0% found this document useful (0 votes)
36 views13 pages

Crop Selection and Yield Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

Crop Selection and Yield Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ISSN: 2347-4688, Vol. 11, No.(3) 2023, pg.

968-980

Current Agriculture Research Journal


www.agriculturejournal.org

Crop Selection and Yield Prediction using Machine


Learning Approach
PRITESH PATIL*, PRANAV ATHAVALE, MANAS BOTHARA,
SIDDHI TAMBOLKAR and ADITYA MORE

Department of Information Technology, AISSMS Institute of Information Technology, Pune, India.

Abstract
In recent years, Agriculture sector has been researched a lot with the
advancements in technologies like machine learning and smart computing.
With the dynamic economics of Agri-produce, it is becoming challenging for Article History
farmers to utilize the land efficiently to get maximum profit in the specific Received: 11 May 2023
landscape. Crop Yield Prediction (CYP) is crucial and is greatly dependent Accepted: 31 October
on environmental factors like soil contents, humidity, rainfall as well as area 2023
under cultivation and other required metrics. Due to insufficient incorporation
of the multiple environmental circumstances, a number of existing tools Keywords
and techniques used for CYP, such as historical averages, tend to produce Crop Yield Prediction;
Digital Agriculture;
inaccurate findings. In such situation, with multiple options of crop, it is Machine Learning;
essential for farmers to plan the crop strategy in advance. If the farmer can Naïve Bayes;
get estimate of the crop yield in advance, cultivation can be done accordingly. Random Forest.
To solve this problem, machine learning approach is implemented as a base
for accurate predictions. Crop prediction is done by classification model and
yield prediction uses regression models to learn from the data. Multiple ML
models are analyzed based on performance metrics. Best performer model
is incorporated in backend. Among the used models for yield prediction,
Random Forest Regression gives best results with MAE of 0.64 and R2
score of 0.96. For crop prediction, Naïve Bayes classifier gives most accurate
results with accuracy of 99.39. The study emphasizes how machine learning
could revolutionize crop management techniques by giving farmers insights
about optimizing resource allocation and boost overall crop yield.

Introduction sample data or experience rather than the ability to


The field of machine learning is advancing day immediately design a computer program to solve
by day. Learning is important when we need a particular problem. When there is no human

CONTACT Pritesh Patil [email protected] Department of Information Technology, AISSMS Institute of Information
Technology, Pune, India.

© 2023 The Author(s). Published by Enviro Research Publishers.


This is an Open Access article licensed under a Creative Commons license: Attribution 4.0 International (CC-BY).
Doi: https://dx.doi.org/10.12944/CARJ.11.3.26
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 969

knowledge or when people are unable to express crop for farmer. The use of various fertilizers is also
their expertise, learning becomes important. unclear because of seasonal climate variations and
Computers are programmed with machine in order changes in the availability of fundamental resources
to improve performance criteria based on actual like soil, water, and air. The agricultural yield rate is
or hypothetical facts. Computer program learns to continuously decreasing in this situation.5 Farmers
optimize the parameters used for the model using today cultivate crops based on knowledge gained
training input or previous information. The model from earlier generations. Since the traditional
may be descriptive to draw conclusions based on method of cultivation has been refined, there are
model data or predictive which estimates trends in either excessive or insufficient yields without really
future.1 A subset of artificial intelligence (AI), machine meeting the need.6 If the producer knows yield
learning (ML) enables computers to learn for a estimates in advance, it would help to form the crop
specific dataset such as playing chess or making strategy. Machine learning is a rapidly expanding
recommendations on social networks without having methodology that supports and provides a guide
to be explicitly programmed. Precision farming in decision process in various applications of
and Agri-technology, now referred to as Digital multiple different industries. The majority of modern
Agriculture, are evolving into emerging fields in gadgets benefit from models being examined before
research that employ highly data-driven techniques deployment. The primary idea is to increase the
to boost productivity in agriculture while shrinking efficiency and profits of the agriculture industry
the adverse effects on the environment. Machine by using data as a tool with models. Precision
learning (ML), alongside big data technology and farming, which prioritizes quality above unfavorable
robust computing infrastructure, has arisen to create environmental variables, would be the main focus.7
potential solutions for unravelling, quantifying, ML has advanced its applications in agriculture in
and comprehending data-intensive processes in areas like predicting soil properties, rainfall analysis,
agricultural operational environments. Data analysis, yield prediction, disease and weed detection,
as an evolved scientific discipline, is essential to the ML based computer-vision and many more.8
development of a wide range of crop management
applications. Many times, it is possible to efficiently The use of computer vision, machine learning,
use ML without having integrating data from many and IoT applications will assist boost productivity,
sources. There tends to be less emphasis on enhance quality, and ultimately increase the
data integration when large datasets are easily profitability of farmers and related industries.
available, especially on a major scale. The main To increase the overall harvesting output, precision
force behind this development is the complexity farming is crucial in the world of agriculture. 9
of data preprocessing and analytical processes, as For example, smart irrigation systems, crop disease
opposed to the machine learning models' generally prediction, crop selection, weather forecasting,
straightforward implementation.2 Agriculture sector and determining the minimal support price are all
has a major contribution of almost 20% in India’s examples of techniques employed in agriculture.
GDP in year 2019-20. 3 Also, it is the principal These methods will increase field productivity
source of employment in India. In addition to being while requiring less work from farmers.10 Crop yield
a significant part of the global economy, it is crucial estimation may be used for a variety of purposes,
for the continued existence of humanity. Weather, including helping farmers enhance production,
pests, and the readiness of harvesting operations optimizing the supply-demand cycle for fertilizers,
are the main factors that influence agricultural insecticides, and other agricultural products,
production. For managing agricultural risk, it's predicting prices, and calculating the risk levels for
essential to have accurate crop history information.4 agricultural insurance.11
Unethical practices are being used to produce
higher yields of less-nutritious hybrid cultivars as the Literature Review
population grows. These techniques tend to harm Prior research12 used data that included nutrients
soil quality. It results in environmental loss. Given and other environmental elements to anticipate
the changing patterns of weather conditions and crops. For CYP, several feature selection techniques
also economics, it is getting difficult to choose right and ML models are employed. In this study, the
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 970

following factors were looked at: To assess the on the limitations of present approaches and their
effectiveness of feature selection and classification applicability for yield prediction. The suggested
algorithms, F1 Score, Mean Absolute Error (MAE), approach then connects the farmers with an
Logarithmic Loss (LL), Accuracy (ACC), Specificity effective yield forecasting system via an app for
(S), Recall (R), Precision (P), and Recall (R) smartphones. To assist them in selecting a crop,
were utilized. (AUC). Using Modified Removal people may select from a number of attributes.
of recursive Features, six variables - average soil The integrated prediction system assists farmers in
and air temperatures, min and max air temperatures, estimating the crop produce. A user may research
precipitation, and humidity are selected. A variety possible crops and their yield using the integrated
of data splitting validation techniques, including recommendation system in order to make better
(25- 75), (30-70), (35-65), (40-60), (45-55), (50-50), educated judgements. Based on data from states
(55-45), (60-40), (65-35), (70-30), and (75- 25), of Maharashtra and Karnataka, several ML models
are used and evaluated against the previously like RF, KNN, MLR, SVM, and ANN were built
stated accuracy criteria. Additionally, versions of and compared for accuracy. Results confirm
the feature selection techniques such as MRFE, RF Regressor, which has a 95% accuracy rate, is
RFE, and Boruta have been applied. According the best standard algorithm when applied to the
to the results, the Random Forests Classifier is the presented datasets.
most accurate in comparison with kNN and other
classifiers discussed above. As characteristic ranges In,15 the Random Forest Algorithm is used. In spite
broadened, the measurement values decreased. of extensive research into challenges and topics like
weather, temperature, humidity, and rainfall, there
Another study by Anakha Venugopal, Jinsu Mani, are still no acceptable remedies or ideas to deal with
Aparna S Rima Mathew, Prof. Vinu Williams13 uses the difficulty we face. In nations like India, there are
several machine learning approaches to forecast numerous different sorts of rising economic growth,
the agricultural production. By taking into account including in the agriculture sector. Additionally, crop
variables like temperature, rainfall, area, and other yield predictions can be made using the processing.
characteristics, Farmers will be able to select the The current study proved the value of data mining
crop that will provide the highest produce by using techniques for predicting agricultural output based
the forecasts made by ML models. The study is on input features related to the climate. All new
focused on Kerala’s Agri-produce. Among the grains and regions chosen for the investigation
classifier models utilized here, Random Forest should have accuracy of prediction above 75%,
has the highest accuracy, followed by Logistic demonstrating improved predictive performance.
Regression and Naive Bayes. The produced website is user-friendly. The website
was developed utilizing data from that area to predict
A Research14 A smartphone app which is used crop yield.
in the proposed method connects farmers to the
internet. GPS helps user in locating his location. According to a study,16 selecting the best crop before
The user enters the location and soil type. The most sowing will increase agricultural yield. It depends
profitable crop list can be picked using machine on a variety of factors, such as the soil type and
learning algorithms, and they can also forecast crop its composition, climate, local terrain, crop yield,
yields for user-selected crops. Machine learning market prices, etc. Techniques like Decision Trees,
models, including random forest (RF), artificial K-nearest Neighbors, and Artificial Neural Networks
neural network (ANN), support vector machine have a position in the crop selection framework,
(SVM), multivariate linear regression (MLR), which depends on a variety of different factors.
and k-nearest neighbor (KNN), are used to estimate Machine learning has been used to choose crops
crop productivity. Random forest demonstrated the based on how natural disasters like hunger could
best outcomes with 95% accuracy. The algorithm affect them. Researchers have employed artificial
also makes recommendations on when to apply neural networks to choose crops depending on soil
fertilizers to increase yield. This research focused and climate with success.
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 971

When attempting to create a high-performance the input qualities, the test data may be applied to
predictive model, ML studies face a variety of the generated training sets. The RF method and
difficulties. To tackle the issue at hand, it is essential the dataset were used to evaluate the efficacy of
to choose the appropriate algorithms, and both the this technique. The advantage of the random forest
algorithms and the supporting platforms must be able approach is that overfitting is less of an issue with
to handle the sheer amount of data.17 random forests than it is with decision tree-based
model. The random forest does not need to be
A study18 suggested a method for unsupervised trimmed. The loaded data sets are divided into train,
fuzzy categorization that identifies crop kinds with test data of 67 or 33 percentage points, or 0.67 or
springtime harvests. The categorization outcomes 0.33 respectively. In order to enable the mapping
likely to get better with time. Strategy used in19 of attribute values to appropriate values and list
made use of the Bayesian network categorization placement, the training data must be categorized.
supervised learning model. Crop information is By contrasting the initial data with model predictions,
analyzed with environmental parameters like the probability is determined. Based on the result,
temperature and rainfall to categorize crops. the highest likelihood is utilized to make a forecast.
The accuracy may be calculated by comparing the
A study by D. A. Reddy, B. Dadore, and A. Watekar20 generated class value with the test data set.
highlights how despite being one of the nations with
the highest agricultural output, India's agriculture According to a different study,22 agriculture has
productivity is still fairly low. Productivity needs to be positive economic effects on the country. It falls short,
increased so that farmers may get better profit from nevertheless, in terms of using modern machine
decreased costs. In order to reliably and successfully learning techniques. As a result, our farmers ought
propose a suitable crop based on soil data, it offers to be knowledgeable with all of the most recent
solutions such as offering a recommender utilizing machine learning technology and fresh approaches.
an ensemble approach with a large proportion The productivity of agriculture is increased by using
of voting methods employing random tree, CHAID, these methods. To increase agricultural productivity
kNN, and naive bayes classifier. Soil types, soil rates, a number of machine learning approaches are
characteristics, and crop yield data collection are used. These techniques can help with agricultural
taken into consideration when advising the farmer on problems. We may also assess the accuracy
the best crop to grow. The majority voting process, of the yield by looking at several ways. Thus, we may
which is the most popular assembly technique, perform better by contrasting the accuracy of several
is used in this system. Any number of primary crops. In agriculture, sensor technology is widely
learners may be used in the voting process. used. The study helps increase agricultural yield
A minimum of two base learners are required. rates. helps choose the right crop for the chosen
The chosen learners complement one another site and season.
and impart knowledge to the others. With more
competition, a better forecast may be made. The Materials and Methods
specified training data set is used to train the model. Data Pre-processing
When a new record has to be categorized, each A technique called data pre-processing transforms
model chooses the class independently. Class unprocessed, uncleaned data ready for further
predicted by consensus of learners is chosen as analysis. Data may be gathered from multiple
class label for current record. sources, but as they are collected in raw form,
analysis is not possible. We convert data into
A study21 says Building a random forest, a group a comprehensible format by using several strategies,
of decision trees that considers two- thirds of the such as substituting missing values and null values.
records in the datasets, takes into account data Fields in the dataset which are insignificant for
sets on temperature, production, perception, and label prediction are eliminated. If required, One-
rainfall. These decision trees are then applied to the Hot Encoding is performed on the dataset to have
remaining data to ensure accurate categorization. dataset ready for regression model fitting. The
For accurate crop production prediction based on division of train and test data is the final stage in the
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 972

data preparation process. As training the machine Both discrete and continuous data may be used
learning algorithm usually requires as much data with it. It is extremely scalable and unaffected by
points as possible, the data typically has uneven insignificant features.
distribution. Training dataset, which in this case
makes up 70% of total data, is used to train machine Decision Trees
learning models and make accurate predictions. A decision tree is a type of tree structure that
resembles a flowchart and is frequently employed
Factors Affecting Crop Yield in supervised machine learning for classification
The yield of every crop is impacted by a wide range and prediction. A DT may be transformed into a set
of variables. These are essentially the characteristics of rules, with each path serving as a different rule,
that aid in estimating a crop yield. For crop yield with each path travelling from the root node to each
prediction, this study includes parameters such as leaf node. In a decision tree, each leaf node has
temperature, rainfall, area, humidity, soil nutrients, a class that may be reached if an attribute matches
pH, and AUC (Area under Cultivation). the prerequisite for the branch that leads to it. In
a decision tree, each internal node corresponds to
Comparison and Selection of ML Algorithm a test, condition, or attribute.6
We first must assess and compare different
algorithms before selecting the one that best KNN
matches this particular dataset. Machine Learning The machine learning approach known as kNN,
is an effective way to solve crop prediction problem which is supervised and nonparametric, is used
as it learns from past data and gives predictions to solve classification and regression issues.
on current parameters. In order to make precise Labeled data is used with supervised algorithms.
predictions and stand by erratic patterns in weather The technique relies on the distances between the
conditions like temperature and rainfall, various points, which may be calculated in a few different
machine learning classifiers like Logistic Regression, ways. The fact that the distance must always
Naïve Bayes, Random Forest, KNN are used and be either zero or positive should be taken into
compared for the performance metrics and the model account. The distance is squared, raised to a given
with best accuracy is selected for crop prediction. power, or the absolute values are used to do this.
For Yield prediction, regressors like Linear Regression, Pre-processing of all the labelled data is necessary
Random Forests and Decision Trees Regression before we apply the kNN algorithm. All of the
are compared for metrics like MAE, Median Absolute data must first be normalized. As kNN struggles
Error and R2 Score. The model with best values is to function when there are too many features present,
selected for predicting yield. feature selection must then be used to eliminate the
insignificant features. Missing data must be filled in.
Naïve Bayes Else, that particular record must be eliminated. The
Based on Bayes' theorem, Naïve Bayes model is performance can be enhanced by including more
frequently employed in many classification tasks. train samples. The fundamental drawback of KNN is
The multinomial, Bernoulli, and Gaussian algorithms that as the size of dataset grows, cost of computing
make up the three Naive Bayes algorithms. Naive rises, the algorithm's speed decreases.
Bayes Algorithm is mostly employed for classification
problems. It operates under the presumption that Random Forests (RF)
each feature has an equal chance of occurring The RF technique is a perfect example of ensemble
and that the likelihood of each feature occurring is learning in action since it connects several classifiers
independent of the probabilities of the occurrence to tackle the challenging problem and improve a
of all other features. The Bayes theorem determines model's efficiency. The "forest" created with this
likelihood of an event happening when another event approach is actually a collection of decision trees.
is occurred. Multi-class classification makes use In each decision split, RF characteristics are chosen
of Bayes theorem. Also, in comparison to other ML at random. Picking traits that encourage prediction
techniques, it is quicker and simpler to construct. and lead to increased efficiency reduces the
Additionally, it doesn't need a lot of training data. correlation across trees. The Random Forest ML
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 973

classification approach generates the final output by random and fed into the Random Forest Technique's
combining the results of all the decision trees after decision trees. It can also carry out jobs requiring
segmenting the dataset into smaller subsets or trees. both regression and classification. It also works well
The Bagging subcategory of ensemble learning with huge, highly dimensional data sets, and most
methods includes Random Forest. A Sample of rows significantly, it greatly improves the model's accuracy
and features from the primary dataset are selected at and fixes the overfitting problem.

Fig. 1: System Architecture

System Architecture 27 unique crops and 4 unique seasons. Crops are


System architecture is represented in Figure 1. End as follows: Arhar/tur, bajra, castor seed, gram,
user interacts with web user interface (UI) which groundnut, jowar,linseed, maize, moong, niger seed,
is hosted on a server. Open Weather Map API is other cereals, kharif pulses, rabi pulses, summer
connected to server to deliver weather data. The pulses, ragi, rapeseed and mustard, rice, safflower,
machine learning models are trained and tested by sesamum, millets, soyabean, sugarcane, sunflower,
admin and are loaded in the server for predicting tobacco, urad, wheat, oilseeds.
crop and yield in tone per hectare of land area.
District Wise Rainfall Normal23
Datasets This dataset is used for collecting district wise
The public datasets have been chosen because they rainfall data to predict yield. It is used for extracting
are readily available and easily accessible. Kaggle district wise average annual rainfall for each district
is a popular platform for finding and sharing datasets, of Maharashtra, India. This feature is combined with
so we were able to find datasets that met our criteria. Yield dataset mentioned above to get estimates
We selected 3 datasets namely of production for particular crop in the given season.

India Agriculture Crop Production23 Crop Recommendation23


The dataset has following features: State, District, The dataset is used for crop prediction. It has
Crop, Year, Season, Area, Area Units, Production, features like N, P, K, rainfall, humidity, pH and
Production Units, Yield. This dataset is used to build crop. N, P, K stands for Nitrogen, Phosphorous and
regression model for yield prediction. Yield is the Potassium nutrients in soil. It has 2200 total records
required label. It has total 12176 records containing containing 22 unique crops. Data consists 100
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 974

records for each of the following crops: rice, maize, set to "mean". This strategy replaces the missing
chickpea, kidney beans, pigeon peas, moth beans, values with the mean of the corresponding column.
mung beans, black gram, lentil, pomegranate,
banana, mango, grapes, watermelon, muskmelon, Standardization
apple, orange, papaya, coconut, cotton, jute, coffee. The features are standardized using the Standard
Scaler transformer. This step scales the features to
Data Pre-Processing have zero mean and unit variance.
Crop Prediction
Prior to start modelling the data, we need to carry Applying the Preprocessing Pipeline
out data-pre-processing. It is done in following steps The line X = my_pipeline.fit_transform(df) applies
as shown in Figure. the preprocessing pipeline (my_pipeline) to the
entire DataFrame (df). It fits the pipeline on the
Handling Missing Values data to learn the mean values (for imputation) and
The line df = pd.read_csv('Crop_recommendation2. the standardization parameters. Then, it transforms
csv', na_values='=') reads the CSV file into the data by applying the learned transformations.
a DataFrame (df), replacing any occurrences of '='
with NaN values, which are commonly used to Train-Test Split
represent missing data in pandas. The train_test_split() function from scikit-learn is
used to split the processed features (X) and the
Separating the Target Variable target variable (b) into training and testing sets.
The line b = df['label'] extracts the target variable The stratify=b parameter ensures that the class
column ('label') from the DataFrame df and assigns distribution is maintained in both the training and
it to the variable b. testing sets. The split is performed with a test size
of 30% (test_size=0.3) and a random state of 42
Creating a Preprocessing Pipeline (random_state=42).
The code creates a pipeline (my_pipeline) using
scikit-learn's Pipeline class. The pipeline consists These pre-processing steps help handle missing
of two steps values, standardize the features, and split the data
into training and testing sets for further analysis and
Imputation model training.
The missing values in the DataFrame are imputed
using the SimpleImputer transformer with a strategy

Fig. 2: Preprocessing for Crop prediction


PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 975

Yield Prediction did data transformation for India Agriculture Crop


In data preprocessing, we did clean the data Production23 dataset as it had categorical variables
containing missing values, outliers, or errors that which need to be encoded as numerical values to
need to be addressed before the data can be used pass to the machine learning model. We used One
for machine learning. Also, we did data integration Hot Encoding for data transformation. We had to
of District wise rainfall normal23 and India Agriculture do data reduction to limit the dataset to the state
Crop Production23 as we required it to be merged of Maharashtra otherwise the dataset would have
for passing it to the machine learning model. We been too large in terms of the rows and columns.

Fig. 3: Preprocessing for Yield Prediction

Feature Selection as input along with other data. This helps the system
A machine learning model's performance can be to give real-time predictions. “OpenWeatherMap”
improved through feature selection, which is the API is used for the same. For creating an API URL,
process of choosing a subset of the relevant features base URL and API key is used which is unique
from available data. For Crop Prediction, following with each subscription. User’s city name is passed
features were selected: Nitrogen, Phosphorous, in complete URL as a parameter and response is
Potassium, Temperature, Humidity, pH and Rainfall. collected. From the collected response, required
For Yield Prediction, features selected are as follows: fields i.e., temperature and humidity are passed
City, Crop, Annual Rainfall (in mm), Season. to ML model for predicting crop.

Train Test Splitting of Data Training and Evaluation of Models


We have split the data in the ratio 70:30 using The crop prediction uses the multi-class classification
random sampling and stratification. Choosing machine learning model to predict the crop for a set
an appropriate train-test split is important in ML, of given input features. Whereas the yield prediction
because it can affect the accuracy and generalization incorporates the regression model to predict the
of the resulting model. yield for a given set of input features. For training
and evaluation of models, Google Colab Platform is
API Integration used. While the User Interface for the project is built
The city of user taken as an input is given to API call using ReactJs, the backend is built using Python
as a parameter. The temperature and humidity fields Flask framework.
from API response are given to crop prediction model
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 976

Application and Advantages over existing Results and Discussions


versions Crop Prediction
The model can be used to create an impact on right First, datasets are loaded and cleaned from insignificant
crop selection as the user would get fair prediction features. After Data Preparation, data is split into
on yield as well as crop. Also yield prediction would training and testing data and various models are
be important in financial assessment of crop strategy. fitted and tested for accuracy. Feature Importance
Model is useful if the user wants to compare yield is calculated to determine the relative significance
for multiple crop options and then select the best or contribution of individual features in ML model. For
one. It could also be used in a wide geography crop predication model, Drop Column Importance,
to estimate the yield for a particular crop. This also called as "permutation importance" or "feature
project can be used directly by end users as farmers importance by feature shuffling," is calculated.
for taking predictions for their conditions. Instead,
it can also be used by government agencies for Drop Column Importance = Baseline Metric −
planning and policy making if modified with wider Shuffled Metric
access to reliable closed source government Drop column importance is based on the idea that
data. It can also be used by NGOs which work for removing a feature that is crucial to the performance
educating farmers in adopting new technologies and of the model would cause it to perform less significant
precision agriculture. Also, it can be used in fields than before. It is calculated in following steps
where monetary calculations come in picture as
it is dependent on how much yield could be produced 1. Train a model with all features
like in insurance claims or loan policies. 2. Measure baseline performance with a validation
data
The project improves the prediction accuracy by 3. One feature is determined of which importance
suitable data gathering cleaning and selecting best is to be calculated
accurate model. Also, the project incorporates both 4. Train a model with all other features except the
crop as well as yield prediction. So, the project is selected one
using classification as well as regression models for 5. Calculate performance with a validation data
necessary functionality. It adds value to the modern 6. The feature importance is the drop in perform-
agriculture setup by providing a way to add to the ance from baseline
reliability of crop selection which in turn improves 7. Follow same steps 3 through 6 for every feature
the yield and financial stability.

Fig. 4: Feature Importance for Crop Prediction


PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 977

As shown in above Figure 4, rainfall is most important Classification Models’ results for Crop prediction as
feature for crop prediction followed by humidity, depicted in Figure 5.
Potassium(K), Phosphorous(P), Nitrogen(N) and pH.

Fig. 5: Accuracy Metric for Classification Models

When trained on the dataset, KNN gives accuracy of Regression has accuracy of 94.69%. Based on these
97.72%, RF gives accuracy of 99.24%, Naïve Bayes results, Naïve Bayes classifier is incorporated in the
Classifier has 99.39% accuracy score. Logistic backend for Crop Prediction.

Fig. 6: Code Snippet for Feature Importance in Yield Prediction

Yield Prediction 1. Train the Random Forest Regressor


Calculating feature importance for a Random Forest 2. Access Feature Importances: Random
Regressor with one-hot encoded features involves Forest Regressor has built in attribute named
determining the contribution of each feature to the feature_importances_.
model's predictive performance. It is done through 3. Map Feature Importances to Original
following steps. Features: Every one-hot encoded feature is
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 978

mapped with its original feature. As shown in Figure 7, Crop is most important feature
4. Aggregate Feature Importances: By in order to predict yield followed by District, Rainfall
aggregating we get every categorical feature’s and Season.
importance
5. Rank Feature Importances in descending
order of importance.

Fig. 7: Feature Importance for Yield Prediction

Yield prediction is done by regression. For Median Absolute Error and R2 Score are used. The
comparison between different regression models, results are depicted in figure 8.
performance metrics like Mean Absolute Error,

Fig. 8: Regression Results for Yield Prediction

Random Forest Regressor gives most reliable and R2 score of 0.96. Decision Trees Regressor
results when given required inputs with Mean has Mean Absolute Error of 0.80, Median Absolute
Absolute Error of 0.64, Median Absolute Error is 0.16 Error of 0.18, R2 Score of 0.94. Linear Regression
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 979

has Mean Absolute Error of 1.08, Median Absolute estimates in accordance with current market prices.
Error of 0.47 and R2 Score of 0.92. Paid datasets may bring more reliable and accurate
data which in turn might help in model accuracy.
Conclusion They may contain more features which may help
Crop yield prediction is a complex process which correlate more with label.
relies on several different factors including weather,
soil, fertilizers, pest infestations, etc. In this paper, Acknowledgements
we predict the crop yield using weather and soil We are grateful to Prof. Pritesh A Patil for providing
parameters. The research is based on the datasets valuable input and feedback throughout the research
limited to districts in Maharashtra. The system process.
incorporates regression techniques to estimate
yield and multi-class classification to predict type Funding
of the crop. Among the used models for yield The author(s) received no financial support for the
prediction, Random Forest Regression gives best research, authorship, and/or publication of this
results with MAE of 0.64 and R2 score of 0.96. article.
For crop prediction, Naïve Bayes classifier gives
most accurate results with accuracy of 99.39. The Conflict of Interest
suggested method aids farmers in choosing which The authors declare no conflict of interest regarding
crop to plant in the field and how much yield any this research. However, it should be noted that
crop would give in that specific environment. Dataset the first author of this paper is an employee of a
used in the research can be improved by taking real company that develops and markets machine
time data through IoT devices. Also, various factors learning software for crop yield prediction. The results
like irrigation and fertilizers use can be included and conclusions presented here are solely based
for better prediction. Mobile App can be developed on the authors' research and do not reflect any
for mobile devices with added services like price external influence.

References

1. Alpaydın, Ethem. "Introduction to machine tation of crop yield prediction model in


learning, second edition." MIT Press, 2010. agriculture." International Journal of
ISBN: 978-0-262-01243-0. Engineering Research & Technology (IJERT),
2. Liakos, K.G.; Busato, P.; Moshou, D.; vol. 9, no. 4, pp. 305-310, Apr. 2020.
Pearson, S.; Bochtis, D., "machine learning 7. Johnson LK, Bloom JD, Dunning RD, Gunter
in agriculture: a review", Sensors, vol. 18, CC, Boyette MD, Creamer NG, "Farmer
no. 8, pp. 2674, August 2018. https://doi. harvest decisions and vegetable loss in
org/10.3390/s18082674 primary production. Agricultural Systems",
3. Sabitha, "A study on sectorial contribution vol. 176, pp. 102672, November 2019.
of gdp in india from 2010 to 2019", AJEBA, 8. Sharma A, Jain A, Gupta P, Chowdary V.
vol. 19, no. 1, pp. 18-31, January 2020. Article Machine learning applications for precision
no. AJEBA. 62227. agriculture: A comprehensive review. IEEE
4. Jain A., "Analysis of growth and instability Access. 2020 Dec 31;9:4843-73.
in the area, production, yield, and price 9. Meshram V, Patil K, Meshram V, Hanchate D,
of rice in India", Journal of Social Change and Ramkteke SD. Machine learning in agriculture
Development, vol. 2, pp. 46-66, N/A, 2018. domain: A state-of-art survey. Artificial
5. Wolfert S, Ge L, Verdouw C, Bogaardt Intelligence in the Life Sciences. 2021 Dec
MJ, "Big data in smart farming– a review. 1;1:100010.
Agricultural Systems", vol. 153, pp. 69-80, 10. Reddy, D. J., & Kumar, M. R. (2021).
May 2017. Crop Yield Prediction using Machine
6. Sangeeta, Shruthi G. "Design and implemen- Learning Algorithm. 2021 5th International
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 980

Conference on Intelligent Computing and for prediction of crop yield," Int. J. Comput.
Control Systems (ICICCS). doi:10.1109/ Intell. Inform., vol. 6, no. 4, pp. 298–305,
iciccs51141.2021.9432236 2017.
11. Ranjini B Guruprasad, Kumar Saurav, 17. van Klompenburg, T., Kassahun, A., &
Sukanya Randhawa,”Machine learning Catal, C. (2020). Crop yield prediction using
methodologies for paddy yield Estimation in machine learning: A systematic literature
India: a case study”, 2019. review in computers and electronics in
12. S. P. Raja, B. Sawicka, Z. Stamenkovic Agriculture, 177, pp. 105709. doi: 10.1016/j.
and G. Mariammal, "Crop prediction compag.2020.105709.
based on characteristics of the agricultural 18. M. Liu, T. Wang, A. K. Skidmore, and X. Liu,
environment using various feature selection "Heavy metal-induced stress in rice crops
techniques and classifiers," IEEE Access, detected using multi-temporal Sentinel-2
vol. 10, pp. 23625-23641, 2022, doi: 10.1109/ satellite images," Sci. Total Environ.,
ACCESS.2022.3154350. vol. 637-638, pp. 18-29, Oct. 2018.
13. Venugopal, Anakha, S, Aparna, Mani, Jinsu, 19. K. E. Eswari and L. Vinitha, "Crop yield
Mathew, Rima, Williams, Vinu. "Crop yield prediction in tamil nadu using bayesian
prediction using machine learning algorithms." network," Int. J. Intell. Adv. Res. Eng.
International Journal of Engineering Research Comput., vol. 6, no. 2, pp. 1571-1576, 2018.
& Technology (IJERT) NCREIS – 2021, vol. 20. D. A. Reddy, B. Dadore, and A. Watekar,
09, no. 13, pp. 1-6, Dec. 2021. "Crop recommendation system to maximize
14. S. M. PANDE, P. K. RAMESH, A. ANMOL, crop yield in ramtek region using machine
B. R. AISHWARYA, K. ROHILLA and K. learning," Int. J. Sci. Res. Sci. Technol., vol.
SHAURYA, "Crop recommender system 6, no. 1, pp. 485-489, Feb. 2019.
using machine learning approach," 2021 21. Priya, P., Muthaiah, U., Balamurugan, M.
5th International Conference on Computing "Predicting yield of the crop using machine
Methodologies and Communication learning algorithm." International Journal of
(ICCMC), 2021, pp. 1066-1071, doi: 10.1109/ Computer Science and Mobile Computing,
ICCMC51019.2021.9418351. vol. 4, no. 5, pp. 1-7, May 2015.
15. Suresh, N., et al. "Crop yield prediction 22. Medar, Ramesh, S, Vijay, Shweta. "Crop
using random forest algorithm." 2021 7th yield prediction using machine learning
International Conference on Advanced techniques." International Journal of
Computing and Communication Systems Advanced Research in Computer Science
(ICACCS), pp. 279-282, 2021, doi: 10.1109/ and Software Engineering, vol. 9, no. 5, pp.
ICACCS51430.2021.9441871. 1-6, May 2019.
16. E. Manjula and S. Djodiltachoumy, "A model 23. https://www.kaggle.com/

You might also like