Crop Selection and Yield Prediction
Crop Selection and Yield Prediction
968-980
Abstract
In recent years, Agriculture sector has been researched a lot with the
advancements in technologies like machine learning and smart computing.
With the dynamic economics of Agri-produce, it is becoming challenging for Article History
farmers to utilize the land efficiently to get maximum profit in the specific Received: 11 May 2023
landscape. Crop Yield Prediction (CYP) is crucial and is greatly dependent Accepted: 31 October
on environmental factors like soil contents, humidity, rainfall as well as area 2023
under cultivation and other required metrics. Due to insufficient incorporation
of the multiple environmental circumstances, a number of existing tools Keywords
and techniques used for CYP, such as historical averages, tend to produce Crop Yield Prediction;
Digital Agriculture;
inaccurate findings. In such situation, with multiple options of crop, it is Machine Learning;
essential for farmers to plan the crop strategy in advance. If the farmer can Naïve Bayes;
get estimate of the crop yield in advance, cultivation can be done accordingly. Random Forest.
To solve this problem, machine learning approach is implemented as a base
for accurate predictions. Crop prediction is done by classification model and
yield prediction uses regression models to learn from the data. Multiple ML
models are analyzed based on performance metrics. Best performer model
is incorporated in backend. Among the used models for yield prediction,
Random Forest Regression gives best results with MAE of 0.64 and R2
score of 0.96. For crop prediction, Naïve Bayes classifier gives most accurate
results with accuracy of 99.39. The study emphasizes how machine learning
could revolutionize crop management techniques by giving farmers insights
about optimizing resource allocation and boost overall crop yield.
CONTACT Pritesh Patil [email protected] Department of Information Technology, AISSMS Institute of Information
Technology, Pune, India.
knowledge or when people are unable to express crop for farmer. The use of various fertilizers is also
their expertise, learning becomes important. unclear because of seasonal climate variations and
Computers are programmed with machine in order changes in the availability of fundamental resources
to improve performance criteria based on actual like soil, water, and air. The agricultural yield rate is
or hypothetical facts. Computer program learns to continuously decreasing in this situation.5 Farmers
optimize the parameters used for the model using today cultivate crops based on knowledge gained
training input or previous information. The model from earlier generations. Since the traditional
may be descriptive to draw conclusions based on method of cultivation has been refined, there are
model data or predictive which estimates trends in either excessive or insufficient yields without really
future.1 A subset of artificial intelligence (AI), machine meeting the need.6 If the producer knows yield
learning (ML) enables computers to learn for a estimates in advance, it would help to form the crop
specific dataset such as playing chess or making strategy. Machine learning is a rapidly expanding
recommendations on social networks without having methodology that supports and provides a guide
to be explicitly programmed. Precision farming in decision process in various applications of
and Agri-technology, now referred to as Digital multiple different industries. The majority of modern
Agriculture, are evolving into emerging fields in gadgets benefit from models being examined before
research that employ highly data-driven techniques deployment. The primary idea is to increase the
to boost productivity in agriculture while shrinking efficiency and profits of the agriculture industry
the adverse effects on the environment. Machine by using data as a tool with models. Precision
learning (ML), alongside big data technology and farming, which prioritizes quality above unfavorable
robust computing infrastructure, has arisen to create environmental variables, would be the main focus.7
potential solutions for unravelling, quantifying, ML has advanced its applications in agriculture in
and comprehending data-intensive processes in areas like predicting soil properties, rainfall analysis,
agricultural operational environments. Data analysis, yield prediction, disease and weed detection,
as an evolved scientific discipline, is essential to the ML based computer-vision and many more.8
development of a wide range of crop management
applications. Many times, it is possible to efficiently The use of computer vision, machine learning,
use ML without having integrating data from many and IoT applications will assist boost productivity,
sources. There tends to be less emphasis on enhance quality, and ultimately increase the
data integration when large datasets are easily profitability of farmers and related industries.
available, especially on a major scale. The main To increase the overall harvesting output, precision
force behind this development is the complexity farming is crucial in the world of agriculture. 9
of data preprocessing and analytical processes, as For example, smart irrigation systems, crop disease
opposed to the machine learning models' generally prediction, crop selection, weather forecasting,
straightforward implementation.2 Agriculture sector and determining the minimal support price are all
has a major contribution of almost 20% in India’s examples of techniques employed in agriculture.
GDP in year 2019-20. 3 Also, it is the principal These methods will increase field productivity
source of employment in India. In addition to being while requiring less work from farmers.10 Crop yield
a significant part of the global economy, it is crucial estimation may be used for a variety of purposes,
for the continued existence of humanity. Weather, including helping farmers enhance production,
pests, and the readiness of harvesting operations optimizing the supply-demand cycle for fertilizers,
are the main factors that influence agricultural insecticides, and other agricultural products,
production. For managing agricultural risk, it's predicting prices, and calculating the risk levels for
essential to have accurate crop history information.4 agricultural insurance.11
Unethical practices are being used to produce
higher yields of less-nutritious hybrid cultivars as the Literature Review
population grows. These techniques tend to harm Prior research12 used data that included nutrients
soil quality. It results in environmental loss. Given and other environmental elements to anticipate
the changing patterns of weather conditions and crops. For CYP, several feature selection techniques
also economics, it is getting difficult to choose right and ML models are employed. In this study, the
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 970
following factors were looked at: To assess the on the limitations of present approaches and their
effectiveness of feature selection and classification applicability for yield prediction. The suggested
algorithms, F1 Score, Mean Absolute Error (MAE), approach then connects the farmers with an
Logarithmic Loss (LL), Accuracy (ACC), Specificity effective yield forecasting system via an app for
(S), Recall (R), Precision (P), and Recall (R) smartphones. To assist them in selecting a crop,
were utilized. (AUC). Using Modified Removal people may select from a number of attributes.
of recursive Features, six variables - average soil The integrated prediction system assists farmers in
and air temperatures, min and max air temperatures, estimating the crop produce. A user may research
precipitation, and humidity are selected. A variety possible crops and their yield using the integrated
of data splitting validation techniques, including recommendation system in order to make better
(25- 75), (30-70), (35-65), (40-60), (45-55), (50-50), educated judgements. Based on data from states
(55-45), (60-40), (65-35), (70-30), and (75- 25), of Maharashtra and Karnataka, several ML models
are used and evaluated against the previously like RF, KNN, MLR, SVM, and ANN were built
stated accuracy criteria. Additionally, versions of and compared for accuracy. Results confirm
the feature selection techniques such as MRFE, RF Regressor, which has a 95% accuracy rate, is
RFE, and Boruta have been applied. According the best standard algorithm when applied to the
to the results, the Random Forests Classifier is the presented datasets.
most accurate in comparison with kNN and other
classifiers discussed above. As characteristic ranges In,15 the Random Forest Algorithm is used. In spite
broadened, the measurement values decreased. of extensive research into challenges and topics like
weather, temperature, humidity, and rainfall, there
Another study by Anakha Venugopal, Jinsu Mani, are still no acceptable remedies or ideas to deal with
Aparna S Rima Mathew, Prof. Vinu Williams13 uses the difficulty we face. In nations like India, there are
several machine learning approaches to forecast numerous different sorts of rising economic growth,
the agricultural production. By taking into account including in the agriculture sector. Additionally, crop
variables like temperature, rainfall, area, and other yield predictions can be made using the processing.
characteristics, Farmers will be able to select the The current study proved the value of data mining
crop that will provide the highest produce by using techniques for predicting agricultural output based
the forecasts made by ML models. The study is on input features related to the climate. All new
focused on Kerala’s Agri-produce. Among the grains and regions chosen for the investigation
classifier models utilized here, Random Forest should have accuracy of prediction above 75%,
has the highest accuracy, followed by Logistic demonstrating improved predictive performance.
Regression and Naive Bayes. The produced website is user-friendly. The website
was developed utilizing data from that area to predict
A Research14 A smartphone app which is used crop yield.
in the proposed method connects farmers to the
internet. GPS helps user in locating his location. According to a study,16 selecting the best crop before
The user enters the location and soil type. The most sowing will increase agricultural yield. It depends
profitable crop list can be picked using machine on a variety of factors, such as the soil type and
learning algorithms, and they can also forecast crop its composition, climate, local terrain, crop yield,
yields for user-selected crops. Machine learning market prices, etc. Techniques like Decision Trees,
models, including random forest (RF), artificial K-nearest Neighbors, and Artificial Neural Networks
neural network (ANN), support vector machine have a position in the crop selection framework,
(SVM), multivariate linear regression (MLR), which depends on a variety of different factors.
and k-nearest neighbor (KNN), are used to estimate Machine learning has been used to choose crops
crop productivity. Random forest demonstrated the based on how natural disasters like hunger could
best outcomes with 95% accuracy. The algorithm affect them. Researchers have employed artificial
also makes recommendations on when to apply neural networks to choose crops depending on soil
fertilizers to increase yield. This research focused and climate with success.
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 971
When attempting to create a high-performance the input qualities, the test data may be applied to
predictive model, ML studies face a variety of the generated training sets. The RF method and
difficulties. To tackle the issue at hand, it is essential the dataset were used to evaluate the efficacy of
to choose the appropriate algorithms, and both the this technique. The advantage of the random forest
algorithms and the supporting platforms must be able approach is that overfitting is less of an issue with
to handle the sheer amount of data.17 random forests than it is with decision tree-based
model. The random forest does not need to be
A study18 suggested a method for unsupervised trimmed. The loaded data sets are divided into train,
fuzzy categorization that identifies crop kinds with test data of 67 or 33 percentage points, or 0.67 or
springtime harvests. The categorization outcomes 0.33 respectively. In order to enable the mapping
likely to get better with time. Strategy used in19 of attribute values to appropriate values and list
made use of the Bayesian network categorization placement, the training data must be categorized.
supervised learning model. Crop information is By contrasting the initial data with model predictions,
analyzed with environmental parameters like the probability is determined. Based on the result,
temperature and rainfall to categorize crops. the highest likelihood is utilized to make a forecast.
The accuracy may be calculated by comparing the
A study by D. A. Reddy, B. Dadore, and A. Watekar20 generated class value with the test data set.
highlights how despite being one of the nations with
the highest agricultural output, India's agriculture According to a different study,22 agriculture has
productivity is still fairly low. Productivity needs to be positive economic effects on the country. It falls short,
increased so that farmers may get better profit from nevertheless, in terms of using modern machine
decreased costs. In order to reliably and successfully learning techniques. As a result, our farmers ought
propose a suitable crop based on soil data, it offers to be knowledgeable with all of the most recent
solutions such as offering a recommender utilizing machine learning technology and fresh approaches.
an ensemble approach with a large proportion The productivity of agriculture is increased by using
of voting methods employing random tree, CHAID, these methods. To increase agricultural productivity
kNN, and naive bayes classifier. Soil types, soil rates, a number of machine learning approaches are
characteristics, and crop yield data collection are used. These techniques can help with agricultural
taken into consideration when advising the farmer on problems. We may also assess the accuracy
the best crop to grow. The majority voting process, of the yield by looking at several ways. Thus, we may
which is the most popular assembly technique, perform better by contrasting the accuracy of several
is used in this system. Any number of primary crops. In agriculture, sensor technology is widely
learners may be used in the voting process. used. The study helps increase agricultural yield
A minimum of two base learners are required. rates. helps choose the right crop for the chosen
The chosen learners complement one another site and season.
and impart knowledge to the others. With more
competition, a better forecast may be made. The Materials and Methods
specified training data set is used to train the model. Data Pre-processing
When a new record has to be categorized, each A technique called data pre-processing transforms
model chooses the class independently. Class unprocessed, uncleaned data ready for further
predicted by consensus of learners is chosen as analysis. Data may be gathered from multiple
class label for current record. sources, but as they are collected in raw form,
analysis is not possible. We convert data into
A study21 says Building a random forest, a group a comprehensible format by using several strategies,
of decision trees that considers two- thirds of the such as substituting missing values and null values.
records in the datasets, takes into account data Fields in the dataset which are insignificant for
sets on temperature, production, perception, and label prediction are eliminated. If required, One-
rainfall. These decision trees are then applied to the Hot Encoding is performed on the dataset to have
remaining data to ensure accurate categorization. dataset ready for regression model fitting. The
For accurate crop production prediction based on division of train and test data is the final stage in the
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 972
data preparation process. As training the machine Both discrete and continuous data may be used
learning algorithm usually requires as much data with it. It is extremely scalable and unaffected by
points as possible, the data typically has uneven insignificant features.
distribution. Training dataset, which in this case
makes up 70% of total data, is used to train machine Decision Trees
learning models and make accurate predictions. A decision tree is a type of tree structure that
resembles a flowchart and is frequently employed
Factors Affecting Crop Yield in supervised machine learning for classification
The yield of every crop is impacted by a wide range and prediction. A DT may be transformed into a set
of variables. These are essentially the characteristics of rules, with each path serving as a different rule,
that aid in estimating a crop yield. For crop yield with each path travelling from the root node to each
prediction, this study includes parameters such as leaf node. In a decision tree, each leaf node has
temperature, rainfall, area, humidity, soil nutrients, a class that may be reached if an attribute matches
pH, and AUC (Area under Cultivation). the prerequisite for the branch that leads to it. In
a decision tree, each internal node corresponds to
Comparison and Selection of ML Algorithm a test, condition, or attribute.6
We first must assess and compare different
algorithms before selecting the one that best KNN
matches this particular dataset. Machine Learning The machine learning approach known as kNN,
is an effective way to solve crop prediction problem which is supervised and nonparametric, is used
as it learns from past data and gives predictions to solve classification and regression issues.
on current parameters. In order to make precise Labeled data is used with supervised algorithms.
predictions and stand by erratic patterns in weather The technique relies on the distances between the
conditions like temperature and rainfall, various points, which may be calculated in a few different
machine learning classifiers like Logistic Regression, ways. The fact that the distance must always
Naïve Bayes, Random Forest, KNN are used and be either zero or positive should be taken into
compared for the performance metrics and the model account. The distance is squared, raised to a given
with best accuracy is selected for crop prediction. power, or the absolute values are used to do this.
For Yield prediction, regressors like Linear Regression, Pre-processing of all the labelled data is necessary
Random Forests and Decision Trees Regression before we apply the kNN algorithm. All of the
are compared for metrics like MAE, Median Absolute data must first be normalized. As kNN struggles
Error and R2 Score. The model with best values is to function when there are too many features present,
selected for predicting yield. feature selection must then be used to eliminate the
insignificant features. Missing data must be filled in.
Naïve Bayes Else, that particular record must be eliminated. The
Based on Bayes' theorem, Naïve Bayes model is performance can be enhanced by including more
frequently employed in many classification tasks. train samples. The fundamental drawback of KNN is
The multinomial, Bernoulli, and Gaussian algorithms that as the size of dataset grows, cost of computing
make up the three Naive Bayes algorithms. Naive rises, the algorithm's speed decreases.
Bayes Algorithm is mostly employed for classification
problems. It operates under the presumption that Random Forests (RF)
each feature has an equal chance of occurring The RF technique is a perfect example of ensemble
and that the likelihood of each feature occurring is learning in action since it connects several classifiers
independent of the probabilities of the occurrence to tackle the challenging problem and improve a
of all other features. The Bayes theorem determines model's efficiency. The "forest" created with this
likelihood of an event happening when another event approach is actually a collection of decision trees.
is occurred. Multi-class classification makes use In each decision split, RF characteristics are chosen
of Bayes theorem. Also, in comparison to other ML at random. Picking traits that encourage prediction
techniques, it is quicker and simpler to construct. and lead to increased efficiency reduces the
Additionally, it doesn't need a lot of training data. correlation across trees. The Random Forest ML
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 973
classification approach generates the final output by random and fed into the Random Forest Technique's
combining the results of all the decision trees after decision trees. It can also carry out jobs requiring
segmenting the dataset into smaller subsets or trees. both regression and classification. It also works well
The Bagging subcategory of ensemble learning with huge, highly dimensional data sets, and most
methods includes Random Forest. A Sample of rows significantly, it greatly improves the model's accuracy
and features from the primary dataset are selected at and fixes the overfitting problem.
records for each of the following crops: rice, maize, set to "mean". This strategy replaces the missing
chickpea, kidney beans, pigeon peas, moth beans, values with the mean of the corresponding column.
mung beans, black gram, lentil, pomegranate,
banana, mango, grapes, watermelon, muskmelon, Standardization
apple, orange, papaya, coconut, cotton, jute, coffee. The features are standardized using the Standard
Scaler transformer. This step scales the features to
Data Pre-Processing have zero mean and unit variance.
Crop Prediction
Prior to start modelling the data, we need to carry Applying the Preprocessing Pipeline
out data-pre-processing. It is done in following steps The line X = my_pipeline.fit_transform(df) applies
as shown in Figure. the preprocessing pipeline (my_pipeline) to the
entire DataFrame (df). It fits the pipeline on the
Handling Missing Values data to learn the mean values (for imputation) and
The line df = pd.read_csv('Crop_recommendation2. the standardization parameters. Then, it transforms
csv', na_values='=') reads the CSV file into the data by applying the learned transformations.
a DataFrame (df), replacing any occurrences of '='
with NaN values, which are commonly used to Train-Test Split
represent missing data in pandas. The train_test_split() function from scikit-learn is
used to split the processed features (X) and the
Separating the Target Variable target variable (b) into training and testing sets.
The line b = df['label'] extracts the target variable The stratify=b parameter ensures that the class
column ('label') from the DataFrame df and assigns distribution is maintained in both the training and
it to the variable b. testing sets. The split is performed with a test size
of 30% (test_size=0.3) and a random state of 42
Creating a Preprocessing Pipeline (random_state=42).
The code creates a pipeline (my_pipeline) using
scikit-learn's Pipeline class. The pipeline consists These pre-processing steps help handle missing
of two steps values, standardize the features, and split the data
into training and testing sets for further analysis and
Imputation model training.
The missing values in the DataFrame are imputed
using the SimpleImputer transformer with a strategy
Feature Selection as input along with other data. This helps the system
A machine learning model's performance can be to give real-time predictions. “OpenWeatherMap”
improved through feature selection, which is the API is used for the same. For creating an API URL,
process of choosing a subset of the relevant features base URL and API key is used which is unique
from available data. For Crop Prediction, following with each subscription. User’s city name is passed
features were selected: Nitrogen, Phosphorous, in complete URL as a parameter and response is
Potassium, Temperature, Humidity, pH and Rainfall. collected. From the collected response, required
For Yield Prediction, features selected are as follows: fields i.e., temperature and humidity are passed
City, Crop, Annual Rainfall (in mm), Season. to ML model for predicting crop.
As shown in above Figure 4, rainfall is most important Classification Models’ results for Crop prediction as
feature for crop prediction followed by humidity, depicted in Figure 5.
Potassium(K), Phosphorous(P), Nitrogen(N) and pH.
When trained on the dataset, KNN gives accuracy of Regression has accuracy of 94.69%. Based on these
97.72%, RF gives accuracy of 99.24%, Naïve Bayes results, Naïve Bayes classifier is incorporated in the
Classifier has 99.39% accuracy score. Logistic backend for Crop Prediction.
mapped with its original feature. As shown in Figure 7, Crop is most important feature
4. Aggregate Feature Importances: By in order to predict yield followed by District, Rainfall
aggregating we get every categorical feature’s and Season.
importance
5. Rank Feature Importances in descending
order of importance.
Yield prediction is done by regression. For Median Absolute Error and R2 Score are used. The
comparison between different regression models, results are depicted in figure 8.
performance metrics like Mean Absolute Error,
Random Forest Regressor gives most reliable and R2 score of 0.96. Decision Trees Regressor
results when given required inputs with Mean has Mean Absolute Error of 0.80, Median Absolute
Absolute Error of 0.64, Median Absolute Error is 0.16 Error of 0.18, R2 Score of 0.94. Linear Regression
PATIL et al., Curr. Agri. Res., Vol. 11(3) 968-980 (2023) 979
has Mean Absolute Error of 1.08, Median Absolute estimates in accordance with current market prices.
Error of 0.47 and R2 Score of 0.92. Paid datasets may bring more reliable and accurate
data which in turn might help in model accuracy.
Conclusion They may contain more features which may help
Crop yield prediction is a complex process which correlate more with label.
relies on several different factors including weather,
soil, fertilizers, pest infestations, etc. In this paper, Acknowledgements
we predict the crop yield using weather and soil We are grateful to Prof. Pritesh A Patil for providing
parameters. The research is based on the datasets valuable input and feedback throughout the research
limited to districts in Maharashtra. The system process.
incorporates regression techniques to estimate
yield and multi-class classification to predict type Funding
of the crop. Among the used models for yield The author(s) received no financial support for the
prediction, Random Forest Regression gives best research, authorship, and/or publication of this
results with MAE of 0.64 and R2 score of 0.96. article.
For crop prediction, Naïve Bayes classifier gives
most accurate results with accuracy of 99.39. The Conflict of Interest
suggested method aids farmers in choosing which The authors declare no conflict of interest regarding
crop to plant in the field and how much yield any this research. However, it should be noted that
crop would give in that specific environment. Dataset the first author of this paper is an employee of a
used in the research can be improved by taking real company that develops and markets machine
time data through IoT devices. Also, various factors learning software for crop yield prediction. The results
like irrigation and fertilizers use can be included and conclusions presented here are solely based
for better prediction. Mobile App can be developed on the authors' research and do not reflect any
for mobile devices with added services like price external influence.
References
Conference on Intelligent Computing and for prediction of crop yield," Int. J. Comput.
Control Systems (ICICCS). doi:10.1109/ Intell. Inform., vol. 6, no. 4, pp. 298–305,
iciccs51141.2021.9432236 2017.
11. Ranjini B Guruprasad, Kumar Saurav, 17. van Klompenburg, T., Kassahun, A., &
Sukanya Randhawa,”Machine learning Catal, C. (2020). Crop yield prediction using
methodologies for paddy yield Estimation in machine learning: A systematic literature
India: a case study”, 2019. review in computers and electronics in
12. S. P. Raja, B. Sawicka, Z. Stamenkovic Agriculture, 177, pp. 105709. doi: 10.1016/j.
and G. Mariammal, "Crop prediction compag.2020.105709.
based on characteristics of the agricultural 18. M. Liu, T. Wang, A. K. Skidmore, and X. Liu,
environment using various feature selection "Heavy metal-induced stress in rice crops
techniques and classifiers," IEEE Access, detected using multi-temporal Sentinel-2
vol. 10, pp. 23625-23641, 2022, doi: 10.1109/ satellite images," Sci. Total Environ.,
ACCESS.2022.3154350. vol. 637-638, pp. 18-29, Oct. 2018.
13. Venugopal, Anakha, S, Aparna, Mani, Jinsu, 19. K. E. Eswari and L. Vinitha, "Crop yield
Mathew, Rima, Williams, Vinu. "Crop yield prediction in tamil nadu using bayesian
prediction using machine learning algorithms." network," Int. J. Intell. Adv. Res. Eng.
International Journal of Engineering Research Comput., vol. 6, no. 2, pp. 1571-1576, 2018.
& Technology (IJERT) NCREIS – 2021, vol. 20. D. A. Reddy, B. Dadore, and A. Watekar,
09, no. 13, pp. 1-6, Dec. 2021. "Crop recommendation system to maximize
14. S. M. PANDE, P. K. RAMESH, A. ANMOL, crop yield in ramtek region using machine
B. R. AISHWARYA, K. ROHILLA and K. learning," Int. J. Sci. Res. Sci. Technol., vol.
SHAURYA, "Crop recommender system 6, no. 1, pp. 485-489, Feb. 2019.
using machine learning approach," 2021 21. Priya, P., Muthaiah, U., Balamurugan, M.
5th International Conference on Computing "Predicting yield of the crop using machine
Methodologies and Communication learning algorithm." International Journal of
(ICCMC), 2021, pp. 1066-1071, doi: 10.1109/ Computer Science and Mobile Computing,
ICCMC51019.2021.9418351. vol. 4, no. 5, pp. 1-7, May 2015.
15. Suresh, N., et al. "Crop yield prediction 22. Medar, Ramesh, S, Vijay, Shweta. "Crop
using random forest algorithm." 2021 7th yield prediction using machine learning
International Conference on Advanced techniques." International Journal of
Computing and Communication Systems Advanced Research in Computer Science
(ICACCS), pp. 279-282, 2021, doi: 10.1109/ and Software Engineering, vol. 9, no. 5, pp.
ICACCS51430.2021.9441871. 1-6, May 2019.
16. E. Manjula and S. Djodiltachoumy, "A model 23. https://www.kaggle.com/