0% found this document useful (0 votes)

29 views9 pages

Final Report

Uploaded by

sugoallacacciatora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views9 pages

Final Report

Uploaded by

sugoallacacciatora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Weather Forecasting Using Machine Learning Algorithms

Will It Rain in Australia?

Weather Forecasting Using Machine Learning Algorithms

Luca Bulfon and Jack Cimorelli (Team 3)

Marist College
MSIS 645: Data Mining & Predictive Analytics
Prof. Eítel Lauria
Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 2

Executive Summary

This project will attempt to analyze multiple measurements of different meteorological

parameters (temperature, wind, humidity, pressure…) during a day and predict whether it will rain
tomorrow or not. The project will use data from the Australian Bureau of Meteorology (BOM) contained
inside a dataset available on Kaggle.com. The motivation for this project is to help increase our
understanding of weather and patterns. Understanding these patterns' importance cannot be understated. It
is widely accepted that weather is easiest to predict in the near future and becomes increasingly difficult
further into the future. Weather forecasting has a very vast scale of complexity. It can be performed with
any level of technology or dataset (Brenner, 2015).

Before usage the data used in this project required extensive preprocessing. The dataset initially had
several different data types that were not conducive to this research that required streamlining. The data
also inherently has important spatial trends that must in some way be preserved. Lastly, scaling using z-
scores and removing outliers occurred before use in modeling. A 70/30 training/testing split was used in
this project.

This project used data mining techniques to analyze a dataset with a binary target variable: whether it will
rain the next day, implementing a set of different models to explore their effectiveness om this dataset.
Firstly, we will employ K-Nearest Neighbors (KNN) and Naive Bayes classifiers, known for their
simplicity and efficiency in handling classification problems. Then, we will build a Random Forest
classifier, which offers robustness against overfitting and can handle complex interactions between
features. Finally, to investigate the influence of temporal patterns in the weather data, we included a Long
Short-Term Memory (LSTM) neural network.

The project yielded encouraging results. According to the chosen evaluation metrics the most effective
model was the random forest classifier. The random forest classifier yielded the highest ROC AUC, F1
measure and accuracy. The LSTM was the least effective method by way of the same metrics. The
random forest classifier likely suffered from some amount of overfitting and in the future can be
improved by further hyperparameter tuning. Other future improvements to the model could include a
Multivariate imputation by chained equations (MICE) to attempt to fill the NA values or other imputation
methods rather than excluding this data. This improvement would likely be most beneficial to the LSTM,
which analyzes temporal trends and would then have more consistent daily data to work with. In the
future one possible improvement to this study could be the introduction of the ArcPy library to further
analyze spatial patterns in this dataset.

Problem Description
In a world where extreme weather events have become increasingly common, understanding
weather forecasting has also become increasingly important. Data mining and predictive analytics offer us
many techniques to address this issue (Brenner, 2015). The country of Australia will be the focus of this
project. Over the past year Australia has had issues with rainfall. Western Australia has had some of the
worst droughts and dry seasons it has seen in recorded history. Over the past 11 to 16 months Western
Australia has experienced record lows in rainfall across the entire region and projections for the near
future have also forecasted low expectations for rainfall. This can be contrasted with the east which has
experienced some of the wettest seasons in recorded history. Despite the low rainfall in the east, rainfall
averages for the entire country were 86.7% above the average March rainfall from 1961 to 1990.
(Australian Government Bureau of Meteorology, 2024). These unprecedented and divergent trends
indicate that further research on rain patterns is necessary. Given weather records from a singular day this
model will attempt to predict if it would rain tomorrow.
Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 3

Data Description

The dataset downloaded from Kaggle.com contains 145,460 records of 20 different daily measurements
of 20 parameters. The parameters are:

['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',

'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RainTomorrow']

The data contains measurements from December 2008 to June 2017 (‘Date’) for 49 cities in Australia
(‘Location’). The dataset also contains a large amount of missing data (10.25% missing values), further
data preparation was necessary before model building.

Exploratory Analysis and Data Preparation

There were many NANs in this dataset (Fig. 1). One option to deal with this issue was to apply a
MICE to this dataset documented in similar modeling procedures. We chose to omit records that
contained NANs as imputation was not viable in this case. Certain cities in this dataset systematically did
not record certain variables, imputing variables into these records would have compromised the integrity
of these records.

The dataset was provided where the “RainTommorrow” column held a string value: yes or no.
We changed the column into a binary where 0 indicates that no rain occurred the following day and 1
indicates that rain occurred the next day.

Cities such as Sydney and Melbourne had different Locations for their airport and the city itself,
for the purposes of this project airports were combined with the larger city they were a part of.

The next data preparation performed was to the wind direction variable. The wind direction
variable was presented in cardinal format as a string. This column was changed to an integer where 0
indicated north and 90 indicated east.

Given the size of the original data set and the spatial nature of the data, we added the ability to
filter the data and select measurements from either a singular city or all cities (Fig. 3). The user can
choose their option in a text box in our model. This allowed us to reduce average response times of the
program while still ensuring high accuracy in the predictions. This was important as the dataset contained
an average of around 1600 measurements per city and over twenty different cities. It also adds the ability
to preserve spatial trends integral to the data's nature.

Another step needed before implementing the predictive models was to scale the data. The dataset
was heavily unbalanced towards records where “RainTomorrow = yes” (Fig. 1). The data was
undersampled randomly to balance before use.

The last step regarding data preparation was to scale the data. Without scaling, the feature with
the larger scale will dominate the model, even if its underlying relationship with the target variable is
weaker. StandardScaler transforms all features to have a mean of 0 and a standard deviation of 1,
Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 4

allowing the model to focus on the relative importance of each feature based on its variation within the
data. Lastly, the ‘Date’ column was dropped.

Technical Solution

We will use several different machine learning algorithms as prediction tools. Three of the four
algorithms used in this project are supervised binary classification tools as the independent variable in this
project is either 0 (no rain) or 1 (rain). One model is a neural network. After having analyzed the dataset
and selected the right features and records, we delved into the application of the before cited data mining
techniques to measure their performance and evaluate which one would perform better when applied to
the dataset presented. The data was partitioned into a 70/30 training/testing split.

The first data mining technique examined was the KNN (K-Nearest Neighbors) algorithm for
classification. The K-Nearest Neighbors (KNN) algorithm is a powerful technique used in machine
learning for both classification and regression tasks, which uses proximity to make predictions about the
grouping of an individual data point. (IBM, 2024) Its core principle assumes similar data points tend to
have similar characteristics or values. Data points with similar characteristics in the independent variable
are more likely to be similar in the dependent variable. When predicting the result of a certain datapoint,
the KNN model considers the value of the K closest datapoints and chooses the majority class label
among those K closest data points.

To determine the best number of K-nearest neighbors to apply to the model, we opted to use a
GridSearch method coupled with cross validation. We determined a range of K values (1 to 30), which
the GridSearch method uses to train the model and provide an estimate for the best K value based on a
defined performance metric, which in our case was “accuracy”.

The second model we used in this project is the Naïve Bayes classifier. The model is based upon
the Bayes Theorem. Bayes Theorem describes the probability of an event using the known probability of
related events. It stated as the probability of A given event B is equal to the probability of B given event
A multiplied by the probability of event A all divided by the probability of event B (Lee, 2012). Naive
comes from the fact that the algorithm assumes that the presence of each event is independent of others.
Our model uses a Gaussian Naive Bayes which assumes that the data follows a normal distribution.

The third model used in this project was a Random Forest Classifier. Random Forest classifiers
are an ensemble learning method. Ensemble learning methods use individual classifiers predictions to
improve the model's robustness and accuracy. Random forests are built upon decision trees which break
data down into different nodes until certain hyperparameters or constraints are met. Random forests differ
from decision trees in that they introduce increased randomness (Kam Ho, 1995). Randomness came first
from bootstrap aggregation and then at each split a random subset of features being considered. We chose
not to use certain hyperparameters because of hardware restrictions.

The fourth and last model we implemented in this project is a Long Short-Term Memory (LSTM)
Neural Network. LSTM models utilize gated cell units, consisting of a forget, an input, and an output
gate. These gates act as sophisticated filters, regulating the flow of information. The forget gate decides
what information from the previous cell state to discard, the input gate determines what new information
to include, and the output gate controls what portion of the current cell state is exposed to the network.
This gating mechanism allows the LSTM to selectively remember relevant information over long time
intervals, effectively addressing the vanishing gradient problem, which characterizes Recurrent Neural
Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 5

Networks (RNN). LSTMs are effective at processing and analyzing sequential data like speech, text, and
time series of data, this is reason behind our choice of adding a LSTM to this project.

Conclusions

The results of the project were tested on the city of Perth. Perth was randomly selected as the test
city for the results. After finding an optimal k value of 26, the KNN model yielded a 92 percent ROC
AUC and an 84% F1 measure.

KNN

Gaussian Naive Bayes

Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 6

Naive Bayes is commonly most effective with categorical data, however yielded good results in
this study. It registered a ROC AUC of 90% and an accuracy of 83%. It was less effective than KNN and
as will be seen the random forests but more effective than the LSTM.

Random Forests Classifier

For the random forest classifier, the number of estimators was set to 500 and the random state
was set to 1. The accuracy was 87%, the ROC AUC 95% and F1 measure was 88%. The random forest
classifier had the highest ROC AUC, and F1 measure of all models. The random forest classifier did the
best job at successfully predicting whether it would rain or not the next day of all models.
Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 7

Long Term Short Memory Neural Network

For the LSTM network we did not randomly split the data into training and testing as it is
important to feed the model with data in a time series format. The results were not as good as expected,
the model needs some fine tuning. Due to time constraints and lack of experience, we were not able to
enhance the performance of the LSTM to match the other models in this project. As can be seen from the
picture of the ROC curve, with an AUC of 0.67, the model is not able to effectively distinguish between
true positives and false positives.

Appendix

Figure 1. Distribution of “RainTomorrow” in the Dataset

Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 8

Figure 21 Missing Measurements in the Dataset before preprocessing

Figure 3. Text Box to Choose a City.

Weather Forecasting Using Machine Learning Algorithms Cimorelli, Bulfon 9

Works Cited and Corresponding Code

Australian Bureau of Meteorology. (2024) Drought Statement March 2024 and outlook.

Retrieved From:

http://www.bom.gov.au/climate/data

Brenner, Steve. (2015). Re: What are the WRF - ARW weather model hardware and software requirements?

Retrieved from:

https://www.researchgate.net/post/What_are_the_WRFARW_weather_model_hardware_and_software_req

uirements/5575da7a5e9d97487a8b442/citaion/download.

Kam Ho, Tin (1995) Random Decision Forests, AT&T Labratories,

https://web.archive.org/web/20160417030218/http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf

IBM (2024). “What is the K-nearest neighbors algorithm?” https://www.ibm.com/topics/knn#:~:text=The%20k

%2Dnearest%20neighbors%20(KNN,used%20in%20machine%20learning%20today.

Lee, Peter M (2012), Bayesian Statistics: An Introduction, 4th edition. Wiley.

Link to the .ipynb file used for this analysis:

https://colab.research.google.com/drive/1RmsL2enDSX44zgP2lHvu8M2UccyxTNPb?usp=sharing

Impact of Job Satisfaction On Employee Performance A Study With Reference To Biblical Management Principles
No ratings yet
Impact of Job Satisfaction On Employee Performance A Study With Reference To Biblical Management Principles
14 pages
Technology and Emergency Management
From Everand
Technology and Emergency Management
John C. Pine
No ratings yet
Simple Linear Regression (Solutions To Exercises)
No ratings yet
Simple Linear Regression (Solutions To Exercises)
28 pages
Literature 04
No ratings yet
Literature 04
5 pages
Rainfall Prediction - Report
No ratings yet
Rainfall Prediction - Report
6 pages
BMS Institute of Technology and Management Department of MCA
100% (1)
BMS Institute of Technology and Management Department of MCA
10 pages
Project ML Last-Dark
No ratings yet
Project ML Last-Dark
28 pages
CSI5155 ML Project Report
No ratings yet
CSI5155 ML Project Report
23 pages
Math 42 Final Project Combined
No ratings yet
Math 42 Final Project Combined
169 pages
Rainfall Prediction With Agricultural Soil Analysis Using Machine Learning
No ratings yet
Rainfall Prediction With Agricultural Soil Analysis Using Machine Learning
11 pages
Annual Average Rainfall Prediction Using KNN Model of Machine Learning
No ratings yet
Annual Average Rainfall Prediction Using KNN Model of Machine Learning
6 pages
490
No ratings yet
490
6 pages
Weather Forecasting Using Decision Tree Regression
No ratings yet
Weather Forecasting Using Decision Tree Regression
7 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
M-R 1
No ratings yet
M-R 1
12 pages
IJEDR1702035
No ratings yet
IJEDR1702035
4 pages
Rainfall
No ratings yet
Rainfall
4 pages
Research Paper Rain Prediction System
No ratings yet
Research Paper Rain Prediction System
6 pages
Predicting Rainfall Based On Machine Learning Algorithm: An Evidence From Bogura District, Bangladesh
No ratings yet
Predicting Rainfall Based On Machine Learning Algorithm: An Evidence From Bogura District, Bangladesh
9 pages
Rainfall Prediction
No ratings yet
Rainfall Prediction
29 pages
Csi 5155 ML Project Report
100% (1)
Csi 5155 ML Project Report
24 pages
ssrn_id3380834_code3457479_240609_192018
No ratings yet
ssrn_id3380834_code3457479_240609_192018
6 pages
Rainfall Prediction Using Machine Learning
100% (1)
Rainfall Prediction Using Machine Learning
6 pages
Dynamic Modeling Technique For Weather Prediction: Jyotismita Goswami
No ratings yet
Dynamic Modeling Technique For Weather Prediction: Jyotismita Goswami
8 pages
Presentationfinal-1
No ratings yet
Presentationfinal-1
14 pages
Final Report 1301174460 1301174539 AMLdocx
No ratings yet
Final Report 1301174460 1301174539 AMLdocx
12 pages
6019-13680-1-PB_2
No ratings yet
6019-13680-1-PB_2
7 pages
Mini Project PPT, Sumit Malan
No ratings yet
Mini Project PPT, Sumit Malan
12 pages
A Survey On Weather Forecasting To Predict Rainfall Using Big Data
No ratings yet
A Survey On Weather Forecasting To Predict Rainfall Using Big Data
8 pages
81681869653_copy
No ratings yet
81681869653_copy
7 pages
Weather prediction using machine learning techniques
No ratings yet
Weather prediction using machine learning techniques
9 pages
Paper 8-Weather Prediction Using Linear Regression Model-Bnmit IITCEE ICCCI - Conference-1
No ratings yet
Paper 8-Weather Prediction Using Linear Regression Model-Bnmit IITCEE ICCCI - Conference-1
4 pages
A Comparative Study of Machine Learning Models For Daily and Weekly Rainfall Forecasting
No ratings yet
A Comparative Study of Machine Learning Models For Daily and Weekly Rainfall Forecasting
20 pages
Proactive Disaster Detection_CSD-G05
No ratings yet
Proactive Disaster Detection_CSD-G05
6 pages
Weather Prediction System
No ratings yet
Weather Prediction System
17 pages
IJEDR2001052
No ratings yet
IJEDR2001052
4 pages
Weather Forecasting - A Time Series Analysis Using R
No ratings yet
Weather Forecasting - A Time Series Analysis Using R
11 pages
29 July2023
No ratings yet
29 July2023
10 pages
REPORT
No ratings yet
REPORT
13 pages
Report
No ratings yet
Report
5 pages
1st Paper On Weather Prediction
No ratings yet
1st Paper On Weather Prediction
4 pages
aml.weather
No ratings yet
aml.weather
6 pages
Time Series Data Analysis For Forecasting - A Literature Review
No ratings yet
Time Series Data Analysis For Forecasting - A Literature Review
5 pages
RAINFALL PREDICTION USING MACHINE LEARNING
No ratings yet
RAINFALL PREDICTION USING MACHINE LEARNING
6 pages
ARIMA
No ratings yet
ARIMA
4 pages
309 Monash Time Series Forecasting
No ratings yet
309 Monash Time Series Forecasting
14 pages
Algorithms 11 00132 PDF
No ratings yet
Algorithms 11 00132 PDF
11 pages
Rainfall prediction using ML
No ratings yet
Rainfall prediction using ML
5 pages
Bedah Jurnal
No ratings yet
Bedah Jurnal
6 pages
Discover Internet of Things: A Pragmatic Ensemble Learning Approach For Rainfall Prediction
No ratings yet
Discover Internet of Things: A Pragmatic Ensemble Learning Approach For Rainfall Prediction
15 pages
Weather Patterns Analysis and Prediction
No ratings yet
Weather Patterns Analysis and Prediction
17 pages
365-Article Text-2346-1-10-20230515
No ratings yet
365-Article Text-2346-1-10-20230515
8 pages
Rainfall prediction
No ratings yet
Rainfall prediction
46 pages
[16] Innovative Machine Learning Approaches for Prediction of Weather Parameters
No ratings yet
[16] Innovative Machine Learning Approaches for Prediction of Weather Parameters
8 pages
10 1109@icesc48915 2020 9155571
No ratings yet
10 1109@icesc48915 2020 9155571
4 pages
Comparative Analysis of Time Series Forecasting Models To Predict Amount of Rainfall in Telangana
No ratings yet
Comparative Analysis of Time Series Forecasting Models To Predict Amount of Rainfall in Telangana
5 pages
Kerala Flood Prediction With ML & Tableau Dashboard
No ratings yet
Kerala Flood Prediction With ML & Tableau Dashboard
45 pages
Pavuluri 2020
No ratings yet
Pavuluri 2020
6 pages
6071ebf4931ad2d03c6daf5ef2b3bf841598
No ratings yet
6071ebf4931ad2d03c6daf5ef2b3bf841598
46 pages
Proposing A New Methodology For Weather
No ratings yet
Proposing A New Methodology For Weather
6 pages
Document 11
No ratings yet
Document 11
8 pages
Addressing Earth's Challenges: GIS for Earth Sciences
From Everand
Addressing Earth's Challenges: GIS for Earth Sciences
Lorraine Tighe
No ratings yet
Sampling Methods: Søren Højsgaard
No ratings yet
Sampling Methods: Søren Højsgaard
22 pages
Module 1 Assignment
No ratings yet
Module 1 Assignment
2 pages
ANOVA Ajay 29 11 21 - Copy
No ratings yet
ANOVA Ajay 29 11 21 - Copy
50 pages
E-MATH-8-REVIEWER-4TH-QUARTER
No ratings yet
E-MATH-8-REVIEWER-4TH-QUARTER
6 pages
CONFIDENCE INTERVAL CALCULATOR (Last Updated 1 February 2011)
No ratings yet
CONFIDENCE INTERVAL CALCULATOR (Last Updated 1 February 2011)
11 pages
STA 342-TH8-Neyman-Pearson
No ratings yet
STA 342-TH8-Neyman-Pearson
18 pages
Workshop Make Up Exam
100% (1)
Workshop Make Up Exam
6 pages
GCSE Cumulative Frequency, Box Plots & Quartiles
No ratings yet
GCSE Cumulative Frequency, Box Plots & Quartiles
10 pages
Differences of Carrageenan Concentration On Physicochemical and Organoleptic Properties of Rosella-Soursop Jelly Drink
No ratings yet
Differences of Carrageenan Concentration On Physicochemical and Organoleptic Properties of Rosella-Soursop Jelly Drink
7 pages
MSC Statistics Syllabus First Year Regular Mode
No ratings yet
MSC Statistics Syllabus First Year Regular Mode
18 pages
Long Quiz in pr2 Sampling
No ratings yet
Long Quiz in pr2 Sampling
2 pages
Label 2
No ratings yet
Label 2
4 pages
Bootstrapping: Bias Statistic
No ratings yet
Bootstrapping: Bias Statistic
2 pages
Ch 4
No ratings yet
Ch 4
6 pages
Assignment - 2 FM-217
100% (1)
Assignment - 2 FM-217
5 pages
13) S2-22-ISM-Session-13 - 17th March 2024
No ratings yet
13) S2-22-ISM-Session-13 - 17th March 2024
67 pages
Sampling Methods Assignment
No ratings yet
Sampling Methods Assignment
8 pages
Business Analyst Roadmap
No ratings yet
Business Analyst Roadmap
40 pages
Test 2 Math & Stats 200561
No ratings yet
Test 2 Math & Stats 200561
7 pages
1 5 Bias Variance Trade Off
No ratings yet
1 5 Bias Variance Trade Off
34 pages
Practical - 592 MA SOCIOLOGY SPSS Fourth Sem
No ratings yet
Practical - 592 MA SOCIOLOGY SPSS Fourth Sem
45 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
13 pages
Topic 5 - Solutions
No ratings yet
Topic 5 - Solutions
5 pages
I. Ii. Iii. Iv. V.: EBE 2174/EBQ2074 Econometrics Tutorial 2 (ANSWERS) Evan Lau
100% (1)
I. Ii. Iii. Iv. V.: EBE 2174/EBQ2074 Econometrics Tutorial 2 (ANSWERS) Evan Lau
3 pages
BAED-STAT2112 Statistics and Probability OED EXAM
No ratings yet
BAED-STAT2112 Statistics and Probability OED EXAM
4 pages
A Simple Introduction to ANOVA
No ratings yet
A Simple Introduction to ANOVA
20 pages
Excel 2007 - 10 Forecasting and Data Analysis Course Manual1
No ratings yet
Excel 2007 - 10 Forecasting and Data Analysis Course Manual1
160 pages
What Are We Weighting For - Jeffrey M. Wooldridge
No ratings yet
What Are We Weighting For - Jeffrey M. Wooldridge
16 pages