Optimizing_Flight_Booking_Decisions_through_Machine_Learning_Price_Predictions
Optimizing_Flight_Booking_Decisions_through_Machine_Learning_Price_Predictions
Balamurugan
Roll_NO:20S2CS06
Technical Architecture:
Project Flow:
● User interacts with the UI to enter the input.
● Entered input is analysed by the model which is integrated.
● Once model analyses the input the prediction is showcased on the UI
Project Structure:
Create the Project folder which contains files as shown below
• We are building a flask application which needs HTML pages stored in the templates folder and a python
script app.py for scripting.
• model1.pkl is our saved model. Further we will use this model for flask integration.
• Training folder contains model training files and training_ibm folder contains IBM deployment files.
Milestone 1: Define Problem / Problem Understanding
Activity 1: Specify the business problem
Refer Project Description
Business Model/Impact: - Personal loan providers may charge fees for services such as loan origination,
processing, and late payments. Advertising the brand awareness and marketing to reach out to potential
borrowers to generate revenue.
Milestone 2: Data Collection & Preparation
ML depends heavily on data. It is the most crucial aspect that makes algorithm training possible. So this
section allows you to download the required dataset.
In this project we have used .csv data. This data is downloaded from kaggle.com. Please refer to the link
given below to download the dataset.
Link: https://www.kaggle.com/code/anshigupta01/flight-price-prediction/data
As the dataset is downloaded. Let us read and understand the data properly with the help of some
visualisation techniques and some analysing techniques.
Note: There are a number of techniques for understanding the data. But here we have used some of it. In
an additional way, you can use multiple techniques.
The download data set is not suitable for training the machine learning model as it might have so much of
randomness so we need to clean the dataset properly in order to fetch good results. This activity includes
the following steps.
Note: These are the general steps of pre-processing the data before using it for machine learning.
Depending on the condition of your dataset, you may or may not have to go through all these steps.
We have 1 missing value in Route column, and 1 missing value in Total stops column. We will meaningfully
replace the missing values going further.
We now start exploring the columns available in our dataset. The first thing we do is to create a list of
categorical columns, and check the unique values present in these columns
• 1. Airline column has 12 unique values - 'IndiGo' , 'Air India', 'Jet Airways' , 'SpiceJet' , 'Multiple carriers' ,
'GoAir', 'Vistara', 'Air Asia', 'Vistara Premium economy' , 'Jet Airways Business', 'Multiple carriers Premium
economy', 'Trujet'.
2. Source column has 5 unique values – ‘Bangalore’, ‘Kolkata’, ‘Chennai’, ‘Delhi’ and ‘Mumbai’.
3. Destination column has 6 unique values - 'New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi' ,
'Hyderabad'.
4. Additional info column has 10 unique values - 'No info', 'In-flight meal not included', 'No check-in
baggage included', '1 Short layover' , 'No Info', '1 Long layover', 'Change airports' , 'Business class',
'Red-eye flight' , '2 Long layover'.
We now split the Date column to extract the ‘Date’, ‘Month’ and ‘Year’ values, and store them in new
columns in our dataframe.
• Further, we split the Route column to create multiple columns with cities that the flight travels through. We
check the maximum number of stops that a flight has, to confirm what should be the maximum number of
cities in the longest route
• Since the maximum number of stops is 4, there should be maximum 6 cities in any particular route. We
split the data in route column, and store all the city names in separate columns
• In the similar manner, we split the Dep_time column, and create separate columns for departure hours and
minutes
• Further, for the arrival date and arrival time separation, we split the ‘Arrival_Time’ column, and create
‘Arrival_date’ column. We also split the time and divide it into ‘Arrival_time_hours’ and
‘Arrival_time_minutes’, similar to what we did with the ‘Dep_time’ column
• Next, we divide the ‘Duration’ column to ‘Travel_hours’ and ‘ Travel_mins’
• We also treat the ‘Total_stops’ column, and replace non-stop flights with 0 value and extract the integer
part of the ‘Total_Stops’ column
• We proceed further to the ‘Additional_info’ column, where we observe that there are 2 categories signifying
‘No info’, which are divided into 2 categories since ‘I’ in ‘No Info’ is capital. We replace ‘No Info’ by ‘No info’
to merge it into a single category
We now drop all the columns from which we have extracted the useful information (original columns). We
also drop some columns like ‘city4’,'city5' and ‘city6’, since majority of the data in these columns was
NaN(null). As a result, we now obtain 20 different columns, which we will be feeding to our ML model. But
first, we treat the missing values and explore the contents in the columns and its impact on the flight price,
to separate a list of final set of columns.
• After dropping some columns, here we can see the meaningful columns to predict the flight price without
the NaN values.
We also replace missing values in ‘Arrival_date’ column with values in ‘Date’ column, since the missing
values are those values where the flight took off and landed on the same date.
We also replace missing values in ‘Travel_mins’ as 0, since the missing values represent that the travel
time was in terms on hours only, and no additional minutes.
• Using the above steps, we were successfully able to treat all the missing values from our data. We again
check the info in our data and find out that the dataset still has data types for multiple columns as ‘object’,
where it should be ‘int’
• Hence, we try to change the datatype of the required columns
• During this step, we face issue converting the ‘Travel_hours’ column, saying that the column has data as
‘5m’, which is not allowing its conversion to ‘int’.
• The data signifies that the flight time is ‘5m’, which is obviously wrong as the plane cannot fly from
BOMBAY->GOA->PUNE->HYDERABAD in 5 mins! (The flight has ‘Total_stops’ as 2)
• We then convert the ‘Travel_hours’ column to ‘int’ data type, and the operation happens successfully.
We now have a treated dataset with 10682 rows and 17 columns (16 independent and 1 dependent
variable).
We create separate lists of categorical columns and numerical columns for plotting and analyzing the data
–
• We are taking two variables 'x' and 'y' to split the dataset as train and test.
• On x variable, drop is passed with dropping the target variable. And on y target variable('Price') is passed.
• Scaling the features makes the flow of gradient descent smooth and helps algorithms quickly reach the
minima of the cost function.
• Without scaling features, the algorithm maybe biased toward the feature which has values higher in
magnitude. it brings every feature in the same range and the model uses every feature wisely.
• We have popular techniques used to scale all the features but I used StandardScaler in which we
transform the feature such that the changed features will have mean=0 and standard deviation=1.
Now let’s split the Dataset into train and test sets.
For splitting training and testing data we are using train_test_split() function. from sklearn. As parameters,
we are passing x, y, test_size, random_state.
Milestone 4: Model Building
Now our data is cleaned and it’s time to build the model. We can train our data on different algorithms. for
this project we are applying four regression algorithms. The best model is saved based on its performance.
A function named RandomForest, GradientBoosting, AdaBoost is created and train and test data are
passed as the parameters. Inside the function, RandomForest, GradientBoosting, AdaBoost algorithm is
initialized and training data is passed to the model with .fit() function. Test data is predicted with .predict()
function and saved in new variable. For evaluating the model, r2_score, mean_absolute_error, and
mean_squared_error report is done.
Activity 2: Regression Model
KNeighborsRegressor, SVR, DecisionTreeRegressor
A function named KNN, SVR, DecisionTree is created and train and test data are passed as the
parameters. Inside the function, KNN, SVR, DecisionTree algorithm is initialized and training data is
passed to the model with .fit() function. Test data is predicted with .predict() function and saved in new
variable. For evaluating the model, r2_score, mean_absolute_error, and mean_squared_error is done.
Activity 3: Checking Cross Validation for RandomForestRegressor
We perform the cross validation of our model to check if the model has any overfitting issue, by checking
the ability of the model to make predictions on new data, using k-folds. We test the cross validation for
Random forest and Gradient Boosting Regressor.
Activity 4: Hypertuning the model
RandomSearch CV is a technique used to validate the model with different parameter combinations, by
creating a random of parameters and trying all the combinations to compare which combination gave the
best results. We apply random search on our model.
From sklearn, cross_val_score is used to evaluate the score of the model. On the parameters, we have
given rf (model name), x, y, cv (as 3 folds). Our model is performing well.
Now let’s see the performance of all the models and save the best model
Activity 5: Accuracy
Checking Train and Test Accuracy by RandomSearchCV using RandomForestRegression Model
Checking Train and Test Accuracy by RandomSearchCV using KNN Model2
By Observing two models train and test accuracy we are getting good accuracy in
RandomForestRegression
Saving the best model after comparing its performance using different evaluation metrics means selecting
the model with the highest performance and saving its weights and configuration. This can be useful in
avoiding the need to retrain the model every time it is needed and also to be able to use it in the future.
In this section, we will be building a web application that is integrated to the model we built. A UI is
provided for the uses where he has to enter the values for predictions. The enter values are given to the
saved model and prediction is showcased on the UI.
● home.html
● predict.html
● submit.html
Load the saved model. Importing the flask module in the project is mandatory. An object of Flask class is
our WSGI application. Flask constructor takes the name of the current module (__name__) as argument.
Here we will be using a declared constructor to route to the HTML page which we have created earlier.
In the above example, ‘/’ URL is bound with the home.html function. Hence, when the home page of the
web server is opened in the browser, the html page will be rendered. Whenever you enter the values from
the html page the values can be retrieved using POST Method.
Main Function:
Activity 2.3: Run the web application
● Open anaconda prompt from the start menu
● Navigate to the folder where your python script is.
● Now type “python app.py” command
● Navigate to the localhost where you can view your web page.
● Click on the predict button from the top left corner, enter the inputs, click on the submit button, and see the
result/prediction on the web.
Now,Go the web browser and write the localhost url (http://127.0.0.1:5000) to get the below result
Now, when you click on Predict button you will get redirected to the prediction page.
Input 1- Now, the user will give inputs to get the predicted result after clicking onto the submit button.
Now when you click on submit button from right top corner you will get redirected to submit.html
Activity 1:- Record explanation Video for project end to end solution