Seminar Report
Seminar Report
A Seminar Report
on
Sales Prediction Using Machine Learning
Submitted by
2023-2024
Under guidance of
Prof. Nikhil Khandare
CERTIFICATE
DECLARATION
I declare that this written submission represents my ideas in
my own words and where others ideas or words have been
included; I have adequately cited and referenced the original
sources.
Signature of Student
Anas Ahmad Ilyas Ahmad
Roll No : 222010001
VJTI, Mumbai
Date :
4
ACKNOWLEDGEMENT
For the help and encouragement in all aspects for this project, I
would like to express my sincere thanks to our guide, Professor
Nikhil Khandare. His expertise and patience were greatly
appreciated and assisted in the successful completion of this
project.
Signature of Student
Anas Ahmad Ilyas Ahmad
Roll No : 222010001
VJTI, Mumbai
Date :
5
1 Introduction 8
2 Problem Statement 9
3 Literature Review 10
4 Modules 12
5 System Requirements 18
6 Conclusion 18
7 Future Scope 19
8 References 19
Table of Content
6
Abstract
Sales forecasting is the process of predicting future sales. It is
the vital part of the financial planning of the business. Most of
the companies heavily depend on the future prediction of the
sales. Accurate sales forecasting empower the organizations to
make informed business decisions and it will help to predict the
short-term and long-term performances. A precise forecasting
can avoid overestimating or underestimating of the future sales,
which may leads to great loss to companies. The past and
current sales statistics is used to estimate the future
performance. But it is difficult to deal with accuracy of sales
forecasting by traditional forecasting. For this purpose, various
machine learning techniques have been discovered. In this work,
we have taken Black Friday dataset and made a detailed analysis
over the dataset. Here, we have implemented the different
machine learning techniques with different metrics. By
analysing the
performance, we have trying to suggest the suitable predictive
algorithm to our problem statement.
7
1. Introduction
Sales play a key role in the business. At the company level, sales
forecasting is the major part of the business plan and significant
inputs for decision-making activities. It is essential for
organizations to produce the required quantity at the specified
time. For that, sales forecasting will gives the idea about how an
organization should manage its budgeting, workforce and
resources. This forecasting helps the business management to
determine how much products should be manufacture, how
much revenue can be expected and what could be the
requirement of employees, investment and equipment. By
analyzing the future trends and needs, Sales forecasting helps to
improve the business growth.
The traditional forecasting systems have some drawbacks
related to accuracy of the forecasting and handling enormous
amount of data. To overcome this problem, Machine-Learning
(ML) techniques have been discovered. These techniques helps
to analyses the bigdata and plays a important role in sales
forecasting. Here we have used supervised machine learning
techniques for the sales forecasting.
8
2. PROBLEM STATEMENT
Most of the business organizations heavily depend on a
knowledge base and demand prediction of sales trends. Sales
forecasting is the process of estimating future sales. Accurate
sales forecasts enable companies to make informed business
decisions and predict short-term and long-term performance.
Companies can base their forecasts on past sales data,
industrywide comparisons, and economic trends. Sales forecasts
help sales teams achieve their goals by identifying early warning
signals in their sales pipeline and course correct before it’s too
late. The goal is to improve the accuracy from the existing
project. So that the sales and profit could be increased for the
companies. Choosing an efficient algorithm from comparing
different algorithms to improve the prediction further more.
9
3. LITERATURE SURVEY
PAPER-1:
Intelligent Sales Prediction Using Machine Learning
Techniques.
Abstract: The detailed study and analysis of comprehensible
predictive models to improve future sales predictions are carried
out in this research. Traditional forecast systems are difficult to
deal with the big data and accuracy of sales forecasting.
Algorithms: The models implemented for prediction are
Random
Forest, Gradient Boosting and Extremely Randomized Trees
(Extra Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very
effective.
PAPER-2:
Comparison of Different Machine Learning Algorithms for
Multiple Regression on Black Friday Sales Data.
Abstract: This study focuses on the field of prediction models
to develop an accurate and efficient algorithm to analyze the
customer spending in the past and output the future spending of
the customers with same features.
Algorithms: Regression, Decision Tree, XGBoost.
Conclusion: XGBoost.
PAPER-3:
Sales Prediction Using Machine Learning Algorithms.
Abstract: The aim of this paper is to propose a dimension for
predicting the future sales of Big Mart Companies keeping in
view the sales of previous years. A comprehensive study of
sales prediction is done using Machine Learning models.
Algorithms: Linear Regression, K-Neighbours Regressor,
XGBoost, Regressor and Random Forest Regressor.
Conclusion: Random Forest Algorithm is found to be the
most suitable
10
PAPER-4:
Forecasting of Walmart Sales using Machine Learning
Algorithm.
Abstract: The ability to predict data accurately is extremely
valuable in a vast array of domains such as stocks, sales,
weather or even sports. Presented here is the study and
implementation of several ensemble classification algorithms
employed on sales data, consisting
of weekly retail sales numbers from different departments in
Walmart retail outlets all over the United States of America.
Algorithms: The models implemented for prediction are
Random Forest, Gradient Boosting and Extremely Randomized
Trees (Extra Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very
effective.
PAPER-5:
Sales Prediction For Big Mart.
Abstract: A retailer company wants a model that can predict
accurate sales so that it can keep track of customers future
demand and update in advance the sale inventory. In this work,
we propose a technique to optimize the parameters and select
the best tuning hyper parameters, further ensemble with
Xgboost techniques for forecasting the future sales of a retailer
company such as Big Mart and we found our model produces
the better result.
Algorithms: Xgboost techniques.
Conclusion: Experimental analysis found our technique
produce more accurate
11
4. MODULES
4.1 DATA COLLECTION:
The dataset has been collected from https://www.kaggle.com/
The training dataset contains 12 columns and 550069 rows. The
Test dataset contains 12 columns and 233600 rows. The dataset
contains 12 variables which includes User ID, Gender, City
Category, Product ID, Total count of years stayed in current
city, Age, Occupation, Marital status, Product Category1,
Product Category2, Product Category3 and Purchase amount.
4.3 ALGORITHMS
Linear Regression :
Linear Regression is one of the common ML and data analysis
technique. This algorithm is helpful for forecasting based on
linear regression equation. The Linear regression technique is
the type of regression, which combines the set of
12 independent features(x) to predict the output value(y) or
dependent variable. The linear equation assigns a factor to each
independent variable called coefficients represented by β.
13
XGBoost:
XGBoost also known as Extreme Gradient Boosting has been
used in order to get an efficient model with high computational
speed and efficacy. The formula makes predictions using the
ensemble method that models the anticipated errors of some
decision trees to optimize last predictions. Production of this
model also reports the value of each feature’s effects in
determining the last building performance score prediction.
Gradient Boosting:
Gradient Boost is the one of the major boosting algorithm.
Boosting is a ensemble technique in which the successive
predictors learn from the mistakes of the previous or
predecessor predictors. It is the method of improving the weak
learners and create a combined prediction model. In this
algorithm, decision trees are mainly used as base learners and
trains the model in sequential manner.
Random Forest:
Random forest is referred as a supervised machine learning
ensemble method, which uses the multiple decision trees. It
involves the technique called Bootstrap aggregation also known
as bagging which aims to reduce the complexity of the
models that overfit the training data . In this algorithm, rather
than depending on individual decision tree it will combines the
multiple decision trees to find the final outcome.
Feature Selection:
Product_Category_1 feature has by far the highest regression
coefficient and is very important feature.
14
5. SYSTEM REQUIREMENTS
6. CONCLUSION
Sales forecasting is mainly required for the organizations for
business decisions. Accurate forecasting will help the companies
to enhance the market growth. Machine learning techniques
provides the effective mechanism in prediction and data mining
as it overcome the problem with traditional techniques. These
techniques enhances the data optimization along with improving
the efficiency with better results and greater predictability. After
predicting the purchase amount, the companies can apply some
marketing strategies for certain sections of customers so that the
profit could be enhanced.
18
7. FUTURE SCOPE
In our future work, we will use the other feature selection
techniques and advanced deep learning architecture algorithms
to enhance the efficiency of the model with improved
optimization.
This algorithm could be integrated in a website or app to get
insights based on this data.
8. REFERENCES
[1] Sunitha Cheriyan, Shaniba Ibrahim, Saju Mohanan & Susan
Treesa (2018) Intelligent Sales Prediction Using Machine
Learning Techniques.
[2] Avinash kumar, Neha Gopal & Jatin Rajput(2020). An
Intelligent Model For Predicting the Sales of a Product.
[3 ] Nikhil Sunil Elias, Seema Singh(2019).FORECASTING of
WALMART SALES using MACHINE LEARNING
ALGORITHMS.
[4] Gopal Behera & Neeta Nain (2019). Sales Prediction For
Big Mart.
19
Code:
import numpy as np
import pandas as pd
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
train_df=pd.read_csv("train.csv")
test_df=pd.read_csv("test.csv")
df=train_df.copy()
train_df.info()
test_df.info()
train_df.head()
train_df.drop('User_ID',axis=1,inplace=True)
test_df.drop('User_ID',axis=1,inplace=True)
train_df.shape
train_df.describe()
20
train_df['Age']=train_df['Age'].map({'0-17':0,'18-25':1, '26-
35':2,'36- 45':3,'46-50':4, '51-55':5, '55+':6})
test_df['Age']=test_df['Age'].map({'0-17':0,'18-25':1, '26-
35':2,'36-45':3,'46-50':4, '51-55':5, '55+':6})
test_df['Gender'].unique()
train_df['Marital_Status'].unique()
train_df['City_Category'].unique()
city=pd.get_dummies(train_df['City_Category'],drop_first=True
)
train_df.drop('City_Category',axis=1,inplace=True)
city_test=pd.get_dummies(test_df['City_Category'],drop_first=T
rue)
percent_missing=np.round((train_df.isna().sum()/
train_df.isna().count()),3a
=17
percent_missing.sort_values(ascending=False)
train_df['Product_Category_2']=train_df['Product_Category_2'].
fillna(train _df['Product_Category_2'].mode()[0])
train_df['Product_Category_2'].isna().sum()
percent_missing=np.round((test_df.isna().sum()/
test_df.isna().count()),3)
percent_missing.sort_values(ascending=False)
test_df.drop('Product_Category_3',axis=1,inplace=True)
test_df['Product_Category_2']=test_df['Product_Category_2'].fil
lna(train_df['Product_Category_2'].mode()[0])
train_df.info()
train_df['Stay_In_Current_City_Years']=train_df['Stay_In_Curr
ent_City_Years'].astype(int)
21
train_df['B']=train_df['B'].astype(int)
train_df['C']=train_df['C'].astype(int)
train_df['Product_Category_2']
test_df.drop('City_Category',axis=1,inplace=True)
sns.barplot('Gender','Purchase',data=train_df)
sns.barplot('Age','Purchase',data=train_df)
sns.barplot('Marital_Status','Purchase',data=train_df)
sns.barplot('Occupation','Purchase',data=train_df)
X=train_df.drop('Purchase',axis=1)
y=train_df['Purchase']
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0
.5,random_state=42)
rfr=RandomForestRegressor(n_estimators=150)
rfr.fit(X_train,y_train)
rfrpredict=rfr.predict(X_valid)
regressor = RandomForestRegressor()
regressor.fit(X_train,y_train)
accuracy = regressor.score(X_valid,y_valid)
accuracy1=a+accuracy*100
gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train)
gbrpredict= gbr.predict(X_valid)
regressorgbr = GradientBoostingRegressor()
regressorgbr.fit(X_train,y_train)
accuracy = regressorgbr.score(X_valid,y_valid)
accuracy2=a+accuracy*100
xgr=XGBRegressor()
xgr.fit(X_train,y_valid)
xgrpredict=xgr.predict(X_valid)
regressorxg = XGBRegressor()
regressorxg.fit(X_train,y_train)
accuracy3 = regressorxg.score(X_valid,y_valid)
22
reg=linear_model.LinearRegression()
lm_model=reg.fit(X_train,y_train)
pred=lm_model.predict(X_valid)
regressorlr = linear_model.LinearRegression()
regressorlr.fit(X_train,y_train)
accuracy = regressorlr.score(X_valid,y_valid)
accuracy4=a+accuracy*100
m=ExtraTreesRegressor()
m.fit(X_train,y_train)
mpredict= m.predict(X_valid)
Exregressor = ExtraTreesRegressor()
Exregressor.fit(X_train,y_train)
accuracy = Exregressor.score(X_valid,y_valid)
accuracy5=a+accuracy*100
finalpredict=gbr.predict(test_df)
finalpredict
size = train_df['Gender'].value_counts()
labels = ['Male', 'Female']
colors = ['#C4061D', 'green']
explode = [0, 0.1]
plt.rcParams['figure.figsize'] = (10, 10)
plt.pie(size, colors = colors, labels = labels, shadow = True,
explode = explode, autopct = '%.2f%%')
plt.axis('off')
plt.legend()
plt.show()
plt.show()
plt.figure(figsize=[12,8])
sns.countplot(train_df['Occupation'],hue=train_df["Age"])
print("RMSE score for Random_Forest : ",
np.sqrt(mean_squared_error(y_valid,rfrpredict)))
import numpy as np
import matplotlib.pyplot as plt
courses = list(data.keys())
values = list(data.values())
fig = plt.figure(figsize = (10, 5))
plt.legend()
plt.show()
rf_regressor_tune = RandomForestRegressor(n_estimators=100,
max_depth = 40, max_features = 'auto', min_samples_leaf =10,
min_samples_split=2 )
rf_regressor_tune.fit(X_train, y_train)
columns = pd.DataFrame({"Features": test_df.columns,
24 "Feature Importance"
:rf_regressor_tune.feature_importances_})
columns.sort_values("Feature Importance", ascending =
False).reset_index(drop=True)
sns.barplot(y="Features", x = "Feature Importance", data =
columns)