0% found this document useful (0 votes)
98 views25 pages

Seminar Report

The document summarizes a seminar report on sales prediction using machine learning. It includes an introduction describing the importance of sales forecasting for businesses. It then states the problem of traditional forecasting techniques having issues with accuracy and big data. The report tests various machine learning algorithms on a Black Friday sales dataset to identify the most suitable predictive model. It describes the modules of data collection, preprocessing, model building and evaluation to solve the problem.

Uploaded by

Anas Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views25 pages

Seminar Report

The document summarizes a seminar report on sales prediction using machine learning. It includes an introduction describing the importance of sales forecasting for businesses. It then states the problem of traditional forecasting techniques having issues with accuracy and big data. The report tests various machine learning algorithms on a Black Friday sales dataset to identify the most suitable predictive model. It describes the modules of data collection, preprocessing, model building and evaluation to solve the problem.

Uploaded by

Anas Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

1

A Seminar Report
on
Sales Prediction Using Machine Learning

Submitted by

Anas Ahmad Ilyas Ahmad


Roll No : 222010001

2023-2024

Under guidance of
Prof. Nikhil Khandare

Department of Master of Computer Applications


Veermata Jijabai Technological Institute
(Autonomous Institute, Affiliated to University of Mumbai)
Mumbai - 400019
2

CERTIFICATE

This is to certify that Anas Ahmad Ilyas Ahmad, a


student of Master of Computer Applications, has
completed the report entitled, "Sales Prediction using
Machine Learning" to our satisfaction.

Guide/ Supervisor Head of the Department


Prof. Nikhil Khandare Prof. Swati Chopade
Assistant Professor Associate Professor
Department of Master Department of Master
of Computer Applications of Computer Applications
VJTI, Mumbai VJTI, Mumbai
Date: Date:
Place: Place:
3

DECLARATION
I declare that this written submission represents my ideas in
my own words and where others ideas or words have been
included; I have adequately cited and referenced the original
sources.

I also declare that I have adhered to all principles of academic


honesty and integrity and have not misrepresented or fabricated
or falsified any idea/ data / fact/ source in my submission.

I understand that any violation of the above will be cause for


disciplinary action by the institute and can also evoke penal
action from the sources which have thus not been properly cited
or from whom proper permission has not been taken when
needed.

Signature of Student
Anas Ahmad Ilyas Ahmad
Roll No : 222010001
VJTI, Mumbai
Date :
4

ACKNOWLEDGEMENT
For the help and encouragement in all aspects for this project, I
would like to express my sincere thanks to our guide, Professor
Nikhil Khandare. His expertise and patience were greatly
appreciated and assisted in the successful completion of this
project.

I would also like to thank other lecturers and students for


providing useful comments, constructive criticism and support
during the design and implementation of the project.

Signature of Student
Anas Ahmad Ilyas Ahmad
Roll No : 222010001
VJTI, Mumbai
Date :
5

Sr Important Points Page No


No

1 Introduction 8

2 Problem Statement 9

3 Literature Review 10

4 Modules 12

5 System Requirements 18

6 Conclusion 18

7 Future Scope 19

8 References 19

Table of Content
6

Abstract
Sales forecasting is the process of predicting future sales. It is
the vital part of the financial planning of the business. Most of
the companies heavily depend on the future prediction of the
sales. Accurate sales forecasting empower the organizations to
make informed business decisions and it will help to predict the
short-term and long-term performances. A precise forecasting
can avoid overestimating or underestimating of the future sales,
which may leads to great loss to companies. The past and
current sales statistics is used to estimate the future
performance. But it is difficult to deal with accuracy of sales
forecasting by traditional forecasting. For this purpose, various
machine learning techniques have been discovered. In this work,
we have taken Black Friday dataset and made a detailed analysis
over the dataset. Here, we have implemented the different
machine learning techniques with different metrics. By
analysing the
performance, we have trying to suggest the suitable predictive
algorithm to our problem statement.
7

1. Introduction
Sales play a key role in the business. At the company level, sales
forecasting is the major part of the business plan and significant
inputs for decision-making activities. It is essential for
organizations to produce the required quantity at the specified
time. For that, sales forecasting will gives the idea about how an
organization should manage its budgeting, workforce and
resources. This forecasting helps the business management to
determine how much products should be manufacture, how
much revenue can be expected and what could be the
requirement of employees, investment and equipment. By
analyzing the future trends and needs, Sales forecasting helps to
improve the business growth.
The traditional forecasting systems have some drawbacks
related to accuracy of the forecasting and handling enormous
amount of data. To overcome this problem, Machine-Learning
(ML) techniques have been discovered. These techniques helps
to analyses the bigdata and plays a important role in sales
forecasting. Here we have used supervised machine learning
techniques for the sales forecasting.
8

2. PROBLEM STATEMENT
Most of the business organizations heavily depend on a
knowledge base and demand prediction of sales trends. Sales
forecasting is the process of estimating future sales. Accurate
sales forecasts enable companies to make informed business
decisions and predict short-term and long-term performance.
Companies can base their forecasts on past sales data,
industrywide comparisons, and economic trends. Sales forecasts
help sales teams achieve their goals by identifying early warning
signals in their sales pipeline and course correct before it’s too
late. The goal is to improve the accuracy from the existing
project. So that the sales and profit could be increased for the
companies. Choosing an efficient algorithm from comparing
different algorithms to improve the prediction further more.
9

3. LITERATURE SURVEY
PAPER-1:
Intelligent Sales Prediction Using Machine Learning
Techniques.
Abstract: The detailed study and analysis of comprehensible
predictive models to improve future sales predictions are carried
out in this research. Traditional forecast systems are difficult to
deal with the big data and accuracy of sales forecasting.
Algorithms: The models implemented for prediction are
Random
Forest, Gradient Boosting and Extremely Randomized Trees
(Extra Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very
effective.

PAPER-2:
Comparison of Different Machine Learning Algorithms for
Multiple Regression on Black Friday Sales Data.
Abstract: This study focuses on the field of prediction models
to develop an accurate and efficient algorithm to analyze the
customer spending in the past and output the future spending of
the customers with same features.
Algorithms: Regression, Decision Tree, XGBoost.
Conclusion: XGBoost.

PAPER-3:
Sales Prediction Using Machine Learning Algorithms.
Abstract: The aim of this paper is to propose a dimension for
predicting the future sales of Big Mart Companies keeping in
view the sales of previous years. A comprehensive study of
sales prediction is done using Machine Learning models.
Algorithms: Linear Regression, K-Neighbours Regressor,
XGBoost, Regressor and Random Forest Regressor.
Conclusion: Random Forest Algorithm is found to be the
most suitable
10

PAPER-4:
Forecasting of Walmart Sales using Machine Learning
Algorithm.
Abstract: The ability to predict data accurately is extremely
valuable in a vast array of domains such as stocks, sales,
weather or even sports. Presented here is the study and
implementation of several ensemble classification algorithms
employed on sales data, consisting
of weekly retail sales numbers from different departments in
Walmart retail outlets all over the United States of America.
Algorithms: The models implemented for prediction are
Random Forest, Gradient Boosting and Extremely Randomized
Trees (Extra Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very
effective.
PAPER-5:
Sales Prediction For Big Mart.
Abstract: A retailer company wants a model that can predict
accurate sales so that it can keep track of customers future
demand and update in advance the sale inventory. In this work,
we propose a technique to optimize the parameters and select
the best tuning hyper parameters, further ensemble with
Xgboost techniques for forecasting the future sales of a retailer
company such as Big Mart and we found our model produces
the better result.
Algorithms: Xgboost techniques.
Conclusion: Experimental analysis found our technique
produce more accurate
11

4. MODULES
4.1 DATA COLLECTION:
The dataset has been collected from https://www.kaggle.com/
The training dataset contains 12 columns and 550069 rows. The
Test dataset contains 12 columns and 233600 rows. The dataset
contains 12 variables which includes User ID, Gender, City
Category, Product ID, Total count of years stayed in current
city, Age, Occupation, Marital status, Product Category1,
Product Category2, Product Category3 and Purchase amount.

4.2 DATA PREPROCESSING:


This step is an important step in data mining process. Because it
improves the quality of the experimental raw data.

i)Removal of Null values:


In this step, the null values in the fields Product Category2 and
Product Category3 are filled with the mean value of the feature.

ii) Converting Categorical values into numerical:


Machine learning deal with numerical values easily because of
the machine readable form. Therefore, the categorical values
like Product ID, Gender, Age and City Category are converted
to numerical values.
Step1: Based on its datatype, we have selected the categorical
values.
Step2: By using python, we have converting the categorical
values into numerical values.

iii) Separate the target variable:


Here, we have to separate the target feature in which we are
going to predict.
In this case, purchase is the target variable.
Step1: The target label purchase is assigned to the variable ‘y’.
Step2: The preprocessed data except the target label purchase is
assigned to the variable ‘X’.
12

iv) Standardize the features:


Here, we have to standardize the features because it arranges the
data in a standard normal distribution. The standardization of the
data is made only for training data most of the time because any
kind of transformation of the features only be fitted on the
training data.
Step1: Only trained data was taken.
Step2: By using the Standard Scaler API, we have standardize
the features

4.3 ALGORITHMS

Linear Regression :
Linear Regression is one of the common ML and data analysis
technique. This algorithm is helpful for forecasting based on
linear regression equation. The Linear regression technique is
the type of regression, which combines the set of
12 independent features(x) to predict the output value(y) or
dependent variable. The linear equation assigns a factor to each
independent variable called coefficients represented by β.
13

XGBoost:
XGBoost also known as Extreme Gradient Boosting has been
used in order to get an efficient model with high computational
speed and efficacy. The formula makes predictions using the
ensemble method that models the anticipated errors of some
decision trees to optimize last predictions. Production of this
model also reports the value of each feature’s effects in
determining the last building performance score prediction.

Gradient Boosting:
Gradient Boost is the one of the major boosting algorithm.
Boosting is a ensemble technique in which the successive
predictors learn from the mistakes of the previous or
predecessor predictors. It is the method of improving the weak
learners and create a combined prediction model. In this
algorithm, decision trees are mainly used as base learners and
trains the model in sequential manner.

Random Forest:
Random forest is referred as a supervised machine learning
ensemble method, which uses the multiple decision trees. It
involves the technique called Bootstrap aggregation also known
as bagging which aims to reduce the complexity of the
models that overfit the training data . In this algorithm, rather
than depending on individual decision tree it will combines the
multiple decision trees to find the final outcome.

Extra Trees Algorithm:


This algorithm works by creating a large number of unpruned
decision trees from the training dataset. Predictions are made by
averaging the prediction of the decision trees in the case of
regression or using majority voting in the case of classification.

Feature Selection:
Product_Category_1 feature has by far the highest regression
coefficient and is very important feature.
14

4.4 RESULTS AND DISCUSSION:


The evaluation of the machine learning algorithms is an
essential part of any prediction model building. For that, we
should carefully choose the evaluation metrics . These metrics
are used to measure or judge the quality of the model.
The performance of the machine learning algorithms are mainly
focusing on accuracy. Companies uses the machine learning
models with high accuracy for the practical business decisions.

ALGORITHM RMSE ACCURACY


Linear Regression 4693 29%
Random Forest 3052 79%
Gradient Boost 3004 81%
XGBoost 5023 82%
Extra Tree 3137 77%
Regression

Based on the performance, we have concluded that the XGBoost


and Gradient Boost algorithm considered as the best fit
comparing to other algorithms. This comparative evaluation will
help the organizations to choose the better and efficient
machine-learning model.
15

Figure: Accuracy for different Machine Learning Techniques

Figure: Accuracy Comparison for different Machine Learning


Techniques
16

Figure: Accuracy and RMSE for different Machine Learning


Techniques
17

5. SYSTEM REQUIREMENTS

5.1 HARDWARE REQUIREMENTS


• System : i3 Processor
• Hard Disk : 500 GB.
• Ram : 4GB

5.2 SOFTWARE REQUIREMENTS


 Operating system : Windows 7 or above, linux.
 Scripting Tool: Jupyter Notebook, Google colab
Language: Python3.9

6. CONCLUSION
Sales forecasting is mainly required for the organizations for
business decisions. Accurate forecasting will help the companies
to enhance the market growth. Machine learning techniques
provides the effective mechanism in prediction and data mining
as it overcome the problem with traditional techniques. These
techniques enhances the data optimization along with improving
the efficiency with better results and greater predictability. After
predicting the purchase amount, the companies can apply some
marketing strategies for certain sections of customers so that the
profit could be enhanced.
18

7. FUTURE SCOPE
In our future work, we will use the other feature selection
techniques and advanced deep learning architecture algorithms
to enhance the efficiency of the model with improved
optimization.
This algorithm could be integrated in a website or app to get
insights based on this data.

8. REFERENCES
[1] Sunitha Cheriyan, Shaniba Ibrahim, Saju Mohanan & Susan
Treesa (2018) Intelligent Sales Prediction Using Machine
Learning Techniques.
[2] Avinash kumar, Neha Gopal & Jatin Rajput(2020). An
Intelligent Model For Predicting the Sales of a Product.
[3 ] Nikhil Sunil Elias, Seema Singh(2019).FORECASTING of
WALMART SALES using MACHINE LEARNING
ALGORITHMS.
[4] Gopal Behera & Neeta Nain (2019). Sales Prediction For
Big Mart.
19

Code:
import numpy as np
import pandas as pd
import os

for dirname, _, filenames in os.walk('/kaggle/input'):


for filename in filenames:
print(os.path.join(dirname, filename))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split


from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import linear_model
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import ExtraTreesRegressor

train_df=pd.read_csv("train.csv")
test_df=pd.read_csv("test.csv")
df=train_df.copy()

train_df.info()
test_df.info()
train_df.head()

train_df.drop('User_ID',axis=1,inplace=True)
test_df.drop('User_ID',axis=1,inplace=True)

train_df.shape
train_df.describe()
20

train_df['Age']=train_df['Age'].map({'0-17':0,'18-25':1, '26-
35':2,'36- 45':3,'46-50':4, '51-55':5, '55+':6})

test_df['Age']=test_df['Age'].map({'0-17':0,'18-25':1, '26-
35':2,'36-45':3,'46-50':4, '51-55':5, '55+':6})

test_df['Gender'].unique()
train_df['Marital_Status'].unique()
train_df['City_Category'].unique()

city=pd.get_dummies(train_df['City_Category'],drop_first=True
)
train_df.drop('City_Category',axis=1,inplace=True)

city_test=pd.get_dummies(test_df['City_Category'],drop_first=T
rue)
percent_missing=np.round((train_df.isna().sum()/
train_df.isna().count()),3a
=17

percent_missing.sort_values(ascending=False)
train_df['Product_Category_2']=train_df['Product_Category_2'].
fillna(train _df['Product_Category_2'].mode()[0])

train_df['Product_Category_2'].isna().sum()
percent_missing=np.round((test_df.isna().sum()/
test_df.isna().count()),3)

percent_missing.sort_values(ascending=False)
test_df.drop('Product_Category_3',axis=1,inplace=True)
test_df['Product_Category_2']=test_df['Product_Category_2'].fil
lna(train_df['Product_Category_2'].mode()[0])
train_df.info()

train_df['Stay_In_Current_City_Years']=train_df['Stay_In_Curr
ent_City_Years'].astype(int)
21

train_df['B']=train_df['B'].astype(int)
train_df['C']=train_df['C'].astype(int)
train_df['Product_Category_2']

test_df.drop('City_Category',axis=1,inplace=True)
sns.barplot('Gender','Purchase',data=train_df)
sns.barplot('Age','Purchase',data=train_df)
sns.barplot('Marital_Status','Purchase',data=train_df)
sns.barplot('Occupation','Purchase',data=train_df)

X=train_df.drop('Purchase',axis=1)
y=train_df['Purchase']
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0
.5,random_state=42)

rfr=RandomForestRegressor(n_estimators=150)
rfr.fit(X_train,y_train)
rfrpredict=rfr.predict(X_valid)
regressor = RandomForestRegressor()
regressor.fit(X_train,y_train)
accuracy = regressor.score(X_valid,y_valid)
accuracy1=a+accuracy*100

gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train)
gbrpredict= gbr.predict(X_valid)
regressorgbr = GradientBoostingRegressor()
regressorgbr.fit(X_train,y_train)
accuracy = regressorgbr.score(X_valid,y_valid)
accuracy2=a+accuracy*100

xgr=XGBRegressor()
xgr.fit(X_train,y_valid)
xgrpredict=xgr.predict(X_valid)
regressorxg = XGBRegressor()
regressorxg.fit(X_train,y_train)
accuracy3 = regressorxg.score(X_valid,y_valid)
22

reg=linear_model.LinearRegression()
lm_model=reg.fit(X_train,y_train)
pred=lm_model.predict(X_valid)
regressorlr = linear_model.LinearRegression()
regressorlr.fit(X_train,y_train)
accuracy = regressorlr.score(X_valid,y_valid)
accuracy4=a+accuracy*100

m=ExtraTreesRegressor()
m.fit(X_train,y_train)
mpredict= m.predict(X_valid)
Exregressor = ExtraTreesRegressor()
Exregressor.fit(X_train,y_train)
accuracy = Exregressor.score(X_valid,y_valid)
accuracy5=a+accuracy*100

finalpredict=gbr.predict(test_df)
finalpredict
size = train_df['Gender'].value_counts()
labels = ['Male', 'Female']
colors = ['#C4061D', 'green']
explode = [0, 0.1]
plt.rcParams['figure.figsize'] = (10, 10)
plt.pie(size, colors = colors, labels = labels, shadow = True,
explode = explode, autopct = '%.2f%%')

plt.title('A Pie Chart representing the gender gap', fontsize = 20)

plt.axis('off')
plt.legend()
plt.show()

from scipy import stats


from scipy.stats import norm
plt.rcParams['figure.figsize'] = (20, 7)
sns.distplot(train_df['Purchase'], color = 'green', fit = norm)
23

# fitting the target variable to the normal curve


mu, sigma = norm.fit(train_df['Purchase'])
print("The mu {} and Sigma {} for the curve".format(mu,
sigma))

plt.title('A distribution plot to represent the distribution of


Purchase')

plt.legend(['Normal Distribution ($mu$: {}, $sigma$:


{}'.format(mu, sigma)], loc = 'best')

plt.show()
plt.figure(figsize=[12,8])

sns.countplot(train_df['Occupation'],hue=train_df["Age"])
print("RMSE score for Random_Forest : ",
np.sqrt(mean_squared_error(y_valid,rfrpredict)))

print("RMSE score for Gradient Boosting : ",


np.sqrt(mean_squared_error(y_valid,gbrpredict)))

print("RMSE score for XG Boosting : ",


np.sqrt(mean_squared_error(y_valid,xgrpredict)))

print("RMSE score for Linear Regression : ",


np.sqrt(mean_squared_error(y_valid,pred)))

print("RMSE score for ExtraTreesRegressor : ",


np.sqrt(mean_squared_error(y_valid,mpredict)))

print("Accuracy for Random_Forest: ",accuracy1,'%')

print("Accuracy for Gradient Boosting: ",accuracy2,'%')

print("Accuracy for XG Boosting: ",accuracy3,'%')


24

print("Accuracy for Linear Regression: ",accuracy4,'%')

print("Accuracy for ExtraTreesRegressor: ",accuracy5,'%')

import numpy as np
import matplotlib.pyplot as plt

data = {'Random_Forest':accuracy1, 'Gradient


Boosting':accuracy2, 'XG
Boosting':accuracy3, 'Linear Regression':accuracy4,
'ExtraTreesRegressor':accuracy5}

courses = list(data.keys())
values = list(data.values())
fig = plt.figure(figsize = (10, 5))

# creating the bar plot


plt.bar(courses, values, color ='maroon', width = 0.4)
plt.xlabel("Algorithm")
plt.ylabel("Percentage %")
plt.title("Accuracy Chart")
plt.show()
barWidth = 0.25
fig = plt.subplots(figsize =(12, 8))
New = [accuracy1, accuracy2, accuracy3, accuracy4, accuracy5]

Old = [77, 73, 72, 37, 0]


br1 = np.arange(len(New))
br2 = [x + barWidth for x in br1]
plt.bar(br1, Old, color ='r', width = barWidth, edgecolor ='grey',
label ='OLD')

plt.bar(br2, New, color ='g', width = barWidth, edgecolor


='grey', label ='NEW')

plt.xlabel('ALGORITHIM', fontweight ='bold', fontsize = 15)


plt.ylabel('ACCURACY %', fontweight ='bold', fontsize = 15)
25

plt.xticks([r + barWidth for r in range(len(New))],


['Random_Forest', 'Gradient Boosting', 'XG Boosting', 'Linear
Regression', 'ExtraTreesRegressor'])

plt.legend()
plt.show()

rf_regressor_tune = RandomForestRegressor(n_estimators=100,
max_depth = 40, max_features = 'auto', min_samples_leaf =10,
min_samples_split=2 )

rf_regressor_tune.fit(X_train, y_train)
columns = pd.DataFrame({"Features": test_df.columns,
24 "Feature Importance"
:rf_regressor_tune.feature_importances_})
columns.sort_values("Feature Importance", ascending =
False).reset_index(drop=True)
sns.barplot(y="Features", x = "Feature Importance", data =
columns)

You might also like