Sales Forecast Prediction - Python
Last Updated :
08 Apr, 2025
Sales forecasting is an important aspect of business planning, helping organizations predict future sales and make informed decisions about inventory management, marketing strategies and resource allocation. In this article we will explore how to build a sales forecast prediction model using Python. Sales forecasting involves estimating current or future sales based on data trends.
Below is the step-by-step implementation of the sales prediction model.
1. Importing Required Libraries
Before starting, ensure you have the necessary libraries installed. For this project, we will be using pandas, matplotlib, seaborn, xgboost and scikit learn. You can install them using pip
:
pip install pandas numpy matplotlib seaborn scikit-learn xgboost
Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
2. Loading the Dataset
For this we will be using a sales dataset that contains features like Row ID, Order ID, Customer ID, Customer ID, etc. You can download dataset from here.
Python
file_path = 'train.csv'
data = pd.read_csv(file_path)
data.head()
Output:
Dataset 3. Data Preprocessing and Visualization
In this block, we will preprocess the data and visualize the sales trend over time.
- pd.to_datetime: Converts the "Order Date" column into datetime format allowing us to perform time-based operations.
- groupby: Groups the data by "Order Date" and sums the sales for each date, creating a time series of daily sales.
Python
data['Order Date'] = pd.to_datetime(data['Order Date'], format='%d/%m/%Y')
sales_by_date = data.groupby('Order Date')['Sales'].sum().reset_index()
plt.figure(figsize=(12, 6))
plt.plot(sales_by_date['Order Date'], sales_by_date['Sales'], label='Sales', color='red')
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.grid(True)
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
Sales Trend Over Time4. Feature Engineering - Creating Lagged Features
Here we create lagged features to capture the temporal patterns in the sales data.
- create_lagged_features: This function generates lagged features by shifting the sales data by a given number of time steps like 1, 2, 3, etc. Lag features help the model learn from the previous sales data to predict future sales.
- dropna: Drops rows with missing values which are introduced due to the shift operation when lagging.
Python
def create_lagged_features(data, lag=1):
lagged_data = data.copy()
for i in range(1, lag+1):
lagged_data[f'lag_{i}'] = lagged_data['Sales'].shift(i)
return lagged_data
lag = 5
sales_with_lags = create_lagged_features(data[['Order Date', 'Sales']], lag)
sales_with_lags = sales_with_lags.dropna()
5. Preparing the Data for Training
In this step we prepare the data for training and testing.
- drop(columns): Removes the 'Order Date' and 'Sales' columns from the feature set
X
since they are not needed for training as sales is the target variable. - train_test_split: Splits the dataset into training (80%) and testing (20%) sets.
shuffle=False:
ensures that the data is split in chronological order preserving the time series structure.
Python
X = sales_with_lags.drop(columns=['Order Date', 'Sales'])
y = sales_with_lags['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
6. Training the XGBoost Model
Here we will train the XGBoost model. It is a machine learning algorithm that uses gradient boosting to create highly accurate predictive models particularly well-suited for regression tasks like sales forecasting.
- XGBRegressor: Initializes an XGBoost model for regression tasks.
objective='reg:squarederror':
indicates that we are solving a regression problem i.e predicting continuous sales values.- learning_rate (lr): Controls the step size at each iteration while moving toward a minimum of the loss function with smaller values leading to slower convergence.
- n_estimators: The number of boosting rounds or trees to build with higher values improving model accuracy but potentially leading to overfitting.
- max_depth: Defines the maximum depth of each decision tree controlling the complexity of the model. Deeper trees can model more complex patterns.
- fit: Trains the model on the training data (
X_train
, y_train
).
Python
model_xgb = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=5)
model_xgb.fit(X_train, y_train)
7. Making Predictions and Evaluating the Model
Here we make predictions and evaluate the model performance using RMSE.
- predict: Makes predictions on the test set (
X_test
) using the trained XGBoost model. - mean_squared_error: Computes the Mean Squared Error (MSE) between actual and predicted values. We use
np.sqrt
to compute the Root Mean Squared Error (RMSE), which is a standard metric for evaluating regression models.
Python
predictions_xgb = model_xgb.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, predictions_xgb))
print(f"RMSE: {rmse_xgb:.2f}")
RMSE: 734.63
The RMSE of 734.63 indicates the average deviation between the actual and predicted sales values. A lower RMSE value signifies better model accuracy, with the model's predictions being closer to the actual sales data. As we have large amount of sales data this RMSE score is accptable.
8. Visualizing Results
We will plot both the actual and predicted sales to visually compare the performance of the model.
Python
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test, label='Actual Sales', color='red')
plt.plot(y_test.index, predictions_xgb, label='Predicted Sales', color='green')
plt.title('Sales Forecasting using XGBoost')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Output:
As we can see the predicted and actual values are quite close to each other this proves the efficiency of our model. Sales forecasting using machine learning models like XGBoost can significantly enhance the accuracy of predictions by capturing temporal patterns in historical data. It can be used for improving sales predictions helping businesses optimize inventory, pricing and demand planning.
Similar Reads
Python Tutorial | Learn Python Programming Language
Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Machine Learning Tutorial
Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.It can
5 min read
Python Interview Questions and Answers
Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components
In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts
Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Linear Regression in Machine learning
Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Python Projects - Beginner to Advanced
Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
Python Exercise with Practice Questions and Solutions
Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs
Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read