Problem-
For this particular assignment, the data of different types of wine sales in the 20th
century is to be analyzed. Both of these data are from the same company but of
different wines. As an analyst in the ABC Estate Wines, you are tasked to analyse
and forecast Wine Sales in the 20th century.
Data set for the Problem: Sparkling.csv and Sparkling.csv
Please do perform the following questions on each of these two data sets
separately.
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and
also perform decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Build various exponential smoothing models on the training data and
evaluate the model using RMSE on the test data.
Other models such as regression, naive forecast models, simple average
models etc. should also be built on the training data and check the
performance on the test data using RMSE.
5. Check for the stationarity of the data on which the model is being built on
using appropriate statistical tests and also mention the hypothesis for the
statistical test. If the data is found to be non-stationary, take appropriate
steps to make it stationary. Check the new data for stationarity and
comment.
Note: Stationarity should be checked at alpha = 0.05.
6. Build an automated version of the ARIMA/SARIMA model in which the
parameters are selected using the lowest Akaike Information Criteria (AIC)
on the training data and evaluate this model on the test data using RMSE.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and
PACF on the training data and evaluate this model on the test data using
RMSE.
8. Build a table with all the models built along with their corresponding
parameters and the respective RMSE values on the test data.
9. Based on the model-building exercise, build the most optimum model(s) on
the complete data and predict 12 months into the future with appropriate
confidence intervals/bands.
10.Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.
OBJECTIVE-
The data of different types of wine sales in the 20th century is to be analyzed.
Both of these data are from the same company but of different wines. As an
analyst in the ABC Estate Wines, We have to analyze and forecast Wine Sales in
the 20th century.
SO THEY HAVE PROVIDED WITH 2 DATASETS-
1. SPARKLING DATASET
2. SPARKLING DATASET
So basically they want to go depth into the datasets provided to us analyze it and
provide them the information regarding optimizing there marketing strategy so to
increase their company growth or sales.
1. WE WILL TAKE SPARKLING DATASET-
The dataset Sparkling.csv was loaded into the database
Shape-there are 187 rows and 2 columns in the df_1 dataset.
Head and Tail-
YearMonth Sparkling
0 1980-01 1686
1 1980-02 1591
2 1980-03 2304
3 1980-04 1712
4 1980-05 1471
YearMonth Sparkling
182 1995-03 1897
183 1995-04 1862
184 1995-05 1670
185 1995-06 1688
186 1995-07 2031
We need to convert them into date format.
We have converted the data into date format and given the column name as
Time_Stamp.
We can also drop the column Year Month as we got the month year and date
format in one column named Time_Stamp.
Now that we have seen how to load the data from a '.csv' file as a Time Series
object, let us go ahead and analyse the Time Series plot that we got.
We can see that there is a slight downward trend with a seasonal pattern
associated as well.
INFO-
RangeIndex: 187 entries, 0 to 186
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year Month 187 non-null object
1 Sparkling 187 non-null int64
dtypes: int64(1), object(1)
2. Perform appropriate Exploratory Data Analysis to understand the data and
also perform decomposition.
Descriptive Summary-
The average sales of Sparkling Wine per month are around 2402.
The maximum sale of the Wine is approx 7242.
The minimum sale of the Wine is approx 1070.
Box plot-
Now, let us plot a box and whisker (1.5* IQR) plot to understand the spread of the
data and check for outliers in each year, if any-
As we got to know from the Time Series plot, the box plots over here also
indicates a measure of trend being present. Also, we see that Sales of Sparkling
Wine has some outliers for certain years.
Monthly Box Plot-
Since this is a monthly data, let us plot a box and whisker (1.5* IQR) plot to
understand the spread of the data and check for outliers for every month across
all the years, if any.
The highest such numbers are being recorded in the month of December across
various years.
Monthly Sales across Years-
We can see the highlighted part which indicates the maximum sales of Wine in
year and the month.
Quarterly plot-
We can see there is a outlier present in the data.
Missing Values-
Sparkling 0
dtype: int64
there are no missing values in the dataset
Decompose the Time Series-
Additive-
We see that the residuals are located around 0 from the plot of the residuals in
the decomposition.
Also there is a trend which keeps on changing.
Also there are no outliers in the dataset.
.
As per the 'additive' decomposition, we see that there is a pronounced trend in
the earlier years of the data. There is seasonality as well.
Also there are no outliers in the dataset.
The trend keeps on changing.
3. Split the data into train and test-
Training Data is till the end of 1990. Test Data is from the beginning of 1991 to the
last time stamp provided.
Train and test value counts-
TARAIN-0.9999999999999998
TEST- 1.0
Train and test number of columns and rows –
(132, 1)
(55, 1)
Sales for last 5 years-
It is difficult to predict the future observations if such an instance have not
happened in the past. From our train-test split we are predicting likewise behavior
as compared to the past years.
4. Building different models and comparing the accuracy metrics-
Model 1: Linear Regression-
For this particular linear regression, we are going to regress the 'Sales' variable
against the order of the occurrence.
Training Time instance
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 , 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132]
Test Time instance
[133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180,
181, 182, 183, 184, 185, 186, 187]
We see that we have successfully the generated the numerical time instance
order for both the training and test set. Now we will add these values in the
training and test set.
First few rows of Training Data -
Sparkling time
Time_stamp
1980-01-31 112.0 1
1980-02-29 118.0 2
1980-03-31 129.0 3
1980-04-30 99.0 4
1980-05-31 116.0 5
Last few rows of Training Data -
First few rows of Training Data
Sparkling time
Time_stamp
1980-01-31 1686 1
1980-02-29 1591 2
1980-03-31 2304 3
1980-04-30 1712 4
1980-05-31 1471 5
Last few rows of Training Data
Sparkling time
Time_stamp
1990-08-31 1605 128
1990-09-30 2424 129
1990-10-31 3116 130
1990-11-30 4286 131
1990-12-31 6047 132
First few rows of Test Data
Sparkling time
Time_stamp
1991-01-31 1902 133
1991-02-28 2049 134
1991-03-31 1874 135
1991-04-30 1279 136
1991-05-31 1432 137
Last few rows of Test Data
Sparkling time
Time_stamp
1995-03-31 1897 183
1995-04-30 1862 184
1995-05-31 1670 185
1995-06-30 1688 186
1995-07-31 2031 187
Now that our training and test data has been modified, let us go ahead use
𝐿𝑖𝑛𝑒𝑎𝑟𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 to build the model on the training data and test the model on
the test data
Sales data for last 5 years-
Defining the accuracy metrics.
Model Evaluation using RMSE on test data-
For RegressionOnTime forecast on the Test Data, RMSE is 1389.14
Test
RMSE
RegressionOn 1389.
Time 14
Model 2: Naive Approach:
For this particular naive model, we say that the prediction for tomorrow is the
same as today and the prediction for day after tomorrow is tomorrow and since
the prediction of tomorrow is same as today, therefore the prediction for day
after tomorrow is also today.
Time_stamp
1991-01-31 6047
1991-02-28 6047
1991-03-31 6047
1991-04-30 6047
1991-05-31 6047
Name: naive, dtype: int64
Model Evaluation
Model Evaluation using RMSE on test data-
For RegressionOnTime forecast on the Test Data, RMSE is 3864.28
Test
RMSE
RegressionOn 1389.
Time 14
NaiveModel 3864.
28
Method 3: Simple Average
For this particular simple average method, we will forecast by using the average
of the training values.
Top 5 rows of test data-
Sparkling mean_forecast
Time_stam
p
31-01-1991 1902 2403.78
28-02-1991 2049 2403.78
31-03-1991 1874 2403.78
30-04-1991 1279 2403.78
31-05-1991 1432 2403.78
Model Evaluation
Model Evaluation using RMSE on test data-
For Simple Average forecast on the Test Data, RMSE is 53.461
Test
RMSE
RegressionOnTi 1389.1
me 4
NaiveModel 3864.2
8
SimpleAverageM 1275.0
odel 8
Method 4: Moving Average (MA)
For the moving average model, we are going to calculate rolling means (or moving
averages) for different intervals. The best interval can be determined by the
maximum accuracy (or the minimum error) over here.
For Moving Average, we are going to average over the entire data.
Top rows-
Train data-
Let us split the data into train and test and plot this Time Series. The window of
the moving average is need to be carefully selected as too big a window will
result in not having any test set as the whole series might get averaged over.
Test data-
Model Evaluation using RMSE-
Done only on the test data.
Train data-
For 2 point Moving Average Model forecast on the Training Data, RMSE is 813.401
For 4 point Moving Average Model forecast on the Training Data, RMSE is
1156.590
For 6 point Moving Average Model forecast on the Training Data, RMSE is
1283.927
For 9 point Moving Average Model forecast on the Training Data, RMSE is
1346.278
Test data-
Before we go on to build the various Exponential Smoothing models, let us plot
all the models and compare the Time Series plots.
Method 5: Simple Exponential Smoothing-
'smoothing_level': 0.04960659880745982,
'smoothing_trend': Nan,
'smoothing_seasonal': Nan,
'damping_trend': Nan,
'initial_level': 1818.5047538435374,
'initial_trend': Nan,
'initial_seasons': array ([], dtype=float64),
'use_boxcox': False,
'Lamda': None,
'remove_bias': False}
Top 5 rows of test data-
Plotting on train and test dataset-
Model Evaluation -Simple Exponential Smoothing
For Alpha =0.04 Simple Exponential Smoothing Model forecast on the Test Data,
RMSE is 1316.035
Setting different alpha values.
Remember, the higher the alpha value more weightage is given to the more
recent observation. That means, what happened recently will happen again.
We will run a loop with different alpha values to understand which particular
value works best for alpha on the test set
First we will define an empty data frame to store our values from the loop-
Model Evaluation-
Plotting on both the Training and Test data-
-
Summary-
Method 6: Double Exponential Smoothing (Holt's Model)
Two parameters 𝛼 and 𝛽 are estimated in this model. Level and Trend are
accounted for in this model.
First we will define an empty data frame to store our values from the loop-
Let us sort the data frame in the ascending ordering of the 'Test RMSE' and the
'Test MAPE' values.
Plotting on both the Training and Test data-
In this particular we have built several models and went through a model building
exercise. This particular exercise has given us an idea as to which particular model
gives us the least error on our test set for this data. But in Time Series
Forecasting, we need to be careful about the fact that after we have done this
exercise we need to build the model on the whole data. Remember, the training
data that we have used to build the model stops much before the data ends. In
order to forecast using any of the models built, we need to build the models again
(this time on the complete data) with the same parameters.
The two models to be built on the whole data are the following:
Alpha , Beta , Gamma, TripleExponentialSmoothing
Alpha , Beta, Gamma, TripleExponentialSmoothing
Method 7: Triple Exponential Smoothing (Holt - winter’s Model)
Three parameters , 𝛽 and 𝛾 are estimated in this model. Level, Trend and
Seasonality are accounted for in this model.
Parameters-
{'smoothing_level': 0.11107308290744182,
'smoothing_trend': 0.06167745801641925,
'smoothing_seasonal': 0.39488777704116057,
'damping_trend': Nan,
'initial_level': 1639.5306320456996,
'initial_trend': -13.803739314239138,
'initial_seasons': array ([1.04411064, 1.00095858, 1.40459398, 1.20906039,
0.96413947,
0.96754964, 1.3048211, 1.69841076, 1.37034155, 1.81659752,
2.84708154, 3.62462473]),
'use_boxcox': False,
'Lamda': None,
'remove_bias': False}
Prediction on the test data-
Plotting on both the Training and Test using auto fit-
Model Evaluation-
Test Data-
For Alpha=0.111, Beta=0.061, Gamma=0.395, Triple Exponential Smoothing
Model forecast on the Test Data, RMSE is 469.432
First we will define an empty data frame to store our values from the loop-
Train RMSE-
Test RMSE-
Plotting on both the Training and Test data using brute force alpha, beta and
gamma determination-
Sorted by RMSE values on the Test Data:
Plotting on both the Training and Test data-
In this particular we have built several models and went through a model building
exercise. This particular exercise has given us an idea as to which particular model
gives us the least error on our test set for this data. But in Time Series
Forecasting, we need to be careful about the fact that after we have done this
exercise we need to build the model on the whole data. Remember, the training
data that we have used to build the model stops much before the data ends. In
order to forecast using any of the models built, we need to build the models again
(this time on the complete data) with the same parameters.
The two models to be built on the whole data are the following:
Alpha, Beta, Gamma, TripleExponentialSmoothing
Alpha, Beta, Gamma, TripleExponentialSmoothing
1. MODEL1-
RMSE: 421.30973568581123
MAPE: 14.463167851671658
Getting the predictions for the same number of times stamps that are present in
the test data-
One assumption that we have made over here while calculating the confidence
bands is that the standard deviation of the forecast distribution is almost equal to
the residual standard deviation.
In the below code, we have calculated the upper and lower confidence
bands at 95% confidence level
Plot the forecast along with the confidence band-
Let us now build the second model using the same parameters on the full data
and check the confidence bands when we forecast into the future for the length
of the test set.
2. MODEL2-
RMSE: 353.89206663885477
MAPE: 11.681458721875629
Getting the predictions for the same number of times stamps that are present in
the test data-
In the below code, we have calculated the upper and lower confidence
bands.
The percentile function under numpy lets us calculate these and adding
and subtracting from the predictions
gives us the necessary confidence bands for the predictions
Plot the forecast along with the confidence band-
5.
Check for stationarity of the whole Time Series data.
Dicky Fuller Test
Null Hypothesis H0- Series is not stationary
Alternative Hypothesis H1- Series is Stationary
Results of Dickey-Fuller Test:
Test Statistic -1.36
P-value 0.60
#Lags Used 11.00
Number of Observations Used 175.00
Critical Value (1%) -3.47
Critical Value (5%) -2.88
Critical Value (10%) -2.58
Dtype: float64
We see that at 5% significant level the Time Series is non-stationary.
Let us take a difference of order 1 and check whether the Time Series is stationary
or not.
Results of Dickey-Fuller Test:
Test Statistic -45.05
P-value 0.00
#Lags Used 10.00
Number of Observations Used 175.00
Critical Value (1%) -3.47
Critical Value (5%) -2.88
Critical Value (10%) -2.58
Dtype: float64
We see that at 𝛼 = 0.05 the Time Series is indeed stationary.
Plot the Autocorrelation and the Partial Autocorrelation function plots on the
whole data.
From the above plots, we can say that there seems to be seasonality in the data.
Check for stationarity of the Training Data Time Series.
Results of Dickey-Fuller Test:
Test Statistic -1.21
p-value 0.67
#Lags Used 12.00
Number of Observations Used 119.00
Critical Value (1%) -3.49
Critical Value (5%) -2.89
Critical Value (10%) -2.58
dtype: float64
We see that the series is not stationary at 𝛼 = 0.05.
Results of Dickey-Fuller Test:
Test Statistic -8.01
P-value 0.00
#Lags Used 11.00
Number of Observations Used 119.00
Critical Value (1%) -3.49
Critical Value (5%) -2.89
Critical Value (10%) -2.58
Dtype: float64
We see that after taking a difference of order 1 the series have become stationary
at 𝛼 = 0.05.
6. Build an Automated version of an ARIMA/SARIMA model for which the best
parameters are selected in accordance with the lowest Akaike Information
Criteria (AIC).
The following loop helps us in getting a combination of different parameters of p
and q in the range of 0 and 2
We have kept the value of d as 1 as we need to take a difference of the series to
make it stationary.
Some parameter combinations for the Model...
Model: (0, 1, 0)
Model: (0, 1, 1)
Model: (0, 1, 2)
Model: (1, 1, 0)
Model: (1, 1, 1)
Model: (1, 1, 2)
Model: (2, 1, 0)
Model: (2, 1, 1)
Model: (2, 1, 2)
Sort the below AIC values in the ascending order to get the parameters for the
minimum AIC value-
ARIMA (0, 1, 0) - AIC: 2269.582796371201
ARIMA (0, 1, 1) - AIC: 2264.9064421638386
ARIMA (0, 1, 2) - AIC: 2232.783097684079
ARIMA (1, 1, 0) - AIC: 2268.5280607731743
ARIMA (1, 1, 1) - AIC: 2235.0139453492993
ARIMA (1, 1, 2) - AIC: 2233.59764711895
ARIMA (2, 1, 0) - AIC: 2262.035600097813
ARIMA (2, 1, 1) - AIC: 2232.3604898927674
ARIMA (2, 1, 2) - AIC: 2210.61951923921
After arranging in ascending order-
ARIMA MODEL RESULTS-
Predict on the Test Set using this model and evaluate the model
Build an Automated version of a SARIMA model for which the best parameters
are selected in accordance with the lowest Akaike Information Criteria (AIC).
Let us look at the ACF plot once more to understand the seasonal parameter for
the SARIMA model.
Setting the seasonality as 6 for the first iteration of the auto SARIMA model.
Examples of some parameter combinations for Model...
Model: (0, 1, 1)(0, 0, 1, 6)
Model: (0, 1, 2)(0, 0, 2, 6)
Model: (1, 1, 0)(1, 0, 0, 6)
Model: (1, 1, 1)(1, 0, 1, 6)
Model: (1, 1, 2)(1, 0, 2, 6)
Model: (2, 1, 0)(2, 0, 0, 6)
Model: (2, 1, 1)(2, 0, 1, 6)
Model: (2, 1, 2)(2, 0, 2, 6)
Sort values of AIC-
Predict on the Test Set using this model and evaluate the model.
We see that we have huge gain the RMSE value by including the seasonal
parameters as well.
Setting the seasonality as 12 for the second iteration of the auto SARIMA model.
Examples of some parameter combinations for Model...
Model: (0, 1, 1)(0, 0, 1, 12)
Model: (0, 1, 2)(0, 0, 2, 12)
Model: (1, 1, 0)(1, 0, 0, 12)
Model: (1, 1, 1)(1, 0, 1, 12)
Model: (1, 1, 2)(1, 0, 2, 12)
Model: (2, 1, 0)(2, 0, 0, 12)
Model: (2, 1, 1)(2, 0, 1, 12)
Model: (2, 1, 2)(2, 0, 2, 12)
Summary-
Similar to the last iteration of the model where the seasonality parameter was
taken as 6, here also we see that the model diagnostics plot does not indicate any
remaining information that we can get.
Predict on the Test Set using this model and evaluate the model.
We see that the RMSE value have not reduced further when the seasonality
parameter was changed to 12.
7 AND 8 ANSWER-
Build a version of the ARIMA model for which the best parameters are selected
by looking at the ACF and the PACF plots.
Let us look at the ACF and the PACF plots once more.
Here, we have taken alpha=0.05.
The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the
significant lag before which the PACF plot cuts-off to 0. The Moving-Average
parameter in an ARIMA model is 'q' which comes from the significant lag before
the ACF plot cuts-off to 0. By looking at the above plots, we can say that both the
PACF and ACF plot cuts-off at lag 0.
We get a comparatively simpler model by looking at the ACF and the PACF plots.
Predict on the Test Set using this model and evaluate the model.
We see that there is difference in the RMSE values for both the models, but
remember that the second model is a much simpler model.
Build a version of the SARIMA model for which the best parameters are selected
by looking at the ACF and the PACF plots. - Seasonality at 6.
We see that our ACF plot at the seasonal interval (6) does not taper off. So, we go
ahead and take a seasonal differencing of the original series. Before that let us
look at the original series
We see that there is a trend and seasonality. So, now we take a seasonal
differencing and check the series.
Now we see that there is almost no trend present in the data. Seasonality is only
present in the data.
Let us go ahead and check the stationarity of the above series before fitting the
SARIMA mode.
Results of Dickey-Fuller Test:
Test Statistic -7.02
P-value 0.00
#Lags Used 13.00
Number of Observations Used 111.00
Critical Value (1%) -3.49
Critical Value (5%) -2.89
Critical Value (10%) -2.58
Dtype: float64
Checking the ACF and the PACF plots for the new modified Time Series.
Here, we have taken alpha=0.05.
We are going to take the seasonal period as 6. We will keep the p(1) and q(1)
parameters same as the ARIMA model.
The Auto-Regressive parameter in an SARIMA model is 'P' which comes from the
significant lag after which the PACF plot cuts-off to 0. The Moving-Average
parameter in an SARIMA model is 'q' which comes from the significant lag after
which the ACF plot cuts-off to 0. Remember to check the ACF and the PACF plots
only at multiples of 6 (since 6 is the seasonal period).
This is a common problem while building models by looking at the ACF and the
PACF plots. But we are able to explain the model.
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-
step).
[2] Covariance matrix is singular or near-singular, with condition number
Predict on the Test Set using this model and evaluate the model.
SARIMA summary-
This is where our model building ends.
Now, we will take our best model and forecast 12 months into the future with
appropriate confidence intervals to see how the predictions look. We have to
build our model on the full data for this.
9.
Building the most optimum model on the Full Data-
We can see the we have annual seasonality rather than half year seasonality.
NORMALLY DISTRIBUTED
Results-
Evaluate the model on the whole and predict 12 months into the future (till the
end of next year
Plot the forecast along with the confidence band-
Final result of RMSE-
After arranging in ascending order
SARIMA (0,1,2)(2,0,2,12)
Insights and recommendations-
1. We have loaded the Sparkling dataset.csv.
2. Performed EDA to check whether there are any missing values and outliers
present in the dataset.
3. After applying EDA we have split the data into train and test
Train is our sample data
Test is are predicted data (actual data)
Training Data is till the end of 1990. Test Data is from the beginning of 1991 to the
last time stamp provided
4. Building different models and prediction the accuracy or doing model
evaluation using RMSE-
Linear Regression model
Naïve Model
Simple average model
Simple exponential model
Trailing moving average model
Double exponential model
Triple exponential model
5. Built the automated ARIMA/SARIMA MODEL
6, many different forecasting algorithms and analysis methods can be applied
to extract the relevant information that is required.
Regardless of using Autoregressive algorithms to determine the trend patterns
for forecasting or the ARIMA model to deduce the correlation pattern of the
data, it all depends on the application use cases and the complexity. Since
most time series forecasting analyses are trivial, choosing the easiest and
simplest model is the best way to look at it.
7. So we have
Read the problem described it.
performed the various models
evaluated the models
8. In the month of December the sales for Sparkling Wine increases have more
demand those other months.
9 .Matching season and customers demand and trend.
10. In-store Tastings and Events can attract the customers to your store and can
increase sales.
11. We can also see in the year 1981, 1983 and 1994 the Wine sales in the month
of October, November remained constant after that it has starting fluctuating
needs to pay attention after that.