Time Series
Time Series
Time Series
Problem:
The data of different types of wine sales in the 20th century is to be analysed. Both of these data are from
the same company but of different wines. As an analyst in the ABC Estate Wines, you are tasked to analyse
and forecast Wine Sales in the 20th century.
1. Read the data as an appropriate Time Series data and plot the data.
Time series plot
Rose Sparkling
Observations :
• Rose wine sales shows a decreasing trend • Sparkling wine sales show no much trend
in the initial years which stabilizes after in the yearly sale.
few years and again shows a decreasing • Sparkling wine sales shows seasonality
trend which has yearly pattern.
• Rose wines sales shows seasonality in the
data trend and pattern seem to repeat on
yearly basis
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.
• There are 187 observations in both the data sets which
represent the monthly sales of respective wines form the year
1980 to July 1995
• The data has two variables the year/month of sales and
the sales for the respective month of the year.
• Mean, min, max values for sparkling wine sales are
greater than rose wine sales
Box plot for yearly sales for rose and sparkling wine
Rose Sparkling
Box plot of monthly sales for rose and sparkling wine
Rose Sparkling
Rose Sparkling
• In agreement with the Time Series plot, the • In agreement with the Time Series plot, the
year wise boxplots also indicate a measure boxplots do not indicate any particular
of downward trend. trend.
• The sales of Rose wine has some outliers • The sales of Sparkling wine has some
for certain years. outliers for almost all years except 1995
• December seems to have the highest sales of • We also observe December month has the
Rose wine and there are also outlier in June, highest sales value for Sparkling wine
July, August and September months
Maximum sales of rose wine is in December and Maximum sale of sparkling wine is in December
minimum in January and minimum sale in June
Month, Cumulative % and Month on Month % Sales plots of Rose and Sparkling wine
Rose Sparkling
The median values are stable from January to
The median values keep increasing from January to June and has an increasing trend from July to
December months. The Average Sales value also December. January to December months. The
shows a decreasing trend Average Sales value does not show a trend.
• There were two missing values which were • For additive we see the residual values are
interpolated using Linear method around 0 and for Multiplicative model we see
• For additive we see the residual values are the residual are around 1
around 0 and for Multiplicative model we see
the residual are around 1
3. Split the data into training and test. The test data should start in 1991.
Train and test data of Rose wine sales Train and test data of Sparkling wine sales
4. Build various exponential smoothing models on the training data and evaluate the model using RMSE
on the test data. Other models such as regression, naïve forecast models, simple average models etc.
should also be built on the training data and check the performance on the test data using RMSE.
Model 1 - Linear regression
Rose Sparkling
The root mean square error for the linear regression The root mean square error for the linear regression
model generated = 15.25 model generated = 1389.13
The predicted values for the test data using linear The line shows a down ward trend for the rose sales
regression model is shown as a straight line with whereas it shows an upward trend for the sparkling
slope. wine sales.
Model 2 – Naïve model
Rose Sparkling
For Naive model forecast on the Test Data, RMSE i For Naive model forecast on the Test Data, RMSE i
s 79.672 s 3864.279
The graph shows a straight line The graph shows a straight line
RMSE value for naïve model generated for both datasets is much higher than the regression model values.
For Simple Average forecast on the Test Data, RM For Simple Average forecast on the Test Data, RM
SE is 53.413 SE is 1275.082
Predicted graph shows a straight line Predicted graph shows a straight line
Model 4 – Rolling Average
Rose Sparkling
• For 2 point Moving Average Model forecast o • For 2 point Moving Average Model forecast o
n the Training Data, RMSE is 11.530 n the Training Data, RMSE is 813.401
• For 4 point Moving Average Model forecast o • For 4 point Moving Average Model forecast o
n the Training Data, RMSE is 14.444 n the Training Data, RMSE is 1156.590
• For 6 point Moving Average Model forecast o • For 6 point Moving Average Model forecast o
n the Training Data, RMSE is 14.555 n the Training Data, RMSE is 1283.927
• For 9 point Moving Average Model forecast o • For 9 point Moving Average Model forecast o
n the Training Data, RMSE is 14.722 n the Training Data, RMSE is 1346.278
Lowest score is for the 2 point moving average Lowest score is for the 2 point moving average
The best alpha value according to the best The best alpha value according to the best paramete
parameters is 0.0987 with RMSE =36.7 rs is 0.04960 ; RMSE = 1316
The alpha value 0.1 gives the least RMSE The alpha value 0.1 gives least RMSE
Model 7 – Double exponential smoothing
Rose Sparkling
The best alpha value is 0.1 and beta value is 0.1 and The best alpha value is 0.1 and beta value is 0.1 and
the RMSE on the test data is 36.87 the RMSE on the test data is 1778
Alpha=0.063,Beta=0.055,Gamma=3.11*10- Alpha=0.111108,Beta=0.06172,Gamma=0.3950,Triple
6,TripleExponentialSmoothing predictions on Test Set ExponentialSmoothing predictions on Test Set
Model evaluation – Table showing the model and their RMSE values in increasing order
Triple exponential smoothing with alpha =.1 beta .2 and gamma.2 gave the least RMSE and so is the best
model for predicting the time series so far.
5. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary. Check the new data for
stationarity and comment. Note: Stationarity should be checked at alpha = 0.05.
The Augmented Dickey-Fuller test is an unit root test which determines whether there is a unit root and
subsequently whether the series is non-stationary.
• 𝐻0H0 : The Time Series has a unit root and is thus non-stationary.
• 𝐻1H1 : The Time Series does not have a unit root and is thus stationary.
We would want the series to be stationary for building ARIMA models and thus we would want the p-value
of this test to be less than the 𝛼α value.(0.05)
ADF on the original data
Rose Sparkling
P value grater than alpha we cannot reject null P value greater than alpha we cannot reject null
hypothesis hypothesis
We see that at 5% significant level the Time Series We see that at 5% significant level the Time Series
is non-stationary. is non-stationary.
P value is less than alpha value = 0.5 so we P value is less than alpha value = 0.5 so we reject null
reject null hypothesis hypothesis
Time series is stationery The time series is stationery
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.
Automated ARIMA model
Rose Sparkling
We ran the automated ARIMA model for Rose Sales We ran the automated ARIMA model for Sparkling
and sorted the AIC values output from lowest to Sales and sorted the AIC values output from lowest
highest. We then proceeded to build the ARIMA to highest. We then proceeded to build the ARIMA
model with the lowest Akaike Information Criteria model with the lowest Akaike Information Criteria
and got the Test RMSE score 15.62 and got the Test RMSE score 1374.61
The table showing the AIC values arranged in The table showing the AIC values arranged in
descending order with vairous combinaitons of p, d descending order with vairous combinaitons of p, d
and q and q
The Arima model is built with the best parameters The Arima model is built with the best parameters
based on the least AIC value in the above table. based on the least AIC value in the above table
0,1,2 are picked as it is a simpler model than 3,1,3
and the difference between AIC values for both
combination is less
Automated SARIMA model
We observe the ACF plot for Rose Sales and We observe the ACF plot for Rose Sales and
observe seasonality at intervals 12, hence we run observe seasonality at intervals 12, hence we run the
the Automated SARIMA models at seasonality 12 Automated SARIMA models at seasonality 12
sorted the AIC values output from lowest to sorted the AIC values output from lowest to highest.
highest. We then proceeded to build the SARIMA We then proceeded to build the SARIMA model
model with the lowest Akaike Information Criteria with the lowest Akaike Information Criteria
The Test RMSE score for the best AIC value The Test RMSE score for the best AIC value
chooses to run the model is =26.88 chooses to run the model is =526.44
Inference from Model diagnostics confirms that the Inference from Model diagnostics confirms that the
model residuals are normally distributed model residuals are normally distributed
Standardized residual – Do not display any obvious
seasonality Standardized residual – Do not display any obvious
Histogram plus estimated density - The KDE plot seasonality
of the residuals is similar with the normal Histogram plus estimated density - The KDE plot of
distribution, hence the model residuals are the residuals is similar with the normal distribution,
normally distributed based hence the model residuals are normally distributed
Normal Q-Q plot – There is an ordered distribution based
of residuals (blue dots) following the linear trend Normal Q-Q plot – There is an ordered distribution
of the samples taken from a standard normal of residuals (blue dots) following the linear trend of
distribution with N(0, 1) the samples taken from a standard normal
Correlogram – The time series residuals have low distribution with N(0, 1)
correlation with lagged versions of itself Correlogram – The time series residuals have low
correlation with lagged versions of itself
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data
and evaluate this model on the test data using RMSE.
Manual Arima
We then built manual ARIMA model for Rose We then built manual ARIMA model for Sparkling
Sales based on the ACF and PACF plots. Hence Sales based on the ACF and PACF plots. Hence we
we chose the AR parameter p value 2, Moving chose the AR parameter p value 1, Moving average
average parameter q value 2 and d value 1 based parameter q value 2 and d value 1 based on the
on the below plots. The Test RMSE score at this below plots. The Test RMSE score at this p,d,q
p,d,q value of ARIMA model is 15.34 value of ARIMA model is 1436.73
SARIMA MANUAL – ROSE WINE
SARIMA MANUAL- ROSE WINE SARIMA(1,1,2)(1,0,1,12)
SARIMA(0,1,2)(2,0,2,12)
RMSE: 583.5458166126524
RMSE: 26.88086124928935
Model diagnostics confirms that the model Model diagnostics confirms that the model residuals
residuals are normally distributed. Standardized are normally distributed. Standardized residual do not
residual do not display any obvious seasonality, display any obvious seasonality, Histogram plus
Histogram plus estimated density - The KDE plot estimated density - The KDE plot has normal
has normal distribution , Normal Q-Q plot – There distribution , Normal Q-Q plot – There is an ordered
is an ordered distribution of residuals (blue dots) distribution of residuals (blue dots) following the
following the linear trend Correlogram – The time linear trend Correlogram – The time series residuals
series residuals have low correlation with lagged have low correlation with lagged versions of itself
versions of itself
8. Build a table (create a data frame) with all the models built along with their corresponding
parameters and the respective RMSE values on the test data.
Table with the model name and their respective test Table with the model name and their respective
RMSE values. test RMSE values.
The values sorted in descending order The values sorted in descending order
The best model is the triple exponential smoothing The best model is the triple exponential
with alpha =.1 beta = .2 gamma = .2 smoothing with alpha =0.4 beta = 0.1 gamma
=0.2
9. Based on the model-building exercise, build the most optimum model(s) on the complete data and
predict 12 months into the future with appropriate confidence intervals/bands.
We observed from the RMSE scores that Triple We observed from the RMSE scores that Triple
Exponential would work better for the Rose Sales Exponential would work better for the Rose Sales
data where we had seasonality and Trend. data where we had seasonality and Trend. We see
We see that the best model is the Triple that the best model is the Triple Exponential
Exponential Smoothing (Holt-winter method) with Smoothing (Holt-winter method) with parameters α
parameters α = 0.1, β = 0.2 and γ = 0.2 = 0.4, β = 0.1 and γ = 0.2
Prediction plot
Prediction Plot –Rose Sales at 95% Confidence
Interval
Prediction Plot –Rose Sales at 95% Confidence
Interval
Events and tastings help draw consumers to your store and generate sales. Retailers with economies of
scale successfully sample consumers on more profitable wines. Some even comparison-taste customers
on national brands that are more expensive to demonstrate they are offering a less expensive but superior
product.
And bringing in celebrities, sommeliers or trade reps for tastings can help create excitement and drive
traffic.