Mini Project Time Series
Mini Project Time Series
Table of Contents
Project Objective
Exploratory Data Analysis and Descriptive Statistics
Summary Statistics of the dataset:
Creation of Training and Testing Data set
Model Building – CART
Model Building – Random Forest
Model Building –Logistic Regression
Model Comparison
Appendix – Sample Source Code
PROBLEM OBJECTIVE:
Solution:
Read the dataset
library(readxl)
str(demand)
str(demand)
Data summary
summary(demand[,3:4])
## IteamA IteamB
## Min. :1954 Min. :1153
## 1st Qu.:2748 1st Qu.:2362
## Median :3134 Median :2876
## Mean :3263 Mean :2962
## 3rd Qu.:3741 3rd Qu.:3468
## Max. :5725 Max. :5618
#Plottig Time Series for Item B for Monthly data from year 2012 Jan to 2017
July
dem_ItB <- ts(demand[,4], start=c(2002,1), end=c(2017,7), frequency=12)
plot(dem_ItB)
#Plotting the Time Series across Item A and Item B
ts.plot(dem_ItA, dem_ItB, gpars = list(col = c("black", "red")),xlab="year",
ylab="demand")
legend("topleft", colnames(demand[3:4]), col=1:ncol(demand), lty=1.9,
cex=.45)
From the above plots, we can see Item A has an increasing demand, whereas Item B has fall in
demand. Also, there is some seasonality and trend in demands. Both Item A and B doesn’t seem to
have cyclic in nature. Item A variation increases with time whereas Item B variation decreases.
Other than above three component there is Cyclic component which occurs after long period of
time.
Additive or multiplicative decomposition? To get a successful decomposition, it is important to
choose between the additive or multiplicativemodel. To choose the right model we need to look at
the time series.
a. The additive model is useful when the seasonal variation is relatively constant over time.
b. The multiplicative model is useful when the seasonal variation increases over time.
#Iteam A
monthplot(dem_ItA)
The seasonal variation looked to be about the same magnitude across time, so an additive
decomposition might give good results.
Decomposing the time series using STL
STL is a very versatile and robust method for decomposing time series. STL is an acronym for
“Seasonal and Trend decomposition using Loess”. It does an additive decomposition and the four
graphs are the original data, seasonal component, trend component and the remainder.
#Item A
ItA_Sea <- stl(dem_ItA[,1], s.window="p") #constant seasonality
plot(ItA_Sea)
#Item B
ItB_Sea <- stl(dem_ItB[,1], s.window="p") #constant seasonality
plot(ItB_Sea)
From the above decomposed details, we can see that there is continuous increase in demand for
Item A, but on contrary similar drop pattern observed for Item B.
Decompose the time series and plot the deseasoned series
If the focus is on figuring out whether the general trend of demand is up, we deseasonalize, and
possibly forget about the seasonal component. However, if you need to forecast the demand in next
month, then you need take into account both the secular trend and seasonality.
Item A
library(forecast)
fcst.ItA.stl <- forecast(ItmATrn, method="rwdrift", h=19)
fcst.ItB.stl <- forecast(ItmBTrn, method="rwdrift", h=19)
VecA<- cbind(DataATest,fcst.ItA.stl$mean)
VecB<- cbind(DataBTest,fcst.ItB.stl$mean)
Item A
#par(mfrow=c(1,1), mar=c(2, 2, 2, 2), mgp=c(3, 1, 0), las=0)
ts.plot(VecA, col=c("blue", "red"),xlab="year", ylab="demand",
main="Quarterly Demand A: Actual vs Forecast")
Mean absolute percentage error (MAPE)
Calculates the mean absolute percentage error (Deviation) function for the forecast and the eventual
outcomes.
## [1] 0.1408798
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
Item B
ts.plot(VecB, col=c("blue", "red"),xlab="year", ylab="demand",
main="Quarterly Demand B: Actual vs Forecast")
Mean absolute percentage error (MAPE)
Calculates the mean absolute percentage error (Deviation) function for the forecast and the eventual
outcomes.
Box-Ljung Test
To check is resuidual are independent
H0: Residuals are independent
Ha: Residuals are not independent
plot(hwA)
Item B
MAPE(VecB1[,1],VecB1[,2])
## [1] 0.1867152
ARIMA Model
Check for stationary time series
Dickey-Fuller test
Statistical tests make strong assumptions about your data. They can only be used to inform the
degree to which a null hypothesis can be accepted or rejected. The result must be interpreted for a
given problem to be meaningful. Nevertheless, they can provide a quick check and confirmatory
evidence that your time series is stationary or non-stationary.
Null Hypothesis (H0): If accepted, it suggests the time series has a unit root, meaning it is non-
stationary. It has some time dependent structure.
Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not
have a unit root, meaning it is stationary. It does not have time-dependent structure.
p-value > 0.05: Accept the null hypothesis (H0), the data has a unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is
stationary.
Item A
library(tseries)
adf.test(dem_ItA)
## Warning in adf.test(dem_ItA): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dem_ItA
## Dickey-Fuller = -7.8632, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
Item B
library(tseries)
adf.test(dem_ItB)
## Warning in adf.test(dem_ItB): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dem_ItB
## Dickey-Fuller = -12.967, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
diff_dem_ItB <- diff(dem_ItB)
plot(diff_dem_ItB)
adf.test(diff(dem_ItB))
## Warning in adf.test(diff(dem_ItB)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(dem_ItB)
## Dickey-Fuller = -9.8701, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
acf(dem_ItA,lag=15)
acf(diff_dem_ItA, lag=15)
acf(dem_ItB,lag=15)
acf(diff_dem_ItB, lag=15)
Checking with lag 50
acf(dem_ItA,lag=50)
acf(diff_dem_ItA, lag=50)
pacf(dem_ItA)
pacf(diff_dem_ItA)
acf(dem_ItB,lag=50)
acf(diff_dem_ItB, lag=50)
pacf(dem_ItB)
pacf(diff_dem_ItB)
ARIMA model
ARMA models are commonly used in time series modeling. In ARMA model, AR stands for auto-
regression and MA stands for moving average.
The above ACF and PACF we have found out that the positive and negative values mean (that is
because of data is stationary); they are not cuts for AR(2) series and no gradually decrease in the
value of PACF, no significance of MA(2).
Item A
plot(ItA.arima.fit.train$x,col="blue")
lines(ItA.arima.fit.train$fitted,col="red",main="Demand A: Actual vs
Forecast")
MAPE(ItA.arima.fit.train$fitted,ItA.arima.fit.train$x)
## [1] 0.0733376
The MAPE percentage error is now reduced to 7.3% for ARIMA model
acf(ItA.arima.fit.train$residuals)
pacf(ItA.arima.fit.train$residuals)
Box-Ljung Test
To check is resuidual are independent
H0: Residuals are independent
Ha: Residuals are not independent
From the plot and data, we can see the forecasted value follows almost the same as actual value,
there are point of interaction at Jan 2016, May 2016, Dec 2016, Jan 2017.
Item B
plot(ItB.arima.fit.train$x,col="blue")
lines(ItB.arima.fit.train$fitted,col="red", main="Demand B: Actual vs
Forecast")
MAPE(ItB.arima.fit.train$fitted,ItB.arima.fit.train$x)
## [1] 0.07654621
acf(ItB.arima.fit.train$residuals)
pacf(ItB.arima.fit.train$residuals)
Box-Ljung Test
To check is resuidual are independent
H0: Residuals are independent
Ha: Residuals are not independent
From the plot and data, we can see the forecasted value doesn’t exactly follows the actual value, but
there are point of interaction at Mar 2016, Apr 2016, May 2016 Nov 2016, Mar 2017.
Conclusion
For Time Series Forecasting problem, we observed the trend and seasonality in the data.
We have observed that the Item A has increasing trend, but for Item B the trend is declining.
Also, we observed for both item there are few months with high variation in seasonality; and for Item
A there are few outliers.
As the seasonality was not following the trend pattern we have used the “Additive” seasonality. We
have performed the three models
1. Random Walk with Drift,
2. Holt Winters and
3. ARIMA model.
Below are MAPE and Box-Ljung test observations for Models.
Random Walk with Drift
Item A# 0.1408798 (14%), p-value < 2.2e-16
Item B# 0.1082608 (10.8%), p-value = 2.931e-13
Holt Winters
Item A# 0.1160528 (11.6%), p-value = 0.8188
Item B# 0.1867152 (18.6%), p-value = 0.873
ARIMA
Item A# 0.0733376 (7%), p-value = 0.0809
Item B# 0.07654621 (7%), p-value = 0.09177
From the MAPE values observed the ARIMA model provided the lowest values and we selected the
model for the Forecasting.
plot(fcastA)
fcastA
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Aug 2017 4320.211 3879.991 4760.431 3646.953 4993.469
## Sep 2017 4169.513 3725.551 4613.476 3490.531 4848.495
## Oct 2017 4428.791 3981.385 4876.197 3744.542 5113.040
## Nov 2017 5102.669 4652.091 5553.246 4413.570 5791.767
## Dec 2017 5879.220 5425.721 6332.719 5185.653 6572.787
## Jan 2018 2819.535 2363.343 3275.727 2121.849 3517.221
## Feb 2018 3990.984 3532.307 4449.660 3289.498 4692.469
## Mar 2018 4181.449 3720.480 4642.419 3476.458 4886.441
## Apr 2018 4081.089 3618.003 4544.174 3372.860 4789.317
## May 2018 3888.336 3423.296 4353.376 3177.118 4599.554
## Jun 2018 4029.525 3562.679 4496.370 3315.545 4743.504
## Jul 2018 4390.292 3921.777 4858.807 3673.760 5106.823
## Aug 2018 4407.590 3900.778 4914.402 3632.487 5182.693
## Sep 2018 4257.019 3747.019 4767.019 3477.041 5036.997
## Oct 2018 4516.419 4003.480 5029.358 3731.946 5300.892
## Nov 2018 5190.414 4674.763 5706.065 4401.794 5979.034
## Dec 2018 5967.079 5448.925 6485.232 5174.632 6759.526
## Jan 2019 2907.502 2387.039 3427.966 2111.522 3703.483
## Feb 2019 4079.056 3556.458 4601.654 3279.812 4878.301
Item B
plot(fcastB)
fcastB
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Aug 2017 2356.3605 1945.1156 2767.605 1727.4157 2985.305
## Sep 2017 2082.9473 1671.7024 2494.192 1454.0025 2711.892
## Oct 2017 1784.7949 1373.5500 2196.040 1155.8501 2413.740
## Nov 2017 2436.4019 2025.1570 2847.647 1807.4571 3065.347
## Dec 2017 2429.8611 2018.6162 2841.106 1800.9163 3058.806
## Jan 2018 965.2270 553.9834 1376.471 336.2842 1594.170
## Feb 2018 1278.2300 866.9865 1689.474 649.2873 1907.173
## Mar 2018 1693.3730 1282.1294 2104.617 1064.4302 2322.316
## Apr 2018 2088.9240 1677.6805 2500.168 1459.9813 2717.867
## May 2018 2342.6726 1931.4290 2753.916 1713.7298 2971.615
## Jun 2018 2587.5703 2176.3268 2998.814 1958.6276 3216.513
## Jul 2018 2903.9519 2492.7084 3315.196 2275.0092 3532.895
## Aug 2018 2267.6121 1814.7093 2720.515 1574.9570 2960.267
## Sep 2018 1945.3845 1492.4816 2398.287 1252.7293 2638.040
## Oct 2018 1663.8271 1210.9242 2116.730 971.1719 2356.482
## Nov 2018 2293.2590 1840.3561 2746.162 1600.6038 2985.914
## Dec 2018 2325.0672 1872.1643 2777.970 1632.4120 3017.722
## Jan 2019 843.6106 390.7094 1296.512 150.9580 1536.263
## Feb 2019 1150.9134 698.0123 1603.815 458.2608 1843.566
str(demand)
dim(demand)
# [1] 187 4
View(demand)
head(demand)
summary(demand[,3:4])
## IteamA IteamB
## Min. :1954 Min. :1153
## 1st Qu.:2748 1st Qu.:2362
## Median :3134 Median :2876
## Mean :3263 Mean :2962
## 3rd Qu.:3741 3rd Qu.:3468
## Max. :5725 Max. :5618
#Plottig Time Series for Item A for Monthly data from year 2012 Jan to 2017 July
dem_ItA <- ts(demand[,3], start=c(2002,1), end=c(2017,7), frequency=12)
plot(dem_ItA)
#Plottig Time Series for Item B for Monthly data from year 2012 Jan to 2017 July
dem_ItB <- ts(demand[,4], start=c(2002,1), end=c(2017,7), frequency=12)
plot(dem_ItB)
#Plotting the Time Series across Item A and Item B
ts.plot(dem_ItA, dem_ItB, gpars = list(col = c("black", "red")),xlab="year", ylab="demand")
legend("topleft", colnames(demand[3:4]), col=1:ncol(demand), lty=1.9, cex=.45)
#Iteam A
monthplot(dem_ItA)
boxplot (dem_ItA ~cycle(dem_ItA))
#Iteam B
monthplot(dem_ItB)
boxplot (dem_ItA ~cycle(dem_ItB))
#Item A
ItA_Sea <- stl(dem_ItA[,1], s.window="p") #constant seasonality
plot(ItA_Sea)
#Item B
ItB_Sea <- stl(dem_ItB[,1], s.window="p") #constant seasonality
plot(ItB_Sea)
plot(hwA)
hwAForecast <- forecast(hwA, h=19)
VecA1 <- cbind(DataATest,hwAForecast)
par(mfrow=c(1,1), mar=c(2, 2, 2, 2), mgp=c(3, 1, 0), las=0)
ts.plot(VecA1[,1],VecA1[,2], col=c("blue","red"),xlab="year", ylab="demand", main="Demand A:
Actual vs Forecast")
Box.test(hwAForecast$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: hwAForecast$residuals
## X-squared = 14.227, df = 20, p-value = 0.8188
#install.packages("MLmetrics")
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 3.5.1
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
MAPE(VecA1[,1],VecA1[,2])
## [1] 0.1160528
plot(hwB)
hwBForecast <- forecast(hwB, h=19)
VecB1 <- cbind(DataBTest,hwBForecast)
par(mfrow=c(1,1), mar=c(2, 2, 2, 2), mgp=c(3, 1, 0), las=0)
ts.plot(VecB1[,1],VecB1[,2], col=c("blue","red"),xlab="year", ylab="demand", main="Demand B:
Actual vs Forecast")
Box.test(hwBForecast$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: hwBForecast$residuals
## X-squared = 13.101, df = 20, p-value = 0.873
MAPE(VecB1[,1],VecB1[,2])
## [1] 0.1867152
library(tseries)
adf.test(dem_ItA)
## Warning in adf.test(dem_ItA): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dem_ItA
## Dickey-Fuller = -7.8632, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
diff_dem_ItA <- diff(dem_ItA)
plot(diff_dem_ItA)
adf.test(diff(dem_ItA))
## Warning in adf.test(diff(dem_ItA)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(dem_ItA)
## Dickey-Fuller = -8.0907, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
library(tseries)
adf.test(dem_ItB)
## Warning in adf.test(dem_ItB): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dem_ItB
## Dickey-Fuller = -12.967, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
acf(dem_ItA,lag=15)
acf(diff_dem_ItA, lag=15)
acf(dem_ItB,lag=15)
acf(diff_dem_ItB, lag=15)
acf(dem_ItA,lag=50)
acf(diff_dem_ItA, lag=50)
pacf(dem_ItA)
pacf(diff_dem_ItA)
acf(dem_ItB,lag=50)
acf(diff_dem_ItB, lag=50)
pacf(dem_ItB)
pacf(diff_dem_ItB)
ItA.arima.fit.train <- auto.arima(DataATrain, seasonal=TRUE)
ItA.arima.fit.train
## Series: DataATrain
## ARIMA(0,0,0)(0,1,1)[12] with drift
##
## Coefficients:
## sma1 drift
## -0.6581 3.9132
## s.e. 0.0798 0.9188
##
## sigma^2 estimated as 116022: log likelihood=-1133.35
## AIC=2272.71 AICc=2272.86 BIC=2281.86
plot(ItA.arima.fit.train$residuals)
plot(ItA.arima.fit.train$x,col="blue")
lines(ItA.arima.fit.train$fitted,col="red",main="Demand A: Actual vs Forecast")
MAPE(ItA.arima.fit.train$fitted,ItA.arima.fit.train$x)
## [1] 0.0733376
##The MAPE percentage error is now reduced to 7.3% for ARIMA model
acf(ItA.arima.fit.train$residuals)
pacf(ItA.arima.fit.train$residuals)
Box.test(ItA.arima.fit.train$residuals, lag = 10, type = c("Ljung-Box"), fitdf = 0)
##
## Box-Ljung test
##
## data: ItA.arima.fit.train$residuals
## X-squared = 16.716, df = 10, p-value = 0.0809
acf(ItB.arima.fit.train$residuals)
pacf(ItB.arima.fit.train$residuals)
Box.test(ItB.arima.fit.train$residuals, lag = 10, type = c("Ljung-Box"), fitdf = 0)
##
## Box-Ljung test
##
## data: ItB.arima.fit.train$residuals
## X-squared = 16.285, df = 10, p-value = 0.09177
##Conclusion: Reject H0: Residuals are independent
fcastA
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Aug 2017 4320.211 3879.991 4760.431 3646.953 4993.469
## Sep 2017 4169.513 3725.551 4613.476 3490.531 4848.495
## Oct 2017 4428.791 3981.385 4876.197 3744.542 5113.040
## Nov 2017 5102.669 4652.091 5553.246 4413.570 5791.767
## Dec 2017 5879.220 5425.721 6332.719 5185.653 6572.787
## Jan 2018 2819.535 2363.343 3275.727 2121.849 3517.221
## Feb 2018 3990.984 3532.307 4449.660 3289.498 4692.469
## Mar 2018 4181.449 3720.480 4642.419 3476.458 4886.441
## Apr 2018 4081.089 3618.003 4544.174 3372.860 4789.317
## May 2018 3888.336 3423.296 4353.376 3177.118 4599.554
## Jun 2018 4029.525 3562.679 4496.370 3315.545 4743.504
## Jul 2018 4390.292 3921.777 4858.807 3673.760 5106.823
## Aug 2018 4407.590 3900.778 4914.402 3632.487 5182.693
## Sep 2018 4257.019 3747.019 4767.019 3477.041 5036.997
## Oct 2018 4516.419 4003.480 5029.358 3731.946 5300.892
## Nov 2018 5190.414 4674.763 5706.065 4401.794 5979.034
## Dec 2018 5967.079 5448.925 6485.232 5174.632 6759.526
## Jan 2019 2907.502 2387.039 3427.966 2111.522 3703.483
## Feb 2019 4079.056 3556.458 4601.654 3279.812 4878.301
fcastB
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Aug 2017 2356.3605 1945.1156 2767.605 1727.4157 2985.305
## Sep 2017 2082.9473 1671.7024 2494.192 1454.0025 2711.892
## Oct 2017 1784.7949 1373.5500 2196.040 1155.8501 2413.740
## Nov 2017 2436.4019 2025.1570 2847.647 1807.4571 3065.347
## Dec 2017 2429.8611 2018.6162 2841.106 1800.9163 3058.806
## Jan 2018 965.2270 553.9834 1376.471 336.2842 1594.170
## Feb 2018 1278.2300 866.9865 1689.474 649.2873 1907.173
## Mar 2018 1693.3730 1282.1294 2104.617 1064.4302 2322.316
## Apr 2018 2088.9240 1677.6805 2500.168 1459.9813 2717.867
## May 2018 2342.6726 1931.4290 2753.916 1713.7298 2971.615
## Jun 2018 2587.5703 2176.3268 2998.814 1958.6276 3216.513
## Jul 2018 2903.9519 2492.7084 3315.196 2275.0092 3532.895
## Aug 2018 2267.6121 1814.7093 2720.515 1574.9570 2960.267
## Sep 2018 1945.3845 1492.4816 2398.287 1252.7293 2638.040
## Oct 2018 1663.8271 1210.9242 2116.730 971.1719 2356.482
## Nov 2018 2293.2590 1840.3561 2746.162 1600.6038 2985.914
## Dec 2018 2325.0672 1872.1643 2777.970 1632.4120 3017.722
## Jan 2019 843.6106 390.7094 1296.512 150.9580 1536.263
## Feb 2019 1150.9134 698.0123 1603.815 458.2608 1843.566