Time Series Forecasting Fundamentals
Time Series Forecasting Fundamentals
Concepts
2 Exploratory TS Data
Analysis
Ravi Mummigatti
1 Forecasting
Concepts
1.1 Fundamental
Concepts
Forecasting is a technique that uses historical data to make
estimates about the future evolutions
(trends)
1.3 Forecasting
Steps
Predictor variables and time series forecasting
Because the electricity demand data form a time series, we could also
use a time series model for
forecasting. In this case, a suitable time
series forecasting equation is of the form
2. Collect data
3. Explore data
1. How will our company sales look in the next few months?
4. How will the Apple stock price will change in the next
days?
Data Collection
For some other data like inflation rate, oil prices, market demand,
market shares, etc. it might be
necessary to appeal to external
sources.
Two kinds of information required (a) statistical data, and (b) the
accumulated expertise of the people
who collect the data and use the
forecasts. Often, it will be difficult to obtain enough historical data
to
be able to fit a good statistical model. In that case, the judgmental
forecasting methods can be used.
Occasionally, old data will be less
useful due to structural changes in the system being forecast; then we
may choose to use only the most recent data.
Missing data Corrupted data errors appear during the time of writing,
reading, storage, transmission,or
processing data. {We are talking about
data in electronic format.}
So this way we can (1) See the general trend (2) Detect cycles and
seasonal patterns (3) Identify
outliers.
First, each method is used to forecast the data series. Second, the
forecast from each method is
evaluated to see how well it fits relative
to the actual historical data.
et = Yt - Ft
Where:
The best forecast model is that with the smallest overall error
measurement value. The choice of which
error criteria are appropriate
depends on the forecaster’s business goals, knowledge of data, and
personal preferences.
Key Notations
Notation Description
y1, y2, y3,, …, yn The time series values, measured over n periods
et The forecast error for the period t, which is the difference between
the
actual series value and the forecasted value: y t - Ft
2 Exploratory TS Data
Analysis
In this section , we will gain insights on how to organize and
visualize time series data in R. We will learn
several simplifying
assumptions that are widely used in time series analysis, and common
characteristics of financial time series.
Another useful command for viewing time series data in R is the length()
(https://www.rdocumentation.org/packages/base/versions/3.3.1/topics/length)
function, which tells
you the total number of observations in your
data.
Some datasets are very long, and previewing a subset of data is more
suitable than displaying the entire
series. The
head(___, n =___) and tail(___, n =___)
functions, in which n is the number of items
to display,
focus on the first and last few elements of a given dataset
respectively.
Use the print() function to display the River Nile data. The data
object is called Nile
Hide
# Print the Nile datase
print(Nile)
## Time Series:
## Start = 1871
## End = 1970
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935 1110 994 1020
## [16] 960 1180 799 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840
## [31] 874 694 940 833 701 916 692 1020 1050 969 831 726 456 824 702
## [46] 1120 1100 832 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846 812 742 801
## [76] 1040 860 874 848 890 744 749 838 1050 918 986 797 923 975 815
## [91] 1020 906 901 1170 912 746 919 718 714 740
Hide
length(Nile)
## [1] 100
Hide
head(Nile , n = 10)
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140
Hide
tail(Nile , n =12)
## [1] 975 815 1020 906 901 1170 912 746 919 718 714 740
2.2 Basic time
series plots
While simple commands such as print() ,
length() , head() , and tail()
provide crucial information
about your time series data, another very
useful way to explore any data is to generate a plot.
In this exercise, we will plot the River Nile annual stream flow data
using the plot()
(https://www.rdocumentation.org/packages/graphics/versions/3.3.1/topics/plot)
function. For time
series data objects such as Nile , a
Time index for the horizontal axis is typically included.
From the
previous exercise, you know that this data spans from 1871 to
1970, and horizontal tick marks are
labeled as such. The default label
of "Time" is not very informative. Since these data are
annual
measurements, you should use the label "Year" . While
we’re at it, we should change the vertical axis
label to
"River Volume (1e9 m^{3})" .
Hide
plot(Nile)
Use a second call to plot() to display the data, but add
the additional arguments: xlab = "Year" ,
ylab = "River Volume (1e9 m^{3})" .
Hide
Hide
# Plot the Nile data with xlab, ylab, main, and type arguments
plot(Nile ,
xlab = "Year",
Hide
# Plot AirPassengers
plot(AirPassengers)
Hide
# View the start and end dates of AirPassengers
start(AirPassengers)
## [1] 1949 1
Hide
end(AirPassengers)
## [1] 1960 12
Now let us gain some additional insight into this data-set by using
the time() , deltat() ,
frequency() , and cycle() commands
AirPassengers
Hide
frequency(AirPassengers)
## [1] 12
Hide
deltat(AirPassengers)
## [1] 0.08333333
Hide
cycle(AirPassengers)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 1 2 3 4 5 6 7 8 9 10 11 12
## 1950 1 2 3 4 5 6 7 8 9 10 11 12
## 1951 1 2 3 4 5 6 7 8 9 10 11 12
## 1952 1 2 3 4 5 6 7 8 9 10 11 12
## 1953 1 2 3 4 5 6 7 8 9 10 11 12
## 1954 1 2 3 4 5 6 7 8 9 10 11 12
## 1955 1 2 3 4 5 6 7 8 9 10 11 12
## 1956 1 2 3 4 5 6 7 8 9 10 11 12
## 1957 1 2 3 4 5 6 7 8 9 10 11 12
## 1958 1 2 3 4 5 6 7 8 9 10 11 12
## 1959 1 2 3 4 5 6 7 8 9 10 11 12
## 1960 1 2 3 4 5 6 7 8 9 10 11 12
2.4 Missing
values
Sometimes there are missing values in time series data, denoted
NA in R, and it is useful to know their
locations. It is
also important to know how missing values are handled by various R
functions.
Sometimes we may want to ignore any missingness, but other
times we may wish to impute or estimate
the missing values.
Hide
my_Airpassengers = data.frame(AirPassengers)
Hide
Hide
Hide
# Inspect
str(df2)
## Time-Series [1:144, 1] from 1949 to 1961: 112 118 132 129 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
Hide
plot(df2)
Hide
mean(AirPassengers , na.rm=TRUE)
## [1] 280.2986
Hide
Hide
# Generate another plot of New AirPassengers
plot(df2,
Let us now overlay the “Original Plot and the Plot with Missing
Imputed by Mean”
Hide
plot(df2,
The advantage of creating and working with time series objects of the
ts class is that many methods
are available for utilizing
time series attributes, such as time index information. For example, as
you’ve
seen in earlier exercises, calling plot() on a
ts object will automatically generate a plot over
time.
Hide
data_vector = c(2.0521941073,4.2928852797,3.3294132944,3.5085950670,0.0009576938,
1.9217186345,0.7978134128,0.2999543435,0.9435687536,0.5748283388,
-0.0034005903,0.3448649176,2.2229761136,0.1763144576,2.7097622770,
1.2501948965,-0.4007164754,0.8852732121,-1.5852420266,-2.2829278891,
-2.5609531290,-3.1259963754,-2.8660295895,-1.7847009207,-1.8894912908,
-2.7255351194,-2.1033141800,-0.0174256893,-0.3613204151,-2.9008403327,
-3.2847440927,-2.8684594718,-1.9505074437,-4.8801892525,-3.2634605353,
-1.6396062522,-3.3012575840,-2.6331245433,-1.7058354022,-2.2119825061,
-0.5170595186,0.0752508095,-0.8406994716,-1.4022683487,-0.1382114230,
-1.4065954703,-2.3046941055,1.5073891432,0.7118679477,-1.1300519022)
Hide
print(data_vector)
## [1] 2.0521941073 4.2928852797 3.3294132944 3.5085950670 0.0009576938
Hide
plot(data_vector)
Let us now convert the Data Vector to a Time Series Object , set the
start argument equal to 2004
and the
frequency argument equal to 4 . Assign the
result to time_series
Hide
# Convert data_vector to a ts object with start = 2004 and frequency = 4
time_series = ts(data_vector ,
frequency = 4 ,
start = 2004)
Hide
print(time_series)
Hide
plot(time_series)
3.2 Validating if an
Object is a TS Object
As you can see, ts objects are treated differently by
commands such as print() and plot() . For
example, automatic use of the time-index in your calls to
plot() requires a ts object
When you work to create your own datasets, you can build them as
ts objects. Recall the dataset
data_vector
previously created, which was just a vector of numbers, and
time_series , the ts object
you created from
data_vector using the ts() function and
information regarding the start time and
the observation frequency. As a
reminder, data_vector and time_series are
shown in the plot on the
right.
Hide
# Check whether data_vector and time_series are ts objects
is.ts(data_vector)
## [1] FALSE
Hide
is.ts(time_series)
## [1] TRUE
Hide
is.ts(Nile)
## [1] TRUE
Hide
is.ts(AirPassengers)
## [1] TRUE
Hide
is.ts(EuStockMarkets)
## [1] TRUE
Hide
start(EuStockMarkets)
Hide
end(EuStockMarkets)
Hide
frequency(EuStockMarkets)
## [1] 260
Start : 130th Business Day of 1991 ; End : 169th Business Day of 1998
; Frequency
Hide
plot(EuStockMarkets)
Hide
ts.plot(EuStockMarkets,
col = 1:4,
xlab = "Year",
Some time series do not exhibit any clear trends over time as seems
to be the case for figures A and B
Linear Trend
Here are examples of series with Linear Trends over time. On the
left, you see an Upward trend, and on
the right, a Downward trend.
Rapid Growth
Upward trends may be increasing more quickly than linear. Figures A
and B are two examples of Rapid
Growth Trends over time. Rapid decay is
also a possibility, but it is not as common in most applications
Periodic Trend
Variance in Trends
Time series can also exhibit trends in variability. Figures A and B
both show examples of series with
Increasing Variance Trends over time
Hide
rapid_growth = read.csv("rapid_growth.csv")
head(rapid_growth)
## Values
## 1 505.9547
## 2 447.3556
## 3 542.5831
## 4 516.0634
## 5 506.9599
## 6 535.0162
Hide
plot(rapid_growth_ts)
Hide
# Log rapid_growth
Hide
ts.plot(linear_growth)
We see, that logarithmic transformation helps stabilize our data by
inducing linear growth over time
4.2 Removing
trends in level by differencing
The first difference transformation of a time series z[t] consists of
the differences (changes) between
successive observations over time,
that is z[t]−z[t−1].
Differencing a time series can remove a time trend. The function diff()
(https://www.rdocumentation.org/packages/base/versions/3.3.1/topics/diff)
will calculate the first
difference or change series. A difference
series lets you examine the increments or changes in a given
time
series. It always has one fewer observation than the original
series.
Hide
linear_trend = read.csv("level_differencing.csv")
plot(linear_trend_ts)
Let us now apply the diff() function to the
linear_trend_ts and store it in another variable
Hide
linear_trend_diff_ts = diff(linear_trend_ts)
Hide
ts.plot(linear_trend_diff_ts)
Let us now examine the lengths of the two time series
Hide
length(linear_trend_ts)
## [1] 104
Hide
length(linear_trend_diff_ts)
## [1] 103
4.3 Removing
seasonal trends with seasonal
differencing
For time series exhibiting seasonal trends, seasonal differencing can
be applied to remove these
periodic patterns. For example, monthly data
may exhibit a strong twelve month pattern. In such
situations, changes
in behavior from year to year may be of more interest than changes from
month to
month, which may largely follow the overall seasonal
pattern.
Let us create a Time Series with values ranging below -10 to above
+10 and a quarterly seasonality
Hide
seasonal = read.csv("seasonal_ts.csv")
head(seasonal)
## Values
## 1 -4.198033
## 2 9.569009
## 3 5.175143
## 4 -9.691646
## 5 -3.215294
## 6 10.843669
Hide
plot(seasonal_ts)
Now let us apply the diff(..., lag = 4) function to the
time series , saving the result as dx
Hide
dx = diff(seasonal_ts , lag = 4)
# Plot dx
ts.plot(dx)
Notice how differencing allows us to remove the longer-term time
trend - in this case, seasonal
volatility - and focus on the change from
one period to another