Statistics 730
Applied Time series Analysis
Fall 2011
Professor Peter Bloomfield
email: Peter
[email protected]http://www.stat.ncsu.edu/people/bloomfield/courses/st730/
Characteristics of Time Series
A time series is a collection of observations made at different
times on a given system.
For example:
Earnings per share of Johnson and Johnson stock (quarterly);
Global temperature anomalies from 1856 1997 (annual);
Investment returns on the New York Stock Exchange (daily).
2
Digression: Retrieving the Data Using R
jj = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/jj.dat");
jj = ts(jj, frequency = 4, start = c(1960, 1));
plot(jj);
globtemp = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/globtemp.dat");
globtemp = ts(globtemp, start = 1856);
plot(globtemp);
nyse = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/nyse.dat");
nyse = ts(nyse);
plot(nyse);
Correlation
Time series data are almost always correlated with each
otherautocorrelated.
We may want to exploit that correlation, or merely to cope
with it.
Exploiting Correlation: Forecasting
Suppose Yt is the tth observation, and we observe Y0, Y1, . . . , Yn1.
What can we say about Yn?
If we know the correlation structure, or more precisely the
joint distribution, of Y0, Y1, . . . , Yn1, Yn, then we calculate
the conditional distribution of Yn|Y0, Y1, . . . , Yn1.
The conditional mean is the best forecast of Yn, and the conditional standard deviation is the root-mean-square forecast
error. If the conditional distribution is normal, we can use
them to make probability statements about Yn.
5
Coping with Correlation: Regression
Suppose instead that Yt is related to a covariate xt, and we
are interested in the regression of Yt on xt.
Because the Y s are correlated, we should not use Ordinary
Least Squares to fit the regression.
If we knew the correlation structure, we would use Generalized Least Squares.
Usually we dont know it, so we must estimate it, typically
using a parsimonious parametric model.
6
Time Domain and Frequency Domain
Methods that focus on how a time series evolves from one
time to the next are called time domain methods.
Some graphs (e.g. residuals of global temperatures from a
quadratic trend) suggest the possibility of waves in the data:
l = lm(globtemp ~ time(globtemp) + I(time(globtemp)^2));
plot(globtemp - fitted(l));
Since a wave is described in terms of its period, or alternatively its frequency, methods that measure the waves in a
time series are called frequency domain methods.
7
Statistical Models
The primary objective of time series analysis is to develop mathematical models that provide plausible descriptions for sample data. . .
We model a time series as a collection of random variables:
x1, x2, x3, . . . , or more generally {xt, t T }.
Often the phenomenon being observed evolves in continuous
time, but our observations are always discrete samples.
If the sampling times t1, t2, . . . are equally spaced, their separation t = tn tn1 is the sampling interval and 1/t is the
sampling rate (samples per unit time).
Choice of sampling rate affects all aspects of data collection,
analysis, and interpretation.
Example: White Noise
Uncorrelated random variables wt with mean 0 and variance
2 , written w wn(0, 2 ).
w
t
w
Why white noise?
By analogy with white light: in the frequency domain, all
frequencies are present with the same strength.
If in addition the ws are independent and identically dis2 ).
tributed, we write wt iid(0, w
3
Iid White Noise
10
# t-distributed with 3 degrees of freedom:
w = ts(rt(500, df = 3));
plot(w);
100
200
300
400
500
Time
If in addition the ws are normally distributed, we write
2 ).
wt iid N(0, w
Iid Normal White Noise
0
1
2
w = ts(rnorm(500));
plot(w);
100
200
300
400
500
Time
Example: Moving Average
Many observed series are smoother than white noise.
Possible model:
vt =
1
wt1 + wt + wt+1
3
Moving Average
w = ts(rnorm(500));
v = filter(w, sides = 2, rep(1, 3) / 3);
plot(v);
100
200
300
400
500
Time
Averaging attenuates the faster oscillations, leaving the slower
oscillations more apparent.
More generally, a weighted average of 2, 3, or more noise
terms.
Example: Autoregression
Recursive model:
xt = xt1 0.9xt2 + wt,
t = 1, 2, . . . , 500
Like a regression equation, but the RHS contains past (lagged)
LHS variables, hence autoregression.
Shows many different types of behavior for different choices
of coefficients.
10
Autoregression
w = ts(rnorm(500));
v = filter(w, filter = c(1, -0.9), method = "recursive");
plot(v);
100
200
300
400
500
Time
11
Example: Random Walk
One model for trend; recursive definition:
xt = + xt1 + wt
Explicitly:
t
xt = t +
wj
j=1
is the drift (per unit time).
12
Random Walk
20
40
60
80
# drift delta = 0.2 per sample:
x = ts(cumsum(rnorm(500) + 0.2));
plot(x);
100
200
300
400
500
Time
13
The white noise we build it from could be non-normal.
14
Non-Normal Random Walk
250
200
150
100
50
# t-distributed increments, 1 degree of freedom, no drift:
x = ts(cumsum(rt(500, df = 1)));
plot(x);
100
200
300
400
500
Time
15
Example: Signal in Noise
Sine-wave signal:
xt = 2 cos(2t/50 + 0.6) + wt,
t = 1, 2, . . . , 500
More generally, the wave term could be
A cos(2t + ),
where:
A is amplitude;
is frequency (in cycles per unit time);
is phase (in this case, in radians).
16
Cosine wave signal plus noise
0
2
4
w = ts(rnorm(500));
x = 2 * cos(2 * pi * time(w) / 50 + 0.6 * pi) + w;
plot(x);
100
200
300
400
500
Time
17
Means
Recall: We model a time series as a collection of random
variables: x1, x2, x3, . . . , or more generally {xt, t T }.
The mean function is
x,t = E(xt) =
xft(x)dx
where the expectation is for the given t, across all the possible
values of xt. Here ft() is the pdf of xt.
Example: Moving Average
wt is white noise, with E (wt) = 0 for all t
the moving average is
vt =
1
wt1 + wt + wt+1
3
so
v,t = E (vt) =
1
E wt1 + E (wt) + E wt+1
3
= 0.
Moving Average Model with Mean Function
100
200
300
400
500
Time
Example: Random Walk with Drift
The random walk with drift is
t
xt = t +
wj
j=1
so
t
x,t = E (xt) = t +
E wj = t,
j=1
a straight line with slope .
20
40
60
80
Random Walk Model with Mean Function
100
200
300
400
500
Time
Example: Signal Plus Noise
The signal plus noise model is
xt = 2 cos(2t/50 + 0.6) + wt
so
x,t = E (xt)
= 2 cos(2t/50 + 0.6) + E (wt)
= 2 cos(2t/50 + 0.6),
the (cosine wave) signal.
0
2
4
Signal-Plus-Noise Model with Mean Function
100
200
300
400
500
Time
Covariances
The autocovariance function is, for all s and t,
x(s, t) = E (xs x,s) xt x,t
Symmetry: x(s, t) = x(t, s).
Smoothness:
if a series is smooth, nearby values will be very similar,
hence the autocovariance will be large;
conversely, for a choppy series, even nearby values may
be nearly uncorrelated.
8
Example: White Noise
2 ), then
If wt is white noise wn(0, w
2 ,
w
w (s, t) = E (wswt) =
0,
s = t,
s = t.
definitely choppy!
Autocovariances of White Noise
gamma
t
s
10
Example: Moving Average
The moving average is
vt =
1
wt1 + wt + wt+1
3
and E (vt) = 0, so
v (s, t) = E (vsvt)
1
= E ws1 + ws + ws+1 wt1 + wt + wt+1
9
2,
(3/9)w
s=t
(2/9) 2 ,
s=t1
w
=
2,
(1/9)w
s=t2
0,
otherwise.
11
Autocovariances of Moving Average
gamma
t
s
12
Example: Random Walk
The random walk with zero drift is
t
xt =
wj
j=1
and E (xt) = 0
so
x(s, t) = E (xsxt)
= E
wj
wj
j=1
2.
= min{s, t}w
j=1
13
Autocovariances of Random Walk
gamma
t
s
14
Notes:
For the first two models, x(s, t) depends on s and t only
through |s t|, but for the random walk x(s, t) depends
on s and t separately.
For the first two models, the variance x(t, t) is constant,
2 increases indefibut for the random walk x(t, t) = tw
nitely as t increases.
15
Correlations
The autocorrelation function (ACF) is
(s, t) =
(s, t)
(s, s)(t, t)
Measures the linear predictability of xt given only xs.
Like any correlation, 1 (s, t) 1.
16
Across Series
For a pair of time series xt and yt, the cross covariance
function is
x,y (s, t) = E (xs x,s) yt y,t
The cross correlation function (CCF) is
x,y (s, t) =
x,y (s, t)
x(s, s)y (t, t)
17
Stationary Time Series
Basic idea: the statistical properties of the observations do
not change over time.
Two specific forms: strong (or strict) stationarity and weak
stationarity.
A time series xt is strongly stationary if the joint distribution
of every collection of values {xt1 , xt2 , . . . , xtk } is the same as
that of the time-shifted values {xt1+h, xt2+h, . . . , xtk +h}, for
every dimension k and shift h.
Strong stationarity is hard to verify.
18
If {xt} is strongly stationary, then for instance:
k = 1: the distribution of xt is the same as that of xt+h, for
any h;
in particular, if we take h = t, the distribution of xt is
the same as that of x0;
that is, every xt has the same distribution;
19
k = 2: the joint (bivariate) distribution of (xs, xt) is the same
as that of (xs+h, xt+h), for any h;
in particular, if we take h = t, the joint distribution of
(xs, xt) is the same as that of (xst, x0);
that is, the joint distribution of (xs, xt) depends on s and
t only through s t;
and so on...
20
A time series xt is weakly stationary if:
the mean function t is constant; that is, every xt has the
same mean;
the autocovariance function (s, t) depends on s and t only
through their difference |s t|.
Weak stationarity depends only on the first and second moment functions, so is also called second-order stationarity.
Strongly stationary (plus finite variance) weakly stationary.
Weakly stationary strongly stationary (unless some other
property implies it, like normality of all joint distributions).
21
Simplifications
If xt is weakly stationary, cov xt+h, xt depends on h but not
on t, so we write the autocovariances as
(h) = cov xt+h, xt
Similarly corr xt+h, xt depends only on h, and can be written
(h) =
(t + h, t)
(t + h, t + h)(t, t)
(h)
.
(0)
22
Examples
White noise is weakly stationary.
A moving average is weakly stationary.
A random walk is not weakly stationary.
23
Estimating Means and Covariances
In other statistical applications, means, variances, and covariances are estimated by averaging across samples.
In time series, we often have only one realization.
Stationarity allows us to estimate moments anyway.
Mean
If xt is stationary, t = E (xt) , so we can estimate by
the sample mean
1 n
x
=
xt .
n t=1
We could also use a weighted mean
n
wtxt,
t=1
where
wt = 1.
t=1
Both are unbiased; usually some weighted mean has smaller
variance than x
, but not much smaller.
2
Autocovariance
Similarly, if xt is stationary, (t + h, t) = cov xt+h, xt (h),
so we can estimate (h) by
1 nh
(h) =
xt+h x
xt x
n t=1
for h = 0, 1, . . . , n 1, with
(h) =
(h).
We estimate the autocorrelation function (ACF) by
(h) =
(h)
.
(0)
3
Sampling Properties
x
is unbiased for .
(h) is not unbiased for (h), but
1 nh
xt+h
n h t=1
xt
would be. Note:
(n h) denominator instead of n;
centering at instead of x
.
4
Non-negative Definiteness
The covariance matrix of (x1, x2, . . . , xk ) is
k =
(0)
(1)
(1)
(0)
...
...
(k 1) (k 2)
. . . (k 1)
. . . (k 2)
...
...
...
(0)
and, as a covariance matrix, is non-negative definite:
a k a = var(a1x1 + a2x2 + + ak xk ) 0
for any vector of constants a = (a1, a2, . . . , ak ) .
k is also non-negative
With the above definition of
(h),
definite; that would not be true if we divided by (n h).
5
Another Sampling Property
If xt is white noise and n is large and some mild conditions
hold, (h) is approximately normal with zero mean and standard deviation
1
(h) = .
n
So we can look for autocorrelations outside 2/ n as evidence of autocorrelation.
R Examples
White noise:
acf(ts(rnorm(100)));
Southern Oscillation Index and fish recruitment:
soi = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/soi.dat");
soi = ts(soi, start = 1950, frequency = 12);
recruit = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/recruit.dat");
recruit = ts(recruit, start = 1950, frequency = 12);
acf(soi, 50);
acf(recruit, 50);
ccf(soi, recruit, 50);
# Negative lags indicate SOI leads recruitment.
7
Interpreting the Cross-Correlation
help(ccf) states: The lag k value returned by ccf(x,y)
estimates the correlation between x[t+k] and y[t].
So the graph shows negative correlation between SOI(t - 5
to 9 months) and recruit(t).
That is, current recruitment is (negatively) correlated with
SOI from 5 9 months ago.
SAS Example
Southern Oscillation Index and fish recruitment:
options pagesize = 80;
data soi;
infile soi.dat;
input soi;
run;
data recruit;
infile recruit.dat;
input recruit;
run;
data both;
time +1;
merge soi recruit;
run;
proc gplot data = both;
symbol i = join;
plot (soi recruit) * time;
run;
proc arima data = both;
title SOI and recruitment;
identify var = soi nlag = 50;
identify var = recruit crosscorr = soi nlag = 50;
/* Positive lags indicate SOI leads recruitment. */
run;
SAS program and output.
10
Seasonality in the SOI
The ACF of the SOI suggests that xt has a correlation of
around 0.4 with xt+12, xt+24, and so on.
This correlation is caused by the fact that those values
all fall in the same month of the year, and different months
have different means.
That is, this series has a non-constant mean function t.
Since it is non-stationary in the mean, the sample ACF does
not estimate the population ACF, and the graph has no
meaning.
11
We can estimate t and subtract it, to give a series with zero
mean.
The simplest way is to subtract the mean for a given month
of the year from all data for that month.
In R (in SAS, use corresponding proc glm):
soiSA = residuals(lm(soi ~ factor(cycle(soi))));
# transfer the time series structure of soi to soiSA:
soiSA = ts(soiSA, start = start(soi), frequency = frequency(soi));
acf(soiSA, lag = 50);
12
The ACF graph now shows correlation dropping progressively
from around 0.5 at a one month lag to zero at one year.
The CCF of soiSA and recruit shows correspondingly simpler
structure.
Frequency-domain methods will show that the recruitment
series also has some seasonality, but with much weaker effects.
Replacing recruit with a corresponding recruitSA makes negigible changes to the ACF and CCF.
13
Frequency-domain methods will also show that the seasonal
effects in SOI consist largely of an annual sine wave.
Instead of estimating 12 separate monthly means, we can fit,
and remove, a three-parameter model
2t
2t
t = 0 + 1 cos
+ 2 sin
.
12
12
In R:
soiCS = residuals(lm(soi ~ cos(2 * pi * time(soi)) +
sin(2 * pi * time(soi))));
soiCS = ts(soiCS, start = start(soi), frequency = frequency(soi));
acf(soiCS, lag = 50);
14
Vector-Valued SeriesNotation
Studies of time series data often involve p > 1 series.
E.g. Southern Oscillation Index and recruitment in a fish
population (p = 2).
Treated as a p 1 column vector:
xt =
xt,1
xt,2
...
xt,p
Mean Vector
Assume jointly weakly stationary.
mean vector:
= E (xt) =
E xt,1
E xt,2
...
E xt,p
1
2
...
p
Autocovariance Matrix
Autocovariance matrix contains individual autocovariances
on the diagonal and cross-covariances off the diagonal:
(h) = E
xt+h (xt )
1,1(h) 1,2(h)
2,1(h) 2,2(h)
...
...
p,1(h) p,2(h)
. . . 1,p(h)
. . . 2,p(h)
...
...
. . . p,p(h)
Sample mean and autocovariances
sample mean:
1 n
xt
x=
n t=1
sample autocovariance:
1 nh
(h) =
xt+h
x (xt
x)
n t=1
(h) =
(h) .
for h 0, and
Multidimensional Series (Spatial Statistics)
Some studies involve data indexed by more than one variable.
E.g. soil surface temperatures in a field
Notation: xs is the observed value at location s (s for spatial).
Soil temperatures
10
6
60
40
row
ature
Temper
30
colu 20
mns
20
10
Autocovariance and Variogram
Stationary : E (xs) and cov xs+h, xs do not depend on s.
For a stationary process, the autocovariance function is
(h) = cov xs+h, xs = E
xs+h
xs
Intrinsic: E xs+h xs and var xs+h xs do not depend on
s.
For an intrinsic process, the (semi-)variogram is
1
Vx(h) = var xs+h xs
2
7
A stationary process is intrinsic (see Problem 1.26), but an
intrinsic process is not necessarily stationary.
In one dimension, the random walk is intrinsic but not stationary.
When stationary, Vx(h) = (0) (h).
Isotropic: an intrinsic process is isotropic if the variogram is
a function only of |h|, the Euclidean distance between s + h
and s.
8
Time Series Regression
A regression model relates a response xt to inputs zt,1, zt,2, . . . , zt,q :
xt = 1zt,1 + 2zt,2 + + q zt,q + error.
Time domain modeling: the inputs often include lagged values of the same series, xt1, xt2, . . . , xtp.
Frequency domain modeling: the inputs include sine and cosine functions.
Fitting a Trend
0.4
0.2
0.0
0.4
window(globtemp, start = 1900)
> g1900 = window(globtemp, start = 1900)
> plot(g1900)
1900
1920
1940
1960
1980
2000
Time
possible model:
xt = 1 + 2t + wt,
where the error (noise) is white noise (unlikely!).
fit using ordinary least squares (OLS):
> lmg1900 = lm(g1900 ~ time(g1900)); summary(lmg1900)
Call:
lm(formula = g1900 ~ time(g1900))
Residuals:
Min
1Q
-0.30352 -0.09671
Median
0.01132
3Q
0.08289
Max
0.33519
3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.219e+01 9.032e-01 -13.49
<2e-16 ***
time(g1900) 6.209e-03 4.635e-04
13.40
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.1298 on 96 degrees of freedom
Multiple R-Squared: 0.6515,
Adjusted R-squared: 0.6479
F-statistic: 179.5 on 1 and 96 DF, p-value: < 2.2e-16
0.0
0.4
g1900
0.2
0.4
> plot(g1900)
> abline(reg = lmg1900)
1900
1920
1940
1960
1980
2000
Time
Using PROC ARIMA
Program
data globtemp;
infile globtemp.dat;
n + 1;
input globtemp;
year = 1855 + n;
run;
proc arima data = globtemp;
where year >= 1900;
identify var = globtemp crosscorr = year;
/* The ESTIMATE statement fits a model to the
*\
\* variable in the most recent IDENTIFY statement */
estimate input = year;
run;
and output.
6
Regression Review
the regression model:
xt = 1zt,1 + 2zt,2 + + q zt,q + wt = zt + wt.
fit by minimizing the residual sum of squares
n
xt zt
RSS( ) =
t=1
find the minimum by solving the normal equations
=
zt zt
t=1
ztxt.
t=1
7
Matrix Formulation
factor matrix Znq = (z1, z2, . . . , zn) , response vector xn1 =
(x1, x2, . . . , xn)
= Z x with solution
= (Z Z)1Z x
normal equations (Z Z)
minimized RSS
= x Z
RSS
x Z
Zx
=xx
= x x x Z(Z Z)1Z x
8
Distributions
If the (white noise) errors are normally distributed (wt
2 )), then
is multivariate normal, and the usual
iid N(0, w
t- and F -statistics have the corresponding distributions.
If the errors are not normally distributed, but still iid, the
same is approximately true.
If the errors are not white noise, none of that is true.
Choosing a Regression Model
We want a model that fits well without using too many parameters.
Two estimates of the noise variance:
unbiased: s2
w = RSS/(n q)
maximum likelihood:
2 = RSS/n.
We want small
2 but also small q.
10
Information Criteria (smaller is better)
Akaikes Information Criterion (with k variables in the model):
AIC = ln
k2 +
n + 2k
n
bias-corrected Akaikes Information Criterion:
n+k
AICc = ln
k2 +
nk2
Schwarzs (Bayesian) Information Criterion:
k ln n
2
SIC = ln
k +
n
11
Notes
More commonly (e.g. in SAS output and in Rs AIC function),
these are all multiplied by n.
AIC, AICc, and SIC (also known as SBC and BIC) can be
generalized to other problems where likelihood methods are
used.
If n is large and the true k is small, minimizing BIC picks k
well, but minimizing AIC tends to over-estimate it.
If the true k is large (or infinite), minimizing AIC picks a value
that gives good predictions by trading off bias vs variance.
12
Exploratory Data Analysis (or Searching for Stationarity)
When an observed time series appears stationary, we can
calculate its sample autocorrelations, and use them to decide
on a model.
Many time series do not appear stationary; e.g., Johnson and
Johnson earnings, global temperature.
Often we can find a way to relate one series to a different
series, for which stationarity is more plausible.
Trends and Detrending
Some series can be modeled as
xt = t + yt,
where yt is stationary.
If t is a parametric form, we can estimate it and subtract
it. That is, we use the residuals from a fitted trend.
The form of trend might be linear, or higher degree polynomial, or some other function suggested by theory.
2
Example: 20th Century Global Temperature
0.3
0.0
0.3
Residuals
lmg1900 = lm(g1900 ~ time(g1900));
plot(ts(residuals(lmg1900), start = 1900));
1900
1920
1940
1960
1980
2000
Time
Differencing
Some series still appear nonstationary after detrending.
E.g. the trend t is a random walk with drift:
t
t = t +
wj
j=1
Here E(xt) = t, but
t
xt E ( xt ) =
wj + yt
j=1
with a variance that grows with time.
4
But now the first differences
xt = xt xt1 = + wt + yt yt1
are stationary.
Define the backshift operator B by Bxt = xt1
Then xt = (1 B)xt.
Also second differences
2xt = (1 B)2xt = xt 2xt1 + xt2,
etc. Easy for any positive integer d; possible for fractional d.
5
Example: 20th Century Global Temperature
0.0
0.3
diff(g1900)
0.3
plot(diff(g1900));
1900
1920
1940
1960
1980
2000
Time
Both detrending and differencing give apparently stationary
results.
6
acf(diff(g1900));
0.4
0.2
ACF
1.0
Series diff(g1900)
10
15
Lag
Differencing has removed almost all auto-correlation.
acf(residuals(lmg1900))
0.4
0.2
ACF
1.0
Series residuals(lmg1900)
10
15
Lag
Removing the trend without differencing leaves more autocorrelation.
8
Transformation (Re-expression)
Some series need to be re-expressed.
Most commonly logarithms, sometimes square roots (especially with counted data).
Often re-expression improves stationarity, and other desirable
features such as symmetry of distribution.
E.g. Glacial varve thicknesses, Johnson and Johnson earnings.
9
Periodic Signals
If a series is plausibly modeled as a cosine wave plus noise,
we can fit
xt = A cos(2t+)+wt = (A cos ) cos(2t)(A sin ) sin(2t)
by least squares.
If is known (e.g., = 1/12 for an annual cycle in monthly
data), this is a linear regression:
xt = 1 cos(2t) + 2 sin(2t)
10
If is of the form j/n for integer j (n = series length), then
2 n
1 =
xt cos(2tj/n),
n t=1
2 n
2 =
xt sin(2tj/n).
n t=1
For other , use standard linear least squares regression.
If is unknown, either:
try all s of the form j/n, plotting 1(j/n)2 + 2(j/n)2
against j/n (the periodogram);
use non-linear least squares for other .
11
# detrend global temperature using a quadratic fit
gtres = residuals(lm(globtemp ~ time(globtemp) + I(time(globtemp)^2)));
gtres = ts(gtres, start = start(globtemp));
par(mfcol = c(2, 1));
plot(gtres);
# use spectrum() to plot the periodogram of detrended global temperature
spectrum(gtres, log = "no");
12
Smoothing a Time Series
Smoothing a time series makes long-term behavior (low frequencies) more apparent. E.g. global temperature, Johnson
and Johnson earnings.
Many types of smoother:
moving averages;
kernel smoothers;
lowess, supsmu, etc.;
smoothing splines.
13
# Trailing yearly average J&J earnings
plot(jj)
lines(filter(jj, rep(1, 4)/4, sides = 1), col = "red")
title("Trailing 4-quarter averages")
# smooth global temperatures over a 30 year window
# (note half weight on end values)
plot(globtemp)
lines(filter(globtemp, c(.5, rep(1, 29), .5)/30), col = "red")
title("Centered 30 year averages")
14
Smoothing a Scatter Plot
Smoothing a scatter plot can also reveal behavior.
E.g. daily NYSE returns plotted against previous day.
15
# scatter plot of NYSE return against previous day,
# with lowess smooth
plot(nyse[-length(nyse)], nyse[-1], xlim = c(-0.02, 0.02),
ylim = c(-0.02, 0.02))
lines(lowess(nyse[-length(nyse)], nyse[-1], f = 1/5), col = "red")
title("NYSE daily return against previous day")
16
Time Domain Models
Box & Jenkins popularized an approach to time series analysis
based on
Auto-Regressive
Integrated
Moving Average
(ARIMA) models.
Autoregressive Models
Autoregressive model of order p (AR(p)):
xt = 1xt1 + 2xt2 + + pxtp + wt,
where:
xt is stationary with mean 0;
1, 2, . . . , p are constants with p = 0;
wt is uncorrelated with xtj , j = 1, 2, . . .
To model a series with non-zero mean :
(xt ) = 1 xt1 + 2 xt2 + + p xtp + wt,
or
xt = + 1xt1 + 2xt2 + + pxtp + wt,
where
= (1 1 2 p) .
Note that the intercept is not .
Note also that
wt = xt 1xt1 + 2xt2 + + pxtp
and is therefore also stationary.
Furthermore, for k > 0,
wtk = xtk 1xtk1 + 2xtk2 + + pxtkp
and wt is uncorrelated with all terms on the right hand side.
So wt is uncorrelated with wtk .
That is, {wt} is white noise.
4
The Autoregressive Operator
Use the backshift operator:
xt = 1Bxt + 2B 2xt + + pB P xt + wt,
or
1 1B 2B 2 pB p xt = wt.
The autoregressive operator is
(B) = 1 1B 2B 2 pB p.
In operator form, the model equation is (B)xt = wt.
5
Example: AR(1)
For the first-order model:
xt = xt1 + wt.
Also
xt1 = xt2 + wt1
so
xt = xt2 + wt1 + wt
= 2xt2 + wt1 + wt.
6
Now use
xt2 = xt3 + wt2
so
xt = 2 xt3 + wt2 + wt1 + wt
= 3xt3 + 2wt2 + wt1 + wt.
Continuing:
xt = k xtk +
k1
j wtj .
j=0
We have shown:
xt = k xtk +
k1
j wtj .
j=0
Since xt is stationary, if || < 1 then k xtk 0 as k ,
so
xt =
j wtj ,
j=0
an infinite moving average, or linear process.
Moments
Mean: E(xt) = 0.
Autocovariances: for h 0
(h) = cov xt+h, xt
= E
j wt+hj
k wtk
j
2 h
w
=
.
2
1
Autocorrelations: for h 0
(h)
= h.
(h) =
(0)
Note that
(h) = (h 1),
h = 1, 2, . . .
Compare with the original equation
xt = xt1 + wt.
10
Simulations
plot(arima.sim(model = list(ar = .9), 100))
11
Causality
What if || > 1? Rewrite
xt = xt1 + wt.
as
xt = 1xt+1 1wt+1
Now
xt =
j wt+j ,
j=1
a sum of future noise terms. This process is said to be not
causal. If || < 1 the process is causal.
12
The Autoregressive Operator Again
Compare the original equation:
xt = xt1 + wt (1 B)xt = wt xt = (1 B)1wt.
with the (infinite) moving average representation:
xt =
j=0
j wtj =
j B j wt
j=0
13
So
(1 B)1 =
j B j .
j=0
Compare with
(1 z)1 =
1
j
=
(z) =
j z j ,
1 z
j=0
j=0
valid for |z| < 1 (because || < 1).
We can manipulate expressions in B as if it were a complex
number z with |z| < 1.
14
Stationary versus Transient
E.g. AR(1):
Stationary version, when || < 1:
xt =
j wtj
j=0
But suppose we want to simulate, using
xt = xt1 + wt,
t = 1, 2, . . .
What about x0?
15
One possibility: let x0 = 0.
Then x1 = w1, x2 = w2 + w1, and generally
t1
xt =
j wtj .
j=0
This means that
2 1 + 2 + 4 + + 2(t1) .
var(xt) = w
var(xt) depends on t this version is not stationary.
16
But, if || < 1, then for large t,
2 1 + 2 + 4 + . . .
var(xt) w
2
w
=
1 2
Also, under the same conditions (more work!),
2 |h|
w
cov xt+h, xt
.
2
1
This version is called asymptotically stationary.
The non-stationarity is only for small t, and is called transient. Simulations use a burn-in or spin-up period: discard
the first few simulated values.
17
2
w
But note: in the stationary version, x0 N 0, 12 .
If we simulate x0 from this distribution, and for t > 0 use
xt = xt1 + wt,
t = 1, 2, . . . ,
then the result is exactly stationary.
That is, we can use a simulation with no spin-up.
This is harder for AR(p) when p > 1, so most simulators use
a spin-up period.
18
Moving Average Model
Moving average model of order q (MA(q)):
xt = wt + 1wt1 + 2wt2 + + q wtq
where:
1, 2, . . . , q are constants with q = 0;
2 ).
wt is Gaussian white noise wn(0, w
Note that wt is uncorrelated with xtj , j = 1, 2, . . . .
In operator form:
xt = (B)wt,
where the moving average operator (B) is
(B) = 1 + 1B + 2B 2 + + q B q .
Compare with the autoregressive model (B)xt = wt.
The moving average process is stationary for any values of
1, 2, . . . , q .
Moments
Mean: E (xt) = 0.
Autocovariances:
(h) = cov xt+h, xt
= E
j wt+hj
k wtk
j
2
= w
k k+h
k
=0
if h > q.
3
The MA(q) model is characterized by
2 = 0
(q) = w
q
(h) = 0
for h > q.
The contrast between the ACF of
a moving average model, which is zero except for a finite
number of lags h
an autoregressive model, which goes to zero geometrically
makes the sample ACF an important tool in deciding what
model to fit.
4
Inversion
Example: MA(1)
xt = wt + wt1 = (1 + B)wt,
so if || < 1,
wt = (1 + B)1xt = (B)xt,
where
(B) =
()j B j .
j=0
So xt satisfies an infinite autoregression:
xt =
()j xtj + wt,
j=1
5
Autoregressive Moving Average Models
Combine! ARMA(p, q):
xt =1xt1 + 2xt2 + + pxtp
+ wt + 1wt1 + 2wt2 + + q wtq .
In operator form:
(B)xt = (B)wt.
Issues in ARMA Models
Parameter redundancy: if (z) and (z) have any common factors, they can be canceled out, so the model is the same as
one with lower orders. We assume no redundancy.
Causality: If (z) = 0 for |z| 1, xt can be written in terms of
present and past ws. We assume causality.
Invertibility: If (z) = 0 for |z| 1, wt can be written in terms
of present and past xs, and xt can be written as an infinite
autoregression. We assume invertibility.
7
Using proc arima
Example: fit an MA(1) model to the differences of the log
varve thicknesses.
options linesize = 80;
ods html file = ../varve1.html;
data varve;
infile ../data/varve.dat;
input varve;
lv = log(varve);
dlv = dif(lv);
run;
8
proc arima data = varve;
title Fit an MA(1) model to differences of log varve;
identify var = dlv;
estimate q = 1;
run;
proc arima output
Using some proc arima options
Example: fit an IMA(1) model to the log varve thicknesses.
options linesize = 80;
ods html file = varve2.html;
data varve;
infile varve.dat;
input varve;
lv = log(varve);
run;
proc arima data = varve;
title Fit an IMA(1, 1) model to log varve, using ML;
title2 Use minic option to identify a good model;
identify var = lv(1) minic;
estimate q = 1 method = ml;
estimate q = 2 method = ml;
estimate p = 1 q = 1 method = ml;
run;
proc arima output
Notes on the proc arima output
For the MA(1) model, the Autocorrelation Check of Residuals rejects the null hypothesis that the residuals are white
noise.
If the series really had MA(1) structure, the residuals
would be white noise.
So the MA(1) model is not a good fit for this series.
For both the MA(2) and the ARMA(1, 1) models, the ChiSquare statistics are not significant, so these models both
seem satisfactory. ARMA(1, 1) has the better AIC and SBC.
10
Using R
Fit a given model and test the residuals as white noise:
varve.ma1 = arima(diff(log(varve)),
order = c(p = 0, d = 0, q = 1));
varve.ma1;
Box.test(residuals(varve.ma1), lag = 6,
type = "Ljung", fitdf = 1);
Note: the fitdf argument indicates that these are residuals
from a fit with a single parameter.
11
As in proc arima, differencing can be carried out within arima():
varve.ima1 = arima(log(varve), order = c(0, 1, 1));
varve.ima1;
Box.test(residuals(varve.ima1), 6, "Ljung", 1);
But note that you cannot include the intercept, so the results
are not identical.
Rerun the original analysis with no intercept:
arima(diff(log(varve)), order = c(0, 0, 1),
include.mean = FALSE);
12
Make a table of AICs:
AICtable = matrix(NA, 5, 5);
dimnames(AICtable) =
list(paste("p =", 0:4), paste("q =", 0:4));
for (p in 0:4) {
for (q in 0:4) {
varve.arma = arima(diff(log(varve)), order = c(p, 0, q));
AICtable[p+1, q+1] = AIC(varve.arma);
}
}
AICtable;
Note: proc arimas MINIC option tabulates (an approximation
to) BIC, not AIC.
13
Make a table of BICs:
BICtable = matrix(NA, 5, 5);
dimnames(BICtable) =
list(paste("p =", 0:4), paste("q =", 0:4));
for (p in 0:4) {
for (q in 0:4) {
varve.arma = arima(diff(log(varve)), order = c(p, 0, q));
BICtable[p+1, q+1] =
AIC(varve.arma, k = log(length(varve) - 1));
}
}
BICtable;
Both tables suggest ARMA(1, 1).
14
ARMA Autocorrelation Functions
For a moving average process, MA(q):
xt = wt + 1wt1 + 2wt2 + + q wtq .
So (with 0 = 1)
(h) = cov xt+h, xt
= E
j wt+hj
j=0
qh
2
j j+h,
w
=
j=0
k wtk
k=0
0hq
h > q.
1
So the ACF is
qh
j j+h
j=0
,
q
(h) =
2
j=0
0hq
h > q.
Notes:
In these expressions, 0 = 1 for convenience.
(q) = 0 but (h) = 0 for h > q.
MA(q).
This characterizes
For an autoregressive process, AR(p):
xt = 1xt1 + 2xt2 + + pxtp + wt.
So
(h) = cov xt+h, xt
= E
j xt+hj + wt+h xt
j=1
j (h j) + cov wt+h, xt .
=
j=1
Because xt is causal, xt is wt+ a linear combination of wt1, wt2, . . . .
So
2
w
cov wt+h, xt =
0
h=0
h > 0.
Hence
p
j (h j),
(h) =
h>0
j=1
and
p
2.
j (j) + w
(0) =
j=1
2 , these equa If we know the parameters 1, 2, . . . , p and w
tions for h = 0 and h = 1, 2, . . . , p form p + 1 linear equations
in the p + 1 unknowns (0), (1), . . . , (p).
The other autocovariances can then be found recursively
from the equation for h > p.
Alternatively, if we know (or have estimated) (0), (1), . . . , (p),
they form p + 1 linear equations in the p + 1 parameters
2.
1, 2, . . . , p and w
These are the Yule-Walker equations.
5
For the ARMA(p, q) model with p > 0 and q > 0:
xt =1xt1 + 2xt2 + + pxtp
+ wt + 1wt1 + 2wt2 + + q wtq ,
a generalized set of Yule-Walker equations must be used.
The moving average models ARMA(0, q) = MA(q) are the
only ones with a closed form expression for (h).
For AR(p) and ARMA(p, q) with p > 0, the recursive equation
means that for h > max(p, q + 1), (h) is a sum of geometrically decaying terms, possibly damped oscillations.
6
The recursive equation is
p
j (h j),
(h) =
h > q.
j=1
What kinds of sequences satisfy an equation like this?
Try (h) = z h for some constant z.
The equation becomes
p
0 = z h
j z (hj) = z h 1
j=1
j z j = z h(z).
j=1
So if (z) = 0, then (h) = z h satisfies the equation.
Since (z) is a polynomial of degree p, there are p solutions,
say z1, z2, . . . , zp.
So a more general solution is
p
(h) =
cl zlh,
l=1
for any constants c1, c2, . . . , cp.
If z1, z2, . . . , zp are distinct, this is the most general solution;
if some roots are repeated, the general form is a little more
complicated.
8
If all z1, z2, . . . , zp are real, this is a sum of geometrically
decaying terms.
If any root is complex, its complex conjugate must also be a
root, and these two terms may be combined into geometrically decaying sine-cosine terms.
The constants c1, c2, . . . , cp are determined by initial conditions; in the ARMA case, these are the Yule-Walker equations.
Note that the various rates of decay are the zeros of (z),
the autoregressive operator, and do not depend on (z), the
moving average operator.
9
Example: ARMA(1, 1)
xt = xt1 + wt1 + wt.
The recursion is
(h) = (h 1),
h = 2, 3, . . .
So (h) = ch for h = 1, 2, . . . , but c = 1.
Graphically, the ACF decays geometrically, but with a different value at h = 0.
10
0.2
0.4
0.6
0.8
ARMAacf(ar = 0.9, ma = 0.5, 24)
1.0
10
15
20
25
Index
11
The Partial Autocorrelation Function
An MA(q) can be identified from its ACF: non-zero to lag q,
and zero afterwards.
We need a similar tool for AR(p).
The partial autocorrelation function (PACF) fills that role.
12
Recall: for multivariate random variables X, Y, Z, the partial
correlations of X and Y given Z are the correlations of:
the residuals of X from its regression on Z; and
the residuals of Y from its regression on Z.
Here regression means conditional expectation, or best linear prediction, based on population distributions, not a sample calculation.
In a time series, the partial autocorrelations are defined as
h,h = partial correlation of xt+h and xt
given xt+h1, xt+h2, . . . , xt+1.
13
For an autoregressive process, AR(p):
xt = 1xt1 + 2xt2 + + pxtp + wt,
If h > p, the regression of xt+h on xt+h1, xt+h2, . . . , xt+1 is
1xt+h1 + 2xt+h2 + + pxt+hp
So the residual is just wt+h, which is uncorrelated with
xt+h1, xt+h2, . . . , xt+1 and xt.
14
So the partial autocorrelation is zero for h > p:
h,h = 0,
h > p.
We can also show that p,p = p, which is non-zero by assumption.
So p,p = 0 but h,h = 0 for h > p. This characterizes AR(p).
15
The Inverse Autocorrelation Function
SASs proc arima also shows the Inverse Autocorrelation Function (IACF).
The IACF of the ARMA(p, q) model
(B)xt = (B)wt
is defined to be the ACF of the inverse (or dual) process
(inverse)
(B)xt
= (B)wt.
The IACF has the same property as the PACF: AR(p) is
characterized by an IACF that is nonzero at lag p but zero
for larger lags.
16
Summary: Identification of ARMA processes
AR(p) is characterized by a PACF or IACF that is:
nonzero at lag p;
zero for lags larger than p.
MA(q) is characterized by an ACF that is:
nonzero at lag q;
zero for lags larger than q.
For anything else, try ARMA(p, q) with p > 0 and q > 0.
17
For p > 0 and q > 0:
AR(p)
MA(q)
ARMA(p, q)
Tails off
Cuts off after lag q
Tails off
PACF
Cuts off after lag p
Tails off
Tails off
IACF
Cuts off after lag p
Tails off
Tails off
ACF
Note: these characteristics are used to guide the initial choice
of a model; estimation and model-checking will often lead to
a different model.
18
Other ARMA Identification Techniques
SASs proc arima offers the MINIC option on the identify
statement, which produces a table of SBC criteria for various
values of p and q.
The identify statement has two other options: ESACF and
SCAN.
Both produce tables in which the pattern of zero and nonzero values characterize p and q.
See Section 3.4.10 in Brocklebank and Dickey.
19
options linesize = 80;
ods html file = varve3.html;
data varve;
infile ../data/varve.dat;
input varve;
lv = log(varve);
run;
proc arima data = varve;
title Use identify options to identify a good model;
identify var = lv(1) minic esacf scan;
estimate q = 1 method = ml;
estimate q = 2 method = ml;
estimate p = 1 q = 1 method = ml;
run;
proc arima output
Forecasting
General problem: predict xn+m given xn, xn1, . . . , x1.
General solution: the (conditional) distribution of xn+m given
xn, xn1, . . . , x1.
In particular, the conditional mean is the best predictor (i.e.
minimum mean squared error).
Special case: if {xt} is Gaussian, the conditional distribution
is also Gaussian, with a conditional mean that is a linear
function of xn, xn1, . . . , x1 and a conditional variance that
does not depend on xn, xn1, . . . , x1.
1
Linear Forecasting
What if xt is not Gaussian?
Use the best linear predictor: xn
n+m .
Not the best possible predictor, but computable.
One-step Prediction
The hard way: suppose
xn
n+1 = n,1 xn + n,2 xn1 + + n,n x1 .
Choose n,1, n,2, . . . , n,n to minimize the mean squared prediction error E
2
n
.
xn+1 xn+1
Differentiate and equate to zero: n linear equations in the n
unknowns.
Solve recursively (in n) using the Durbin-Levinson algorithm.
Incidentally, the PACF is n,n.
3
One-step Prediction for an ARMA Model
The easy way: suppose we can write
xn+1 = some linear combination of xn, xn1, . . . , x1
+ something uncorrelated with xn, xn1, . . . , x1.
Then the first part is the best linear predictor, and the second
part is the prediction error.
E.g. AR(p), p n:
xn+1 = 1xn + 2xn1 + + pxn+1p +
first part
wn+1
second part
General ARMA case
Now
xn+1 =1xn + 2xn1 + + pxn+1p
+ 1wn + 2wn1 + + q wn+1q
+ wn+1.
First part on the right hand side is a linear combination of
xn, xn1, . . . , x1.
Last part, wn+1, is uncorrelated with xn, xn1, . . . , x1.
5
Middle part? If the model is invertible, wt is a linear combination of xt, xt1, . . . , so if n is large, we can truncate the
sum at x1, and wn, wn1, . . . , wn+1q are all (approximately)
linear combinations of xn, xn1, . . . , x1.
So the middle part is also approximately a linear combination
of xn, xn1, . . . , x1, whence
xn
n+1 =1 xn + 2 xn1 + + p xn+1p
+ 1wn + 2wn1 + + q wn+1q
and wn+1 is the prediction error, xn+1 xn
n+1 .
Multi-step Prediction
The easy way: build on one-step prediction. E.g. two-step:
xn+2 =1xn+1 + 2xn + + pxn+2p
+ 1wn+1 + 2wn + + q wn+2q
+ wn+2.
Replace xn+1 by xn
n+1 + wn+1 :
xn+2 =1xn
n+1 + 2 xn + + p xn+2p
+ 2wn + + q wn+2q
+ wn+2 + (1 + 1) wn+1.
7
The first two parts are again (approximately) linear combinations of xn, xn1, . . . , x1, and the last is uncorrelated with
xn, xn1, . . . , x1. So
n
xn
=
x
1
n+2
n+1 + 2 xn + + p xn+2p
+ 2wn + + q wn+2q
and the prediction error is
xn+2 xn
n+2 = wn+2 + (1 + 1 ) wn+1 .
Note that the mean squared prediction error is
2 1 + + 2 2 .
w
( 1
1)
w
Mean squared prediction error increases as we predict further
into the future.
8
Forecasting with proc arima
E.g. the fishery recruitment data.
proc arima program and output.
Note that predictions approach the series mean, and std
errors approach the series standard deviation.
The autocorrelation test for residuals is borderline, largely
because of residual autocorrelations at lags 12, 24, . . . .
Spectrum analysis shows that these are caused by seasonal
means, which can be removed: proc arima program and
output.
10
Comments on Choice of ARMA model
Keep it simple! Use small p and q.
Some systems have autoregressive-like structure.
E.g. first order dynamics:
dx(t)
= x(t)
dt
or in stochastic form,
dx(t) = x(t)dt + dW (t)
where W (t) is a Wiener process, the continuous time limit of
the random walk.
1
Discrete time approximation:
x(t) = x(t + t) x(t) = x(t)t + W (t)
or
x(t + t) = x(t) x(t)t + W (t)
= (1 t)x(t) + W (t),
an AR(1) (causal if > 0 and t is small).
Similarly a second order system leads to AR(2).
Since many real-world systems can be approximated by first
or second order dynamics, this suggests using p = 1 or 2,
and q = 0.
2
Some systems have more dimensions. E.g. first order vector
autoregression, VARp(1):
xt1 + wt .
xt =
p1
p1
pp p1
Here each component time series is typically ARMA(p, p 1).
This suggests using q < p, especially q = p 1.
Added noise: if yt is ARMA(p, q) with q < p, but we observe
xt = yt + wt where wt is white noise, uncorrelated with yt,
then xt is ARMA(p, p).
This suggests using q = p.
Summary: youll often find that you can use small p and
q p, perhaps q = 0 or q = p 1 or q = p, depending on the
background of the series.
Estimation
Current methods are likelihood-based:
f1,2,...,n (x1, x2, . . . , xn) = f1 (x1) f2|1 (x2|x1) . . .
fn|n1,...,1 xn|xn1, xn2, . . . , x1 .
If xt is AR(p) and n > p, then
fn|n1,...,1 xn|xn1, xn2, . . . , x1 =
fn|n1,...,np xn|xn1, xn2, . . . , xnp .
Assume xt is Gaussian. E.g. AR(1):
2 ] for t > 1,
ft|t1(xt|xt1) is N [(1 ) + xt1, w
and
2 /(1 2 )].
f1(x1) is N [, w
So the likelihood, still for AR(1), is
2 ) = (2 2 )n/2
L(, , w
w
S(, )
2
,
1 exp
2
2w
where
S(, ) = (1 2) (x1 )2 +
(xt ) xt1 2 .
t=2
6
Methods in proc arima
method = ml: maximize the likelihood.
method = uls:
S(, ).
minimize the unconditional sum of squares
method = cls: minimize the conditional sum of squares Sc(, ):
Sc(, ) = S(, ) (1 2) (x1 )2
n
(xt ) xt1 2 .
t=2
This is essentially least squares regression of xt on xt1.
7
AR(p), p > 1, can be handled similarly.
ARMA(p, q) with q > 0 is more complicated; state space
methods can be used to calculate the exact likelihood.
proc arima implements the same three methods in all cases.
All three methods give estimators with the same large-sample
normal distribution; all are asymptotically optimal.
Brute Force
Above methods fail (or need serious modification) if any data
are missing.
Can always fall back to brute force:
x1, x2, . . . , xn Nn(1, ),
where
nn
(0)
(1)
(2)
(1)
(0)
(1)
(2)
(1)
(0)
...
...
...
(n 1) (n 2) (n 3)
. . . (n 1)
. . . (n 2)
. . . (n 3)
...
...
...
(0)
9
2 (h), and use e.g.
Write (h) = w
compute (h).
Rs ARMAacf(...)
to
Likelihood is
1
1
exp (x 1) 1(x 1)
2
det(2 )
=
1
2 )
det(2w
exp
1
1(x 1)
(
x
1
)
2
2w
2 , then
Can maximize analytically with respect to and w
numerically with respect to and .
Missing data? Just leave out corresponding rows and columns
of .
10
The Integrated ARMA model: ARIMA(p, d, q)
Some series are nonstationary, but their differences are stationary; e.g. the random walk.
Recall: the first differences of xt are
xt xt1 = (1 B)xt = xt.
The second differences are
xt xt1 = (1 B)xt = 2xt.
If dxt is ARMA(p, q), we say that xt is ARIMA(p, d, q).
1
Under-differencing
Suppose that xt is ARIMA(p, d, q), but we analyze yt = d xt
for some d < d.
In this case, yt satisfies
dd (B)yt = (B)yt = (B)wt
where (z) = (1 z)(dd )(z) has d d roots at z = 1.
This looks like an ARMA(p + d d , q) model, but it is not
causal.
2
Over-differencing
Suppose that xt is ARIMA(p, d, q), but we analyze yt = d xt
for some d > d.
In this case, yt satisfies
(B)yt = d d(B)wt = (B)wt
where (z) = (1 z)(d d)(z) has d d roots at z = 1.
This looks like an ARMA(p, q + d d) model, but it is not
invertible.
3
Simplest model with d > 0: ARIMA(0, 1, 1)
Many nonstationary series are found to be fitted quite well
as ARIMA(0, 1, 1).
This model is connected with the exponentially weighted
moving average (EWMA) method of forecasting.
If the model is written xt xt1 = wt wt1, the one-step
forecast is
x
n+1 = (1 )
j xnj ,
j=0
the exponentially weighted moving average.
4
We can calculate the forecast recursively:
xn+1 = xn wn + wn+1.
We can find wn from xn, xn1, . . . , so the one-step forecast
is the first part:
x
n+1 = xn wn
But wn is the previous forecast error, xn x
n, so
x
n+1 = xn (xn x
n)
= (1 )xn +
xn .
In words,
the new forecast is a weighted average of the current
forecast and the current value.
Also
x
n+1 = x
n + (1 )(xn x
n),
so the new forecast is the current forecast plus a correction
based on the current forecast error.
6
Strategy for Building ARIMA Models
1. First choose d:
ACF of an integrated series tends to die away slowly, so
difference until it dies away quickly;
the IACF of a non-invertible series tends to die away
slowly, which indicates over-differencing.
You may want to try more than one value of d.
2. Next choose p and q, e.g. using MINIC.
7
3. Next estimate the model.
4. Finally check the model diagnostics:
p (if p > 0)
Significance of highest order coefficients,
and q (if q > 0);
Non-significance in autocorrelation check of residuals;
Low value of AIC or SBC.
5. Repeat from step 2 until satisfactory.
Note: You may not find a completely satisfactory model,
especially for a long data series.
8
Unit Root Tests
Choice of d can be formulated as a hypothesis test.
E.g. in the AR(1) model xt = xt1 + wt, set:
H0 : = 1, xt is ARIMA(0, 1, 0) (nonstationary, d = 1);
HA : || < 1, xt is ARIMA(1, 0, 0) (stationary, d = 0).
Test using proc arimas stationarity keyword on the identify
statement.
E.g. the global temperature data: proc arima program and
output.
9
The statistics on the Lags 0 rows in the panel Augmented
Dickey-Fuller Unit Root Tests refer to the three models
Zero Mean:
xt = xt1 + wt;
Single Mean:
xt = (xt1 ) + wt;
Trend:
xt t = xt1 (t 1) + wt.
10
Note that under H0, these models reduce to
xt = xt1 + wt,
xt = xt1 + wt,
xt = xt1 + + wt,
the first two being random walks with no drift, the latter
being a random walk with drift.
The statistics on the Lags 1 rows refer to corresponding
AR(2) models, which reduce to integrated AR(1) models
under the null hypothesis.
The Tau tests are generally preferred to the Rho tests.
11
E.g. Case-Shiller housing data: proc arima program and output.
12
Seasonal ARIMA Models
Many time series collected on a monthly or quarterly basis
have seasonal behavior.
Similarly hourly data and daily behavior.
E.g. Johnson & Johnson quarterly earnings; discussion typically focuses on comparison with:
previous quarter ;
same quarter, previous year.
1
That is, we compare xt with xt1 and xt4.
More generally, we compare xt with xt1 and xts, where
s = 4 for quarterly data,
s = 12 for monthly data,
s = 24 for daily effects in hourly data,
s = 168 for weekly effects in hourly data,
etc.
This suggests modeling xt in terms of xt1 and xts.
2
Pure Seasonal ARMA
The pure seasonal ARMA model has the form
xt = 1xts + 2xt2s + + P xtP s
+ wt + 1wts + 2wt2s + + QwtQs.
Notation: ARMA(P, Q)s.
In operator form:
P (B s)xt = Q(B s)wt.
P (B s) and Q(B s) are seasonal autoregressive and moving
average operators.
3
Multiplicative Seasonal ARMA
ACF of pure seasonal ARMA is nonzero only at lags s, 2s,
. . . ; most seasonal time series have other nonzero values.
(s)
For such series, wt = Q(B s)1P (B s)xt is not white noise
for any choice of P and Q.
(s)
But suppose that for some P and Q, wt
(s)
p(B)wt
is ARMA(p, q):
= q (B)wt,
where {wt} is white noise.
4
Then xt satisfies
P (B s)p(B)xt = Q(B s)q (B)wt.
This is the Multiplicative
ARMA(p, q) (P, Q)s.
Seasonal
ARMA
model
The non-seasonal parts p and q control short-term correlations (up to half a season, lag s/2), while the seasonal
parts P and Q control the decay of the correlations over
multiple seasons.
Example: Johnson & Johnson earnings; R analysis
par(mfrow = c(2, 1))
plot(log(jj))
jjl = lm(log(jj) ~ time(jj) + factor(cycle(jj)))
summary(aov(jjl))
jjf = ts(fitted(jjl), start = start(jj),
frequency = frequency(jj))
lines(jjf, col = 2, lty = 2)
jjr = ts(residuals(jjl), start = start(jj),
frequency = frequency(jj))
plot(jjr)
acf(jjr)
pacf(jjr)
6
PACF is simpler than ACF:
ACF spikes at lags 4, 8, perhaps 12; of these, PACF spikes
only at lag 4;
apart from lags 4, 8, . . . , PACF drops off faster.
(P)ACF indicates neither simple ARMA nor simple ARMA4.
PACF suggests ARMA(2, 0) (1, 0)4:
jja = arima(jjr, order = c(2, 0, 0),
seasonal = list(order = c(1, 0, 0), period = 4))
print(jja)
tsdiag(jja)
7
Note: the original fit of the straight line and seasonal dummies was by OLS;
possibly inefficient;
invalid inferences (standard errors, etc.).
Solution: refit as part of the time series model.
x = model.matrix( ~ time(jj) + factor(cycle(jj)))
jja = arima(log(jj), order = c(2, 0, 0),
seasonal = list(order = c(1, 0, 0), period = 4),
xreg = x, include.mean = FALSE)
print(jja)
tsdiag(jja)
8
Notes:
The time series being fitted is the original unadjusted log(jj).
The regressors are specified as the matrix argument xreg.
arima does not check for linear dependence, so we must either
omit one dummy variable from xreg or use include.mean =
FALSE in arima.
Regression parameter estimates are similar to OLS, but standard errors are roughly doubled.
Using SAS: proc arima program and output.
9
Multiplicative Seasonal ARIMA
The seasonal difference operator is s = 1 B s.
Some series show slow decay of ACF only at lags s, 2s, . . . ,
which suggests seasonal differencing.
But note: seasonal means also give slow decay of ACF at
those lags.
The
Multiplicative
Seasonal
ARIMA(p, d, q) (P, D, Q)s is
ARIMA
model
d x = (B s ) (B)w .
P (B s)p(B)D
q
t
t
Q
s
10
The Frequency Domain
Time domain methods:
regress present on past;
capture dynamics in terms of velocity (first order), acceleration (second order), inertia, etc.
Frequency domain methods:
regress present on periodic sines and cosines;
capture dynamics in terms of resonant frequencies, etc.
1
E.g. AR(2):
plot(ts(arima.sim(list(order = c(2,0,0), ar = c(1.5,-.95)), n = 144)))
Strong periodicity, around 16 peaks period of around 9
samples.
Fitting an AR model doesnt describe this:
xt = 1.50xt1 0.95xt2 + wt.
Cyclical Behavior
Simplest case is the periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t).
where:
A is amplitude;
is frequency, in cycles per sample;
is phase, in radians;
and U1 = A cos(), U2 = A sin().
3
Folding Frequency; Aliasing
If = 0, xt = A cos(), constant.
If = 1? At t = 0, 1, 2, . . . , same thing!
= 0 is an alias of = 1.
All frequencies higher than = 1/2 have an alias in 0
1/2:
cos[2(k )t + ] = cos(2t ),
t = 0, 1, 2, . . .
= 1/2 is the folding frequency.
4
For example, = 0.8:
omega = 0.8;
phi = pi / 6;
plot(function(x) cos(2 * pi *
from = 0, to = 10);
plot(function(x) cos(2 * pi *
from = 0, to = 10, add =
abline(v = 0:10, lty = 2, col
omega * x + phi),
(1 - omega) * x - phi),
TRUE, col = "red");
= "blue");
Note:
= 0.8 = 0.5 + 0.3, and 1 = 0.2 = 0.5 0.3;
1 is folded around 0.5.
5
Stationarity
If
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t).
and is random, uniformly distributed on [0, 2), then:
E(xt) = 0,
1
E xt+hxt = A2 cos(2h).
2
So xt is weakly stationary.
6
Also
E(U1) = E(U2) = 0,
E U12
= E U22
1 2
= A ,
2
and
E(U1U2) = 0.
Alternatively, if the U s have these properties, xt is stationary
with the same mean and autocovariances:
E(xt) = 0,
1
E xt+hxt = A2 cos(2h).
2
More generally, if
q
xt =
Uk,1 cos(2k t) + Uk,2 sin(2k t) ,
k=1
where:
the U s are uncorrelated with zero mean;
var Uk,1 = var Uk,2 = k2;
then xt is stationary with zero mean and autocovariances
q
k2 cos(2k h).
(h) =
k=1
Harmonic Analysis
Any time series sample x1, x2, . . . , xn can be written
(n1)/2
xt = a0 +
aj cos(2jt/n) + bj sin(2jt/n)
j=1
if n is odd; if n is even, an extra term is needed.
The periodogram is
2
P (j/n) = a2
j + bj .
The R function spectrum can calculate and plot the periodogram.
9
R examples:
par(mfcol = c(2, 1));
# one frequency:
x = cos(2*pi*(0.123)*(1:144))
plot.ts(x); spectrum(x, log = "no")
# and a second frequency:
x = x + 2 * cos(2*pi*(0.234)*(1:144))
plot.ts(x); spectrum(x, log = "no")
# and added noise:
x = x + rnorm(144)
plot.ts(x); spectrum(x, log = "no")
# the AR(2) series:
x = ts(arima.sim(list(order = c(2,0,0), ar = c(1.5,-.95)), n = 144))
plot(x); spectrum(x, log = "no")
Using SAS: proc spectra program and output.
10
tr st
rr ss rqs r str
t s
t t t r t s s sttr
t srs
str st s t rrs r
srt rr rsr
t t srt rr trsr s
t t
t
rqs r t rr r t
rqs
tr rr trsr t s rs trsr
t
t t
Prr
rr s
s rr s rr s P
trs s trs r
tt s r s rt s
r
r t rr s
t str st t
rst t str st s t rt
t t rr
t s r t str st ss t s
t t rr t
rs t rr s s sttr
t str st
t t rr s t rs r
s t r stt
tts t tr
r sttr t srs t t trs
s rs str r str strt
t r
s st ts t s str st
t
r rs ts s s
rtt s t s
Prrts t str st
t s stt
t t str t t s
t t s
r t t t t t
s s
r s rsts t r tr s
s tt t str st t q rss
t t
t tt ts s t s rt
t tr str t srs ts t
tt t rq
t rs r rss rrs t
rs
rs
t
t
ts str st s
rs
s t rs rss
t
rr
ttsrsstr
r
s
t t t t
t ts t
sqr t t
t t
s r r
t rs rt ss
t rs t
s r st ts t t r
tt t r t t rs
ss
sst ts t t r r s
ss tt t r
t t s t
r
t r
ss
rtr tr stts
rt
r rt t
Pr
s rt s sttr
t t t rs r t s
t P
rq s
s r t rr rqs r s ttr
s tr r rq t sr
t
s st t st r s
tr
s tr s sr rr s st
t rt t s qtt rt
t
r Prr
s t t rr
rqs
t s rt s r
t t r Prr
r
s strs st tr tr
r str t s s
r s
ts
r
Pts ts
ts r rt
tstrs t sqr
ts r r
rt tt s
r rt t
tstrs t r
t ts r ts r rrq
ts t t t t r t tr str
ts trs s strt
t r r
t t Prr
s t r
t t s r stt t s
qtt
t s t t s t tr ss rt
trt t r
t t t Prr
t r t s rs ts r
trr ts
r
t
r ts rs
r
s strs st tr tr
t ts sttt s s t rt
rtrr r
r str t s s
r s
ts
r
str
rs t ts t s t
t ts sttt s s t rt s
str rs
ts r
r str
rr tt
s st
tr st t r
tr s t ss st s t
sstt s t rr t t rq
t srs s sttr s tr t
r t r str st t
r ss ss ss t ss st srs
s rsss strs
s tss rq rqs strt strts
r
s strs st tr tr
tstrs t sqr
tt r
Period in years
1
0.5
0.25
0.167
0.00 0.02 0.04 0.06 0.08
s(f)
Frequency in cycles per year
tstrs t r
tt r
Period in years
1
0.5
0.25
0.167
0.04
0.02
0.00
s(f)
0.06
0.08
Frequency in cycles per year
Tapering
The periodogram works well with data containing only Fourier
frequencies:
w = rnorm(128, sd = 0.01);
x5 = cos(2*pi*(5/128)*(1:128)) + w;
x6 = cos(2*pi*(6/128)*(1:128)) + w;
par(mfcol = c(3, 1), mar = c(2, 2, 1, 1));
spectrum(x5, taper = 0, ylim = c(1e-7, 1e2));
spectrum(x6, taper = 0, ylim = c(1e-7, 1e2));
It doesnt work so well with other frequencies:
x5h = cos(2*pi*(5.5/128)*(1:128)) + w;
spectrum(x5h, taper = 0, ylim = c(1e-7, 1e2));
1
One solution is to taper the data:
spectrum(x5h, taper = 0.5, ylim = c(1e-7, 1e2))
This works by multiplying the data by a data window :
par(mfcol = c(3, 1), mar = c(2, 2, 1, 1));
plot(tapr(rep(1, 128), 0.25));
plot(x5h);
plot(tapr(x5h, 0.25));
The data window modifies a fraction of the data at each end
of the series, to make the data more nearly continuous when
it is wrapped.
2
Tapering makes the main peak wider, but much reduces side
lobes.
To see the side lobes, make the periodogram graphs on a
finer grid of frequencies:
par(mfcol = c(2, 1), mar = c(2, 2, 1, 1));
spectrum(x5h, taper = 0.0, ylim = c(1e-7, 1e2), pad = 896)
spectrum(x5h, taper = 0.5, ylim = c(1e-7, 1e2), pad = 896)
The default in Rs spectrum (or spec.pgram, which does the
work) is to taper 10% at each end of the data.
3
t rs
t sttr srs t t rss rs
t t
rss str st s
rss str st s
q
r q r t str q str
rst
q q
qr r
s t r t s s
ts t rrstt
t
st s st
r q r t rs t
s s ts t rrstts t
t
rrts r s sr t trrt t rs
t sqr r
srs t strt t rts t t t
t rq
s s t srs t rt
r t t rq t
t tt r
Ps tr
qr r s s t t sqr rr
t t t rs
st rrt t t ts t
t ts s str rrt rt tr
st r t rts
s t rrt s tr r
t t st
t sr
s s r s t
t t t r t t r t
t rtt s
r s t s str
s t s str s sr rs t
s t rt t s r s
s str srs t t r tts
t t rq t sst rt t t
ts t t t s rq
tr tr
r r rt t srs
t t
str tr s
t t
t
t str tr s st
rtr stt
r t rt s
s t r trs t s
r r r
stt r
s t s r t t stt
sqr r s
t tr r s r
s st sr rst s rt
t r
r t tss
t rqs r s ts rt
trr t s tt t r s st
rqs t s ts s s
t r
t s t sqr r t t tr
st t srs rrtt srs
r
s strs r r
tr st
ts tt t
q s
s r
q s
s r t
ss ts s t s t rt
s
s
r
r
s
rsss trs
tss strt strts rq rqs
rssr trr
tsr strt strtr rq rqr
strs r r
tr st
ts tt t
q s
s r
q s
s r t
stt s
ts tt s t
r trs
r tr ts t
t
t tt t
r tr
r ts r r
t t s t s t t
t
t
t tt s t t s t ts r t s
rss t t tr
t
r t
trs
rr
tsr strt
t
t t
r
tr
t t
r
t
ss
t
s r r r r t tr s rt r s
r r
t t s s
t t
t tt s
t
t
r
r tr
r r
r r
s t rq rss t t tr
tr s t r r t
tr t
r tr
tr
t
tr ss
st rtr
tr rtt
t
r tr
r r t
r r
rq rss t s
Ps
rq rss t r rts
s t t r t
s t s t
t t
stt
s r t s
s tr
r tr
t tr s t s ss tr
t s t s ss tr
trs r rrs
t t
r tr
t s t s t tr s t s
t s t s t tr s rrs t s
r s t tt tr t
t s t tr rrs tr
tt tr
t t s sttr rss t trs
t rs t tt r
t t
r tr
sts
r s tr ts
r s r s
s s t t s t s sttr t
r s r s
t r str st t s
r r
r s r s
ss
r srs
rs
tt str t str sqr
s t t rss r
t t
sts
s t ts
s s
s t s t t s t t r t
sttr t
s s
t rss str st s
s s
s
ss
r r rs
rss str t strrq rss t
t tt t sqr r s
r r rq
t s srs s tr rs tr tr
sqr r s t rqs
rs s s tr
rst r
r
trs
t s
rq rss t s
s
r t t s s
rr
tt s r t
t
tt s r t
t
r t s t s
tr r
t
r tr
rq rss t s
r
s
s
t
rr
tt
r t
tt
r t
tt
r t
s s
t
ss s
t
s s
t
Ps t trts t
tr
t t t
s tr rss t t t s
Prt
t str t tt tr s
s t tr s tt
t s stt
t s t tt t tr s t s t s
rt tr r t t t
t rt tr s t t
t t t t s t qt t t
s s tr
rt rt s
r stt
r
s sr t stt t
rst stt s t
Lagged regression
The fisheries recruitment series (yt) and the Southern Oscillation Index (xt) are cross-correlated with lags of several
months.
Perhaps we can model them as
yt =
r xtr + vt
r=
where vt is uncorrelated with xtr at all lags r. That is,
the coherence between vt and xt is zero at all frequencies;
in words: vt and xt are incoherent.
1
In terms of filters:
zt =
r xtr
r=
is the output of a filter whose input is xt, and yt is zt plus
noise that is incoherent with the input.
If the frequency response function of the filter is
B() =
r e2ir ,
r=
the spectrum of zt is
fzz () = |B()|2fxx().
2
Also the cross spectrum is
fzx() = B()fxx().
Now yt = zt + vt, and vt is incoherent with xt, and therefore
also with zt.
So the spectrum of yt is
fyy () = fzz () + fvv () = |B()|2fxx() + fvv ().
and the cross spectrum of yt and xt is
fyx() = fzx() = B()fxx().
3
So B() must satisfy
B() =
fyx()
.
fxx()
Can we find a filter with frequency response function B()?
Typically, yes. If xt and yt are such that
1/2
1/2
fyx()
d < ,
|B()|d =
1/2
1/2 fxx ()
the coefficients are
1/2
r =
1/2
n1
e2ir B()d
1
e2ik r B(k ).
n k=0
SOI and recruitment
We need B(k ), k = 0, 1, . . . , n 1, but both Rs spectrum()
and SASs proc spectra omit k = 0 and k > n/2.
In R, we can use fft() directly:
dy = fft(rec - mean(rec)) / sqrt(length(rec))
dx = fft(soi - mean(soi)) / sqrt(length(soi))
fyx = filter.complex(dy * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
fxx = filter.complex(dx * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
B = fyx/fxx
beta = Re(fft(B, inv = TRUE)) / length(B)
plot(-15:15, c(beta[-1], beta)[length(B) + -15:15], type = "h")
abline(h = 0, lty = 3)
Using the seasonally adjusted series gives a very similar result:
dy = fft(recSA) / sqrt(length(recSA))
dx = fft(soiSA) / sqrt(length(soiSA))
fyx = filter.complex(dy * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
fxx = filter.complex(dx * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
B = fyx/fxx
betaSA = Re(fft(B, inv = TRUE)) / length(B)
plot(-15:15, c(betaSA[-1], betaSA)[length(B) + -15:15], type = "h")
abline(h = 0, lty = 3)
In this case, the response is clearly recruitment, and the input
is SOI.
We could reverse the roles: SOI versus recruitment.
fxy = Conj(fyx)
fyy = filter.complex(dy * Conj(dy), rep(1, 15), sides = 2, circular = TRUE)
B = fxy/fyy
beta = Re(fft(B, inv = TRUE)) / length(B)
plot(-15:15, c(beta[-1], beta)[length(B) + -15:15], type = "h")
abline(h = 0, lty = 3)
In the first version, r 0 for r < 0, so the filter is physically
realizable. In other cases, this method may give unrealizable
filters; we can fit the best realizable filter using time domain
methods.
8
Interpreting Coherence
Recall that
fyy () = |B()|2fxx() + fvv ()
fyx()
=
fxx()
fxx() + fvv ()
= 2
yx ()fyy () + fvv ().
So
fvv () = 1 2
yx () fyy ().
The squared coherence is the proportion of the spectrum of
yt that is explained by the lagged regression on xt.
9
Forecasting
The forecasting problem is also a type of lagged regression:
of xt on its own lags;
and on only the past.
We have seen that the solution is
x
t =
r xtr ,
r=1
where the s must satisfy
cov(xt x
t, xtr ) = 0
for r = 1, 2, . . .
10
That is, wt = xt x
t is uncorrelated with all past xs, and
hence with all past ws, and hence is white noise.
So the filter
wt = xt
rxtr
r xtr =
r=1
r=0
turns xt into white noise wt.
So the spectrum fxx() satisfies
2 = f ()
w
xx
r=0
re2ir
= fxx()
r=0
re2ir
re2ir
r=0
11
So, taking logarithms:
2 log
log[fxx()] = log w
re2ir log
re2ir .
r=0
r=0
Now, provided log[fxx()] is integrable:
1/2
1/2
|log[fxx()]| d <
we can write
log[fxx()] = l0 + 2
= l0 +
r=1
lr cos(2r)
r=1
lr e2ir +
lr e2ir
r=1
12
Some standard complex variable theory implies that we can
match terms:
2 =l ,
log w
0
log
re2ir =
lr e2ir ,
r=0
r=1
log
re2ir =
lr e2ir .
r=0
r=1
13
That is,
1/2
2 = exp l
w
( 0) = exp
1/2
log[fxx()] d ,
and
re2ir = exp
r=0
lr e2ir
r=1
whence for r = 1, 2, . . .
r = r =
1/2
1/2
exp
lr e2ir e2ir d.
r=1
This is the essence of Kolmogorovs (1941) solution to the
forecasting problem.
14
Long Memory Time Series
A time series has short memory if
|(h)| < .
So a time series for which
|(h)| =
is said to have long memory.
Why do we care?
Write the mean of x1, x2, . . . , xn as
x + x2 + + xn
.
x
n = 1
n
Then
n1
1
|h|
var(x
n) =
1
(h)
n h=(n1)
n
1
|h|
1
=
(h)
n h=
n +
where (a)+ = max(a, 0) is a if a 0 and 0 if a < 0.
2
If
|(h)| < , then
|h|
1
(h)
(h)
n +
h=
h=
as n .
So
nvar(x
n)
(h),
h=
or
var(x
n) =
1
1
(h) + o
.
n h=
n
3
That is, for a short memory time series, var(x
n) goes to zero
as the sample size increases at the usual rate, 2/n, but with
a different multiplier.
Note that
(h) = f (0),
h=
the spectral density f () evaluated at = 0.
So we can also write
var(x
n) =
f (0)
1
+o
:
n
n
2 is replaced by f (0).
4
But if
|(h)| = , this doesnt work.
In practice, many series show var(x
n) decaying more slowly.
Plot log[var(x
n)] against log(n), and look for a slope of 1.
vartime = function(x, nmax = round(length(x) / 10)) {
v = rep(NA, nmax);
for (n in 1:nmax) {
y = filter(x, rep(1/n, n), sides = 1);
v[n] = var(y, na.rm = TRUE);
}
plot(log(1:nmax), log(v));
lmv = lm(log(v) ~ log(1:nmax));
abline(lmv);
title(paste(deparse(substitute(x)), "; nmax = ", nmax));
print(summary(lmv));
}
vartime(log(varve))
vartime(globtemp)
vartime(residuals(lm(globtemp ~ time(globtemp))))
5
Fractional Integration
How can we model such series?
Fractionally integrated white noise:
(1 B)dxt = wt,
0 < d < 0.5.
ACF is
(h) =
(h + d)(1 d)
h2d1
(h d + 1)(d)
So for 0 < d < 0.5,
|(h)| = .
h=
6
Notes:
var(x
n) decays like n(2d1), so
1 + slope of variance-time graph
2
gives a rough empirical estimate of d.
d=
The spectral density is
f () =
2
w
d
2
4 sin()
so for d > 0, f () as 0.
7
Also f () ||2d as 0, so a graph of log[f ()] against
log(||) gives another estimate of d.
If d 0.5, f () is not integrable, so the series is not stationary.
ARFIMA Model
In some long-memory series, autocorrelations at small lags
do not match those of fractionally integrated noise.
We can add ARMA components to allow for such differences;
the ARIMA(p, d, q) model with fractional d, or ARFIMA.
Use the R function fracdiff():
library(fracdiff)
summary(fracdiff(log(varve)))
summary(fracdiff(log(varve), nar = 1, nma = 1))
summary(fracdiff(residuals(lm(globtemp ~ time(globtemp)))))
9
Trend Estimation with ARFIMA errors
The R function fracdiff() does not allow explanatory variables, but we can use it to calculate a profile likelihood function.
E.g. global temperature versus cumulative CO2 emissions:
source("http://www.stat.ncsu.edu/people/bloomfield/courses/st730/co2w.R");
plot(cbind(globtemp, co2w));
slopes = seq(from = 0, to = 1.5, length = 151);
ll2 = rep(NA, length(slopes));
for (i in 1:length(slopes))
ll2[i] = -2 * fracdiff(globtemp - slopes[i] * co2w)$log.likelihood;
plot(slopes, ll2, type = "l");
abline(h = min(ll2) + qchisq(.95, 1));
10
The point estimate is
slopeEst = slopes[which.min(ll2)];
abline(v = slopeEst, col = "red"); # [1] 0.68
and the 95% confidence interval is roughly:
slopeCI = range(slopes[ll2 <= min(ll2) + qchisq(.95, 1)]);
abline(v = slopeCI, col = "red", lty = 2); # [1] 0.41 1.03
The CO2 series was scaled by its change from 1900 to 2000,
so we estimate the 20th century warming as 0.68C, with a
confidence interval of (0.41C, 1.03C) (note the asymmetry:
0.68(0.27, +0.35)C).
Compare with IPCC: 19062005 warming is 0.74C 0.18C.
11
Conditional Heteroscedasticity (CH)
So far, our models are for the conditional mean.
For instance, the Gaussian AR(1) model
yt = yt1 + t
may be written:
Conditionally on yt1, yt2, . . . ,
2 .
yt N + yt1 , w
The conditional mean depends on the past, the conditional
variance does not.
1
Three key features:
The conditional distribution is normal;
The conditional mean is a linear function of yt1, yt2, . . . ;
The conditional variance is constant: conditional homoscedasticity.
All three features could be changed.
Non-normal noise: typically longer tails; for fitting, provided
the variance is finite, changes the likelihood function, but
not much else.
Nonlinear mean function: Modeling a nonlinear mean is quite
difficult; for instance, ensuring stationarity is restrictive. Threshold models are perhaps most feasible.
Non-constant variance. Two approaches:
ARCH (AutoRegressive CH), GARCH (Generalized ARCH),
...
Stochastic volatility.
3
ARCH Models
Simplest is ARCH(1):
yt = t t
2
t2 = 0 + 1yt1
where t is Gaussian white noise with variance 1.
Alternatively:
Conditionally on yt1, yt2, . . . ,
2
yt N 0, 0 + 1yt1
.
If |yt1| happens to be large, t is increased, so |yt| also tends
to be large.
Conversely, if |yt1| happens to be small, t is decreased, so
|yt| also tends to be small.
volatility clusters and long tails.
n = 1000; alpha1 = 0.9; alpha0 = 1 - alpha1;
y = epsilon = ts(rnorm(n));
par(mfcol = c(2, 1));
plot(epsilon);
for (i in 2:n) y[i] = epsilon[i] * sqrt(alpha0 + alpha1 * y[i - 1]^2);
plot(y);
5
ARCH as AR
The ARCH(1) model for yt implies:
yt2 = t2 2
t
= t2 + t2
21
t
2 + 2
= 0 + 1yt1
t
21
t
or
2 +v ,
yt2 = 0 + 1yt1
t
where
vt = t2
21 .
t
6
Note that
E(vt|yt1, yt2, . . . ) = 0,
and hence that for h > 0,
E(vtvth) = E E vtvth|yt1, yt2, . . .
= E vthE vt|yt1, yt2, . . .
= 0,
so vt is (highly nonnormal) white noise, and yt2 is AR(1).
For positivity and stationarity, 0 > 0 and 0 1 < 1, and
unconditionally,
0
2
E yt = var(yt) =
.
1 1
7
Extensions and Generalizations
Extend to ARCH(m):
yt = t t
2 + y2 + + y2
t2 = 0 + 1yt1
m tm .
2 t2
Now yt2 is AR(m) usual restrictions on s.
Generalize to GARCH(m, r):
yt = t t
t2 = 0 +
m
j=1
2 +
j ytj
2 .
j tj
j=1
Now yt2 is ARMA[m, max(m, r)] corresponding restrictions
on s and s.
8
Simplest GARCH model: GARCH(1, 1)
The GARCH(1, 1) model is widely used:
2 + 2
t2 = 0 + 1yt1
1 t1
with
1 + 1 < 1
for stationarity.
The unconditional variance is now
E yt2 = var(yt) =
0
.
1 1 1
9
n = 1000; alpha1 = 0.5; beta1 = 0.4; alpha0 = 1 - alpha1 - beta1;
y = epsilon = ts(rnorm(n));
par(mfcol = c(2, 1));
plot(epsilon);
sigmatsq = 1;
for (i in 2:n) {
sigmatsq = alpha0 + alpha1 * y[i - 1]^2 + beta1 * sigmatsq;
y[i] = epsilon[i] * sqrt(sigmatsq);
}
plot(y);
Volatility clusters are more sustained.
10
In SAS, use proc autoreg and the garch option on the model
statement.
In R, explore and describe volatility:
nyse = ts(scan("nyse.dat"));
par(mfcol = c(2, 1));
plot(nyse);
plot(abs(nyse));
lines(lowess(time(nyse), abs(nyse), f = .005), col = "red");
par(mfcol = c(2, 2));
acf(nyse);
acf(abs(nyse));
acf(nyse^2);
11
In R, fit GARCH (default is 1,1):
library(tseries);
nyse.g = garch(nyse);
summary(nyse.g);
plot(nyse.g);
par(mfcol = c(1, 1));
plot(nyse);
matlines(predict(nyse.g), col = "red", lty = 1);
12
GARCH with a unit root: IGARCH
A special case:
1 + 1 = 1:
IGARCH(1, 1)
GARCH(1, 1)
with
yt = t t
2 + 2
t2 = 0 + (1 1) yt1
1 t1
Solving recursively with 0 = 0:
t2 = (1 1)
j1 2
ytj
1
j=1
an exponentially weighted moving average of yt2.
13
Tail Length
All xARCH models give yt with fat tails:
yt = t t where t N (0, 1)
fy (y) =
1
y
f ()
d.
fy () is a mixture of Gaussian densities with the same
mean and different variances.
In practice, residuals in xARCH models may not be normal,
but are usually closer to normal than the original data.
14
R UpdateFall 2011
Shumway and Stoffers code for Example 5.3 does not work
with the R garch function.
The fGarch package provides another method, garchFit, which
allows simultaneous fitting of ARMA and GARCH models.
15
gnp96 = read.table("http://www.stat.pitt.edu/stoffer/tsa2/data/gnp96.dat");
gnpr = ts(diff(log(gnp96[, 2])), frequency = 4, start = c(1947, 1));
library(fGarch);
gnpr.mod = garchFit(gnpr ~ arma(1, 0) + garch(1, 0), data.frame(gnpr = gnpr));
summary(gnpr.mod);
Title:
GARCH Modelling
Call:
garchFit(formula = gnpr ~ arma(1, 0) + garch(1, 0),
data = data.frame(gnpr = gnpr))
Mean and Variance Equation:
data ~ arma(1, 0) + garch(1, 0)
[data = data.frame(gnpr = gnpr)]
Conditional Distribution:
norm
16
Coefficient(s):
mu
ar1
0.00527795 0.36656255
omega
0.00007331
alpha1
0.19447134
Std. Errors:
based on Hessian
Error Analysis:
Estimate Std. Error t value
mu
5.278e-03
8.996e-04
5.867
ar1
3.666e-01
7.514e-02
4.878
omega 7.331e-05
9.011e-06
8.135
alpha1 1.945e-01
9.554e-02
2.035
--Signif. codes: 0 *** 0.001 ** 0.01 *
Log Likelihood:
722.2849
normalized:
Pr(>|t|)
4.44e-09
1.07e-06
4.44e-16
0.0418
***
***
***
*
0.05 . 0.1
3.253536
17
Standardised Residuals Tests:
Jarque-Bera Test
Shapiro-Wilk Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
LM Arch Test
R
R
R
R
R
R^2
R^2
R^2
R
Chi^2
W
Q(10)
Q(15)
Q(20)
Q(10)
Q(15)
Q(20)
TR^2
Statistic
9.118036
0.9842405
9.874326
17.55855
23.41363
19.2821
33.23648
37.74259
25.41625
p-Value
0.01047234
0.01433578
0.4515875
0.2865844
0.2689437
0.03682245
0.004352734
0.009518987
0.01296901
Information Criterion Statistics:
AIC
BIC
SIC
HQIC
-6.471035 -6.409726 -6.471669 -6.446282
18
garchFit also provides many diagnostic plots:
plot(gnpr.mod);
19
Threshold Models
A simple form of nonlinear model, basically a switching
AR(p):
(j)
(j)
xt = (j) + 1 xt1 + + p xtp + (j)wt
if xt1 Rj , where xt1 = (xt1, . . . , xtp) , R1, R2, . . . , Rr is
a partition of Rp, and wt is white noise with variance 1.
That is, the AR(p) parameters in the equation for xt change,
depending on the values of the previous p observations
xt1, . . . , xtp.
1
Assuming equal variances, can estimate using regression.
E.g. for monthly pneumonia and influenza deaths:
flu = ts(scan("flu.dat"));
dflu = diff(flu);
a = dflu;
for (l in 1:6)
a = cbind(a, lag(dflu, -l));
a = cbind(a, lag(dflu, -1) > 0.05);
a = data.frame(a);
names(a) = c("x", paste("x", 1:6, sep = ""), "delta");
summary(lm(x ~ delta + x1*delta + x2*delta + x3*delta + x4*delta +
x5*delta + x6*delta, data = a));
flu.l = lm(x ~ -1 + delta + x1*delta + x2*delta + x3*delta + x4*delta,
data = a);
summary(flu.l);
flu.r = residuals(flu.l);
delta = a$delta[4 + 1:length(flu.r)];
lapply(split(flu.r, delta), sd);
acf(flu.r);
2
This is inefficient, and standard errors are invalid, if variances
are unequal; here F = 1.93, df = (17, 110), P = .022.
We can also fit the model using two separate regressions:
flu.lF = lm(x ~ x1 + x2 + x3 + x4, data = a, subset = (delta == 0));
summary(flu.lF);
flu.lT = lm(x ~ x1 + x2 + x3 + x4, data = a, subset = (delta == 1));
summary(flu.lT);
Setting up a residual series:
flu.r01 = flu.r;
flu.r01[!delta] = residuals(flu.lF) / sd(residuals(flu.lF));
flu.r01[delta] = residuals(flu.lT) / sd(residuals(flu.lT));
acf(flu.r01);
3
Regression with Autocorrelated Errors
Regression model
yt = zt + xt
where xt has covariance matrix .
Generalized least squares (GLS) estimates for known :
= Z 1Z
Z 1y.
For unknown , plug in an estimate.
4
If xt is a stationary time series, can fit using OLS, then either:
get estimated autocovariances
(h) from the residuals,
;
and plug in
or use Cochrane-Orcutt method.
More generally, can use mixed model methods (SAS proc
mixed).
Cochrane and Orcutt suggested:
fit using OLS to get initial estimate of ;
fit AR(p) to OLS residuals (wt is white noise):
(B)xt = wt;
transform the regression to
(B)yt = (B)zt + (B)xt = (B)zt + wt
or
ut = vt + wt.
Residuals are now white, so fit using OLS.
6
SAS proc arima offers better solution:
Mortality and air pollution (Example 5.6): program and
output.
Global temperature and cumulative CO2 emissions: program and output.
In R, temperature (slightly different) and CO2:
arima(globtemp, order = c(1, 0, 0), xreg = co2w);
arima(globtemp, order = c(4, 0, 0), xreg = co2w);
arima(globtemp, order = c(0, 0, 4), xreg = co2w);
Lagged Regression again: Transfer Functions
To forecast an output series yt given its own past and the
present and past of an input series xt, we might use
yt =
j xtj + t = (B)xt + t,
j=0
where the noise t is uncorrelated with the inputs.
This generalizes regression with correlated errors by including lags, and specializes the frequency domain lagged regression by excluding future inputs.
Preliminary estimation of 0, 1, . . . often suggests a parsimonious model
(B)
(B) = B d
,
(B)
where:
d is the pure delay : 0 1 d1 0 and d = 0;
(B) and (B) are low-order polynomials: (B) is needed
if the s decay exponentially, and (B) is needed if the
first few nonzero s do not follow the decay.
Preliminary estimates from frequency domain method, or a
similar time domain method.
2
Time Domain Preliminary Estimates
If the input series xt were white noise, the cross correlation
y,x(h) = E yt+hxt
= E
j xt+hj + t+h xt
j=0
= hvar (xt) ,
so
y,x(h)/var (xt) provides an estimate of h.
Usually, xt is not white noise, but if it is a stationary time
series, we know how to make it white: fit an ARMA model.
3
Prewhitening
Suppose that xt is ARMA:
(B)xt = (B)wt,
where wt is white noise.
Apply the prewhitening filter (B)(B)1 to the lagged regression equation:
j wtj +
t,
yt =
j=0
where yt = [(B)(B)1]yt and
t = [(B)(B)1]t.
4
Now the cross correlation y,w (h) provides an estimate of h.
You can use SASs proc arima to do this:
first identify and estimate a model for xt;
then identify yt with xt as a crosscorr variable.
At the second step, SAS uses the prewhitening filter from
the first step to filter both xt and yt before calculating cross
correlations.
Note: SAS announces that both series have been prewhitened,
but the filter is designed to prewhiten only xt; yt is filtered,
but typically not prewhitened.
5
Finally estimate the model for yt, specifying the input series,
in the form:
input = (d$(L1,1, L1,2, . . . ) . . . (Lk,1, . . . )
/(Lk+1,1, . . . ) . . . (. . . )variable)
E.g. for Southern Oscillation and the fisheries recruitment
series: program and output.
E.g. for global temperature and an estimated historical forcing series: program and output.
Interpreting a Transfer Function
For the global temperature case, we have
yt = 0.087917 (xt + 0.79513xt1 + 0.795132xt2 + . . . ) + t.
So the effect of an impulse in the forcing xt, say a dip due
to a volcanic eruption, is felt in the current year and several
subsequent years, with a mean delay of 1/(10.79513) 4.9
years.
Also, the effect of a sustained change of +4.4W/m2 would
be
0.087917 4.4 (1 + 0.79513 + 0.795132 + . . . )
= 0.087917 4.4/(1 0.79513)
1.9C.
This is the expected forcing for a doubling of CO2 over preindustrial levels, and the temperature response is called the
climate sensitivity. The IPCC states:
Analysis of models together with constraints from
observations suggest that the equilibrium climate sensitivity is likely to be in the range 2C to 4.5C, with a
best estimate value of about 3C. It is very unlikely to
be less than 1.5C.
8
Our estimate is at the low end of that range, but quantifying
its uncertainty is difficult using proc arima.
The profile likelihood for climate sensitivity, constructed using a grid search in R (with p = 4), gives an estimated value
of 1.85C and 95% confidence limits of 1.44C to 2.27C.
1.5
2.0
2.5
-2 Log-Likelihood contours for climate sensitivity (y-axis) and
decay factor (x-axis):
0.4
0.5
0.6
0.7
0.8
0.9
10
306
308
310
ll2
304
302
-2 Log-Likelihood profile for climate sensitivity:
1.5
2.0
2.5
4.4 * theta
11
310 308 306 304 302 300
ll2
-2 Log-Likelihood profile for decay factor:
0.4
0.5
0.6
0.7
0.8
0.9
lambda
12
ARMAX Models
Vector (multivariate) regression:
output vector
yt =
yt,1
yt,2
...
yt,k
input vector
zt,1
z
t,2
zt = ..
.
zt,r
1
Regression equation:
yt,i = i,1zt,1 + i,2zt,2 + + i,r zt,r + wt,i
or in vector form
yt = Bzt + wt.
Here {wt} is multivariate white noise:
E(wt) = 0,
,
w
cov wt+h, wt =
0,
h=0
h = 0.
Given observations for t = 1, 2, . . . , n, the least squares estimator of B, also the maximum likelihood estimator when
{wt} is Gaussian white noise, is
=YZ ZZ
B
where
y1
y2
...
yn
and
z1
z2
...
zn
ML estimate of w (replace n with (n r) for unbiased):
1 n
w =
t
yt Bz
n t=1
t .
yt Bz
3
Information criteria:
Akaike:
w +
AIC = ln
2
k(k + 1)
kr +
;
n
2
Schwarz:
w +
SIC = ln
ln n
k(k + 1)
kr +
,
n
2
Bias-corrected AIC (incorrect in Shumway & Stoffer):
w +
AICc = ln
k(k + 1)
2
kr +
.
nkr1
2
Vector Autoregression
E.g., VAR(1):
xt = + xt1 + wt.
Here is a k k coefficient matrix, and {wr } is Gaussian
multivariate white noise.
This resembles the vector regression equation, with:
yt = xt,
B= ,
zt =
xt1
.
5
Observe x0, x1, . . . , xn, and condition on x0.
Maximum conditional likelihood estimators of B and w are
same as for ordinary vector regression.
VAR(p) is similar, but we must condition on the first p observations.
Full likelihood = conditional likelihood likelihood derived
from marginal distribution of first p observations, and is difficult to use.
Example: 1-year, 5-year, and 10-year weekly interest rates
Data from http://research.stlouisfed.org/fred2/series/WGS1YR/,
etc.
a = read.csv("WGS1YR.csv");
WGS1YR = ts(a[,2]);
a = read.csv("WGS5YR.csv");
WGS5YR = ts(a[,2]);
a = read.csv("WGS10YR.csv");
WGS10YR = ts(a[,2]);
a = cbind(WGS1YR, WGS5YR, WGS10YR);
plot(a);
plot(diff(a));
Use the dse package to fit VAR(1) and VAR(2) models to
differences:
library(dse);
b = TSdata(output = diff(a));
b1 = estVARXls(b, max.lag = 1);
cat("VAR(1)\n print method:\n");
print(b1);
cat("\n summary method:\n");
print(summary(b1));
b2 = estVARXls(b, max.lag = 2);
cat("\nVAR(2)\n print method:\n");
print(b2);
cat("\n summary method:\n");
print(summary(b2));
VAR(1)
print method:
neg. log likelihood= -7188.785
A(L) =
1-1.014698L1
0-0.02482398L1
0-0.0144053L1
B(L)
1
0
0
=
0
1
0
0+0.05794167L1
1-0.9224325L1
0+0.03872528L1
0-0.04292339L1
0-0.05304638L1
1-1.024605L1
0
0
1
summary method:
neg. log likelihood = -7188.785
sample length = 2448
WGS1YR y.WGS5YR
WGS10YR
RMSE 0.2005654 0.1713752 0.1563661
ARMA: model estimated by estVARXls
inputs :
outputs: WGS1YR y.WGS5YR WGS10YR
9
input dimension = 0
output dimension = 3
order A = 1
order B = 0
order C =
9 actual parameters
6 non-zero constants
trend not estimated.
VAR(2)
print method:
neg. log likelihood= -7414.944
A(L) =
1-1.329215L1+0.3221239L2
0+0.1030711L1-0.05850615L2
0-0.1539836L1+0.1172694L
0-0.07336772L1+0.05027099L2
1-1.117284L1+0.1974304L2
0-0.1148573L1+0.0577710
0+0.0002002881L1-0.01317073L2
0-0.02287398L1+0.06233586L2
1-1.252808L1+0.226
B(L)
1
0
0
=
0
1
0
0
0
1
summary method:
neg. log likelihood = -7414.944
sample length = 2448
WGS1YR y.WGS5YR
WGS10YR
RMSE 0.1910442 0.1666275 0.1534016
ARMA: model estimated by estVARXls
inputs :
outputs: WGS1YR y.WGS5YR WGS10YR
input dimension = 0
output dimension = 3
order A = 2
order B = 0
order C =
18 actual parameters
6 non-zero constants
trend not estimated.
AIC is smaller (more negative) for VAR(2), but SIC is smaller
for VAR(1).
For VAR(1),
1 =
0.3288773
0.08581201
0.06575108
0.1534516
0.004959931
0.04152504
0.136938
0.08875425
0.2406055
Largest off-diagonal elements are (1,3) and (2,3), suggesting
that changes in the 10-year rate are followed, one week later,
by changes in the same direction in the 1-year and 5-year
rates.
10