0% found this document useful (0 votes)
32 views316 pages

Time Series StudyBook

Uploaded by

Vivienne Onuigbo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views316 pages

Time Series StudyBook

Uploaded by

Vivienne Onuigbo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 316

STA3303

Statistics for Climate


Research
Faculty of Sciences

Study Book

Written by

Dr Peter Dunn
Department of Mathematics & Computing
Faculty of Sciences
The University of Southern Queensland
ii

Published by

University of Southern Queensland


Toowoomba Queensland 4350
Australia

http://www.usq.edu.au

©The University of Southern Queensland, 2007.2.

Copyrighted materials reproduced herein are used under the provisions of the
Copyright Act 1968 as amended, or as a result of application to the copyright
owner.

No part of this publication may be reproduced, stored in a retrieval system or


transmitted in any form or by any means electronic, mechanical, photocopying,
recording or otherwise without prior permission.

Produced using LATEX in the USQ style by the Department of Mathematics and
Computing.

© USQ, February 21, 2007


Table of Contents

I Time Series Analysis 1

1 Introduction 3

2 Autoregressive (AR) models 23

3 Moving Average (MA) models 41

4 arma Models 59

5 Finding a Model 73

6 Diagnostic Tests 105

7 Non-Stationary Models 129

8 Markov chains 173

9 Other Models 205

II Multivariate Statistics 213

10 Introduction 215

iii
iv Table of Contents

11 Principal Components Analysis 225

12 Factor Analysis 255

13 Cluster Analysis 279

A Installing other packages in R 291

B Review of statistical rules 293

C Some time series tricks in R 299

D Time series functions in R 301

E Multivariate analysis functions in R 305

© USQ, February 21, 2007


Strand I
Time Series
Analysis

1
2

© USQ, February 21, 2007


Module

Introduction
1
Module contents
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Time-series . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Signal and noise . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Simple methods . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 The r package . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Getting help in r . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.1 Answers to selected Exercises . . . . . . . . . . . . . . . 19

Module objectives

Upon completion of this module students should be able to:

ˆ recognize and define a time series;

3
4 Module 1. Introduction

ˆ understand what defines a stationary time series;

ˆ know the particular kinds of time series being discussed in this course;

ˆ recognise the reasons for finding statistical models for time series;

ˆ understand the notation used to designate a time series;

ˆ understand that a time series consists of a signal plus noise;

ˆ understand that the signal of a time series can be modelled and that
the noise is random;

ˆ list some simple time series modelling methods;

ˆ know how to use the software package r to do basic manipulations


with time series data, including loading data, plotting the data, and
defining the data as time series data.

1.1 Introduction

This Module introduces time series and associated terminology. Some sim-
ple methods are discussed for analysing time series, and the software used
in the course is also introduced.

1.2 Time-series

1.2.1 Definitions

A time series is a sequence of observations ordered by time. Examples in-


clude the noon temperature measured daily at the Oakey airport, the annual
sales of passenger cars in Australia, monthly average values of the southern
oscillation index (SOI), the number of people receiving unemployment ben-
efits in Queensland each month, and the number of bits of information sent
through a computer line per second. In each case, the observations are taken
at regular time intervals. This is not necessary, but greatly simplifies the
mathematics; we will only be concerned with time series where observations
are taken at regular intervals (that is, equally spaced: each month, each day
or each year for example). In this course, the emphasis is on climatological
applications; however time series are used in many branches of science and
engineering, and are particularly common in business (sales forecasts, share
markets and so on).

© USQ, February 21, 2007


1.2. Time-series 5

A time series is interesting because the series is a function of past values of


itself, and so the series is somewhat predictable. The task of the scientist
is to find out more about that relationship between observations. Unlike
most statistics, the observations in a time series are not independent (that
is, they are dependent). Time series are usually plotted using a time-plot,
as in the next example.

Example 1.1: The monthly Southern Oscillation Index (the SOI) is avail-
able for approximately the last 130 years. A plot of the monthly av-
erage SOI (Fig. 1.1) has time on the horizontal axis, and the SOI on
the vertical axis. Generally, the observations are joined with a line
to indicate that the points are given in a particular order. (Note the
horizontal line at zero was added by me, and is not part of the default
plot.)

Example 1.2: The seasonal SOI can also be examined. This series cer-
tainly does not consist of independent observations. The seasonal SOI
can be plotted against the SOI for the previous season, the season
before that, and so on (Fig. 1.2).
There is a reasonably strong relationship between the seasonal SOI
and the previous season. The relationship between the SOI and the
season before that is still obvious; it is less obvious (but still present)
with three seasons previous. There is basically no relationship between
the seasonal SOI and the SOI four seasons previous.

A stationary time series is a time series whose statistics do not change


over time. Such statistics are typically the mean and the variance (and the
covariance, discussed in Sect. 2.5.3). Initially, only stationary time series are
considered in this course. In Module 7, methods are discussed for modelling
non-stationary time series and for identifying non-stationary time series. At
present, identify a non-stationary time series simply using a time series plot
of the data, as shown in the next Example.

Example 1.3:
Consider the annual rainfall near Wendover, Utah, USA. (These data
are considered in more detail in Example 7.1.) A plot of the data
(Fig. 1.3, top panel) suggests a non-stationary mean (the mean goes
up and down a little). To check this, a smoothing filter was applied

© USQ, February 21, 2007


6 Module 1. Introduction

30
20
10
SOI

0
−10
−20
−30
−40

1880 1900 1920 1940 1960 1980 2000

Time

20

10

0
SOI

−10

−20

−30

1980 1985 1990 1995 2000

Time

Figure 1.1: A time-plot of the monthly average SOI. Top: the SOI from
from 1876 to 2001; Bottom: the SOI since 1980 showing more detail. (In
this example, the SOI has been plotted using las=1; this just make the labels
on the vertical axis easier to read in my opinion, but is not necessary.)

© USQ, February 21, 2007


1.2. Time-series 7

SOI vs SOI one season previous SOI vs SOI two seasons previous

● ●
● ●
20 ● ● ●● ● ●
20 ●● ● ●●● ●
● ● ● ● ●
● ● ●● ● ●●
●●●●●● ●
● ● ● ●●●● ●● ●
● ●
● ● ●●● ● ●● ● ●● ● ●●
●●●●●●● ●● ● ●● ● ● ●● ●

● ●●●● ●
10 ● ●● ●● ● ● ●●●●
●● ● ●●●● ● ●● 10 ●●● ●●
●● ● ● ●●
● ●●●● ● ●● ●
● ●● ●●● ●●● ● ●● ●● ●●● ● ● ● ●●
●● ●●●● ● ● ●
● ● ●● ● ●
SOI at time t − 1

SOI at time t − 2
●●● ●● ● ●●● ●
●● ●● ●●● ●● ●● ● ● ●● ●● ●●



●●●●●●
● ●
●● ●
●●
● ●●●● ● ●● ● ● ●● ●●
●●● ● ●●
●●●●
● ●● ●●
●● ●●
●● ● ●●
●● ●
● ●●● ●
●● ● ● ●● ●● ●● ● ● ● ●●
● ●● ●●● ●
●●●●●
● ●●●●●●●● ●●●●
●●●
● ● ● ● ●● ● ●●
● ●●
●●
●●●
●● ●●
● ●●●
● ●
●●● ●
● ●● ●●● ●●
●●




●●●
●●● ●● ●●●
●●


●●●
●● ● ● ●● ●
●●●●
●●●● ●●●
●●

●● ● ●

●●●●●


●●

●●● ●● ● ●● ● ●
● ●● ● ●●● ● ●●
0 ●● ● ●●● ●●
● ●●
●●● ●
● ●●●
● ●
● ●● ● 0 ●● ● ●
●●●● ●
●●
●●● ●●●●● ●● ● ●
● ● ●●

● ●●● ● ●● ●

●●●

●●
●●●
●●

●●

●●



●●●
●●
● ●●
●●●● ●● ● ● ●● ●●
● ●● ●● ●●●
●●●● ●●


●●
●●●●●●●●●
● ●●
●●●●●●
● ●● ●● ●
● ●● ●
●● ●●●●●●
●●


●●●●
● ● ● ●
●● ● ●
● ● ●● ● ●● ●


●●
●●●
●●

●●●●


●●●●●●

●●
●●

● ●● ● ● ●
●●●●
●●● ●
●● ● ●●● ●
● ● ●● ● ●● ●● ● ●●●
●● ● ● ● ● ● ●● ● ●
● ●
● ●● ●●● ●
● ●●
●●● ●●●● ● ● ●● ●● ●● ●● ● ●
●● ●● ● ●●●●
● ● ●●
● ● ●● ● ● ●● ●●●● ● ●● ● ● ● ●●
●●● ●●

● ●● ● ● ● ●●
●●●●●● ●● ●
●●●● ●●● ● ● ●●●● ●
−10 ● ● ● ● ● ●● ●● ●● ● ● −10 ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ●● ●● ●●● ●●●●●● ● ● ● ●● ●
● ●●●● ● ●●● ●●● ●●● ●
● ●●● ● ● ●● ●●
● ●
●● ●●● ●● ● ● ● ● ●
● ● ●

●●● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
−20 −20
● ●
● ●

−30 ● −30 ●
● ●

−30 −20 −10 0 10 20 −30 −20 −10 0 10 20

SOI at time t SOI at time t

SOI vs SOI three seasons previous SOI vs SOI four seasons previous

● ●
● ●
20 ● ● ● ● ●
20 ● ●● ●●
● ● ●●● ● ●● ●
● ●

●● ● ● ● ●● ● ● ●
●●● ● ● ●
● ● ●●


● ●●● ●● ● ● ●● ●
●●● ● ●●● ●
●● ● ● ● ● ● ● ●
● ●● ● ● ● ●●
● ● ●●●● ● ●● ●●● ●● ● ●
10 ● ● ● ●● ●●●●●● ● ● ●●●
● ● ● 10 ● ●● ●● ● ● ● ● ●● ●● ●● ●
●●● ●● ● ●● ●
● ●● ●● ● ●
● ●● ● ● ●● ●
● ● ● ●● ●●● ●●●●● ● ●
SOI at time t − 3

SOI at time t − 4

●● ● ●● ●●●● ● ● ● ●● ● ●● ●

● ● ● ●● ●● ●
●●● ●
● ●● ●
●●●●●●●●
● ●
● ●● ●● ● ● ●● ● ● ●●●●
●● ● ●


●●●●●
●●●●●●
● ● ● ●
●●●
● ● ● ● ●●
● ●

●●●● ●
●●
● ●

●●
●●
●●
●●●●●●
● ●● ● ● ● ●● ●●●
●● ● ●




● ●●●
● ●●●
● ●●

● ●●● ● ●●

● ●● ●●
●● ●●● ● ●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●●
●●● ●●● ● ●
●●●
● ●●●
● ● ● ● ● ● ●●● ●●
●●●●●
●● ●
●● ●● ●
●●
● ●● ●●● ●● ●●

●●
0 ●● ● ● ●● ●● ●●●● ●●
●●


●●
● ●
●●●●
●●●● ● ● ● ●●

●●● 0 ● ●● ●●● ●● ●
●●●●
●●


●● ●●

●● ● ● ●●
●● ●

● ● ●
● ● ● ● ●●●● ● ●
●●
● ●

●●●
●●
● ● ●●●●●●● ● ● ●●● ● ●●●●●●●● ●●
●●


●●●●●● ●
●●● ●●●
● ● ●● ●●
●●
●●
●●●


● ●

●●


● ●

●●● ● ● ●● ● ●●
●●
●●●
● ●
●●●●
●●●


● ●

●●●




●●
● ● ●● ● ● ● ●● ● ●● ●●● ●● ●●● ● ●● ● ● ● ● ● ●●●●
● ●●● ●●● ●● ●●●●● ●●
● ●● ● ●●
● ●● ●● ●●● ● ● ●● ●●
●●●
●●●● ●● ● ●●
● ● ●● ● ●● ● ●● ●● ● ●
●●●● ●●●
● ●● ● ●
● ●●●●●●●●●●
●● ●●●● ●●●● ●● ●● ● ● ●


●●●●● ● ●●●● ●● ●●● ●●
●●
● ● ●● ●● ● ●● ●● ● ●●● ● ● ● ●
−10 ●● ● ●● ●● ● ● ●● ● ● −10 ●●● ● ● ● ● ● ● ●●●●● ● ●
● ●● ●●● ●●●●● ●● ●● ●●● ● ●● ●
●●●●● ● ● ● ● ● ●●●● ●● ●


● ●● ●●● ● ● ● ● ●
● ●●● ●
● ●● ●
● ●● ●● ● ● ● ●
● ●●● ●
●●●
● ● ●
● ● ● ● ●
● ● ●
● ●● ●
● ●
−20 −20
● ●
● ●

−30 ● −30 ●
● ●

−30 −20 −10 0 10 20 −30 −20 −10 0 10 20

SOI at time t SOI at time t

Figure 1.2: The seasonal SOI plotted against previous values of the SOI.

that computed the mean of each set of six observations at a time. This
smooth gave the thick, dark line in the bottom panel of Fig. 1.3, and
suggests that the mean is perhaps non-stationary as this line is not
(approximately) constant. However, it is not too bad. The middle
panel of Fig. 1.3 shows a series that is definitely non-stationary. This
series—the average monthly sea-level at Darwin—is not stationary as
the mean obviously fluctuates. However, the SOI from 1876 to 2001,
plotted in the top panel of Fig. 1.3 (and seen in Example 1.1), is
approximately stationary.

All the time series considered in this part of the course will be equally
spaced (or regular ). These are time series recorded at regular intervals—
every day, year, month, second, etc. Until Module 8, the time series are all
considered for continuous data. In addition, only stationary time series will
be considered initially (until Module 7).

1.2.2 Purpose

There are two main purposes of gathering time series:

© USQ, February 21, 2007


8 Module 1. Introduction

30
20
10
SOI

0
−10
−20
−30
−40

1880 1900 1920 1940 1960 1980 2000

Time

4.2
Sea level (in m)

4.1

4.0

3.9

3.8

1988 1990 1992 1994 1996 1998 2000

Time

500
Annual rainfall (in mm)

400

300

200

1920 1940 1960 1980 2000

Year

Figure 1.3: Stationary and non-stationary time series. Bottom: the annual
rainfall near Wendover, Utah, USA in mm is plotted. The data is plotted
with a thin line, and the smoothed data in a thick line indicating that the
mean is perhaps non-stationary. Middle: the monthly average sea level
(in metres) in Darwin, Australia is plotted. The data are definitely not
stationary, as the mean fluctuates. Top: the average monthly SOI from
1876 to 2001 is shown. This series looks approximately stationary.

© USQ, February 21, 2007


1.2. Time-series 9

1. First, it helps us understand the process underlying the observations.

2. Secondly, data is gathered to predict, or forecast, what may happen


next. It is of great interest in climatology, for example, to predict the
value of seasonal climatic indicators. In business, it is important to be
able to predict future sales of products.

Forecasting is the process of estimating future values of numerical parame-


ters on the basis of the past. To do this, a model is created. This model is
an artificial equation that captures the important features of the data.

Example 1.4: Consider the average monthly sea level (in metres) in Dar-
win, Australia (Fig. 1.3, middle panel).
Any useful model for this time series would need to capture the im-
portant features of this time series. What are the important features?
One obvious feature is that the series has a cyclic pattern: the average
sea level rises and falls on a regular basis. Is there also an indication
that the average sea level has been rising since about 1994? Any good
model should capture these important features of the data. As noted
in the previous Example, the series is not stationary.

Methods for modelling and forecasting time series are well established and
rigourous and are sometimes quite accurate, but keep in mind the following:

ˆ Any forecast is only as good as the information it is based on. It


is not possible for a good method of forecasting to make up for lack
of information, or inaccurate information, about the process being
forecasted.

ˆ Some processes may be impossible to forecast with any useful accuracy


(for example, future outcomes of a coin tossing experiment).

ˆ Some processes are usefully forecast by means of complex expensive


methods—for example, daily regional weather forecasting.

1.2.3 Notation

Consider a sequence of numbers {Xn } = {X1 , X2 , . . . , XN }, ordered by


time, so that Xa comes before Xb if a is less than b; that is, {Xn } is a time
series. This notation indicates that the time series measures the variable
X (which may be monthly rainfall, water temperatures or snowfall depths,

© USQ, February 21, 2007


10 Module 1. Introduction

for example). The subscript indicates particular observations in the series.


Hence, X1 is the first observation, the first recorded in the data. (Note that
Y , W or some other letter may be used in place of X.)

The notation Xt (or Xn , or similar) is used to indicate the value of the time
series X at a particular point in time t. For different values of t, values of
the time series at different points in time are indicated. That is, Xt+1 refers
to the next term in the series following Xt .

The entire series is usually written {Xn }n≥1 , indicating the variable X is
a time sequence of numbers. Sometimes, the upper and lower limits are
specified explicitly as {Xn }n=1000
n=1 . Quite often, the notation is abbreviated
so that
{Xn } ≡ {Xn }n≥1 .

1.3 Signal and noise

The observed and recorded time series, say {Xn }, consists of two compo-
nents:

1. The signal. This is the component of the data that contains informa-
tion, say {Sn } This is the component of the time series that can be
forecast.

2. The noise. This is the randomness that is observed, which may be


due to numerous other variables affecting the signal, measurement
imperfections, etc. Because the noise is random, it cannot be forecast.

The task of the scientist is to extract the signal (or information) from the
time series in the presence of noise. There is no way of knowing exactly what
the signal is; instead, statistical methods are used to separate the random
noise from the forecastable signal. There are many methods for doing this; in
this course, one of those methods will be studied in detail: the Box–Jenkins
method. Some other simple models are discussed in Sect. 1.4; more complex
methods are discussed in Module 9.

Example 1.5: Consider the monthly Pacific Decadal Oscillation, or PDO


(obtained from monthly Sea-Surface Temperature (SST) anomalies in
the North Pacific Ocean). The data from January 1980 to December
2000 (Fig. 1.4, top panel) is non-stationary. The data consist of a
signal and noise. One way to extract the signal is to use a smoother.

© USQ, February 21, 2007


1.3. Signal and noise 11

0
PDO

−1

−2

1980 1985 1990 1995 2000

Time

1
PDO signal

−1

−2

1980 1985 1990 1995 2000

Time

1.5
1.0
PDO noise

0.5
0.0
−0.5
−1.0
−1.5

1980 1985 1990 1995 2000

Time

Figure 1.4: The monthly Pacific Decadal Oscillation (PDO) from Jan 1980
to Dec 2000 in the top plot. Middle: a lowess smooth is shown superimposed
over the PDO. Bottom: the noise is shown (observations minus signal).

© USQ, February 21, 2007


12 Module 1. Introduction

A lowess smoother can be applied to the data1 . (The details are not
important—it is simply one type of smoother.) For one set of param-
eters, the smooth is shown in Fig. 1.4 (middle panel). The smoother
captures the important features of the time series, and ignores the
random noise. The noise is shown Fig. 1.4 (bottom panel), and if the
smooth is good, should be random. (In this example, the noise does
not apear random, and so the model is probably not very good.)
One difficulty with using smoothers is that they have limited use for
forecasting into the future, as the fitted smoother apply only for the
given data. Consequently, other methods are considered here.

1.4 Simple methods

Many methods exist for modelling time series. These notes concentrate only
on the Box–Jenkins method, though some other methods will be discussed
very briefly at the end of the course.

It is very important to use the appropriate forecasting technique for each


particular application however. The Box–Jenkins technique is of general
applicability, and has been used in many applications. In addition, studying
the Box–Jenkins method will enable the student to learn other techniques
as appropriate: the language, basic techniques and skills are applicable to
other methods also.

In this section, a variety of simple methods for forecasting are first discussed.
Importantly, in some situations they are also the best method available.
If this is the case, it may not be obvious—it might require some careful
statistical analysis to show that a simple model is the best model.

Constant estimation The simplest possible approach is to use a constant


forecast for all future values of the time series. This is appropriate
when successive values of the time series are completely uncorrelated
but do come from the same distribution.

Slope estimation If the time series appears to have a linear trend, it may
be appropriate to estimate this trend by fitting a straight line by linear
regression. Future values can then be forecast by extrapolating this
line.
1
Many statisticians would probably not identify a smoother as a statistical model (in
fact, I am one of them). But the use of a smoother here demonstrates a point.

© USQ, February 21, 2007


1.5. Software 13

Random walk model In some cases, the best estimate of a future value
is the most recent observation. This model is called a random walk
model. For example, the best forecast of the future price of a share is
usually quite close to the present price.

Smoothing Smoothing is the name given to a collection of techniques which


estimate future values of a time series by an average of past values.
This approach makes sense when there is random variations which add
onto a relatively stable trend in the process under study. An example
has been seen in Example 1.5.

Regression Another method of forecasting is to relate the parameter under


study to some known parameter, or parameters, by means of a func-
tional relationship which is statistically estimated using regression.

1.5 Software

Most standard statistical software packages—such as SPSS, SAS, r and S-


Plus—can analyse time series data. In addition, many mathematical pack-
ages (such as Matlab) can be used, but sometime require add-ons which
usually cost money.

1.5.1 The R package

This course uses the free software package r. r is a free, open source soft-
ware project which is “not unlike” S-Plus, an expensive commercial software
package. r is available for many operating systems from http://cran.
r-project.org/, or http://mirror.aarnet.edu.au/pub/CRAN/ for resi-
dents of Australia and New Zealand. More information about r, including
documentation, is found at http://www.r-project.org/. r is command
line driven like Matlab, but has a statistical rather than mathematical
focus.

r is object orientated. This means to get the most benefit from r, objects
should be correctly defined. For example, time series data should be declared
as time series data. When r knows that a particular data set is a time series,
it has default mechanisms of working with the data. For example, plotting
data in r generally produces a dot-plot; if the data is declared as time series
data, the data are joined by lines which is the standard way of plotting time
series data. The following example explains some of these details.

© USQ, February 21, 2007


14 Module 1. Introduction

In r, you can set the working directory using (for example)


setwd("c:/My Documents/USQ/STA3303/data"). Check the cur-
rent working directory using getwd(). It is usually sensible to
set this working directory as soon as you start r to the location
of your data files. This will be assumed throughout these study
notes.

Example 1.6: In Example 1.1, the monthly average SOI was plotted. Ass-
ming the current folder (or directory) is set as described above, the
following code reproduces this plot.

> soidata <- read.table("soiphases.dat", header = TRUE)

The data is loaded using read.table. The option header=TRUE means


that the first row of the data contained header information (that is,
names for the variables).
An alternative method for loading the data directly from the internet
is:

> soidata <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/appl


+ header = TRUE)

Now, take a quick look at the variables:

> summary(soidata)

year month soi


Min. :1876 Min. : 1.000 Min. :-38.8000
1st Qu.:1907 1st Qu.: 3.000 1st Qu.: -6.6000
Median :1939 Median : 6.000 Median : 0.3000
Mean :1939 Mean : 6.493 Mean : -0.1514
3rd Qu.:1970 3rd Qu.: 9.000 3rd Qu.: 6.7500
Max. :2002 Max. :12.000 Max. : 33.1000
soiphase
Min. :0.000
1st Qu.:2.000
Median :3.000
Mean :3.148
3rd Qu.:5.000
Max. :5.000

> soidata[1:5, ]

© USQ, February 21, 2007


1.5. Software 15

year month soi soiphase


1 1876 1 10.8 2
2 1876 2 10.6 2
3 1876 3 -0.7 3
4 1876 4 7.9 4
5 1876 5 6.9 2

> names(soidata)

[1] "year" "month" "soi" "soiphase"

This shows the dataset (or object) soidata consists of four different
variables. The one of interest now is soi, and this variable is referred
to (and accessed) as soidata$soi. To use this variable first declare it
as a time series object:

> SOI <- ts(soidata$soi, start = c(1876, 1),


+ end = c(2002, 2), frequency = 12)

The first argument is the name of the variable. The input start indi-
cates the time when the data starts. For the SOI data, the data starts
at January 1876, which is input to r as c(1876, 1) (the one means
January, the first month). The command c means ‘concatenate’, or
join together. The data set ends at February 2002; if an end is not
defined, r should be able to deduce it anyway from the rest of the
given information. But make sure you check your time series to ensure
r has interpreted the input correctly. The argument frequency indi-
cates that the data have a cycle of twelve (that is, each twelve points
make one larger grouping—here twelve months make one year).
Now plot the data:

> plot(SOI, las = 1)


> abline(h = 0)

The plot (Fig. 1.5, top panel) is formatted correctly for time series
data. (The command abline(h=0) adds a horizontal line at y = 0.)
In contrast, if the data is not declared as time series data, the default
plot appears as the bottom panel in Fig. 1.5.
When the data are declared as a time series, the observations are
plotted and joined by lines and the horizontal axis is labelled Time by
default (the axis label is easily changed using the command:
title(ylab="New y-axis label")).
Other methods also have a standard default if the data have been
declared as a time series object.

© USQ, February 21, 2007


16 Module 1. Introduction

30
20
10
SOI

0
−10
−20
−30
−40

1880 1900 1920 1940 1960 1980 2000

Time

● ●
30 ● ●

● ● ● ● ●
● ● ● ●● ● ●● ● ●
20 ●


● ●
●●●●● ● ●
● ●
● ●● ●

● ●


● ●
●●●

●●

●● ●● ●
●●

●● ● ● ●● ●●● ●

● ●● ● ●


● ● ●●● ● ● ●
● ●●
● ●●●●●●●
● ●●
● ●● ●

●●
●●●● ●●

● ●●




●● ●● ● ●●●●
● ●● ● ●
● ●
● ●


●● ● ●



● ●●● ● ●

●●● ● ● ●● ●


● ● ●
●●
●●
soidata$soi

10 ● ● ●● ● ●● ●●
●● ●
● ● ●● ● ● ● ● ● ● ● ●

●●
●●

●●
●●●● ●●
● ●●
●● ●●
●● ●




●●
●●●
● ●● ● ●
●●● ●
●●
●●●
●●
● ●●
●●●●●
● ●


● ●●●
●●● ●●●●●
● ●●●
●●
●● ●
●●●
●● ●●●● ●
● ●● ●●
● ●
● ● ●●
●●




● ●● ● ● ●●●●●●● ●● ●
●●● ●●● ● ●●● ●

●●●● ● ●
●●
● ●
●●● ●

● ●● ●● ●●●
●● ●●●●●●●● ●
● ●●
●●●
● ●
● ●
● ●●●● ● ●● ●●●
● ●●
● ●●
●●

●●
●●●
● ●●●
● ●
●●●●●●●●●●●

●●●● ●● ● ●●● ●● ● ●
●● ● ●

●●● ●

●●●
● ●
●●


● ●●●●●




● ●


●●●
●●●●





●●●● ●●●
●●


●●
●●●
●●

●●●●

●●
●●●



●●
●● ●● ●
●●●

● ●●
●●●● ●●●●●

●●
● ●
●●●

● ●
●●●
● ●

●●
●●
●●


●●●● ●●●● ●

●●
●●
0 ● ●●●
●●
● ●



● ●●●
● ●●●



● ●

●●●●
● ●● ●
●● ●
●●● ●
●●●●
● ●








●●●●
●●




● ●

● ● ●●●
●●●


●●
● ●
●●
●●


●● ●
●●

●●●



● ●● ●
●●●●●
●●
●●●●

● ●

●●●●

●●●● ● ●● ● ●
● ● ●●
●●●
● ●●● ●
● ●●●●●● ●●●
● ●
●●●
●●●●● ●● ● ● ●●●●● ●
● ● ●
● ●
●● ●●●●
● ● ● ●
●● ●
●●

● ● ●●●●
●● ●●●●●● ●●● ●●
●● ●●

●●
●● ●● ●● ●●
● ●
●●
●●●●● ●

●●

● ●●

●●● ●● ● ●● ●● ●●

●●● ●● ●●
● ●●●
● ●●
●● ● ●●●●●●● ●
●● ●
●●●●●●●●
●●●●● ●●● ●
● ●●
● ● ●
●●●●●

●●
●●●●●●● ●


●● ●●●
●●● ●●● ●●●
● ● ●
● ●
●●●●●●







● ●
● ●●●● ● ●
●●●●
● ●
● ● ●
●●●● ●●

●●
● ●●●●● ● ● ●●●●

● ●●
●● ●●
● ●●
●●●● ●
−10 ●
● ●● ●
●● ● ●●● ●●●●●
●●●● ●

●●●
● ●●●
●●

●● ● ●
●●● ● ● ● ●●●●●
●●●
● ●● ●
● ●● ● ●● ●● ●● ● ●
●● ● ● ● ● ●●

● ●● ●
●●●● ● ●● ●● ● ●
● ● ●●●● ● ●● ●● ●● ● ●●
●● ●
●●●●
●●
● ●● ●● ●●● ● ●● ● ●● ● ●●

●● ● ● ●●●●●●●● ●

−20 ● ● ● ● ●●

● ●

● ●● ● ● ●●●● ●
●● ●● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
−30 ● ●

● ●
● ●
−40

0 500 1000 1500

Index

Figure 1.5: A plot of the monthly average SOI from 1876 to 2001, without
the series declared as being a time series. Top: the data has been declared
as a time series; Bottom: the data has not been declared as a time series.

© USQ, February 21, 2007


1.5. Software 17

In the above example, the data was available in a file. If it is not avail-
able, a data file can be created, or the data can be entered in to r. The
following commands show the general approach to entering data in r. The
command c is very useful: it is used to create a list of numbers, and stands
for ‘concatenate’.

> data.values <- c(12, 14, 1, 8, 9, 10, 7)


> data <- ts(data.values, start = 1980)

The first line puts the observation into a list called data.values. The second
line designates the data as a time series starting in 1980 (and so r assumes
the values are annual measurements).

You can also use scan(); see ?scan.Data stops being read when a blank line
is entered if you use scan.

Other commands will be introduced as appropriate throughout the course.


A full list of the time series functions available in r are given in Appendix D.

1.5.2 Getting help in R

Two commands of particular interest are the commands help and help.search.
The help command gives help on a particular topic. For example, try typing
help("names") or help("plot") at the r command prompt. (The quotes
are necessary.) A short-cut is also available: typing ?names is equivalent to
typing help("names"). Using the short-cut is generally more convenient.

The command help.search searches the help database for particular words.
For example, try typing help.search("eigen") to find how to evaluate
eigenvalues in r. (The quotes are necessary.) This function requires a rea-
sonably specific search phrase. The command help.start starts the r help
in a web browser (if everything is configured correctly).

Further help and information is available at http://stat.ethz.ch/R/manual/


doc/html/, including a Web-based manual An Introduction to r. After
starting r, look under the Help menu for available documentation.

© USQ, February 21, 2007


18 Module 1. Introduction

1.6 Exercises
Ex. 1.7: Start r and load in the data file qbo.dat. This data file is a time
series of the monthly quasi-biennial oscillation (QBO) from January
1948 to December 2001.

(a) Examine the variables in the data set using names.


(b) Declare the QBO as a time series, setting the start, end and
frequency parameters correctly.
(c) Plot the data.
(d) Is the data stationary? Explain.
(e) Determine the mean and variance of the series.
(f) List important features in the data (if any) that should be mod-
elled.

Ex. 1.8: Start r and load in the data file easterslp.dat. This data file is
a time series of sea-level air pressure anomalies at Easter Island from
Jan 1951 to Dec 1995.

(a) Examine the variables in the data set using names.


(b) Declare the air pressures as a time series, setting the start, end
and frequency parameters correctly.
(c) Plot the data.
(d) List important features in the data (if any) that should be mod-
elled.

Ex. 1.9: Obtain the maximum temperature from your town or residence for
as far back as possible up to, say, thirty days. This may be obtained
from a newspaper or website.

(a) Load the data into r.


(b) Declare the series a time series, and plot the data.
(c) List important features in the data (if any) that should be mod-
elled.
(d) Compute the mean and variance of the series.

Ex. 1.10: The data in Table 1.1 shows the mean annual levels at Lake
Victoria Nyanza from 1902 to 1921, relative to a fixed reference point
(units are not given). The data are from Shaw [41], as quoted in
Hand [19].

(a) Enter the data into r as a time series.

© USQ, February 21, 2007


1.6. Exercises 19

Year Level Year Level


1902 −10 1912 −11
1903 13 1913 −3
1904 18 1914 −2
1905 15 1915 4
1906 29 1916 15
1907 21 1917 35
1908 10 1918 27
1909 8 1919 8
1910 1 1920 3
1911 −7 1921 −5

Table 1.1: The mean annual level of Lake Victoria Nyanza from 1902 to
1921 relative to some fixed level (units are unknown).

(b) Plot the data. Make sure you give appropriate labels.
(c) List important features in the data (if any) that should be mod-
elled.

Ex. 1.11: Many people believe that sunspots affect the climate on the
earth. The mean number of sunspots from 1770 to 1869 for each year
are given in the data file sunspots.dat and are shown in Table 1.2.
(The data are from Izenman [23] and Box & Jenkins [9, p 530], as
quoted in Hand [19]).

(a) Enter the data into r as a time series by loading the data file
sunspots.dat.
(b) Plot the data. Make sure you give appropriate labels.
(c) List important features in the data (if any) that should be mod-
elled.

1.6.1 Answers to selected Exercises

1.7 (a) Here is one solution:


> qbo <- read.table("qbo.dat", header = TRUE)
> names(qbo)
[1] "Year" "Month" "QBO"
(b) One option is:
> qbo <- ts(qbo$QBO, start = c(qbo$Year[1],
+ 1), frequency = 12)

© USQ, February 21, 2007


20 Module 1. Introduction

Year Sunspots Year Sunspots Year Sunspots


1770 101 1804 48 1838 103
1771 82 1805 42 1839 86
1772 66 1806 28 1840 63
1773 35 1807 10 1841 37
1774 31 1808 8 1842 24
1775 7 1809 2 1843 11
1776 20 1810 0 1844 15
1777 92 1811 1 1845 40
1778 154 1812 5 1846 62
1779 125 1813 12 1847 98
1780 85 1814 14 1848 124
1781 68 1815 35 1849 96
1782 38 1816 46 1850 66
1783 23 1817 41 1851 64
1784 10 1818 30 1852 54
1785 24 1819 24 1853 39
1786 83 1820 16 1854 21
1787 132 1821 7 1855 7
1788 131 1822 4 1856 4
1789 118 1823 2 1857 23
1790 90 1824 8 1858 55
1791 67 1825 17 1859 94
1792 60 1826 36 1860 96
1793 47 1827 50 1861 77
1794 41 1828 62 1862 59
1795 21 1829 67 1863 44
1796 16 1830 71 1864 47
1797 6 1831 48 1865 30
1798 4 1832 28 1866 16
1799 7 1833 8 1867 7
1800 14 1834 13 1868 37
1801 34 1835 57 1869 74
1802 45 1836 122
1803 43 1837 138

Table 1.2: The annual sunspot numbers from 1770 to 1869.

© USQ, February 21, 2007


1.6. Exercises 21

QBO from 1948 to 2001

10
Quasi−bienniel oscillation

−10

−20

−30

1950 1960 1970 1980 1990 2000

Time

Figure 1.6: The QBO from January 1948 to December 2001.

Here the square brackets [ . . . ] have been used; they are used by
r to indicate elements of an array or matrix2 . (Note that start
must have numeric inputs, so qbo$Month[1]will not work as it
returns Jan, which is a text string.)
It is worth printing out qbo to ensure that r has interpretted your
statements correctly. Type qbo at the prompt, and in particular
check that the series ends in December 2001.
(c) The following code plots the graph:
> plot(qbo, las = 1, xlab = "Time", ylab = "Quasi-bienniel oscillation",
+ main = "QBO from 1948 to 2001")
The final plot is shown in Fig. 1.6.

1.10 Here is one way of doing the problem. (Note: The data can be entered
using scan or by typing the data into a data file and loading the usual
way. Here, we assume the data is available as the object llevel.)

> llevel <- ts(llevel, start = c(1902))


> plot(llevel, las = 1, xlab = "Time", ylab = "Level of Lake Victoria Nyanza",
+ main = "The (relative) Level of Lake Nyanza from 1902 to 1921")
2
Matlab, for example, uses round brackets: ( . . . ).

© USQ, February 21, 2007


22 Module 1. Introduction

The (relative) Level of Lake Nyanza from 1902 to 1921

30

Level of Lake Victoria Nyanza

20

10

−10

1905 1910 1915 1920

Time

Figure 1.7: The mean annual level of Lake Victoria Nyanza from 1902 to
December 1921. The figures are relative to some fixed level and units are
unknown.

The final plot is shown in Fig. 1.7. There is too little data to be sure of
any patterns of features to be modelled, but the series suggests there
may be some regular up-and-down pattern.

© USQ, February 21, 2007


Module

Autoregressive (AR) models


2
Module contents
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Forecasting ar models . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 The backshift operator . . . . . . . . . . . . . . . . . . . 29
2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.2 The variance . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.3 Covariance and correlation . . . . . . . . . . . . . . . . 32
2.5.4 Autocovariance and autocorrelation . . . . . . . . . . . 32
2.6 More on stationarity . . . . . . . . . . . . . . . . . . . . 35
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 39

23
24 Module 2. Autoregressive (AR) models

Module objectives

Upon completion of this module students should be able to:

ˆ understand what is meant by an autoregressive (ar) model;


ˆ use the ar(p) notation to define ar models;
ˆ use ar models to develop forecasting formulae;
ˆ understand the operation of the backshift operator;
ˆ write ar models using the backshift operator;
ˆ compute the mean of a time series written in ar form;
ˆ understand that the variance is not easily computed from the ar form
of a model;
ˆ understand the concepts of autocorrelation and autocovariance;
ˆ understand the term ‘lag’ used in the context of autocorrelation;
ˆ compute the autocorrelation function (acf) for an ar model;
ˆ know that the acf will always be one at a lag of zero;
ˆ understand that the acf for a lower-order ar model will decay slowly
toward zero.

2.1 Introduction

In this Module, one particular type of time series model—an autoregressive


model—is discussed. Subsequent Modules examine other types of models.

2.2 Definition

As stated in previously, the observations in a time series are somehow related


to past values of the series, and the task of the scientist is to find out more
about that relationship.
Recall that a time series consists of two components: the or information;
and the or random error. If values in a time series are related to past values
of the series, one possible model for the signal St is for the series Xt to be
expressed as a function of previous values of X. This is exactly the idea
behind an autoregressive model, denoted an ar model. An ar model is one
particular type of model in the Box–Jenkins methodology.

© USQ, February 21, 2007


2.2. Definition 25

Example 2.1: Consider the model Wn+1 = 3.12+0.63Wn +en+1 for n ≥ 0.


This model is an ar(1) model, since Wn+1 is a function of only one
past value of the series {Wn }. In this model, the information or signal
is Sn+1 = 3.12 + 0.63Wn and the noise is en+1 .

Example 2.2: An example of an ar(3) model is Tn = 0.9Tn−1 − 0.4Tn−2 +


0.1Tn−3 + en for n ≥ 1. In this model, the information or signal is
Sn = 0.9Tn−1 − 0.4Tn−2 + 0.1Tn−3 and the noise is en .

A more formal definition of an autoregressive process follows.

Definition 2.1 An autoregressive model of order p, or an ar(p) model,


satisfies the equation
p
X
Xn = m0 + en + φk Xn−k
k=1
= m0 + en + φ1 Xn−1 + φ2 Xn−2 + · · · + φp Xn−p (2.1)
for n ≥ 0, where {en }n≥0 is a series of independent, identically distributed
(iid) random variables, and m0 is some constant.

The letter p denotes the order of autoregressive model, defining how many
previous values the current value is related to. The model is called auto-
regressive because the series is regressed on to past values of itself.
The error term {en } in Equation (2.1) refers to the noise in the time series.
Above, the errors were said to be iid. Commonly, they are also assumed to
have a normal distribution with mean zero and variance σe2 .
For the model in Equation (2.1) to be of use in practice, the scientist must
be able to estimate the value of p (that is, how many terms are needed in
the ar model), and then estimate the values of φk and m0 . Each of these
issues will be addressed in later sections.
Notice the subscripts are defined so that the first value of the series to appear
on the left of the equation is always one. Now consider the ar(3) model in
Example 2.2: When n = 1 (for the first observation in the time series), the
equation reads
T1 = 0.9T0 − 0.4T−1 + 0.1T−2 + e1 .
Bbut the series {T } only exists for positive indices. This means that the
model does not apply for the first three terms in the series, because the data
T0 , T−1 and T−2 are unavailable.

© USQ, February 21, 2007


26 Module 2. Autoregressive (AR) models

Example 2.3: Using r, it is easy to simulate an ar model. For Exam-


ple 2.1, the following r code simulates the series:

> noise <- rnorm(100, 0, 1)


> W <- array(dim = length(noise))
> W[1] <- 0
> for (i in 2:length(noise)) {
+ W[i] <- 3.12 + 0.63 * W[i - 1] + noise[i]
+ }
> plot(W, type = "l", las = 1)

Note type="l" means to use lines, not points (meaning it is an “ell”,


not a numeral “one”).
More directly, a time series can be simulated using arima.sim as fol-
lows:

> sim.ar1 <- arima.sim(model = list(ar = c(0.63)),


+ n = 100)

10

6
W

0 20 40 60 80 100

Index

Figure 2.1: One realization of the ar(1) model Wn+1 = 3.12+0.63Wn +en+1

The final plot is shown in Fig. 2.1. The data created in r are called a
realization of the model. Every realization will be different, since each
will be based on a different set of random {e}. The first few values

© USQ, February 21, 2007


2.3. Forecasting ar models 27

are not typical, as the model cannot be used for the first observation
(when n = 0 in the ar(1) model in Example 2.1, W0 does not exist);
it takes a few terms before the effect of this is out of the system.

Example 2.4: Chu & Katz [13] studied the seasonal SOI time series {Xt }
from January 1935 to August 1983 (that is, the average SOI for (north-
ern hemisphere) Summer, Spring, etc), and concluded the data was
well modelled using the ar(3) model

Xt = 0.6885Xt−1 + 0.2460Xt−2 − 0.3497Xt−3 + et .

An ar(3) model was alluded to in Example 1.2 (Fig. 1.2 on p 7).

2.3 Forecasting AR models

One purposes of having models for time series data is to make forecasts. In
this section, ar models will be discussed. First, some notation is established.

2.3.1 Notation

Consider a time series {Xn }. Suppose the values of {Xn } are known from
n = 1 to n = 100. Then the forecast of {Xn } at n = 101 is written as
Xb101|100 . The ‘hat’ indicates the quantity is a forecast, not an observed
value of the series. The subscript implies the value of {Xn } is known up
to n = 100, and the forecast is for the value at n = 101. This is called a
one-step ahead forecast, since the forecast is one-step ahead of the available
data.
In general, the notation X bn+k|n indicates the value of the time series {Xn }
is to be forecast for time n + k assuming that the series is known up to time
n. This forecast is a k-step ahead forecast. Note a k-step ahead forecast can
be written in many ways: Xn+k|n , Xn|n−k and Xn−2|n−k−2 are all k-step
ahead forecasts.

Example 2.5: Consider the forecast Ybt+3|t+1 . This is a forecast of the time
series {Yt } at time t + 3 if the time series is known to time t + 1. This
is a two-step ahead forecast, since the forecast at t + 3 is two steps
ahead of the available information, known up to time t + 1.

© USQ, February 21, 2007


28 Module 2. Autoregressive (AR) models

2.3.2 Forecasting

Forecasting using an ar model is quite simple. Consider the following ar(2)


model:
Fn = 23 + 0.4Fn−1 − 0.2Fn−2 + en , (2.2)
where en has a normal distribution with a mean of zero and variance of
σe2 = 5; that is, en ∼ N (0, 5). Suppose a one-step ahead forecast is required
if the information about the time series {Fn } is known up to time n; that
is, Fbn+1|n is required.

The value of Fn+1 , if we knew exactly what is was, is found from Equa-
tion (2.2) as
Fn+1 = 23 + 0.4Fn − 0.2Fn−1 + en+1 (2.3)
by adjusting the subscripts. Then conditioning on what we actually ‘know’
gives
Fn+1|n = 23 + 0.4Fn|n − 0.2Fn−1|n + en+1|n
Adding ‘hats’ to all the terms, the forecast will be

Fbn+1|n = 23 + 0.4Fbn|n − 0.2Fbn−1|n + ebn+1|n .

Now, since information is known up to time n, the value of Fbn|n is known


exactly: it’s the value of F at time n, Fn . Likewise, Fbn−1|n = Fn−1 . But
what about the value of en+1|n ? It is not known at time n as it is a future
random noise component. So what do we do with the en+1|n term?

If we know nothing about the value of en+1|n , a sensible approach would be


to use the mean value of {en }, which is zero. Hence,

Fbn+1|n = 23 + 0.4Fn − 0.2Fn−1 (2.4)

is the forecast.

The difference between Fn+1 and Fbn+1|n determined from Equation (2.3)
and (2.4) is

Fn+1 − Fbn+1|n = (23 + en+1 + 0.4Fn − 0.2Fn−1 ) − (23 + 0.4Fn − 0.2Fn−1 )


= en+1 .

Hence, the error in making the forecast is en+1 , and so the terms {en } are
actually the one-step ahead forecasting errors.

The same approach can be used for k-step ahead forecasts also, as shown in
the next example.

© USQ, February 21, 2007


2.4. The backshift operator 29

Example 2.6: Consider the ar(2) model in Equation (2.2). To determine


the two-step ahead forecast, first find

Fn+2 = 23 + 0.4Fn+1 − 0.2Fn + en+2 .

Hence
Fbn+2|n = 23 + 0.4Fbn+1|n − 0.2Fbn|n + ebn+2|n .

Now, information is known up to time n, so Fbn|n = Fn . As before,


en+2|n is not know, so is replaced by the mean value, which is zero. But
what about Fbn+1|n ? It is unknown, since information is only known
up to time n, so information at time n + 1 is unknown. So what is the
best estimate of Fbn+1|n ?
Note that Fbn+1|n is simply a one-step ahead forecast itself available
from Equation (2.4). So the two-step ahead forecast here is

Fbn+2|n = 23 + 0.4Fbn+1|n − 0.2Fn ,

where Equation (2.4) can be substituted for Fbn+1|n , but it is not nec-
essary.

2.4 The backshift operator

This section introduces the backshift operator , a tool that enables compli-
cated time series model to be written in a simple form, and also allows the
models to be manipulated. A full appreciation of the value of the backshift
operator will not become apparent until later, when the models considered
become very complicated and cannot be written down in any other (practi-
cal) way (see, for example, Example 7.22).

2.4.1 Definition

The backshift operator, B, is defined on a time series as follows:

Definition 2.2 Consider a time series {Xt }. The backshift operator, B,


is defined so that BXt = Xt−1 .

© USQ, February 21, 2007


30 Module 2. Autoregressive (AR) models

Note the backshift operator can be used more than once, so that

B 2 Xt = B.B.Xt = B(BXt ) = BXt−1 = Xt−2 .

In general,
B r Xt = Xt−r .
The backshift operator allows ar models to be written in a different form,
which will later prove very useful.

Note the backshift operator only operates on time series (otherwise it makes
no sense to “shift backward” in time). This implies that Bk = k if k is a
constant.

Example 2.7: Consider the ar(2) model

Yt+1 = 0.23Yt − 0.15Yt−1 + et+1 .

Using the backshift operator notation, this model is written

Yt+1 − 0.23BYt+1 + 0.15B 2 Yt+1 = et+1


(1 − 0.23B + 0.15B 2 )Yt+1 = et+1 .

Example 2.8: The ar(3) model

Xt = et − 0.4Xt−1 + 0.6Xt−2 − 0.1Xt−3

is written using the backshift operator as

φ(B)Xt = et

where φ(B) = (1 + 0.4B − 0.6B 2 + 0.1B 3 ). The notation φ(B) is often


used to denote an autoregressive polynomial in B.

2.5 Statistics

In this Section, the important statistics of an ar model are studied.

© USQ, February 21, 2007


2.5. Statistics 31

2.5.1 The mean

In Equation (2.1), the general form of an ar(p) model is given. Taking


expected values of each term in this series gives

E[Xn ] = E[m0 ] + E[en ] + E[φ1 Xn−1 ] + E[φ2 Xn−2 ] + · · · + E[φp Xn−p ]


= m0 + + φ1 E[Xn−1 ] + φ2 E[Xn−2 ] + · · · + φp E[Xn−p ],

since E[en ] = 0 (the average error is zero). Now, assuming the time series
{Xk } is stationary, the mean of this series will be approximately constant at
any time (that is, for any subscript). Let this constant mean be µ. (It only
makes sense to talk about the ‘mean of a series’ if the series is stationary.)
Then,
µ = m0 + φ1 µ + φ2 µ + · · · + φp µ,
and so, on solving for µ,
m0
µ= .
1 − φ1 − φ2 − · · · φp
This enables the mean of the sequences to be computed from the ar model.

Example 2.9: In Equation (2.2), let the mean of the series be µ = E[F ].
Taking expected values of each term,

µ = 23 + 0.4µ − 0.2µ + 0.

The mean of the series is µ = E[F ] = 23/0.8 = 28.75.

Example 2.10: Consider the ar(1) model of Example 2.3:

Wn+1 = 3.12 + 0.63Wn + en+1 ,

for n ≥ 0. Taking expectations, E[W ] = 8.43. The plot of the simu-


lated data in Fig. 2.1 (page 26) confirms this.

2.5.2 The variance

(It may be useful to refer to Appendix B while reading this section.)

Consider the ar(1) model

Yt = 12 + 0.5Yt−1 + et ,

© USQ, February 21, 2007


32 Module 2. Autoregressive (AR) models

where {en } ∼ N (0, 4). First, write as

Yt − 0.5Yt−1 = 12 + et ,

and then taking the variance of both sides gives

var[Yt ] + (−0.5)2 var[Yt−1 ] + 2Covar[Yt , Yt−1 ] = var[et ],

since the errors {en } are assumed to be independent of the time series {Yn }.
Since the series is assumed stationary, the variance is constant at all time
steps; hence define σY2 = var[Yn ]. Then,

1.25σY2 + 2Covar[Yt , Yt−1 ] = 4,

since var[en ] = 4 in this example. This equation cannot be simplified and


solved for σY2 unless there is some understanding of the covariance which
characterizes the time series.

2.5.3 Covariance and correlation

The covariance is a measure of how two variables change together. For two
random variables X (with mean µX and variance σX 2 ) and Y (with mean
2
µY and variance σY ), the covariance is defined as

Covar[X, Y ] = E[(X − µX )(Y − µY )].

Then, the correlation is

Covar[X, Y ]
Corr[X, Y ] = 2 σ2 .
σX Y

A correlation of +1 indicates perfect positive correlation; a correlation of


−1 indicates perfect negative correlation. A correlation of zero indicates no
correlation at all between X and Y .

2.5.4 Autocovariance and autocorrelation

In the case of a time series, the autocovariance is defined between two points
in the time series {Xn } (with a mean µ), say Xi and Xj , as

κij = E[(Xi − µ)(Xj − µ)].

Since the time series is stationary, the autocovariance is the same if the time
series is shifted in time. For example, consider Example 1.2 which includes

© USQ, February 21, 2007


2.5. Statistics 33

a plot of the SOI. If we were to split the SOI series into (say) five equal
period of time, and produce a plot like Fig. 1.2 (top panel) (p 7) for each
time period, the correlation would be similar for each time period.
This all means the important information about Xi and Xj is the time
between the two observations (that is, |i − j|). Arbitrarily, Xi can be set to
X0 then, and hence the autocovariance can be written as
γk = Covar[X0 , Xk ]
for integer k. As with correlation, the autocorrelation is then defined as
γk
ρk =
γ0
for integer k, where γ0 = Covar[X0 , X0 ] is simply the variance of the time
series.
The series {ρk } is known as the autocorrelation function, or acf, at lag k.
For any given ar model, it is possible to determine the acf, which will
be unique to that ar model. For this reason, the acf is one of the most
important pieces of information to know about a time series. Later, the acf
isused to determine which ar model is appropriate for our data.
The term lag indicates the time difference in the acf. Thus, “the acf at
lag 2” means the term in the acf for k = 2, which is the correlation of any
term in the series with the term two time steps before (or after, as the series
is assumed stationary).
Note that since the autocorrelation is a series, the backshift operator can be
used with the autocorrelation. It can be shown that the autocovariance for
an ar(p) model is
σe2
γ(B) = . (2.5)
φ(B)φ(B −1 )

Example 2.11: In Example 1.2 (p 5), the seasonal SOI was plotted against
the seasonal SOI for one, two, three and four seasons ago. In r, the
correlation coefficients were computed as

> soi <- read.table("soiseason.dat", header = TRUE)


> attach(soi)
> len <- length(soi$SOI)
> lags <- 5
> SOI0 <- soi$SOI[lags:len]
> SOI1 <- soi$SOI[(lags - 1):(len - 1)]
> SOI2 <- soi$SOI[(lags - 2):(len - 2)]
> SOI3 <- soi$SOI[(lags - 3):(len - 3)]
> SOI4 <- soi$SOI[(lags - 4):(len - 4)]
> cor(cbind(SOI0, SOI1, SOI2, SOI3, SOI4))

© USQ, February 21, 2007


34 Module 2. Autoregressive (AR) models

SOI0 SOI1 SOI2 SOI3


SOI0 1.000000000 0.6319201 0.4098892 0.2001955
SOI1 0.631920149 1.0000000 0.6327576 0.4111551
SOI2 0.409889218 0.6327576 1.0000000 0.6336245
SOI3 0.200195528 0.4111551 0.6336245 1.0000000
SOI4 0.007600544 0.2018563 0.4119918 0.6340156
SOI4
SOI0 0.007600544
SOI1 0.201856306
SOI2 0.411991828
SOI3 0.634015609
SOI4 1.000000000

The correlations between the SOI and lagged values of the SOI can be
written as the series of autocorrelations:
{ρ} = {1, 0.632, 0.41, 0.2, 0.0076}.

Example 2.12:
The ar(2) model
Ut+1 = 0.3Ut − 0.2Ut−1 + et+1 (2.6)
is written using the backshift operator as
φ(B)Ut+1 = et+1
where φ(B) = 1 − 0.3B + 0.2B 2 . Suppose for the sake of example that
σe2 = 10. Then, since φ(B −1 ) = 1 − 0.3B −1 + 0.2B −2 , the autocovari-
ance is
10
γ(B) =
(1 − 0.3B + 0.2B )(1 − 0.3B −1 + 0.2B −2 )
1 2

10
= .
(0.2B −2 − 0.36B −1 + 1.13 − 0.36B + 0.2B −2 )
By some detailed mathematics (Sect. 3.6.3), this equals
γ(B) = · · ·+11.11+2.78B 1 −1.39B 2 −0.97B 3 −0.0139B 4 +0.190B 5 +· · · ,
only quoting the terms for the non-negative lags (recall that the auto-
correlation is symmetric). The terms in the autocorrelation are there-
fore (quoting terms from the non-negative lags again):
{γ} = {γ0 , γ1 , γ2 , . . . }
= {11.11, 2.78, −1.39, −0.97, −0.0139, 0.190, 0.0598, · · · }.

© USQ, February 21, 2007


2.6. More on stationarity 35

The corresponding terms in the autocovariance are found by dividing


by γ0 = var[U ] = 11.11, to give

{ρk } = {1, 0.25, −0.125, −0.0875, −0.00125, 0.0017125, 0.00053875, · · · }

The first term at lag zero always has an acf value of one (that is, each
term is perfectly correlated with itself). It is usual to plot the acf,
(Fig. 2.2).

1.0

0.8

0.6
ACF

0.4

0.2

0.0

5 10 15 20

lag

Figure 2.2: The acf for the ar(2) model in Equation (2.6).

The plot is typical of an ar(2) model: the terms in the acf decay
slowly towards zero. Indeed, any low order ar model (such as ar(1),
ar(2), ar(3), or similar) shows similar behaviour: a slow decay of the
term towards zero.

2.6 More on stationarity

In an ar(1) model, it can be shown that the model is stationary only if


|φ1 | < 1, otherwise the model is non-stationary (Exercise 2.24).

© USQ, February 21, 2007


36 Module 2. Autoregressive (AR) models

For an ar(2) process to be stationary, the following conditions must be


satisfied:

φ1 + φ2 < 1
φ2 − φ1 < 1
−1 < φ2 < 1

These inequalities define a triangular region in the (φ1 , φ2 ) plane (Exer-


cise 2.26).

2.7 Summary

In this Module, autoregressive models, or ar models, were studied. Fore-


casting and the statistics of the models have been considered. In addition,
the use of the backshift operator was studied.

2.8 Exercises

Ex. 2.13: Classify the following ar models (that is, state if they are ar(1),
ar(4), etc.)

(a) Xn+1 = en+1 + 78.03 − 0.56Xn − 0.23Xn−1 + 0.19Xn−2 .


(b) Yn = 12.8 − 0.22Yn−1 + en .
(c) Dt − 0.17Dt−1 + 0.18Dt−2 = et .

Ex. 2.14: Classify the following ar models (that is, state if they are ar(1),
ar(4), etc.)

(a) Xn = en + 0.223Xn−1 .
(b) At = 26.7 + 0.2At−1 − 0.2At−2 + et .
(c) Qt + 0.21Qt−1 + 0.034Qt−2 − 0.13Qt−3 = et .

Ex. 2.15: Determine the mean of each series in Exercise 2.13.

Ex. 2.16: Determine the mean of each series in Exercise 2.14.

Ex. 2.17: Write each of the models in Exercise 2.13 using the backshift
operator.

Ex. 2.18: Write each of the models in Exercise 2.14 using the backshift
operator.

© USQ, February 21, 2007


2.8. Exercises 37

Ex. 2.19: The time series {An } has a mean of 47.4. The following ar(2)
model was fitted to the series:

An = m0 + 0.25An−1 + 0.17An−2 + en .

(a) Find the value of m0 .


(b) Write the model using the backshift operator.

Ex. 2.20: The time series {Yn } has a mean of 12.26. The following ar(3)
model was fitted to the series:

Yn = en + m0 − 0.31Yn−1 + 0.12Yn−2 − 0.10Yn−3 .

(a) Find the value of m0 .


(b) Write down formulae for forecasting the series one, two and three
steps ahead.

Ex. 2.21: Yao [52] fits numerous ar models to model the total June rainfall
(in mm) at Shanghai, {Yt }, from 1932 to 1950. One of the fitted models
is
Yt = 309.70 − 0.44Yt−1 − 0.29Yt−2 + et .

(a) Classify the ar model fitted to the series.


(b) Determine the mean of the series {Yt }.
(c) Write down formulae for forecasting the June rainfall in Shanghai
one and two years ahead.
(d) Write the model using the backshift operator.

Ex. 2.22: In Guiot & Tessier [18], ar(3) models are fitted to the widths
of tree rings. This is of interest as there is evidence that pollution
may be affecting tree growth. Each observation in the series {Ct } is
the average of 30 tree-ring widths from 1900 to 1941 of a species of
conifer. Write down the general form of the model used to forecast
tree-ring width.

Ex. 2.23: Woodward and Gray [51] use a number of models, including ar
models, to study change in global temperature. One such ar model
is given in the paper (their Table 2) for modelling the International
Panel for Climate Change (IPCC) data series from 1968 to 1990 has
the factor
(1 + 0.22B + 0.59B 2 )
when the model is written using backshift operators. Write out the
model without using the backshift operator.

© USQ, February 21, 2007


38 Module 2. Autoregressive (AR) models

Ex. 2.24: Write a short piece of R-code to simulate the ar model Xt =


φXt−1 +et where e ∼ N (0, 4) (see Example 2.3). Plot a simulated series
of length 200 for each of the following eight values of φ: φ = −1.5,
−1, −0.6, −0.2, 0, 0.5, 1, 1.5. Comment on your findings: What effect
does the value of φ have on the stationarity of the series?

Ex. 2.25: Write a short piece of R-code to simulate the ar model Yn =


0.2Yn−1 + en where e ∼ N (0, σe2 ) (see Example 2.3). Plot a simulated
series of length 200 for each of the following four values of σe2 : σe2 = 0.5,
1, 2, 4. Comment on your findings: What effect does changing the
value of σe2 have?

Ex. 2.26: The notes indicate that for an ar(2) process to be stationary,
the following conditions must be satisfied:

φ1 + φ2 < 1
φ2 − φ1 < 1
−1 < φ2 < 1

These inequalities define a triangular region in the (φ1 , φ2 ) plane.


Draw this rectangular region, and then write some R-code to simu-
late some ar(2) series with parameters in this region, and some with
parameters outside this region. You should observe non-stationary
time series when the parameters are outside this triangular region.

Ex. 2.27: Consider the time series {G}, for which the last three observa-
tions are: G67 = 40.3, G68 = 39.6, G69 = 50.1. A statistician has
developed the ar(2) model

Gn = en − 0.3Gn−1 − 0.1Gn−2 + 63

for modelling the data.

(a) Determine the mean of the series {G}.


(b) Develop a forecasting formula for forecasting {G} one-, two- and
three-steps ahead.
(c) Using the data above, compute numerical forecasts for G b 70|69 ,
Gb 71|69 and G
b 72|69 .

Ex. 2.28: Use r generate a time series from the ar(1) model

Ft+1 = 12 + 0.3Ft + et+1 (2.7)

of length 300 (see Example 2.3 for a guideline).

(a) Compute the mean of {F } from Equation (2.7).

© USQ, February 21, 2007


2.8. Exercises 39

(b) Compute the mean of your R-generated time series, ignoring the
first 50 observations. (It usually takes a little while for the sim-
ulations to stabilize; see Fig. 2.1.) Compare to your previous
answer, and comment.
(c) Develop a forecasting formula for forecasting {F } one-, two- and
three-steps ahead.
(d) Using your generated data set, compute numerical forecasts for
the next three observations.

2.8.1 Answers to selected Exercises

2.13 The models are: ar(3), ar(1) and ar(2).

2.15 (a) Let µ = E[X] and take expectations of each term. This gives:
µ = 0 + 78.03 − 0.56µ − 0.23µ + 0.19µ. Solving for µ shows that
µ = E[X] ≈ 48.77.
(b) In a similar manner, E[Y ] = 10.49.
(c) E[D] = 0.

2.17 (a) (1 + 0.56B 1 + 0.23B 2 − 0.19B 3 )Xn+1 = 78.3 + en+1 ;


(b) (1 + 0.22B)Yn = 12.8 + en ;
(c) (1 − 0.17B + 0.18B 2 )Dt = et .

2.19 (a) Taking expectations shows that 0.58E[A] = m0 . Since E[A] =


47.4, it follows that m0 = 27.492.
(b) (1 − 0.25B − 0.17B 2 )An = en .

2.20 (a) Taking expectations, 1.29E[Y ] = m0 . Since E[Y ] = 12.26, it


follows that m0 = 15.8154.
(b) The one-step ahead forecast is Ybn+1|n = 15.8154 − 0.31Yn +
0.12Yn−1 − 0.10Yn−2 . The two-step ahead forecast is Ybn+2|n =
15.8154 − 0.31Ybn+1|n + 0.12Yn − 0.10Yn−1 . The three-step ahead
forecast is Ybn+3|n = 15.8154 − 0.31Ybn+2|n + 0.12Ybn+1|n − 0.10Yn .

2.23 If Gt is the global temperature, one model is Gt = −0.22Gt−1 −


0.59Gt−2 + et .

© USQ, February 21, 2007


40 Module 2. Autoregressive (AR) models

© USQ, February 21, 2007


Module 3
Moving Average (MA) models

Module contents
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 The backshift operator . . . . . . . . . . . . . . . . . . . 43
3.4 Forecasting ma models . . . . . . . . . . . . . . . . . . . 44
3.4.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Forecasting difficulties with ma models . . . . . . . . . . 47
3.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.2 The variance . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.3 Autocovariance and autocorrelation . . . . . . . . . . . 49
3.6 Why have different types of models? . . . . . . . . . . . 50
3.6.1 Two reasons . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.2 Conversion of models . . . . . . . . . . . . . . . . . . . . 51
3.6.3 The acf for ar models . . . . . . . . . . . . . . . . . . 53
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 57

41
42 Module 3. Moving Average (MA) models

Module objectives

Upon completion of this module students should be able to:

ˆ understand what is meant by a moving average (ma) model;

ˆ use the ma(q) notation to define ma models;

ˆ use ma models to develop forecasting formulae;

ˆ use ma models to develop confidence intervals for forecasts;

ˆ write ma models using the backshift operator;

ˆ compute the mean and variance of a time series written in ma form;

ˆ understand the need for both ar and ma models;

ˆ convert ar models to ma models using appropriate methods;

ˆ compute the autocorrelation function (acf) for an ma model;

ˆ understand that the acf for an ma(q) model will have q non-zero terms
(apart from the term at lag zero, which is always one).

3.1 Introduction

This Module introduces a second type of time series model: moving aver-
age models. Together with autoregressive models, they form the two basic
models in the Box–Jenkins methodology.

3.2 Definition

Another type of (Box–Jenkins) time series model is a Moving Average model,


or ma model. ar models imply the time series signal can be expressed as
a linear function of previous values on the time series. The error (or noise)
term in the equation, en , is the one-step ahead forecasting error.

In contrast, ma models imply the signal can be expressed as a function of


previous forecasting errors. This is sensible: it suggests ma models make
forecasts based on the errors made in the past, and so one can learn from
the errors made in the past to improve later forecasts. (Colloquially, it
means it learns from its own mistakes!)

© USQ, February 21, 2007


3.3. The backshift operator 43

Example 3.1: Consider the model Xt = et − 0.3et−1 + 0.2et−2 − 0.15et−3 .


The signal or information is Sn = −0.3et−1 + 0.2et−2 − 0.15et−3 . This
is an ma(3) model, since the is based on three previous error terms.
The term et is the error.

Example 3.2: An example of an MA(2) model is Wn+1 = 12 + 0.9en −


0.4en−1 + en+1 .

A more formal definition of a moving average model follows.

Definition 3.1 A moving average model of order q, or an ma(q) model, is


of the form

Xn = m + en + θ1 en−1 + θ2 en−2 + · · · + θq en−q (3.1)


Xq
= m + en + θk en−k (3.2)
k=1

for n ≥ 1 where θ1 , . . . , θq are real numbers and m is a real number.

For the model in Equation (3.2) to be of use in practice, the scientist must
be able to estimate the value of q (that is, how many terms are needed in
the ma model), and then estimate the values of θk and m. Each of these
issues will be addressed in later sections.

3.3 The backshift operator

The backshift operator can be used to write ma models in the same way as
ar models. Consider the model in Example 3.1. Using backshift operators,
this is written

Xt = (1 − 0.3B + 0.2B 2 − 0.15B 3 )et


= θ(B)et .

© USQ, February 21, 2007


44 Module 3. Moving Average (MA) models

3.4 Forecasting MA models

3.4.1 Forecasting

The principles of forecasting were developed in Sect. 2.3.1 (it may be worth
reading this section again) in the context of ar models. The same principles
apply for ma models. Consider the following ma(2) model:
Rn = 12 + en − 0.3en−1 − 0.12en−2 , (3.3)
where en has a normal distribution with a mean of zero and variance of
σe2 = 3; that is, en ∼ N (0, 3). Suppose a one-step ahead forecast is required
if the information about the time series {Rn } is known up to time n; that
is, R
bn+1|n is required.

Proceeding as before, first adjust the subscripts:


Rn+1 = 12 + en+1 − 0.3en − 0.12en−1 ;
then write
bn+1|n = 12 + ên+1|n − 0.3ên|n − 0.12ên−1|n .
R
Now, en|n and en−1|n are both known at time n to be en and en−1 , but en+1
is not known at time n. So what do we use for the value of en+1 ? Again,
if we have no other information, use the mean value of {en }, which is zero.
So the forecast is
bn+1|n = 12 − 0.3en − 0.12en−1 .
R (3.4)
The same procedure is used for k-step ahead forecasts.

Example 3.3: A two-step ahead forecast for the ma(2) model in Equa-
tion (3.3) is found by first adjusting the subscripts:
Rn+2 = 12 + en+2 − 0.3en+1 − 0.12en ,
and then writing
bn+2|n = 12 + ên+2|n − 0.3ên+1|n − 0.12ên|n .
R
Of the terms on the right, only en|n is known; the rest must be replaced
by the mean value of zero. So the two-step ahead forecast is
bn+2|n = 12 − 0.12en .
R
The forecasts for three-steps ahead is
R
bn+3|n = 12,

which is also the forecast for further steps ahead as well.

© USQ, February 21, 2007


3.4. Forecasting ma models 45

3.4.2 Confidence intervals

Consider the ma(2) model in Equation (3.3):

Rn = 12 + en − 0.3en−1 − 0.12en−2 . (3.5)

For this equation, one- and two-step ahead forecasts were developed. The
one-step ahead forecast is
bn+1|n = 12 − 0.3en − 0.12en−1 .
R

The forecasting error is the difference between the value forecast, and the
value actually observed. It is found as follows:

Rn+1 − R
bn+1|n .

Now, even though Rn+1 is not known exactly, it can be expressed as

Rn+1 = 12 + en+1 − 0.3en − 0.12en−1 ,

from Equation (3.5). (The reason Rn+1 is not known exactly is that Rn+1
depends on the unknown random value of en+1 ; this is the error we make
when we make our forecast, which is of course unknown.) This means that
the forecasting error is

Rn+1 − R
bn+1|n = [12 + en+1 − 0.3en − 0.12en−1 ] − [12 − 0.3en − 0.12en−1 ]
= en+1 .

This tells us that the series {en } is actually just the one-step ahead forecast-
ing errors. A confidence interval for the forecast of Rn+1 can also be formed.
The actual error about to be made, en+1 is, of course, unknown. But this
information can be used to develop confidence intervals for the forecast.

The variance of {en } can generally be estimated by computing all the previ-
ous forecasting errors (r computes these) and then computing the variance.

Suppose for the sake of example the variance of the errors is 5.8. Then the
variance of the forecast is

var[Rn+1 − R
bn+1|n ] = var[en+1 ] = 5.8.

Then a 95% confidence interval for the one-step ahead forecast is


q
bn+1|n ± z ∗ var[Rn+1 − R
R bn+1|n ]

bn+1|n ± z ∗ 5.8
R

© USQ, February 21, 2007


46 Module 3. Moving Average (MA) models

for the appropriate value of z ∗ . Generally, this is taken as 2 for a 95% con-
fidence interval. (1.96 is more precise; t-values with an appropriate number
of degrees of freedom even more precise. In practice, however, the value of
2 is often used.) So the confidence interval for the forecast is approximately

Rbn+1|n ± 2 × 5.8
or Rbn+1|n ± 4.82.

The same principles apply for other forecasts.

Example 3.4: In Example 3.3, the following two-step ahead forecast was
obtained for Equation (3.5):
bn+2|n = 12 − 0.12en .
R

The actual value of Rn+2|n is

Rn+2 = 12 + en+2 − 0.3en+1 − 0.12en ,

so the forecasting error is

Rn+2 − R
bn+2|n = [12 + en+2 − 0.3en+1 − 0.12en ] − [12 − 0.12en ]
= en+2 − 0.3en+1 .

The variance of the forecasting error is

var[en+2 − 0.3en+1 ] = var[en+2 ] + (−0.3)2 var[en+1 ]


= 5.8 + (0.09 × 5.8) = 6.322.

The confidence interval becomes



R bn+2|n ± 5.03.
bn+2|n ± 2 × 6.322 = R

The same principle is used for three-, four- and further steps ahead,
when the confidence interval is

Rbn+k|n ± 2 6.40552 = R bn+k|n ± 5.06

when k > 2. Notice that the confidence interval gets wider as we


predict further ahead of our knowledge. This should be expected.

© USQ, February 21, 2007


3.5. Statistics 47

3.4.3 Forecasting difficulties with MA models

Consider the ma model Tn = en − 0.3en−1 . The one-step ahead forecasting


formula is
Tbn+1|n = −0.3en .
Suppose we seek a forecast; the last three observations are: T8 = 4.6; T9 =
−3.0; T10 = 0.1. Let’s use the forecasting formula to produce a forecast for
T11|10 : We would use
Tb11|10 = −0.3e10 .
So we need to know the one-step ahead forecasting error at n = 10; that is
e10 . What is this forecasting error? We know the actual observed value at
10: it is T10 = 0.1. But to know the one-step ahead error in forecasting T10 ,
we need to know Tb10|9 . What is this value?
By the forecasting formula, it is computed using

Tb10|9 = −0.3e9 .

And so we need the one-step ahead forecasting error for n = 9, which requires
knowledge of Tb9|8 . From the forecasting formula, we find this using

Tb9|8 = −0.3e8 .

And so the cycle continues, right back to the start of the series.
In practice, we need to compute all the one-step ahead forecasting errors. r
can compute these errors and produce predictions without having to worry
about these difficulties in a real (data-driven) situation; see Sect. 5.4.

3.5 Statistics

In this Section, the important statistics of a model are found.

3.5.1 The mean

In Equation (3.2), the general form of an ma(q) model is given. Taking


expected values of each term in this series gives

E[Xn ] = E[m] + E[en ] + E[θ1 en−1 ] + E[θ2 en−2 ] + · · · + E[θp en−p ]


= m,

since the average error is zero. Hence, for an ma model, the constant term
m is actually the mean of the series {Xn }.

© USQ, February 21, 2007


48 Module 3. Moving Average (MA) models

Example 3.5: In Equation (3.3), let the mean of the series be µ = E[R].
Then taking expected values of each term gives

µ = 12,

so that the mean of the series is µ = E[R] = 12. This should not be
unexpected given the forecasts in Example 3.3

3.5.2 The variance

The variance of a time series written in ma(1) form is found by taking the
variance of each term. Consider again Equation (3.2); taking the variance
of each term gives

var[Rn ] = 12 + var[en ] + (−0.3)2 var[en−1 ] + (−0.12)2 var[en−2 ],

where {en } ∼ N (0, 3), since the errors {en } are independent of the time
series {Rn } and independent of each other. This gives

var[Rn ] = {1 + (−0.3)2 + (−0.12)2 }var[en ],

and so var[R] = 1.1044 × 3 = 3.3132. This approach can be applied to other


ma models also.

Example 3.6: The above results can be checked numerically in r as fol-


lows (set.seed() sets the random number seed so these results are
reproducible):

> set.seed(100)
> ma.sim <- arima.sim(model = list(ma = c(-0.3,
+ -0.12)), n = 10000, sd = sqrt(3))
> var(ma.sim)

[1] 3.321068

> ma.sim <- arima.sim(model = list(ma = c(-0.3,


+ -0.12)), n = 10000, sd = sqrt(3))
> var(ma.sim)

[1] 3.309557

© USQ, February 21, 2007


3.5. Statistics 49

3.5.3 Autocovariance and autocorrelation

The autocovariance for a time series is written, as shown earlier, as

γk = Covar[X0 , Xk ]

for integer k. The is then defined as


γk
ρk =
γ0
for integer k, where γ0 = Covar[X0 , X0 ] is simply the variance of the time
series. The series {γk } is the autocorrelation function, or . For any ma
model, the acf can be computed, which will be unique to that ma model.
For this reason, the acf is one of the most important pieces of information
that we can know about a time series. Later, the acf will be used to
determine which ma model might be appropriate for our data.

Note that since the autocorrelation is a series, it can be written using the
backshift operator. It can be shown that the autocovariance for an ma(p)
model is
γ(B) = θ(B)θ(B −1 )σe2 .

Example 3.7: The ma(2) model Vn+1 = en+1 − 0.39en − 0.22en−1 can be
written
Vn+1 = θ(B)en+1
where θ(B) = 1 − 0.39B 1 − 0.22B 2 . Suppose for the sake of example
that σe2 = 2. Then, since θ(B −1 ) = 1 − 0.39B −1 − 0.22B −2 , the
autocovariance is

γ(B) = 2(1 − 0.39B −1 − 0.22B −2 )(1 − 0.39B 1 − 0.22B 2 )


= 2(−0.22B −2 − 0.3042B −1 + 1.2005 − 0.3042B 1 − 0.22B −2 )
= −0.44B −2 − 0.6084B −1 + 2.4010 − 0.6084B 1 − 0.44B 2 .

The terms in the autocovariance are therefore (quoting only the terms
for the non-zero lags, as the autocorrelation is symmetric):

{γ} = {2.4010, −0.6084, −0.4400}.

and so the corresponding terms in the autocorrelation are ρk = γk /γ0 ,


where γ0 = 2.4010. Hence

{ρ} = {1, −0.253, −0.183}.

The first element of the autocorrelation is always one. It is usual to


plot the acf(Fig. 3.1).

© USQ, February 21, 2007


50 Module 3. Moving Average (MA) models

1.0

0.8

0.6
ACF

0.4

0.2

0.0

−0.2

0 1 2 3 4 5 6 7

Lag

Figure 3.1: The acf for the ma(2) model in Example 3.7.

The plot is typical of an ma(2) model: there are two terms in the acf
that are non-zero (apart from the term at a lag of zero, which is always
one).
In general, the acf of an ma(q) model has q non-zero terms excluding
the term at lag zero which is always one.

3.6 Why have different types of models?

Why are both ar and ma models necessary? ar models are far more popular
in the literature than ma models, so why not just have ar models? There
are two important reasons why both ma and ar models are necessary.

3.6.1 Two reasons

The first reason is that only ma models can be used to create confidence
intervals on forecasts (Sect. 3.4.2). If an ar model is developed, it must be
written as ma model to produce confidence intervals for the forecasts.

© USQ, February 21, 2007


3.6. Why have different types of models? 51

Secondly, it is necessary to again recall one of the principles of statistical


modelling: to find the simplest possible model that captures the important
features of the data. In some applications, the only suitable ar model has
a large number of parameters. In these situation, there will probably be
an ma model that will be almost identical in terms of forecasting ability,
but has fewer parameters to estimate. In this case, the ma model would be
preferred. In other applications, a simpler ar model will be preferred over
a more complicated ma model.

3.6.2 Conversion of models

This discussion implies that it is possible to convert ar models into ma


models, and ma models into ar models. This is indeed true, and the vehicle
through which this is done is the backshift operator.
Consider an ar model, written using backshift notation as φ(B)Xn = en . If
it is possible and sensible to divide by φ(B), expressed the model as
1
Xn = en .
φ(B)
Denoting 1/φ(B) by θ(B) gives

Xn = θ(B)en ,

which looks like an ma model. This is exactly the way models are converted
from ar to ma.
Consider writing the ar(1) model Xn = 0.6Xn−1 +en as an ma model. There
are three ways of proceeding. The first can only be used for ar(1) models
as it uses a mathematical result relevant only then. The second approach
is more difficult, but is used for any ar model. The third approach uses r,
and so is the easiest but of no use in the examination.
Using the first approach, write the model using the backshift operator as:
φ(B)Xn = en , where φ(B) = 1 − 0.6B. Then divide by φ(B) to obtain
Xn = θ(B)en , where θ(B) = 1/φ(B). So,
1
θ(B) = . (3.6)
1 − 0.6B
The mathematical result for the sum of a geometric series1 is then used to
obtain
1
θ(B) = = 1 + 0.6B + (0.6)2 B 2 + (0.6)3 B 3 + · · · .
1 − 0.6B
1
1 + r + r2 + r3 + · · · = 1/(1 − r) if |r| < 1.

© USQ, February 21, 2007


52 Module 3. Moving Average (MA) models

So, the corresponding ma model is the infinite ma model (written ma(∞))

Xn = en + 0.6en−1 + (0.6)2 en−2 + (0.6)3 en−3 + · · · .

This shows an ar(1) model has an equivalent ma(∞) form. Since both are
equivalent, the simpler ar(1) form would be preferred, but the ma form is
necessary for computing confidence intervals of forecasts.

In the second approach, start with Equation (3.6), and equate it to an un-
known infinite sequence of θ’s:
1
= 1 + θ1 B + θ2 B 2 + · · · .
1 − 0.6B
Then multiply both sides by 1 − 0.6B to get

1 = (1 − 0.6B)(1 + θ1 B + θ2 B 2 + · · · )
= 1 + B(θ1 − 0.6) + B 2 (θ2 − 0.6θ1 ) + · · · ,

and then equate the powers of B on both sides of the equation. For example,
looking at constants, there is one on both sides. Looking at powers of B, zero
are on the left, and −0.6 + θ1 on the right after multiplying out. Equating,
we find that θ1 = 0.6 (as before). Then equating powers of B 2 , the left hand
side has zero, and the right hand side has θ2 − 0.6θ1 . Substituting θ1 = 0.6
and solving gives θ2 = (0.6)2 (as before). A general pattern emerges, giving
the same result as before.

Remember the second method is used to convert any ar model into an ma


model (and also any ma model into an ar model).

The third approach uses r. This is useful, but you will need to know other
methods for the examination. Naturally, the answers are the same as using
the other two methods.

> imp <- as.ts(c(1, rep(0, 19)))


> phi <- 0.6

Note the one is not needed in the list of ar components as it is always one!
Confusingly, the sign is different for the φ.

> theta <- filter(imp, phi, method = "recursive")


> theta

© USQ, February 21, 2007


3.6. Why have different types of models? 53

Time Series:
Start = 1
End = 20
Frequency = 1
[1] 1.000000e+00 6.000000e-01 3.600000e-01
[4] 2.160000e-01 1.296000e-01 7.776000e-02
[7] 4.665600e-02 2.799360e-02 1.679616e-02
[10] 1.007770e-02 6.046618e-03 3.627971e-03
[13] 2.176782e-03 1.306069e-03 7.836416e-04
[16] 4.701850e-04 2.821110e-04 1.692666e-04
[19] 1.015600e-04 6.093597e-05

3.6.3 The ACF for AR models

Briefly, we digress to again consider the acf for ar models, seen previously
in Sect. 2.5.4, Equation 2.5, and Example 2.12 (p 34) in particular. In this
example, the following in stated:

. . . the autocovariance is
10
γ(B) =
(1 − 0.3B 1 + 0.2B 2 )(1 − 0.3B −1 + 0.2B −2 )
10
= .
(0.2B −2 − 0.36B −1 + 1.13 − 0.36B + 0.2B −2 )

By some detailed mathematics (covered in Sect. 3.6), this equals

γ(B) = · · ·+11.11+2.78B 1 −1.39B 2 −0.97B 3 −0.0139B 4 +0.190B 5 +· · · ,


(3.7)
only quoting the terms for the non-negative lags

Since this is Sect. 3.6, we had better deliver!

The way to convert to Equation (3.7) is to proceed as in this section. First,


write

γ(B) = · · · + γ−2 B −2 + γ−1 B −1 + γ0 + γ1 B + γ2 B 2 + · · ·

(recalling that the autocovariance in a series in both directions, but is sym-


metric.) Then, rearrange the original equation to get

10 = γ(B)(0.2B −2 − 0.36B −1 + 1.13 − 0.36B + 0.2B −2 )


= (· · · + γ1 B −1 + γ0 + γ1 B + γ2 B 2 + · · · ) ×
(0.2B −2 − 0.36B −1 + 1.13 − 0.36B + 0.2B −2 )

© USQ, February 21, 2007


54 Module 3. Moving Average (MA) models

Then, expand and equate powers of B as before in this section. In this


situation, it is just a lot trickier.

On the left, the constant term is 10; on the right, a constant can be found
from:

γ0 (1.13) + γ1 (−0.36) + γ2 (0.2) + γ1 (−0.36) + γ2 (0.2)


| {z } | {z } | {z } | {z }
γ1 B −1 (−0.36B) γ2 B −2 (0.2B) γ1 B 1 (−0.36B −1 ) γ2 B 2 (0.2B −1 )

So we have

10 = γ0 (1.13) + γ1 (−0.36) + γ2 (0.2) + γ1 (−0.36) + γ2 (0.2)

Proceed for other powers of B also, and develop a set of equations to be


solved for γ1 , γ2 , and so on.

Far easier is to use r after first converting to an ma model, whose parameters


we call theta:

> imp <- as.ts(c(1, rep(0, 99)))


> theta <- filter(imp, c(0.3, -0.2), "recursive")

That’s the AR model found. Note that the first component is 1 and is
assumed; it should not be included.

> theta[1:4]

[1] 1.000 0.300 -0.110 -0.093

> gamma <- convolve(theta, theta) * 10


> gamma[1:4]

[1] 11.1111111 2.7777778 -1.3888889 -0.9722222

> rho <- gamma/gamma[1]


> rho[1:4]

[1] 1.0000 0.2500 -0.1250 -0.0875

© USQ, February 21, 2007


3.7. Summary 55

3.7 Summary

In this Module, moving average models were studied, including forecasting,


establishing confidence intervals on forecasts, and writing using the backshift
operator. In addition, three methods were shown that can be used to convert
ar models to ma models.

3.8 Exercises
Ex. 3.8: Classify the following ma models (that is, state if they are ma(3),
ma(2), etc.)

(a) At+1 = et+1 + 8.39 − 0.06et + 0.35et−1 .


(b) Xn = −0.12en−1 + en .
(c) Yt − 0.29et−1 + 0.19et−2 + 0.62et−3 − 0.26et−4 − et = 12.40.

Ex. 3.9: Classify the following ma models (that is, state if they are ma(3),
ma(2), etc.)

(a) Bt = 0.1et−1 + et .
(b) Yn = 0.036en−2 − 0.36en−1 + en .
(c) Wt + 0.39et−1 + 0.25et−2 − 0.21et−3 − et = 8.00.

Ex. 3.10: Determine the mean of each series in Exercise 3.8.

Ex. 3.11: Determine the mean of each series in Exercise 3.9.

Ex. 3.12: Write each of the models in Exercise 3.8 using the backshift op-
erator.

Ex. 3.13: Write each of the models in Exercise 3.9 using the backshift op-
erator.

Ex. 3.14: Convert the ar model

Xt+1 = et+1 + 0.4Xt

into the equivalent ma model using each of the three methods outlined
in Sect. 3.6, and confirm that they give the same answer.

Ex. 3.15: Convert the ma(2) model

Yn = en + 0.3en−1 − 0.1en−2

into the equivalent ar model using one of the three methods outlined
in Sect. 3.6.

© USQ, February 21, 2007


56 Module 3. Moving Average (MA) models

Ex. 3.16: Convert the AR model

Yn = 0.25Yn−1 − 0.13Yn−2 + en

into the equivalent ma model using one of the three methods outlined
in Sect. 3.6.

Ex. 3.17: Compute forecasting formula for each of the ma models in Exer-
cise 3.8 for one-, two- and three-steps ahead, and compute confidence
intervals for each forecast in terms of the error variance σe2 .

Ex. 3.18: Compute forecasting formula for each of the ma models in Exer-
cise 3.9 for one-, two- and three-steps ahead, and compute confidence
intervals for each forecast. In each case, assume σe2 = 2.

Ex. 3.19: Write a short piece of r-code to simulate the ma model Xt =


et + θet−1 where e ∼ N (0, 1) (see Example 2.3 for a guideline). Plot a
simulated series of length 200 for each of the following eight values of
θ: θ = −1.5, −1, −0.6, −0.2, 0, 0.5, 1, 1.5. Comment on your findings:
What effect does the value of θ have on the stationarity of the series?

Ex. 3.20: Consider the ma(1) model

Xn = 0.4en−1 + en

where e ∼ N (0, 3).

(a) Write the model using backshift operators.


(b) Find the autocovariance series {γ}.
(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.21: Consider the ma(1) model

Sn+1 = 0.2en + en+1 ,

where e ∼ N (0, 2).

(a) Write the model using backshift operators.


(b) Find the autocovariance series {γ}.
(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.22: Consider the time series model

Zt = 0.2et−1 − 0.1et−2 + et ,

where e ∼ N (0, 5).

© USQ, February 21, 2007


3.8. Exercises 57

(a) Write the model using backshift operators.


(b) Find the autocovariance series {γ}.
(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.23: Consider the ar(1) model

Wn = 0.3Wn−1 + en ,

where e ∼ N (0, 2.5).

(a) Write the model using backshift operators.


(b) Find the autocovariance series {γ} using R.
(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.24: Consider the AR model

Yt = 0.45Yt−1 − 0.2Yt−2 + et ,

where e ∼ N (0, 5).

(a) Write the model using backshift operators.


(b) Find the autocovariance series {γ} using R.
(c) Compute the autocorrelation function (acf), {ρ}.

3.8.1 Answers to selected Exercises

3.8 The models are: ma(2); ma(1) and ma(4).

3.10 The means are: E[A] = 8.39; E[X] = 0; and E[Y ] = 12.40.

3.12 (a) At+1 = (1 − 0.06B + 0.35B 2 )et+1 + 8.39;


(b) Xn = (1 − 0.12B)en ;
(c) Yt = (1 + 0.29B − 0.19B 2 − 0.62B 3 + 0.26B 4 )et + 12.40.

3.14 First, convert to backshift operator notation: (1 − 0.4B)Xt+1 = et+1 .


The infinite ma model is given by
1
Xt+1 = et+1 .
1 − 0.4B
Then the equivalent ma model, using any method, is

Xt+1 = (1 + 0.4B + (0.4)2 B 2 + (0.4)3 B 3 + · · · )et+1 ,

or
Xt+1 = et+1 + 0.4et + 0.16et−1 + 0.064et−2 + · · · .

© USQ, February 21, 2007


58 Module 3. Moving Average (MA) models

3.17 For (a) only:

(a) One-step ahead: A bt+1|t = 8.39 − 0.06et + 0.35et−1 ; var[A


bt+1|t −
At+1 ] = var[et ] = σe2 ; the CI is A
bt+1|t ± 2σe2 .

(b) Two-steps ahead: A bt+2|t = 8.39 + 0.35et ; var[A bt+2|t − At+2 ] =


p
(1 + (0.06)2 )var[et ] = 1.0036σe2 ; the CI is A
bt+1|t ± 2 1.0036σe2 .

(c) Three-steps ahead: A bt+3|t = 8.39; var[A bt+3|t − At+3 ] = (1 +


p
(0.06)2 +(0.35)2 )var[et ] = 1.1261σe2 ; the CI is A
bt+1|t ±2 1.1261σ 2 .
e

3.20 (a) Xn = (1 + 0.4B)en , or Xn = θ(B)en where θ(B) = (1 + 0.4B).


(b) The autocovariance using the backshift operator is γ(B) = θ(B)θ(B −1 )σe2 ,
so γ(B) = 3(1 + 0.4B)(1 + 0.4B −1 ) = 1.2B −1 + 3.48 + 1.2B, so
the series is {1.2, 3.48, 1.2}.
(c) Dividing the autocovariance by γ0 = 3.48 gives the acf series as
{0.345, 1, 0.345}

3.23 (a) (1 − 0.3B)Wn = en .


(b) > imp <- as.ts(c(1, rep(0, 99)))
> theta <- filter(imp, c(-0.3), "recursive")
> gamma <- convolve(theta, theta) * 2.5
> gamma[1:6]
[1] 2.747252747 -0.824175824 0.247252747
[4] -0.074175824 0.022252747 -0.006675824
(c) > rho <- gamma/gamma[1]
> rho[1:6]
[1] 1.00000 -0.30000 0.09000 -0.02700 0.00810
[6] -0.00243

© USQ, February 21, 2007


Module

ARMA Models
4
Module contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 The backshift operator for arma models . . . . . . . . . 62
4.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.2 The autocovariance and autocorrelation . . . . . . . . . 63
4.5 Conversion of arma models to ar and ma models . . . 64
4.6 Forecasting arma models . . . . . . . . . . . . . . . . . . 65
4.6.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . 66
4.6.3 Forecasting difficulties with arma models . . . . . . . . 67
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 70

Module objectives

Upon completion of this module students should be able to:

59
60 Module 4. arma Models

ˆ understand what is meant by an autoregressive moving average (arma)


model;

ˆ use the arma(p, q) notation to define arma models;

ˆ use arma models to develop forecasting formulae;

ˆ develop confidence intervals for forecasts from arma models;

ˆ write arma models using the backshift operator;

ˆ compute the mean of a time series written in arma form;

ˆ understand the need for ar, ma and arma models;

ˆ convert arma models to ma and ar models using appropriate meth-


ods;

ˆ compute the autocorrelation function (acf) for an arma model.

4.1 Introduction

This Module examines models with both autoregressive and moving average
components.

4.2 Definition

The principle of parsimony—that the best model is the simplest model that
captures the important features of the data—has been mentioned before,
where it was noted that a complex ar model can often be replaced by a
simpler ma model.

Sometimes, however, neither a simple ar model or simple ma model ex-


ists. In these cases, a combination of ar and ma models will almost always
produce a simple model. These models are called AutoRegressive Moving
Average models, or arma models. Once again, p is used for the number of
autoregressive components, and q for the number of moving average compo-
nents. Consider first some examples.

Example 4.1: An example of an arma(2, 1) model is

Wn+1 = 0.56 + 0.8Wn − 0.4Wn−1 +en+1 + 0.5e .


| {z } | {zn}
2 ar components 1 ma component

© USQ, February 21, 2007


4.2. Definition 61

The signal or information is of the form Sn+1 = 0.56 + 0.8Wn −


0.4Wn−1 + 0.5en , and has both ar and ma components.

Example 4.2: An example of an arma(1, 3) model is

Xt = 120.78 + 0.88Xt−1 +et


| {z }
The ar(1) component
− 0.41et−1 − 0.15et−2 + 0.08et−3 .
| {z }
The ma(3) component

A more formal definition follows.

Definition 4.1 The form of an arma(p, q) model is the equation:


p
X q
X
0
Xn − φk Xn−k = m + en + θj en−j , n ≥ 0, (4.1)
k=1 j=1

where {Xn }n≥1 , m0 is some constant, and the φk and θj are defined as for
ar and ma models respectively.

Example 4.3: Chu & Katz [13] studied the monthly SOI time series from
January 1935 to August 1983, and concluded the data could be mod-
elled by an arma(1, 1) model.

Example 4.4: Davis & Rappoport [15] use an arma(2, 2) model for the
Palmer Drought Index, {Yt }. The final fitted model is

Yt = 1.344Yt−1 − 0.431Yt−2 + et − 0.419et−1 + 0.034et−2 .

Katz & Skaggs [26] claim the equivalent ar(2) model is almost as good
as the model given by Davis & Rappoport, yet has half the number of
parameters. For this reason, they prefer the ar(2) model.

© USQ, February 21, 2007


62 Module 4. arma Models

4.3 The backshift operator for ARMA models

arma models have both ar and ma components; the model is easily writ-
ten using the backshift operator by following the guidelines for ar and ma
models.

Example 4.5: Consider the arma(1, 2) model


Zt = 0.83 + et − 0.66et−1 + 0.72et−2 − 0.29Zt−1 . (4.2)
First, re-write as
Zt + 0.29Zt−1 = 0.83 + et − 0.66et−1 + 0.72et−2 ;
then use the backshift operator to get
φ(B)Zt = m0 + θ(B)et
(1 + 0.29B)Zt = 0.83 + (1 − 0.66B + 0.72B 2 )et .

4.4 Statistics

4.4.1 The mean

In Equation (4.1), the general form of an arma(p, q) model is given. Taking


expected values of each term in this series gives
E[Xn ] = E[m0 ] + E[en ] + E[θ1 en−1 ] + · · · + E[θq en−q ] +
+ E[φ1 Xn−1 ] + E[φ2 Xn−2 ] + · · · + E[φp Xn−p ]
0
= m + φ1 E[Xn−1 ] + φ2 E[Xn−2 ] + · · · + φp E[Xn−p ],
since the average error is zero. Now, since the assumption is that the time
series {X} is stationary, the mean of this series is approximately constant,
so the expected value of the series will be the same at any time step. Let
this constant mean be µ. Then,
µ = m0 + φ1 µ + φ2 µ + · · · + φp µ,
and so, on solving for µ,
m0
µ=
1 − φ1 − φ2 − · · · φp
This enables the mean of the sequences to be computed from the arma
model.

© USQ, February 21, 2007


4.4. Statistics 63

Example 4.6: The mean of {Zt }, say µ, in the arma(1, 2) model in Equa-
tion (4.2) is found by taking expectations of each term:

E[Zt ] = 0.83 + E[et ] − 0.66E[et−1 ] + 0.72E[et−2 ] − E[0.29Zt−1 ]


µ = 0.83 − 0.29µ

so that E[Z] = µ = 0.643.

See also Example 4.8.

4.4.2 The autocovariance and autocorrelation

For an arma(p, q) model, the autocovariance can be expressed as

σe2 θ(B)θ(B −1 )
γ(B) = .
φ(B)φ(B −1 )

Example 4.7: In Example 4.5, the following were found:

φ(B) = (1 + 0.29B 1 )
θ(B) = (1 − 0.66B 1 + 0.72B 2 ).

Suppose for the sake of example that σe2 = 4. Then, the autocovariance
is
4(1 − 0.66B 1 + 0.72B 2 )(1 − 0.66B −1 + 0.72B −2 )
γ(B) = .
(1 + 0.29B)(1 + 0.29B −1 )

This can be converted into the series

γ(B) = · · · 5.4430B −2 − 8.8380B −1 + 11.9381


−8.8380B + 5.4430B 2 − 1.5785B 3 + 0.4578B 4 + · · ·

so that the autocorrelation is

{ρ} = {1.0000, −0.7403, 0.4559, −0.1322, 0.0383, . . . }.

© USQ, February 21, 2007


64 Module 4. arma Models

4.5 Conversion of ARMA models to AR and MA mod-


els

Using similar approaches as used before in Sect. 3.6, arma models can be
converted to pure ar or pure ma models.

Example 4.8: Consider the arma(1, 2) model

Xt = 0.3Xt−1 + et + 0.4et−1 − 0.1et−2 + 10. (4.3)

For the moment, ignore the constant term m0 = 10 and write

(1 − 0.3B)Xt = (1 + 0.4B − 0.1B 2 )et


φ(B)Xt = θ(B)et .

To write the model as a pure ma model,


θ(B)
Xt = et = θ0 (B)et (4.4)
φ(B)
where θ0 (B) = θ00 + θ10 B + θ20 B + · · · . To convert to this pure ma form,
the values of θ00 , θ10 , and so on must be found. Rearrange Equation (4.4)
to obtain

θ(B) = θ0 (B)φ(B)
1 + 0.4B − 0.1B 2 = (θ00 + θ10 B + θ20 B 2 + θ30 B 3 + · · · )(1 − 0.3B)
= θ00 + B(−0.3θ00 + θ10 )
+ B 2 (−0.3θ10 + θ20 )
+ B 3 (−0.3θ20 + θ30 ) + · · ·

Now, equate powers of B so that both sides of the equation are equal.
Equating constant terms: 1 = θ00 as expected. Equating terms in B:

0.4 = −0.3θ00 + θ10 ,

so that θ10 = 0.4 + 0.3θ00 = 0.7.


Equating terms in B 2 :

−0.1 = −0.3θ10 + θ20 ,

so that θ20 = −0.1 + 0.3θ10 = 0.11.


Equating terms in B 3 :

0 = −0.3θ20 + θ30 ,

© USQ, February 21, 2007


4.6. Forecasting arma models 65

so that θ30 = 0.3θ20 = 0.11(0.3).


Continuing, a pattern emerges showing that θk0 = (0.3)k−2 (0.11) when
k ≥ 2. Hence,

θ0 (B) = 1 + 0.7B + 0.11B 2 + 0.11(0.3)B 3 + · · · + 0.11(0.3)k−2 B k .

This means the arma(1, 2) model has an equivalent ma(∞) represen-


tation of

Xt = et + 0.7et−1 + 0.11et−2 + · · · + 0.11(0.3)k−2 et−k .

However, there will probably be a constant term in the model yet to


be found, so that

Xt = m0 + et + 0.7et−1 + 0.11et−2 + · · · + 0.11(0.3)k−2 et−k . (4.5)

Taking expectations of Equation (4.3) shows that the mean of the series
is E[X] = 10/0.7 ≈ 14.2857. Taking expectations of Equation (4.5)
shows that m0 = E[X] ≈ 14.2857. So the arma(1, 2) model has the
equivalent ma model

Xt = 14.2857 + et + 0.7et−1 + 0.11et−2 + · · · + 0.11(0.3)k−2 et−k .

Note that I have yet to determine how to do these conversion in r.

4.6 Forecasting ARMA models

4.6.1 Forecasting

Forecasting arma models uses the same principles as for forecasting ma and
ar models. This procedures is called the hat principle, summarized below:

The forecasting equation for an arma model is obtained from the model
equation by “placing hats” on all the terms of the equation, and adjusting
subscripts accordingly. The “hat” designates the best linear estimate of the
quantity underneath the hat. This equation is then adjusted by noting:

1. An ebk|j for which k is in the future (i.e. k > j) just equals zero (the
mean of {ek }), while one for which k is in the present or past (k ≤ j)
just equals ek . In other words, hats change future ek ’s to zeros and
they fall off present and past ek s.

© USQ, February 21, 2007


66 Module 4. arma Models

2. A Xbk|j for which k is in the present or past (i.e. k ≤ j) just equals


Xk , while one for which k is in the future can be expressed in terms
of another forecasting equation, which ultimately will allow it to be
expressed in terms of known quantities. In other words, hats fall off
present and past Xk ’s and they stay on future ones.

Example 4.9: Consider the arma(2, 1) model

Wn = 0.72 + 0.44Wn−1 + 0.17Wn−2 + en − 0.26en−1 . (4.6)

A one-step ahead forecast is

W
cn+1|n = 0.72 + 0.44W cn−1|n + ên+1|n − 0.26ên|n .
cn|n + 0.17W

Since ên+1|n is in the future, it is replaced by the mean of the {ek },


which is zero. In contrast, en|n = en . Likewise, W cn|n = Wn and
W
cn−1|n = Wn−1 , so the forecasting formula is

cn+1|n = 0.72 + 0.44Wn + 0.17Wn−1 − 0.26en .


W (4.7)

Using the sample principles, the two-step ahead forecasting formula is

W
cn+2|n = 0.72 + 0.44W
cn+1|n + 0.17Wn .

Again, W
cn+1|n can be replaced by Equation (4.7) (though this is not
necessary) to get
cn+2|n = 0.72 + 0.44 {0.72 + 0.44Wn + 0.17Wn−1 − 0.26ên } + 0.17Wn ,
W

which can be simplified if you wish.

4.6.2 Confidence intervals

As with ar models, arma models must be first converted to pure ma models


before confidence intervals can be computed for forecasts. After conversion
to a pure ma form, the same principles as used in Sect. 3.4.2 are used.

Example 4.10: Consider the arma(1, 2) model from Example 4.8:

Xt = 0.3Xt−1 + et + 0.4et−1 − 0.1et−2 + 10.

This model has the equivalent ma(∞) form

Xt = 14.2857 + et + 0.7et−1 + 0.11et−2 + · · · + 0.11(0.3)k−2 et−k .

© USQ, February 21, 2007


4.7. Summary 67

The one-step ahead forecast of the model in ma form is


bt+1|t = 14.2857 + 0.7et + 0.11et−1 + · · · + 0.11(0.3)k−2 et−k+1 ,
X

whereas the exact (but unknown) value will be

Xt+1 = 14.2857 + et+1 + 0.7et + 0.11et−1 + · · · + 0.11(0.3)k−2 et−k+1 .

The difference between them is et+1 , and so the forecasting error is


just the error variance, say σe2 .
The two-step ahead forecast is
bt+2|t = 14.2857 + 0.11et + · · · + 0.11(0.3)k−2 et−k+2 ,
X

whereas the exact (but unknown) value is

Xt+2 = 14.2857 + et+2 + 0.7et+1 + 0.11et + · · · + 0.11(0.3)k−2 et−k+2 .

The difference between them is

Xt+2 − X
bt+2|t = et+2 + 0.7et+1 ,

so the variance of the forecast is σe2 (1 + 0.72 ) = 1.49σe2 . Confidence


intervals can be constructed from the values of the error variance.
Continuing in the same manner, the variance of a three-step ahead
forecast is 1.5021σe2 and a four-step ahead forecast is 1.503189σe2 .

4.6.3 Forecasting difficulties with ARMA models

In Sect. 3.4.3, some difficulties forecasting with ma models were presented.


In short, the one-step ahead forecasting errors need to be determined right
to the beginning of the series. Because aspects of ma models are present in
arma models, this same difficulty is also present. Of course, r can compute
these errors and produce predictions without having to worry about these
difficulties in a real (data-driven) situation; see Sect. 5.4.

4.7 Summary

In this Module, a combination of autoregressive and moving average models,


called arma models, was discussed. Forecasting methods were also exam-
ined for these models.

© USQ, February 21, 2007


68 Module 4. arma Models

4.8 Exercises

Ex. 4.11: Classify the following models as ar, ma or arma, and state the
orders of the models (for example, an answer may be arma(1, 3)):

(a) At = 12.6 − 0.44At−1 + 0.37et−1 + et ;


(b) Xn − 0.24Xn−1 + 0.38Xn−2 − 14.8 = en ;
(c) Yt+1 = et+1 − 0.19et − 0.44Yt ;
(d) Rn = 0.46en−1 + en ;
(e) Pn+1 = 8.69 + en+1 − 0.35Pn − 0.26en − 0.18en−1 + 0.11en−2 .

Ex. 4.12: Classify the following models as ar, ma or arma, and state the
orders of the models (for example, an answer may be arma(1, 3)):

(a) An − 0.1An−1 = 7.40 + 0.22en−1 + en ;


(b) Bn − 0.5Bn−1 = en ;
(c) Xt − et = 0.61Xt−1 − 0.67et−1 ;
(d) Zt+1 = 0.26et + 0.10et−1 + 0.17Zt − 0.16Zt−1 + et+1 ;
(e) Xt+1 − 0.2et + 0.2et−1 = et+1 + 7;
(f) Yn = −2.2 + en + 0.23Yn−1 − 0.19en−1 − 0.18en−2 + 0.17en−3 .

Ex. 4.13: Find the mean of each series in Exercise 4.11.

Ex. 4.14: Find the mean of each series in Exercise 4.12.

Ex. 4.15: Write each model in Exercise 4.11. using the backshift operator.

Ex. 4.16: Write each model in Exercise 4.12. using the backshift operator.

Ex. 4.17: Consider the arma(1, 1) model

Xn = 0.2Xn−1 + en − 0.1en−1

where var[en ] = 9.3.

(a) Write the model using the backshift operator.


(b) Find a one- and two-step ahead forecast for the model.
(c) Convert the model into a pure ma model.
(d) Find 95% confidence intervals for the forecasts in (b).

© USQ, February 21, 2007


4.8. Exercises 69

Ex. 4.18: Consider the arma(1, 1) model

Yn = 0.3Yn−1 + en + 0.2en−1

where var[en ] = 7.0.

(a) Write the model using the backshift operator.


(b) Find a one-, two- and three-step ahead forecast for the model.
(c) Convert the model into a pure ma model.
(d) Find 95% confidence intervals for the forecasts in (b).

Ex. 4.19: Consider the arma(1, 1) model

Wt+1 + 0.2Wt = 2 + en + 0.2en−1

where var[en ] = 7.0.

(a) Write the model using the backshift operator.


(b) Find a one-, two- and three-step ahead forecast for the model.
(c) Convert the model into a pure ma model.
(d) Find 95% confidence intervals for the forecasts in (b).

Ex. 4.20: Give two reasons why it is sometimes necessary to convert ar


and arma models into pure ma models.
Ex. 4.21: Claps & Morrone [14] give the following model for modelling
runoff Dt under certain conditions:

Dt − exp{−1/K3 }Dt−1 =
(1 − c3 exp{−1/K3 })It − exp{−1/K3 }(1 − c3 )It−1 ,

where c3 is a recharge coefficient (constant in any given problem), It


is the effective rainfall input, and K3 is a storage coefficient (constant
in any given problem). The authors state that if the effective rainfall
input It is white noise, then the model is equivalent to an arma(1, 1)
model. Use that (1 − c3 exp{−1/K3 })It = et to show that this is the
case.
Ex. 4.22: Sales, Pereira & Vieira [40] discuss numerous arma-type models
in connection with the Brazilian Electrical Sector. A significant pro-
portion of electricity is sourced from hydroelectricty in Brazil. In their
paper, the author use arma-type models to model natural monthly av-
erage flow rate (in cubic metres per second) of the reservoir of Furnas
on the Grande River in Brazil. Initially, the logarithm of the data was
found to create a time series {Ft }, and then an arma(1, 1) model was
fitted. The information in Table 4.1 comes from their Table 2.

© USQ, February 21, 2007


70 Module 4. arma Models

Table 4.1: Parameters estimates and standard errors for the arma(1, 1)
model fitted by Sales, Pereira & Vieira [40].
Parameter Estimate Standard Error
φ1 0.8421 0.0237
θ1 −0.2398 0.0426
σe2 0.4343

(a) Write down the fitted model.


(b) Convert the model to a pure ma model.
(c) Develop one-, two- and three- step ahead forecasts for the log of
the flowrate.
(d) Determine 95% confidence intervals for each of these forecasts.

Ex. 4.23: Consider the arma(2, 2) model for the Palmer Drought Index
seen in Example 4.4. Write this model using the backshift operator.
Then create forecasting formulae for forecasting one-, two-, three- and
four-steps ahead.

4.8.1 Answers to selected Exercises

4.11 The models are arma(1, 1); ar(2) (or arma(2, 0)); arma(1, 1); ma(1)
(or arma(0, 1)); arma(1, 3).

4.13 The means are: E[A] = 8.75; E[X] ≈ 13.0; E[Y ] = 0; E[R] = 0;
E[P ] ≈ 6.44.

4.15 (a) (1 + 0.44B)At = 12.6 + (1 + 0.37B)et ;


(b) (1 − 0.24B + 0.38B 2 )Xn = 14.8 + en ;
(c) (1 + 0.44B)Yt+1 = (1 − 0.19B)et+1 ;
(d) Rn = (1 + 0.46B)en ;
(e) (1 + 0.35B)Pn+1 = 8.69 + (1 − 0.26B − 0.18B 2 + 0.11B 3 )en+1 .

4.17 (a) (1 − 0.2B 1 )Xn = (1 − 0.1B 1 )en ;


(b) The one-step ahead forecast is X bn+1|n = 0.2Xn − 0.1en . The
two-step ahead forecast is X
bn+2|n = 0.2X
bn+1|n .
(c) We have θ0 (B) = (1 − 0.1B)/(1 − 0.2B). Solving shows that
θ0 (B) = 1 + 0.1B + 0.1(0.2)B 2 + · · · + 0.1(0.2)k−1 B k . The pure
MA models is therefore

Xt = et + 0.1et−1 + 0.1(0.2)et−2 + · · · + 0.1(0.2)k−1 et−k .

© USQ, February 21, 2007


4.8. Exercises 71

(d) The variance of the forecasting error for the one-step ahead fore-
cast is σe2 = 9.3. For the two-step ahead forecast, the variance of
the forecast error is σe2 + (0.1)2 σ√e2 = 9.393. The 95% confidence
intervals therefore are X ± 2 9.3 for the one-step ahead fore-
√ t+1|t
b
cast; and Xt+2|t ± 2 9.393 for the two-step ahead forecast.
b

4.20 Firstly, models must be in ma form to compute confidence intervals


for forecasts; secondly, sometimes the ma model will be the simplest
model in a given situation.

4.21 Hint: First write φ = exp(−1/K3 ), and the right-hand side looks like
the ar(1) part. Then, use the given relationship between It and et to
find It−1 and hence show that θ = φ(1 − c3 )/(1 − c3 φ) for the ma(1)
part.

© USQ, February 21, 2007


72 Module 4. arma Models

© USQ, February 21, 2007


Module 5
Finding a Model

Module contents
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Identifying a Model . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 The Autocorrelation Function . . . . . . . . . . . . . . . 75
5.2.2 Sample acf . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.3 Sample pacf . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.4 Tips for using the sample acf and pacf . . . . . . . . . 83
5.2.5 Model selection using aic . . . . . . . . . . . . . . . . . 83
5.2.6 Selecting arma models . . . . . . . . . . . . . . . . . . 84
5.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Preliminary estimation for ar models: The Yule–Walker
equations . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Parameter estimation in R . . . . . . . . . . . . . . . . 86
5.4 Forecasting using R . . . . . . . . . . . . . . . . . . . . . 88
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6.1 Answers to selected Exercises . . . . . . . . . . . . . . . 95

73
74 Module 5. Finding a Model

Module objectives

Upon completion of this module students should be able to:

ˆ understand the information contained in the sample autocorrelation


function (acf);

ˆ understand the information contained in the sample partial acf (PACF);

ˆ use the sample acf and sample pacf to select ar and ma models for
time series data;

ˆ use r to plot the sample acf and pacf for time series data;

ˆ write down the fitted ar or ma model given the r output;

ˆ use the Akaike Information Criterion (aic) to select the order of ar


models for time series data using r;

ˆ understand that selecting arma models is more difficult than selecting


ar and ma models;

ˆ compute initial parameter estimates of an ar model using the Yule–


Walker equations;

ˆ use r to compute predictions from an ar or ma model;

ˆ use r to compute parameter estimates for ar and ma models of a given


order.

5.1 Introduction

In this Module, methods are discussed for finding the best model for a par-
ticular time series. This consists of two stages: first, determining which type
of model is appropriate for the given data (for example, ar(1) or ma(2));
then secondly estimating the parameters in the chosen model.

The choice of ar, ma and arma models are discussed, as well as the number
of parameters necessary for the chosen type of model.

The two most important tools in making these decisions are the sample
autocorrelation function (acf) and sample partial autocorrelation function
(pacf).

© USQ, February 21, 2007


5.2. Identifying a Model 75

5.2 Identifying a Model

5.2.1 The Autocorrelation Function

The autocorrelation function, or acf, was studied in earlier Modules. The


approach then was to take a given model and deduce the acf is characteristic
of that particular model.

In practice, the scientist doesn’t start with a known model, but instead starts
with data for which a model is sought. Using software, the acf is estimated
from the data (using a sample acf), and the characteristics of the sample
acf used to select the best model.

5.2.2 Sample ACF

The autocorrelation function is estimated from the data using the formulae
N −k
1 X
γ
bk = (Xi − µ̂)(Xi+k − µ̂) k≥0
N
i=1
γ
bk
ρbk = k≥0
γ
b0
where N is the number of terms in the time series and µ̂ is the sample mean
of the time series. Of course, the actual computations are performed by
computer, using a package such as r. Since the quantities γ bk (and hence
ρbk ) are estimated, there will be some sampling error. Formulae exist for
estimation of the sampling error but will not be given here. However, r uses
these formulae to produce approximate 95% confidence intervals for ρbk .

Consider the ma(2) model as used in Example 3.7 (p 49): Vn+1 = en+1 −
0.39en − 0.22en−1 , where σe2 = 2. In that example, the theoretical acf was
computed as
{ρ} = {1, −0.253, −0.183}.
The series {Vn } is simulated in r as follows:

> ma.terms <- c(-0.39, -0.22)


> sim.ma2 <- arima.sim(model = list(ma = ma.terms),
+ n = 1000, sd = sqrt(2))

Note the variance of the errors is given as 2.

The sample acf of this data is found as follows (Fig. 5.1):

© USQ, February 21, 2007


76 Module 5. Finding a Model

Series sim.ma2[10:1000]

1.0
0.8
0.6
0.4
ACF

0.2
0.0
−0.2

0 5 10 15 20 25 30

Lag

Figure 5.1: The sample acf for the ma(2) model in Example 2.2 (p 35).

> acf(sim.ma2[10:1000])

Note the first few terms have been ignored; this allows the simulation to
recover from the initial (arbitrary) choice of errors needed to begin the sim-
ulation.

First, note the dotted horizontal lines on the plot. These indicate the approx-
imate 95% confidence intervals for ρbk . In other words, if the autocorrelation
value lies within the dotted lines, the value can be considered as zero; the
reason it is not exactly zero is due to sampling error only.

We would expect that the sample acf would demonstrate the features of the
acf for the model. Compare Figures 3.1 (p 50) and 5.1; the sample acf and
acf do look similar—they both show two components in the plot that are
larger than the rest when we ignore the term at a lag of zero which will always
be one. (Recall that only two acf values are outside the dotted confidence
bands, so the rest can be considered as zero, and that the first term will
always be one so is of no importance.) Notice there are two components in
the acf that are non-zero for a two-parameter ma model (that is, ma(2)).

In fact, this is typical. Here is one of the most important rules for identifying
time series models:

© USQ, February 21, 2007


5.2. Identifying a Model 77

If the sample acf has k non-zero components from 1 to k, then


an ma(k) model is appropriate.

In r, the sample acf is produced by typing acf( time.series ) at the r


prompt, where time.series is the name of the time series.

Example 5.1: Consider the ar(2) model

Xn = 0.4Xn−1 − 0.3Xn−2 + en .

The theoretical acf can be computed and plotted in r (by first con-
verting to an ma model):

> imp <- as.ts(c(1, rep(0, 99)))


> ar.terms <- c(0.4, -0.3)
> theta <- filter(imp, ar.terms, "recursive")
> errorvar <- 1
> gamma <- convolve(theta, theta) * sqrt(errorvar)
> rho <- gamma/gamma[1]

Note we used σe2 = 1; it doesn’t matter what we use since we eventually


compute ρ anyway.
> plot(c(1, 10), c(1, -0.2), type = "n", las = 1,
+ main = "Actual ACF ", xlab = "Lag", ylab = "ACF")
> lines(rho, type = "h", lwd = 2)
> abline(h = 0)
This theoretical acf is shown in the top panel of Fig. 5.2.
Suppose we generated some random numbers from this time series and
computed the sample acf; we would expect the sample acf to look
similar to Fig. 5.2. Proceed:

> ar2.sim <- arima.sim(model = list(ar = ar.terms),


+ n = 1000)
> acf(ar2.sim[10:1000], lwd = 2, las = 1, lag.max = 10,
+ main = "Sample ACF")

This sample acf is shown in the bottom panel of Fig. 5.2. They are
very similar as expected.

Example 5.2:
Parzens [36] studied a time series of yearly snowfall in Buffalo from
1910 to 1972 (recorded to the nearest tenth of an inch):

© USQ, February 21, 2007


78 Module 5. Finding a Model

Actual ACF

1.0
0.8
0.6
ACF

0.4
0.2
0.0
−0.2

2 4 6 8 10

Lag

Sample ACF

1.0
0.8
0.6
ACF

0.4
0.2
0.0
−0.2

0 2 4 6 8 10

Lag

Figure 5.2: Top: the theoretical acf for the ar(2) model in Example 5.1;
Bottom: the sample acf for data simulated from the ar(2) model in Ex-
ample 5.1.

© USQ, February 21, 2007


5.2. Identifying a Model 79

120

100

80
sf

60

40

1910 1920 1930 1940 1950 1960 1970

Time

Series sf
1.0
0.6
ACF

0.2
−0.2

0 5 10 15

Lag

Figure 5.3: Yearly Buffalo snowfall from 1910 to 1972. Top: the plot of the
data; Bottom: the sample acf.

> bs <- read.table("buffalosnow.dat", header = TRUE)


> sf <- ts(bs$Snow, start = c(1910, 1), frequency = 1)

The data are plotted in the top panel of Fig. 5.3. The time series
is small, but the series appears to be approximately stationary. The
sample acf for the data has been computed in r(Fig. 5.3, bottom
panel).
The acf has two non-zero terms (ignoring the term at lag zero, which
is always one), suggesting an ma(2) model is appropriate for modelling
the data. Note the confidence bands are approximate only. Here is
some of the code used to produce the plots:

> plot(sf, las = 1)


> acf(sf, lwd = 2)

© USQ, February 21, 2007


80 Module 5. Finding a Model

5.2.3 Sample PACF

In the previous section, the acf was introduced to indicate the order of the
ma model appropriate for a dataset. How do we choose the appropriate
order of the ar model? To identify ar models, a partial acf is used, which
is explained below.

Consider three random variable X, Y and Z. Suppose X and Y are corre-


lated, and Y and Z are correlated. Does this mean X and Z will be cor-
related? Generally yes—because both are correlated with Y . If Y changes,
both X and Z will change, and so there will be a non-zero correlation be-
tween X and Z. Partial correlation measures the correlation between X
and Z after removing the effect of the variable Y on both X and Z.

Likewise, the partial autocorrelation measures the correlation between Xi


and Xi+k after removing the effect of the joint correlations with Xi+1 , Xi+2 ,
. . . , Xi+(k−1) . The number of non-zero terms in the partial acf or pacf
suggests the order of the ar model. Here is the second of the most important
rules for identifying time series models:

If the sample pacf has k non-zero components from 1 to k, then


an ar(k) model is appropriate.

In r, the sample pacf is produced by typing pacf( time.series ) at the


r prompt, where time.series is the name of the time series. Note there is
no term at a lag of zero for the sample pacf, as it makes no sense given the
explanation above about removing the effect of intermediate observations.

Example 5.3: Consider the ar(2) model from Example 5.1. As this is an
ar(2) model, the sample pacf from the simulated data is expected
to have two significant terms. The sample pacf (Fig. 5.4) has two
significant terms as expected.
As explained, note there is no term at a lag of zero for the sample
pacf.

Example 5.4: In Example 5.2 (p 77), the annual Buffalo snowfall data
was examined using the acf and an ma(2) model was found to be
suitable.

© USQ, February 21, 2007


5.2. Identifying a Model 81

Sample Partial ACF

0.3
0.2
0.1
Partial ACF

0.0
−0.1
−0.2
−0.3

2 4 6 8 10

Lag

Figure 5.4: The sample pacf of data simulated from an ar(2) model.

Series sf
0.3
0.2
0.1
Partial ACF

0.0
−0.1
−0.2

5 10 15

Lag

Figure 5.5: The sample pacf of yearly Buffalo snowfall from 1910 to 1972.

© USQ, February 21, 2007


82 Module 5. Finding a Model

6
Correct AR(2) model

4
Incorrect ARIMA(9,2,9) model
2
Data

0
−2
−4

0 100 200 300 400 500

Time

Figure 5.6: Simulated ar(2) data. Two models have been used to make pre-
dictions; the simple model is better for prediction. Note the more complex
model predicts snowfall wil increase linearly over time!

The sample pacf for the data has been computed in r(Fig. 5.5, bottom
panel); there is no term at a lag of zero for the sample pacf.
The pacf has only one non-zero term, suggesting an ar(1) model is
appropriate for modelling the data. Recall the acf suggested an ma(2)
model. Which model do we choose? Since the one-parameter ar model
is simpler than the two-parameter ma model, the ar(1) model would
be chosen as the best model. We almost certainly do not need both a
ma(2) and ar(1) term in the model. (Later, we will learn about other
criteria to use that helps make this decision also.)
Now that an ar(1) model is chosen, it remains to estimate the param-
eters of the model. This will be discussed in Sect. 5.3.

Example 5.5: Consider some simulated ar(2) data. An ar(2) model and
a more complicated model (an arima(9, 2, 9); we learn about arima
models in Module 7.4) are fitted to the data. Predictions can be made
using both models; these predictions are compared in Fig. 5.6.
The simple model is far better for making predictions!

© USQ, February 21, 2007


5.2. Identifying a Model 83

Table 5.1: Typical features of a sample acf and sample pacf for ar and
ma models. The ‘slow decay’ may not always be observed.
acf pacf
ar(k) model slow decay k non-zero terms
ma(k) model k non-zero terms slow decay

5.2.4 Tips for using the sample ACF and PACF

When using the sample acf and pacf it is important to realize they are
obtained from sample information. This means they have sampling error. To
allow for this, the dotted lines produced by r represent confidence intervals
(95% by default). This implies a small number of terms (about 1 in 20) will
lie outside the dotted lines even if they are truly zero. In addition, these
confidence intervals are approximate only. Since 5% (or 1 in 20) components
are expected to be outside these approximate limits anyway, it is important
to not place too much emphasis on term in the sample acf and pacf are
marginal. For example, if the sample acf has two significant terms, but
one is just over the confidence bands, perhaps an ma(1) model will be just
as good as an ma(2). Tools for assisting in making this decision will be
considered in Module 6.

An ar(k) model is implied by a sample pacf with non-zero terms from 1


to k, and typically (but not always) the terms in the sample acf will decay
slowly toward zero. Similarly, a ma(k) model will be implied by a sample acf
with k non-zero terms from 1 to k, and typically (but not always) the terms
in the sample pacf will decay slowly toward zero. Table 5.1 summarizes
these very important facts for selecting time series models.

5.2.5 Model selection using AIC

Another method of selecting the order of the ar model is appropriate is to


use the Akaike Information Criterion (aic). The aic is used in many areas
of statistics, and details will not be considered here. The aic, in general
terms, determines the size of the errors by evaluating the log-likelihood, but
also penalizes overfitting of models by including a penalty term (usually
twice the number of parameters used). While including extra (but possibly
unnecessary) parameters in the model will reduce the size of the errors,
the penalty function ensures these unnecessary terms will be less attractive
when using the aic. There are numerous variations of the aic which use
different forms for the penalty function, and often produce different models

© USQ, February 21, 2007


84 Module 5. Finding a Model

than produced using the aic. In each case, the model with the minimum
aic is selected.
In r, the function ar uses the aic to select the order of the ‘best’ ar model;
unfortunately, ma and arma models are not considered.
The advantage of this method it is automatic, and any two people using
the same data and software will select the same model. The disadvantage is
the computer is very strict in its decision making and does not allow for a
human’s expert knowledge or interpretation of the information.

Example 5.6: Using the snowfall data from Example 5.4 (p 80), the func-
tion ar can be used to select the order of the ar model.
> sf.armodel <- ar(sf)
> sf.armodel

Call:
ar(x = sf)

Coefficients:
1 2
0.2379 0.2229

Order selected 2 sigma^2 estimated as 500.7

(We will consider writing down the actual model in Sect. 5.3.2).
Thus the ar function recommends an ar(2) model (from line 10)
Order selected 2). There are therefore three models to consider:
an ma(2) from the sample acf an ar(1) from the sample pacf and
now an ar(2) from r using the aic. Which do we choose?
This predicament happens often in time series analysis: there are often
many good models from which to choose. In Module 6, some methods
will be discussed for evaluating various models. If one of the model
appears better than the others using these methods, that model should
be chosen. But what if they all appear to be equally good? In that
case, the simplest model would be chosen—the ar(1) model in this
case.

5.2.6 Selecting ARMA models

Selecting arma models is not easy from the acf and the pacf. To select
arma models, it is first necessary to study some diagnostics of ar and ma

© USQ, February 21, 2007


5.3. Parameter estimation 85

models in the next Module. The issue of selecting arma models will be
reconsidered in Sect. 6.3.

5.3 Parameter estimation

Previous sections have given the basis for selecting an ar or ma model for a
given data set, and to determine the order of the model. This section now
discusses how to estimate the unknown parameters in the model using r.
The actual mathematics is not discussed and indeed, it is not easy.

5.3.1 Preliminary estimation for AR models: The Yule–Walker


equations

Consider the ar model in Equation (2.1) (p 2.1). If the number of terms p


in the series is finite, it is possible to write down a system of equations for
calculating the autoregressive coefficients {φk }pk=1 from the autocorrelation
coefficients, {ρk }k≥0 .

Multiplying Equation (2.1) by Xn−k and taking expectations, obtain

γk = φ1 γk−1 + · · · + φp γk−p

for k ≥ 0. Dividing through by γ0 ,

ρk = φ1 ρk−1 + · · · + φp ρk−p , (5.1)

for k ≥ 0. The set of equations (5.1) with k = 0, . . . p are written as a


matrix equation, and can be solved for the coefficients φk . These are known
as the . In matrix form, we have
    
1 ρ1 ρ2 · · · ρp−1 φ1 ρ1
 ρ1 1 ρ1 · · · ρp−2    φ2  
   ρ2 
= . (5.2)
 
 .. .. .. .. . .
..   .. 
  ..
 . . . .   . 
ρp−1 ρp−2 ρp−3 · · · 1 φp ρp

This matrix equation can be solved for the coefficients {φk }pk=0 via the for-
mula    −1  
φ1 1 ρ1 ρ2 · · · ρp−1 ρ1
 φ2   ρ1 1 ρ1 · · · ρp−1    ρ2 
 
 ..  =  .. ..   ..  . (5.3)
  
.. .. ..
 .   . . . . .   . 
φp ρp−1 ρp−2 ρp−3 · · · 1 ρp

© USQ, February 21, 2007


86 Module 5. Finding a Model

Example 5.7: Suppose we have a set of time series data. A plot of the
reveals that the first few non-zero terms of the (and hence ρk val-
ues) are 0.36, −0.14, 0.01 and −0.03. We could use the to determine
approximate values for φk :
   −1  
φ1 1 0.36 −0.14 0.01 0.36
 φ2   0.36 1 0.36 −0.14 
  −0.14  ,
 
 φ3  =  −0.14 0.36
  
1 0.36   0.01 
φ4 0.01 −0.14 0.36 1 −0.03

which gives φ = (0.6032, −0.5247, 0.3708, −0.2430). Using more terms


would give estimate for φk for k > 4, but this is sufficient to demon-
strate the use of the Yule–Walker equation.

The Yule–Walker equations are used to find an initial estimate of the pa-
rameters. Note also that they are based on finding parameters for an ar
model only.

5.3.2 Parameter estimation in R

The function used by r to estimate parameters in arma models in the


function arima. To demonstrate how to use this function, consider again
the yearly Buffalo snowfall from Example 5.2 (p 77), Example 5.4 (p 80)
and Example 5.6 (p 84). In these examples, the following models were
considered: ar(1) (from the pacf); ma(2) (from the acf); and a ar(2)
(from the aic).

Example 5.8: To fit the ar(1) model, use

> snow.ar1 <- arima(sf, order = c(1, 0, 0))


> snow.ar1

Call:
arima(x = sf, order = c(1, 0, 0))

Coefficients:
ar1 intercept
0.3302 80.8809
s.e. 0.1236 4.1722

sigma^2 estimated as 496.8: log likelihood = -285.01, aic = 576.01

© USQ, February 21, 2007


5.3. Parameter estimation 87

Importantly, r always fits a model to the mean-corrected time series.


That is, the mean of the series is subtracted from the observations
before computing the acf and pacf. Hence, if yearly Buffalo snowfall
is {Bt }, the output indicates the fitted model is

Bt − 80.88 = 0.3302(Bt−1 − 80.88) + et .

Rearranging produces the model

Bt = 54.17 + 0.3302Bt−1 + et .

The parameter estimates are also given in the output. Either form is
acceptable as the final model.

Example 5.9: Similarly, the ar(2) model is found thus:

> snow.ar2 <- arima(sf, order = c(2, 0, 0))


> snow.ar2

Call:
arima(x = sf, order = c(2, 0, 0))

Coefficients:
ar1 ar2 intercept
0.2542 0.2373 81.5422
s.e. 0.1262 0.1262 5.2973

sigma^2 estimated as 469.6: log likelihood = -283.3, aic = 574.59

This indicates the ar(2) model is

Bt − 81.54 = 0.2542(Bt−1 − 81.54) + 0.2373(Bt−2 − 81.54) + et .

Rearranging produces

Bt = 41.46 + 0.2542Bt−1 + 0.2373Bt−2 + et .

Comparing the aic for both the ar models show that the ar(2) model
is only slightly better using this criterion than the ar(1) model.
The output from using the function ar can also be used to write down
the fitted model but it doesn’t estimate the intercept; see Example 5.6.
The estimates are also slightly different as a different algorithm is used
for estimating the parameters.

© USQ, February 21, 2007


88 Module 5. Finding a Model

Example 5.10: To fit the ma(1) model, use

> snow.ma1 <- arima(sf, order = c(0, 0, 1))


> snow.ma1

Call:
arima(x = sf, order = c(0, 0, 1))

Coefficients:
ma1 intercept
0.2104 80.5421
s.e. 0.0982 3.4616

sigma^2 estimated as 517.6: log likelihood = -286.27, aic = 578.53

This indicates the ma(1) model is

Bt − 80.54 = +et + 0.2104et−1 .

or
Bt = 80.54 + et + 0.2104et−1 .

In general, the model is fitted using arima using the order option. The first
component in order is the order of the ar component, and the third is the
order of the ma component. What is the second term?

The second term is only necessary if the series is non-stationary. The next
Module discusses this issue, where the meaning of the second term in the
order parameter will be discussed.

5.4 Forecasting using R

Once a model has been found, r can be used to make forecasts. The function
to use is predict. The following example shows how to use this function.

Example 5.11: To demonstrate how to use this function, consider again


the yearly Buffalo snowfall recently seen in Examples 5.8 to 5.10. The
data contain the annual snowfall in Buffalo up to 1972.
Consider just Example 5.8, where an ar(1) model was fitted. To make
a forecast, the following commands are used (note that the object
snow.ar1 was created earlier by fitting an ar(1) model to the data):

© USQ, February 21, 2007


5.5. Summary 89

> snow.pred <- predict(snow.ar1, n.ahead = 10)


> snow.pred

$pred
Time Series:
Start = 1973
End = 1982
Frequency = 1
[1] 90.49534 84.05536 81.92903 81.22696 80.99516
[6] 80.91862 80.89335 80.88500 80.88225 80.88134

$se
Time Series:
Start = 1973
End = 1982
Frequency = 1
[1] 22.28815 23.47162 23.59705 23.61068 23.61217
[6] 23.61233 23.61235 23.61235 23.61235 23.61235

r has made predictions for the next ten years based on the ar(1)
model, and has included the standard errors of the forecasts as well.
(This make it easy to compute the confidence intervals.) Notice the
forecasts from about six years ahead and further are almost the same.
This implies that the model has little skill at forecasting that far ahead
(which is not surprising). Forecasts a long way into the future tend to
be the mean, which is reasonable.
The data and the forecasts can be plotted together (Fig. 5.7) as follows:

> snow.and.preds <- ts.union(sf, snow.ar1$pred)


> plot(snow.and.preds, plot.type = "single",
+ lty = c(1, 2), lwd = 2, las = 1)

Similar forecasts and plots can be constructed from the other types of
models (that is, ma or arma models) in a similar way. The forecasts
are shown for each of these models in Table 5.2.

5.5 Summary

This Module considered the identification of ar and ma models for a given


set of stationary time series data, primarily using the acf and the pacf.
The Akaike Information Criterion (aic) was also considered.

© USQ, February 21, 2007


90 Module 5. Finding a Model

120

100
snow.and.preds

80

60

40

1910 1920 1930 1940 1950 1960 1970 1980

Time

Figure 5.7: Forecasting the Buffalo snowfall data ten years ahead. There is
little skill in the forecast after a few years. The forecasts are shown using a
dashed line.

Table 5.2: Comparison of the predictions for forecasting ten-steps ahead


using the ar(1), ar(2) and ma(2) models for the Buffalo snowfall data.
ar(1) ar(2) ma(2)
1 90.50 92.44 86.18
2 84.06 91.06 85.90
3 81.93 86.55 80.90
4 81.23 85.07 80.90
5 81.00 83.63 80.90
6 80.92 82.91 80.90
7 80.89 82.38 80.90
8 80.89 82.08 80.90
9 80.88 81.88 80.90
10 80.88 81.76 80.90

© USQ, February 21, 2007


5.6. Exercises 91

Note that most time series (including climatological time series) are not
stationary, but the methods developed so far apply only to stationary data.
In Module 7, non-stationary time series will be examined.

5.6 Exercises
Ex. 5.12: Consider a time series {L}. The fitted model is an arma(1, 0)
model.

(a) The model is a special case of an arma model. What is another


way of expressing the model?
(b) Write this model using the backshift operator.
(c) Sketch the possible sample acf and pacf that lead to the selec-
tion of this model.

Ex. 5.13: Consider a time series {Y }. The fitted model is an arma(0, 2)


model.

(a) The model is a special case of an arma model. What is another


way of expressing the model?
(b) Write this model using the backshift operator.
(c) Sketch the possible sample acf and pacf that lead to the selec-
tion of this model.

Ex. 5.14: The mean annual streamflow in Cache River at Forman, Illinois,
from 1925 to 1988 is given in the file cacheriver.dat. (The data are
not reported by calendar year, but by ‘water year’. A water year starts
in October of the calendar year one year less than the water year and
ends in September of the calendar year the same as the water year. For
example, water year 1980 covers the period October 1, 1979 through
September 30, 1980. However, this does not affect the model or your
analysis.) There are two variables of interest: Mean reports the mean
annual flow, and Max reports the maximum flow each water year, each
measured in cubic feet per second. (The data have been obtained from
USGS [4].)

(a) Use r to find a suitable model for the mean annual stream flow
using the acf and pacf.
(b) Use r to find a suitable model for the maximum annual stream
flow using the function ar and the sample acf and sample pacf.
(c) Using your chosen model, produce forecasts up to three-steps
ahead.

© USQ, February 21, 2007


92 Module 5. Finding a Model

0.77 1.74 0.81 1.20 1.95 1.20 0.47 1.43


3.37 2.20 3.00 3.09 1.51 2.10 0.52 1.62
1.31 0.32 0.59 0.81 2.81 1.87 1.18 1.35
4.75 2.48 0.96 1.89 0.90 2.05

Table 5.3: Thirty consecutive days of precipitation in inches at Minneapolis,


St Paul. The data should be read across the rows.

Ex. 5.15: Simulate the ar(2) model

Rn+1 = 0.2Rn − 0.4Rn−1 + en+1

where {e} ∼ N (0, 4). Compute the sample acf and sample pacf from
this simulated data. Do they show the features you expect?

Ex. 5.16: Simulate the ma(2) model

Xt = −0.3et−1 − 0.2et−2 + et

where {e} ∼ N (0, 8). Compute the sample acf and sample pacf from
this simulated data. Do they show the features you expect?

Ex. 5.17: The data in Table 5.3 are thirty consecutive values of March
precipitation in inches for Minneapolis, St. Paul obtained from Hand
et al. [19]. The years are not given. (The data are available in the
data file minn.txt.)

(a) Load the data into r and find a suitable model (ma or ar) for
the data.
(b) Produce forecasts up to three-steps ahead with your chosen model.

Ex. 5.18: The data in the file lake.dat give the mean annual levels at
Lake Victoria Nyanza from 1902 to 1921, relative to a fixed reference
point (units are not given). The data are from Shaw [41] as quoted in
Hand et al [19]. Explain why an ar, ma or arma cannot be fitted to
this data set.

Ex. 5.19: The Easter Island sea level air pressure anomalies from 1951 to
1995 are given in the data file easterslp.dat, which were obtained
from the IRI/LDEO Climate Data Library (http://ingrid.ldgo.
columbia.edu/). Find a suitable ar or ma model for the series using
the sample acf and pacf. Use this model to forecast up to three
months ahead.

© USQ, February 21, 2007


5.6. Exercises 93

Ex. 5.20: The Western Pacific Index (WPI) measures the mode of low-
frequency variability over the North Pacific. The time series in the data
file wpi.txt is from the Climate Prediction Center [3] and the Climate
Diagnostic Centre [2], and gives the monthly WPI from January 1950
to December 2001.

(a) Confirm that the data are approximately stationary by plotting


the data.
(b) Find an appropriate model for the data using the acf and pacf.
(c) Find an appropriate model using the ar function.
(d) Which model is your preferred model? Explain your answer.
(e) Find parameter estimates for your preferred model.

Ex. 5.21: The seasonal average SOI from (southern hemisphere) summer
1876 to (southern hemisphere) summer 2001 is given in the file soiseason.dat

(a) Confirm that the data are approximately stationary by plotting


the data.
(b) Find an appropriate model for the data using the acf and pacf.
(c) Find an appropriate model using the ar function.
(d) Which model is your preferred model? Explain your answer.
(e) Find parameter estimates for your preferred model.

Ex. 5.22: The monthly average solar flux from January 1948 to December
2002 is given in the file solarflux.txt.

(a) Confirm that the data are approximately stationary by plotting


the data.
(b) Find an appropriate model for the data using the acf and pacf.
(c) Find an appropriate model using the ar function.
(d) Which model is your preferred model? Explain your answer.
(e) Find parameter estimates for your preferred model.

Ex. 5.23: The acf in Fig. 5.8 was produced for a time series {P }. In this
question, the Yule–Walker equations are used to form initial estimates
for the values of φ.

(a) Use the first three terms in the acf to set up the Yule–Walker
equations, and solve for the ar parameters. (Any terms within
the confidence limits can be assumed to be zero.)
(b) Repeat, but use four terms of the acf. Compare you answers to
those in part (a).

© USQ, February 21, 2007


94 Module 5. Finding a Model

1.0

0.8

0.6
ACF

0.4

0.2

0.0

−0.2

0 2 4 6 8 10

Lag

Figure 5.8: The acf for the time series {P }.

Ex. 5.24: The acf in Fig. 5.9 was produced for a time series {Q}. In this
question, the Yule–Walker equations are used to form initial estimates
for the values of φ.

(a) Use the first three terms in the acf to set up the Yule–Walker
equations, and solve for the ar parameters. (Any terms within
the confidence limits can be assumed to be zero.)
(b) Repeat, but use four terms of the acf. Compare you answers to
those in part (a).
(c) Repeat, but use five terms of the acf. Compare you answers to
those in parts (a) and (b).

Ex. 5.25: The acf in Fig. 5.10 was produced for a time series {R}. In this
question, the Yule–Walker equations are used to form initial estimates
for the values of φ.

(a) Use the first three terms in the acf to set up the Yule–Walker
equations, and solve for the ar parameters. (Any terms within
the confidence limits can be assumed to be zero.)
(b) Repeat, but use four terms of the acf. Compare you answers to
those in part (a).

© USQ, February 21, 2007


5.6. Exercises 95

1.0

0.8

0.6

0.4
ACF

0.2

0.0

−0.2

−0.4

0 2 4 6 8 10

Lag

Figure 5.9: The acf for the time series {Q}.

(c) Repeat, but use five terms of the acf. Compare you answers to
those in parts (a) and (b).

5.6.1 Answers to selected Exercises

5.12 (a) ar(1) model.


(b) (1 − φB)Ln = en for some value of φ.
(c) The possible sample acf and pacf are shown in Fig. 5.11. The
actual details of the s are not important; what is important is
that there is only one significant term in the sample pacf and
the sample acf takes a long time to decay (and the term at lag
zero in the acf is one as always).

5.14 (a) The time series is plotted in Fig. 5.12. The data appears to be
approximately stationary. The sample acf and pacf are shown
in Fig. 5.13.
The sample acf has no significant terms, suggesting no particular
ma model will be useful. The sample pacf has only one term
marginally significant at a lag of 14. This suggests that there is

© USQ, February 21, 2007


96 Module 5. Finding a Model

1.0

0.8

0.6

0.4
ACF

0.2

0.0

−0.2

−0.4
0 2 4 6 8 10

Lag

Figure 5.10: The acf for the time series {R}.

no obvious ar model. What is the conclusion? The conclusion is


that there is no suitable ar or ma model for modelling the data.
In fact, it suggests that the observations are actually random,
and therefore unpredictable. Using the function ar suggests the
same.
Notice that a lot of work is sometimes needed to come to the con-
clusion that no model is useful. This does not mean the exercise
has been a waste of time—after all, it is now known that there
is no useful ar or ma model, which in in itself is useful informa-
tion. If {St } is the mean annual streamflow, then the model is
St = m + et for the appropriate value of m (which will be the
mean in this case). Since the mean value of the mean stream-
flow is 299.3, the model is St = 299.3 + et . The forecasts up to
three-steps ahead are all 299.3.
(b) Using the function ar, a suitable ar model is an ar(5) model:
> cr <- read.table("cacheriver.dat", header = TRUE)
> ar(cr$Max)
Call:
ar(x = cr$Max)

© USQ, February 21, 2007


5.6. Exercises 97

0.0 0.2 0.4 0.6 0.8 1.0


ACF

0 2 4 6 8 10

Lag
0.6
Partial ACF

0.4
0.2
0.0

2 4 6 8 10

Lag

Figure 5.11: A possible sample acf and pacf for an ar(1) model. The acf
is shown in the top plot; the pacf in the bottom plot.

Coefficients:
1 2 3 4 5
-0.2096 -0.1298 -0.3270 -0.2821 -0.2111

Order selected 5 sigma^2 estimated as 4662001


In contrast, using the acf and pacf would suggest that the data
are random. This is an example of a situation where the human is
probably correct, and the computer doesn’t actually know best.
(c) The chosen model is St = 4133 + et where 4133 is the mean. The
forecasts are all 4133.

5.19 The time series is plotted in Fig. 5.14. The data appears to be approx-
imately stationary. The sample acf and pacf are shown in Fig. 5.15.
The sample acf has seven significant terms, suggesting an ma(7)

© USQ, February 21, 2007


98 Module 5. Finding a Model

800

600
rflow

400

200

1930 1940 1950 1960 1970 1980 1990

Time

Figure 5.12: A plot of the mean annual streamflow in cubic feet per second
at Cache River, Illinois, from 1925 to 1988.

© USQ, February 21, 2007


5.6. Exercises 99

1.0
0.6
ACF

0.2
−0.2

0 5 10 15

Lag
0.1
Partial ACF

−0.1
−0.3

5 10 15

Lag

Figure 5.13: The sample acf and pacf of the mean annual streamflow in
cubic feet per second at Cache River, Illinois, from 1925 to 1988. Top: the
sample acf; Bottom: the sample pacf.

© USQ, February 21, 2007


100 Module 5. Finding a Model

2
eislp

−2

−4

−6

1950 1960 1970 1980 1990

Time

Figure 5.14: A plot of the Easter Island sea level air pressure anomaly from
1951 to 1995.

© USQ, February 21, 2007


5.6. Exercises 101

0.8
ACF

0.4
0.0

0.0 0.5 1.0 1.5 2.0

Lag
0.2
Partial ACF

0.1
0.0
−0.1

0.0 0.5 1.0 1.5 2.0

Lag

Figure 5.15: The sample acf and pacf of the Easter Island sea level air
pressure anomaly. Top: the sample acf; Bottom: the sample pacf.

© USQ, February 21, 2007


102 Module 5. Finding a Model

model. It is likely that a more compact ar model can be found. The


sample pacf suggests an ar(3) model may be appropriate (the terms
at lag 5 and 6 are so marginal, they can probably be ignored.) The
second term is not significant, but the third term in the pacf is sig-
nificant, so we need to use an ar(3) model if the significant term at
lag 3 is to be taken. Given the choice of either ma(7) or ar(3), the
more compact ar model is to be preferred. The code used to generate
the above plots is shown below

> ei <- read.table("easterslp.dat", header = TRUE)


> eislp <- ts(ei$slpa, start = c(1951, 1), frequency = 12)
> plot(eislp, main = "", las = 1)
> acf(eislp, main = "")
> pacf(eislp, main = "")

To estimate the parameters, use

> eislp.model <- arima(eislp, order = c(3, 0,


+ 0))
> eislp.model$coef

ar1 ar2 ar3 intercept


0.25139496 0.02663228 0.16891009 -0.15173751

The fitted ar model is therefore


Et = 0.251Et−1 + 0.0266Et−2 + 0.1689Et−3 + en
if {Et } is the Easter Island sea level air pressure anomaly.
The one-step ahead forecast is
E
bt+1|t = 0.251Et + 0.0266Et−1 + 0.1689Et−2

The last few values in the series are:


> length(eislp)

[1] 540

> eislp[535:540]

[1] 0.1 3.0 2.9 -1.2 3.2 -1.0

So the one-step ahead forecast is


bt+1|t = 0.251 × (−1.0) + 0.0266 × 3.2 + 0.1689 × (−1.2)
E
= −0.36896,
and likewise for further steps ahead.

© USQ, February 21, 2007


5.6. Exercises 103

5.23 From the acf ρ1 ≈ 0.3, ρ3 ≈ −0.2 and ρ3 ≈ 0.2 (and the rest are
essentially zero). So the matrix equation is
    
1 0.3 −0.2 φ1 0.3
 0.3 1 0.3   φ2  =  −0.2 
−0.2 0.3 1 φ3 0.2

with solution  
0.5416667
 −0.5 .
0.4583333
In r:

> Mat <- matrix(data = c(1, 0.3, -0.2, 0.3,


+ 1, 0.3, -0.2, 0.3, 1), byrow = FALSE,
+ nrow = 3, ncol = 3)
> rhs <- matrix(nrow = 3, data = c(0.3, -0.2,
+ 0.2))
> sol1a <- solve(Mat, rhs)
> Mat <- matrix(data = c(1, 0.3, -0.2, 0.2,
+ 0.3, 1, 0.3, -0.2, -0.2, 0.3, 1, 0.3,
+ 0.2, -0.2, 0.3, 1), byrow = FALSE, nrow = 4,
+ ncol = 4)
> rhs <- matrix(nrow = 4, data = c(0.3, -0.2,
+ 0.2, 0))
> sol1b <- solve(Mat, rhs)
> sol1b

[,1]
[1,] 0.7870968
[2,] -0.7677419
[3,] 0.7483871
[4,] -0.5354839

The solutions are very different. In practice, all the available informa-
tion is used (and hence very large matrices result).

© USQ, February 21, 2007


104 Module 5. Finding a Model

© USQ, February 21, 2007


Module 6
Diagnostic Tests

Module contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Residual acf and pacf . . . . . . . . . . . . . . . . . . . . 107
6.3 Identification of arma models . . . . . . . . . . . . . . . 110
6.4 The Box–Pierce test (Q-statistic) . . . . . . . . . . . . . 116
6.5 The cumulative periodogram . . . . . . . . . . . . . . . 117
6.6 Significance of parameters . . . . . . . . . . . . . . . . . 118
6.7 Normality of residuals . . . . . . . . . . . . . . . . . . . . 119
6.8 Alternative models . . . . . . . . . . . . . . . . . . . . . 120
6.9 Evaluating the performance of a model . . . . . . . . . 121
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.11.1 Answers to selected Exercises . . . . . . . . . . . . . . . 124

Module objectives

Upon completion of this module students should be able to:

ˆ use r to create residual acf and pacf plots;

105
106 Module 6. Diagnostic Tests

ˆ understand the information contained in residual acf and pacf plots;

ˆ use the residual acf and pacf to identify arma models;

ˆ write down the arma model from the r output;

ˆ use r to make forecasts from a fitted arma model;

ˆ use r to evaluate the Box–Pierce statistic and Ljung–Box statistic and


understand what they imply about the fitted model;

ˆ use r to create a cumulative periodogram and understand what it


implies about the fitted model;

ˆ use r to create Q–Q plot and understand what it implies about the
fitted model;

ˆ use r to test the signifcance of fitted parameters in a fitted model;

ˆ fit competing models to a time series, and use the appropriate tests to
compare the possible models;

ˆ select a good model for given stationary time series data.

6.1 Introduction

Once a model is fitted, it is important to know if the model is a ‘good’ model,


or if it can be improved. But first, what is a ‘good’ model? A good model
should be able to capture the important features of the data, or, in other
words, capture the signal . After removing the signal from the time series,
only random noise should remain. So to test if a model is a good model
or not, the noise is usually tested to ensure it is indeed random (and hence
unpredictable). If the residuals are somehow predictable, the model should
be refined so the residuals are unpredictable and random.

In addition, a good model is as simple as possible. To ensure the model is


as simple as possible, each term in the model should be tested to make sure
it is significant; otherwise, the insignificant parameters should be removed
from the model.

The process of evaluating a model is called diagnostic testing. A number of


diagnostic tests are considered in this Module.

© USQ, February 21, 2007


6.2. Residual acf and pacf 107

6.2 Residual ACF and PACF

Since the residuals should be white noise (that is, are independent and
contain no elements are predictable), the acf and pacf of the residuals
should contain no hint of being forecastable. In other words, the terms of the
residual acf and residual pacf should all lie between the (approximate) 95%
confidence limits. If not, there are elements in the residuals are forecastable,
and these forecastable aspects should be included in the signal of model.

Example 6.1: In Sect. 5.3 (p 85), numerous models were fitted to the
yearly Buffalo snowfall data first introduced in Example 5.2 (p 77).
Two of those models were ar models. Here, consider the ar(1) model.
The model was fitted in Example 5.8 (p 86).
There are two ways to do diagnostic tests in r. The first way is to
use the tsdiag function; this function plots the standardized residuals
in order and plots the acf of the residuals. (It also produces another
plot studied in Sect. 6.4). Here is how the function can be used:

> par(mfrow = c(1, 1))


> bs <- read.table("buffalosnow.dat", header = TRUE)
> sf <- ts(bs$Snow, start = c(1910, 1), frequency = 1)
> ar1 <- arima(sf, order = c(1, 0, 0))
> tsdiag(ar1)

The result is shown in Fig. 6.1. The middle panel in Fig. 6.1 indicates
the residual acf is fine and no model could be fitted to the residuals.
The second method involves using the output object from the arima
command, as shown below.

> ar1 <- arima(sf, order = c(1, 0, 0))


> names(ar1)

[1] "coef" "sigma2" "var.coef" "mask"


[5] "loglik" "aic" "arma" "residuals"
[9] "call" "series" "code" "n.cond"
[13] "model"

The residuals are given by ar1$resid, or more directly as resid(ar1):

> summary(resid(ar1))

Min. 1st Qu. Median Mean 3rd Qu. Max.


-65.6600 -14.6800 1.4540 -0.2791 16.8700 47.3700

© USQ, February 21, 2007


108 Module 6. Diagnostic Tests

2
1
0
−1
−2
−3
Standardized Residuals

1910 1920 1930 1940 1950 1960 1970

Time

ACF of Residuals
1.0
0.6
ACF

0.2
−0.2

0 5 10 15

Lag

p values for Ljung−Box statistic


1.0
0.8



0.6
p value


● ●
0.4


0.2

● ●

0.0

2 4 6 8 10

lag

Figure 6.1: Diagnostic plots after fitting an ar(1) model to the yearly Buffalo
snowfall data. This is the output of using the tsdiag command in r.

© USQ, February 21, 2007


6.2. Residual acf and pacf 109

Series resid(ar1)
1.0
0.6
ACF

0.2
−0.2

0 5 10 15

Lag

Series resid(ar1)
0.2
0.1
Partial ACF

0.0
−0.2

5 10 15

Lag

Figure 6.2: Diagnostic plots after fitting an ar(1) model to the yearly Buffalo
snowfall data. Top: the residual acf; Bottom: the residual pacf.

© USQ, February 21, 2007


110 Module 6. Diagnostic Tests

These residuals can be used to perform diagnostic tests. For example,


the residual acf and residual pacf are shown in Fig. 6.2.
The residual acf and pacf indicate the residuals (or the noise) are
not forecastable. This suggests the ar(1) model fitted in Sect. 5.3 is
adequate, considering this single criterion.

6.3 Identification of ARMA models

Using the residual acf and pacf is often how arma models are fitted. A
researcher may look at the sample acf and sample pacf and conclude an
ar(2) model is appropriate. After fitting such a model, an examination of
the residual acf and residual pacf indicates an ma(1) model now seems
appropriate. The best model for the data then be an arma(2, 1) model.
The researcher would hope the residuals from this arma(2, 1) would be
white noise. As was alluded to in Sect. 6.2, using the residual acf and pacf
allows arma models to be identified.

Example 6.2: In Example 4.3, Chu & Katz [13] were said to fit an arma(1, 1)
model to the monthly SOI time series from January 1935 to August
1983. In this example we see how that model may have been chosen.
Keep in mind that selecting arma models is very much an art and
requires experience to do well.
As with any time series, the data must be stationarity. (Fig. 1.3 bot-
tom panel, p 8), which it appears to be. The next step is to look at
the acf and pacf; (Fig. 6.3). The acf suggests a very large order ma
model; the pacf suggests possibly an ar(2) model or an ar(4) model.
To begin, select an ar(2) model as it is simpler and the terms at lags 3
and 4 are only just significant; if an ar(4) model is necessary, it will
become apparent in the diagnostic analysis. The code so far:

> ms <- read.table("soiphases.dat", header = TRUE)


> acf(ms$soi)
> pacf(ms$soi)
> ms.ar2 <- arima(ms$soi, order = c(2, 0, 0))

The residuals can be examined now to see if the fitted ar(2) model
is adequate using the residual acf and pacf from the ar(2) model
(Fig. 6.4).
The residual acf suggests the model is reasonable, but the residual
pacf suggests at least one ma term at lag 2 may be necessary. (There

© USQ, February 21, 2007


6.3. Identification of arma models 111

Series ms$soi
1.0
0.6
ACF

0.2
−0.2

0 5 10 15 20 25 30

Lag

Series ms$soi
0.6
Partial ACF

0.4
0.2
0.0

0 5 10 15 20 25 30

Lag

Figure 6.3: The acf and pacf of monthly SOI. Top: the acf; Bottom: the
pacf.

© USQ, February 21, 2007


112 Module 6. Diagnostic Tests

Series resid(ms.ar2)
0.8
ACF

0.4
0.0

0 5 10 15 20 25 30

Lag

Series resid(ms.ar2)
0.05
Partial ACF

0.00
−0.10

0 5 10 15 20 25 30

Lag

Figure 6.4: The acf and pacf of residuals for the ar(2) model fitted to the
monthly SOI. In (a), the acf; in (b), the pacf.

© USQ, February 21, 2007


6.3. Identification of arma models 113

are significant terms at lags 5, 6, 14 and 15 also; it is more common


that observations will be strongly related to more recent observations
than those some time ago. Initially, then, deal with problem at lag 2; if
the problems at the others lags persist, they can be dealt with later.)
This is surprising as we fitted an ar(2) model which we would ex-
pect to account for significant terms at lag 2. This suggests trying to
add an ma(2) component to the ar(2) component above, making an
arma(2, 2) model. Fit this and look again at the residual plots:

> acf(ms.ar2$residuals)
> pacf(ms.ar2$residuals)
> ms.arma22 <- arima(ms$soi, order = c(2, 0,
+ 2))
> acf(ms.arma22$residuals)
> pacf(ms.arma22$residuals)

Again the residual acf looks fine; the residual pacf looks better, but
still not ideal (Fig. 6.5). The significant term at lag 2 has gone as
well as those at lags 5 and 6 however; this is more important than the
significant terms at lags 14 and higher (as lags 14 time steps away are
less likely to be of importance). So perhaps the arma(2, 2) model will
suffice. Here’s the model:

> ms.arma22

Call:
arima(x = ms$soi, order = c(2, 0, 2))

Coefficients:
ar1 ar2 ma1 ma2 intercept
0.9192 -0.0473 -0.4273 -0.0131 -0.0903
s.e. 0.3801 0.3250 0.3792 0.1451 0.8158

sigma^2 estimated as 53.19: log likelihood = -5156.87, aic = 10325.74

Note the second ar term and the second ma term are both unnecessary
(the estimate divided by the standard errors are much less than one).
This suggests the second ar term and the second ma term should be
excluded from the model. In other words, try fitting an arma(1, 1)
model.

> ms.arma11 <- arima(ms$soi, order = c(1, 0,


+ 1))
> acf(ms.arma11$residuals)
> pacf(ms.arma11$residuals)

© USQ, February 21, 2007


114 Module 6. Diagnostic Tests

Series ms.arma22$residuals
0.8
ACF

0.4
0.0

0 5 10 15 20 25 30

Lag

Series ms.arma22$residuals
0.05
Partial ACF

0.00
−0.05
−0.10

0 5 10 15 20 25 30

Lag

Figure 6.5: The acf and pacf of residuals for the arma(2, 2) model fitted
to the monthly SOI. Top: the acf; Bottom: the pacf.

© USQ, February 21, 2007


6.3. Identification of arma models 115

Series ms.arma11$residuals

0.8
ACF

0.4
0.0

0 5 10 15 20 25 30

Lag

Series ms.arma11$residuals
0.05
Partial ACF

0.00
−0.05
−0.10

0 5 10 15 20 25 30

Lag

Figure 6.6: The acf and pacf of residuals for the arma(1, 1) model fitted
to the monthly SOI. Top: the acf; Bottom: the pacf.

The residual acf and pacf from this model (Fig. 6.6) look very similar
to those in Fig. 6.5, suggesting the arma(1, 1) model is better than
the arma(2, 2) model, and also simpler.
Here’s the arma(1, 1) model:

> ms.arma11

Call:
arima(x = ms$soi, order = c(1, 0, 1))

Coefficients:
ar1 ma1 intercept
0.8514 -0.3698 -0.1183
s.e. 0.0196 0.0355 0.7927

© USQ, February 21, 2007


116 Module 6. Diagnostic Tests

sigma^2 estimated as 53.25: log likelihood = -5157.63, aic = 10323.26

The aic implies this is a better model than the arma(2, 2) model, and
so the arma(1, 1) is appropriate for the data.

6.4 The Box–Pierce test (Q-statistic)

Another test to apply to the residuals is to calculate the Box–Pierce statistic,


or the Q-statistic, also known as a Portmanteau test. The purpose of the test
is to check if the residuals are independent. The null hypothesis is that the
residuals are independent, and the alternative is they are not independent.
This test computes the sum of the squares of the first m (e.g. m = 15)
sample acf coefficients, multiplied by the length of the time series (say N )
and calls this Q:
Xm
Q=N ρ̂2k .
k=1
If the residuals are taken from a white noise process, the Q statistic will have
approximately a chi-square (χ2 ) distribution with m−N degrees of freedom,
where m is the number of autocorrelation coefficients used in computing the
statistic (15 above), and N is the number of autoregressive and moving
average components estimated for the model. Some authors use m rather
than m − N degrees of freedom (as does r). Chatfield [11, p 62] and others
note the test is really only useful when the time series has more than 100
observations. An alternative test to use which is better for shorter series is
m
X ρ̂2k
Q = N (N + 2) ,
N −k
k=1

called the Ljung–Box test. Both tests, however, may lack statistical power.
In r, the function Box.test is used for both tests.

Example 6.3: In Example 6.1, the yearly Buffalo snowfall data were con-
sidered. In that Example, the residual acf and pacf showed the
residuals were not forecastable using an ar(1) model. To test if the
residuals appear to be independent, use the Box.test function in r.
The input variables are the residuals from the fitted model, and the
number of terms in the acf to be used to compute the statistic. The
default value is one, which is far too few. Typically, a value such as 15
is used (it is often more if the series is longer or is seasonal, and shorter
if the time series is short).

© USQ, February 21, 2007


6.5. The cumulative periodogram 117

> Box.test(resid(ar1), lag = 15)

Box-Pierce test

data: resid(ar1)
X-squared = 7.3209, df = 15, p-value = 0.9481

> Box.test(resid(ar1), lag = 15, type = "Ljung-Box")

Box-Ljung test

data: resid(ar1)
X-squared = 8.1009, df = 15, p-value = 0.9197

The P -value indicates there is no evidence that the residuals are de-
pendent. The conclusion from the Ljung–Box test is similar. This
further confirms that the ar(1) model is adequate. If the P -value was
below about 0.05, there would be some cause for concern: it would
imply that the terms in the acf are too large to be a white noise.

Note the r function tsdiag produces a plot using P -value of the Box–Pierce
statistic for various value of the lag; see the third (bottom) panel in Fig. 6.1.
The dotted line in the plot corresponds to a P -value of 0.05.

6.5 The cumulative periodogram

Another test applied to the residuals is to calculate the cumulative (or in-
tegrated) periodogram and apply a Kolmogorov–Smirnov test to check the
assumption that the residuals form a white noise process. The r function
cpgram performs this test.

The cumulative periodogram from a white noise process will lie close to the
central diagonal line. Thus, if the residuals do form a white noise process as
they should do approximately if the model is correct, the cumulative peri-
odogram of the residuals will lie within the indicated bounds with probability
95%.

Example 6.4: In Example 6.1, the yearly Buffalo snowfall data were con-
sidered and an ar(1) model fitted. The cumulative periodogram is
found as follows:

© USQ, February 21, 2007


118 Module 6. Diagnostic Tests

1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

frequency

Figure 6.7: The cumulative periodogram after fitting an ar(1) model to the
yearly Buffalo snowfall data.

> cpgram(ar1$resid, main = "")

The result (Fig. 6.7) indicates that the model is adequate as it remains
between the confidence bands.

6.6 Significance of parameters

The next important test to perform is to check on the statistical significance


of the parameters. Standard errors of the parameter estimates are computed
and shown by r when the model is fitted using arima.

Roughly speaking, the parameters of a model are accepted as significant


if the estimated value of the parameter is twice the standard error of this
estimate or more. This is made a little more precise by using a statistical
test (the t-test), however in practice this amounts to almost the same thing.
If a parameter shows up as not significant, it should be removed from the
model.

© USQ, February 21, 2007


6.7. Normality of residuals 119

Example 6.5: In Example 6.1, the yearly Buffalo snowfall data were con-
sidered. An ar(1) model was fitted to the data. There were two
estimated parameters: the constant term in the model, m0 , and the
ar term. The ar term can be tested for significance. (Recall that the
intercept is of no interest to the structure of the model.) The param-
eter estimates and the standard errors are shown in Example 5.8 (p 86).
Dividing the estimate by the standard error produces an approximate
t-score. The parameter estimates for the ar term has a t-score greater
than two in absolute value, indicating that it is necessary in the model.
The actual t-scores can be computed using the output from the fitting
of the model, as shown below.

> coef(ar1)

ar1 intercept
0.3301765 80.8808921

> ar1$coef

ar1 intercept
0.3301765 80.8808921

> ar1$var.coef

ar1 intercept
ar1 0.01528329 0.03975151
intercept 0.03975151 17.40728000

> coef(ar1)/sqrt(diag(ar1$var.coef))

ar1 intercept
2.670778 19.385655

The conclusion is that the ar parameter in the model is necessary,


and so that the ar(1) model seems appropriate.

6.7 Normality of residuals

Throughout, the residual have been assumed to be normally distributed. To


test this, use a Q–Q plot of the residuals. If the residuals do have a normal
distribution, the points in the plot will lie close to the diagonal line.

© USQ, February 21, 2007


120 Module 6. Diagnostic Tests

Normal Q−Q Plot


40

●●

●●
●●●
●●

20
●●
●●


●●
Sample Quantiles


●●
●●●
●●●●●●
0

●●●●
●●●
●●
●●
●●●

−20



●●●

● ●●●
●●
● ●
−40
−60

−2 −1 0 1 2

Theoretical Quantiles

Figure 6.8: The cumulative periodogram after fitting an ar(1) model to the
yearly Buffalo snowfall data.

Example 6.6: Continuing Example 6.1 (the yearly Buffalo snowfall), con-
sider again the fitted ar(1) model. The Q–Q plot of the residu-
als (Fig. 6.8) indicates the residuals are approximately normally dis-
tributed.

> qqnorm(resid(ar1))
> qqline(resid(ar1))

(Note: qqnorm plots the points; qqline draw the diagonal line.)

6.8 Alternative models

The last type of test is to check if an alternative model might be better. This
is open-ended, because there is an endless variety of alternative models from
which to choose. But, as seen before, there are sometimes a small number
of models that are suggested from which the researcher has to choose. If one
model proves to be better using the diagnostic tests, that model should be
used. If all perform similarly, choose the simplest model. But what if there

© USQ, February 21, 2007


6.9. Evaluating the performance of a model 121

is more than one model that perform similarly, and each are as simple as the
other? If you can’t decide between them, then it probably doesn’t matter!

6.9 Evaluating the performance of a model

Finally, consider an evaluation tool that is slightly different than those pre-
viously discussed. The idea is that the model is fitted to the first portion of
the data (perhaps half the data), and then forecasts are made on the basis of
that model fitted to this portion (called the training sets). One-step ahead
forecasts are then made for each of the remaining data points (called the
testing set) to see how adequate the model can forecast—which, after all, is
one of the main reasons for developing time series models.

This approach generally requires a time series with a large number of ob-
servations to work well, since splitting the data into two parts halves the
amount of information available for model selection. Obviously, smaller por-
tions can be withheld from the model selection stage if necessary, as shown
in the next example. The approach discussed here is called cross-validation.
The ‘best’ model is the model whose predictions in the tesing set are clos-
est to the actual observed values; this can be summarised by noting the
mean and variance of the differences. More sophisticated cross-validation
techniques are possible, but not discussed here.

Example 6.7: Because the Buffalo snowfall data is a short series, we with-
hold only the last ten observations and retain those for model evalu-
ation. The one-step ahead forecasts for the remaing ten observations
for each model are shown in Table 6.1.
These one-step aheads predictions are plotted in Fig. 6.9. Table 6.1
suggests little difference between the models; the ar(2) model has
smaller errors on average (compare the means), but the ar(1) model
is closest more consistent (compare the variances).

6.10 Summary

Before accepting a time series model, it must be tested. The main tests are
based on analysing the “residuals”—the one-step ahead forecast errors of the
model. Table 6.2 summaries the diagnostic tests discussed.

© USQ, February 21, 2007


122 Module 6. Diagnostic Tests

Table 6.1: The one-step ahead forecasts for the ar(1), ar(2) and ma(2)
model after withholding the last ten observations and using the remainder
as a training set.
Prediction from Model:
Actual ar(1) ar(2) ma(2)
1 89.80 87.08 91.67 86.91
2 71.50 83.21 88.52 84.59
3 70.90 77.10 80.89 77.14
4 98.30 76.89 75.88 73.97
5 55.50 86.05 82.53 85.08
6 66.10 71.75 79.17 79.26
7 78.40 75.29 70.43 66.64
8 120.50 79.40 76.31 79.19
9 97.00 93.46 90.05 95.84
10 110.00 85.61 95.39 93.69
Errors: Mean: 4.215 2.717 3.570
Var: 414.9 444.7 427.0

120

110
Series and predictions

100

90

80

70
Series
AR(1) preds
60 AR(2) preds
MA(2) preds

1960 1962 1964 1966 1968 1970 1972

Years

Figure 6.9: The cross-validation one-step ahead predictions for the ar(1),
ar(2) and ma(2) models applied to the Buffalo snowfall data.

© USQ, February 21, 2007


6.11. Exercises 123

Assumption Test R Commands


to Test to Use to Use
Residuals Residual acf & use acf and pacf on
unforecastable Residual pacf residuals
Residuals Box–Pierce test Box.test
independent
Residuals cumulative cpgram
white noise periodogram
Simple significance of output from arima
model parameters
Residuals Q–Q plot of qqnorm
normally distributed residuals

Table 6.2: A summary of the diagnostic test to use on given time series
models.

6.11 Exercises

Ex. 6.8: In Exercise 4.22, an arma(1, 1) model was discussed that was
fitted by Sales, Pereira & Vieira [40] to the natural monthly average
flow rate (in cubic metres per second) of the reservior of Furnas on the
Grande River in Brazil. Table 4.1 (p 70) gave the parameter estimates
and their standard errors. Determine if each parameter is significant
at the 95% level.

Ex. 6.9: In Exercise 5.14 (p 91), data concerning the mean annual stream-
flow from 1925 to 1988 in Cache River at Forman, Illinois, given in the
file cacheriver.dat There are two variables of interest: Mean reports
the mean annual flow, and Max reports the maximum flow each water
year, each measured in cubic feet per second. Perform the diagnostic
checks to see if the model found for the variable Mean in that exercise
produce adequate models.

Ex. 6.10: In Exercise 5.19 (p 92), the Easter Island sea level air pressure
anomalies from 1951 to 1995, given in the data file easterslp.txt,
were analysed. An ar(3) model was considered a suitable model. Per-
form the appropriate diagnostic checks on this model, and determine
if the model is adequate.

Ex. 6.11: In Exercise 4.4, Davis & Rappoport [15] were reported to use
an arma(2, 2) model for modelling the Palmer Drought Index, {Yt }.
Katz & Skaggs [26] claim the equivalent ar(2) model is almost as good

© USQ, February 21, 2007


124 Module 6. Diagnostic Tests

as the model given by Davis & Rappoport, yet has half the number of
parameters. For this reason, they prefer the ar(2) model.
Load the data into r and decide on the best model. Give reasons for
your solution, and include diagnostics analyses.

Ex. 6.12: In Exercise 5.20, a model was fitted to the Western Pacific Index
(WPI). The time series in the data file wpi.txt gives the monthly
WPI from January 1950 to December 2001. Perform some diagnostic
analyses and select the ‘best’ model for the data, justifying your choice
and illustrating your answer with appropriate diagrams.

Ex. 6.13: In Exercise 5.21, the seasonal average SOI from (southern hemi-
sphere) summer 1876 to (southern hemisphere) summer 2001 was stud-
ied. The data is given in the file soiseason.dat. Fit an appropriate
model to the data justifying your choice and illustrating your answer
with appropriate diagrams.

Ex. 6.14: In Exercise 5.22, the monthly average solar flux from December
1950 to December 2001 was studied. The data is given in the file
solarflux.txt. Fit an appropriate model to the data justifying your
choice and illustrating your answer with appropriate diagrams.

Ex. 6.15: The data file rionegro.dat contains the average monthly heights
of the Rio Negro river at Manaus from 1903–1992 in metres (relative
to an arbitrary reference point). Find a suitable model for the times
series, including a diagnostic analysis of possible models.

6.11.1 Answers to selected Exercises

6.9 The model chosen for the variable Mean was simply that the data were
random. Hence the residual acf and residual pacf are just the sam-
ple acf and sample pacf as shown in Fig. 5.13. The cumulative
periodogram shows no problems with this model; see Fig. 6.10. The
Box–Pierce test likewise indicates no problems. The Q–Q plot is not
ideal though (and looks better if an ar(3) model is fitted). Here is
some of the code:

> Box.test(rflow)

Box-Pierce test

data: rflow
X-squared = 0.865, df = 1, p-value = 0.3523

6.15 First, load and prepare the data:

© USQ, February 21, 2007


6.11. Exercises 125

Normal Q−Q Plot

800

Series: rflow

1.0

600


0.8



●●

Sample Quantiles


●●
0.6

400
●●

●●


0.4









●●
●●


0.2

●●

200
●●
●●

●●
●●


●●


●●●
0.0


●●●●
0.0 0.1 0.2 0.3 0.4 0.5 ● ●●

frequency
−2 −1 0 1 2

Theoretical Quantiles

Figure 6.10: The cumulative periodogram of the annual streamflow at Cache


River; the plot suggests that the data are random. However, the Q–Q plot
suggests that data are perhaps not normally distributed.

> RN <- read.table("rionegro.dat", header = TRUE)


> ht <- ts(RN$Height, start = c(RN$Year[1],
+ RN$Month[1]), frequency = 12)
A plot of the data shows the series is reasonaboy stationary (Fig. 6.11).
See the acf and pacf (Fig. 6.12); the acf suggests a very large order
ma model, while the pacf suggests an ar(3) model. Decide to start
with the ar(3) model!
> rn.ar3 <- arima(ht, order = c(3, 0, 0))
The residual acf and pacf are pretty good if not perfect (Fig. 6.13);
there are a a couple of components outside the approximate confidence
limits, but probably nothing of importance.
Let’s examine more diagnostics (Fig. 6.14); the cumulative periodogramn
looks fine, but the normal probability plot looks bad. However, a his-
togram shows the residuals have a decent distribution that loos slightly
normal, so things aren’t so bad (try hist(resid(rn.ar3))).
So, for some final diagnostics:
> Box.test(resid(rn.ar3))

Box-Pierce test

data: resid(rn.ar3)
X-squared = 0.1847, df = 1, p-value = 0.6674

© USQ, February 21, 2007


126 Module 6. Diagnostic Tests

> plot(ht)

4
2
0
ht

−2
−4
−6

1900 1920 1940 1960 1980

Time

Figure 6.11: A plot of the Rio negro river data

> par(mfrow = c(1, 2))


> acf(ht)
> pacf(ht)

Series ht Series ht
1.0

0.8
0.8

0.6
0.6

0.4
Partial ACF
ACF

0.4

0.2
0.0
0.2

−0.2
0.0

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

Lag Lag

Figure 6.12: The acf and pacf of the Rio negro river data

© USQ, February 21, 2007


6.11. Exercises 127

> par(mfrow = c(1, 2))


> acf(resid(rn.ar3))
> pacf(resid(rn.ar3))

Series resid(rn.ar3) Series resid(rn.ar3)


1.0

0.05
0.8
0.6

0.00
Partial ACF
ACF

0.4

−0.05
0.2
0.0

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

Lag Lag

Figure 6.13: The residual acf and pacf of the Rio Negro river data after
fitting the ar(3) model

> par(mfrow = c(1, 2))


> cpgram(resid(rn.ar3))
> qqnorm(resid(rn.ar3))
> qqline(resid(rn.ar3))

Normal Q−Q Plot


3

Series: resid(rn.ar3) ●

●●
1.0


2

●●

●●


●●

●●

●●


●●


0.8



●●

●●

Sample Quantiles

●●
1








●●



●●


●●

0.6


●●





●●



●●






●●
0




●●










●●


0.4








●●


−1










●●



0.2


●●






●●

−2

●●





0.0

●●



●●
−3

0 1 2 3 4 5 6 ●

frequency −3 −1 0 1 2 3

Theoretical Quantiles

Figure 6.14: Further diagnsotic plot of the Rio Negro river data after fitting
the ar(3) model

© USQ, February 21, 2007


128 Module 6. Diagnostic Tests

> coef(rn.ar3)/sqrt(diag(rn.ar3$var.coef))

ar1 ar2 ar3 intercept


38.72317372 -11.39726406 6.14219789 -0.01352855

Ther Box test shows no problems; all the parameters seem necessary.
This model seems fine (if not perfect).

> rn.ar3

Call:
arima(x = ht, order = c(3, 0, 0))

Coefficients:
ar1 ar2 ar3 intercept
1.1587 -0.4985 0.1837 -0.0020
s.e. 0.0299 0.0437 0.0299 0.1462

sigma^2 estimated as 0.567: log likelihood = -1226.89, aic = 2463.77

© USQ, February 21, 2007


Module

Non-Stationary Models
7
Module contents
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 Non-stationarity in the mean . . . . . . . . . . . . . . . 131
7.3 Non-stationarity in the variance . . . . . . . . . . . . . 134
7.4 arima models . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.4.3 Backshift operator . . . . . . . . . . . . . . . . . . . . . 138
7.5 Seasonal models . . . . . . . . . . . . . . . . . . . . . . . 138
7.5.1 Identifying the season length . . . . . . . . . . . . . . . 141
7.5.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5.3 The backshift operator . . . . . . . . . . . . . . . . . . . 147
7.5.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.7 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.8 A summary of model fitting . . . . . . . . . . . . . . . . 154
7.9 A complete example . . . . . . . . . . . . . . . . . . . . . 156
7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.11.1 Answers to selected Exercises . . . . . . . . . . . . . . . 164

129
130 Module 7. Non-Stationary Models

Module objectives

Upon completion of this module students should be able to:

ˆ identify time series that are not stationary in the mean;

ˆ use differences to remove non-stationarity in the mean;

ˆ identify time series that are not stationary in the variance;

ˆ use logarithms to remove non-stationarity in the variance;

ˆ understand what is meant by an arima model;

ˆ use the arima(p, d, q) notation to define arima models;

ˆ develop forecasting formulae for arima models;

ˆ develop confidence intervals for forecasts for arima models;

ˆ write arima models using the backshift operator;

ˆ identify seasonal time series;

ˆ identify the length of a season in a seasonal time series;

ˆ use the arima(p, d, q) (P, D, Q)s notation to define seasonal arima


models;

ˆ develop forecasting formulae for seasonal arima models;

ˆ develop confidence intervals for forecasts for seasonal arima models;

ˆ write seasonal arima models using the backshift operator;

ˆ use r to estimate the parameters in seasonal arima models;

ˆ use r to fit an appropriate Box–Jenkins model to time series data.

7.1 Introduction

Up to now, all the time series considered have been assumed stationary. This
assumption was crucial to the definitions of the autocorrelation and partial
autocorrelation. In practice, however, many time series are not stationary.
In this Module, methods for identifying non-stationary series are considered,
and then models for modelling these series are examined.

In this Module, three types of non-stationarity are discussed:

© USQ, February 21, 2007


7.2. Non-stationarity in the mean 131

1. series that have a non-stationary mean;

2. series that have a non-stationary variance; and

3. series with a periodic or seasonal component.

Many series may exhibit more than one of these types of non-stationarity.

7.2 Non-stationarity in the mean

One common type of non-stationarity is a non-stationary mean. Typically,


the mean of the series tends to increase or fluctuate. This is easiest to
identify by looking at a plot of the data. Sometimes, the sample acf may
indicate a non-stationary mean if the terms take a long time to decay to
zero.

If a dataset exhibits a non-stationary mean, the solution is to take differ-


ences. That is, if a time series {X} is non-stationary in the mean, compute
the differences Yn = Xn − Xn−1 . Generally, this makes any time series with
a non-stationary mean into a time series with a stationary mean {Y }. Oc-
casionally, the differenced time series {Y } will also be non-stationary in the
mean, and another set of differences will be needed. It is rare to ever need
more than two sets of differences. When differences of this kind are taken
(soon another type of difference is considered), this is referred to as taking
first differences.

Note that each time a set of differences is calculated, the new series has one
less observation than the original. In r, differences are created using diff.

Example 7.1:
Consider the annual rainfall near Wendover, Utah, USA. The data
appear to have a non-stationary mean (Fig. 7.1) as the mean goes up
and down, though it is not too severe. To check this, a smoothing
filter was applied computing the mean of each set of six observations
at a time. This smooth (Fig. 7.1, top panel) suggests the mean is
probably non-stationary as this line is not (approximately) constant.
The following code fragment shows how the differences series was found
in r.

> rfdata <- read.table("./rainfall/wendover.dat",


+ header = TRUE)
> rf <- rfdata[(rfdata$Year > 1907) & (rfdata$Year <
+ 1999), ]

© USQ, February 21, 2007


132 Module 7. Non-Stationary Models

500

Annual rainfall (in mm)


400

300

200

1920 1940 1960 1980 2000

Year
Differences of Annual rainfall (in mm)

200

100

−100

−200

1920 1940 1960 1980 2000

Year

Figure 7.1: The annual rainfall near Wendover, Utah, USA in mm. Top:
the original data is plotted with a thin line, and a smooth in a thick line,
indicating that the mean is non-stationary. Bottom: the differenced data is
plotted with a thin line, and a smooth in a thick line. Since the smooth is
relatively flat, the differenced data has a stationary mean.

> ann.rain <- tapply(rfdata$Rain, list(rfdata$Year),


+ sum)
> ann.rain <- ts(as.vector(ann.rain), start = rfdata$Year[1],
+ end = rfdata$Year[length(rfdata$Year)])
> plot(ann.rain, type = "l", las = 1, ylab = "Annual rainfall (in mm)",
+ xlab = "Year")
> ar.l <- lowess(ann.rain, f = 0.1)
> lines(ar.l, lwd = 2)

If differences are applied, the series appears more stationary in the


mean (Fig. 7.1, bottom panel).

© USQ, February 21, 2007


7.2. Non-stationarity in the mean 133

AMO

0.2

0.1
amo

0.0

−0.1

1950 1960 1970 1980 1990

Time

One difference of AMO

0.02
damo

0.00

−0.02

−0.04

1950 1960 1970 1980 1990

Time

Two differences of AMO


0.04
0.03
0.02
ddamo

0.01
0.00
−0.01
−0.02
−0.03

1950 1960 1970 1980 1990

Time

Figure 7.2: The Atlantic Multidecadal Oscillation from 1948 to 1994. Top:
the plot shows the data is not stationary. Middle: the first differences are
also not stationary. Bottom: taking two sets of differences has produced a
stationary series.

Example 7.2: Enfield et al. [16] used the Kaplan SST to compute a ten-
year running mean of detrended Atlantic SST anomalies north of the
equator. This data series is called the Atlantic Multidecadal Oscilla-
tion (AMO). The data, obtained from the NOAA Climatic Diagnostic
Center [2], are stored as amo.dat. A plot of the data shows the series is
non-stationary in the mean; (Fig. 7.2, top panel). The first differences
are also non-stationary; (Fig. 7.2, middle panel). Taking one more
set of differences produces approximately stationary data; (Fig. 7.2,
bottom panel).
Here is the code used.

> amo <- read.table("amo.dat", header = TRUE)


> amo <- ts(amo$AMO, start = c(amo$Year[1]),

© USQ, February 21, 2007


134 Module 7. Non-Stationary Models

+ frequency = 1)
> par(mfrow = c(3, 1))
> plot(amo, main = "AMO", las = 1)
> damo <- diff(amo)
> plot(damo, main = "One difference of AMO",
+ las = 1)
> ddamo <- diff(damo)
> plot(ddamo, main = "Two differences of AMO",
+ las = 1)

7.3 Non-stationarity in the variance

A less common type of non-stationarity with climate data is non-stationarity


in the variance. A non-stationary variance is a common difficulty, however,
in many business applications. Generally, a series that is non-stationary
in the variance has a variance that gets larger over time (that is, as time
progresses, the observations become more variable). In these case, usually
taking logarithms of the time series will help. Another possible difficulty is
that the time series contains negative values (for example, SOI series). In
these cases, add a sufficiently large constant to the data (which won’t affect
the variance), and then take logarithms. If the time series is non-stationary
in the mean and the variance, logs should be taken before differences (to
avoid taking logs of negative values).

7.4 ARIMA models

Once a non-stationary time series has been made stationary, it can be


analysed like any other (stationary) time series. These models, which in-
clude some differencing, are called Autoregressive Integrated Moving Aver-
age models, or arima models.

7.4.1 Notation

ar, ma or arma models in which differences have been taken are collectively
called autoregressive integrated moving average models, or arima models.
Consider an arima model in which the original time series has been differ-
enced d times (d is mostly 1, sometimes 2, and almost never greater than
2). If this now-stationary time series can be well modelled by an arma(p, q)

© USQ, February 21, 2007


7.4. arima models 135

1.0
0.5
ACF

0.0
−0.5

0 5 10 15

Lag
0.2
0.0
Partial ACF

−0.2
−0.4

5 10 15

Lag

Figure 7.3: The differences of the annual rainfall near Wendover, Utah, USA
in mm. Top: the sample acf. Bottom: the sample pacf.

model, then the final model is said to be an arima(p, d, q) model, where d


is the number of sets of differences needed to make the series stationary.

Example 7.3: In Example 7.1, the annual rainfall near Wendover, Utah,
say {Xn }, was considered. The time series was non-stationary, and
differences were taken. The differenced time series, say {Yn }, is now
stationary. The sample acf and pacf of the stationary series {Yn } is
shown in Fig. 7.3.
The sample acf suggests an ma(1) model is appropriate (again re-
calling that the term at lag 0 is always one), while the sample pacf
suggests an ar(2) model is appropriate. The AIC recommends an
ar(1) model. If the ar(2) model is chosen, the model would be an
arima(2, 1, 0). If the ma(1) model is chosen, the model would be an
arima(0, 1, 1). If the ar(1) model is chosen, the model would be an

© USQ, February 21, 2007


136 Module 7. Non-Stationary Models

arima(1, 1, 0), since there is one set of differences.


Here is some of the code used:

> rf <- read.table("wendover.dat", header = TRUE)


> rf <- rf[(rf$Year > 1907) & (rf$Year < 1999),
+ ]
> ann.rain <- tapply(rf$Rain, list(rf$Year),
+ sum)
> ann.rain <- ts(as.vector(ann.rain), start = rf$Year[1],
+ end = rf$Year[length(rf$Year)])
> plot(ann.rain, type = "n", las = 1, ylab = "Annual rainfall (in mm)",
+ xlab = "Year")
> lines(ann.rain)
> ann.rain.d <- diff(ann.rain)
> plot(ann.rain.d, type = "n", las = 1, ylab = "Differences of Annual rainf
+ xlab = "Year")
> acf(ann.rain.d, main = "")
> pacf(ann.rain.d, main = "")

Example 7.4: An example of an arima(2, 1, 1) model is

Wt = 0.3Wt−1 − 0.1Wt−2 + et − 0.24et−1 ,

where Wt = Yt − Yt−1 is the stationary, differenced time series. The


model for the original series, {Yt }, is therefore

(Yt − Yt−1 ) = 0.3(Yt−1 − Yt−2 ) − 0.1(Yt−2 − Yt−3 ) + et − 0.24et−1


⇒ Yt = 1.3Yt−1 − 0.4Yt−2 + 0.1Yt−3 + et − 0.24et−1 .

7.4.2 Estimation

The r function arima can be used to fit arima models, with only a simple
change to what was seen for stationary models.

Example 7.5: In Example 7.3, three models are considered. To fit the
arima(0, 1, 1) model, use the code

> ann.rain.ma1 <- arima(ann.rain, order = c(0,


+ 1, 1))
> ann.rain.ma1

© USQ, February 21, 2007


7.4. arima models 137

Call:
arima(x = ann.rain, order = c(0, 1, 1))

Coefficients:
ma1
-0.7036
s.e. 0.1208

sigma^2 estimated as 5548: log likelihood = -516, aic = 1035.99

We have now seen what the second element of order is for: it indicates
the order of the differencing necessary to make the series stationary.
The fitted model for the first differences of the annual rainfall series
is therefore Wt = −0.7036et−1 + et where Wt = Yt − Yt−1 , and {Y } is
the original time series of annual rainfall (since first differences were
taken). This can be written as
Yt − Yt−1 = −0.7036et−1 + et
and further unravelled to
Yt = Yt−1 − 0.7036et−1 + et .
To fit the arima(1, 1, 0) model, proceed as follows:

> ann.rain.ar1 <- arima(ann.rain, order = c(1,


+ 1, 0))
> ann.rain.ar1

Call:
arima(x = ann.rain, order = c(1, 1, 0))

Coefficients:
ar1
-0.4494
s.e. 0.0933

sigma^2 estimated as 6296: log likelihood = -521.46, aic = 1046.92

So the model for the first difference of annual rainfall is


Wt = −0.4494Wt−1 + et ,
where Wt = Yt − Yt−1 and {Y } is the original rainfall series. This can
be also expressed as
Yt = 0.5506Yt−1 + 0.4494Yt−2 + et
in terms of the original rainfall series.

© USQ, February 21, 2007


138 Module 7. Non-Stationary Models

7.4.3 Backshift operator

When differences are taken of a time series {Xt }, this is written using the
backshift operator as Yt = (1 − B)Xt .

Example 7.6: In Example 7.4 an arma(2, 1) model was fitted to a station-


ary series {Wt } (hence making an arima(2, 1, 1) model). The model
can be written using the backshift operator as
(1 − 0.3B + 0.1B 2 )Wt = (1 − 0.24B)et .
Since {Wt } is a differenced time series, Yt = (1 − B)Wt , so the model
for {Yt } written using the backshift operator is
(1 − 0.3B + 0.1B 2 )(1 − B)Yt = (1 − 0.24B)et .
This expression can be expanded to give
(1 − 1.3B + 0.4B 2 − 0.1B 3 )Yt = (1 − 0.24B)et ,
producing the same model as before.

Example 7.7: In Example 7.2, the AMO from 1948 to 1994 was examined.
Two sets of differences were required to make the data stationary.
Looking at the sample acf and sample pacf of the twice-differenced
data shows that no model is necessary. The fitted model is therefore
is an arima(0, 2, 0) model. Using the backshift operator, the model is
(1 − B)2 At = et , where {A} is the AMO series.

7.5 Seasonal models

The most common type of non-stationarity is when the time series exhibits
a ‘seasonal’ pattern. ‘Seasonal’ does not necessarily have anything to do
with the seasons of Winter, Spring, and so on. It means that there is some
kind of regular pattern in the data. This type of non-stationarity is very
common in climatological and meteorological applications, where there is
often an annual pattern evident in the data. Seasonal data is time series
data that shows regular fluctuation aligned usually with some natural time
period (not just the actual seasons of Winter, Spring, etc). The length of
a season is the time period over which the pattern repeats. For example,
monthly data might show an annual pattern with a season of length 12, as
the data may have a pattern that repeats each year (that is, each twelve
months). These patterns usually appear in the sample acf and pacf.

© USQ, February 21, 2007


7.5. Seasonal models 139

Example 7.8:
The average monthly sea level at Darwin, Australia (in millimetres),
obtained from the Joint Archive for Sea Level [1], is plotted in the top
panel of Fig. 7.4. The sample acf and sample pacf are also shown.
The code used to produce these Figure is given below:

> sealevel <- read.table("darwinsl.txt", header = TRUE)


> sl <- ts(sealevel$Sealevel/1000, start = c(1988,
+ 1), end = c(1999, 12), frequency = 12)
> plot(sl, ylab = "Sea level (in m)", las = 1)
> acf(sl, lag.max = 40, main = "")
> pacf(sl, lag.max = 40, main = "")

The data show a seasonal pattern—the sea level has a regular rise and
fall according to the months of the year (as expected). The length of
the season is therefore twelve, since the pattern is of length twelve,
when the pattern then repeats. This seasonality also appears in the
sample acf.

Seasonal time series have a non-stationary mean, but the non-stationarity is


of a regular kind (that is, every year or every month a cycle repeats). These
type of time series can be represented using a model that explicitly allows
for the seasonality.

In seasonal models, ar and ma components can be introduced at the value


of the season. For example, the model

Xt = et − 0.23Xt−12

might be used to model monthly data (where the season length is twelve, as
the data might be expected to repeated each year). This model explicitly
models the seasonal pattern by incorporating an autoregressive term at a
lag of twelve.

This model is a seasonal ar model. More generally, a seasonal arma model


may have the usual non-seasonal ar and ma components (or the “ordinary”
ar and ma components) but also seasonal ar and ma components. The
model in the previous paragraph is a seasonal ar(1) model, since the one
ar term is one season before. Similarly, for a time series with a season of
length twelve, an example of a seasonal ar(2) model is

Yn+1 = en+1 + 0.17Yn−11 − 0.55Yn−23 ,

© USQ, February 21, 2007


140 Module 7. Non-Stationary Models

4.2
Sea level (in m)

4.1

4.0

3.9

3.8

1988 1990 1992 1994 1996 1998 2000

Time
0.8
0.4
ACF

0.0
−0.4

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lag
0.8
0.4
Partial ACF

0.0
−0.4

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lag

Figure 7.4: The monthly average sea level at Darwin, Australia in metres.
Top: the data are plotted. Centre: the sample acf and Bottom: the sample
pacf.

© USQ, February 21, 2007


7.5. Seasonal models 141

since the first ar term is one ‘season’ (12 time steps) behind, and the second
ar term is two ‘seasons’ (2 × 12 = 24 time steps) behind.

Sometimes it is also necessary to take seasonal differences. If a time series


{Xt } shows a very strong seasonal component with a season of length s,
then a seasonal difference of the form

Yt = Xt − Xt−s

is used to create a more stationary time series. Again, the r function diff
is used with an optional parameter given to indicate the season length.

Example 7.9: The Darwin sea average monthly sea level data (Exam-
ple 7.8, p 139) has a strong seasonal pattern. Taking seasonal differ-
ences seems appropriate:

> dsl <- diff(sl, 12)


> plot(dsl, las = 1)

The plot of the seasonally differenced data (Fig. 7.5, top panel) sug-
gests the series is still possibly non-stationary in the mean, so taking
ordinary (non-seasonal) differences also seems appropriate:

> ddsl <- diff(dsl)


> plot(ddsl, las = 1)

The plot of the twice-differenced data (Fig. 7.5, bottom panel) is now
approximately stationary.

Example 7.10: Kärner & Rannik [25] using the seasonal ma model

xt − xt−12 = et − Θ1 et−12

to model cloud amount, where seasonal difference have been initially


taken.

7.5.1 Identifying the season length

Sometimes it is easy to identify the length of a season as it is aligned with a


yearly or seasonal cycle. If this is the case, the season length should be made
to aligned with the natural season. But for many climatological variables
this is not true. In these cases, identifying the season length can be difficult.

© USQ, February 21, 2007


142 Module 7. Non-Stationary Models

0.3

0.2

0.1
dsl

0.0

−0.1

−0.2

1990 1992 1994 1996 1998 2000

Time

0.10

0.05
ddsl

0.00

−0.05

−0.10

1990 1992 1994 1996 1998 2000

Time

Figure 7.5: The differences in monthly average sea level at Darwin, Australia
in metres (see also Fig. 7.4). Top: the seasonal differences are plotted, while
in the bottom plot, both seasonal and non-seasonal differences have been
taken.

© USQ, February 21, 2007


7.5. Seasonal models 143

10

0
qbo

−10

−20

−30
1960 1970 1980 1990 2000

Time

Series: x
Raw Periodogram
1e+03
1e+01
spectrum

1e−01
1e−03

0 1 2 3 4 5 6

frequency
bandwidth = 0.00601

Series: x
Smoothed Periodogram
1e+02
spectrum

1e+00
1e−02

0 1 2 3 4 5 6

frequency
bandwidth = 0.0662

Figure 7.6: The Quasi-Biennial Oscillation (QBO) from 1955 to 2001. Top:
the the QBO is plotted and shown cyclic behaviour. Middle: the spectrum
is shown. Bottom: the spectrum is shown again, but has been smoothed.

To help identifying the season length, a periodogram, or spectrum, is used.


The spectrum examines many frequencies in the data and computes the
strength of each possible frequency. Hence any frequency that is very strong
is an indication of the period of the season.

Example 7.11: The quasi-biennial oscillation, or QBO, is calculated at


the Climate Diagnostic Centre from the zonal average of the 30mb
zonal wind at the equator. The monthly data have a distinct seasonal
pattern (Fig. 7.6, top panel) but is not aligned with years or seasons,
or anything else useful.
Using ,̊ the spectrum is found using the function spectrum as follows:

> qbo <- ts(qbo, start = c(1955, 1), end = c(2001,

© USQ, February 21, 2007


144 Module 7. Non-Stationary Models

+ 12), frequency = 12)


> qbo.spec <- spectrum(qbo)

This spectrum (Fig. 7.6, centre panel) is very noisy. It is best to


smooth the plot, as shown below. (You do not need to understand
what this smoother does or how it works; the point is that it smooths
the plot.)

> k5 <- kernel("daniell", 5)


> qbo.spec <- spectrum(qbo, kernel = k5)

The result is a much smoother spectrum (Fig. 7.6, bottom panel). The
season length is identified as the frequency where the spectrum is at
its greatest. This can also be done in r:

> max.spec <- max(qbo.spec$spec)


> max.freq <- qbo.spec$freq[max.spec == qbo.spec$spec]
> max.freq

[1] 0.4375

> 1/max.freq

[1] 2.285714

The maximum frequency corresponds to 2.3 “seasons”, or 2.3 years in


this case.

Random numbers are expected to have a spectrum that is fairly constant


for all frequencies; see the following example.

Example 7.12: In this example, we look at the spectrum of four sets of


random numbers from a Normal distribution.
> set.seed(102030)
> par(mfrow = c(2, 2))
> k5 <- kernel("daniell", 5)
> for (i in (1:4)) {
+ random.numbers <- rnorm(1000)
+ spectrum(random.numbers, kernel = k5)
+ }
In the output (Fig. 7.7)., no frequencies stand out as being much
stronger than others.

© USQ, February 21, 2007


7.5. Seasonal models 145

Series: x Series: x
Smoothed Periodogram Smoothed Periodogram

1.0 1.5
1.0
spectrum

spectrum
0.5

0.5
0.2

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

frequency frequency
bandwidth = 0.00318 bandwidth = 0.00318

Series: x Series: x
Smoothed Periodogram Smoothed Periodogram
2.0

2.0
1.0

1.0
spectrum

spectrum
0.5

0.5
0.2

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

frequency frequency
bandwidth = 0.00318 bandwidth = 0.00318

Figure 7.7: Four replication of a spectrum from 1000 Normal random num-
bers. There is no evidence of one frequency dominating.

7.5.2 Notation

These models are very difficult to write down. There are a number of pa-
rameters that must be included:

1. The order of non-seasonal (or ordinary) differencing, d;


2. The order of the non-seasonal ar model, p;
3. The order of the non-seasonal ma model, q;
4. The length of a season, s;
5. The order of seasonal differencing, D;
6. The order of the seasonal ar model, P ;
7. The order of the seasonal ma model, Q.

Note that any one model should only have a few parameters, and so some
of p, q P , and Q are expected to be zero. In addition, d + D is most often
one, sometimes two, and rarely greater than two. These parameters are
summarized by writing a model down as follows: A model with all of the
above parameters would be written as a arima(p, d, q) (P, D, Q)s model.

© USQ, February 21, 2007


146 Module 7. Non-Stationary Models

Example 7.13: Consider a time series {Rt }. The series is non-stationary,


and ordinary differences and seasonal differences (period 7) are taken
to make the series stationary in the mean. An ordinary ar(2) model
and seasonal ma(1) model is then fitted. The final model is a arima(2, 1, 0)
(0, 1, 1)7 model.

Example 7.14: Consider a time series {Zt }. The series is non-stationary,


and two sets of seasonal differences (period 12) are taken to make the
series stationary in the mean. An ordinary arma(1, 1) model is then
fitted. The final model is a arima(1, 0, 1) (0, 2, 0)12 model.

Example 7.15: Consider a time series {Pn }. The series is non-stationary,


and one set of seasonal differences (period 4) are taken to make the
series stationary in the mean. An ordinary ma(1) model and seasonal
ar(2) model is then fitted. The final model is a arima(0, 0, 1) (2, 1, 0)4
model.

Example 7.16: Consider a time series {An }. The series is non-stationary,


and one set of seasonal differences (period 12) are taken to make the
series stationary in the mean. The data then appears to be white noise
(that is, the acf andpacf suggest no model to be fitted). The final
model is a arima(0, 0, 0) (0, 1, 0)12 model.

When writing seasonal components of the model, it is usual to write seasonal


ar terms with a capital phi: Φ. Likewise, seasonal ma models are written
using a capital theta: Θ. This is in line with using capital P and Q for the
orders of the seasonal components.

Example 7.17: In Example 7.8 (p 139), the average monthly sea level at
Darwin was analysed. In Example 7.9 (p 141), seasonal differences
were taken to make the data stationary.
The seasonally differenced data (Fig. 7.5, top panel) was non-statonary.
The seasonally differenced and non-seasonally differenced data (Fig. 7.5,
bottom panel) looks approximately stationary. The sample acf and
pacf of this series is shown in Fig. 7.8.

© USQ, February 21, 2007


7.5. Seasonal models 147

0.2
1.0

0.1
0.0
0.5

Partial ACF

−0.1
ACF

−0.2
0.0

−0.3
−0.4
−0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lag Lag

Figure 7.8: The sample acf and pacf for the twice-differenced monthly
average sea level at Darwin, Australia in metres. Top: the sample acf;
Bottom: the sample pacfof the twice-differenced data are shown.

For the non-seasonal components of the model, the sample acf sug-
gests no model is necessary (the one component above the dotted confi-
dence interval can probably be ignored—it is just over the approximate
lines and is at a lag of two). The sample pacf suggest no model is
needed either—though there is again a marginal component at a lag of
two. (It may be necessary to include these terms later, as wil become
evident in the diagnostic analysis, but it is unlikely.)
For the seasonal model, the sample pacf decays very slowly (there
is one at seasonal lag 1, lag 2 and lag 3), suggesting a large number
seasonal ar terms would be necessary. In contrast, the sample acf
suggests one seasonal ma term is needed. In summary, two differences
have been taken (so d = 1 and D = 1). No non-seasonal model seems
necessary (so p = q = 0), but a seasonal ma(1) term is suggested (so
P = 0 and Q = 1). So the model is arima(0, 1, 0) (0, 1, 1)12 , and there
is only one parameter to estimate (the seasonal ma(1) parameter).

7.5.3 The backshift operator

Earlier, it was shown the backshift operator equivalent of taking non-seasonal


differences was (1 − B). Similarly, if the series {Xt } is seasonally differ-
enced with a season of length s, then the backshift operator equivalent is
(1 − B s )Xt .

© USQ, February 21, 2007


148 Module 7. Non-Stationary Models

The general form of an arima(p, d, q) (P, D, Q)s model is written using the
backshift operator as

(1 − B)d (1 − B s )D φ(B)Φ(B)Xt = θ(B)Θ(B)et ,

where φ(B) is the non-seasonal ar component written using the backshift


operator, Φ(B) is the seasonal ar component written using the backshift
operator, θ(B) is the non-seasonal ma component written using the backshift
operator, and Θ(B) is the the seasonal ma component written using the
backshift operator. The terms in the seasonal components decay in steps
of the season-length (that is, of the season has a length of seven, Φ(B) =
1 + 0.31Φ1 B 7 − 0.19Φ2 B 14 is a typical term).

Example 7.18: In Example 7.17, one model suggested for the average
monthly sea level at Darwin was arima(0, 1, 0) (0, 1, 1)12 . Using the
backshift operator, this model is

(1 − B)(1 − B 12 )Xt = Θ(B)et ,

where Θ(B) = 1 + Θ1 B 12 . Using r, the unknown parameter is calcu-


lated to be −0.9996, so the model is

(1 − B)(1 − B 12 )Xt = (1 − 0.9996B 12 )et .

Example 7.19: Maier & Dandy [31] use arima models to model the daily
salinity at Murray Bridge, South Australia from Jan 1, 1987 to 31 Dec
1991. They examined numerous models, including some models not
in the Box–Jenkins methodology. The best Box–Jenkins models were
those based on one set of non-seasonal differences, and one or two sets
of seasonal differences, with a season of length s = 365. One of their
final models was the arima(1, 1, 1) (1, 2, 0)365 model

(1 − B)(1 − B 365 )2 (1 + 0.267B)(1 − 0.513B 365 )Xt


= (1 − 0.455B)et ,

where the daily salinity is {Xt }.

© USQ, February 21, 2007


7.5. Seasonal models 149

7.5.4 Estimation

Estimation of seasonal arima models is quite tricky, as there are many pa-
rameters that could be specified: the ar and ma components both seasonally
and non-seasonally. This is part of the help from the arima function

arima(x, order = c(0, 0, 0),


seasonal = list(order = c(0, 0, 0), period = NA) )

The input order has been used previously; to also specifiy seasonal compo-
nents, the input seasonal must be used.

Example 7.20: In Example 7.17 (p 146), an arima(0, 1, 0) (0, 1, 1)12 was


suggested for the average monthly sea level at Darwin. This model is
fitted in r as follows:

> dsl.small <- arima(sealevel$Sealevel, order = c(0,


+ 1, 0), seasonal = list(order = c(0, 1,
+ 1), period = 12))
> dsl.small

Call:
arima(x = sealevel$Sealevel, order = c(0, 1, 0), seasonal = list(order = c(0,
1, 1), period = 12))

Coefficients:
sma1
-0.9996
s.e. 0.2305

sigma^2 estimated as 1013: log likelihood = -654.01, aic = 1312.02

So the fitted model is written


ord. diff
z }| {
(1 − B) (1 − B 12 ) Dt = (1 − 0.99957B 12 )et
| {z }
seas. diff
Dt − Dt−12 − Dt−1 + Dt−13 = et − 0.99957et−12 .

This means the estimated seasonal ma parameter is −0.99957.

© USQ, February 21, 2007


150 Module 7. Non-Stationary Models

7.6 Forecasting

The principles of forecasting used earlier apply to arima and seasonal arima
models without significant differences. However, it is necessary to write the
model without using the backshift operator first, which can be quite tedious.

Example 7.21: Consider the arima(1, 0, 0) (0, 1, 1)4 model Wn = 0.20Wn−1 +


en − 0.16en−4 , where Wn = Zn − Zn−4 is the seasonally differenced se-
ries. Using the backshift operator, the model is

(1 − B 4 ) (1 − 0.20B)Zt = (1 − 0.16B 4 )et .


| {z }
seasonal diff.
Expanding the backshift terms gives

(1 − 0.2B − B 4 + 0.20B 5 )Zt = (1 − 0.16B 4 )et ,

so the model is written as

Zn = 0.2Zn−1 + Zn−4 − 0.20Zn−5 − 0.16en−4 + en .

A one-step ahead forecast is given by


bn+1|n = 0.2Zn + Zn−3 − 0.20Zn−4 − 0.16en−3 .
Z

A two-step ahead forecast is given by

Z bn+1|n + Zn−2 − 0.20Zn−3 − 0.16en−2 .


bn+2|n = 0.2Z

Example 7.22: In Example 7.19, the arima(1, 1, 1) (1, 2, 0)365 model

(1 − B)(1 − B 365 )2 (1 + 0.267B)(1 − 0.513B 365 )Xt


= (1 − 0.455B)et ,

was given for the daily salinity at Murray Bridge, South Australia,
say {Xt }. After expanding the terms on the left-hand side, there will
be terms involving B, B 2 , B 365 , B 366 , B 367 , B 730 , B 731 , B 732 , B 1095 ,
B 1096 and B 1097 . This makes it very difficult to write down. Indeed,
without using the backshift operator as above, it would be very tedious
to write down the model at all, even though only three parameters have
been estimated. Note this is an unusual case of model fitting in that
three sets of differences were taken.

© USQ, February 21, 2007


7.7. Diagnostics 151

Series resid(sma1) Series resid(sma1)

0.00 0.10 0.20


1.0

Partial ACF
0.6
ACF

0.2

−0.15
−0.2

0.0 0.5 1.0 1.5 0.5 1.0 1.5

Lag Lag

Series: resid(dsl.small) Normal Q−Q Plot


Sample Quantiles
0.8


50 ●●●●●


●●
●●
●●


●●

●●


●●

●●

●●
●●


●●

●●

●●



●●

●●

0


●●


●●


●●


0.4


●●
●●


●●



●●



●●

●●

●●
●●
●●●
●●

−100
0.0

0.0 0.2 0.4 −2 −1 0 1 2

frequency Theoretical Quantiles

Figure 7.9: Some residual plots for the arima(0, 1, 0) (0, 1, 1)12 fitted to the
monthly sea level at Darwin. Top left: the residual acf; top right: the
residual pacf; bottom left: the cumulative periodogram; bottom right: the
Q–Q plot

7.7 Diagnostics

The usual diagnostics apply equally for non-stationary models; see Module 6.

Example 7.23: In Example 7.17, the model arima(0, 1, 0) (0, 1, 1)12 was
suggested for the monthly sea level at Darwin. The residual acf,
residual pacf and the cumulative periodogram can be produced in
˚(Fig. 7.9).

> sma1 <- arima(sl, order = c(0, 1, 0), seasonal = list(order = c(0,
+ 1, 1), period = 12))
> acf(resid(sma1))
> pacf(resid(sma1))
> cpgram(resid(sma1))

Both the residual acf and pacf look OK, but both have a significant
term at lag 2; the periodogram looks a little suspect, but isn’t too bad.
The Box–Pierce Q statistic can be computed, and the standard error
of the estimated parameter found also:

© USQ, February 21, 2007


152 Module 7. Non-Stationary Models

> Box.test(resid(sma1))

Box-Pierce test

data: resid(sma1)
X-squared = 2.9538, df = 1, p-value = 0.08567

> coef(sma1)/sqrt(diag(sma1$var.coef))

sma1
-4.335704

The Box–Pierce test is OK, but is marginal. The estimated parameter


is significant.
The Q–Q plot is OK if not perfect.
In summary, the arima(0, 1, 0) (0, 1, 1)12 looks OK, but there are some
points of minor concern. Is there perhaps a better model? Perhaps
a model with a term at lag 2 such as arima(2, 1, 0) (0, 1, 1)12 ? (The
reason for proposing this model is that the residual acf and pacf
suggest difficulties at lag 2.)
We fit this model and compare the residual analyses; see Fig. 7.10.

> oth.mod <- arima(sl, order = c(2, 1, 0), seasonal = list(order = c(0,
+ 1, 1), period = 12))
> acf(resid(oth.mod))
> pacf(resid(oth.mod))
> cpgram(resid(oth.mod))
> qqnorm(resid(oth.mod))
> qqline(resid(oth.mod))
> Box.test(resid(oth.mod))

Box-Pierce test

data: resid(oth.mod)
X-squared = 0.0116, df = 1, p-value = 0.9144

> coef(oth.mod)/sqrt(diag(oth.mod$var.coef))

ar1 ar2 sma1


-1.309895 2.202391 -4.992391

The residual acf and pacf appear better, as does the periodogram.
The Box–Pierce statistic now certainly not significant, but one of the
parameters is unnecessary. (This was expected; we only really wanted

© USQ, February 21, 2007


7.7. Diagnostics 153

Series resid(oth.mod) Series resid(oth.mod)


1.0

0.10
Partial ACF
0.6
ACF

0.00
0.2

−0.15
−0.2

0.0 0.5 1.0 1.5 0.5 1.0 1.5

Lag Lag

Series: resid(oth.mod) Normal Q−Q Plot




Sample Quantiles
0.8

●●●●

●●
●●

●●



●●

●●

●●

●●



●●

0.00



●●


●●

●●

●●


●●

●●

●●


●●

●●
●●

0.4



●●



●●



●●

●●
●●
●●
●●●
●●
●●●
−0.10
0.0

0 1 2 3 4 5 6 −2 −1 0 1 2

frequency Theoretical Quantiles

Figure 7.10: Some residual plots for the arima(2, 1, 0) (0, 1, 1)12 fitted to
the monthly sea level at Darwin. Top left: the residual acf; top right: the
residual pacf; bottom left: the cumulative periodogram; bottom right: the
Q–Q plot.

© USQ, February 21, 2007


154 Module 7. Non-Stationary Models

the second lag, but were forced to take the first, insignificant one.)
The Q–Q plot looks marginally improved also.
Fitting a arima(0, 1, 2) (0, 1, 1)12 produces similar results. Which is
the better model? It is not entirely clear; either is probably OK.

7.8 A summary of model fitting

To summarise, these are the steps that need to be taken to fit a good model:

ˆ Plot the data. Check that the data is stationary. If the data is not
stationary, deal with it appropriately (by taking logarithms or differ-
ences (seasonal and/or non-seasonal), or perhaps both). Remember
that it is rare to require many levels of differencing.

ˆ Examine the sample acf, sample pacf and/or the AIC to determine
possible models to for the data. Models may include ma, ar, arma or
arima models, with non-seasonal and/or seasonal aspects. (Remem-
ber that is it rare to have models with a large number of parameters to
be estimated.) You may have to use a periodogram to identify season
length.

ˆ Use r’s arima function to fit the models and determine the parameter
estimates.

ˆ Perform the following diagnostic checks for each of the possible models.

– examine the residual acf and pacf;


– the cumulative periodogram of residuals;
– and the Box–Pierce statistic;
– examine the Q–Q plot; and
– the significance of the parameter estimates.

ˆ Choose the best model from the available information, and write down
the model (probably using backshift operators). Remember that the
simplest, most adequate model is the best model; more parameters do
not necessarily make a better model.

These steps are summarized in the flowchart in Fig. 7.11.

© USQ, February 21, 2007


7.8. A summary of model fitting 155

Plot the series

Yes
Is time series stationary?

No

Use differences and/or logs

Identify possible models:


AR, MA, ARMA or ARIMA
(using ACF, PACF and/or AIC)

Estimate model parameters

Perform diagnostic checks

Is the model adequate?


No

Yes

Write down the final model

Figure 7.11: A flowchart for fitting arima (Box–Jenkins) type models.

© USQ, February 21, 2007


156 Module 7. Non-Stationary Models

7.9 A complete example

The data file mlco2.dat contains monthly measurements of carbon dioxide


above Mauna Loa Hawaii from Jan 1959 to Dec 1990 in parts per million
(ppm). (Missing values have been filled in by linear interpolation.) The
data were collected by Scripps Institute of Oceanography, La Jolla, Cali-
fornia. The original source is the climatology database maintained by the
Oak Ridge National Laboratory, and the data here have been obtained from
Hyndman [5].
We will find a suitable model for the data, and use diagnostic tests to de-
termine if the model is adequate.
A plot (Fig. 7.12, top panel) shows the series is clearly non-stationary in
the mean, and is also seasonal. Taking non-seasonal difference produces an
approximately stationary series in the mean, but the series is still strikingly
seasonal (Fig. 7.12, centre panel). Taking seasonal differences (length 12)
produces a series that appears stationary (Fig. 7.12, bottom panel).
Using the stationary (twice-differenced) series, the sample acf and pacf are
shown in Fig. 7.13.
To find a model, first consider the non-seasonal components. The acf sug-
gests an ma(1) or perhaps ma(3) model. The pacf suggests an ar(1) or
perhaps ar(3) model. At this stage, choosing either the ma(1) or ar(1)
model seems appropriate as the terms at lag 3 are marginally over the ap-
proximate confidence limits. Which to choose? Since the lag 3 term seems
more marginal in the acf, perhaps the ma(1) model is the best choice (this
may not turn out to be the case).
Consider now the seasonal components. The acf has a strong term at a
seasonal lag of 1 only, suggesting a seasonal ma(1) model. In contrast, the
pacf shows significant terms at a seasonal lag of 1, 2 and 3 (and there may
be more if we looked at higher seasonal lags). This suggests a seasonal model
of at least ar(2). For the seasonal component, the best model is the ma(1).
Combining this information suggests the model arima(0, 1, 1) (0, 1, 1)12 .
This model is fitted as follows:

> co.model <- arima(co, order = c(0, 1, 1),


+ seasonal = list(order = c(0, 1, 1), season = 12))

Is this model an adequate model? The residual acf and pacf (Fig. 7.14)
suggest the model is adequate.
The cumulative periodogram and Q–Q plots (Fig. 7.15) indicate the model
is adequate. The two estimated parameters are also significant:

© USQ, February 21, 2007


7.9. A complete example 157

350

340
co

330

320

1960 1965 1970 1975 1980 1985 1990

Time

First difference of CO2

0
dco

−1

−2

1960 1965 1970 1975 1980 1985 1990

Time

Seasonal and non−seasonal differences of CO2

1.0

0.5
ddco

0.0

−0.5

−1.0

1960 1965 1970 1975 1980 1985 1990

Time

Figure 7.12: The monthly measurements of carbon dioxide above Mauna


Loa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Top:
the data is clearly non-stationary in the mean and is seasonal; Middle: the
first differences have been taken; Bottom: the seasonal differences have also
been taken, and now the series appears stationary.

© USQ, February 21, 2007


158 Module 7. Non-Stationary Models

1.0
0.5
ACF

0.0
−0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lag
0.1
Partial ACF

−0.1
−0.3

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lag

Figure 7.13: The monthly measurements of carbon dioxide above Mauna


Loa, Hawaii from Jan 1959 to Dec 1990 in parts per million (ppm). Top:
the sample acf of the twice-differenced series; Bottom: the sample pacf of
the twice-differenced series.

© USQ, February 21, 2007


7.9. A complete example 159

Series resid(co.model)
0.8
ACF

0.4
0.0

0.0 0.5 1.0 1.5 2.0

Lag

Series resid(co.model)
0.00 0.05 0.10
Partial ACF

−0.10

0.5 1.0 1.5 2.0

Lag

Figure 7.14: The monthly measurements of carbon dioxide above Mauna


Loa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Top:
the residual acf; Bottom: the residual pacf.

© USQ, February 21, 2007


160 Module 7. Non-Stationary Models

Normal Q−Q Plot

● ●
●●●
Series: resid(co.model) ●
●●
●●

0.5
●●

1.0


●●

●●

●●
●●

●●



●●









●●

0.8



●●











●●

Sample Quantiles




●●
●●

0.0

●●


●●

0.6
●●


●●










●●
●●



●●








0.4


●●




●●



●●



●●

−0.5

●●



●●

●●


0.2


●●
●●


0.0

−1.0
0 1 2 3 4 5 6

frequency
−3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 7.15: The monthly measurements of carbon dioxide above Mauna


Loa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). The
cumulative periodogram indicates that the model is adequate.

> co.model$coef/sqrt(diag(co.model$var.coef))

ma1 sma1
-6.62813 -27.24159

The model suggested is

> co.model

Call:
arima(x = co, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), season = 1

Coefficients:
ma1 sma1
-0.3634 -0.8581
s.e. 0.0548 0.0315

sigma^2 estimated as 0.0803: log likelihood = -66.66, aic = 139.33

Using backshift operators, the fitted model is

B(1 − B 12 )Ct = (1 − 0.3634B − 0.8581B 1 2)et .

© USQ, February 21, 2007


7.10. Summary 161

7.10 Summary

In this Module, three types of non-stationarity have been considered: non-


stationarity in the mean, non-stationarity in the variance, and seasonal mod-
els. For each, identification and estimation has been considered, as well as
the notation for each. The diagnostic testing involved is the same as for
stationary models.

7.11 Exercises

Ex. 7.24: Consider a arima(1, 0, 0) (0, 1, 1)7 model fitted to a time series
{Pn }. Write this model using the backshift operator notation (make
up some reasonable parameter estimates).

Ex. 7.25: Consider a arima(1, 1, 0) (1, 1, 0)12 model fitted to a time series
{Yn }. Write this model using the backshift operator notation (make
up some reasonable parameter estimates).

Ex. 7.26: Consider some non-stationary data {W }. After taking non-


seasonal differences, the series seems stationary. Let this differenced
data be {Y }. A non-seasonal ar(1) model and seasonal ma(2) is fitted
to the stationary data (the season is of length 12).

(a) Write down the model fitted to the series {W } using the backshift
notation;
(b) Write down the model fitted to the series {W } using arima no-
tation.
(c) Write the model out in terms of Wt , et and previous terms (that
is, don’t use the backshift operator).

Ex. 7.27: Consider some non-stationary data {Z}. After taking seasonal
differences, the series seems stationary. Let this differenced data be
{Y }. A non-seasonal ma(2) model and seasonal arma(1, 1) is fitted
to the stationary data (the season is of length 24).

(a) Write down the model fitted to the series {Z} using the backshift
notation;
(b) Write down the model fitted to the series {Z} using arima no-
tation.
(c) Write the model out in terms of Zt , et and previous terms (that
is, don’t use the backshift operator).

© USQ, February 21, 2007


162 Module 7. Non-Stationary Models

Ex. 7.28: For each of the following cases, write down the final model using
the backshift operator and using notation.

(a) The time series {P } is non-stationary; after taking ordinary dif-


ferences, an arma(1, 0) model was fitted to the data.
(b) The time series {T } is seasonal with period 12. After seasonal
difference were taken, a seasonal ma(2) model was fitted to the
data.

Ex. 7.29: For each of the following cases, write down the final model using
the backshift operator and using notation.

(a) The time series {Y } is non-stationary; after taking seasonal dif-


ferences (season of length 12), an arma(1, 1) model was fitted to
the data.
(b) The daily time series {S} is seasonal with period 365. After
ordinary and seasonal difference were taken, an arma(1, 1) model
was fitted to the data.

Ex. 7.30: For the following models written using backshift operators, ex-
pand the model and write down the model in standard form. In addi-
tion, write down the model using arima notation.

(a) (1 − B)(1 − 0.3B 12 )Xt = (1 + 0.2B)et .


(b) (1 − B 7 )Hn = (1 − 0.5B − 0.2B 2 )en .
(c) (1 − 0.3B)Wn+1 = (1 − 0.4B)en+1 .

Ex. 7.31: For the following models written using backshift operators, ex-
pand the model and write down the model in standard form. In addi-
tion, write down the model using arima notation.

(a) (1 − B)2 (1 + 0.3B)Yt = et .


(b) (1 − B 12 )(1 − B)(1 + 0.3B)Mn+1 = en+1 .
(c) Wn+1 = (1 − 0.4B)(1 + 0.3B 7 )en+1 .

Ex. 7.32: Consider some non-stationary monthly data {G}. After taking
seasonal differences, the series seems stationary. Let this differenced
data be {H}. A non-seasonal ma(2) model, a seasonal ma(1) and a
seasonal ar(1) model are fitted to {H}. Write down the model fitted
to the series {G} using

(a) the backshift operator;


(b) arima notation.

© USQ, February 21, 2007


7.11. Exercises 163

(c) Make up some (reasonable) numbers for the parameters in this


model. Then write the model out in terms of Gt , et and previous
terms.

Ex. 7.33: Trenberth & Stepaniak [43] defined an index of El Niño evolution
they called the Trans-Niño Index (TNI). This monthly time series is
given in the data file tni.txt, and contains values of the TNI from
January 1958 to December 1999. (The data have been obtained from
the Climate Diagnostic Center [2].)

(a) Plot the series and see that it is a little non-stationary.


(b) Use differences to make the series stationary.
(c) Find a suitable ar model for the series.
(d) Find a suitable ma models for the series.
(e) Which model would you prefer: the ar or ma model? Explain
your answer using diagnostic analyses.
(f) For your prefered model, estimate the parameters.

Ex. 7.34: The sunspot numbers from 1770 to 1869 were given in Table 1.2
(p 20). The data are given in the data file sunspots.dat.

(a) Plot the data and decide if a seasonal component appears to exist.
(b) Use spectrum (and a smoother) to find any seasonal components.
(c) Suggest a possible model for the data (make sure to do a diag-
nostic analysis).

Ex. 7.35: The quasi-bienniel oscillation (QBO) was considered in Exer-


cise 1.7.

(a) Plot the data and decide if a seasonal component appears to exist.
(b) Use spectrum (and a smoother) to find any seasonal components.
(c) Suggest a possible model for the data (make sure to do a diag-
nostic analysis).

Ex. 7.36: The average monthly air temperatures in degrees Fahrenheit at


Nottingham Castle has been recorded for 20 years and is given in the
data file nottstmp.txt. (The data are from Anderson [6, p 166], as
quoted in Hand et al. [19, p 279].) Find a suitable time series model
for the data.
Note that the season is expected to be of length 12. See if you can dis-
cover this from the unsmoothed spectrum, and also from the smoothed
spectrum.

© USQ, February 21, 2007


164 Module 7. Non-Stationary Models

Ex. 7.37: Kärner & Rannik [25] fit an arima(0, 0, 0) (0, 1, 1)12 to the Inter-
national Satellite Cloud Climatology Project (ISCCP) cloud detection
time series {Cn }. They fit different model for different latitudes. At
−90◦ latitude, the unknown model parameter is about 0.7 (taken from
their Figure 5).

(a) Write this model using the backshift operator.


(b) Write the model in terms of Cn and en .
(c) Develop a forecasting model for forecasting one-, two-, twelve-
and thirteen- steps ahead.

Ex. 7.38: The streamflow in Little Mahoning Creek, McCormick, Pennsyl-


vania, from 1940 to 1988 is given in the data file mcreek.txt. The
file contains the monthly mean values of streamflow in cubic feet per
second. Find a suitable time series model for the data. (Make sure to
do a diagnostic analysis.)

Ex. 7.39: The data file wateruse.dat contains the annual water usage in
Baltimore city in litres per capita per day from 1885 to 1963. The
data are from Hipel & McLeod [21] and Hyndman [5]. Plot the data
and confirm that the data is non-stationary.

(a) Use appropriate methods to make the series stationary.


(b) Find a suitable model for the series and estimate the parameters
of the model. Make sure to do a diagnostic analysis.
(c) Write the model using the backshift operator.

Ex. 7.40: The file firring.txt contain the tree ring indicies for the Dou-
glas fir at the Navajo National Monument in Arizona, USA from 1107
to 1968. Find a suitable model for the data.

Ex. 7.41: The data file venicesealevel.dat contains the maximum sea
levels recorded at Venice from 1887–1981. Find a suitable model for
the times series, including a diagnostic analysis of possible models.

7.11.1 Answers to selected Exercises

7.24 The model is of the form

(1 − B 7 )(1 − φB)Pn = (1 − ΘB 7 )et

for some values φ and Θ.

© USQ, February 21, 2007


7.11. Exercises 165

7.26 (a) (1 − B)(1 − φB)Wt = (1 + Θ1 B 12 + Θ2 B 24 )et for some numbers


φ, Θ1 and Θ2 .
(b) arima(1, 1, 0) (0, 0, 2)12 .
(c) Expanding the model written using backshift operators gives

(1 − (1 + φ)B + φB 2 )Wt = (1 + Θ1 B 12 + Θ2 B 24 )et .

This is equivalent to

Wt = (1 + φ)Wt−1 − φWt−2 + et + Θ1 et−12 + Θ2 et−24 .

7.28 (1 − B)(1 − φB)Pt = et which is arima(1, 1, 0) (0, 0, 0)0 ; (1 − B 12 )Tt =


(1 + Θ1 B 12 + Θ2 B 24 )et which is arima(0, 0, 0) (0, 1, 2)12 .

7.30 (a) Xt = Xt−1 + 0.3Xt−12 − 0.3Xt−13 + et + 0.2et−1 , which is an


arima(0, 1, 1) (1, 0, 0)12 model.
(b) Hn = Hn−7 + en − 0.5en−1 − 0.2en−2 which is a arima(0, 0, 2)
(1, 0, 0)7 model.
(c) Wn+1 = 0.3Wn + en+1 − 0.4en which is a arima(1, 0, 1) (0, 0, 0)?
model; ie, it is not seasonal.

7.39 The series is plotted in the top plot in Fig. 7.16. The data are clearly
non-stationary in the mean. Taking difference produces an approxi-
mately stationary series; see the bottom plot in Fig. 7.16.
Using the stationary differenced series, the sample acf and pacf are
shown in Fig. 7.17. These plots suggest that no model can be fitted
to the differenced series. That is, the first differences are random.
The model for the water usage {Wt } is therefore

(1 − B)Wt = et

or Wt = Wt−1 + et . There are no parameters to estimate.


Here is the code used:

> wu <- read.table("wateruse.dat", header = TRUE)


> wu <- ts(wu$Use, start = 1885)
> plot(wu, las = 1)
> dwu <- diff(wu)
> plot(dwu, main = "First difference of water use",
+ las = 1)
> acf(dwu, main = "")
> pacf(dwu, main = "")

The diagnostics have been left for you.

© USQ, February 21, 2007


166 Module 7. Non-Stationary Models

650
600
550
wu

500
450
400
350

1900 1920 1940 1960

Time

First difference of water use

100

50
dwu

−50

−100

−150
1900 1920 1940 1960

Time

Figure 7.16: The annual water usage in Baltimore city in litres per capita
per day from 1885 to 1968. Top: the data is clearly non-stationary in the
mean; Bottom: the first differences are approximately stationary.

© USQ, February 21, 2007


7.11. Exercises 167

1.0
0.6
ACF

0.2
−0.2

0 5 10 15

Lag
0.2
0.1
Partial ACF

0.0
−0.2

5 10 15

Lag

Figure 7.17: The annual water usage in Baltimore city in litres per capita
per day from 1885 to 1968. Top: the sample acf of the differenced series;
Bottom: the sample pacf of the differenced series.

© USQ, February 21, 2007


168 Module 7. Non-Stationary Models

> par(mfrow = c(1, 2))


> plot(vs)
> plot(diff(vs))

80
180

60
160

40
140

20
diff(vs)
vs

120

0
100

−20
−40
80

−60
60

1900 1920 1940 1960 1980 1900 1920 1940 1960 1980

Time Time

Figure 7.18: A plot of the Venice sea level data. Left: original data; right:
after taking first differences

7.41 First, load and prepare the data:


> VSL <- read.table("venicesealevel.dat", header = TRUE)
> vs <- ts(VSL$MaxSealevel, start = c(1887))
A plot of the data shows the series is non-stationary (Fig. 7.18, left
panel) and the data increasing (what is the implication there?). Taking
difference produces a more stationary series (Fig. 7.18, right panel).
See the acf and pacf (Fig. 7.19); the acf suggests an ma(1) model
(or possibly ma(3), but start with the simpler choice), while the pacf
suggests an ar(2) model. Decide to start with the ma(1) model:
> vs.ar1 <- arima(vs, order = c(1, 1, 0))
The residual acf and pacf aren’t great (Fig. 7.20); there are quite
a few components outside the approximate confidence limits, but the
components at lag 1 are fine in both plots. Maybe the ma(1) would
be better? That does appear to be true (Fig. 7.21).

> vs.ma1 <- arima(vs, order = c(0, 1, 1))

This model appears fine, if not perfect, so let’s examine more diagnos-
tics (Fig. 7.22); these look OK too.
So, for some final diagnostics:

© USQ, February 21, 2007


7.11. Exercises 169

> par(mfrow = c(1, 2))


> acf(diff(vs))
> pacf(diff(vs))

Series diff(vs) Series diff(vs)


1.0

0.2
0.8

0.1
0.6

0.0
Partial ACF
0.4

−0.1
ACF

0.2

−0.2
0.0

−0.3
−0.2

−0.4
−0.4

0 5 10 15 5 10 15

Lag Lag

Figure 7.19: The acf and pacf of the Venice sea level data

> par(mfrow = c(1, 2))


> acf(resid(vs.ar1))
> pacf(resid(vs.ar1))

Series resid(vs.ar1) Series resid(vs.ar1)


1.0

0.2
0.8

0.1
0.6

0.0
Partial ACF
0.4
ACF

−0.1
0.2

−0.2
0.0
−0.2

−0.3

0 5 10 15 5 10 15

Lag Lag

Figure 7.20: The residual acf and pacf of the Venice sea level data after
fitting the ar(1) model

© USQ, February 21, 2007


170 Module 7. Non-Stationary Models

> par(mfrow = c(1, 2))


> acf(resid(vs.ma1))
> pacf(resid(vs.ma1))

Series resid(vs.ma1) Series resid(vs.ma1)

1.0

0.2
0.8

0.1
0.6

Partial ACF

0.0
0.4
ACF

−0.1
0.2
0.0

−0.2
−0.2

−0.3
0 5 10 15 5 10 15

Lag Lag

Figure 7.21: The residual acf and pacf of the Venice sea level data after
fitting the ma(1) model

> par(mfrow = c(1, 2))


> cpgram(resid(vs.ma1))
> qqnorm(resid(vs.ma1))
> qqline(resid(vs.ma1))

Normal Q−Q Plot

Series: resid(vs.ma1)
80


1.0

60
0.8

Sample Quantiles

●●
40


●●
●●
0.6



20


●●




●●
0.4


●●



●●


●●
●●


●●


●●



●●


●●



●●

0



●●



●●
0.2


●●

●●

●●

●●
●●


●●


−20

●●
●●●●
0.0


●●
−40

0.0 0.1 0.2 0.3 0.4 0.5 ●

frequency −2 −1 0 1 2

Theoretical Quantiles

Figure 7.22: Further diagnsotic plot of the Venice sea level data after fitting
the ma(1) model

© USQ, February 21, 2007


7.11. Exercises 171

> Box.test(resid(vs.ma1))

Box-Pierce test

data: resid(vs.ma1)
X-squared = 0.8388, df = 1, p-value = 0.3597

> coef(vs.ma1)/sqrt(diag(vs.ma1$var.coef))

ma1
-16.48790

All looks well; decide the ma(1) model is suitable:

> vs.ma1

Call:
arima(x = vs, order = c(0, 1, 1))

Coefficients:
ma1
-0.8677
s.e. 0.0526

sigma^2 estimated as 319.5: log likelihood = -405.11, aic = 814.23

© USQ, February 21, 2007


172 Module 7. Non-Stationary Models

© USQ, February 21, 2007


Module 8
Markov chains

Module contents
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.3 The transition matrix . . . . . . . . . . . . . . . . . . . . 177
8.4 Forecast the future with powers of the transition matrix181
8.5 Classification of finite Markov chains . . . . . . . . . . 184
8.6 Limiting state (steady state) probabilities . . . . . . . . 187
8.6.1 Share of the market model . . . . . . . . . . . . . . . . 190
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.7.1 Answers to selected Exercises . . . . . . . . . . . . . . . 200

Module objectives: Upon completion of this Module students should be


able to:

ˆ state and understand the Markov property;

ˆ identify processes requiring a Markov chain description;

ˆ determine the transition and higher order matrices of a Markov chain;

ˆ calculate state probabilities;

173
174 Module 8. Markov chains

ˆ determine and interpret the steady state distribution;

ˆ calculate and interpret mean recurrence intervals;

ˆ apply Markov chain techniques to basic decision making problems;

ˆ determine future states or conditions using Markov analysis.

8.1 Introduction

Up to now, only continuous time series have been considered; that is, the
quantity being measured over time is continuous. In this Module1 , a simple
method is considered for time series that take on discrete values. A simple
example is the state of the weather: If it is fine or if it is raining, for example.

8.2 Terminology

A stochastic process is a collection of random variables {X(t)} where the


parameter t denotes time (possibly space) and ranges over some interval of
interest; e.g. t ≥ 0. X(t) denotes the random variable X at time t. The
values assumed by X(t) may be called states and the set of all possible
states is called the state space. The state space (and hence X(t)) may
be discrete or continuous: the state space of a queue is discrete; the state
space of inter-event times is continuous. The time parameter t (sometimes
called the indexing parameter) may also be discrete or continuous. In this
Module, we study the case where the state space is discrete, and the time
parameter t is also discrete (and equally spaced). Two examples of a discrete
time stochastic process follow.

ˆ Let Y (n) be the volume of water in a reservoir at the start of month n.


The parameter n is used in place of t to emphasise the fact that this
parameter is discrete, taking on values 0, 1, 2,. . . . Although Y (n) is
naturally continuous, since it is a measure of volume, it may be suffi-
cient in some applications to measure Y (n) on a crude scale containing
relatively few values, in which case Y (n) would be treated as discrete.

ˆ Let T (n) be the time between nth and (n + 1)th pulse registered on
a Geiger counter. The indexing parameter n ∈ {0, 1, 2, . . .} is discrete
and the state space continuous. A realisation of this process would be
a discrete set of real numbers with values in the range (0, ∞).
1
Most of the material in this Module has been drawn from previous work by Dr Ashley
Plank and Professor Tony Roberts.

© USQ, February 21, 2007


8.2. Terminology 175

In this section we consider stochastic models with both discrete state space
and discrete parameter space. Some example include annual survey of bi-
ological populations; monthly assessment of the water levels in a reservoir;
weekly inventories of stock; daily inspections of a vending machine; mi-
crosecond sampling of a buffer state. These models are used occasionally in
climate modelling.

Example 8.1: Tomorrow’s weather Consider the state of the weather


on a day to day basis. Days may be classified as either fine/sunny or
overcast/cloudy. Suppose a fine day follows a fine sunny day 40% of
time and an overcast cloudy day follows an overcast day 20% of the
time. For example, the data this conclusion comes from may be the
following sequence of observations for consecutive days: C, S, C, C, S,
S, S, C, S, C, S (though illustrative, this sample is far too small for real
applications). See that as stated above, 2/5’s of the sunny days are
followed by a sunny day and that 1/5 of the cloudy days are followed
by cloudy days. Define

1, if day t is fine/sunny,
Xt =
2, if day t is overcast/cloudy.

In other words, let state 1 correspond to sunny days, and state 2 to


cloudy. We model this process as a Markov chain as defined below by
assuming that for any two consecutive days t and t + 1 in the future:

Pr {Xt+1 = 1 | Xt = 1} = 0.4 , Pr {Xt+1 = 2 | Xt = 2} = 0.2 .

It follows that

Pr {Xt+1 = 2 | Xt = 1} = 1 − 0.4 = 0.6 ,

Pr {Xt+1 = 1 | Xt = 2} = 1 − 0.2 = 0.8 .


This information is recorded on a state transition diagram such as: Always draw a state
transition diagram
0.6

0.4 0.2
1=sunny 2=cloudy

0.8

© USQ, February 21, 2007


176 Module 8. Markov chains

These four probabilities are conveniently represented as the matrix


   
P11 P12 0.4 0.6
P = = (8.1)
P21 P22 0.8 0.2

Note that the rows sum to one.

Markov chains are a special type of discrete-time stochastic process. For


convenience, as above, we write times as an integral number of some basic
units such as days, weeks, months, years or microseconds.

Definition 8.1 (Markov chain) Suppose a discrete-time stochastic pro-


cess can be in one of a finite number of states, generally labelled 1, 2, 3, . . . ,
s, then the stochastic process is called a Markov chain if

Pr {Xt+1 = it+1 | Xt = it , Xt−1 = it−1 , . . . , X1 = i1 , X0 = i0 }


= Pr {Xt+1 = it+1 | Xt = it } ,

This expression says that the probability distribution of the state at time
t + 1 depends only on the state at time t (namely it ) and does not depend
on the states the chain passed through on the way to it at time t. Usu-
ally we make a further assumption that for all states i and j and all t,
Pr {Xt+1 = j | Xt = i} is independent of t. This assumption applies when-
ever the system under study behaves consistently over time. Any stochastic
process with this behaviour is called stationary. Based on this assumption
we write
Pr {Xt+1 = j | Xt = i} = Pij , (8.2)
so that Pij is the probability that given the system is in state i at time t, the
system will be in state j at time t + 1. Pij ’s are referred to as the transition
probabilities.

Note that it is crucial that you clearly define the states and the discrete
times.

Example 8.2: Preisendorfer and Mobley [37] and Wilks [48] use a three-
state Markov chain to model the transitions between below-normal,
normal and above-normal months for temperature and precipitation.

© USQ, February 21, 2007


8.3. The transition matrix 177

8.3 The transition matrix

For a system with s states the transition probabilities are conveniently repre-
sented as an s × s matrix P . Such a matrix P is called the transition matrix
and each Pij is called a one-step transition probability. For example, P12
represents the probability that the process makes a transition from state 1
to state 2 in one period, whereas P22 is the probability that the system stays
in state 2. Each row represents the one-step transition probability distri-
bution over all states. If we observe the system in state i at the beginning
of any period, then the ith row of the transition matrix P represents the
probability distribution over the states at the beginning of the next period.

The same transition matrix completely describes the probabilistic behaviour


of the system for all future one-step transitions. The probabilistic behaviour
of such a system over time is called a stationary Markov chain. Stationary
because the matrix P is the same for transitions between all times.

Example 8.3: Tomorrow’s weather continued Consider the weather


example with transition matrix (8.1) and suppose today, t = 0, is
sunny, state X0 = 1. Then from the given data the probability of being
sunny tomorrow, state X1 = 1, is 0.4 and the probability of it being
cloudy, X1 = 2, is thus 0.6 . So
 our forecast
 for tomorrow’s weather is
the probabilistic mix p(1) = 0.4 0.6 , called a probability vector
and denoting the probability of being sunny or cloudy respectively. As
claimed above, this is just the first row of the transition matrix P .
What can we say about the weather in two days time? We seek a
vector of probabilities, say p(2), giving the probabilities of the day
after tomorrow being sunny of cloudy respectively. Given today is
sunny, then the day after tomorrow can be sunny, X2 = 1, via two
possible routes: it can be cloudy tomorrow then sunny the day after,
with probability (as the Markov assumption is that these transitions
are independent)

Pr {X2 = 1 | X1 = 2} × Pr {X1 = 2 | X0 = 1} = P21 × P12 = 0.8 × 0.6 ;

or it can be sunny tomorrow then sunny the day after, with probability
(as the transitions are assumed independent)

Pr {X2 = 1 | X1 = 1} × Pr {X1 = 1 | X0 = 1} = P11 × P11 = 0.4 × 0.4 .

Since these are two mutually exclusive routes, we add their probability
to determine

Pr {X2 = 1 | X0 = 1} = 0.4 × 0.4 + 0.6 × 0.8 = 0.64 .

© USQ, February 21, 2007


178 Module 8. Markov chains

Similarly, the probability that it is cloudy the day after tomorrow is


the sum of two possible routes:

Pr {X2 = 2 | X0 = 1} = 0.4 × 0.6 + 0.6 × 0.2 = 0.36 .

Combining these into one probability vector  our probabilistic


 fore-
cast for the day after tomorrow is p(2) = 0.64 0.36 . The im-
portant general feature of this example is that post-multiplication by
the transition matrix determines how the vector of probabilities evolve,
p(2) = p(1)P , as you see realised in the above two displayed expres-
sions.
This formula applies to the initial forecast of tomorrow’s weather
 too.
Since we know today is sunny, the current state is p(0) = 1 0
denoting that we are certain the weather is in state 1. Then observe
in the above that p(1) = p(0)P .

Using independence of transitions from step to step, and the mutual exclu-
siveness of different possible paths we establish the general key result:

Theorem 8.2 If a Markov chain has transition matrix P and is in states


with probability vector p(t) at time t, then 1 time step later its probability
vector is p(t + 1) = p(t)P .

Proof: Consider the following schematic general but partial state transition
diagram:


p1 (t) 1


Z
Z P1j
2
Z
p2 (t) Z
 P PP Z pj (t + 1)
.. PP ZZ
2j PP

. ~

q
P
: j

i   > 
pi (t)  
Pij  

.. 
. 
P
 
 sj
ps (t) s


© USQ, February 21, 2007


8.3. The transition matrix 179

The system arrives to be in some state j at time t+1 by s mutually exclusive


possibilities depending upon the state of the system at time t:
s
X
pj (t + 1) = Pr {make state i to j transition}
i=1
Xs
= Pr {in state i} × Pr {Xt+1 = j | Xt = i}
i=1
Xs
= pi (t)Pij by their definition
i=1
= jth element of p(t)P .

Hence putting these elements together: p(t + 1) = p(t)P . ♠

Note that the future behaviour of the system (for example, the states of
the weather) only depends on the current state and not on how it entered
this state. Given the transition matrix P , knowledge of the current state
occupied by the process is sufficient to completely describe the future proba-
bilistic behaviour of the process. This lack of memory of earlier history may
be viewed as an extreme limitation. However, this is not so. As the next
example shows, we can build into the current state such a memory. The
trick is widely applicable and creates a powerful modelling mechanism.

Example 8.4: Remembering yesterday’s weather. Assume that to-


morrow’s weather depends on the weather condition during the last
two days as follows:

ˆ if the last two days have been sunny, then 95% of the time to-
morrow will be sunny;
ˆ if yesterday was cloudy and today is sunny, then 70% of the time
tomorrow’s will be sunny;
ˆ if yesterday was sunny and today is cloudy, then 60% of the time
tomorrow’s will be cloudy;
ˆ if the last two days have been cloudy, then 80% of the time to-
morrow will be cloudy.

Using this information model the weather as a Markov chain, draw


the state transition diagram and write down its transition matrix. If
tomorrow’s weather depends on the weather conditions during the last
three days, how many states would be needed to model the weather
as a Markov chain?

© USQ, February 21, 2007


180 Module 8. Markov chains

Solution: Since each day is classified as either sunny (S) or cloudy


(C) then we have 4 states: SS, SC, CS, and CC. In these labels the first
letter denotes what the weather was yesterday and the second letter
denotes today’s weather. For example, the second rule above says that
if today we are in state CS (that yesterday was cloudy and today is
sunny), then with probability 70% tomorrow will be in state SS be-
cause tomorrow will be sunny, the second S, and tomorrow’s yesterday,
namely today, was sunny, the first S. The state transition diagram is:

SC

0.05 0.4

0.6
SS 0.95 CC 0.8

0.3

0.2
0.7

CS

The transition matrix is thus

 SS SC CS CC 
SS 0.95 0.05 0.0 0.0
P = SC  0.0
 0.0 0.40 0.60 

CS  0.70 0.30 0.0 0.0 
CC 0.0 0.0 0.20 0.80

If tomorrow’s weather depends on the weather conditions during the


last 3 days, then 23 = 8 states are needed: SSS, SSC, SCS, SCS, SCC,
CCC, CCS, CSC, and CSS.

See that you may write down the states of a Markov chain in any order that
you please. But once you have decided on an ordering, you must stick to
that ordering throughout the analysis. In the above example, the labels for
both the rows and the columns of the transition matrix must be, and are,
in the same order, namely SS, SC, CS, and CC. In applying Markov chains,
there need not be a natural order for the states, and so you will have to
decide and fix upon one.

© USQ, February 21, 2007


8.4. Forecast the future with powers of the transition matrix 181

8.4 Forecast the future with powers of the transition


matrix

Using independence of transitions from step to step, and the mutual exclu-
siveness of different possible paths:

Theorem 8.3 If the process is in states with probability vector p(t) at


time t then n steps later its probability vector is p(t + n) = p(t)P n .

Example 8.5: In the weather Example 8.3 we saw that p(1) = p(0)P and
p(2) = p(1)P so that

p(2) = p(1)P = p(0)P P = p(0)P 2 .

Thus the forecast 2 days later is P 2 times the current probability


vector.

Proof: It is certainly true for the n = 1 case: p(t + 1) = p(t)P by Theo-


rem 8.2. For the case n = 2:

p(t + 2) = p(t + 1)P by Theorem 8.2


= p(t)P P by Theorem 8.2 again
2
= p(t)P .

For the case n = 3:

p(t + 3) = p(t + 2)P by Theorem 8.2


2
= p(t)P P by n = 2 case
3
= p(t)P .

For the case n = 4:

p(t + 4) = p(t + 3)P by Theorem 8.2


3
= p(t)P P by n = 3 case
4
= p(t)P .

And so on (formally by induction) for the general case. ♠

Given a Markov chain with transition probability matrix P , if the chain is


in state i at time t, we might be interested to know the probability that
n periods later the chain will be in a state j. Since we are dealing with a
stationary Markov chain, this probability will be independent of t.

© USQ, February 21, 2007


182 Module 8. Markov chains

Corollary 8.4 The (i, j)th element of P n gives the probability of starting
from state i and being in state j precisely n steps later.

Proof: Being in state i at time t corresponds to p(t) being zero except for
the ith element which is one, then the right-hand side of p(t + n) = p(t)P n
shows p(t + n) must be just the ith row of P n . Thus the corollary follows.

Example 8.6: Assume that the population movement of people between


city and country is modelled as a Markov chain with transition matrix
 
P11 P12
P =
P21 P22

where:

ˆ P11 = 0.9, is the probability that a person currently living in the


city will remain in the city after one transition (year)
ˆ P12 = 0.1, is the probability that a person currently living in the
city will move to country after one transition (year)
ˆ P21 = 0.2, is the probability that a person currently living in the
country will move to the city after one transition (year); and
ˆ P22 = 0.8, is the probability that a person currently living in the
country will remain in the country after one transition (year).

If a person is currently living in the city what is the probability that


this person will be living in the country 2 years from now?
If 75% of the population is currently living in the city and 25% in the
country, what is the population distribution after 1, 2, 3 and 10 years
from now.

Solution: To answer the first question we determine element (1, 2)


of the matrix P 2 .
    
2 0.9 0.1 0.9 0.1 0.83 0.17
P = = .
0.2 0.8 0.2 0.8 0.34 0.66

Hence [P 2 ]12 = 0.17. This means that the probability that a city
person will live in the country after 2 transitions (years) is 17%.
To find the population distribution after
 1, 2, 3 and
 10 years given
that the initial distribution is p(0) = 0.75 0.25 we perform the
following calculations.

© USQ, February 21, 2007


8.4. Forecast the future with powers of the transition matrix 183

After 1 year the distribution is:


 
  0.9 0.1  
p(1) = 0.75 0.25 = 0.725 0.275 .
0.2 0.8
Use this result to find the population distribution after 2 years:
 
  0.9 0.1  
p(2) = 0.725 0.275 = 0.7075 0.2925 .
0.2 0.8
And after 3 years:
 
  0.9 0.1  
p(3) = 0.7075 0.2925 = 0.6952 0.3048 .
0.2 0.8
We continue with this process to obtain the population distribution
after 9 years and 10 years:
   
p(9) = 0.6700 0.3300 , p(10) = 0.6690 0.3310 .

Notice that after many transitions the population distribution tends to settle
down to a steady state distribution.
The above calculations can be also performed as follows:
p(n) = p(0)P n .
Hence to calculate p(10) we multiply the initial population distribution by
 
10 0.6761 0.3239
P = .
0.6478 0.3522
 
  0.6761 0.3239  
p(10) = 0.75 0.25 = 0.6690 0.3310 .
0.6478 0.3522
which is the same result as before.
For large n notice that P n also approaches a steady state with identical
rows. For example,
 
n 0.6667 0.3333
lim P =
n→∞ 0.6667 0.3333
The probabilities in each row represent the population distribution in the
steady state. This distribution is independent of the initial conditions. For
example if a fraction x (0 ≤ x ≤ 1) of the population initially lived in the
city and a fraction (1−x) in the country, in the steady state situation we will
find 66.67 percent living in the city and 33.33 percent living in the country
regardless of the value of x. This is verified by computing
 
  0.6667 0.3333  
p(∞) = x 1 − x = 0.6667 0.3333 .
0.6667 0.3333

© USQ, February 21, 2007


184 Module 8. Markov chains

8.5 Classification of finite Markov chains

The long term behaviour of Markov chains depend on the general struc-
ture of the transition matrix. For some transition matrices the chain will
settle down to a steady state condition which is independent of the initial
state. In this subsection we identify the characteristics of a Markov chain
that will ensure a steady state exists. In order to do this we must classify a
Markov chain according to the structure of its transition diagram and ma-
trix. The critical property we need for a steady state is that the Markov
chain is “ergodic”—you may meet this term in completely different contexts,
such as in fluid turbulence, but the meaning is essentially the same: here
it means that the probabilities get “mixed up” enough to ensure there are
There are biological no long time correlations in behaviour and hence a steady state will appear.
situations with intriguing Further, an ergodic system is one in which time averages, such as might be
non-ergodic effects.
obtained from an experiment, are identical to ensemble averages, averages
over many realisations, which is what we often want to discuss and report
in applications.
Consider the following transition matrix
1 2 3 4 5
 
1 0.3 0.7 0 0 0
2  0.9 0.1 0 0 0 
P =  
3 
 0 0 0.2 0.8 0 

4  0 0 0.5 0.3 0.2 
5 0 0 0 0.4 0.6
Always draw such a state This matrix is depicted by the following state transition diagram. Each
transition diagram for your node represents a state and the labels on the arrows represent the transition
Markov chains.
probability Pij .

0.7 0.8 0.2

1 0.3 2 0.1 3 0.2 4 0.3 5 0.6

0.9 0.5 0.4

The following properties refer to this particular Markov chain as a first


example.

ˆ Given two states i and j a path from i to j is a sequence of transitions


that begins in i and ends in j such that each transition in the sequence

© USQ, February 21, 2007


8.5. Classification of finite Markov chains 185

has a positive probability of occurrence: thus [P n ]ij > 0 for an n-step


path.
For example, see that there are paths from 1 to 2, from 1 to 1, from 3
to 5, but not from 3 to 1.

ˆ A state j is accessible from i if there is a path leading from i to j after


one or more transitions.
For example, state 5 is accessible from state 3 but state 5 is not acces-
sible from states 1 nor 2.

ˆ Two states i and j communicate with each other if j is accessible


from i and i is accessible from j. If state i communicates with j and
with k, then j also communicates with k. Therefore, all states that
communicate with i also communicate with each other.2
For example, states 1 and 2 communicate with each other. Similarly
states 3 and 5 communicate, but states 1 and 5 do not.

ˆ A set of states S in a Markov chain is a closed set if no state outside


of S is accessible from any state in S.
For example, S1 = {1, 2} and S2 = {3, 4, 5} are both closed sets.

ˆ A state i is an absorbing state if Pii = 1. Once we enter such an


absorbing state, we never leave that state because with probability 1 we
can only make the transition from i to i, there is no “spare probability”
to go elsewhere.
There are no absorbing states in the above example. However, in
many models of biological populations, the population going extinct is
an absorbing state because with zero females the species cannot breed
and so remains extinct forever.

ˆ A state i is a transient state if a state j exists that is accessible from


i, but the state i is not accessible from j. If a state is not transient
it is called a recurrent state. After a large number of transitions the
probability of being in a transient state is zero.
There are no transient states in the above example. States 1 and 2
in the Markov chain with the following state transition diagram are
transient, states 3 and 4 are recurrent:
2
For those who did Discrete Mathematics for Computing: communication is an equiv-
alence relation.

© USQ, February 21, 2007


186 Module 8. Markov chains

1 2 3 4

ˆ A recurrent state i is cyclic (periodic) with period d > 1 if the system


can never return to state i except after a multiple of d steps. (Thus
d is the greatest common divisor, over all possibilities, of the number
of transitions, n, for the process to move from state i back to state i:
d = gcd{n | [P n ]ii > 0}.) A state that is not cyclic is called aperiodic.
In the earlier example, all states are aperiodic because from each
state the system can revisit that state after any integer number of
steps. However, for the example immediately above, the two recurrent
states 3 and 4 are cyclic with period d = 2 as, for example, state 3
can only be returned to after a multiple of d = 2 steps, similarly for
state 4.

ˆ If all states in a chain are recurrent, aperiodic and communicate with


each other, the chain is ergodic.
The above examples are not ergodic because not all states communi-
cate with each other. See in the earlier example that no unique steady
state exists because if the system starts in states 1 or 2 it must stay
in those states forever, whereas if its starts in states 3–5 then it stays
in those states forever: the long time behaviour is quite different de-
pending upon which case occurs, and thus there is no unique steady
state.

Example 8.7: Determine which of the chains with the following transition
matrices is ergodic.
 
0 0 0.5 0.5  
 0 0.2 0.4 0.4
0 0.4 0.6 
P1 =   0.1 0.9 0
, P2 =  0.1 0.2 0.7  .
0 
0.3 0.3 0.4
0.4 0.6 0 0

Solution: Draw a state transition diagram for each, then the follow-
ing observations easily follow. The states in P1 communicate with each
other. However, if the process is in state 1 or 2 it will always move to
either state 3 or 4 in the next transition. Similarly if the process is in

© USQ, February 21, 2007


8.6. Limiting state (steady state) probabilities 187

state 3 or 4 it will move back to state 1 or 2. All states in such a chain


are cyclic with period d = 2. This chain is not ergodic.
All states in P2 communicate with each other. The states are recurrent
and aperiodic. Therefore P2 is an ergodic chain.

8.6 Limiting state (steady state) probabilities

Theorem 8.5 Let P be the the transition matrix  of an s-state ergodic


Markov chain, then a vector π = π1 π2 . . . πs exists such that
 
π1 π2 · · · πs
 π 1 π2 · · · π s 
lim P n =  . ..  . (8.3)
 
. .. . .
n→∞  . . . . 
π1 π 2 · · · π s

The common row vector π represents the limiting state probability distribu-
tion or the steady state probability distribution that the process approaches
regardless of the initial state. When the above limit occurs, then following
any initial condition p(0) the probability vector after a large number n of
transitions is
p(n) = p(0)P n → π .
To show this last step, consider just the first element, p1 (n), of the proba-
bility vector p(n). It is computed as p(0) times the first column of P n , but
T
P n to π1 · · · π1

hence
 
π1
p1 (n) → p(0)  ... 
 

π1
 
1
= π1 p(0)  ... 
 

1
= π1

as the sum of the elements in p(0) have to be 1. Similarly for all the other
elements in p(n).
How do we find these limiting state probabilities π? For a given chain with
transitions matrix P we have observed that as the number of transitions n
increases
p(n) → π

© USQ, February 21, 2007


188 Module 8. Markov chains

But we know p(n + 1) = p(n)P and so taking the limit as n → ∞:

π = πP . (8.4)

The limiting steady state probabilities are therefore the solution of the sys-
tem of linear equations such that the row sum of π is 1:
s
X
πj = 1 . (8.5)
j=1

Unfortunately, with the above condition we have s + 1 linear equations in


s unknowns. To solve for the unknowns we P may replace any one of the s
linear equations obtained from (8.4) with sj=1 πj = 1.

Example 8.8: To illustrate how to solve the steady state probabilities con-
sider the transition matrix,
 
0.7 0.2 0.1 0
 0.3 0.4 0.2 0.1 
P = 0
.
0.3 0.4 0.3 
0 0 0.3 0.7

Solving π = πP we have
 
0.7 0.2 0.1 0
     0.3 0.4 0.2 0.1 
π1 π 2 π 3 π4 = π 1 π 2 π3 π4  0 0.3 0.4 0.3  ,

0 0 0.3 0.7
or
π1 = 0.7π1 + 0.3π2 + 0π3 + 0π4 ,
π2 = 0.2π1 + 0.4π2 + 0.3π3 + 0π4 ,
π3 = 0.1π1 + 0.2π2 + 0.4π3 + 0.3π4 ,
π4 = 0π1 + 0.1π2 + 0.3π3 + 0.7π4 ,
together with
π 1 + π2 + π 3 + π 4 = 1 . (8.6)
Discarding any of the first four equations and solving the remaining
equations we find the steady state probabilities:
 3 3 4 5

π = 15 15 15 15 .

The steady state probabilities can be found by first noting that π = πP


can be written as
π(I − P ) = 0,

© USQ, February 21, 2007


8.6. Limiting state (steady state) probabilities 189

where I is an identity matrix of appropriate size (and remembering


that the order of multiplication s important in matrix multiplication).
This equation is of the form xA = b. To turn it into the more familiar
form Ax = b, transpose both sides:

(I − P )T π T = 0

(since (AB)T = B T AT ). Now, this system has four equation, only


three of which are necessary. One row (say the last row) can be re-
placed with the equation
 
π= 1 1 1 1

(that is, Equation (8.6)). This can all be done in r—albeit with some
effort.

> data <- c(0.7, 0.2, 0.1, 0, 0.3, 0.4,


+ 0.2, 0.1, 0, 0.3, 0.4, 0.3, 0, 0,
+ 0.3, 0.7)
> P <- matrix(data, nrow = 4, ncol = 4,
+ byrow = T)
> eye <- diag(4)
> tIP <- t(eye - P)
> tIP[4, ] <- c(1, 1, 1, 1)
> rhs <- matrix(c(0, 0, 0, 1), nrow = 4,
+ ncol = 1)
> steady.state <- solve(tIP, rhs)
> steady.state

[,1]
[1,] 0.2000000
[2,] 0.2000000
[3,] 0.2666667
[4,] 0.3333333

Of course, in R it can be easier just to raise the transition matrix to a


large power:

> P2 <- P %*% P


> P4 <- P %*% P %*% P %*% P
> P16 <- P4 %*% P4 %*% P4 %*% P4
> P64 <- P16 %*% P16 %*% P16 %*% P16
> P256 <- P64 %*% P64 %*% P64 %*% P64
> P256

© USQ, February 21, 2007


190 Module 8. Markov chains

[,1] [,2] [,3] [,4]


[1,] 0.2 0.2 0.2666667 0.3333333
[2,] 0.2 0.2 0.2666667 0.3333333
[3,] 0.2 0.2 0.2666667 0.3333333
[4,] 0.2 0.2 0.2666667 0.3333333

The answers are the same.

8.6.1 Share of the market model

One application area of the Markov chains is in brand switching or share


of the market models. Suppose NoFrill Airlines (nfa) is competing for
the market share of domestic passengers with the other two major carriers,
KangaRoo Airways (kra) and emu Airlines. The major airlines have com-
missioned a survey to determine the likely impact of the newcomer on their
market share. The results of a random survey have revealed the following
information:

ˆ 40% of passengers currently fly with kra;

ˆ 50% of passengers currently fly with emu;

ˆ 10% of passengers currently fly with nfa.

The survey results also showed that:

ˆ 80% of the passengers who currently fly with kra will fly with kra
next time, 15% will switch to emu and the remaining 5% will switch
to nfa;
ˆ 90% of the passengers who currently fly with emu will fly with emu
next time, 6% will switch to kra and the remaining 4% will switch to
nfa;
ˆ 90% of the passengers who currently fly with nfa will fly with nfa
next time, 4% will switch to kra and the remaining 6% will switch to
emu Airlines.

The preference pattern of passengers is here modelled as a Markov chain


with the following transitions matrix:
 
kra 0.8 0.15 0.05
P = emu  0.06 0.90 0.04  .
nfa 0.04 0.06 0.90

© USQ, February 21, 2007


8.7. Exercises 191

We also have the initial market share


 
p(0) = 0.4 0.5 0.1 .
To determine the long term market share for each airline we find the steady
state probabilities of the transition matrix. Solving π = πP and replacing
any of the equations with π1 + π2 + π3 = 1 we get
 
π = 0.2077 0.4918 0.3005 .
Therefore, in the long term, the market share of the kra Airlines would
drop from an initial 40% to 20.77%, the market share for the emu Airlines
will remain steady and the nfa airline would increase their market share
from 10% to 30.05%.
Note that the future market share for each airline only depends on the
transition matrix and not on the initial market share. The management of
the kra could launch an advertising campaign to regain some of the 20% of
their customers who are switching to the other two airlines.

8.7 Exercises
Ex. 8.9: The SOI is a well known climatological indicator for eastern Aus-
tralia. Stone and Auliciems [42] developed SOI phases in which the av-
erage monthly SOI is allocated to one of five phases correspond to the
SOI falling rapidly (phase 1), staying consistently negative (phase 2),
staying consistently near zero (phase 3), staying consistently positive
(phase 4), and rising rapidly (phase 5).
The transition matrix, based on data collected from July 1877 to
February 2002 is
 
0.668 0.000 0.081 0.154 0.101
 0.000 0.683 0.125 0.062 0.130 
 
P =  0.354 0.000 0.063 0.370 0.212 .

 0.000 0.387 0.204 0.102 0.303 
0.036 0.026 0.132 0.276 0.529

(a) Draw a transition diagram for the SOI phases.


(b) Determine if the Markov chain is ergodic.
(c) Determine the steady state probabiities.

Ex. 8.10: Draw the state transition diagram for the Markov chain given by
 
1/3 2/3
P = .
1/4 3/4

© USQ, February 21, 2007


192 Module 8. Markov chains

Ex. 8.11: Draw the state transition diagram and hence determine if the
following Markov chain is ergodic. Also determine the recurrent, tran-
sient and absorbing states of the chain.
 
0 0 1 0 0 0
 0 0 0 0 0 1 
 
 0 0 0 0 1 0 
P =  1 1 0 1 0 0 

 4 4 2 
 1 0 0 0 0 0 
1 2
0 3 0 0 0 3

Ex. 8.12: The daily rainfall in Melbourne has been recorded from 1981 to
1990. The data is contained in the file melbrain.dat, and is from
Hyndman [5] (and originally from the Australian Bureau of Meteorol-
ogy).
A large number of days recorded no rainfall at all. The following
transition matrix shows the transition matrix for the two states ‘Rain’
and ‘No rain’:  
0.721 0.279
P = .
0.440 0.560

(a) Draw a transition diagram from the matrix P .


(b) Use r to determine the steady state probabilities of days with
rain in Melbourne.
(c) Determine the probability of having a wet day two days after a
fine day.

Ex. 8.13: The daily rainfall in Melbourne has been recorded from 1981 to
1990, and was used in the previous exercise. In that exercise, two states
(‘Rain’ (R) or ‘No rain’ (N)) were used. Then, the state yesterday was
used to deduce probabilities of the two states today. In this exercise,
four states are used, taking into account the weather for the previous
two days.
There are four states RR, RN, NR, NN; the left-most state occurs
earlier. (That is, RN means a rain-day followed by a day with no
rain). The following transition matrix shows the transition matrix for
the four states:
 
0.564 0.436 0 0
 0 0 0.315 0.685 
P =
 0.5554 0.445
.
0 0 
0 0 0.265 0.735

(a) Draw a transition diagram from the matrix P .

© USQ, February 21, 2007


8.7. Exercises 193

(b) Explain why eight entries in the transition matrix must be exactly
zero.
(c) Use r to determine the steady state probabilities of the four states
for the data.
(d) Determine the probability that two wet days will be followed by
two dry days.

Ex. 8.14: A computer laboratory has become notorious in service because


of computer breakdowns. Data collected on its status every 15 minutes
for about 12 hours (50 observations) is given below (1 indicates “system
up” and 0 indicates “system down”.)

1110010011111110011110111
1111001111111110001101101

Assuming this process can be modelled as a Markov chain, estimate


from the data the probabilities of the system being “up” or “down” each
15 minutes given it was “up” or “down” in the previous period, draw
the state transition diagram and write down the transition matrix.

Ex. 8.15: Suppose that if it has rained for the past three days, then it will
rain today with probability 0.8; if it did not rain for any of the past
three days, then it will rain today with probability 0.2; and in any
other case the weather today will, with probability 0.6, be the same
as the weather yesterday. Determine the transition matrix for this
Markov chain.

Ex. 8.16: Let {Xn | n = 0, 1, 2, ...} be a Markov chain with state space
{1, 2, 3} and transition matrix
 1 1 1 
2 4 4
2 1
P = 3 0 3
.
3 2
5 5 0

Determine the following probabilities:

(a) being in state 3 two steps after being in state 2;


(b) Pr {X4 = 1 | X2 = 1} ;
 
(c) p(2) given that p(0) = 1 0 0 ;
(d) Pr {X2 = 3} given that Pr {X0 = 1} = Pr {X0 = 2} = Pr {X0 = 3} ;
(e) Pr {X2 = 3 | X1 = 2 & X0 = 1} ;
(f) Pr {X2 = 3 & X1 = 2 | X0 = 1} .

© USQ, February 21, 2007


194 Module 8. Markov chains

Ex. 8.17: Determine the limiting state probabilities for Markov chains with
the following transition matrices.
   
  0 1 0 0.2 0.4 0.4
0.5 0.5
P1 = P2 =  0 0 1  P3 =  0.5 0.2 0.3 
0.7 0.3
0.4 0 0.6 0.3 0.4 0.3

Ex. 8.18: Two white and two black balls are distributed in two urns in
such a way that each contains two balls. We say that the system is in
state i, i = 0, 1, 2, if the first urn contains i white balls. At each step,
we randomly select one ball from each urn and place the ball drawn
from the first urn into the second, and conversely with the ball from
the second urn. Let Xn denote the state of the system after nth step.
Assuming that the process can be modelled as a Markov chain, draw
the state transition diagram and determine the transition matrix.

Ex. 8.19: A company has two machines. During any day each machine that
is working at the beginning of the day has a 1/3 chance of breaking
down. If a machine breaks down during the day, it is sent to repair
facility and will be working 3 days after it breaks down. (i.e. if a
machine breaks down during day 3, it will be working at the beginning
of day 6). Letting the state of the system be the number of machines
working at the beginning of the day, draw a state transition diagram
and formulate a transition probability matrix for this situation.

Ex. 8.20: The State Water Authority plans to build a reservoir for flood
mitigation and irrigation purposes on the Macintyre river. The pro-
posed maximum capacity of the reservoir is 4 million cubic metres.
The weekly flow of the river can be approximated by the following
discrete probability distribution:

weekly inflow (106 m3 ) 2 3 4 5


probability 0.3 0.4 0.2 0.1

Irrigation demand is 2 million cubic metres per week. Environmental


demand is 1 million cubic metres per week. Minimum storage require-
ment is 1 million cubic metres. Any demand shortage is at the expense
of irrigation. Excess inflow would be released over the spillway. As-
sume that the irrigation water may be supplied after the inflow arrives.
Before proceeding with the construction, the Water Authority wishes
to have some idea of the behaviour of the reservoir.

(a) Model the system as a Markov chain and determine the steady
state probabilities. State any assumptions you make.

© USQ, February 21, 2007


8.7. Exercises 195

(b) Explain the steady state probabilities in the context of this ques-
tion.

Ex. 8.21: Past records indicate that the survival function for light bulbs of
traffic lights has the following pattern:

Age of Bulbs in months 0 1 2 3


Number surviving to age n 100 85 60 0

(a) If each light bulb is replaced after failure, draw a state transition
diagram and find the transition matrix associated with this pro-
cess. Assume that a replacement during the month is equivalent
to a replacement at the end of the month.
(b) Determine the steady state probabilities.
(c) If an intersection has 20 bulbs, how many bulbs fail on average
per month?
(d) If an individual replacement has a cost of $15, what is the long-
run average cost per month ?

Ex. 8.22: A machine in continuous service requires frequent adjustment to


ensure quality output. If it gets out of adjustment, on average $600 of
defective parts are made before it can be corrected. Adjustment costs
$200 in labour and downtime. Data collected on the operation of the
machine is summarised below:

Time since adjustment Probability of


(hours) defective production
1 0.00
2 0.20
3 0.50
4 1.00

In answering the following questions make suitable assumptions where


appropriate.

(a) If the machine is adjusted only when defective production occurs,


find the transition matrix associated with this process.
(b) Determine the steady state probabilities. What is the long run
mean hourly cost of this policy?
(c) Suppose a policy of readjustment when needed or after three
hours of running time (whichever comes first) is introduced. What
is the long run mean cost of this policy?

Ex. 8.23: The following exercises involve computer work in R.

© USQ, February 21, 2007


196 Module 8. Markov chains

(a) The weather can be classified as Sunny (S) or Cloudy (C). Con-
sider the previous two days classified in this manner; then there
are four states: SS, SC, CS and CC. The transition matrix, en-
tered in R, is
pp <- matrix( nrow=4, ncol=4, byrow=TRUE,
data=c(.9, .1, 0, 0,
0, 0, .4, .6,
.7, .3, 0, 0,
0, 0, .2, .8) )
(You can read ?matrix for assistance.) Verify this could be a
valid transition matrix by computing the row sums rowSums(pp)
(the row sums ) are all one. (See ?rowSums.)
Suppose today is the secondof two sunny days in a row, state SS,
that is p0 = 1 0 0 0 . Enter this state into r by typing
pie <- c(1,0,0,0), then compute the probabilities of being in
various states tomorrow as pie <- pie %*% pp. (See ?"%*%"
for help here (quotes necessary). Note that the operator * does an
element-by-element multiplication; the command %*% is used for
matrix multiplication in r.) Why is Pr {cloudy tomorrow} = 0.1?
Evaluate pie <- pie %*% pp again to compute the probabilities
for two days time. Why is Pr {cloudy in 2 days} = 0.15?
What is the probability of being sunny in 3 days time?
(b) Keep applying pie <- pie %*% pp iteratively and see that the
predicted probabilities recognisably  converge in about 10–20 days
to π = 0.58 0.08 0.08 0.25 . These are the long-term prob-
abilities of the various states.
Compute P 10 , P 20 and P 30 and see that the rows of powers of
the transition matrix also converge to the same probabilities.
(c) So far we have only addressed patterns of probabilities. Some-
times we run simulations to see how the Markov chain may ac-
tually evolve. That is, we need to generate a sequence of states
according to the probabilities of transitions. For this weather ex-
ample, if we start in zeroth state SS we need to generate for the
first state either SS with probability 0.9 or SC with probability
0.1. Suppose it was SC, then for the second state we need gener-
ate either CS with probability 0.4 and CC with probability 0.6.
How is this done?
Sampling from general probability distributions is done using the
cumulative probability distribution (cdf) and rand. For exam-
ple, if weare in state SS, i = 1, then the cdf for the choice of next
state is .9 1 1 1 obtained in r by cumsum( pp[1,] ).
Thus in general the next state j is found from the current state i
by for example

© USQ, February 21, 2007


8.7. Exercises 197

j <- sample( c(1,2,3,4), prob=pp[1,] , size=1)


Run a simulation by wrapping this in a loop such as
i<- 1 # Initial state
for (t in (1:99)) {
i <- sample( c(1,2,3,4), prob=pp[ i,] , size=1)
cat(i) # prints the value of i
}
cat("\n") # Ends the line
Now save the history of states by executing
i <- array( dim=200) # Set up an array
i[1] <- 1 # Initial state
for (t in (1:199)) {
i[t+1] <- sample( c(1,2,3,4),
prob=pp[ i[t],] , size=1)
}
Use hist(i, freq=FALSE, breaks=c(0,1,2,3,4) ) to draw a
histogram and verify the long-term histogram is reasonably close
to that predicted by theory. (The proportions are found by
table(i)/sum(table(i)); compare to pie.)

Ex. 8.24: Packets of information sent via modems down a noisy telephone
line often fail. For example, suppose in 31 attempts we find that
packets are sent successfully or fail with the following pattern, where
1 denotes success and 0 failure:

1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1 .

There seem to be runs of success and runs of failure so we guess that


these may not be each independent of each other (failure in commu-
nication networks are indeed generally correlated). Thus we try to
model as a Markov chain.
Suppose the probability of success of the next packet only depends
upon chance and the success or failure of the current attempt. Argue
from the data that the transition matrix should be
 
1/2 1/2
P ≈ .
1/3 2/3

Given this model, what is the long-term probability of success for each
packet?
Suppose the probability of success of packet transmission depends
upon chance and the success or failure of the previous two attempts.

© USQ, February 21, 2007


198 Module 8. Markov chains

Write down and interpret the four states of this Markov chain model.
Use the data to estimate the transition probabilities, then form them
into a 4×4 transition matrix P . Compute using Matlab a high power
of P to see that the long-term
 distribution of states is approximately
π = .4 .2 .2 .2 and hence deduce this model would predict a
slightly higher overall success rate.

Ex. 8.25: Let a Markov chain with the state space S = {0, 1, 2} be such
that:

ˆ from state 0 the particle jumps to states 1 or 2 with equal prob-


ability 1/2;
ˆ from state 2 the particle must next jump to state 1;
ˆ state 1 is absorbing (that is, once the particle enters state 1, it
cannot leave.

Draw the transition diagram and write down the transition matrix.

Ex. 8.26: For a Markov chain with the transition matrix


 
0 0.1 0.9
P = 0.8
 0 0.2  ,
0.7 0.3 0

draw the transition diagram and find the probability that the particle
will be in state 1 after three jumps given it started in state 1.

Ex. 8.27: (sampling problem) Let X be a Markov Chain. Show that the
sequence Yn = X2n , n ≥ 0 is a Markov chain (such chains are called
imbedded in X).

Ex. 8.28: (lumping states together) Let X be a Markov chain. Show that
Yn = |Xn | , n ≥ 0 is not necessarily a Markov chain.

Ex. 8.29: Classify the states of the following Markov chains and determine
whether they are absorbing, transient or recurrent:
 
0 1/2 1/2
P1 =  1/2 0 1/2  ;
1/2 1/2 0
 
0 0 0 1
 0 0 0 1 
P2 = 
 1/2 1/2 0
;
0 
0 0 1 0

© USQ, February 21, 2007


8.7. Exercises 199
 
1/2 0 1/2 0 0
 1/4 1/2 1/4 0 0 
 
 1/2
P3 =  0 1/2 0 0 ;
 0 0 0 1/2 1/2 
0 0 0 1/2 1/2
 
1/4 3/4 0 0 0
 1/2 1/2 0 0 0 
 
 0
P4 =  0 1 0 0 
.
 0 0 1/3 2/3 0 
1 0 0 0 0
Ex. 8.30: Classify the states of the Markov chains with the following tran-
sition probability matrices:
 
0 1/2 1/2
P1 =  1/2 0 1/2  ;
1/2 1/2 0
 
0 0 1/2 1/2
 1 0 0 0 
P2 =  0 1 0
;
0 
0 1 0 0
 
1/2 1/2 0 0 0
 1/2 1/2 0 0 0 
 
 0
P3 =  0 1/2 1/2 0  .
 0 0 1/2 1/2 0 
1/4 1/4 0 0 1/2
Ex. 8.31: Consider the Markov chain consisting of the four states and hav-
ing transition probability matrix
 
0 0 1/2 1/2
 1 0 0 0 
P =  0 1 0
.
0 
0 1 0 0
Which states are recurrent?
Ex. 8.32: Let a Markov chain be defined by the matrix
 
1/2 1/2 0 0 0
 1/2 1/2 0 0 0 
 
P = 0 0 1/2 1/2 0  .
 0 0 1/2 1/2 0 
1/4 1/4 0 0 1/2
What can you say about its decomposability into disjoint Markov
chains and the transient and recurrent nature of its states?

© USQ, February 21, 2007


200 Module 8. Markov chains

Ex. 8.33: (A Communications system) Consider a communications system


which transmits the digit 0 and 1. Each digit transmitted must pass
through several stages, at each of which there is a probability p that
the digit entered will be unchanged when it leaves. Letting Xn denote
the digit entering the nth stage, define its transmission probability
matrix. Show by induction that
 1 1
+ 2 (2p − 1)n 21 − 21 (2p − 1)n

n 2
P = 1 1 n 1 + 1 (2p − 1)n .
2 − 2 (2p − 1) 2 2

Ex. 8.34: Suppose that coin 1 has probability 0.7 of coming up heads, and
coin 2 has probability 0.6 of coming up heads. If the coin flipped
today comes up heads, then we select coin 1 to flip tomorrow, and if
it comes up tails, then we select coin 2 to flip tomorrow. If the coin
initially flipped is equally likely to be coin 1 or coin 2, then what is
the probability that the coin flipped on the third day after the initial
flip is coin 1?

Ex. 8.35: For a series of dependent trials the probability of success on any
trial is (k + 1)/(k + 2) where k is equal to the number of successes on
the previous two trials. Compute

lim Pr {success on the nth trial} .


n→∞

Ex. 8.36: An organisation has N employees where N is a large number.


Each employee has one of three possible job classifications and changes
classifications (independently) according to a Markov chain with tran-
sition probabilities
 
0.7 0.2 0.1
P =  0.2 0.6 0.2  .
0.1 0.4 0.5

What percentage of employees are in each classification in the long


run?

8.7.1 Answers to selected Exercises

8.9 The chain is ergodic, and the steady state probabilities are (to three
decimal places)
[0.165, 0.247, 0.126, 0.183, 0.278].

8.11 States 1, 2, 3, 5 and 6 are recurrent. State 4 is transient. S1 = {1, 3, 5}


and S2 = {2, 6} are two closed sets. Since states 4 and 1 do not
communicate the chain is not ergodic.

© USQ, February 21, 2007


8.7. Exercises 201

8.14
6 8
 
0 14 14
P = 8 27
1 35 35

8.15 The process may be modelled as an 8 state Markov chain with states
{[111], [112], [121], [122], [211], [212], [221], [222]} where 1 indicates no
rain, 2 indicates rain and a triple [abc] indicates the weather was a the
day before yesterday, b yesterday and c today.
 
[111] 0.8 0.2 0 0 0 0 0 0
[112]  0 0 0.4 0.6 0 0 0 0 
[121]  0
 0 0 0 0.6 0.4 0 0 
[122]  0 0 0 0 0 0 0.4 0.6 
P =  
[211]  0.6 0.4 0 0 0 0 0 0 
[212]  0 0 0.4 0.6 0 0 0 0 

[221]  0 0 0 0 0.6 0.4 0 0 
[222] 0 0 0 0 0 0 0.2 0.8

8.16
17 9 5
 
30 40 24
P2 =  16
30
9
30
1
6
.
17 3 17
30 20 60

(a) [P 2 ]23 = 1/6


(b) Pr {X4 = 1 | X2 = 1} = 17/30
(c) p(2) = p(0)P 2 = 17/30 9/40 5/24
 

1/3] therefore Pr {X2 = 3} = p(0)P 2 3 =


 
(d) p(0) = [1/3 1/3
79/360
(e) Pr {X2 = 3 | X1 = 2 & X0 = 1} = Pr {X2 = 3 | X1 = 2} = 1/3
(f) Pr {X2 = 3 & X1 = 2 | X0 = 1} = Pr {X2 = 3 | X1 = 2}×Pr {X1 = 2 | X0 = 1} =
1/3 × 1/4 = 1/12
7 5

8.17 (a) π = 12 12
(b) π = 29 29 95
 

(c) π = 13 13 31
 

8.18  
0 1 0
1 1 1
P = 4 2 4

0 1 0

8.19 The process may be modelled as a 6 state Markov chain with the
following states ∈ {[200], [101], [110], [020], [011], [002]}.
The three numbers in the label for each state describes the number
of machines currently working, in repair for 1 day and in repair for 2

© USQ, February 21, 2007


202 Module 8. Markov chains

days. For example, the state [020] means no machines are currently
working and both machines were broken down yesterday and would
be available again the day after tomorrow. If we are currently at
state [020] then after one transition (day) the process will move to
state [002]. Following this process we find the transition matrix as
 4 4 1 
[200] 9 0 9 9 0 0
2 1
[101] 
3 0 3 0 0 0 
2 1
 
[110]  0
3 0 0 3 0 

P = 
[020]  0
 0 0 0 0 1 

[011]  0 1 0 0 0 0 
[002] 1 0 0 0 0 0

8.20 (a) The states are the volume of water in the reservoir, which al-
though continuous are assumed to take discrete values ∈ {1, 2, 3, 4}.
Hence the transition matrix is
 
0.7 0.2 0.1 0.0
 0.3 0.4 0.2 0.1 
P =  0.0 0.3 0.4 0.3 

0.0 0.0 0.3 0.7

The steady state probabilities may be computed by solving π =


πP where the elements of π sum to one to give
 
π = 0.2 0.2 0.2667 0.3333 .

(b) The steady state probabilities represent the long term average
probability of finding the reservoir in each state. For example
in the long run we expect the reservoir will start, or end, with
a volume of 1 million m3 , 20.5% of the time and a volume of 4
million m3 , 32.9% of the time.

8.21 (a) The states are the age of lights in months ∈ {0, 1, 2} then the
transition matrix associated with this process is
 
0.15 0.85 0.0
P =  0.29 0.0 0.71  .
1.0 0.0 0.0

(b) π = [0.407 0.346 0.246]


(c) Average number of failures per month = 0.4076 × 20 = 8.15 units
(d) Long term average cost per month = $15 × 8.15 = $122.28

© USQ, February 21, 2007


8.7. Exercises 203

8.22 (a) Let Xn = elapsed time in hours (at time n) since adjustment
∈ {0, 1, 2, 3}. Assume that adjustments occur on the hour only
and that the time taken to service the machine is negligible. (al-
ternative sets of assumptions are possible.) The transition prob-
abilities can be found by converting the given table as follows. If
the machine is adjusted 100 times, the number of these adjust-
ments which we expect to ”survive” are given by
Time since adjustment (hours) Number surviving
0 100
1 100
2 80
3 50
4 0
We then have
100
P01 = Pr {Xn+1 = 1 | Xn = 0} = =1
100
80
P12 = = 0.8
100
50
P23 = = 0.625
80
Since none survive to “age” 4 a state 4 is not needed. Hence the
required transition matrix is
 
0 1 0 0
 0.2 0 0.8 0 
P = 0.375
.
0 0 0.625 
1 0 0 0

(b)
10 10 8 5
 
π= 33 33 33 33

In the long run “breakdowns” occur in a proportion π0 of the


hours and each breakdown costs $800. Therefore mean cost per
10
hour = 33 × 800 = $242.42 .
(c) Now Xn ∈ {0, 1, 2} and
 
0 1 0
P =  0.2 0 0.8  ,
1 0 0

with steady state distribution.


 5 5 4

π = 14 14 14 .

© USQ, February 21, 2007


204 Module 8. Markov chains

In the long run, the proportion of hours in which a breakdown


occurs
5 5 4 5
= ×0+ × 0.2 + × 0.375 =
14 14 14 28
and each breakdown costs $800. The proportion of time that
readjustment occurs without a breakdown

= Pr {reaching age 3 and no breakdown occurs}


4 5
= × 0.624 =
14 28
and each readjustment alone costs $200. Hence, the long term
cost per hour of this policy is
5 5
= × 800 + × 200 = $178.57
28 28

8.34 Model this as a Markov chain with two states: C1 means that coin 1 is
to be tossed; C2 means that coin 2 is to be tossed. From the question
the state transition diagram is
 
Pr(T)=0.3 -
0.7 C1  C2 0.4
 Pr(H)=0.6 

From this diagram the transition matrix is read off to be


 
0.7 0.3
P =
0.6 0.4
 
whence starting from the state π0 = 0.5 0.5 the predictions for
the states after three tosses is

π3 = π0 P 3 = 0.6665 0.3335 .
 

Thus the probability of tossing coin 1 on the third day is 0.6665 .

© USQ, February 21, 2007


Module

Other Models
9
Module contents
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.2 Using other models . . . . . . . . . . . . . . . . . . . . . 206
9.3 Seasonally adjusted models . . . . . . . . . . . . . . . . 206
9.4 Regime-dependent autoregressive models . . . . . . . . 207
9.5 Neural networks . . . . . . . . . . . . . . . . . . . . . . . 207
9.6 Trend-regression models . . . . . . . . . . . . . . . . . . 208
9.7 Multivariate time series models . . . . . . . . . . . . . . 208
9.8 Forecasting by means of a model . . . . . . . . . . . . . 211
9.9 Finding similar past patterns . . . . . . . . . . . . . . . 211
9.10 Singular spectrum analysis . . . . . . . . . . . . . . . . . 212

Module objectives

Upon completion of this module students should be able to:

ˆ understand there are numerous other types of models for modelling


time series;
ˆ name some other time series models used in climatology;
ˆ explain one of the methods in more detail.

205
206 Module 9. Other Models

9.1 Introduction

In this part of the course, one particular type of time series methodology
has been discussed: the Box–Jenkins models, or arma type models. There
are a large number of other possible models for time series however. In this
Module, some of these models are briefly discussed. You are required to
know the details of only one of these models in particular, but should at
least know the names and ideas behind the others.
You don’t need to understand all the details in this Module; but see Assign-
ment 3.

9.2 Using other models

The time series models previously discussed—arima( , m, o)dels and Markov


chain models—are reasonably simple. There are, however, many far more
complicated models have not studied. In an attempt to compare numer-
ous types of forecasting methods, Spyros Makridakis and Michèle Hibon
conducted the M3-Competition (following the M- and M2-Competitions),
which compared 24 time series methods (including the Box–Jenkins ap-
proach adopted here plus many more complicated models) on 3003 different
time series. This was one of the conclusion from the competition:

Statistically sophisticated or complex methods do not necessarily


produce more accurate forecasts than simpler ones.

In particular, the method that unofficially won the competition was the
theta-method (see Assimakopoulos and Nikolopoulos [7]) which was shown
later (see Hyndman and Billah [22]) to be simply exponential smoothing
with a drift (or trend) component. Exponential smoothing was listed in
Section 1.4 as a simple method.
The lesson is clear: Just because methods appear clever, complicated or
technical, simple methods are often the best. However, all methods have
situation in which they perform well, and there are other methods worthy
of consideration. Some of those are considered here.

9.3 Seasonally adjusted models

Activity 9.A: Read Chu & Katz [13] in the selected read-
ings.

© USQ, February 21, 2007


9.4. Regime-dependent autoregressive models 207

Table 9.1: The parameter estimates for an ar(3) model with seasonally
varying parameters for modelling the seasonal SOI. Note the seasons refer
to northern hemisphere seasons.
SOI predictand

Parameter Spring Summer Fall Winter


Estimates (t = 1) (t = 2) (t = 3) (t = 4)

φ̂1 (t) 0.5268 0.7832 0.8554 0.7736


φ̂2 (t) 0.1158 0.2568 0.1700 0.1971
φ̂3 (t) −0.2011 −0.3816 −0.0674 −0.1808

Chu & Katz [13] discuss fitting arma type models to the seasonal and
monthly SOI using an arma model whose coefficients change according to
the season. They fit a seasonally varying ar(3) model to the seasonal SOI,
{Xt }, of the form

Xt = φ1 (t)Xt−1 + φ2 (t)Xt−2 + φ3 (t)Xt−3 + et ,

with the parameters as shown in Table 9.1.

9.4 Regime-dependent autoregressive models

Activity 9.B: Read Zwiers & von Storch [53] in the selected
readings.

Zwiers & von Storch [53] fit a regime-dependent ar model (ram) to the SOI
described by a stochastic differential equation. (These models are also called
Threshold Autoregressive Models by other authors, such as Tong [44].) In
essence, the SOI is modelled using one of two indicators of the SOI (ei-
ther the South Pacific Convergence Zone hypothesis, or the Indian Monsoon
hypothesis, as explained in the article), and a seasonal indicator.

9.5 Neural networks

Neural networks consist of processing elements (called nodes) joined by


weighted connections. The processing elements take as inputs the weighted
sum of the output of the nodes connected to it. The input to the processing
element is transformed (linearly or non-linearly) which is then the output

© USQ, February 21, 2007


208 Module 9. Other Models

(and can be passed to other processing elements). Neural networks are said
to be loosely based on the operation of the human brain (!).

Maier & Dandy [31] fit neural networks to daily salinity data at Murray
Bridge, South Australia, as well as numerous Box–Jenkins models. They
conclude the Box–Jenkins models produce better one-day ahead forecasts,
while the neural networks produce better long term forecasts.

Guiot & Tessier [18] use neural networks and ar(3) models to detect the
effects of pollution of the widths of tree rings, and hence tree growth, from
1900 to 1983.

9.6 Trend-regression models

Visser & Molenaar [47] discuss a trend-regression model for modelling a time
series {Yt } in the presence of k other variables {Xi,t } for i = 1, . . . , k. These
models are of the form

yt = µt + δ1,t X1,t + · · · + δk,t Xk,t + et

where the stochastic trend µt is described using an arima(p, d, q) process,


and et is the noise term. These models are written as TR(p, d, q, k) models,
where p, d and q are the usual parameters for an arima(p, d, q) model, and
k is the number of explanatory variables. The authors state the trend-
regression models ‘include most trend and regression models used in the
literature’.

One particular model they fit is for modelling annual mean surface air tem-
peratures in the northern hemisphere from 1851 to 1990, {Tn }. They fit
a TR(0, 2, 0, 2) model using the Southern Oscillation Index (SOI) and the
index of volcanic dust (VDI) on the northern hemisphere as covariates. The
fitted model is

Tn = µt − 0.050SOIt − 0.086VDIt + et

where the trend µt is described using an arima(0, 2, 0) model (parameters


not given).

9.7 Multivariate time series models

In this course, only univariate time series have been discussed. It is possible,
however, for two time series to be related to each other. In this case, there
is a multivariate time series.

© USQ, February 21, 2007


9.7. Multivariate time series models 209

30
10
SOI

−10
−30

1960 1970 1980 1990

Time
Sea Level Pressure Anomaly

6
4
2
−2 0
−6

1960 1970 1980 1990

Time

Figure 9.1: Two time series that might be expected to vary together: Top:
the SOI; Bottom: the sea level air pressure anomaly at Easter Island.

Example 9.1: The SOI and the sea level air pressure anomaly at Easter
Island might be expected to vary together, since the SOI is related to
pressure anomalies at Darwin and Tahiti. The two are plotted together
in Figure 9.1.
In a similar way as the autocorrelation was measured, the cross corre-
lation can be defined as:

γXY = E[(Xt − µX )(Ytk − µY )],

where µX is the mean of the time series {Xt } and µY is the mean of
the time series {Yt }, and k is again the lag. The cross correlation can
be computed for various k. For this example, the plot of the cross
correlation is shown in Figure 9.2.
The cross correlation indicates there is a significant correlation between

© USQ, February 21, 2007


210 Module 9. Other Models

SOI & slpa


0.3
0.2
ACF

0.1
0.0
−0.1

−2 −1 0 1 2

Lag

Figure 9.2: The cross correlation between the SOI and the sea level air
pressure anomaly at Easter Island.

© USQ, February 21, 2007


9.8. Forecasting by means of a model 211

the two series near a lag of zero. That is, when the SOI goes up, there
is a strong chance the sea level air pressure anomaly at Easter Island
will also go up at the same time.

9.8 Forecasting by means of a model

Forecasting by means of a model is common in meteorology and astron-


omy. The weather is routinely forecast by special groups in all developed
countries. They use data from satellites and terrestrial weather stations as
input to a fluid dynamical model of the earth’s atmosphere. The model is
simply projected forward by a type of numerical integration to produce the
forecasts.

9.9 Finding similar past patterns

Suppose we have a time series {Xt }t≥0 and we wish to be able to forecast
future values. We wish to identify an estimator of the next value of the time
series, say X
bt+1|t . One way of doing this is to search through the history
of the time series and find a time when the past k values of the time series
have approximately occurred before.

For example, suppose we wish to forecast tomorrow’s maximum daily tem-


perature at Hervey Bay and wish to use the past five days maximum temper-
atures to make this forecast. The strategy is to search through the available
history of the maximum temperatures at Hervey Bay and find a time is
the past when five maximum temperatures have been very similar to the
maximum temperatures over the last five days. Whatever the next day’s
maximum temperature was in the past will be the prediction for tomorrow.

How do we determine which five past values are “like” the pattern we are
currently observing?

Call the current m values vector x. For any past series of m values, say
vector y, one measure of the distance between these two vectors is defined
by v
um
uX
d(x, y) = t (xk − yk )2 .
k=1

where xk is the kth element of the vector x. (This is the most common way
to define distance between vectors.)

© USQ, February 21, 2007


212 Module 9. Other Models

The choice of m needs to be made carefully after consideration of the time


series in question. The idea here is simply this: we are trying to find all the
times in the past when things were similar to now.

9.10 Singular spectrum analysis

Singular spectrum analysis is a method which attempts to identify naturally


occurring patterns in a time series. The time series formed by keeping the
most important patterns, and removing the others (which are regarded as
noise) potentially leaves a series which represents the underlying dynamics
and is also easier to forecast.

© USQ, February 21, 2007


Strand II
Multivariate
Statistics

213
214

© USQ, February 21, 2007


Module

Introduction
10
Module contents
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.2 Multivariate data . . . . . . . . . . . . . . . . . . . . . . 216
10.3 Preview of methods . . . . . . . . . . . . . . . . . . . . . 217
10.4 Review of mathematical concepts . . . . . . . . . . . . . 217
10.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.6 Displaying multivariate data . . . . . . . . . . . . . . . . 217
10.7 Some hypothesis tests . . . . . . . . . . . . . . . . . . . . 221
10.8 Further comments . . . . . . . . . . . . . . . . . . . . . . 221
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.9.1 Answers to selected Exercises . . . . . . . . . . . . . . . 223

Module objectives

Upon completion of this module students should be able to:

ˆ recognize multivariate data;


ˆ give some examples of multivariate data;
ˆ list some type of multivariate statistical methods;
ˆ appropriately display multivariate data.

215
216 Module 10. Introduction

10.1 Introduction

In this Module, some basic multivariate statistical techniques are introduced.


The emphasis is on the application rather than the details and the theory;
there is insufficient time to delve too far into the theory.

This Module is based on the textbook Multivariate Statistical Methods by


Bryan F. J. Manly. This book includes numerous examples using real data
sets, although most examples do not have a climatological flavour. Some
examples with such a flavour are given in these notes.

There are numerous books available about multivariate statistics, and many
are available from the USQ library. You may find other books useful to refer
to during your study of this multivariate anlaysis component of this course.

As a general comment, you will be expected to read the textbook to un-


derstand this Module. The Study Book will supplement these notes where
necessary, provide extra example, and make notes about using the r software
for performing the analyses.

10.2 Multivariate data

Activity 10.A: Read Manly, section 1.1.

Multivariate analysis is popular in many area of science, engineering and


business; the examples give some flavour of typical problems. Climatology is
filled with examples of multivariate data. There are numerous climatological
variables measured on a routine basis which can collectively be considered
multivariate data.

One of the most common sources of multivariate data are the Sea Sur-
face Temperatures (SST). SSTs are measurements of the temperature of the
oceans, measured at locations all around the world.

In addition, multivariate data can be created from any univariate series since
climatological variable are often time-dependent. The original data, with say
n observations, can be designated as X1 . The series can then be shifted back
t time steps to create a new variable X2 . Both variables can be adjusted to
have a length of n − t, when the variables could now be identified as X10 and
X20 . The two variables (X10 , X20 ) can be considered multivariate data.

© USQ, February 21, 2007


10.3. Preview of methods 217

10.3 Preview of methods

Activity 10.B: Read Manly, section 1.2.

This section introduces some different types of multivariate methods. Not


all the methods will be discussed in this course, but it is useful to know the
types of methods available.

10.4 Review of mathematical concepts

Activity 10.C: Briefly read Manly, Chapter 2.

This Chapter contains material that should be revision for the most part.
You may find it useful to refer back to Chapter 2 throughout this course. Pay
particular attention to sections 2.5 to 2.7 as many multivariate techniques
use these concepts.

10.5 Software

The software package r will be used for this Part, as with the time series
component. See Sect. 1.5.1 for more details. Most statistical programs will
have multivariate analysis capabilities.

For this part of the course, the r multivariate analysis library is needed; this
should be part of the package that you install by default. To enable this
package to be available to r, type library(mva) at the r prompt when r
is started. For an idea of what functions are available in this library, type
library(help=mva) at the r prompt.

10.6 Displaying multivariate data

With multivariate data, any plots will be of a multi-dimensional nature, and


will therefore be difficult to display on a two-dimensional page. Plotting
data is, of course, always useful for understanding the data and detecting
possible problems in the data (outliers, errors, missing values, and so on).
Some creative solutions have been developed for plotting multivariate data.

© USQ, February 21, 2007


218 Module 10. Introduction

Activity 10.D: Read Manly, Chapter 3. We will not discuss


Andrew’s method.

Many of the plots discussed are available in the package S-Plus, a commercial
package not unlike r. In the free software, r, however, some of these plots
are not available (in particular, Chernoff faces). The general consensus is
that it would be a lot of work for a graphic that isn’t that useful. One
particular problem with Chernoff faces is that the faces (and interpretations)
can change dramatically depending on what variables are allocated to which
dimensions of the face.

However, the star plot is available using the function stars. The “Drafts-
man’s display” is available just by plotting a multivariate dataset; see the
following example.

The profile plots are useful, but only when there are not too many variables
or too many groups, otherwise the plots become too cluttered to be of any
use.

Example 10.1: Hand et al. [19, dataset 26] gives a number of measure-
ments air pollution from 41 cities in the USA. The data consists of
seven variables (plus the names of the cities), generally means from
1969 to 1971

ˆ SO2: The SO2 content in micrograms per cubic metre;


ˆ temp: The average annual temperature in degrees F;
ˆ manufac: The number of manufacturing enterprizes employing
20 or more workers;
ˆ population: The population in thousands, in the 1970 census;
ˆ wind.speed: The average annual wind speed in miles per hours;
ˆ annual.precip: The average annual precipitation in inches;
ˆ days.precip: The average number of days with precipitation
each year.

The following code shows how to plot this multi-dimensional data in


r. First, a Draftsman’s display (this is an unusual term; it is often
called a pairwise scatterplot):

> library(mva)
> us <- read.table("usair.dat", header = TRUE)
> plot(us[, 1:4])
> pairs(us[, 1:4])

© USQ, February 21, 2007


10.6. Displaying multivariate data 219

45 55 65 75 0 1000 2500

● ● ●

100
● ● ●

● ● ●
● ● ●
SO2

60
● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ●
● ●●● ● ●●● ● ●●
●●● ●
● ● ● ●●●●● ●●●●

20
● ●● ● ● ● ● ●●
●●●
●● ●●●●
●●
● ● ● ●
● ●● ●● ●●● ●●● ●
● ● ●●
●● ● ●●● ●●●●
75

● ● ●

● ● ●
●●
● ●● ● ●● ●
● ● ●
65

●● ●

●●●
● ●
●● temp ●●
●●

●●● ●●
● ●●

●●
● ●
● ●●
55

● ● ● ●● ●● ● ●●
● ● ●● ● ● ● ● ● ● ● ● ●
●● ●●

● ● ●

●●●● ● ●●●



● ● ●
● ● ● ● ● ●● ●
● ●●
● ● ●
● ● ● ● ● ●
45

● ● ●
● ● ●

● ● ●

2500
● ●
manufac ●

1000
● ● ●
● ● ●
● ● ● ●● ●
● ●
● ● ●
● ● ● ● ●●●
●●●●●
●● ●

●●●

●● ● ● ●●
●●●●
● ●●● ● ●● ●




● ●

●●
●●
● ● ● ● ● ●●
● ●
●●
● ● ●
● ●
●●●

●●●● ●● ●● ● ●● ● ●●●● ● ●


●●

●●

0
● ● ●
2500

● ● ●
● ● ●
population
1000

● ● ●
● ● ● ● ●

● ● ●● ● ● ● ● ●●●●
● ●
●●● ● ●● ●
● ●● ● ●● ●●●● ●
●●●
●●●●● ●
● ●● ●● ● ●
●●
●●


●●●● ●●● ● ● ●●● ● ● ●
●●●



● ●● ● ● ● ●●●● ●● ● ●



●● ●●
0

20 60 100 0 1000 2500

Figure 10.1: A multivariate plot of the US pollution dataset.

The plot is shown in Fig. 10.1. Star plots can also be produced:

> stars(us[1:11, ], main = "Pollution measures in 41 US cities",


+ flip.labels = FALSE, key.loc = c(7.8,
+ 2))
> stars(us[1:11, ], main = "Pollution measures in 41 US cities",
+ flip.labels = FALSE, draw.segments = TRUE,
+ key.loc = c(7.8, 2))

The input key.loc changes the location of the ‘key’ that shows which
variable is displayed where on the star; it was determined through trial
and error. Alter the value of the input flip.labels to true (that is,
set flip.labels=TRUE) to see what affect this has.
The star plot discussed in the text is in Fig. 10.2. Only the stars for
the first eleven cities are shown so that the detail can be seen here.
A variation of the star plot is given in Fig. 10.3, and is particularly
instructive when seen in colour.
From the star plots, can you find any cities that look very similar?
That look very different?

© USQ, February 21, 2007


220 Module 10. Introduction

Pollution measures in 41 US cities

Phoenix Little.Rock San.Francicso

Denver Hartford Wilmington

Washington Jacksonville Miami


manufac
temp
population

SO2

wind.speed
Atlanta Chicago days.precip
annual.precip

Figure 10.2: A star plot of the US pollution dataset.

Pollution measures in 41 US cities

Phoenix Little.Rock San.Francicso

Denver Hartford Wilmington

Washington Jacksonville Miami


temp
manufac
SO2

population

days.precip
Atlanta Chicago wind.speed
annual.precip

Figure 10.3: A variation of the star plot of the US pollution dataset.

© USQ, February 21, 2007


10.7. Some hypothesis tests 221

10.7 Some hypothesis tests

Activity 10.E: Read Manly, Chapter 4. We will not dwell


on the details, but it is important that you understand the
issues involved (especially section 4.4).

Currently, Hotelling’s T 2 test is not implemented in r.

10.8 Further comments

One difficulty with multivariate data has already been discussed: it may be
hard to display the data in a useful way. Because of this, it is often difficult
to find any outliers in multivariate data. Note that an observation may not
appear as an outlier with regard to any particular variables, but it may have
a strange combination of variables.

Multivariate data also can present computational difficulties. The math-


ematics involved in using multivariate techniques is usually matrix based,
and so often very large matrices will be in use. This can create memory
problems, particularly when matrix inversion is necessary. Many computa-
tional tricks and advanced methods are employed in standard software for
performing the computations. Techniques such as singular value decompo-
sition (SVD) are common. Indeed, different answers are often obtained in
different software packages because different algorithms are used.

The main multivariate techniques can be broadly divided into the following
categories:

ˆ Data reduction techniques. These techniques reduce the dimension of


the data at the expense of losing a small amount of information. A
balance is made between reducing the dimension of the data and re-
taining as much information as possible. Techniques such as principal
components analysis (PCA; see Module 11) and factor analysis (FA;
see Module 12) are in this category.

ˆ Classification techniques. These techniques attempt to classify data


into a number of groups. Techniques such as cluster analysis (see
Module 13) and discriminant analysis fall into this category.

Consider the data in Example 10.1. We may wish to reduce the number
of variables from eight to two or three. If we could reduce the number of

© USQ, February 21, 2007


222 Module 10. Introduction

Toowoomba weather by Decade

1890 1900 1910

1920 1930 1940

1950 1960 1970

maxt

mint

1980 1990
radn

Figure 10.4: A star plot of the Toowoomba weather data.

variables to just one, this might be called a ‘pollution index’. This would be
an example of data reduction. Data reduction works with the variables.

However, we may wish to classify the 41 cities into a number of groups


depending on their characteristics. We may be able to identify three group:
high pollution, moderate pollution and low pollution categories. This is a
classification problem. Classification works with the individuals.

10.9 Exercises

Ex. 10.2: The data set twdecade.dat contains (among other things) the
average rainfall, maximum temperature and minimum temperature at
Toowoomba for the decades 1890s to the 1990s.
Produce a multivariate plot of the three variables by decade. Which
decades appear similar?

Ex. 10.3: The data set twdecade.dat contains the average rainfall, max-
imum temperature and minimum temperature at Toowoomba for the
each month. It should be possible to see the seasonal pattern in tem-
peratures and rainfall. Produce a multivariate plot that shows the
features by month.

© USQ, February 21, 2007


10.9. Exercises 223

Ex. 10.4: The data set emdecade.dat contains the average rainfall, max-
imum temperature and minimum temperature at Emerald for the
decades 1890s to the 1990s.
Produce a multivariate plot of the three variables by decade. Which
decades appear similar? How similar are the patterns to those observed
for Toowoomba?

Ex. 10.5: The data set emdecade.dat contains the average rainfall, maxi-
mum temperature and minimum temperature at Emerald for the each
month. It should be possible to see the seasonal pattern in temper-
atures and rainfall. Produce a multivariate plot that shows the fea-
tures by month. How similar are the patterns to those observed for
Toowoomba?

Ex. 10.6: The data in the file countries.dat contains numerous variables
from a number of countries, and the countries have been classified by
region. Create a plot to see which countries appear similar.

Ex. 10.7: This question concerns a data set that is not climatological, but
you may find interesting. The data file chocolates.dat, available
from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/
popular/chocolates.html, contains measurements of the price, weight
and nutritional information for 17 chocolates commonly available in
Queensland stores. The data was gathered in April 2002 in Brisbane.
Create a plot to see which chocolates appear similar. Are there are
surprises?

Ex. 10.8: The data file soitni.txt contains the SOI and TNI from 1958 to
1999. The TNI is related to sea surface temperatures (SSTs), and SOI
is also known to be related to SSTs. It may be expected, therefore,
that there may be a relationship between the two indices. Create a
plot to examine if such a relationship exists.

10.9.1 Answers to selected Exercises

10.2 A star plot can be found as follows:

> td <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/clim


+ header = TRUE)
> head(td)

rain maxt mint radn pan vpd


1890 1087.22 22.426 11.430 17.798 4.521 14.526
1900 850.78 22.426 11.430 17.798 4.521 14.526

© USQ, February 21, 2007


224 Module 10. Introduction

1910 856.65 22.426 11.430 17.798 4.521 14.526


1920 921.28 22.427 11.431 17.798 4.521 14.527
1930 931.85 22.426 11.430 17.798 4.521 14.526
1940 969.08 22.427 11.431 17.798 4.521 14.527

> stars(td[, 1:3], draw.segments = TRUE,


+ key.loc = c(7, 2), main = "Toowoomba weather by Decade")

The plot (Fig. 10.4) shows a trend of increasing rainfall from the 1900s
to the 1950s, a big drop in the 1960s, then a big jump in the 1970s. The
1990s were very dry again. The 1990s were also a very warm decade
(relatively speaking), and the 1960s very cold (relatively speaking).

© USQ, February 21, 2007


Module

Principal Components
11
Analysis

Module contents
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.2 The procedure . . . . . . . . . . . . . . . . . . . . . . . . 228
11.2.1 When should the correlation matrix be used? . . . . . . 232
11.2.2 Selecting the number of pcs . . . . . . . . . . . . . . . . 233
11.2.3 Interpretation of pcs . . . . . . . . . . . . . . . . . . . . 234
11.3 pca and other statistical techniques . . . . . . . . . . . 235
11.4 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.5 Spatial pca . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.5.1 A small example . . . . . . . . . . . . . . . . . . . . . . 242
11.5.2 A larger example . . . . . . . . . . . . . . . . . . . . . . 245
11.6 Rotation of pcs . . . . . . . . . . . . . . . . . . . . . . . . 247
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.7.1 Answers to selected Exercises . . . . . . . . . . . . . . . 252

Module objectives

Upon completion of this module students should be able to:

225
226 Module 11. Principal Components Analysis

ˆ understand the principles underlying principal components analysis;

ˆ give a geometric interpretation of the principal components method;

ˆ compute principal components from given data using r;

ˆ select an appropriate number of principal components using suitable


techniques;

ˆ make sensible interpretations of the principal components where pos-


sible;

ˆ compute the principal components scores for each subject;

ˆ conduct a spatial pca;

ˆ understand that rotation of principal components is a contentious is-


sue.

11.1 Introduction

Principal components analysis (pca) is one of the basic multivariate tech-


niques, and is also one of the simplest. Wilks [49, p 373] says of pca that
it is “possibly the most widely used multivariate statistical technique in the
atmospheric sciences. . . ” (to which statistical climatology belongs). pca is
an example of a data reduction technique, one that reduces the dimension
of the data. This is possible if the variables are correlated. pca attempts to
find a new coordinate system for the data.

In climatology and related sciences, numerous variables are correlated, so


pca is a commonly used technique. pca is also called empirical orthogonal
functions (EOFs) or sometimes empirical eigenvector analysis (EEA).

Activity 11.A: Read Manly, Section 6.1. Read Wilks, the


introduction to Section 9.3.

For a geometric interpretation of principal components in the two-dimensional


case, see Fig. 11.1. Fig. 11.1 (top left panel) shows the original data. The
data have a strong trend in the SW–NE direction. In Fig. 11.1 (top right
panel), the two principal components are shown. The first principal compo-
nent is in the SW–NE direction as expected. Fig. 11.1 (centre left) shows one
particular point being mapped to the new coordinates. In Fig. 11.1 (bottom
right panel), a screeplot (see the next section) shows that most (almost 96%)
of the original variation in the data can be explained by the first principal

© USQ, February 21, 2007


11.1. Introduction 227

● ●

● ●
2

2
● ●
● ●● ● ● ●● ●
● ● ● ●
● ● ● ●
1

1
● ●
● ●
● ● ●● ● ● ● ● ● ●● ● ● ●
● ●

● ●● ●
● ●●
● ● ●● ● ● ●●
● ●
● ●● ● ● ●● ●
● ● ● ●● ● ● ● ●●
● ● ● ●
x2

x2
● ● ● ● ● ● ● ●
● ●● ● ●● ● ●● ● ●●
0

0
●●
● ●●

● ●
● ●● ● ●● ● ●● ● ●●
● ● ● ● ● ● ● ●
● ●●● ●● ● ●●● ●●
● ● ● ● ● ●

● ● ●
● ●
● ● ●● ●● ● ● ●● ●●
● ●

● ●


−1

−1
● ●
●● ● ●● ●
● ●● ●● ● ●● ●●

● ●
● ●
● ●
● ●
−2

−2
● ●

−2 −1 0 1 2 −2 −1 0 1 2

x1 x1

Histogram of PCA 1
35


2

30


● ●● ●
● ●
● ●
25
1



● ● ●● ● ● ●

● ●●
20


Frequency

● ● ●●

● ●● ●
● ● ● ●●
● ●
x2

● ● ● ●
● ●● ● ●●
0

●●
15

● ●
● ●● ● ●●
● ● ● ●
● ●●● ●●
● ● ●

● ●
● ● ●● ●●
10





−1


●● ●
● ●● ●●
5





−2


0

−2 −1 0 1 2 −3 −2 −1 0 1 2 3 4

x1 predict(pca)[, 1]

Histogram of PCA 2 Scree plot of principal components


30

1.5
25
20

1.0
Frequency

Variances
15
10

0.5
5
0

0.0

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

predict(pca)[, 2]

Figure 11.1: A geometric interpretation for principal components in the


two-dimensional case. Top left: some points are shown. They tend to be
strongly oriented in one direction. Top right: the corresponding principal
components are shown as bold lines. The main principal component is in
the SW–NE direction. Bottom left: a particular point is mapped to the new
coordinates. Bottom right: the scree plot shows that the first pc accounts
for most of the variation in the data.

© USQ, February 21, 2007


228 Module 11. Principal Components Analysis

component only. That is, using just the first principal components reduces
the dimension of the problem from two to one, with only a small loss of
information.

Note the pcs are simply linear combinations of the variables, and that they
are orthogonal. Also note that my computer struggles to perform the re-
quired computations on my machine (it seems to manage despite complain-
ing).

11.2 The procedure

Activity 11.B: Read Manly, Sections 6.2 and 6.3. Read


Wilks, Section 9.3.1.

A pca is conducted on a set of n observations of p (probably correlated)


variables.

It is important to realize that pca—and most other multivariate methods


also—is based on finding the eigenvalues and eigenvectors. Also note that
the eigenvalues and eigenvectors are found from either the correlation matrix
or the covariance matrix. The next section discusses which should be used.

The four steps outlined at the bottom of p 80 of Manly show the general
procedure. Software is used to do the computations.

Example 11.1: Consider the following data matrix X with two variables
X1 and X2 , with three observations (so n = 3) for each variable:
 
1 0
X =  1 1 .
4 2

The data are plotted in Fig. 11.2 (a). The mean for each variable is
X 1 = 2 and X 2 = 1, so the centred matrix is
 
−1 −1
Xc =  −1 0  .
2 1
The centred data are plotted in Fig. 11.2 (b). It is usual to find the pcs
from the correlation matrix. First, find the covariance matrix, found

© USQ, February 21, 2007


11.2. The procedure 229

by computing (X − X)T (X − X)/n as follows1 :


   
T 1 6 3 2 1
P = (X − X) (X − X)/n = = .
3 3 2 1 2/3
This matrix is always symmetric. From the diagonals of this matrix,
var[X1 ] = 2 and var[X2 ] = 2/3. Using these two numbers, the diagonal
matrix D can be formed:
 
2 0
D= ,
0 2/3

when, by convention, D−1/2 refers to the matrix with the diagonals


raised to the power −1/2:
 √ 
−1/2 1/ 2 √ 0√
D = .
0 3/ 2
The correlation matrix, say R, can then be found as follows:
 √ 
−1/2 −1/2 1 3/2
R=D PD = √ .
3/2 1
This matrix will always have ones on the diagonals.
The data can be scaled after being centred by dividing by the standard
deviation (obtained from matrix D−1/2 ); in this case, the centred and
scaled data are
 √ √ √ 
−1/√2 − 3/ 2
Xcs =  −1/ √ 2 √ 0√
.
2 3/ 2
The centred and scaled data are plotted in Fig. 11.2 (c). In effect,
it is this data for which principal components are sought (since R =
XcsT X /n). Now,
cs
 √   √ 
1 3 3 3/2 1 3/2
R= √ = √ ,
3 3 3/2 3 3/2 1
the correlation matrix.
The eigenvectors e and eigenvalues λ of matrix R are now required2 ,
which are the solutions of

(R − Iλ)e = 0. (11.1)
1
Notice that we have divided by n rather than n − 1. This is simply to follow what r
does; more commonly, the divisor is n − 1 when sample variances (and covariances) are
computed. I do not know why r divides by n instead of n − 1.
2
This is a quick review of work already studied in MAT2100. Eigenvalues and eigen-
vectors are covered in most introductory algebra texts.

© USQ, February 21, 2007


230 Module 11. Principal Components Analysis

This system of equation is only consistent if

|R − Iλ| = 0

(where |W | means the determinant of matrix W ). This becomes



1−λ
√ 3/2
= 0,
3/2 1 − λ
or
(1 − λ)2 − 3/4 = 0,
√ √
with solutions λ1 = 1+ 3/2 ≈ 1.866 and λ2 = 1− 3/2 ≈ 0.134. Sub-
situting these eigenvalues into Equation (11.1) to find the eigenvectors
gives  √   √ 
1/√2 1/ √2
e1 = ; e2 = .
1/ 2 −1/ 2
These eigenvectors become the principal components, or pcs. There
are two pcs as there were originally two variables. Note that the two
eigenvectors (or the two pcs) are orthogonal: e1 .e2 = 0.
Generally, a matrix of eigenvectors is defined:
 √ √ 
1/√2 1/ √2
C= .
1/ 2 −1/ 2

(Note that these vectors are only defined up to a constant. These


vectors have been defined to have a length of one, and the signs de-
termined to be equivalent to those given in the current version of r I
have3 .) Note that the two eigenvectors are orthogonal: e1 .e2 = 0.
The (directions of the) eigenvectors are shown plotted with the centred
and scaled data in Fig. 11.2 (d).
There were originally two variables; there will be two pcs. The pcs
are defined in the direction of the two eigenvectors. The proportions
of the variance explained by each is found from the eigenvalues, and
can be reported in a table like that shown below.

pc e’value % variance cumulative %

pc 1 1.866 93.3% 93.3%


pc 2 0.134 6.7% 100%

2 100%
3
The signs may change from one version of r to another, or even differ between copies
of r on different operating systems. This is true for almost any computer package gener-
ating eigenvectors. A change in sign simply means the eigenvectors point in the opposite
direction and makes no effective difference to the analysis

© USQ, February 21, 2007


11.2. The procedure 231

Original data Centred data


4

4
3

3
● Centred X 2
2

2
X2

● ●
1

● ●
0

0
−1

−1

−1 0 1 2 3 4 −1 0 1 2 3 4

X1 Centred X 1

Centred and scaled data With the PCs shown


4

e1
Centred & scaled X 2

3
2

e2
X2

● ●
1

● ●
0

0
−1

−1

● ●

−1 0 1 2 3 4 −1 0 1 2 3 4

Centred & scaled X 1 X1

Figure 11.2: The data from Example 11.1. Top left: the original data; Top
right: the data have been centred; Bottom left: the data have been centred
and then scaled; Bottom right: the directions of the principcal components
have been added.

© USQ, February 21, 2007


232 Module 11. Principal Components Analysis

A scree plot can be drawn from this if you wish. In any case, one pc
would be taken (otherwise, no simplication has been made for all this
work!).
It is possible to then determine what ‘score’ each of the original points
now have on the new variables (or principal components). These new
scores, say Y , can be found from the ‘original’ variables, X, using

Y = XC.

In our case in this example, the matrix X will refer to the centred,
scaled variables since the pcs were computed using these. Hence,
√ √ √ 
−1/√2 − 3/ 2  √ √ 

1/√2 1/ √2
Y =  −1/
√ 2 √ 0√

1/ 2 −1/ 2
2 3/ 2
 √ √ 
(−1 − 3)/2 (−1 + 3)/2
=  −1/2
√ −1/2

.
1 + 3/2 (1 − 3)/2
√ √
Thus, the point (1, 0) is now mapped to ((−1 − 3)/2, (−1 + 3)/2),
the point (1, 1) is now mapped
√ √ −1/2), and the point (4, 2)
to (−1/2,
is now mapped to (1 + 3/2, (1 − 3)/2) in the new system. In
Fig. 11.2 (d), the point (1, 1) can be seen to be mapped to a negative
value for the first pc, and the same (possibly negative) value for the
second pc4 . Thus, (−1/2, −1/2) seems a sensible value to which the
second point could be mapped.
Since we only take one pc, the new variable takes the values
 √ √ 
(−1 − 3)/2, −1/2, 1 + 3/2

which accounts for about 93% of the variation in the original data.

11.2.1 When should the correlation matrix be used?

Activity 11.C: Read Wilks, Section 9.3.4.

When the variables measure similar information, or have similar units of


measurement, the covariance matrix is generally used. If the variables are
on very different scales, the correlation matrix is usually the basis for pca.
4
We say ‘possibly’ since it depends on which direction the eigenvectors are pointing.

© USQ, February 21, 2007


11.2. The procedure 233

For example, Example 10.1 involves variables that are measured on different
scales: SO2 was measured in micrograms per cubic metre, whereas man-
ufac is simply the number of manufacturing enterprises with more than 20
employees. These are very different and measured in different units of mea-
surement. For this reason, the pca should be based on the correlation
matrix.

In effect, the correlation matrix transforms all of the variables to a similar


scale so that the actual units of measurement are not important. Commonly,
but not always, the correlation matrix is used. It is important to realize that,
in general, different results are obtained using the correlation and covariance
matrices.

11.2.2 Selecting the number of PC s

Activity 11.D: Read Wilks, Sections 9.3.2 and 9.3.3.

One of the difficult decisions to make in pca is how many principal com-
ponents (pcs) are necessary to keep. The analysis will always produce as
many pcs as there are variables, so keeping all the pcs means that no infor-
mation is lost, but it also completely reproduces the data. This defeats the
purpose of performing a data reduction technique such as pca—it simply
complicates matters!

There are many criteria for making this decision, but no formal procedure
(involving tests, etc.). There are only guidelines; some are given below.
Using any of the methods without thought is dangerous and prone to error.
Always examine the information and make a sensible decision that you can
justify. Sometimes, there is not one clear decision. Remember the purpose
of pca is to reduce the dimension of the data, so a small number of pcs is
preferred.

Scree plots

One way to help make the decision is to use a scree plot. The scree plot is
used to help decide between the important pcs (with large eigenvalues) and
the less important pcs (with small eigenvalues). Some authors claim this
method generally includes too many pcs. When using a screeplot, some pcs
should be clearly more important than others. (This is not always the case,
however.)

© USQ, February 21, 2007


234 Module 11. Principal Components Analysis

Total variance rule

Another proposed method is to take as many pcs as necessary until a certain


percentage (often 90%) of the variance has been explained.

Use above average PCs

This method recommends only keeping those pcs whose eigenvalues are
greater than the average. (Note that if the correlation matrix has been used
to compute the pcs, this means that pcs are retained if their eigenvalues are
greater than one.) For a small number of variables (say 20), this method is
reported to include too few techniques.

Example 11.2: Kidson [29] analysed monthly means of surface pressures,


temperature and rainfall using principal components analysis. In each
case considered, 10 out of a possible 120 components accounted for
more than 80% of the observed variance.

Example 11.3: Katz & Glantz [27] use a principal components analysis
on rainfall data to show that no single rainfall index (or principal
component) can adequately explain rainfall variation.

11.2.3 Interpretation of PC s

It is often useful to find an interpretation for the pcs, recalling that the pcs
are simply linear combinations of the variables. It is not uncommon for the
first pc to be a measure of ‘size’. Finding interpretations is often quite an
art, and sometimes any interpretation is difficult to find.

Example 11.4: Mantua et al. [33] define the the Pacific Decadal Oscilla-
tion (PDO) as the leading pc of monthly SST anomalies in the North
Pacific Ocean.

© USQ, February 21, 2007


11.3. pca and other statistical techniques 235

11.3 PCA and other statistical techniques

pca is often used as a data reduction technique, as has been described in


these notes. But there are other uses as well. For example, pca can be used
on various type of data often as a preliminary step before further analysis.

pca is sometimes used as a preliminary step before a regression analysis. In


particular, if there are a large number of covariates, or there are a number
of large correlations between covariates, a pca is often performed, a number
of pcs selected, and these pcs used as covariates in a regression analysis.

Example 11.5: Wolff, Morrisey & Kelly [50] use principal components
analysis followed by a regression to identify source areas of the fine
particles and sulphates which are the primary components of summer
haze in the Blue Ridge Mountains of Virginia, USA.

Example 11.6: Fritts [17] describes two techniques for examining the rela-
tionship between ring-width of conifers in western North America and
climatic variables. The first technique is a multiple regression on the
principal components of climate.

pca is sometimes used with cluster analysis (see Module 13) to classify
climatological variables.

Example 11.7: Stone & Auliciems [42] use a combination of cluster analy-
sis and pca to define phases of the Southern Oscillation Index (SOI).

Example 11.8: One use of principal components analysis is to extract prin-


cipal components from a multivariate time series. Michaelsen [35] used
this method (which he called frequency domain principal components
analysis) on the movement of sea surface temperatures (SST) anoma-
lies in the North Pacific, and found a low frequency SST field.

© USQ, February 21, 2007


236 Module 11. Principal Components Analysis

r Computation Matrix
function method used

princomp Eigen-analysis correlation or covariance


prcomp SVD* centre and/or scale

Table 11.1: Two methods for computing principal components in r. The


stars indicate the preferred option. SVD stands for ‘singular-value decom-
position’. princomp uses the less-preferred eigenvalue-based analysis (for
compatibility with programs such as S-Plus). The functions use different
methods of specifying the matrix on which to base the computations: using
center=TRUE and scale=TRUE in prcomp is equivalent to using cor=TRUE
in princomp. (The default for prcomp is center=TRUE, scale=FALSE; the
default for princomp is cor=FALSE (that is, use the covariance matrix)).

11.4 Using R

r can be used to find principal components; confusingly, two different meth-


ods exist; Table 11.1 compares the methods. In general, the function prcomp
will be used here.

The next example continues on from Example 11.9 and uses a very small
data matrix to show how the calculations done by hand can be compared to
those performed in r.

Example 11.9:
Refer to Example 11.1. and the data are plotted in Fig. 11.2 (a). How
can this analysis be done in r?
Of course, tasks such as multiplying matrices and computing the eigen-
values can be done in r (using the commands %*% and eigen respec-
tively). First, define the data matrix (and then centre it also):

> testd <- matrix(byrow = TRUE, nrow = 3,


+ data = c(1, 0, 1, 1, 4, 2))
> means <- colMeans(testd)
> means <- c(1, 1, 1) %o% means
> ctestd <- testd - means

Some of the matrics we used can be defined also:

> XtX <- t(ctestd) %*% ctestd


> P <- XtX/length(testd[, 1])

© USQ, February 21, 2007


11.4. Using r 237

> st.devs <- sqrt(diag(P))


> cstestd <- testd
> cstestd[, 1] <- ctestd[, 1]/st.devs[1]
> cstestd[, 2] <- ctestd[, 2]/st.devs[2]
> cormat <- cor(ctestd)
> D.power <- diag(st.devs)
> cormat2 <- D.power^T %*% P %*% D.power
> es <- eigen(cormat)
> es

$values
[1] 1.8660254 0.1339746

$vectors
[,1] [,2]
[1,] 0.7071068 0.7071068
[2,] 0.7071068 -0.7071068

These results agree with those in Example 11.1. But of course, r can
compute principal components without us having to resort to matrix
multiplication and finding eigenvalues.

> p <- prcomp(testd, center = TRUE, scale = TRUE)


> names(p)

[1] "sdev" "rotation" "center" "scale"


[5] "x"

Specifying center=TRUE and scale=TRUE instructs r to use the corre-


lation matrix to find the pcs. The standard deviations used by r to
scale the data is

> p$sdev

[1] 1.3660254 0.3660254

> p$sdev^2

[1] 1.8660254 0.1339746

Likewise, the centres (means) of each variable is found using p$center


(but aren’t shown here). The eigenvectors are in the columns of:

> p$rotation

© USQ, February 21, 2007


238 Module 11. Principal Components Analysis

PC1 PC2
[1,] 0.7071068 0.7071068
[2,] 0.7071068 -0.7071068

A screeplot can be produced using

> screeplot(p)

or just

> plot(p)

but is not shown here. The proportion of the variance explained by


each pc is found using summary:

> summary(p)

Importance of components:
PC1 PC2
Standard deviation 1.366 0.366
Proportion of Variance 0.933 0.067
Cumulative Proportion 0.933 1.000

The eigenvalues are given by

> p$sdev^2

[1] 1.8660254 0.1339746

The new scores, called the principal components or pcs (and called Y
earlier), can be found using

> predict(p)

PC1 PC2
[1,] -1.1153551 0.2988585
[2,] -0.4082483 -0.4082483
[3,] 1.5236034 0.1093898

This example was to show you how to perform a pca by hand, and how to
find those bits-and-pieces in the r output. Notice that once the correlation
matrix has been found, the analysis proceeds without knowledge of anything
else. Hence, given only a correlation matrix, pca can be performed. (Note

© USQ, February 21, 2007


11.4. Using r 239

r requires a data matrix for use in prcomp; to use only a correlation matrix,
you must use eigen and so on.)

Commonly, a small number of the pcs are chosen for further analysis; these
can be extracted as follows (where the first two pcs here are extracted as an
example):

> p.pcs <- predict(p)

The next example is more practical.

Example 11.10: Consider the sparrow example used by Manly in Exam-


ple 6.1. (While not a climatological, it will demonstrate how to do
equivalent analyses in r.) We use the correlation matrix since the
variables are dissimilar.
First, load the data

> sp <- read.table("sparrows.txt", header = TRUE)

It is then interesting to examine the correlations between the variables:

> cor(sp)

Length Extent Head Humerus


Length 1.0000000 0.7349642 0.6618119 0.6269482
Extent 0.7349642 1.0000000 0.6737411 0.7621451
Head 0.6618119 0.6737411 1.0000000 0.7184943
Humerus 0.6269482 0.7621451 0.7184943 1.0000000
Sternum 0.6051247 0.5290138 0.5262701 0.5787743
Sternum
Length 0.6051247
Extent 0.5290138
Head 0.5262701
Humerus 0.5787743
Sternum 1.0000000

There are many high correlations, so it may be possible to reduce the


number of variables are retain most of the information. That is, a pca
may be useful. The following code analyses the data:

> sp <- read.table("sparrows.txt", header = TRUE)


> sp.prcomp <- prcomp(sp, center = TRUE,
+ scale = TRUE)
> names(sp.prcomp)

© USQ, February 21, 2007


240 Module 11. Principal Components Analysis

[1] "sdev" "rotation" "center" "scale"


[5] "x"

The command prcomp returns numerous variables, as can be seen. The


table at the bottom of Manly, p 81 is found as follows:

> sp.prcomp$rotation

PC1 PC2 PC3


Length 0.4548793 -0.06760175 0.7340681
Extent 0.4662631 0.30512343 0.2671031
Head 0.4494628 0.29277283 -0.3470235
Humerus 0.4635108 0.22746613 -0.4772988
Sternum 0.3985280 -0.87457014 -0.2038638
PC4 PC5
Length 0.23424318 0.4413490
Extent -0.47737764 -0.6247119
Head 0.73389847 -0.2307272
Humerus -0.41989524 0.5738386
Sternum -0.04818454 -0.1800565

Can these pcs be interpretted? The first pc is almost equally loaded


for each variable; it therefore measures the general size of the bird.
The second pc is highly loaded with the sternum length, not very
loaded with length, and equally loaded for the rest. It is not easy to
interpret, but perhaps is a measure of sternum length. The third pc
has a high loading for length; perhaps it is a length pc. The fourth
is a measure of head size; the fifth the constrast between extent and
humerus (since these two variable are loaded with different signs). As
can be seen, some creativity may be necessary to develop meaningful
interpretations!
The table above is equivalent to Table 6.3 in Manly, but information
is transposed (try t(sp.prcomp$rotation)). The numbers are also
slightly different, but certainly similar. The eigenvalues (variances of
the pcs) in Manly’s Table 6.3 are found as follows:

> sp.prcomp$sdev^2

[1] 3.5762941 0.5355019 0.3788619 0.3273533


[5] 0.1819888

A screeplot is produced using screeplot:

> screeplot(sp.prcomp)
> screeplot(sp.prcomp, type = "lines")

© USQ, February 21, 2007


11.4. Using r 241

sp.prcomp sp.prcomp

3.5

3.5
3.0

3.0
2.5

2.5
2.0
Variances

Variances

2.0
1.5

1.5
1.0

1.0
0.5

0.5




0.0

1 2 3 4 5

Figure 11.3: Two different ways of presenting the screeplot for the sparrow
data. In (a), the default screeplot; in (b), the more standard screeplot
produced with the option type="lines".

The final plot is shown in Fig. 11.3. The first pc obviously is much
larger than the rest, and easily accounts for most of the variation in
the data.
If we use the screeplot, you may decide to keep only one pc. Using
the total variance rule, you may decide that three or four pcs are
necessary:

> summary(sp.vars)

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.03640 0.06547 0.07577 0.20000 0.10710 0.71530

Using the above average pc rule would select only one pc:

> mean(sp.vars)

[1] 0.2

> sp.vars > mean(sp.vars)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5


TRUE FALSE FALSE FALSE FALSE

The values of the pcs for each bird is found using (for the first 10 birds
only)

© USQ, February 21, 2007


242 Module 11. Principal Components Analysis

> predict(sp.prcomp)[1:10]

[1] 0.07836554 -2.16233078 -1.13609553


[4] -2.29462019 -0.28519596 1.93013405
[7] -1.03954232 0.44378025 2.70477182
[10] 0.19259851

Note that the first bird has a score of 0.07837 on the first pc, whereas
the score is 0.064 in Manly. The scores on the second pc are very
similar: 0.6166 (above) compared to 0.602 (Manly).
The first three pcs are extracted for further analysis using

> sp.pcs <- predict(sp.prcomp)[, 1:3]

11.5 Spatial PCA

One important use of pca in climatology is spatial pca, or field pca.

Activity 11.E: Read Wilks, Section 9.3.5.

As noted by Wilks, this is a very common use of pca. The idea is this: Data,
such as rainfall, may be available for a large number of locations (usually
called ‘stations’), usually over a long time period. pca can be used to find
patterns over those locations.

11.5.1 A small example

Example 11.11: As a preliminary example, consider some rainfall data


from selected rainfall stations in Australia, as shown in Table 11.2.
Each column consist of 15 observations of the rainfall at each station.
Thus, there are the equivalent of 10 variables with 15 repeated ob-
servations each. A pca can be performed to reduce the information
contained in 10 stations to a smaller number. Notice that each of the
15 observations for each station constitute a time series.

> p <- prcomp(rain, cor = TRUE)


> plot(p, main = "Small rainfall example")

© USQ, February 21, 2007


11.5. Spatial pca 243

Table 11.2: Monthly rainfall figures for ten stations in Australia. There are
15 observations for each station, given in order of time (the actual recording
months are unknown; the source did not state).
Station number

1 2 3 4 5 6 7 8 9 10

1 111.70 30.80 78.70 58.60 30.60 63.60 53.40 15.90 27.60 72.60
2 25.50 2.80 19.20 4.00 8.10 7.80 10.30 1.00 4.10 27.30
3 82.90 47.50 98.90 65.20 73.50 117.00 95.60 37.50 93.40 139.90
4 174.30 81.50 106.80 80.90 73.90 123.50 155.80 51.20 81.50 177.10
5 77.70 22.00 48.90 56.20 67.10 113.00 256.40 38.30 65.60 253.30
6 117.10 35.90 118.10 86.90 81.90 98.60 84.00 42.40 67.30 154.30
7 111.20 52.70 69.10 56.80 27.20 51.60 76.00 16.30 50.40 191.50
8 147.40 109.70 150.70 101.20 102.80 112.40 32.60 42.60 52.50 47.30
9 66.50 29.00 41.70 22.60 50.60 73.10 92.80 26.40 36.00 80.10
10 107.70 37.70 77.00 52.80 27.60 34.80 16.20 7.60 5.50 12.20
11 26.70 6.10 16.20 11.90 14.20 34.80 32.60 18.00 28.70 118.30
12 92.40 25.70 45.50 58.00 22.20 32.30 35.70 8.80 13.80 37.80
13 157.00 63.00 79.20 70.10 45.70 66.80 76.00 14.40 16.30 71.50
14 20.80 4.10 12.50 7.90 7.40 11.70 9.30 14.80 6.60 19.40
15 137.20 38.10 82.40 59.70 27.60 58.00 45.30 5.00 34.30 108.40

The scree plot is shown in Fig. 11.4; it is not clear how many pcs should
be retained. We shall select three for the purpose of this example; three
is not unreasonable as they account for over 90% of the variation in
the data (see line 11 of the output).
There a few important points to note:
(a) In practice, there are often hundreds of stations with available
data, and over a hundred years worth of rainfall data for most
stations. This creates huge data files that, in practice, take large
amounts of computing power to analyse.
(b) If latitudes and longitudes of the stations are known, contour
maps can be drawn of the principal components over a map of
Australia (see the next example).
(c) Each pc is a vector of length 15 and is also a time series. These
can be plotted as time series (see Fig. 11.5) and even analysed
as a time series using the techniques previously studied. This
analysis can detect time trends in the pcs.
In this small example, the time trends of 15 stations have been reduced
to time trends of three new variables that capture the important in-
formation carried by all 15.

© USQ, February 21, 2007


244 Module 11. Principal Components Analysis

Small rainfall example

10000
8000
Variances

6000
4000
2000
0

Figure 11.4: The scree plot for the pca of the small rainfall example.

100

0
Time

−100

PCA 1
PCA 2
−200 PCA 3

2 4 6 8 10 12 14

PCs

Figure 11.5: The pcs plotted over time for the small rainfall examples.

© USQ, February 21, 2007


11.5. Spatial pca 245

Full rainfall example

15
10
Variances

5
0

Figure 11.6: The scree plot for the full rainfall example.

11.5.2 A larger example

Example 11.12: Using the larger data file from which the data in the
previous example came, a more thorough pca can be performed. This
analysis was over 1188 time points for 52 stations. The data matrix has
1188 × 52 = 61 776 entries; this needs a lot of storage in the computer,
and a lot of memory for performing operations such as matrix multi-
plication and matrix inversions. The scree plot is shown in Fig. 11.6.
Plotting the first pc over a map of Australia gives Fig. 11.7 (a). The
second pc has been plotted over a map of Australia Fig. 11.7 (b).
This time, the first three pcs account for about 57% of the total varia-
tion. Notice that even with 52 stations, the contours are jagged; they
could, of course, be smoothed.
It requires special methods to handle data files of this size. The code
used to generate these picture is given below. Be aware that you
probably cannot run this code as it requires installing r libraries that
you probably do not have by default (but can perhaps be installed;
see Appendix A). The huge data files necessary are in a format called
netCDF, and a special library is required to read these files.

© USQ, February 21, 2007


246 Module 11. Principal Components Analysis

First PC

−10
−15
−20
−25
Latitude

−30
−35
−40
−45

120 130 140 150

Longitude

Second PC
−10
−15
−20
−25
Latitude

−30
−35
−40
−45

120 130 140 150

Longitude

Figure 11.7: The first two pcs plotted over a map of Australia.

© USQ, February 21, 2007


11.6. Rotation of pcs 247

> library(oz)
> library(ncdf)
> set.datadir()
> d <- open.ncdf("./pca/oz-rain.nc")
> rawrain <- get.var.ncdf(d, "RAIN")
> missing <- attr(rawrain, "missing_value")
> rawrain[rawrain == missing] <- NA
> set.docdir()
> longs <- get.var.ncdf(d, "LONGITUDE79_90")
> nx <- length(longs)
> lats <- get.var.ncdf(d, "LATITUDE19_33")
> ny <- length(lats)
> times <- get.var.ncdf(d, "TIME")
> ntime <- length(times)
> rain <- matrix(0, ntime, nx * ny)
> for (ix in (1:nx)) {
+ for (iy in (1:ny)) {
+ idx <- (iy - 1) * nx + ix
+ t <- rawrain[ix, iy, 1:ntime]
+ if (length(na.omit(t)) == ntime) {
+ rain[, idx] <- t
+ }
+ }
+ }
> pc.rain <- rain[, colSums(rain) > 0]
> p1 <- prcomp(pc.rain, center = TRUE, scale = TRUE)
> plot(p1$rotation, type = "b", main = "Full rainfall example",
+ ylab = "Eigenvalues")
> par(mfrow = c(2, 1))
> oz(add = TRUE, lwd = 2)
> oz(add = TRUE, lwd = 2)

The gaps in the plots are because there is such little data in those
remote parts of Australia, and rainfall is scare there anyway. Note the
pcs are deduced from the correlations, so the contours are for small
and sometimes negative numbers, not rainfall amounts.

11.6 Rotation of PCs

One controverisal topic is the rotation of principal components, which we


briefly discuss here.

© USQ, February 21, 2007


248 Module 11. Principal Components Analysis

One constraint on the pcs is they must be orthogonal, which some authors
argue limits how well they can be interpretted. If the physical interpretation
of the pcs is more important than data reduction, some authors argue that
the orthogonality constraint should be relaxed to allow better interpretation
(see, for example, Richman [38]). This is called rotation of the pcs. Many
methods exist for rotation of the pcs.

However, there are many arguments against rotation of pcs (see, for ex-
ample, Basilevsky [8]). Accordingly, r does not explicitly allow for pcs to
be rotated, but it can be accomplished using functions designed to be used
in factor analysis (where rotations are probably the norm rather than the
exception). We will not discuss this topic any further, except to note two
issues:

1. Rotation is discussed further in Chapter 12 on factor analysis, where


it is more appropriate;

2. The purpose of rotation of the pcs appears to generally be to ‘cluster’


the pcs together. This can be accomplished using a cluster analysis
(see Chapter 13).

11.7 Exercises

Ex. 11.13: Consider the following data:


 
3 3
 3 4 
X=  1
.
3 
1 6

(a) Perform a pca ‘by hand’ using the correlation matrix (follow
Example 11.1 or Example 11.9). (Don’t use prcomp or similar
functions; you may use r to do the matrix multiplication and so
on for you.)
(b) Perform a pca ‘by hand’, but using the covariance matrix.
(c) Compare and comment on the two strategies.

Ex. 11.14: Consider the following data:


 
1 2
 0 3 
X=  3
.
5 
4 6

© USQ, February 21, 2007


11.7. Exercises 249

(a) Perform a pca ‘by hand’ using the correlation matrix (follow
Example 11.1 or Example 11.9). (Don’t use prcomp or similar
functions; you may use r to do the matrix multiplication and so
on for you.)
(b) Perform a pca ‘by hand’, but using the covariance matrix.
(c) Compare and comment on the two strategies.

Ex. 11.15: Consider the correlation matrix


 
1 0.6
R= .
0.6 1

Perform a pca using the correlation matrix. Define the new variables,
and explain how many new pcs are necessary.

Ex. 11.16: Consider the correlation matrix


 
1 r
R= .
r 1

(a) Perform a pca using the correlation matrix and show it always
produces new axes at 45◦ to the original axes.
(b) Explain what happens in the pca for r = 0, r = 0.25, r = 0.5
and r = 1.

Ex. 11.17: The data file toowoomba.dat contains (among other things) the
daily rainfall, maximum and minimum temperatures at Toowoomba
from 1 January 1889 to 21 July 2002 (a total of 41474 observations
on three variables). Perform a pca. How many pcs are necessary to
summarize the data?

Ex. 11.18: Consider again the air quality data from 41 cities in the USA,
as seen in Example 10.1. For each city, seven variables have been mea-
sured (see p 218). The first is the concentration of SO2 in microgram
per cubic metre; the other six are potential identifiers of pollution
problems. The original source treats the concentration of SO2 as a
response variable, and the other six as covariates.

(a) Examine the correlation matrix; what varaible are highly corre-
lated?
(b) Produce a star plot of the data, and comment.
(c) Is it possible to reduce these six covariates to a smaller number,
without losing much information? Use a pca to perform a data
reduction.

© USQ, February 21, 2007


250 Module 11. Principal Components Analysis

(d) Should a correlation or covaraince matrix be used for the pca?


Explain your answer.
(e) Examine the loadings; is there any sensible interpretation?
(f)

Ex. 11.19: Consider the example in 11.5.2. If you can load the appropri-
ate libraries, try the same steps in that example but for the data in
oz-slp.nc.

Ex. 11.20: The data file emerald.dat contains the daily rainfall, maximum
and minimum temperatures, radiationp, an evaporation and maximum
vapour pressure deficit (in hPa) at Emerald from 1 January 1889 to
15 September 2002 (a total of 41530 observations on three variables).
Perform a pca. How many pcs are necessary to summarize the data?

Ex. 11.21: The data file gatton.dat contains the daily rainfall, maximum
and minimum temperatures, radiationp, an evaporation and maximum
vapour pressure deficit (in hPa) at Gatton from 1 January 1889 to 15
September 2002 (a total of 41530 observations on three variables).

(a) Perform a pca using the covariance matrix.


(b) Perform a pca using the correlation matrix. Compare to the
previous pca. Which would you choose: a pca based on the
covariance or the correlation matrix? Explain.
(c) How many pcs are necessary to summarize the data? Explain.
(d) If possible, interpret the pcs.
(e) Take the first pc; perform a quick time series analysis on this pc.
(Don’t attempt necessarily to find an ‘optimal’ model; doing so
will be time consuming because oe the amount of data, and may
be difficult also. Just plot an ACF, PACF and suggest a model
based on those.)

Ex. 11.22: The data file strainfall.dat contains the average month and
annual rainfall (in tenths of mm) for 363 Australian rainfall stations.

(a) Perform a pca using the monthly averages (and not the annual
average) using the correlation matrix. How many pcs seems nec-
essary?
(b) Perform a pca using the monthly averages (and not the annual
average) using the covariance matrix. How many pcs seems nec-
essary?
(c) Which pca would you prefer? Why?
(d) Select the first two pcs. Confirm that they are uncorrelated.

© USQ, February 21, 2007


11.7. Exercises 251

Ex. 11.23: The data file jondaryn.dat contains the daily rainfall, max-
imum and minimum temperatures, radiationp, an evaporation and
maximum vapour pressure deficit (in hPa) at Jondaryn from 1 Jan-
uary 1889 to 15 September 2002 (a total of 41474 observations on six
variables). Perform a pca. How many pcs are necessary to summarize
the data?

Ex. 11.24: The data file wind_ca.dat contains numerous weather and wind
measurements from Canberra during 1989.

(a) Explain why it is best to use the correlation matrix for this data.
(b) Perform a pca using the correlation matrix.
(c) How many pcs are necessary to summarize the data? Explain.
(d) If possible, interpret the pcs.
(e) Perform a time series analysis on the first pc.

Ex. 11.25: The data file wind_wp.dat contains numerous weather and wind
measurements from Wilson’s Promontory, Victoria (the most southerly
point of mainland Australia) during 1989.

(a) Explain why it is best to use the correlation matrix for this data.
(b) Perform a pca using the correlation matrix.
(c) How many pcs are necessary to summarize the data? Explain.
(d) If possible, interpret the pcs.
(e) Explain why a time series analysis on, say, the first pc cannot be
done here. (Hint: Read the help about the data.)

Ex. 11.26: The data file qldweather.dat contains six weather-related vari-
ables for 20 Queensland cities.

(a) Perform a pca using the correlation matrix. How many pcs seems
necessary?
(b) Perform a pca using the covariance matrix. How many pcs seems
necessary?
(c) Which pca would you prefer? Why?
(d) Select the first three pcs. Confirm that they are uncorrelated.

Ex. 11.27: This question concerns a data set that is not climatological,
but you may find interesting. The data file chocolates.dat, available
from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/
popular/chocolates.html, contains measurements of the price, weight
and nutritional information for 17 chocolates commonly available in
Queensland stores. The data was gathered in April 2002 in Brisbane.

© USQ, February 21, 2007


252 Module 11. Principal Components Analysis

(a) Would it be best to use the correlation or covariance matris for


the pca? Explain.
(b) Perform this pca using the nutritional information.
(c) How many pcs are useful?
(d) If possible, give an interpretation for the pcs.

11.7.1 Answers to selected Exercises

11.13 First, use the correlation matrix.

> testd <- matrix(byrow = TRUE, nrow = 4,


+ data = c(3, 3, 3, 4, 1, 3, 1, 6))
> means <- colMeans(testd)
> ctestd <- testd - means
> means <- colMeans(testd)
> means <- c(1, 1, 1) %o% means
> XtX <- t(ctestd) %*% ctestd
> P <- XtX/length(testd[, 1])
> st.devs <- sqrt(diag(P))
> cstestd <- testd
> cstestd[, 1] <- ctestd[, 1]/st.devs[1]
> cstestd[, 2] <- ctestd[, 2]/st.devs[2]
> cormat <- cor(ctestd)
> D.power <- diag(1/st.devs)
> cormat2 <- D.power^T %*% P %*% D.power
> es <- eigen(cormat)
> es

$values
[1] 1.5 0.5

$vectors
[,1] [,2]
[1,] 0.7071068 0.7071068
[2,] -0.7071068 0.7071068

Using the covariance matrix:

> es <- eigen(cov(testd))


> es

$values
[1] 2.4120227 0.9213107

© USQ, February 21, 2007


11.7. Exercises 253

$vectors
[,1] [,2]
[1,] 0.5257311 0.8506508
[2,] -0.8506508 0.5257311

As expected, the eigenvalues and pcs are different.

11.18 Here is some r code:

> us <- read.table("usair.dat", header = TRUE,


+ row.names = 1)
> us.pca <- prcomp(us[, 2:7], center = TRUE,
+ scale = TRUE)
> plot(us.pca, main = "Screeplot for US air data")

How many pcs should be selected? The screeplot is shown in Fig. 11.8,
from which three or four might be selected. The variances of the
eigenvectors are

> summary(us.pca)

Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.482 1.225 1.181 0.872
Proportion of Variance 0.366 0.250 0.232 0.127
Cumulative Proportion 0.366 0.616 0.848 0.975
PC5 PC6
Standard deviation 0.3385 0.18560
Proportion of Variance 0.0191 0.00574
Cumulative Proportion 0.9943 1.00000

Perhaps three pcs are appropriate. The first three account for almost
74% of the total variance. It would also be possible to choose four pcs,
but with six original variables, this isn’t a large reduction.

PC1 PC2 PC3


temp -0.32964613 0.1275974 -0.67168611
manufac 0.61154243 0.1680577 -0.27288633
population 0.57782195 0.2224533 -0.35037413
wind.speed 0.35383877 -0.1307915 0.29725334
annual.precip -0.04080701 -0.6228578 -0.50456294
days.precip 0.23791593 -0.7077653 0.09308852
PC4 PC5 PC6
temp -0.30645728 0.55805638 -0.13618780

© USQ, February 21, 2007


254 Module 11. Principal Components Analysis

Screeplot for US air data

2.0
1.5
Variances

1.0
0.5
0.0

Figure 11.8: The scree plot for the US air data.

manufac 0.13684076 -0.10204211 -0.70297051


population 0.07248126 0.07806551 0.69464131
wind.speed -0.86942583 0.11326688 -0.02452501
annual.precip -0.17114826 -0.56818342 0.06062222
days.precip 0.31130693 0.58000387 -0.02196062

Is there a sensible interpretation for these pcs? The first pc has a high
positive loading for temperature, but a high negative loading for the
other variables (apart from annual precipitation). This could be seen
as the contrast between temperature and the other variables: the con-
trast between temperature rises and other variables rising. It is hard
to see any intelligent purpose in such a pc. Likewise, interpretations
for the next two pcs are difficult to determine.

© USQ, February 21, 2007


Module

Factor Analysis
12
Module contents
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 256
12.2 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . 257
12.2.1 Path model . . . . . . . . . . . . . . . . . . . . . . . . . 258
12.2.2 Steps in a fa . . . . . . . . . . . . . . . . . . . . . . . . 260
12.3 Factor rotation . . . . . . . . . . . . . . . . . . . . . . . . 262
12.3.1 Methods of factor rotation . . . . . . . . . . . . . . . . . 262
12.4 Interpretation of factors . . . . . . . . . . . . . . . . . . 263
12.5 The differences between pca and fa . . . . . . . . . . . 266
12.6 Principal components factor analysis . . . . . . . . . . . 267
12.7 How many factors to choose? . . . . . . . . . . . . . . . 268
12.8 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.9 Concluding comments . . . . . . . . . . . . . . . . . . . . 274
12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
12.10.1 Answers to selected Exercises . . . . . . . . . . . . . . . 277

Module objectives

Upon completion of this module students should be able to:

255
256 Module 12. Factor Analysis

ˆ understand the principles underlying factor analysis;

ˆ give a geometric interpretation of the factors used, where possible;

ˆ perform a factor analysis from given data using r;

ˆ select an appropriate number of factors using suitable techniques.

12.1 Introduction

Factor analysis is a data reduction technique very similar to pca. Indeed,


many students find it hard to see the differences between the two methods;
see Sect. 12.5 for a discussion on this issue.

Activity 12.A: Read Manly, Sect. 7.1.

Factor analysis refers to a variety of statistical techniques whose common


objective is to represent a set of variables in terms of a smaller number of
hypothetical variables or factors. pca is therefore an example of a factor
analysis. Usually, however, factor analysis refers to so-called common factor
analysis, which is considered here.
In general, the first step is an examination of the interrelationships between
the variables. Usually correlation coefficients are used as a measure of the as-
sociation between variables. Inspection of the correlation matrix may reveal
relationships within some subsets of variables, and that these correlations are
higher than those between subsets. Factor analysis explains these observed
correlations by postulating the existence of a small number of hypothetical
variables or factors which are causing the observed correlations.
It can be argued that, ignoring sampling errors, a causal system of fac-
tors will lead to a unique correlation system of observed variables. How-
ever the reverse is not true. Only under very limiting conditions can one
unequivocably determine the underlying causal structure from the correla-
tional structure. In practice, only a correlational structure is presented. The
construction of a causal system of factors from this structure relies as much
on mathematics as judgement, knowledge of the system under investigation
and interpretation of the analysis.
At one extreme, the researcher may not have any idea as to how many
underlying factors exist. Then fa is an exploratory technique aiming at
ascertaining the minimum number of hypothetical factors that can account
for the observed covariation. The majority of applications of this type are
in the social sciences.

© USQ, February 21, 2007


12.2. The Procedure 257

fa may also be used as a means of testing specific hypotheses. A researcher


with a considerable depth of knowledge of an area may hypothesize two
different underlying dimensions or factors, and that certain variables belong
to one dimension while others belong to the second. If fa is used to test this
expectation, then it is used as a means of confirming a certain hypothesis,
not as a means of exploring underlying dimensions. Thus, it is referred to
as confirmatory factor analysis.

The idea of having underlying, but unobservable, factors may sound odd.
But consider an example: annual taxable income, number of cars owned,
value of home, and occupation may all be measure various observable so-
cioeconomic status indicators. Likewise, heart rate, muscle strength, blood
pressure and hours of exercise per week may all be measurements of fitness.
The observable measurements are all aspects of the underlying factor called
‘fitness’. In both cases, the true, underlying variable of interest (‘socioeco-
nomic status’ and ‘fitness’) is hard to measure directly, but can be measured
using the observed variables given.

12.2 The Procedure

Activity 12.B: Read Manly, Sect 7.2 and 7.3.

Factor analysis (fa), like pca, is a data reduction technique. fa and pca are
very similar, and indeed some computer programs and texts barely distingish
between them. However, there are certainly differences. As with pca, the
analysis starts with n observations on p variables. These p variables are
assumed to have a common set of m factors underlying them; the role of fa
is to identify these factors.

Mathematically, the p variables are

X1 = a11 F1 + a12 F2 + · · · + a1m Fm + e1


X2 = a21 F1 + a22 F2 + · · · + a2m Fm + e2
.. .. ..
. . .
Xp = ap1 F1 + ap2 F2 + · · · + apm Fm + ep (12.1)

where Fj are the underlying factors common to all the variables Xi , aij are
called factor loadings, and the ei are the parts of each variable unique to
that variable. In matrix notation,

x = Λf + e, (12.2)

© USQ, February 21, 2007


258 Module 12. Factor Analysis

where the factor loadings are in the matrix λ. In general, the Xi are stan-
dardized to have mean zero and variance one. Likewise, the factors Fj are
assumed to have mean zero and variance one, and are independent of ei .
The factor loadings aij are assumed constant. Under these assumptions,

var[Xi ] = 1 = a2i1 + a2i2 + · · · + a2im + var[ei ].

Hence, the observed variance in Xi is due to two components:

1. The effect of the common factors Fj , through the constants aij . Hence,
the quantity a2i1 + a2i2 + · · · + a2im is called the communality for Xi .

2. The effect of the component specific to Xi , through var[ei ]. Hence


var[ei ] is called the specificity or uniqueness of Xi . This can also be
seen as the error variance.

12.2.1 Path model

The relationship between (observed) variables and factors is often displayed


using a path model. For example, consider the (unlikely) situation where
there are three observed variables, X1 , X2 and X3 , and two factors F1 and
F2 . Suppose further that the factor loadings aij in Eq. (12.1) are known.
Then a path model can be constructed which is consistent with the original
data:

: X1 
 e1
a
11 
 
>
F1 
XXX
 
 a21
a31Z XX
Z
XXX
Z : X2 
z
X
 e2
a12 Z
 Z  
Z a22


F2 

XX
XXX Z
a32 XXZZ
X~
z X 
X
3 e3

Using properties of expectations and covariances, the original variances of


the Xi (which are 1, recall) can be recovered.

© USQ, February 21, 2007


12.2. The Procedure 259

Example 12.1: Consider a (hypothetical) example where three variables


are observed on a number of fit men: X1 is the numbers of hours of
exercise performed each week; X2 is the time taken to run 10km; and
X3 is the time taken to sprint 100m. The correlation matrix is
 
1 0.64 0.51
 0.64 1 0.27  .
0.51 0.27 1

One possible allocation of the factors is shown below.

: X1 
 0.38
0.6   
>
F1 
X 
ZXXXX 0.9
0.1Z XXX
Z z X 
X 0.15
Z : 2
0.5Z
F2 

X Z 0.2
XXX Z
0.9 XXXZZ ~
X3 
XXz 0.18

Note that, for example,

Covar[X1 , X2 ]
= Covar[0.6F1 + 0.5F2 , 0.9F1 + 0.2F2 ]
= Covar[0.6F1 + 0.5F2 , 0.9F1 ] + Covar[0.6F1 + 0.5F2 , 0.2F2 ]
= Covar[0.6F1 , 0.9F1 ] + Covar[0.5F2 , 0.9F1 ] +
Covar[0.6F1 , 0.2F2 ] + Covar[0.5F2 , 0.2F2 ]
= 0.54Covar[F1 , F1 ] + 0 + 0 + 0.1Covar[F2 , F2 ]
= 0.64,

as in the original correlation matrix. The communalities are given by


e1 = 0.38; e2 = 0.15 and e3 = 0.18. At this stage, we are assuming
F1 and F2 are orthogonal, so Covar[F1 , F2 ] = 0. (Recall var[Fi ] =
Covar[Fi , Fi ] = 1 and var[Xi ] = 1.) In addition,

var[X1 ] = var[0.6F1 ] + var[0.5F2 ] + var[e1 ]


= 0.36 + 0.25 + 0.38 ≈ 1

as required. Thus this path model represents one possible allocation


of the factors; there are, however, others possible. Often, the relation-
ships between the factors and the observable variables are given in a
table:

© USQ, February 21, 2007


260 Module 12. Factor Analysis

F1 F2

X1 0.6 0.5
X2 0.9 0.2
X3 0.1 0.9

Is there a sensible interpretation of the factors? F1 is strongly related


to the time to run 10km, and also to the hours of exercise per week;
perhaps this factor could be interpretted as measuring stamina. The
second factor is highly related to the time to sprint 100m, and the
hours of exercise per week; perhaps this factor could be interpreted as
measuring strength.
Written using the matrix notation of Eq. (12.2),
x = Λf + e
     
X1 0.6 0.5   0.38
 X2  =  0.9 0.2  F1 +  0.15  ,
F2
X3 0.1 0.9 0.18
where  
0.6 0.5
Λ =  0.9 0.2  .
0.1 0.9

12.2.2 Steps in a FA

fa has three steps:

1. Find some provisional factor loadings. Commonly, this is done using a


pca. Since the number of underlying factors is often unknown, m pcs
are chosen to become the m underlying factors. Since these factors are
actually pcs, they are uncorrelated. However, the choice of factors F1 ,
F2 , . . . , Fm is not unique. Any linear combination of these is also a
valid choice for the factors. That is,
F10 = d11 F1 + d12 F2 + · · · + d1m Fm
F20 = d21 F1 + d22 F2 + · · · + d2m Fm
.. .. ..
. . .
0
Fm = dm1 F1 + dm2 F2 + · · · + dmm Fm

are also valid factors. The original factor loadings Λ are effectively
replaced by ΛT for some rotation matrix T .

© USQ, February 21, 2007


12.2. The Procedure 261

2. The second step involves selecting a linear combination of the factors


to help interpretation; that is, computing the dij above. This step is
called rotation. There are two types of rotation:
(a) Orthogonal: With this type of rotation, the factors remain or-
thogonal. A common P example 2is the varimax rotation. This
method maximizes ij (dij − d.j ) where d.j is the mean over i of
the dij .
A transformation y = Ax is orthogonal if the transformation
matrix A is orthogonal; a square matrix A is orthogonal if and
only if its column vectors (say, ai , a2 , . . . an ) form an orthonormal
set; that is 
T 0 if i 6= j
ai aj =
1 if i = j
For example, the matrix
 
0.9397 −0.3420
P=
0.3420 0.9397
is orthogonal. First, write a1 = [0.9397, 0.3420]T , and a2 =
[−0.3420, 0.9397]T . Then, aT1 a1 = 0.93972 + 0.34202 = 1 and
aT2 a2 = (−0.3420)2 +0.93972 = 1; also, aT1 a2 = (0.9397×−0.3420)+
(0.9397 × 0.3420) = 0. Thus, a transformation based on matrix P
is an orthogonal transformation. (In fact, it represents a rotation
of −20◦ .)
(b) Oblique: The factors do not have to remain orthogonal with this
type of rotation. The promax rotation is an example. This pro-
cedure tends to increase large loadings in magnitude relative to
small loadings.
3. The third step is to compute the factors scores; that is, how much of
each variable is explained by each factor. This leads to interpretations
of the factors. To make interpretation easier, a good rotation should
produce factor loadings so that some are close to one, and the others
close to zero.

Some points to note:

ˆ pca is often the first step in a factor analysis;


ˆ Factor analysis, like pca, is based on eigenvalues;
ˆ Many types of rotation may be performed. The software package S-
Plus (which is very similar to r) implements twelve different criteria
(Venables & Ripley [46, p 409]). The varimax method is probably the
most popular.

© USQ, February 21, 2007


262 Module 12. Factor Analysis

12.3 Factor rotation

In general, with two or more common factors, the initial factor solution
may be converted to another equally valid solution with the same number of
factors by an orthogonal rotation. Such a rotation preserves the correlations
and communalities amongst variables, but of course changes the loadings or
correlations between the original variables and the factors. Recalling that
the initial factor solution may result in loadings which do not allow easy
interpretation of the factors, rotation can be used to “simplify” the loadings
in the sense of enabling easier interpretation. The rotational process of
factor analysis allows the reseacher a degree of flexibility by presenting a
multiplicity of views of the same data set. Obtain a parsimonious or simple
structure following these guidelines:

1. Any column of the factor loadings matrix should have mostly small
values, as close to zero as possible.

2. Any row of the matrix should have only a few entries far from zero.

3. Any two columns of the matrix should exhibit a different pattern of


high and low loadings.

12.3.1 Methods of factor rotation

Orthogonal rotation discussed above preserves the orientation between the


initial factors so that they are still perpendicular after rotation. In fact
the initial factor axes can be rotated independently giving factors which
are not necessarily perpendicular to each other but still explain the reduced
correlation matrix. This rotation technique is called oblique.

Orthogonal rotation methods enjoy some distinctive properties:

1. Factors remain uncorrelated.

2. The communality estimates are not affected but the proportion of


variability accounted for by a given factor will change as a result of
the rotation.

3. Although the total amount of variance explained by the common fac-


tors won’t change with orthogonal rotation, the percentage accounted
for by an individual factor will, in general, be different.

© USQ, February 21, 2007


12.4. Interpretation of factors 263

The standard orthogonal rotation techniques are the varimax (which is in


r), quartimax , and equimax methods. They each aim to simplify the factor
structure but in different ways. Varimax is the most popular and is usually
used with pca extraction. It aims to create small, medium and large loadings
within a particular factor. Quartimax aims, for each variable, to obtain one
and only one major loading across the factors. Equimax attempts to simplify
both the rows and the columns of the structure matrix.

Unfortunately, the use of orthogonal rotation techniques may not result in


uncovering an easily interpretable set of factors. Also there is often no reason
to believe that the hypothetical factors should be uncorrelated. Thus, it is
possible to arrive at much more interpretable factors if oblique rotation is
allowed.

The most popular oblique factor rotation methods are promax (which is in
r), oblimax , quartimin, covarimin, biquartimin, and oblimin. Similar to or-
thogonal rotation methods, oblique methods are designed to satisfy various
definitions of simple structure, and no algorithm is clearly superior to an-
other. Oblique methods present complexities that don’t exist for orthogonal
methods. They include:

1. The factors are no longer uncorrelated and hence the pattern and
structure matrices will not in general be identical.

2. Communalities and variances accounted for are not invariant under


oblique rotation.

For more information on some popular rotation techniques, see Kim and
Mueller [30].

Example 12.2: Buell & Bundgaard [10] use factor analysis to represent
wind soundings over Battery MacKenzie.

12.4 Interpretation of factors

It is often useful to find an interpretation for the resultant factors; rotation


is usually performed to help with this. As with pca, finding interpretations
is often quite an art, and sometimes any interpretation is difficult to find.

Sometimes using a different kind of rotation may help.

© USQ, February 21, 2007


264 Module 12. Factor Analysis

Example 12.3: Kalnicky [24] used factor analysis to classify the atmo-
spheric circulation over the midlatitudes of the northern hemisphere
from 1899–1969.

Example 12.4: Hannes [20] used rotated factors to explore the relationship
between water temperatures measured at Blunt’s Reef Light Ship and
the air pressure at Eureka, California. The factor loadings indicated
that the water temperatures measured at Trinidad Head and Blunt’s
Reef were quite different.

Example 12.5: Rogers [39] used factor analysis to find areal patterns of
anomalous sea surface temperature (SST) over the eastern North Pa-
cific based on monthly SSTs, surface pressure and 1000–500mb layer
thickness over North America during 1960–1973.

Example 12.6: Consider Example 12.1. An orthogonal rotation can be


used to rotate the matrix of factor loadings. For example (and this is
probably not a practical example of a rotation but serves to demon-
strate the point), an orthogonal rotation could be achieved using the
matrix  √ 
3/2 √ −1/2
T = . (12.3)
1/2 3/2
(Check this transformation matrix is orthogonal!) Then, the factor
loadings become
 √
 
0.6 0.5 
3/2 √−1/2
ΛT =  0.9 0.5  ×
1/2 3/2
0.1 0.9
 
0.774 0.13
≈  0.88 −0.28  .
0.54 0.73

This allocation of factor loadings produces the following path diagram:

© USQ, February 21, 2007


12.4. Interpretation of factors 265

3
2

2
Transformed y
Original y

1
−1 0

−1 0
−3
−3

−4 −2 0 1 2 3 −4 −2 0 2

Original x Transformed x

Figure 12.1: The effect on the cartesian plane of applying the orthogonal
transform in matrix T in Eq. (12.3)

: X1 
 0.38
0.77  

>
F1 
X
ZXXXX 0.88

0.54Z XXX
Z z X 
X 0.15
Z : 2
0.13
Z −0.28
F2 

X XXX
Z
Z
0.73 XXXX ZZ
~
z
X X3  0.18
Note that still
Covar[X1 , X2 ] = Covar[0.77F1 + 0.13F2 , 0.88F1 − 0.28F2 ]
= (0.77 × 0.88) + (0.1 × −0.28)
≈ 0.64.

It is not clear that this (arbitrary) rotation helps aid interpretation; it


has been used merely to demonstrate the concepts. The transforma-
tion represents a rotation of −30◦ (Fig. 12.2).

Example 12.7: A non-orthogonal rotation for Example 12.1 can be ob-


tained using the rotation matrix
 
1.07 −0.288
S= .
−0.116 1.04
Then, the factor loadings become
 
0.58 0.35
ΛS ≈  0.94 −0.052  .
0.0023 0.90

© USQ, February 21, 2007


266 Module 12. Factor Analysis

3
2

2
Transformed y
1
Original y

1
0

−1 0
−3 −2 −1

−3
−4 −3 −2 −1 0 1 2 −4 −2 0 2

Original x Transformed x

Figure 12.2: The effect on the cartesian plane of applying the oblique trans-
form in matrix S in Eq. (12.7)

With oblique rotations, matters become more complicated because


now the factors are correlated. In a path diagram, this is indicated as
shown below, where r is the correlation between the two factors.

: X1  0.38
0.58

 

>
F1 
XXX 
Z
0.0023
Z
XX
XXX
0.94
r 6 Z  : X2 
z
X
 0.15
? 0.35Z  
 Z−0.052

Z
F2 

XXX Z
0.90XXXXZ XZ
~
z X 
X 0.18
3

12.5 The differences between PCA and FA

pca and factor analysis are similar methods, which is often a source of
confusion for students. This section lists some of the difference (also see
Mardia, Kent & Bibby [34, §9.8]).

1. As seen above, a pca is often a first step in a factor analysis.


2. There is an essential difference between the two analyses. In pca, the
hypothetical new variables (the principal components) are defined as
linear combinations of the observed variables. In factor analysis, it is
the other way around: The observed variables are conceptualized as
being linear composites of some unobserved variables or factors.

© USQ, February 21, 2007


12.6. Principal components factor analysis 267

3. In pca, the major objective is to select a number of components that


explain as much of the total variance as possible. The values of the
principal components for an individual are relatively simple to com-
pute and interpret usually. In contrast, the factors obtained in factor
analysis are selected mainly to explain the interrelationships between
the original variables.
4. In pca, computations are started with the covariance matrix or the
correlation matrix. In factor analysis computations often begin with
a reduced correlation matrix, a matrix in which the 1’s on the main
diagonal are replaced by communalities. These are further explained
below.
5. In pca, the principal components is just a transformation of the origi-
nal data, with no assumptions made about the form of the covariance
matrix of the data. In factor analysis, a definite form is assumed.

12.6 Principal components factor analysis

In the previous section, differences between fa and pca were pointed out.
However, pca can actually be used to assist in performing a fa. This is
called principal components factor analysis, and uses a pca to perform the
first step in the fa (note that this is not the only option) from which the
next two steps can be done. This idea is presented in this section.
Begin with p original variables Xi for i = 1 . . . p. Performing a pca will
produce p pcs, Zi for i = 1 . . . p. The pcs are defined as
Z1 = b11 X1 + b12 X2 + · · · + b1p Xp
.. .. ..
. . .
Zp = bp1 X1 + bp2 X2 + · · · + bpp Xp
where the bij are given by the eigenvectors of the correlation matrix. In
matrix form, write Z = BX. Since B is a matrix of eigenvectors, B −1 = B T ,
so also X = B T Z, or
X1 = b11 Z1 + b21 Z2 + · · · + bp1 Zp
.. .. ..
. . .
Xp = b1p Z1 + b2p Z2 + · · · + bpp Zp
Now in a factor analysis, we only keep m of the p factors; hence
X1 = b11 Z1 + b21 Z2 + · · · bp1 Zm + e1
.. .. ..
. . .
Xp = b1p Z1 + b2p Z2 + · · · bmp Zm + ep

© USQ, February 21, 2007


268 Module 12. Factor Analysis

where the ei are unexplained components after omitting the last p − m pcs.
In this equation, the bij are like factor loadings. But true factors have a
variance of one; here, var[Zi ] = λi since the Zi is a pc. This means the Zi
are not ‘true’ factors. Of course, the Zi can be rescaled to have a variance
of one:
p p p p p p
X1 = ( λ1 b11 )Z1 / λ1 + ( λ2 b21 )Z2 λ2 + · · · + ( λm bm1 )Zm λm + e1
.. .. ..
. . .
p p p p p p
Xp = ( λ1 b1p )Z1 λ1 + ( λ2 b2p )Z2 λ2 + · · · + ( λp bmp )Zm λm + ep

when we can also write

X1 = a11 F1 + b12 F2 + · · · b1m Fm + e1


.. .. ..
. . .
Xp = ap1 F1 + ap2 F2 + · · · bpm Fm + em ,
√ √
where Fi = Zi / λi and aij = bji λi (note the subscripts carefully!). In
matrix form,
X = Λf + e.
A rotation can be perfomed by writing

X = ΛT f + e

for an appropriate rotation matrix T .

12.7 How many factors to choose?

In pca, there were some guidelines for selecting the number of pcs. Similar
guidelines also exist for factor analysis. r will not let you have too many
factors; for example, if you try to extract three factors from four variables,
you will be told this is too many.

As usual, there are two competing criteria: To have the simplest model
possible, and to explain as much of the variation as possible.

There is no easy answer to explain how many factors are chosen. This is one
of the major criticism of fa. Try to find a number of factors that explains as
much variation as possible (using the communalities and uniquenesses), but
is not too complicated, and preferably leads to a useful interpretation. The
best methods is probably to perform a pca, and note that ‘best’ number of
pcs, and then use this many factors in the fa.

© USQ, February 21, 2007


12.8. Using r 269

Note also that choosing the number of factors is a separate issue to the
rotation. The rotation will not alter the communalities or uniquenesses.
The first step is therefore to decide on the number of factors using commu-
nalities and uniquenesses, and then try various rotations to find the best
interpretation.

12.8 Using R

r can be used to perform factor analysis using the function factanal.

The help file for this r function states

The fit is done by optimizing the log likelihood assuming


multivariate normality over the uniquenesses.

Actually doing this is beyond the scope of this course; we will just use r
trusting the code gives sensible answers.

Example 12.8: Consider the European employment data used by Manly


in Example 7.1. (While not a climatological, it will demonstrate how
to do equivalent analyses in r.) The following code analyses the data.
First, Manly’s Table 7.1 can be found directly, or using factanal: The
factor analysis without rotation, shown in the middle of Manly p 101,
can be obtained as follows:

> ee <- read.table("europe.txt", header = TRUE)


> cmat <- cor(ee)
> ee.fa4 <- factanal(ee, factors = 4, rotation = "none")
> print(ee.fa4$loadings, cutoff = 0)

Loadings:
Factor1 Factor2 Factor3 Factor4
AGR -0.961 0.178 -0.178 0.094
MIN 0.143 0.625 -0.410 -0.078
MAN 0.744 0.416 -0.102 -0.508
PS 0.582 0.576 -0.017 0.569
CON 0.449 0.034 0.376 -0.375
SER 0.601 -0.327 0.600 0.089
FIN 0.103 -0.121 0.631 0.228
SPS 0.697 -0.672 -0.138 0.196
TC 0.615 -0.121 -0.233 0.146

© USQ, February 21, 2007


270 Module 12. Factor Analysis

Factor1 Factor2 Factor3 Factor4


SS loadings 3.274 1.516 1.184 0.858
Proportion Var 0.364 0.168 0.132 0.095
Cumulative Var 0.364 0.532 0.664 0.759

Notice that the value are not identical to those shown in Manly; there
are numerous different algorithms for factor analysis, so this is of no
concern. The help for the function factanal in r states
There are so many variations on factor analysis that it is hard to
compare output from different programs. Further, the optimiza-
tion in maximum likelihood factor analysis is hard, and many
other examples we compared had less good fits than produced by
this function.

The values are, however, similar. The signs are different, but this is of
no consequence.
The results using the varimax rotation are obtained as follows:

> ee.fa4r <- factanal(ee, factors = 4, rotation = "varimax")


> print(ee.fa4r$loadings, cutoff = 0)

Loadings:
Factor1 Factor2 Factor3 Factor4
AGR -0.695 -0.633 -0.278 -0.185
MIN -0.142 0.194 -0.546 0.479
MAN 0.199 0.882 -0.293 0.302
PS 0.205 0.086 0.084 0.969
CON 0.081 0.644 0.250 -0.033
SER 0.427 0.368 0.720 0.023
FIN -0.022 0.041 0.686 0.055
SPS 0.972 0.051 0.197 -0.091
TC 0.614 0.160 -0.061 0.249

Factor1 Factor2 Factor3 Factor4


SS loadings 2.097 1.803 1.563 1.368
Proportion Var 0.233 0.200 0.174 0.152
Cumulative Var 0.233 0.433 0.607 0.759

Again, the factors are not identical, but are similar.


The communalities are not produced by r; instead, the uniqueness is
computed (these are called specificity in Manly). Simply, the variance
of each factor (that is, the eigenvalues) consists of two parts: the
uniqueness plus the communality. The communalities represent the
proportion of each variable that is shared with the other variables

© USQ, February 21, 2007


12.8. Using r 271

through the common factors. The uniqueness is the proportion of


the variance unique to each variable and not shared with the other
variables. The communalities are computed in r as follows:

> 1 - ee.fa4$uniqueness

AGR MIN MAN PS CON


0.9950000 0.5852025 0.9950000 0.9950000 0.4853668
SER FIN SPS TC
0.8366518 0.4758879 0.9950000 0.4682174

Again, while they are somewhat similar to those shown in Manly, they
are not identical.
We now show how to extract the ‘scores’ from the factor analysis.
In this example, the ‘scores’ represent how each country scores on
each factor. First, we need to adjust the call to factanal by adding
scores="regression":

> ee.fa4scores <- factanal(ee, factors = 4,


+ scores = "regression")
> names(ee.fa4scores)

[1] "converged" "loadings" "uniquenesses"


[4] "correlation" "criteria" "factors"
[7] "dof" "method" "scores"
[10] "STATISTIC" "PVAL" "n.obs"
[13] "call"

> ee.fa4scores$scores

Factor1 Factor2
Belgium 0.735864909 0.39347899
Denmark 1.660414922 -0.66961020
France 0.196749209 0.34219028
W.Germany 0.367203694 1.17655067
Ireland 0.109519146 -1.15479327
Italy -0.243011308 0.72606452
Luxemborg -0.387309499 1.19266940
Netherlands 0.998482010 -0.43751930
UK 1.296604056 -0.10590831
Austria -0.554916391 0.47914938
Finland 0.674141842 -0.53711442
Greece -1.463580232 -0.79224633
Norway 0.910020084 -0.36175539
Portugal -0.582448069 -0.01995971

© USQ, February 21, 2007


272 Module 12. Factor Analysis

Spain -1.446012474 0.95896642


Sweden 1.826033446 -0.42669861
Switzerland -0.904728288 2.13805967
Turkey -1.041975726 -2.66833845
Bulgaria -0.045481468 0.55490800
Czechoslovakia 0.000259302 0.62003588
E.Germany 0.668431037 1.24247646
Hungary 0.027698217 -0.79180745
Poland -0.354297646 -0.48659964
Romania -1.022644195 0.39156263
USSR 0.760473031 -0.50031050
Yugoslavia -2.185489608 -1.26345072
Factor3 Factor4
Belgium 1.05675332 -0.29779175
Denmark 0.42843017 -1.16880321
France 0.76331126 -0.16036086
W.Germany -0.55782348 -0.15329451
Ireland 0.78688797 1.07501857
Italy 0.52722797 -1.17426449
Luxemborg 0.95859219 -0.38523076
Netherlands 1.45523464 -0.04966847
UK 0.16428566 1.06316594
Austria 0.88978886 1.33940370
Finland 0.38459638 0.93802719
Greece 0.60623551 -0.51373635
Norway 1.10665730 -0.54306128
Portugal 0.05541367 -0.72558905
Spain 0.69160242 -0.40625354
Sweden -0.11120997 -0.63396595
Switzerland 0.32945045 -0.32444541
Turkey -1.07926842 -1.66123747
Bulgaria -1.64948488 -0.73494006
Czechoslovakia -1.36338513 0.86309299
E.Germany -1.65825030 0.96913085
Hungary -0.72829821 2.83417294
Poland -0.93180860 0.18018067
Romania -1.52721874 -0.52834041
USSR -1.32194122 -0.83876243
Yugoslavia 0.72422119 1.03755315

These scores may be used in further analysis. For example, a factor


analysis (or pca) is often used to reduce the number of covariates used
in a regression analysis. Suppose, in this example, the given variables
were to be used in a regression analysis where the response variable

© USQ, February 21, 2007


12.8. Using r 273

is gross domestic product (GDP). (There is no such variable, but this


will demonstrate the ideas). In r, to perform the regression of GDP
against the four factors identified above, use

> ee.lm <- lm(GDP ~ ee.fa4scores)

To learn more about this regression fit, use

> summary(ee.lm)
> names(ee.lm)

Example 12.9: Consider Example 11.10, where a pca was performed on


Manly’s sparrow data. Here, a fa is conducted for comparison.

> sp <- read.table("sparrows.txt", header = TRUE)


> sp.fa.vm <- factanal(sp, factors = 2,
+ rotation = "varimax")
> loadings(sp.fa.vm)

Loadings:
Factor1 Factor2
Length 0.370 0.926
Extent 0.659 0.530
Head 0.638 0.459
Humerus 0.901 0.317
Sternum 0.475 0.463

Factor1 Factor2
SS loadings 2.017 1.665
Proportion Var 0.403 0.333
Cumulative Var 0.403 0.736

> 1 - sp.fa.vm$uniqueness

Length Extent Head Humerus Sternum


0.9950000 0.7151875 0.6186305 0.9123528 0.4403432

> sp.fa.pm <- factanal(sp, factors = 2,


+ rotation = "promax")
> loadings(sp.fa.pm)

Loadings:
Factor1 Factor2

© USQ, February 21, 2007


274 Module 12. Factor Analysis

Length -0.184 1.143


Extent 0.588 0.293
Head 0.614 0.200
Humerus 1.138 -0.234
Sternum 0.358 0.338

Factor1 Factor2
SS loadings 2.180 1.601
Proportion Var 0.436 0.320
Cumulative Var 0.436 0.756

> 1 - sp.fa.pm$uniqueness

Length Extent Head Humerus Sternum


0.9950000 0.7151875 0.6186305 0.9123528 0.4403432

12.9 Concluding comments

Activity 12.C: Read Manly, Sect. 7.6.

Factor analysis is perceived as valuable by many, and with scepticism by


many others. We present the technique here as a tool, without judgement.
Of note, however, is that Wilks [49] does not consider fa; he only mentions
in passing that pca and fa are distinct methods.

12.10 Exercises

Ex. 12.10: The data file toowoomba.dat contains the daily rainfall, max-
imum and minimum temperatures, radiation, pan evaporation and
maximum vapour pressure deficit (in hPa) at Toowoomba from 1 Jan-
uary 1889 to 21 July 2002 (a total of 41474 observations on three
variables). Perform a fa to find two underying factors, and compare
the factors using no rotation, promax rotation and varimax rotation.

Ex. 12.11: In a certain factor analysis, the factor loadings were computed
as shown in the following table.

© USQ, February 21, 2007


12.10. Exercises 275

F1 F2

X1 0.3 0.5
X2 0.8 0.1
X3 0.1 0.8
X4 0.6 0.7

(a) Draw the path model for this problem.


(b) Determine the uniqueness for each variable.

Ex. 12.12: Consider again the air quality data from 41 cities in the USA,
as seen in Example 10.1. For each city, seven variables have been mea-
sured (see p 218). The first is the concentration of SO2 in microgram
per cubic metre; the other six are potential identifiers of pollution
problems. The original source treats the concentration of SO2 as a
response variable, and the other six as covariates.

(a) Is it possible to reduce these six covariates to a smaller number,


without losing much information? How many factors are ade-
quate?
(b) Use an appropriate fa to perform a data reduction. If possible,
find a useful interpretation of the resultant factors.
(c) Perform a regression analysis using SO2 as the response, and the
factors as regressors. Compare to a regression of SO2 on all the
original variables, and comment. (To regress variables A and B
against Y in r, use m1 <- lm ( Y ~ A + B); then names(m1)
and summary(m1) may prove useful.)

Ex. 12.13: The data file gatton.dat contains the daily rainfall, maximum
and minimum temperatures, radiation, pan evaporation and maximum
vapour pressure deficit (in hPa) at Gatton from 1 January 1889 to 21
July 2002 from 1 January 1889 to 15 September 2002 (a total of 41474
observations on six variables). Perform a fa to find two underying
factors, and compare the factors using no rotation, promax rotation
and varimax rotation.

Ex. 12.14: The data file strainfall.dat contains the average month and
annual rainfall (in tenths of mm) for 363 Australian rainfall stations.

(a) Perform a fa. How many factors seems necessary?


(b) How many factors are useful?
(c) If possible, find a rotation that provides a useful interpretation
for the factors.

© USQ, February 21, 2007


276 Module 12. Factor Analysis

Ex. 12.15: The data file jondaryn.dat contains the daily rainfall, max-
imum and minimum temperatures, radiation, pan evaporation and
maximum vapour pressure deficit (in hPa) at Jondaryn from 1 January
1889 to 21 July 2002 (a total of 41474 observations on six variables).
Perform a fa to find two underying factors, and compare the factors
using no rotation, promax rotation and varimax rotation.

Ex. 12.16: The data file emerald.dat contains the daily rainfall, maximum
and minimum temperatures, radiationp, an evaporation and maximum
vapour pressure deficit (in hPa) at Emerald from 1 January 1889 to
21 July 2002 (a total of 41474 observations on six variables). Perform
a fa to find two underying factors, and compare the factors using no
rotation, promax rotation and varimax rotation.

Ex. 12.17: The data file wind_ca.dat contains numerous weather and wind
measurements from Canberra during 1989.

(a) Perform a fa on the data.


(b) How many factors are necessary to summarize the data? Explain.
(c) If possible, interpret the factors. What rotation makes for easiest
interpretation?

Ex. 12.18: The data file wind_wp.dat contains numerous weather and wind
measurements from Wilson’s Promontory, Victoria (the most southerly
point of mainland Australia) during 1989.

(a) Perform a fa on the data.


(b) How many factors are necessary to summarize the data? Explain.
(c) If possible, interpret the factors. What rotation makes for easiest
interpretation?

Ex. 12.19: The data file qldweather.dat contains six weather-related vari-
ables for 20 Queensland cities.

(a) Perform a fa. How many factors seems necessary?


(b) How many factors are useful?
(c) If possible, find a rotation that provides a useful interpretation
for the factors.

Ex. 12.20: This question concerns a data set that is not climatological,
but you may find interesting. The data file chocolates.dat, available
from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/
popular/chocolates.html, contains measurements of the price, weight
and nutritional information for 17 chocolates commonly available in
Queensland stores. The data was gathered in April 2002 in Brisbane.

© USQ, February 21, 2007


12.10. Exercises 277

(a) Perform a fa using the nutritional information.


(b) How many factors are useful?
(c) If possible, find a rotation that provides a useful interpretation
for the factors.

12.10.1 Answers to selected Exercises

12.10 Here is a brief analysis.

> tw <- read.table("toowoomba.dat", header = TRUE)


> tw.2.n <- factanal(tw[4:9], factors = 2,
+ rotation = "none")
> tw.2.v <- factanal(tw[4:9], factors = 2,
+ rotation = "varimax")
> tw.2.p <- factanal(tw[4:9], factors = 2,
+ rotation = "promax")

© USQ, February 21, 2007


278 Module 12. Factor Analysis

© USQ, February 21, 2007


Module

Cluster Analysis
13
Module contents
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 280
13.2 Types of cluster analysis . . . . . . . . . . . . . . . . . . 280
13.2.1 Hierarchical methods . . . . . . . . . . . . . . . . . . . . 280
13.3 Problems with cluster analysis . . . . . . . . . . . . . . 281
13.4 Measures of distance . . . . . . . . . . . . . . . . . . . . 281
13.5 Using PCA and cluster analysis . . . . . . . . . . . . . . 281
13.6 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
13.7 Some final comments . . . . . . . . . . . . . . . . . . . . 287
13.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
13.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 290

Module objectives

Upon completion of this module students should be able to:

ˆ understand the principles underlying cluster analysis;


ˆ compute clusters using r;
ˆ select an appropriate number of cluster for a given data;
ˆ plot a dendrogram using r.

279
280 Module 13. Cluster Analysis

13.1 Introduction

Cluster analysis, unlike PCA and factor analysis, is a classification technique.

Activity 13.A: Read Manly, section 9.1.

Example 13.1: Kavvas and Delleur [28] use a cluster analysis for modelling
sequences of daily rainfall in Indiana.

Example 13.2: Fritts [17] describes two techniques for examining the rela-
tionship between ring-width of conifers in western North America and
climatic variables. The second technique is a cluster analysis which
he uses to identify similarities and differences in the response function
and then to classify the tree sites.

13.2 Types of cluster analysis

Activity 13.B: Read Manly, sections 9.2.

The simple idea of cluster analysis is explained in Manly, section 9.1. The
actual mechanics, however, can be performed in numerous ways. Manly
discusses two of these. Two methods are hierarchical clustering (using
hclust),also the first mentioned by Manly; and k-means clustering (using
kmeans),the second method mentioned by Manly. The hierarchical methods
are discussed in more detail in both Manly and these notes.

13.2.1 Hierarchical methods

Activity 13.C: Read Manly, section 9.3.

The hierarchical methods discussed in this section are well explained by the
text. The third method, using group averages, can be performed in r using
the option method="average" in the call to hclust. A similar approach to
the first method is found using the option method="single". r also provides
other hierarchical clustering methods; see ?hclust.

© USQ, February 21, 2007


13.3. Problems with cluster analysis 281

13.3 Problems with cluster analysis

Activity 13.D: Read Manly, section 9.4.

13.4 Measures of distance

The hierarchical clustering methods are all based on measures of distance


between observations. There are different measures of distance that can be
used besides the standard Euclidean distance.

Activity 13.E: Read Manly, sections 9.5, 5.1, 5.2 and 5.3.

13.5 Using PCA and cluster analysis

As mentioned in Sect. 11.3, PCA is often a preliminary step before conduct-


ing a cluster analysis.

Activity 13.F: Read Manly, section 9.6.

Example 13.3: Stone & Auliciems [42] use a combination of cluster analy-
sis and PCA to define phases of the Southern Oscillation Index (SOI).

13.6 Using R

Cluster analysis can be performed in r, as briefly been mentioned previously.


The primary functions to use for hierarchical methods are hclust (which
performs the clustering), and dist (which computes the distance matrix of
which the clustering is based). The default distance measure is the standard
Euclidean distance.

After hclust is used, the resultant object can be plotted; the default plot
is the dendrogram (Manly, Figure 9.1).

For k-means clustering (called partitioning in Manly), the function kmeans


can be used.

© USQ, February 21, 2007


282 Module 13. Cluster Analysis

Example 13.4: Consider the example concerning European countries used


by Manly in Example 9.1. (While not a climatological, it will demon-
strate how to do equivalent analyses in r.) The following code analyses
the data. First the data is loaded, the names of the countries extracted,
and the rest of the variables re-labelled as ec:

> ec <- read.table("europe.txt", header = TRUE)

Then, attach the data:

> attach(ec)

The example in Manly uses standardized data (see the top of page 135).
Here is one way to do this in r:

> ec.std <- scale(ec)

Now that the data is prepared, the clustering can commence. The clus-
tering method used in the Example is the nearest neighbour method;
the most similar of the methods available in r is called method="single".
The distance measure used is the default Euclidean distance.

> es.hc <- hclust(dist(ec.std), method = "single")


> plot(es.hc, hang = -1)

The final plot, shown in Fig. 13.1, looks very similar to that shown in
Manly Figure 9.3.
You can try other methods if you want to experiment. To then de-
termine which countries are in which cluster, the function cutree is
used, here is an example of extracting four clusters:

> cutree(es.hc, k = 4)

Belgium Denmark France


1 1 1
W.Germany Ireland Italy
1 1 1
Luxemborg Netherlands UK
1 1 1
Austria Finland Greece
1 1 1
Norway Portugal Spain
1 1 2
Sweden Switzerland Turkey
1 1 3
Bulgaria Czechoslovakia E.Germany

© USQ, February 21, 2007


13.6. Using r 283

Cluster Dendrogram
5
4
3
2
Height

Turkey
Yugoslavia

W.Germany
Czechoslovakia

Poland

Norway
Luxemborg

Austria
Portugal

Sweden

France
Spain
USSR

Romania

Italy
Greece

Switzerland
Ireland

UK
Finland

Belgium
E.Germany

Bulgaria

Denmark

Netherlands
Hungary

dist(ec.std)
hclust (*, "single")

Figure 13.1: The dendrogram after fitting a hierarchical clustering model


(using the single agglomeration method) to the European countries data

© USQ, February 21, 2007


284 Module 13. Cluster Analysis

1 1 1
Hungary Poland Romania
1 1 1
USSR Yugoslavia
1 4

> sort(cutree(es.hc, k = 4))

Belgium Denmark France


1 1 1
W.Germany Ireland Italy
1 1 1
Luxemborg Netherlands UK
1 1 1
Austria Finland Greece
1 1 1
Norway Portugal Sweden
1 1 1
Switzerland Bulgaria Czechoslovakia
1 1 1
E.Germany Hungary Poland
1 1 1
Romania USSR Spain
1 1 2
Turkey Yugoslavia
3 4

Later (Example 13.6), we will see that using Ward’s method is common
in the climatological literature. This produces four different clusters
(Fig. 13.2.)

> es.hc.w <- hclust(dist(ec.std), method = "ward")


> plot(es.hc.w, hang = -1)
> sort(cutree(es.hc.w, k = 4))

Belgium Denmark France


1 1 1
Ireland Netherlands UK
1 1 1
Austria Finland Norway
1 1 1
Sweden W.Germany Italy
1 2 2
Luxemborg Greece Portugal

© USQ, February 21, 2007


13.6. Using r 285

Cluster Dendrogram

15
10
Height

5
0

Turkey
Yugoslavia
W.Germany

Czechoslovakia

Poland
Norway

Luxemborg
Austria

France

Sweden

Portugal
UK
Finland
Ireland

Belgium

Italy

Switzerland
Spain
Greece

USSR
Romania
E.Germany
Netherlands

Denmark

Bulgaria
Hungary

dist(ec.std)
hclust (*, "ward")

Figure 13.2: The dendrogram after fitting a hierarchical clustering model


(using Ward’s method) to the European countries data

2 2 2
Spain Switzerland Turkey
2 2 3
Yugoslavia Bulgaria Czechoslovakia
3 4 4
E.Germany Hungary Poland
4 4 4
Romania USSR
4 4

Which clustering seems to produce the more sensible clusters? Why?

Example 13.5: On page 137, Manly discusses using the partitioning, or


k-means, method, on the European cities data. This can also be done
in r; firstly, grouping into two groups:

> ec.km2 <- kmeans(ec, centers = 2)


> row.names(ec)[ec.km2$cluster == 1]

© USQ, February 21, 2007


286 Module 13. Cluster Analysis

[1] "Belgium" "Denmark"


[3] "France" "W.Germany"
[5] "Ireland" "Italy"
[7] "Luxemborg" "Netherlands"
[9] "UK" "Austria"
[11] "Finland" "Norway"
[13] "Portugal" "Spain"
[15] "Sweden" "Switzerland"
[17] "Bulgaria" "Czechoslovakia"
[19] "E.Germany" "Hungary"
[21] "USSR"

> row.names(ec)[ec.km2$cluster == 2]

[1] "Greece" "Turkey" "Poland"


[4] "Romania" "Yugoslavia"

These are different groups that given in Manly (since a different algo-
rithm is used). Six groups can also be specified:

> ec.km6 <- kmeans(ec, centers = 6)


> row.names(ec)[ec.km6$cluster == 1]

[1] "Greece" "Yugoslavia"

> row.names(ec)[ec.km6$cluster == 2]

[1] "W.Germany" "Switzerland"


[3] "Czechoslovakia" "E.Germany"

> row.names(ec)[ec.km6$cluster == 3]

[1] "Belgium" "Denmark" "Netherlands"


[4] "UK" "Norway" "Sweden"

> row.names(ec)[ec.km6$cluster == 4]

[1] "Ireland" "Portugal" "Spain" "Bulgaria"


[5] "Hungary" "Poland" "Romania" "USSR"

> row.names(ec)[ec.km6$cluster == 5]

[1] "Turkey"

> row.names(ec)[ec.km6$cluster == 6]

© USQ, February 21, 2007


13.7. Some final comments 287

[1] "France" "Italy" "Luxemborg"


[4] "Austria" "Finland"

Example 13.6:
Unal, Kindap and Karaca [45] use cluster analysis to analyse Turkey’s
climate. The abstract states:

Climate zones of Turkey are redefined by using . . . cluster


analysis. Data from 113 climate stations for temperatures
(mean, maximum and minimum) and total precipitation
from 1951 to 1998 are used after standardizing with zero
mean and unit variance, to confirm that all variables are
weighted equally in the cluster analysis. Hierarchical cluster
analysis is chosen to perform the regionalization. Five dif-
ferent techniques were applied initially to decide the most
suitable method for the region. Stability of the clusters is
also tested. It is decided that Ward’s method is the most
likely to yield acceptable results in this particular case, as
is often the case in climatological research. Seven different
climate zones are found, as in conventional climate zones,
but with considerable differences at the boundaries.

In the above quote, it is noted that Ward’s method is commonly used in


climatology. This is specified in r as follows:

hclust( dist( data ), method="ward")

The clusters produced using different methods can be quite different (the
default method is the complete agglomeration method).

13.7 Some final comments

A cluster analysis is generally used to classify data into clusters. It is rarely


obvious how many clusters is ideal. There are, however, hypothesis tests
available for helping make this decision (see Wilks [49, p 424] for some
references). In terms of hierarchical cluster analysis as described here, the
dendrogram can help in the decision. One can select a value of the ‘Height’

© USQ, February 21, 2007


288 Module 13. Cluster Analysis

or ‘Distance’ appropriately. An appropriate distance may be that value


under which the clusters change rapidly; alternatively, interpretations may
aid the clustering. After clustering, there are sometimes useful labels that
can be applied to the clusters (in a data set containing climatic variables
for various countries, for example, countries may be clustered by climatic
regions and labels such as ‘Desert’, ‘Mediterranean’ etc, may be applied).

In addition, it is often recommended that data be first standardized to sim-


ilar scales before a cluster analysis is performed, especially if the data are in
different units. (That is, subtract the mean from the variable, and divide by
the standard deviation; use the r function scale). This prevents variables
with large variances dominating the distance measure.

13.8 Exercises

Ex. 13.7: Try to reproduce Manly’s Figure 9.4 by standardizing and then
using hclust.

Ex. 13.8: The data file tempppt.dat contains the average July temperature
(in ◦ F ) and the average July precipitation for 28 stations in the USA.
Each station has also been classified as belonging to southeastern,
central or northeastern USA.

(a) Plot the temperature and precipitation data on a set of axes,


identifying on the plot the three regions the stations are from.
Do the three regions appear to form clusters?
(b) Perform a cluster analysis using the temperature and precipita-
tion data. Use various clustering methods and compare.
(c) How well are the stations clustered according to the three prede-
fined classifications?
(d) Using a dendrogram, which two regions are most similar?

Ex. 13.9: The data file strainfall.dat contains the average month and
annual rainfall (in tenths of mm) for 363 Australian rainfall stations.

(a) Perform a cluster analysis using the monthly averages. Use vari-
ous clustering methods and compare.
(b) Using a dendrogram, how many classifications seems useful?

Ex. 13.10: Consider the data file strainfall.dat again.

(a) Perform a PCA on the data. Show that two PCs is reasonable.
(b) Plot the first PC against the second PC. What does this indicate?

© USQ, February 21, 2007


13.8. Exercises 289

(c) Perform a cluster analysis on the first two PCs.


(d) Using a dendrogram, how many classifications seems useful?

Ex. 13.11: This question concerns a data set that is not climatological,
but you may find interesting. The data file chocolates.dat, available
from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/
popular/chocolates.html, contains measurements of the price, weight
and nutritional information for 17 chocolates commonly available in
Queensland stores. The data was gathered in April 2002 in Brisbane.

(a) Perform a cluster analysis using the nutritional information using


various clustering methods, and compare.
(b) Using a dendrogram, how many classifications of seems useful?
What broad names could be given to these classifications?

Ex. 13.12: The data file ustemps.dat contains the normal average January
minimum temperature in degrees Fahrenheit with the latitude and
longitude of 56 U.S. cities. (See the help file for full details.) Perform
a cluster analysis. How many clusters seem appropriate? Explain.

Ex. 13.13: In Exercise 11.18, the US pollution data was examined, and a
PCA performed.

(a) Perform a cluster analysis of the first two PCs. Produce a den-
drogram. Does it appear the cities can be clustered into a small
number of groups, based on the first two PCs?
(b) Repeat the above exercise, but use the first three PCs. Compare
the two cluster analyses.

Ex. 13.14: The data in the file qldweather.dat contains six weather-related
data for 20 Queensland cities (covering temperatures, rainfall, number
of raindays, humidity) plus elevation.

(a) Perform a PCA to summarise the seven variables into a small


number. How many PCs seems appropriate?
(b) Using the first three PCs, perform a cluster analysis (use Ward’s
method).
(c) Plot a dendrogram. Can you identify and find useful names for
some clusters? A map of Queensland may be useful (Fig. 13.3).
(d) Compare your results to a cluster analysis on all numerical vari-
ables.
(e) Based on the cluster analysis of all variables, use cutree to divide
into four clusters.
(f) Plot a star plot, and see if the clusters can be identified.

© USQ, February 21, 2007


290 Module 13. Cluster Analysis

Weipa

Cairns
Atherton Innisfail

Townsville

Mt.Isa
Mackay

Rockhampton
Gladstone
Childers
Theodore
Birdsville Maryborough
Gympie
Roma Nambour
Toowomba Brisbane
Cunnamulla Warwick Mt.Tamborine

Stanthorpe

Figure 13.3: A map of Queensland may be useful to label the clusters in


Exercise 13.14.

Ex. 13.15: The data in the file countries.dat contains numerous variables
from a number of countries, and the countries have been classified by
region.
(a) Perform a cluster analysis on the original data. Given the re-
gions of the countries, is there a sensible clustering that emerges?
Explain.
(b) Perform a PCA on the data. How many PCs seem necessary?
Let this number of PCs be p.
(c) Cluster the first p PCs. Given the regions of the countries, is
there a sensible clustering that emerges? Explain.
(d) How do these clusters compare to the clusters identified using all
the data?

13.8.1 Answers to selected Exercises

13.7 The following r code will work:


> mn <- read.table("mandible.txt", header = TRUE)
> mn.hc <- hclust(dist(scale(mn)), method = "single")
> plot(mn.hc, hang = -1)

© USQ, February 21, 2007


Appendix

Installing other packages in R


A
Installing extra r packages can be a tricky business in Windows (I have
never had any trouble in Linux, however). To install the packages oz and
ncdf as used in Section 11.5, there are a couple of options.

ˆ First try using the menu in r: click Packages|Install packages from


CRAN. I have never got this to work for me, but some people have.

ˆ If the above doesn’t work, there is an alternative. Follow these steps:

1. Check your version of r by typing version at the r prompt.


2. If the CD has packages for your version of r, then use the r menu
to select Packages|Install package from local zip file and install the
packages from the zip files on the CD.
3. If this doesn’t work, or your version of r differs from that on the
CD, use your browser to go to: http://mirror.aarnet.edu.
au/pub/CRAN/, then click on r Binaries, then Windows and then
contrib. (Alternatively, go straight to http://mirror.aarnet.
edu.au/pub/CRAN/bin/windows/contrib/ directly).
4. Select the directory/folder that corresponds to your version of r.
5. Then download the zip file you need and put it somewhere that
you’ll remember.

291
292 Appendix A. Installing other packages in R

6. Then, from the r menu, select Install package from local zip file.
Then point to where you saved the file.

Then you should have the package installed ready for use. At the r prompt,
you can then type library(oz), for example, and the library is loaded.

© USQ, February 21, 2007


Appendix

Review of statistical rules


B
B.1 Basic definitions

Experiment: Any situation where the outcome is uncertain is called an


experiment.
An experiments range from the simple tossing of a coin, to the complex
simulation of a queuing system.

Sample space: For any experiment, the sample space S of the experiment
consists of all possible outcomes for the experiment.
For example, the sample space of a coin toss is simply S = {tail, head},
usually abbreviated to {T, H}, whereas for a queuing system the sam-
ple space is the huge set of all possible realisations over time of people
arriving and being served in the queue.

Event: An event E consists of any collection of points (set of outcomes) in


the sample space.
For example, in the coin toss there are two possible outcomes: either
T or H. These engender three possible nontrivial events: {T }, {H}
and {T, H} (this last event always happens). Whereas when two coins
are tossed there are four possible outcomes: either T T , T H, HT or
HH (using what I trust is an obvious notation). There are then fifteen
possible nontrivial events such as: two heads, E1 = {HH}; the first

293
294 Appendix B. Review of statistical rules

coin is a head, E2 = {HT, HH}; at least one of the coins is a tail,


E3 = {T T, T H, HT }; etc.
This definition of an event as a set of outcomes is very important as it
allows us to discuss events at level appropriate to the circumstances.
For example, a driver is the event “drunk” if his/her blood-alcohol
content is above 0.05% . This groups all the possible outcomes of
the level of alcohol (a real percentage) into two possible sets, that is,
events: “drunk” or “not drunk.”
Mutually exclusive: A collection of events E1 , E2 , E3 , . . . are mutually
exclusive if for i 6= j, Ei and Ej have no outcomes in common.
For example, when tossing two coins, the above events E1 and E3 are
mutually exclusive because HH (the only outcome in E1 ) is not in
E3 . But, E2 and E3 is not mutually exclusive because HT is in both;
neither is E1 and E2 mutually exclusive.

The probabilities of events must satisfy the following rules of probability.

ˆ For any event E, Pr {E} ≥ 0 .


ˆ Something always happens: Pr {S} = 1 .
ˆ If E1 , E2 ,. . . , En are mutually exclusive events, then
n
X
Pr {E1 ∪ E2 ∪ E3 ∪ · · · ∪ En } = Pr {Ej } .
j=1

For example, we used these last two properties to determine the steady
state probabilities in a queue. Let event Ej denote that the queue is
in state j (that is, with j people in the queue). These are clearly
mutually exclusive events as the queue cannot be in two states at
once. Further the sample space is the union of all possible states:
S = E0 ∪ E1 ∪ E2 ∪ · · · and hence
1 = Pr {S} = Pr {E0 ∪ E1 ∪ E2 ∪ · · · }
= Pr {E0 } + Pr {E1 } + Pr {E2 } + · · ·
= π 0 + π 1 + π2 + · · · .

ˆ Pr Ē = 1 − Pr {E} where Ē is the complement of E, that is, Ē is




the set of outcomes that are not in E .


We used this before too. For example, the the event E be that none
are waiting in the queue, that is the system is in states 0 or 1, then

Pr Ē = 1 − Pr {E} = 1 − Pr {E0 ∪ E1 } = 1 − Pr {E0 } − Pr {E1}
gives the probability that there is someone waiting in the queue.

© USQ, February 21, 2007


B.1. Basic definitions 295

ˆ If two events are not mutually exclusive, then

Pr {E1 ∪ E2 } = Pr {E1 } + Pr {E2 } − Pr {E1 ∩ E2 } .

This is known as the general addition rule of probability. Note that


if E1 and E2 are mutually exclusive then Pr {E1 ∩ E2 } = 0 and so
Pr {E1 ∪ E2 } = Pr {E1 } + Pr {E2 } as given above.

ˆ For two events E1 and E2 , the conditional probability that event E2


will occur given that E1 has already occurred, is

Pr {E1 ∩ E2 }
Pr {E2 | E1 } = .
Pr {E1 }

This gives rise to the general multiplication rule:

Pr {E2 ∩ E1 } = Pr {E1 } Pr {E2 | E1 } = Pr {E2 } Pr {E1 | E2 } .

Events E1 and E2 are termed independent if and only if Pr {E2 | E1 } =


Pr {E2 }, or equivalently Pr {E1 | E2 } = Pr {E1 }, or equivalently Pr {E2 ∩ E1 } =
Pr {E1 } Pr {E2 }.

Example B.1: The probability that a person convicted of dangerous driv-


ing will be fined is 0.87, and the probability that he/she will lose
his/her licence is 0.52. The probability that such a person will be
fined and lose their licence is 0.41.
What is the probability that a person convicted of dangerous driving
will be either fined or lose licence or both?

Solution: The events of being fined and losing licence are not mutu-
ally exclusive, therefore apply the general addition rule:

Pr {F ∪ L} = Pr {F } + Pr {L} − Pr {F ∩ L}
= 0.87 + 0.52 − 0.41
= 0.98 .

Example B.2: A researcher knows that 60% of the goats in a certain dis-
trict are male and that 30% of female goats have a certain disease.
Find the probability that a goat picked at random from the district is
a female and has the disease.

© USQ, February 21, 2007


296 Appendix B. Review of statistical rules

Solution: Apply the general multiplication rule:

Pr {F ∩ D} = Pr {F } Pr {D | F }
= 0.4 × 0.3
= 0.12 .

B.2 Mean and variance for sums of random variables

If X1 and X2 are random variables and c is a constant, then the following


relationships must hold.

ˆ E(cX1 ) = cE(X1 )

ˆ E(X1 + c) = E(X1 ) + c

ˆ E(X1 + X2 ) = E(X1 ) + E(X2 )

ˆ Var(cX1 ) = c2 Var(X1 )

ˆ Var(X1 + c) = Var(X1 )

If X1 and X2 are independent random variables,

Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) .

If X1 and X2 are not independent random variables,

Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2Covar[X1 , X2 ] ,

where Covar[X1 , X2 ,] is the covariance between X1 and X2 .

Example B.3: A random variable X has a mean of 10 and variance of 5.


Determine the mean and variance of 3X − 1.

© USQ, February 21, 2007


B.2. Mean and variance for sums of random variables 297

Solution: Given that E(X) = 10 and Var(X) = 5


ˆ E(3X − 1) = E(3X) − 1 = 3E(X) − 1 = 3(10) − 1 = 29
ˆ Var(3X − 1) = Var(3X) = 9 Var(X) = 9(5) = 45

Example B.4: The alternative formula for the variance is derived as follows
Var(X) = E (X − µX )2
 

= E X 2 − 2µX X + µ2X
 

= E X 2 + E [−2µX X] + µ2X by addition rules


 

= E X 2 − 2µX E [X] + µ2X by multiplication rule


 

= E X 2 − E [X]2 as µX = E(X) .
 

Definition B.1 (general expectation, variance and standard deviation)

ˆ The expected value of any given function g(X) of a discrete random


variable is X
E(g(X)) = g(x)p(x) ,
x
where p(x) is its probability distribution.
ˆ For a discrete random variable X, the variance of X is the expected
value of g(X) where g(x) = (x − µX )2 (recall µX = E(X)), that is,
2
 X
= E (X − µX )2 = (x − µX )2 p(x) .

Var(X) = σX
x
p
The standard deviation of X is σX = Var(X) .

For any distribution, the variance of X may also be computed from


Var(X) = E(X 2 ) − E(X)2 ,
as can be shown from properties established in the next subsubsection.

Example B.5: A random variable X has the following probability distri-


bution
x 0 1 2 3 4
p(x) 0.05 0.15 0.35 0.25 0.20

Determine the expected value of X and the variance of X.

© USQ, February 21, 2007


298 Appendix B. Review of statistical rules

Solution:

ˆ E(X) = µx = x xp(x) = 0 × 0.05 + 1 × 0.15 + 2 × 0.35 + 3 ×


P
0.25 + 4 × 0.20, therefore E(X) = 2.40
ˆ E(X 2 ) = x x2 p(x) = 02 × 0.05 + 12 × 0.15 + 22 × 0.35 + 32 ×
P
0.25 + 42 × 0.20 = 7.0, therefore Var(X) = E(X 2 ) − E(X)2 =
7.0 − 2.402 = 1.24

© USQ, February 21, 2007


Appendix

Some time series tricks in R


C
C.1 Helpful R commands

To convert an AR model to an MA model:

imp <- as.ts( c(1, rep(0,19) ) )


# Creates a time-series (1, 0, 0, 0, ...)

theta <- filter(imp, c(ar1, ar2, ...), "recursive")


# Note that ar0 = 1 is assumed, and should not be included.

To find the ACF of an AR model:

imp <- as.ts( c(1, rep(0,99) ) )


# Creates a time-series (1, 0, 0, 0, ...)

# Now convert to MA model, as above


theta <- filter(imp, c(ar1, ar2, ...), "recursive")
# Note that ar0 = 1 is assumed, and should not be included.

# Now get gamma:


convolve( theta, theta )

299
300 Appendix C. Some time series tricks in R

© USQ, February 21, 2007


Appendix

Time series functions in R


D
The following is a list of the functions available in r for time series analysis.

Table D.1: The time series library in r.

Function Description
acf Autocovariance and Autocorrelation
Function Estimation
ar Fit Autoregressive Models to Time Series
ar.burg Fit Autoregressive Models to Time Series
ar.mle Fit Autoregressive Models to Time Series
ar.ols Fit Autoregressive Models to Time Series by OLS
ar.yw Fit Autoregressive Models to Time Series
arima ARIMA Modelling of Time Series
austres Quarterly Time Series: Number of Australian Residents
bandwidth.kernel Smoothing Kernel Objects
beaver1 Body Temperature Series of Two Beavers
beaver2 Body Temperature Series of Two Beavers
beavers Body Temperature Series of Two Beavers
BJsales Sales Data with Leading Indicator.
Box.test Box–Pierce and Ljung–Box Tests
ccf Function Estimation

301
302 Appendix D. Time series functions in R

Function (cont.) Description (cont.)


cpgram Plot Cumulative Periodogram
df.kernel Smoothing Kernel Objects
diffinv Discrete Integrals: Inverse of Differencing
embed Embedding a Time Series
EuStockMarkets Daily Closing Prices of Major European
Stock Indices, 1991-1998.
fdeaths Monthly Deaths from Lung Diseases in the UK
filter Linear Filtering on a Time Series
is.tskernel Smoothing Kernel Objects
kernapply Apply Smoothing Kernel
kernel Smoothing Kernel Objects
lag Lag a Time Series
lag.plot Time Series Lag Plots
LakeHuron Level of Lake Huron 1875–1972
ldeaths Monthly Deaths from Lung Diseases in the UK
lh Luteinizing Hormone in Blood Samples
lynx Annual Canadian Lynx trappings 1821–1934
mdeaths Monthly Deaths from Lung Diseases in the UK
na.contiguous NA Handling Routines for Time Series
nottem Average Monthly Temperatures at
Nottingham, 1920–1939
pacf Autocovariance and Autocorrelation Function Estimation
plot.acf Plotting Autocovariance and Autocorrelation Functions
plot.spec Plotting Spectral Densities
plot.stl Methods for STL Objects
plot.tskernel Smoothing Kernel Objects
PP.test Phillips-Perron Unit Root Test
predict.ar Fit Autoregressive Models to Time Series
predict.arima0 ARIMA Modelling of Time Series - Preliminary Version
print.ar Fit Autoregressive Models to Time Series
print.arima0 ARIMA Modelling of Time Series - Preliminary Version
print.stl Methods for STL Objects
print.tskernel Smoothing Kernel Objects
spec Spectral Density Estimation
spec.ar Estimate Spectral Density of a Time
Series from AR Fit
spec.pgram Estimate Spectral Density of a Time
Series from Smoothed Periodogram
spec.taper Taper a Time Series

© USQ, February 21, 2007


303

Function (cont.) Description (cont.)


spectrum Spectral Density Estimation
stl Seasonal Decomposition of Time Series by Loess
summary.stl Methods for STL Objects
sunspot Yearly Sunspot Data, 1700-1988.
Monthly Sunspot Data, 1749-1997.
toeplitz Form Symmetric Toeplitz Matrix
treering Yearly Treering Data, -6000-1979.
ts.intersect Bind Two or More Time Series
ts.plot Plot Multiple Time Series
ts.union Bind Two or More Time Series
UKDriverDeaths Deaths of Car Drivers in Great Britain, 1969-84
UKLungDeaths Monthly Deaths from Lung Diseases in the UK
USAccDeaths Accidental Deaths in the US 1973-1978
Tskernel Smoothing Kernel Objects

© USQ, February 21, 2007


304 Appendix D. Time series functions in R

© USQ, February 21, 2007


Appendix

Multivariate analysis
E
functions in R

Table E.1: The multivariate statistics library in r.

Function Description
ability.cov Ability and Intelligence Tests
as.dendrogram General Tree Structures
as.dist Distance Matrix Computation
as.hclust Convert Objects to Class hclust
as.matrix.dist Distance Matrix Computation
biplot Biplot of Multivariate Data
biplot.princomp Biplot for Principal Components
cancor Canonical Correlations
cmdscale Classical (Metric) Multidimensional Scaling
cut.dendrogram General Tree Structures
cutree Cut a tree into groups of data
dist Distance Matrix Computation
factanal Factor Analysis
factanal.fit.mle Factor Analysis
format.dist Distance Matrix Computation
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4

305
306 Appendix E. Multivariate analysis functions in R

Function (cont.) Description (cont.)


hclust Hierarchical Clustering
identify.hclust Identify Clusters in a Dendrogram
kmeans K-Means Clustering
loadings Print Loadings in Factor Analysis
names.dist Distance Matrix Computation
plclust Hierarchical Clustering
plot.dendrogram General Tree Structures
plot.hclust Hierarchical Clustering
plot.prcomp Principal Components Analysis
plot.princomp Principal Components Analysis
plotNode General Tree Structures
plotNodeLimit General Tree Structures
prcomp Principal Components Analysis
predict.princomp Principal Components Analysis
princomp Principal Components Analysis
print.dist Distance Matrix Computation
print.factanal Print Loadings in Factor Analysis
print.hclust Hierarchical Clustering
print.loadings Print Loadings in Factor Analysis
print.prcomp Principal Components Analysis
print.princomp Principal Components Analysis
print.summary.prcomp Principal Components Analysis
print.summary.princomp Summary method for Principal Components Analysis
promax Rotation Methods for Factor Analysis
rect.hclust Draw Rectangles Around Hierarchical Clusters
screeplot Screeplot of PCA Results
summary.prcomp Principal Components Analysis
summary.princomp Summary method for Principal Components Analysis
varimax Rotation Methods for Factor Analysis

© USQ, February 21, 2007


Bibliography

[1] Joint Archive for Sea Level, http://uhslc.soest.hawaii.edu/uhslc/


jasl.html.

[2] Climate Indicies, from the Climate Diagnostic Centre, http://www.


cdc.noaa.gov/ClimateIndices/

[3] Climate Prediction Center http://www.cpc.noaa.gov/

[4] U.S. Geological Survey, Hydro-Climatic Data Network (HCDN):


Streamflow Data Set, 1874–1988 By J.R. Slack, Alan M. Lumb,
and Jurate Maciunas Landwehr. http://water.usgs.gov/pubs/wri/
wri934076/1st_page.html

[5] Hyndman, Rob. The Time Series Data Library, http://


www-personal.buseco.monash.edu.au/~hyndman/TSDL/index.htm

[6] Anderson, O. D. (1976). Time Series Analysis and Forecasting: The


Box–Jenkins Approach. London and Boston: Butterworths.

[7] Assimakopoulos, V. and Nikolopoulos, K. (2000) ‘The theta model:


a decomposition approach to forecasting’ in International Journal of
Forecasting 16(4) 521–530.

[8] Basilevsky, A. (1994). Statistical factor analyasis and related methods,


New York: John Wiley and Sons.

[9] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, San


Francisco: Holden-Day.

[10] Buell, C. Eugene and Bundgaard, Robert C. (1971). ‘A factor analysis


of winds to 60 km over Battery MacKenzie, C.Z.’ in Journal of Applied
Meteorology, 10(4), 803–810.

[11] Chatfield, Chris (1996). The Analysis of Time Series: an introduction,


Boca Raton:Chapman and Hall.

307
308 Bibliography

[12] Chin, Roland T., Jau, Jack Y. C. and Weinman, James A. (1987). ‘The
application of time series models to cloud field morphology analysis’ in
Journal of Climate and Applied Meteorology, 26, 363–373.

[13] Chu, Pao-Shin and Katz, Richard W. (1985). ‘Modeling and forecast-
ing the southern oscillation: A Time-Domain Approach’ in Monthly
Weather Review, 113, 1876–1888.

[14] Claps, P. and Murrone, F. (1994). ‘Optimal parameter estimation of


conceptually-based streamflow models by time series aggregation’ in
Stochastic and Statistical Methods in Hydrology and Environmental En-
gineering, Volume 3, eds Keith W. Hipel, A. Ian McLeod, U. S. Panu
and Vijay P. Singh, Netherland: Kluwer Academic Publishers p421–
434.

[15] Davis, J. M. and Rapoport, P. N. (1974). ‘The use of time series analysis
techniques in forecasting meteorological drought’, in Monthly Weather
Review 102, 176–180.

[16] Enfield, D.B., A. M. Mestas-Nunez and P.J. Tribble, (2001). ‘The At-
lantic multidecadal oscillation and it’s relation to rainfall and river flows
in the continental U.S.’ in Geophysical Research Letters, 28, 2077–2080.

[17] Fritts, Harold C. (1974). ‘Relationships of ring widths in arid-site


conifers to variations in monthly temperature and precipitation’ in Eco-
logical Monographs, 44, 411–440.

[18] Guiot, J. and Tessier, L. (1997). ‘Detection of pollution signals in tree-


ring series using AR processes and neural networks’ in Applications of
Time Series Analysis in Astronomy and Meteorology, eds T. Subba Rao,
M. B. Priestley and O. Lessi, London: Chapman and Hall, p413–426.

[19] Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E.
(1994). A Handbook of Small Data Sets, London: Chapman and Hall.

[20] Hannes, Gerald (1976). ‘Factor analyusis of costal air pressure and wa-
ter temperature’ in Journal of Applied Meteorology, 15(2), 120–126.

[21] Hipel and McLeod (1984). Time Series Modelling of Water Resources
and Environmental Systems, Elsevier.

[22] Hyndman, Rob J. and Billah, Baki. ‘Unmasking the Theta method’ to
appear in International Journal of Forecasting.

[23] Izenman, A. J. (1983). ‘J. R. Wolf and H. A. Wolfer: An historical


note on the Zurick sunspot relative numbers’ in Journal of the Royal
Statistical Society A, 146, 311–318.

© USQ, February 21, 2007


Bibliography 309

[24] Kalnicky, Richard A. (1987) ‘Seasons, singularities, and climatic


changes over the midlatitudes of the northern hemisphere during 1899–
1969’ in Journal of Applied Meterology, 26(11), 1496–1510.

[25] Kärner, Olavi and Rannik, Üllar (1996). ‘Stochastic models to repre-
sent the temporal variability of zonal average cloudiness’ in Journal of
Climate, 9, 2718–2726.

[26] Katz, Richard W. and Skaggs, Richard H. (1981). ‘On the use of
autoregressive-moving average processes to model meteorological time
series’ in Monthly Weather Review, 109, 479–484.

[27] Katz, Richard W. and Glantz, Michael H. (1986). ‘Anatomy of a rainfall


index’ in Monthly Weather Review, 114(4), 764–771.

[28] Kavvas, M. L. and Delleur, J. W. (1981). ‘A stochastic cluster model


of daily rainfall sequences’ in Water Resources Research, 17(4), 1151–
1160.

[29] Kidson, John W. (1975). ‘Eigenvector analysis of monthly mean surface


data’ in Monthly Weather Review, 103(3), 177–186.

[30] Kim, Jae-On and Mueller, Charles W. (1990). Fcator Analysis: Sta-
tistical Methods and Prcatical Issues, Sage University Paper series on
Quantitative Applications in the Social Sciences, series no. 14. Beverley
Hills and London: Sage Publications.

[31] Maier, H. R. and Dandy, G. C. (1995). Comparison Of The Box-Jenkins


Procedure with Artificial Neural Network Methods for Univariate Time
Series Modelling, Volume 1, Research Report R127, Department of Civil
and Environmental Engineering, The University of Adelaide.

[32] Makridakis, Spyros and Hibon, Michèle (2000). ‘The M3-Competition:


results, conclusions and implications’ in International Journal of Fore-
casting 16(4), 451–476

[33] Mantua, Nathan J. Hare, Steven R., Zhang, Yuan, Wallace, John M.,
and Francis, Robert C. (1997). ‘A Pacific interdecadal climate oscilla-
tion with impacts on salmon production’ in the Bulletin of the American
Meteorological Society, 78, 1069–1079.

[34] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Anal-


ysis, London: Academic Press.

[35] Michaelsen, Joel (1982). ‘A statistical study of large-scale, long period


variability in North Pacific sea surface temperature anomalies’ in Jour-
nal of Physical Oceanography, 12(7), 694–703.

© USQ, February 21, 2007


310 Bibliography

[36] Parzen, Emanuel (1979). ‘Nonparametric statistical data modeling’ in


Journal of the American Statistical Association, 74(365), 105–121.

[37] Preisendorfer, RW and Mobley, CD (1984). ‘Climate forecast verifica-


tions , United States Mainland 1974–83’ in Monthly Weather Review,
112, 809–825.

[38] Richman, MB (1986). ‘Rotation of principal components’ in Journal of


Climatology, 6, 293–335.

[39] Rogers, Jeffery C. (1976). ‘Sea surface temperature anomalies in the


eastern North Pacific and associated wintertime atmospheric fluctu-
ations over North America, 1960–73’ in Monthly Weather Review,
104(8), 985–993.

[40] Sales, P. R. H., Pereira, B. de B. and Vieira, A. M. (1994). ‘Linear


procedures for time series analysis in hydrology’ in in Stochastic and
Statistical Methods in Hydrology and Environmental Engineering, Vol-
ume 3, eds Keith W. Hipel, A. Ian McLeod, U. S. Panu and Vijay P.
Singh, Netherland: Kluwer Academic Publishers p105–117.

[41] Shaw, N. (1942). Manual of Meterology, Volume 1, London: Cambridge


University Press.

[42] Stone RC and Auliciems A. (1992). ‘SOI phase relationships with rain-
fall in eastern Australia’ in International Journal of Climatology, 12,
625–636.

[43] Trenberth Kevin E. and and Stepaniak, David P. (19XX). ‘Indices of


El Nino Evolution’ in Journal of Climate, 14, 1697–1701.

[44] Tong, Howell (1983). Threshold Models in Nonlinear Time Series Anal-
ysis, Springer-Verlag.

[45] Unal,Yurdanur, Kindap, Tayfun and Karaca, Mehmet (2003). ‘Redefin-


ing the climate zones of Turkey using cluster analysis’ in International
journal of climatology 23, 1045–1055.

[46] Venables, W. N. and Ripley, B. D. (1997). Modern Applied Statistics


with S-PLUS, second edition, Springer-Verlag: New York.

[47] Visser, H. and Molenaar, J. (1995). ‘Trend estimation and regression


analysis in climatological time series: an application of structural time
series models and Kalman filter’ in Journal of Climate, 8, 969–979.

[48] Wilks, DS (1989). ‘Conditioning stochastic daily precipitation models


on total monthly precipitation’ in Water Resources Research,
25, 1429–1439.

© USQ, February 21, 2007


Bibliography 311

[49] Wilks, Daniel S. (1995). Statistical Methods in the Atmospheric Sci-


ences. Academic Press, San Diego.

[50] Wolff, George T., Morrisey, Mark L. and Kelly, Nelson A. (1984). ‘An
investigation of the sources of summertime haze in the Blue Ridge
Mountains using multivariate statistical methods’ in Journal of Applied
Meteorology, 23(9), 1333–1341.

[51] Woodward, Wayna A. and Gray, H. L. (1995). ‘Selecting a model for


detecting the presence of a trend’ in Journal of Climate, 8, 1929–1937.

[52] Yao, C. S. (1983). ‘Fitting a linear autoregressive model for long-range


forecasting’ in Monthly Weather Review, 111, 692–700.

[53] Zwiers, Francis and von Storch, Hans (1990). ‘Regime-dependent au-
toregressive time series models of the southern oscillation’ in Journal
of Climate, 3, 1347–1363.

© USQ, February 21, 2007


312 Bibliography

© USQ, February 21, 2007

You might also like