STAT1301 Notes
STAT1301 Notes
Advanced Analysis of
Scientific Data
by the
Statistics Group of
the School of Mathematics and Physics
November 4, 2020
CONTENTS
2 Describing Data 21
2.1 Data as a Spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Structuring Variables According to Type . . . . . . . . . . . . . . . . 23
2.2.1 Structuring Nominal Factors . . . . . . . . . . . . . . . . . . 25
2.2.2 Structuring Ordinal Factors . . . . . . . . . . . . . . . . . . . 25
2.2.3 Structuring Discrete Quantitative Data . . . . . . . . . . . . . 25
2.2.4 Structuring Continuous Quantitative Variables . . . . . . . . . 26
2.2.5 Good Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Summary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Making Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.1 Plotting Qualitative Variables . . . . . . . . . . . . . . . . . . 30
2.5.2 Plotting Quantitative Variables . . . . . . . . . . . . . . . . . 31
2.5.3 Graphical Representations in a Bivariate Setting . . . . . . . . 36
2.5.4 Visualising More than Two Variables . . . . . . . . . . . . . . 39
3 Understanding Randomness 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Random Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Random Variables and their Distributions . . . . . . . . . . . . . . . 56
3.7 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2
CONTENTS 3
6 Estimation 93
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Estimates and Estimators . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Approximate Confidence Interval for the Mean . . . . . . . . 98
6.3.2 Normal Data, One Sample . . . . . . . . . . . . . . . . . . . 100
6.3.3 Normal Data, Two Samples . . . . . . . . . . . . . . . . . . . 104
6.3.4 Binomial Data, One Sample . . . . . . . . . . . . . . . . . . 106
6.3.5 Binomial Data, Two Samples . . . . . . . . . . . . . . . . . . 107
8 Regression 127
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2.1 Estimation for Linear Regression . . . . . . . . . . . . . . . . 130
8.2.2 Hypothesis Testing for Linear Regression . . . . . . . . . . . 132
8.2.3 Using the Computer . . . . . . . . . . . . . . . . . . . . . . . 132
8.2.4 Confidence and Prediction Intervals for a New Value . . . . . 135
8.2.5 Validation of Assumptions . . . . . . . . . . . . . . . . . . . 136
8.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 137
8.3.1 Analysis of the Model . . . . . . . . . . . . . . . . . . . . . 138
8.3.2 Validation of Assumptions . . . . . . . . . . . . . . . . . . . 140
4 CONTENTS
A R Primer 191
A.1 Installing R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . 191
A.2 Learning R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
A.2.1 R as a Calculator . . . . . . . . . . . . . . . . . . . . . . . . 193
A.2.2 Vector and Data Frame Objects . . . . . . . . . . . . . . . . . 194
A.2.3 Component Selection . . . . . . . . . . . . . . . . . . . . . . 196
A.2.4 List Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
A.2.5 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.2.6 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.2.7 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.2.8 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
A.2.9 Reading and Writing Data . . . . . . . . . . . . . . . . . . . 208
A.2.10 Workspace, Batch Files, Package Installation . . . . . . . . . 209
CONTENTS 5
Index 211
6 CONTENTS
P REFACE
These notes are intended for first-year students who would like to more fully under-
stand the logical reasoning and computational techniques behind statistics. Our inten-
tion was to make something that would be useful as a self-learning guide for advanced
first-year students, providing both a sound theoretical foundation of statistics as well
as a comprehensive introduction to the statistical language R.
STAT1301 is the advanced version of STAT1201, and will explain several concepts
on a deeper level than is feasible in STAT1201. Our guiding principle was that it is just
as important to know the “why” as the “how”. To get the most use out of these notes
it is important that you carefully read the whole story from beginning to end, annotate
the notes, check the results, make connections, do the exercises, try the R programs,
visit the lectures, and most importantly, ask questions about things you do not under-
stand. If you are frightened by the maths, it is good to remember that the mathematics
is there to make life easier, not harder. Mathematics is the language of science, and
many things can be said more precisely in one single formula, definition, or with a
simple artificial example, than is possible in many pages of verbose text. Moreover,
by using mathematics it becomes possible to build up statistical knowledge from very
basic facts to a high level of sophistication. Of course in a first-year statistics course,
however advanced, it is not possible to cover all the knowledge that has been built
up over hundreds of years. We will sometimes only give a glimpse of new things to
discover, but we have to leave something for your future studies! Knowing the math-
ematical reasoning behind statistics avoids using statistics only as a black box, with
many “magic” buttons. Especially when you wish to do further research it is impor-
tant to be able to develop your own statistical reasoning, separate from any statistical
package.
7
8 Preface
• Dirk P. Kroese and Joshua C.C. Chan (2014). Statistical Modeling and Computation,
Springer, New York.
• Pierre Lafaye de Micheaux, Rémy Drouilhet, and Benoit Liquet (2014). The R Soft-
ware: Fundamentals of Programming and Statistical Analysis, Springer, New York.
We will introduce the topics in these notes in a linear fashion, starting a brief intro-
duction to data and evidence in Chapter 1. In Chapter 2 we describe how to summarize
and visualize data. We will use the statistical package R to read and structure the data
and make figures and tables and other summaries. Chapter 3 is about probability,
which deals with the modeling and understanding of randomness. We will learn about
concepts such as random variables, probability distributions, and expectations. Vari-
ous important probability distributions in statistics, including the binomial and normal
distributions, receive special attention in Chapter 4. We then continue with a few more
probability topics in Chapter 5, including multiple random variables, independence,
and the central limit theorem. At the end of that chapter we introduce some sim-
ple statistical models. After this chapter, we will have built up enough background
to properly understand the statistical analysis of data. In particular, we discuss esti-
mation in Chapter 6 and and hypothesis testing in Chapter 7, for basic models. The
remaining chapters consider the statistical analysis of more advanced models, includ-
ing regression (Chapter 8) and analysis of variance (Chapter 9), both of which are
special examples of a linear model (Chapter 10). The final Chapter 11 touches on
additional statistical techniques, such as goodness of fit tests, logistic regression, and
nonparametric tests. The R program will be of great help here. Appendix A gives a
short introduction to R.
The aim of this chapter is to give a short introduction to the statistical reasoning
that we will be developing during this course. We will discuss the typical steps
taken in a statistical study, emphasize the distinction between observational and
designed statistical experiments, and introduce you to the language of hypothesis
testing.
Statistics is an essential part of science, providing the language and techniques neces-
sary for understanding and dealing with chance and uncertainty in nature. It involves
the design, collection, analysis, and interpretation of numerical data, with the aim of
extracting patterns and other useful information.
5. Analyse this model and make decisions about the model based on the ob-
served data.
6. Translate decisions about the model to decisions and predictions about the
research question.
9
10 Data and Evidence
To fully understand statistics it is important that you follow the reasoning behind
the steps above. Let’s look at a concrete example.
Example 1.1 (Biased Coin) Suppose we have a coin and wish to know if it is fair
— that is, if the probability of Heads is 1/2. Thus the research questions here is: is
the coin fair or biased? What we could do to investigate this question is to conduct
an experiment where we toss the coin a number of times, say 100 times, and observe
when Heads or Tails appears. The data is thus a sequence of Heads and Tails— or
we could simply write a 1 for Heads and 0 for Tails. We thus have a sequence of 100
observations, such as 1 0 0 1 0 1 0 0 1 . . . 0 1 1. These are our data. We can visualize
the data by drawing a bar graph such as in Figure 1.1.
1 50 100
Figure 1.1: Outcome of an experiment where a fair coin is tossed 100 times. The dark
bars indicate when Heads (=1) appears.
Think about the pros and cons of this plot. If we are only interested in the bi-
asedness of the coin, then a simple chart that shows the total numbers of Heads and
Tails would suffice, as knowing exactly where the Heads or Tails appeared is irrele-
vant. Thus, we can summarize the data by giving only the total number of Heads, x
say. Suppose we observe x = 60. Thus, we find 60 Heads in 100 tosses. Does this
mean that the coin is not fair, or is this outcome simply due to chance?
Note that if we would repeat the experiment with the same coin, we would likely
get a different series of Heads and Tails (see Figure 1.2) and therefore a different out-
come for x.
1 50 100
Figure 1.2: Outcomes of three different experiments where a fair coin is tossed 100
times.
Statistical Studies 11
We can now reason as follows (and this is crucial for the understanding of statis-
tics): if we denote by X (capital letter) the total number of Heads (out of 100) that
we will observe tomorrow, then we can view x = 60 as just one possible outcome
of the random variable X. To answer the question whether the coin is fair, we need
to say something about how likely it is that X takes a value of 60 or more for a fair
coin. To calculate probabilities and other quantities of interest involving X we need
an appropriate statistical model for X, which tells us how X behaves probabilistically.
Using such a model we can calculate, in particular, the probability that X takes a value
of 60 or more, which is about 0.028 for a fair coin — so quite small. However, we did
observe this quite unlikely event, providing reasonable evidence that the coin may not
be fair.
Interestingly, we don’t actually need any formulas to calculate this probability.
Computers have become so fast and powerful that we can quickly approximate prob-
abilities via simulations. Simulating this coin flip experiment in R is equivalent to
sampling 100 times (with replacement) from a “population” {0, 1} and counting how
many 1s there are. In R:
[1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0
[29] 1 0 0 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0
[57] 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0
[85] 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 1
[1] 48
[1] 54
[1] 38
Now, let’s repeat this 1000 times and save the output in a variable:
> data.we.could.have.seen
= replicate(1000, sum(sample(coin, 100, replace = T)))
12 Data and Evidence
> data.we.could.have.seen
[1] 43 47 56 54 49 45 46 51 41 47 48 44 54 53 43 54 46 49
[19] 48 44 47 52 53 39 44 52 53 45 52 57 49 54 48 56 42 47
[37] 42 46 44 47 49 46 51 53 59 57 50 45 51 55 50 53 60 53
...
[973] 45 49 42 53 54 51 56 46 49 48 53 46 55 37 47 49 51 54
[991] 50 49 49 50 57 35 44 49 45 52
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[46] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
...
[991] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[1000] FALSE
[1] 21
[1] 0.021
So, without any knowledge of probability, we have found that the probability that
X takes a value of 60 or more is approximately 0.021 for a fair coin. If the coin is
indeed fair, then what we have witnessed was quite a rare event — entirely possible,
but rather rare. We can either:
• accept that the coin is fair and that we just happened to see a rather rare occur-
rence; or
• do not accept that we’ve been so unlucky, and instead suspect that the coin is
rigged.
You have already carried out your first scientific study and statistical hypothesis
test! The rest of this course will build up your foundational knowledge in probability
and statistics so that you can tackle a wider range of research questions and data types.
Data 13
1.2 Data
Data comes in many shapes and forms, but is often represented as a spreadsheet in
a “standard format”, where columns represent features such as height, gender, and
income, and rows represent individuals or units.
The data in a spreadsheet could be the result of an observational study, where we
have no control over each feature (corresponding to a column in a spread sheet). A
typical example is survey data. If we would repeat the whole study, the values in all
columns would change.
Alternatively, The data in a spreadsheet could be the result of an designed experi-
ment, where certain experimental conditions (features) are controlled (fixed) to reduce
unwanted variability in the measurements. If we would repeat the whole experiment,
the experimental feature columns would stay the same.
Regardless of whether we have data from an observational or designed experiment,
some measurements, would change if the data were collected again. There may be
various causes of variability/randomness in measurements. For example, in height
data the main source of variability is the natural diversity of heights in a population.
Another source of variability is the measurement variability (how accurately we can
measure each height). Later on in this course we will consider statistical models that
aim to explain the variability in the data. For example, we could try to explain the
variability in heights not only via the natural variability in the population, but also
taking into account variables such as gender, ethnicity, and shoe size. Any remaining
variability that cannot be explained by the model is called the residual variability. A
good model has a small residual error and predicts new data well.
• Give 10 friends 250mL of caffeinated Diet Coke while the other 10 friends are
given 250mL of caffeine-free (decaf) Diet Coke.
• Wait half an hour after the drink, and then measure the heart rates again.
1. Alice chose 20 subjects in her study. Is this a sufficient sample size? There are
many considerations for choosing a “good” sample size. In general, the sample
size is determined by (a) the size of the effect that we are trying to detect and (b)
the variability in the data. If the data has little variability, it may suffice to only
use a small sample size to detect a certain effect. In contrast, if the data exhibits
a large amount of variability it may require a large sample size to detect any
effects, especially if they are small. However, bigger sample sizes do not always
make for better experiments. Running an overly large study often leads to poor
quality of data, as it is difficult to enforce compliance to the study protocols at all
levels of the study. A large sample size may also be impractical, potentially dan-
gerous (e.g., for experimental medical treatments), costly, or time-consuming.
Think of all the cola Alice would have to buy for a study with 1000 individuals!
2. Alice chose her friends as test subjects. Is that fair? Let us go back to the
research question: to detect if caffeinated cola increases the heart rate. If the
population of interest is not just Alice’s friends, but the general population, then
choosing the subjects within her circle of friends may introduce a sampling bias.
For example, suppose that the effect of caffeine would depend on age and gender.
If most of Alice’s friends would be between 19 and 22 years old and female, the
sample group would no longer be representative of the general population. Any
conclusions from this study would pertain to the smaller population of people
that are similar to Alice’s friends. But maybe the effect of caffeine does not
depend on factors such as age, weight, gender, or the subject’s cola drinking
behaviour, and then the conclusions might be applicable to a larger population.
There is another possible source of bias in Alice’s experiment. Recall that she
gives 10 friends caffeinated Diet Coke and 10 friends decaf Diet Coke. How
are the two groups chosen? Perhaps she divides the groups into 10 males and
10 females. But this could lead to a bias in the results, if the difference in heart
rate would depend on gender. To avoid any bias in the group selection, we can
randomly select the treatment and placebo group, by using the random number
generator of R, for example. Such a randomization process is an important
ingredient in many designed experiments.
Another issue is whether Alice’s friends know if they are getting caffeinated or
decaf cola. This may influence the measurements (increase in heart rate) through
the placebo effect: friends that know they have consumed caffeinated Diet Coke
may increase their heart rate beyond what would have happened if they knew
their cola did not have caffeine. This is especially pertinent in comparative ex-
periments of new medical treatments. In such experiments it is customary that
one portion (say one half) of the subjects is offered the new treatment and the
other half receives a placebo. In a blind experiment the subjects do not know
whether they receive the treatment or a placebo. Even in this case, the experi-
menter may inadvertently influence the outcome if they know which subjects are
Designed Experiment: Alice’s Caffeine Data 15
assigned the treatments and placebos. To remove also this bias, the gold-standard
procedure is to employ a double-blind experiment, where the experimenters do
not know how the treatments and placebos are distributed over the subjects.
3. Why did Alice wait half an hour after consuming drinks before measuring the
pulse rates again? In this experiment, Alice used “subject-specific” knowledge.
Namely, she did a thorough literature search on the average time it takes for
humans to absorb and metabolize caffeine in drink form — most sources claimed
this to be around 20–30 minutes.
4. Why did Alice choose to compare Diet Coke with decaf Diet Coke instead of the
regular (non-Diet) cola? The reason is given in Figure 1.3: the only difference
between caffeinated Diet Coke and decaf Diet Coke is the amount of caffeine.
In contrast, caffeinated Regular Coke and decaf Regular Coke have, in addition
to the caffeine content, different energy, protein, carbohydrate, and sodium con-
tents. So a change in heart rate could be the result of the sugar content, for
example, rather than caffeine content. We say that decaf Diet Coke serves as a
suitable control for caffeinated Diet Coke, as their only difference is the caffeine
content.
Later on in this course we will have a closer look at designed experiments and
how to analyse them via Analysis of Variance (ANOVA) methods. But for now let us
examine Alice’s data and introduce some experimental design terminology on the way.
Alice’s experiment involves actively applying treatments to subjects and observing
their responses. An experimental treatment is a combination of factors at different
levels. The variables describing the treatments are the explanatory variables in the
study. The response from an experiment is the variable/s of interest.
For Alice’s experiment, the response variable is the change in pulse rate, and the
explanatory variable is the caffeine content, which is considered at two levels (yes and
no). The resulting changes in pulse rate are given in Table 1.1:
Caffeinated 17 22 21 16 6 −2 27 15 16 20
Decaf 4 10 7 −9 5 4 5 7 6 12
We mentioned in Section 1.2 that when using software such as R to analyse and
display data it is important that the data is represented/stored in a standard format.
Table 1.1 is not in a standard format. To convert it to standard format, we should
store the pulse beat changes of all 20 subjects (Alice’s friends) in a single column —
called pb, for example. And a second column, Caffeine, indicates whether a subject
receives the caffeinated or decaf cola. Of course such a table is very tall and skinny
and does not present as well on paper as Table 1.1.
16 Data and Evidence
http://www.coca-cola.com.au/ourdrinks/nutrition-comparison-tool.jsp
Figure 1.3: Nutritional information for caffeinated/decaf Regular Coke (top) and caf-
feinated/decaf Diet Coke (bottom).
Do the results in Table 1.1 provide any evidence that caffeine increases pulse rate?
Let us now go through the same 6 steps of a statistical study as in Section 1.1. The first
two steps (designing the study and collecting the data) have already been discussed
above.
Step 3 is about visualizing and summarizing data. Figure 1.4 shows a possible
visualization of the data in a so-called stripplot.
Hypothesis Testing 17
20
pb 10
−10
No Yes
• The mean (i.e., average) increase in pulse rate for the decaf group is 5.1 bpm.
• The mean increase in pulse rate for caffeine group is 15.8 bpm.
The group difference is thus 15.8 − 5.1 = 10.7 bpm. Is this evidence that caffeine
increases pulse rate? In other words, if we summarize our data via the group differ-
ence, what evidence is there that the group difference 10.7 was due to the effect of the
caffeine presence, rather than this happening by chance?
Compare this with the coin toss experiment in Section 1.1, where we summarized
the data via the total number of Heads. There, we considered the behaviour of the total
number of heads X for a fair coin, and compared it with the observed number of heads
x = 60. In the next section we will carry out a similar comparison for the Alice data in
the context of hypothesis testing.
1. Caffeine really has no effect on pulse rate and the observed group difference of
10.7 was just due to the chance variability in pulse rates.
2. The group difference of 10.7 arose because caffeine does increase pulse rate.
Suppose that the first explanation is correct — i.e., caffeine really has no effect. In-
stead of two different groups we have really made 20 observations of the same process.
That is, the changes in pulse rate for each subject would have been the same regard-
less of which treatment they were given. Only in the selection of the control/treatment
groups did the observations happen to end up in the groups that they did. If we dis-
tribute the 20 measurements randomly amongst the two groups, how likely is it that
we end up with a group difference of 10.7 or more?
Suppose we want to randomly select 10 of the 20 subjects to be in the caffeinated
group (with the other 10 going to the decaf group). Let’s label the first subject by “1”,
second subject by “2”, third subject by “3”, . . . , and so on. We can do this in R via:
We can then sample 10 numbers from this list at random, without replacement, via:
[1] 4 14 11 3 16 15 2 18 6 7
We have now simulated one particular randomization that could have been ob-
served. How many possible randomizations are there? A total of 20 10
= 184, 756 —
+ 49 you will learn to count this in Section 3.4. For our original data, the 10 subjects in
the caffeinated group were chosen as 1,2,. . . ,10. We need a fast way to compute the
group difference for any group. To do this, we first store the original data in a variable
alice:
and then define a function to calculate the group differences (i.e., differences in the
sample means):
> diff.mean = function(data){
mean(data[1:10]) - mean(data[11:20]) }
10.7
Hypothesis Testing 19
7 7 -9 -2 17 15 4 12 21 5 10 4 6 20 16 22 5 16 6 27
-5.5
Repeat this process many times, e.g., 200,000 times, and count how many reshuffles
lead to a group difference greater than the observed 10.7.
> sum(replicate(200000,diff.mean(sample(alice))>=10.7))
410
0.00205
We see that this number is very small, indicating that it is very unlikely that we obtain
a group difference of 10.7 or more by reshuffling only. However, we did observe this
group difference, giving strong evidence against the hypothesis that caffeine has no
effect on the pulse rate.
In the language of hypothesis testing, we conducted a specific statistical test called
a randomization test. The first explanation we gave is called the null hypothesis, H0 ,
of the test (“nothing is really happening”). The second explanation (the statement we
wish to show) is the alternative hypothesis, H1 . The function of the data on which
we base our conclusion is called the test statistic; in this case the group difference.
The probability of obtaining such unusual data (or even more unusual) under the null
hypothesis is the P-value of the test (here, p = 0.002).
• A small P-value suggests the null hypothesis may be wrong, giving evidence for
the alternative hypothesis.
• A large P-value suggests that the data are consistent with the null explanation,
giving inconclusive evidence of an effect.
20 Data and Evidence
Figure 1.5 illustrates the strength of evidence, expressed in words, associated with
a P-value.
Strong
D ESCRIBING DATA
This chapter describes how to structure data, calculate simple numerical sum-
maries and draw standard summary plots.
1. The first column is usually an identifier column, where each unit/row is given a
unique name or ID.
2. Certain columns can correspond to the design of the experiment, specifying for
example to which experimental group the unit belongs, after using a randomiza-
tion procedure.
In this course, we will store data in CSV (Comma Separated Values) format.
That is, the data is given as a text file where, as the name suggests, values are sep-
arated by commas. You can open and create a CSV file/spreadsheet via Excel or,
better, via R. It will be convenient to illustrate various data concepts by using the CSV
file nutrition_elderly.csv, which contains nutritional measurements of thirteen
variables (columns) for 226 elderly individuals (rows) living in Bordeaux, who were
interviewed in the year 2000 for a nutritional study (see Table 2.1 for a description of
the variables).
21
22 Describing Data
You can import the data into R using for example the function read.csv, as in:
> nutri = read.csv("nutrition_elderly.csv",header=TRUE)
The R function head gives the first few rows of the data frame, including the variable
names.
> head(nutri)
gender situation tea coffee height weight age meat fish
1 2 1 0 0 151 58 72 4 3
2 2 1 1 1 162 60 68 5 2
3 2 1 0 4 162 75 78 3 1
4 2 1 0 0 154 45 91 0 4
Structuring Variables According to Type 23
5 2 1 2 1 154 50 65 5 3
6 2 1 2 0 159 66 82 4 2
raw_fruit cooked_fruit_veg chocol fat
1 1 4 5 6
2 5 5 1 4
3 5 2 5 4
4 4 0 3 2
5 5 5 3 2
6 5 5 1 3
The names of the variables can also be obtained directly via the function names, as
in names(nutri). This returns a list of all the names of the data frame. The data for
each individual column (corresponding to a specific name) can be accessed by using
R’s list$name construction. For example, nutri$age gives the vector of ages of the + 196
individuals in the nutrition data set.
Note that all the entries in nutri are numerical (that is, they are numbers). How-
ever, the meaning of each number depends on the respective columns. For example,
a 1 in the “gender” column means here that the person is male (and 2 for female),
while a 1 in the “fish” column indicates that this person eats fish less than once a week.
Note also that it does not make sense to take the average of the values in the “gender”
column, but it makes perfect sense for the “weights” column. To better manipulate the
data it is important to specify exactly what the structure is of each variable. We discuss
this next.
For qualitative variables (often called factors), we can distinguish between nomi-
nal and ordinal variables:
1
As all measurements are recorded to a finite level of accuracy, one could argue that all quantitative
variables are discrete. The actual issue is how we model the data, via discrete and continuous (random)
variables. Random variables and their probability distributions will be introduced in Chapter 3.
24 Describing Data
Nominal factors represent groups of measurements without order. For example, record-
ing the sex of subjects is essentially the same as making a group of males and a
group of females.
Ordinal factors represent groups of measurement that do have an order. A common
example of this is the age group someone falls into. We can put these groups in
order because we can put ages in order.
Example 2.1 (Variable Types) The variable types for the data frame nutri are
given in Table 2.2.
Table 2.2: The variable types for the data frame nutri
Nominal gender, situation, fat
Ordinal meat, fish, raw_fruit, cooked_fruit_veg, chocol
Discrete quantitative tea, coffee
Continuous quantitative height, weight, age
Initially, all variables in nutri are identified as quantitative, because they hap-
pened to be entered as numbers2 . You can check the type (or structure) of the variables
with the function str.
> str(nutri)
'data.frame': 226 obs. of 13 variables:
$ gender : int 2 2 2 2 2 2 2 2 2 2 ...
$ situation : int 1 1 1 1 1 1 1 1 1 1 ...
$ tea : int 0 1 0 0 2 2 2 0 0 0 ...
$ coffee : int 0 1 4 0 1 0 0 2 3 2 ...
$ height : int 151 162 162 154 154 159 160 163 154
160 ...
$ weight : int 58 60 75 45 50 66 66 66 60 77 ...
$ age : int 72 68 78 91 65 82 74 73 89 87 ...
$ meat : int 4 5 3 0 5 4 3 4 4 2 ...
$ fish : int 3 2 1 4 3 2 3 2 3 3 ...
$ raw_fruit : int 1 5 5 4 5 5 5 5 5 5 ...
$ cooked_fruit_veg: int 4 5 2 0 5 5 5 5 5 4 ...
$ chocol : int 5 1 5 3 3 1 5 1 5 0 ...
$ fat : int 6 4 4 2 2 3 6 6 6 3 ...
We shall now set up an modified R structure for each variable that better reflect
their type and meaning.
2
If gender had been entered as M and F, the variable would have automatically been structured as
a factor. In the same way, the entries for the other two factor variables, situation and fat, could have
been entered as letters or words.
Structuring Variables According to Type 25
Note that nutri$tea and nutri$coffee were initially classified as integer types
anyway, so that the above assignments are superfluous.
26 Describing Data
> str(nutri)
'data.frame': 226 obs. of 13 variables:
$ gender : Factor w/ 2 levels "Male","Female": 2 2 2
2 2 2 2 2 2 2 ...
$ situation : Factor w/ 4 levels "single","couple",..: 1
1 1 1 1 1 1 1 1 1 ...
$ tea : int 0 1 0 0 2 2 2 0 0 0 ...
$ coffee : int 0 1 4 0 1 0 0 2 3 2 ...
$ height : num 151 162 162 154 154 159 160 163
154 160 ...
$ weight : num 58 60 75 45 50 66 66 66 60 77 ...
$ age : num 72 68 78 91 65 82 74 73 89 87 ...
$ meat : Ord.factor w/ 6 levels "never"<"< 1/week."
<..: 5 6 4 1 6 5 4 5 5 3 ...
$ fish : Ord.factor w/ 6 levels "never"<"< 1/week."
<..: 4 3 2 5 4 3 4 3 4 4 ...
$ raw_fruit : Ord.factor w/ 6 levels "never"<"< 1/week."
<..: 2 6 6 5 6 6 6 6 6 6 ...
$ cooked_fruit_veg: Ord.factor w/ 6 levels "never"<"< 1/week."
<..: 5 6 3 1 6 6 6 6 6 5 ...
$ chocol : Ord.factor w/ 6 levels "never"<"< 1/week."
<..: 6 2 6 4 4 2 6 2 6 1 ...
$ fat : Factor w/ 8 levels "butter","margarine",..
: 6 4 4 2 2 3 6 6 6 3 ...
We can access the variables (columns) of a data frame via the $ construction.
> nutri$gender[1:3] #first three elements of gender
> class(nutri$gender)
Summary Tables 27
[1] "factor"
You can save3 your data in another CSV file via the write.csv function, as in:
> write.csv(nutri,"nutri_restructured.csv")
In the remaining sections of this chapter we discuss various ways to extract sum-
mary information from a data frame. Which type of plots and numerical summaries
can be performed depends strongly on the structure of the data frame, and on the type
of the variable(s) in play.
fat
butter margarine peanut sunflower olive Isio4
15 27 48 68 40 23
rapeseed duck
1 4
It is also possible to use table to cross tabulate between two or more variables:
Numerical summaries cannot be computed when some data are missing (NA). If
necessary, missing data can be omitted with the function omit.
> x = na.omit(nutri$height) # Useless in this case
# since height has no NA.
The mean of the data of x1 , . . . , xn is denoted by x̄ and is simply the average of the
data values: n
1X
x̄ = xi .
n i=1
We will often refer to x̄ as the sample mean, rather than “the mean of the data”. Using
the mean function in R for our nutri data, we have, for example:
> mean(nutri$height)
[1] 163.9602
• if n is even, then any value between the values at positions 2n and n2 + 1 can be
used as a median of the series. In practice, the median is usually the average
between these two values.
> median(nutri$height)
[1] 163
Summary Statistics 29
The p-quantile (0 < p < 1) of the data x1 , . . . , xn is a value y such that a fraction
p of the data is less than or equal to y and a fraction 1 − p of the data is greater
than or equal to y. For example, the sample 0.5-quantile corresponds to the sample
median. The p-quantile is also called the 100 × p percentile. The 25, 50, and 75
sample percentiles are sometimes called the first, second, and third quartiles. Using R
we have, for example,
> quantile(nutri$height,probs=c(0.1,0.9))
10% 90%
153 176
While the sample mean and median say something about the location of the data, it
does not provide information about the dispersion (spread) of the data. The following
summary statistics are useful for this purpose.
The range of the data x1 , . . . , xn is given by
In R, the function range returns the minimum and maximum of the data, so to get the
actual range we have to take the difference of the two.
> range(nutri$height)
140 188
Typically, when the sample size increases, the range becomes wider, and so it is dif-
ficult to compare the spreads of two data sets via their ranges, when when the sample
sizes are different. A more robust measure for the spread of the data is the interquar-
tile range (IQR), which is the difference between the third and first quartile.
> IQR(nutri$height)
[1] 13
where x̄ is the sample mean. We will see in later chapters that it plays an essential
√ role
in the analysis of statistical data. The square root of the sample variance s = s2 is
called the sample standard deviation. In R, we have, as an example,
> var(nutri$height)
81.06063
> sd(nutri$height)
9.003368
30 Describing Data
The function barplot is part of the base (i.e., default) plotting library. In addition
to the base graphics package, R has many other packages for plotting. We will be
using the lattice package frequently.
Making Plots 31
In RStudio you can save a plot as an image or PDF file via the graphical inter-
face.
Boxplot
A boxplot (or, more generally, a box-and-whiskers plot) can be viewed as a graphical
representation of a five-number summary of the data consisting of the minimum, max-
imum, and the first, second, and third quartiles. Figure 2.2 gives a boxplot of the age
variable of the nutri data. It was made with the bwplot function from the lattice
package.
> library(lattice)
> bwplot(~age, data=nutri)
32 Describing Data
65 70 75 80 85 90
age
The box is drawn from the first quantile (Q1 ) to the third quantile (Q3 ). The solid
dot inside the box signifies the location of the second quantile, i.e., the median. So-
called “whiskers” extend to either side of the box. The size of the box is called the
interquartile range: IQR = Q3 −Q1 . The left whisker is the largest of (a) the minimum
of the data and (b) Q1 − 1.5 IQR. Similarly, the right whisker is the smallest of (a) the
maximum of the data and (b) Q3 + 1.5 IQR. Any data point outside the whiskers is
indicated by a small open dot, indicating a suspicious or deviant point (outlier). Note
a boxplot may also be used for discrete quantitative variables.
Histogram
A histogram is a main graphical representation of the distribution of a quantitative
variable. We start by breaking the range of the values into a number of bins or classes.
We tally the counts of the values falling in each bin and then make the plot by drawing
rectangles whose bases are the bin intervals and whose heights are the counts. In R we
can use the standard graphics function hist or, from the package lattice, we can
use histogram. For example, Figure 2.3 shows a histogram of the 226 ages in data
nutri.
> histogram(~age, data=nutri)
Making Plots 33
20
15
Percent of Total
10
0
65 70 75 80 85 90
age
Here 9 bins were used. Rather than using raw counts, the vertical axis here gives
the percentage in each class, defined by count
total
100%. Histograms can also be used for
discrete variables, although it may be necessary to explicitly specify the bins and place-
ment of the ticks. The number of bins can be changed via the parameter nint.
Density Plot
Instead of a histogram, we could use a density plot to visualize the distribution of a
continuous variable. For example, using the function densityplot from lattice:
> densityplot(~weight,lwd=2,data=nutri)
0.030
0.025
0.020
Density
0.015
0.010
0.005
0.000
40 60 80 100
weight
This plot indicates that perhaps there is a bimodal distribution of the weights, caused
by the two different genders. This corroborated by a density plot of the weights by
gender.
> densityplot(~weight,groups=gender,lwd=2,data=nutri)
0.05
0.04
Density
0.03
0.02
0.01
0.00
40 60 80 100
weight
The smoothness of density plots made with densityplot can be tuned with the
“bandwidth” parameter bw.
where I{xi 6x} is equal to 1 when xi 6 x and 0 otherwise. To produce the plot of the
empirical cumulative distribution function using R, we can combine the functions plot
and ecdf. The result for the age data is shown in Figure 2.6. The empirical distribution
function for a discrete quantitative variable is obtained in the same way.
> plot(ecdf(nutri$age),xlab="age")
Making Plots 35
ecdf(nutri$age)
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
65 70 75 80 85 90
age
Figure 2.6: Plot of the empirical cdf for a continuous quantitative variable.
The “inverse” of the empirical cdf is obtained by swapping the x and y coordinates
in the plot above. This gives a plot of the p-quantile of the data against p ∈ [0, 1].
We can view the variable p as the theoretical p-quantile of the uniform distribution on
[0,1]; see also Section 4.4. Plots that compare quantiles against quantiles, whether the- + 69
oretical or sample quantiles, are called quantile-quantile plots or qq-plots for short.
In the following code we use the function qqmath from the lattice package.
> qqmath(~age,data=nutri, type = c("l","p"),distribution=qunif)
90
85
80
age
75
70
65
qunif
In Chapters 8–10 we will be using qq-plots to assess if data could be coming from
a prescribed distribution, such as the normal distribution.
36 Describing Data
Male
Female
60
40
20
0
90
80
nutri$weight
70
60
50
40
nutri$height
In the second command we see the formula weight ∼ height being used in a “base
graphics” setting. It is often desired to add a smooth “trend” curve or a straight line
through the scatterplot. This is easy to accomplish via the function xyplot from the
lattice package, by specifying type parameter(s) to be plotted. Choices are "p"
(points), "smooth" (smooth curve), or "r" (straight line).
> xyplot(weight~height,type=c("p","smooth"),
col.line="darkorange",lwd=3, data=nutri)
90
80
weight
70
60
50
40
height
Figure 2.10: Scatterplot of weight against height, with a smoothed loess curve.
38 Describing Data
5000 non−smoking
smoking
4000
birth weight
3000
2000
1000
20 30 40
age
Figure 2.11: Birth weight against age for smoking and non-smoking mothers.
Making Plots 39
3
coffee
0
Male Female
gender
200
150
100
circumference
50
3 1 5
200
150
100
50
Figure 2.13: Growth curves for five orange trees from the R data set Orange.
Using the lattice package, we can go even further and display a plot with 4 vari-
ables, using the groups argument. Figure 2.14 show a plot for the electroconductivity
of soil, for three different water contents (0%, 5%, and 15%), three levels of salinity,
and three different soils types (clay, loam, and sand). It was made with the following
code:
The auto.key argument provides a list of further options for changing the figure.
In this case we move the legend to a more suitable place.
Electroconductivity as a function of salinity, water content (%) and soil type
Low Medium High
0% 5% 15%
6
Clay
Loam
Sand
5
4
Electroconductivity
Salinity
Figure 2.14: Electroconductivity as a function of salinity, water content (%) and soil
type.
42 Describing Data
CHAPTER 3
U NDERSTANDING R ANDOMNESS
3.1 Introduction
Statistical data is inherently random: if we would repeat the process of collecting the
data, we would most likely obtain different measurements. Various reasons why there
is variability in the data were already discussed in Section 1.2. + 13
To better understand the role that randomness plays in statistical analyses, we need
to know a few things about the theory of probability first.
43
44 Understanding Randomness
5. conducting a survey on the nutrition of the elderly, resulting in a data frame such
+ 21 as nutri discussed in Chapter 2.
Example 3.1 (Coin Tossing) One of the most fundamental random experiments is
the one where a coin is tossed a number of times. Indeed, much of probability theory
+ 10 can be based on this simple experiment. In Section 1.1 we viewed this experiment
from a statistical point of view (is the coin fair?). As we have already seen, to better
understand how this coin toss experiment behaves, we can carry it out on a computer.
The following R program simulates a sequence of 100 tosses with a fair coin (that is,
Heads and Tails are equally likely), and plots the results in a bar chart.
This is what the first line of code does: the function runif is used to draw a vector of
100 uniform random numbers from the interval [0, 1]. By testing whether the uniform
numbers are less than 0.5, we obtain a vector x of logical (TRUE or FALSE) variables,
indicating Heads and Tails, say. Typical outcomes for three such experiments were
+ 10 given in Figure 1.2.
We can also plot the average number of Heads against the number of tosses. This
is accomplished by adding two lines of code:
The result of three such experiments is depicted in Figure 3.1. Notice that the aver-
age number of Heads seems to converge to 0.5, but there is a lot of random fluctuation.
Random Experiments 45
0.8
Average number of Heads
0.6
0.4
0.2
0
1 10 20 30 40 50 60 70 80 90 100
Number of tosses
Similar results can be obtained for the case where the coin is biased, with a proba-
bility of Heads of p, say. Here are some typical probability questions.
• What is the probability of waiting more than 4 tosses before the first Head comes
up?
A statistical analysis would start from observed data of the experiment — for example,
all the outcomes of 100 tosses are known. Suppose the probability of Heads p is not
known. Typical statistics questions are:
To answer these type of questions, we need to have a closer look at the models that
are used to describe random experiments.
46 Understanding Randomness
A = [80, 140) .
3. The event that the third selected person in the group of 10 is taller than 2 metres:
Since events are sets, we can apply the usual set operations to them, as illustrated
in the Venn diagrams in Figure 3.2.
A B A B A A B
A ∩B A ∪B Ac B ⊂A
Figure 3.2: Venn diagrams of set operations. Each square shows the sample space Ω.
Two events A and B which have no outcomes in common, that is, A ∩ B = ∅ (empty
set), are called disjoint events.
Example 3.2 (Casting Two Dice) Suppose we cast two dice consecutively. The
sample space is Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), . . . , (6, 6)}. Let A = {(6, 1), . . . , (6, 6)}
be the event that the first die is 6, and let B = {(1, 6), . . . , (6, 6)} be the event that the
second die is 6. Then A ∩ B = {(6, 1), . . . , (6, 6)} ∩ {(1, 6), . . . , (6, 6)} = {(6, 6)} is the
event that both dice are 6.
48 Understanding Randomness
The third ingredient in the model for a random experiment is the specification of
the probability of the events. It tells us how likely it is that a particular event will occur.
We denote the probability of an event A by P(A) — note the special “black board bold”
font. No matter how we define P(A) for different events A, the probability must always
satisfy three conditions, given in the following definition.
1. 0 6 P(A) 6 1.
2. P(Ω) = 1.
The crucial property (3.1) is called the sum rule of probability. It simply states that
if an event can happen in several distinct ways, then the probability that at least one of
these events happens (that is, the probability of the union) is equal to the sum of the
probabilities of the individual events. We see a similar property in an area measure:
the total area of the union of nonoverlapping regions is simply the sum of the areas of
the individual regions.
The following theorem lists some important consequences of the definition above.
Make sure you understand the meaning of each of them, and try to prove them yourself,
using only the three rules above.
1. P(∅) = 0 ,
3. P(Ac ) = 1 − P(A) ,
We have now completed our general model for a random experiment. Of course
for any specific model we must carefully specify the sample space Ω and probability P
that best describe the random experiment.
Counting 49
An important case where P is easily specified is where the sample space has a finite
number of outcomes that are all equally likely. The probability of an event A ⊂ Ω is in
this case simply
|A| Number of elements in A
P(A) = = . (3.2)
|Ω| Number of elements in Ω
The calculation of such probabilities thus reduces to counting.
3.4 Counting
Counting is not always easy. Let us first look at some examples:
1. A multiple choice form has 20 questions; each question has 3 choices. In how
many possible ways can the exam be completed?
2. Consider a horse race with 8 horses. How many ways are there to gamble on the
placings (1st, 2nd, 3rd).
3. Jessica has a collection of 20 CDs, she wants to take 3 of them to work. How
many possibilities does she have?
To be able to comfortably solve a multitude of counting problems requires a lot of
experience and practice, and even then, some counting problems remain exceedingly
hard. Fortunately, many counting problems can be cast into the simple framework of
drawing balls from an urn, see Figure 3.3.
Consider an urn with n different balls, numbered 1, . . . , n from which k balls are
drawn. This can be done in a number of different ways. First, the balls can be drawn
one-by-one, or one could draw all the k balls at the same time. In the first case the
order in which the balls are drawn can be noted, in the second case that is not possible.
In the latter case we can (and will) still assume the balls are drawn one-by-one, but that
the order is not noted. Second, once a ball is drawn, it can either be put back into the
urn (after the number is recorded), or left out. This is called, respectively, drawing with
and without replacement. All in all there are 4 possible experiments: (ordered, with
replacement), (ordered, without replacement), (unordered, without replacement) and
(ordered, with replacement). The art is to recognize a seemingly unrelated counting
problem as one of these four urn problems. For the three examples above we have the
following
50 Understanding Randomness
Third position
Second position
First position
(1,1,1)
1
2
3
4
1
3
(3,2,1)
Figure 3.4: Enumerating the number of ways in which three ordered positions can be
filled with 4 possible numbers, where repetition is allowed.
Counting 51
Third position
Second position
First position
2
3
4
1
1
2 3
1
4 (2,3,1)
4 (2,3,4)
3 1
2 (3,2,1)
4
4
1
2
3
Figure 3.5: Enumerating the number of ways in which three ordered positions can be
filled with 4 possible numbers, where repetition is NOT allowed.
52 Understanding Randomness
Note the two different notations for this number. Summarising, we have the following
table:
Table 3.1: Number of ways k balls can be drawn from an urn containing n balls.
Replacement
Order Yes No
Yes nk n
Pk
n
No — Ck
Conditional Probabilities 53
Returning to our original three problems, we can now solve them easily:
1. The total number of ways the exam can be completed is 320 = 3, 486, 784, 401.
Once we know how to count, we can apply the equilikely principle to calculate
probabilities:
1. What is the probability that out of a group of 40 people all have different birth-
days?
Answer: Choosing the birthdays is like choosing 40 balls with replacement
from an urn containing the balls 1,. . . ,365. Thus, our sample space Ω consists
of vectors of length 40, whose components are chosen from {1, . . . , 365}. There
are |Ω| = 36540 such vectors possible, and all are equally likely. Let A be the
event that all 40 people have different birthdays. Then, |A| = 365 P40 = 365!/325!
It follows that P(A) = |A|/|Ω| ≈ 0.109, so not very big!
2. What is the probability that in 10 tosses with a fair coin we get exactly 5 Heads
and 5 Tails?
Answer: Here Ω consists of vectors of length 10 consisting of 1s (Heads) and
0s (Tails), so there are 210 of them, and all are equally likely. Let A be the
event of exactly 5 heads. We must count how many binary vectors there are
with exactly 5 1s. This is equivalent to determining in how 10many ways the
positions of the 5 1s can be chosen out of 10 positions, that is, 5 . Consequently,
P(A) = 105
/210 = 252/1024 ≈ 0.25.
3. We draw at random 13 cards from a full deck of cards. What is the probability
that we draw 4 Hearts and 3 Diamonds?
Answer: Give the cards a number from 1 to 52. Suppose 1–13 is Hearts, 14–26
is Diamonds, etc. Ω consists of unordered sets of size 13, without repetition,
e.g., {1, 2, . . . , 13}. There are |Ω| = 52
13
of these sets, and they are all equally
likely. Let A be the event of 4 Hearts and 3 Diamonds. To form A we have to
choose 4 Hearts out of 13, and 3 Diamonds
13 13 out
26of 13, followed by 6 cards out of
26 Spade and Clubs. Thus, |A| = 4 × 3 × 6 . So that P(A) = |A|/|Ω| ≈ 0.074.
and the relative chance of A occurring is therefore P(A ∩ B)/P(B), which is called the
conditional probability of A given B. The situation is illustrated in Figure 3.6.
Figure 3.6: What is the probability that A occurs (that is, the outcome lies in A) given
that the outcome is known to lie in B?
Example 3.3 (Casting Two Dice) We cast two fair dice consecutively. Given that
the sum of the dice is 10, what is the probability that one 6 is cast? Let B be the event
that the sum is 10:
B = {(4, 6), (5, 5), (6, 4)} .
Let A be the event that one 6 is cast:
A = {(1, 6), . . . , (5, 6), (6, 1), . . . , (6, 5)} .
Then, A ∩ B = {(4, 6), (6, 4)}. And, since for this experiment all elementary events are
equally likely, we have
2/36 2
P(A | B) = = .
3/36 3
Independent Events
When the occurrence of B does not give extra information about A, that is P(A | B) =
P(A), the events A and B are said to be independent. A slightly more general definition
(which includes the case P(B) = 0) is:
Conditional Probabilities 55
Example 3.4 (Casting Two Dice (Continued)) We cast two fair dice consecutively.
Suppose A is the event that the first toss is 6 and B is the event that the second one is a
6, then naturally A and B are independent events, knowing that the first die is a 6 does
not give any information about what the result of the second die will be. Let’s check
this formally. We have A = {(6, 1), (6, 2) . . . , (6, 6)} and B = {(1, 6), (2, 6), . . . , (6, 6)},
so that A ∩ B = {(6, 6)}, and
1/36 1
P(A | B) = = = P(A) .
6/36 6
Product Rule
By the definition of conditional probability (3.3) we have
Example 3.5 (Urn Problem) We draw consecutively 3 balls from an urn with 5
white and 5 black balls, without putting them back. What is the probability that all
drawn balls will be black?
Let Ai be the event that the i-th ball is black. We wish to find the probability of
A1 A2 A3 , which by the product rule (3.5) is
5 43
P(A1 ) P(A2 | A1 ) P(A3 | A1 A2 ) = ≈ 0.083 .
10 9 8
56 Understanding Randomness
We give some more examples of random variables without specifying the sample
space:
• Continuous random variables can take a continuous range of values; for exam-
ple, any value on the positive real line R+ .
We have used P(X 6 x) as a shorthand notation for P({X 6 x}). From now on we
will use this type of abbreviation throughout the notes. In Figure 3.7 the graph of a
general cdf is depicted. Note that any cdf is increasing (if x 6 y then F(x) 6 F(y)) and
lies between 0 and 1. We can use any function F with these properties to specify the
distribution of a random variable X.
If X has cdf F, then the probability that X takes a value in the interval (a, b] (ex-
cluding a, including b) is given by
To see this, note that P(X 6 b) = P({X 6 a} ∪ {a < X 6 b}), where the events {X 6 a}
and {a < X 6 b} are disjoint. Thus, by the sum rule: F(b) = F(a) + P(a < X 6 b),
which leads to the result above.
We sometimes write fX instead of f to stress that the pmf refers to the discrete
random variable X. The easiest way to specify the distribution of a discrete random
variable is to specify its pmf. Indeed, by the sum rule, if we know f (x) for all x, then + 48
58 Understanding Randomness
f (x)
x
B
Example 3.6 (Sum of Two Dice) Toss two fair dice and let X be the sum of their
face values. The pmf is given in Table 3.2.
Note that we use the same notation f for both the pmf and pdf, to stress the
similarities between the discrete and continuous case. Henceforth we will use
the notation X ∼ f and X ∼ F to indicate that X is distributed according to pdf
f or cdf F.
In analogy to the discrete case (3.6), once we know the pdf, we can calculate any
probability that X lies in some set B by means of integration:
Z
P(X ∈ B) = f (x) dx , (3.8)
B
f (x)
| {z x }
B
Suppose that f and F are the pdf and cdf of a continuous random variable X, as in
Definition 3.6. Then F is simply a primitive (also called anti-derivative) of f :
Z x
F(x) = P(X 6 x) = f (u) du .
−∞
It is important to understand that in the continuous case f (x) is not equal to the proba-
bility P(X = x), because the latter is 0 for all x. Instead, we interpret f (x) as the density
of the probability distribution at x, in the sense that for any small h,
Z x+h
P(x 6 X 6 x + h) = f (u) du ≈ h f (x) . (3.9)
x
Example 3.7 (Random Point in an Interval) Draw a random number X from the
interval of real numbers [0, 2], where each number is equally likely to be drawn. What
are the pdf f and cdf F of X? We have
0 if x < 0,
P(X 6 x) = F(x) =
x/2 if 0 6 x 6 2,
if x > 2.
1
By differentiating F we find
1/2 if 0 6 x 6 2,
f (x) =
0
otherwise.
Note that this density is constant on the interval [0, 2] (and zero elsewhere), reflecting
the fact that each point in [0, 2] is equally likely to be drawn.
3.7 Expectation
Although all probability information about a random variable is contained in its cdf
or pmf/pdf, it is often useful to consider various numerical characteristics of a ran-
dom variable. One such number is the expectation of a random variable, which is a
“weighted average” of the values that X can take. Here is a more precise definition.
Example 3.8 (Expected Payout) Suppose in a game of dice the payout X (dollars)
is the largest of the face values of two dice. To play the game a fee of d dollars must
be paid. What would be a fair amount for d? If the game is played many times, the
long-run fraction of tosses in which the maximum face value takes the value 1, 2,. . . , 6,
is P(X = 1), P(X = 2), . . . , P(X = 6), respectively. Hence, the long-run average payout
of the game is the weighted sum of 1, 2, . . . , 6, where the weights are the long-run
fractions (probabilities). So, the long-run payout is
For a symmetric pmf/pdf the expectation (if finite) is equal to the symmetry
point.
For continuous random variables we can define the expectation in a similar way,
replacing the sum with an integral.
Example 3.9 (Die Experiment and Expectation) Find E(X 2 ) if X is the outcome
of the toss of a fair die. We have
1 1 1 1 91
E(X 2 ) = 12 × + 22 × + 32 × + · · · + 62 × = .
6 6 6 6 6
1. E(a X + b) = a E(X) + b ,
Proof: We show it for the discrete case. The continuous case is proven analogously,
simply by replacing sums with integrals. Suppose X has pmf f . The first statement
follows from
X X X
E(aX + b) = (ax + b) f (x) = a x f (x) + b f (x) = a E(X) + b .
x x x
where µ = E(X). The square root of the variance is called the standard devia-
tion. The number EX r is called the r-th moment of X.
Expectation 63
Theorem 3.8
(Properties of the Variance). For any random variable X the following proper-
ties hold for the variance.
1. Var(X) = EX 2 − (EX)2 .
Proof: To see this, write E(X) = µ, so that Var(X) = E(X − µ)2 = E(X 2 − 2µX + µ2 ).
By the linearity of the expectation, the last expectation is equal to the sum E(X 2 ) −
2 µ E(X) + µ2 = E(X 2 ) − µ2 , which proves the first statement. To prove the second
statement, note that the expectation of a + bX is equal to a + bµ. Consequently,
Var(a + bX) = E (a + bX − (a + bµ))2 = E(b2 (X − µ)2 ) = b2 Var(X) .
64 Understanding Randomness
CHAPTER 4
C OMMON P ROBABILITY
D ISTRIBUTIONS
This chapter presents four probability distributions that are the most frequently
used in the study of statistics: the Bernoulli, Binomial, Uniform, and Normal dis-
tributions. We give various properties of these distributions and show how to com-
pute probabilities of interest for them. You will also learn how to simulate random
data from these distributions.
4.1 Introduction
In the previous chapter, we have seen that a random variable that takes values in a
continuous set (such as an interval) is said to be continuous and a random variable that
can have only a finite or countable number of different values is said to be discrete;
see Section 3.6. Recall that the distribution of a continuous variable is specified by its + 56
probability density function (pdf), and the distribution of a discrete random variable by
its probability mass function (pmf).
In the following, we first present two distributions for discrete variables: they are
the Bernoulli and Binomial distributions. Then, we describe two key distributions for
continuous variables: the Uniform and Normal distributions. All of these distributions
are actually families of distributions, which depend on a few (one or two in this case)
parameters — fixed values that determine the shape of the distribution. Although in
statistics we only employ a relatively small collection of distribution families (bino-
mial, normal, etc.), we can make an infinite amount of distributions through parameter
selection.
65
66 Common Probability Distributions
We write X ∼ Ber(p).
Figure 4.1: Probability mass function for the Bernoulli distribution, with parameter p
(the case p = 0.6 is shown)
The expectation and variance of X ∼ Ber(p) are easy to determine. We leave the
proof as an exercise, as it is instructive do do it yourself, using the definitions of the
+ 60 expectation and variance; see (3.10) and (3.12).
Binomial Distribution 67
1. E(X) = p
2. Var(X) = p(1 − p)
[1] 0.02844397
More generally, when we toss a coin n times and the probability of Heads is p (not
necessarily 1/2), the outcomes are no longer equally likely (for example, when p is
close to 1 the sequence coin flips 1, 1, . . . , 1 is more likely to occur than 0, 0, . . . , 0). We
can use the product rule (3.5) to find that the probability of having a particular sequence + 55
68 Common Probability Distributions
with x heads and n − x tails is p x (1 − p)n−x . Since there are nx of these sequences, we
see that X has a Bin(n, p) distribution, as given in the following definition.
The following theorem lists the expectation and variance for the Bin(n, p) distribu-
+ 87 tion. A simple proof will be given in the next chapter; see Example 5.4. In any case,
the expression for the expectation should come as no surprise, as we would expect np
successes in a sequence of Bernoulli experiments (coin flips) with success probability
p. Note that both the expectation and variance are n times the expectation and variance
of a Ber(p) random variable. This is no coincidence, as a Binomial random variable
can be seen as the sum of n independent Bernoulli random variables.
Uniform Distribution 69
1. E(X) = np
2. Var(X) = np(1 − p)
Counting the number of successes in a series of coin flip experiments might seem
a bit artificial, but it is important to realize that many practical statistical situations
can be treated exactly as a sequence of coin flips. For example, suppose we wish to
conduct a survey of a large population to see what the proportion p is of males, where
p is unknown. We can only know p if we survey everyone in the population, but
suppose we do not have the resources or time to do this. Instead we select at random
n people from the population and note their gender. We assume that each person is
chosen with equal probability. This is very much like a coin flipping experiment. In
fact, if we allow the same person to be selected more than once, then the two situations
are exactly the same. Consequently, if X is the total number of males in the group of
n selected persons, then X ∼ Bin(n, p). You might, rightly, argue that in practice we
would not select the same person twice. But for a large population this would rarely
happen, so the Binomial model is still a good model. For a small population, however,
we should use a (more complicated) urn model to describe the experiment, where we
draw balls (select people) without replacement and without noting the order. Counting
for such experiments was discussed in Section 3.4. + 49
A random variable X ∼ U[a, b] can model a randomly chosen point from the inter-
val [a, b], where each choice is equally likely. A graph of the density function is given
in Figure 4.3. Note that the total area under the pdf is 1.
1. E(X) = (a + b)/2
Proof: The expectation is finite (since it must lie between a and b) and the pdf is
symmetric. It follows that the expectation is equal to the symmetry point (a + b)/2.
To find the variance, it is useful to write X = a + (b − a)U where U ∼ U[0, 1]. In
words: randomly choosing a point between a and b is equivalent to first randomly
choosing a point in [0, 1], multiplying this by (b − a), and adding a. We can now write
Var(X) = Var(a + (b − a)U), which is the same as (b − a)2 Var(U), using the second
+ 63 property for the variance in Theorem 3.7. So, it suffices to show that Var(U) = 1/12.
Writing Var(U) = E(U 2 )−(E(U))2 = E(U 2 )−1/4, it remains to show that E(U 2 ) = 1/3.
This follows by direct integration:
Z 1
1 3 1 1
E(U ) =
2
u 1du = u = .
2
0 3 0 3
We write X ∼ N(µ, σ2 ).
The parameters µ and σ2 turn out to be the expectation and variance of the distri-
bution, respectively. If µ = 0 and σ = 1 then the distribution is known as the standard
normal distribution. Its pdf is often denoted by ϕ (phi), so
1
ϕ(x) = √ e−x /2 , x ∈ R.
2
2π
The corresponding cdf is denoted by Φ (capital phi). In Figure 4.4 the density function
of the N(µ, σ2 ) distribution for various µ and σ2 is plotted.
0.7
µ = 0, σ2 = 0.5
0.6
0.5
µ = 2, σ2 = 1
0.4
µ = 0, σ2 = 1
f (x)
0.3
0.2
0.1
0
−3 −2 −1 0 1 x 2 3 4 5
You may verify yourself, by applying the definitions of expectation and variance,
that indeed the following theorem holds:
1. E(X) = µ
2. Var(X) = σ2
72 Common Probability Distributions
The normal distribution is symmetric about the expectation µ and the dispersion is
controlled by the variance parameter σ2 , or the standard deviation σ (see Figure 4.4).
An important property of the normal distribution is that any normal random variable
can be thought of as a simple transformation of a standard normal random variable.
Proof: Suppose Z is standard normal. So, P(Z 6 z) = Φ(z) for all z. Let X = µ + σZ.
We wish to derive the pdf f of X and show that it is of the form (4.3). We first derive
the cdf F:
By taking the derivative f (x) = F 0 (x) we find (apply the chain rule of differentiation):
1
f (x) = F 0 (x) = Φ0 ((x − µ)/σ) = ϕ((x − µ)/σ)/σ ,
σ
which is the pdf of a N(µ, σ2 )-distributed random variable (replace x with (x − µ)/σ in
the formula for ϕ and divide by σ. This gives precisely (4.3)).
By using the standardization (4.4) we can simplify calculations involving arbitrary
normal random variables to calculations involving only standard normal random vari-
ables.
Since Φ(3) is larger than Φ(2), finding a male to be taller than 2m is more unusual than
finding a female taller than 180cm.
In the days before the computer it was customary to provide tables of Φ(x) for
0 6 x 6 4, say. Nowadays we can simply use statistical software. For example, the
cdf Φ is encoded in R as the function pnorm. So to find 1 − Φ(2) and 1 − Φ(3) we can
type:
> 1 - pnorm(2)
[1] 0.02275013
> 1 - pnorm(3)
Unfortunately there is no simple formula for working out areas under the Normal
density curve. However, as a rough rule for X ∼ N(µ, σ2 ):
> pnorm(3,mean=1,sd=2)
[1] 0.8413447
Note that R uses the standard deviation as an argument, not the variance!
We can also go the other way around: let X ∼ N(1, 4). For what value z does it hold
that P(X 6 z) = 0.9. Such a value z is called a quantile of the distribution — in this
case the 0.9-quantile. The concept is closely related to the sample quantile discussed
+ 28 in Section 2.4, but the two are not the same. Figure 4.6 gives an illustration. For the
normal distribution the quantiles can be obtained via the R function qnorm.
[1] 1.959964
> qnorm(0.90,mean=1,sd=2)
[1] 3.563103
> qnorm(0.5,mean=2,sd=1)
[1] 2
Simulating Random Variables 75
[1] 0.6453129
[1] 0.8124339
We can use a uniform random number to simulate a toss with a fair coin by returning
TRUE if x < 0.5 and FALSE if x > 0.5.
> runif(1) < 0.5
[1] TRUE
We can turn the logical numbers into 0s and 1s by by using the function as.integer
> as.integer(runif(20)<0.5)
[1] 1 1 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0
We can, in principle, draw from any probability distribution including the normal dis-
tribution, using only uniform random numbers. However, to draw from a normal dis-
tribution we will use R’s inbuilt rnorm function. For example, the following generates
5 outcomes from the standard normal distribution:
> rnorm(5)
In R, every function for generating random variables starts with an “r” (e.g.,
runif, rnorm). This is also holds for discrete random variables:
> rbinom(1,size=10,p=0.5)
[1] 5
Generating artificial data can be a very useful way to understand probability distri-
butions. For example, if we generate many realizations from a certain distribution, then
+ 30 the histogram and empirical cdf of the data (see Section 2.5) will resemble closely the
true pdf/pmf and cdf of the distribution. Moreover the summary statistics (see Sec-
+ 28 tion 2.4) of the simulated data such as the sample mean and sample quantiles will re-
semble the true distributional properties such as the expected value and the quantiles.
Let us illustrate this by drawing one 10,000 samples from the N(2, 1) distribution.
> x = rnorm(10e4,mean=2,sd=1)
> summary(x)
The true first and third quartiles are 1.32551 and 2.67449, respectively, which are
quite close to the sample quartiles. Similarly the true expectation and median are 2,
which is again close to the sample mean and sample median.
The following R script (program) was used to produce Figure 4.7. We see a very
close correspondence between the true pdf (on the left, in red) and a histogram of the
10,000 data points. The true cdf (on the right, in red) is virtually indistinguishable
from the empirical cdf.
1 # simnorm.R
2 par(mfrow=c(1,2),cex=1.5) # two plot windows, use larger font
3 x = rnorm(10e4,mean=2,sd=1) # generate data
4 hist(x,prob=TRUE,breaks=100) # make histogram
5 curve(dnorm(x,mean=2,sd=1),col="red",ylab="",lwd=2,add=T) #true pdf
6 plot(ecdf(x)) # draw the empirical cdf
7 curve(pnorm(x,mean=2,sd=1),col="red",lwd=1,add=TRUE) #true cdf
Simulating Random Variables 77
Histogram of x ecdf(x)
1.0
0.4
0.8
0.3
0.6
Density
Fn(x)
0.2
0.4
0.1
0.2
0.0
0.0
−2 0 2 4 6 −2 2 4 6
x x
Figure 4.7: Left: pdf of the N(2, 1) distribution (red) and histogram of the generated
data. Right: cdf of the N(2, 1) distribution (red) empirical cdf of the generated data.
Density functions (pmf or pdf) always start in R with “d” (e.g., dnorm, dunif).
The cummulative distribution functions (cdf), which give a probability, al-
ways start in R with “p” (e.g., pnorm, punif). Quantiles start with “q” (e.g.,
qnorm,qunif).
To summarize, we present in table 4.1 the main R functions for the evaluation of
densities, cumulative distribution functions, quantiles, and the generation of random
variables for the distributions described in this chapter. Later on we will encounter
more distributions such as the Student’s t distribution, the F distribution, and the chi-
squared distribution. You can use the “d”, “p”, “q” and “r” construction to evaluate
pmfs, cdfs, quantiles, and random numbers in exactly the same way!
78 Common Probability Distributions
Table 4.1: Standard discrete and continuous distributions. R functions for the mass
or density function (d), cumulative distribution function (p) and quantile function
(q). Instruction to generate (r) pseudo-random numbers from these distributions.
Distr. R functions Distr. R functions
dbinom(x,size=1,prob=p) dnorm(x,mean=µ,sd=σ)
pbinom(x,size=1,prob=p) pnorm(x,mean=µ,sd=σ)
Ber(p) N(µ, σ2 )
qbinom(γ,size=1,prob=p) qnorm(γ,mean=µ,sd=σ)
rbinom(n,size=1,prob=p) rnorm(n,mean=µ,sd=σ)
dbinom(x,size=n,prob=p) dunif(x,min=a,max=b)
pbinom(x,size=n,prob=p) punif(x,min=a,max=b)
Bin(n, p) U[a, b]
qbinom(γ,size=n,prob=p) qunif(γ,min=a,max=b)
rbinom(n,size=n,prob=p) runif(n,min=a,max=b)
CHAPTER 5
In this chapter you will learn how random experiments that involve more than
one random variable can be described via their joint cdf and joint pmf/pdf. When
the random variables are independent of each other, the joint density has a simple
product form. We will discuss the most basic statistical model for data — indepen-
dent and identically distributed (iid) draws from a common distribution. We will
show that the expectation and variance of sums of random variables obey simple
rules. We will also illustrate the central limit theorem, explaining the central role
that the normal distribution has in statistics. The chapter concludes with the con-
ceptual framework for statistical modeling and gives various examples of simple
models.
5.1 Introduction
In the previous chapters we considered random experiments that involved only a single
random variable, such as the number of heads in 100 tosses, the number of left-handers
in 50 people, or the amount of rain on the 2nd of January 2021 in Brisbane. This is
obviously a simplification: in practice most random experiments involve multiple ran-
dom variables. Here are some examples of experiments that we could do “tomorrow”.
2. We toss a coin repeatedly. Let Xi = 1 if the ith toss is Heads and Xi = 0 other-
wise. The experiment is thus described by the sequence X1 , X2 , . . . of Bernoulli
random variables.
3. We randomly select a person from a large population and measure his/her mass
X and height Y.
79
80 Multiple Random Variables
4. We simulate 10,000 realizations from the standard normal distribution using the
rnorm function. Let X1 , . . . , X10,000 be the corresponding random variables.
How can we specify the behavior of the random variables above? We should not
just specify the pdf of the individual random variables, but also say something about
the interaction (or lack thereof) between the random variables. For example, in the
third experiment above if the height Y is large, then most likely the mass X is large
as well. In contrast, in the first two experiments it is reasonable to assume that the
random variables are “independent” in some way; that is, information about one of
the random variables does not give extra information about the others. What we need
to specify is the joint distribution of the random variables. The theory below for
multiple random variables follows a similar path to that of a single random variable
+ 56 described in Section 3.6.
Let X1 , . . . , Xn be random variables describing some random experiment. Recall
that the distribution of a single random variable X is completely specified by its cu-
mulative distribution function. For multiple random variables we have the following
generalization.
F(x1 , . . . , xn ) = P(X1 6 x1 , . . . , Xn 6 xn ) .
— giving the last column of Table 5.1. Similarly, the distribution of Y is given by the
column totals in the last row of the table.
Table 5.1: The joint distribution of X (die number) and Y (face value).
y
P
1 2 3 4 5 6
1 1 1 1 1 1 1
1 18 18 18 18 18 18 3
1 1 1 1 1 1
x 2 18 18 18 18 9
0 3
1 1 1 1 1 1
3 18 18 18 18
0 9 3
P 1 1 1 1 1 1
6 6 6 6 6 6
1
f (x1 , . . . , xn ) = P(X1 = x1 , . . . , Xn = xn ) .
We sometimes write fX1 ,...,Xn instead of f to show that this is the pmf of the random
variables X1 , . . . , Xn . To save on notation, we can refer to the sequence X1 , . . . , Xn
simply as a random “vector” X = (X1 , . . . , Xn ). If the joint pmf f is known, we can
calculate the probability of any event via summation as
X
P(X ∈ B) = f (x) . (5.1)
x∈B
That is, to find the probability that the random vector lies in some set B (of dimension
n), all we have to do is sum up all the probabilities f (x) over all x in the set B. This
is simply a consequence of the sum rule and a generalization of (3.6). In particular, as + 58
illustrated in Example 5.1, we can find the pmf of Xi — often referred to as a marginal
pmf, to distinguish it from the joint pmf — by summing the joint pmf over all possible
values of the other variables. For example,
X X
fX (x) = P(X = x) = P(X = x, Y = y) = fX,Y (x, y) . (5.2)
y y
82 Multiple Random Variables
The converse is not true: from the marginal distributions one cannot in general recon-
struct the joint distribution. For example, in Example 5.1 we cannot reconstruct the
inside of the two-dimensional table if only given the column and row totals.
For the continuous case we need to replace the joint pmf with the joint pdf.
The integral in (5.3) is now a multiple integral — instead of evaluating the area
under f , we now need to evaluate the (n-dimensional) volume. Figure 5.1 illustrates
the concept for the 2-dimensional case.
1.0
0.5
2
0.0
0
-2
0
-2
Figure 5.1: Left: a two-dimensional joint pdf of random variables X and Y. Right: the
area under the pdf corresponds to P(0 6 X 6 1, Y > 0).
This means that any information about what the outcome of X is does not provide any
extra information about Y. For the pmfs this means that the joint pmf f (x, y) is equal
to the product of the marginal ones fX (x) fY (y). We can take this as the definition for
independence, also for the continuous case, and when more than two random variables
are involved.
Example 5.3 (Bivariate Standard Normal Distribution) Suppose X and Y are in-
dependent and both have a standard normal distribution. We say that (X, Y) has a
bivariate standard normal distribution. What is the joint pdf? We have
1 1 2 1 1 2 1 − 12 (x2 +y2 )
f (x, y) = fX (x) fY (y) = √ e− 2 x √ e− 2 y = e .
2π 2π 2π
The graph of this joint pdf is the hat-shaped surface given in the left pane of Figure 5.1.
We can also simulate independent copies X1 , . . . , Xn ∼iid N(0, 1) and Y1 , . . . , Yn ∼iid N(0, 1)
and plot the pairs (X1 , Y1 ), . . . , (Xn , Yn ) to gain insight into the joint distribution. The
following lines of R code produce the scatter plot of simulated data in Figure 5.2.
> x = rnorm(2000)
> y = rnorm(2000)
> plot(y~x,xlim = c(-3,3), ylim= c(-3,3))
84 Multiple Random Variables
3
2
1
0
y
−1
−2
−3
−3 −2 −1 0 1 2 3
Figure 5.2: Scatter plot of 2000 points from the bivariate standard normal distribution.
We see a “spherical” pattern in the data. This is corroborated by the fact that the
joint pdf has contour lines that are circles.
where the sum is taken over all possible values of (x1 , . . . , xn ). In the continuous case
replace the sum above with a (multiple) integral.
Two important special cases are the expectation of the sum (or more generally any
linear transformation plus a constant) of random variables and the product of random
variables.
Expectations for Joint Distributions 85
E[a + b1 X1 + b2 X2 + · · · + bn Xn ] = a + b1 µ1 + · · · + bn µn (5.6)
E[X1 X2 · · · Xn ] = µ1 µ2 · · · µn . (5.7)
Proof: We show it for the discrete case with two variables only. The general case
follows by analogy and, for the continuous case, by replacing sums with integrals. Let
X1 and X2 be discrete random variables with joint pmf f . Then, by (5.5),
X
E[a + b1 X1 + b2 X2 ] = (a + b1 x1 + b2 x2 ) f (x1 , x2 )
x1 ,x2
XX XX
= a + b1 x1 f (x1 , x2 ) + b2 x2 f (x1 , x2 )
x1 x2 x1 x2
X X X X
= a + b1 x1 f (x1 , x2 ) + b2 x2 f (x1 , x2 )
x1 x2 x2 x1
X X
= a + b1 x1 fX1 (x1 ) + b2 x2 fX2 (x2 ) = a + b1 µ1 + b2 µ2 .
x1 x2
Next, assume that X1 and X2 are independent, so that f (x1 , x2 ) = fX1 (x1 ) fX2 (x2 ). Then,
X
E[X1 X2 ] = x1 x2 fX1 (x1 ) fX2 (x2 )
x1 ,x2
X X
= x1 fX1 (x1 ) × x2 fX2 (x2 ) = µ1 µ2 .
x1 x2
The covariance is a measure of the amount of linear dependency between two ran-
dom variables. A scaled version of the covariance is given by the correlation coeffi-
cient:
Cov(X, Y)
%(X, Y) = , (5.8)
σ X σY
86 Multiple Random Variables
where σ2X = Var(X) and σ2Y = Var(Y). For easy reference, Theorem 5.2 lists some
important properties of the variance and covariance.
6. Cov(X, X) = Var(X).
In particular, combining Properties (7) and (8) we see that if X and Y are indepen-
dent, then the variance of their sum is equal to the sum of their variances. It is not
difficult to deduce from this the following more general result.
Limit Theorems 87
Example 5.4 (Expectation and Variance for the Binomial Distribution) We now
show a simple way to prove Theorem 4.2; that is, to prove that the expectation and vari- + 69
ance for the Bin(n, p) distribution are np and np(1− p), respectively. Let X ∼ Bin(n, p).
Hence, we can view X as the total number of successes in n Bernoulli trials (coin flips)
with success probability p. Let us introduce Bernoulli random variables X1 , . . . , Xn ,
where Xi = 1 is the ith trial is a success (and Xi = 0 otherwise). We thus have that
X1 , . . . , Xn ∼iid Ber(p). The key to the proof is to observe that X is simply the sum of
the Xi0 s; that is
X = X1 + · · · + Xn .
Since we have seen that each Bernoulli variable has expectation p and variance p(1−p),
we have by Theorem 5.1 that
E(X) = E(X1 ) + · · · + E(Xn ) = np
and by Theorem 5.3 that
Var(X) = Var(X1 ) + · · · + Var(Xn ) = np(1 − p) ,
as had to be shown.
√ 5.5 (Square Root of a Uniform) Let U ∼ U(0, 1). What is the expecta-
Example
tion of √U? We know that the expectation of U is 1/2. Would the expectation of
√
U be 1/2? We can determine in this case the expectation exactly, but let us use
simulation and the law of large numbers instead. All we have to do is simulate a large
number of uniform numbers, take their square roots, and average over all values:
> u = runif(10e6)
> x = sqrt(u)
> mean(x)
[1] 0.6665185
Repeating the simulation gives consistently 0.666 in the first three digits behind the
decimal
√ point. You can check that the true expectation is 2/3, which is smaller than
1/2 ≈ 0.7071.
The central limit theorem describes the approximate distribution of S n (or S n /n),
and it applies to both continuous and discrete random variables. Informally, it states
the following.
Histogram of x1 + x2
1.0
0.8
0.6
Density
0.4
0.2
0.0
x1 + x2
Figure 5.3: Histogram for the sum of 2 independent uniform random variables.
The pdf seems to be triangle shaped and, indeed, this is not so difficult to show.
Now let us do the same thing for sums of 3 and 4 uniform numbers. Figure 5.4 shows
that the pdfs have assumed a bellshaped form reminiscent of the normal distribution.
Indeed, if we superimpose the normal distribution with the same mean and variance as
the sums, the agreement is excellent.
0.7
0.8
0.6
0.6
0.5
0.4
Density
Density
0.4
0.3
0.2
0.2
0.1
0.0
0.0
x x
Figure 5.4: The histograms for the sums of 3 (left) and 4 (right) uniforms are in close
agreement with normal pdfs.
The central limit theorem does not only hold if we add up continuous random
variables, such as uniform ones, but it also holds for the discrete case. In particular,
recall that a binomial random variable X ∼ Bin(n, p) can be viewed as the sum of n iid
90 Multiple Random Variables
Note that the expectation and variance of Y are a direct consequence of Theo-
rems 5.1 and 5.3.
The simplest class of statistical models is the one where the data X1 , . . . , Xn are as-
+ 83 sumed to be independent and identically distributed (iid), as we already mentioned. In
Statistical Modeling 91
many cases it is assumed that the sampling distribution is normal. Here is an example.
Example 5.6 (One-sample Normal Model) From a large population we select 300
men between 40 and 50 years of age and measure their heights. Let Xi be the height of
the i-th selected person, i = 1, . . . , 300. As a model take,
iid
X1 , . . . , X300 ∼ N(µ, σ2 )
for some unknown parameters µ and σ2 . We could interpret these as the population
mean and variance.
A simple generalization of a single sample of iid data is the model where there are
two independent samples of iid data, as in the examples below.
The objective is to assess the difference p1 − p2 on the basis of the observed values for
X1 , . . . , X100 , Y1 , . . . , Y100 . Note that it suffices to only record the total number of boys
or girls who prefer Sweet cola in each group; that is, X = 100 i=1 Xi and Y =
P P100
i=1 Yi .
This gives the two-sample binomial model:
X ∼ Bin(100, p1 ) ,
Y ∼ Bin(100, p2 ) ,
X, Y independent, with p1 and p2 unknown.
Example 5.8 (Two-sample Normal Model) From a large population we select 200
men between 25 and 30 years of age and measure their heights. For each person we
also record whether the mother smoked during pregnancy or not. Suppose that 60
mothers smoked during pregnancy.
Let X1 , . . . , X60 be the heights of the men whose mothers smoked, and let Y1 , . . . , Y140
be the heights of the men whose mothers did not smoke. Then, a possible model is the
92 Multiple Random Variables
where the model parameters µ1 , µ2 , σ21 , and σ22 are unknown. One would typically like
to assess the difference µ1 − µ2 . That is, does smoking during pregnancy affect the
(expected) height of the sons? A typical simulation outcome of the model is given in
Figure 5.6, using parameters µ1 = 170, µ2 = 175, σ21 = 200, and σ22 = 100.
smoker
non−smoker
130 140 150 160 170 180 190 200 210 220
Remark 5.1 (About Statistical Modeling) At this point it is good to emphasize a few
points about statistical modeling.
• Any model for data is likely to be wrong. For example, in Example 5.8 the
height would normally be recorded on a discrete scale, say 1000 – 2200 (mm).
However, samples from a N(µ, σ2 ) can take any real value, including negative
values! Nevertheless, the normal distribution could be a reasonable approxima-
tion to the real sampling distribution. An important advantage of using a normal
distribution is that it has many nice mathematical properties as we have seen.
• Any model for data needs to be checked for suitability. An important criterion
is that data simulated from the model should resemble the observed data — at
least for a certain choice of model parameters.
CHAPTER 6
E STIMATION
In this chapter you will learn how to estimate parameters of simple statistical
models from the observed data. The difference between estimate and estimator
will be explained. Confidence intervals will be introduced to assess the accuracy
of an estimate. We will derive confidence intervals for a variety one- and two-
sample models. Various probability distributions, such as the Student’s t and the
χ2 distribution will make their first appearance.
6.1 Introduction
Recall the framework of statistical modeling in Figure 5.5. We are given some data + 90
(measurements) for which we construct a model that depends on one or more parame-
ters. Based on the observed data we try to say something about the model parameters.
For example, we wish to estimate the parameters. Here are some concrete examples.
Example 6.1 (Biased Coin) We throw a coin 1000 times and observe 570 Heads.
Using this information, what can we say about the “fairness” of the coin? The data here
(or better, datum, as there is only one observation) is the number x = 570. Suppose we
view x as the outcome of a random variable X which describes the number of Heads
in 1000 tosses. Our statistical model is then:
X ∼ Bin(1000, p) ,
where p ∈ [0, 1] is unknown. Any statement about the fairness of the coin is expressed
in terms of p and is assessed via this model. It is important to understand that p will
never be known. The best we can do is to provide an estimate of p. A common sense
estimate of p is simply the proportion of Heads x/1000 = 0.570. But how accurate is
this estimate? Is it possible that the unknown p could in fact be 0.5? One can make
sense of these questions through detailed analysis of the statistical model.
93
94 Estimation
Example 6.2 (Iid Sample from a Normal Distribution) Consider the standard model
for data
X1 , . . . , Xn ∼ N(µ, σ2 ) ,
where µ and σ2 are unknown. The random measurements {Xi } could represent the
masses of randomly selected teenagers, the heights of the dorsal fin of sharks, the
dioxin concentrations in hamburgers, and so on. Suppose, for example that, with n =
10, the observed measurements x1 , . . . , xn are:
77.01, 71.37, 77.15, 79.89, 76.46, 78.10, 77.18, 74.08, 75.88, 72.63.
Note that the estimates x̄ and s2 are functions of the data x = (x1 , . . . , xn ) only. We
+ 28 encountered these summary statistics already in Section 2.4.
Why are these numbers good estimates (guesses) for our unknown parameters µ
and σ2 . How accurate are these numbers? That is, how far away are they from the true
parameters? To answer these questions we need to investigate the statistical properties
of the sample mean and sample variance.
The first equation gives the sample mean b µ = x̄ as our estimate for µ. Substituting
µ = x̄ in the second equation, we find that the second equation gives
b
n
n
1 X 1 X
σ
c2 = xi2 − x̄ 2 =
xi2 − n x̄ 2
(6.3)
n i=1 n i=1
as an estimate for σ2 . This estimate seems quite different from the sample variance
s2 in (6.2). But the two estimates are actually very similar. To see this, expand the
quadratic term in (6.2), to get
n
1 X 2
s2 = (x − 2xi x̄ + x̄ 2 ) .
n − 1 i=1 i
and simplify
n
1 X
s2 = xi2 − 2 x̄n x̄ + n x̄ 2
n − 1 i=1
n
1 X 2
= xi − n x̄ 2 .
n−1 i=1
To find out how good an estimate is, we need to investigate the properties of the
corresponding estimator. The estimator is obtained by replacing the fixed observa-
tions xi with the random variables Xi in the expression for the estimate. For example,
the estimator corresponding to the sample mean x̄ is
X1 + · · · + Xn
X̄ = .
n
The interpretation is that X1 , . . . , Xn are the data that we will obtain if we carry out the
experiment tomorrow, and X̄ is the (random) sample mean of these data, which again
will be obtained tomorrow.
Let us go back to the basic model were X1 , . . . , Xn are independent and identically
distributed with some unknown expectation µ and variance σ2 . We do not require that
the {Xi } are normally distributed — we are only interested in estimating the expectation
and variance. To justify why x̄ is a good estimate of µ, think about what we can say
(today) about the properties of the estimator X̄. The expectation and variance of X̄
follow easily from the rules for expectation and variance in Chapter 5. In particular,
+ 85 by (5.6) we have
!
1 1 1
E(X̄) = E (X1 + · · · + Xn ) = E (X1 + · · · + Xn ) = (E(X1 ) + · · · + E(Xn ))
n n n
1
= (µ + · · · + µ) = µ
n
+ 87 and from (5.9) we have
!
1 1 1
Var(X̄) = Var (X1 + · · · + Xn ) = 2 Var(X1 + · · · + Xn ) = 2 (Var(X1 ) + · · · + Var(Xn ))
n n n
1 σ2
= 2 (σ2 + · · · + σ2 ) = .
n n
The first result says that the estimator X̄ is “on average” equal to the unknown quantity
that we wish to estimate (µ). We call an estimator whose expectation is equal to the
quantity that we wish to estimate unbiased. The second result shows that the larger
we take n, the closer the variance of X̄ is to zero, indicating that X̄ goes to the constant
+ 87 µ for large n. This is in essence the law of large numbers; see Section 5.5.
To assess how close X̄ is to µ, one needs to look at a confidence interval for µ.
If t1 and t2 are the observed values of T 1 and T 2 , then the interval (t1 , t2 ) is called
the numerical confidence interval for θ with confidence 1 − α. If (6.4) only
holds approximately, the interval is called an approximate confidence interval.
1
2
3
4
sample
5
6
7
8
9
10
Only in (on average) 9 out of 10 cases would these intervals contain our unknown
θ. To put it in another way: Consider an urn with 90 white and 10 black balls. We pick
98 Estimation
at random a ball from the urn but we do not open our hand to see what colour ball we
have. Then we are pretty confident that the ball we have in our hand is white. This is
how confident you should be that the unknown θ lies in the interval (9.5, 10.5).
X̄ − µ
( )
A = −1.96 < √ < 1.96)
S/ n
√
as follows. Multiplying the left, middle, and right parts of the inequalities by S / n
still gives the same event, so
( )
S S
A = −1.96 √ < X̄ − µ < 1.96 √ .
n n
Subtracting X̄ from left, middle, and right parts still does not change anything about
the event, so ( )
S S
A = −X̄ − 1.96 √ < −µ < −X̄ + 1.96 √ .
n n
Finally we multiply the left, middle, and right parts with −1. This will flip the < signs
to >. For example, −3 < −2 is the same as 3 > 2. So, we get:
( )
S S
A = X̄ + 1.96 √ > µ > X̄ − 1.96 √ ,
n n
which is the same as
( )
S S
A = X̄ − 1.96 √ < µ < X̄ + 1.96 √ .
n n
If we write this as A = {T 1 < µ < T 2 }, with P(A) ≈ 0.95, then we see that (T 1 , T 2 ) is
an approximate 95% confidence interval for µ. We can repeat this procedure with any
quantile of the normal distribution. This leads to the following result.
Example 6.4 (Oil Company) An oil company wishes to investigate how much on
average each household in Melbourne spends on petrol and heating oil per year. The
company randomly selects 51 households from Melbourne, and finds that these spent
on average $1136 on petrol and heating oil, with a sample standard deviation of $178.
We wish to construct a 95% confidence interval for the expected amount of money
per year that the households in Melbourne spend on petrol and heating oil. Call this
parameter µ.
We assume that the outcomes of the survey, x1 , . . . , x51 , are realizations of an iid
sample with expectation µ. Although we do not know the outcomes themselves, we
know their sample mean x̄ = 1136 and standard deviation s = 178. An approximate
numerical 95% confidence interval is thus
178
1136 ± 1.96 √ = (1087, 1185) .
51
X̄ − µ
√ ∼ tn−1 . (6.7)
S/ n
Figure 6.2 gives graphs of the probability densities functions for the t1 , t2 , t5 , and
t50 . Notice a similar bell-shaped curve as for the normal distribution, but the tails of
the distribution are “fatter” than for the normal distribution. As n grows larger the pdf
of the tn gets closer and closer to the pdf of the N(0, 1) distribution.
Confidence Intervals 101
Figure 6.2: The pdfs of Student t distributions with various degrees of freedom (df).
We can use R to calculate the pdf, cdf, and quantiles for this distribution. For
example, the following R script produces Figure 6.2.
1 curve(dt(x,df=1),ylim=c(0,0.4),xlim=c(-5,5),col=1,ylab="Density")
2 curve(dt(x,df=2),col=2,add=TRUE)
3 curve(dt(x,df=5),col=3,add=TRUE)
4 curve(dt(x,df=50),col=4,add=TRUE)
5 legend(2.1,0.35,lty=1,bty="n",
6 legend=c("df=1","df=2","df=5","df=50"),col=1:4)
To obtain the 0.975 quantile of the tdf distribution for df = 1, 2, 5, 50, and 100,
enter the following commands.
> qt(0.975,df=c(1,2,5,50,100))
As a comparison, the 0.975 quantile for the standard normal distribution is given by
qnorm(0.975) = 1.959964 (≈ 1.96).
Returning to the pivot T in (6.5), it has a tn−1 distribution. By repeating the rear-
rangement steps from Section 6.3.1, we find the following exact confidence interval + 98
for µ in terms of the quantiles of the tn−1 distribution.
Let X1 , X2 , . . . , Xn ∼iid N(µ, σ2 ) and let q be the 1 − α/2 quantile of the Student’s
tn−1 distribution. An exact stochastic confidence interval for µ is
S
X̄ ± q √ . (6.8)
n
102 Estimation
Example 6.5 (Volume of a Drop of Water) A buret is a glass tube with scales that
can be used to add a specified volume of a fluid to a receiving vessel. We wish to
determine a 95% confidence interval for the average volume of one drop of water that
leaves the buret, based on the data in Table 6.1.
Our model for the data is as follows: let X1 be the volume of the first 50 drops,
and X2 the volume of the second 50 drops. We assume that X1 , X2 are iid and N(µ, σ2 )
distributed, with unknown µ and σ2 . Note that µ is the expected volume of 50 drops,
and therefore µ/50 is the expected volume of one drop.
With n = 2 and α = 0.05, we have that the 0.975 quantile of the t1 distribution
is q = 12.71. The outcomes of X1 and X2 are respectively x1 = 2.52 and x2 = 2.48.
Hence, p √
s = (2.52 − 2.50)2 + (2.48 − 2.50)2 = 0.02 2 .
Hence, a numerical 95% CI for µ is
It turns out that (n − 1)S 2 /σ2 = ni=1 (Xi − X̄)2 /σ2 has a known distribution, called
P
the χ2 distribution, where χ is the Greek letter chi. Hence, the distribution is also
Confidence Intervals 103
written (and pronounced) as the chi-squared distribution. Like the t distribution, the χ2
distribution is actually a family of distributions, depending on a parameter that is again
called the degrees of freedom. We write Z ∼ χ2df to denote that Z has a chi-square
distribution with df degrees of freedom. Figure 6.3 shows the pdf of the χ21 , χ22 , χ25 , and
χ210 distributions. Note that the pdf is not symmetric and starts at x = 0. The χ21 has a
density that is infinite at 0, but that is no problem — as long as the total integral under
the curve is 1.
Figure 6.3: The pdfs of chi-square distributions with various degrees of freedom (df).
Figure 6.3 was made in a very similar way to Figure 6.2, mostly by replacing dt
with dchisq in the R code. Here is the beginning of the script — you can work out
the rest.
> curve(dchisq(x,df=1),xlim=c(0,15),ylim=c(0,1),ylab="density")
To obtain the 0.025 and 0.975 quantiles of the χ224 distribution, for example, we can
issue the command:
> qchisq(p=c(0.025,0.975),24)
Because (n − 1)S 2 /σ2 has a χ2n−1 distribution, if we denote the α/2 and 1 − α/2
quantiles of this distribution by q1 and q2 , then
!
(n − 1) 2
P q1 < S < q2 = 1 − α .
σ2
Rearranging, this shows
(n − 1)S 2 (n − 1)S 2
!
P <σ <
2
=1−α.
q2 q1
104 Estimation
This gives the following exact confidence interval for σ2 in terms of the quantiles of
the χ2n−1 distribution.
Theorem 6.4
Let X1 , X2 , . . . , Xn ∼iid N(µ, σ2 ) and let q1 and q2 be the α/2 and 1−α/2 quantiles
of the χ2n−1 distribution. An exact stochastic confidence interval for σ2 is
(n − 1)S 2 (n − 1)S 2
!
, . (6.9)
q2 q1
Example 6.6 (Aspirin) On the label of a certain packet of aspirin it is written that
the standard deviation of the tablet weight (actually mass) is 1.0 mg. To investigate
if this is true we take a sample of 25 tablets and discover that the sample standard
deviation is 1.3mg. A 95% numerical confidence interval for σ2 is
24 × 1.32 24 × 1.32
!
, = (1.04, 3.27) ,
39.4 12.4
where we have used (in rounded numbers) q1 = 12.4 and q2 = 39.4 calculated before
with the qchisq() function. A 95% numerical confidence interval for σ is found by
taking square roots (why?):
(1.02, 1.81) .
Note that this CI does not contain the asserted weight of 1.0 mg. We therefore have
some doubt whether the “true” standard deviation is indeed equal to 1.0 mg.
So, if we standardize and replace σ2X and σ2Y with their sample variances, we have
X̄ − Ȳ approx.
q 2 ∼ N(0, 1).
SX S2
m
+ nY
Confidence Intervals 105
For small m and n the standard normal approximation may not be very accurate. Fortu-
nately, it is possible to obtain a much better approximation using a Student distribution
where df is given by the so-called effective degrees of freedom:
2
S 2X S Y2
m
+ n
df = 2 2 2 . (6.10)
S 2X SY
1
m−1 m
+ 1
n−1 n
We thus have the following approximate confidence interval for µ x − µY , using the
above Satterthwaite approximation.
Example 6.7 (Human Movement Study) A human movement student has a the-
ory that the expected mass of 3rd year students differs from that of 1st years. To
investigate this theory, random samples are taken from each of the two groups. A sam-
ple of 15 1st years has a mean of 62.0kg and a standard deviation of 15kg, while a
sample of 10 3rd years has a mean of 71.5kg and a standard deviation of 12kg. Does
this show that the expected masses are indeed different?
Here we have m = 15 and n = 10. The outcomes X̄ − Ȳ is x̄ − ȳ = 62 − 71.5 = −9.5.
Using (6.10), the effective degrees of freedom is df = 22.09993. You may verify also
that s
s2X s2Y
+ = 5.422177.
m n
To construct a 95% numerical confidence interval for µX − µY , we need to also evaluate
the 0.975 quantile of the tdf distribution, using the R command qt(0.975,22.09993).
This gives q = 2.073329. So that the 95% numerical confidence interval for µX − µY is
given by
−9.5 ± 2.073329 × 5.422177 = (−20.74, 1.74) .
This contains the value 0, so there is not enough evidence to conclude that the two
expectations are different.
106 Estimation
Example 6.8 (Opinion Poll) In an opinion poll of 1000 registered voters, 227 vot-
ers say they will vote for the Greens. How can we construct a 95% confidence interval
for the proportion p of Green voters of the total population? A systematic way to pro-
ceed is to view the datum, 227, as the outcome of a random variable X (the number of
Green voters under 1000 registered voters) with a Bin(1000, p) distribution. In other
words, we view X as the total number of “Heads” (= votes Green) in a coin flip ex-
periment with some unknown probability p of getting Heads. Note that this is only
a model for the data. In practice it is not always possible to truly select 1000 people
at random from the population and find their true party preference. For example, a
randomly selected person may not wish to participate or could deliberately give the
“wrong answer”.
Now, let us proceed to make a confidence interval for p, in the general situation that
we have an outcome of some random variable X with a Bin(n, p) distribution. It is not
so easy to find an exact confidence interval for p that satisfies (6.4) in Definition 6.1.
+ 87 Instead, for large n we rely on the central limit theorem (see Section 5.5) to construct
an approximate confidence interval. The reasoning is as follows:
For large n, X has approximately a N(np, np(1 − p)) distribution. Let P b = X/n
denote the estimator of p. We use capital letter P b to stress that the estimator is a
random variable. The outcome of P is denoted b
b p, which is an estimate of the parameter
p. Then Pb has approximately a N(p, p(1 − p)/n) distribution. For some small α (e.g.,
α = 0.05) let q be the 1 − α/2 quantile of the standard normal distribution. Thus, with
Φ the cdf of the standard normal distribution, we have
Φ(q) = 1 − α/2 .
Rearranging gives:
r r
p(1 − p) p(1 − p)
<p<P
b+ q ≈ 1 − α .
P P b− q
n n
Confidence Intervals 107
q
This would suggest that we take b p ± q p(1−p)
n
as an numerical (approximate) (1 − α)
confidence interval for p, were it not for the fact that the bounds still contain the
unknown p! However, for large n the estimator P b is close to the real p, so that we
have s s
b − P)
P(1 b + q P(1 − P) ≈ 1 − α .
b
<p<P
b b
P P −q
b
n n
Example 6.10 (Nightmares) Two groups of men and women are asked whether
they experience nightmares “often” (at least once a month) or “seldom” (less than
once a month). The results are given in Table 6.2.
The observed proportions of frequent nightmares by men and women are 34.4%
and 31.3%. Is this difference statistically significant, or due to chance? To assess this
we could make a confidence interval for the difference of the true proportions pX and
pY .
108 Estimation
bX − P
P bY − (pX − py )
q
pX (1−pX )
m
+ pY (1−p
n
Y)
This interval contains 0, so there is no evidence that men and women are different in
their experience of nightmares.
110 Estimation
CHAPTER 7
H YPOTHESIS T ESTING
7.1 Introduction
We had a first look at hypothesis testing in Chapter 1. Namely, in Section 1.1 we +9
investigated a coin flip experiment (is the coin fair?) and in Section 1.4 we studied + 17
Alice’s cola experiment (does drinking caffeinated Diet cola increase the heart rate?).
In this chapter we will revisit both these experiments and describe their analysis in a
framework that is more generally applicable.
In particular, suppose that we have a general model for data X that is described by
a family of probability distributions that depend on a parameter θ. For example, in the
onesample normal model, we have X = (X1 , . . . , Xn ), where X1 , . . . , Xn ∼iid N(µ, σ2 ).
In this case θ is the vector (µ, σ2 ).
The aim of hypothesis testing is to decide, on the basis of the observed data x,
which of two competing hypotheses on the parameters is true. For example, one hy-
pothesis could be that µ = 0 and the other that µ , 0. Traditionally, the two hypotheses
do not play equivalent roles. One of the hypothesis contains the “status quo” statement.
This is the null hypothesis, often denoted by H0 . The alternative hypothesis, denoted
H1 , contains the statement that we wish to show. A good analogy is found in a court of
law. Here, H0 (present state of affairs) could be the statement that a suspect is innocent,
while H1 is the statement that the suspect is guilty (what needs to be demonstrated).
The legal terms such as “innocent until proven guilty”, and “without reasonable doubt”
show clearly the asymmetry between the hypotheses. We should only be prepared to
111
112 Hypothesis Testing
reject H0 if the observed data, that is the evidence, is very unlikely to have happened
under H0 .
The decision whether to reject H0 or not is dependent on the outcome of a test
statistic T , which is a function of the data X only. The P-value is the probability that
under H0 the (random) test statistic takes a value as extreme as or more extreme than
the one observed. Let t be the observed outcome of the test statistic T . We consider
three types of tests:
• Left one-sided test. Here H0 is rejected for small values of t, and the P-value is
defined as p = PH0 (T 6 t).
• Right one-sided test: Here H0 is rejected for large values of t, and the P-value
is defined as p = PH0 (T > t),
• Two-sided test: In this test H0 is rejected for small or large values of t, and the
P-value is defined as p = min{2PH0 (T 6 t), 2PH0 (T > t)}.
The smaller the P-value, the greater the strength of the evidence against H0 provided
+ 20 by the data. As a rule of thumb (see also Figure 1.5):
p < 0.10 weak evidence,
p < 0.05 moderate evidence,
p < 0.01 strong evidence.
The following decision rule is generally used to decide between H0 and H1 :
Decision rule : Reject H0 if the P-value is smaller than some significance level α.
In general, a statistical test involves the following steps.
Choosing an appropriate test statistic is akin to selecting a good estimator for the
unknown parameter θ. The test statistic should summarize the information about θ and
make it possible to distinguish between the two hypotheses.
Example 7.1 (Blood Pressure) Suppose the systolic blood pressure for white males
aged 35–44 is known to be normally distributed with expectation 127 and standard de-
viation 7. A paper in a public health journal considers a sample of 101 diabetic males
and reports a sample mean of 130. Is this good evidence that diabetics have on average
a higher blood pressure than the general population?
To assess this, we could ask the question how likely it would be, if diabetics were
similar to the general population, that a sample of 101 diabetics would have a mean
blood pressure this far from 127.
Let us perform the seven steps of a statistical test. A reasonable model for the data
is X1 , . . . , X101 ∼iid N(µ, 49). Alternatively, the model could simply be X̄ ∼ N(µ, 49/101),
since we only have an outcome of the sample mean of the blood pressures. The null
hypothesis (the status quo) is H0 : µ = 127; the alternative hypothesis is H1 : µ > 127.
We take X̄ as the test statistic. Note that we have a right one-sided test here, because
we would reject H0 for high values of X̄. Under H0 we have X̄ ∼ N(127, 49/101). The
outcome of X̄ is 130, so that the P-value is given by
!
X̄ − 127 130 − 127
P(X̄ > 130) = P √ > √ = P(Z > 4.31) ≈ 8.16 · 10−6 ,
49/101 49/101 | {z }
1−pnorm(4.31)
where Z ∼ N(0, 1). So it is extremely unlikely that the event {X̄ > 130} occurs if
the two groups are the same with regard to blood pressure. However, the event has
occurred. Therefore, there is strong evidence that the blood pressure of diabetics differs
from the general public.
Example 7.2 (Biased Coin (Revisited)) We revisit Example 1.1, where we observed + 10
60 out of 100 Heads for a coin that we suspect to be biased towards Heads. Is there
enough evidence to justify our suspicion?
What are the 7 hypothesis steps in this case? A good model (step 1) for the data X
(the total number of Heads in 100 tosses) is: X ∼ Bin(100, p), with the probability of
Heads, p, is unknown. We would like to show (step 2) the hypothesis H1 : p > 1/2;
otherwise, we do not reject (accept) the null hypothesis H0 : p = 1/2. Our test statistic
(step 3) could simply be X. Under H0 , X ∼ Bin(100, 1/2) (step 4). The outcome of X
(step 5) is x = 60, so the P-value for this right one-sided test is
100 ! !100
X 100 1
P(X > 60) = ≈ 0.02844397 .
k=60
k 2
| {z }
1−pbinom(59,100,1/2)
This is quite small. Hence, we have reasonable evidence that the die is loaded.
In the rest of this chapter we are going to look at a selection of basic tests, involving
one or two iid samples from either a Normal or Bernoulli distribution.
114 Hypothesis Testing
Table 7.1: Changes in pulse rate for the Decaf group in Alice’s cola experiment.
4 10 7 −9 5 4 5 7 6 12
To answer this question, we again consider an appropriate model for this situation.
We represent the observations by X1 , . . . , X10 , and assume that they form an iid sample
from a N(µ, σ2 ) distribution, where both µ and σ2 are unknown; note that is different
to Example 7.1, where the variance is known. The hypotheses can now be formulated
as: H0 : µ = 0 against H1 : µ > 0.
Which test statistic should we choose? Since we wish to make a statement about
µ, the test statistic should reflect this. We could take X̄ as our test statistic and reject
H0 for large values of X̄. However, this leads to a complication. It looks like our null
hypothesis only contains one parameter value, but in fact it contains many, because we
should have written
H0 : µ = 0, 0 < σ2 < ∞ .
It is the unknown variance σ2 that leads to the complication in choosing X̄ as our
test statistic. To see this, consider the following two cases. First consider the case
where the standard deviation σ is small, say 1. In that case, X̄ is under H0 very much
concentrated around 0, and therefore any deviation from 0, such as 7 would be most
unlikely under H0 . We would therefore reject H0 . On the other hand, if σ is large, say
10, then a value of 7 could very well be possible under H0 , so we would not reject it.
This shows that X̄ is not a good test statistic, but that we should “scale” it with
the standard deviation. That is, we should measure our deviation from 0 in units of
σ rather than in units of 1. However, we do not know σ. But this is easily fixed by
replacing σ with an appropriate estimator. This leads to the test statistic
X̄
T= √ .
S / 10
√
+ 100 The factor 10 is a “standardising” constant which enables us to utilize Theorem 6.7.
Namely, under H0 the random variable T has a tn−1 = t9 distribution. Note that this is
One-sample t-test 115
[1] 0.008942135
X̄ − µ0
T= √ , (7.1)
S/ n
Using R
Note that in to order to carry out a onesample t-test, we only need the summary statis-
tics x̄ and s of the data. When the individual measurements are available, it is conve-
nient to carry out the t-test using the R function t.test. As an example, we enter the
data in Table 7.1 into R and print out the sample mean and standard deviation using
the function sprintf, which can be used to format output neatly:
"mean=5.1 sd=5.587"
> t.test(x,alternative="greater")
116 Hypothesis Testing
data: x
t = 2.8868, df = 9, p-value = 0.008989
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
1.8615 Inf
sample estimates:
mean of x
5.1
The main output of the function t.test are: the outcome of the T statistic (t =
2.8868), the P-value = 0.008989, the alternative hypothesis (true mean is greater than
0) and the sample mean x̄ = 5.1. To output just the P-value, we can use:
> t.test(x,alternative="greater")$p.value
[1] 0.008988979
By default, the t.test function takes a two-sided alternative. The option alternative
= "greater" forces a right-onesided alternative. Note that in this case t.test re-
turns a onesided confidence interval. To obtain a 99% two-sided confidence interval
for µ we can use:
> t.test(x,conf.level=0.99)$conf.int
To find the variable names that are returned by a function, use names(), as in
h = t.test(x)
names(h)
Whether we make a right or wrong decision is the result of a random process. Thus,
for any statistical test where we make a decision in the end, there is a probability of
a Type I error, Type II error, or correct decision. Ideally, we would like to construct
tests which make the probabilities of Type-I and Type-II errors, (let’s call them eI and
eII ) as small as possible. Unfortunately, this is not possible, because the two errors
“compete” with each other: if we make eI smaller, eII will increase, and vice versa.
Because, as mentioned, the null and alternative hypothesis do not play equivalent
roles, a standard approach is to keep the probability eI of a Type I error at (or below)
a certain threshold: the significance level, say 0.05. The decision rule: reject H0 if the
P-value is smaller than some significance level α ensures that eI 6 α.
Next, given that eI remains at (or below) level α, we should try to make eII as small
as possible. The probability 1 − eII is called the power of the test. It is the probability
of making the right decision (reject H0 ) under some alternative in H1 . So, minimizing
the probability of a Type II error is the same as maximizing the power. Note that the
power heavily depends on what alternative is used.
Example 7.3 (Simulating the Power) Suppose we have a one-sample √ t-test, where
we want to test H0 : µ = 0 versus H1 : µ > 0. Our test statistic is X̄/(S / n) and under
H0 , this test statistic has a tn−1 distribution. Suppose we have a significance level of
α = 0.05. What is the power of the test when the real parameters are µ = 1 and σ = 2,
for example?
Imagine what would happen if we conducted the test tomorrow, with the √ data
X1 , . . . , Xn coming from N(1, 4). We would form the test statistic T = X̄/(S / n) and
then calculate the corresponding P-value for this right-onesided test. In R we would
do it via: pt(T,df=n-1). Finally, we would reject the null hypothesis if the P-value
is less than 0.05. So let’s do this many times on a computer and see how many times
we correctly reject the null hypothesis. In the program below we use a sample size of
n = 5.
118 Hypothesis Testing
The above example illustrates that the power depends on various factors: the sam-
ple size, the significance level, as well as µ and σ. In fact (you can verify it yourself)
the power in the above code only depends on µ/σ, which is sometimes called the “sig-
nal to strength ratio”. In R, we can make power calculations via the power.t.test
function. Here is the output of the last lines in the code above:
n = 5
delta = 1
sd = 2
sig.level = 0.05
power = 0.2389952
alternative = one.sided
A power analysis, as carried out above, allows us to choose a sample size large
enough to determine some minimal effect, as long as we have an idea of the standard
deviation. The latter can be estimated with a trial run, for example.
One-sample Test for Proportions 119
data: 8 and 10
number of successes = 8, number of trials = 10, p-value = 0.05469
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.4930987 1.0000000
sample estimates:
probability of success
0.8
[1] 0.02888979
Hence, for small sample sizes it is recommended to use binom.test.
Two-sample t-test 121
Caffeinated 17 22 21 16 6 −2 27 15 16 20
Decaf 4 10 7 −9 5 4 5 7 6 12
20
10
pb
−10
No Yes
Let us go through the 7 steps of a hypothesis test. First, we could model the data
as coming from different normal distributions. Let X1 , . . . , Xm (with m = 10) be the
change in heartbeat for the caffeinated Diet cola (treatment) group and let Y1 , . . . , Yn
(with n = 10) be the change in heartbeat for the decaf Diet cola (control) group. We
assume that
iid
• X1 , . . . , Xm ∼ N(µX , σ2X ).
iid
• Y1 , . . . , Yn ∼ N(µY , σ2Y ).
• X1 , . . . , Xm , Y1 , . . . , Yn are independent,
statistic:
X̄ − Ȳ
T= q ,
S 2X S Y2
m
+ n
which under H0 has approximately a Student tdf distribution where df is given by
2 2
SX S Y2
m
+ n
df = 2 2 2 2 , (7.2)
SX SY
1
m−1 m
+ n−1 n
1
which we already encountered in (6.10). Even for small m and n this approximation is
very accurate. This two-sample t-test is attributed to Bernard Welch. This completes
steps 1–4. Let us finish the remaining steps of the test by using R as a calculator.
1 x = c(17,22,21,16,6,-2,27,15,16,20)
2 y = c(4,10,7,-9,5,4,5,7,6,12)
3 mx = mean(x)
4 my = mean(y)
5 sx = sd(x)
6 sy = sd(y)
7 a = sx^2/10
8 b = sy^2/10
9 t = (mx - my)/sqrt(a + b)
10 df = (a + b)^2/(a^2/9 + b^2/9)
11 pval = 1 - pt(t,df=df)
12 cat("t = ", t, ", df =", df,", pva l=", pval) #print the values
This gives the output (using the cat (for concatenate) function):
t = 3.37521 , df = 15.74042 , pval = 0.001965818
We conclude that there is strong evidence that the caffeine has an effect on the change
in pulse beat.
Having defined x and y as in the above code, we can obtain the same results by
using the t.test function:
> t.test(x,y, alternative="greater")
data: x and y
t = 3.3752, df = 15.74, p-value = 0.001966
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
5.159642 Inf
sample estimates:
mean of x mean of y
15.8 5.1
Two-sample t-test 123
data: x and y
t = 3.3752, df = 18, p-value = 0.001686
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
5.202718 Inf
sample estimates:
mean of x mean of y
15.8 5.1
We see that the P-value is slightly smaller under the assumption of equal variances,
but that in essence we come to the same conclusions.
Paired Data
When conducting a twosample t-test, it is important to ascertain that the random vari-
ables are not paired. Such data often arises in “before–after” experiments or on repli-
cated experiments involving the same subjects, as in the following example.
Example 7.7 (Paired Lab Data) We wish to compare the results from two labs for
a specific examination. Both labs made the necessary measurement on the same fifteen
patients.
124 Hypothesis Testing
In this case the measurements between the goups are not independent, as the measure-
ments are conducted on the same patient. For example, both labs report high measure-
ments (29 and 28) for patient 10, and both labs reported low measurements (8 and 10)
for patient 6.
D̄
T= √ ∼ tn−1 ,
S/ n
with D̄ = Di , and S 2 =
1 Pn 1 Pn
n i=1 n−1 i=1 (Di − D̄)2 .
Example 7.8 (Paired Lab Data (Continued)) To use t.test on the pair lab data,
we need to set the parameter paired=TRUE:
> t.test(lab1,lab2,paired=TRUE)
Paired t-test
Since the P-value is rather high (0.1) there is not enough evidence to conclude that
the two labs give different results on average.
Example 7.9 (Are ABC Viewers More Left-wing?) A politician believes that au-
dience members of the ABC news are in general more left wing than audience mem-
bers of a commercial news broadcast. A poll of two-party preferences is taken. Of
seventy ABC viewers, 40 claim left wing allegiance, while of 100 commercial station
viewers, 50 claim left wing allegiance. Is there any evidence to support the politician’s
claim?
Our model is as follows. Let X be the number of left-wing ABC viewers out of
m = 70, and let Y be the number of left-wing “commercial” viewers out of n = 100.
We assume that X and Y are independent, with X ∼ Bin(m, pX ) and Y ∼ Bin(n, pY ), for
some unknown pX and pY . We wish to test H0 : pX = pY against H1 : pX > pY .
Since m and n are fairly large here, we proceed by using the central limit theorem
(CLT), analogously to Sections 6.3.4 and 6.3.5. Let P bX := X/m and P
bY := Y/n be + 107
the empirical proportions. By the CLT b pX has approximately a N(pX , pX (1 − pX )/m)
pY has approximately a N(pY , pY (1 − pY )/n) distribution. It follows
distribution, and b
that
bX − P
P bY
q
pX (1−pX )
m
+ pY (1−p
n
Y)
bX − P
P bY
Z= q , (7.5)
P(1 − P) m + n
b b 1 1
Our general formulation for the two-sample binomial test (also called the test for
proportions) is as follows. First, the model is:
• X and Y independent.
126 Hypothesis Testing
Example 7.10 (Are ABC Viewers More Left-wing? (Continued)) Returning to Ex-
pX = 70
ample 7.9, from our data we have the estimates c 40
, pbY = 100
50
, and
40 + 50 90
p= = .
70 + 100 170
b
This gives a P-value of PH0 (Z > 0.9183) ≈ 0.1792 (in R type 1 - pnorm(0.9183)),
so there is no evidence to support the politician’s claim.
As for one-sample test for proportions, we can also use the R function prop.test
to compare two proportions:
> prop.test(x=c(40,50),n=c(70,100),alternative="greater",correct=F)
Note that, as expected, we obtain the same P-value. However, observed test statistic
here is the square of the one we used before (0.91832 = 0.8433). The function provides
also the sample proportion b pX and b
pY .
CHAPTER 8
R EGRESSION
8.1 Introduction
Francis Galton observed in an article in 1889 that the heights of adult offspring are,
on the whole, more “average” than the heights of their parents. Galton interpreted this
as a degenerative phenomenon, using the term regression to indicate this “return to
mediocrity”. Karl Pearson continued Galton’s original work and conducted compre-
hensive studies comparing various relationships between members of the same family.
Figure 8.1 depicts the measurements of the heights of 1078 fathers and their adult sons
(one son per father).
The average height of the fathers was 67 inches, and of the sons 68 inches. Because
sons are on average 1 inch taller than the fathers we could try to “explain” the height of
the son by taking the height of his father and adding 1 inch. However, the line y = x+1
(dashed) does not seem to predict the height of the sons as accurately as the solid line
in Figure 8.1. This line has a slope less than 1, and demonstrates Galton’s “regression”
effect. For example, if a father is 5% taller than average, then his son will be on the
whole less than 5% taller than average.
Regression analysis is about finding relationships between a response variable
which we would like to “explain” via one or more explanatory variables. In regression,
the response variable is usually a quantitative (numerical) variable.
127
128 Regression
80 y = x+1
75
height son
70
65
60
55
55 60 65 70 75 80
height father
E(Yi ) = β0 + β1 xi , i = 1, . . . , n (8.1)
y = β0 + β1 x (8.2)
is called the regression line. To completely specify the model, we need to designate
the joint distribution of Y1 , . . . , Yn . The most common linear regression model is given
next. The adjective “simple” refers to the fact that a single explanatory variable is used
to explain the response.
Yi = β0 + β1 xi + εi , i = 1, . . . , n , (8.3)
iid
where ε1 , . . . , εn ∼ N(0, σ2 ).
Simple Linear Regression 129
This formulation makes it even more obvious that we view the responses as ran-
dom variables which would lie exactly on the regression line, were it not for some
“disturbance” or “error” term (represented by the {εi }).
To make things more concrete let us consider the student survey dataset stored in
the dataset studentsurvey.csv, which can be found on Blackboard. Suppose we
wish to investigate the relation between the shoe size (explanatory variable) and the
height (response variable) of a person.
First we load the data:
> rm(list=ls()) # good practice to clear the workspace
> survey = read.csv("studentsurvey.csv")
> names(survey) # check the names
200
190
180
height
170
160
150
15 20 25 30 35
shoe
Figure 8.2: Scatter plot of height (in cm) against shoe size (in cm).
130 Regression
We observe a slight increase in the height as the shoe size increases, although this
relationship is not very clear.
β1 and b
The values for b β0 that minimize the least-squares criterion (8.4) are:
Pn
(xi − x)(yi − y)
β1 = i=1
b Pn 2
(8.5)
i=1 (xi − x)
b β1 x .
β0 = y − b (8.6)
Proof: We seek to minimize the function g(a, b) = SSE = ni=1 (yi − a − bxi )2 with
P
respect to a and b. To find the optimal a and b, we take the derivative of SSE with
respect to a, b and set it equal to 0. This leads to two linear equations:
n
∂ ni=1 (yi − a − bxi )2
P X
= −2 (yi − a − bxi ) = 0
∂a i=1
and
n
∂
Pn
i=1 (yi − a − bxi )2 X
= −2 xi (yi − a − bxi ) = 0 .
∂b i=1
From the first equation, we find y − a − bx = 0 and then a = y − bx. We put this
expression for a in the second equation and get (omitting the factor −2):
X n n
X
xi (yi − a − bxi ) = xi (yi − y + bx − bxi )
i=1 i=1
n n n
X X 2 X
= xi + b nx − xi .
2
xi yi − y
i=1 i=1 i=1
Simple Linear Regression 131
β0 and b with b
Replacing a with b β1 , we have completed the proof.
If we replace in (8.5) and (8.6) the values yi and y with the random variables Yi and
Y, then we obtain the estimators of β1 and β0 . Think of these as the parameters for the
line of best fit that we would obtain if we would carry out the experiment tomorrow.
When dealing with parameters from the Greek alphabet, such as β, it is custom-
ary in the statistics literature —and you might better get used to it— to use the
same notation (Greek letter) for the estimate and the corresponding estimator,
β. Whether b
both indicated by the “hat” notation: b β is to be interpreted as random
(estimator) or fixed (estimate) should be clear from the context.
β0 and b
Theorem 8.2: Properties of the Estimators b β1
β0 and b
Both b β1 have a normal distribution. Their expected values are
β0 ) = β0
E(b and β1 ) = β1 ,
E(b (8.7)
x2
!
1
β0 ) = σ
Var(b 2
+ Pn 2
(8.8)
n i=1 (xi − x)
and
σ2
β1 ) = Pn
Var(b 2
. (8.9)
i=1 (xi − x)
1 P 2
It turns out we need to slightly scale this to n−2 ei to get an unbiased estimator for
σ . This is sometimes called the mean squared error (MSE) or also residual squared
2
error (RSE).
Call:
lm(formula = height ~ shoe, data = survey)
Coefficients:
(Intercept) shoe
145.778 1.005
The above R output gives the least squares estimates of β0 and β1 . For the above
example, we get b β0 = 145.778 and b
β1 = 1.005. We can now draw the regression line
on the scatter plot, using:
> xyplot(height~shoe,data=survey,type = c("p","r"),
cex=1.2,pch=16, col.line="red", lwd =3)
Simple Linear Regression 133
200
190
180
height
170
160
150
15 20 25 30 35
shoe
Figure 8.3: Scatter plot of height (in cm) against shoe size (in cm), with the fitted line.
The function lm performs a complete analysis of the linear model. The function
summary provides a summary of the calculations:
Call:
lm(formula = height ~ shoe, data = survey)
Residuals:
Min 1Q Median 3Q Max
-18.9073 -6.1465 0.1096 6.3626 22.6384
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 145.7776 5.7629 25.296 < 2e-16 ***
shoe 1.0048 0.2178 4.613 1.2e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Then access the values via via the dollar ($) construction. For example, the following
code extracts the P-value for the slope:
> sumr1$coefficients[2,4]
[1] 1.1994e-05
Simple Linear Regression 135
where c is a constant that depends on n and the confidence level α (it is the 1 − α/2
quantile of the tn−2 distribution). Recall that MSE estimates the variance σ2 of the
model error.
If we wish to predict the value of Y (not just its expectation) for a given value of x,
then we have two sources of variation:
1. Y itself is a random variable, which is normally distributed with variance σ2 ,
We can also predict the weight of a person whose shoe size is 30 to lie in the
following interval, with probability 0.95.
> predict(model1,data.frame(shoe=30),interval="prediction")
> par(mfrow=c(1,2))
> plot(model1,1:2)
Multiple Linear Regression 137
Standardized residuals
3
83 83
2
Residuals
0 10
1
0
−20
89 85
−2
8985
Examining the residuals as a function of predicted values, we see that the residuals are
correctly spread, symmetrical about the x axis: the conditions of the model seem valid.
Note that the instruction plot(model1) can draw four plots; some of these are for
outlier detection.
To put across the idea, let us go back to the student survey data set survey. Instead
of “explaining” the student height via their shoe size, we could include other quantita-
tive explanatory variables, such as the weight (stored in weight). The corresponding
R formula for this model would be
height∼shoe + weight
138 Regression
where ε is a normally distributed error term with mean 0 and variance σ2 . The model
has thus 4 parameters.
Before analysing the model we present a scatter plot of all pairs of variables, using
the R function pairs.
> pairs(height ∼ shoe + weight, data = survey)
15 20 25 30 35
190
height
170
150
35
30
shoe
25
20
15
100
80
weight
60
40
Call:
lm(formula = height ~ shoe + weight)
Residuals:
Min 1Q Median 3Q Max
-21.4193 -4.0596 0.1891 4.8364 19.5371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 132.2677 5.2473 25.207 < 2e-16 ***
shoe 0.5304 0.1962 2.703 0.0081 **
weight 0.3744 0.0572 6.546 2.82e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results returned by summary are presented in the same fashion as for simple linear
regression. The individual Student tests indicate that:
• shoe size is linearly associated with student height, after adjusting for weight
(P-value = 0.0081). At the same weight, an increase of one cm in shoe size
corresponds to an increase of 0.53 cm of average student height;
• weight is linearly associated with student height, after adjusting for shoe size
(P-value = 2.82 × 10−09 ). At the same shoe size, an increase of one kg of the
weight corresponds to an increase of 0.3744 cm of average student height.
Confidence intervals for regression parameters can be found with confint:
> confint(model2)
2.5 % 97.5 %
(Intercept) 121.8533072 142.6821199
shoe 0.1410087 0.9198251
weight 0.2608887 0.4879514
Confidence and prediction intervals can be obtained via the predict function.
Suppose we wish to predict the height of a person with shoe size 30 and weight 75 kg.
A confidence interval for the expected height is obtained as follows (notice that we can
abbreviate "confidence" to "conf").
> predict(model2,data.frame(shoe=30,weight=75),interval="conf")
140 Regression
The residuals are correctly spread, symmetrical about the x axis: the conditions of the
model seem valid. Moreover, the QQ-plot indicates no extreme departure from the
normality.
CHAPTER 9
A NALYSIS OF VARIANCE
9.1 Introduction
Analysis of variance (ANOVA) is used to study the relationship between a quantita-
tive variable of interest and one or several categorical variables. As in regression, the
variable of interest is called the response variable (in some fields confusingly called
“dependent variable”) and the other variables are called explanatory variables (or “in-
dependent variables”). Recall (see Section 2.2) that categorical variables take values in + 23
a finite number of categories, such as yes/no, green/blue/brown, and male/female. In R,
such variables are called factors. They often arise in designed experiments: controlled
statistical experiments in which the aim is to assess how a response variable is affected
by one or more factors tested at several levels. A typical example is an agricultural
experiment where one wishes to investigate how the yield of a food crop depends on
two factors: (1) pesticide, at two levels (yes and no), and (2) fertilizer, at three levels
(low, medium, and high). Treatment pairs were assigned to plots via randomization.
Table 9.1 gives an example of data that is produced in such an experiment. Here three
responses (crop yield) are collected from each of the six different combinations of
levels.
141
142 Analysis of Variance
Note that the pesticide factor only has two levels. To investigate whether using
pesticide is effective (produces increased crop yield) we could simple carry out a two-
+ 121 sample t-test; see Section 7.5. Let us carry out the usual steps for a statistical test
here:
2. H0 is the hypothesis that there is no difference between the groups; that is, µ1 =
µ2 . The alternative hypothesis is that there is a difference: µ1 , µ2 .
Note that the above t-test does not tell us whether the pesticide was successful (that
is, gives a higher average yield). Think how you would assess this.
What if we consider instead whether fertilizer “explains” crop yield. For this factor
we have three levels: low, medium, and high. So a two-sample t-test does no longer
work. Nevertheless, we would like to make a similar analysis as above. Steps 1 and 2
are easily adapted:
1. The model is a three-sample normal model. Let Y1 , Y2 , Y3 , Y10 , Y11 , Y12 ∼iid N(µ1 , σ2 )
be the crop yields with low fertilizer, Y4 , Y5 , Y6 , Y13 , Y14 , Y15 ∼iid N(µ2 , σ2 ) be the
crop yields with medium fertilizer, and Y7 , Y8 , Y9 , Y16 , Y17 , Y18 ∼iid N(µ3 , σ2 ) be
the crop yield with high fertilizer. We assume equal variances for all three
groups, and that all variables are independent of each other.
2. H0 is the hypothesis that there is no difference between the groups; that is, µ1 =
µ2 = µ3 . The alternative hypothesis is that there is a difference.
The question is now how to formulate a test statistic (a function of the data) that
makes it easy to distinguish between the null and alternative hypothesis. This is where
ANOVA comes. It will allow us to compare the means of any number of levels within
a factor. Moreover, we will be able to explain the response variable using multiple fac-
tors at the same time. For example, how does the crop yield depend on both pesticide
and fertilizer.
The following code reads the data and produces Figure 9.1. We have used a few
tricks in this code that you might find useful to know. Firstly, we plotted the levels in
the order from "Low" to "High". This is done in Line 5. Without this line, the levels
would be taken in alphabetical order, starting with "High". Secondly, we indicated
what the Pesticide level was for the data in each Fertilizer group. This is done by
specifying the plotting characters (numbers) in Line 6, for each data point, and in
Line 7, we use these characters via the " pch = " option. Note that the rep function
replicates numbers or strings; in this case nine 4s (producing crosses) and nine 1s
(producing circles).
1 crop = read.csv("cropyield.csv")
2 library(lattice)
3 # reorder the levels from low to high
4 crop$Fertilizer = factor(crop$Fertilizer,
5 levels = c("Low", "Medium", "High"))
6 chs = c(rep(4,9),rep(1,9)) # define two groups of plotting characters
7 stripplot(Yield∼Fertilizer,pch=chs,cex=1.5,data=crop,xlab="Fertilizer")
8 #stripplot(Yield∼Fertilizer,groups=Pesticide,cex=1.5,data=crop,
9 # xlab="Fertilizer")
144 Analysis of Variance
8
Yield
Fertilizer
Figure 9.1: Strip plot of crop yield against fertilizer level. Whether pesticide was
applied is also indicated (circle for Yes, cross for No).
9.2.1 Model
Consider a response variable which depends on a single factor with d levels, denoted
1, . . . , d. Let us use the letter ` to indicate a level. So, ` ∈ {1, . . . , d}. Within each level
` there are n` independent measurements of the response variable. The total number
of measurements is thus n = n1 + · · · + nd . As in the crop yield example, think of the
response data as a single column (of size n), and the factor (explanatory variable) as
another column (with entries in {1, . . . , d}). An obvious model for the data is that the
{Yi } are assumed to be independent and normally distributed with a mean and variance
which depend only on the level. Such a model is simply a d-sample generalization of
+ 91 the two-sample normal model in Example 5.8. To be able to analyse the model via
ANOVA one needs, however, the additional model assumption that the variances are
all equal; that is, they are the same for each level. Using the indicator notation
1 if x = `,
I(x = `) =
0 if x , `,
+ 128 we can write this model in a very similar way to the regression model in Definition 8.1:
Single-Factor ANOVA 145
where µ1 is the mean effect of the reference level (level 1 in this case) and α` = µ` − µ1
is the incremental effect of level `, relative to the reference level. The latter approach
is used in R.
9.2.2 Estimation
The model (9.2) has d + 1 unknown parameters: µ1 , . . . , µd , and σ2 . Each µ` can be
estimated exactly as for the 1-sample normal model, by only taking into account the
data in level i. In particular, the estimator of µ` is the sample mean within the `-th
level: n
1 X
µ` = Y ` =
b Yi I(xi = `), i = 1, . . . , d .
n` i=1
To estimate σ2 , we should utilize the fact that all {Yi } are assumed to have the same
variance σ2 . So, as in the two-sample normal model case, we should pool our data
and not just calculate, say, the sample variance of the first level only. The model (9.1)
assumes that the errors {εi } are independent and normally distributed, with a constant
variance σ2 . If we knew each εi = Yi − d`=1 µ` I(xi = `), we could just take the sample
P
variance i ε2i /(n − 1) to estimate σ2 unbiasedly. Unfortunately, we do not know the
P
{µ` }. However, we can estimate each µ` with Y ` . This suggests that, similar to the
regression case, we replace the unknown true errors εi with the residual errors (or
simply residuals), which are here given by
d
X
ei = Yi − Y ` I(xi = `).
`=1
Pn
e2
A sensible estimator for σ2 is therefore: n−1
i=1 i
. However, this turns out not to be un-
biased. An unbiased estimator is obtained by dividing the sum of the squared residual
146 Analysis of Variance
c2 = SSE .
σ (9.4)
n−d
The latter is also called the mean squared residual error (MSE).
Let us denote this by SSF (Sum of Squares due to the Factor). It measures the variabil-
ity between the different levels of the factor. If we further abbreviate SSF/(d − 1) to
MSF (mean square factor) and SSE/(n − d) to MSE (mean square error), then we can
write our test statistic as
MSF
F= .
MSE
Single-Factor ANOVA 147
The test statistic F thus compares the variability between levels with the variability
within the levels. We reject H0 for large values of F (right-onesided test). To actually
carry out the test we need to know the distribution of F under H0 , which is given in the
following theorem, the proof of which is beyond a 1-st year course.
Theorem 9.1
Under H0 , F = MSF/MSE has an F(d − 1, n − d) distribution.
This F-distribution is named after R.A. Fisher — one of the founders of modern
statistics. So, in addition to the Student’s t distribution and the χ2 distribution this
is the third important distribution that appears in the study of statistics. Again, this
is a family of distributions, this time depending on two parameters (called, as usual,
degrees of freedom). We write F(df 1 , df 2 ) for an F distribution with degrees of freedom
df 1 and df 2 . Figure 9.2 gives a plot of various pdfs of this family. We used a similar
script as for the plotting of Figure 6.2. Here is the beginning of the script — you can + 101
work out the rest.
> curve(df(x,df1=1,df2=3),xlim=c(0,8),ylim=c(0,1.5),ylab="density")
Figure 9.2: The pdfs of F distributions with various degrees of freedom (df).
It is out of the scope of this 1-st year course to discuss all the properties of the F
distribution (or indeed the t and the χ2 ), but the thing to remember is that it is just a
probability distribution, like the normal and uniform one, and we can calculate pdfs,
cdfs, and quantiles exactly as for the normal distribution, using the “d, p, q, r” con-
struction, as in Table 4.1. + 78
148 Analysis of Variance
Fortunately, software can do all the calculations for us and summarize the results
in an ANOVA table. For the one-factor case, it is of the form given in Table 9.2.
Table 9.3: Cold sore healing times for 5 different treatments. T 1 is a placebo treatment.
T1 T2 T3 T4 T5
5 4 6 7 9
8 6 4 4 3
7 6 4 6 5
7 3 5 6 7
10 5 4 3 7
8 6 3 5 6
The aim here is to compare the mean healing times. The times in the placebo
column seem a little higher. But is this due to chance or is there a real difference. To
answer this question, let us first load the data into R.
> x = data.frame(Placebo=c(5,8,7,7,10,8),T2=c(4,6,6,3,5,6),
+ T3=c(6,4,4,5,4,3),T4=c(7,4,6,6,3,5),T5=c(9,3,5,7,7,6))
The first important point to note is that while Table 9.3 (and the data frame x) is
a perfectly normal table (and data frame) it is in the wrong format for an ANOVA
study. Remember (see Chapter 2) that the measurements (the healing times) must be
in a single column. In this case we should have a table with only two columns (apart
from the index column): one for the response variable (healing time) and one for the
factor (treatment). The factor has here 5 levels (T 1 , . . . , T 5 ). An example of a correctly
formated table is Table 9.1.
We need to first “stack” the data using the stack function. This creates a new data
frame with only two columns: one for the healing times and the other for the factor
Single-Factor ANOVA 149
(at levels T 1 , . . . , T 5 ). The default names for these columns are values and ind. We
rename them to times and treatment.
> coldsore = stack(x)
> names(coldsore) = c("times", "treatment")
The second important point is that both columns in the reformated data frame
coldsore now have the correct type (check with str(coldsore)): the response is a
quantitative variable (numerical) and the treatment is a categorical variable (factor) at
five levels.
We can do a brief descriptive analysis, giving a data summary for the healing times
within each of the factor levels. In R this can be done conveniently via the function
tapply, which applies a function to a table.
> tapply(coldsore$times,coldsore$treatment,summary)
This applies the summary function to the vector times, grouped into treatment lev-
els. The output is as follows.
$Placebo
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.0 7.0 7.5 7.5 8.0 10.0
$T2
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 4.25 5.50 5.00 6.00 6.00
$T3
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 4.000 4.000 4.333 4.750 6.000
$T4
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 4.250 5.500 5.167 6.000 7.000
$T5
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 5.250 6.500 6.167 7.000 9.000
In particular, the level means (the ȳ` , ` = 1, . . . , 5) are given in the fourth column.
A boxplot of times versus treatment gives more information:
> library(lattice)
> bwplot(times~treatment, data = coldsore, xlab="treatment")
150 Analysis of Variance
10
times 8
Placebo T2 T3 T4 T5
treatment
Using a 1-factor ANOVA model, we wish to test the hypothesis H0 that all treat-
ment levels have the same means versus the alternative that this is not the case. Our
test statistic is F = MSF/MSE, which, if H0 is true, we know has an F distribution;
see Theorem 9.2.3. In this case d = 5 and n = 30, so F has an F(4, 25) distribution
under H0 . The next step is to evaluate the outcome f of F based on the observed data,
and then to calculate the P-value. Since we have a right-onesided test (we reject H0 for
large values of F), the P-value is P(F > f ), where F ∼ F(4, 25). Fortunately, R can do
all these calculations for us, using for instance the function aov. All we need to do is
specify the R formula.
The values listed are the parameters (degrees of freedom, Df) for the F distribution
(4 and 25), the sum of squares of the treatment SSF = 36.467 and the residuals SSE =
58.500, the corresponding mean squares MSF = 9.1167 and MSE = 2.3400 and,
finally, the outcome of the test statistic f = 3.896, with corresponding P-value 0.01359,
which is quite small. There is thus fairly strong evidence to believe that the treatments
have an effect.
Two-factor ANOVA 151
> plot(my.aov)
25 25
2
Standardized residuals
2
1
1
Residuals
−3 −2 −1 0
−1 0
1
1
−2
26
26
4.5 5.0 5.5 6.0 6.5 7.0 7.5 −2 −1 0 1 2
Fitted values Theoretical Quantiles
aov(times ~ treatment) aov(times ~ treatment)
R actually returns four diagnostic plots, but we have listed only two in Figure 9.4.
Examining the residuals as a function of predicted values, the residuals are correctly
spread, symmetrical about the x-axis: the conditions of the model (i.e., zero mean
and constant variance) seem valid. The normality of the residuals is indicated by the
observed straight line in the qq-plot.
9.3.1 Model
Consider a response variable with depends on two factors. Suppose Factor 1 has d1
levels and Factor 2 has d2 levels. Within each pair of levels (`1 , `2 ) there are ni j repli-
cations, so that the total number of observations is n = d`11=1 d`22=1 ni j . A direct gener-
P P
alization of (9.2) gives the following model.
iid
where ε1 , . . . , εn ∼ N(0, σ2 ).
This is just saying that the response variables are independent of each other and that
for each pair of explanatory variables (x1 , x2 ) = (`1 , `2 ), the corresponding response Y
has a N(µ`1 ,`2 , σ2 ) distribution. The model thus has d1 d2 + 1 parameters.
Note that the variances of the responses are all assumed to be the same (equal to
σ ). To obtain a “factor effects” representation, we can reparameterize the model for a
2
The parameter µ1,1 is the “reference” mean response, with both factors at level 1. For
any explanatory pair (`1 , `2 ) that is not the reference pair (1,1) we add to this reference
mean response:
• an incremental effect α`1 due to Factor 1,
• an incremental effect β`2 due to Factor 2,
• an interaction effect γ`1 ,`2 due to both factors.
Notice that there are again d1 d2 + 1 parameters. The advantage of the formulation
(9.9) is that we can consider “nested” models by setting some parameters to zero. For
example, if no interaction terms are included, we get the model
d1
X d2
X
Y = µ1,1 + α`1 I(x1 = `1 ) + β`2 I(x2 = `2 ) + ε . (9.10)
`1 =2 `2 =2
Two-factor ANOVA 153
The assumption that there is no interaction and Factor 2 has no effect leads to the
model:
Xd1
Y = µ1,1 + α`1 I(x1 = `1 ) + ε , (9.11)
`1 =2
which is a 1-factor ANOVA model. The simplest model is the default normal model,
where neither of the factors has an effect:
Y = µ1,1 + ε . (9.12)
Which of these models is most appropriate can be investigated via statistical tests.
9.3.2 Estimation
For the model (9.8), a natural estimator of µ`1 ,`2 is the sample mean of all the responses
at level `1 of Factor 1 and level `2 of Factor 2; that is,
d1 X d2
1 X
µ`1 ,`2 = Y `1 ,`2
b = Yi I(xi,1 = `1 , xi,2 = `2 ).
ni j ` =1 ` =2
1 2
For the factor effects representation (9.9) the parameters can be estimated in a similar
way. The reference mean is estimated via Y 1,1 , as given above. The incremental effect
α`1 can be estimated via Y `1 ,• − Y 1,1 , where Y `1 ,• is the average of all the {Yi } within
level `1 of Factor 1. Similarly, β`2 can be estimated via Y •,`2 − Y 1,1 , where Y •,`2 is the
average of all the {Yi } within level j of Factor 2. Finally, γ`1 ,`2 is estimated by taking the
average of all responses at the level pair (`1 , `2 ) and subtracting from this the estimates
for µ1,1 , α`1 and β`2 .
To estimate σ2 we can reason similarly to the 1-factor case and consider the resid-
uals ei = Yi − b Yi as our best guess of the true model errors, where b Yi is the fitted value
to the i-th response. Specifically, if the i-th expanatory pair is (`1 , `2 ), then b Yi = Y `1 ,`2 .
Similar to (9.5) we have the unbiased estimator + 146
Pn 2
SSE i=1 ei
σ
c2 = MSE = = .
n − d1 d2 n − d1 d2
• and whether there is an interaction effect between Factors 1 and 2 on the response
variable.
154 Analysis of Variance
Following the usual steps for hypothesis testing, we need to formulate the ques-
tions above in terms of hypotheses on the model parameters. Let us take the model
formulation (9.3). Remember that the null hypothesis should contain the “conserva-
tive” statement and the alternative hypothesis contains the statement that we wish to
demonstrate. So, whether Factor 1 has an effect can be assessed by testing
where SSF1 measures the variability between the levels of Factor 1, SSF2 measures
the variability between the levels of Factor 2, SSF12 measures the variability due to
interaction between the factors, and SSE measures the residual variability (i.e., within
the levels).
As in the 1-factor ANOVA case, the test statistics for the above hypotheses are
quotients of the corresponding mean square errors, and have an F distribution with a
certain number of degrees of freedom. The various quantities of interest in an ANOVA
table are summarized in Table 9.4.
These data can be entered into R using the following script. The code shows also a few
“tricks of the trade”. The attach function makes the variables region, fertilizer,
yield available without having to use the $ construction, such as in yield$region.
The function paste concatenates (joins) strings, after converting numbers into strings.
So, we can get the string "Region 1", for example. The function gl generates factors
by specifying the pattern of their levels.
Alternatively, you could of course enter the data in a CSV file with appropriate
headers, and read the file into a data frame with read.csv.
We wish to study the effect of the type of fertilizer on the yield of the crop and
whether there is a significantly different yield between the four regions. There could
also be an interaction effect; for example, if a certain treatment works better in a spe-
cific region.
11.5 4
3
4
11.5
3
10.5
mean of yield
2 4
fertilizer
3 Fertilizer 2
10.5
mean of yield
Fertilizer 1
Fertilizer 3
9.5
2
1
9.5
1 2
8.5
8.5
1
Fertilizer 1 Fertilizer 2 Fertilizer 3 Region 1 Region 2 Region 3 Region 4
fertilizer region
These plots contain a lot of information. For example, the left figure makes it easier
to investigate the Fertilizer effect. We can observe that the mean yield is always better
with Fertilizer 2, whatever the region. A graph with horizontal lines would indicate
no effect of the Fertilizer factor. The figure on the right may indicate an effect of
the Region factor, as we can observe an increase of the mean yield from Region 1 to
Region 4, whatever the Fertilizer used.
If there is no interaction between the two factors, the effect of one factor on the
response variable is the same irrespective of the level of the second factor. This corre-
sponds to observing parallel curves on both plots in figure (9.5). Indeed, the differences
of the black dotted curve (Region 1) and the red dotted curve (Region 2) in the left plot
represent the differential effects of the Region 2 versus Region 1 for each Fertilizer. If
there is no interaction, these differences should be the same (i.e., parallel curves). Both
plots in Figure 9.5 might indicate an absence of interaction as we can observe parallel
curves. We will confirm it by testing the interaction effect in the next sub-section.
We plotted the two interaction plots in two different ways. To find out about the
possible plotting parameters, type: ?interaction.plot and ?par.
ANOVA Table
Similar to the 1-factor ANOVA case, the R function aov provides the ANOVA table:
The P-value associated with the test of interaction is not significant (P-value=
0.99). This implies that the effect of fertilizer of yield is the same whatever the re-
gion. In this case, we perform an ANOVA without an interaction term which makes
it easier to interpret the principal effect. The corresponding additive model is given in
(9.10): + 152
Xd1 d2
X
Y = µ1,1 + α`1 I(x1 = `1 ) + β`2 I(x2 = `2 ) + ε .
`1 =2 `2 =2
Both P-values are significant which indicate a significant effect of region and fer-
tilizer on crop yield.
When you have only one observation per combination of levels of the factors A
and B (i.e., ni j = 1 for all i, j), you can only estimate two-way ANOVA without
interaction: aov(yield∼region+fertilizer).
Note that when there is interaction, we do not interpret the principal effects in
the ANOVA table output. Suppose we found in our example an significant effect of
the interaction term. This implies that the effect of fertilizer of yield can be different
depending on the region. For example, we wish to know whether there is a fertilizer
effect in Region 1. To this end, we use the function subset, which only uses data from
a given region.
158 Analysis of Variance
The test in this ANOVA table corresponds to ANOVA with one factor (fertilizer)
of the yield of wheat in Region 1. It does not take into account any information
from data in the other regions, which would allow for a better estimation of the
residual variance.
Validation of Assumptions
As in one-way ANOVA, we validate the model with a study of the residuals of the
underlying linear model.
> plot(my.aov)
2
1.0
1
Standardized residuals
0.5
Residuals
0.0
0
−1.0
−1
23
35
21 23
−2
35
−2.0
21
However, if the data size is large enough for each pair of factor levels, it is better
to check for normality in each subpopulation and for homoscedasticity.
Randomization and Blocking 159
[1] 14 38 19 40 42 2 23 37 44 18 41 17 25 21 4 30 8 43
[19] 10 28 36 46 47 29 16 26 12 13 6 3 39 24 22 45 1 7
[37] 33 27 34 11 31 9 35 32 48 15 20 5
Then we assign treatment 1 to the first 16 subplots, treatment 2 to the next 16, and
treatment 3 to the last 16. If we colour the subplots red, green, and yellow, this gives
the left panel in Figure 9.7
4
4
37 38 39 40 41 42 43 44 45 46 47 48 37 38 39 40 41 42 43 44 45 46 47 48
3
3
25 26 27 28 29 30 31 32 33 34 35 36 25 26 27 28 29 30 31 32 33 34 35 36
2
2
13 14 15 16 17 18 19 20 21 22 23 24 13 14 15 16 17 18 19 20 21 22 23 24
1
1
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
0
0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Figure 9.7: Left: completely randomized design. Right: randomized block design,
with 4 treatments per block (row).
Now suppose that the soil conditions vary a lot within each column; for example
the bottom row could lie at the bottom of a hill and the top row on the top of a hill.
Then the soil condition of the row in which the crop was plotted could be an important
160 Analysis of Variance
factor (but a nuisance factor) in explaining the crop yield. Complete randomization as
described above would alleviate the bias caused by the row soil conditions. However,
note that in the left panel of Figure 9.7 rows 1 and 3 only have two treatments of type
1 (red). If the rows indeed are a factor, it would be better (less variability in the data)
if we chose our design to block the treatments such that each block (i.e., row) has the
same number of treatments. Of course we still should randomize within each block.
The right panel of Figure 9.7 shows such a randomized block design. We made the
design and figure with the following code.
37 38 39 40 41 42 43 44 45 46 47 48
3
25 26 27 28 29 30 31 32 33 34 35 36
2
13 14 15 16 17 18 19 20 21 22 23 24
1
1 2 3 4 5 6 7 8 9 10 11 12
0
0 2 4 6 8 10 12
Figure 9.8: Non-random design where treatments are evenly distributed over both rows
and columns.
CHAPTER 10
L INEAR M ODEL
Much of modeling in applied statistics is done via the versatile class of linear
models. We will give a brief introduction to such models, which requires some
knowledge of linear algebra (mostly vector/matrix notation). We will learn that
both linear regression and ANOVA models are special cases of linear models, so
that these can be analysed in a similar way (i.e., using the lm and aov functions).
In addition to estimation and hypothesis testing, we consider model selection to
determine which of many competing linear models is the most descriptive of the
data.
10.1 Introduction
The linear regression and ANOVA models in Chapters 9 and 8 are both special cases
of a (normal) linear model. Let Y be the column vector of response data Y =
(Y1 , . . . , Yn )> .
161
162 Linear Model
Example 10.1 (Simple Linear Regression) For the simple linear regression model
+ 128 (see Definition 8.1) we have
1 x1
1 x2 β0
!
X = .. .. and β= .
. . β1
1 xn
The situation for linear models in which the explanatory variables are factors is a
little more complicated, requiring the introduction of indicator variables. We explain
it with an example.
Example 10.2 (One-factor ANOVA) Consider a one-factor ANOVA model (see
+ 144 Section 9.2) with 3 levels and 2 replications per level. Denote the responses by
Y1 , Y2 , Y
|{z} 3 , Y4 , Y5 , Y6 .
|{z} |{z}
level 1 level 2 level 3
Let µ1 be the mean (i.e., expected) response at level — the reference level — and let
α2 and α3 be the incremental effects of the other two levels. We can write the vector Y
as
Y1 µ1 ε1 1 0 0
Y µ ε 1 0 0
Y3 µ1 + α2 ε3 1 1 0 µ1
2 1 2
Y = = + = α2 + ε .
Y µ + α ε 1 1 0
4 1 2 4 α3
Y µ + α ε 1 0 1 |{z}
5 1 3
5
Y6 µ1 + α3 ε6 1 0 1 β
|{z} | {z }
ε X
If we denote for each response Y the level by x, then we can write
Y = µ1 + α2 I(x = 2) + α3 I(x = 3) + ε, (10.1)
where I(x = k) is, as in Chapter 9, an indicator variable that is 1 if x = k and 0
otherwise.
In R, all data from a general linear model is assumed to be of the form
Yi = β0 + β1 xi1 + β2 xi2 + · · · + β p xip + εi , i = 1, . . . , n , (10.2)
where xi j is the j-th explanatory variable for individual i and the errors εi are inde-
pendent random variables such that E(εi ) = 0 and Var(εi ) = σ2 . In matrix form,
Y = Xβ + ε, with
1 x11 x12 · · · x1p
Y1 β0 ε1
1 x21 x22 · · · x2p
.
Y = .. , X = .. .. . .
.. , β = .. and ε = .. .
. . .
βp εn
Yn
1 xn1 xn2 · · · xnp
Introduction 163
Thus, the first column can always be interpreted as an “intercept” parameter. The
corresponding R formula for this model would be
y ∼ x1 + x2 + · · · + xp .
Examples 10.1 and 10.2 show that it is important to treat quantitative (numbers) and
qualitative (factors) explanatory variables differently. Fortunately, R automatically in-
troduces indicator variables when the explanatory variable is a factor. We illustrate
this with a few examples in which we print the model matrix, obtained via the function
model.matrix.
In the first model variables x1 and x2 are both considered (by R) to be quantitative.
> my.dat = data.frame(y = c(10,9,4,2,4,9),
+ x1=c(7.4,1.2,3.1,4.8,2.8,6.5),x2=c(1,1,2,2,3,3))
> mod1 = lm(y~x1+x2,data = my.dat)
> print(model.matrix(mod1))
(Intercept) x1 x2
1 1 7.4 1
2 1 1.2 1
3 1 3.1 2
4 1 4.8 2
5 1 2.8 3
6 1 6.5 3
Suppose we want the second variable to be factorial instead. We can change the type
as follows, using the function factor. Observe how this changes the model matrix.
> my.dat$x2 = factor(my.dat$x2)
> mod2 = lm(y~x1+x2,data=my.dat)
> print(model.matrix(mod2))
[1] 1 1 2 2 3 3
Levels: 1 2 3
By default, R sets the incremental effect αi of the first-named level (in alphabet-
ical order) to zero. To impose the model constraint i αi = 0 for a factor x, use
P
C(x,sum) in the R formula, instead of x.
ei = yi − {b
β0 + b
β1 xi1 + b
β2 xi2 + · · · + b
β p xip }
is the i-th residual error. Hence, the least squares criterion minimizes the sum of the
squares of the residual errors, denoted SSE. To estimate σ2 we can, as in Chapters 9
and 8, take the mean square error
SSE
σ
c2 = MSE = ,
n − (p + 1)
where p + 1 is the number of components in the vector β.
For hypothesis testing, we can test whether certain parameters in β are zero or
not. This can be investigated with an analysis of variance, where the residual variance
of the full model is compared with the residual variance of the reduced model. The
corresponding test statistics have an F distribution under the null hypothesis. The exact
Using the Computer 165
details are beyond a first introduction to statistics, but fortunately R provides all the
information necessary to carry out a statistical analysis of quite complicated linear
models.
If we are interested in a single parameter βi , we also can use the same approach as
the Student’s test used to test if a single parameter is equal to zero or not; see (8.10). + 132
In a multivariate model, the individual test statistic used in R is following a Student’s
t distribution with n − (p + 1) degrees of freedom (p being the number of covariates in
the model).
We can see the structure of the variables via str(birthwt). Check yourself that
all variables are defined as quantitative (int). However, the variables race, smoke,
ht, and ui should really be interpreted as qualitative (factors). To fix this, we could
redefine them with the function as.factor, similar to what we did in Chapter 2.
Alternatively, we could use the function factor in the R formula to let the program
know that certain variables are factors. We will use the latter approach.
166 Linear Model
For binary response variables (that is, variables taking the values 0 or 1) it does
not matter whether the variables are interpreted as factorial or numerical, as R
will return identical summary tables for both cases.
We can now investigate all kinds of models. For example, let us see if the mother’s
weight, her age, her race, and whether she smokes explain the baby’s birthweight.
Call:
lm(formula = bwt ~ lwt + age + factor(race) + smoke, data = birthwt)
Residuals:
Min 1Q Median 3Q Max
-2281.9 -449.1 24.3 474.1 1746.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2839.433 321.435 8.834 8.2e-16 ***
lwt 4.000 1.738 2.301 0.02249 *
age -1.948 9.820 -0.198 0.84299
factor(race)2 -510.501 157.077 -3.250 0.00137 **
factor(race)3 -398.644 119.579 -3.334 0.00104 **
smoke -401.720 109.241 -3.677 0.00031 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results returned by summary are presented in the same fashion as for simple
linear regression. Parameter estimates are given in the column Estimate.
The realizations of Student’s test statistics associated with the hypotheses H0 : βi =
0 and H1 : βi , 0 are given in column t value; the associated P-values are in column
Pr(>|t|). Residual standard error gives the estimate of σ and the number of
associated degrees of freedom n− p−1. The coefficient of determination R2 (Multiple
R-squared) and an adjusted version (Adjusted R-squared) are given, as are the re-
alization of Fisher’s global test statistic (F-statistic) and the associated P-value.
Using the Computer 167
Fisher’s global F test is used to test the global joint contribution of all explana-
tory variables in the model for “explaining” the variability in Y. The null hypoth-
esis is H0 : β1 = β2 = . . . = β p = 0 (under the linear model, the p explanatory
variables give no useful information to predict Y). The assertion of interest is
H1 : at least one of the coefficients β j ( j = 1, 2, . . . , p) is significantly different
from zero (at least one of the explanatory variables is associated with Y after
adjusting for the other explanatory variables).
Given the result of Fisher’s global test (P-value = 1.758 × 10−5 ), we can conclude
that at least one of the explanatory variables is associated with child weight at birth,
after adjusting for the other variables. The individual Student tests indicate that:
• mother weight is linearly associated with child weight, after adjusting for age,
race and smoking status, with risk of error less than 5% (P-value = 0.022). At
the same age, race status and smoking status, an increase of one pound in the
mother’s weight corresponds to an increase of 4 g of average child weight at
birth;
• the age of the mother is not significantly linearly associated with child weight
at birth when mother weight, race and smoking status are already taken into
account (P-value = 0.843);
• weight at birth is significantly lower for a child born to a mother who smokes,
compared to children born to non-smoker mothers of same age, race and weight,
with a risk of error less than 5 % (P-value=0.00031). At the same age, race and
mother weight, the child weight at birth is 401.720 g less for a smoking mother
than for a non-smoking mother;
• regarding the interpretation of the variable race, we recall that the model per-
formed used as reference the group race=1 (white). Then, the estimation of
−510.501 g represents the difference of child birth weight between black moth-
ers (race=2) and white mothers (reference group), and this result is significantly
different from zero (P-value=0.001) in a model adjusted for mother weight,
mother age and smoking status. Similarly, the difference in average weight at
birth between group race = 3 and the reference group is −398.644 g and is sig-
nificantly different from zero (P-value=0.00104), adjusting for mother weight,
mother age and smoking status.
Interaction
We can also include interaction terms in the model. Let us see whether there is any
interaction effects between smoke and age via the model
Call:
lm(formula = bwt ~ age * smoke, data = birthwt)
Residuals:
Min 1Q Median 3Q Max
-2189.27 -458.46 51.46 527.26 1521.39
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2406.06 292.19 8.235 3.18e-14 ***
age 27.73 12.15 2.283 0.0236 *
smoke 798.17 484.34 1.648 0.1011
age:smoke -46.57 20.45 -2.278 0.0239 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 %
3.76278 51.69998
β1 +b
For smoking mothers, there seems to be a decrease in birthweight, b β3 = 27.73138−
46.57191 = −18.84054. To see if this is significant, we can again make a confidence
interval and see if 0 is contained in it or not. A clever way of doing this is to create
a new variable nonsmoke = 1-smoke, which reverses the encoding for the smokers
and nonsmokers. Then, the parameter β1 + β3 in the original model is the same as the
parameter β1 in the following model
Bwt = β0 + β1 age + β2 nonsmoke + β3 age × nonsmoke + ε .
Hence the confidence interval can be found as follows.
Variable Selection 169
2.5 % 97.5 %
-51.28712 13.60605
Since 0 lies in this confidence interval, the effect of age on bwt is not significant for
smoking mothers.
and
> summary(aov(bwt~age,data=birthwt))
The numbers of interest are here the P-values 0.0105 and 0.216. Repeating this for
all 7 variables is tedious, and fortunately we can automate this in R using the add1
function. Watch this method in action:
> form1 = formula(bwt~lwt+age+ui+smoke+ht+ftv1+ptl1) #formula
> add1(lm(bwt~1),form1,test="F", data=birthwt)
Model:
bwt ~ 1
Df Sum of Sq RSS AIC F value Pr(F)
<none> 99969656 2492.8
lwt 1 3448639 96521017 2488.1 6.6814 0.010504 *
age 1 815483 99154173 2493.2 1.5380 0.216475
ui 1 8059031 91910625 2478.9 16.3968 7.518e-05 ***
smoke 1 3625946 96343710 2487.8 7.0378 0.008667 **
ht 1 2130425 97839231 2490.7 4.0719 0.045032 *
ftv1 1 1340387 98629269 2492.2 2.5414 0.112588
ptl1 1 4755731 95213925 2485.6 9.3402 0.002570 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conclude that ui is the most significant variable to be included into the model.
Next, we investigate which variable could be further added.
add1(lm(bwt~ui),form1,test="F", data=birthwt)
Model:
bwt ~ ui
Df Sum of Sq RSS AIC F value Pr(F)
<none> 91910625 2478.9
lwt 1 2074421 89836203 2476.6 4.2950 0.03960 *
age 1 478369 91432256 2479.9 0.9731 0.32518
Variable Selection 171
So, ht is the most significant variable to be added to the model. We now look for a
third possible variable:
> add1(lm(bwt~ui+ht),form1,test="F", data=birthwt)
Model:
bwt ~ ui + ht
Df Sum of Sq RSS AIC F value Pr(F)
<none> 88748030 2474.3
lwt 1 3556661 85191369 2468.5 7.7236 0.006013 **
age 1 420915 88327114 2475.4 0.8816 0.348988
smoke 1 2874044 85873986 2470.0 6.1916 0.013720 *
ftv1 1 698945 88049085 2474.8 1.4686 0.227120
ptl1 1 2678123 86069907 2470.5 5.7564 0.017422 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model:
bwt ~ ui + ht + lwt
Df Sum of Sq RSS AIC F value Pr(F)
<none> 85191369 2468.5
age 1 97556 85093813 2470.3 0.2109 0.64657
smoke 1 2623742 82567628 2464.6 5.8469 0.01658 *
ftv1 1 510128 84681241 2469.4 1.1084 0.29380
ptl1 1 2123998 83067371 2465.8 4.7048 0.03136 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> add1(lm(bwt~ui+ht+lwt+smoke),form1,test="F",data=birthwt)
172 Linear Model
Model:
bwt ~ ui + ht + lwt + smoke
Df Sum of Sq RSS AIC F value Pr(F)
<none> 82567628 2464.6
age 1 67449 82500178 2466.5 0.1496 0.69935
ftv1 1 274353 82293275 2466.0 0.6101 0.43576
ptl1 1 1425291 81142337 2463.3 3.2145 0.07464 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
No further variable is significant. The method thus stops at the model with variables:
ui, ht, lwt, and smoke.
> drop1(lm(form1),test="F",data=birthwt)
Model:
bwt ~ lwt + age + ui + smoke + ht + ftv1 + ptl1
Df Sum of Sq RSS AIC F value Pr(F)
<none> 80682074 2466.2
lwt 1 2469731 83151806 2469.9 5.5405 0.0196536 *
age 1 90142 80772217 2464.5 0.2022 0.6534705
ui 1 5454284 86136359 2476.6 12.2360 0.0005899 ***
smoke 1 1658409 82340484 2468.1 3.7204 0.0553149 .
ht 1 3883249 84565324 2473.1 8.7116 0.0035808 **
ftv1 1 270077 80952151 2464.9 0.6059 0.4373584
ptl1 1 1592757 82274831 2467.9 3.5731 0.0603190 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We delete variable age. Let us see which variable should be dropped next, if any:
> drop1(lm(bwt~lwt+ui+smoke+ht+ftv1+ptl1),test="F",data=birthwt)
Variable Selection 173
Model:
bwt ~ lwt + ui + smoke + ht + ftv1 + ptl1
Df Sum of Sq RSS AIC F value Pr(F)
<none> 80772217 2464.5
lwt 1 2737552 83509769 2468.8 6.1684 0.0139097 *
ui 1 5561240 86333456 2475.0 12.5309 0.0005082 ***
smoke 1 1680651 82452868 2466.3 3.7869 0.0531944 .
ht 1 3953082 84725299 2471.5 8.9073 0.0032306 **
ftv1 1 370120 81142337 2463.3 0.8340 0.3623343
ptl1 1 1521058 82293275 2466.0 3.4273 0.0657462 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model:
bwt ~ lwt + ui + smoke + ht + ptl1
Df Sum of Sq RSS AIC F value Pr(F)
<none> 81142337 2463.3
lwt 1 2887694 84030031 2467.9 6.5126 0.011528 *
ui 1 5787979 86930316 2474.3 13.0536 0.000391 ***
smoke 1 1925034 83067371 2465.8 4.3415 0.038583 *
ht 1 4215957 85358294 2470.9 9.5082 0.002362 **
ptl1 1 1425291 82567628 2464.6 3.2145 0.074642 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We delete variable plt1. The method stops at the model with variables: ui, ht, lwt
and smoke.
It should be noted that different methods of automatic selection may not lead to
the same choice of variables in the final model. They have the advantage of being
easy to use, and of treating the question of variable selection in a systematic manner.
The main drawback is that variables are included or deleted based on purely statistical
criteria, without taking into account the aim of the study. This usually leads to a model
which may be satisfactory from a statistical point of view, but in which the variables
are not the most relevant when it comes to understanding and interpreting the data in
the study.
174 Linear Model
Bwt = β0 +β1 smoke+β2 age+β3 lwt+β4 race2+β5 race3+β6 ui+β7 ht+β8 smoke×age+ε.
The following R code checks various model assumptions.
> finalmodel=lm(bwt~smoke+age+lwt+factor(race)+ui+ht+smoke:age,data=birthwt)
> par(mfrow=c(1:2))
> plot(finalmodel,1:2,col.smooth="red")
● 130 ● 130 ●
● ●
● ●
1000
●
2
● ● ●
●●● ●
●●
●● ●
●●
● ●● ●●●● ●
Standardized residuals
●●●
●●●●●●● ●●●
● ●
●
●
●
●
●
● ● ●●
●● ● ● ● ● ●
●
●
●● ● ●
1
●●● ● ● ●
● ●● ● ● ●●
●
● ● ●● ● ● ● ●
●
● ●● ●●● ●
● ●●● ●
Residuals
●
●● ● ●●
●●●●
●●●
● ●●●●● ●
●
●
●
●
● ● ●
0
● ● ●
●● ●
●
●●
●
●●●●●●●
●● ● ●● ●
0
●● ●●● ● ● ●
●
●
● ● ● ●● ●
●
●● ● ● ● ●● ● ●●
●●
● ●● ●● ●● ●
●
●
●
●●● ● ● ●
●
●● ● ● ●
●●● ●● ● ●●
−1
−1000
● ● ● ●
●
● ●● ●● ●
●
● ● ●●
●
● ● ●●
●
●
●●● ●
● ● ● ● ●
●
●
●
●
−2
●
●●
●
● 136
−2000
● 132 ● 136
−3
● 132
2000 3000 −3 −1 1 3
It can also be useful to plot the residuals as a function of each explanatory variable,
as shown in Figure 10.2. This plot is useful to check whether there is a relationship
between the error term and the explanatory variables. This plot is also useful to detect
outliers.
● ● ● ● ● ●
● ● ●
1000
1000
1000
● ● ● ●● ● ●●
● ●
● ● ● ●●
●
● ●● ●●● ●● ● ● ●
●
● ● ● ●●●●● ● ● ● ●
●●● ●● ●
●
● ●
● ●●●●●●
●●●●●● ● ● ●● ●●
●● ●● ● ●
●
●
● ● ● ●●
● ●● ● ● ●● ●●
●
●
●
●
● ●●●●
●
●●●●● ● ● ●●
● ●
●● ●●
●
● ● ● ●
●
● ●
●
● ●● ●●●●
●●● ●●●
●●
●●●
●●● ●●● ●
●
●
● ●
● ●●●●●●●
●
●●● ●●●●●●●● ●● ●●
●●●●
● ●
●●
●
●
●
●
●● ● ● ●
● ● ●
0
0
● ● ●
● ●● ● ● ● ● ● ● ●●
●●●●
res
res
res
● ● ●● ● ●●●●●● ●
●
●
● ●
● ●
●●● ●● ●● ●● ●●●
●●
●
●●●● ●
● ● ●
● ● ●
● ●●●●●
●●● ● ●● ●●
●
●
●
●
●
● ●●
●●●●●● ●●
●
●●
●●●
●●
●●● ●●● ● ●
●
● ● ●●●●●●●● ●● ●● ● ●
●
● ●
● ●● ●● ●● ●●● ● ●
●●
● ●
● ●● ● ●● ● ●
●●● ●●
●
●
●
● ● ● ●
● ●● ● ● ●
● ●● ● ●●
−1500
−1500
−1500
● ●
● ●●●●●
● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
1000
1000
1000
● ●
● ●
● ●
●
●
● ●
● ● ● ●
● ● ●
● ●
● ●
● ●
●
● ●
●
●
●
● ●
●
● ●
● ● ●
● ●
●
● ● ●
● ●
●
● ● ●
●
●
● ●
● ● ● ● ●
●
● ●
● ● ●
● ●
●
● ● ●
●
● ●
● ● ●
●
●
●
●
● ●
● ●
● ● ●
● ●
● ● ● ●
● ● ●
0
●
● ● ● ● ●
●
res
res
res
●
● ●
● ●
● ●
● ●
●
●
● ●
● ● ●
● ●
● ●
●
●
● ●
●
● ●
● ● ●
●
●
● ● ● ● ● ●
●
● ● ● ●
● ● ●
●
●
● ● ●
● ●
● ● ●
● ●
●
● ● ●
● ●
● ● ●
● ●
●
● ● ●
● ●
●
● ●
● ●
● ●
● ●
−1500
−1500
−1500
● ● ● ●
● ● ●
●
● ● ●
● ● ●
● ● ●
RACE UI HT
The purpose of this chapter is to give you a taste of other useful techniques for
data analysis. Goodness of fit tests can be used for verifying that the data comes
from a described distribution; by applying logistic regression, one can perform
regression on binary responses; and nonparametric test are useful when standard
model assumptions, such as normality, are not valid.
You will see that the ideas behind goodness of fit tests, logistic regression and non-
parametric tests naturally extend the concepts that you have already learned in pre-
vious chapters. Emphasis will be more on the practical side (how can we apply the
techniques, for example in R) rather than on full mathematical proofs, which would be
out of scope for a first-year course.
177
178 Other Statistical Techniques
Example 11.1 (Army Recruits) Suppose the IQ of army recruits is N(100, 162 )
distributed. Army recruits are classified as
Class 1 : IQ 6 90
Class 2 : 90 < IQ 6 110
Class 3 : IQ > 110
The proportion p1 , p2 and p3 of army recruits in the three categories are given by
P(Y 6 90) = p1 , P(90 < Y 6 110) = p2 and P(Y > 110) = p3 , where Y ∼ N(100, 162 ).
It follows that we have the following proportions:
Class 1 : p1 = 0.266
Class 2 : p2 = 0.468
Class 3 : p3 = 0.266
Now suppose we have 7 new recruits. What is the probability that of these 7 new
recruits, two are Class 1; four are Class 2 and one is Class 3?
To answer this, let Xi be the number in class i, i = 1, 2, 3. Then, (X1 , X2 , X3 ) ∼
Mnom(7, p1 , p2 , p3 ). Thus, it follows immediately that
7!
P(X1 = 2, X2 = 4, X3 = 1) = p2 p4 p1 ≈ 0.0957.
2! 4! 1! 1 2 3
For the goodness-of-fit tests that we will discuss next, the following theorem is of
utmost importance. The proof relies on the Central Limit Theorem and the fact that
the square of a standard normal random variable has a χ21 distribution.
Remark 11.1 As a rule of thumb, we can use the approximation above provided that
1 x = 1:10
2 n = 100
3 R = 1000
4 t = vector() # initialize a vector t
5 for (i in 1:R){
6 s = sample(x, size=n, replace=TRUE)
7 h = hist(s, breaks=0:10) # create a histogram object
8 ob = h$counts # contains the observed counts
9 ex = 10
10 t[i] = sum((ob - ex)^2/ex)
11 }
12
13 hist(t,breaks=30,freq = FALSE,main="")
14 curve(dchisq(x,df=9),xlim=c(0,30), col=2,lw=2,add = TRUE)
0.08
Density
0.04
0.00
0 10 20 30 40
Figure 11.1: The histogram of the test statistic values closely matches the pdf of the
χ29 distribution (red curve).
180 Other Statistical Techniques
(X1 , . . . , Xk ) ∼ Mnom(n, p1 , . . . , pk ).
which, under H0 , has a χ2k−1 distribution, by Theorem 11.1. We reject H0 at the α level
of significance if
T > q,
where q is the (1 − α)-quantile of the χ2k−1 distribution.
where Oi is the observed number of observations in class i and Ei is the expected num-
ber of observations in class i. This form for the test statistic is found in any goodness
of fit test.
5. The outcome of T is
(23 − 93/4)2 (50 − 93/2)2 (20 − 93/4)2
t= + + = 0.72 .
93/4 93/2 93/4
6. The P-value for this right-onesided test is PH0 (T > 0.72) = 0.70.
7. Because the value is high (0.70), we accept H0 ; that is, we find no evidence to
reject the theory.
Comparing this with Theorem 11.1 we see that we apparently “lose” r degrees of
freedom if we have to estimate r parameters.
An important application of Theorem 11.2 occurs in a two-way table of counts (also
called a contingency table), where we wish to test for an association (i.e., dependence)
between the two variables. We explain the idea via a specific example first.
Example 11.4 (ESP Belief) We wish to examine whether artists differ from non-
artists in Extra-Sensory Perception (ESP) belief. Table 11.1 lists the amount of belief
in ESP for a group of 114 Artists and a group of 344 Non-artists. We wish to investigate
whether being an artist or not is “independent” of the ESP belief (strong, moderate or
not).
182 Other Statistical Techniques
ESP belief
Strong Moderate Not total
Artists 67 41 6 114
Non-artists 129 183 32 344
196 224 38 458
To see that this is a type of goodness of fit situation, we need to properly formulate
a model for the data and express the null and alternative hypotheses in terms of the
parameters in the model.
If we ignore the row and column totals, we have a table with r = 2 rows and c = 3
columns. We can imagine the table to be filled in the following way: We randomly
select 458 people and ask whether they are an artist or not and what their ESP belief is.
Let (Uk , Vk ) denote the response for the kth selected person, where Uk ∈ {1, 2}, where
(1 = artist, 2=non-artist), and Vk ∈ {1, 2, 3}, where, (1 = strong belief , 2 = medium
belief, 3= no belief). We assume that (U1 , V1 ), . . . , (Un , Vn ) are independent and dis-
tributed as a random vector (U, V) that can take values (1, 1), (1, 2), (1, 3), (2, 1), (2, 2)
and (2, 3) with probabilities p11 , p12 , . . . , p23 .
Now, instead of recording all (Uk , Vk ), we could instead count how many people
are artist with a strong ESP belief, artist with a Moderate ESP belief, etc. Let Xi, j
be the count in row i and column j. That is, the total number of observations out of
n = 458 that fall in “cell” (i, j). For example, the outcome of X2,2 is 183. From the
model above we have
(X11 , . . . , X23 ) ∼ Mnom(n, p11 , . . . , p23 ) .
We wish to test whether null hypothesis that the random variables U and V are inde-
pendent. In terms of the parameters of the model, the null hypothesis can be written
as
H0 : pi j = pi q j , for all i, j,
where p1 , p2 , q1 , q2 and q3 are unknown probabilities. Using Theorem 11.2, we can
test the null hypothesis against the alternative hypothesis that pi j , pi q j for some i
and j, by using the test statistic
2 X
3
X (Xi j − Ei j )2
T= ,
i=1 j=1
Ei j
where Ei j is an estimator of npi j , the expected number of observations in cell (i, j).
Under H0 , this is npi q j . The natural estimators for pi and q j are
P3 P2
j=1 Xi j Xi j
pi =
b and qbj = i=1 ;
458 458
Testing Independence 183
Table 11.2: Observed and estimated counts for the ESP belief.
ESP belief
Strong Moderate Not total
Artists 67 (48.8) 41 (55.8) 6 (9.46) 114
Non-artists 129 (147) 183 (168) 32 (28.5) 344
196 224 38 458
For example, E11 = 114 × 196/458 ≈ 48.8. It follows that the outcome of T is
t = 6.79 + 3.93 + 1.27 + 2.20 + 1.34 + 0.43 = 15.96. The p-value is 0.00034. Hence,
we strongly reject H0 . Artists indeed seem to differ from non-artists in ESP belief.
For the general contingency table, we have count data in r rows and c columns.
Again, we wish to test for association between the variables. Let Xi j be the total
number of observations (out of n) that fall in cell (i, j) (i.e., in the ith row and jth
column). We have
pi j = pi q j , ∀i, j,
for some (unknown) p1 , . . . , pr and q1 , . . . , qc . We can test this by using the test statistic
r X
c
X (Xi j − Ei j )2
T= ,
i=1 j=1
Ei j
where Ei j is as in (11.1). Under the null hypothesis of no association, the test statistics
has approximately a χ2df distribution with the degrees of freedom parameter equal to
df = rc − 1 − (r − 1) − (c − 1) = (r − 1)(c − 1) .
And we reject H0 for large values of T .
184 Other Statistical Techniques
1 eβ0 +β1 x
p= = ,
1 + e−β0 −β1 x 1 + eβ0 +β1 x
and Y is 0 otherwise. Large values of β0 + β1 x lead to a high probability that Y = 1,
and small (negative) values of β0 + β1 x cause Y to be 0 with high probability. Note that
there is a linear relationship between the logarithm of the “odds” p/(1 − p) and x:
!
p
ln = β0 + β1 x.
1− p
The parameters β0 and β1 can be estimated from the observed data (x1 , y1 ), . . . , (xn , yn )
by maximizing the likelihood of the observed {yi }; that is, P(Y1 = y1 , . . . , Yn = yn ) seen
as a function of β0 and β1 . For example, if we have the sample y1 = 1, y2 = 0, y3 = 1
and denote pi = (1 + exp(−β0 − β1 xi ))−1 , i = 1, 2, 3, then
P(Y1 = 1, Y2 = 0, Y3 = 1) = p1 (1 − p2 )p3 .
We can use the glm function in R to do this estimation/optimization for us, as ex-
plained in the following example.
Example 11.5 (Logistic Regression) The code below first simulates 100 explana-
tory variables, {xi }, chosen uniformly between −1 and 1. Then, the binary response
variable {yi } are obtained from the logistic model with parameters β0 = −3 and β1 = 10.
The data are stored in a data frame mydata, consisting of two columns (one for the
{xi } and one for the {yi }.
4 b0 = -3
5 b1 = 10
6 p = exp(b0+b1*x)/(1 + exp(b0+b1*x))
7 y = rbinom(n,size=1,p)
8 mydata = data.frame(x,y)
Figure 11.2 shows the (xi , yi ) pairs as black circles. The true logistic curve is also
shown (dashed line). Let us now try to “recover” the parameters and the curve from
the observed data only, using the glm function:
Call:
glm(formula = y ~ x, family = binomial, data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.25039 -0.05859 -0.00072 0.03185 1.95385
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.520 1.439 -3.141 0.00168 **
x 14.701 4.587 3.205 0.00135 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find estimates b β0 = −4.520 and b β1 = 14.701, which are not that far from the
true parameter values. The output also allows us to construct confidence intervals. For
example, an approximate 95% confidence interval for β1 is 14.701 ± 1.96 × 4.587.
We can test for association between the response and explanatory variable by testing
whether β1 is 0 or not. The P-values in the output summary indicate that both β0 and β1
are not zero (as indeed they are not). The predicted probability b
p for a new explanatory
variable x = 0.3 is
1
p= = 0.4726.
1+e
b 4.520−14.701×0.3
The same probability (up to rounding errors) can also be obtained via the predict
function:
186 Other Statistical Techniques
1
0.472675
In fact, we can estimate and plot the entire logistic curve via:
The estimated curve is given by the red curve in Figure 11.2. For a new explana-
tory variable x we can classify/predict the response as 0 or 1, if the estimate for p is
respectively < 1/2 or > 1/2.
1.0
0.8
0.6
y
0.4
0.2
0.0
Figure 11.2: Logistic regression data (black circles), fitted curve (red), and true curve
(black dashed).
Later on, in Section 7.5, we make extra assumptions on the distribution of the data. + 121
In particular, we assumed that the data in both groups came from two possibly dif-
ferent normal distributions, characterized by 4 unknown parameters (two expectations
and two variances). This then led to the 2-sample t-test. The advantage of such a para-
metric approach is that (1) we can specify our hypotheses in terms of the parameters
of the model, and (2) the distribution of our test statistic can be readily obtained; e.g.,
a t-distribution.
However, making assumptions about the distribution of the data is fraught with
risks (e.g., incorrect calculation of P-values may lead to the wrong conclusions), es-
pecially if the assumptions are not true. The randomization test is an example of a
nonparametric test, where we still may make assumptions about the data (e.g., inde-
pendence), but we do not model the data via a specific parametric class of distributions.
Nonparametric test tend to be more “robust” to outliers in the data. The downside is
that they are less “powerful” than parametric test, in the sense that it is more difficult
to reject the null hypothesis when it indeed should be rejected. + 117
Fortunately, there are nonparametric versions of the standard tests (e.g., 1- and 2-
sample t-tests) available and we will discuss a number of these. The simplest one is
the sign test, which is a robust alternative to the 1-sample t-test. This is simply the
1-sample binomial test in disguise, as we will see in the following example.
Example 11.6 (Sign Test) In Section 7.2 we had the following data for the Decaf
group in Alice’s cola experiment:
Table 11.3: Changes in pulse rate for the Decaf group in Alice’s cola experiment.
4 10 7 −9 5 4 5 7 6 12
+ + + − + + + + + +
To apply the 1-sample t-test, it was assumed that these data came from some nor-
mal distribution. What if we do not assume anything about the underlying distribution
of the data, other than that the 10 independent observations come from the same prob-
ability distribution? Can we still conduct a test? For the null hypothesis, we could test
if the median of the unknown distribution is 0 or greater than 0. And as a summary of
the data we could indicate the signs of change in pulse rate; positive or negative (or 0,
but this does not apply here). In fact, as our test statistics, we could simply take the
number of positive changes — in this case 9 out of 10. The situation is now equivalent
to a 1-sample binomial test for the probability p of a positive change. The null and
alternative hypothesis are H0 : p = 1/2 vs. H1 : p > 1/2. The P-value of the test is
P(X > 9), where X ∼ Bin(10, 1/2). Using R:
> 1 - pbinom(8,size=10,prob=0.5)
188 Other Statistical Techniques
[1] 0.01074219
There is thus strong evidence that the change in pulse rate is positive for the Decaf
(control) group.
Recall that the data in Table 11.3 was the result of paired data (before, after), and
we recorded only the changes in pulse rate (after − before). The sign test is particularly
useful for such paired data, as a nonparametric alternative to the paired t-test (see
Page 123). Another nonparametric alternative for paired data is the Wilcoxon signed
rank test. The procedure is as follows:
1. First rank the absolute differences, where ties are given equal fractional ranks.
2. The test statistic S is the sum of the ranks of the positive differences.
Under the null hypothesis and for large sample size n, the test statistic has approxi-
mately a normal distribution with
r
n(n + 1) n(n + 1)(2n + 1)
E(S ) = and sd(S ) = .
4 24
Example 11.7 (Wilcoxon Signed Rank Test) Consider again the data Table 11.3,
which lists the differences (changes) in pulse rate for the Decaf group. The absolute
differences are exactly the same, apart from −9, which is changed to 9. The ranks for
the absolute differences are given in Table 11.4. Note that there are several ties, which
both get the same average rank.
Table 11.4: Ranks of absolute differences (changes) in pulse rate for the Decaf group
in Alice’s cola experiment.
|change| 4 10 7 9 5 4 5 7 6 12
rank 1.5 9 6.5 8 3.5 1.5 3.5 6.5 5 10
The outcome of the test statistic is s = 1.5 + 9 + 6.5 + 3.5 + 1.5 + 3.5 + 6.5 + 5 + 10 =
47. Under H0 , S has approximately a normal distribution with expectation 27.5 and
standard deviation 9.810708, so that the P-value for this right-onesided test (we reject
for large values of S ) can be computed as follows:
1 n = 10
2 s = 1.5 + 9 + 6.5 + 3.5 + 1.5 + 3.5 + 6.5 + 5 + 10
3 es = n*(n+1)/4
4 sds = sqrt(n*(n+1)*(2*n+1)/24)
Nonparametric Tests 189
[1] 0.02342664
data: x
V = 47, p-value = 0.02616
alternative hypothesis: true location is greater than 0
Note that the answers differ slightly because R applies a continuity correction to the
central limit theorem. R also gives a warning message (not shown here) that it cannot
calculate the P-value exactly due to the ties in the data.
We can also replace the 2-sample t-test with a nonparametric alternative called
Wilcoxon’s rank sum test. As in the 2-sample t-test, the assumption is that we have
two independent groups of data, and we wish to test if the distributions of the two
groups are the same or not. The test statistic is the sum of the ranks of the first group,
adjusted for ties if necessary.
Under the null hypothesis and for large sample size n, the test statistic has approx-
imately a normal distribution with
r
n1 (n1 + n2 + 1) n1 n2 (n1 + n2 + 1)
ES = and sd(S ) = .
2 12
To compute ES above, note that under the null hypothesis all ranks are equally likely,
so the rank R of one observation is a discrete uniform random variable taking values
in 1, . . . , n1 + n2 . Its expectation is thus ER = (n1 + n2 + 1)/2, and since there are
n1 observations in the first group, we have an expected rank total of ES = n1 ER =
n1 (n1 + n2 + 1)/2. The variance of R and S is a bit more difficult to derive, and we leave
this as an exercise. In R, the same function wilcox.test can be used to perform a
rank sum test. However, the test statistic is here the rank sum of the first group minus
n1 ×(n1 +1)/2. This equivalent test statistic is known as the Mann–Whitney test statistic.
Example 11.8 (Rank Sum Test) To see how the rank sum test works, consider Al-
ice’s caffeine study one last time. Table 11.5 shows the original observations and along
with their ranks.
190 Other Statistical Techniques
Table 11.5: Changes in pulse rate for Alice’s caffeine experiment. Ranks are given
below the observations.
Caffeinated 17 22 21 16 6 −2 27 15 16 20
Rank 16 19 18 14.5 7.5 2 20 13 14.5 17
Decaf 4 10 7 −9 5 4 5 7 6 12
Rank 3.5 11 9.5 1 5.5 3.5 5.5 9.5 7.5 12
The smallest increase was −9 bpm, so it gets rank 1 while the second smallest
increase was −2 bpm. The next smallest increase, 4 bpm, occurs twice and so we give
both of them the average of the ranks that they would have had if they were not tied, 3
and 4.
The null hypothesis is that there is no difference between the distributions of the
Caffeinated and Decaf groups. The outcome of the test statistic is:
s = 16 + 19 + 18 + 14.5 + 7.5 + 2 + 20 + 13 + 14.5 + 17 = 141.5.
If the subjects with caffeine tended to have higher increases in pulse rate then they
would tend to have higher ranks and so S would tend to be bigger. The P-value is
the probability of getting a value as extreme or more extreme, so here we want P(S >
141.5). Using the normal approximation, with ES = 105 and sd(S ) = 13.22876, we
obtain:
> Pval = 1 - pnorm((s - es)/sds)
> Pval
[1] 0.002897679
1 x = c(17, 22, 21, 16, 6, -2, 27, 15, 16, 20) # caffeinated
2 y = c(4, 10, 7 ,-9, 5, 4, 5, 7, 6, 12) # decaf
3 wilcox.test(x,y,alternative="greater")
data: x and y
W = 86.5, p-value = 0.003201
alternative hypothesis: true location shift is greater than 0
We see a similar P-value (obtained via a more accurate computation than we did
by hand) and observe that the outcome of the Mann–Whitney test statistic is indeed
141.5 − 10 × 11/2 = 86.5.
APPENDIX A
R P RIMER
• R is free and easy to install and maintain, especially when using integrated de-
velopment environments (IDEs) such as RStudio.
• R has many external packages: collections of functions and data tailored to cer-
tain tasks. These packages can be easily installed, e.g., via RStudio.
• R has many efficient inbuilt procedures for statistics, data management and vi-
sualization.
• R has an integrated and accessible documentation system.
R’s base system comes with a rudimentary Graphical User Interface. We recom-
mend instead the use of RStudio’s integrated development environment (IDE), de-
picted in Figure A.1.
191
192 R Primer
This IDE comprises (customizable) windows for R programs (top-left), the R con-
sole (bottom-left), environment variables and history (top-right), and plotting and pack-
ages information (bottom-right).
A.2 Learning R
There are many resources available to help you learn R. In RStudio, for example, the
Help>R Help menu gives access to the comprehensive tutorial “An Introduction to
R”, as well as to the precise “R Language Definition”. In this section we will merely
give an overview of R. If at any point you need help, the first thing to do is consult
R’s help function. For example help("sin") will show information about the sin
function and other trigonometric functions. An Internet search is nowadays also a
good alternative, which often will bring you to a Stack Exchange question and answer
website.
A.2.1 R as a Calculator
The simplest thing you can do with R is to use it as a basic calculator, as in
> 1*2*3*4
[1] 24
and
> sin(1)
[1] 0.841471
Here sin is the built-in trigonometric function. As on your calculator, numbers can be
stored in memory. This is done via the assignment operator =, as in:
> xx = 10
[1] 10
10 is clearly the contents of xx, [1] is the row number of the object that 10 is on. You
can create objects with words and other characters is the same way.
> my.text = "I like R"
An object’s type is important to keep in mind as it determines what we can do it. For
example, you cannot take the mean of a character object like the my.text objects:
> mean(my.text)
[1] NA
Warning message:
In mean.default(my.text) : argument is not numeric or logical:
returning NA
Trying to find the mean of your my.text object gives us a warning message and return
NA: not applicable. To find out an object’s type use the class function:
> class(my.text)
[1] "character"
194 R Primer
Names of objects are case-sensitive, and must begin with a letter and not contain
spaces. Names may include fullstops, such as my.name.
Vectors consisting of a sequence of numbers can be created via the “colon” operator,
e.g., 1:5 is the same as c(1,2,3,4,5), or via the seq function, as in:
> my.sequence = seq(from=1, to=20, by=2)
> my.sequence
[1] 1 3 5 7 9 11 13 15 17 19
If the number of elements in the vector is smaller than the number of elements in the
matrix, the vector elements will be “cycled”:
> matrix(1:5, ncol=5, nrow=2)
Learning R 195
[1,] 1 3 5 2 4
[2,] 2 4 1 3 5
Let’s now combine the two vectors age and Author into a new object with the
cbind (column bind) function.
> AgeAuthorObject = cbind(age,Author)
> AgeAuthorObject
age Author
[1,] "50" "Dirk"
[2,] "38.5" "Benoit"
[3,] "37.5" "Michael"
We have created again a matrix object. Since, matrix objects must have the same type
of objects, R has coerced (cast) the numerical age vector into a vector of strings. You
can see that the numbers in the age column are between quotation marks. In R, a matrix
object is seen as vector with extra attributes, in particular the dimension of the matrix,
and possibly the row and column names. The attributes of an object can be obtained
and set via the attributes function. The functions colnames and rownames make
it possible to retrieve or set column or row names of a matrix-like object.
If you want to have an object with rows and columns and allow the columns to
contain data with different types, you need to use data frame objects, which can be
constructed via the data.frame function.
> AgeAuthorObject = data.frame(age,Author)
> AgeAuthorObject
age Author
1 50.0 Dirk
2 38.5 Benoit
3 37.5 Michael
You can use the names command to see the data frame’s name. The command names
is not specific to the data.frame object but can be applied to other R objects as well,
such as the list object, which is defined later.
> names(AgeAuthorObject)
Notice that the first column of the data set has no name and is a series of numbers. This
is the row.names attribute of the data frame. We can use the rownames command to
set the row names from a vector.
196 R Primer
age Author
First 50 Dirk
Second 38.5 Benoit
Third 37.5 Michael
In this example, it extracted the age column from the AgeAuthorObject. You can
then compute for example the mean of the age by using
> mean(AgeAuthorObject$age)
[1] 39.66667
Using the component selector can create long repetitive code if you want to select
many components. You can streamline your code by using the attach command. This
command attaches a database to R’s search path (you can see what is in your current
search path with the search command; just type search() into your R console). R
will then search the database for variables you specify. You don’t need to use the com-
ponent selector to tell R again to look in a particular data frame after you have attached
it. For example, let’s attach the cars data that comes with the default packages of R. It
has two variables, speed and dist (type ?cars for more information on this dataset)
> attach(cars)
> head(speed) # Display the first values of speed
[1] 4 4 7 7 8 9
> mean(speed)
[1] 15.4
Learning R 197
It is a good idea to detach a data frame after you are done using it, to avoid confusing
R.
> detach(cars)
Another way to select parts of an object is to use subscripts. They are denoted
with squares brackets []. We can use subscripts to select not only columns from data
frames but also rows and individuals values. Let’s see it in action with the data frame
cars
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
speed dist
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
[1] 22
[1] 22
Also note the functions which, which.min and which.max, which are often very
useful to extract information.
198 R Primer
[1] 1 3 7
> x = c(0:4,0:5,11)
> which.min(x) # Outputs the index of the smallest value.
[1] 1
[1] 12
We can also select the cars with a speed less than 9 mph by using
> cars[which(cars$speed<9),]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
Learning R 199
[[1]]
[1] TRUE
[[2]]
[1] -1 0 1 2 3
[[3]]
[,1] [,2]
[1,] 1 3
[2,] 2 4
[[4]]
[1] "A character string"
In such a structure, with heterogeneous data types, element ordering is often com-
pletely arbitrary. Elements can therefore be explicitly named, which makes the output
more user-friendly. Here is an example:
> B = list(my.matrix=matrix(1:4,nrow=2),my.numbers=-1:3)
> B
$my.matrix
[,1] [,2]
[1,] 1 3
[2,] 2 4
$my.numbers
[1] -1 0 1 2 3
[,1] [,2]
[1,] 1 3
[2,] 2 4
200 R Primer
[1] 3 2 1
Multiplying A with x results in a matrix object with 2 rows and 1 column.
> A %*% x # multiply matrix A with (column) vector x
[,1]
[1,] 14
[2,] 20
We can coerce this matrix back into a vector (if needed) with the as.vector function.
The transpose of a matrix is found via the t function.
> t(A) # transpose of A
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
Functions of a matrix are foremost treated in an elementwise way, as in
> 1/A
[,1] [,2]
[1,] 1 3
[2,] 2 4
The inverse of an invertible square matrix B can be found by solving the linear equation
BX = I for matrix X. In R we use the solve function:
> (solve(B)) # compute and print the inverse of B
[,1] [,2]
[1,] -2 1.5
[2,] 1 -0.5
The function apply can be used to apply a function to the rows or columns of a
matrix. For example, with matrix A as above, the column means are:
> apply(A,MARGIN=2,FUN=mean) #column means
[1] 3 4
202 R Primer
The R code below gives some examples. We can execute the code in RStudio via the
“source” button. Typing source("filename") in the console, where filename is
to be specified by you, will execute the code as well. The code illustrates also the use
of the cat and print functions to output results. See the help files for their different
uses. The function scan can be used to input data, from file, URL, or keyboard. To
output a new line, use the special character “\n”. Note that x == y (that is, double
equal sign) is used to compare x with y. In the code below, two strings are compared.
1 cat("Input name");
2 name = scan(,what="char",nmax=1) # read the name from keyboard input
3 if (name == "Dirk"){ print("Welcome back Dirk")
4 } else {cat("Hello",name,"\n")} #important to have "} else"
5
6 for (i in 1:10) cat(i^2," ") # output numbers in a row
7 cat("\n") # put newline in output
8
9 # this does the same but prints the results as a column
10 i = 1
11 while (i <= 10){
12 print(i^2)
13 i = i+1
14 }
A common mistake in “if else” statements is to start the “else” as the first word
of a new line. This confuses R, as it deals with the previous statement as an “if”
statement without the “else” part.
A.2.7 Functions
Functions are simply a set of statements that transform an “input” objects into an “out-
put” object. We have already seen several examples of functions, such as the function
mean. The input to this function is a vector of numbers, and the output is the mean
(i.e., average) of these numbers.
Learning R 203
The standard way to use a function is to assign the result of the function f to
an object y, as in y = f(x). However, some functions in R can change the
attribute f (x) of x to z via an assignment f(x) = z. A common example is
the names function, which not only shows the names of an object, but can be
used to change the names of that object as well. Another example is the levels
function.
Some functions can be called with and “empty” argument. For example, getwd
gives the current working directory:
> getwd()
"c:/Users/JohnSmith/DataProject"
Arguments are the input into a function, and use the ARGUMENTLABEL=VALUE syntax.
To find all of arguments that a command can accept look at Arguments section of
the command’s help file. Argument labels may be put in any order and also can be
abbreviated provided there is no ambiguity. It is advised, though, to keep the labels
and the order exactly as in the specified argument list.
> ?rnorm #open help file for rnorm
> x = rnorm(n=10,mean=3,sd=2) #generate normal random variables
> (q = quantile(x, probs = c(0.25,0.75))) #output two quantiles
25% 75%
1.021414 4.237529
Basic Functions
Here are some important data manipulation functions. See the help files for extra
arguments.
[1] 9
[1] 0 1 1 2 3 4 6 7 8
204 R Primer
• order, rank: the first function returns the vector of ranking indices of the el-
ements. In case of a tie, the ordering is always from left to right. The second
function returns the vector of ranks of the elements. In case of a tie, the ranks
are shared and can be non-integer.
1 2 3 4 5 6 7 8 9
1 3 6 2 7 4 8 1 0
> sort(vec)
9 1 8 4 2 6 3 5 7
0 1 1 2 3 4 6 7 8
> order(vec)
[1] 9 1 8 4 2 6 3 5 7
> rank(vec)
1 2 3 4 5 6 7 8 9
2.5 5.0 7.0 4.0 8.0 6.0 9.0 2.5 1.0
[1] 1 3 6 2 7 4 8 0
[1] 2 3 5 6 7
> x[ind]
Learning R 205
[1] 3 6 7 4 8
[1] 1 2 3 1 2 3 1 2 3 1 2 3
1 BMI = function(weight,height){
2 bmi = weight/height^2
3 res = list(Weight=weight,Height=height,BMI=bmi)
4 return(res)}
$Weight
[1] 70
$Height
[1] 1.82
$BMI
[1] 21.13271
206 R Primer
A.2.8 Graphics
R comes with a “base” graphics package. To see the available functions and vari-
ables, type:
> ls(package:graphics)
Some of these are high-level functions, which produce complete plots with a single
or just a few commands. Examples inlcude plot, boxplot, contour, barplot, and
hist. Other plotting functions are low-level, plotting only parts of plots, such as
abline, points, curve, frame, axis, text, and legend.
The function plot is a generic function for plotting R objects. Each object
invokes its own plot function, called a method. Type methods(plot) to see all
the methods that are associated with the plot function.
• cex : changes the size of the characters (especially useful when exporting
graphs to be included in a LATEX document).
• mar: a vector of form c(bottom, left, top, right) giving the mar-
gins of the plot.
• pch : the point type: either specified by a character or an integer. See
?points for a list.
• lty: the line type, specified by an integer.
• lwd: the line width.
• col: the color of a line or character, specified as a character string or a
number. Type colors for available colors.
1 f = function(x) sin(x)
2 g = function(x) sin(x)*exp(-x)
3 windows(width=8,height=5) # draws an external window in MS Windows
4 par(cex=1.5,mar = c(4,2,0.2,0.2))
5 curve(f,0,pi, lwd = 3,col="blue",xlab = "x",ylab="")
6 curve(g,0,pi, lwd = 3,lty="dashed",col="darkorange",xlab = "x",add=T)
Learning R 207
7 legend(-0.05,1.02,c("f(x)","g(x)"),col=c("blue","darkorange"),lwd=2,
8 lty=c("solid","dashed"),bty="n")
1.0
f(x)
g(x)
0.8
0.6
0.4
0.2
0.0
Using the commands below we can quickly plot a fitted line to the cars regression
data. The result is depicted in Figure A.3.
> plot(cars)
> abline(lm(dist~speed,data=cars),col="blue")
> points(cars[30,],col="red",pch=20)
100
80
60
40
20
0
5 10 15 20 25
speed
Figure A.3: Plotting a fitted line to regression data, and highlighting one point.
Instead of writing to a window, R can also write to other devices, such as a pdf or
postscript file. For example:
208 R Primer
> pdf("cars.pdf")
> plot(cars)
> dev.off()
Plots the cars data into a pdf file called cars.pdf. The file is not written until the
command dev.off() is issued.
Note that file paths are specified using slashes (/). This notation comes from
the UNIX environment. In R, you cannot use backslashes (\), as you would in
Microsoft Windows, unless you double all the backslashes (\\).
Another option is using the function setwd to change the work directory. The
argument file will then accept the file name alone, without its path.
> setwd("C:/MyFolder")
> my.file = "mydata.txt"
> data = read.table(file=my.file)
Your data are now available in the R console: they are stored in the object which
you have chosen to call data. You can visualize them by typing data; you can also
type head(data) or tail(data) to display only the beginning or the end of the
dataset. You can also use str(data) to see the nature of each column of your data.
For writing data, the relevant function is write.table. Suppose you have a
data.frame called mydata, containing data that you wish to save in a text file. You
would then use the instruction:
> write.table(mydata, file = "myfile.txt", sep = "\t")
Learning R 209
executes the statements from infile.R and writes the results to outfile.txt, or to
standard output when the output file is not provided. You can also use Rscript instead
of R CMD BATCH. To try things out, run the following batch file, e.g, named batch.R,
in a command shell:
1 x = seq(0,2*pi,by=0.1)
2 print(x) # write to standard output
3 windows() # open a window
4 plot(sin(x))
5 Sys.sleep(5) # wait 5 seconds before exiting
Before you can load a package with library(), you will need to install it first. In
R studio the “Packages” tab in the lower-right IDE pane shows the packages that have
already been installed. Clicking on the “Install” tab opens a window to search for new
packages, which then will be automatically installed. Packages can also be manually
installed via the install.package function.
210 R Primer
I NDEX
A C
central limit theorem, 88, 98
absolute deviation to the median, 30
charts
alternative hypothesis, 19, 111
boxplots, 39
Analysis of Variance (ANOVA)
chi-squared distribution, 102, 178
model, 141
coefficient of variation, 30
two-factor, 152
coin tossing, 10, 44, 93
Analysis of Variance (ANOVA)
combinations, 52
model, 154
single-factor, 145, 152 Comma Separated Values, 21
comparative experiments, 14
B completely randomized design, 159
barplot, 30 conditional probability, 54
Bernoulli trial, 66 confidence interval, 97
Bernoulli variable, 66 approximate, 97
binomial distribution, 90 approximate – for p (2-sample,
normal approximation to, 90 binomial distribution), 108
211
212 INDEX