Chapter 4 Hand Out
Chapter 4 Hand Out
STAT 763
y i = β0 + β1 x 1 + . . . + βk x k + i
Residual Analysis
We have defined the residuals as
ei = yi − ŷi for i = 1, . . . , n
so we can think of these as a measure of deviation between the actual data and the fit. This is a measure of
the variability not explained by the model itself.
We can view ei as an approximation of i ; given this if the assumptions for do not hold this should also be
reflected in the residuals.
The residuals have a mean of 0 (E(ei ) = 0) and we approximate the variance by SS
n−p . An important point is
E
that the residuals are not independent (owing to the fact that we are estimating the model), but this does
not have serious consequence in model adequacy checking provided that the sample size n is large relative to
the number of parameters p.
Extracting the residuals from a model is very easy in R
library(robustbase) #### need for delivery data
g = lm(delTime ~ n.prod + distance, data=delivery)
### the original residuals are attached to the lm object
e_i = g$residuals
There are two visual approaches we use to if our error assumptions are reasonable. The residual plot is
designed to check if the assumption of constant error is reasonable, while the Normal QQ plot is designed to
check the normality assumption.
1
Residual Plots
Two of our four assumptions can be qualitatively checked using residual plots; specifically these plots help
identify if the assumption of the error terms having a 0 mean and a constant standard deviation may not be
valid. A residual plot simply plots the fitted value on the x axis and the residual on the y axis.
The code below shows how to create a residual plots for the residuals for the delivery driver model
### residual plot
plot(g$fit,g$residuals,xlab='Fitted',ylab='Residual',pch= 20,
main='Residual Plot')
abline(h=0)### adds reference line
Residual Plot
6
4
Residual
2
−2 0
−6
10 20 30 40 50 60 70
Fitted
What we would like to see is a plot that looks like the points are randomly scattered (sometimes called a
sneeze pattern). In addition there should not be residuals more than ± 3M SE from zero, and the number of
points above zeros should be about the same as the number of points below zero for any predicted value.
The following residual plots are meant to show examples of patterns we might see with an explanation of how
to interpret.
2
2
1
Residual
0
−1
−2
Fitted
Here we see no clear pattern. The error terms are equally spaced above and below the horizontal line
(referencing the expected value of 0) and we do not see any trend in terms of the x axis.
2
1
Residual
0
−1
−2
−3
Fitted
This is a fan pattern. We can see the variability is increasing as the fitted value increases. This indicates that
the variability of the error term is based on the expected value, and as a result is not constant.
3
1.5
0.5
Residual
−0.5
−1.5
Fitted
Similar to the last plot, we see a pattern indicating that the variability is linked to the expected value; in this
case the variability is the squared value of the explanatory variable.
Sometimes the patterns we see are not a result of the violation of the constant error, but rather an failure
in the model fit. This occurs when we attempt to fit a linear model when the relationship between the
explanatory and response is not linear (over either part or the entire observed range of the explanatory).
4
2
Residual
0
−2
−4
Fitted
We see that the model is consistently overestimating points in the middle range of the fitted values. This is
because the relationship between X and Y are not actually defined by a straight line relationship. Note we
would have to do further testing to determine this in practice.
Normal QQ Plot
A very simple visual method for checking the assumption of normality is to use normal quantile plots. If
the errors are normal we expect our sample to behave the same way a sample of normal random variables
would. We can check this by comparing the quantiles of our residuals versus the theoretical quantiles we
would see with a normal. The quantiles are computed as follows: we order our (standardized) residuals from
smallest to largest t[1] , . . . , t[n] . The empirical cumulative probability at each index is F (e[i] ) ≡ Fi = (i − 12 )/n.
We then compute the expected normal value (what value we would get at each Fi if the values were
generated from a normal distribution). We can then plot the residuals against the expected theoretical values
Xi = Φ−1 ((i − 12 )/n). If the residuals are approximately normal we should see a fairly straight line between
the sample values and the theoretical values.
The code below shows how to make a QQ plot for the residuals from the delivery driver model.
### QQ plot
qqnorm(g$residuals,pch=20)
qqline(g$residuals)
5
Normal Q−Q Plot
6
Sample Quantiles
4
2
−2 0
−6
−2 −1 0 1 2
Theoretical Quantiles
Following are several examples of Normal QQ Plots for different types of error distribution.
Normal Error
2
Sample Quantiles
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Notice that the points do not follow the line perfectly. However there is no obvious and clear deviations
between what we got (sample) and what we expected to get (theoretical).
6
Compare the above to the following 3 Normal QQ plots; these are instances where the error terms do not
follow a normal distribution.
Poisson Error
4
Sample Quantiles
3
2
1
0
−2 −1 0 1 2
Theoretical Quantiles
t Error
6
Sample Quantiles
4
2
0
−2
−4
−2 −1 0 1 2
Theoretical Quantiles
7
Gamma Error
6
Sample Quantiles
5
4
3
2
1
0
−2 −1 0 1 2
Theoretical Quantiles
Aside: Formal Testing for Normality While the use of Q-Q normal plots is very common, any decision
ultimately comes down to a somewhat subjective decision. There are however statistical tests designed to
formally test if it is plausible that the data come from a specified distribution (in our case we would be testing
the null that the data are normal). The most common of these tests is the Kolomogorov-Smirnov test which
can easily be performed in R using the ks.test() function in R. Keep in mind that the R function will give you
an error if there are ties in the data (the basic computation for the test statistic assumes all values in data are
unique). There are ways to do this test with tied data, but that requires more advanced statistical packages.
The main drawback with this test is that we can never prove that the data are normal; if we do not reject
the null hypothesis of normality all we can say is there is not enough evidence that the data are not normal.
Scaling Residuals
We can choose to work with the residuals we compute directly, or there are several scalings we can use. The
scalings are designed to make it easier to visually identify potential violations of the assumptions. The two
most popular scaling methods are standardized and studentized.
A standardized residual is
ei
di = √
M SE
which we can view as a unit scaling. Since we are assuming ei is normal, we should not see |di | > 3 (since
virtually all the density for a normal distribution is 3 standard deviations from the mean).
√
A studentized residual accounts for the fact that M SE is only an approximation and does not account
for the exact standard deviation of the residual. Recall the hat matrix H = X(X T X)−1 X T ; we use the
diagonals of the hat matrix to compute the studentized residual
8
ei
ri = p
M SE (1 − hii )
where hii is the ith diagonal of the hat matrix. Note most software will compute this for you automatically.
The advantage of studentized residuals over standardized residuals is that the studentized residuals account
for the fact that linear models tend to fit points at the edges of the X space better than at the center;
residuals (and the standardized residuals) tend to be smaller in the edges of the X space making it harder to
detect violations of the assumptions.
A more involved approach is the PRESS residual. Here we fit a model without the ith point, and then use the
estimated model to compute the residual of the excluded point. Fortunately we do not have to fit n models
as it winds up the ith press residual is
ei
e(i) = .
1 − hii
Another option is the R-Student, which is an externally studentized residual. When we use M SE as our
estimate of σ 2 this is internally generated since we are computing it using all n observations. We find the
external estimate of the variance of the ith residual by
ei
ti = q .
S(i) (1 − hii )
2
To demonstrate how these are computed using R the following code demonstrates using the delivery time
data. You can compare the output of the 5 different residuals to the table on page 137
library(robustbase) #### need for delivery data
g = lm(delTime ~ n.prod + distance, data=delivery)
9
## e_i d_i r_i p_i t_i
## 1 -5.0280843 -1.54260631 -1.62767993 -5.59796734 -1.69562881
## 2 1.1463854 0.35170879 0.36484267 1.23360321 0.35753764
## 3 -0.0497937 -0.01527661 -0.01609165 -0.05524867 -0.01572177
## 4 4.9243539 1.51078203 1.57972040 5.38401290 1.63916491
## 5 -0.4443983 -0.13634053 -0.14176094 -0.48043610 -0.13856493
## 6 -0.2895743 -0.08884082 -0.09080847 -0.30254339 -0.08873728
## 7 0.8446235 0.25912883 0.27042496 0.91986749 0.26464769
## 8 1.1566049 0.35484408 0.36672118 1.23532680 0.35938983
## 9 7.4197062 2.27635117 3.21376278 14.78889824 4.31078012
## 10 2.3764129 0.72907878 0.81325432 2.95682585 0.80677584
## 11 2.2374930 0.68645843 0.71807970 2.44837821 0.70993906
## 12 -0.5930409 -0.18194377 -0.19325733 -0.66908638 -0.18897451
## 13 1.0270093 0.31508443 0.32517935 1.09387183 0.31846924
## 14 1.0675359 0.32751789 0.34113547 1.15815364 0.33417725
## 15 0.6712018 0.20592338 0.21029137 0.69997845 0.20566324
## 16 -0.6629284 -0.20338513 -0.22270023 -0.79482144 -0.21782566
## 17 0.4363603 0.13387449 0.13803929 0.46393280 0.13492400
## 18 3.4486213 1.05803019 1.11295196 3.81594602 1.11933065
## 19 1.7931935 0.55014821 0.57876634 1.98460588 0.56981420
## 20 -5.7879699 -1.77573772 -1.87354643 -6.44313971 -1.99667657
## 21 -2.6141789 -0.80202492 -0.87784258 -3.13179171 -0.87308697
## 22 -3.6865279 -1.13101946 -1.44999541 -6.05913500 -1.48962473
## 23 -4.6075679 -1.41359270 -1.44368977 -4.80585779 -1.48246718
## 24 -4.5728535 -1.40294240 -1.49605875 -5.20001871 -1.54221512
## 25 -0.2125839 -0.06522033 -0.06750861 -0.22776283 -0.06596332
##
## Call:
## lm(formula = delTime ~ n.prod + distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## n.prod 1.615907 0.170735 9.464 3.25e-09 ***
## distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
We can see that our p value for the significance of regression is extremely low, indicating that at least one
10
of the parameters is useful in predicting delivery time. Further examination of the individual t tests would
indicate both parameters are useful. Now in order to check that the necessary model assumptions hold;
if they dont the p values are suspect. Recall we had 5 different choices of residuals. Regardless of which
ones you choose to use, they all should lead to the same conclusion. The most commonly used one is the
studentized residual (what we denoted as ri ). Below are the residual plots using the studentized residuals.
plot(g$fitted,r_i,pch=20,xlab='Fitted Values',ylab='Studentized Residuals')
abline(h=0)
3
Studentized Residuals
2
1
0
−1
−2
10 20 30 40 50 60 70
Fitted Values
The plot does not show any strong unusual patterns, but we should take note of the unusually large residual
at the fitted value of 70+. We should also note that the model tends to overpredict short delivery times and
underpredict longer delivery times. Keep in mind while this residual plot is not ideal, we need to keep in
mind this is a relatively small sample used to fit this model.
Another option is to plot the residuals against each individual regressor to further identify if any deviation
of the residuals is the result of a single explanatory variable. These are not commonly used, and are only
utilized if there are clear violations seen in examining the residual values versus fitted values.
We should also check for normality. The QQ Normal plot of the studentized residuals is shown below.
qqnorm(r_i,pch=20)
qqline(r_i)
11
Normal Q−Q Plot
3
Sample Quantiles
2
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Here we see some deviation in the tails, which would indicate there may be an issue with the normality
assumption. This pattern indicates a light tailed distribution, meaning the errors do not range as far as we
would expect for a normal distribution. Again keep in mind that we are using a relatively small sample, but
given what the QQ plot is showing we should be cautious about accepting conclusions based on the p values
presented.
Additional Diagnostics
Partial Regression Plots
Partial regression plots are used as a method to examine linear relationships between the response variable
and each individual explanatory variable. Recall we discussed plotting each explanatory variable against the
response variable, but here we are not accounting for the relationship between the remaining explanatory
variables. Partial regression plots are designed to account for this.
As a working example using the delivery data to create a partial regression plot to examine how number of
cases and delivery time we would fit a model to predict delivery time using distance only. We would then
plot the number of cases against the residual for that observation using the distance model.
Below is an example of this in R (done manually).
g_case = lm(delTime ~ distance, data=delivery)
g_dist = lm(delTime ~ n.prod, data=delivery)
par(mfrow=c(1,2))
plot(delivery$n.prod,rstandard(g_case),pch=20,xlab='Number of Cases',ylab='Residuals',main=
"Partial Regression Cases")
plot(delivery$dist,rstandard(g_dist),pch=20,xlab='Distance',ylab='Residuals',main=
"Partial Regression Distance")
12
Partial Regression Cases Partial Regression Distance
3
2
2
1
Residuals
Residuals
1
0
0
−1
−1
−2
5 10 20 30 0 400 800 1400
dev.off()
## null device
## 1
Note these plots are limited; these do not indicate any potential interaction and can still lead to false
conclusions about the presence of linearity between explanatory and response variables.
There are packages in R that will create all the necessary models automatically, but we will not cover.
PRESS statistic
Another aspect of any model is how well does it fit the data? Previously we defined the PRESS residual,
which is simply a residual computed from a model not using that data point. We can extend on that idea
and measure the total square distance between each observation and how well the model predicted without
including that observation. Specifically the PRESS statistic is
X 2
X ei
PRESS = (yi − ŷ(i) )2 = .
1 − hii
A small PRESS statistic indicates the model will predict new data point fairly accurately (residual will be
small). Note that like R2 and RAdj
2
there is no formal choice.
Similar to the PRESS statistic we have an R2 for prediction
PRESS
2
Rprediction =1−
SST
which will give some idea of how much variability we expect the model to explain given a new data point.
13
Outliers and Lack of Fit
Another potential problem with models can be the presence of outliers (extreme values) in the data. While
there are some formal tests (which we will cover later) outliers can be difficult to deal with. The issue is
how to determine if an outlier should be ignored; do we treat outliers as incorrectly recorded data or do
they indicate something we should account for about the relationship between the response and explanatory
variables? Ultimately the decision about how to handle outliers should be handled carefully and on a case by
case basis.
Lack of fit is a formal way to test if the deviation we see from our model is due to error (the fact that points
will not exactly equal the expected value) or if the model itself is not accounting for all sources of variation.
In order to perform formal test we need to have replicate observations at at least one of the explanatory
levels.
For each of these replicate observation sets the Sum of Square for the Pure Error is
nj
m X
X
SSP E = (yij − ȳi )2 .
i=1 j=1
## x y.FUN1 y.FUN2
## 1 1.0 1.185800 1
## 2 2.0 0.000000 0
14
## 3 3.3 1.080450 1
## 4 4.0 11.246667 2
## 5 4.7 0.000000 0
## 6 5.0 0.000000 0
## 7 5.6 1.434067 2
## 8 6.0 0.616050 1
## 9 6.5 0.000000 0
## 10 6.9 0.000000 0
dfPE = sum(SSPE$y.FUN2)
SSPE = sum(SSPE$y.FUN1)
### SSLOF
SSLOF = sum(g$residuals^2) - SSPE
F_0 = (SSLOF/(m-p))/(SSPE/(n-m))
F_0
## [1] 13.18827
pf(F_0,m-p,n-m,lower.tail=FALSE)
## [1] 0.001388715
Given our p value is very small we would conclude that the fitted linear model is not appropriate for predicting
the expected value of the response.
15