06 05 Adequacy of Regression Models
06 05 Adequacy of Regression Models
05
Adequacy of Models for Regression
After reading this chapter, you should be able to
1. determine if a linear regression model is adequate
2. determine how well the linear regression model predicts the response variable.
Quality of Fitted Model
In the application of regression models, one objective is to obtain an equation
y = f (x) that best describes the n response data points ( x1 , y1 ), ( x 2 , y 2 ),......., ( x n , y n ) .
Consequently, we are faced with answering two basic questions.
1. Does the model y = f (x) describe the data adequately, that is, is there an adequate
fit?
2. How well does the model predict the response variable (predictability)?
To answer these questions, let us limit our discussion to straight line models as
nonlinear models require a different approach. Some authors [1] claim that nonlinear model
parameters are not unbiased.
To exemplify our discussion, we will take example data to go through the process of
model evaluation. Given below is the data for the coefficient of thermal expansion vs.
temperature for steel. We assume a linear relationship between the data as
(T ) = a 0 + a1T
Table 1 Values of coefficient of thermal expansion vs. temperature.
T ( F) (in/in/ F)
-340
2.45
-260
3.58
-180
4.52
-100
5.28
-20
5.86
60
6.36
Following the procedure for conducting linear regression as given in Chapter 06.03, we get
(T ) = 6.0325 + 0.0096964T
Let us now look at how we can evaluate the adequacy of a linear regression model.
1. Plot the data and the regression model.
Figure 1 shows the data and the regression model. From a visual check, it looks like the
model explains the data adequately.
06.05.1
06.05.2
Chapter 06.05
Figure 1 Plot of coefficient of thermal expansion vs. temperature data points and regression
line.
2. Calculate the standard error of estimate.
The standard error of estimate is defined as
Sr
s / T =
n2
where
n
S r = ( i a0 a1Ti ) 2
i =1
Ti
-340
-260
-180
-100
-20
60
2.45
3.58
4.52
5.28
5.86
6.36
a 0 + a1Ti
2.7357
3.5114
4.2871
5.0629
5.8386
6.6143
i a 0 a1Ti
-0.28571
0.068571
0.23286
0.21714
0.021429
-0.25429
06.05.3
Table 2 shows the residuals of the data to calculate the sum of the square of residuals as
S r = ( 0.28571) 2 + (0.068571) 2 + (0.23286) 2 + (0.21714) 2
+ (0.021429) 2 + ( 0.25429) 2
= 0.25283
The standard error of estimate
Sr
s / T =
n2
0.25283
62
= 0.25141
The units of s / T are same as the units of . How is the value of the standard error of
estimate interpreted? We may say that on average the difference between the observed and
predicted values is 0.25141 in/in/ F . Also, we can look at the value as follows. About 95%
of the observed values are between 2 s / T of the predicted value (see Figure 2). This
would lead us to believe that the value of in the example is expected to be accurate within
2 s / T = 2 0.25141 = 0.50282 in/in/ F .
=
Figure 2 Plotting the linear regression line and showing the regression standard error.
06.05.4
Chapter 06.05
One can also look at this criterion as finding if 95% of the scaled residuals for the model are
in the domain [-2,2], that is
a0 a1Ti
Scaled residual = i
s / T
For the example,
s / T = 0.25141
Table 4 Residuals and scaled residuals for data.
Ti
i
i a 0 a1Ti Scaled Residuals
-340 2.45 -0.28571
-1.1364
-260 3.58 0.068571
0.27275
-180 4.52 0.23286
0.92622
-100 5.28 0.21714
0.86369
-20 5.86 0.021429
0.085235
60
6.36 -0.25429
-1.0115
and the scaled residuals are calculated in Table 4. All the scaled residuals are in the [-2,2]
domain.
3. Calculate the coefficient of determination.
Denoted by r 2 , the coefficient of determination is another criterion to use for checking the
adequacy of the model.
To answer the above questions, let us start from the examination of some measures of
discrepancies between the whole data and some key central tendency. Look at the two
equations given below.
n
S r = ( i i )
i =1
n
= ( i a0 a1Ti )
(1)
i =1
n
S t = ( i )
i =1
where
i =1
n
For the example data
6
i =1
6
2.45 + 3.58 + 4.52 + 5.28 + 5.86 + 6.36
=
6
= 4.6750 in/in/ F
(2)
S t = ( i )
06.05.5
i =1
(5)
06.05.6
Chapter 06.05
Based on the value obtained above, we can claim that 97.7% of the original uncertainty in the
value of can be explained by the straight-line regression model of
(T ) = 6.0325 + 0.0096964T .
Going back to the definition of the coefficient of determination, one can see that S t is the
variation without any relationship of y vs. x , while S r is the variation with the straight-line
relationship.
The limits of the values of r 2 are between 0 and 1. What do these limiting values of
r 2 mean? If r 2 = 0 , then S t = S r , which means that regressing the data to a straight line
does nothing to explain the data any further. If r 2 = 1 , then S r = 0 , which means that the
straight line is passing through all the data points and is a perfect fit.
Caution in the use of r 2
a) The coefficient of determination, r 2 can be made larger (assumes no collinear points)
by adding more terms to the model. For instance, n 1 terms in a regression equation
for which n data points are used will give an r 2 value of 1 if there are no collinear
points.
b) The magnitude of r 2 also depends on the range of variability of the regressor (x)
variable. Increase in the spread of x increases r 2 while a decrease in the spread of x
decreases r 2 .
c) Large regression slope will also yield artificially high r 2 .
d) The coefficient of determination, r 2 does not measure the appropriateness of the
linear model. r 2 may be large for nonlinearly related x and y values.
e) Large coefficient of determination r 2 value does not necessarily imply the regression
will predict accurately.
f) The coefficient of determination , r 2 does not measure the magnitude of the regression
slope.
g) These statements above imply that one should not choose a regression model solely
based on consideration of r 2 .
4. Find if the model meets the assumptions of random errors.
These assumptions include that the residuals are negative as well as positive to give a mean
of zero, the variation of the residuals as a function of the independent variable is random, the
residuals follow a normal distribution, and that there is no auto correlation between the data
points.
To illustrate this better, we have an extended data set for the example that we took.
Instead of 6 data points, this set has 22 data points (Table 6). Drawing conclusions from
small or large data sets for checking assumption of random error is not recommended.
06.05.7
06.05.8
Chapter 06.05
Figure 3 Plot of thermal expansion coefficient vs. temperature data points and regression line
for more data points.
Regressing the data from Table 2 to the straight line regression line
(T ) = a 0 + a1T
and following the procedure for conducting linear regression as given in Chapter 06.03, we
get (Figure 3)
= 6.0248 + 0.0093868T
06.05.9
06.05.10
Chapter 06.05
Authors
Date
Web Site
06.05.11