9 Regression (Statistics IEM 2-2)
9 Regression (Statistics IEM 2-2)
REGRESSION
Correlation vs. Regression
A scatter plot can be used to show the relationship between two
variables
Correlation analysis is used to measure the strength of the
association (linear relationship) between two variables
Correlation is only concerned with strength of the relationship
No causal effect is implied with correlation
Y Y
X X
Y Y
X X
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Types of Relationships
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Types of Relationships
No relationship
Yi β0 β1Xi ε i
Linear component Random Error
component
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Simple Linear Regression DCOVA
Equation (Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
House
Cost
bo ut
sa
o st
c
se t.
ho u o Size)
n g a e fo + 75(
d i a r
Buil er squ 25000
p =
$75 e cost
Most lots sell s
H ou
for $25,000
House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost
Y
Question: What should be
considered a good line?
X
The Least Squares
(Regression) Line
(2,4)
Let us compare two lines
4
The second line is horizontal
3 (4,3.2)
2.5
2
(1,2)
(3,1.5)
1
To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:
ssXY
cov(X,Y))
bb11
cov(X,Y
2 2
XY
ˆ
Yˆ b00 bb11XX
Y b
ssXX
2 ssXX
2
Y bb11XX
bb00 Y
The Simple Linear Regression Line
Y 14,822.823; cov(X,Y )
(X i X )(Yi Y )
2,712,511
n 1
where n = 100.
cov(X,Y) 1,712,511
b1 .06232
2
sX 43,528,690
b0 Y b1 X 14,822.82 ( .06232)(36,009.45) 17,067
Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466
Yˆ 17,067 .0623X
Standard Error 303.1
Observations 100
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
Coefficients Standard Error t Stat P-value
Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
Interpreting the Linear
Regression -Equation
17067 Odometer Line Fit Plot
16000
15000
Price
14000
0 No data 13000
Odometer
Yˆ 17,067 .0623X
nn
SSE (Yi i Yi )i .
SSE
(Y ˆ
Yˆ 22 .
)
i
i11
– A shortcut formula
2 cov(X,Y)
2
2
(n1)s
SSE(n
SSE 1)sY
2 cov(X,Y)
2
Y
ssX2X
Standard Error of Estimate
◦ The mean error is equal to zero.
◦ If is small the errors tend to be close to zero (close to the mean
error). Then, the model fits the data well.
◦ Therefore, we can, use as a measure of the suitability of using a
linear model.
◦ An estimator of is given by s
SStan
tandard
dard Error
Error ofof Estimate
Estimate
SSE
SSE
ss
nn 22
◦Example 17.3
◦ Calculate the standard error of estimate for
Example 17.2, and describe what does it tell you
about the model fit?
◦Solution
sY2
i i
(Y Yˆ ) 2
259,996
Calculated before
n 1
2 2
[cov( X , Y )] ( 2, 712,511)
SSE (n 1) sY2 2
99(259,996) 9,005,450
sX 43,528,690
SSE 9,005,450 It is hard to assess the model based
s 303.13
n 2 98 on s even when compared with the
mean value of Y.
s 303.1 y 14,823
Inferences About the Slope:
t Test DCOVA
◦ Test statistic
where:
b1 β 1
t STAT b1 = regression slope
coefficient
Sb
1 β1 = hypothesized slope
cov(X,Y) 22
R
R 22cov(X,Y)
2 2
or,
or, r
r 22
XY
;;
2 2
ssXXssYY XY
22
SSE
SSE
or, 1
or, RR 1 (seep.p.18
(see 18above)
above)
(Yi i Y )
(Y Y )22
– Using the computer
From the regression output we have
SUMMARY OUTPUT
65% of the variation in the auction
Regression Statistics selling price is explained by the
Multiple R 0.8063
R Square 0.6501
variation in odometer reading. The
Adjusted R Square 0.6466 rest (35%) remains unexplained by
Standard Error 303.1 this model.
Observations 100
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561