0% found this document useful (0 votes)
21 views32 pages

9 Regression (Statistics IEM 2-2)

The document provides an overview of simple linear regression, explaining the difference between correlation and regression, and how regression analysis is used to predict the value of a dependent variable based on one independent variable. It details the structure of the regression model, including the least squares method for estimating coefficients, and emphasizes the importance of assessing model fit through various statistical measures. Additionally, it includes examples and calculations to illustrate the process of deriving the regression equation and interpreting its results.

Uploaded by

frankestien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views32 pages

9 Regression (Statistics IEM 2-2)

The document provides an overview of simple linear regression, explaining the difference between correlation and regression, and how regression analysis is used to predict the value of a dependent variable based on one independent variable. It details the structure of the regression model, including the least squares method for estimating coefficients, and emphasizes the importance of assessing model fit through various statistical measures. Additionally, it includes examples and calculations to illustrate the process of deriving the regression equation and interpreting its results.

Uploaded by

frankestien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

SIMPLE LINEAR

REGRESSION
Correlation vs. Regression
 A scatter plot can be used to show the relationship between two
variables
 Correlation analysis is used to measure the strength of the
association (linear relationship) between two variables
 Correlation is only concerned with strength of the relationship
 No causal effect is implied with correlation

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


12.1 Regression Models DCOVA

Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Types of Relationships
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Types of Relationships
No relationship

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


12.2 Introduction to
Regression Analysis
 Regression analysis is used to:
 Predict the value of a dependent variable based on the value of at
least one independent variable
 Explain the impact of changes in an independent variable on the
dependent variable
 Dependent variable: the variable we wish to
predict or explain
 Independent variable: the variable used to
predict or explain the
dependent variable

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Simple Linear Regression
Model

◦Only one independent variable, X


◦Relationship between X and Y is
described by a linear function
◦Changes in Y are assumed to be related
to changes in X

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Simple Linear Regression
Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi β0  β1Xi  ε i
Linear component Random Error
component

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Simple Linear Regression DCOVA
Model (continued)

Y Yi β0  β1Xi  ε i
Observed Value
of Y for Xi

εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value

Intercept = β0

Xi X
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Simple Linear Regression DCOVA
Equation (Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line

Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for

Ŷi b0  b1Xi


observation i

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


The Model
The model has a deterministic and a probabilistic components

House
Cost
bo ut
sa
o st
c
se t.
ho u o Size)
n g a e fo + 75(
d i a r
Buil er squ 25000
p =
$75 e cost
Most lots sell s
H ou
for $25,000

House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost

Most lots sell


for $25,000
House cost = 25000 + 75 (Size) 
House size
Estimating the Coefficients
◦ The estimates are determined by
◦ drawing a sample from the population of interest,
◦ calculating sample statistics.
◦ producing a straight line that cuts into the data.

Y 
 Question: What should be
 considered a good line?



  

X
The Least Squares
(Regression) Line

A good line is one that minimizes


the sum of squared differences between the
points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

(2,4)
Let us compare two lines
4
 The second line is horizontal
3  (4,3.2)
2.5
2
(1,2) 
 (3,1.5)
1

The smaller the sum of


1 2 3 4 squared differences
the better the fit of the
line to the data.
The Estimated Coefficients

To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:
 ssXY
cov(X,Y)) 

bb11 
cov(X,Y
2  2 

 XY
 ˆ
Yˆ b00 bb11XX
Y b
ssXX
2  ssXX 
 2 
Y  bb11XX
bb00 Y



The Simple Linear Regression Line

• Example 17.2 (Xm17-02)


◦ A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars. Car Odometer Price
◦ A random sample of 100 cars is 1 37388 14636
selected, and the data 2 44758 14122
recorded.
3 45833 14016
◦ Find the regression line.
4 30862 15590
5 31705 15568
6 34010 14718
. . .
Independent Dependent
. . .
variable X variable Y
. . .
• Solution
– Solving by hand: Calculate a number of statistics
X 36,009.45; sX2 
 (X i  X )2
43,528,690
n 1

Y 14,822.823; cov(X,Y ) 
 (X i  X )(Yi  Y )
 2,712,511
n 1
where n = 100.
cov(X,Y)  1,712,511
b1    .06232
 2
sX 43,528,690

b0 Y  b1 X 14,822.82  ( .06232)(36,009.45) 17,067

Yˆ b0  b1 X 17,067  .0623X



• Solution – continued
– Using the computer (Xm17-02)

Tools > Data Analysis > Regression >


[Shade the Y range and the X range] > OK
Xm17-02
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466

Yˆ 17,067  .0623X
Standard Error 303.1
Observations 100

ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
Coefficients Standard Error t Stat P-value
Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
Interpreting the Linear
Regression -Equation
17067 Odometer Line Fit Plot

16000

15000
Price

14000

0 No data 13000
Odometer

Yˆ 17,067  .0623X

The intercept is b0 = $17067. This is the slope of the line.


intercept as the For each additional mile on the odometer,
Do not interpret the the price decreases by an average of $0.0623
“Price of cars that have not been driven”
Error Variable: Required Conditions
◦ The error is a critical part of the regression model.
◦ Four requirements involving the distribution of  must be satisfied.
◦ The probability distribution of  is normal.
◦ The mean of  is zero: E() = 0.
◦ The standard deviation of  is  for all values of X.
◦ The set of errors associated with different values of Y are all independent.
Assessing the Model
◦ The least squares method will produces a regression line
whether or not there are linear relationship between X and Y.
◦ Consequently, it is important to assess how well the linear
model fits the data.
◦ Several methods are used to assess the model. All are based
on the sum of squares for errors, SSE.
Sum of Squares for Errors
◦ This is the sum of differences between the points and the regression
line.
◦ It can serve as a measure of how well the line fits the data. SSE is
defined by

nn
SSE (Yi i Yi )i .
SSE 
 (Y  ˆ
Yˆ 22 .
)
i
i11

– A shortcut formula
2 cov(X,Y)
2
2
 (n1)s
SSE(n
 SSE 1)sY 
2 cov(X,Y)
2
Y
ssX2X
Standard Error of Estimate
◦ The mean error is equal to zero.
◦ If  is small the errors tend to be close to zero (close to the mean
error). Then, the model fits the data well.
◦ Therefore, we can, use  as a measure of the suitability of using a
linear model.
◦ An estimator of  is given by s

SStan
tandard
dard Error
Error ofof Estimate
Estimate
SSE
SSE

ss 
nn 22
◦Example 17.3
◦ Calculate the standard error of estimate for
Example 17.2, and describe what does it tell you
about the model fit?
◦Solution

sY2 
 i i
(Y  Yˆ ) 2

259,996
Calculated before
n 1
2 2
[cov( X , Y )] (  2, 712,511)
SSE (n  1) sY2  2
99(259,996)  9,005,450
sX 43,528,690
SSE 9,005,450 It is hard to assess the model based
s   303.13
n 2 98 on s even when compared with the
mean value of Y.
s  303.1 y 14,823
Inferences About the Slope:
t Test DCOVA

◦ t test for a population slope


◦ Is there a linear relationship between X and Y?
◦ Null and alternative hypotheses
◦ H0: β1 = 0 (no linear relationship)
◦ H1: β1 ≠ 0 (linear relationship does exist)

◦ Test statistic

where:
b1  β 1
t STAT  b1 = regression slope
coefficient
Sb
1 β1 = hypothesized slope

d.f. n  2 Sb1 = standard


error of the slope
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
◦ Example 17.4
◦ Test to determine whether there is enough evidence to infer that
there is a linear relationship between the car auction price and the
odometer reading for all three-year-old Tauruses, in Example 17.2.
Use  = 5%.
◦Solving by hand
◦To compute “t” we need the values of b1 and
sb1.
b1  .0623
s 303.1
sb1   .00462
2
(n  1)sX (99)(43,528,690)
b1  1  .0623  0
t   13.49
sb1 .00462

◦The rejection region is t > t.025 or t < -t.025 with


 = n-2 = 98.
 Approximately, t.025 = 1.984
Xm17-02
• Using the computer
Price Odometer SUMMARY OUTPUT
14636 37388
14122 44758 Regression Statistics
14016 45833 Multiple R 0.8063 There is overwhelming evidence to infer
15590 30862 R Square 0.6501
15568 31705 Adjusted R Square 0.6466 that the odometer reading affects the
14718 34010 Standard Error 303.1 auction selling price.
14470 45854 Observations 100
15690 19057
15072 40149 ANOVA
14802 40237 df SS MS F Significance F
15190 32359 Regression 1 16734111 16734111 182.11 0.0000
14660 43533 Residual 98 9005450 91892
15612 32744 Total 99 25739561
15610 34470
14634 37720 Coefficients Standard Error t Stat P-value
14632 41350 Intercept 17067 169 100.97 0.0000
15740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
Coefficient of Determination
◦ To measure the strength of the linear relationship we use the
coefficient of determination:

cov(X,Y) 22
R 
R 22cov(X,Y)
2 2 
or,
or, r
r 22
XY 
;;
2 2
ssXXssYY XY

22
SSE
SSE
or, 1
or, RR 1 (seep.p.18
(see 18above)
above)

 (Yi i Y )
(Y  Y )22


– Using the computer
From the regression output we have
SUMMARY OUTPUT
65% of the variation in the auction
Regression Statistics selling price is explained by the
Multiple R 0.8063
R Square 0.6501
variation in odometer reading. The
Adjusted R Square 0.6466 rest (35%) remains unexplained by
Standard Error 303.1 this model.
Observations 100

ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561

CoefficientsStandard Error t Stat P-value


Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000

You might also like