1.
Introduction to regression analysis
Regression analysis
- Describe a relationship between two variables in
mathematical terms.
- Predict the value of a dependent variable based on
the value of at least one independent variable
- Explain the impact of changes in an independent
variable on the dependent variable
1. Introduction to regression analysis
Dependent Independent
variable variable
the variable we wish the variable used
to explain to explain the
dependent variable
Names for ys and xs in regression model
Names for y Name for xs
Dependent variable Independent variables
Regressand Regressors
Effect variable Causal variables
Explained variable Explanatory variables
Simple Linear Regression Model
Only one independent variable, x
Relationship between x and y is described
by a linear function
Changes in y are assumed to be caused by
changes in x
Types of Regression Models
Positive Linear Relationship Non-linear relationship
Negative Linear Relationship No Relationship
Population Linear Regression
The population regression model:
Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
y β0 β1x ε
Variable
Linear component Random Error
component
Linear Regression Assumptions
Error values (ε) are statistically independent
Error values are normally distributed for any given
value of x
The probability distribution of the errors has
constant variance
The underlying relationship between the x variable
and the y variable is linear
Population Linear Regression
y y β0 β1x ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value
Random Error
of y for xi
for this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
Estimated Estimate of Estimate of the
(or predicted) the regression regression slope
y value
intercept
Independent
ŷ i b0 b1x variable
The individual random error terms ei have a mean of zero
Least Squares Criterion
b0 and b1 are obtained by finding the values of
b0 and b1 that minimize the sum of the squared
residuals
e 2
(y ŷ) 2
(y (b 0 b1x))
2
The Least Squares Equation
The formulas for b1 and b0 are:
xy x y
b1 n
(
x n
2 x ) 2
and
or
xy x . y
b1 b0 y b1 x
x2
Interpretation of the
Slope and the Intercept
b0 is the estimated average value of y when the
value of x is zero
b1 is the estimated change in the average value
of y as a result of a one-unit change in x
Example
A real estate agent wishes to examine the
relationship between the selling price of a home and
its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (y)?
house price in $1000s
Independent variable (x)? square feet
Sample Data for House Price Model
House Price in $1000s Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
y x xy x2 y2
245 1400 343000 1960000 60025
312 1600 499200 2560000 97344
279 1700 474300 2890000 77841
308 1875 577500 3515625 94864
199 1100 218900 1210000 39601
219 1550 339450 2402500 47961
405 2350 951750 5522500 164025
324 2450 793800 6002500 104976
319 1425 454575 2030625 101761
255 1700 433500 2890000 65025
2865 17150 5085975 3098375 853423
0
Estimate b0 and b1
xy x . y 508597.5 1715 286.5
b1 0.1097
x (x )
2 2 3098375 1715 2
b0 y b1 x 286.5 0.1097 1715 98.2483
The regression equation is:
yˆ 98.2483 0.1097 x
Graphical Presentation
House price model: scatter plot and regression
line
450
400
House Price ($1000s)
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
ŷ 98.2483 0.1097x
Interpretation of the
Intercept, b0
ŷ 98.2483 0.1097x
b0 is the estimated average value of Y when the value
of X is zero
Reflect the portion of the house price not explained
by square feet
Reflect the portion of the
house price caused by factors
other than square feet.
Interpretation of the
Slope Coefficient, b1
ŷ 98.2483 0.1097x
b1 measures the estimated change in the
average value of Y as a result of a one-unit
change in X
Here, b1 = .10977 indicates that the average
value of a house increases by .10977($1000)
= $109.77, on average, for each additional
one square foot of size
2
Coefficient of Determination R
The coefficient of determination is the portion of the
total variation in the dependent variable that is
explained by variation in the independent variable
The coefficient of determination is also called R-
squared and is denoted as
RSS
R
2 where 0 R 12
TSS
2
Coefficient of Determination R
Coefficient of determination
RSS sum of squares explained by regression
R
2
TSS total sum of squares
Examples of Approximate
2
R Values
y
R2 = 1
Perfect linear relationship
between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x
x
R2 = +1
Examples of Approximate
R Values
2
y
0 < R2 < 1
Weaker linear relationship
between x and y:
x
y
Some but not all of the
variation in y is explained
by variation in x
x
Examples of Approximate
Values R 2
R2 = 0
y
No linear relationship
between x and y:
The value of Y does not
R2 = 0
x depend on x. (None of the
variation in y is explained
by variation in x)
y x ŷ ( yˆ y ) 2 ( y y )2
245 1400 251.8283 1202.127 1722.25
312 1600 273.7683 162.0962 650.25
279 1700 284.7383 3.103587 56.25
308 1875 303.9358 304.0071 462.25
199 1100 218.9183 4567.286 7656.25
219 1550 268.2833 331.8482 4556.25
405 2350 356.0433 4836.271 14042.25
324 2450 367.0133 6482.391 1406.25
319 1425 254.5708 1019.474 1056.25
255 1700 284.7383 3.103587 992.25
2865 17150 2863.838 18911.71 32600.5
Coefficient of determination
RSS 18911.71
R
2
R
2
0.58
TSS 32600.5
Only 58% of the variation in house price is
due to square feet
Put another way, 42% of variation in house
price is due to factors other than square feet
2. Correlation analysis
Correlation is a technique used to measure the
strength of the relationship between two variables.
The stronger the correlation, the better the
relationship or the better fit the regression line and
vice versa.
Scatter Plot Examples
High degree of Low degree of
correlation correlation
y y
x x
y y
x x
Scatter Plot Examples
No relationship
x
The correlation coefficient (r)
The correlation coefficient is used to
measure the strength of the linear
relationship between two variables
The product moment correlation
coefficient is calculated using the
formula:
The correlation coefficient (r)
r
( x x )( y y )
[ ( x x ) ][ ( y y ) ]
2 2
n xy x y
r
[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
xy x . y
r
x y
Note
In the single
independent variable
case, the coefficient of
determination is R r 2 2
where
r : simple correlation coefficient
Features of r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker the linear relationship
Examples of Approximate
r Values
y y y
x x x
r = -1 r = -.6 r=0
y y
x x
r = +.3 r = +1
Example calculation
xy x . y
r
x 2 ( x )2 y 2 ( y )2
508597.5 1715 286.5
r 0.762
3098375 (1715) 2 85342.3 (286.5) 2
The result shows a fairly strong correlation
Working Productivity
Example experience (items/h)
The data below
1 2
relates the working
experience (years) to 3 8
the productivity of 10 4 9
workers in a small 5 15
firm
6 15
7 20
9 23
12 25
14 22
15 36
Example calculation
x 76 x 2
782
y 175 y 2
3932
xy 1722
Estimate b0 and b1
xy x . y
b1
x (x )
2 2
172.2 7.6 17.5
b1 1.918
78.2 7.6 2
Estimate b0 and b1
b0 y b1 x
b0 17.5 1.918 7.6 2.923
Linear regression equation
Interpretation of b0 and b1?
Coefficient of determination and correlation
coefficient
RSS ( yˆ y ) 751.9312
2
TSS ( y y ) 870.5
2
RSS 751.9312
R
2
0.8637
TSS 870.5
Coefficient of determination and correlation
coefficient
xy x . y
r
x 2 ( x )2 y 2 ( y )2
172.2 7.6 17.5
r 0.9293
78.2 7.6 393.3 17.5
2 2
or r R 0.8637 0.9293
2
The Multiple Regression Model
Idea: Examine the linear relationship between
1 dependent (y) & 2 or more independent variables (xi)
Population model:
Y-intercept Population slopes Random Error
y β0 β1x1 β 2 x 2 βk x k ε
Estimated multiple regression model:
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of y
ŷ b0 b1x1 b 2 x 2 bk x k
Estimates b0, b1, b2,….,bk
y nb0 b1 x1 b2 x2 ....... bk xk
1 0 1 1 1 b2 x1 x2 ....... bk x1 xk
2
x y b x b x
2 0 2 1 1 2 2 2 ....... bk x2 xk
2
x y b x b x x b x
......................................................................................
k
0 k 1 1 k 2 2 k k k
2
x y b x b x x b x x ....... b x
Interpretation of Estimated Coefficients
Slope (bi)
Estimates that the average value of y changes by bi units
for each 1 unit increase in Xi given that all other variables
unchanged
Intercept (b0)
The estimated average value of y when all xi = 0
Multiple Regression Model
Two variable model
y
ŷ b0 b1x1 b 2 x 2
x1
e
abl
i
var
r
fo
ope x2
Sl
varia ble x 2
pe fo r
Slo
x1
Multiple Regression Model
Two variable model
y Sample
<yi
observation ŷ b0 b1x1 b 2 x 2
yi
<
e = (y – y)
x2i
x2
<
x1i The best fit equation, y ,
is found by minimizing the
x1 sum of squared errors, e2
Multiple Regression Assumptions
Errors (residuals) from the regression
model:
<
e = (y – y)
The errors are normally distributed
The mean of the errors is zero
Errors have a constant variance
The model errors are independent
Example
A distributor of frozen desert
pies wants to evaluate factors
thought to influence demand
Data are collected for 15 weeks
Price Advertising
Week Pie Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Example
Dependent variable (y):Pie sales
Independent variables 1 (x1):Prices ($)
Independent variables 2 Advertising ($ 100s)
(x2):
Estimated (Predicted) regression equation:
ŷ b0 b1 x1 b2 x2
Estimates b0, b1, b2
y nb0 b1 x1 b2 x2
x1 y b0 x1 b1 x1 b2 x1 x2
2
x2 y b0 x2 b1 x1 x2 b2 x2
2
Example calculation
y 5990 x x
1 2 345.46
x 1 99.2 x 2
1 675.26
x2 52.2 2 185
x 2
x y 39152
1
y 2
2448500
x y 21087
2
Example calculation
5990 15b0 99.2b1 52.2b2
39152 99.2b0 675.26b1 345.46b2
21087 52.2b 345.46b 185b
0 1 2
b0 306.525
b1 24.975
b 74.131
2
Example calculation
Estimated (Predicted) regression
equation:
yˆ 306.526 24.975 x1 74.131x2
Interpretation b0, b1, b2?
The Multiple Regression Equation
Sales 306.526 - 24.975(Price) 74.131(Advertising)
where
Sales is in number of pies per week
Price is in $
Advertising is in $100’s.
b1 = -24.975: sales b2 = 74.131: sales will
will decrease, on increase, on average,
average, by 24.975 by 74.131 pies per
pies per week for week for each $100
each $1 increase in increase in
selling price, net of advertising, net of the
the effects of effects of changes
changes due to price due to advertising
Using The Model to Make Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Sales 306.526 - 24.975(Price) 74.131(Advertising)
306.526 - 24.975 (5.50) 74.131 (3.5)
428.62
Predicted sales
is 428.62 pies
Multiple Coefficient of Determination
Reports the proportion of total variation in y
explained by all x variables taken together
RSS Regression sum of squares
R
2
TSS Total sum of squares
Multiple correlation (R)
Multiple correlation provides a measure of the overall
strength of the relationship between dependent
variable and independent variables.
It is defined as the positive square root of the
coefficient of the determination
R R 2
Example calculation
RSS ( yˆ y ) 29459.96
i
2
TSS ( y y ) 56493.33
i
2
29459.96
R
2
0.521 Indication?
56493.33
R R 0.722
2
Correlation matrix
Provides measures of the strength of the
relationship between dependent variable and each
independent variable
y x1 x2
y 1
x1 rx1y 1
x2 rx2y rx1x2 1
Example calculation
Pie Sales Price Advertising
Pie Sales 1
Price -0.44327 1
Advertising 0.55632 0.03044 1
Price vs. Sales : r = -0.44327
There is a negative association between
price and sales
Advertising vs. Sales : r = 0.55632
There is a positive association between
advertising and sales
Exercise 1
The table below show the data related to a bank’s interest and the
amount of money granted during 2005-2011:
Year Interest (%) Amount of money
granted (billion
VND)
1 9.05 20.1
2 10.1 20.9
3 12.5 19.8
4 14,2 18.3
5 12 17.9
6 11.1 19.4
7 10.2 21.6
- Display the data by scatter plot
- Do the bank’s interest and the amount of money granted have
any relationship?
- If yes, use the regression model to present their relationship
Exercise 2
A motion picture industry analyst wants to estimate the
gross earnings generated by a movie. The estimate will
be based on different variables involved in the film
production. The independent variables considered are
X1 = production cost of the movie (million USD) and X2
= total cost of all promotion activities (million USD).
The analyst obtains information on a random sample of
10 Hollywood movies made within the last 5 years. The
variable Y is gross earnings, in million of dollars. The
data are given in the following table. Use the multiple
regression to display the relationship among those
variables
Exercise 2
Gross earnings Production cost Promotion cost
(m. USD) (m. USD) (m. USD)
72 12 5
76 11 8
78 15 6
70 10 5
68 11 3
80 16 9
82 14 12
65 8 4
62 8 3
90 18 10
Exercise 3
The data below
Cost of No of products
relates cost of advertisement sold
advertising (VND (m. VND) (1000 units)
m)to number of 1 2
products sold (1000 3 8
units). 4 9
Prepare the 5 15
regression model and 6 15
make a conclusion 7 20
about this relation. 9 23
12 25
14 22
15 36
Exercise 4
A CEO considers whether his company can take the
unemployment rate to measure the number of products sold
or not. Write down the regression model displaying the
relationship between those variables and make conclusion
based on your results.
Period 1 2 3 4 5 6 7 8 9 10
Unemploym 1,3 2,0 1,7 1,5 1,6 1,2 1,6 1,4 1,0 1,1
ent rate, %
Number of10 6 5 12 10 15 5 12 17 20
products
sold (1000
units)