Linear regression
η(x) =isβ0a +
• Linear regression β1 x1 approach
simple + β2 x2 + .to. . supervised
β p xp
learning.
Almost always Itthought
assumes of that
as anthe dependence oftoYthe
approximation ontruth.
X
Functions
1 , Xin, . . .
2 nature X is linear.
p are rarely linear.
• True regression functions are never linear!
7
6
f(X)
5
4
3
2 4 6 8
• although it may seem overly simplistic, linear regression is
extremely useful both conceptually and practically.
1 / 48
Linear regression for the advertising data
Consider the advertising data shown on the next slide.
Questions we might ask:
• Is there a relationship between advertising budget and
sales?
• How strong is the relationship between advertising budget
and sales?
• Which media contribute to sales?
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?
2 / 48
Advertising data
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
3 / 48
Simple linear regression using a single predictor X.
• We assume a model
Y = β0 + β1 X + ,
where β0 and β1 are two unknown constants that represent
the intercept and slope, also known as coefficients or
parameters, and is the error term.
• Given some estimates β̂0 and β̂1 for the model coefficients,
we predict future sales using
ŷ = β̂0 + β̂1 x,
where ŷ indicates a prediction of Y on the basis of X = x.
The hat symbol denotes an estimated value.
4 / 48
Estimation of the parameters by least squares
• Let ŷi = β̂0 + β̂1 xi be the prediction for Y based on the ith
value of X. Then ei = yi − ŷi represents the ith residual
• We define the residual sum of squares (RSS) as
RSS = e21 + e22 + · · · + e2n ,
or equivalently as
RSS = (y1 −β̂0 −β̂1 x1 )2 +(y2 −β̂0 −β̂1 x2 )2 +. . .+(yn −β̂0 −β̂1 xn )2 .
• The least squares approach chooses β̂0 and β̂1 to minimize
the RSS. The minimizing values can be shown to be
Pn
(x − x̄)(yi − ȳ)
β̂1 = i=1Pn i 2
,
i=1 (xi − x̄)
β̂0 = ȳ − β̂1 x̄,
P P
where ȳ ≡ n1 ni=1 yi and x̄ ≡ n1 ni=1 xi are the sample
means.
5 / 48
i=1 (xi − x̄) (3.4)
β̂0 = ȳ − β̂1 x̄,
P P
where ȳ ≡ n1 ni=1 yi and x̄ ≡ n1 ni=1 xi are the sample means. In other
Example: advertising data
words, (3.4) defines the least squares coefficient estimates for simple linear
regression.
25
20
Sales
15
10
5
0 50 100 150 200 250 300
TV
FIGURE 3.1. For the Advertising data, the least squares fit for the regression
The least squares fit for the regression of sales onto TV.
of sales onto TV is shown. The fit is found by minimizing the sum of squared
In this case a linear fit captures the essence of the relationship,
errors. Each grey line segment represents an error, and the fit makes a compro-
mise by averaging their squares. In this case a linear fit captures the essence of
although it is somewhat deficient in the left of the plot.
the relationship, although it is somewhat deficient in the left of the plot.
Figure 3.1 displays the simple linear regression fit to the Advertising
data, where β̂0 = 7.03 and β̂1 = 0.0475. In other words, according to this
6 / 48
Assessing the Accuracy of the Coefficient Estimates
• The standard error of an estimator reflects how it varies
under repeated sampling. We have
2 σ2 2 2 1 x̄2
SE(β̂1 ) = Pn 2
, SE(β̂0 ) = σ P
+ n 2
,
i=1 (xi − x̄) n i=1 (xi − x̄)
where σ 2 = Var()
• These standard errors can be used to compute confidence
intervals. A 95% confidence interval is defined as a range of
values such that with 95% probability, the range will
contain the true unknown value of the parameter. It has
the form
β̂1 ± 2 · SE(β̂1 ).
7 / 48
Confidence intervals — continued
That is, there is approximately a 95% chance that the interval
h i
β̂1 − 2 · SE(β̂1 ), β̂1 + 2 · SE(β̂1 )
will contain the true value of β1 (under a scenario where we got
repeated samples like the present sample)
For the advertising data, the 95% confidence interval for β1 is
[0.042, 0.053]
8 / 48
Hypothesis testing
• Standard errors can also be used to perform hypothesis
tests on the coefficients. The most common hypothesis test
involves testing the null hypothesis of
H0 : There is no relationship between X and Y
versus the alternative hypothesis
HA : There is some relationship between X and Y .
• Mathematically, this corresponds to testing
H0 : β1 = 0
versus
HA : β1 6= 0,
since if β1 = 0 then the model reduces to Y = β0 + , and
X is not associated with Y .
9 / 48
Hypothesis testing — continued
• To test the null hypothesis, we compute a t-statistic, given
by
β̂1 − 0
t= ,
SE(β̂1 )
• This will have a t-distribution with n − 2 degrees of
freedom, assuming β1 = 0.
• Using statistical software, it is easy to compute the
probability of observing any value equal to |t| or larger. We
call this probability the p-value.
10 / 48
Results for the advertising data
Coefficient Std. Error t-statistic p-value
Intercept 7.0325 0.4578 15.36 < 0.0001
TV 0.0475 0.0027 17.67 < 0.0001
11 / 48
Assessing the Overall Accuracy of the Model
• We compute the Residual Standard Error
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2 ,
n−2 n−2
i=1
P
where the residual sum-of-squares is RSS = ni=1 (yi − ŷi )2 .
• R-squared or fraction of variance explained is
TSS − RSS RSS
R2 = =1−
TSS TSS
P
where TSS = ni=1 (yi − ȳ)2 is the total sum of squares.
• It can be shown that in this simple linear regression setting
that R2 = r2 , where r is the correlation between X and Y :
Pn
(xi − x)(yi − y)
r = Pn i=1
p pPn .
− 2 2
i=1 (x i x) i=1 (yi − y)
12 / 48
Advertising data results
Quantity Value
Residual Standard Error 3.26
R2 0.612
F-statistic 312.1
13 / 48
Multiple Linear Regression
• Here our model is
Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ,
• We interpret βj as the average effect on Y of a one unit
increase in Xj , holding all other predictors fixed. In the
advertising example, the model becomes
sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + .
14 / 48
Interpreting regression coefficients
• The ideal scenario is when the predictors are uncorrelated
— a balanced design:
- Each coefficient can be estimated and tested separately.
- Interpretations such as “a unit change in Xj is associated
with a βj change in Y , while all the other variables stay
fixed”, are possible.
• Correlations amongst predictors cause problems:
- The variance of all coefficients tends to increase, sometimes
dramatically
- Interpretations become hazardous — when Xj changes,
everything else changes.
• Claims of causality should be avoided for observational
data.
15 / 48
The woes of (interpreting) regression coefficients
“Data Analysis and Regression” Mosteller and Tukey 1977
• a regression coefficient βj estimates the expected change in
Y per unit change in Xj , with all other predictors held
fixed. But predictors usually change together!
• Example: Y total amount of change in your pocket;
X1 = # of coins; X2 = # of pennies, nickels and dimes. By
itself, regression coefficient of Y on X2 will be > 0. But
how about with X1 in model?
• Y = number of tackles by a football player in a season; W
and H are his weight and height. Fitted regression model
is Ŷ = b0 + .50W − .10H. How do we interpret β̂2 < 0?
16 / 48
Two quotes by famous Statisticians
“Essentially, all models are wrong, but some are useful”
George Box
“The only way to find out what will happen when a complex
system is disturbed is to disturb the system, not merely to
observe it passively”
Fred Mosteller and John Tukey, paraphrasing George Box
17 / 48
Estimation and Prediction for Multiple Regression
• Given estimates β̂0 , β̂1 , . . . β̂p , we can make predictions
using the formula
ŷ = β̂0 + β̂1 x1 + β̂2 x2 + · · · + β̂p xp .
• We estimate β0 , β1 , . . . , βp as the values that minimize the
sum of squared residuals
n
X
RSS = (yi − ŷi )2
i=1
n
X
= (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − · · · − β̂p xip )2 .
i=1
This is done using standard statistical software. The values
β̂0 , β̂1 , . . . , β̂p that minimize RSS are the multiple least
squares regression coefficient estimates.
18 / 48
3.2 Multiple Linear Regression 15
X2
X1
19 / 48
Results for advertising data
Coefficient Std. Error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV 0.046 0.0014 32.81 < 0.0001
radio 0.189 0.0086 21.89 < 0.0001
newspaper -0.001 0.0059 -0.18 0.8599
Correlations:
TV radio newspaper sales
TV 1.0000 0.0548 0.0567 0.7822
radio 1.0000 0.3541 0.5762
newspaper 1.0000 0.2283
sales 1.0000
20 / 48