Multiple Linear Regression
1
The General Idea
Simple regression considers the relation
between a single explanatory variable and
response variable
2
The General Idea
Multiple regression simultaneously considers the
influence of multiple explanatory variables on a
response variable Y
The intent is to look at
the independent effect
of each variable while
“adjusting
adjusting out”
out the
influence of potential
confounders
Y = βo + β1 X1 + β2 X 2 + ... + β p X p + ε
3
Regression Modeling
A simple
p regression
g
model (one independent
variable) fits a regression
line in 2
2-dimensional
dimensional
space
A multiple regression
model with two
explanatory variables fits
a regression plane in 3-
dimensional space
4 4
Y = βo + β1 X1 + β2 X2 + ... + β p X p + ε
intercept Partial residuals
R
Regressioni
Coefficients
P ti l R
Partial Regression
i C Coefficients
ffi i t ((slopes):
l )
Regression coefficient of X after controlling
for (holding all other predictors constant)
influence of other variables from both X and
Y.
5
Multiple Regression Model
Intercept α predicts
where the regression
plane crosses the Y
axis
Slope for variable X1
(β1) predicts the
change in Y per unit X1
holding X2 constant
The slope
p for variable
X2 (β2) predicts the
change in Y per unit X2
holding X1 constant
6 6
Common variance explained
by X1 and X2
Unique variance explained by
X2
X2
X1
Unique variance explained by Y
X1 Variance NOT
explained by X1 and X2
7
Simple Regression Model
Regression coefficients are estimated by
minimizing ∑residuals2 (i.e., sum of the squared
residuals) to derive this model:
The standard error of the regression (sY|x) is
based on the squared residuals:
8
Multiple Regression Model
Again, estimates for the multiple slope
coefficients are derived by minimizing ∑residuals2
to derive this multiple regression model:
Again, the standard error of the regression is
based on the ∑residuals2:
9
Polynomial Model
yˆ = b0 + b1 x + b2 x + " + br x
2 r
Linear in-parameter, linear model
Special case of multiple linear regression if
setting x1 = x, x2 = x 2 ,", xr = x r
10
Estimating coefficients
11
The matrix algebra of
Ordinary Least Square
⎡1 x11 x12 " x k 1 ⎤
⎢1 x x 22 " x k 2 ⎥⎥
Intercept and Slopes: X = ⎢ 12
⎢# # # # ⎥
⎢ ⎥
β = ( X ' X ) X 'Y −1 ⎣1 x1 n x 2 n " x kn ⎦
Predicted Values:
Y ′ = Xβ
Residuals:
Y −Y′
12
Example 12.3
13
Example 12.3
14
Regression Statistics
How good is our model?
SST = ∑ (Y − Y ) 2
SSR = ∑ (Y ′ − Y ′) 2
SSE = ∑ (Y − Y ′) 2
SST=SSR+SSE
15
The Regression Picture
ŷi = βxi + α
yi
C A
B
y
B y
A
C
yi
*Least squares
x estimation gave us the
n n n line (β) that minimi
minimized
ed
∑i=1
( y i − y ) 2
= ∑
i=1
( yˆ i − y ) 2
+ ∑
i=1
( yˆ i − y i ) 2 C2
R2=SSreg/SStotal
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of Variance around the regression line
observations from naïve mean of y y Additional variability not explained
Total variation Variability due to x (regression) 16 squares method aims
by x—what least
to minimize
ANOVA
H 0 : β 1 = β 2 = ... = β k = 0
H A : βi ≠ 0 att least
l t one!!
dff SS MS F P-value
Regression k SSR SSR / df MSR / MSE P(F)
Residual n-k-1 SSE SSE / df
Total n1
n-1 SST
If P(F)<α then we know that we get significantly better prediction of Y from
the regression model than by just predicting mean of Y.
Y
ANOVA to test significance of regression
17
If we revisit example 12 12.3
3 and make ANOVA
f=30.98 and the p-value is less than 0.0001
H
How tto interpret
i t t the
th resultlt
Regression is significant
Thi model
This d l iis nott th
the onlyl model
d l th
thatt can b
be
used to explain the data
Th model
The d l may have
h been
b more effective
ff ti withith
inclusion or deletion of variables
18
Regression Statistics
SSE SSR
R = 1−
2
=
SST SST
Coefficient of multiple Determination
to judge the adequacy of the regression model
Drawback
D b k off thi
this concept:
t one can always
l iincrease th
the value
l
of Coefficient of determination by including more independent
variables
19
Regression Statistics
MSE
/(n−k−1) n−1 2
SSE
R =1−
2
=1− (1−R)
/(n−1) n−k−1
adj
SST
n = sample size
k = number of independent variables
djusted R2 a
Adjusted are
e not
ot b
biased!
ased
20
Revisit example 12.3
21
Properties of least squares
i
estimator
Under model assumption that random
errors ε1, ε 2 ,", ε k are iid, we have
b0 , b1 ,", bk are unbiased estimator of
regression coefficients β , β , " , β 0 1 k
The elements of matrix ( X ′X ) σ display −1 2
the variance of estimators on the main
diagonal and covariance on the off off-
diagonal σ = c σ
2
bi ii
2
σ b b = cov ( bi , b j ) = C ij σ 2 , i ≠ j
i j
22
Regression Statistics
Standard Error for the regression model
S e = S = σˆ
2
e
2
SSE SSE = ∑ (Y − Y ′) 2
S =
2
n − k −1
e
S e = MSE
2
23
Hypotheses Tests for Regression
C ffi i t
Coefficients
H 0 : β i = β i0
H 1 : β i ≠ β i0
bi − βi 0 bi − βi 0
t( n − k −1) = =
Se (bi ) 2
Se Cii S 2
e
S xx
24
Considering the importance of X3 in example
12.3.
H 0 : β3 = 0
H1 : β3 ≠ 0
We test by using t-distribution with 9 dof.
j
We can not reject y
the null hypothesis
Variable is insignificant in the presence of other
regressors in the model
25
Confidence Interval on Regression
C ffi i t
Coefficients
bi − tα / 2,( n − k −1) S Cii ≤ β i ≤ bi + tα / 2,( n − k −1) S Cii
2
e
2
e
Confidence Interval for βι
26
Hypotheses Tests for Regression
Coefficients: F test
Regression sum of squares if one variable X1
is removed from the regression model
1| 2, 3, … , 2, 3, … ,
H 0 : β1 = 0 Example 12
12.3:
3:
H 1 : β1 ≠ 0
1| 2, 3, … ,
2
Compare it with
27
Hypotheses Tests for Regression
Coefficients:
ff F test
H 0 : β1 = β 2 = 0
H 1 : β 1 ≠ 0 , or β 2 ≠ 0
C
Comparing
i iit with
ih
28
Confidence Interval on mean response
T-statistic with n-k-1 degrees of freedom
0 10, 20 ,…, 0
1
0 0
A 100(1-α)%
100(1 )% confidence
fid iinterval
t l ffor th
the mean response
1
0 //2 0 0 10, 20 ,,…,, 0
1
0 /2 0 0
29
30
Confidence Interval on observed
response
T-statistic
0 0
1 1
0 0
A 100(1-α)%
100(1 )% confidence
fid iinterval
t l ffor th
the mean response
1 1
0 /2 0 0 0
1 1
0 /2 0 0
31
32
Orthogonality
Designed experiment wherein the
variables Xp and Xq is orthogonal
Contribution of one individual
variable in explaining the
variance is readily given.
33
34
35
Qualitative variables
Qualitative variables provide information on discrete characteristics
The number of categories taken by qualitative variables is general
small.
These can be
Th b numerical
i l values
l b
butt each
h number
b d denotes
t an
attribute – a characteristic.
A qualitative variable may have several categories
Two categories: male – female
Three categories: nationality (French, German, Turkish)
More than three categories: sectors (car, chemical, steel, electronic equip., etc.)
36
Qualitative variables
There are several ways to code qualitative variables with n
categories
Using one categorical variables
Producing
g n - 1 dummyy variables
A dummy variable is a variable which takes values 0 or 1.
We also call them binary variables
We also call dichotomous variables
37
38
Stepwise regression
Avoiding predictors (Xs) that do not contribute
significantly
i ifi tl tto model
d l prediction
di ti
- Forward selection
The ‘best’ predictor variables are entered, one by one.
- Backward elimination
The ‘worst’ predictor variables are eliminated, one by one.
39
Forward selection
STEP 1. Do simple linear regressions of y vs. each x variable
individually. Select the x variable with the largest value of . (Suppose it is
X1.)
Step 2: Do all possible 2-variable regressions in which one of the two
variables is X1.
Choose the variable that when inserted gives the largest increase in
(Suppose it is X2.)
Step 3: Do all possible 3-variable regressions in which two of the three
variables are X1 and X2.
Choose the variable that gives the largest increase of
Repeat the process until the most recent variable inserted fails to induce a
significant increase in the explained regression. Such an increase can be
determined by using appropriate F-test or T-test.
40
41
42
Why use logistic regression?
There are many important research topics for
which the dependent variable is "limited
"limited.""
For example: voting, marketing, and
participation data is not continuous or
distributed normally.
Binary logistic regression is a type of
regression analysis where the dependent
variable is a dummy variable: coded 0 (did
not vote) or 1(did vote)
43
The Linear Probability Model
In the ordinary least squares regression:
Y=α+ β βX + є ; where Y = ((0, 1))
є is not normally distributed because Y
takes on only two values
The predicted probabilities can be greater
than 1 or less than 0
44
The Logistic Regression Model
The "logit" model solves these problems:
ln[p/(1-p)] = α + βX + e
p is the probability that the event Y occurs,
p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)]
l [ /(1 )] iis th
the llog odds
dd ratio,
ti or "l
"logit"
it"
45
More:
The logistic distribution constrains the
estimated probabilities to lie between 0 and 1.
The estimated probability is:
exp(-α - β X)]
p = 1/[1 + exp(
if you let α + β X =0,
=0 then p = .50
50
as α + β X gets really big, p approaches 1
as α + β X gets really small
small, p approaches 0
46
What if β=0 or infinity 47
Maximum Likelihood Estimation
(MLE)
MLE is a statistical method for estimating the
coefficients of a model.
The likelihood function ((L)) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
The
Th higher
hi h the h L L, the
h hi
higher
h the
h probability
b bili off
observing the ps in the sample.
48
MLE involves finding the coefficients (α, β)
that makes the log of the likelihood function
(LL < 0) as large as possible
Or,
O finds
f the coefficients
ff that make -2 times
the log of the likelihood function (-2LL) as
small as possible
The maximum likelihood estimates can be
solved by differentiating the log of likelihood
function with respect to α, β and setting the
partial derivatives equal
p q to zero
49
Interpreting Coefficients
Since:
[p ( p)] = α + βX + e
ln[p/(1-p)]
The slope coefficient (β) is interpreted as the
rate of change in the "log odds" as X changes
… not very useful
useful.
Since:
exp(-α - β X)]
p = 1/[1 + exp(
The marginal effect of a change in X on the
probability
b bilit iis:
50
An interpretation of the logit coefficient which
is usually more intuitive is the "odds ratio"
Since:
/( ) = exp((α + βX))
[p/(1-p)]
exp(β) is the effect of the independent
variable on the "odds
odds ratio"
ratio
51