0% found this document useful (0 votes)
2 views40 pages

5 TakingBackControl

The document discusses multivariate regression, emphasizing the importance of including control variables to address confounding factors and avoid biased estimates. It highlights the complexities of causality between variables, particularly in the context of education, gender, and experience, and the implications for interpreting regression results. Additionally, it touches on issues like multicollinearity and the need for careful model selection to accurately assess relationships between variables.

Uploaded by

reetsgg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views40 pages

5 TakingBackControl

The document discusses multivariate regression, emphasizing the importance of including control variables to address confounding factors and avoid biased estimates. It highlights the complexities of causality between variables, particularly in the context of education, gender, and experience, and the implications for interpreting regression results. Additionally, it touches on issues like multicollinearity and the need for careful model selection to accurately assess relationships between variables.

Uploaded by

reetsgg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Mu AKA

l
reg tivaria
res t
sio e

Taking back control …


n

…of confounding factors

Sam Asher ([email protected])


Housekeeping

• Proposals: will get you feedback by next week

• Class participation: constructive comments in class and on Forum, plus on-


time submissions (visualization, proposal, etc)
• No extra marks for going above once a week

• Next week: no class but extra Yuan office hours on Wed 19 Oct: Business
School Student Vault 12 (cave room on lower ground)
• Stream 1: 16.30-17.30
• Stream 2: 17.30-18.30

• R/data/etc support: Forum!

• Revision lecture/tutorials in Week 10

Imperial College Business School Imperial means Intelligent Business 2


Objectives for this lecture

• Learn how to include further (control) variables in a


regression ⇒ multivariate regression

• Interpret multivariate regressions properly

• Understand that including more variables is not always


better

• Learn how to perform hypothesis tests involving several


parameters

Imperial College Business School Imperial means Intelligent Business 3


Why Multivariate Regression?

We have more than one explanatory variable

e.g. 𝑊𝑎𝑔𝑒 = 𝛽! + 𝛽" 𝐸𝐷𝑈𝐶 + 𝛽# 𝐹𝐸𝑀𝐴𝐿𝐸 + 𝑢

Years of Schooling 1 for women


="
0 for men
Why?

1. Interest in several effects at once

2. Address confounding/endogeneity

3. Can help to reduce variance of estimate

Imperial College Business School Imperial means Intelligent Business 4


Endogeneity and Multivariate Regression

We will see in the data that


this is still happening in many
Suppose we are only interested in schooling parts of the world,
unfortunately.

𝑊𝑎𝑔𝑒 = 𝛽! + 𝛽" 𝐸𝐷𝑈𝐶 + 𝜖


e.g. women tend to
have less schooling
so that 𝜖 = 𝛽# 𝐹𝐸𝑀𝐴𝐿𝐸 + 𝑢

• But there is a correlation between schooling and gender and


gender has also a separate effect on wages

⟹Estimate of 𝛽" is biased Your turn: What bias do you expect?


Upward, downward, none?

Imperial College Business School Imperial means Intelligent Business 5


Endogeneity and Multivariate Regression

-
𝑊𝑎𝑔𝑒 = 𝛽! + 𝛽" 𝐸𝐷𝑈𝐶 + 𝜖

+
𝜖 = 𝛽# 𝐹𝐸𝑀𝐴𝐿𝐸 + 𝑢

𝛽! < 0

• Negative effect of FEMALE on WAGE and EDU implies positive correlation between 𝜖
• We get upward bias when attempting to estimate 𝛽!

Imperial College Business School Imperial means Intelligent Business 6


Endogeneity and Multivariate Regression

• To avoid the problem we can include 𝐹𝐸𝑀𝐴𝐿𝐸 as additional


variable:

𝑊𝑎𝑔𝑒 = 𝛽! + 𝛽" 𝐸𝐷𝑈𝐶 + 𝛽# 𝐹𝐸𝑀𝐴𝐿𝐸 + 𝑢

• We get a different regression model with a different residual term 𝑢


• This will lead to an unbiased 𝛽%$ if 𝐸𝐷𝑈𝐶 is independent (i.e.
uncorrelated) with 𝑢

Imperial College Business School Imperial means Intelligent Business 7


A stylised example

Wage (£/h)

12

10

Men
4
Women
2

0
0 2 4 6 8 10 12 14
Years of education

Imperial College Business School Imperial means Intelligent Business 8


A stylised example
Positive 𝜖-shocks when EDUC is high

Wage (£/h)

12 Negative 𝜖-shocks when


EDUC is low
10

8
True model in univariate
case
6
𝜖"
4
Men

Women
2
Men
0
𝛽# 0 2 4 6 8
Years of education
10 12 14

Women
Imperial College Business School Imperial means Intelligent Business 9
Causality vs all else equal

Another variable that is likely correlated with schooling and


affecting wages is experience (EXPER).

𝑊𝑎𝑔𝑒 = 𝛽! + 𝛽" 𝐸𝐷𝑈𝐶 + 𝛽# 𝐸𝑋𝑃𝐸𝑅 + 𝑢

If we run a regression of this equation (and EDUC and EXPER are


independent of u) the estimate of 𝛽" gives us the change in
wage for one year more of schooling keeping experience (and
everything else constant)

However, it might not give us the causal effect of increasing


schooling on wages.

Imperial College Business School Imperial means Intelligent Business 10


Directions of causality EDU☞EXP

EDUC Wage

- +
𝐸𝑋𝑃𝐸𝑅

The reason why EDUC and EXPER are correlated is likely because of a chain of causality from
schooling to experience (i.e. if you go to school longer you don’t have so much time to get job
experience; also not that S is typically determined before EXPER which supports the suggested
chain)

If you include EXPER as separate explanatory variable, then your coefficient on EDUC will not
reflect this causal channel. This is good if you really want the all else equal effect of EXPER.
However, if you want the full causal effect of EDUC (e.g. you want to advise the government
what an extra year of schooling does to wages) you get the wrong answer as you are
pretending that you can have extra schooling without reducing people’s experience. So it
would be better to exclude EXPER.
Imperial College Business School Imperial means Intelligent Business 11
Directions of causality EDUC☜FEMALE

EDUC Wage

- -
Female

• Gender is mostly (but not exclusively) determined before schooling


• Hence the reason why EDU and Female variable are correlated because of a
causality chain from Female to EDU
• In this case it is vital to include the Female variable to get the correct causal
estimate of a change in EDU

Imperial College Business School Imperial means Intelligent Business 12


Directions of causality EDUC☞☜EXPER

EDUC=Schooling+Uni+Vocational Wage
Su
ed ppo
uc se
ati ED
on U
wh C in
ile clu
wo de
rki s fu
ng r t
EXPER
h er

If the causality between the two explanatory variables goes both ways we are in
trouble as far as finding the causal effect of EDUC is concerned (we are cool for
finding the ceteris paribus effect). Both including or dropping the gender variable
will lead to a biased estimate. We have to use other methods some of which we
shall discuss later in the module (e.g. Instrumental Variables).

Imperial College Business School Imperial means Intelligent Business 13


Key insight: You can be too controlling

• More control variables are not always better to identify a causal effect

• To include or not include → depends on direction of causation


between control and explanatory variable of interest

• Sometimes there is no clear cut answer as causation goes both ways

• Report regression with and without control and discuss limitations of your analysis
• More research with other data or better model (e.g. Instrumental Variables which we
discuss later) might be needed.
• Might sometimes be beyond the scope of a study (e.g. in group coursework)

Imperial College Business School Imperial means Intelligent Business 14


Multivariate OLS in practice – Let’s start univariate

data <- read.csv("https://www.dropbox.com/s/9agc2vmamfztlel/WAGE1.csv?dl=1")


mod1 <- lm(wage ~ educ, data)
summary(mod1)
##
## Call:
## lm(formula = wage ~ educ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3396 -2.1501 -0.9674 1.1921 16.6085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.90485 0.68497 -1.321 0.187
## educ 0.54136 0.05325 10.167 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.378 on 524 degrees of freedom
## $ = −0.9
Multiple R-squared: 0.1648, Adjusted
𝑊𝐴𝐺𝐸 R-squared: 0.1632
+ 0.54×𝐸𝐷𝑈𝐶
## F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16
Indicates that earnings per hour increase by $0.54 for every extra year of schooling

Imperial College Business School Imperial means Intelligent Business 15


Wage & Gender

• Women tend to have less education than men (in this dataset)

Your turn: If we include female as additional variable in a regression of WAGE


what effect you expect that to have on the coefficient for education?
(a) EDUC coefficient goes up
(b) EDUC coefficient goes down
(c) College
Imperial EDUC coefficient
Business School remains unchanged Imperial means Intelligent Business 16
Wage & Gender

• Including the female variable makes the educ coefficient smaller


• This is because women are paid less than men irrespective of education
• Hence part of what we thought was the wage depressing effect of little
education is actually the wage depressing effect of being female

Imperial College Business School Imperial means Intelligent Business 17


EDUC vs EXPER

• One more year of education means 1.4 years less experience

Your turn: If we include exper as additional variable in a regression of WAGE


what effect you expect that to have on the coefficient for education?
(a) EDUC coefficient goes up
(b) EDUC coefficient goes down
(c) EDUC coefficient remains unchanged
Imperial College Business School Imperial means Intelligent Business 18
Multivariate OLS in practice

" = −3.39 + 0.644×𝐸𝐷𝑈𝐶 + 0.07×𝐸𝑋𝑃𝐸𝑅


𝑊𝐴𝐺𝐸
• It indicates that earnings per hour increase by $0.64 for every extra year of schooling
and by $0.07 for every extra year of work experience.
• EDUC coefficient went up (previously 0.54) because of negative correlation between
EDUC and EXPER and because EXPER has positive influence on wage.
• EDUC coefficient represents now “all else equal” but no longer causal effect
Imperial College Business School Imperial means Intelligent Business 19
More than 2 variables

Imperial College Business School Imperial means Intelligent Business 20


Back to the criminal foreigners

## lm(formula = crimesPc ~ b_migr11 + pop11 + urate2011 + medianage,


## data = df)
##
## Residuals: Migration Effect goes away if we control for
## Min 1Q Median 3Q Max population unemployment rate & median age
## -0.8873 -0.2680 -0.0783 0.1434 3.1754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.689e+00 4.855e-01 7.599 3.57e-13 ***
## b_migr11 5.446e-03 3.879e-03 1.404 0.16130
## pop11 -8.656e-07 2.793e-07 -3.099 0.00212 **
## urate2011 4.016e-02 9.320e-03 4.309 2.20e-05 ***
## medianage -6.305e-02 1.027e-02 -6.138 2.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4774 on 310 degrees of freedom
## (105 observations deleted due to missingness)
## Multiple R-squared: 0.3468, Adjusted R-squared: 0.3383

Imperial College Business School Imperial means Intelligent Business 21


Perfect Multi-collinearity
• A specific problem in the multivariate case
• Sample has not enough variation in the explanatory variables

Implies perfect multi-collinearity


educ in days will be different numbers
but it will be perfectly correlated with
educ in years

Imperial College Business School Imperial means Intelligent Business 22


Perfect Multi-collinearity

We cannot work the effect of “educ in


days” holding “educ in years” constant
because all observations with a given
“educ in years” all have the same
“educ in days”

• R will drop one of the


variables to allow OLS
• Which one is dropped
has no meaning
Imperial College Business School Imperial means Intelligent Business 23
Imperfect Multi-collinearity

Explanatory variables are closely but no perfectly correlated

i.e. it’s hard – but not


Consequences: impossible – for the OLS
• We can estimate all coefficients algorithm to distinguish
• Variance of estimates might be high i.e. between the separate
estimates could be quite far off from true value. effects for all the
• However: estimates will be unbiased (if x not variables
correlated with 𝜖)

So what’s the problem?


• Possibly none
• Sometimes it won’t be possible to reliably identify all desired effects
• You might think something doesn’t matter when it does
Imperial College Business School Imperial means Intelligent Business 24
Imperfect Multi-collinearity: An example

We include a wider set of controls for age


bands; e.g. shxage017 reports the share of 0-
17 year olds in percent.

None of the age variables is significant. Does


this mean the age of the population is not
important in explaining crime?

Age mattered before. The reason it doesn’t


matter now is because of collinearity

Imperial College Business School Imperial means Intelligent Business 25


Joint Hypothesis Test

F-test….but it’s enough to look at the P value

Imperial College Business School Imperial means Intelligent Business 26


F-tests: comparing unrestricted and restricted models
Unrestricted Model Restricted Model
𝑌 = 𝛽!" 𝑋 + 𝛽!# 𝐴𝐺𝐸# + 𝛽!$ 𝐴𝐺𝐸$ + 𝜖! 𝑌 = 𝛽%" 𝑋 + 0×𝐴𝐺𝐸# + 0×𝐴𝐺𝐸$ + 𝜖%

To compute the F-statistic, the 2 parameters are


computers compares residuals 𝜖,̂ restricted to be 0
with 𝜖-̂

Imperial College Business School Imperial means Intelligent Business 27


Takeaways

• We can easily include further variables in a regression

• There are two reasons we might want to do that


1. To deal with endogeneity
2. We are interested in several variables at the same time

• Be careful about the causal relationships between explanatory variables

• There might be collinearity, which might imply that we cannot (precisely)


distinguish between the effects of several explanatory variables.

• With several explanatory variables we might want to test several hypothesis


combined.

• We can use an F-test for that.

Imperial College Business School Imperial means Intelligent Business 28


Extra
Slides
Back to the criminal foreigners
## lm(formula = crimesPc ~ b_migr11 + pop11, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6243 -0.4052 -0.1253 0.2347 13.8304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.124e+00 1.018e-01 11.034 < 2e-16 ***
## b_migr11 4.105e-02 5.335e-03 7.694 1.77e-13 ***
## pop11 -1.033e-06 5.078e-07 -2.034 0.0428 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
…those areas are less “crime intensive”

## lm(formula = b_migr11 ~ pop11, data = df)


##
## Residuals:
## Min 1Q Median 3Q Max Foreigners come to more populous areas but ….
## -19.039 -5.187 -2.698 1.225 40.835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.240e+00 9.530e-01 6.548 2.18e-10 ***
## pop11 3.088e-05 4.883e-06 6.326 8.02e-10 ***
Imperial College Business School Imperial means Intelligent Business 30
Back to the criminal foreigners

## lm(formula = crimesPc ~ b_migr11 + pop11 + urate2011 +


medianage,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8873 -0.2680 -0.0783 0.1434 3.1754
## Migration Effect goes away if we control for
## Coefficients: population unemployment rate & median age
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.689e+00 4.855e-01 7.599 3.57e-13 ***
## b_migr11 5.446e-03 3.879e-03 1.404 0.16130
## pop11 -8.656e-07 2.793e-07 -3.099 0.00212 **
## urate2011 4.016e-02 9.320e-03 4.309 2.20e-05 ***
## medianage -6.305e-02 1.027e-02 -6.138 2.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
##
## Residual standard error: 0.4774 on 310 degrees of
freedom
## (105 observations deleted due to missingness)
## Multiple R-squared: 0.3468, Adjusted R-squared:
0.3383

Is it always a good idea to control for those variables?

Can you think of an alternative strategy?


Imperial College Business School Imperial means Intelligent Business 31
An Alternative Strategy Pos
00 4 t 20
e 2 fore
P r
ig ne rs ign 04
fo re ers

Crime 2011 Foreign 2011

Unemployment 2004

Unemployment 2011

Imperial College Business School Imperial means Intelligent Business 32


An Alternative Strategy

Migration coefficient:
• Still not significant
• Value has become slightly lower

Imperial College Business School Imperial means Intelligent Business 33


Imperfect Multi-collinearity: Variance Inflation

• We see that that some the age variables are highly correlated
• What matters is if a lot of the variation of an x variable is accounted for by a linear combination of all
other x variables.
• We can examine this by looking at R2 in regressions of the following kind:

𝑋! = 𝛾!! + 𝛾!" 𝑋" + 𝛾!# 𝑋# + ⋯ + 𝑢


• i.e. we regress the explanatory variables on each other and compute R2 each time
• The Variance Inflation Factor (VIF) is an index that informs us about this by computing for every x-
variable:
1 VIF=1 → R2=0 → no problem
𝑉𝐼𝐹 =
1 − 𝑅"
Imperial College Business School VIF>5 → R2>80% → maybe an means
Imperial issue
Intelligent Business 34
The Variance inflation factor in practice

Imperial College Business School Imperial means Intelligent Business 35


!
Finding 𝑅

• Accounting is not necessarily explaining


• 𝑅" is mechanically increasing as we add further variables
• If we have as many parameters as observations 𝑅" is always 100%
(e.g. consider 2 observations)
!56 & 75!
• Hence Adjusted 𝑅C " = 1 − where k = # of variables
75 89!
• i.e. the higher k the lower 𝑅 C"
Imperial College Business School Imperial means Intelligent Business 36
R2=100%

If we have as many observations as parameters being


estimated, R2 is always 100%

Imperial College Business School Imperial means Intelligent Business 37


!
Accounting for variation: 𝑅
• How much of the variation in 𝑌 is accounted for by 𝑌9 = 𝛽:! + 𝛽:"𝑋" + 𝛽:#𝑋# + ⋯?

For instance Not so much


12

Most
10
8
6

-2 0 2 4

yXhighepslow yXlowepshigh
y

𝑉𝐴𝑅( E
𝑌)
𝑅" =
𝑉𝐴𝑅(𝑌)
Imperial College Business School Imperial means Intelligent Business 38
Some more details on F-tests
Unrestricted Model Restricted Model
𝑌 = 𝛽!" 𝑋 + 𝛽!# 𝐴𝐺𝐸# + 𝛽!$ 𝐴𝐺𝐸$ + 𝜖! 𝑌 = 𝛽%" 𝑋 + 0×𝐴𝐺𝐸# + 0×𝐴𝐺𝐸$ + 𝜖%

We can compute an F-statistic as 2 parameters are


restricted to be 0
𝑅𝑆𝑆- − 𝑅𝑆𝑆,
𝑝, − 𝑝-
𝐹=
𝑅𝑆𝑆, • How much more error do we get when
𝑛 − 𝑝, restricting the model.
• If it’s a lot (F big) then the restriction
should be rejected
̂/
where 𝑅𝑆𝑆- = ∑. 𝜖-. • How do we know when F is big?
• Somebody worked out how F is distributed
(turns out his name was Fisher)

Imperial College Business School Imperial means Intelligent Business 39


F distribution After R.A. Fisher (1890-1962)
(did not work for Guinness)

1
.8
F distribution has two arguments
1. Number of restrictions

.6
Density
2. Degrees of freedom
unrestricted model
.4 .2
0

0 1 2 3 4 5
F statistic

Fden(2,50,x) Fden(3,50,x)

If the F statistic is large it means that some or all of the


restrictions jointly tested are probably not true
Find critical value by equating this
area to significance level; 5%

Imperial College Business School Imperial means Intelligent Business 40

You might also like