0% found this document useful (0 votes)
141 views25 pages

Regression With One Regressor

This document outlines univariate linear regression analysis. It introduces the key concepts of relating two variables using a population regression line where the expected value of the outcome (Y) given the predictor (X) is modeled as a linear function. It describes how ordinary least squares (OLS) estimation is used to estimate the parameters of the population regression line by minimizing the sum of squared errors between the actual and predicted Y values. Notation for the regression parameters and error term are also defined.

Uploaded by

Fatemeh Iglinsky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views25 pages

Regression With One Regressor

This document outlines univariate linear regression analysis. It introduces the key concepts of relating two variables using a population regression line where the expected value of the outcome (Y) given the predictor (X) is modeled as a linear function. It describes how ordinary least squares (OLS) estimation is used to estimate the parameters of the population regression line by minimizing the sum of squared errors between the actual and predicted Y values. Notation for the regression parameters and error term are also defined.

Uploaded by

Fatemeh Iglinsky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Univariate Linear Regression

Joonhyung Lee
University of Memphis

Econ 7810/8810

Contents
1 Relating Two Variables 3

2 Estimation 4
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Ordinary Least Squares (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 The OLS Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Properties of the OLS Estimators 9


3.1 OLS is unbiased . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 OLS is consistent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Estimating V ar(βb1 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Asymptotic Distribution of βb1 (skip) . . . . . . . . . . . . . . . . . . . . . . 12
3.4.1 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.4 Slutsky’s Theorem (Combines Conv in Prob and Dist) . . . . . . . . 13

4 Skedacity 16
4.1 Heteroskedasticity and Homoskedasticity . . . . . . . . . . . . . . . . . . . . 16
4.2 Weighted Least Squares (WLS) . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 The Variance of X and the Variance of βb1 . . . . . . . . . . . . . . . . . . . 18

5 Statistical Inference 19

6 Regression When X is a Binary Variable 19

7 Goodness of Fit 20

8 Units of Measurement 22

1
9 Estimation in Stata 23
9.1 Effects of Education on Hourly Wage (WAGE1.DTA) . . . . . . . . . . . . 23
9.2 Test score and student ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.3 CPS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2
1 Relating Two Variables
• Econometrics is concerned with understanding relationships between variables that
we as economists care about.

– Education and wages, investment and innovation, advertising and sales, class
size and test scores...

• But given what we know so far, all we can do to study the relationship between two
(or more) variables is to use covariance and correlation.

• Covariance measures how 2 variables move together:

– Cov (X, Y ) = E [(X − µX ) (Y − µY )]


– But how can we estimate this? Suppose (Xi , Yi ) ∼ iid (pairs of observations are
iid)
1 P
 
– Then we can use sXY = n−1 Xi − X Yi − Y
p
– Furthermore, we can show that sXY → σXY

• Correlation also measures how two variables move together


sXY p
– In particular, rXY = sX sY (it’s also true that rXY → ρXY )

• But does rXY > 0 mean that high values of X cause the values of Y to be high?

• No, correlation is not causation. Moreover, correlation represents a linear relation.

• In many cases we want to know: If we increase X by a certain amount, what is the


expected effect on Y ?

• Are averages enough to answer this question?

• Let’s start with a case where X is discrete and compare E (Y | X) for two values of
X.

• Example: the following has the info about average earnings for men and women. Is
there a significant gender gap?

– Wage gap Y m − Y w is $4.11 per hour.

– The standard error SE Y m − Y w = .35 so the t-stat for H0 : Y m − Y w = 0 is
4.11−0
.35 = 11.74, which has a p-value that’s very close to 0 (2Φ(−11.74) ≈ 0).
– Indeed, a 99% CI for the wage gap is 4.11 ± 2.58 · .35 = (3.41, 4.80)

3
– So there is a gender gap and it’s statistically significant.
– But is this a sign of discrimination?
– Quite possibly. But why might it not be?
– Some “other factor” could be driving the relationship (experience, education).

• To establish gender bias we need to keep “everything else” constant which means
that instead of looking at

E(earnings | gender)

• We should be concerned with

E(earnings | gender, age, experience, education, etc.)

• It turns out that we can do both using regression analysis.

2 Estimation
• Let’s keep it simple in the beginning and start with E (Y | X).

• Of course, in most cases a univariate regression will be inadequate.

• Consider the following example. Does

E (T estScore | Classsize)

really capture the causal effect of class size on test scores?

• Aren’t other variables also important and perhaps driving the relationship?

– Neighborhood, teacher quality, parents’ income....


– Can we identify the impact of class size on test scores without controlling for
these other factors? Probably not.

• For now we’ll just stick to one X (and attribute these other factors to random vari-
ation).

• Adding more X’s turns out to be pretty simple and will allow us to account for these
additional factors explicitly.

4
2.1 Notation
• Y is the dependent, explained, response, predicted variable or regressand.
• X is the independent, explanatory, control, predictor variable or regressor, covariate.
• We know that E (Y | X) is a function of X, but what function?
• Let’s start by assuming it’s linear.
• Suppose E (Y | X) is linear in X:
E (Y | X) = β0 + β1 X

• In words, this is saying that if we know X, the expected value of Y is a linear function
of X.
• β0 + β1 X is then called the population regression line (the relationship that holds
between Y and X on average).
• So what do β0 & β1 represent? Consider the impact on Y of a one unit change in X.
E (Y | X = x) = β0 + β1 x
E (Y | X = (x + 1)) = β0 + β1 (x + 1)
E (Y | x + 1) − E (Y | x)= β0 + β1 (x + 1) − β0 − β1 x = β1
• So β1 is the expected change in Y associated with a one unit change in X (i.e. the
∆Y
slope: β1 = ∆X ).
• β0 is the intercept: the expected value of Y when X = 0.

– The intercept is simply the point at which the population regression line inter-
sects the Y axis.
– Note that in applications where X cannot equal 0, the intercept has no real
meaning. (ex) X is class size and Y test score. Intercept means test scores when
x = 0, implying nothing.

• Note E (Y | X) = β0 + β1 X doesn’t mean that the data will all lie on the same line.
• Notice that we didn’t write Yi = β0 +β1 Xi , but wrote E (Y | X) = β0 +β1 X instead.
• E (Y | X) is an expectation, the actual observations will be scattered around the
population regression line:
Yi = β0 + β1 Xi + ui

• ui represents all the other factors besides Xi that determine the value of Yi for a
particular observation i

5
2.2 Ordinary Least Squares (OLS)
• Given that we’ve assumed there’s a linear relationship between E (Y | X) and X,
how do we estimate it?
• Intuitively, we want to estimate E
b (Y | X) = βb0 + βb1 X, where βb0 & βb1 are estimates
of the population parameters β0 & β1 (just like X is an estimate of µ).
• So how do we find βb0 & βb1 ? By minimizing the prediction error.
• Our estimates βb0 & βb1 will give us the predicted value of Y conditional on X (the
predicted values are Ybi = βb0 + βb1 Xi ).
• Although we expect our estimates of β0 & β1 to be correct on average, for any
particular observation i, we are likely to make a prediction error.
• The error made in predicting the ith observation is given by Yi − Ybi = Yi − βb0 − βb1 Xi
• Intuitively, we would like to choose βb0 & βb1 to make all of these errors as small as
possible. But how?
• The OLS estimator chooses the regression coefficients by minimizing the sum of the
squared prediction errors
P 2 Ph  i2
• M in Yi − Ybi = M in Yi − βb0 + βb1 Xi
βb0 ,βb1 βb0 ,βb1

• Another way: min Yi − Ybi (quantile regression)

• Taking partial derivatives yields


h i2 P 
∂ P
Yi − β
b0 − βb X
1 i = −2 Y i − β
b 0 − β
b 1 i → Y − β0 − β1 X = 0
X b b
∂ βb0
h i2  
∂ P b0 − βb1 Xi = −2 P Yi − βb0 − βb1 Xi Xi → 1 P Yi Xi −βb0 X−βb1 1 P X 2 =
Yi − β n n i
∂ βb1
0
• Setting the partial derivatives equal to zero, collecting terms, dividing by n, and
solving the resulting two equations in two unknowns for βb0 & βb1 yields:
(Xi −X )(Yi −Y )
P
β1 =
b 2 = ssXY
2
(Xi −X )
P
X

βb0 = Y − βb1 X
• So we can derive the estimating equations for βb0 & βb1 by minimizing the sum of
squared prediction errors (recall that X can be constructed in a similar way).
• Moreover, just like X, βb0 & βb1 are themselves random variables. (We’ll derive their
distributions in a bit.)

6
2.3 The OLS Assumptions
• So why should we have faith in the OLS methodology?

• Do the OLS estimators have the same desirable properties that X had (unbiasedness,
consistency, asymptotic normality, efficiency)?

• Do the OLS estimators have causal interpretation?

• The answer is yes, pending assumptions.

• Actually, these assumptions are enough to give us unbiasedness, consistency and


asymptotic normality (which will let us build confidence intervals and conduct hy-
pothesis tests).

• Efficiency will require an additional assumption (Homoskedacity or iid assumption)


that we’ll discuss later.

• The assumptions of OLS are

1. Linear in parameters. The population model can be written as

y = β0 + β1 x + u

where β0 and β1 are the (unknown) population parameters.


2. Simple random sample⇔ (Xi , Yi ) , i = 1, ..., n, each individual in the population
is equally likely to be included in the sample
3. Sample variation in the explanatory variable
4. Zero conditional mean; Strict exogeneity; E (ui | Xi ) = 0
5. Homeskedacity: V ar(ui |Xi ) = σ 2

• In fact, a central purpose of these assumptions is to allow us to derive sampling


distributions for the estimates (which turn out to be normal).

• This will allow us to construct CIs and test hypotheses just like we did for µ.

• A second role of the assumptions is to highlight situations in which OLS regressions


might run into trouble.

• Much of the second half of the course is focused on handling these situations.

Assumption 1 & 4

• Yi = β0 + β1 Xi + ui

7
• The conditional distribution of ui given Xi has mean 0, called the zero conditional
mean assumption.

• Extending to multiple regressors, if u is correlated with any of the Xi , this assumption


is violated. This is usually a good way to think about the problem.

• Implication: ui is random (noise) given Xi , i.e. Xi and ui are independent. In the


wage equation, Suppose u is “ability” and x is years of education. We need, for
example,
E(ability|x = 8) = E(ability|x = 12) = E(ability|x = 16)
so that the average ability is the same in the different portions of the population with
an 8th grade education, a 12th grade education, and a four-year college eduction.

• Because people choose education levels partly based on ability, this assumption is
almost certainly false.

• As another example, suppose u is “land quality” and x is fertilizer amount. Then


E(u|x) = E(u) if fertilizer amounts are chosen independently of quality. This as-
sumption is reasonable (assumes fertilizer amounts are assigned at random).

• We will relax this into E(ui |X1 , X2 ) = E(ui |X2 ), which is called conditional mean
independence. In this case, we can still interpret the causality of X1 , but not X2 .
The idea is that X1 is independent (exogenous) variable as long as X2 is controlled.
We will get back to this issue in linear regression with multiple regressors.

• E (ui | Xi ) = 0 ⇒ E (Yi − (β0 + β1 Xi ) | Xi ) = 0 ⇒ E(Yi |Xi ) = β0 + β1 Xi

• Assumption 1 & 4 generates conditional expectation being linear.

• Intuition: Given Xi , the mean of the distribution of ui is 0.

• This means the conditional distribution is centered around the population regression
line.

• Note also E (ui | Xi ) = 0⇒ρux = 0, but not reverse.

Assumption 2

• Intuition: You have a random sample!

• This assumption is likely to hold in cross-sections, but often violated in time series
or panel data.

8
3 Properties of the OLS Estimators
3.1 OLS is unbiased
• We are going to show that OLS is unbiased and find its asymptotic distribution.

• Let’s start with unbiasedness.

• We want to show that

E(βb0 ) = β0 and E(βb1 ) = β1

• We’ll calculate E(βb1 ) now.

• To find E(βb1 ), we first need to know the formula for βb1 :


P  
X i − X Yi − Y
βb1 = P 2
Xi − X

• It will be useful to rewrite this using a clever “trick”.

• Since we are assuming Yi = β0 + β1 Xi + ui , it follows that


1X 1X 1X 1X
Yi = β0 + β1 Xi + ui
n n n n

Y = β0 + β1 X + u
taking the difference we have

Yi − Y = β1 Xi − X + ui − u

• Using this trick and some additional algebra, we can rewrite the formula for βb1 in a
more useful way

9
P  
Xi − X Yi − Y
βb1 = P 2
Xi − X
P   
Xi − X β1 Xi − X + ui − u
= P 2
Xi − X
P 2 P 
β1 Xi − X + Xi − X (ui − u)
= P 2
Xi − X
P 2 P  P 
β1 Xi − X + Xi − X ui − Xi − X u
= P 2
Xi − X
P 2 P 
β1 Xi − X + Xi − X ui
= P 2
Xi − X
P 
Xi − X ui
= β1 + P 2
Xi − X

• So P 
Xi − X ui
β1 = β1 + P
b 2
Xi − X

which is a very useful result (we’ll use it to derive the distribution of βb1 later on).

• Now, to show E(βb1 ) = β1 I just need to show that the expected value of the second
term is zero.

• Now let’s do the proof (i.e. show that E(βb1 ) = β1 ).

• Since we have P 
Xi − X ui
β1 = β1 + P
b 2
Xi − X
it follows that "P  #
Xi − X ui
E(βb1 ) = β1 + E P 2
Xi − X
(& using the LIE (E (Z) = E [E (Z | X)]) on the 2nd term)
" P  !#
Xi − X ui
= β1 + E E P 2 | X1 , ..., Xn
Xi − X

10
and since we know E (XY | X) = XE (Y | X) for any Y
"P  #
Xi − X E (ui | X1 , ..., Xn )
= β1 + E P 2
Xi − X
which since E (ui | X1 , ..., Xn ) = E (ui | Xi ) = 0 by OLS Assumptions 4 and 2 re-
spectively
= β1 + 0 = β1

• Therefore, we have shown that E(βb1 ) = β1 , so βb1 is an unbiased estimator of β1 .


• A similar approach can be used to show that E(βb0 ) = β0

3.2 OLS is consistent


P 
X i − X ui
plim(βb1 ) = β1 + plim( P 2 )
Xi − X
1 P

n Xi − X ui
= β1 + plim( P 2 )
1
n Xi − X
1 1X 
= β1 + plim( Xi − X ui )
var(Xi ) n
cov(Xi , ui )
= β1 +
var(Xi )
= β1 + 0

3.3 Estimating V ar(βb1 )


P 
Xi − X ui
var(β1 ) = var( P
b 2 )
Xi − X
1 X 
= P 2 var( Xi − X ui )
( Xi − X )2
1 X 2
= P 2 Xi − X var(ui )
( Xi − X )2
1 P
2
n Xi − X var(ui )
= 2
(Xi −X ) 2
P
n( n )
1 P  2
Xi − X var(ui )
= n
n(var(Xi ))2

11
if we further assume assumption #5, i.e. homoskedacity (iid assumption), we can go
further
1 X 2
= P 2 Xi − X σ 2
( Xi − X )2
σ2
=P 2
Xi − X
1 2

= 2
1 P
n Xi − X
σ2
=
n ∗ var(Xi )
  q
• So, SE βb1 = V ar(βb1 )

• These are the formulas that Stata uses to construct the standard errors.

3.4 Asymptotic Distribution of βb1 (skip)


This subsection sketches proof for the normality of βbs .

3.4.1 Convergence in Probability


• Let a1 , a2 , ..., an , .. be a sequence of random variables

– Example: Y 1 , Y 2 , ..., Y n , where n is the # of observations.

• Loosely speaking, we say that a random variable an converges in probability to a if


an becomes closer and closer to a as n → ∞.
p
– This is written as an −→ a or plim (an ) = a

• Formally, an converges in probability to a if for every ε > 0

P (|an − a| > ε) → 0 as n → ∞

p p
• Examples: Y n → µY , s2Y → σY2

12
3.4.2 Convergence in Distribution
• Let F1 , F2 , ..., Fn , .. be a sequence of CDFs corresponding to a sequence of random
variables W1 , W2 , ..., Wn , ..
 
d
• Wn converges in distribution to W Wn → W if the CDFs {Fn } converge to F (the
CDF of W )
d
Wn → W ⇐⇒ lim Fn (t) = F (t)
n→∞

• We will sometimes also use the notation


a
Wn ∼ F

σ2
 
Y −µY d a
• Examples: σY

→ N (0, 1) , Y ∼ N µY , nY
n

3.4.3 The Central Limit Theorem


• If Y1 , ..., Yn are iid with E (Yi ) = µY , var (Yi ) = σY2 where 0 < σY2 < ∞, then the
standardized sample average

Y − µY d
σY → N (0, 1)

n


Y −µY n(Y −µY )
• Since σY

= σY the CLT can also be written as
n

√  d
n Y − µY → σY N (0, 1)

√  d
n Y − µY → N 0, σY2


• You will sometimes see this written as


σY2
 
a
Y ∼ N µY ,
n

3.4.4 Slutsky’s Theorem (Combines Conv in Prob and Dist)


p d d d Wn d
• Suppose an → a and Wn → W, then an + Wn → a + W, an Wn → aW and an →
W
a (if a 6= 0)

– Example: using Slutsky to find the asymptotic distribution of the t-statistic

13
– Assume Yi ∼ iid (µY , σY2 < ∞)
– Recall that the t-statistic based on Y is
Y − µY
t= sY

n

σY Y −µY
and let an = sY and Wn = σY

so that t = an Wn
n
p
– Since s2Y → σY2 , (and by using the continuous mapping theorem1 ), we know that
p
an → 1

also, from the Central Limit Theorem, we know that


d
Wn → N (0, 1)

Therefore, applying the Slutsky theorem


d
t = an Wn → N (0, 1)

• Now let’s derive the asymptotic distribution of βb1 .

• Recall our trick from before

1 P

n Xi − X ui
βb1 − β1 = 2
1 P
n Xi − X
1 1 1
P  P P
• Let n Xi − X ui = n (Xi − µX ) ui = n vi = v.

• Thus, we can write



√   n (v) ”Wn ”
n βb1 − β1 = 2 =
1 P
Xi − X ”an ”
n

1 P 2 p 2 (so a → σ 2 ) p
• We know n Xi − X → var(Xi ) = σX n X
1
The continious mapping theorem states that, for any continuous function g :
p p
∗ if an → a then g(an ) → g(a), and
d d
∗ if Wn → W then g(Wn ) → g(W )

14
• Suppose we can prove that

√ d
n (v) → N 0, σv2


d
⇔ Wn → N 0, σv2


√    
d 1 a σv2
0, σv2

• Then n βb1 − β1 → 2 N
σX
, so βb1 ∼ N β1 , 2 2
n(σX )
√ d
n (v) → N 0, σv2 ? By CLT!

• So how do we show that

• First, note that since

• E (vi ) = E [(Xi − µX ) ui ] = 0 (Assumption #4)

• V ar (vi ) = V ar ((Xi − µX ) ui ) = σv2 < ∞

• We can apply the CLT

v − µv d √ d
n v → N 0, σv2

σv → N (0, 1) ⇔

n

• Finally, applying the Slutsky theorem,

√ d
→ N 0, σv2

√   n (v)
n βb1 − β1 = 2 = p
1 P 2
→ σX
n Xi − X

• Or, !
√  
d 1 σv2
n βb1 − β1 → 2 N 0, σv2 = N

0,
σX 2 2

σX

• So, we conclude that


 
a V ar ((Xi − µX ) ui )
βb1 ∼ N β1 ,
n (var(Xi ))2

15
4 Skedacity
4.1 Heteroskedasticity and Homoskedasticity
• Note that the ui ’s determine how the data will be scattered around the regression
line.

• But so far we’ve made no assumptions about V ar(ui ) (aside from a nonzero finite
fourth moment: 0 < E u4i < ∞).

• We did assume that E (ui | Xi ) = 0 but we did not assume that V ar (ui | Xi ) = σu2
(i.e. that the variance does not depend on the regressors).

• If it’s true that V ar (ui | Xi ) = σu2 (a constant) then we have homoskedasticity, which
is a useful property to have!

• If instead, V ar (ui | Xi ) = f (Xi ) we have heteroskedasticity.

• “Skedasticity”, sometimes spelled “scedasticity”, is a statistical word meaning “ten-


dency to scatter”.

• Homoskedasticity: All conditional distributions have the same variance (spread).

• Heteroskedasticity: The conditional distributions can have different variances.

• Examples : education and income, a firm’s productivity and foreign investment, etc.

• If the tendency to scatter has some pattern, we may use quantile regression.

• So why is homoskedasticity nice to have?

• First, it simplifies the formulas for the SEs quite a bit.

• Second, more importantly, assumptions above plus homoskedasticity mean that OLS
is BLUE.

– BLUE means best (min. var.) linear unbiased estimator.


– βb0 & βb1 are efficient among all estimators that are linear and unbiased, condi-
tional on the Xi ’s.
– The OLS estimators have the smallest variance of all unbiased estimators.
– Before we showed the OLS estimators were unbiased, consistent and asymptot-
ically normal (all still true).
– Now OLS is BLUE (also called the Gauss-Markov Theorem).

• However,

16
– If the errors are instead heteroskedastic, which is true in most practical ques-
tions, OLS is no longer BLUE.
– Even if the errors are homoskedacitic, OLS is the best in the linear sense. That
is, there may be better non-linear estimator than OLS.

• So why don’t we use these simple formulas all the time?

• Homoskedasticity often does not hold in practice.

b2b using a computer, we don’t care so much about having a


• Also, since we compute σ
β1
simple formula.

• Moreover, the heteroskedasticity robust (HR) estimator can handle homoskedasticity


since HR assumes less.

• Remember that the point estimate is the same either way: only the SE’s change
depending on what you assume about V ar (ui | Xi ) .

• Typically
bβ2b (HR) > σ
σ bβ2b (Homosked only)
1 1

so you’ll have bigger CI’s and p-values


βb −β
• Since a bigger variance leads to a smaller t-ratio (recall t = 1 b1,0 ), you are less
SE(β1 )
likely to reject H0 if you use HR SE’s, so your assumptions matter for inference!

– Point estimates are the same, but tests and CI’s change

• Thus, it’s much safer to always use HR standard errors unless you know that V ar (ui | Xi ) =
σu2 .

4.2 Weighted Least Squares (WLS)


• Can we transform heteroskedasticity to homoskedasticity?

• Yes, if we know the form of heteroskedasticity.

• Suppose

1. E (Yi | Xi ) = β0 + β1 Xi

2. (Yi , Xi ) ∼ iid

3. Xi , ui have finite fourth moments, and

17
4. V ar (ui | Xi ) = λf (Xi ) where f (·) is a known function and λ is an unknown constant
(so we know the form of heteroskedasticity up to the proportionality factor λ), but
covariance matrix is still zeros. (E(ui uj ) = 0)
Define
Yei = √ Yi e0i = √
,X 1
, e1i = √ Xi
X ei = √ ui
and u
f (Xi ) f (Xi ) f (Xi ) f (Xi )

Yi = β0 + β1 Xi + ui =⇒ Yei = β0 X
e0i + β1 X
e1i + u
ei
but now
 
V ar(ui |Xi )
ui | Xi ) = V ar
V ar (e √ ui | Xi = f (Xi ) = λ which is a constant
f (Xi )

• So we can create Yei , X


e0i , & X
e1i and run OLS on

Yei = β0 X
e0i + β1 X
e1i + u
ei

• This WLS regression will be BLUE.

• It’s called weighted least squares because we calculate the coefficients by minimizing
1
the sum of the squared residuals, weighted by f (X i)
.

• WLS can be extended to cases where we have to estimate the function f (·), which
is called feasible WLS.

• WLS is more efficienty than heteroskedacity robust SE.

• So what’s the caveat?

– We don’t know much about f (X)


– Since the functional form of f (X) is rarely known (and using the wrong one
invalidates the method), WLS is rarely used in practice.
– In most cases, it is preferable to simply use HR SEs
∗ They produce asymptotically valid inferences even when you don’t know
f (X) .
∗ They are computed automatically by Stata and other regression packages.

4.3 The Variance of X and the Variance of βb1


• First, a high variance of X yields a low variance of βb1 .

1 V ar ((Xi − µX ) ui )
σβ2b =
1 n (V ar(Xi ))2

18
– Which of these two scatterplots would you rather fit a line through?

• Second, a low variance of u yields a low variance of βb1 . This is rationale for adding
more control variables explaining y.

5 Statistical Inference
• We are now able to construct confidence intervals and conduct hypothesis tests.

• We can construct a 95% confidence interval as


 
βb1 ± 1.96 · SE βb1

βb1 −β1,0
• Similarly our t-ratio or t-statistic is t =
SE (βb1 )

• So a two-sided test has

p-value = P (|Z| > |t|) = 2Φ (− |t|)

• Remember that the p-value is the smallest significance level at which the null hy-
pothesis could be rejected.

• Equivalently, it’s the probability of obtaining a statistic, by random sampling vari-


ation, at least as different from the null hypothesis value as the statistic actually
observed (assuming H0 is correct).

6 Regression When X is a Binary Variable


• So far, we have only looked at examples where the regressor X is a continuous variable
(e.g. class size).

• Regression Analysis can also be used when X is a binary or dummy variable (i.e. can
only take on the values 0 and 1).

– gender, drug treatment, democrat...

• Although the coefficients are calculated in exactly the same way when X is binary,
the interpretation of β1 differs.

– Why? Because a regression with a binary regressor is equivalent to performing


a difference of means analysis.

19
• For example, let Yi be average hourly earnings in 2008 and Di equal 1 if the worker
is male and 0 if the worker is female.

• The population regression model with Di as the regressor is

Yi = β0 + β1 Di + ui

• Since Di is not continuous, we can’t really think of β1 as a slope (there’s no “line”


since Di only takes on 2 values).

• For this reason, we just call β1 the coefficient on Di , instead of the slope.

• So how do we interpret β1 if it’s not a slope? Let’s look at what we have for each
value of Di

• When Di = 0 (the worker is female)

Yi = β0 + β1 · 0 + ui = β0 + ui

• Since E (Yi | Di = 0) = β0 , β0 is the population mean value of earnings for women.

• Whereas when Di = 1 (the worker is male)

Yi = β0 + β1 · 1 + ui = β0 + β1 + ui

• So E (Yi | Di = 1) = β0 + β1 , the population mean value of earnings for men.

• β1 is then the difference between the two population means.

• Stata command (margins) report this result as well.

7 Goodness of Fit
• So we’ve learned how to estimate β0 & β1 and how to test hypotheses and build CI’s
using these estimates.

• But how “good” is our regression?

• In other words, how close is the line to the actual data?

• Or more precisely, how much of the variation in Y is explained by our regression?

• Can we measure how much better the regression does at estimating Y than just using
Y?

20
• We need to measure how close we are getting to the data.

• Define the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (or sum of squared residuals) as

n
X
SST = T SS = (yi − ȳ)2
i=1
Xn
SSE = ESS = (ŷi − ȳ)2
i=1
n
X
SSR = RSS = û2i
i=1

• The R2 is the percentage of the total variation in Y “explained” by the estimated


regression:
2
(Yi −Y )
P b
ESS “explained variation” sample variance of Ybi
R2 = 2 = T SS = “total variation” = sample variance of Yi
(Yi −Y )
P

• Since T SS = ESS + RSS, we can also show that


RSS “unexplained variation”
R2 = 1 − =1−
T SS “total variation”

• Note that 0 ≤ R2 ≤ 1

• R2 = 1 is a perfect fit (all the data points are on the regression line).

• R2 = 0 means you are explaining none of the variation in Y (so your best guess for
any Yi is just the sample mean Y ).

• It turns out that there is a close link between R2 in the univariate regression model
and the sample correlation coefficient rXY = ssXXY
sY

• R2 is a measure of the fit of the linear model.

• The sample correlation (rXY ) is a measure of the linear relationship between two
variables.

• In fact, R2 = rXY2 (you can prove this using the definitions of R2 and rXY
2 ) in the

univariate case.

• This is useful to know since it gives us some idea of what a high or low R2 should
“look like”.

21
• A high R2 means that a lot of the total variation is explained by the regression (data
is tightly concentrated around the line).

• But, R2 does not tell you about the statistical significance of the coefficients (for this
you need SEs).

– R2 also does not prove our model is right or wrong: you can have a good model
but a low R2 because V ar(ui ) is large.
– Can also have a bad model with R2 ≈ 1
– Spurious regression: X and Y move together because of something else.
– Ex: Regress the number of supermarkets on the number of cars (or video stores).

8 Units of Measurement
• It is very important to know how y and x are measured in order to interpret regression
functions. Consider an equation estimated from CEOSAL1.DTA, where annual CEO
salary is in thousands of dollars and the return on equity is a percent:
\ = 963.191 + 18.501 roe
salary
n = 209, R2 = .0132

• When roe = 0 (it never is in the data), salary


\ = 963.191. But salary is in thousands
of dollars, so $963,191.

• A one percentage point increase in roe increases predicted salary by 18.501, or


$18,501.

• What if we measure roe as a decimal, rather than a percent? Define

roedec = roe/100

• What will happen to the intercept, slope, and R2 when we regress

salary on roedec?

• Nothing should happen to the intercept: roedec = 0 is the same as roe = 0. But the
slope will increase by 100. The goodness-of-fit should not change, and it does not.

• The new regression is


\ = 963.191 + 1, 850.1 roedec
salary
n = 209, R2 = .0132

22
• Now a one percentage point change in roe is the same as ∆roedec = .01, and so we
get the same effect as before.
• What if we measure salary in dollars, rather than thousands of dollars?
• Both the intercept and slope get multiplied by 1,000:
\
salarydol = 963, 191 + 18, 501 roe
n = 209, R2 = .0132

9 Estimation in Stata
9.1 Effects of Education on Hourly Wage (WAGE1.DTA)
• Data are from 1991 on men only. wage is reported in dollars per hour, educ is highest
grade completed.
• reg wage educ
• Negative intercept. Each additional year of schooling is estimated to be worth $0.54.
• Plugging in educ = 0 gives the silly prediction wage
[ = −.904. Extrapolating outside
the range of the data can produce strange predictions.
• When educ = 12, the predicted hourly wage is $5.59, which we can think of as our
estimate of the average wage in the population when educ = 12.
• margins, at(educ=12)
• We are explaining about 16% of the variation in wage with our regression. In other
words, 84% of variation in wage remains unexplained.
• predict wagehat
• predict uhat, resid
• Some residuals are positive, others are negative. None is especially close to zero.
Years of schooling, by itself, need not be a very good predictor of wage.

9.2 Test score and student ratio


• Stata command: reg testscr str, robust
• The estimated regression line is
\ = 698.9 − 2.28 · ST R
T estScore
(10.4) (.52)

23
• So what is the expected impact on test scores of a two student increase in class size?
-2.28*2=-4.5

• What is the expected test score in a district with 20 students per teacher? How about
30 students? 0 students?

• Note that SE(βb0 ) = 10.4 & SE(βb1 ) = .52.

• Of course, we can use them to test hypotheses as we did before. For example, suppose
you want to test

H0 : β1 = 0
HA : β1 6= 0

βb1 −0 −2.28−0
t-stat = = .52 = −4.39 ⇒ p-value = 2 · Φ(−4.39) ≈ 0 (so we reject the
SE(βb1 )
null).
Alternatively, a 95% CI for β1 is simply βb1 ± 1.96 · SE(βb1 ) = −2.28 ± 1.02 =
(−3.3, −1.26) (same conclusion).

• rXY = −.226. (Stata command : pwcorr str testscr )

• We can see that (−.226)2 = .051, which is the R2 from the regression!

• What do you expect the test score of orange county compared to other counties?
Run the regression and interpret it.

9.3 CPS data


• Here’s the result of the regression above using the 2008 data:

• Stata command: reg ahe08 a sex if year==2008, robust

\ = 29.08 − 4.10 · F emale


Earnings

• βb0 = 29.08 is the average value of earnings for men.

• βb0 + βb1 = 24.98 is the average value of earnings for women.

• βb1 = −4.10 is the difference between the two sample averages.

• Stata command: reg ahe08 i.a sex if year==2008, robust; margins a sex;

24
• We can test the hypothesis H0 : β1 = 0 HA : β1 6= 0 by calculating the t-statistic

βb1 − 0 −4.10
tact =  = = −11.59
SE βb1 ..353

and then calculating the p-value

p-value = 2Φ − tact = 2Φ (−11.59) ≈ 0




• We can reject the null hypothesis at any positive level of significance (just as before).

25

You might also like