Regression With One Regressor
Regression With One Regressor
Joonhyung Lee
University of Memphis
Econ 7810/8810
Contents
1 Relating Two Variables 3
2 Estimation 4
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Ordinary Least Squares (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 The OLS Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Skedacity 16
4.1 Heteroskedasticity and Homoskedasticity . . . . . . . . . . . . . . . . . . . . 16
4.2 Weighted Least Squares (WLS) . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 The Variance of X and the Variance of βb1 . . . . . . . . . . . . . . . . . . . 18
5 Statistical Inference 19
7 Goodness of Fit 20
8 Units of Measurement 22
1
9 Estimation in Stata 23
9.1 Effects of Education on Hourly Wage (WAGE1.DTA) . . . . . . . . . . . . 23
9.2 Test score and student ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.3 CPS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2
1 Relating Two Variables
• Econometrics is concerned with understanding relationships between variables that
we as economists care about.
– Education and wages, investment and innovation, advertising and sales, class
size and test scores...
• But given what we know so far, all we can do to study the relationship between two
(or more) variables is to use covariance and correlation.
• But does rXY > 0 mean that high values of X cause the values of Y to be high?
• Let’s start with a case where X is discrete and compare E (Y | X) for two values of
X.
• Example: the following has the info about average earnings for men and women. Is
there a significant gender gap?
– Wage gap Y m − Y w is $4.11 per hour.
– The standard error SE Y m − Y w = .35 so the t-stat for H0 : Y m − Y w = 0 is
4.11−0
.35 = 11.74, which has a p-value that’s very close to 0 (2Φ(−11.74) ≈ 0).
– Indeed, a 99% CI for the wage gap is 4.11 ± 2.58 · .35 = (3.41, 4.80)
3
– So there is a gender gap and it’s statistically significant.
– But is this a sign of discrimination?
– Quite possibly. But why might it not be?
– Some “other factor” could be driving the relationship (experience, education).
• To establish gender bias we need to keep “everything else” constant which means
that instead of looking at
E(earnings | gender)
2 Estimation
• Let’s keep it simple in the beginning and start with E (Y | X).
E (T estScore | Classsize)
• Aren’t other variables also important and perhaps driving the relationship?
• For now we’ll just stick to one X (and attribute these other factors to random vari-
ation).
• Adding more X’s turns out to be pretty simple and will allow us to account for these
additional factors explicitly.
4
2.1 Notation
• Y is the dependent, explained, response, predicted variable or regressand.
• X is the independent, explanatory, control, predictor variable or regressor, covariate.
• We know that E (Y | X) is a function of X, but what function?
• Let’s start by assuming it’s linear.
• Suppose E (Y | X) is linear in X:
E (Y | X) = β0 + β1 X
• In words, this is saying that if we know X, the expected value of Y is a linear function
of X.
• β0 + β1 X is then called the population regression line (the relationship that holds
between Y and X on average).
• So what do β0 & β1 represent? Consider the impact on Y of a one unit change in X.
E (Y | X = x) = β0 + β1 x
E (Y | X = (x + 1)) = β0 + β1 (x + 1)
E (Y | x + 1) − E (Y | x)= β0 + β1 (x + 1) − β0 − β1 x = β1
• So β1 is the expected change in Y associated with a one unit change in X (i.e. the
∆Y
slope: β1 = ∆X ).
• β0 is the intercept: the expected value of Y when X = 0.
– The intercept is simply the point at which the population regression line inter-
sects the Y axis.
– Note that in applications where X cannot equal 0, the intercept has no real
meaning. (ex) X is class size and Y test score. Intercept means test scores when
x = 0, implying nothing.
• Note E (Y | X) = β0 + β1 X doesn’t mean that the data will all lie on the same line.
• Notice that we didn’t write Yi = β0 +β1 Xi , but wrote E (Y | X) = β0 +β1 X instead.
• E (Y | X) is an expectation, the actual observations will be scattered around the
population regression line:
Yi = β0 + β1 Xi + ui
• ui represents all the other factors besides Xi that determine the value of Yi for a
particular observation i
5
2.2 Ordinary Least Squares (OLS)
• Given that we’ve assumed there’s a linear relationship between E (Y | X) and X,
how do we estimate it?
• Intuitively, we want to estimate E
b (Y | X) = βb0 + βb1 X, where βb0 & βb1 are estimates
of the population parameters β0 & β1 (just like X is an estimate of µ).
• So how do we find βb0 & βb1 ? By minimizing the prediction error.
• Our estimates βb0 & βb1 will give us the predicted value of Y conditional on X (the
predicted values are Ybi = βb0 + βb1 Xi ).
• Although we expect our estimates of β0 & β1 to be correct on average, for any
particular observation i, we are likely to make a prediction error.
• The error made in predicting the ith observation is given by Yi − Ybi = Yi − βb0 − βb1 Xi
• Intuitively, we would like to choose βb0 & βb1 to make all of these errors as small as
possible. But how?
• The OLS estimator chooses the regression coefficients by minimizing the sum of the
squared prediction errors
P 2 Ph i2
• M in Yi − Ybi = M in Yi − βb0 + βb1 Xi
βb0 ,βb1 βb0 ,βb1
• Another way: min Yi − Ybi (quantile regression)
βb0 = Y − βb1 X
• So we can derive the estimating equations for βb0 & βb1 by minimizing the sum of
squared prediction errors (recall that X can be constructed in a similar way).
• Moreover, just like X, βb0 & βb1 are themselves random variables. (We’ll derive their
distributions in a bit.)
6
2.3 The OLS Assumptions
• So why should we have faith in the OLS methodology?
• Do the OLS estimators have the same desirable properties that X had (unbiasedness,
consistency, asymptotic normality, efficiency)?
y = β0 + β1 x + u
• This will allow us to construct CIs and test hypotheses just like we did for µ.
• Much of the second half of the course is focused on handling these situations.
Assumption 1 & 4
• Yi = β0 + β1 Xi + ui
7
• The conditional distribution of ui given Xi has mean 0, called the zero conditional
mean assumption.
• Because people choose education levels partly based on ability, this assumption is
almost certainly false.
• We will relax this into E(ui |X1 , X2 ) = E(ui |X2 ), which is called conditional mean
independence. In this case, we can still interpret the causality of X1 , but not X2 .
The idea is that X1 is independent (exogenous) variable as long as X2 is controlled.
We will get back to this issue in linear regression with multiple regressors.
• This means the conditional distribution is centered around the population regression
line.
Assumption 2
• This assumption is likely to hold in cross-sections, but often violated in time series
or panel data.
8
3 Properties of the OLS Estimators
3.1 OLS is unbiased
• We are going to show that OLS is unbiased and find its asymptotic distribution.
• Using this trick and some additional algebra, we can rewrite the formula for βb1 in a
more useful way
9
P
Xi − X Yi − Y
βb1 = P 2
Xi − X
P
Xi − X β1 Xi − X + ui − u
= P 2
Xi − X
P 2 P
β1 Xi − X + Xi − X (ui − u)
= P 2
Xi − X
P 2 P P
β1 Xi − X + Xi − X ui − Xi − X u
= P 2
Xi − X
P 2 P
β1 Xi − X + Xi − X ui
= P 2
Xi − X
P
Xi − X ui
= β1 + P 2
Xi − X
• So P
Xi − X ui
β1 = β1 + P
b 2
Xi − X
which is a very useful result (we’ll use it to derive the distribution of βb1 later on).
• Now, to show E(βb1 ) = β1 I just need to show that the expected value of the second
term is zero.
• Since we have P
Xi − X ui
β1 = β1 + P
b 2
Xi − X
it follows that "P #
Xi − X ui
E(βb1 ) = β1 + E P 2
Xi − X
(& using the LIE (E (Z) = E [E (Z | X)]) on the 2nd term)
" P !#
Xi − X ui
= β1 + E E P 2 | X1 , ..., Xn
Xi − X
10
and since we know E (XY | X) = XE (Y | X) for any Y
"P #
Xi − X E (ui | X1 , ..., Xn )
= β1 + E P 2
Xi − X
which since E (ui | X1 , ..., Xn ) = E (ui | Xi ) = 0 by OLS Assumptions 4 and 2 re-
spectively
= β1 + 0 = β1
11
if we further assume assumption #5, i.e. homoskedacity (iid assumption), we can go
further
1 X 2
= P 2 Xi − X σ 2
( Xi − X )2
σ2
=P 2
Xi − X
1 2
nσ
= 2
1 P
n Xi − X
σ2
=
n ∗ var(Xi )
q
• So, SE βb1 = V ar(βb1 )
• These are the formulas that Stata uses to construct the standard errors.
P (|an − a| > ε) → 0 as n → ∞
p p
• Examples: Y n → µY , s2Y → σY2
12
3.4.2 Convergence in Distribution
• Let F1 , F2 , ..., Fn , .. be a sequence of CDFs corresponding to a sequence of random
variables W1 , W2 , ..., Wn , ..
d
• Wn converges in distribution to W Wn → W if the CDFs {Fn } converge to F (the
CDF of W )
d
Wn → W ⇐⇒ lim Fn (t) = F (t)
n→∞
σ2
Y −µY d a
• Examples: σY
√
→ N (0, 1) , Y ∼ N µY , nY
n
Y − µY d
σY → N (0, 1)
√
n
√
Y −µY n(Y −µY )
• Since σY
√
= σY the CLT can also be written as
n
√ d
n Y − µY → σY N (0, 1)
√ d
n Y − µY → N 0, σY2
13
– Assume Yi ∼ iid (µY , σY2 < ∞)
– Recall that the t-statistic based on Y is
Y − µY
t= sY
√
n
σY Y −µY
and let an = sY and Wn = σY
√
so that t = an Wn
n
p
– Since s2Y → σY2 , (and by using the continuous mapping theorem1 ), we know that
p
an → 1
1 P
n Xi − X ui
βb1 − β1 = 2
1 P
n Xi − X
1 1 1
P P P
• Let n Xi − X ui = n (Xi − µX ) ui = n vi = v.
1 P 2 p 2 (so a → σ 2 ) p
• We know n Xi − X → var(Xi ) = σX n X
1
The continious mapping theorem states that, for any continuous function g :
p p
∗ if an → a then g(an ) → g(a), and
d d
∗ if Wn → W then g(Wn ) → g(W )
14
• Suppose we can prove that
√ d
n (v) → N 0, σv2
d
⇔ Wn → N 0, σv2
√
d 1 a σv2
0, σv2
• Then n βb1 − β1 → 2 N
σX
, so βb1 ∼ N β1 , 2 2
n(σX )
√ d
n (v) → N 0, σv2 ? By CLT!
• So how do we show that
v − µv d √ d
n v → N 0, σv2
σv → N (0, 1) ⇔
√
n
√ d
→ N 0, σv2
√ n (v)
n βb1 − β1 = 2 = p
1 P 2
→ σX
n Xi − X
• Or, !
√
d 1 σv2
n βb1 − β1 → 2 N 0, σv2 = N
0,
σX 2 2
σX
15
4 Skedacity
4.1 Heteroskedasticity and Homoskedasticity
• Note that the ui ’s determine how the data will be scattered around the regression
line.
• But so far we’ve made no assumptions about V ar(ui ) (aside from a nonzero finite
fourth moment: 0 < E u4i < ∞).
• We did assume that E (ui | Xi ) = 0 but we did not assume that V ar (ui | Xi ) = σu2
(i.e. that the variance does not depend on the regressors).
• If it’s true that V ar (ui | Xi ) = σu2 (a constant) then we have homoskedasticity, which
is a useful property to have!
• Examples : education and income, a firm’s productivity and foreign investment, etc.
• If the tendency to scatter has some pattern, we may use quantile regression.
• Second, more importantly, assumptions above plus homoskedasticity mean that OLS
is BLUE.
• However,
16
– If the errors are instead heteroskedastic, which is true in most practical ques-
tions, OLS is no longer BLUE.
– Even if the errors are homoskedacitic, OLS is the best in the linear sense. That
is, there may be better non-linear estimator than OLS.
• Remember that the point estimate is the same either way: only the SE’s change
depending on what you assume about V ar (ui | Xi ) .
• Typically
bβ2b (HR) > σ
σ bβ2b (Homosked only)
1 1
– Point estimates are the same, but tests and CI’s change
• Thus, it’s much safer to always use HR standard errors unless you know that V ar (ui | Xi ) =
σu2 .
• Suppose
1. E (Yi | Xi ) = β0 + β1 Xi
2. (Yi , Xi ) ∼ iid
17
4. V ar (ui | Xi ) = λf (Xi ) where f (·) is a known function and λ is an unknown constant
(so we know the form of heteroskedasticity up to the proportionality factor λ), but
covariance matrix is still zeros. (E(ui uj ) = 0)
Define
Yei = √ Yi e0i = √
,X 1
, e1i = √ Xi
X ei = √ ui
and u
f (Xi ) f (Xi ) f (Xi ) f (Xi )
Yi = β0 + β1 Xi + ui =⇒ Yei = β0 X
e0i + β1 X
e1i + u
ei
but now
V ar(ui |Xi )
ui | Xi ) = V ar
V ar (e √ ui | Xi = f (Xi ) = λ which is a constant
f (Xi )
Yei = β0 X
e0i + β1 X
e1i + u
ei
• It’s called weighted least squares because we calculate the coefficients by minimizing
1
the sum of the squared residuals, weighted by f (X i)
.
• WLS can be extended to cases where we have to estimate the function f (·), which
is called feasible WLS.
1 V ar ((Xi − µX ) ui )
σβ2b =
1 n (V ar(Xi ))2
18
– Which of these two scatterplots would you rather fit a line through?
• Second, a low variance of u yields a low variance of βb1 . This is rationale for adding
more control variables explaining y.
5 Statistical Inference
• We are now able to construct confidence intervals and conduct hypothesis tests.
βb1 −β1,0
• Similarly our t-ratio or t-statistic is t =
SE (βb1 )
• Remember that the p-value is the smallest significance level at which the null hy-
pothesis could be rejected.
• Regression Analysis can also be used when X is a binary or dummy variable (i.e. can
only take on the values 0 and 1).
• Although the coefficients are calculated in exactly the same way when X is binary,
the interpretation of β1 differs.
19
• For example, let Yi be average hourly earnings in 2008 and Di equal 1 if the worker
is male and 0 if the worker is female.
Yi = β0 + β1 Di + ui
• For this reason, we just call β1 the coefficient on Di , instead of the slope.
• So how do we interpret β1 if it’s not a slope? Let’s look at what we have for each
value of Di
Yi = β0 + β1 · 0 + ui = β0 + ui
Yi = β0 + β1 · 1 + ui = β0 + β1 + ui
7 Goodness of Fit
• So we’ve learned how to estimate β0 & β1 and how to test hypotheses and build CI’s
using these estimates.
• Can we measure how much better the regression does at estimating Y than just using
Y?
20
• We need to measure how close we are getting to the data.
• Define the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (or sum of squared residuals) as
n
X
SST = T SS = (yi − ȳ)2
i=1
Xn
SSE = ESS = (ŷi − ȳ)2
i=1
n
X
SSR = RSS = û2i
i=1
• Note that 0 ≤ R2 ≤ 1
• R2 = 1 is a perfect fit (all the data points are on the regression line).
• R2 = 0 means you are explaining none of the variation in Y (so your best guess for
any Yi is just the sample mean Y ).
• It turns out that there is a close link between R2 in the univariate regression model
and the sample correlation coefficient rXY = ssXXY
sY
• The sample correlation (rXY ) is a measure of the linear relationship between two
variables.
• In fact, R2 = rXY2 (you can prove this using the definitions of R2 and rXY
2 ) in the
univariate case.
• This is useful to know since it gives us some idea of what a high or low R2 should
“look like”.
21
• A high R2 means that a lot of the total variation is explained by the regression (data
is tightly concentrated around the line).
• But, R2 does not tell you about the statistical significance of the coefficients (for this
you need SEs).
– R2 also does not prove our model is right or wrong: you can have a good model
but a low R2 because V ar(ui ) is large.
– Can also have a bad model with R2 ≈ 1
– Spurious regression: X and Y move together because of something else.
– Ex: Regress the number of supermarkets on the number of cars (or video stores).
8 Units of Measurement
• It is very important to know how y and x are measured in order to interpret regression
functions. Consider an equation estimated from CEOSAL1.DTA, where annual CEO
salary is in thousands of dollars and the return on equity is a percent:
\ = 963.191 + 18.501 roe
salary
n = 209, R2 = .0132
roedec = roe/100
salary on roedec?
• Nothing should happen to the intercept: roedec = 0 is the same as roe = 0. But the
slope will increase by 100. The goodness-of-fit should not change, and it does not.
22
• Now a one percentage point change in roe is the same as ∆roedec = .01, and so we
get the same effect as before.
• What if we measure salary in dollars, rather than thousands of dollars?
• Both the intercept and slope get multiplied by 1,000:
\
salarydol = 963, 191 + 18, 501 roe
n = 209, R2 = .0132
9 Estimation in Stata
9.1 Effects of Education on Hourly Wage (WAGE1.DTA)
• Data are from 1991 on men only. wage is reported in dollars per hour, educ is highest
grade completed.
• reg wage educ
• Negative intercept. Each additional year of schooling is estimated to be worth $0.54.
• Plugging in educ = 0 gives the silly prediction wage
[ = −.904. Extrapolating outside
the range of the data can produce strange predictions.
• When educ = 12, the predicted hourly wage is $5.59, which we can think of as our
estimate of the average wage in the population when educ = 12.
• margins, at(educ=12)
• We are explaining about 16% of the variation in wage with our regression. In other
words, 84% of variation in wage remains unexplained.
• predict wagehat
• predict uhat, resid
• Some residuals are positive, others are negative. None is especially close to zero.
Years of schooling, by itself, need not be a very good predictor of wage.
23
• So what is the expected impact on test scores of a two student increase in class size?
-2.28*2=-4.5
• What is the expected test score in a district with 20 students per teacher? How about
30 students? 0 students?
• Of course, we can use them to test hypotheses as we did before. For example, suppose
you want to test
H0 : β1 = 0
HA : β1 6= 0
βb1 −0 −2.28−0
t-stat = = .52 = −4.39 ⇒ p-value = 2 · Φ(−4.39) ≈ 0 (so we reject the
SE(βb1 )
null).
Alternatively, a 95% CI for β1 is simply βb1 ± 1.96 · SE(βb1 ) = −2.28 ± 1.02 =
(−3.3, −1.26) (same conclusion).
• We can see that (−.226)2 = .051, which is the R2 from the regression!
• What do you expect the test score of orange county compared to other counties?
Run the regression and interpret it.
• Stata command: reg ahe08 i.a sex if year==2008, robust; margins a sex;
24
• We can test the hypothesis H0 : β1 = 0 HA : β1 6= 0 by calculating the t-statistic
βb1 − 0 −4.10
tact = = = −11.59
SE βb1 ..353
• We can reject the null hypothesis at any positive level of significance (just as before).
25