Simple Regression Model
Juergen Meinecke
107 / 151
Roadmap
Selected Topics
Measures of Fit
108 / 151
There are two regression statistics that provide measures of how well
the regression line “fits” the data:
• regression 𝑅2 , and
• standard error of the regression (SER)
Main idea: how closely does the scatterplot “fit” around the
regression line?
109 / 151
Graphical illustration of “fit” of the regression line
110 / 151
The regression 𝑅2 is the fraction of the sample variation of 𝑌𝑖 that is
explained by the explanatory variable 𝑋𝑖
Total variation in the dependent variable can be broken down as
• total sum of squares (TSS)
𝑛 ̄ 2
𝑇𝑆𝑆 ∶= ∑𝑖=1 (𝑌𝑖 − 𝑌)
• explained sum of squares (ESS)
𝑛
𝐸𝑆𝑆 ∶= ∑𝑖=1 (𝑌̂ 𝑖 − 𝑌)
̄ 2
• residual sum of squares (RSS)
𝑛
𝑅𝑆𝑆 ∶= ∑ (𝑌𝑖 − 𝑌̂ 𝑖 )2
𝑖=1
It follows that 𝑇𝑆𝑆 = 𝐸𝑆𝑆 + 𝑅𝑆𝑆
111 / 151
Definition
𝑅2 is defined by
𝐸𝑆𝑆
𝑅2 ∶= .
𝑇𝑆𝑆
Corollary
Based on the preceding terminology, it is easy to see that
𝑅𝑆𝑆
𝑅2 = 1 −
𝑇𝑆𝑆
112 / 151
Therefore,
• 𝑅2 = 0 means 𝐸𝑆𝑆 = 0 (the regressor X explains nothing in the
variation of the dependent variable Y)
• 𝑅2 = 1 means 𝐸𝑆𝑆 = 𝑇𝑆𝑆
(the regressor X explains all the variation of the dependent
variable Y)
• 0 ≤ 𝑅2 ≤ 1
• For a regression with a single regressor 𝑋, 𝑅2 is the square of
the sample correlation coefficient between X and Y
• Python routinely calculates and reports 𝑅2 when it runs
regressions
113 / 151
In contrast, the standard error of the regression measures the spread
of the distribution of the errors
Because you don’t observe the errors 𝑢𝑖 you use the residuals 𝑢̂ 𝑖
instead
It is defined as the estimator of the standard deviation of 𝑢𝑖 :
⃓ 1 𝑛
𝑆𝐸𝑅 ∶= ⃓ (𝑢̂ 𝑖 − 𝑢)̂̄ 2
𝑛 − 2 𝑖=1
⎷
⃓ 1 𝑛 2 𝑅𝑆𝑆
=⃓ 𝑢̂ =
𝑛 − 2 𝑖=1 𝑖 √ 𝑛 − 2
⎷
1
The second equality holds because 𝑢̂̄ ∶= ∑𝑛𝑖=1 𝑢̂ 𝑖 = 0
𝑛
114 / 151
The SER
• has the units of u, which are the units of Y
• measures the spread of the OLS residuals around the estimated
PRF
Technical note: why divide by 𝑛 − 2 instead of 𝑛 − 1?
• Division by 𝑛 − 2 is a “degrees of freedom” correction – just like
division by 𝑛 − 1 in 𝑠2𝑌 , except that for the SER, two parameters
have been estimated (𝛽0 and 𝛽1 ), whereas in 𝑠2𝑌 only one has
been estimated (𝜇𝑌 )
• When sample size 𝑛 is large, it doesn’t really matter whether 𝑛
or 𝑛 − 1 or 𝑛 − 2 is being used
115 / 151
Simple Regression Model
Juergen Meinecke
116 / 151
Roadmap
Selected Topics
Binary Regressor
117 / 151
Quite often an explanatory variable is binary
• 𝑋𝑖 = 1 if small class size (else zero)
• 𝑋𝑖 = 1 if identify as female (else zero)
• 𝑋𝑖 = 1 if smokes (else zero)
Binary regressors are called dummy variables
So far, we have looked at 𝛽1 as a slope
But does this make sense when 𝑋𝑖 is binary?
How should we interpret 𝛽1 and its estimator 𝛽̂1 ?
118 / 151
The linear model 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖 reduces to
• 𝑌𝑖 = 𝛽0 + 𝑢𝑖 ̇ when 𝑋𝑖 = 0
• 𝑌𝑖 = 𝛽0 + 𝛽1 + 𝑢𝑖 when 𝑋𝑖 = 1
Analogously, the population regression functions are
• 𝐸[𝑌𝑖 |𝑋𝑖 = 0] = 𝛽0
• 𝐸[𝑌𝑖 |𝑋𝑖 = 1] = 𝛽0 + 𝛽1
It therefore follows that
𝛽1 = 𝐸[𝑌𝑖 |𝑋𝑖 = 1] − 𝐸[𝑌𝑖 |𝑋𝑖 = 0]
In words: the coefficient 𝛽1 captures the difference in group means
119 / 151
Do moms who smoke have babies with lower birth weight?
Python Code
> import pandas as pd
> df = pd.read_csv('birthweight.csv')
> smokers = df[df.smoker == 1]
> nonsmokers = df[df.smoker == 0]
> t_test(smokers.birthweight, nonsmokers.birthweight)
> t_test(smokers.birthweight, nonsmokers.birthweight)
Two-sample t-test
Mean in group 1: 3178.831615120275
Mean in group 2: 3432.0599669148055
Point estimate for difference in means: -253.22835179453068
Test statistic: -9.441398919580234
95% confidence interval: (-305.7976345612996, -200.65906902776175)
120 / 151
Regression with smoker dummy gives exact same numbers
Python Code (output edited)
> import statsmodels.formula.api as smf
> formula = 'birthweight ~ smoker'
> model1 = smf.ols(formula, data=df, missing='drop')
> reg1 = model1.fit(use_t=False)
> print(reg1.summary())
OLS Regression Results
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3432.0600 11.871 289.115 0.000 3408.793 3455.327
smoker -253.2284 26.951 -9.396 0.000 -306.052 -200.404
==============================================================================
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
• 𝛽̂0 equal to average birthweight in sub-sample 𝑋𝑖 = 0
• 𝛽̂1 equal to difference in average birthweighs b/w groups
121 / 151
Simple Regression Model
Juergen Meinecke
122 / 151
Roadmap
Selected Topics
Gauss-Markov Theorem
123 / 151
OLS estimator is not the only estimator of the PRF
You can nominate anything you want as your estimator
Similar to lecture 2, here are some alternative estimators:
𝑛 𝑝
• argmin ∑𝑖=1 (𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 ) ,
𝑏0 ,𝑏1
where 𝑝 is any natural number
𝑛
• argmin ∑𝑖=1 |𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 |
𝑏0 ,𝑏1
this is called the least absolute deviations estimator
• the number 42
(the ‘answer to everything estimator’)
124 / 151
Clearly, these are all estimators
(they satisfy the definition given earlier)
Are they sensible estimators?
Clearly, the last one is silly
125 / 151
The point is: there always exist an endless number of possible
estimators for any given estimation problem
Most of them do not make any sense
What then constitutes a good estimator?
Let’s determine ‘goodness’ of an estimator by two properties:
1. bias
2. variance
Let’s briefly look at these again
126 / 151
Definition
An estimator 𝜃̂ for an unobserved population parameter 𝜃 is
unbiased if its expected value is equal to 𝜃, that is
E[𝜃]̂ = 𝜃
Definition
An estimator 𝜃̂ for an unobserved population parameter 𝜃 has
minimum variance if its variance is (weakly) smaller than the
variance of any other estimator of 𝜃. Sometimes we will also say
that the estimator is efficient.
Let’s see if the OLS estimator satisfies these two properties
127 / 151
But first we need to take a brief detour:
Definition
An estimator 𝜃̂ is linear in 𝑌𝑖 if it can be written as
𝑛
𝜃̂ = 𝑎𝑖 𝑌𝑖 ,
𝑖=1
where the weights 𝑎𝑖 are functions of 𝑋𝑖 but not of 𝑌𝑖 .
It is easy to see that the OLS estimator is a linear estimator
Definition
A Best Linear Unbiased Estimator (BLUE) is an estimator that is
linear, unbiased, and has minimal variance (efficient).
If an estimator is BLUE, you can’t beat it, it’s the optimum
128 / 151
When we did univariate statistics (we only looked at one random
variable 𝑌𝑖 ) we discovered that the sample average was indeed BLUE
Currently we are doing bivariate statistics (we study the joint
distribution between 𝑌𝑖 and 𝑋𝑖 )
Our estimator of choice is the OLS estimator
Now, similarly to the sample average in the univariate world,
a powerful result holds for the OLS estimator…
129 / 151
Theorem
Under OLS Assumptions 1 through 4a, the OLS estimator
𝑛
𝛽̂0 , 𝛽̂1 ∶= argmin (𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 )2
𝑏0 ,𝑏1 𝑖=1
is BLUE.
The Gauss-Markov theorem provides a theoretical justification for
using OLS
This theorem holds only for the subset of estimators that are linear
in 𝑌𝑖
There may be nonlinear estimators that are better
130 / 151
Simple Regression Model
Juergen Meinecke
131 / 151
Roadmap
Selected Topics
Homoskedasticity versus Heteroskedasticity
132 / 151
We’ve introduced the idea of homoskedasticity last week
We learned about it in OLS Assumption 4a
Homoskedasticity concerns the variance of the error terms 𝑢𝑖
Mathematically, the error terms are homoskedastic when
Var (𝑢𝑖 |𝑋𝑖 ) = 𝜎2𝑢
The essence of this equation is that the variance of 𝑢𝑖 is not a
function of 𝑋𝑖 ; instead, the variance is just a constant 𝜎2𝑢 whatever
the value of 𝑋𝑖
133 / 151
Example of homoskedasticity
Scatterplot is distributed evenly around PRF
Variance of error term is constant; does not vary with 𝑋𝑖
134 / 151
But why would we want to assume this?
It seems a bit arbitrary to make an assumption about the variance of
the unobserved error term
After all, the error term is unobserved; so why would we make
assumptions on the variance of it?
Well, the reason I gave during lecture 5 was that homoskedasticity
makes the derivation of the asymptotic distribution a little bit easier
The results just look a little bit cleaner
But homoskedasticity is not a necessary assumption
135 / 151
If the error terms are not homoskedastic, what are they?
If they are not homoskedastic, they are called heteroskedastic
How should we think about them?
The next three pictures illustrate…
136 / 151
Example of heteroskedasticity
Scatterplot gets wider as 𝑋 increases
Variance of error term increases in 𝑋
137 / 151
Example of heteroskedasticity
Scatterplot gets narrower as 𝑋 increases
Variance of error term decreases in 𝑋
138 / 151
Example of heteroskedasticity
Scatterplot gets narrower at first but then gets wider again
Variance of error term increases in 𝑋, then decreases again
139 / 151
What do these three pictures have in common?
The variance of 𝑌𝑖 itself varies in 𝑋𝑖
The following assumption clarifies what we mean by
heteroskedasticity
Assumption (OLS Assumption 4b)
The error terms 𝑢𝑖 are heteroskedastic if their variance has the
following form:
Var (𝑢𝑖 |𝑋𝑖 ) = 𝜎2𝑢 (𝑋𝑖 ),
that is, the variance is a function in 𝑋𝑖 .
Corollary
If the error terms 𝑢𝑖 are not homoskedastic, they are
heteroskedastic.
140 / 151
How do the OLS standard errors from last week change if the error
terms are heteroskedastic instead of homoskedastic?
Recall the asymptotic distribution of the OLS estimator 𝛽̂1
𝑎𝑝𝑝𝑟𝑜𝑥. 1 𝜎2𝑢
𝛽̂1 ∼ N 𝛽1 ,
𝑛 𝜎2𝑋
This result only holds under OLS Assumptions 1 through 4a
In particular, it only holds under homoskedasticity (Assumption 4a)
If the error terms are heteroskedastic instead, we have to adjust the
asymptotic variance
This is tedious, but let’s do it!
141 / 151
Recall from lecture 5 how the asymptotic variance collapses to
something nice and simple under homoskedasticity:
𝑛
1
Var(𝛽̂1 |𝑋𝑖 ) = ⋯ = ̄ 2 Var(𝑢𝑖 |𝑋𝑖 )
(𝑋𝑖 − 𝑋)
𝑛 2
̄ 2
∑𝑖=1 (𝑋𝑖 − 𝑋) 𝑖=1
𝑛
1
= ̄ 2 𝜎2𝑢
(𝑋𝑖 − 𝑋)
𝑛 2
̄ 2
∑𝑖=1 (𝑋𝑖 − 𝑋) 𝑖=1
𝑛
𝜎2𝑢
= ̄ 2
(𝑋𝑖 − 𝑋)
𝑛 2
̄ 2
∑𝑖=1 (𝑋𝑖 − 𝑋) 𝑖=1
𝜎2𝑢
≃ 𝑛𝜎2
(𝑛𝜎2𝑋 )2 𝑋
1 𝜎2𝑢
= ,
𝑛 𝜎2𝑋
𝑛 ̄ 2 ≃ 𝑛𝜎2𝑋 and Var (𝑢𝑖 |𝑋𝑖 ) = 𝜎2𝑢
where we plugged in ∑𝑖=1 (𝑋𝑖 − 𝑋)
142 / 151
In contrast, under heteroskedasticity, we make our lives a bit easier
by imposing an asymptotic approximation at a much earlier stage:
𝑛
1
Var(𝛽̂1 |𝑋𝑖 ) = ⋯ = ̄ 𝑖 |𝑋𝑖
Var (𝑋𝑖 − 𝑋)𝑢
𝑛 2
̄ 2
∑𝑖=1 (𝑋𝑖 − 𝑋) 𝑖=1
1
≃ 𝑛Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖
(𝑛𝜎2𝑋 )2
1 Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖
=
𝑛 𝜎4𝑋
(Note: the use of the conditional variance and the subsequent
approximation are a bit dubious; the actual math is a bit more
complicated and I am taking shortcuts here to make things easy)
143 / 151
Putting things together and invoking the CLT once more
Theorem
The asymptotic distribution of the OLS estimator 𝛽̂1 under
OLS Assumptions 1 through 4b is
⎛ ⎞
𝑎𝑝𝑝𝑟𝑜𝑥. 1 Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖 ⎟
𝛽̂1 ∼ N ⎜ 𝛽1 ,
⎜ 𝑛 𝜎4𝑋 ⎟
⎝ ⎠
A similar theorem holds for 𝛽̂0 , it just looks a little bit uglier
144 / 151
The previous theorem is the basis for deriving confidence intervals
for 𝛽1 under heteroskedasticity
With our knowledge from the previous weeks, it is easy to propose a
95% confidence interval
⎡ Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖
𝐶𝐼(𝛽1 ) ∶= ⎢𝛽̂1 − 1.96 ⋅ √ 2
,
⎢ √𝑛𝜎𝑋
⎣
Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖 ⎤
𝛽̂1 + 1.96 ⋅ √ 2
⎥
√𝑛𝜎𝑋 ⎥
⎦
Only problem: we do not know Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖 and 𝜎𝑋
145 / 151
But can estimate them easily instead:
• Var (𝑋𝑖 − 𝜇𝑋 )𝑢𝑖 is estimated by
1 𝑛 2
𝑠2𝑢𝑥 ∶= ̄ 𝑢̂ 𝑖
(𝑋𝑖 − 𝑋)
𝑛 𝑖=1
• 𝜎𝑋 is estimated by 𝑠𝑋
(Do you remember the definition of 𝑢̂ 𝑖 and 𝑠𝑋 ?)
146 / 151
An operational version of the confidence interval therefore is given
by
𝑠𝑢𝑥 𝑠𝑢𝑥
𝐶𝐼(𝛽1 ) ∶= 𝛽̂1 − 1.96 ⋅ 2
, 𝛽̂1 + 1.96 ⋅ 2
√𝑛𝑠𝑋 √𝑛𝑠𝑋
The ratio 𝑠𝑢𝑥 /(√𝑛𝑠2𝑋 ) is, of course, the standard error under
heteroskedasticity
The standard error will differ under homoskedasticity and
heteroskedasticity
147 / 151
The standard error under heteroskedasticity has the term 𝑠𝑢𝑥 in the
numerator which makes it seem a little bit more complicated to
calculate
But it is actually less complicated than it looks
In practice, Python computes this for you anyway
148 / 151
Default in Python is homoskedasticity
Python Code (output edited)
> import pandas as pd
> import statsmodels.formula.api as smf
> df = pd.read_csv('caschool.csv')
> formula = 'testscr ~ str'
> model1 = smf.ols(formula, data=df, missing='drop')
> reg1 = model1.fit(use_t=False)
> print(reg1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: testscr R-squared: 0.051
Model: OLS Adj. R-squared: 0.049
Method: Least Squares F-statistic: 22.58
No. Observations: 420
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 698.9330 9.467 73.825 0.000 680.377 717.489
str -2.2798 0.480 -4.751 0.000 -3.220 -1.339
==============================================================================
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
149 / 151
New way to do things:
Python Code (output edited)
cov_type='HC1' use_t=False)
> reg1_heterosk = model1.fit(cov_type='HC1'
cov_type='HC1',
> print(reg1_heterosk.summary())
OLS Regression Results
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 698.9330 10.364 67.436 0.000 678.619 719.247
str -2.2798 0.519 -4.389 0.000 -3.298 -1.262
==============================================================================
Notes: [1] Standard Errors are heteroscedasticity robust (HC1)
Using the option cov_type='HC1' inside ols.fit()
is Python’s way of adjusting for heteroskedasticity
This is called the heteroskedasticity robust option
(Aside: cov_type='HC1' makes the same standard error
adjustment as Stata’s robust)
150 / 151
Homoskedastic standard errors are only correct if OLS
Assumption 4a is satisfied
Heteroskedastic standard errors are correct under both OLS
Assumption 4a and Assumption 4b
Practical implication
• If you know for sure that the error terms are homoskedastic, you
should simply use Python’s ols.fit()
• If you know for sure that the error terms are heteroskedastic,
you should use Python’s ols.fit(cov_type='HC1')
• If you do not know for sure, it is always safer to use
heteroskedasticity robust standard errors
151 / 151