0% found this document useful (0 votes)
18 views

226 Lecture5 Prediction

This document discusses bias and variance in predictive modeling when dealing with small amounts of data. It begins by introducing bias and variance as ways to characterize prediction error. It then provides a thought experiment to illustrate how generating multiple predictive models from parallel simulated datasets can reveal the bias and variance of a modeling approach. The document walks through an example applying this process to ordinary least squares regression on simulated data, generating distributions of prediction errors across multiple simulated "universes" to expose the bias and variance properties of the modeling approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

226 Lecture5 Prediction

This document discusses bias and variance in predictive modeling when dealing with small amounts of data. It begins by introducing bias and variance as ways to characterize prediction error. It then provides a thought experiment to illustrate how generating multiple predictive models from parallel simulated datasets can reveal the bias and variance of a modeling approach. The document walks through an example applying this process to ordinary least squares regression on simulated data, generating distributions of prediction errors across multiple simulated "universes" to expose the bias and variance properties of the modeling approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

MS&E 226: “Small” Data

Lecture 5: Bias and variance (v3)

Ramesh Johari
[email protected]

Fall 2015

1 / 45
The road ahead

Thus far we have seen how we can select and evaluate predictive
models using the train-validate-test methodology. This approach
works well if we have “enough” data.
What if we don’t have enough data to blindly train and validate
models? We have to understand the behavior of prediction error
well enough to intelligently explore the space of models.

2 / 45
The road ahead

Starting with this lecture:


I We characterize exactly how prediction error behaves through
the ideas of bias and variance.
I We develop measures of model complexity that we can use to
help us effectively search for “good” models.
I We develop methods of evaluating models using limited data.
A word of caution: All else being equal, more data leads to more
robust model selection and evaluation! So these techniques are not
“magic bullets”.

3 / 45
Conditional expectation

4 / 45
Conditional expectation

Given the population model for X~ and Y , suppose we are allowed


ˆ
to choose any predictive model f we want. What is the best one?

minimize E[(Y − fˆ(X)


~ 2 ].
~ Y ).)
(Here expectation is over (X,
Theorem
The predictive model that minmizes squared error is
fˆ(X)
~ = E[Y |X].
~

5 / 45
Conditional expectation
Proof:

E[(Y − fˆ(X))
~ 2]
= E[(Y − E[Y |X] ~ − fˆ(X))
~ + E[Y |X] ~ 2]
= E[(Y − E[Y |X]) ~ − fˆ(X))
~ 2 ] + E[(E[Y |X] ~ 2]
~
+ 2E[(Y − E[Y |X])(E[Y ~ − fˆ(X)].
|X] ~

The first two terms are positive, and minimized if fˆ(X)


~ = E[Y |X].
~
For the third term, using the tower property of conditional
expectation:
~
E[(Y − E[Y |X])(E[Y ~ − fˆ(X)]
|X] ~
h i
~ X]
= E E[Y − E[Y |X]| ~ − fˆ(X)
~ (E[Y |X] ~ = 0.

So the squared error minimizing solution is to choose


fˆ(X)
~ = E[Y |X].
~
6 / 45
Conditional expectation

~ as our predictive model?


Why don’t we just choose E[Y |X]
~ Y )!
Because we don’t know the distribution of (X,
Nevertheless, the preceding result is a useful guide:
I It provides the benchmark that every squared-error-minimizing
predictive model is striving for: approximate the conditional
expectation.
I It provides intuition for why linear regression approximates the
conditional mean.

7 / 45
Population model

For the rest of the lecture write:


~ + ε,
Y = f (X)
~ = 0. In other words, f (X)
where E[ε|X] ~ = E[Y |X].
~
~ = σ 2 (i.e., it does not
We make the assumption that Var(ε|X)
~
depend on X).
We will make additional assumptions about the population model
as we go along.

8 / 45
Prediction error revisited

9 / 45
A note on prediction error and conditioning

When you see “prediction error”, it typically means:

E[(Y − fˆ(X))
~ 2 |(∗)],

where (∗) can be one of many things:


I X, Y, X;~
I X, Y;
I ~
X, X;
I X;
I nothing.
As long we don’t condition on both X and Y, the model is
random!

10 / 45
Models and conditional expectation
So now suppose we have data X, Y, and use it to build a model fˆ.
~
What is the prediction error if we see a new X?

E[(Y − fˆ(X))
~ 2 |X, Y, X]
~
~ 2 |X]
= E[(Y − f (X)) ~
+ (fˆ(X)
~ − f (X))
~ 2
= σ 2 + (fˆ(X)
~ − f (X))
~ 2.

I.e.: When minimizing mean squared error, “good” models should


behave like conditional expectation.1
Our goal: understand the second term.
1
This is just another way of deriving that the prediction-error-minimizing
solution is the conditional expectation.
11 / 45
Models and conditional expectation [∗]

Proof of preceding statement:


The proof is essentially identical to the earlier proof for conditional
expectation:

E[(Y − fˆ(X))
~ 2 |X, Y, X]
~
= E[(Y − f (X) ~ − fˆ(X))
~ + f (X) ~ 2 |X, Y, X]
~
= E[(Y − f (X)) ~ + (fˆ(X)
~ 2 |X] ~ − f (X))
~ 2
~ X](f
+ 2E[Y − f (X)| ~ (X)~ − fˆ(X))
~
= E[(Y − f (X)) ~ + (fˆ(X)
~ 2 |X] ~ − f (X))
~ 2,

~ X]
because E[Y − f (X)| ~ = E[Y − E[Y |X]|
~ X]~ = 0.

12 / 45
A thought experiment

Our goal is to understand:

(fˆ(X)
~ − f (X))
~ 2. (∗∗)

Here is one way we might think about the quality of our modeling
approach:
I Fix the design matrix X.
I Generate data Y many times (parallel “universes”).
I In each universe, create a fˆ.
I In each universe, evaluate (∗∗).

13 / 45
A thought experiment

Our goal is to understand:

(fˆ(X)
~ − f (X))
~ 2. (∗∗)

What kinds of things can we evaluate?


I If fˆ(X)
~ is “close” to the conditional expectation, then on
~
average in our universes, it should look like f (X).
I fˆ(X)
~ might be close to f (X)~ on average, but still vary wildly
across our universes.
The first is bias. The second is variance.

14 / 45
Example

Let’s carry out some simulations with a synthetic model.


Population model:
I We generate X1 , X2 as i.i.d. N (0, 1) random variables.
I Given X1 , X2 , the distribution of Y is given by:

Y = 1 + X1 + 2X2 + ε,

where ε is an independent N (0, 5) random variable.


Thus f (X1 , X2 ) = 1 + X1 + 2X2 .

15 / 45
Example: Parallel universes

Generate a design matrix X by sampling 1000 i.i.d. values of


(X1 , X2 ).
Now we run m = 500 simulations. These are our “universes.” In
each simulation, generate data Y according to:

Yi = f (Xi ) + εi ,

where εi are i.i.d. N (0, 5) random variables.


In each simulation, what changes is the specific values of the εi .
This is what it means to condition on X.

16 / 45
Example: OLS in parallel universes

In each simulation, given the design matrix X and Y, we build a


fitted model fˆ using ordinary least squares.
Finally, let X̃ denote a fixed matrix of 1000 i.i.d. test values of
(X1 , X2 ). (We use the same test data X̃ across all the universes.)
In each universe we evaluate:
1000
1 X ˆ
MSE = (f (X̃i1 , X̃i2 ) − f (X̃i1 , X̃i2 ))2 .
1000
i=1

(This is an estimate of E[(fˆ(X)


~ − f (X))
~ 2 |X, Y] in the given
universe.)
We then create a density plot of these mean squared errors across
the 500 universes.

17 / 45
Example: Results
We go through the process with three models: A, B, C.
The three we try are:
I Y ~ 1 + X1.
I Y ~ 1 + X1 + X2.
I Y ~ 1 + X1 + X2 + ... + I(X1^4) + I(X2^4).
Which is which?

18 / 45
Example: Results
Results:

model

A
density

4 B

0 1 2 3 4
MSE

19 / 45
A thought experiment: Aside

This is our first example of a frequentist approach:


I The population model is fixed.
I The data is random.
I We reason about a particular modeling procedure by
considering what happens if we carry out the same procedure
over and over again (in this case, fitting a model from data).

20 / 45
Examples: Bias and variance

Suppose you are predicting, e.g., wealth based on a collection of


demographic covariates.

I Suppose we make a constant prediction: fˆ(Xi ) = c for all i.


Is this biased? Does it have low variance?
I Suppose that every time you get your data, you use enough
parameters to fit Y exactly: fˆ(Xi ) = Yi for all i. Is this
biased? Does it have low variance?

21 / 45
The bias-variance decomposition

We can be more precise about our discussion.

E[(Y − fˆ(X))
~ 2 |X, X]
~
= σ 2 + E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
 2
= σ 2 + f (X)~ − E[fˆ(X)|X,
~ ~
X]
h 2 i
+ E fˆ(X) ~ − E[fˆ(X)|X,
~ ~
X] ~ .
X, X

The first term is irreducible error.


The second term is BIAS2 .
The third term is VARIANCE.

22 / 45
The bias-variance decomposition

The bias-variance decomposition measures how sensitive prediction


error is to changes in the training data (in this case, Y.
I If there are systematic errors in prediction made regardless of
the training data, then there is high bias.
I If the fitted model is very sensitive to the choice of training
data, then there is high variance.

23 / 45
The bias-variance decomposition: Proof [∗]
Proof: We already showed that:

E[(Y − fˆ(X)) ~ = σ 2 + (fˆ(X)


~ 2 |X, Y, X] ~ − f (X))
~ 2.

Take expectations over Y:

E[(Y − fˆ(X)) ~ = σ 2 + E[(fˆ(X)


~ 2 |X, X] ~ − f (X))
~ 2 |X, X].
~

Add and subtract E[fˆ(X)|X,


~ ~ in the second term:
X]

E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
h
= E fˆ(X) ~ − E[fˆ(X)|X,
~ ~
X]
2 i
+ E[fˆ(X)|X,
~ ~ − f (X)
X] ~ ~
X, X

24 / 45
The bias-variance decomposition: Proof [∗]

Proof (continued): Cross-multiply:

E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
h 2 i
= E fˆ(X)~ − E[fˆ(X)|X,
~ ~
X] ~
|X, X
 2
+ E[fˆ(X)|X,
~ ~ − f (X)
X] ~
h   i
+ 2E fˆ(X)
~ − E[fˆ(X)|X,
~ ~
X] E[fˆ(X)|X,
~ ~ − f (X)
X] ~ X, X
~ .

(The conditional expectation drops out on the middle term,


~ it is no longer random.)
because given X and X,
The first term is the VARIANCE. The second term is the BIAS2 .
We have to show the third term is zero.

25 / 45
The bias-variance decomposition: Proof [∗]

Proof (continued). We have:


h   i
E fˆ(X)~ − E[fˆ(X)|X,
~ ~
X] E[fˆ(X)|X,
~ ~ − f (X)
X] ~ X, X
~
  h  i
= E[fˆ(X)|X,
~ X] ~ E fˆ(X)
~ − f (X) ~ − E[fˆ(X)|X,
~ ~ X, X
X] ~

= 0,

using the tower property of conditional expectation.

26 / 45
Example: k-nearest-neighbor fit

Generate data the same way as before:


We generate 1000 X1 , X2 as i.i.d. N (0, 1) random variables.
We then generate 1000 Y random variables as:

Yi = 1 + 2Xi1 + 3Xi2 + εi ,

where εi are i.i.d. N (0, 5) random variables.

27 / 45
Example: k-nearest-neighbor fit
Using the first 500 points, we create a k-nearest-neighbor (k-NN)
model: For any X,~ let fˆ(X) ~ be the average value of Yi over the k
~ in the training set.
nearest neighbors X(1) , . . . , X(k) to X

1
X2

−1

−2

−3
−2 −1 0 1 2
X1

How does this fit behave as a function of k?


28 / 45
Example: k-nearest-neighbor fit
The graph shows root mean squared error (RMSE) as a function of
k:

7
K−NN RMSE

5 Best linear regression prediction error

0 100 200 300 400 500


# of neighbors

29 / 45
k-nearest-neighbor fit

Given X~ and X, let X(1) , . . . , X(k) be the k closest points to X


~ in
the data. You will show that:
k
~ −1
X
BIAS = f (X) f (X(i) ),
k
i=1

and
σ2
VARIANCE = .
k
This type of result is why we often refer to a “tradeoff” between
bias and variance.

30 / 45
Linear regression: Linear population model

Now suppose the population model itself is linear:2


X
Y = β0 + βj Xj + ε, i.e. Y = Xβ ~ + ε,
j

for some fixed parameter vector β.


The errors ε are i.i.d. with E[εi |X] = 0 and Var(εi |X) = σ 2 .
Also assume errors ε are uncorrelated across observations. So for
our given data X, Y, we have Y = Xβ + ε, where:

Var(ε|X) = σ 2 I.

2 ~ is viewed as a row vector (1, X1 , . . . , Xp ).


Note: Here X
31 / 45
Linear regression

Suppose we are given data X, Y and fit the resulting model by


ordinary least squares. Let β̂ denote the resulting fit:
X
fˆ(X)
~ = β0 + β̂j Xj
j

with β̂ = (X> X)−1 X> Y.


What can we say about bias and variance?

32 / 45
Linear regression

Let’s look at bias:

E[fˆ(X)|
~ X,~ X] = E[X
~ β̂|X,
~ X]
~ > X)−1 (X> Y)|X,
= E[X(X ~ X]
~ > X)−1 (X> E[Xβ + ε|X,
~ X]

= X(X
~ > X)−1 (X> X)β = Xβ
= X(X ~ = f (X).
~

In other words: the ordinary least squares solution is unbiased!

33 / 45
Linear regression: The Gauss-Markov theorem
In fact: among all unbiased linear models, the OLS solution has
minimum variance.
This famous result in statistics is called the Gauss-Markov theorem.
Theorem
Assume a linear population model with uncorrelated errors. Fix a
~ and let γ = Xβ
(row) covariate vector X, ~ = P βj Xj .
j

Given data X, Y, let β̂ be the OLS solution. Let


~ β̂ = X(X
γ̂ = X ~ > X)−1 X> Y.
~
Let δ̂ = g(X, X)Y be any other estimator for γ that is linear in Y
~ = γ.
and unbiased for all X: E[δ̂|X, X]
~ ≥ Var(γ̂|X, X),
Then Var(δ̂|X, X) ~ with equality if and only if
δ̂ = γ̂.

34 / 45
The Gauss-Markov theorem: Proof [∗]

Proof: We compute the variance of δ̂.


~ X])2 |X,
E[(δ̂ − E[δ̂|X, ~ X]
~ 2 |X,
= E[(δ̂ − Xβ) ~ X]
~ 2 |X,
= E[(δ̂ − γ̂ + γ̂ − Xβ) ~ X]
~ X]
= E[(δ̂ − γ̂)2 |X,
~ 2 |X,
+ E[(γ̂ − Xβ) ~ X]
~
+ 2E[(δ̂ − γ̂)(γ̂ − Xβ)| ~ X].
X,

Look at the last equality: If we can show the last term is zero,
then we would be done, because the first two terms are uniquely
minimized if δ̂ = γ̂.

35 / 45
The Gauss-Markov theorem: Proof [∗]

~ We
Proof continued: For notational simplicity let c = g(X, X).
have:
~
E[(δ̂ − γ̂)(γ̂ − Xβ)| ~ X]
X,
= E[(cY − X ~ β̂)2 (X
~ β̂ − Xβ)|
~ ~ X].
X,

Now using the fact that β̂ = (X> X)−1 X> Y, that Y = Xβ + ε,


~ X] = σ 2 I, the last quantity reduces
and the fact that E[εε> |X,
(after some tedious algebra) to:
~ > X)−1 (X> c> − X
σ 2 X(X ~ > ).

36 / 45
The Gauss-Markov theorem: Proof [∗]

Proof continued: To finish the proof, notice that from


unbiasedness we have:
~ = Xβ.
E[cY|X, X] ~

~ = 0, we have:
But since Y = Xβ + ε, where E[ε|X, X]
~
cXβ = Xβ.
~
Since this has to hold true for every X, we must have cX = X,
i.e., that:
X > c> − X ~ > = 0.

This concludes the proof.

37 / 45
Linear regression: Variance of OLS

We can explicitly work out the variance of OLS in the linear


population model.

Var(fˆ(X)|X,
~ ~ > X)−1 X> Y|X, X)
~ = Var(X(X
X) ~
~ > X)−1 X> Var(Y|X)X(X> X)−1 X.
= X(X ~

Now note that Var(Y|X) = Var(ε|X) = σ 2 I where I is the n × n


identity matrix.
Therefore:

Var(fˆ(X)|X,
~ ~ > X)−1 X
~ = σ 2 X(X
X) ~ >.

38 / 45
Linear regression: In-sample prediction error

The preceding formula is not particularly intuitive. To get more


intuition, evaluate in-sample prediction error:
n
1X
Errin = E[(Y − fˆ(X))
~ 2 |X, Y, X
~ = Xi ].
n
i=1

This is the prediction error if we received new samples of Y


corresponding to each covariate vector in our existing data. Note
that the only randomness in the preceding expression is in the new
observations Y .

39 / 45
Linear regression: In-sample prediction error [∗]

Taking expectations over Y and using the bias variance


decomposition on each term of Errin , we have:

E[(Y − fˆ(X)) ~ = Xi ] = σ 2 + 02 + σ 2 Xi (X> X)−1 Xi .


~ 2 |X, X

First term is irreducible error; second term is zero because OLS is


unbiased; and third term is variance when X ~ = Xi .

Now note that:


n
X
Xi (X> X)−1 Xi = Trace(H),
i=1

where H = X(X> X)−1 X> is the hat matrix.


It can be shown that the Trace(H) = p (see appendix).

40 / 45
Linear regression: In-sample prediction error

Therefore we have:
p 2
E[Errin |X] = σ 2 + 02 + σ .
n
This is the bias-variance decomposition for linear regression:
I As before σ 2 is the irreducible error.
I 02 is the BIAS2 : OLS is unbiased.
I The last term, (p/n)σ 2 , is the VARIANCE. It increases with p.

41 / 45
Linear regression: Model specification

It was very important in the preceding analysis that the covariates


we used were the same as in the population model!
This is why:
I The bias is zero.
I The variance is (p/n)σ 2 .
On the other hand, with a large number of potential covariates,
will we want to use all of them?

42 / 45
Linear regression: Model specification

What happens if we use a subset of covariates S ⊂ {0, . . . , p}?


I In general, the resulting model will be biased, if the omitted
variables are correlated with the covariates in S.3
I It can be shown that the same analysis holds:
n
1X |S| 2
E[Errin |X] = σ + 2
(Xi β − E[fˆ(Xi )|X])2 + σ .
n n
i=1

(Second term is BIAS2 .)


So bias increases while variance decreases.

3
This is called the omitted variable bias in econometrics.
43 / 45
Linear regression: Model specification

What happens if we introduce a new covariate that is uncorrelated


with the existing covariates and the outcome?
I The bias remains zero.
I However, now the variance increases, since VARIANCE =
(p/n)σ 2 .

44 / 45
Appendix: Trace of H [∗]

Why does H have trace p?


I The trace of a (square) matrix is the sum of its eigenvalues.
I Recall that H is the matrix that projects Y into the column
space of X. (See Lecture 2.)
I Since it is a projection matrix, for any v in the column space
of X, we will have Hv = v.
I This means that 1 is an eigenvalue of H with multiplicity p.
I On the other hand, the remaining eigenvalues of H must be
zero, since the column space of X has rank p.
So we conclude that H has p eigenvalues that are 1, and the rest
are zero.

45 / 45

You might also like