226 Lecture5 Prediction
226 Lecture5 Prediction
Ramesh Johari
[email protected]
Fall 2015
1 / 45
The road ahead
Thus far we have seen how we can select and evaluate predictive
models using the train-validate-test methodology. This approach
works well if we have “enough” data.
What if we don’t have enough data to blindly train and validate
models? We have to understand the behavior of prediction error
well enough to intelligently explore the space of models.
2 / 45
The road ahead
3 / 45
Conditional expectation
4 / 45
Conditional expectation
5 / 45
Conditional expectation
Proof:
E[(Y − fˆ(X))
~ 2]
= E[(Y − E[Y |X] ~ − fˆ(X))
~ + E[Y |X] ~ 2]
= E[(Y − E[Y |X]) ~ − fˆ(X))
~ 2 ] + E[(E[Y |X] ~ 2]
~
+ 2E[(Y − E[Y |X])(E[Y ~ − fˆ(X)].
|X] ~
7 / 45
Population model
8 / 45
Prediction error revisited
9 / 45
A note on prediction error and conditioning
E[(Y − fˆ(X))
~ 2 |(∗)],
10 / 45
Models and conditional expectation
So now suppose we have data X, Y, and use it to build a model fˆ.
~
What is the prediction error if we see a new X?
E[(Y − fˆ(X))
~ 2 |X, Y, X]
~
~ 2 |X]
= E[(Y − f (X)) ~
+ (fˆ(X)
~ − f (X))
~ 2
= σ 2 + (fˆ(X)
~ − f (X))
~ 2.
E[(Y − fˆ(X))
~ 2 |X, Y, X]
~
= E[(Y − f (X) ~ − fˆ(X))
~ + f (X) ~ 2 |X, Y, X]
~
= E[(Y − f (X)) ~ + (fˆ(X)
~ 2 |X] ~ − f (X))
~ 2
~ X](f
+ 2E[Y − f (X)| ~ (X)~ − fˆ(X))
~
= E[(Y − f (X)) ~ + (fˆ(X)
~ 2 |X] ~ − f (X))
~ 2,
~ X]
because E[Y − f (X)| ~ = E[Y − E[Y |X]|
~ X]~ = 0.
12 / 45
A thought experiment
(fˆ(X)
~ − f (X))
~ 2. (∗∗)
Here is one way we might think about the quality of our modeling
approach:
I Fix the design matrix X.
I Generate data Y many times (parallel “universes”).
I In each universe, create a fˆ.
I In each universe, evaluate (∗∗).
13 / 45
A thought experiment
(fˆ(X)
~ − f (X))
~ 2. (∗∗)
14 / 45
Example
Y = 1 + X1 + 2X2 + ε,
15 / 45
Example: Parallel universes
Yi = f (Xi ) + εi ,
16 / 45
Example: OLS in parallel universes
17 / 45
Example: Results
We go through the process with three models: A, B, C.
The three we try are:
I Y ~ 1 + X1.
I Y ~ 1 + X1 + X2.
I Y ~ 1 + X1 + X2 + ... + I(X1^4) + I(X2^4).
Which is which?
18 / 45
Example: Results
Results:
model
A
density
4 B
0 1 2 3 4
MSE
19 / 45
A thought experiment: Aside
20 / 45
Examples: Bias and variance
21 / 45
The bias-variance decomposition
E[(Y − fˆ(X))
~ 2 |X, X]
~
= σ 2 + E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
2
= σ 2 + f (X)~ − E[fˆ(X)|X,
~ ~
X]
h 2 i
+ E fˆ(X) ~ − E[fˆ(X)|X,
~ ~
X] ~ .
X, X
22 / 45
The bias-variance decomposition
23 / 45
The bias-variance decomposition: Proof [∗]
Proof: We already showed that:
E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
h
= E fˆ(X) ~ − E[fˆ(X)|X,
~ ~
X]
2 i
+ E[fˆ(X)|X,
~ ~ − f (X)
X] ~ ~
X, X
24 / 45
The bias-variance decomposition: Proof [∗]
E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
h 2 i
= E fˆ(X)~ − E[fˆ(X)|X,
~ ~
X] ~
|X, X
2
+ E[fˆ(X)|X,
~ ~ − f (X)
X] ~
h i
+ 2E fˆ(X)
~ − E[fˆ(X)|X,
~ ~
X] E[fˆ(X)|X,
~ ~ − f (X)
X] ~ X, X
~ .
25 / 45
The bias-variance decomposition: Proof [∗]
= 0,
26 / 45
Example: k-nearest-neighbor fit
Yi = 1 + 2Xi1 + 3Xi2 + εi ,
27 / 45
Example: k-nearest-neighbor fit
Using the first 500 points, we create a k-nearest-neighbor (k-NN)
model: For any X,~ let fˆ(X) ~ be the average value of Yi over the k
~ in the training set.
nearest neighbors X(1) , . . . , X(k) to X
1
X2
−1
−2
−3
−2 −1 0 1 2
X1
7
K−NN RMSE
29 / 45
k-nearest-neighbor fit
and
σ2
VARIANCE = .
k
This type of result is why we often refer to a “tradeoff” between
bias and variance.
30 / 45
Linear regression: Linear population model
Var(ε|X) = σ 2 I.
32 / 45
Linear regression
E[fˆ(X)|
~ X,~ X] = E[X
~ β̂|X,
~ X]
~ > X)−1 (X> Y)|X,
= E[X(X ~ X]
~ > X)−1 (X> E[Xβ + ε|X,
~ X]
= X(X
~ > X)−1 (X> X)β = Xβ
= X(X ~ = f (X).
~
33 / 45
Linear regression: The Gauss-Markov theorem
In fact: among all unbiased linear models, the OLS solution has
minimum variance.
This famous result in statistics is called the Gauss-Markov theorem.
Theorem
Assume a linear population model with uncorrelated errors. Fix a
~ and let γ = Xβ
(row) covariate vector X, ~ = P βj Xj .
j
34 / 45
The Gauss-Markov theorem: Proof [∗]
Look at the last equality: If we can show the last term is zero,
then we would be done, because the first two terms are uniquely
minimized if δ̂ = γ̂.
35 / 45
The Gauss-Markov theorem: Proof [∗]
~ We
Proof continued: For notational simplicity let c = g(X, X).
have:
~
E[(δ̂ − γ̂)(γ̂ − Xβ)| ~ X]
X,
= E[(cY − X ~ β̂)2 (X
~ β̂ − Xβ)|
~ ~ X].
X,
36 / 45
The Gauss-Markov theorem: Proof [∗]
~ = 0, we have:
But since Y = Xβ + ε, where E[ε|X, X]
~
cXβ = Xβ.
~
Since this has to hold true for every X, we must have cX = X,
i.e., that:
X > c> − X ~ > = 0.
37 / 45
Linear regression: Variance of OLS
Var(fˆ(X)|X,
~ ~ > X)−1 X> Y|X, X)
~ = Var(X(X
X) ~
~ > X)−1 X> Var(Y|X)X(X> X)−1 X.
= X(X ~
Var(fˆ(X)|X,
~ ~ > X)−1 X
~ = σ 2 X(X
X) ~ >.
38 / 45
Linear regression: In-sample prediction error
39 / 45
Linear regression: In-sample prediction error [∗]
40 / 45
Linear regression: In-sample prediction error
Therefore we have:
p 2
E[Errin |X] = σ 2 + 02 + σ .
n
This is the bias-variance decomposition for linear regression:
I As before σ 2 is the irreducible error.
I 02 is the BIAS2 : OLS is unbiased.
I The last term, (p/n)σ 2 , is the VARIANCE. It increases with p.
41 / 45
Linear regression: Model specification
42 / 45
Linear regression: Model specification
3
This is called the omitted variable bias in econometrics.
43 / 45
Linear regression: Model specification
44 / 45
Appendix: Trace of H [∗]
45 / 45