0% found this document useful (0 votes)

18 views

226 Lecture5 Prediction

This document discusses bias and variance in predictive modeling when dealing with small amounts of data. It begins by introducing bias and variance as ways to characterize prediction error. It then provides a thought experiment to illustrate how generating multiple predictive models from parallel simulated datasets can reveal the bias and variance of a modeling approach. The document walks through an example applying this process to ordinary least squares regression on simulated data, generating distributions of prediction errors across multiple simulated "universes" to expose the bias and variance properties of the modeling approach.

Uploaded by

Davi Sousa e Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

226 Lecture5 Prediction

Uploaded by

Davi Sousa e Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

MS&E 226: “Small” Data

Lecture 5: Bias and variance (v3)

Ramesh Johari
[email protected]

Fall 2015

1 / 45
The road ahead

Thus far we have seen how we can select and evaluate predictive
models using the train-validate-test methodology. This approach
works well if we have “enough” data.
What if we don’t have enough data to blindly train and validate
models? We have to understand the behavior of prediction error
well enough to intelligently explore the space of models.

2 / 45
The road ahead

Starting with this lecture:

I We characterize exactly how prediction error behaves through
the ideas of bias and variance.
I We develop measures of model complexity that we can use to
help us effectively search for “good” models.
I We develop methods of evaluating models using limited data.
A word of caution: All else being equal, more data leads to more
robust model selection and evaluation! So these techniques are not
“magic bullets”.

3 / 45
Conditional expectation

4 / 45
Conditional expectation

Given the population model for X~ and Y , suppose we are allowed

ˆ
to choose any predictive model f we want. What is the best one?

minimize E[(Y − fˆ(X)

~ 2 ].
~ Y ).)
(Here expectation is over (X,
Theorem
The predictive model that minmizes squared error is
fˆ(X)
~ = E[Y |X].
~

5 / 45
Conditional expectation
Proof:

The first two terms are positive, and minimized if fˆ(X)

~ = E[Y |X].
~
For the third term, using the tower property of conditional
expectation:
~
E[(Y − E[Y |X])(E[Y ~ − fˆ(X)]
|X] ~
h i
~ X]
= E E[Y − E[Y |X]| ~ − fˆ(X)
~ (E[Y |X] ~ = 0.

So the squared error minimizing solution is to choose

fˆ(X)
~ = E[Y |X].
~
6 / 45
Conditional expectation

~ as our predictive model?

Why don’t we just choose E[Y |X]
~ Y )!
Because we don’t know the distribution of (X,
Nevertheless, the preceding result is a useful guide:
I It provides the benchmark that every squared-error-minimizing
predictive model is striving for: approximate the conditional
expectation.
I It provides intuition for why linear regression approximates the
conditional mean.

7 / 45
Population model

For the rest of the lecture write:

~ + ε,
Y = f (X)
~ = 0. In other words, f (X)
where E[ε|X] ~ = E[Y |X].
~
~ = σ 2 (i.e., it does not
We make the assumption that Var(ε|X)
~
depend on X).
We will make additional assumptions about the population model
as we go along.

8 / 45
Prediction error revisited

9 / 45
A note on prediction error and conditioning

When you see “prediction error”, it typically means:

E[(Y − fˆ(X))
~ 2 |(∗)],

where (∗) can be one of many things:

I X, Y, X;~
I X, Y;
I ~
X, X;
I X;
I nothing.
As long we don’t condition on both X and Y, the model is
random!

10 / 45
Models and conditional expectation
So now suppose we have data X, Y, and use it to build a model fˆ.
~
What is the prediction error if we see a new X?

E[(Y − fˆ(X))
~ 2 |X, Y, X]
~
~ 2 |X]
= E[(Y − f (X)) ~
+ (fˆ(X)
~ − f (X))
~ 2
= σ 2 + (fˆ(X)
~ − f (X))
~ 2.

I.e.: When minimizing mean squared error, “good” models should

behave like conditional expectation.1
Our goal: understand the second term.
1
This is just another way of deriving that the prediction-error-minimizing
solution is the conditional expectation.
11 / 45
Models and conditional expectation [∗]

Proof of preceding statement:

The proof is essentially identical to the earlier proof for conditional
expectation:

~ X]
because E[Y − f (X)| ~ = E[Y − E[Y |X]|
~ X]~ = 0.

12 / 45
A thought experiment

Our goal is to understand:

(fˆ(X)
~ − f (X))
~ 2. (∗∗)

Here is one way we might think about the quality of our modeling
approach:
I Fix the design matrix X.
I Generate data Y many times (parallel “universes”).
I In each universe, create a fˆ.
I In each universe, evaluate (∗∗).

13 / 45
A thought experiment

Our goal is to understand:

(fˆ(X)
~ − f (X))
~ 2. (∗∗)

What kinds of things can we evaluate?

I If fˆ(X)
~ is “close” to the conditional expectation, then on
~
average in our universes, it should look like f (X).
I fˆ(X)
~ might be close to f (X)~ on average, but still vary wildly
across our universes.
The first is bias. The second is variance.

14 / 45
Example

Let’s carry out some simulations with a synthetic model.

Population model:
I We generate X1 , X2 as i.i.d. N (0, 1) random variables.
I Given X1 , X2 , the distribution of Y is given by:

Y = 1 + X1 + 2X2 + ε,

where ε is an independent N (0, 5) random variable.

Thus f (X1 , X2 ) = 1 + X1 + 2X2 .

15 / 45
Example: Parallel universes

Generate a design matrix X by sampling 1000 i.i.d. values of

(X1 , X2 ).
Now we run m = 500 simulations. These are our “universes.” In
each simulation, generate data Y according to:

Yi = f (Xi ) + εi ,

where εi are i.i.d. N (0, 5) random variables.

In each simulation, what changes is the specific values of the εi .
This is what it means to condition on X.

16 / 45
Example: OLS in parallel universes

In each simulation, given the design matrix X and Y, we build a

fitted model fˆ using ordinary least squares.
Finally, let X̃ denote a fixed matrix of 1000 i.i.d. test values of
(X1 , X2 ). (We use the same test data X̃ across all the universes.)
In each universe we evaluate:
1000
1 X ˆ
MSE = (f (X̃i1 , X̃i2 ) − f (X̃i1 , X̃i2 ))2 .
1000
i=1

(This is an estimate of E[(fˆ(X)

~ − f (X))
~ 2 |X, Y] in the given
universe.)
We then create a density plot of these mean squared errors across
the 500 universes.

17 / 45
Example: Results
We go through the process with three models: A, B, C.
The three we try are:
I Y ~ 1 + X1.
I Y ~ 1 + X1 + X2.
I Y ~ 1 + X1 + X2 + ... + I(X1^4) + I(X2^4).
Which is which?

18 / 45
Example: Results
Results:

model

A
density

4 B

0 1 2 3 4
MSE

19 / 45
A thought experiment: Aside

This is our first example of a frequentist approach:

I The population model is fixed.
I The data is random.
I We reason about a particular modeling procedure by
considering what happens if we carry out the same procedure
over and over again (in this case, fitting a model from data).

20 / 45
Examples: Bias and variance

Suppose you are predicting, e.g., wealth based on a collection of

demographic covariates.

I Suppose we make a constant prediction: fˆ(Xi ) = c for all i.

Is this biased? Does it have low variance?
I Suppose that every time you get your data, you use enough
parameters to fit Y exactly: fˆ(Xi ) = Yi for all i. Is this
biased? Does it have low variance?

21 / 45
The bias-variance decomposition

We can be more precise about our discussion.

E[(Y − fˆ(X))
~ 2 |X, X]
~
= σ 2 + E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
2
= σ 2 + f (X)~ − E[fˆ(X)|X,
~ ~
X]
h 2 i
+ E fˆ(X) ~ − E[fˆ(X)|X,
~ ~
X] ~ .
X, X

The first term is irreducible error.

The second term is BIAS2 .
The third term is VARIANCE.

22 / 45
The bias-variance decomposition

The bias-variance decomposition measures how sensitive prediction

error is to changes in the training data (in this case, Y.
I If there are systematic errors in prediction made regardless of
the training data, then there is high bias.
I If the fitted model is very sensitive to the choice of training
data, then there is high variance.

23 / 45
The bias-variance decomposition: Proof [∗]
Proof: We already showed that:

E[(Y − fˆ(X)) ~ = σ 2 + (fˆ(X)

~ 2 |X, Y, X] ~ − f (X))
~ 2.

Take expectations over Y:

E[(Y − fˆ(X)) ~ = σ 2 + E[(fˆ(X)

~ 2 |X, X] ~ − f (X))
~ 2 |X, X].
~

Add and subtract E[fˆ(X)|X,

~ ~ in the second term:
X]

E[(fˆ(X)
~ − f (X))
~ 2 |X, X]
~
h
= E fˆ(X) ~ − E[fˆ(X)|X,
~ ~
X]
2 i
+ E[fˆ(X)|X,
~ ~ − f (X)
X] ~ ~
X, X

24 / 45
The bias-variance decomposition: Proof [∗]

Proof (continued): Cross-multiply:

(The conditional expectation drops out on the middle term,

~ it is no longer random.)
because given X and X,
The first term is the VARIANCE. The second term is the BIAS2 .
We have to show the third term is zero.

25 / 45
The bias-variance decomposition: Proof [∗]

Proof (continued). We have:

h i
E fˆ(X)~ − E[fˆ(X)|X,
~ ~
X] E[fˆ(X)|X,
~ ~ − f (X)
X] ~ X, X
~
h i
= E[fˆ(X)|X,
~ X] ~ E fˆ(X)
~ − f (X) ~ − E[fˆ(X)|X,
~ ~ X, X
X] ~

= 0,

using the tower property of conditional expectation.

26 / 45
Example: k-nearest-neighbor fit

Generate data the same way as before:

We generate 1000 X1 , X2 as i.i.d. N (0, 1) random variables.
We then generate 1000 Y random variables as:

Yi = 1 + 2Xi1 + 3Xi2 + εi ,

where εi are i.i.d. N (0, 5) random variables.

27 / 45
Example: k-nearest-neighbor fit
Using the first 500 points, we create a k-nearest-neighbor (k-NN)
model: For any X,~ let fˆ(X) ~ be the average value of Yi over the k
~ in the training set.
nearest neighbors X(1) , . . . , X(k) to X

1
X2

−1

−2

−3
−2 −1 0 1 2
X1

How does this fit behave as a function of k?

28 / 45
Example: k-nearest-neighbor fit
The graph shows root mean squared error (RMSE) as a function of
k:

7
K−NN RMSE

5 Best linear regression prediction error

0 100 200 300 400 500

# of neighbors

29 / 45
k-nearest-neighbor fit

Given X~ and X, let X(1) , . . . , X(k) be the k closest points to X

~ in
the data. You will show that:
k
~ −1
X
BIAS = f (X) f (X(i) ),
k
i=1

and
σ2
VARIANCE = .
k
This type of result is why we often refer to a “tradeoff” between
bias and variance.

30 / 45
Linear regression: Linear population model

Now suppose the population model itself is linear:2

X
Y = β0 + βj Xj + ε, i.e. Y = Xβ ~ + ε,
j

for some fixed parameter vector β.

The errors ε are i.i.d. with E[εi |X] = 0 and Var(εi |X) = σ 2 .
Also assume errors ε are uncorrelated across observations. So for
our given data X, Y, we have Y = Xβ + ε, where:

Var(ε|X) = σ 2 I.

2 ~ is viewed as a row vector (1, X1 , . . . , Xp ).

Note: Here X
31 / 45
Linear regression

Suppose we are given data X, Y and fit the resulting model by

ordinary least squares. Let β̂ denote the resulting fit:
X
fˆ(X)
~ = β0 + β̂j Xj
j

with β̂ = (X> X)−1 X> Y.

What can we say about bias and variance?

32 / 45
Linear regression

Let’s look at bias:

E[fˆ(X)|
~ X,~ X] = E[X
~ β̂|X,
~ X]
~ > X)−1 (X> Y)|X,
= E[X(X ~ X]
~ > X)−1 (X> E[Xβ + ε|X,
~ X]

= X(X
~ > X)−1 (X> X)β = Xβ
= X(X ~ = f (X).
~

In other words: the ordinary least squares solution is unbiased!

33 / 45
Linear regression: The Gauss-Markov theorem
In fact: among all unbiased linear models, the OLS solution has
minimum variance.
This famous result in statistics is called the Gauss-Markov theorem.
Theorem
Assume a linear population model with uncorrelated errors. Fix a
~ and let γ = Xβ
(row) covariate vector X, ~ = P βj Xj .
j

Given data X, Y, let β̂ be the OLS solution. Let

~ β̂ = X(X
γ̂ = X ~ > X)−1 X> Y.
~
Let δ̂ = g(X, X)Y be any other estimator for γ that is linear in Y
~ = γ.
and unbiased for all X: E[δ̂|X, X]
~ ≥ Var(γ̂|X, X),
Then Var(δ̂|X, X) ~ with equality if and only if
δ̂ = γ̂.

34 / 45
The Gauss-Markov theorem: Proof [∗]

Proof: We compute the variance of δ̂.

~ X])2 |X,
E[(δ̂ − E[δ̂|X, ~ X]
~ 2 |X,
= E[(δ̂ − Xβ) ~ X]
~ 2 |X,
= E[(δ̂ − γ̂ + γ̂ − Xβ) ~ X]
~ X]
= E[(δ̂ − γ̂)2 |X,
~ 2 |X,
+ E[(γ̂ − Xβ) ~ X]
~
+ 2E[(δ̂ − γ̂)(γ̂ − Xβ)| ~ X].
X,

Look at the last equality: If we can show the last term is zero,
then we would be done, because the first two terms are uniquely
minimized if δ̂ = γ̂.

35 / 45
The Gauss-Markov theorem: Proof [∗]

~ We
Proof continued: For notational simplicity let c = g(X, X).
have:
~
E[(δ̂ − γ̂)(γ̂ − Xβ)| ~ X]
X,
= E[(cY − X ~ β̂)2 (X
~ β̂ − Xβ)|
~ ~ X].
X,

Now using the fact that β̂ = (X> X)−1 X> Y, that Y = Xβ + ε,

~ X] = σ 2 I, the last quantity reduces
and the fact that E[εε> |X,
(after some tedious algebra) to:
~ > X)−1 (X> c> − X
σ 2 X(X ~ > ).

36 / 45
The Gauss-Markov theorem: Proof [∗]

Proof continued: To finish the proof, notice that from

unbiasedness we have:
~ = Xβ.
E[cY|X, X] ~

~ = 0, we have:
But since Y = Xβ + ε, where E[ε|X, X]
~
cXβ = Xβ.
~
Since this has to hold true for every X, we must have cX = X,
i.e., that:
X > c> − X ~ > = 0.

This concludes the proof.

37 / 45
Linear regression: Variance of OLS

We can explicitly work out the variance of OLS in the linear

population model.

Var(fˆ(X)|X,
~ ~ > X)−1 X> Y|X, X)
~ = Var(X(X
X) ~
~ > X)−1 X> Var(Y|X)X(X> X)−1 X.
= X(X ~

Now note that Var(Y|X) = Var(ε|X) = σ 2 I where I is the n × n

identity matrix.
Therefore:

Var(fˆ(X)|X,
~ ~ > X)−1 X
~ = σ 2 X(X
X) ~ >.

38 / 45
Linear regression: In-sample prediction error

The preceding formula is not particularly intuitive. To get more

intuition, evaluate in-sample prediction error:
n
1X
Errin = E[(Y − fˆ(X))
~ 2 |X, Y, X
~ = Xi ].
n
i=1

This is the prediction error if we received new samples of Y

corresponding to each covariate vector in our existing data. Note
that the only randomness in the preceding expression is in the new
observations Y .

39 / 45
Linear regression: In-sample prediction error [∗]

Taking expectations over Y and using the bias variance

decomposition on each term of Errin , we have:

E[(Y − fˆ(X)) ~ = Xi ] = σ 2 + 02 + σ 2 Xi (X> X)−1 Xi .

~ 2 |X, X

First term is irreducible error; second term is zero because OLS is

unbiased; and third term is variance when X ~ = Xi .

Now note that:

n
X
Xi (X> X)−1 Xi = Trace(H),
i=1

where H = X(X> X)−1 X> is the hat matrix.

It can be shown that the Trace(H) = p (see appendix).

40 / 45
Linear regression: In-sample prediction error

Therefore we have:
p 2
E[Errin |X] = σ 2 + 02 + σ .
n
This is the bias-variance decomposition for linear regression:
I As before σ 2 is the irreducible error.
I 02 is the BIAS2 : OLS is unbiased.
I The last term, (p/n)σ 2 , is the VARIANCE. It increases with p.

41 / 45
Linear regression: Model specification

It was very important in the preceding analysis that the covariates

we used were the same as in the population model!
This is why:
I The bias is zero.
I The variance is (p/n)σ 2 .
On the other hand, with a large number of potential covariates,
will we want to use all of them?

42 / 45
Linear regression: Model specification

What happens if we use a subset of covariates S ⊂ {0, . . . , p}?

I In general, the resulting model will be biased, if the omitted
variables are correlated with the covariates in S.3
I It can be shown that the same analysis holds:
n
1X |S| 2
E[Errin |X] = σ + 2
(Xi β − E[fˆ(Xi )|X])2 + σ .
n n
i=1

(Second term is BIAS2 .)

So bias increases while variance decreases.

3
This is called the omitted variable bias in econometrics.
43 / 45
Linear regression: Model specification

What happens if we introduce a new covariate that is uncorrelated

with the existing covariates and the outcome?
I The bias remains zero.
I However, now the variance increases, since VARIANCE =
(p/n)σ 2 .

44 / 45
Appendix: Trace of H [∗]

Why does H have trace p?

I The trace of a (square) matrix is the sum of its eigenvalues.
I Recall that H is the matrix that projects Y into the column
space of X. (See Lecture 2.)
I Since it is a projection matrix, for any v in the column space
of X, we will have Hv = v.
I This means that 1 is an eigenvalue of H with multiplicity p.
I On the other hand, the remaining eigenvalues of H must be
zero, since the column space of X has rank p.
So we conclude that H has p eigenvalues that are 1, and the rest
are zero.

45 / 45

Instant Download Linear Models 2nd Edition Shayle R. Searle PDF All Chapters
100% (1)
Instant Download Linear Models 2nd Edition Shayle R. Searle PDF All Chapters
67 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
Slides 1 Handout
No ratings yet
Slides 1 Handout
23 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Statistical Decision Theory, Least Squares, and Bias Variance Tradeoff
No ratings yet
Statistical Decision Theory, Least Squares, and Bias Variance Tradeoff
3 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
SSRN Id3588594
No ratings yet
SSRN Id3588594
27 pages
ESGB Evaluation Methods
No ratings yet
ESGB Evaluation Methods
84 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
3 Bias Variance Tradeoff
No ratings yet
3 Bias Variance Tradeoff
9 pages
Week2-Day 1-Introduction To Data Mining
No ratings yet
Week2-Day 1-Introduction To Data Mining
30 pages
supervised learning
No ratings yet
supervised learning
61 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
lec1
No ratings yet
lec1
54 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
1 Introduction
No ratings yet
1 Introduction
8 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
Excellent 05 - Overfitting
No ratings yet
Excellent 05 - Overfitting
22 pages
Problem Set 2
No ratings yet
Problem Set 2
18 pages
Bias-Variance Tradeoffs: 1 Single Sample MLE
No ratings yet
Bias-Variance Tradeoffs: 1 Single Sample MLE
7 pages
Gary Chamberlain Econometric S
No ratings yet
Gary Chamberlain Econometric S
152 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Stat PDF
No ratings yet
Stat PDF
132 pages
Ch2_Statistical_Learning
No ratings yet
Ch2_Statistical_Learning
51 pages
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
No ratings yet
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
8 pages
Lec-01-Introduction to Statistical Learning
No ratings yet
Lec-01-Introduction to Statistical Learning
38 pages
Lecture 31-36
No ratings yet
Lecture 31-36
44 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
No ratings yet
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
7 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
6 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
Overfitting: Extracting Too Much
No ratings yet
Overfitting: Extracting Too Much
17 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
3.3 Bias Variance
No ratings yet
3.3 Bias Variance
14 pages
Lecture1
No ratings yet
Lecture1
8 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
EDAN96_2024_Last_lecture-1
No ratings yet
EDAN96_2024_Last_lecture-1
78 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Modelos y Simulación - Clase 4-2016
No ratings yet
Modelos y Simulación - Clase 4-2016
87 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Lecturenote - COL341 - 2010
No ratings yet
Lecturenote - COL341 - 2010
116 pages
18+cv+%26+model+selection
No ratings yet
18+cv+%26+model+selection
11 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
LearningTheory
No ratings yet
LearningTheory
19 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Sample Sec 3
No ratings yet
Sample Sec 3
16 pages
UnivariateRegression 2
No ratings yet
UnivariateRegression 2
72 pages
Download full Understanding Regression Analysis A Conditional Distribution Approach 1st Edition Peter H. Westfall ebook all chapters
100% (2)
Download full Understanding Regression Analysis A Conditional Distribution Approach 1st Edition Peter H. Westfall ebook all chapters
49 pages
+part 04 - AMEFA - 2024 - Introduction and Repetition
No ratings yet
+part 04 - AMEFA - 2024 - Introduction and Repetition
69 pages
Full Download Principles of Econometrics 4th Edition Hill Test Bank
100% (61)
Full Download Principles of Econometrics 4th Edition Hill Test Bank
35 pages
Ref. CH 3 Gujarati Book
No ratings yet
Ref. CH 3 Gujarati Book
51 pages
Econometrics I Course Outline_20221108 (3)
No ratings yet
Econometrics I Course Outline_20221108 (3)
2 pages
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
No ratings yet
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
86 pages
Answers Are Highlighted in Yellow Color: MCQ's Subject:Introductory Econometrics
No ratings yet
Answers Are Highlighted in Yellow Color: MCQ's Subject:Introductory Econometrics
16 pages
Econometrics 1st Edition K. Nirmal Ravi Kumar - Quickly download the ebook to start your content journey
100% (1)
Econometrics 1st Edition K. Nirmal Ravi Kumar - Quickly download the ebook to start your content journey
70 pages
Adaptive Control
No ratings yet
Adaptive Control
382 pages
Stat 378
No ratings yet
Stat 378
73 pages
Econometric S
No ratings yet
Econometric S
59 pages
ECONF241 GaussMarkov Theorem
No ratings yet
ECONF241 GaussMarkov Theorem
25 pages
Gauss-Markov Econometrics
100% (1)
Gauss-Markov Econometrics
2 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
22 pages
4_5861748881427009429
No ratings yet
4_5861748881427009429
29 pages
Econometrics
No ratings yet
Econometrics
205 pages
Ecc321 chapter 3
No ratings yet
Ecc321 chapter 3
8 pages
Download Complete (Ebook) The coordinate-free approach to linear models by Michael J. Wichura ISBN 9780521868426, 0521868424 PDF for All Chapters
100% (2)
Download Complete (Ebook) The coordinate-free approach to linear models by Michael J. Wichura ISBN 9780521868426, 0521868424 PDF for All Chapters
67 pages
Econometrica - 2022 - Hansen - A Modern Gauss Markov Theorem
No ratings yet
Econometrica - 2022 - Hansen - A Modern Gauss Markov Theorem
12 pages
Introduction to Econometrics 3rd, global Edition James H. Stock pdf download
100% (2)
Introduction to Econometrics 3rd, global Edition James H. Stock pdf download
49 pages
Econometrics Course Outline UCU 2023
No ratings yet
Econometrics Course Outline UCU 2023
8 pages
Short - Notes - Econometric Methods
No ratings yet
Short - Notes - Econometric Methods
22 pages
Buy ebook Matrix Algebra for Linear Models 1st Edition Marvin H. J. Gruber cheap price
100% (4)
Buy ebook Matrix Algebra for Linear Models 1st Edition Marvin H. J. Gruber cheap price
77 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
105 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
Ch08. Heteroskedasticity (Section 8.1-8.4) : Ping Yu
No ratings yet
Ch08. Heteroskedasticity (Section 8.1-8.4) : Ping Yu
43 pages
Notes Estimation Theory
100% (3)
Notes Estimation Theory
39 pages

226 Lecture5 Prediction

Uploaded by

226 Lecture5 Prediction

Uploaded by

MS&E 226: “Small” Data

Lecture 5: Bias and variance (v3)

Starting with this lecture:

Given the population model for X~ and Y , suppose we are allowed

minimize E[(Y − fˆ(X)

The first two terms are positive, and minimized if fˆ(X)

So the squared error minimizing solution is to choose

~ as our predictive model?

For the rest of the lecture write:

When you see “prediction error”, it typically means:

where (∗) can be one of many things:

I.e.: When minimizing mean squared error, “good” models should

Proof of preceding statement:

Our goal is to understand:

Our goal is to understand:

What kinds of things can we evaluate?

Let’s carry out some simulations with a synthetic model.

where ε is an independent N (0, 5) random variable.

Generate a design matrix X by sampling 1000 i.i.d. values of

where εi are i.i.d. N (0, 5) random variables.

In each simulation, given the design matrix X and Y, we build a

(This is an estimate of E[(fˆ(X)

This is our first example of a frequentist approach:

Suppose you are predicting, e.g., wealth based on a collection of

I Suppose we make a constant prediction: fˆ(Xi ) = c for all i.

We can be more precise about our discussion.

The first term is irreducible error.

The bias-variance decomposition measures how sensitive prediction

E[(Y − fˆ(X)) ~ = σ 2 + (fˆ(X)

Take expectations over Y:

E[(Y − fˆ(X)) ~ = σ 2 + E[(fˆ(X)

Add and subtract E[fˆ(X)|X,

Proof (continued): Cross-multiply:

(The conditional expectation drops out on the middle term,

Proof (continued). We have:

using the tower property of conditional expectation.

Generate data the same way as before:

where εi are i.i.d. N (0, 5) random variables.

How does this fit behave as a function of k?

5 Best linear regression prediction error

0 100 200 300 400 500

Given X~ and X, let X(1) , . . . , X(k) be the k closest points to X

Now suppose the population model itself is linear:2

for some fixed parameter vector β.

2 ~ is viewed as a row vector (1, X1 , . . . , Xp ).

Suppose we are given data X, Y and fit the resulting model by

with β̂ = (X> X)−1 X> Y.

Let’s look at bias:

In other words: the ordinary least squares solution is unbiased!

Given data X, Y, let β̂ be the OLS solution. Let

Proof: We compute the variance of δ̂.

Now using the fact that β̂ = (X> X)−1 X> Y, that Y = Xβ + ε,

Proof continued: To finish the proof, notice that from

This concludes the proof.

We can explicitly work out the variance of OLS in the linear

Now note that Var(Y|X) = Var(ε|X) = σ 2 I where I is the n × n

The preceding formula is not particularly intuitive. To get more

This is the prediction error if we received new samples of Y

Taking expectations over Y and using the bias variance

E[(Y − fˆ(X)) ~ = Xi ] = σ 2 + 02 + σ 2 Xi (X> X)−1 Xi .

First term is irreducible error; second term is zero because OLS is

Now note that:

where H = X(X> X)−1 X> is the hat matrix.

It was very important in the preceding analysis that the covariates

What happens if we use a subset of covariates S ⊂ {0, . . . , p}?

(Second term is BIAS2 .)

What happens if we introduce a new covariate that is uncorrelated

Why does H have trace p?

You might also like