0% found this document useful (0 votes)
14 views37 pages

Stat444 Notes

Uploaded by

蕾酱爱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views37 pages

Stat444 Notes

Uploaded by

蕾酱爱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Winter 2019 STAT 444/844 Course Notes TABLE OF CONTENTS

richardwu.ca

STAT 444/844 Course Notes


Statistical Learning: Function Estimation
Kun Liang • Winter 2019 • University of Waterloo

Last Revision: March 26, 2019

Table of Contents
1 January 8, 2019 1
1.1 What is a function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Advertising data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 January 10, 2019 5


2.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Piecewise linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Piecewise quadratic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 January 15, 2019 8


3.1 Weight least squares applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 January 17, 2019 9


4.1 Notes on terminology and lm in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Notes on model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Geometric interpretation of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 January 24, 2019 10


5.1 Discrepancy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2 Discrepancy function and log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Iteratively re-weighted least squares (IRLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 Why IRLS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 January 29, 2019 14


6.1 Remark on robust regression and constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.2 Sensitivity curve and breakdown point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3 Least median squares (LMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.4 Least trimmed average sum of squares (LTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

i
Winter 2019 STAT 444/844 Course Notes TABLE OF CONTENTS

7 January 31, 2019 16


7.1 Local linear regression with k-nearest neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.2 Piecewise polynomials (splines) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.3 Cubic splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 February 5, 2019 18
8.1 Natural cubic splines (NCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2 Fitting NCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.3 General function fitting with basis funcitons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

9 February 7, 2019 21
9.1 Choosing k for NCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.2 Smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

10 February 14, 2019 24


10.1 B-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
10.2 Smoothing splines in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

11 February 26, 2019 24


11.1 KNN local linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
11.2 Kernel local linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11.3 Linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

12 February 28, 2019 27


12.1 Reinsh form of smoother matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
12.2 Penalty form of linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
12.3 Regression vs smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

13 March 5, 2019 29
13.1 Local linear regression as a linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

14 March 7, 2019 30
14.1 Multivariate local regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
14.2 Multivariate regression splines with tensor products . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.3 Multivariate smoothing splines with thin plates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.4 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.5 Structured regression additive approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

15 March 12, 2019 32

16 March 14, 2019 32

17 March 19, 2019 32


17.1 Tuning parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ii
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

Abstract
These notes are intended as a resource for myself; past, present, or future students of this course, and anyone
interested in the material. The goal is to provide an end-to-end resource that covers all material discussed
in the course displayed in an organized manner. These notes are my interpretation and transcription of the
content covered in lectures. The instructor has not verified or confirmed the accuracy of these notes, and any
discrepancies, misunderstandings, typos, etc. as these notes relate to course’s content is not the responsibility of
the instructor. If you spot any errors or would like to contribute, please contact me directly.

1 January 8, 2019
1.1 What is a function?
Suppose we have some measured response variate y and we have one or more explanatory variables x1 , . . . , xp .
The response and explanatory ariables are approximately related through an unknown function µ(x) (to be
estimated/learned) where
y = µ(x) + r
where r is residual that cannot be explained by µ(x).
Some other names for response and explanatory variables include:

response explanatory
response predictor
response design
output input
dependent independent
endogenous exogenous

1.2 Advertising data example


Suppose we want to predict Sales (response) from how much companies spend on TV, Radio, and Newspaper
advertising (explanatory).
The dataset is

if we plot sales against TV

1
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

we see there is some positive correlation.


Similarly against Newspaper and Radio

What if we tried a simple linear model where µ̂(x1 ) = α̂ + β̂x1 where x1 is the TV advertising? We obtain estimates
α̂ = 7.03 and β̂ = 0.05 which are interpretable. However if we take a look at the residuals

2
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

we see that the residuals are not independently distributed accordingly to x1 , which violates our Markov-Gaussian
assumptions.
The residuals of the model with Newspaper and Radio are

we observe that we do not observe constant variance across the explanatory variables.
Therefore a linear model does not seem to work (we could of course introduce scaling e.g. log-scaling for the Radio
variate or polynomial terms).

1.3 Notation
Some notes on notation:

• Capital letters are matrices or vectors: A, X, Σ

• Lower letters are scalars: a, x, σ

3
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

• Arrows on letters are vectors: ~a, ~x

• All vectors are column vectors

• The transpose of any matrix A is AT (ocassionally A0 )

1.4 Definitions and properties


Quadratic form Suppose A = (aij )n×n is symmetric i.e. aij = aji ∀i, j. Then

f = Y T AY
XX
= aij yi yj
i j

is called a quadratic form.

Trace For a matrix, the trace is defined as


m
X
tr(Am×m ) = aii
i=1

Note that tr(BC) = tr(CB).

Rank The rank of a matrix denoted rank(A) is the maximum number of linearly independent columns (or rows)
of A.
Note that vectors Y1 , . . . , Yn are linearly independent iff

c1 Y1 + . . . + cn Yn = 0

implies c1 = . . . = cn = 0 (i.e. no non-trivial solution).

Eigenvector and eigenvalue A non-zero vector ~vi is an eigenvector of Am×m if

A~vi = λi~vi i = 1, 2, . . . , m

where λi is the corresponding ith eigenvalue.

Idempotent A matrix A is idempotent if AA = A.


Some notable results:

1. If A is idempotent, then all its eigenvalues are either 0 or 1.


2. If A is idempotent, there exists an orthogonal matrix P such that A = P ΛP T where
 
1 ... 0 0 0 0
 .. 
0 . 0 0 0 0
 
0 . . . 1 0 0 0
Λ= 0

 0 0 0 . . . 0 
 .. 
0 0 0 0 . 0
0 0 0 0 ... 0

and tr(A) = rank(A) = tr(Λ) which is equivalent to the number of eigenvalues being 1.

4
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019

2 January 10, 2019


2.1 Linear models
A linear model is generally in the form of

yi = β0 + β1 xi1 + . . . + βp xip + i i = 1, . . . , n

which holds under the assumptions that

• E(i ) = 0

• V ar(i ) = σ 2 (constant variance)

• 1 , . . . , n are independent
iid
• 1 , . . . , n ∼ N (0, σ 2 )

In matrix form we have


 
    β0  
y1 1 x11 . . . x1p  β1  1
 ..   .. .
. . . .
.  .. 
= . + . 
  
. . . .   .. 
.
yn n×1 1 xn1 . . . xnp n×(p+1) n n×1
βp (p+1)×1

or in short matrix form Y = X β~ + ~.


The Least Squares Estimator (LSE) of β ~ minimizing the discrepancy function

~ = (Y − X β)
S(β) ~ T (Y − X β)
~

has a closed form solution


ˆ
β~ = (X T X)−1 X T Y
The fitted values are thus
ˆ
Ŷ = X β~ = X(X T X)−1 X T Y
= HY

where H = X(X T X)−1 X T (hat matrix). Note that H is idempotent and symmetric.
Geometric interpretation of LSE : Ŷ is the projection of Y onto C(X), the column space of X (we can thus see that
the fitted errors should be orthogonal to our fitted values in LSE).
The degrees of freedom of our model is n − (p + 1) where p + 1 is the number of free parameteres in our model.
This is equivalent to n − tr(H) i.e. tr(H) = p + 1.
Under normality
ˆ
• β~ = M V N (β,
~ σ 2 (X T X)−1 )

ˆ
• β~ and σ̂ 2 are independent (Note σ̂ 2 = SSE
df ).

(n−p−1)σ̂ 2
• σ2
∼ χ2n−p−1

5
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019

Let ~ap = (1, x1 , . . . , xp )T (observation ~x extended with intercept term). The (1 − α) prediction interval at ~ap is

ˆ
q
~aTp β~ ± tn−p−1,α/2 σ̂ 1 + ~aTp (X T X)−1~ap

We can also estimate confidence intervals as well (drop 1 + . . . term above).

2.2 Piecewise linear


We can specify the following piecewise linear function (with discontinuity at a)

as two linear functions


(
β0 + β1 x x ≤ a
y=
β2 + β3 x x ≥ a

subject to β0 + β1 a = β2 + β3 a.
A more convenient way to express the above

y = β0 + β1 x + β2 (x − a)I(x ≥ a)

where I is the indicator function. Note the above is linear in terms of β~ BUT NOT in terms of x. However we can
simply construct a new variate (x − a)I(x ≥ a) from x.
Note that β2 is the change in slope right of a for samples where x ≥ a.
Extension to more than one interesting point (knot) is straightforward.

6
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019

2.3 Piecewise quadratic


Similar to piecewise linear models, we can specify
(
β0 + β1 x + β2 x2 x≤a
y=
β 3 + β 4 x + β 5 x2 x≥a

subject to β0 + β1 a + β2 a2 = β3 + β4 a + β5 a2 (continuity) and β1 + 2β2 a = β4 + 2β5 a (differentiable at a).


Alternatively we can express this as one linear function

y = β0 + β1 x + β2 x2 + β3 (x − a)2 I(x ≥ a)

continuity is trivially satisfied. Note the 1st derivative is


dy
= β1 + β2 x + 2β3 (x − a)I(x ≥ a)
dx
where the last term is 0 when x = a, thus our additional indicator term does not affect the derivative.

Remark 2.1. We choose to omit the (x − a)I(x ≥ a) term to ensure y is differentiable at x = a.

2.4 Weighted least squares


Sometimes we would like to give more importance to some observations than others.
~ T (Y − X β)
Instead of minimizing (Y − X β) ~ we can minimize

~ T W (Y − X β)
(Y − X β) ~

where  
w1 0 . . . 0
 0 w2 . . . 0 
W =
 
.. 
0 0 . 0
0 0 . . . wn n×n
a diagonal matrix. wi corresponds to the weight assigned to observation i (a higher wi the more important that
observation is).

Claim. The closed form solution is


~ˆW LS = (X T W X)−1 X T W Y
β

Proof. Note that


~ = (Y − X β)
S(β) ~ T W (Y − X β)
~
= Y T W Y − Y T W X β~ − β~ T X T W Y + β~ T X T W X β~

Note that Y T W X β~ = (β~ T X T W Y )T which is a scalar, so Y T W X β~ = β~ T X T W Y (transposes of scalars are


equivalent). thus
~ = Y T W Y − 2Y T W X β~ + β~ T X T W X β
S(β) ~

where −2Y T W X β~ is the “linear term” and β~ T X T W X β~ is of quadratic form.

7
Winter 2019 STAT 444/844 Course Notes 3 JANUARY 15, 2019

Recall that
d~cT Y
= ~cT
dY
dY T AY
= 2Y T A
dY
so
~
dS(β)
= −2Y T W X + 2β~ T X T W X
dβ~
~
dS(β)
⇒β~ T X T W X = Y T W X =0
dβ~
⇒(X T W X)β~ = X T W Y WT = W
⇒β~ = (X T W X)−1 X T W Y

as claimed.

Here is an alternative proof:


1 1
Proof. Let Y ∗ = W 2 Y and X ∗ = W 2 X.
Note that minimizing (Y − X β)~ T W (Y − X β)
~ is equivalent to minimizing (Y ∗ − X ∗ β)
~ T (Y ∗ − X ∗ β)
~ (simply expand

out X and Y ).∗

Thus the LSE of β~ with X ∗ , Y ∗ is


T T
β~ = (X ∗ X)−1 X ∗ Y ∗
1 1 1 1
= ((X T W 2 )(W 2 X))−1 (X T W 2 )(W 2 Y )
= (X T W X)−1 X T W Y

which is equivalent to our previous derivation.

3 January 15, 2019


3.1 Weight least squares applications
Example 3.1. We can apply weighted least squares to do local regression where we downweight observations
farther away from a given observation.

Example 3.2. Suppose that V ar(i ) = σi2 (i.e. not all observations are drawn with the same variance). If we want
to overweight observations that have lower variance, we can set wi = σ12 to obtain an unbiased estimator of β~
i
with the smallest variance (Best Linear Unbiased Estimator or BLUE).

3.2 Types of errors


We will use the example of 790 Facebook posts published by a comestics company to illusrate.
The population being examined is called the study population (the 790 posts). The analysis of these posts may
be applied to a larger population (whether it’s future Facebook posts for this company or Facebook posts for any
company) which we call the targt population.
The difference between the study and target population is called the study error.

8
Winter 2019 STAT 444/844 Course Notes 4 JANUARY 17, 2019

In a paper by Soros et al., they ended up using a sample of only 500 posts for confidentiality reasons.
The difference between the sample and study population is called sample error.

4 January 17, 2019


4.1 Notes on terminology and lm in R
• When using lm the intercept term is included by default. To remove it simply specify Y ∼ X - 1.

• Factors are like categorical variables in R: there are a finite number of categories (called factor levels).

• In lm almost any function of variates may appear in the formula e.g. Y ∼ X + sin(X) or Y ∼ X + sin(X ?
Y).
To specify Y = X · Z, we need to use Y ∼ I(X ? Z) or Y ∼ X:Z instead of X ? Z since X ? Z represents
interaction in lm and translates to the model y = αx + βz + γxz + r.

• Some arithmetic operations e.g. +, −, ∗, ˆ are interpreted as formula operators rather than arithmetic operators
in lm. One should wrap them in I(·).

4.2 Notes on model selection

Figure 4.1: Quadratic and cubic polynomial linear models on Facebook data.

In the above figure we see that while both quadratic and cubic models are global (predict any value of x) the
quadratic model seems to predict likes returning to 0 as impressions approach infinity.
The cubic function on the contrary continues to increase: this makes more sense intuitively, thus examining a model
often requires human understanding of the data and problem.

9
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

4.3 Geometric interpretation of linear models


A linear model is a linear combination of functions called generators e.g.

µ(x) = β1 g1 (x) + β2 g2 (x) + β3 g3 (x) + β4 g4 (x)

where g1 , . . . , g4 could be arbitrary continuous functions of x.


All possible linear combinations of the generators forms a subspace (the functions generate the subspace). The
functions are a basis for this subspace. µ(x) lies in the subspace whose dimension equals to the number of basis
functions.
The functions should be linearly independent of each other: otherwise the solution to parameters will be ill-defined.

5 January 24, 2019


5.1 Discrepancy function
Let the discrepancy function for a fit of parameters β~ be denoted
n
X n
X
~ =
S(β) ~ =
ρ(yi − ~xTi β) ρ(ri )
i=1 i=1

where ρ is a real-valued loss function (in the OLS case, this was simply the square function). ri is our residual for
observation i.
Taking the derivative

~ n
dS(β) X
= ρ0 (yi − ~xTi β)(−1)~
~ xTi
~
dβ i=1
n
X
=− ρ0 (ri )~xTi
i=1

Solving for the extremum we have


n
X
ψ(ri )~xTi = ~0T
i=1

where ψ(r) = ρ0 (r) ~


(derivative with respect to β).

Remark 5.1 (LSE). If ρ(r) = r2 we get LSE, that is (ψ(r) = 2r)


n
X
~0 = 2r~xi
i=1
n
X
=2 ~
(~xi yi − ~xTi ~xi β)
i=1
~
= 2(X Y − X T X β)
T

which exactly solves to our LSE closed form solution.

10
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

5.2 Discrepancy function and log-likelihood


Let us compare our discrepancy function with the log-likelihood for linear models:
n
X
~ =
l(β) ~
li (β)
i=1
n
X
= l(ri )
i=1

−r2
where l(ri ) = 2σ2i , a function only of ri .
The second equality follows from the following remark:
~ is the ith observation’s contribution to l(β)
Remark 5.2. Note li (β) ~ i.e.

~ = log f (yi | ~xT β,


li (β) ~ σ2)
i

For a linear model we have


~ σ 2 ) ∼ N (~xT β,
f (yi | ~xTi β, ~ σ2)
i
2
~ = − ri2 + C where C = − 1 log(2π) − log σ a constant.
so li (β) 2σ 2
r2
We let l(ri ) = − 2σi2 . Since the constant does not change with respect to β we can omit it from our objective
function.

From above we observe that minimizing the discrepancy function is the same as maximizing the log likelihood where
ρ(r) = −l(r) in the discrepancy function.
Pn
Definition 5.1 (M-estimator). We call the estimator β~ that minimizes i=1 ρ(ri ) the M-estimator or the
maximum-likelihood type estimator.

5.3 Iteratively re-weighted least squares (IRLS)


Note that the solution turns out to be a WLS estimator :
n
X
~0 = ψ(ri )~xi
i=1
n
X ψ(ri )
= ri ~xi
ri
i=1
n
X
= ~ xi
w(ri )(yi − ~xTi β)~
i=1
n
X
= ~ xi
wi (yi − ~xTi β)~
i=1

ψ(ri )
where we let wi = w(ri ) = ri . If we solve this we see that the solution is WLS where

ˆ
β~ = (X T W X)−1 X T W Y

with W = diag(w1 , . . . , wn ).

11
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

However, the weights of this WLS depend on the residuals which in turn depends on β. ~ If we are given an initial
~ (0) ~
estimate of β , we could iteratively update residuals and β to converge to a solution. We proceed as follows:

Initialization Initialization: set j = 0

Step 1 Compute residuals


ˆ
rˆi (j) = yi − ~xTi β~ (j) i = 1, . . . , n

Step 2 Update weights


(j)
(j) ψ(r̂i )
wi = (j)
r̂i
(j) (j)
and let W (j) = diag(w1 , . . . , wn ).
ˆ
Step 3 WLS to estimate next set of β~ (j+1)
ˆ
β~ (j+1) = (X T W (j) X)−1 X T W (j) Y

Step 4 Set j = j + 1 and return to Step 1 if convergence criterion is not met.

We can this procedure iteratively re-weighted least squares (IRLS).


The convergence criterion is typically
ˆ ˆ
kβ~ (j+1) − β~ (j) k ≤ 
with the L2/Euclidean norm and for some small positive constant .

5.4 Why IRLS?


Question 5.1. Why do we need to use iteratively re-weighted least squares?
In ordinary least squares with Gaussian response and loss function ri2 , there is no reason to use IRLS since OLS
and IRLS are equivalent (the loss function ri2 simplifies IRLS to OLS).
However in generalized linear models (GLMs) (STAT 431/831) we may have a different type of response (e.g.
Bernoulli 0/1 or categorical) and thus we may define our loss function ρ(ri ) differently.
We may also want to modify our ρ(ri ) to de-emphasize huge outliers (see next section).

5.5 Robust regression


Robust regression tries to de-emphasize the influence of large outliers.

Question 5.2. What loss function ρ (and ψ) should we use?


In ordinary least squares (OLS) we can use
1
ρ(r) = r2
2
ψ(r) = r
ψ(r)
w(r) = =1
r
so we essentially have W = In×n which devolves into OLS as expected.

12
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

Remark 5.3. The residual function for OLS is unbounded and so extreme outliers with large residuals have
significantly more influence.
Huber (1964) proposed a modified loss function (Huber loss) which de-emphasizes outliers:
(
1 2
r if |r| ≤ c
ρ(r) = 2 1
c(|r| − 2 c) if |r| > c

The modified loss function essentially makes the loss function linear after a certain threshold c:

We also let (
r if |r| ≤ c
ψ(r) =
csign(r) if |r| > c
and thus (
1 if |r| ≤ c
w(r) = c
|r| if |r| > c
The ψ and weight w functions look like

Figure 5.1: Left: ψ(r). Right: w(r) for Huber’s loss function.

How do we decide c? Huber suggested c = 1.345 and showed it achieved 95% of LSE asymptotically when the true
distribution is normal (95% efficiency essentially means the the variance of the betas from OLS is 95% that of the
variance of the betas using Huber’s loss).

13
Winter 2019 STAT 444/844 Course Notes 6 JANUARY 29, 2019

Question 5.3. Since c is fixed, what if our residuals are scaled to very large or small values (e.g. O(1e5) or
O(1e − 4))? We would have to scale our data beforehand to make it within a sensible range so that c = 1.345 makes
sense.
Sometimes we prefer the ψ function to “redescend” i.e. ψ(r) → 0 when |r| is large (that is: we fully de-emphasize
outliers). Other ψ functions include
Redescending M -estimator (Hampel)



r if 0 ≤ |r| ≤ a

asign(r) if a ≤ |r| ≤ b
ψ(r) =


a c−|r|
c−b sign(r) if b ≤ |r| ≤ c

0 if |r| > c

The recommended settings are a = 2, b = 4, c = 8 (with appropriately scaled data and residuals).
Tukey’s biweight    2 2
r 1 − r

if |r| ≤ c
ψ(r) = c

if |r| > c

0

where c = 4.685 is typically used. This is designed to have 95% efficiency as well for a true normal distribution.

6 January 29, 2019


6.1 Remark on robust regression and constants
Remark 6.1. All recommended constants in the various robust regression methods (Huber, Hampel, Tukey) are
based on the assumption that V ar(r) = 1. Therefore in practice we typically need to scale the residuals i.e. ri0 = rsi
where s is a scale parameter.
One simple solution is to estimate the median absolute deviation (MAD):

MAD = median(|ri |)
MAD
and let ŝ = 0.6745 . For the standard normal distribution we note that MAD = 0.6745.

6.2 Sensitivity curve and breakdown point


Let Tn (y1 , . . . , yn ) be a population attribute (that is a function of the same points). To see how sensitive Tn is to
an individual data point, define

Tn (y1 , . . . , yn−1 , y) − Tn−1 (y1 , . . . , yn−1 )


SC(y) = 1
n

which is the difference between Tn (·) (with all n points) and Tn−1 (·) (with one point y omitted) compare to the
contamination size n1 .
Example 6.1. Let Tn (y1 , . . . , yn ) = n1 ni=1 yi = ȳn (sample mean).
P
Note that
n−1
X n−1
Tn = yi + y = ȳn−1 + y
n
i=1

14
Winter 2019 STAT 444/844 Course Notes 6 JANUARY 29, 2019

Note that SC(y) is simply

SC(y) = n(Tn − Tn−1 ) = (n − 1)ȳn−1 + y − nȳn−1


= y − ȳn−1

Definition 6.1 (Breakdown point). Informally, the breakdown point of a statistic is the largest proportion of
contamination before the statistic breaks down.
Formally, let ~zi = (xi1 , xi2 , . . . , xip , yi )T for i = 1, . . . , n be the ith data vector.
Let Z = (~z1 , . . . , ~zn ) be the whole set. Let T be the statistic of interest. The worst error for swapping m zi ’s is

e(m; T, Z) = supkT (Zm ) − T (Z)k

Zm

∗ is Z with any of its m data vectors replaced.


where Zm
The breakdown point is then defined as
 
m
min | e(m; T, Z) = ∞
n

Remark 6.2. That is: the breakdown point measures the minimum proportion of points required to influence
the statistic significantly.

Some breakdown point examples:

Sample mean Note we can simply swap out m = 1 point arbitrarily such that e(1; T, Z) → ∞ thus the breakdown
point is n1 → 0 as n → ∞.

Median The breakdown point is 12 as n → ∞: we need to change at least half of them to aribitrarily influence the
median e.g. make it go to infinity.

k% trimmed mean The k% trimmed mean is defined as the mean after discarding the lowest k% and highest k%
of yi ’s.
Breakdown point is k% (we swap out the top k% + 1 points).

6.3 Least median squares (LMS)


Recall for regression, the LSE of β~ is
n
X
argminβ~ ~ 2
(yi − ~xTi β)
i=1

or equivalently
argminβ~ ~ 2
average(yi − ~xTi β)
To make it robust for “outliers” or contaminations i.e. to ensure we have a high breakdown point we could consider
the least median squares (LMS) estimator:

β~LM S = argminβ~ ~ 2
median(yi − ~xTi β)

1 1
which has a breakdown point of 2 (compared to a breakdown point of n for OLS).

15
Winter 2019 STAT 444/844 Course Notes 7 JANUARY 31, 2019

6.4 Least trimmed average sum of squares (LTS)


Similar to how we made our objective function for OLS more robust by considering the median in LMS, we can also
consider the (least) trimmed average sum of squares (LTS) estimator:
k
X
β~LT S = argminβ~ 2
r(i)
i=1

2 is the ith smallest squared residual.


where r(i)
n−k+1 1
Note the breakdown point for LTS is n (compared to a breakdown point of n for OLS).

7 January 31, 2019


7.1 Local linear regression with k-nearest neighbours
Instead of fitting one linear regression model with all points, we can instead fit local linear regression models for
neighbourhoods of points. In essence we are fitting piecewise linear functions.
We first look at piecewise polynomials and splines.

7.2 Piecewise polynomials (splines)


Definition 7.1 (Spline). We collectively call functions that aim to interpolate and smooth over some distribution
splines. Piecewise polynomials are a common choice for splines.
For continuous piecewise polynomial functions, the simplest form is piecewise linear (as seen before):

which can be specified as a single linear model

f (x) = β0 + β1 x + β2 (x − a)I(x ≥ a)

16
Winter 2019 STAT 444/844 Course Notes 7 JANUARY 31, 2019

Remark 7.1. Piecewise linear is also called the broken stick method.

For notation simplicity let us define (x)+ = max(x, 0) such that we have

f (x) = β0 + β1 x + β2 (x − a)+

Thus our basis functions are 1, x, (x − a)+ . Here is plot of the basis:

This is an example of the truncated power series. We can easily generalize this model to accomodate many
break points or knots.
However, picewise linear functions are not differentiable at their break points since f 0 (x) is not continuous.
Recall that for a piecewise quadratic function we have

f (x) = β0 + β1 x + β2 x2 + β3 (x − a)2+

where our basis functions are 1, x, x2 , (x − a)2+ . Note that a piecewise quadratic model f (x) is indeed differentiable
at the break points.

7.3 Cubic splines


Remark 7.2. The most commmonly used spline is the cubic spline, which is piecewise cubic where f (x), f 0 (x), f 00 (x)
are all continuous.

Let t1 < t2 < . . . , tk be fixed and known knots, where t1 and tk are boundary knots and t2 , . . . , tk−1 are interior
knots.
Then the basis consists of the functions 1, x, x2 , x3 , (x − t1 )3+ , . . . , (x − tk )3+ . That is any cubic spline with the
above k knots can be expressed as
k
X
2 3
f (x) = β0 + β1 x + β2 x + β3 x + βj+3 (x − tj )3+
j=1

Remark 7.3. 1. There are k + 4 parameters.

2. f (x) is continuous up to the 2nd derivative.

17
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019

Proof. This is obviously true between knots. We verify at x = ti :


i−1
X
f (ti ) = β0 + β1 ti + β2 t2i + β3 t3i + βj+3 (ti − tj )3+
j=1

note that (x − tj )3+ = 0 for x < ti+1 and j = i + 1, . . . , k.


Note that limx→t− f (x) = f (ti ) since (x − ti )+ = 0 if x < ti so limx→t− (x − ti )3+ = 0.
i i

Also limx→t+ f (x) = f (ti ) since limx→t+ (x − ti )3+ = 0.


i i

Therefore limx→t− f (x) = limx→t+ f (x) = f (ti ) so f is continuous at ti for all i = 1, . . . , k.


i i

Similarly we can show this for f 0 (x) and f 00 (x).

8 February 5, 2019
8.1 Natural cubic splines (NCS)
A cubic spline is called a natural cubic spline with knots {t1 , . . . , tk } if f (x) is linear when x 6∈ [t1 , tk ], that is
(
t0 (x) = a0 + b0 x if x < t1
f (x) =
tk (x) = ak + bk x if x > tk

Question 8.1. How many free parameters are there in the natural cubic spline?
Answer. Note that in general cubic splines, we have k + 4 parameters. If we constrain our spline to be linear at
both ends (x < t1 and x > tk ) then we essentially remove the quadratic and cubic terms and thus parameters at
each end. So we remove 4 parameters and thus we have k free parameters.
To express an NCS, note that for a regular cubic spline we have
k
X
2 3
f (x) = β0 + β1 x + β2 x + β3 x + βj+3 (x − tj )3+
j=1

Secondly our constraints are:


f (x) is linear when x < t1 We know that β4 , . . . , βk + 3 are already 0 when x < t1 .
Thus we need only specify that β2 = β3 = 0.
f (x) is linear when x > tk We require that
k
X
βj+3 = 0
j=1
k
X
βj+3 tj = 0
j=1

since we we expand out the cubic terms


k
X
βj+3 (x3 − 3tj x2 + 3t2j x − t3j )
j=1

18
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019

we want all the x3 terms to have 0 coefficients (first term of expansion) and all x2 terms to also have 0
coefficients (second term of expansion).
These conditions are necessary and sufficient.

Claim. We claim N1 (x) = 1, N2 (x) = x, and Nj (x) = dj−1 (x) − d1 (x) for j = 3, . . . , k where

(x − tj )3+ − (x − tk )3+
dj (x) =
tk − tj

is a basis for NCS.


Pk Pk−1
Proof. From j=1 βj+3 = 0 we have βk+3 = − j=1 βj+3 .
Thus from the second equation we have
k−1
X k−1
X
βj+3 tj + βk+3 tk = βj+3 (tj − tk )
j=1 j=1

i.e.
k−1
X
β4 (tk − t1 ) = − βj+3 (tk − tj )
j=2

Thus from our original equation (where β2 = β3 = 0)


k
X
f (x) = β0 + β1 x + βj+3 (x − tj )3+
j=1
k−1
X
= β0 + β1 x + βj+3 (x − tj )3+ + βk+3 (x − tk )3+
j=1
k−1
X
βj+3 (x − tj )3+ − (x − tk )3+
 
= β0 + β1 x +
j=1
k−1
3 3
 X
βj+3 (x − tj )3+ − (x − tk )3+
  
= β0 + β1 x + β4 (x − t1 )+ − (x − tk )+ +
j=2
k−1
(tk − tj )((x − t1 )3+ − (x − tk )3+ )
X  
= β0 + β1 x + βj+3 (x − tj )3+ − (x − tk )3+ −
tk − t1
j=2
k−1
(x − tj )3+ − (x − tk )3+ (x − t1 )3+ − (x − tk )3+
X  
= β0 + β1 x + βj+3 (tk − tj ) −
tk − tj tk − t1
j=2
k
X
0
= β0 + β1 x + βj+2 (dj−1 (x) − d1 (x))
j=3

as desired.

Note that we have 4 separate (linearly independent) constraints on the parameters hence why we lose 4 degrees of
freedom.

19
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019

(x−tj )3+ −(x−tk )3+


Let dj (x) = tk −tj , then NCS can be expressed as linear combination of the basis functions

N0 (x) = 1
N1 (x) = x
 
Nj (x) = (tk − tj ) dj (x) − d1 (x) j = 2, . . . , k − 1

More conveniently we can express the NCS as


k
X
f (x) = βj Nj (x)
j=1

where N1 (x) = 1, N2 (x) = x and Nj (x) = dj−1 (x) − d1 (x) for j = 3, . . . , k.

Remark 8.1. 1. If x < t1 , then dj (x) = 0 ⇒ Nj (x) = 0 for j = 3, . . . , k.


(x−tj )3 −(x−tk )3
2. If x > tk , then dj (x) = tk −tj reduces to a quadratic function of x where the coefficient of x2 term is
3.
Since Nj (x) = dj−1 (x) − d1 (x) then it is a linear function of x if x > tk for j = 3, . . . , k.

Definition 8.1 (Regression splines). The fixed-knot splines, such as cubic splines and NCS, are called regression
splines.

8.2 Fitting NCS


Let yi = f (xi ) + i for some response P
yi and explanatory variates xi and some arbitrary continuous function f (·).
We can approximate/regress f (x) by kj=1 βj Nj (x) (NCS) i.e.

k
X
yi ≈ βj Nj (xi ) + i
j=1

Now we simply fit the following linear model with design matrix
 
N1 (x1 ) . . . Nk (x1 )
 .. .. .. 
X= . . . 
N1 (xn ) . . . Nk (xn )

where
 
β1
β =  ... 
 

βk
 
y1
 .. 
Y = . 
yn

Remark 8.2. The problem becomes a regular regression problem with design matrix generated from the basis
functions Nj ’s.

20
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019

8.3 General function fitting with basis funcitons


We extend our method for fitting NCS: more generally for a p-dimensional input vector ~x, we can consider the
following appproximation to f (~x)
k
X
f (~x) = βj hj (~x)
j=1

where {hj } are a series of basis functions.  


That is: we approximate f (~x) as a linear basis expansion. Then we form the design matrix X = hj (~xi ) where i
indexes the row (ith sample) and j indexes the column (jth basis function).
Some examples:

1. hj (~x) = xj for j = 1, . . . , p is the original linear model where basis functions are the jth component

2. hj (~x) = log(xj ) are arbitrary transformations

3. hj (~x) = xkj for k ∈ N is polynomial regression

4. hj (~x) = Nj (~x) is NCS

9 February 7, 2019
9.1 Choosing k for NCS
Recall that the basis functions for NCS are

N0 (x) = 1
N1 (x) = x
Ni (x) = dj−1 (x) − d1 (x) j = 3, . . . , k

We still need to choose a k and our knots t1 , . . . , tk .


Some examples of how to choose k and knots:

Equal-distance knots We choose k first arbitrarily e.g. k = 5, then we use an equal-distance grid between the
min and max of xi ’s.
i
Quantiles Quantiles are also a popular choice e.g. k−1 quantiles for each xi , i = 0, . . . , k − 1.

Degrees of freedom Alternatively we can instead specify the degrees of freedom for an NCS i.e. the number of
free parameters. For df = k, we would have k − 2 knots (if intercept term is also included). Usually knots are
placed at equal distance quantiles.

9.2 Smoothing splines


Consider the following penalized regression problem
n
X Z ∞
fˆλ (x) = argminf 2
[yi − f (xi )] + λ [f 00 (x)]2 dx
i=1 −∞

Pn
Remark 9.1. 1. i=1 [yi − f (xi )]2 is the sum of squared residuals which measures the goodness of fit.

21
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019

R∞ 00 (x)]2 dx
2. −∞ [f measures the “roughness” of f (x).

Remark 9.2. Note that we try to minimize the integral over the f 00 (x) (squared), which is essentially
minimizing f 00 (x) so that it is close to 0.
R∞
For example, if f (x) = β0 + β1 x (OLS) then f 00 (x) = 0 thus −∞ [f
00 (x)]2 = 0 i.e. no penalty for OLS.

3. The role of λ: if λ = 0 then we have no roughness penalty and we will minimize the SSR over all functions
and fˆλ (x) is the interpolating line.
R∞
If λ = ∞ then we will force −∞ [f 00 (x)]2 dx = 0 thus fˆλ (x) is the ordinary least square fit.

4. Remarkably we can show that fˆλ (x) is just the natural cubic spline with knots at distinct values of {xi }ni=1 .

5. NCS is the “smoothest” interpolator.


For any complex function f (x) if we only know the value of k points {f (ti )}ki=1 then we can use {ti , f (ti )}ki=1
to determine an NCS s(x) such that s(ti ) = f (ti ) for i = 1, . . . , k.

Claim. Z ∞ Z ∞
00 2
[s (x)] dx ≤ [f 00 (x)]2 dx
−∞ −∞

Proof. Left as exercise in assignment.

Definition 9.1 (Smoothing spline). We call the function fitted by the penalized regression a smoothing spline.

We determine the β for the NCS smoothing spline. Note that


k
X
fˆλ (x) = βj Nj (x)
j=1

that is
n
X k
X Z ∞ Xk
β̂λ = argminf [yi − 2
βj Nj (x)] + λ [ βj Nj00 (x)]2 dx
i=1 j=1 −∞ j=1

Note that we can re-express this in matrix notation where


n
X k
X
[yi − ~ T (Y − X β)
βj Nj (x)]2 = (Y − X β) ~
i=1 j=1

where  
N1 (x1 ) . . . Nk (x1 )
X =  ... .. .. 

. . 
N1 (xn ) . . . Nk (xn )

22
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019

Also
Z ∞ Xk Z ∞ k
X Xk
[ βj Nj00 (x)]2 dx = [ βj Nj00 (x)][ βl Nl00 (x)] dx
−∞ j=1 −∞ j=1 l=1
Z ∞ k X
X k
= [ βj βl Nj00 (x)Nl00 (x)] dx
−∞ j=1 l=1

k X
X k Z ∞
Nj00 (x)Nl00 (x) dx

= βj βl
j=1 l=1 −∞

= β~ T N β~
R∞
where N = (Njl ) = −∞ Nj00 (x)Nl00 (x) dx (i, j-th entry is Njl ).
Therefore we can let
~ = (Y − X β)
S(β) ~ T (Y − X β)~ + λβ~ T N β~
= Y T Y − β~ T X T Y − Y T X β~ + β
~ T X T X β~ + λβ~ T N β~
= Y T Y − 2Y T X β~ + β~ T (X T X + λN )β~

ˆ
and β~λ = argminS(β).
~
Recall that for matrix Y, A and vector ~c

∂~cT Y
= ~cT
∂Y
∂Y T AY
= 2Y T AT
∂Y
thus we have

∂S(β)~
= 0 = −2Y T X + 2β~ T (X T X + λN )T
∂ β~
ˆ
⇒(X T X + λN )β~λ = X T Y
ˆ
⇒β~λ = (X T X + λN )−1 X T Y

To calculate the effective number of parameters or effective df (edf): recall for NCS we have k knots and in OLS
with Xn×p
Ŷ = HY = X(X T X)−1 X T Y
where the number of parameters is df = tr(H).
Now in the smoothing spline, we have

Ŷλ = X(X T X + λN )−1 X T Y = Aλ Y

where the effective number of parameters is dfλ = tr(Aλ ).

Remark 9.3. When λ → ∞, dfλ → 2.

23
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019

10 February 14, 2019


10.1 B-splines
A comptutationally efficient alternative to cubic splines and NCS is the B-spline.
The basis functions of B-splines are strictly local. For a degree d B-spline (e.g. d = 3 for a cubic B-spline), each
basis function is non-zero over the interval of d + 2 adjacent knots (and zero everywhere else) i.e. d + 1 intervals.
Advantages of B-spline:

1. Numerically stable: recall cubic splines have x3 terms which grows fast as x → ∞. B-splines are fitted to a
restriction of x (the d + 5 knots).

2. Computationally efficient: when # of knots k is large. More specifically least squares estimation with n
observations and k variables takes O(nk 2 + k 3 ) operations. If k → n then this becomes O(n3 ). B-splines
reduces this cost to O(n) (since k ∈ O(d)) becomes constant.

We define the 0-degree B-spline basis:


(
1 if ti ≤ x < ti+1
Bi,0 (x) =
0 otherwise

where Bi,0 (x) is the interval indicator function. It is also known as the Haar basis function.
In general for a d-degree B-spline, we define its basis as:
x − ti ti+d+1 − x
Bi,d (x) = Bi,d−1 (x) + Bi+1,d+1 (x)
ti+d − ti ti+d+1 − ti+1

After we compute the basis functions given our x, we can fit the model as an OLS or robust LR model. In R, we can
use the function bs in the package splines to generate the B-spline basis functions (note there are no intercepts
included). This will give us a design matrix with d + k basis functions (so d + k degrees of freedom) where d is the
degree and k is the number of knots (d starts at 0 for the constant function).
Then we simply feed this to lm or rlm as usual (which will subsequently introduce the bias term). Note that lm will
add one more degree of freedom with the intercept for (d + 1) + k degrees of freedom.
Similarly we can generate NCS basis functions with ns in splines.

10.2 Smoothing splines in R


To fit smoothing splines (penalized splines), we can use the smooth.spline function.
First specify an appropriate degrees of freedom df.
Let nx be the number of distinct values of x. We use nx knots if nx ≤ 49 and O(nx0.2 ) knots if nx > 49.
Although it is not strictly a smoothing spline when nx > 49 it is very close to one.

Remark 10.1. Since smoothing splines are penalized for their “smoothness” this allows us to choose a high number
of knots.

11 February 26, 2019


11.1 KNN local linear regression
An alternative to fitting splines with prespecified knots is to fit the data more locally by considering a neighbourhood
at each point x.

24
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019

We take the k nearest neighbours for every x and compute the mean response value of the neighbours. This then
becomes the fitted value at xi and we may linearly interpolate or even quadratically interpolate (or even higher
order polynomial interpolation) between points.
This can be accomplished in R with knn.reg from the FNN package.

Remark 11.1. As the neighbourhood size (k) increases, the smoother the function becomes.

Instead of taking the mean response based on the k neighbours, we can instead use the value from any fitted model
based on those k neighbours (e.g. lm, rlm with Huber, Tukey’s, etc., ltsreg).

Remark 11.2. We can think of KNN local linear regression as weighted linear regression where wj = 0 if xj is
outside the neighbourhood of xi .

11.2 Kernel local linear regression


In the KNN case, we essentially assign equal weighting amongst all k neighbours. Instead we can use some kernel to
assign higher weights to points closer to xi and lower weight to points farther away from xi .
A kernel K(t) must satisfy the following
Z Z Z
K(t) dt = 1 tK(t) dt = 0 t2 K(t) dt < ∞

where the first two standardize K(t) and the last constraint ensures weights are spread along the real line but not
too much weight are in the extremes.
Some examples of kernels:

Epanechinikov K(t) = 34 (1 − t2 )I(|t| ≤ 1)

Tukey’s tri-cube K(t) = (1 − |t|3 )3 I(|t| ≤ 1)


1
Gaussian K(t) = √ 2
2πe−t /2

Figure 11.1: Graph of kernels used in kernel local linear regression.

Thus for a bandwidth parameter h the weight function for neighbour xi for current point x is defined as:

K xih−x

wi = w(x, xi ) = PN xj −x 
j=1 K h

25
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019

For the mean response µ̂(x) we take the weighted average or the Nadaraya-Watson estimator:
N
X
µ̂(x) = wi yi
i=1

Remark 11.3. The boundary effect occurs when no points lie on one side of the kernel and thus the weights are
distributed in a biased way to the points on the other side. This occurs at the extremes of the explanatory variate
space.

Figure 11.2: The boundary effect causes the kernel fitted line (green) at the left end to bias the fitted value higher
since most the available points are on the right of the kernel and have a higher response value.

To avoid the boundary effect, local regression is typically used.


Local linear is simply and good at boundaries and local quadratic is good at interior points.
Higher order regression is rarely used.

In R we can use the loess function where span defines the proportion of points in the local neighbourhood. The
kernel used is Tukey’s tri-cube.

11.3 Linear smoother


A smoothing spline with for fixed λ is an example of a linear smoother where

Ŷλ = X(X T X + λN )−1 X T Y


= Sλ Y

A linear smoother is a linear combination of the yi ’s with Sλ being the smoother matrix.
Consider a regression with a small number p of basis functions. That is for basis functions b1 , . . . , bp (e.g. the NCS
basis) let the matrix  
Bn×p i,j = bj (xi )

26
Winter 2019 STAT 444/844 Course Notes 12 FEBRUARY 28, 2019

That is: the i, j-th element is bj (xi ).


Then the ordinary least squares (OLS) solution for the fitted values are

Ŷ = B(B T B)−1 B T Y

where HB = B(B T B)−1 B T (the hat matrix).

Claim. It can be shown (for any hat matrix) that the column space of B satisfies C(B) = C(HB ). Note however
B has p columns whereas HB has n columns, so there is some redundancy.

Proof. Since HB is symmetric we take its eigendecomposition

HB = U P U T
  
ρ1 . . . 0 ~u1
 . .
. . ..   ... 
.

= ~u1 . . . ~un  ..
 

0 . . . ρn ~un
n
X
= ρi ~ui ~uTi
j=1

where ρ1 ≥ . . . ≥ ρn ≥ 0 are the eigenvalues and ~u1 , . . . , ~un are the corresponding orthonormal eigenvectors.
Then

Ŷ = HB Y
Xn
= ρi ~ui ~uTi Y
i=1
n
X
= ρi h~ui , Y i~ui h·, ·i inner product
i=1

Thus Y is first projected to the orthonormal basis {~u1 , . . . , ~un } then modulated by {ρ1 , . . . , ρn }.
Because HB is idempotent and assuming B is full rank (i.e. rank(B) = p) then P n = P for all n ∈ N thus
(
1 if i = 1, . . . , p
ρi =
0 if i = p + 1, . . . , n

12 February 28, 2019


12.1 Reinsh form of smoother matrix
For a smoothing spline and assuming the n xi values are distinct, then
 
Xn×n i,j = hj (xi )

where hj is the jth NCS basis function.

Claim. Given the smooth spline


Ŷλ = X(X T X + λN )−1 X T Y = Sλ Y

27
Winter 2019 STAT 444/844 Course Notes 12 FEBRUARY 28, 2019

we claim the smoother matrix Sλ in a linear smoother can be written in the Reinsch form

Sλ = (I + λK)−1

where K = (X T )−1 N X −1 does not depend on λ and N is the roughness penalty matrix for NCS, that is
N = Njl n×n and
Z ∞
Njl = Nj00 (x)Nl00 (x) dx
−∞

Proof. First remark that X is a square matrix since we assume all xi values are distinct so

Sλ = X(X T X + λN )−1 X T
−1
= (X T )−1 (X T X + λN )X −1

−1
= I + λ(X T )−1 N X −1
= (I + λK)−1

where K = (X T )−1 N X −1 .

12.2 Penalty form of linear smoother


It can be shown that Ŷλ = Sλ Y is the solution to the minimization problem

~ )T (Y − µ
min (Y − µ µT K~
~ ) + λ~ µ

where K is known as the penalty matrix where K is symmetric and has eigendecomposition

K = V DV T

with D = diag(d1 , . . . , dn ) the eigenvalues where di ≥ 0 and V = (~v1 , . . . , ~vn ) is an orthonomal matrix (of
eigenvectors).
1. K = ni=1 di~vi~viT (sum of matrices) and µ µ = ni=1 di h~vi , µ
~ T K~ ~ i2 .
P P
Remark 12.1.
This implies that µ
~ is penalized more in the directions of vi ’s with large di values.

2. It can be shown that dn−1 = dn = 0.


It is straightforward to show that
Sλ = V (I + λD)−1 V T
is the eigendecomposition of Sλ with eigenvalues
1
ρi (λ) = i = 1, . . . , n
1 + λdn−i+1

Remark 12.2. 1. Sλ and K share the same eigenvectors which do not depend on λ.

2. Large eigenvalues di of D leads to small eigenvalues ρi (λ) of Sλ .

3. Large λ leads to small eigenvalues ρi (λ).

4. ρ1 (λ) = ρ2 (λ) = 1 since dn−1 = dn = 0.


All other ρi (λ)’s are less than 1.

28
Winter 2019 STAT 444/844 Course Notes 13 MARCH 5, 2019

Returning to our original fitted value µ


~

µ
~ = Sλ Y
Xn
= ρi (λ)~vi~viT Y
i=1
n
X
= ρi (λ)h~vi , Y i~vi
i=1

that is: we project Y down to the every eigenvector ~vi and scale each projection by corresponding eigenvalue ρi (λ).
Note the first two vectors are not shrunk by ρi (λ) but the rest are shrunk towards 0 since ρi (λ) < 1.

12.3 Regression vs smoothing splines


How do smoothing splines compare to regresson splines (e.g. OLS)?
To draw a parallel with OLS, recall
Xn
Ŷ = ρi h~ui , Y i~ui
i=1

where (
1 if i = 1, . . . , p
ρi =
0 if i = p + 1, . . . , n
Comparing the two, OLS selects the eigenvector basis where eigenvalues are 1 and drops the other eigenvectors (hard
thresholding) whereas smoothing splines shirnk Y in the direction of eigenvectors according to their corresponding
eigenvalues ρi (λ).
For this reason OLS or regression splines are called projection smoothers and smoothing splines are shrinking
smoothers.

Remark 12.3. 1. The sequence of ~vi , ordered by decreasing eigenvalues ρi (λ), appear to increase in complexity
(i.e. roughness or “wiggleness”).

2. Recall the effective degrees of freedom is


n
X 1
dfλ = tr(Sλ ) =
1 + λdi
i=1

thus if we want a specific dfλ we can simply linear search for the corresponding λ (since the di ’s are also fixed).

13 March 5, 2019
13.1 Local linear regression as a linear smoother
We show that local linear regression is indeed a linear smoother.
For target value x (point we are doing local regression about), local linear regression is equivalent to solving the
weighted optimization problem
Xn
argminα,β kh (x − xi )(yi − (α + βxi ))2
i=1

29
Winter 2019 STAT 444/844 Course Notes 14 MARCH 7, 2019

and our fitted value at x from local regression is fˆ(x) = α(x) + β(x)x (kh is our kernel function used in local linear
regression).
Note the above optimization problem has an explicit solution. Let
 
1 x1
1 x2 
B = . . 
 
.
. . .
1 xn
W (x) = diag(kh (x − x1 ), . . . , kh (x − xn ))

where B is n × 2 and W is n × n.
Then fˆ(X) (n × 1 vector of fitted values) can be re-rewritten as

fˆ(X) = (1, x) B T W (X)B)−1 B T W (X)Y


= lT (X)Y

where lT (X) = (1, x) B T W (X)B)−1 B T W (X).


Remark 13.1. 1. Local linear regression is a linear smoother i.e. fˆ(x) is a linear combination of yi ’s.
2. Recall in smoothing splines
Ŷλ = Sλ Y
and we define dfλ = tr(Sλ ).
If we let
lT (x1 )
 
 lT (x2 ) 
Lh =  . 
 
 .. 
lT (xn )

where Lh is n × n, then Ŷh = Lh Y and we efine dfh = tr(Lh ).

14 March 7, 2019
14.1 Multivariate local regression
We want to make a prediction at target value ~x = (x1 , . . . , xp )T . A simple kernel function we could use in local
regression in Rp
1 k~xk 
Kh (~x) = K
h h
where k·k is the Euclidean norm.
Remark 14.1. The issue with the Euclidean norm is we weight every coordinate/variate xi equally. If a variate
has less importance we should not give it the same weight as other variates in the kernel.
That is: the kernel we have above is equally-skewed about all variates xi and gives equal weight to each coordinate.
We use structured local regression instead. This is a more general approach where we use a positive semidefinite
matrix Ap×p to weight each coordinate, that is:

1 ~xT A~x 
Kh,A (~x) = K
h h

30
Winter 2019 STAT 444/844 Course Notes 14 MARCH 7, 2019

Typically A is chosen to be symmetric and there are p(p+1) 2 elements in A.


A popular choice is a diagonal matrix i.e. A = diag(a1 , . . . , ap ) where ai modulates the variance of the kernel along
the ith dimension (as ai → ∞ then the larger and hence more smoothed out the kernel is along the ith dimension).

14.2 Multivariate regression splines with tensor products


An extension of regression splines to Rp are tensor products.

Example 14.1. Let p = 2 and ~x = (x1 , x2 )T ∈ R2 .


Let {h11 , h12 , . . . , h1k1 } a set of spline basis functions on x1 , and {h21 , . . . , h2k2 } a set of spline basis functions on
x2 .
Consider the following tensor product basis with k1 × k2 functions where

gjk (~x) = h1j (x1 )h2k (x2 )

and X
f (~x) = βjk gjk (x)
j,k

that is f is a multiplicative/interaction-based model.

We note that the number of parameters with a tensor product basis grows exponentially with p.

14.3 Multivariate smoothing splines with thin plates


An extension of smoothing splines to Rp are with thin plate splines.

Example 14.2. Let p = 2. We solve for


n
X
fˆλ = argminf { [yi − f (~xi )2 ]2 + λJ(f )}
i=1

where 2 2 2
∂2f ∂2f ∂2f
Z Z   
J(f ) = +2 + dx1 dx2
R R ∂x21 ∂x1 ∂x2 ∂x22
We can show the optimal solution has the form (at a given target ~x)
n
X
fλ (~x) = β0,λ + βλT ~x + αi hi (~x)
i=1

where the generator function is the radial basis function hi (~x) = k~x − x~i k2 logk~x − x~i k.
Note x~i for i = 1, . . . , n are our control/knot points.

14.4 Curse of dimensionality


Any dataset with in a large dimension Rp is sparse.

Example 14.3. Consider fixed-width neighbourhoods in e.g. local regression.


For p = 1, assume data points are uniformly distributed across the domain x ∈ [0, 1]. Suppose for a target x we
have a neighbourhood x ± 0.05. Then we will capture ≈ 10% of points.

31
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019

For p = 2 again assume data points are uniformly distributed across [0, 1] × [0, 1]. Suppose we have a neighbourhood
x ± 0.1: note that we capture an area of 0.2 × 0.2 = 0.04 which only captures ≈ 4% of points!
As p increases, even as we increase the width of our neighbourhood along each dimension linearly our data becomes
so sparse our neighbourhood becomes relatively small.

14.5 Structured regression additive approach


For predictor values ~x = (x1 , . . . , xp )T , instead of modelling

µ(~x) = f (x1 , . . . , xp )

which may be some arbitrary possibly interactive function of every variate we consider the additive model
p
X
µ(~x) = α + fj (xj )
j=1

where fj ’s can be linear functions or any smooth function of xj .

Remark 14.2. We can extend the above model to allow a limited number of interactions.
For example if p is small we can consider additional pairwise interactions
p
X p X
X p
f (~x) = α + fj (xj ) + fjk (xj , xk )
j=1 j=1 k=1

We may also be more selective, e.g. for p = 5

f (~x) = α + f1 (x1 ) + f2 (x2 , x3 ) + f3 (x4 , x5 )

15 March 12, 2019


Review of assignment 3 solutions and slides on generative additive model.

16 March 14, 2019


Review of assignment 4 solutions and more slides on generative additive model.

17 March 19, 2019


17.1 Tuning parameter selection
Suppose we are given the true model yi = f (xi ) + i where E(i ) = 0 and V ar(i ) = σ 2 , observations (xi , yi )nn=1
(training set T ), and prediction model fˆ(x).
The methods considered so far typically involve some tuning parameter e.g. the # of knots in regression spline, λ in
smoothing spline, bandwidth h in local regression, etc.
These tuning parameters are considered “complexity parameters” that regulate the complexity (i.e. degrees of
freedom) of the prediction model.
In prediction we want fˆ to provide good estimates for future observations.

32
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019

Definition 17.1 (Test error). We define the test error as:

ErrT = E (Y − fˆ(X))2 | T


where the expectation is over the true joint distribution of (X, Y ) i.e. the population.
Definition 17.2 (Expected test error). We define the expected test error as:

Err = E (Y − fˆ(X))2 ) = E(ErrT )

where the distribution is over the distribution of (X, Y ) and the random generation of training sets.
Definition 17.3 (Training error). We define the training error as:
n
1X RSS
Err = (yi − fˆ(xi ))2 =
n n
i=1

However Err uses the same data twice (once for producing fˆ and once for calculating the error) and does not track
Err well.
We note that as the model complexity grows, training error decreases while our test error will increase after the
optimal complexity:

Test error increases after a certain point due to overfitting to the training set.
To regularize our model for complexity, some solutions include:
Information criteria let d denote the number of parameters. We define the Akaike Information Criterion
(AIC) as:
AIC = 2d + n log(RSS)
and the Bayesian Information Criterion (BIC), which has a larger regularization effect, as:

BIC = log(n)d + n log(RSS)

Remark 17.1. The information criteria are only useful for comparing models for the same training sample.
Their absolute values are meaningless.

33
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019

Cross-validation (CV) Recall we can estimate the test error Err by repeatedly sampling test sets from the
population.
We hold out a part of the training set as our “population” (cross-validation set) and construct our model on
the remaining training set. We can then validate our complexity and model on the cross-validation set as an
estimate of the test error.
This error on the cross-validation set is called the cross-validation error.

k-fold CV 1. Given a training set T , randomly partition it into k disjoint equal-sized parts (“folds”) T1 , . . . , Tk .
2. For every i = 1, . . . , k, we train our model on partitions T1 , . . . , Ti−1 , Ti+1 , Tk to obtain fˆ(k) , then we
evaluate fˆ(k) on Ti to get the cross-validation error for each Ti . Let i(j) be the fold Ti corresponding to
example j, then the overall cross-validation error is:
n
1X
CV (fˆ) = (yj − fˆ(i(j)) (xj ))2
n
j=1

The choice of k = n is called the leave-one-out (LOO) CV. The justification for LOO CV: Let fˆ−i (xi ) be
the fitted value of xi without using (xi , yi ) during training. Then

E(yi − fˆ−i (xi ))2 = E(yi − f (xi ) + f (xi ) − fˆ−i (xi ))2
= E(yi − f (xi ))2 + 2i E(f (xi ) − fˆ−i (xi )) + E(f (xi ) − fˆ−i (xi ))2
= σ 2 + E(f (xi ) − fˆ−i (xi ))2
≈ σ 2 + E(f (xi ) − fˆ−i (xi ))2

that is LOO CV provdies an approximate estimate of the test error Err (up to a constant σ 2 ).
Thus minimizing the LOO CV error is essentially minimizing the test error.

Remark 17.2. Since LOO CV requires fitting the data n times, for large n then this is infeasible.

Remark 17.3. For most linear smoothers where Ŷ = SY where S is the smoother matrix it can be shown
that
n n 
1 X yi − fˆ(xi ) 2

ˆ 1X ˆ−i 2
CV (f ) = (yi − f (xi )) =
n n 1 − sii
i=1 i=1

where sii is the ith diagonal element of S. We can thus simply fit the data once and weight the squared
residuals by (1−s1ii )2 .
The above proof for OLS is in A2 Q2 part (d).

Generalized Cross Validation (GCV) For any linear smoother Ŷ = SY we define the GCV error as
n
yi − fˆ(xi ) 2
 
1X
GCV (fˆ) = tr(S)
n 1−
i=1 n

where we use the average trace tr(S)/n instead of each individual sii .

Note that LOO CV is approximately unbiased for Err, but can have high variance due to the n training sets being
very similar to one another. On the other hand, a small k tends to have large bias but small variance. To balance
bias and variance, k = 5 or k = 10 is recommended.

34
Winter 2019 STAT 444/844 Course Notes 18 MARCH 26, 2019

18 March 26, 2019


18.1 Tree-based methods
A decision tree is essentially local linear regression with K neighbourhoods, where at some iteration with i
neighbourhoods we partition some neighbourhood to get i + 1 neighbourhoods. We aim to optimize the objective
function:
XN K
X
(yi − IRk (xi )µ̂k )2 + λK
i=1 k=1

where IRk (xi ) = 1 if xi ∈ Rk neighbourhood k and µ̂k is the average of the points in the neighbourhood k. This is
equivalent to local average regression.
Optimizing the above is still computationally difficult (since there are a combinatorial number of different K ∈ N
neighbourhoods to consider for N points.

35

You might also like