0% found this document useful (0 votes)

14 views37 pages

Stat444 Notes

Uploaded by

蕾酱爱

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views37 pages

Stat444 Notes

Uploaded by

蕾酱爱

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Winter 2019 STAT 444/844 Course Notes TABLE OF CONTENTS

richardwu.ca

STAT 444/844 Course Notes

Statistical Learning: Function Estimation
Kun Liang • Winter 2019 • University of Waterloo

Last Revision: March 26, 2019

Table of Contents
1 January 8, 2019 1
1.1 What is a function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Advertising data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 January 10, 2019 5

2.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Piecewise linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Piecewise quadratic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 January 15, 2019 8

3.1 Weight least squares applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 January 17, 2019 9

4.1 Notes on terminology and lm in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Notes on model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Geometric interpretation of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 January 24, 2019 10

5.1 Discrepancy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2 Discrepancy function and log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Iteratively re-weighted least squares (IRLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 Why IRLS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 January 29, 2019 14

6.1 Remark on robust regression and constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.2 Sensitivity curve and breakdown point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3 Least median squares (LMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.4 Least trimmed average sum of squares (LTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

i
Winter 2019 STAT 444/844 Course Notes TABLE OF CONTENTS

7 January 31, 2019 16

7.1 Local linear regression with k-nearest neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.2 Piecewise polynomials (splines) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.3 Cubic splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 February 5, 2019 18
8.1 Natural cubic splines (NCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2 Fitting NCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.3 General function fitting with basis funcitons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

9 February 7, 2019 21
9.1 Choosing k for NCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.2 Smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

10 February 14, 2019 24

10.1 B-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
10.2 Smoothing splines in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

11 February 26, 2019 24

11.1 KNN local linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
11.2 Kernel local linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11.3 Linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

12 February 28, 2019 27

12.1 Reinsh form of smoother matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
12.2 Penalty form of linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
12.3 Regression vs smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

13 March 5, 2019 29
13.1 Local linear regression as a linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

14 March 7, 2019 30
14.1 Multivariate local regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
14.2 Multivariate regression splines with tensor products . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.3 Multivariate smoothing splines with thin plates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.4 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.5 Structured regression additive approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

15 March 12, 2019 32

16 March 14, 2019 32

17 March 19, 2019 32

17.1 Tuning parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ii
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

Abstract
These notes are intended as a resource for myself; past, present, or future students of this course, and anyone
interested in the material. The goal is to provide an end-to-end resource that covers all material discussed
in the course displayed in an organized manner. These notes are my interpretation and transcription of the
content covered in lectures. The instructor has not verified or confirmed the accuracy of these notes, and any
discrepancies, misunderstandings, typos, etc. as these notes relate to course’s content is not the responsibility of
the instructor. If you spot any errors or would like to contribute, please contact me directly.

1 January 8, 2019
1.1 What is a function?
Suppose we have some measured response variate y and we have one or more explanatory variables x1 , . . . , xp .
The response and explanatory ariables are approximately related through an unknown function µ(x) (to be
estimated/learned) where
y = µ(x) + r
where r is residual that cannot be explained by µ(x).
Some other names for response and explanatory variables include:

response explanatory
response predictor
response design
output input
dependent independent
endogenous exogenous

1.2 Advertising data example

Suppose we want to predict Sales (response) from how much companies spend on TV, Radio, and Newspaper
advertising (explanatory).
The dataset is

if we plot sales against TV

1
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

we see there is some positive correlation.

Similarly against Newspaper and Radio

What if we tried a simple linear model where µ̂(x1 ) = α̂ + β̂x1 where x1 is the TV advertising? We obtain estimates
α̂ = 7.03 and β̂ = 0.05 which are interpretable. However if we take a look at the residuals

2
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

we see that the residuals are not independently distributed accordingly to x1 , which violates our Markov-Gaussian
assumptions.
The residuals of the model with Newspaper and Radio are

we observe that we do not observe constant variance across the explanatory variables.
Therefore a linear model does not seem to work (we could of course introduce scaling e.g. log-scaling for the Radio
variate or polynomial terms).

1.3 Notation
Some notes on notation:

• Capital letters are matrices or vectors: A, X, Σ

• Lower letters are scalars: a, x, σ

3
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019

• Arrows on letters are vectors: ~a, ~x

• All vectors are column vectors

• The transpose of any matrix A is AT (ocassionally A0 )

1.4 Definitions and properties

Quadratic form Suppose A = (aij )n×n is symmetric i.e. aij = aji ∀i, j. Then

f = Y T AY
XX
= aij yi yj
i j

is called a quadratic form.

Trace For a matrix, the trace is defined as

m
X
tr(Am×m ) = aii
i=1

Note that tr(BC) = tr(CB).

Rank The rank of a matrix denoted rank(A) is the maximum number of linearly independent columns (or rows)
of A.
Note that vectors Y1 , . . . , Yn are linearly independent iff

c1 Y1 + . . . + cn Yn = 0

implies c1 = . . . = cn = 0 (i.e. no non-trivial solution).

Eigenvector and eigenvalue A non-zero vector ~vi is an eigenvector of Am×m if

A~vi = λi~vi i = 1, 2, . . . , m

where λi is the corresponding ith eigenvalue.

Idempotent A matrix A is idempotent if AA = A.

Some notable results:

1. If A is idempotent, then all its eigenvalues are either 0 or 1.

2. If A is idempotent, there exists an orthogonal matrix P such that A = P ΛP T where
 
1 ... 0 0 0 0
 .. 
0 . 0 0 0 0
 
0 . . . 1 0 0 0
Λ= 0

 0 0 0 . . . 0 
 .. 
0 0 0 0 . 0
0 0 0 0 ... 0

and tr(A) = rank(A) = tr(Λ) which is equivalent to the number of eigenvalues being 1.

4
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019

2 January 10, 2019

2.1 Linear models
A linear model is generally in the form of

yi = β0 + β1 xi1 + . . . + βp xip + i i = 1, . . . , n

which holds under the assumptions that

• E(i ) = 0

• V ar(i ) = σ 2 (constant variance)

• 1 , . . . , n are independent
iid
• 1 , . . . , n ∼ N (0, σ 2 )

In matrix form we have

 
    β0  
y1 1 x11 . . . x1p  β1  1
 ..   .. .
. . . .
.  .. 
= . + . 
  
. . . .   .. 
.
yn n×1 1 xn1 . . . xnp n×(p+1) n n×1
βp (p+1)×1

or in short matrix form Y = X β~ + ~.

The Least Squares Estimator (LSE) of β ~ minimizing the discrepancy function

~ = (Y − X β)
S(β) ~ T (Y − X β)
~

has a closed form solution

ˆ
β~ = (X T X)−1 X T Y
The fitted values are thus
ˆ
Ŷ = X β~ = X(X T X)−1 X T Y
= HY

where H = X(X T X)−1 X T (hat matrix). Note that H is idempotent and symmetric.
Geometric interpretation of LSE : Ŷ is the projection of Y onto C(X), the column space of X (we can thus see that
the fitted errors should be orthogonal to our fitted values in LSE).
The degrees of freedom of our model is n − (p + 1) where p + 1 is the number of free parameteres in our model.
This is equivalent to n − tr(H) i.e. tr(H) = p + 1.
Under normality
ˆ
• β~ = M V N (β,
~ σ 2 (X T X)−1 )

ˆ
• β~ and σ̂ 2 are independent (Note σ̂ 2 = SSE
df ).

(n−p−1)σ̂ 2
• σ2
∼ χ2n−p−1

5
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019

Let ~ap = (1, x1 , . . . , xp )T (observation ~x extended with intercept term). The (1 − α) prediction interval at ~ap is

ˆ
q
~aTp β~ ± tn−p−1,α/2 σ̂ 1 + ~aTp (X T X)−1~ap

We can also estimate confidence intervals as well (drop 1 + . . . term above).

2.2 Piecewise linear

We can specify the following piecewise linear function (with discontinuity at a)

as two linear functions

(
β0 + β1 x x ≤ a
y=
β2 + β3 x x ≥ a

subject to β0 + β1 a = β2 + β3 a.
A more convenient way to express the above

y = β0 + β1 x + β2 (x − a)I(x ≥ a)

where I is the indicator function. Note the above is linear in terms of β~ BUT NOT in terms of x. However we can
simply construct a new variate (x − a)I(x ≥ a) from x.
Note that β2 is the change in slope right of a for samples where x ≥ a.
Extension to more than one interesting point (knot) is straightforward.

6
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019

2.3 Piecewise quadratic

Similar to piecewise linear models, we can specify
(
β0 + β1 x + β2 x2 x≤a
y=
β 3 + β 4 x + β 5 x2 x≥a

subject to β0 + β1 a + β2 a2 = β3 + β4 a + β5 a2 (continuity) and β1 + 2β2 a = β4 + 2β5 a (differentiable at a).

Alternatively we can express this as one linear function

y = β0 + β1 x + β2 x2 + β3 (x − a)2 I(x ≥ a)

continuity is trivially satisfied. Note the 1st derivative is

dy
= β1 + β2 x + 2β3 (x − a)I(x ≥ a)
dx
where the last term is 0 when x = a, thus our additional indicator term does not affect the derivative.

Remark 2.1. We choose to omit the (x − a)I(x ≥ a) term to ensure y is differentiable at x = a.

2.4 Weighted least squares

Sometimes we would like to give more importance to some observations than others.
~ T (Y − X β)
Instead of minimizing (Y − X β) ~ we can minimize

~ T W (Y − X β)
(Y − X β) ~

where  
w1 0 . . . 0
 0 w2 . . . 0 
W =
 
.. 
0 0 . 0
0 0 . . . wn n×n
a diagonal matrix. wi corresponds to the weight assigned to observation i (a higher wi the more important that
observation is).

Claim. The closed form solution is

~ˆW LS = (X T W X)−1 X T W Y
β

Proof. Note that

~ = (Y − X β)
S(β) ~ T W (Y − X β)
~
= Y T W Y − Y T W X β~ − β~ T X T W Y + β~ T X T W X β~

Note that Y T W X β~ = (β~ T X T W Y )T which is a scalar, so Y T W X β~ = β~ T X T W Y (transposes of scalars are

equivalent). thus
~ = Y T W Y − 2Y T W X β~ + β~ T X T W X β
S(β) ~

where −2Y T W X β~ is the “linear term” and β~ T X T W X β~ is of quadratic form.

7
Winter 2019 STAT 444/844 Course Notes 3 JANUARY 15, 2019

Recall that
d~cT Y
= ~cT
dY
dY T AY
= 2Y T A
dY
so
~
dS(β)
= −2Y T W X + 2β~ T X T W X
dβ~
~
dS(β)
⇒β~ T X T W X = Y T W X =0
dβ~
⇒(X T W X)β~ = X T W Y WT = W
⇒β~ = (X T W X)−1 X T W Y

as claimed.

Here is an alternative proof:

1 1
Proof. Let Y ∗ = W 2 Y and X ∗ = W 2 X.
Note that minimizing (Y − X β)~ T W (Y − X β)
~ is equivalent to minimizing (Y ∗ − X ∗ β)
~ T (Y ∗ − X ∗ β)
~ (simply expand
∗
out X and Y ).∗

Thus the LSE of β~ with X ∗ , Y ∗ is

T T
β~ = (X ∗ X)−1 X ∗ Y ∗
1 1 1 1
= ((X T W 2 )(W 2 X))−1 (X T W 2 )(W 2 Y )
= (X T W X)−1 X T W Y

which is equivalent to our previous derivation.

3 January 15, 2019

3.1 Weight least squares applications
Example 3.1. We can apply weighted least squares to do local regression where we downweight observations
farther away from a given observation.

Example 3.2. Suppose that V ar(i ) = σi2 (i.e. not all observations are drawn with the same variance). If we want
to overweight observations that have lower variance, we can set wi = σ12 to obtain an unbiased estimator of β~
i
with the smallest variance (Best Linear Unbiased Estimator or BLUE).

3.2 Types of errors

We will use the example of 790 Facebook posts published by a comestics company to illusrate.
The population being examined is called the study population (the 790 posts). The analysis of these posts may
be applied to a larger population (whether it’s future Facebook posts for this company or Facebook posts for any
company) which we call the targt population.
The difference between the study and target population is called the study error.

8
Winter 2019 STAT 444/844 Course Notes 4 JANUARY 17, 2019

In a paper by Soros et al., they ended up using a sample of only 500 posts for confidentiality reasons.
The difference between the sample and study population is called sample error.

4 January 17, 2019

4.1 Notes on terminology and lm in R
• When using lm the intercept term is included by default. To remove it simply specify Y ∼ X - 1.

• Factors are like categorical variables in R: there are a finite number of categories (called factor levels).

• In lm almost any function of variates may appear in the formula e.g. Y ∼ X + sin(X) or Y ∼ X + sin(X ?
Y).
To specify Y = X · Z, we need to use Y ∼ I(X ? Z) or Y ∼ X:Z instead of X ? Z since X ? Z represents
interaction in lm and translates to the model y = αx + βz + γxz + r.

• Some arithmetic operations e.g. +, −, ∗, ˆ are interpreted as formula operators rather than arithmetic operators
in lm. One should wrap them in I(·).

4.2 Notes on model selection

Figure 4.1: Quadratic and cubic polynomial linear models on Facebook data.

In the above figure we see that while both quadratic and cubic models are global (predict any value of x) the
quadratic model seems to predict likes returning to 0 as impressions approach infinity.
The cubic function on the contrary continues to increase: this makes more sense intuitively, thus examining a model
often requires human understanding of the data and problem.

9
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

4.3 Geometric interpretation of linear models

A linear model is a linear combination of functions called generators e.g.

µ(x) = β1 g1 (x) + β2 g2 (x) + β3 g3 (x) + β4 g4 (x)

where g1 , . . . , g4 could be arbitrary continuous functions of x.

All possible linear combinations of the generators forms a subspace (the functions generate the subspace). The
functions are a basis for this subspace. µ(x) lies in the subspace whose dimension equals to the number of basis
functions.
The functions should be linearly independent of each other: otherwise the solution to parameters will be ill-defined.

5 January 24, 2019

5.1 Discrepancy function
Let the discrepancy function for a fit of parameters β~ be denoted
n
X n
X
~ =
S(β) ~ =
ρ(yi − ~xTi β) ρ(ri )
i=1 i=1

where ρ is a real-valued loss function (in the OLS case, this was simply the square function). ri is our residual for
observation i.
Taking the derivative

~ n
dS(β) X
= ρ0 (yi − ~xTi β)(−1)~
~ xTi
~
dβ i=1
n
X
=− ρ0 (ri )~xTi
i=1

Solving for the extremum we have

n
X
ψ(ri )~xTi = ~0T
i=1

where ψ(r) = ρ0 (r) ~

(derivative with respect to β).

Remark 5.1 (LSE). If ρ(r) = r2 we get LSE, that is (ψ(r) = 2r)

n
X
~0 = 2r~xi
i=1
n
X
=2 ~
(~xi yi − ~xTi ~xi β)
i=1
~
= 2(X Y − X T X β)
T

which exactly solves to our LSE closed form solution.

10
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

5.2 Discrepancy function and log-likelihood

Let us compare our discrepancy function with the log-likelihood for linear models:
n
X
~ =
l(β) ~
li (β)
i=1
n
X
= l(ri )
i=1

−r2
where l(ri ) = 2σ2i , a function only of ri .
The second equality follows from the following remark:
~ is the ith observation’s contribution to l(β)
Remark 5.2. Note li (β) ~ i.e.

~ = log f (yi | ~xT β,

li (β) ~ σ2)
i

For a linear model we have

~ σ 2 ) ∼ N (~xT β,
f (yi | ~xTi β, ~ σ2)
i
2
~ = − ri2 + C where C = − 1 log(2π) − log σ a constant.
so li (β) 2σ 2
r2
We let l(ri ) = − 2σi2 . Since the constant does not change with respect to β we can omit it from our objective
function.

From above we observe that minimizing the discrepancy function is the same as maximizing the log likelihood where
ρ(r) = −l(r) in the discrepancy function.
Pn
Definition 5.1 (M-estimator). We call the estimator β~ that minimizes i=1 ρ(ri ) the M-estimator or the
maximum-likelihood type estimator.

5.3 Iteratively re-weighted least squares (IRLS)

Note that the solution turns out to be a WLS estimator :
n
X
~0 = ψ(ri )~xi
i=1
n
X ψ(ri )
= ri ~xi
ri
i=1
n
X
= ~ xi
w(ri )(yi − ~xTi β)~
i=1
n
X
= ~ xi
wi (yi − ~xTi β)~
i=1

ψ(ri )
where we let wi = w(ri ) = ri . If we solve this we see that the solution is WLS where

ˆ
β~ = (X T W X)−1 X T W Y

with W = diag(w1 , . . . , wn ).

11
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

However, the weights of this WLS depend on the residuals which in turn depends on β. ~ If we are given an initial
~ (0) ~
estimate of β , we could iteratively update residuals and β to converge to a solution. We proceed as follows:

Initialization Initialization: set j = 0

Step 1 Compute residuals

ˆ
rˆi (j) = yi − ~xTi β~ (j) i = 1, . . . , n

Step 2 Update weights

(j)
(j) ψ(r̂i )
wi = (j)
r̂i
(j) (j)
and let W (j) = diag(w1 , . . . , wn ).
ˆ
Step 3 WLS to estimate next set of β~ (j+1)
ˆ
β~ (j+1) = (X T W (j) X)−1 X T W (j) Y

Step 4 Set j = j + 1 and return to Step 1 if convergence criterion is not met.

We can this procedure iteratively re-weighted least squares (IRLS).

The convergence criterion is typically
ˆ ˆ
kβ~ (j+1) − β~ (j) k ≤
with the L2/Euclidean norm and for some small positive constant .

5.4 Why IRLS?

Question 5.1. Why do we need to use iteratively re-weighted least squares?
In ordinary least squares with Gaussian response and loss function ri2 , there is no reason to use IRLS since OLS
and IRLS are equivalent (the loss function ri2 simplifies IRLS to OLS).
However in generalized linear models (GLMs) (STAT 431/831) we may have a different type of response (e.g.
Bernoulli 0/1 or categorical) and thus we may define our loss function ρ(ri ) differently.
We may also want to modify our ρ(ri ) to de-emphasize huge outliers (see next section).

5.5 Robust regression

Robust regression tries to de-emphasize the influence of large outliers.

Question 5.2. What loss function ρ (and ψ) should we use?

In ordinary least squares (OLS) we can use
1
ρ(r) = r2
2
ψ(r) = r
ψ(r)
w(r) = =1
r
so we essentially have W = In×n which devolves into OLS as expected.

12
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019

Remark 5.3. The residual function for OLS is unbounded and so extreme outliers with large residuals have
significantly more influence.
Huber (1964) proposed a modified loss function (Huber loss) which de-emphasizes outliers:
(
1 2
r if |r| ≤ c
ρ(r) = 2 1
c(|r| − 2 c) if |r| > c

The modified loss function essentially makes the loss function linear after a certain threshold c:

We also let (
r if |r| ≤ c
ψ(r) =
csign(r) if |r| > c
and thus (
1 if |r| ≤ c
w(r) = c
|r| if |r| > c
The ψ and weight w functions look like

Figure 5.1: Left: ψ(r). Right: w(r) for Huber’s loss function.

How do we decide c? Huber suggested c = 1.345 and showed it achieved 95% of LSE asymptotically when the true
distribution is normal (95% efficiency essentially means the the variance of the betas from OLS is 95% that of the
variance of the betas using Huber’s loss).

13
Winter 2019 STAT 444/844 Course Notes 6 JANUARY 29, 2019

Question 5.3. Since c is fixed, what if our residuals are scaled to very large or small values (e.g. O(1e5) or
O(1e − 4))? We would have to scale our data beforehand to make it within a sensible range so that c = 1.345 makes
sense.
Sometimes we prefer the ψ function to “redescend” i.e. ψ(r) → 0 when |r| is large (that is: we fully de-emphasize
outliers). Other ψ functions include
Redescending M -estimator (Hampel)



r if 0 ≤ |r| ≤ a

asign(r) if a ≤ |r| ≤ b
ψ(r) =


a c−|r|
c−b sign(r) if b ≤ |r| ≤ c

0 if |r| > c

The recommended settings are a = 2, b = 4, c = 8 (with appropriately scaled data and residuals).
Tukey’s biweight  2 2
r 1 − r

if |r| ≤ c
ψ(r) = c

if |r| > c

0

where c = 4.685 is typically used. This is designed to have 95% efficiency as well for a true normal distribution.

6 January 29, 2019

6.1 Remark on robust regression and constants
Remark 6.1. All recommended constants in the various robust regression methods (Huber, Hampel, Tukey) are
based on the assumption that V ar(r) = 1. Therefore in practice we typically need to scale the residuals i.e. ri0 = rsi
where s is a scale parameter.
One simple solution is to estimate the median absolute deviation (MAD):

MAD = median(|ri |)
MAD
and let ŝ = 0.6745 . For the standard normal distribution we note that MAD = 0.6745.

6.2 Sensitivity curve and breakdown point

Let Tn (y1 , . . . , yn ) be a population attribute (that is a function of the same points). To see how sensitive Tn is to
an individual data point, define

Tn (y1 , . . . , yn−1 , y) − Tn−1 (y1 , . . . , yn−1 )

SC(y) = 1
n

which is the difference between Tn (·) (with all n points) and Tn−1 (·) (with one point y omitted) compare to the
contamination size n1 .
Example 6.1. Let Tn (y1 , . . . , yn ) = n1 ni=1 yi = ȳn (sample mean).
P
Note that
n−1
X n−1
Tn = yi + y = ȳn−1 + y
n
i=1

14
Winter 2019 STAT 444/844 Course Notes 6 JANUARY 29, 2019

Note that SC(y) is simply

SC(y) = n(Tn − Tn−1 ) = (n − 1)ȳn−1 + y − nȳn−1

= y − ȳn−1

Definition 6.1 (Breakdown point). Informally, the breakdown point of a statistic is the largest proportion of
contamination before the statistic breaks down.
Formally, let ~zi = (xi1 , xi2 , . . . , xip , yi )T for i = 1, . . . , n be the ith data vector.
Let Z = (~z1 , . . . , ~zn ) be the whole set. Let T be the statistic of interest. The worst error for swapping m zi ’s is
∗
e(m; T, Z) = supkT (Zm ) − T (Z)k
∗
Zm

∗ is Z with any of its m data vectors replaced.

where Zm
The breakdown point is then defined as

m
min | e(m; T, Z) = ∞
n

Remark 6.2. That is: the breakdown point measures the minimum proportion of points required to influence
the statistic significantly.

Some breakdown point examples:

Sample mean Note we can simply swap out m = 1 point arbitrarily such that e(1; T, Z) → ∞ thus the breakdown
point is n1 → 0 as n → ∞.

Median The breakdown point is 12 as n → ∞: we need to change at least half of them to aribitrarily influence the
median e.g. make it go to infinity.

k% trimmed mean The k% trimmed mean is defined as the mean after discarding the lowest k% and highest k%
of yi ’s.
Breakdown point is k% (we swap out the top k% + 1 points).

6.3 Least median squares (LMS)

Recall for regression, the LSE of β~ is
n
X
argminβ~ ~ 2
(yi − ~xTi β)
i=1

or equivalently
argminβ~ ~ 2
average(yi − ~xTi β)
To make it robust for “outliers” or contaminations i.e. to ensure we have a high breakdown point we could consider
the least median squares (LMS) estimator:

β~LM S = argminβ~ ~ 2
median(yi − ~xTi β)

1 1
which has a breakdown point of 2 (compared to a breakdown point of n for OLS).

15
Winter 2019 STAT 444/844 Course Notes 7 JANUARY 31, 2019

6.4 Least trimmed average sum of squares (LTS)

Similar to how we made our objective function for OLS more robust by considering the median in LMS, we can also
consider the (least) trimmed average sum of squares (LTS) estimator:
k
X
β~LT S = argminβ~ 2
r(i)
i=1

2 is the ith smallest squared residual.

where r(i)
n−k+1 1
Note the breakdown point for LTS is n (compared to a breakdown point of n for OLS).

7 January 31, 2019

7.1 Local linear regression with k-nearest neighbours
Instead of fitting one linear regression model with all points, we can instead fit local linear regression models for
neighbourhoods of points. In essence we are fitting piecewise linear functions.
We first look at piecewise polynomials and splines.

7.2 Piecewise polynomials (splines)

Definition 7.1 (Spline). We collectively call functions that aim to interpolate and smooth over some distribution
splines. Piecewise polynomials are a common choice for splines.
For continuous piecewise polynomial functions, the simplest form is piecewise linear (as seen before):

which can be specified as a single linear model

f (x) = β0 + β1 x + β2 (x − a)I(x ≥ a)

16
Winter 2019 STAT 444/844 Course Notes 7 JANUARY 31, 2019

Remark 7.1. Piecewise linear is also called the broken stick method.

For notation simplicity let us define (x)+ = max(x, 0) such that we have

f (x) = β0 + β1 x + β2 (x − a)+

Thus our basis functions are 1, x, (x − a)+ . Here is plot of the basis:

This is an example of the truncated power series. We can easily generalize this model to accomodate many
break points or knots.
However, picewise linear functions are not differentiable at their break points since f 0 (x) is not continuous.
Recall that for a piecewise quadratic function we have

f (x) = β0 + β1 x + β2 x2 + β3 (x − a)2+

where our basis functions are 1, x, x2 , (x − a)2+ . Note that a piecewise quadratic model f (x) is indeed differentiable
at the break points.

7.3 Cubic splines

Remark 7.2. The most commmonly used spline is the cubic spline, which is piecewise cubic where f (x), f 0 (x), f 00 (x)
are all continuous.

Let t1 < t2 < . . . , tk be fixed and known knots, where t1 and tk are boundary knots and t2 , . . . , tk−1 are interior
knots.
Then the basis consists of the functions 1, x, x2 , x3 , (x − t1 )3+ , . . . , (x − tk )3+ . That is any cubic spline with the
above k knots can be expressed as
k
X
2 3
f (x) = β0 + β1 x + β2 x + β3 x + βj+3 (x − tj )3+
j=1

Remark 7.3. 1. There are k + 4 parameters.

2. f (x) is continuous up to the 2nd derivative.

17
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019

Proof. This is obviously true between knots. We verify at x = ti :

i−1
X
f (ti ) = β0 + β1 ti + β2 t2i + β3 t3i + βj+3 (ti − tj )3+
j=1

note that (x − tj )3+ = 0 for x < ti+1 and j = i + 1, . . . , k.

Note that limx→t− f (x) = f (ti ) since (x − ti )+ = 0 if x < ti so limx→t− (x − ti )3+ = 0.
i i

Also limx→t+ f (x) = f (ti ) since limx→t+ (x − ti )3+ = 0.

i i

Therefore limx→t− f (x) = limx→t+ f (x) = f (ti ) so f is continuous at ti for all i = 1, . . . , k.

i i

Similarly we can show this for f 0 (x) and f 00 (x).

8 February 5, 2019
8.1 Natural cubic splines (NCS)
A cubic spline is called a natural cubic spline with knots {t1 , . . . , tk } if f (x) is linear when x 6∈ [t1 , tk ], that is
(
t0 (x) = a0 + b0 x if x < t1
f (x) =
tk (x) = ak + bk x if x > tk

Question 8.1. How many free parameters are there in the natural cubic spline?
Answer. Note that in general cubic splines, we have k + 4 parameters. If we constrain our spline to be linear at
both ends (x < t1 and x > tk ) then we essentially remove the quadratic and cubic terms and thus parameters at
each end. So we remove 4 parameters and thus we have k free parameters.
To express an NCS, note that for a regular cubic spline we have
k
X
2 3
f (x) = β0 + β1 x + β2 x + β3 x + βj+3 (x − tj )3+
j=1

Secondly our constraints are:

f (x) is linear when x < t1 We know that β4 , . . . , βk + 3 are already 0 when x < t1 .
Thus we need only specify that β2 = β3 = 0.
f (x) is linear when x > tk We require that
k
X
βj+3 = 0
j=1
k
X
βj+3 tj = 0
j=1

since we we expand out the cubic terms

k
X
βj+3 (x3 − 3tj x2 + 3t2j x − t3j )
j=1

18
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019

we want all the x3 terms to have 0 coefficients (first term of expansion) and all x2 terms to also have 0
coefficients (second term of expansion).
These conditions are necessary and sufficient.

Claim. We claim N1 (x) = 1, N2 (x) = x, and Nj (x) = dj−1 (x) − d1 (x) for j = 3, . . . , k where

(x − tj )3+ − (x − tk )3+
dj (x) =
tk − tj

is a basis for NCS.

Pk Pk−1
Proof. From j=1 βj+3 = 0 we have βk+3 = − j=1 βj+3 .
Thus from the second equation we have
k−1
X k−1
X
βj+3 tj + βk+3 tk = βj+3 (tj − tk )
j=1 j=1

i.e.
k−1
X
β4 (tk − t1 ) = − βj+3 (tk − tj )
j=2

Thus from our original equation (where β2 = β3 = 0)

k
X
f (x) = β0 + β1 x + βj+3 (x − tj )3+
j=1
k−1
X
= β0 + β1 x + βj+3 (x − tj )3+ + βk+3 (x − tk )3+
j=1
k−1
X
βj+3 (x − tj )3+ − (x − tk )3+

= β0 + β1 x +
j=1
k−1
3 3
X
βj+3 (x − tj )3+ − (x − tk )3+

= β0 + β1 x + β4 (x − t1 )+ − (x − tk )+ +
j=2
k−1
(tk − tj )((x − t1 )3+ − (x − tk )3+ )
X
= β0 + β1 x + βj+3 (x − tj )3+ − (x − tk )3+ −
tk − t1
j=2
k−1
(x − tj )3+ − (x − tk )3+ (x − t1 )3+ − (x − tk )3+
X
= β0 + β1 x + βj+3 (tk − tj ) −
tk − tj tk − t1
j=2
k
X
0
= β0 + β1 x + βj+2 (dj−1 (x) − d1 (x))
j=3

as desired.

Note that we have 4 separate (linearly independent) constraints on the parameters hence why we lose 4 degrees of
freedom.

19
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019

(x−tj )3+ −(x−tk )3+

Let dj (x) = tk −tj , then NCS can be expressed as linear combination of the basis functions

N0 (x) = 1
N1 (x) = x

Nj (x) = (tk − tj ) dj (x) − d1 (x) j = 2, . . . , k − 1

More conveniently we can express the NCS as

k
X
f (x) = βj Nj (x)
j=1

where N1 (x) = 1, N2 (x) = x and Nj (x) = dj−1 (x) − d1 (x) for j = 3, . . . , k.

Remark 8.1. 1. If x < t1 , then dj (x) = 0 ⇒ Nj (x) = 0 for j = 3, . . . , k.

(x−tj )3 −(x−tk )3
2. If x > tk , then dj (x) = tk −tj reduces to a quadratic function of x where the coefficient of x2 term is
3.
Since Nj (x) = dj−1 (x) − d1 (x) then it is a linear function of x if x > tk for j = 3, . . . , k.

Definition 8.1 (Regression splines). The fixed-knot splines, such as cubic splines and NCS, are called regression
splines.

8.2 Fitting NCS

Let yi = f (xi ) + i for some response P
yi and explanatory variates xi and some arbitrary continuous function f (·).
We can approximate/regress f (x) by kj=1 βj Nj (x) (NCS) i.e.

k
X
yi ≈ βj Nj (xi ) + i
j=1

Now we simply fit the following linear model with design matrix
 
N1 (x1 ) . . . Nk (x1 )
 .. .. .. 
X= . . . 
N1 (xn ) . . . Nk (xn )

where
 
β1
β =  ... 
 

βk
 
y1
 .. 
Y = . 
yn

Remark 8.2. The problem becomes a regular regression problem with design matrix generated from the basis
functions Nj ’s.

20
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019

8.3 General function fitting with basis funcitons

We extend our method for fitting NCS: more generally for a p-dimensional input vector ~x, we can consider the
following appproximation to f (~x)
k
X
f (~x) = βj hj (~x)
j=1

where {hj } are a series of basis functions.

That is: we approximate f (~x) as a linear basis expansion. Then we form the design matrix X = hj (~xi ) where i
indexes the row (ith sample) and j indexes the column (jth basis function).
Some examples:

1. hj (~x) = xj for j = 1, . . . , p is the original linear model where basis functions are the jth component

2. hj (~x) = log(xj ) are arbitrary transformations

3. hj (~x) = xkj for k ∈ N is polynomial regression

4. hj (~x) = Nj (~x) is NCS

9 February 7, 2019
9.1 Choosing k for NCS
Recall that the basis functions for NCS are

N0 (x) = 1
N1 (x) = x
Ni (x) = dj−1 (x) − d1 (x) j = 3, . . . , k

We still need to choose a k and our knots t1 , . . . , tk .

Some examples of how to choose k and knots:

Equal-distance knots We choose k first arbitrarily e.g. k = 5, then we use an equal-distance grid between the
min and max of xi ’s.
i
Quantiles Quantiles are also a popular choice e.g. k−1 quantiles for each xi , i = 0, . . . , k − 1.

Degrees of freedom Alternatively we can instead specify the degrees of freedom for an NCS i.e. the number of
free parameters. For df = k, we would have k − 2 knots (if intercept term is also included). Usually knots are
placed at equal distance quantiles.

9.2 Smoothing splines

Consider the following penalized regression problem
n
X Z ∞
fˆλ (x) = argminf 2
[yi − f (xi )] + λ [f 00 (x)]2 dx
i=1 −∞

Pn
Remark 9.1. 1. i=1 [yi − f (xi )]2 is the sum of squared residuals which measures the goodness of fit.

21
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019

R∞ 00 (x)]2 dx
2. −∞ [f measures the “roughness” of f (x).

Remark 9.2. Note that we try to minimize the integral over the f 00 (x) (squared), which is essentially
minimizing f 00 (x) so that it is close to 0.
R∞
For example, if f (x) = β0 + β1 x (OLS) then f 00 (x) = 0 thus −∞ [f
00 (x)]2 = 0 i.e. no penalty for OLS.

3. The role of λ: if λ = 0 then we have no roughness penalty and we will minimize the SSR over all functions
and fˆλ (x) is the interpolating line.
R∞
If λ = ∞ then we will force −∞ [f 00 (x)]2 dx = 0 thus fˆλ (x) is the ordinary least square fit.

4. Remarkably we can show that fˆλ (x) is just the natural cubic spline with knots at distinct values of {xi }ni=1 .

5. NCS is the “smoothest” interpolator.

For any complex function f (x) if we only know the value of k points {f (ti )}ki=1 then we can use {ti , f (ti )}ki=1
to determine an NCS s(x) such that s(ti ) = f (ti ) for i = 1, . . . , k.

Claim. Z ∞ Z ∞
00 2
[s (x)] dx ≤ [f 00 (x)]2 dx
−∞ −∞

Proof. Left as exercise in assignment.

Definition 9.1 (Smoothing spline). We call the function fitted by the penalized regression a smoothing spline.

We determine the β for the NCS smoothing spline. Note that

k
X
fˆλ (x) = βj Nj (x)
j=1

that is
n
X k
X Z ∞ Xk
β̂λ = argminf [yi − 2
βj Nj (x)] + λ [ βj Nj00 (x)]2 dx
i=1 j=1 −∞ j=1

Note that we can re-express this in matrix notation where

n
X k
X
[yi − ~ T (Y − X β)
βj Nj (x)]2 = (Y − X β) ~
i=1 j=1

where  
N1 (x1 ) . . . Nk (x1 )
X =  ... .. .. 

. . 
N1 (xn ) . . . Nk (xn )

22
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019

Also
Z ∞ Xk Z ∞ k
X Xk
[ βj Nj00 (x)]2 dx = [ βj Nj00 (x)][ βl Nl00 (x)] dx
−∞ j=1 −∞ j=1 l=1
Z ∞ k X
X k
= [ βj βl Nj00 (x)Nl00 (x)] dx
−∞ j=1 l=1

k X
X k Z ∞
Nj00 (x)Nl00 (x) dx

= βj βl
j=1 l=1 −∞

= β~ T N β~
R∞
where N = (Njl ) = −∞ Nj00 (x)Nl00 (x) dx (i, j-th entry is Njl ).
Therefore we can let
~ = (Y − X β)
S(β) ~ T (Y − X β)~ + λβ~ T N β~
= Y T Y − β~ T X T Y − Y T X β~ + β
~ T X T X β~ + λβ~ T N β~
= Y T Y − 2Y T X β~ + β~ T (X T X + λN )β~

ˆ
and β~λ = argminS(β).
~
Recall that for matrix Y, A and vector ~c

∂~cT Y
= ~cT
∂Y
∂Y T AY
= 2Y T AT
∂Y
thus we have

∂S(β)~
= 0 = −2Y T X + 2β~ T (X T X + λN )T
∂ β~
ˆ
⇒(X T X + λN )β~λ = X T Y
ˆ
⇒β~λ = (X T X + λN )−1 X T Y

To calculate the effective number of parameters or effective df (edf): recall for NCS we have k knots and in OLS
with Xn×p
Ŷ = HY = X(X T X)−1 X T Y
where the number of parameters is df = tr(H).
Now in the smoothing spline, we have

Ŷλ = X(X T X + λN )−1 X T Y = Aλ Y

where the effective number of parameters is dfλ = tr(Aλ ).

Remark 9.3. When λ → ∞, dfλ → 2.

23
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019

10 February 14, 2019

10.1 B-splines
A comptutationally efficient alternative to cubic splines and NCS is the B-spline.
The basis functions of B-splines are strictly local. For a degree d B-spline (e.g. d = 3 for a cubic B-spline), each
basis function is non-zero over the interval of d + 2 adjacent knots (and zero everywhere else) i.e. d + 1 intervals.
Advantages of B-spline:

1. Numerically stable: recall cubic splines have x3 terms which grows fast as x → ∞. B-splines are fitted to a
restriction of x (the d + 5 knots).

2. Computationally efficient: when # of knots k is large. More specifically least squares estimation with n
observations and k variables takes O(nk 2 + k 3 ) operations. If k → n then this becomes O(n3 ). B-splines
reduces this cost to O(n) (since k ∈ O(d)) becomes constant.

We define the 0-degree B-spline basis:

(
1 if ti ≤ x < ti+1
Bi,0 (x) =
0 otherwise

where Bi,0 (x) is the interval indicator function. It is also known as the Haar basis function.
In general for a d-degree B-spline, we define its basis as:
x − ti ti+d+1 − x
Bi,d (x) = Bi,d−1 (x) + Bi+1,d+1 (x)
ti+d − ti ti+d+1 − ti+1

After we compute the basis functions given our x, we can fit the model as an OLS or robust LR model. In R, we can
use the function bs in the package splines to generate the B-spline basis functions (note there are no intercepts
included). This will give us a design matrix with d + k basis functions (so d + k degrees of freedom) where d is the
degree and k is the number of knots (d starts at 0 for the constant function).
Then we simply feed this to lm or rlm as usual (which will subsequently introduce the bias term). Note that lm will
add one more degree of freedom with the intercept for (d + 1) + k degrees of freedom.
Similarly we can generate NCS basis functions with ns in splines.

10.2 Smoothing splines in R

To fit smoothing splines (penalized splines), we can use the smooth.spline function.
First specify an appropriate degrees of freedom df.
Let nx be the number of distinct values of x. We use nx knots if nx ≤ 49 and O(nx0.2 ) knots if nx > 49.
Although it is not strictly a smoothing spline when nx > 49 it is very close to one.

Remark 10.1. Since smoothing splines are penalized for their “smoothness” this allows us to choose a high number
of knots.

11 February 26, 2019

11.1 KNN local linear regression
An alternative to fitting splines with prespecified knots is to fit the data more locally by considering a neighbourhood
at each point x.

24
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019

We take the k nearest neighbours for every x and compute the mean response value of the neighbours. This then
becomes the fitted value at xi and we may linearly interpolate or even quadratically interpolate (or even higher
order polynomial interpolation) between points.
This can be accomplished in R with knn.reg from the FNN package.

Remark 11.1. As the neighbourhood size (k) increases, the smoother the function becomes.

Instead of taking the mean response based on the k neighbours, we can instead use the value from any fitted model
based on those k neighbours (e.g. lm, rlm with Huber, Tukey’s, etc., ltsreg).

Remark 11.2. We can think of KNN local linear regression as weighted linear regression where wj = 0 if xj is
outside the neighbourhood of xi .

11.2 Kernel local linear regression

In the KNN case, we essentially assign equal weighting amongst all k neighbours. Instead we can use some kernel to
assign higher weights to points closer to xi and lower weight to points farther away from xi .
A kernel K(t) must satisfy the following
Z Z Z
K(t) dt = 1 tK(t) dt = 0 t2 K(t) dt < ∞

where the first two standardize K(t) and the last constraint ensures weights are spread along the real line but not
too much weight are in the extremes.
Some examples of kernels:

Epanechinikov K(t) = 34 (1 − t2 )I(|t| ≤ 1)

Tukey’s tri-cube K(t) = (1 − |t|3 )3 I(|t| ≤ 1)

1
Gaussian K(t) = √ 2
2πe−t /2

Figure 11.1: Graph of kernels used in kernel local linear regression.

Thus for a bandwidth parameter h the weight function for neighbour xi for current point x is defined as:

K xih−x

wi = w(x, xi ) = PN xj −x
j=1 K h

25
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019

For the mean response µ̂(x) we take the weighted average or the Nadaraya-Watson estimator:
N
X
µ̂(x) = wi yi
i=1

Remark 11.3. The boundary effect occurs when no points lie on one side of the kernel and thus the weights are
distributed in a biased way to the points on the other side. This occurs at the extremes of the explanatory variate
space.

Figure 11.2: The boundary effect causes the kernel fitted line (green) at the left end to bias the fitted value higher
since most the available points are on the right of the kernel and have a higher response value.

To avoid the boundary effect, local regression is typically used.

Local linear is simply and good at boundaries and local quadratic is good at interior points.
Higher order regression is rarely used.

In R we can use the loess function where span defines the proportion of points in the local neighbourhood. The
kernel used is Tukey’s tri-cube.

11.3 Linear smoother

A smoothing spline with for fixed λ is an example of a linear smoother where

Ŷλ = X(X T X + λN )−1 X T Y

= Sλ Y

A linear smoother is a linear combination of the yi ’s with Sλ being the smoother matrix.
Consider a regression with a small number p of basis functions. That is for basis functions b1 , . . . , bp (e.g. the NCS
basis) let the matrix
Bn×p i,j = bj (xi )

26
Winter 2019 STAT 444/844 Course Notes 12 FEBRUARY 28, 2019

That is: the i, j-th element is bj (xi ).

Then the ordinary least squares (OLS) solution for the fitted values are

Ŷ = B(B T B)−1 B T Y

where HB = B(B T B)−1 B T (the hat matrix).

Claim. It can be shown (for any hat matrix) that the column space of B satisfies C(B) = C(HB ). Note however
B has p columns whereas HB has n columns, so there is some redundancy.

Proof. Since HB is symmetric we take its eigendecomposition

HB = U P U T
  
ρ1 . . . 0 ~u1
 . .
. . ..   ... 
.

= ~u1 . . . ~un  ..
 

0 . . . ρn ~un
n
X
= ρi ~ui ~uTi
j=1

where ρ1 ≥ . . . ≥ ρn ≥ 0 are the eigenvalues and ~u1 , . . . , ~un are the corresponding orthonormal eigenvectors.
Then

Ŷ = HB Y
Xn
= ρi ~ui ~uTi Y
i=1
n
X
= ρi h~ui , Y i~ui h·, ·i inner product
i=1

Thus Y is first projected to the orthonormal basis {~u1 , . . . , ~un } then modulated by {ρ1 , . . . , ρn }.
Because HB is idempotent and assuming B is full rank (i.e. rank(B) = p) then P n = P for all n ∈ N thus
(
1 if i = 1, . . . , p
ρi =
0 if i = p + 1, . . . , n

12 February 28, 2019

12.1 Reinsh form of smoother matrix
For a smoothing spline and assuming the n xi values are distinct, then

Xn×n i,j = hj (xi )

where hj is the jth NCS basis function.

Claim. Given the smooth spline

Ŷλ = X(X T X + λN )−1 X T Y = Sλ Y

27
Winter 2019 STAT 444/844 Course Notes 12 FEBRUARY 28, 2019

we claim the smoother matrix Sλ in a linear smoother can be written in the Reinsch form

Sλ = (I + λK)−1

where K = (X T )−1 N X −1 does not depend on λ and N is the roughness penalty matrix for NCS, that is
N = Njl n×n and
Z ∞
Njl = Nj00 (x)Nl00 (x) dx
−∞

Proof. First remark that X is a square matrix since we assume all xi values are distinct so

Sλ = X(X T X + λN )−1 X T
−1
= (X T )−1 (X T X + λN )X −1

−1
= I + λ(X T )−1 N X −1
= (I + λK)−1

where K = (X T )−1 N X −1 .

12.2 Penalty form of linear smoother

It can be shown that Ŷλ = Sλ Y is the solution to the minimization problem

~ )T (Y − µ
min (Y − µ µT K~
~ ) + λ~ µ

where K is known as the penalty matrix where K is symmetric and has eigendecomposition

K = V DV T

with D = diag(d1 , . . . , dn ) the eigenvalues where di ≥ 0 and V = (~v1 , . . . , ~vn ) is an orthonomal matrix (of
eigenvectors).
1. K = ni=1 di~vi~viT (sum of matrices) and µ µ = ni=1 di h~vi , µ
~ T K~ ~ i2 .
P P
Remark 12.1.
This implies that µ
~ is penalized more in the directions of vi ’s with large di values.

2. It can be shown that dn−1 = dn = 0.

It is straightforward to show that
Sλ = V (I + λD)−1 V T
is the eigendecomposition of Sλ with eigenvalues
1
ρi (λ) = i = 1, . . . , n
1 + λdn−i+1

Remark 12.2. 1. Sλ and K share the same eigenvectors which do not depend on λ.

2. Large eigenvalues di of D leads to small eigenvalues ρi (λ) of Sλ .

3. Large λ leads to small eigenvalues ρi (λ).

4. ρ1 (λ) = ρ2 (λ) = 1 since dn−1 = dn = 0.

All other ρi (λ)’s are less than 1.

28
Winter 2019 STAT 444/844 Course Notes 13 MARCH 5, 2019

Returning to our original fitted value µ

µ
~ = Sλ Y
Xn
= ρi (λ)~vi~viT Y
i=1
n
X
= ρi (λ)h~vi , Y i~vi
i=1

that is: we project Y down to the every eigenvector ~vi and scale each projection by corresponding eigenvalue ρi (λ).
Note the first two vectors are not shrunk by ρi (λ) but the rest are shrunk towards 0 since ρi (λ) < 1.

12.3 Regression vs smoothing splines

How do smoothing splines compare to regresson splines (e.g. OLS)?
To draw a parallel with OLS, recall
Xn
Ŷ = ρi h~ui , Y i~ui
i=1

where (
1 if i = 1, . . . , p
ρi =
0 if i = p + 1, . . . , n
Comparing the two, OLS selects the eigenvector basis where eigenvalues are 1 and drops the other eigenvectors (hard
thresholding) whereas smoothing splines shirnk Y in the direction of eigenvectors according to their corresponding
eigenvalues ρi (λ).
For this reason OLS or regression splines are called projection smoothers and smoothing splines are shrinking
smoothers.

Remark 12.3. 1. The sequence of ~vi , ordered by decreasing eigenvalues ρi (λ), appear to increase in complexity
(i.e. roughness or “wiggleness”).

2. Recall the effective degrees of freedom is

n
X 1
dfλ = tr(Sλ ) =
1 + λdi
i=1

thus if we want a specific dfλ we can simply linear search for the corresponding λ (since the di ’s are also fixed).

13 March 5, 2019
13.1 Local linear regression as a linear smoother
We show that local linear regression is indeed a linear smoother.
For target value x (point we are doing local regression about), local linear regression is equivalent to solving the
weighted optimization problem
Xn
argminα,β kh (x − xi )(yi − (α + βxi ))2
i=1

29
Winter 2019 STAT 444/844 Course Notes 14 MARCH 7, 2019

and our fitted value at x from local regression is fˆ(x) = α(x) + β(x)x (kh is our kernel function used in local linear
regression).
Note the above optimization problem has an explicit solution. Let
 
1 x1
1 x2 
B = . . 
 
.
. . .
1 xn
W (x) = diag(kh (x − x1 ), . . . , kh (x − xn ))

where B is n × 2 and W is n × n.
Then fˆ(X) (n × 1 vector of fitted values) can be re-rewritten as

fˆ(X) = (1, x) B T W (X)B)−1 B T W (X)Y

= lT (X)Y

where lT (X) = (1, x) B T W (X)B)−1 B T W (X).

Remark 13.1. 1. Local linear regression is a linear smoother i.e. fˆ(x) is a linear combination of yi ’s.
2. Recall in smoothing splines
Ŷλ = Sλ Y
and we define dfλ = tr(Sλ ).
If we let
lT (x1 )
 
 lT (x2 ) 
Lh =  . 
 
 .. 
lT (xn )

where Lh is n × n, then Ŷh = Lh Y and we efine dfh = tr(Lh ).

14 March 7, 2019
14.1 Multivariate local regression
We want to make a prediction at target value ~x = (x1 , . . . , xp )T . A simple kernel function we could use in local
regression in Rp
1 k~xk
Kh (~x) = K
h h
where k·k is the Euclidean norm.
Remark 14.1. The issue with the Euclidean norm is we weight every coordinate/variate xi equally. If a variate
has less importance we should not give it the same weight as other variates in the kernel.
That is: the kernel we have above is equally-skewed about all variates xi and gives equal weight to each coordinate.
We use structured local regression instead. This is a more general approach where we use a positive semidefinite
matrix Ap×p to weight each coordinate, that is:

1 ~xT A~x
Kh,A (~x) = K
h h

30
Winter 2019 STAT 444/844 Course Notes 14 MARCH 7, 2019

Typically A is chosen to be symmetric and there are p(p+1) 2 elements in A.

A popular choice is a diagonal matrix i.e. A = diag(a1 , . . . , ap ) where ai modulates the variance of the kernel along
the ith dimension (as ai → ∞ then the larger and hence more smoothed out the kernel is along the ith dimension).

14.2 Multivariate regression splines with tensor products

An extension of regression splines to Rp are tensor products.

Example 14.1. Let p = 2 and ~x = (x1 , x2 )T ∈ R2 .

Let {h11 , h12 , . . . , h1k1 } a set of spline basis functions on x1 , and {h21 , . . . , h2k2 } a set of spline basis functions on
x2 .
Consider the following tensor product basis with k1 × k2 functions where

gjk (~x) = h1j (x1 )h2k (x2 )

and X
f (~x) = βjk gjk (x)
j,k

that is f is a multiplicative/interaction-based model.

We note that the number of parameters with a tensor product basis grows exponentially with p.

14.3 Multivariate smoothing splines with thin plates

An extension of smoothing splines to Rp are with thin plate splines.

Example 14.2. Let p = 2. We solve for

n
X
fˆλ = argminf { [yi − f (~xi )2 ]2 + λJ(f )}
i=1

where 2 2 2
∂2f ∂2f ∂2f
Z Z
J(f ) = +2 + dx1 dx2
R R ∂x21 ∂x1 ∂x2 ∂x22
We can show the optimal solution has the form (at a given target ~x)
n
X
fλ (~x) = β0,λ + βλT ~x + αi hi (~x)
i=1

where the generator function is the radial basis function hi (~x) = k~x − x~i k2 logk~x − x~i k.
Note x~i for i = 1, . . . , n are our control/knot points.

14.4 Curse of dimensionality

Any dataset with in a large dimension Rp is sparse.

Example 14.3. Consider fixed-width neighbourhoods in e.g. local regression.

For p = 1, assume data points are uniformly distributed across the domain x ∈ [0, 1]. Suppose for a target x we
have a neighbourhood x ± 0.05. Then we will capture ≈ 10% of points.

31
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019

For p = 2 again assume data points are uniformly distributed across [0, 1] × [0, 1]. Suppose we have a neighbourhood
x ± 0.1: note that we capture an area of 0.2 × 0.2 = 0.04 which only captures ≈ 4% of points!
As p increases, even as we increase the width of our neighbourhood along each dimension linearly our data becomes
so sparse our neighbourhood becomes relatively small.

14.5 Structured regression additive approach

For predictor values ~x = (x1 , . . . , xp )T , instead of modelling

µ(~x) = f (x1 , . . . , xp )

which may be some arbitrary possibly interactive function of every variate we consider the additive model
p
X
µ(~x) = α + fj (xj )
j=1

where fj ’s can be linear functions or any smooth function of xj .

Remark 14.2. We can extend the above model to allow a limited number of interactions.
For example if p is small we can consider additional pairwise interactions
p
X p X
X p
f (~x) = α + fj (xj ) + fjk (xj , xk )
j=1 j=1 k=1

We may also be more selective, e.g. for p = 5

f (~x) = α + f1 (x1 ) + f2 (x2 , x3 ) + f3 (x4 , x5 )

15 March 12, 2019

Review of assignment 3 solutions and slides on generative additive model.

16 March 14, 2019

Review of assignment 4 solutions and more slides on generative additive model.

17 March 19, 2019

17.1 Tuning parameter selection
Suppose we are given the true model yi = f (xi ) + i where E(i ) = 0 and V ar(i ) = σ 2 , observations (xi , yi )nn=1
(training set T ), and prediction model fˆ(x).
The methods considered so far typically involve some tuning parameter e.g. the # of knots in regression spline, λ in
smoothing spline, bandwidth h in local regression, etc.
These tuning parameters are considered “complexity parameters” that regulate the complexity (i.e. degrees of
freedom) of the prediction model.
In prediction we want fˆ to provide good estimates for future observations.

32
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019

Definition 17.1 (Test error). We define the test error as:

ErrT = E (Y − fˆ(X))2 | T

where the expectation is over the true joint distribution of (X, Y ) i.e. the population.
Definition 17.2 (Expected test error). We define the expected test error as:

Err = E (Y − fˆ(X))2 ) = E(ErrT )

where the distribution is over the distribution of (X, Y ) and the random generation of training sets.
Definition 17.3 (Training error). We define the training error as:
n
1X RSS
Err = (yi − fˆ(xi ))2 =
n n
i=1

However Err uses the same data twice (once for producing fˆ and once for calculating the error) and does not track
Err well.
We note that as the model complexity grows, training error decreases while our test error will increase after the
optimal complexity:

Test error increases after a certain point due to overfitting to the training set.
To regularize our model for complexity, some solutions include:
Information criteria let d denote the number of parameters. We define the Akaike Information Criterion
(AIC) as:
AIC = 2d + n log(RSS)
and the Bayesian Information Criterion (BIC), which has a larger regularization effect, as:

BIC = log(n)d + n log(RSS)

Remark 17.1. The information criteria are only useful for comparing models for the same training sample.
Their absolute values are meaningless.

33
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019

Cross-validation (CV) Recall we can estimate the test error Err by repeatedly sampling test sets from the
population.
We hold out a part of the training set as our “population” (cross-validation set) and construct our model on
the remaining training set. We can then validate our complexity and model on the cross-validation set as an
estimate of the test error.
This error on the cross-validation set is called the cross-validation error.

k-fold CV 1. Given a training set T , randomly partition it into k disjoint equal-sized parts (“folds”) T1 , . . . , Tk .
2. For every i = 1, . . . , k, we train our model on partitions T1 , . . . , Ti−1 , Ti+1 , Tk to obtain fˆ(k) , then we
evaluate fˆ(k) on Ti to get the cross-validation error for each Ti . Let i(j) be the fold Ti corresponding to
example j, then the overall cross-validation error is:
n
1X
CV (fˆ) = (yj − fˆ(i(j)) (xj ))2
n
j=1

The choice of k = n is called the leave-one-out (LOO) CV. The justification for LOO CV: Let fˆ−i (xi ) be
the fitted value of xi without using (xi , yi ) during training. Then

E(yi − fˆ−i (xi ))2 = E(yi − f (xi ) + f (xi ) − fˆ−i (xi ))2
= E(yi − f (xi ))2 + 2i E(f (xi ) − fˆ−i (xi )) + E(f (xi ) − fˆ−i (xi ))2
= σ 2 + E(f (xi ) − fˆ−i (xi ))2
≈ σ 2 + E(f (xi ) − fˆ−i (xi ))2

that is LOO CV provdies an approximate estimate of the test error Err (up to a constant σ 2 ).
Thus minimizing the LOO CV error is essentially minimizing the test error.

Remark 17.2. Since LOO CV requires fitting the data n times, for large n then this is infeasible.

Remark 17.3. For most linear smoothers where Ŷ = SY where S is the smoother matrix it can be shown
that
n n
1 X yi − fˆ(xi ) 2

ˆ 1X ˆ−i 2
CV (f ) = (yi − f (xi )) =
n n 1 − sii
i=1 i=1

where sii is the ith diagonal element of S. We can thus simply fit the data once and weight the squared
residuals by (1−s1ii )2 .
The above proof for OLS is in A2 Q2 part (d).

Generalized Cross Validation (GCV) For any linear smoother Ŷ = SY we define the GCV error as
n
yi − fˆ(xi ) 2

1X
GCV (fˆ) = tr(S)
n 1−
i=1 n

where we use the average trace tr(S)/n instead of each individual sii .

Note that LOO CV is approximately unbiased for Err, but can have high variance due to the n training sets being
very similar to one another. On the other hand, a small k tends to have large bias but small variance. To balance
bias and variance, k = 5 or k = 10 is recommended.

34
Winter 2019 STAT 444/844 Course Notes 18 MARCH 26, 2019

18 March 26, 2019

18.1 Tree-based methods
A decision tree is essentially local linear regression with K neighbourhoods, where at some iteration with i
neighbourhoods we partition some neighbourhood to get i + 1 neighbourhoods. We aim to optimize the objective
function:
XN K
X
(yi − IRk (xi )µ̂k )2 + λK
i=1 k=1

where IRk (xi ) = 1 if xi ∈ Rk neighbourhood k and µ̂k is the average of the points in the neighbourhood k. This is
equivalent to local average regression.
Optimizing the above is still computationally difficult (since there are a combinatorial number of different K ∈ N
neighbourhoods to consider for N points.

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Hardle - Applied Nonparametric Regression
No ratings yet
Hardle - Applied Nonparametric Regression
433 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
Premature Bearing Failures
100% (1)
Premature Bearing Failures
21 pages
NICE3000 Synchronous Motor Adjusting Manual - PDF - PDF
100% (4)
NICE3000 Synchronous Motor Adjusting Manual - PDF - PDF
29 pages
CS 2008 3complete PDF
No ratings yet
CS 2008 3complete PDF
53 pages
SOA Exam Statistics For Risk Modelling Study Manual
No ratings yet
SOA Exam Statistics For Risk Modelling Study Manual
42 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
2021 - Creel - econometrics (githuib book)
No ratings yet
2021 - Creel - econometrics (githuib book)
1,060 pages
Lecturenote - COL341 - 2010
No ratings yet
Lecturenote - COL341 - 2010
116 pages
Econometric s
No ratings yet
Econometric s
1,341 pages
Adv Stat Inf
No ratings yet
Adv Stat Inf
194 pages
Eco No Metrics
No ratings yet
Eco No Metrics
1,045 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
machine_learning
No ratings yet
machine_learning
662 pages
Econometrics UAB
No ratings yet
Econometrics UAB
353 pages
STA2005S Regression
No ratings yet
STA2005S Regression
92 pages
MUS2 Draft Contents November 2020
No ratings yet
MUS2 Draft Contents November 2020
14 pages
Econometrics in R
No ratings yet
Econometrics in R
34 pages
Econometrics Simpler Note
No ratings yet
Econometrics Simpler Note
692 pages
Glmext4 Preview
No ratings yet
Glmext4 Preview
27 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
100% (1)
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
414 pages
Gauss Markov Book
No ratings yet
Gauss Markov Book
150 pages
Full Notes 248 Spring 2022 Time Series
No ratings yet
Full Notes 248 Spring 2022 Time Series
117 pages
Applied Robust Statistics 2005 PDF
No ratings yet
Applied Robust Statistics 2005 PDF
532 pages
Applied Robust Statistics
No ratings yet
Applied Robust Statistics
532 pages
CSC 446 Lecture Notes
No ratings yet
CSC 446 Lecture Notes
61 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
Applied Robust Statistics-David Olive
No ratings yet
Applied Robust Statistics-David Olive
588 pages
Applied Statistics
No ratings yet
Applied Statistics
361 pages
ssrn-5162304
No ratings yet
ssrn-5162304
271 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
(VMLS) Julia Language Companion PDF
No ratings yet
(VMLS) Julia Language Companion PDF
178 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Econometrics
No ratings yet
Econometrics
28 pages
Time Series and Its Applications
No ratings yet
Time Series and Its Applications
6 pages
Regbook Inside
No ratings yet
Regbook Inside
21 pages
(eBook-PDF) - Statistics - Applied Nonparametric Regression
No ratings yet
(eBook-PDF) - Statistics - Applied Nonparametric Regression
433 pages
Applied Nonparametric Regression
No ratings yet
Applied Nonparametric Regression
433 pages
Applied Nonparametric Regression: Wolfgang H Ardle
No ratings yet
Applied Nonparametric Regression: Wolfgang H Ardle
433 pages
Vmls Python Companion
No ratings yet
Vmls Python Companion
192 pages
Linear Dynamical Models, Kalman Filtering and Statistics. Lecture Notes To IN-ST 259
No ratings yet
Linear Dynamical Models, Kalman Filtering and Statistics. Lecture Notes To IN-ST 259
163 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Regression Models for Data Science in R
No ratings yet
Regression Models for Data Science in R
137 pages
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data [Fall 2014]
No ratings yet
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data [Fall 2014]
68 pages
Reg Mods
No ratings yet
Reg Mods
137 pages
eecs127_reader
No ratings yet
eecs127_reader
199 pages
Machine Learning
100% (1)
Machine Learning
185 pages
Stat Modelling Notes
No ratings yet
Stat Modelling Notes
49 pages
TOBo ML
No ratings yet
TOBo ML
135 pages
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
The Future of Learning: Revolutionizing Education Through Generative AI: AI Books, #11
From Everand
The Future of Learning: Revolutionizing Education Through Generative AI: AI Books, #11
Mohammad
No ratings yet
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
The First Science Fiction Novel MEGAPACK®: 6 Great Science Fiction Novels
From Everand
The First Science Fiction Novel MEGAPACK®: 6 Great Science Fiction Novels
John Gregory Betancourt
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Calibration
No ratings yet
Calibration
44 pages
PHYSICS SUBMISSION SCHEDULE
No ratings yet
PHYSICS SUBMISSION SCHEDULE
1 page
Chapter 11 - HUMAN EYE and COLOURFUL WORLD-1
No ratings yet
Chapter 11 - HUMAN EYE and COLOURFUL WORLD-1
23 pages
7.1 Objective:: Experiment
No ratings yet
7.1 Objective:: Experiment
5 pages
Jadual Penilai Kedua PSM Sem 1 Sesi 20112012 Updates211211-Utk Print
No ratings yet
Jadual Penilai Kedua PSM Sem 1 Sesi 20112012 Updates211211-Utk Print
9 pages
me-pm
No ratings yet
me-pm
25 pages
Chapter 3 Random Variable and Mathematical Expectation
No ratings yet
Chapter 3 Random Variable and Mathematical Expectation
42 pages
Flange Leakage Check: Equivalent Pressure / Kellogg Method
No ratings yet
Flange Leakage Check: Equivalent Pressure / Kellogg Method
3 pages
06 Thermochemistry Revision Notes Quizrr
No ratings yet
06 Thermochemistry Revision Notes Quizrr
57 pages
Physics IGCSE Past Paper
0% (1)
Physics IGCSE Past Paper
16 pages
The Fischer Esterification of Benzocaine
No ratings yet
The Fischer Esterification of Benzocaine
5 pages
Gs 8 - Meteorology and Weather - Basic PDF
No ratings yet
Gs 8 - Meteorology and Weather - Basic PDF
49 pages
Pcal 11 q1 0301 PF Final
No ratings yet
Pcal 11 q1 0301 PF Final
27 pages
Chemistry Project Class 12 Variation of Conductance of Electrolytes With Temperature
No ratings yet
Chemistry Project Class 12 Variation of Conductance of Electrolytes With Temperature
21 pages
Assmt 1
No ratings yet
Assmt 1
2 pages
3D Scanner
No ratings yet
3D Scanner
12 pages
Stock Specifications - Plates - S355G10 M
No ratings yet
Stock Specifications - Plates - S355G10 M
2 pages
10.2 Ventilation Surveys - Equipments Used For Ventilation Surveys
No ratings yet
10.2 Ventilation Surveys - Equipments Used For Ventilation Surveys
8 pages
Maximizing Missile Flight Performance
No ratings yet
Maximizing Missile Flight Performance
70 pages
You Are What You Eat (Testing Organic Compound in Food) PDF
No ratings yet
You Are What You Eat (Testing Organic Compound in Food) PDF
6 pages
Logistic Regression
100% (2)
Logistic Regression
32 pages
Ferro Cement
No ratings yet
Ferro Cement
75 pages
FM&HM 2
No ratings yet
FM&HM 2
7 pages
Product Guide: Food Contact
No ratings yet
Product Guide: Food Contact
28 pages
Ductility Test On Bitumen
No ratings yet
Ductility Test On Bitumen
10 pages
Questions and Answers Physics PDF Free
No ratings yet
Questions and Answers Physics PDF Free
49 pages
1 6 PDF
No ratings yet
1 6 PDF
6 pages

Stat444 Notes

Uploaded by

Stat444 Notes

Uploaded by

Winter 2019 STAT 444/844 Course Notes TABLE OF CONTENTS

STAT 444/844 Course Notes

Last Revision: March 26, 2019

2 January 10, 2019 5

3 January 15, 2019 8

4 January 17, 2019 9

5 January 24, 2019 10

6 January 29, 2019 14

7 January 31, 2019 16

10 February 14, 2019 24

11 February 26, 2019 24

12 February 28, 2019 27

15 March 12, 2019 32

16 March 14, 2019 32

17 March 19, 2019 32

1.2 Advertising data example

if we plot sales against TV

we see there is some positive correlation.

• Capital letters are matrices or vectors: A, X, Σ

• Lower letters are scalars: a, x, σ

• Arrows on letters are vectors: ~a, ~x

• All vectors are column vectors

• The transpose of any matrix A is AT (ocassionally A0 )

1.4 Definitions and properties

is called a quadratic form.

Trace For a matrix, the trace is defined as

Note that tr(BC) = tr(CB).

implies c1 = . . . = cn = 0 (i.e. no non-trivial solution).

Eigenvector and eigenvalue A non-zero vector ~vi is an eigenvector of Am×m if

where λi is the corresponding ith eigenvalue.

Idempotent A matrix A is idempotent if AA = A.

1. If A is idempotent, then all its eigenvalues are either 0 or 1.

2 January 10, 2019

which holds under the assumptions that

• V ar(i ) = σ 2 (constant variance)

In matrix form we have

or in short matrix form Y = X β~ + ~.

has a closed form solution

We can also estimate confidence intervals as well (drop 1 + . . . term above).

2.2 Piecewise linear

as two linear functions

2.3 Piecewise quadratic

subject to β0 + β1 a + β2 a2 = β3 + β4 a + β5 a2 (continuity) and β1 + 2β2 a = β4 + 2β5 a (differentiable at a).

continuity is trivially satisfied. Note the 1st derivative is

Remark 2.1. We choose to omit the (x − a)I(x ≥ a) term to ensure y is differentiable at x = a.

2.4 Weighted least squares

Claim. The closed form solution is

Proof. Note that

Note that Y T W X β~ = (β~ T X T W Y )T which is a scalar, so Y T W X β~ = β~ T X T W Y (transposes of scalars are

where −2Y T W X β~ is the “linear term” and β~ T X T W X β~ is of quadratic form.

Here is an alternative proof:

Thus the LSE of β~ with X ∗ , Y ∗ is

which is equivalent to our previous derivation.

3 January 15, 2019

3.2 Types of errors

4 January 17, 2019

4.2 Notes on model selection

4.3 Geometric interpretation of linear models

µ(x) = β1 g1 (x) + β2 g2 (x) + β3 g3 (x) + β4 g4 (x)

where g1 , . . . , g4 could be arbitrary continuous functions of x.

5 January 24, 2019

Solving for the extremum we have

where ψ(r) = ρ0 (r) ~

Remark 5.1 (LSE). If ρ(r) = r2 we get LSE, that is (ψ(r) = 2r)

which exactly solves to our LSE closed form solution.

5.2 Discrepancy function and log-likelihood

~ = log f (yi | ~xT β,

For a linear model we have

5.3 Iteratively re-weighted least squares (IRLS)

Initialization Initialization: set j = 0

Step 1 Compute residuals

Step 2 Update weights

Step 4 Set j = j + 1 and return to Step 1 if convergence criterion is not met.

We can this procedure iteratively re-weighted least squares (IRLS).

5.4 Why IRLS?

5.5 Robust regression

• V ar(i ) = σ 2 (constant variance)

or in short matrix form Y = X β~ + ~.

where {hj } are a series of basis functions.