Stat444 Notes
Stat444 Notes
richardwu.ca
Table of Contents
1 January 8, 2019 1
1.1 What is a function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Advertising data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
i
Winter 2019 STAT 444/844 Course Notes TABLE OF CONTENTS
8 February 5, 2019 18
8.1 Natural cubic splines (NCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2 Fitting NCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.3 General function fitting with basis funcitons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9 February 7, 2019 21
9.1 Choosing k for NCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.2 Smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
13 March 5, 2019 29
13.1 Local linear regression as a linear smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
14 March 7, 2019 30
14.1 Multivariate local regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
14.2 Multivariate regression splines with tensor products . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.3 Multivariate smoothing splines with thin plates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.4 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
14.5 Structured regression additive approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019
Abstract
These notes are intended as a resource for myself; past, present, or future students of this course, and anyone
interested in the material. The goal is to provide an end-to-end resource that covers all material discussed
in the course displayed in an organized manner. These notes are my interpretation and transcription of the
content covered in lectures. The instructor has not verified or confirmed the accuracy of these notes, and any
discrepancies, misunderstandings, typos, etc. as these notes relate to course’s content is not the responsibility of
the instructor. If you spot any errors or would like to contribute, please contact me directly.
1 January 8, 2019
1.1 What is a function?
Suppose we have some measured response variate y and we have one or more explanatory variables x1 , . . . , xp .
The response and explanatory ariables are approximately related through an unknown function µ(x) (to be
estimated/learned) where
y = µ(x) + r
where r is residual that cannot be explained by µ(x).
Some other names for response and explanatory variables include:
response explanatory
response predictor
response design
output input
dependent independent
endogenous exogenous
1
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019
What if we tried a simple linear model where µ̂(x1 ) = α̂ + β̂x1 where x1 is the TV advertising? We obtain estimates
α̂ = 7.03 and β̂ = 0.05 which are interpretable. However if we take a look at the residuals
2
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019
we see that the residuals are not independently distributed accordingly to x1 , which violates our Markov-Gaussian
assumptions.
The residuals of the model with Newspaper and Radio are
we observe that we do not observe constant variance across the explanatory variables.
Therefore a linear model does not seem to work (we could of course introduce scaling e.g. log-scaling for the Radio
variate or polynomial terms).
1.3 Notation
Some notes on notation:
3
Winter 2019 STAT 444/844 Course Notes 1 JANUARY 8, 2019
f = Y T AY
XX
= aij yi yj
i j
Rank The rank of a matrix denoted rank(A) is the maximum number of linearly independent columns (or rows)
of A.
Note that vectors Y1 , . . . , Yn are linearly independent iff
c1 Y1 + . . . + cn Yn = 0
A~vi = λi~vi i = 1, 2, . . . , m
and tr(A) = rank(A) = tr(Λ) which is equivalent to the number of eigenvalues being 1.
4
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019
yi = β0 + β1 xi1 + . . . + βp xip + i i = 1, . . . , n
• E(i ) = 0
• 1 , . . . , n are independent
iid
• 1 , . . . , n ∼ N (0, σ 2 )
~ = (Y − X β)
S(β) ~ T (Y − X β)
~
where H = X(X T X)−1 X T (hat matrix). Note that H is idempotent and symmetric.
Geometric interpretation of LSE : Ŷ is the projection of Y onto C(X), the column space of X (we can thus see that
the fitted errors should be orthogonal to our fitted values in LSE).
The degrees of freedom of our model is n − (p + 1) where p + 1 is the number of free parameteres in our model.
This is equivalent to n − tr(H) i.e. tr(H) = p + 1.
Under normality
ˆ
• β~ = M V N (β,
~ σ 2 (X T X)−1 )
ˆ
• β~ and σ̂ 2 are independent (Note σ̂ 2 = SSE
df ).
(n−p−1)σ̂ 2
• σ2
∼ χ2n−p−1
5
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019
Let ~ap = (1, x1 , . . . , xp )T (observation ~x extended with intercept term). The (1 − α) prediction interval at ~ap is
ˆ
q
~aTp β~ ± tn−p−1,α/2 σ̂ 1 + ~aTp (X T X)−1~ap
subject to β0 + β1 a = β2 + β3 a.
A more convenient way to express the above
y = β0 + β1 x + β2 (x − a)I(x ≥ a)
where I is the indicator function. Note the above is linear in terms of β~ BUT NOT in terms of x. However we can
simply construct a new variate (x − a)I(x ≥ a) from x.
Note that β2 is the change in slope right of a for samples where x ≥ a.
Extension to more than one interesting point (knot) is straightforward.
6
Winter 2019 STAT 444/844 Course Notes 2 JANUARY 10, 2019
y = β0 + β1 x + β2 x2 + β3 (x − a)2 I(x ≥ a)
~ T W (Y − X β)
(Y − X β) ~
where
w1 0 . . . 0
0 w2 . . . 0
W =
..
0 0 . 0
0 0 . . . wn n×n
a diagonal matrix. wi corresponds to the weight assigned to observation i (a higher wi the more important that
observation is).
7
Winter 2019 STAT 444/844 Course Notes 3 JANUARY 15, 2019
Recall that
d~cT Y
= ~cT
dY
dY T AY
= 2Y T A
dY
so
~
dS(β)
= −2Y T W X + 2β~ T X T W X
dβ~
~
dS(β)
⇒β~ T X T W X = Y T W X =0
dβ~
⇒(X T W X)β~ = X T W Y WT = W
⇒β~ = (X T W X)−1 X T W Y
as claimed.
Example 3.2. Suppose that V ar(i ) = σi2 (i.e. not all observations are drawn with the same variance). If we want
to overweight observations that have lower variance, we can set wi = σ12 to obtain an unbiased estimator of β~
i
with the smallest variance (Best Linear Unbiased Estimator or BLUE).
8
Winter 2019 STAT 444/844 Course Notes 4 JANUARY 17, 2019
In a paper by Soros et al., they ended up using a sample of only 500 posts for confidentiality reasons.
The difference between the sample and study population is called sample error.
• Factors are like categorical variables in R: there are a finite number of categories (called factor levels).
• In lm almost any function of variates may appear in the formula e.g. Y ∼ X + sin(X) or Y ∼ X + sin(X ?
Y).
To specify Y = X · Z, we need to use Y ∼ I(X ? Z) or Y ∼ X:Z instead of X ? Z since X ? Z represents
interaction in lm and translates to the model y = αx + βz + γxz + r.
• Some arithmetic operations e.g. +, −, ∗, ˆ are interpreted as formula operators rather than arithmetic operators
in lm. One should wrap them in I(·).
Figure 4.1: Quadratic and cubic polynomial linear models on Facebook data.
In the above figure we see that while both quadratic and cubic models are global (predict any value of x) the
quadratic model seems to predict likes returning to 0 as impressions approach infinity.
The cubic function on the contrary continues to increase: this makes more sense intuitively, thus examining a model
often requires human understanding of the data and problem.
9
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019
where ρ is a real-valued loss function (in the OLS case, this was simply the square function). ri is our residual for
observation i.
Taking the derivative
~ n
dS(β) X
= ρ0 (yi − ~xTi β)(−1)~
~ xTi
~
dβ i=1
n
X
=− ρ0 (ri )~xTi
i=1
10
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019
−r2
where l(ri ) = 2σ2i , a function only of ri .
The second equality follows from the following remark:
~ is the ith observation’s contribution to l(β)
Remark 5.2. Note li (β) ~ i.e.
From above we observe that minimizing the discrepancy function is the same as maximizing the log likelihood where
ρ(r) = −l(r) in the discrepancy function.
Pn
Definition 5.1 (M-estimator). We call the estimator β~ that minimizes i=1 ρ(ri ) the M-estimator or the
maximum-likelihood type estimator.
ψ(ri )
where we let wi = w(ri ) = ri . If we solve this we see that the solution is WLS where
ˆ
β~ = (X T W X)−1 X T W Y
with W = diag(w1 , . . . , wn ).
11
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019
However, the weights of this WLS depend on the residuals which in turn depends on β. ~ If we are given an initial
~ (0) ~
estimate of β , we could iteratively update residuals and β to converge to a solution. We proceed as follows:
12
Winter 2019 STAT 444/844 Course Notes 5 JANUARY 24, 2019
Remark 5.3. The residual function for OLS is unbounded and so extreme outliers with large residuals have
significantly more influence.
Huber (1964) proposed a modified loss function (Huber loss) which de-emphasizes outliers:
(
1 2
r if |r| ≤ c
ρ(r) = 2 1
c(|r| − 2 c) if |r| > c
The modified loss function essentially makes the loss function linear after a certain threshold c:
We also let (
r if |r| ≤ c
ψ(r) =
csign(r) if |r| > c
and thus (
1 if |r| ≤ c
w(r) = c
|r| if |r| > c
The ψ and weight w functions look like
Figure 5.1: Left: ψ(r). Right: w(r) for Huber’s loss function.
How do we decide c? Huber suggested c = 1.345 and showed it achieved 95% of LSE asymptotically when the true
distribution is normal (95% efficiency essentially means the the variance of the betas from OLS is 95% that of the
variance of the betas using Huber’s loss).
13
Winter 2019 STAT 444/844 Course Notes 6 JANUARY 29, 2019
Question 5.3. Since c is fixed, what if our residuals are scaled to very large or small values (e.g. O(1e5) or
O(1e − 4))? We would have to scale our data beforehand to make it within a sensible range so that c = 1.345 makes
sense.
Sometimes we prefer the ψ function to “redescend” i.e. ψ(r) → 0 when |r| is large (that is: we fully de-emphasize
outliers). Other ψ functions include
Redescending M -estimator (Hampel)
r if 0 ≤ |r| ≤ a
asign(r) if a ≤ |r| ≤ b
ψ(r) =
a c−|r|
c−b sign(r) if b ≤ |r| ≤ c
0 if |r| > c
The recommended settings are a = 2, b = 4, c = 8 (with appropriately scaled data and residuals).
Tukey’s biweight 2 2
r 1 − r
if |r| ≤ c
ψ(r) = c
if |r| > c
0
where c = 4.685 is typically used. This is designed to have 95% efficiency as well for a true normal distribution.
MAD = median(|ri |)
MAD
and let ŝ = 0.6745 . For the standard normal distribution we note that MAD = 0.6745.
which is the difference between Tn (·) (with all n points) and Tn−1 (·) (with one point y omitted) compare to the
contamination size n1 .
Example 6.1. Let Tn (y1 , . . . , yn ) = n1 ni=1 yi = ȳn (sample mean).
P
Note that
n−1
X n−1
Tn = yi + y = ȳn−1 + y
n
i=1
14
Winter 2019 STAT 444/844 Course Notes 6 JANUARY 29, 2019
Definition 6.1 (Breakdown point). Informally, the breakdown point of a statistic is the largest proportion of
contamination before the statistic breaks down.
Formally, let ~zi = (xi1 , xi2 , . . . , xip , yi )T for i = 1, . . . , n be the ith data vector.
Let Z = (~z1 , . . . , ~zn ) be the whole set. Let T be the statistic of interest. The worst error for swapping m zi ’s is
∗
e(m; T, Z) = supkT (Zm ) − T (Z)k
∗
Zm
Remark 6.2. That is: the breakdown point measures the minimum proportion of points required to influence
the statistic significantly.
Sample mean Note we can simply swap out m = 1 point arbitrarily such that e(1; T, Z) → ∞ thus the breakdown
point is n1 → 0 as n → ∞.
Median The breakdown point is 12 as n → ∞: we need to change at least half of them to aribitrarily influence the
median e.g. make it go to infinity.
k% trimmed mean The k% trimmed mean is defined as the mean after discarding the lowest k% and highest k%
of yi ’s.
Breakdown point is k% (we swap out the top k% + 1 points).
or equivalently
argminβ~ ~ 2
average(yi − ~xTi β)
To make it robust for “outliers” or contaminations i.e. to ensure we have a high breakdown point we could consider
the least median squares (LMS) estimator:
β~LM S = argminβ~ ~ 2
median(yi − ~xTi β)
1 1
which has a breakdown point of 2 (compared to a breakdown point of n for OLS).
15
Winter 2019 STAT 444/844 Course Notes 7 JANUARY 31, 2019
f (x) = β0 + β1 x + β2 (x − a)I(x ≥ a)
16
Winter 2019 STAT 444/844 Course Notes 7 JANUARY 31, 2019
Remark 7.1. Piecewise linear is also called the broken stick method.
For notation simplicity let us define (x)+ = max(x, 0) such that we have
f (x) = β0 + β1 x + β2 (x − a)+
Thus our basis functions are 1, x, (x − a)+ . Here is plot of the basis:
This is an example of the truncated power series. We can easily generalize this model to accomodate many
break points or knots.
However, picewise linear functions are not differentiable at their break points since f 0 (x) is not continuous.
Recall that for a piecewise quadratic function we have
f (x) = β0 + β1 x + β2 x2 + β3 (x − a)2+
where our basis functions are 1, x, x2 , (x − a)2+ . Note that a piecewise quadratic model f (x) is indeed differentiable
at the break points.
Let t1 < t2 < . . . , tk be fixed and known knots, where t1 and tk are boundary knots and t2 , . . . , tk−1 are interior
knots.
Then the basis consists of the functions 1, x, x2 , x3 , (x − t1 )3+ , . . . , (x − tk )3+ . That is any cubic spline with the
above k knots can be expressed as
k
X
2 3
f (x) = β0 + β1 x + β2 x + β3 x + βj+3 (x − tj )3+
j=1
17
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019
8 February 5, 2019
8.1 Natural cubic splines (NCS)
A cubic spline is called a natural cubic spline with knots {t1 , . . . , tk } if f (x) is linear when x 6∈ [t1 , tk ], that is
(
t0 (x) = a0 + b0 x if x < t1
f (x) =
tk (x) = ak + bk x if x > tk
Question 8.1. How many free parameters are there in the natural cubic spline?
Answer. Note that in general cubic splines, we have k + 4 parameters. If we constrain our spline to be linear at
both ends (x < t1 and x > tk ) then we essentially remove the quadratic and cubic terms and thus parameters at
each end. So we remove 4 parameters and thus we have k free parameters.
To express an NCS, note that for a regular cubic spline we have
k
X
2 3
f (x) = β0 + β1 x + β2 x + β3 x + βj+3 (x − tj )3+
j=1
18
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019
we want all the x3 terms to have 0 coefficients (first term of expansion) and all x2 terms to also have 0
coefficients (second term of expansion).
These conditions are necessary and sufficient.
Claim. We claim N1 (x) = 1, N2 (x) = x, and Nj (x) = dj−1 (x) − d1 (x) for j = 3, . . . , k where
(x − tj )3+ − (x − tk )3+
dj (x) =
tk − tj
i.e.
k−1
X
β4 (tk − t1 ) = − βj+3 (tk − tj )
j=2
as desired.
Note that we have 4 separate (linearly independent) constraints on the parameters hence why we lose 4 degrees of
freedom.
19
Winter 2019 STAT 444/844 Course Notes 8 FEBRUARY 5, 2019
N0 (x) = 1
N1 (x) = x
Nj (x) = (tk − tj ) dj (x) − d1 (x) j = 2, . . . , k − 1
Definition 8.1 (Regression splines). The fixed-knot splines, such as cubic splines and NCS, are called regression
splines.
k
X
yi ≈ βj Nj (xi ) + i
j=1
Now we simply fit the following linear model with design matrix
N1 (x1 ) . . . Nk (x1 )
.. .. ..
X= . . .
N1 (xn ) . . . Nk (xn )
where
β1
β = ...
βk
y1
..
Y = .
yn
Remark 8.2. The problem becomes a regular regression problem with design matrix generated from the basis
functions Nj ’s.
20
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019
1. hj (~x) = xj for j = 1, . . . , p is the original linear model where basis functions are the jth component
9 February 7, 2019
9.1 Choosing k for NCS
Recall that the basis functions for NCS are
N0 (x) = 1
N1 (x) = x
Ni (x) = dj−1 (x) − d1 (x) j = 3, . . . , k
Equal-distance knots We choose k first arbitrarily e.g. k = 5, then we use an equal-distance grid between the
min and max of xi ’s.
i
Quantiles Quantiles are also a popular choice e.g. k−1 quantiles for each xi , i = 0, . . . , k − 1.
Degrees of freedom Alternatively we can instead specify the degrees of freedom for an NCS i.e. the number of
free parameters. For df = k, we would have k − 2 knots (if intercept term is also included). Usually knots are
placed at equal distance quantiles.
Pn
Remark 9.1. 1. i=1 [yi − f (xi )]2 is the sum of squared residuals which measures the goodness of fit.
21
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019
R∞ 00 (x)]2 dx
2. −∞ [f measures the “roughness” of f (x).
Remark 9.2. Note that we try to minimize the integral over the f 00 (x) (squared), which is essentially
minimizing f 00 (x) so that it is close to 0.
R∞
For example, if f (x) = β0 + β1 x (OLS) then f 00 (x) = 0 thus −∞ [f
00 (x)]2 = 0 i.e. no penalty for OLS.
3. The role of λ: if λ = 0 then we have no roughness penalty and we will minimize the SSR over all functions
and fˆλ (x) is the interpolating line.
R∞
If λ = ∞ then we will force −∞ [f 00 (x)]2 dx = 0 thus fˆλ (x) is the ordinary least square fit.
4. Remarkably we can show that fˆλ (x) is just the natural cubic spline with knots at distinct values of {xi }ni=1 .
Claim. Z ∞ Z ∞
00 2
[s (x)] dx ≤ [f 00 (x)]2 dx
−∞ −∞
Definition 9.1 (Smoothing spline). We call the function fitted by the penalized regression a smoothing spline.
that is
n
X k
X Z ∞ Xk
β̂λ = argminf [yi − 2
βj Nj (x)] + λ [ βj Nj00 (x)]2 dx
i=1 j=1 −∞ j=1
where
N1 (x1 ) . . . Nk (x1 )
X = ... .. ..
. .
N1 (xn ) . . . Nk (xn )
22
Winter 2019 STAT 444/844 Course Notes 9 FEBRUARY 7, 2019
Also
Z ∞ Xk Z ∞ k
X Xk
[ βj Nj00 (x)]2 dx = [ βj Nj00 (x)][ βl Nl00 (x)] dx
−∞ j=1 −∞ j=1 l=1
Z ∞ k X
X k
= [ βj βl Nj00 (x)Nl00 (x)] dx
−∞ j=1 l=1
k X
X k Z ∞
Nj00 (x)Nl00 (x) dx
= βj βl
j=1 l=1 −∞
= β~ T N β~
R∞
where N = (Njl ) = −∞ Nj00 (x)Nl00 (x) dx (i, j-th entry is Njl ).
Therefore we can let
~ = (Y − X β)
S(β) ~ T (Y − X β)~ + λβ~ T N β~
= Y T Y − β~ T X T Y − Y T X β~ + β
~ T X T X β~ + λβ~ T N β~
= Y T Y − 2Y T X β~ + β~ T (X T X + λN )β~
ˆ
and β~λ = argminS(β).
~
Recall that for matrix Y, A and vector ~c
∂~cT Y
= ~cT
∂Y
∂Y T AY
= 2Y T AT
∂Y
thus we have
∂S(β)~
= 0 = −2Y T X + 2β~ T (X T X + λN )T
∂ β~
ˆ
⇒(X T X + λN )β~λ = X T Y
ˆ
⇒β~λ = (X T X + λN )−1 X T Y
To calculate the effective number of parameters or effective df (edf): recall for NCS we have k knots and in OLS
with Xn×p
Ŷ = HY = X(X T X)−1 X T Y
where the number of parameters is df = tr(H).
Now in the smoothing spline, we have
23
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019
1. Numerically stable: recall cubic splines have x3 terms which grows fast as x → ∞. B-splines are fitted to a
restriction of x (the d + 5 knots).
2. Computationally efficient: when # of knots k is large. More specifically least squares estimation with n
observations and k variables takes O(nk 2 + k 3 ) operations. If k → n then this becomes O(n3 ). B-splines
reduces this cost to O(n) (since k ∈ O(d)) becomes constant.
where Bi,0 (x) is the interval indicator function. It is also known as the Haar basis function.
In general for a d-degree B-spline, we define its basis as:
x − ti ti+d+1 − x
Bi,d (x) = Bi,d−1 (x) + Bi+1,d+1 (x)
ti+d − ti ti+d+1 − ti+1
After we compute the basis functions given our x, we can fit the model as an OLS or robust LR model. In R, we can
use the function bs in the package splines to generate the B-spline basis functions (note there are no intercepts
included). This will give us a design matrix with d + k basis functions (so d + k degrees of freedom) where d is the
degree and k is the number of knots (d starts at 0 for the constant function).
Then we simply feed this to lm or rlm as usual (which will subsequently introduce the bias term). Note that lm will
add one more degree of freedom with the intercept for (d + 1) + k degrees of freedom.
Similarly we can generate NCS basis functions with ns in splines.
Remark 10.1. Since smoothing splines are penalized for their “smoothness” this allows us to choose a high number
of knots.
24
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019
We take the k nearest neighbours for every x and compute the mean response value of the neighbours. This then
becomes the fitted value at xi and we may linearly interpolate or even quadratically interpolate (or even higher
order polynomial interpolation) between points.
This can be accomplished in R with knn.reg from the FNN package.
Remark 11.1. As the neighbourhood size (k) increases, the smoother the function becomes.
Instead of taking the mean response based on the k neighbours, we can instead use the value from any fitted model
based on those k neighbours (e.g. lm, rlm with Huber, Tukey’s, etc., ltsreg).
Remark 11.2. We can think of KNN local linear regression as weighted linear regression where wj = 0 if xj is
outside the neighbourhood of xi .
where the first two standardize K(t) and the last constraint ensures weights are spread along the real line but not
too much weight are in the extremes.
Some examples of kernels:
Thus for a bandwidth parameter h the weight function for neighbour xi for current point x is defined as:
K xih−x
wi = w(x, xi ) = PN xj −x
j=1 K h
25
Winter 2019 STAT 444/844 Course Notes 11 FEBRUARY 26, 2019
For the mean response µ̂(x) we take the weighted average or the Nadaraya-Watson estimator:
N
X
µ̂(x) = wi yi
i=1
Remark 11.3. The boundary effect occurs when no points lie on one side of the kernel and thus the weights are
distributed in a biased way to the points on the other side. This occurs at the extremes of the explanatory variate
space.
Figure 11.2: The boundary effect causes the kernel fitted line (green) at the left end to bias the fitted value higher
since most the available points are on the right of the kernel and have a higher response value.
In R we can use the loess function where span defines the proportion of points in the local neighbourhood. The
kernel used is Tukey’s tri-cube.
A linear smoother is a linear combination of the yi ’s with Sλ being the smoother matrix.
Consider a regression with a small number p of basis functions. That is for basis functions b1 , . . . , bp (e.g. the NCS
basis) let the matrix
Bn×p i,j = bj (xi )
26
Winter 2019 STAT 444/844 Course Notes 12 FEBRUARY 28, 2019
Ŷ = B(B T B)−1 B T Y
Claim. It can be shown (for any hat matrix) that the column space of B satisfies C(B) = C(HB ). Note however
B has p columns whereas HB has n columns, so there is some redundancy.
HB = U P U T
ρ1 . . . 0 ~u1
. .
. . .. ...
.
= ~u1 . . . ~un ..
0 . . . ρn ~un
n
X
= ρi ~ui ~uTi
j=1
where ρ1 ≥ . . . ≥ ρn ≥ 0 are the eigenvalues and ~u1 , . . . , ~un are the corresponding orthonormal eigenvectors.
Then
Ŷ = HB Y
Xn
= ρi ~ui ~uTi Y
i=1
n
X
= ρi h~ui , Y i~ui h·, ·i inner product
i=1
Thus Y is first projected to the orthonormal basis {~u1 , . . . , ~un } then modulated by {ρ1 , . . . , ρn }.
Because HB is idempotent and assuming B is full rank (i.e. rank(B) = p) then P n = P for all n ∈ N thus
(
1 if i = 1, . . . , p
ρi =
0 if i = p + 1, . . . , n
27
Winter 2019 STAT 444/844 Course Notes 12 FEBRUARY 28, 2019
we claim the smoother matrix Sλ in a linear smoother can be written in the Reinsch form
Sλ = (I + λK)−1
where K = (X T )−1 N X −1 does not depend on λ and N is the roughness penalty matrix for NCS, that is
N = Njl n×n and
Z ∞
Njl = Nj00 (x)Nl00 (x) dx
−∞
Proof. First remark that X is a square matrix since we assume all xi values are distinct so
Sλ = X(X T X + λN )−1 X T
−1
= (X T )−1 (X T X + λN )X −1
−1
= I + λ(X T )−1 N X −1
= (I + λK)−1
where K = (X T )−1 N X −1 .
~ )T (Y − µ
min (Y − µ µT K~
~ ) + λ~ µ
where K is known as the penalty matrix where K is symmetric and has eigendecomposition
K = V DV T
with D = diag(d1 , . . . , dn ) the eigenvalues where di ≥ 0 and V = (~v1 , . . . , ~vn ) is an orthonomal matrix (of
eigenvectors).
1. K = ni=1 di~vi~viT (sum of matrices) and µ µ = ni=1 di h~vi , µ
~ T K~ ~ i2 .
P P
Remark 12.1.
This implies that µ
~ is penalized more in the directions of vi ’s with large di values.
Remark 12.2. 1. Sλ and K share the same eigenvectors which do not depend on λ.
28
Winter 2019 STAT 444/844 Course Notes 13 MARCH 5, 2019
µ
~ = Sλ Y
Xn
= ρi (λ)~vi~viT Y
i=1
n
X
= ρi (λ)h~vi , Y i~vi
i=1
that is: we project Y down to the every eigenvector ~vi and scale each projection by corresponding eigenvalue ρi (λ).
Note the first two vectors are not shrunk by ρi (λ) but the rest are shrunk towards 0 since ρi (λ) < 1.
where (
1 if i = 1, . . . , p
ρi =
0 if i = p + 1, . . . , n
Comparing the two, OLS selects the eigenvector basis where eigenvalues are 1 and drops the other eigenvectors (hard
thresholding) whereas smoothing splines shirnk Y in the direction of eigenvectors according to their corresponding
eigenvalues ρi (λ).
For this reason OLS or regression splines are called projection smoothers and smoothing splines are shrinking
smoothers.
Remark 12.3. 1. The sequence of ~vi , ordered by decreasing eigenvalues ρi (λ), appear to increase in complexity
(i.e. roughness or “wiggleness”).
thus if we want a specific dfλ we can simply linear search for the corresponding λ (since the di ’s are also fixed).
13 March 5, 2019
13.1 Local linear regression as a linear smoother
We show that local linear regression is indeed a linear smoother.
For target value x (point we are doing local regression about), local linear regression is equivalent to solving the
weighted optimization problem
Xn
argminα,β kh (x − xi )(yi − (α + βxi ))2
i=1
29
Winter 2019 STAT 444/844 Course Notes 14 MARCH 7, 2019
and our fitted value at x from local regression is fˆ(x) = α(x) + β(x)x (kh is our kernel function used in local linear
regression).
Note the above optimization problem has an explicit solution. Let
1 x1
1 x2
B = . .
.
. . .
1 xn
W (x) = diag(kh (x − x1 ), . . . , kh (x − xn ))
where B is n × 2 and W is n × n.
Then fˆ(X) (n × 1 vector of fitted values) can be re-rewritten as
14 March 7, 2019
14.1 Multivariate local regression
We want to make a prediction at target value ~x = (x1 , . . . , xp )T . A simple kernel function we could use in local
regression in Rp
1 k~xk
Kh (~x) = K
h h
where k·k is the Euclidean norm.
Remark 14.1. The issue with the Euclidean norm is we weight every coordinate/variate xi equally. If a variate
has less importance we should not give it the same weight as other variates in the kernel.
That is: the kernel we have above is equally-skewed about all variates xi and gives equal weight to each coordinate.
We use structured local regression instead. This is a more general approach where we use a positive semidefinite
matrix Ap×p to weight each coordinate, that is:
1 ~xT A~x
Kh,A (~x) = K
h h
30
Winter 2019 STAT 444/844 Course Notes 14 MARCH 7, 2019
and X
f (~x) = βjk gjk (x)
j,k
We note that the number of parameters with a tensor product basis grows exponentially with p.
where 2 2 2
∂2f ∂2f ∂2f
Z Z
J(f ) = +2 + dx1 dx2
R R ∂x21 ∂x1 ∂x2 ∂x22
We can show the optimal solution has the form (at a given target ~x)
n
X
fλ (~x) = β0,λ + βλT ~x + αi hi (~x)
i=1
where the generator function is the radial basis function hi (~x) = k~x − x~i k2 logk~x − x~i k.
Note x~i for i = 1, . . . , n are our control/knot points.
31
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019
For p = 2 again assume data points are uniformly distributed across [0, 1] × [0, 1]. Suppose we have a neighbourhood
x ± 0.1: note that we capture an area of 0.2 × 0.2 = 0.04 which only captures ≈ 4% of points!
As p increases, even as we increase the width of our neighbourhood along each dimension linearly our data becomes
so sparse our neighbourhood becomes relatively small.
µ(~x) = f (x1 , . . . , xp )
which may be some arbitrary possibly interactive function of every variate we consider the additive model
p
X
µ(~x) = α + fj (xj )
j=1
Remark 14.2. We can extend the above model to allow a limited number of interactions.
For example if p is small we can consider additional pairwise interactions
p
X p X
X p
f (~x) = α + fj (xj ) + fjk (xj , xk )
j=1 j=1 k=1
32
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019
ErrT = E (Y − fˆ(X))2 | T
where the expectation is over the true joint distribution of (X, Y ) i.e. the population.
Definition 17.2 (Expected test error). We define the expected test error as:
where the distribution is over the distribution of (X, Y ) and the random generation of training sets.
Definition 17.3 (Training error). We define the training error as:
n
1X RSS
Err = (yi − fˆ(xi ))2 =
n n
i=1
However Err uses the same data twice (once for producing fˆ and once for calculating the error) and does not track
Err well.
We note that as the model complexity grows, training error decreases while our test error will increase after the
optimal complexity:
Test error increases after a certain point due to overfitting to the training set.
To regularize our model for complexity, some solutions include:
Information criteria let d denote the number of parameters. We define the Akaike Information Criterion
(AIC) as:
AIC = 2d + n log(RSS)
and the Bayesian Information Criterion (BIC), which has a larger regularization effect, as:
Remark 17.1. The information criteria are only useful for comparing models for the same training sample.
Their absolute values are meaningless.
33
Winter 2019 STAT 444/844 Course Notes 17 MARCH 19, 2019
Cross-validation (CV) Recall we can estimate the test error Err by repeatedly sampling test sets from the
population.
We hold out a part of the training set as our “population” (cross-validation set) and construct our model on
the remaining training set. We can then validate our complexity and model on the cross-validation set as an
estimate of the test error.
This error on the cross-validation set is called the cross-validation error.
k-fold CV 1. Given a training set T , randomly partition it into k disjoint equal-sized parts (“folds”) T1 , . . . , Tk .
2. For every i = 1, . . . , k, we train our model on partitions T1 , . . . , Ti−1 , Ti+1 , Tk to obtain fˆ(k) , then we
evaluate fˆ(k) on Ti to get the cross-validation error for each Ti . Let i(j) be the fold Ti corresponding to
example j, then the overall cross-validation error is:
n
1X
CV (fˆ) = (yj − fˆ(i(j)) (xj ))2
n
j=1
The choice of k = n is called the leave-one-out (LOO) CV. The justification for LOO CV: Let fˆ−i (xi ) be
the fitted value of xi without using (xi , yi ) during training. Then
E(yi − fˆ−i (xi ))2 = E(yi − f (xi ) + f (xi ) − fˆ−i (xi ))2
= E(yi − f (xi ))2 + 2i E(f (xi ) − fˆ−i (xi )) + E(f (xi ) − fˆ−i (xi ))2
= σ 2 + E(f (xi ) − fˆ−i (xi ))2
≈ σ 2 + E(f (xi ) − fˆ−i (xi ))2
that is LOO CV provdies an approximate estimate of the test error Err (up to a constant σ 2 ).
Thus minimizing the LOO CV error is essentially minimizing the test error.
Remark 17.2. Since LOO CV requires fitting the data n times, for large n then this is infeasible.
Remark 17.3. For most linear smoothers where Ŷ = SY where S is the smoother matrix it can be shown
that
n n
1 X yi − fˆ(xi ) 2
ˆ 1X ˆ−i 2
CV (f ) = (yi − f (xi )) =
n n 1 − sii
i=1 i=1
where sii is the ith diagonal element of S. We can thus simply fit the data once and weight the squared
residuals by (1−s1ii )2 .
The above proof for OLS is in A2 Q2 part (d).
Generalized Cross Validation (GCV) For any linear smoother Ŷ = SY we define the GCV error as
n
yi − fˆ(xi ) 2
1X
GCV (fˆ) = tr(S)
n 1−
i=1 n
where we use the average trace tr(S)/n instead of each individual sii .
Note that LOO CV is approximately unbiased for Err, but can have high variance due to the n training sets being
very similar to one another. On the other hand, a small k tends to have large bias but small variance. To balance
bias and variance, k = 5 or k = 10 is recommended.
34
Winter 2019 STAT 444/844 Course Notes 18 MARCH 26, 2019
where IRk (xi ) = 1 if xi ∈ Rk neighbourhood k and µ̂k is the average of the points in the neighbourhood k. This is
equivalent to local average regression.
Optimizing the above is still computationally difficult (since there are a combinatorial number of different K ∈ N
neighbourhoods to consider for N points.
35