0% found this document useful (0 votes)
11 views42 pages

L4 Kernel Regression

Uploaded by

yuebb2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views42 pages

L4 Kernel Regression

Uploaded by

yuebb2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.

4 Generalized Additive Models

Kernel and Local Regression

MAST90083 Computational Statistics and Data Mining

Karim Seghouane
School of Mathematics & Statistics
The University of Melbourne

Kernel and Local Regression 1/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Outline

§5.1 Introduction

§5.2 One Dimensional Kernel

§5.3 Local Polynomial Regression

§5.4 Generalized Additive Models

Kernel and Local Regression 2/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Introduction

Fitting a good linear model often involves considerable time to


adequately model:
▶ Nonlinear dependencies
▶ Significant and insignificant variables
▶ Interactions between variables
Various methods have been proposed to overcome these
limitations, among them spline repression. Here we look at an
alternative to linear and spline regression that overcomes the issue
of nonlinearities.

Kernel and Local Regression 3/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Introduction

▶ We discuss an alternative regression technique for estimating


a regression function f (x) over a domain in Rp
▶ The approximation is realized by fitting a simple model at
each point xi , i = 1, ..., n
▶ At each point xi , the model makes use of the those training
samples close to xi producing a smooth estimation fˆ (x) in Rp
▶ The selection of the training samples is realized using a
weighting function known as kernel Kh (xi , xj )

Kernel and Local Regression 4/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Introduction

▶ Kh (xi , xj ) assigns a weight to xj based on its scaled distance


to xi where the scale is controlled by a parameter h
▶ The scale h controls the size of the effective neighborhood to
use for estimation
▶ These methods differ by the shape of the kernel function and
do not require training
▶ The only parameter that needs to be tuned using training
samples is the width of the kernel h

Kernel and Local Regression 5/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Introduction

Kernel regression has been around since the 1960s, and is one of
the most popular methods for “nonparametrically” fitting a model
to data. We work here in regression context, but there exist
extensions to classification models via logistic regression.

We will focus on the most popular kernel regression and local


polynomial regression.

Kernel and Local Regression 6/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

One Dimensional Kernel

▶ Consider the regression model

yi = f (xi ) + ϵi , E (ϵi ) = 0
▶ and we are interested in estimating the regression function

f (x) = E (y |x)
▶ using a training set (xi , yi ), i = 1, ..., n.
▶ The relationship between x and y is more likely to be nonlinear

Kernel and Local Regression 7/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

One Dimensional Kernel

▶ A direct method: k-nearest-neighbor average. Use the average


of those observations in the defined neighborhood of x, Nk (x)
to build the estimator of f (x)
1 X
fˆ (x) = yi = Ave(yi |xi ∈ Nk (x))
k
xi ∈Nk (x)

▶ Nk (x) defines the k closest points xi to x in the training


sample to use or select for the estimation

Kernel and Local Regression 8/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

One Dimensional Kernel


▶ The average changes in a discrete way, leading to a
discontinuous fˆ(x)

Kernel and Local Regression 9/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

One Dimensional Kernel

▶ Problem: The k-nearest-neighbor estimator gives the same


weight to all the points in the neighbor used for the
estimation of fˆ(x)
▶ Alternative: Make the weights attributed to the points used in
the estimation inversely proportional (smoothly) to the
distance from the point of estimation interest

Kernel and Local Regression 10/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Nadaraya-Watson Kernel

The Nadaraya-Watson kernel leads to a weighted average


estimation
PN
Kh (x0 , xi ) yi
fˆ (x0 ) = Pi=1
N
i=1 Kh (x0 , xi )
N
X
if Kh (x0 , xi ) ̸= 0
i=1

N
X
and fˆ(x0 ) = 0 if Kh (x0 , xi ) = 0
i=1

Kernel and Local Regression 11/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Kernel Function

▶ The Kernel function plays a central role in the fitting and it is


defined by
 
x − x0
Kh (x0 , x) = K
h
▶ K (x) needs to be smooth, maximal at 0, symmetrical around
0 and decreasing with respect to |x|
▶ Having
Z Z
K (u)du = 1 uK (u)du = 0

▶ is also common

Kernel and Local Regression 12/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Kernel Functions

Common kernel functions are

Name K (x) Support


3 2
Epanechnikov 4 (1 − x )I{|x|<1}
n 2o [−1, 1]
Gaussian (2π)−1/2 exp −x2 [−∞, ∞]
15 2 2
Biweight 16 (1 − x ) I{|x|<1} [−1, 1]
35 2 3
Triweight 32 (1 − x ) I{|x|<1} [−1, 1]
1
Uniform 2 I{|x|<1} [−1, 1]
70 3 3
Tricube 81 (1 − |x| ) I{|x|<1} [−1, 1]

Kernel and Local Regression 13/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Triweight kernel Kh (x) for various choices of h

2.0

h=0.5
1.5
K_h(x)

1.0

h=1
0.5

h=2
0.0

−2 −1 0 1 2

Kernel and Local Regression 14/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Kernel Functions

▶ The Gaussian function is non-compact kernel where σ 2 plays


the role of the window size

Kernel and Local Regression 15/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Nadaraya-Watson Kernel
Epanechnikov quadratic kernel application example

The contribution of the points (their weights in the estimation)


slowly increases as the approximation evolves. The contribution is
initially with weight zero.
Kernel and Local Regression 16/42
§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Example

▶ The nearest-neighbor corresponds to

1
K (x) = I {|x| < 1}
2
▶ In this case fˆ(x) =average of yi′ s such that xi ∈ [x − h, x + h]
or |xi − x| ≤ h

Kernel and Local Regression 17/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Example
There are two extreme cases
▶ h → ∞, fˆ is independent of x (high bias case)

N
ˆ 1 X
f (x) → yi = const.
N
i=1
▶ h → 0, h < mini,j |xi − xj |, (high variance case)

fˆ(xi ) = yi and fˆ(x) = 0 for x ̸= xi


▶ The estimator reproduces the data yi at xi and zero in other
points.
▶ The optimal h is between these two extremes and provides the
appropriate compromise between the bias and variance

Kernel and Local Regression 18/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Linear Estimator

▶ The Nadaraya-Watson can be written as a weighted sum

N
X
fˆ (x) = yi Wi (x)
i=1
▶ where the weights

N
!
Kh (x, xi ) X
Wi (x) = PN I Kh (x, xi ) ̸= 0
i=1 Kh (x, xi ) i=1
▶ are independent of the responses yi

Kernel and Local Regression 19/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Justification or Interpretation

Let (x, y ) be a pair of random 2


R variables in R with density p(x, y )
and marginal density p(x) = p(x, y )dy > 0, then
Z Z
p(y , x)
f (x) = E [y /x] = yp(y /x)dy = y dy
p(x)
Z R
1 yp(x, y )dy
= yp(y , x)dy = R
p(x) p(x, y )dy
If we replace p(x, y ) by p̂(x, y ) (its estimator) and p(x) by p̂(x) we
recover fˆ(x)
Note 1

Kernel and Local Regression 20/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Justification or Interpretation

If the density p is assumed uniform then


N  
1 X x − xi
fˆ(x) = yi K
Nh h
i=1

Kernel and Local Regression 21/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Properties
▶ The width of the used local neighborhood h plays the role of
the smoothing parameter
▶ Large values of h implies lower variance (use more samples for
estimation) but higher bias (assume the function is constant
within the window)
▶ For k-nearest neighborhoods, the neighborhood size k plays
the role of the window size h and hk (xi ) = |xi − xk | where xk
is the k th closest xj to xi
▶ Adaptive width h(x) can also be considered instead of
constant width h(x) = h and the kernel is
 
|x − xi |
Kh (xi , x) = K
h(xi )

Kernel and Local Regression 22/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Polynomial Regression

▶ Kernel fit can still have problems due to the asymmetry at the
boundaries
▶ or in the interior if the x values are not equally spaced
▶ Locally weighted linear regression provide an alternative local
approximation

Kernel and Local Regression 23/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Polynomial Regression

Kernel and Local Regression 24/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Linear Regression

▶ It is obtained by solving a weighted least squares criterion at


each target points x0
N
X
min Kh (x0 , xi ) [yi − α(x0 ) − β(x0 )xi ]2
α(x0 ),β(x0 )
i=1
▶ and the estimate at x0 is given by

fˆ(x0 ) = α̂ (x0 ) + β̂ (x0 ) x0

Kernel and Local Regression 25/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Polynomial Regression

Kernel and Local Regression 26/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Linear Regression

Let B be the N × 2 regression matrix with row b(xi )⊤ = (1, xi )


and W (x0 ) the N × N diagonal matrix with i th diagonal element
Kh (x0 , xi ) then

 −1 N

X
ˆ ⊤ ⊤
f (x0 ) = b (x0 ) B W (x0 )B B W (x0 )y = ℓi (x0 ) yi
i=1

where ℓi (x0 )′ s do not involve y

Local linear regression tends to be biased in curved regions of the


true function

Kernel and Local Regression 27/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

An example of local linear regression


●●●

0.4
● ●
● ●


●●
● ● ●
●● ●

0.3 ●


●● ●
● ● ●


●●


0.2

●●

y




● ●
● ● ●
● ●●

●● ●
0.1




● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0

●●
● ●
● ● ● ●
●●

●●●

−3 −2 −1 0 1 2 3

Kernel and Local Regression 28/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

An example of local linear regression


●●●

0.4
● ●
● ●


●●
● ● ●
●● ●

0.3 ●


●● ●
● ● ●


●●


0.2

●●

y




● ●
● ● ●
● ●●

●● ●
0.1




● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0

●●
● ●
● ● ● ●
●●

●●●

−3 −2 −1 0 1 2 3

Kernel and Local Regression 29/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

An example of local linear regression


●●●

0.4
● ●
● ●


●●
● ● ●
●● ●

0.3 ●


●● ●
● ● ●


●●


0.2

●●

y




● ●
● ● ●
● ●●

●● ●
0.1




● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0

●●
● ●
● ● ● ●
●●

●●●

−3 −2 −1 0 1 2 3

Kernel and Local Regression 30/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

An example of local linear regression


●●●

0.4
● ●
● ●


●●
● ● ●
●● ●

0.3 ●


●● ●
● ● ●


●●


0.2

●●

y




● ●
● ● ●
● ●●

●● ●
0.1




● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0

●●
● ● h=1
● ● ● ●
●●

●●●

−3 −2 −1 0 1 2 3

Kernel and Local Regression 31/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Overfitting and underfitting

The choice of bandwidth h very directly controls the bias-variance


tradeoff. Choosing h too small will tend to give overfitted models
(high variance, low bias), while h too large will give underfit
models (high bias, low variance).
In practice we can employ methods like cross-validation, or even
plug-in estimates to decide on an appropriate value of h.

Kernel and Local Regression 32/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Underfitted local linear regression


●●●

0.4
● ●
● ●


●●
● ● ●
●● ●

0.3 ●


●● ●
● ● ●


●●


0.2

●●

y




● ●
● ● ●
● ●●

●● ●
0.1




● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0

●●
● ● h=5, underfit
● ● ● ●
●●

●●●

−3 −2 −1 0 1 2 3

Kernel and Local Regression 33/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Overfitted local linear regression


●●●

0.4
● ●
● ●


●●
● ● ●
●● ●

0.3 ●


●● ●
● ● ●


●●


0.2

●●

y




● ●
● ● ●
● ●●

●● ●
0.1




● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0

●●
● ● h=0.4, overfit
● ● ● ●
●●

●●●

−3 −2 −1 0 1 2 3

Kernel and Local Regression 34/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Adaptive choices of h

A common alternative to using a fixed h is to vary it with respect


to x. The most common example of this is the nearest neighbour
bandwidth, where hx is chosen so that the window always contains
a fixed proportion of the data t,
P
I{|x − xi | < hx }
t= i .
n

Kernel and Local Regression 35/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Polynomial Regression

▶ Local polynomial regression are generally able to correct this


bias
▶ In this case we fit a local polynomial

 2
N d
βj (x0 )xij 
X X
min Kh (x0 , xi ) yi − α(x0 ) −
α(x0 ),βj (x0 )
i=1 j=1

▶ with fit

d
β̂j (x0 )x0j
X
fˆ (x0 ) = α̂(x0 ) +
j=1

Kernel and Local Regression 36/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Polynomial Regression

▶ The reduction in bias generates an increase in variance


▶ The bias-variance tradeoff is controlled by the polynomial
degree d

Kernel and Local Regression 37/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

Local Constant Regression

A special kernel regression smoother — the local constant


regression smoother, which minimises
n
X
(Yi − αx )2 Kh (x − xi ) ,
i=1

can be found to be
" n
#−1 n
X X
α̂x = Kh (x − xi ) Yi Kh (x − xi )
i=1 i=1

Kernel and Local Regression 38/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

The aim

▶ We are interested in a flexible model to predict y using


multiple predictors, say x1 , · · · , xp .

▶ LS: f (x1 , · · · , xp ) = α + β1 x1 + · · · + βp xp

▶ GAM: f (x1 , · · · , xp ) = α + f1 (x1 ) + · · · + fp (xp ) ,


where each fj (xj ) is a smoothing spline function of xj .

▶ GAM is an additive model of many functions that depend on


a single predictor. (Although, you can create a ‘new’ predictor
xk xl and add a function fp+1 (xk xl ), but this quickly leads to
an over-fit model)

Kernel and Local Regression 39/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

GAMs

fi (xi ) is a building block and can take many forms. For example
▶ Smoothing spline (the most popular)

▶ Natural spline

▶ Local regression

▶ Polynomial regression

Kernel and Local Regression 40/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

GAMs

In the regression context, the model is fit by minimising


 2
Xn  Xp  p
X Z
y −α− fj (xij ) + λj fj′′ (tj )2 dtj ,
 i 
i=1 j=1 j=1

which involves parameter estimation.


In the fitting procedure, each λj is a constant that control the
degree of smoothing (determined by the degrees of freedom
specified for fj ).
Each fj is then estimated by estimating the associated regression
coefficient parameters as for the smoothing spline fit.

Kernel and Local Regression 41/42


§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

GAMs
Estimating all the fj ’s simultaneously is difficult. The backfitting
algorithm is an iterative solution to this, which fits each fj in turn
and iteratively:
1. Initialize α̂ = ȳ , all fˆj = 0.
2. For j = 1, · · · , p,
 n
 X 
ˆ n
fj ← Smooth fit using {xij }i=1 for yi − α̂ − ˆ
fk (xik )
 
k̸=j
i=1
n
1 X Pn
fˆj ← fˆj − fˆj (xij ) (so ˆ
i=1 f (xij ) = 0 is assured)
n
i=1

3. Repeat step 2 until convergence (each fˆj changes less than


some threshold)
Kernel and Local Regression 42/42

You might also like