L4 Kernel Regression
L4 Kernel Regression
Karim Seghouane
School of Mathematics & Statistics
The University of Melbourne
Outline
§5.1 Introduction
Introduction
Introduction
Introduction
Introduction
Kernel regression has been around since the 1960s, and is one of
the most popular methods for “nonparametrically” fitting a model
to data. We work here in regression context, but there exist
extensions to classification models via logistic regression.
yi = f (xi ) + ϵi , E (ϵi ) = 0
▶ and we are interested in estimating the regression function
f (x) = E (y |x)
▶ using a training set (xi , yi ), i = 1, ..., n.
▶ The relationship between x and y is more likely to be nonlinear
Nadaraya-Watson Kernel
N
X
and fˆ(x0 ) = 0 if Kh (x0 , xi ) = 0
i=1
Kernel Function
▶ is also common
Kernel Functions
2.0
h=0.5
1.5
K_h(x)
1.0
h=1
0.5
h=2
0.0
−2 −1 0 1 2
Kernel Functions
Nadaraya-Watson Kernel
Epanechnikov quadratic kernel application example
Example
1
K (x) = I {|x| < 1}
2
▶ In this case fˆ(x) =average of yi′ s such that xi ∈ [x − h, x + h]
or |xi − x| ≤ h
Example
There are two extreme cases
▶ h → ∞, fˆ is independent of x (high bias case)
N
ˆ 1 X
f (x) → yi = const.
N
i=1
▶ h → 0, h < mini,j |xi − xj |, (high variance case)
Linear Estimator
N
X
fˆ (x) = yi Wi (x)
i=1
▶ where the weights
N
!
Kh (x, xi ) X
Wi (x) = PN I Kh (x, xi ) ̸= 0
i=1 Kh (x, xi ) i=1
▶ are independent of the responses yi
Justification or Interpretation
Justification or Interpretation
Properties
▶ The width of the used local neighborhood h plays the role of
the smoothing parameter
▶ Large values of h implies lower variance (use more samples for
estimation) but higher bias (assume the function is constant
within the window)
▶ For k-nearest neighborhoods, the neighborhood size k plays
the role of the window size h and hk (xi ) = |xi − xk | where xk
is the k th closest xj to xi
▶ Adaptive width h(x) can also be considered instead of
constant width h(x) = h and the kernel is
|x − xi |
Kh (xi , x) = K
h(xi )
▶ Kernel fit can still have problems due to the asymmetry at the
boundaries
▶ or in the interior if the x values are not equally spaced
▶ Locally weighted linear regression provide an alternative local
approximation
−1 N
⊤
X
ˆ ⊤ ⊤
f (x0 ) = b (x0 ) B W (x0 )B B W (x0 )y = ℓi (x0 ) yi
i=1
●
●●●
0.4
● ●
● ●
●
●
●●
● ● ●
●● ●
●
0.3 ●
●
●
●● ●
● ● ●
●
●
●●
●
●
0.2
●●
●
y
●
●
●
● ●
● ● ●
● ●●
●
●● ●
0.1
●
●
●
● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0
●●
● ●
● ● ● ●
●●
●
●●●
−3 −2 −1 0 1 2 3
●
●●●
0.4
● ●
● ●
●
●
●●
● ● ●
●● ●
●
0.3 ●
●
●
●● ●
● ● ●
●
●
●●
●
●
0.2
●●
●
y
●
●
●
● ●
● ● ●
● ●●
●
●● ●
0.1
●
●
●
● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0
●●
● ●
● ● ● ●
●●
●
●●●
−3 −2 −1 0 1 2 3
●
●●●
0.4
● ●
● ●
●
●
●●
● ● ●
●● ●
●
0.3 ●
●
●
●● ●
● ● ●
●
●
●●
●
●
0.2
●●
●
y
●
●
●
● ●
● ● ●
● ●●
●
●● ●
0.1
●
●
●
● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0
●●
● ●
● ● ● ●
●●
●
●●●
−3 −2 −1 0 1 2 3
●
●●●
0.4
● ●
● ●
●
●
●●
● ● ●
●● ●
●
0.3 ●
●
●
●● ●
● ● ●
●
●
●●
●
●
0.2
●●
●
y
●
●
●
● ●
● ● ●
● ●●
●
●● ●
0.1
●
●
●
● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0
●●
● ● h=1
● ● ● ●
●●
●
●●●
−3 −2 −1 0 1 2 3
●
●●●
0.4
● ●
● ●
●
●
●●
● ● ●
●● ●
●
0.3 ●
●
●
●● ●
● ● ●
●
●
●●
●
●
0.2
●●
●
y
●
●
●
● ●
● ● ●
● ●●
●
●● ●
0.1
●
●
●
● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0
●●
● ● h=5, underfit
● ● ● ●
●●
●
●●●
−3 −2 −1 0 1 2 3
●
●●●
0.4
● ●
● ●
●
●
●●
● ● ●
●● ●
●
0.3 ●
●
●
●● ●
● ● ●
●
●
●●
●
●
0.2
●●
●
y
●
●
●
● ●
● ● ●
● ●●
●
●● ●
0.1
●
●
●
● ●
● ●● ●
● ●● ●
● ●● ● ●
●●
●● ●
●● ● ● ●
●● ● ●
● ●
0.0
●●
● ● h=0.4, overfit
● ● ● ●
●●
●
●●●
−3 −2 −1 0 1 2 3
Adaptive choices of h
2
N d
βj (x0 )xij
X X
min Kh (x0 , xi ) yi − α(x0 ) −
α(x0 ),βj (x0 )
i=1 j=1
▶ with fit
d
β̂j (x0 )x0j
X
fˆ (x0 ) = α̂(x0 ) +
j=1
can be found to be
" n
#−1 n
X X
α̂x = Kh (x − xi ) Yi Kh (x − xi )
i=1 i=1
The aim
▶ LS: f (x1 , · · · , xp ) = α + β1 x1 + · · · + βp xp
GAMs
fi (xi ) is a building block and can take many forms. For example
▶ Smoothing spline (the most popular)
▶ Natural spline
▶ Local regression
▶ Polynomial regression
GAMs
GAMs
Estimating all the fj ’s simultaneously is difficult. The backfitting
algorithm is an iterative solution to this, which fits each fj in turn
and iteratively:
1. Initialize α̂ = ȳ , all fˆj = 0.
2. For j = 1, · · · , p,
n
X
ˆ n
fj ← Smooth fit using {xij }i=1 for yi − α̂ − ˆ
fk (xik )
k̸=j
i=1
n
1 X Pn
fˆj ← fˆj − fˆj (xij ) (so ˆ
i=1 f (xij ) = 0 is assured)
n
i=1