0% found this document useful (0 votes)
15 views133 pages

Lecture 36

The document discusses support vector machines (SVM) for classification and regression, highlighting the importance of kernel functions and support vector expansions. It explains how kernels facilitate the learning of nonlinear models and generalize distance measures for assessing similarity in pattern recognition. Additionally, it introduces concepts like positive definite kernels and the Reproducing Kernel Hilbert Space (RKHS), emphasizing their role in formulating classifiers and estimating class conditional densities.

Uploaded by

Shoubhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views133 pages

Lecture 36

The document discusses support vector machines (SVM) for classification and regression, highlighting the importance of kernel functions and support vector expansions. It explains how kernels facilitate the learning of nonlinear models and generalize distance measures for assessing similarity in pattern recognition. Additionally, it introduces concepts like positive definite kernels and the Reproducing Kernel Hilbert Space (RKHS), emphasizing their role in formulating classifiers and estimating class conditional densities.

Uploaded by

Shoubhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

• We have discussed the support vector machines both

for classification and regression.

PR NPTEL course – p.1/133


• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.

PR NPTEL course – p.2/133


• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
• In SVM method, there are two important ingredients.

PR NPTEL course – p.3/133


• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
• In SVM method, there are two important ingredients.
• One is the Kernel function.

PR NPTEL course – p.4/133


• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
• In SVM method, there are two important ingredients.
• One is the Kernel function.
• Second is the ‘support vector’ expansion – the final
model is expressed as a (‘sparse’) linear combination
of some of the data vectors.

PR NPTEL course – p.5/133


• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.

PR NPTEL course – p.6/133


• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
• Kernels are a good way to capture ‘similarity’ and are
useful in general.

PR NPTEL course – p.7/133


• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
• Kernels are a good way to capture ‘similarity’ and are
useful in general.
• The support vector expansion is also a general
property of Kernel based methods.

PR NPTEL course – p.8/133


• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
• Kernels are a good way to capture ‘similarity’ and are
useful in general.
• The support vector expansion is also a general
property of Kernel based methods.
• We briefly look at this general view of Kernels next.

PR NPTEL course – p.9/133


• Often, in pattern recognition, we use distance
between pattern vectors as a means to assess
similarity (e.g. nearest neighbour classifier).

PR NPTEL course – p.10/133


• Often, in pattern recognition, we use distance
between pattern vectors as a means to assess
similarity (e.g. nearest neighbour classifier).
• Kernels allow us to generalize such notions of
distance or similarity between patterns.

PR NPTEL course – p.11/133


• Often, in pattern recognition, we use distance
between pattern vectors as a means to assess
similarity (e.g. nearest neighbour classifier).
• Kernels allow us to generalize such notions of
distance or similarity between patterns.
• We illustrate such uses of kernels with a simple
example.

PR NPTEL course – p.12/133


• Consider a 2-class classification problem with training
data
{(Xi , yi ), i = 1, · · · , n}, Xi ∈ ℜm , yi ∈ {+1, −1}

PR NPTEL course – p.13/133


• Consider a 2-class classification problem with training
data
{(Xi , yi ), i = 1, · · · , n}, Xi ∈ ℜm , yi ∈ {+1, −1}
• Suppose we implement a nearest neighbour classifier,
by computing distance of a new pattern to a set of
prototypes.

PR NPTEL course – p.14/133


• Consider a 2-class classification problem with training
data
{(Xi , yi ), i = 1, · · · , n}, Xi ∈ ℜm , yi ∈ {+1, −1}
• Suppose we implement a nearest neighbour classifier,
by computing distance of a new pattern to a set of
prototypes.
• Keeping with the viewpoint of SVM, suppose we want
to transform the patterns into a new space using φ
and find the distances there.

PR NPTEL course – p.15/133


• Suppose we use two prototypes given by
1 X 1 X
C+ = φ(Xi ) and C− = φ(Xi )
n+ i: y =+1 n− i: y =−1
i i

where n+ is the number of examples in class +1 and


n− is that in class −1.

PR NPTEL course – p.16/133


• Suppose we use two prototypes given by
1 X 1 X
C+ = φ(Xi ) and C− = φ(Xi )
n+ i: y =+1 n− i: y =−1
i i

where n+ is the number of examples in class +1 and


n− is that in class −1.
• The prototypes are ‘centers’ of the two classes.

PR NPTEL course – p.17/133


• Suppose we use two prototypes given by
1 X 1 X
C+ = φ(Xi ) and C− = φ(Xi )
n+ i: y =+1 n− i: y =−1
i i

where n+ is the number of examples in class +1 and


n− is that in class −1.
• The prototypes are ‘centers’ of the two classes.
• Given a new X , we would put it in class +1 if

||φ(X) − C+ ||2 < ||φ(X) − C− ||2

PR NPTEL course – p.18/133


• We can implement this using kernels. We have

||φ(X) − C+ ||2 = φ(X)T φ(X) − 2φ(X)T C+ + C+T C+

PR NPTEL course – p.19/133


• We can implement this using kernels. We have

||φ(X) − C+ ||2 = φ(X)T φ(X) − 2φ(X)T C+ + C+T C+


• Thus we would put X in class +1 if
1 T
φ(X) C+ − φ(X) C− + (C− C− − C+T C+ ) > 0
T T
2

PR NPTEL course – p.20/133


• We can implement this using kernels. We have

||φ(X) − C+ ||2 = φ(X)T φ(X) − 2φ(X)T C+ + C+T C+


• Thus we would put X in class +1 if
1 T
φ(X) C+ − φ(X) C− + (C− C− − C+T C+ ) > 0
T T
2
• All these inner products are now easily done using
kernel functions.

PR NPTEL course – p.21/133


• By the definition of C+ , we get
à !
1 X
φ(X)T C+ = φ(X)T φ(Xi )
n+ i: y =+1
i

PR NPTEL course – p.22/133


• By the definition of C+ , we get
à !
1 X
φ(X)T C+ = φ(X)T φ(Xi )
n+ i: y =+1
i

1 X
= K(Xi , X)
n+ i: y =+1
i

PR NPTEL course – p.23/133


• By the definition of C+ , we get
à !
1 X
φ(X)T C+ = φ(X)T φ(Xi )
n+ i: y =+1
i

1 X
= K(Xi , X)
n+ i: y =+1
i

• Similarly we get

T 1 X
C+ C+ = 2 K(Xi , Xj )
n+ i,j :yi =yj =+1
PR NPTEL course – p.24/133
• Thus, our classifier is sgn(h(X)) where
1 X 1 X
h(X) = K(Xi , X) − K(Xi , X) + b
n+ i: y =+1 n− i: y =−1
i i

where
 
1 1 X 1 X
b= K(Xi , Xj ) − 2 K(Xi , Xj )
2 n2− y ,y =−1
n + yi ,yj =+1
i j

PR NPTEL course – p.25/133


• Thus we can implement such nearest neighbour
classifiers by implicitly transforming the feature space
and using kernel function for the inner product in the
transformed space.
• The kernel function allows us to formulate the right
kind of similarity measure in the original space.

PR NPTEL course – p.26/133


• Define
1 X
P+ (X) = K(Xi , X),
n+ i: y =+1
i

1 X
P− (X) = K(Xi , X)
n− i: y =−1
i

PR NPTEL course – p.27/133


• Define
1 X
P+ (X) = K(Xi , X),
n+ i: y =+1
i

1 X
P− (X) = K(Xi , X)
n− i: y =−1
i

• With a proper normalization, these are essentially


non-parametric estimators for the class conditional
densities – the kernel density estimates.

PR NPTEL course – p.28/133


• We could, for example, use a Gaussian kernel and
then it is the nonparametric density estimators we
studied earlier.

PR NPTEL course – p.29/133


• We could, for example, use a Gaussian kernel and
then it is the nonparametric density estimators we
studied earlier.
• Thus, our nearest neighbour classifier is essentially a
Bayes classifier using nonparametric estimators for
class conditional densities.

PR NPTEL course – p.30/133


• We next look at positive definite kernels in some
detail.

PR NPTEL course – p.31/133


• We next look at positive definite kernels in some
detail.
• We show that for any such kernel, there is one vector
space with an innerproduct, H, such that the kernel is
an innerproduct in that space.

PR NPTEL course – p.32/133


• We next look at positive definite kernels in some
detail.
• We show that for any such kernel, there is one vector
space with an innerproduct, H, such that the kernel is
an innerproduct in that space.
• This is called the Reproducing Kernel Hilbert Space
(RKHS) associated with the Kernel.

PR NPTEL course – p.33/133


• We next look at positive definite kernels in some
detail.
• We show that for any such kernel, there is one vector
space with an innerproduct, H, such that the kernel is
an innerproduct in that space.
• This is called the Reproducing Kernel Hilbert Space
(RKHS) associated with the Kernel.
• We also show that if we are doing regularized
empirical risk minimization on this space, then the
final solution would have the ‘support vector
expansion’ form.
PR NPTEL course – p.34/133
Positive definite kernels

• Let X be the original feature space.

PR NPTEL course – p.35/133


Positive definite kernels

• Let X be the original feature space.


• Let K : X × X → ℜ be a positive definite kernel.

PR NPTEL course – p.36/133


Positive definite kernels

• Let X be the original feature space.


• Let K : X × X → ℜ be a positive definite kernel.
• Given any n points, X1 , · · · , Xn ∈ X , the n × n matrix
with (i, j) element as K(Xi , Xj ) is called the Gram
matrix of K .

PR NPTEL course – p.37/133


Positive definite kernels

• Let X be the original feature space.


• Let K : X × X → ℜ be a positive definite kernel.
• Given any n points, X1 , · · · , Xn ∈ X , the n × n matrix
with (i, j) element as K(Xi , Xj ) is called the Gram
matrix of K .
• Recall that K is positive definite if the Gram matrix is
positive semi-definite for all n and all X1 , · · · , Xn .

PR NPTEL course – p.38/133


• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1

PR NPTEL course – p.39/133


• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1

• Taking n = 1, we get K(X, X) ≥ 0, ∀X ∈ X .

PR NPTEL course – p.40/133


• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1

• Taking n = 1, we get K(X, X) ≥ 0, ∀X ∈ X .


• Taking n = 2 and remembering that K is symmetric,
we get

K(X1 , X2 )2 ≤ K(X1 , X1 ) K(X2 , X2 ), ∀X1 , X2 ∈ X

PR NPTEL course – p.41/133


• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1

• Taking n = 1, we get K(X, X) ≥ 0, ∀X ∈ X .


• Taking n = 2 and remembering that K is symmetric,
we get

K(X1 , X2 )2 ≤ K(X1 , X1 ) K(X2 , X2 ), ∀X1 , X2 ∈ X


Thus, K satisfies Cauchy-Schwartz inequality
PR NPTEL course – p.42/133
• Suppose K(X, X ′ ) = φ(X)T φ(X ′ ).

PR NPTEL course – p.43/133


• Suppose K(X, X ′ ) = φ(X)T φ(X ′ ).
• Then K is a positive definite kernel:
à !T à !
X X X
T
ci cj φ(Xi ) φ(Xj ) = ci φ(Xi ) cj φ(Xj )
i,j i j
X 2
= || ci φ(Xi )|| ≥ 0
i

PR NPTEL course – p.44/133


• Suppose K(X, X ′ ) = φ(X)T φ(X ′ ).
• Then K is a positive definite kernel:
à !T à !
X X X
T
ci cj φ(Xi ) φ(Xj ) = ci φ(Xi ) cj φ(Xj )
i,j i j
X 2
= || ci φ(Xi )|| ≥ 0
i

• Thus, if K satisfies Mercer theorem, then it is a


positive definite kernel.

PR NPTEL course – p.45/133


• We now show that all positive definite kernels are also
innerproducts on some appropriate space.

PR NPTEL course – p.46/133


• We now show that all positive definite kernels are also
innerproducts on some appropriate space.
• Given a kernel K , we will construct a space endowed
with an inner product and show how any positive
definite kernel is essentially implementing inner
product in this space.

PR NPTEL course – p.47/133


• We now show that all positive definite kernels are also
innerproducts on some appropriate space.
• Given a kernel K , we will construct a space endowed
with an inner product and show how any positive
definite kernel is essentially implementing inner
product in this space.
• This space is called the Reproducing Kernel Hilbert
Space associated with the Kernel, K .

PR NPTEL course – p.48/133


• Let ℜX be the set of all real-valued functions on X .

PR NPTEL course – p.49/133


• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.

PR NPTEL course – p.50/133


• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
• For any X ∈ X , let K(· , X) ∈ ℜX denote the function
that maps X ′ ∈ X to K(X ′ , X) ∈ ℜ.

PR NPTEL course – p.51/133


• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
• For any X ∈ X , let K(· , X) ∈ ℜX denote the function
that maps X ′ ∈ X to K(X ′ , X) ∈ ℜ.
• Consider the set of functions
H1 = {K(· , X) : X ∈ X }.

PR NPTEL course – p.52/133


• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
• For any X ∈ X , let K(· , X) ∈ ℜX denote the function
that maps X ′ ∈ X to K(X ′ , X) ∈ ℜ.
• Consider the set of functions
H1 = {K(· , X) : X ∈ X }.
• Let H be the set of all functions that are finite linear
combinations of functions in H1 .

PR NPTEL course – p.53/133


• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1

PR NPTEL course – p.54/133


• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1

• It is easy to see that if f, g ∈ H then f + g ∈ H and


αf ∈ H for α ∈ ℜ.

PR NPTEL course – p.55/133


• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1

• It is easy to see that if f, g ∈ H then f + g ∈ H and


αf ∈ H for α ∈ ℜ.
• Thus, H is a vector space.

PR NPTEL course – p.56/133


• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1

• It is easy to see that if f, g ∈ H then f + g ∈ H and


αf ∈ H for α ∈ ℜ.
• Thus, H is a vector space.
• We now define an inner product on H.

PR NPTEL course – p.57/133


• Let f, g ∈ H with
n
X n′
X
f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )
i=1 j=1

PR NPTEL course – p.58/133


• Let f, g ∈ H with
n
X Xn′

f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )


i=1 j=1

• We define the inner product as

XXn n′
<f, g> = αi βj K(Xi , Xj′ )
i=1 j=1

PR NPTEL course – p.59/133


• Let f, g ∈ H with
n
X Xn′

f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )


i=1 j=1

• We define the inner product as

XXn n′
<f, g> = αi βj K(Xi , Xj′ )
i=1 j=1

• We first show this is well defined.


PR NPTEL course – p.60/133
• We note that
n
X n′
X n
X
<f, g> = αi βj K(Xi , Xj′ ) = αi g(Xi )
i=1 j=1 i=1

PR NPTEL course – p.61/133


• We note that

Xn n′
X n
X
<f, g> = αi βj K(Xi , Xj′ ) = αi g(Xi )
i=1 j=1 i=1

• Similarly we have
n′
X n
X X n′

<f, g> = βj αi K(Xi , Xj′ ) = βj f (Xj′ )


j=1 i=1 j=1

PR NPTEL course – p.62/133


• We note that

Xn n′
X n
X
<f, g> = αi βj K(Xi , Xj′ ) = αi g(Xi )
i=1 j=1 i=1

• Similarly we have
n′
X n
X X n′

<f, g> = βj αi K(Xi , Xj′ ) = βj f (Xj′ )


j=1 i=1 j=1

• Thus our inner product does not depend on the αi


and βj and hence is well defined.
PR NPTEL course – p.63/133
• Now we have to show that this is actually an inner
product.

PR NPTEL course – p.64/133


• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.

PR NPTEL course – p.65/133


• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
• It is easily verified that it is bilinear:
< f , g1 + g2 > = < f , g1 > + < f , g2 >
< f1 + f2 , g > = < f1 , g > + < f2 , g >

PR NPTEL course – p.66/133


• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
• It is easily verified that it is bilinear:
< f , g1 + g2 > = < f , g1 > + < f , g2 >
< f1 + f2 , g > = < f1 , g > + < f2 , g >
• It is also easy to see that < cf , g > = c < f , g >.

PR NPTEL course – p.67/133


• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
• It is easily verified that it is bilinear:
< f , g1 + g2 > = < f , g1 > + < f , g2 >
< f1 + f2 , g > = < f1 , g > + < f2 , g >
• It is also easy to see that < cf , g > = c < f , g >.
We have, by the Pn definiteness of K ,
Pnpositive

<f, f >= i=1 j=1 αi αj K(Xi , Xj ) ≥ 0

PR NPTEL course – p.68/133


• Finally, we have to show < f , f > = 0 ⇒ f = 0.

PR NPTEL course – p.69/133


• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.

PR NPTEL course – p.70/133


• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
• Now we get
p p p
X X X
γi γj < fi , fj > = < γi fi , γj fj >
i,j=1 i=1 j=1

PR NPTEL course – p.71/133


• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
• Now we get
p p p
X X X
γi γj < fi , fj > = < γi fi , γj fj >
i,j=1 i=1 j=1
= < g1 , g1 >

PR NPTEL course – p.72/133


• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
• Now we get
p p p
X X X
γi γj < fi , fj > = < γi fi , γj fj >
i,j=1 i=1 j=1
= < g1 , g1 >
≥ 0

PR NPTEL course – p.73/133


• Note that < · , · > is a symmetric function that maps
H × H to ℜ.

PR NPTEL course – p.74/133


• Note that < · , · > is a symmetric function that maps
H × H to ℜ.
• Thus what we have shown is that < · , · > is a
positive definite kernel on H

PR NPTEL course – p.75/133


• Note that < · , · > is a symmetric function that maps
H × H to ℜ.
• Thus what we have shown is that < · , · > is a
positive definite kernel on H
• Since positive definite kernels satisfy
Cauchy-Schwartz inequality, for any f ∈ H, we have

| < K(·, X), f > |2 ≤ < K(·, X), K(·, X) > < f , f >

PR NPTEL course – p.76/133


• Recall that for f, g ∈ H with
n
X Xn′

f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )


i=1 j=1

the inner product is

XXn n′
<f, g> = αi βj K(Xi , Xj′ )
i=1 j=1

PR NPTEL course – p.77/133


• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and

PR NPTEL course – p.78/133


• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)

PR NPTEL course – p.79/133


• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property.

PR NPTEL course – p.80/133


• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property. Now, ∀X ,

|f (X)|2 = | < K(·, X), f > |2 ≤ K(X, X) < f , f >

PR NPTEL course – p.81/133


• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property. Now, ∀X ,

|f (X)|2 = | < K(·, X), f > |2 ≤ K(X, X) < f , f >


• This shows < f , f > = 0 ⇒ f = 0.

PR NPTEL course – p.82/133


• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property. Now, ∀X ,

|f (X)|2 = | < K(·, X), f > |2 ≤ K(X, X) < f , f >


• This shows < f , f > = 0 ⇒ f = 0.
• This shows what we defined is indeed an inner
product.

PR NPTEL course – p.83/133


• Given any positive definite kernel, we can construct
this inner product space H as explained here.

PR NPTEL course – p.84/133


• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.

PR NPTEL course – p.85/133


• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
• It is called the Reproducing Kernel Hilbert Space
(RKHS) associated with K .

PR NPTEL course – p.86/133


• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
• It is called the Reproducing Kernel Hilbert Space
(RKHS) associated with K .
• The reproducing kernel property is

< K(· , X) , f > = f (X), ∀f ∈ H

PR NPTEL course – p.87/133


• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
• It is called the Reproducing Kernel Hilbert Space
(RKHS) associated with K .
• The reproducing kernel property is

< K(· , X) , f > = f (X), ∀f ∈ H


• Note that elements of RKHS are certain real-valued
functions on X . Essentially, a kind of generalization of
linear functionals on X .
PR NPTEL course – p.88/133
• Given this RKHS H associated with K , define
φ : X → H by
φ(X) = K(· , X)

PR NPTEL course – p.89/133


• Given this RKHS H associated with K , define
φ : X → H by
φ(X) = K(· , X)
• Now we have
K(X, X ′ ) = < φ(X) , φ(X ′ ) >

PR NPTEL course – p.90/133


• Given this RKHS H associated with K , define
φ : X → H by
φ(X) = K(· , X)
• Now we have
K(X, X ′ ) = < φ(X) , φ(X ′ ) >
• This shows that any positive definite kernel gives us
the inner product in some other space as needed.

PR NPTEL course – p.91/133


• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .

PR NPTEL course – p.92/133


• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
• Now, K(· , X) is the function that takes dot product of
its argument with X .

PR NPTEL course – p.93/133


• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
• Now, K(· , X) is the function that takes dot product of
its argument with X .
• Let X = [x1 , · · · , xm ]T . Let ei , i = 1, · · · , m, be the
coordinate unit vectors. For any X ′ ∈ ℜm ,
m
X m
X
′ T ′ T ′ ′
K(X , X) = X X = xe X =
i i xi K(X , ei )
i=1 i=1

PR NPTEL course – p.94/133


• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
• Now, K(· , X) is the function that takes dot product of
its argument with X .
• Let X = [x1 , · · · , xm ]T . Let ei , i = 1, · · · , m, be the
coordinate unit vectors. For any X ′ ∈ ℜm ,
m
X m
X
′ T ′ T ′ ′
K(X , X) = X X = xe X =
i i xi K(X , ei )
i=1 i=1
Pm
• This gives us K(· , X) = i=1 xi K(· , ei ).
PR NPTEL course – p.95/133
• This means all functions in H are linear combinations
of K(· , ei ).

PR NPTEL course – p.96/133


• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).

PR NPTEL course – p.97/133


• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
• So, the RKHS is isomorphic to ℜm and thus it
represents hyperplanes on X .

PR NPTEL course – p.98/133


• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
• So, the RKHS is isomorphic to ℜm and thus it
represents hyperplanes on X .
• The inner product in this H would be simply the usual
dot product.

PR NPTEL course – p.99/133


• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
• So, the RKHS is isomorphic to ℜm and thus it
represents hyperplanes on X .
• The inner product in this H would be simply the usual
dot product.
• Learning hyperplanes is searching over this H for
minimizer of empirical risk with the usual norm as a
regularizer.
PR NPTEL course – p.100/133
• What we have shown is the following.

PR NPTEL course – p.101/133


• What we have shown is the following.
• Given a positive definite kernel, there is a vector
space with an inner product, namely, the RKHS
associated with K , and a mapping φ from X to H
such that the kernel is an inner product in H.

PR NPTEL course – p.102/133


• What we have shown is the following.
• Given a positive definite kernel, there is a vector
space with an inner product, namely, the RKHS
associated with K , and a mapping φ from X to H
such that the kernel is an inner product in H.
• This RKHS represents a space of functions where we
can search for the empirical risk minimizer.

PR NPTEL course – p.103/133


• What we have shown is the following.
• Given a positive definite kernel, there is a vector
space with an inner product, namely, the RKHS
associated with K , and a mapping φ from X to H
such that the kernel is an inner product in H.
• This RKHS represents a space of functions where we
can search for the empirical risk minimizer.
• An important insight gained by this view point is the
Representer theorem.

PR NPTEL course – p.104/133


Representer Theorem

• Let K be a positive definite Kernel and let H be the


RKHS associated with it.

PR NPTEL course – p.105/133


Representer Theorem

• Let K be a positive definite Kernel and let H be the


RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.

PR NPTEL course – p.106/133


Representer Theorem

• Let K be a positive definite Kernel and let H be the


RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
• For any function f , the empirical risk, under any loss
function can be represented as a function

R̂n (f ) = C( (Xi , yi , f (Xi )), i = 1, · · · , n)

PR NPTEL course – p.107/133


Representer Theorem

• Let K be a positive definite Kernel and let H be the


RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
• For any function f , the empirical risk, under any loss
function can be represented as a function

R̂n (f ) = C( (Xi , yi , f (Xi )), i = 1, · · · , n)


• We search over H for a minimizer of empirical risk.

PR NPTEL course – p.108/133


Representer Theorem

• Let K be a positive definite Kernel and let H be the


RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
• For any function f , the empirical risk, under any loss
function can be represented as a function

R̂n (f ) = C( (Xi , yi , f (Xi )), i = 1, · · · , n)


• We search over H for a minimizer of empirical risk.
• Let ||f ||2 = < f , f > be the norm under our inner
product.
PR NPTEL course – p.109/133
Let Ω : [0, ∞) → ℜ+ be a strictly
• Theorem:
monotonically increasing function. Then any
minimizer g over H of the regularized risk

C( (Xi , yi , g(Xi )), i = 1, · · · , n) + Ω(||g||2 )


admits a representation
n
X
g(X) = αi K(Xi , X)
i=1

PR NPTEL course – p.110/133


• What this means is the following.

PR NPTEL course – p.111/133


• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .

PR NPTEL course – p.112/133


• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
• Though we are searching over this space, the
minimizer can always be expressed as a linear
combinations of kernels centered on data points only.

PR NPTEL course – p.113/133


• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
• Though we are searching over this space, the
minimizer can always be expressed as a linear
combinations of kernels centered on data points only.
• Thus, though H may be infinite dimensional, we can
solve the optimization problem by searching for only n
real numbers αi .

PR NPTEL course – p.114/133


• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
• Though we are searching over this space, the
minimizer can always be expressed as a linear
combinations of kernels centered on data points only.
• Thus, though H may be infinite dimensional, we can
solve the optimization problem by searching for only n
real numbers αi .
• This is essentially what we have done in solving the
dual for SVM.
PR NPTEL course – p.115/133
Proof of Representer Theorem

• In the vector space H, consider the span of the


functions K(X1 , ·), · · · , K(Xn , ·). This will be a
subspace.

PR NPTEL course – p.116/133


Proof of Representer Theorem

• In the vector space H, consider the span of the


functions K(X1 , ·), · · · , K(Xn , ·). This will be a
subspace.
• Given any f ∈ H, we can decompose it into two
components – one in this subspace and one in the
subspace orthogonal to it.

PR NPTEL course – p.117/133


Proof of Representer Theorem

• In the vector space H, consider the span of the


functions K(X1 , ·), · · · , K(Xn , ·). This will be a
subspace.
• Given any f ∈ H, we can decompose it into two
components – one in this subspace and one in the
subspace orthogonal to it.
• Let us call these two components as fk and f⊥ .

PR NPTEL course – p.118/133


• Thus, For any f ∈ H and any X ∈ X , we have
n
X
f (X) = fk (X) + f⊥ (X) = αi K(Xi , X) + f⊥ (X)
i=1

where αi ∈ ℜ, f⊥ ∈ H and
< f⊥ , K(Xi , ·) > = 0, i = 1, · · · , n.

PR NPTEL course – p.119/133


• Thus, For any f ∈ H and any X ∈ X , we have
n
X
f (X) = fk (X) + f⊥ (X) = αi K(Xi , X) + f⊥ (X)
i=1

where αi ∈ ℜ, f⊥ ∈ H and
< f⊥ , K(Xi , ·) > = 0, i = 1, · · · , n.
• Since H is the RKHS of K , the reproducing kernel
property gives us

f (X ′ ) = < f , K(X ′ , ·) >

PR NPTEL course – p.120/133


• Hence for any of the data points, Xj , j = 1, · · · , n,

f (Xj ) = < f , K(Xj , ·) >

PR NPTEL course – p.121/133


• Hence for any of the data points, Xj , j = 1, · · · , n,

f (Xj ) = < f , K(Xj , ·) >


= < fk + f⊥ , K(Xj , ·) >

PR NPTEL course – p.122/133


• Hence for any of the data points, Xj , j = 1, · · · , n,

f (Xj ) = < f , K(Xj , ·) >


= < fk + f⊥ , K(Xj , ·) >
n
X
= αi K(Xi , Xj ) + < f⊥ , K(Xj , ·) >
i=1

PR NPTEL course – p.123/133


• Hence for any of the data points, Xj , j = 1, · · · , n,

f (Xj ) = < f , K(Xj , ·) >


= < fk + f⊥ , K(Xj , ·) >
n
X
= αi K(Xi , Xj ) + < f⊥ , K(Xj , ·) >
i=1
X n

= αi K(Xi , Xj ) = fk (Xj )
i=1

PR NPTEL course – p.124/133


• Hence for any of the data points, Xj , j = 1, · · · , n,

f (Xj ) = < f , K(Xj , ·) >


= < fk + f⊥ , K(Xj , ·) >
n
X
= αi K(Xi , Xj ) + < f⊥ , K(Xj , ·) >
i=1
X n

= αi K(Xi , Xj ) = fk (Xj )
i=1

• This is true for any f ∈ H.


PR NPTEL course – p.125/133
• Now let g ∈ H be a minimizer of the regularized risk.

PR NPTEL course – p.126/133


• Now let g ∈ H be a minimizer of the regularized risk.
• We can write g = gk + g⊥ and g(Xj ) = gk (Xj ) for all
data vectors, Xj .

PR NPTEL course – p.127/133


• Now let g ∈ H be a minimizer of the regularized risk.
• We can write g = gk + g⊥ and g(Xj ) = gk (Xj ) for all
data vectors, Xj .
• Now the empirical risk of g ,

C( (Xi , yi , g(Xi )), i = 1, · · · , n)


would be same as empirical risk of gk .

PR NPTEL course – p.128/133


• Now let g ∈ H be a minimizer of the regularized risk.
• We can write g = gk + g⊥ and g(Xj ) = gk (Xj ) for all
data vectors, Xj .
• Now the empirical risk of g ,

C( (Xi , yi , g(Xi )), i = 1, · · · , n)


would be same as empirical risk of gk .
• Since gk and g⊥ are orthogonal,

||g||2 = ||gk ||2 + ||g⊥ ||2 ≥ ||gk ||2


PR NPTEL course – p.129/133
• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).

PR NPTEL course – p.130/133


• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
• This shows that the regularized risk of gk can only be
less than or equal to that of g .

PR NPTEL course – p.131/133


• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
• This shows that the regularized risk of gk can only be
less than or equal to that of g .
• Hence any minimizer would be in the subspace
spanned by K(Xi , ·) and hence would have a
representation
n
X
g(X) = αi K(Xi , X)
i=1

PR NPTEL course – p.132/133


• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
• This shows that the regularized risk of gk can only be
less than or equal to that of g .
• Hence any minimizer would be in the subspace
spanned by K(Xi , ·) and hence would have a
representation
n
X
g(X) = αi K(Xi , X)
i=1

• This completes proof of the theorem.


PR NPTEL course – p.133/133

You might also like