• We have discussed the support vector machines both
for classification and regression.
PR NPTEL course – p.1/133
• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
PR NPTEL course – p.2/133
• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
• In SVM method, there are two important ingredients.
PR NPTEL course – p.3/133
• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
• In SVM method, there are two important ingredients.
• One is the Kernel function.
PR NPTEL course – p.4/133
• We have discussed the support vector machines both
for classification and regression.
• We have also seen some generalizations of the basic
SVM idea.
• In SVM method, there are two important ingredients.
• One is the Kernel function.
• Second is the ‘support vector’ expansion – the final
model is expressed as a (‘sparse’) linear combination
of some of the data vectors.
PR NPTEL course – p.5/133
• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
PR NPTEL course – p.6/133
• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
• Kernels are a good way to capture ‘similarity’ and are
useful in general.
PR NPTEL course – p.7/133
• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
• Kernels are a good way to capture ‘similarity’ and are
useful in general.
• The support vector expansion is also a general
property of Kernel based methods.
PR NPTEL course – p.8/133
• Kernel functions allow us to learn nonlinear models
using essentially linear techniques.
• Kernels are a good way to capture ‘similarity’ and are
useful in general.
• The support vector expansion is also a general
property of Kernel based methods.
• We briefly look at this general view of Kernels next.
PR NPTEL course – p.9/133
• Often, in pattern recognition, we use distance
between pattern vectors as a means to assess
similarity (e.g. nearest neighbour classifier).
PR NPTEL course – p.10/133
• Often, in pattern recognition, we use distance
between pattern vectors as a means to assess
similarity (e.g. nearest neighbour classifier).
• Kernels allow us to generalize such notions of
distance or similarity between patterns.
PR NPTEL course – p.11/133
• Often, in pattern recognition, we use distance
between pattern vectors as a means to assess
similarity (e.g. nearest neighbour classifier).
• Kernels allow us to generalize such notions of
distance or similarity between patterns.
• We illustrate such uses of kernels with a simple
example.
PR NPTEL course – p.12/133
• Consider a 2-class classification problem with training
data
{(Xi , yi ), i = 1, · · · , n}, Xi ∈ ℜm , yi ∈ {+1, −1}
PR NPTEL course – p.13/133
• Consider a 2-class classification problem with training
data
{(Xi , yi ), i = 1, · · · , n}, Xi ∈ ℜm , yi ∈ {+1, −1}
• Suppose we implement a nearest neighbour classifier,
by computing distance of a new pattern to a set of
prototypes.
PR NPTEL course – p.14/133
• Consider a 2-class classification problem with training
data
{(Xi , yi ), i = 1, · · · , n}, Xi ∈ ℜm , yi ∈ {+1, −1}
• Suppose we implement a nearest neighbour classifier,
by computing distance of a new pattern to a set of
prototypes.
• Keeping with the viewpoint of SVM, suppose we want
to transform the patterns into a new space using φ
and find the distances there.
PR NPTEL course – p.15/133
• Suppose we use two prototypes given by
1 X 1 X
C+ = φ(Xi ) and C− = φ(Xi )
n+ i: y =+1 n− i: y =−1
i i
where n+ is the number of examples in class +1 and
n− is that in class −1.
PR NPTEL course – p.16/133
• Suppose we use two prototypes given by
1 X 1 X
C+ = φ(Xi ) and C− = φ(Xi )
n+ i: y =+1 n− i: y =−1
i i
where n+ is the number of examples in class +1 and
n− is that in class −1.
• The prototypes are ‘centers’ of the two classes.
PR NPTEL course – p.17/133
• Suppose we use two prototypes given by
1 X 1 X
C+ = φ(Xi ) and C− = φ(Xi )
n+ i: y =+1 n− i: y =−1
i i
where n+ is the number of examples in class +1 and
n− is that in class −1.
• The prototypes are ‘centers’ of the two classes.
• Given a new X , we would put it in class +1 if
||φ(X) − C+ ||2 < ||φ(X) − C− ||2
PR NPTEL course – p.18/133
• We can implement this using kernels. We have
||φ(X) − C+ ||2 = φ(X)T φ(X) − 2φ(X)T C+ + C+T C+
PR NPTEL course – p.19/133
• We can implement this using kernels. We have
||φ(X) − C+ ||2 = φ(X)T φ(X) − 2φ(X)T C+ + C+T C+
• Thus we would put X in class +1 if
1 T
φ(X) C+ − φ(X) C− + (C− C− − C+T C+ ) > 0
T T
2
PR NPTEL course – p.20/133
• We can implement this using kernels. We have
||φ(X) − C+ ||2 = φ(X)T φ(X) − 2φ(X)T C+ + C+T C+
• Thus we would put X in class +1 if
1 T
φ(X) C+ − φ(X) C− + (C− C− − C+T C+ ) > 0
T T
2
• All these inner products are now easily done using
kernel functions.
PR NPTEL course – p.21/133
• By the definition of C+ , we get
à !
1 X
φ(X)T C+ = φ(X)T φ(Xi )
n+ i: y =+1
i
PR NPTEL course – p.22/133
• By the definition of C+ , we get
à !
1 X
φ(X)T C+ = φ(X)T φ(Xi )
n+ i: y =+1
i
1 X
= K(Xi , X)
n+ i: y =+1
i
PR NPTEL course – p.23/133
• By the definition of C+ , we get
à !
1 X
φ(X)T C+ = φ(X)T φ(Xi )
n+ i: y =+1
i
1 X
= K(Xi , X)
n+ i: y =+1
i
• Similarly we get
T 1 X
C+ C+ = 2 K(Xi , Xj )
n+ i,j :yi =yj =+1
PR NPTEL course – p.24/133
• Thus, our classifier is sgn(h(X)) where
1 X 1 X
h(X) = K(Xi , X) − K(Xi , X) + b
n+ i: y =+1 n− i: y =−1
i i
where
1 1 X 1 X
b= K(Xi , Xj ) − 2 K(Xi , Xj )
2 n2− y ,y =−1
n + yi ,yj =+1
i j
PR NPTEL course – p.25/133
• Thus we can implement such nearest neighbour
classifiers by implicitly transforming the feature space
and using kernel function for the inner product in the
transformed space.
• The kernel function allows us to formulate the right
kind of similarity measure in the original space.
PR NPTEL course – p.26/133
• Define
1 X
P+ (X) = K(Xi , X),
n+ i: y =+1
i
1 X
P− (X) = K(Xi , X)
n− i: y =−1
i
PR NPTEL course – p.27/133
• Define
1 X
P+ (X) = K(Xi , X),
n+ i: y =+1
i
1 X
P− (X) = K(Xi , X)
n− i: y =−1
i
• With a proper normalization, these are essentially
non-parametric estimators for the class conditional
densities – the kernel density estimates.
PR NPTEL course – p.28/133
• We could, for example, use a Gaussian kernel and
then it is the nonparametric density estimators we
studied earlier.
PR NPTEL course – p.29/133
• We could, for example, use a Gaussian kernel and
then it is the nonparametric density estimators we
studied earlier.
• Thus, our nearest neighbour classifier is essentially a
Bayes classifier using nonparametric estimators for
class conditional densities.
PR NPTEL course – p.30/133
• We next look at positive definite kernels in some
detail.
PR NPTEL course – p.31/133
• We next look at positive definite kernels in some
detail.
• We show that for any such kernel, there is one vector
space with an innerproduct, H, such that the kernel is
an innerproduct in that space.
PR NPTEL course – p.32/133
• We next look at positive definite kernels in some
detail.
• We show that for any such kernel, there is one vector
space with an innerproduct, H, such that the kernel is
an innerproduct in that space.
• This is called the Reproducing Kernel Hilbert Space
(RKHS) associated with the Kernel.
PR NPTEL course – p.33/133
• We next look at positive definite kernels in some
detail.
• We show that for any such kernel, there is one vector
space with an innerproduct, H, such that the kernel is
an innerproduct in that space.
• This is called the Reproducing Kernel Hilbert Space
(RKHS) associated with the Kernel.
• We also show that if we are doing regularized
empirical risk minimization on this space, then the
final solution would have the ‘support vector
expansion’ form.
PR NPTEL course – p.34/133
Positive definite kernels
• Let X be the original feature space.
PR NPTEL course – p.35/133
Positive definite kernels
• Let X be the original feature space.
• Let K : X × X → ℜ be a positive definite kernel.
PR NPTEL course – p.36/133
Positive definite kernels
• Let X be the original feature space.
• Let K : X × X → ℜ be a positive definite kernel.
• Given any n points, X1 , · · · , Xn ∈ X , the n × n matrix
with (i, j) element as K(Xi , Xj ) is called the Gram
matrix of K .
PR NPTEL course – p.37/133
Positive definite kernels
• Let X be the original feature space.
• Let K : X × X → ℜ be a positive definite kernel.
• Given any n points, X1 , · · · , Xn ∈ X , the n × n matrix
with (i, j) element as K(Xi , Xj ) is called the Gram
matrix of K .
• Recall that K is positive definite if the Gram matrix is
positive semi-definite for all n and all X1 , · · · , Xn .
PR NPTEL course – p.38/133
• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1
PR NPTEL course – p.39/133
• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1
• Taking n = 1, we get K(X, X) ≥ 0, ∀X ∈ X .
PR NPTEL course – p.40/133
• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1
• Taking n = 1, we get K(X, X) ≥ 0, ∀X ∈ X .
• Taking n = 2 and remembering that K is symmetric,
we get
K(X1 , X2 )2 ≤ K(X1 , X1 ) K(X2 , X2 ), ∀X1 , X2 ∈ X
PR NPTEL course – p.41/133
• Positive definiteness of Kernel means, for all n,
n
X
ci cj K(Xi , Xj ) ≥ 0, ∀ci ∈ ℜ, ∀Xi ∈ X
i,j=1
• Taking n = 1, we get K(X, X) ≥ 0, ∀X ∈ X .
• Taking n = 2 and remembering that K is symmetric,
we get
K(X1 , X2 )2 ≤ K(X1 , X1 ) K(X2 , X2 ), ∀X1 , X2 ∈ X
Thus, K satisfies Cauchy-Schwartz inequality
PR NPTEL course – p.42/133
• Suppose K(X, X ′ ) = φ(X)T φ(X ′ ).
PR NPTEL course – p.43/133
• Suppose K(X, X ′ ) = φ(X)T φ(X ′ ).
• Then K is a positive definite kernel:
à !T à !
X X X
T
ci cj φ(Xi ) φ(Xj ) = ci φ(Xi ) cj φ(Xj )
i,j i j
X 2
= || ci φ(Xi )|| ≥ 0
i
PR NPTEL course – p.44/133
• Suppose K(X, X ′ ) = φ(X)T φ(X ′ ).
• Then K is a positive definite kernel:
à !T à !
X X X
T
ci cj φ(Xi ) φ(Xj ) = ci φ(Xi ) cj φ(Xj )
i,j i j
X 2
= || ci φ(Xi )|| ≥ 0
i
• Thus, if K satisfies Mercer theorem, then it is a
positive definite kernel.
PR NPTEL course – p.45/133
• We now show that all positive definite kernels are also
innerproducts on some appropriate space.
PR NPTEL course – p.46/133
• We now show that all positive definite kernels are also
innerproducts on some appropriate space.
• Given a kernel K , we will construct a space endowed
with an inner product and show how any positive
definite kernel is essentially implementing inner
product in this space.
PR NPTEL course – p.47/133
• We now show that all positive definite kernels are also
innerproducts on some appropriate space.
• Given a kernel K , we will construct a space endowed
with an inner product and show how any positive
definite kernel is essentially implementing inner
product in this space.
• This space is called the Reproducing Kernel Hilbert
Space associated with the Kernel, K .
PR NPTEL course – p.48/133
• Let ℜX be the set of all real-valued functions on X .
PR NPTEL course – p.49/133
• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
PR NPTEL course – p.50/133
• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
• For any X ∈ X , let K(· , X) ∈ ℜX denote the function
that maps X ′ ∈ X to K(X ′ , X) ∈ ℜ.
PR NPTEL course – p.51/133
• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
• For any X ∈ X , let K(· , X) ∈ ℜX denote the function
that maps X ′ ∈ X to K(X ′ , X) ∈ ℜ.
• Consider the set of functions
H1 = {K(· , X) : X ∈ X }.
PR NPTEL course – p.52/133
• Let ℜX be the set of all real-valued functions on X .
• Let K be a positive definite kernel.
• For any X ∈ X , let K(· , X) ∈ ℜX denote the function
that maps X ′ ∈ X to K(X ′ , X) ∈ ℜ.
• Consider the set of functions
H1 = {K(· , X) : X ∈ X }.
• Let H be the set of all functions that are finite linear
combinations of functions in H1 .
PR NPTEL course – p.53/133
• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1
PR NPTEL course – p.54/133
• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1
• It is easy to see that if f, g ∈ H then f + g ∈ H and
αf ∈ H for α ∈ ℜ.
PR NPTEL course – p.55/133
• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1
• It is easy to see that if f, g ∈ H then f + g ∈ H and
αf ∈ H for α ∈ ℜ.
• Thus, H is a vector space.
PR NPTEL course – p.56/133
• Any f (·) ∈ H can be written as
n
X
f (·) = αi K(· , Xi ), for some n, Xi ∈ X , αi ∈ ℜ
i=1
• It is easy to see that if f, g ∈ H then f + g ∈ H and
αf ∈ H for α ∈ ℜ.
• Thus, H is a vector space.
• We now define an inner product on H.
PR NPTEL course – p.57/133
• Let f, g ∈ H with
n
X n′
X
f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )
i=1 j=1
PR NPTEL course – p.58/133
• Let f, g ∈ H with
n
X Xn′
f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )
i=1 j=1
• We define the inner product as
XXn n′
<f, g> = αi βj K(Xi , Xj′ )
i=1 j=1
PR NPTEL course – p.59/133
• Let f, g ∈ H with
n
X Xn′
f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )
i=1 j=1
• We define the inner product as
XXn n′
<f, g> = αi βj K(Xi , Xj′ )
i=1 j=1
• We first show this is well defined.
PR NPTEL course – p.60/133
• We note that
n
X n′
X n
X
<f, g> = αi βj K(Xi , Xj′ ) = αi g(Xi )
i=1 j=1 i=1
PR NPTEL course – p.61/133
• We note that
Xn n′
X n
X
<f, g> = αi βj K(Xi , Xj′ ) = αi g(Xi )
i=1 j=1 i=1
• Similarly we have
n′
X n
X X n′
<f, g> = βj αi K(Xi , Xj′ ) = βj f (Xj′ )
j=1 i=1 j=1
PR NPTEL course – p.62/133
• We note that
Xn n′
X n
X
<f, g> = αi βj K(Xi , Xj′ ) = αi g(Xi )
i=1 j=1 i=1
• Similarly we have
n′
X n
X X n′
<f, g> = βj αi K(Xi , Xj′ ) = βj f (Xj′ )
j=1 i=1 j=1
• Thus our inner product does not depend on the αi
and βj and hence is well defined.
PR NPTEL course – p.63/133
• Now we have to show that this is actually an inner
product.
PR NPTEL course – p.64/133
• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
PR NPTEL course – p.65/133
• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
• It is easily verified that it is bilinear:
< f , g1 + g2 > = < f , g1 > + < f , g2 >
< f1 + f2 , g > = < f1 , g > + < f2 , g >
PR NPTEL course – p.66/133
• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
• It is easily verified that it is bilinear:
< f , g1 + g2 > = < f , g1 > + < f , g2 >
< f1 + f2 , g > = < f1 , g > + < f2 , g >
• It is also easy to see that < cf , g > = c < f , g >.
PR NPTEL course – p.67/133
• Now we have to show that this is actually an inner
product.
• It is symmetric by definition.
• It is easily verified that it is bilinear:
< f , g1 + g2 > = < f , g1 > + < f , g2 >
< f1 + f2 , g > = < f1 , g > + < f2 , g >
• It is also easy to see that < cf , g > = c < f , g >.
We have, by the Pn definiteness of K ,
Pnpositive
•
<f, f >= i=1 j=1 αi αj K(Xi , Xj ) ≥ 0
PR NPTEL course – p.68/133
• Finally, we have to show < f , f > = 0 ⇒ f = 0.
PR NPTEL course – p.69/133
• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
PR NPTEL course – p.70/133
• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
• Now we get
p p p
X X X
γi γj < fi , fj > = < γi fi , γj fj >
i,j=1 i=1 j=1
PR NPTEL course – p.71/133
• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
• Now we get
p p p
X X X
γi γj < fi , fj > = < γi fi , γj fj >
i,j=1 i=1 j=1
= < g1 , g1 >
PR NPTEL course – p.72/133
• Finally, we have to show < f , f > = 0 ⇒ f = 0.
• Let f1 , · · · , fp ∈ H and let γ1 , · · · , γp ∈ ℜ.
Pp
Let g1 = i=1 γi fi ∈ H.
• Now we get
p p p
X X X
γi γj < fi , fj > = < γi fi , γj fj >
i,j=1 i=1 j=1
= < g1 , g1 >
≥ 0
PR NPTEL course – p.73/133
• Note that < · , · > is a symmetric function that maps
H × H to ℜ.
PR NPTEL course – p.74/133
• Note that < · , · > is a symmetric function that maps
H × H to ℜ.
• Thus what we have shown is that < · , · > is a
positive definite kernel on H
PR NPTEL course – p.75/133
• Note that < · , · > is a symmetric function that maps
H × H to ℜ.
• Thus what we have shown is that < · , · > is a
positive definite kernel on H
• Since positive definite kernels satisfy
Cauchy-Schwartz inequality, for any f ∈ H, we have
| < K(·, X), f > |2 ≤ < K(·, X), K(·, X) > < f , f >
PR NPTEL course – p.76/133
• Recall that for f, g ∈ H with
n
X Xn′
f (·) = αi K(· , Xi ), g(·) = βj K(· , Xj′ )
i=1 j=1
the inner product is
XXn n′
<f, g> = αi βj K(Xi , Xj′ )
i=1 j=1
PR NPTEL course – p.77/133
• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
PR NPTEL course – p.78/133
• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
PR NPTEL course – p.79/133
• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property.
PR NPTEL course – p.80/133
• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property. Now, ∀X ,
|f (X)|2 = | < K(·, X), f > |2 ≤ K(X, X) < f , f >
PR NPTEL course – p.81/133
• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property. Now, ∀X ,
|f (X)|2 = | < K(·, X), f > |2 ≤ K(X, X) < f , f >
• This shows < f , f > = 0 ⇒ f = 0.
PR NPTEL course – p.82/133
• Hence We have
′ ′
< K(· , X) , K(· , X ) > = K(X, X ) and
< K(· , X) , f > = f (X)
• This is the reproducing Kernel property. Now, ∀X ,
|f (X)|2 = | < K(·, X), f > |2 ≤ K(X, X) < f , f >
• This shows < f , f > = 0 ⇒ f = 0.
• This shows what we defined is indeed an inner
product.
PR NPTEL course – p.83/133
• Given any positive definite kernel, we can construct
this inner product space H as explained here.
PR NPTEL course – p.84/133
• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
PR NPTEL course – p.85/133
• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
• It is called the Reproducing Kernel Hilbert Space
(RKHS) associated with K .
PR NPTEL course – p.86/133
• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
• It is called the Reproducing Kernel Hilbert Space
(RKHS) associated with K .
• The reproducing kernel property is
< K(· , X) , f > = f (X), ∀f ∈ H
PR NPTEL course – p.87/133
• Given any positive definite kernel, we can construct
this inner product space H as explained here.
• We can complete it in the norm induced by the inner
product.
• It is called the Reproducing Kernel Hilbert Space
(RKHS) associated with K .
• The reproducing kernel property is
< K(· , X) , f > = f (X), ∀f ∈ H
• Note that elements of RKHS are certain real-valued
functions on X . Essentially, a kind of generalization of
linear functionals on X .
PR NPTEL course – p.88/133
• Given this RKHS H associated with K , define
φ : X → H by
φ(X) = K(· , X)
PR NPTEL course – p.89/133
• Given this RKHS H associated with K , define
φ : X → H by
φ(X) = K(· , X)
• Now we have
K(X, X ′ ) = < φ(X) , φ(X ′ ) >
PR NPTEL course – p.90/133
• Given this RKHS H associated with K , define
φ : X → H by
φ(X) = K(· , X)
• Now we have
K(X, X ′ ) = < φ(X) , φ(X ′ ) >
• This shows that any positive definite kernel gives us
the inner product in some other space as needed.
PR NPTEL course – p.91/133
• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
PR NPTEL course – p.92/133
• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
• Now, K(· , X) is the function that takes dot product of
its argument with X .
PR NPTEL course – p.93/133
• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
• Now, K(· , X) is the function that takes dot product of
its argument with X .
• Let X = [x1 , · · · , xm ]T . Let ei , i = 1, · · · , m, be the
coordinate unit vectors. For any X ′ ∈ ℜm ,
m
X m
X
′ T ′ T ′ ′
K(X , X) = X X = xe X =
i i xi K(X , ei )
i=1 i=1
PR NPTEL course – p.94/133
• As a simple example, let X = ℜm and
K(X, X ′ ) = X T X ′ .
• Now, K(· , X) is the function that takes dot product of
its argument with X .
• Let X = [x1 , · · · , xm ]T . Let ei , i = 1, · · · , m, be the
coordinate unit vectors. For any X ′ ∈ ℜm ,
m
X m
X
′ T ′ T ′ ′
K(X , X) = X X = xe X =
i i xi K(X , ei )
i=1 i=1
Pm
• This gives us K(· , X) = i=1 xi K(· , ei ).
PR NPTEL course – p.95/133
• This means all functions in H are linear combinations
of K(· , ei ).
PR NPTEL course – p.96/133
• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
PR NPTEL course – p.97/133
• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
• So, the RKHS is isomorphic to ℜm and thus it
represents hyperplanes on X .
PR NPTEL course – p.98/133
• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
• So, the RKHS is isomorphic to ℜm and thus it
represents hyperplanes on X .
• The inner product in this H would be simply the usual
dot product.
PR NPTEL course – p.99/133
• This means all functions in H are linear combinations
of K(· , ei ).
• any f ∈ H can be written as
ThusP
m
f = i=1 wi K(· , ei ).
• So, the RKHS is isomorphic to ℜm and thus it
represents hyperplanes on X .
• The inner product in this H would be simply the usual
dot product.
• Learning hyperplanes is searching over this H for
minimizer of empirical risk with the usual norm as a
regularizer.
PR NPTEL course – p.100/133
• What we have shown is the following.
PR NPTEL course – p.101/133
• What we have shown is the following.
• Given a positive definite kernel, there is a vector
space with an inner product, namely, the RKHS
associated with K , and a mapping φ from X to H
such that the kernel is an inner product in H.
PR NPTEL course – p.102/133
• What we have shown is the following.
• Given a positive definite kernel, there is a vector
space with an inner product, namely, the RKHS
associated with K , and a mapping φ from X to H
such that the kernel is an inner product in H.
• This RKHS represents a space of functions where we
can search for the empirical risk minimizer.
PR NPTEL course – p.103/133
• What we have shown is the following.
• Given a positive definite kernel, there is a vector
space with an inner product, namely, the RKHS
associated with K , and a mapping φ from X to H
such that the kernel is an inner product in H.
• This RKHS represents a space of functions where we
can search for the empirical risk minimizer.
• An important insight gained by this view point is the
Representer theorem.
PR NPTEL course – p.104/133
Representer Theorem
• Let K be a positive definite Kernel and let H be the
RKHS associated with it.
PR NPTEL course – p.105/133
Representer Theorem
• Let K be a positive definite Kernel and let H be the
RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
PR NPTEL course – p.106/133
Representer Theorem
• Let K be a positive definite Kernel and let H be the
RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
• For any function f , the empirical risk, under any loss
function can be represented as a function
R̂n (f ) = C( (Xi , yi , f (Xi )), i = 1, · · · , n)
PR NPTEL course – p.107/133
Representer Theorem
• Let K be a positive definite Kernel and let H be the
RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
• For any function f , the empirical risk, under any loss
function can be represented as a function
R̂n (f ) = C( (Xi , yi , f (Xi )), i = 1, · · · , n)
• We search over H for a minimizer of empirical risk.
PR NPTEL course – p.108/133
Representer Theorem
• Let K be a positive definite Kernel and let H be the
RKHS associated with it.
• Let {(Xi , yi ), i = 1, · · · , n} be the training set.
• For any function f , the empirical risk, under any loss
function can be represented as a function
R̂n (f ) = C( (Xi , yi , f (Xi )), i = 1, · · · , n)
• We search over H for a minimizer of empirical risk.
• Let ||f ||2 = < f , f > be the norm under our inner
product.
PR NPTEL course – p.109/133
Let Ω : [0, ∞) → ℜ+ be a strictly
• Theorem:
monotonically increasing function. Then any
minimizer g over H of the regularized risk
C( (Xi , yi , g(Xi )), i = 1, · · · , n) + Ω(||g||2 )
admits a representation
n
X
g(X) = αi K(Xi , X)
i=1
PR NPTEL course – p.110/133
• What this means is the following.
PR NPTEL course – p.111/133
• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
PR NPTEL course – p.112/133
• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
• Though we are searching over this space, the
minimizer can always be expressed as a linear
combinations of kernels centered on data points only.
PR NPTEL course – p.113/133
• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
• Though we are searching over this space, the
minimizer can always be expressed as a linear
combinations of kernels centered on data points only.
• Thus, though H may be infinite dimensional, we can
solve the optimization problem by searching for only n
real numbers αi .
PR NPTEL course – p.114/133
• What this means is the following.
• Functions in H are linear combinations of kernels
centered at all points of X .
• Though we are searching over this space, the
minimizer can always be expressed as a linear
combinations of kernels centered on data points only.
• Thus, though H may be infinite dimensional, we can
solve the optimization problem by searching for only n
real numbers αi .
• This is essentially what we have done in solving the
dual for SVM.
PR NPTEL course – p.115/133
Proof of Representer Theorem
• In the vector space H, consider the span of the
functions K(X1 , ·), · · · , K(Xn , ·). This will be a
subspace.
PR NPTEL course – p.116/133
Proof of Representer Theorem
• In the vector space H, consider the span of the
functions K(X1 , ·), · · · , K(Xn , ·). This will be a
subspace.
• Given any f ∈ H, we can decompose it into two
components – one in this subspace and one in the
subspace orthogonal to it.
PR NPTEL course – p.117/133
Proof of Representer Theorem
• In the vector space H, consider the span of the
functions K(X1 , ·), · · · , K(Xn , ·). This will be a
subspace.
• Given any f ∈ H, we can decompose it into two
components – one in this subspace and one in the
subspace orthogonal to it.
• Let us call these two components as fk and f⊥ .
PR NPTEL course – p.118/133
• Thus, For any f ∈ H and any X ∈ X , we have
n
X
f (X) = fk (X) + f⊥ (X) = αi K(Xi , X) + f⊥ (X)
i=1
where αi ∈ ℜ, f⊥ ∈ H and
< f⊥ , K(Xi , ·) > = 0, i = 1, · · · , n.
PR NPTEL course – p.119/133
• Thus, For any f ∈ H and any X ∈ X , we have
n
X
f (X) = fk (X) + f⊥ (X) = αi K(Xi , X) + f⊥ (X)
i=1
where αi ∈ ℜ, f⊥ ∈ H and
< f⊥ , K(Xi , ·) > = 0, i = 1, · · · , n.
• Since H is the RKHS of K , the reproducing kernel
property gives us
f (X ′ ) = < f , K(X ′ , ·) >
PR NPTEL course – p.120/133
• Hence for any of the data points, Xj , j = 1, · · · , n,
f (Xj ) = < f , K(Xj , ·) >
PR NPTEL course – p.121/133
• Hence for any of the data points, Xj , j = 1, · · · , n,
f (Xj ) = < f , K(Xj , ·) >
= < fk + f⊥ , K(Xj , ·) >
PR NPTEL course – p.122/133
• Hence for any of the data points, Xj , j = 1, · · · , n,
f (Xj ) = < f , K(Xj , ·) >
= < fk + f⊥ , K(Xj , ·) >
n
X
= αi K(Xi , Xj ) + < f⊥ , K(Xj , ·) >
i=1
PR NPTEL course – p.123/133
• Hence for any of the data points, Xj , j = 1, · · · , n,
f (Xj ) = < f , K(Xj , ·) >
= < fk + f⊥ , K(Xj , ·) >
n
X
= αi K(Xi , Xj ) + < f⊥ , K(Xj , ·) >
i=1
X n
= αi K(Xi , Xj ) = fk (Xj )
i=1
PR NPTEL course – p.124/133
• Hence for any of the data points, Xj , j = 1, · · · , n,
f (Xj ) = < f , K(Xj , ·) >
= < fk + f⊥ , K(Xj , ·) >
n
X
= αi K(Xi , Xj ) + < f⊥ , K(Xj , ·) >
i=1
X n
= αi K(Xi , Xj ) = fk (Xj )
i=1
• This is true for any f ∈ H.
PR NPTEL course – p.125/133
• Now let g ∈ H be a minimizer of the regularized risk.
PR NPTEL course – p.126/133
• Now let g ∈ H be a minimizer of the regularized risk.
• We can write g = gk + g⊥ and g(Xj ) = gk (Xj ) for all
data vectors, Xj .
PR NPTEL course – p.127/133
• Now let g ∈ H be a minimizer of the regularized risk.
• We can write g = gk + g⊥ and g(Xj ) = gk (Xj ) for all
data vectors, Xj .
• Now the empirical risk of g ,
C( (Xi , yi , g(Xi )), i = 1, · · · , n)
would be same as empirical risk of gk .
PR NPTEL course – p.128/133
• Now let g ∈ H be a minimizer of the regularized risk.
• We can write g = gk + g⊥ and g(Xj ) = gk (Xj ) for all
data vectors, Xj .
• Now the empirical risk of g ,
C( (Xi , yi , g(Xi )), i = 1, · · · , n)
would be same as empirical risk of gk .
• Since gk and g⊥ are orthogonal,
||g||2 = ||gk ||2 + ||g⊥ ||2 ≥ ||gk ||2
PR NPTEL course – p.129/133
• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
PR NPTEL course – p.130/133
• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
• This shows that the regularized risk of gk can only be
less than or equal to that of g .
PR NPTEL course – p.131/133
• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
• This shows that the regularized risk of gk can only be
less than or equal to that of g .
• Hence any minimizer would be in the subspace
spanned by K(Xi , ·) and hence would have a
representation
n
X
g(X) = αi K(Xi , X)
i=1
PR NPTEL course – p.132/133
• Since Ω is strictly monotone increasing,
Ω(||g||2 ) ≥ Ω(||gk ||2 ).
• This shows that the regularized risk of gk can only be
less than or equal to that of g .
• Hence any minimizer would be in the subspace
spanned by K(Xi , ·) and hence would have a
representation
n
X
g(X) = αi K(Xi , X)
i=1
• This completes proof of the theorem.
PR NPTEL course – p.133/133