0% found this document useful (0 votes)
50 views119 pages

Exam Schedual

The document discusses support vector machines (SVMs) and how they address issues with naive nonlinear classification approaches. SVMs map input vectors to a higher dimensional space and learn a linear classifier in that space, effectively learning nonlinear decision boundaries in the original space. They address computational issues by using kernel functions to calculate similarities without explicitly computing the mapping. SVMs find the optimal hyperplane that maximizes the margin between classes in the transformed space.

Uploaded by

Samarth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views119 pages

Exam Schedual

The document discusses support vector machines (SVMs) and how they address issues with naive nonlinear classification approaches. SVMs map input vectors to a higher dimensional space and learn a linear classifier in that space, effectively learning nonlinear decision boundaries in the original space. They address computational issues by using kernel functions to calculate similarities without explicitly computing the mapping. SVMs find the optimal hyperplane that maximizes the margin between classes in the transformed space.

Uploaded by

Samarth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

The SVM approach

• We have briefly discussed Support Vector Machine


(SVM) idea.

PR NPTEL course – p.1/119


The SVM approach

• We have briefly discussed Support Vector Machine


(SVM) idea.
• The idea is to map the feature vectors nonlinearly into
another space and learn a linear classifier there.

PR NPTEL course – p.2/119


The SVM approach

• We have briefly discussed Support Vector Machine


(SVM) idea.
• The idea is to map the feature vectors nonlinearly into
another space and learn a linear classifier there.
• The linear classifier in this new space would be an
appropriate nonlinear classifier in the original space.

PR NPTEL course – p.3/119


• Recall the simple example we saw earlier.

PR NPTEL course – p.4/119


• Recall the simple example we saw earlier.
• Let X = [x1 x2 ]

PR NPTEL course – p.5/119


• Recall the simple example we saw earlier.
• Let X = [x1 x2 ] and let φ : ℜ2 → ℜ5 given by
Z = φ(X) = [1 x1 x2 x21 x22 x1 x2 ]

PR NPTEL course – p.6/119


• Recall the simple example we saw earlier.
• Let X = [x1 x2 ] and let φ : ℜ2 → ℜ5 given by
Z = φ(X) = [1 x1 x2 x21 x22 x1 x2 ]
• Now,
g(X) = a0 + a1 x1 + a2 x2 + a3 x21 + a4 x22 + a5 x1 x2
is a quadratic discriminant function in ℜ2 ;

PR NPTEL course – p.7/119


• Recall the simple example we saw earlier.
• Let X = [x1 x2 ] and let φ : ℜ2 → ℜ5 given by
Z = φ(X) = [1 x1 x2 x21 x22 x1 x2 ]
• Now,
g(X) = a0 + a1 x1 + a2 x2 + a3 x21 + a4 x22 + a5 x1 x2
is a quadratic discriminant function in ℜ2 ; but

g(Z) = a0 + a1 z1 + a2 z2 + a3 z3 + a4 z4 + a5 z5
is a linear dscriminant function in the ‘φ(X)’ space.

PR NPTEL course – p.8/119


• There are two major issues in naively using this idea.

PR NPTEL course – p.9/119


• There are two major issues in naively using this idea.
• If we want, e.g., pth degree polynomial discriminant
function in the original feature space (ℜm ), then the
transformed feature vector, Z, has dimension O(mp ).

PR NPTEL course – p.10/119


• There are two major issues in naively using this idea.
• If we want, e.g., pth degree polynomial discriminant
function in the original feature space (ℜm ), then the
transformed feature vector, Z, has dimension O(mp ).
• Results in huge computational cost both for learning
and and final operation of the classifier.

PR NPTEL course – p.11/119


• There are two major issues in naively using this idea.
• If we want, e.g., pth degree polynomial discriminant
function in the original feature space (ℜm ), then the
transformed feature vector, Z, has dimension O(mp ).
• Results in huge computational cost both for learning
and and final operation of the classifier.
• We need to learn O(mp ) parameters rather than
O(m) parameters. Hence may need much larger
number of examples for achieving proper
generalization.

PR NPTEL course – p.12/119


• There are two major issues in naively using this idea.
• If we want, e.g., pth degree polynomial discriminant
function in the original feature space (ℜm ), then the
transformed feature vector, Z, has dimension O(mp ).
• Results in huge computational cost both for learning
and and final operation of the classifier.
• We need to learn O(mp ) parameters rather than
O(m) parameters. Hence may need much larger
number of examples for achieving proper
generalization.
• SVM offers an elegant solution to both.
PR NPTEL course – p.13/119
Support Vector Machines

• Learning of optimal hyperplane.

PR NPTEL course – p.14/119


Support Vector Machines

• Learning of optimal hyperplane.


• Separating hyperplane that maximizes separation
between Classes.

PR NPTEL course – p.15/119


Support Vector Machines

• Learning of optimal hyperplane.


• Separating hyperplane that maximizes separation
between Classes.
• Effectively maps original feature vectors into a high
dimensional space. Hence learns nonlinear
discriminant functions.

PR NPTEL course – p.16/119


Support Vector Machines

• Learning of optimal hyperplane.


• Separating hyperplane that maximizes separation
between Classes.
• Effectively maps original feature vectors into a high
dimensional space. Hence learns nonlinear
discriminant functions.
• By using Kernel function we never need to explicitly
calculate the mapping.

PR NPTEL course – p.17/119


Support Vector Machines

• Learning of optimal hyperplane.


• Separating hyperplane that maximizes separation
between Classes.
• Effectively maps original feature vectors into a high
dimensional space. Hence learns nonlinear
discriminant functions.
• By using Kernel function we never need to explicitly
calculate the mapping.
• We need solve only a quadratic optimization problem.

PR NPTEL course – p.18/119


Support Vector Machines

• Learning of optimal hyperplane.


• Separating hyperplane that maximizes separation
between Classes.
• Effectively maps original feature vectors into a high
dimensional space. Hence learns nonlinear
discriminant functions.
• By using Kernel function we never need to explicitly
calculate the mapping.
• We need solve only a quadratic optimization problem.
• Now we formulate the SVM method, first for linearly
separable case.
PR NPTEL course – p.19/119
• Training set:
{(Xi , yi ), i = 1, . . . , n}, Xi ∈ ℜm , yi ∈ {+1, −1}.

PR NPTEL course – p.20/119


• Training set:
{(Xi , yi ), i = 1, . . . , n}, Xi ∈ ℜm , yi ∈ {+1, −1}.
• To start with, assume training set is linearly separable.
That is, exist W ∈ ℜm and b ∈ ℜ such that

W T Xi + b > 0, ∀i s.t. yi = +1
T
W Xi + b < 0, ∀i s.t. yi = −1
(Note both inequalities are strict)

PR NPTEL course – p.21/119


• Training set:
{(Xi , yi ), i = 1, . . . , n}, Xi ∈ ℜm , yi ∈ {+1, −1}.
• To start with, assume training set is linearly separable.
That is, exist W ∈ ℜm and b ∈ ℜ such that

W T Xi + b > 0, ∀i s.t. yi = +1
T
W Xi + b < 0, ∀i s.t. yi = −1
(Note both inequalities are strict)
• W T X + b = 0 – A separating hyperplane.

PR NPTEL course – p.22/119


• Training set:
{(Xi , yi ), i = 1, . . . , n}, Xi ∈ ℜm , yi ∈ {+1, −1}.
• To start with, assume training set is linearly separable.
That is, exist W ∈ ℜm and b ∈ ℜ such that

W T Xi + b > 0, ∀i s.t. yi = +1
T
W Xi + b < 0, ∀i s.t. yi = −1
(Note both inequalities are strict)
• W T X + b = 0 – A separating hyperplane.
• Infinitely many separating hyperplanes exist.
PR NPTEL course – p.23/119
A good separating hyperplane

PR NPTEL course – p.24/119


Another separating hyperplane

PR NPTEL course – p.25/119


• Recall that we assume training set is linearly
separable and hence

W T Xi + b > 0, ∀i s.t. yi = +1
W T Xi + b < 0, ∀i s.t. yi = −1

PR NPTEL course – p.26/119


• Recall that we assume training set is linearly
separable and hence

W T Xi + b > 0, ∀i s.t. yi = +1
W T Xi + b < 0, ∀i s.t. yi = −1
• Since the training set is finite, ∃ǫ > 0 s.t.
T
W Xi + b ≥ ǫ, ∀i s.t. yi = +1
W T Xi + b ≤ −ǫ, ∀i s.t. yi = −1

PR NPTEL course – p.27/119


• Hence, we can scale W, b such that

W T Xi + b ≥ +1 if yi = +1
W T Xi + b ≤ −1 if yi = −1

PR NPTEL course – p.28/119


• Hence, we can scale W, b such that

W T Xi + b ≥ +1 if yi = +1
W T Xi + b ≤ −1 if yi = −1
or, equivalently

yi (W T Xi + b) ≥ 1, ∀i.
(Recall that yi ∈ {+1, −1})

PR NPTEL course – p.29/119


• When the training set is separable, any separating
hyperplane, W, b, can be scaled to satisfy

yi (W T Xi + b) ≥ 1, ∀i.

PR NPTEL course – p.30/119


• When the training set is separable, any separating
hyperplane, W, b, can be scaled to satisfy

yi (W T Xi + b) ≥ 1, ∀i.
• Then there are no training patterns between the two
parallel hyperplanes

W T X + b = +1
and
W T X + b = −1
.
PR NPTEL course – p.31/119
Optimal hyperplane

2
• Distance between these two hyperplanes is: ||W ||
.
Called margin of the separating hyperplane.

PR NPTEL course – p.32/119


Margin of a hyperplane

PR NPTEL course – p.33/119


Optimal hyperplane

2
• Distance between these two hyperplanes is: ||W ||
.
Called margin of the separating hyperplane.
• Hence distance between the hyperplane and the
closest pattern is ||W1 || .

PR NPTEL course – p.34/119


Margin of a hyperplane

PR NPTEL course – p.35/119


Optimal hyperplane

2
• Distance between these two hyperplanes is: ||W ||
.
Called margin of the separating hyperplane.
• Hence distance between the hyperplane and the
closest pattern is ||W1 || .
• Intuitively, more the margin, better is the chance of
correct classification of new patterns.

PR NPTEL course – p.36/119


Optimal hyperplane

2
• Distance between these two hyperplanes is: ||W ||
.
Called margin of the separating hyperplane.
• Hence distance between the hyperplane and the
closest pattern is ||W1 || .
• Intuitively, more the margin, better is the chance of
correct classification of new patterns.
• Optimal Hyperplane – separating hyperplane with
maximum margin.

PR NPTEL course – p.37/119


The optimization problem

• Among all separating hyperplanes, the one with


largest margin is the optimal hyperplane.

PR NPTEL course – p.38/119


The optimization problem

• Among all separating hyperplanes, the one with


largest margin is the optimal hyperplane.
• So, the optimal hyperplane is a solution to the
following optimization problem.

PR NPTEL course – p.39/119


The optimization problem

• Among all separating hyperplanes, the one with


largest margin is the optimal hyperplane.
• So, the optimal hyperplane is a solution to the
following optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to yi (W T Xi + b) ≥ 1, i = 1, . . . , n

PR NPTEL course – p.40/119


The optimization problem

• Among all separating hyperplanes, the one with


largest margin is the optimal hyperplane.
• So, the optimal hyperplane is a solution to the
following optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to yi (W T Xi + b) ≥ 1, i = 1, . . . , n
• This is a constrained optimization problem with
quadratic cost function and linear inequality
constraints. PR NPTEL course – p.41/119
Constrained Optimization

• Consider the following optimization problem

minimize f (x)
subject to aTj x + bj ≤ 0, j = 1, . . . , r
where f : ℜm → ℜ is a continuously differentiable
function, and
aj ∈ ℜm , bj ∈ ℜ, j = 1, · · · , r.

PR NPTEL course – p.42/119


Constrained Optimization

• Consider the following optimization problem

minimize f (x)
subject to aTj x + bj ≤ 0, j = 1, . . . , r
where f : ℜm → ℜ is a continuously differentiable
function, and
aj ∈ ℜm , bj ∈ ℜ, j = 1, · · · , r.
• A point, x ∈ ℜm , is called a feasible point (for this
problem) if aTj x + bj ≤ 0, j = 1, · · · , r.

PR NPTEL course – p.43/119


• Any x∗ ∈ ℜm is called a local minimum of the problem if
f (x∗ ) ≤ f (x) for all x that is feasible and is in a small
neighbourhood of x∗ .

PR NPTEL course – p.44/119


• Any x∗ ∈ ℜm is called a local minimum of the problem if
f (x∗ ) ≤ f (x) for all x that is feasible and is in a small
neighbourhood of x∗ .
• If f (x∗ ) ≤ f (x), ∀x ∈ ℜm and x feasible, then x∗ is a
global minimum.

PR NPTEL course – p.45/119


• Any x∗ ∈ ℜm is called a local minimum of the problem if
f (x∗ ) ≤ f (x) for all x that is feasible and is in a small
neighbourhood of x∗ .
• If f (x∗ ) ≤ f (x), ∀x ∈ ℜm and x feasible, then x∗ is a
global minimum.
• Unlike in unconstrained optimization, here we need to
minimize only over the feasible set.

PR NPTEL course – p.46/119


• Here we would consider only the case where f is a
convex function.

PR NPTEL course – p.47/119


• Here we would consider only the case where f is a
convex function.
• f : ℜm → ℜ is said to be a convex function if
for all x1 , x2 ∈ ℜm and for all α ∈ (0, 1),

f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 )

PR NPTEL course – p.48/119


• Here we would consider only the case where f is a
convex function.
• f : ℜm → ℜ is said to be a convex function if
for all x1 , x2 ∈ ℜm and for all α ∈ (0, 1),

f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 )


• For example, f (x) = xT x is a convex function.

PR NPTEL course – p.49/119


• Here we would consider only the case where f is a
convex function.
• f : ℜm → ℜ is said to be a convex function if
for all x1 , x2 ∈ ℜm and for all α ∈ (0, 1),

f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 )


• For example, f (x) = xT x is a convex function.
• When f is convex, in our optimization problem, every
local minimum is also a global minimum.

PR NPTEL course – p.50/119


• We now look at one method of sloving the constrained
optimization problem.

PR NPTEL course – p.51/119


• We now look at one method of sloving the constrained
optimization problem.
• Given our optimization problem, define
r
X
L(x, µ) = f (x) + µj (aTj x + bj )
j=1

PR NPTEL course – p.52/119


• We now look at one method of sloving the constrained
optimization problem.
• Given our optimization problem, define
r
X
L(x, µ) = f (x) + µj (aTj x + bj )
j=1

• The L is called the Lagrangian of the problem and the


µj are called the Lagrange multipliers.

PR NPTEL course – p.53/119


• We now look at one method of sloving the constrained
optimization problem.
• Given our optimization problem, define
r
X
L(x, µ) = f (x) + µj (aTj x + bj )
j=1

• The L is called the Lagrangian of the problem and the


µj are called the Lagrange multipliers.
• Essentially, the constrained optimization problem can
be solved through unconstrained optimization of L.
PR NPTEL course – p.54/119
Kuhn-Tucker Conditions

• Consider the optimization problem with f convex.

PR NPTEL course – p.55/119


Kuhn-Tucker Conditions

• Consider the optimization problem with f convex.


• Any x∗ is a global minimum if and only if x∗ is feasible
and there exist µ∗j , j = 1, · · · , r, such that

PR NPTEL course – p.56/119


Kuhn-Tucker Conditions

• Consider the optimization problem with f convex.


• Any x∗ is a global minimum if and only if x∗ is feasible
and there exist µ∗j , j = 1, · · · , r, such that
1. ∇x L(x∗ , µ∗ ) = 0

PR NPTEL course – p.57/119


Kuhn-Tucker Conditions

• Consider the optimization problem with f convex.


• Any x∗ is a global minimum if and only if x∗ is feasible
and there exist µ∗j , j = 1, · · · , r, such that
1. ∇x L(x∗ , µ∗ ) = 0
2. µ∗j ≥ 0, ∀j

PR NPTEL course – p.58/119


Kuhn-Tucker Conditions

• Consider the optimization problem with f convex.


• Any x∗ is a global minimum if and only if x∗ is feasible
and there exist µ∗j , j = 1, · · · , r, such that
1. ∇x L(x∗ , µ∗ ) = 0
2. µ∗j ≥ 0, ∀j
3. µ∗j (aTj x∗ + bj ) = 0, ∀j

PR NPTEL course – p.59/119


Kuhn-Tucker Conditions

• Consider the optimization problem with f convex.


• Any x∗ is a global minimum if and only if x∗ is feasible
and there exist µ∗j , j = 1, · · · , r, such that
1. ∇x L(x∗ , µ∗ ) = 0
2. µ∗j ≥ 0, ∀j
3. µ∗j (aTj x∗ + bj ) = 0, ∀j
• These are the so called Kuhn-Tucker conditions for
our optimization problem with convex cost function
and linear constraints.

PR NPTEL course – p.60/119


• We can use the above conditions to obtain a x∗ which
is a minimum of the optimization problem.

PR NPTEL course – p.61/119


• We can use the above conditions to obtain a x∗ which
is a minimum of the optimization problem.
• We can also solve the constrained optimization
problem using the so called dual of this problem.

PR NPTEL course – p.62/119


• We can use the above conditions to obtain a x∗ which
is a minimum of the optimization problem.
• We can also solve the constrained optimization
problem using the so called dual of this problem.
• This is the approach taken in SVM algorithm.

PR NPTEL course – p.63/119


• We can use the above conditions to obtain a x∗ which
is a minimum of the optimization problem.
• We can also solve the constrained optimization
problem using the so called dual of this problem.
• This is the approach taken in SVM algorithm.
• Duality is an important concept in optimization.

PR NPTEL course – p.64/119


• We can use the above conditions to obtain a x∗ which
is a minimum of the optimization problem.
• We can also solve the constrained optimization
problem using the so called dual of this problem.
• This is the approach taken in SVM algorithm.
• Duality is an important concept in optimization.
• Here we discuss only one way of formulating the dual
which is useful when the objective function is convex
and constraints are linear.

PR NPTEL course – p.65/119


• Our optimization problem is

minimize f (x)
subject to aTj x + bj ≤ 0, j = 1, . . . , r
where f : ℜm → ℜ is a continuously differentiable
convex function, and
aj ∈ ℜm , bj ∈ ℜ, j = 1, · · · , r.

PR NPTEL course – p.66/119


• Our optimization problem is

minimize f (x)
subject to aTj x + bj ≤ 0, j = 1, . . . , r
where f : ℜm → ℜ is a continuously differentiable
convex function, and
aj ∈ ℜm , bj ∈ ℜ, j = 1, · · · , r.
• This is known as the primal problem.

PR NPTEL course – p.67/119


• Our optimization problem is

minimize f (x)
subject to aTj x + bj ≤ 0, j = 1, . . . , r
where f : ℜm → ℜ is a continuously differentiable
convex function, and
aj ∈ ℜm , bj ∈ ℜ, j = 1, · · · , r.
• This is known as the primal problem.
• Here the optimization variables are x ∈ ℜm .

PR NPTEL course – p.68/119


• Recall that the Lagrangian is
r
X
L(x, µ) = f (x) + µj (aTj x + bj )
j=1

Here, x ∈ ℜm and µ ∈ ℜr .

PR NPTEL course – p.69/119


• Recall that the Lagrangian is
r
X
L(x, µ) = f (x) + µj (aTj x + bj )
j=1

Here, x ∈ ℜm and µ ∈ ℜr .
• Define the dual function, q : ℜr → [−∞, ∞) by

q(µ) = inf L(x, µ)


x

PR NPTEL course – p.70/119


• Recall that the Lagrangian is
r
X
L(x, µ) = f (x) + µj (aTj x + bj )
j=1

Here, x ∈ ℜm and µ ∈ ℜr .
• Define the dual function, q : ℜr → [−∞, ∞) by

q(µ) = inf L(x, µ)


x

• If for a particular µ, if the infimum is not attained then


q(µ) would take value −∞.
PR NPTEL course – p.71/119
The Dual problem

• The dual problem is:

maximize q(µ)
subject to µj ≥ 0, j = 1, . . . , r

PR NPTEL course – p.72/119


The Dual problem

• The dual problem is:

maximize q(µ)
subject to µj ≥ 0, j = 1, . . . , r
• This is also a constrained optimization problem.

PR NPTEL course – p.73/119


The Dual problem

• The dual problem is:

maximize q(µ)
subject to µj ≥ 0, j = 1, . . . , r
• This is also a constrained optimization problem.
• Here the optimization is over ℜr and µ ∈ ℜr are the
optimization variables.

PR NPTEL course – p.74/119


The Dual problem

• The dual problem is:

maximize q(µ)
subject to µj ≥ 0, j = 1, . . . , r
• This is also a constrained optimization problem.
• Here the optimization is over ℜr and µ ∈ ℜr are the
optimization variables.
• There is a nice connection between the primal and
dual problems.

PR NPTEL course – p.75/119


Primal-Dual Relationship

• Now we have the following.

PR NPTEL course – p.76/119


Primal-Dual Relationship

• Now we have the following.


1. If the primal has a solution so does the dual and
the optimal values are equal.

PR NPTEL course – p.77/119


Primal-Dual Relationship

• Now we have the following.


1. If the primal has a solution so does the dual and
the optimal values are equal.
2. x∗ is optimal for primal and µ∗ is optimal for dual if
and only if

PR NPTEL course – p.78/119


Primal-Dual Relationship

• Now we have the following.


1. If the primal has a solution so does the dual and
the optimal values are equal.
2. x∗ is optimal for primal and µ∗ is optimal for dual if
and only if
• x∗ is feasible for primal and µ∗ is feasible for
dual,

PR NPTEL course – p.79/119


Primal-Dual Relationship

• Now we have the following.


1. If the primal has a solution so does the dual and
the optimal values are equal.
2. x∗ is optimal for primal and µ∗ is optimal for dual if
and only if
• x∗ is feasible for primal and µ∗ is feasible for
dual,
• f (x∗ ) = L(x∗ , µ∗ ) = minx L(x, µ∗ ).

PR NPTEL course – p.80/119


Primal-Dual Relationship

• Now we have the following.


1. If the primal has a solution so does the dual and
the optimal values are equal.
2. x∗ is optimal for primal and µ∗ is optimal for dual if
and only if
• x∗ is feasible for primal and µ∗ is feasible for
dual,
• f (x∗ ) = L(x∗ , µ∗ ) = minx L(x, µ∗ ).

• We would be using the dual formulation for the


optimization problem in SVM

PR NPTEL course – p.81/119


The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.

PR NPTEL course – p.82/119


The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n

PR NPTEL course – p.83/119


The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n
• Quadratic cost function and linear (inequality)
constraints.

PR NPTEL course – p.84/119


The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n
• Quadratic cost function and linear (inequality)
constraints.
• Kuhn-Tucker conditions are necessary and sufficient.
Every local minimum is global minimum.
PR NPTEL course – p.85/119
Optimal hyperplane

PR NPTEL course – p.86/119


Non-optimal hyperplane

PR NPTEL course – p.87/119


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

PR NPTEL course – p.88/119


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give

PR NPTEL course – p.89/119


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give



Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi

PR NPTEL course – p.90/119


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn

∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0

PR NPTEL course – p.91/119


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn

∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i

PR NPTEL course – p.92/119


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn

∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i
µ∗i ≥ 0, & µ∗i [1 − yi (XiT W ∗ + b∗ )] = 0, ∀i
PR NPTEL course – p.93/119
• Let S = {i | µ∗i > 0}.

PR NPTEL course – p.94/119


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1

PR NPTEL course – p.95/119


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.

PR NPTEL course – p.96/119


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors.

PR NPTEL course – p.97/119


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
W = i µi yi Xi = i∈S µ∗i yi Xi

P ∗ P

PR NPTEL course – p.98/119


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
W = i µi yi Xi = i∈S µ∗i yi Xi

P ∗ P

• Optimal W is a linear combination of Support vectors.

PR NPTEL course – p.99/119


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
W = i µi yi Xi = i∈S µ∗i yi Xi

P ∗ P

• Optimal W is a linear combination of Support vectors.


• Support vectors constitute a very useful output of the
method.
PR NPTEL course – p.100/119
Optimal hyperplane

PR NPTEL course – p.101/119


The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:

PR NPTEL course – p.102/119


The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:


∗ ∗
µ∗i yi Xi
P P
W = i µ y Xi =
i i i∈S

PR NPTEL course – p.103/119


The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:


∗ ∗
µ∗i yi Xi
P P
W = i µ y Xi =
i i i∈S

b∗ = yj − XjT W ∗ , j s.t. µ∗j > 0


(Note that µ∗j > 0 ⇒ yj (XjT W ∗ + b∗ ) = 1)

PR NPTEL course – p.104/119


The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:


∗ ∗
µ∗i yi Xi
P P
W = i µ y Xi =
i i i∈S

b∗ = yj − XjT W ∗ , j s.t. µ∗j > 0


(Note that µ∗j > 0 ⇒ yj (XjT W ∗ + b∗ ) = 1)

• Thus, W ∗ , b∗ are determined by µ∗i , i = 1, . . . , n.

PR NPTEL course – p.105/119


The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:


∗ ∗
µ∗i yi Xi
P P
W = i µ y Xi =
i i i∈S

b∗ = yj − XjT W ∗ , j s.t. µ∗j > 0


(Note that µ∗j > 0 ⇒ yj (XjT W ∗ + b∗ ) = 1)

• Thus, W ∗ , b∗ are determined by µ∗i , i = 1, . . . , n.


• We can use the dual of the optimization problem to
get µ∗i .

PR NPTEL course – p.106/119


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1

PR NPTEL course – p.107/119


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.

PR NPTEL course – p.108/119


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.

PR NPTEL course – p.109/119


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .

PR NPTEL course – p.110/119


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .
P
• We obtainP the dual by substituting W = µi yi Xi and
imposing µi yi = 0.
PR NPTEL course – p.111/119
P P
• By substituting W = µi yi Xi and µi yi = 0 we get

PR NPTEL course – p.112/119


P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1

PR NPTEL course – p.113/119


P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
à !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j

PR NPTEL course – p.114/119


P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
à !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j
X 1 XX
= µi − µi yi µj yj XiT Xj
i
2 i j
PR NPTEL course – p.115/119
• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ
i=1
2 i,j=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.116/119


• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ
i=1
2 i,j=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Quadratic cost function and linear constraints

PR NPTEL course – p.117/119


• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ
i=1
2 i,j=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Quadratic cost function and linear constraints


• Training data vectors appear only as innerproduct

PR NPTEL course – p.118/119


• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ
i=1
2 i,j=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Quadratic cost function and linear constraints


• Training data vectors appear only as innerproduct
• Optimization is over ℜn irrespective of the dimension
of Xi .
PR NPTEL course – p.119/119

You might also like