Support Vector Machines
(Informal : Version 0)
Introduction
10/26/2021 P. Viswanath, at IIITS 1
Linear Classifier
Classifier:
If f(x1,x2) < 0 assign Class 1;
x2
If f(x1,x2) > 0 assign Class 2;
Class 2
f(x1,x2) = w1x1+w2x2+b = 0
Class 1
x1
10/26/2021 P. Viswanath, at IIITS 2
Perceptron
• Perceptron is the name given to the linear
classifier.
• If there exists a Perceptron that correctly
classifies all training examples, then we say
that the training set is linearly separable.
• Different Perceptron learning techniques are
available.
10/26/2021 P. Viswanath, at IIITS 3
Perceptron – Let us begin with linearly
separable data
• For linearly separable data many Perceptrons are
possible that correctly classifies the training set.
All being doing equally good
on training set, which one is
Class 2
good on the unseen test set?
Class 1
10/26/2021 P. Viswanath, at IIITS 4
Hard Linear SVM
• The best perceptron for a linearly separable
data is called “hard linear SVM”.
• For each linear function we can define its
margin.
• That linear function which has the maximum
margin is the best one.
10/26/2021 P. Viswanath, at IIITS 5
Class 2 Class 2
Class 1 Class 1
Margin
10/26/2021 P. Viswanath, at IIITS 6
Maximizing the Margin
Var1 IDEA : Select the
separating
hyperplane that
maximizes the
margin!
Margin
Width
Margin
Width
Var2
10/26/2021 P. Viswanath, at IIITS 7
Support Vectors
Var1
Support Vectors
Margin
Width
Var2
10/26/2021 P. Viswanath, at IIITS 8
What if the data is not linearly separable?
But solving a non-linear
Var1
problem is mathematically
more difficult
Var2
10/26/2021 P. Viswanath, at IIITS 9
Kernel Mapping
10/26/2021 P. Viswanath, at IIITS 10
An example
Input Space Feature Space
y = -1
y = +1
10/26/2021 P. Viswanath, at IIITS 11
The Trick !!
• There is no need to do this mapping explicitly.
• For some mappings, the dot product in the
feature space can be expressed as a function
in the Input space.
• 𝜑 𝑋1 ∙ 𝜑 𝑋2 = 𝑘 𝑋1 , 𝑋2
10/26/2021 P. Viswanath, at IIITS 12
• So, if the solution is going to involve only dot
products then it can be solved using kernel
trick (of course, appropriate kernel function
has to be chosen).
• The problem is, with powerful kernels like
“Gaussian kernel” it is possible to learn a non-
linear classifier which does extremely well on
the training set.
10/26/2021 P. Viswanath, at IIITS 13
Discriminant functions: non-linear
This makes zero mistakes with the training set.
10/26/2021 P. Viswanath, at IIITS 14
Other important issues …
• This classifier is doing very well as for the training data is
considered.
• But this does not guarantee that the classifier works well
with a data element which is not there in the training set
(that is, with unseen data).
• This is overfitting the classifier with the training data.
• May be we are respecting noise also (There might be
mistakes while taking the measurements).
• The ability “to perform better with unseen test patterns
too” is called the generalization ability of the classifier.
10/26/2021 P. Viswanath, at IIITS 15
Generalization ability
• This is discussed very much.
• It is argued that the simple one will have better
generalization ability (eg: Occam’s razor: Between two
solutions, if everything else is same then choose the
simple one).
• How to quantify this?
• (Training error + a measure of complexity) should be
taken into account while designing the classifier.
• Support vector machines are proved to have better
generalization ability.
10/26/2021 P. Viswanath, at IIITS 16
Discriminant functions …
This has some training error, but this is a relatively simple one.
10/26/2021 P. Viswanath, at IIITS 17
Overfitting and underfitting
underfitting good fit overfitting
10/26/2021 P. Viswanath, at IIITS 18
Soft SVM
• Allow for some mistakes with the training set !
• But, this is to achieve a better margin.
10/26/2021 P. Viswanath, at IIITS 19
10/26/2021 P. Viswanath, at IIITS 20