Introduction to Support
Vector Machines
History of SVM
SVM is related to statistical learning theory [3]
SVM was first introduced in 1992 [1]
SVM becomes popular because of its success in
handwritten digit recognition
1.1% test error rate for SVM. This is the same as the error
rates of a carefully constructed neural network, LeNet 4.
SVM is now regarded as an important example of
“kernel methods”, one of the key area in machine
learning
Note: the meaning of “kernel” is different from the “kernel”
function for Parzen windows
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
w: weight vector
x: data vector
How would you
classify this data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
How would you
classify this data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
How would you
classify this data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
How would you
classify this data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
How to calculate the distance from a point to a
line?
denotes +1
denotes -1 x wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
In our case, w1*x1+w2*x2+b=0,
thus, w=(w1,w2), x=(x1,x2)
Estimate the Margin
denotes +1
denotes -1 x wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
What is the distance expression for a point x to a line
wx+b= 0?
xw b xw b
d ( x)
2 d 2
w 2
w
i 1 i
Large-margin Decision Boundary
The decision boundary should be as far away from the
data of both classes as possible
We should maximize the margin, m
Distance between the origin and the line wtx=-b is b/||w||
Class 2
Class 1
m
Finding the Decision Boundary
Let {x1, ..., xn} be our data set and let yi {1,-1} be
the class label of xi
The decision boundary should classify all points correctly
To see this: when y=-1, we wish (wx+b)<1, when y=1,
we wish (wx+b)>1. For support vectors, we wish
y(wx+b)=1.
The decision boundary can be found by solving the
following constrained optimization problem
Extension to Non-linear Decision Boundary
So far, we have only considered large-
margin classifier with a linear decision
boundary
How to generalize it to become nonlinear?
Key idea: transform x to a higher
i
dimensional space to “make life easier”
Input space: the space the point xi are
located
Feature space: the space of (x ) after
i
transformation
Transforming the Data (c.f. DHS Ch. 5)
( )
( ) ( )
( ) ( ) ( )
(.) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( )
( )
Input space Feature space
Note: feature space is of higher dimension
than the input space in practice
Computation in the feature space can be costly because it is
high dimensional
The feature space is typically infinite-dimensional!
The kernel trick comes to rescue
The Kernel Trick
Recall the SVM optimization problem
The data points only appear as inner product
As long as we can calculate the inner product in the
feature space, we do not need the mapping explicitly
Many common geometric operations (angles, distances)
can be expressed by inner products
Define the kernel function K by
An Example for (.) and K(.,.)
Suppose (.) is given as follows
An inner product in the feature space is
So, if we define the kernel function as follows, there is
no need to carry out (.) explicitly
This use of kernel function to avoid carrying out (.)
explicitly is known as the kernel trick
Examples of Kernel Functions
Polynomial kernel with degree d
Radial basis function kernel with width
Closely related to radial basis function neural networks
The feature space is infinite-dimensional
Sigmoid with parameter and
It does not satisfy the Mercer condition on all and
Non-linear SVMs: Feature spaces
General idea: the original input space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
Summary: Steps for Classification
Prepare the pattern matrix
Select the kernel function to use
Select the parameter of the kernel function and the
value of C
You can use the values suggested by the SVM software, or
you can set apart a validation set to determine the values
of the parameter
Execute the training algorithm and obtain the i
Unseen data can be classified using the i and the
support vectors
Conclusion
SVM is a useful alternative to neural networks
Two key concepts of SVM: maximize the margin and the
kernel trick
Many SVM implementations are available on the web for
you to try on your data set!