Foundations of Machine Learning
DSA 5102X • Lecture 2
Soufiane Hayou
Department of Mathematics
Consultation, Homework, Project
Consultation
We have three TAs for this class
• Wang Shida
• Jiang Haotian
• Wang Weixi
Each would have set up a 3x15min consultation slot per week.
Please use LumiNUS (consultation tab) to sign up.
I will also address common questions during the lectures.
Homework
Regression problem on the UCI Concrete Compressive
Strength dataset
• Instructions are in DSA5102X-homework1.ipynb
• Due: 4th Sept 2021 (2 weeks)
• Submission: Luminus submission folder under Files
• Late submission policy
• To ensure fairness, -20% of homework total grade every day
late, down to 0
• Example: Actual grade 8/10, 1 day late, Obtained grade:
6/10.
Project
Instructions are found in the Project Instructions folder on
Luminus
Use this homework as a starting point for the project
Due date: End of reading week before exam week
Last time
From linear models to linear basis models via feature maps
From this, we can derive the least squares formula etc.
Today, we will focus on the role of feature maps and their
relationship with kernels
Interpreting Feature Maps
What do feature maps do really?
𝜙3
𝜙1 𝜙2
Another view of feature maps
One can also view feature maps as implicitly defining some sort
of similarity measure
Consider two vectors and . Then, measures how similar they are
𝑢 𝑣 𝑢 𝑢𝑣
𝑣 Increasing
Increasing similarity
Feature maps defines a similarity between two samples by
computing the dot product in feature space
𝜙 ( 𝑥) 𝜙 ( 𝑥′) 𝜙 ( 𝑥) 𝜙 ( 𝑥′)
𝜙 ( 𝑥)
𝜙 ( 𝑥′)
Increasing
Increasing similarity in feature space
Least Squares Revisited
Let us revisit the linear basis hypothesis space
The (regularized) least squares problem is
Recall:
This is known as ridge regression.
Solution of ridge regression:
Making new predictions
Two observations:
• The dataset is memorized by and is not needed for new
predictions
• For each new prediction, we have operations
Reformulation of Ridge Regression
Let us now write ridge regression solution another way…
Two observations:
• The input data participates in theGram matrix
predictions
• For each new prediction, we have operations
Original Solution Reformulated Solution
What did we gain with this reformulation?
• They are exactly the same function, but…
• Left side requires and right side requires
• Most importantly, the right side only depends on the
feature maps through the dot product
Reformulated Solution
So why not just specify it and
Only need “similarity” measure
forget about !
Kernel Ridge Regression
Kernel
Can we choose any we want?
Observe that cannot be arbitrary, since we need
• (Symmetry)
• (Non-negativity)
• (Positive Semi-definiteness)
for all
If satisfies these conditions, it is called Symmetric Positive
Definite (SPD). Are these conditions all we need?
Symmetric Positive Definite Kernels
For a kernel to represent a valid feature map, we define the
notion of Symmetric Positive Definite (SPD) kernels. These
satisfy
1. Symmetry:
2. Positive Semi-definiteness: For any and , the Gram matrix
is positive semi-definite
(Recall: a matrix is positive semi-definite if for any vector )
Mercer’s Theorem (1909)
Suppose is a SPD kernel. Then, there exists a feature space and a
feature map such that
In fact,
where are eigenvalues/eigenfunctions of the linear integral operator
Feature map SPD kernel
and
SPD kernel Feature map
Examples of SPD kernels
• Linear kernel
• Polynomial kernel
• Gaussian (Radial Basis Function, RBF) kernel
• Many more…
Flexibility of using kernels
For example, consider the RBF kernel in 1 input dimension
Taylor expanding gives
where
The feature space is infinite-dimensional!
Kernel ridge regression with different
types of kernels
Express solution in
1 terms of similarity
(dot product)
Key Idea of
Kernel
Methods
Replace them with
2 kernels
Support Vector Machines
Linear (Affine) Functions and
Hyperplanes
Linear functions
What are hyperplanes?
In two dimension: a line
In general:
Hyperplanes are solutions of a linear equation
Classification using linear functions
Binary classification: ,
Linear decision function:
Linear separability assumption:
There exists a linear decision function such that
for all
Margin
Margin
There can be many possible decision functions!
Maximum Margin Solution
Mathematically, the margin of a decision function
is
The goal of support vector machines (SVM) is to find the
maximum margin solution
Why?
1
min |𝑤 𝑥𝑖 +𝑏| subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) >0 ∀ 𝑖
𝑇 𝑇
max
𝑤 ,𝑏 ‖𝑤‖ 𝑖=1 ,… , 𝑁
Reformulated as a constrained
convex optimization problem
1 2
min ‖𝑤‖ subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) ≥1 ∀ 𝑖
𝑇
𝑤, 𝑏 2
The Method of Lagrange Multipliers
Minimizing a function can be found by . What if there are
constraints?
First example:
𝐹 (𝑧) 𝐹 (𝑧) ( ^
𝑧 )
𝐹
𝑧∇
^
∇𝐹
𝑎
𝑎
𝑇 𝑇
𝑧 𝑎=0 𝑧 𝑎=0
What about general equality constraints?
parallel at a local optimum point, i.e.
The gradient plays the role of so we must have and
𝐹 (𝑧)
^
𝑧 ∇ 𝐹( 𝑧^ )
∇ 𝐺( 𝑧^ )
𝐺 ( 𝑧 )=0
What about general inequality constraints?
Constraint Inactive Constraint Active
𝐹 (𝑧) 𝐹 (𝑧)
^
𝑧 ∇ 𝐹( 𝑧^ )
^
𝑧 ∇ 𝐺( 𝑧^ )
𝐺 ( 𝑧 )≤ 0 𝐺 ( 𝑧 )≤ 0
𝛻 𝐹(^ ^ ) <0
𝑧 ) =0 𝐺 ( 𝑧
Inactive Case Active Case
Define the Lagrangian
Then these conditions can be combined in the following
conditions:
The variable is called a Lagrange multiplier
The most general case has inequality constraints (why no
equality constraints?)
Karush-Kuhn-Tucker (KKT) Conditions
Define the Lagrangian
Then, under technical conditions, for each locally optimal , there
exists Lagrange multipliers
1. (Stationarity)
2. (Primal Feasibility)
3. (Dual Feasibility)
4. (Complementary Slackness)
Dual Problem
Moreover, under some additional conditions, we can find the
multipliers via the dual problem
Choice (Primal) Price (Dual)
Back to the SVM…
1 2
min ‖𝑤‖ subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) ≥1 ∀ 𝑖
𝑇
𝑤, 𝑏 2
We apply KKT conditions with
We obtain the following
1. From Stationarity
2. From Dual Feasibility
3. From Complementary Slackness
4. The multipliers can be found by the dual problem
Dual Formulation of SVM
Decision function:
Complementary slackness:
Crucial Observations
From the dual formulation, we
observe the following
1. Only vectors closest to the
decision boundary matters
in predictions. These are
called support vectors.
2. The dual formulation of
the problem depends on
the inputs only through the
dot product support vectors
Kernel Support Vector Machines
Decision function:
As before, only support vectors satisfying
matter for predictions. This is a sparse kernel method.
Summary
The essence of kernel methods
• Write solution only in terms of dot products (this usually
involves going to a “dual” formulation)
• Go nonlinear by using kernels to replace dot products
Support vector machines
• Maximum margins solution
• Example of sparse kernel method: only some points
(support vectors) are used for prediction