Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth
Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth
Predicted Label
A test image
Cat vs Dog (cat/dog)
Prediction model
CS771: Intro to ML
3
Some Types of Supervised Learning Problems
Consider building an ML module for an e-mail client
For denoting multiple vectors, we will use a subscript with each vector, e.g.,
N images denoted by N feature vectors , or compactly as
The vector denotes the image
(a scalar) denotes the feature () of the image
CS771: Intro to ML
6
Some Basic Operations on Vectors
Addition/subtraction of two vectors gives another vector of the same size
For a vector , its Euclidean norm is defined via its inner product with itself
CS771: Intro to ML
7
Computing Distances
Euclidean (L2 norm) distance between two vectors and
√
Sqrt of Inner product of Another expression in terms of inner
𝐷 the difference vector products of individual vectors
𝑑 2 ( 𝒂, 𝒃 )=¿|𝒂− 𝒃|∨¿ 2= ∑ ( 𝑎𝑖 − 𝑏𝑖 ) =√ ( 𝒂− 𝒃 )
2 ⊤
( 𝒂−𝒃 )=√ 𝒂 𝒂+𝒃 𝒃 − 2 𝒂 𝒃 ¿ ⊤ ⊤ ⊤
𝑖=1
Weighted Euclidean distance between two vectors and
is a DxD diagonal matrix with weights on its
diagonals. Weights may be known or even
√
learned from data (in ML problems)
𝐷
𝑑 𝑤 ( 𝒂 , 𝒃 )= ∑ 𝑤 𝑖 ( 𝑎𝑖 − 𝑏𝑖 ) =√ ( 𝒂 − 𝒃 )
2 ⊤
𝐖 ( 𝒂 − 𝒃)
𝑖=1
Absolute (L1 norm) distance between two vectors and
L1 norm distance is also known as the
Manhattan distance or Taxicab norm 𝐷
(it’s a very natural notion of distance
between two points in some vector space) 𝑑 1 ( 𝒂 , 𝒃 )=¿|𝒂 − 𝒃|∨¿1 =∑ ¿ 𝑎 𝑖 −𝑏 𝑖∨¿ ¿ ¿
𝑖=1
CS771: Intro to ML
8
CS771: Intro to ML
9
Prelude: A Very Primitive Classifier
The idea also applies to multi-
class classification: Use one
image per class, and predict label
based on the distances of the test
Consider a binary classification problem – cat vs dog image from all such images
A simple idea: Predict using its distance from each of the 2 training images
d( Test
image , ) < d( Test
image
, ) ? Predict cat else dog
Wait. Is it ML? Seems to be Some possibilities: Use a feature Even this simple model can be
like just a simple “rule”. learning/selection algorithm to learned. For example, for the
Where is the “learning” part extract features, and use a feature extraction/selection part
in this? Mahalanobis distance where you and/or for the distance computation
learn the W matrix (instead of using part
a predefined W), using “distance
metric learning” techniques CS771: Intro to ML
10
Improving Our Primitive Classifier
Just one input per class may not sufficiently capture variations in a class
CS771: Intro to ML
Learning to predict categories
dog
dog ?? cat
dog
cat
dog cat
CS771: Intro to ML
12
Learning with Prototypes (LwP)
Basic idea: Represent each class by a “prototype” vector
Predict label of each test input based on its distances from the class prototypes
Predicted label will be the class that is the closest to the test input
How we compute distances can have an effect on the accuracy of this model
(may need to try Euclidean, weight Euclidean, Mahalanobis, or something else)
Pic from: https://www.reddit.com/r/dataisbeautiful/comments/3wgbv9/average_handwritten_digit_oc/ CS771: Intro to ML
13
Learning with Prototypes (LwP): An Illustration
Suppose the task is binary classification (two classes assumed pos and neg)
1 𝜇
∑
1
𝜇− = 𝐱𝑛 𝜇−
+¿=
∑
¿
𝑁 − 𝑦 =−1
𝑛 𝜇+¿¿ 𝑁+¿
𝑦 𝑛=+1
𝐱𝑛¿
Test example
CS771: Intro to ML
15
LwP: The Prediction Rule, Mathematically
Let’s expand the prediction rule expression a bit more
Test example
In general, if classes are not equisized and spherical, LwP with Euclidean
distance will usually not work well (but improvements possible; will discuss
later)
CS771: Intro to ML
17
LwP: Some Key Aspects
Very simple, interpretable, and lightweight model
Just requires computing and storing the class prototype vectors
Works with any number of classes (thus for multi-class classification as well)
With a learned distance function, can work very well even with very few
examples from each class (used in some “few-shot learning” models nowadays
– if interested, please refer to “Prototypical Networks for Few-shot Learning”)
CS771: Intro to ML
18
Learning with Prototypes (LwP)
1 𝜇
∑ 𝐱
1
𝜇− = 𝜇−
+¿=
∑
¿
𝑁 − 𝑦 =−1 𝑛
𝑛
𝐰 𝜇+¿¿ 𝑁+¿
𝑦 𝑛=+1
𝐱𝑛 ¿
Note: Mahalanobis distance also has the effect of rotating the axes which helps
A good W will help bring
W will be a 2x2 symmetric matrix in points from same class
this case (chosen by us or learned) closer and move different
classes apart
𝑑 𝑤 ( 𝒂 , 𝒃 )=√ ( 𝒂 − 𝒃 ) 𝐖 ( 𝒂 − 𝒃 )
⊤
𝜇− 𝜇+¿¿
𝜇+¿¿ 𝜇−
CS771: Intro to ML
20
Improving LwP when classes are complex-shaped
Even with weighted Euclidean or Mahalanobis dist, LwP still a linear classifier
Exercise: Prove the above fact. You may use the following hint
Mahalanobis dist can be written as
is a symmetric matrix and thus can be written as for any matrix
Showing for Mahalabonis is enough. Weighted Euclidean is a special case with diag
CS771: Intro to ML