0% found this document useful (0 votes)
34 views22 pages

Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth

This document introduces a simple supervised learning model called prototypes for binary and multi-class classification. The model works by computing the distances between a test sample and labeled prototypes or examples from the training data. It predicts the class of the test sample based on which prototype it is closest to. While simple, this model demonstrates the basic idea of learning from labeled examples and making predictions on new test data. The document also discusses how this basic model could be improved, such as by using more training examples per class to better capture variations within each class.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views22 pages

Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth

This document introduces a simple supervised learning model called prototypes for binary and multi-class classification. The model works by computing the distances between a test sample and labeled prototypes or examples from the training data. It predicts the class of the test sample based on which prototype it is closest to. While simple, this model demonstrates the basic idea of learning from labeled examples and making predictions on new test data. The document also discusses how this basic model could be improved, such as by using more training examples per class to better capture variations within each class.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Learning with Prototypes

CS771: Introduction to Machine Learning


Nisheeth
2
Supervised Learning
Labeled
Training “dog”
Data
“dog” Supervised Learning
“cat” Algorithm
“cat” Cat vs Dog
“cat” Prediction model

Predicted Label
A test image
Cat vs Dog (cat/dog)
Prediction model

CS771: Intro to ML
3
Some Types of Supervised Learning Problems
 Consider building an ML module for an e-mail client

 Some tasks that we may want this module to perform


 Predicting whether an email of spam or normal: Binary Classification
 Predicting which of the many folders the email should be sent to: Multi-class Classification
 Predicting all the relevant tags for an email: Tagging or Multi-label Classification
 Predicting what’s the spam-score of an email: Regression
 Predicting which email(s) should be shown at the top: Ranking
 Predicting which emails are work/study-related emails: One-class Classification

 These predictive modeling tasks can be formulated as supervised learning problems

 Today: A very simple supervised learning model for binary/multi-class classification


 This model doesn’t require any fancy maths – just computing means and distances
CS771: Intro to ML
4
Some Notation and Conventions
 In ML, inputs are usually represented by vectors 0.5 0.3 0.6 0.1 0.2 0.5 0.9 0.2 0.1 0.5

 A vector consists of an array of scalar values


 Geometrically, a vector is just a point in a vector space, e.g.,
 A length 2 vector is a point in 2-dim vector space
Likewise for higher
 A length 3 vector is a point in 3-dim vector space dimensions, even though
harder to visualize
(0.5,0.3) (0.5,0.3,0.6)
0.5 0.3 0.6
0.5 0.3

 Unless specified otherwise


 Small letters in bold font will denote vectors, e.g., , , etc.
 Small letters in normal font to denote scalars, e.g. , etc
 Capital letters in bold font will denote matrices (2-dim arrays), e.g., , etc
CS771: Intro to ML
5
Some Notation and Conventions
 A single vector will be assumed to be of the form

 Unless specified otherwise, vectors will be assumed to be column vectors


 So we will assume to be a column vector of size
 Assuming each element to be real-valued scalar, or (: space of reals)

 If is a feature vector representing, say an image, then


 denotes the dimensionality of this feature vector (number of features)
 (a scalar) denotes the value of feature in the image

 For denoting multiple vectors, we will use a subscript with each vector, e.g.,
 N images denoted by N feature vectors , or compactly as
 The vector denotes the image
 (a scalar) denotes the feature () of the image
CS771: Intro to ML
6
Some Basic Operations on Vectors
 Addition/subtraction of two vectors gives another vector of the same size

 The mean (average or centroid) of vectors


𝑁
1
𝜇= ∑ 𝐱 𝑛 (of the same size as each )
𝑁 𝑛=1
 The inner/dot product of two vectors and Assuming both and have
unit Euclidean norm

= (a real-valued number denoting how “similar” and are)

 For a vector , its Euclidean norm is defined via its inner product with itself

CS771: Intro to ML
7
Computing Distances
 Euclidean (L2 norm) distance between two vectors and


Sqrt of Inner product of Another expression in terms of inner
𝐷 the difference vector products of individual vectors

𝑑 2 ( 𝒂, 𝒃 )=¿|𝒂− 𝒃|∨¿ 2= ∑ ( 𝑎𝑖 − 𝑏𝑖 ) =√ ( 𝒂− 𝒃 )
2 ⊤
( 𝒂−𝒃 )=√ 𝒂 𝒂+𝒃 𝒃 − 2 𝒂 𝒃 ¿ ⊤ ⊤ ⊤

𝑖=1
 Weighted Euclidean distance between two vectors and
is a DxD diagonal matrix with weights on its
diagonals. Weights may be known or even


learned from data (in ML problems)
𝐷
𝑑 𝑤 ( 𝒂 , 𝒃 )= ∑ 𝑤 𝑖 ( 𝑎𝑖 − 𝑏𝑖 ) =√ ( 𝒂 − 𝒃 )
2 ⊤
𝐖 ( 𝒂 − 𝒃)
𝑖=1
 Absolute (L1 norm) distance between two vectors and
L1 norm distance is also known as the
Manhattan distance or Taxicab norm 𝐷
(it’s a very natural notion of distance
between two points in some vector space) 𝑑 1 ( 𝒂 , 𝒃 )=¿|𝒂 − 𝒃|∨¿1 =∑ ¿ 𝑎 𝑖 −𝑏 𝑖∨¿ ¿ ¿
𝑖=1

CS771: Intro to ML
8

Our First Supervised


Learner

CS771: Intro to ML
9
Prelude: A Very Primitive Classifier
The idea also applies to multi-
class classification: Use one
image per class, and predict label
based on the distances of the test
 Consider a binary classification problem – cat vs dog image from all such images

 Assume training data with just 2 images – one and one

 Given a new test image (cat/dog), how do we predict its label?

 A simple idea: Predict using its distance from each of the 2 training images

d( Test
image , ) < d( Test
image
, ) ? Predict cat else dog
Wait. Is it ML? Seems to be Some possibilities: Use a feature Even this simple model can be
like just a simple “rule”. learning/selection algorithm to learned. For example, for the
Where is the “learning” part extract features, and use a feature extraction/selection part
in this? Mahalanobis distance where you and/or for the distance computation
learn the W matrix (instead of using part
a predefined W), using “distance
metric learning” techniques CS771: Intro to ML
10
Improving Our Primitive Classifier
 Just one input per class may not sufficiently capture variations in a class

 A natural improvement can be by using more inputs per class


“cat” “dog”
“cat” “dog”
“dog”
“cat”

 We will consider two approaches to do this


 Learning with Prototypes (LwP)
 Nearest Neighbors (NN)
 Both LwP and NN will use multiple inputs per class but in different ways

CS771: Intro to ML
Learning to predict categories

dog

dog ?? cat

dog
cat

dog cat

CS771: Intro to ML
12
Learning with Prototypes (LwP)
 Basic idea: Represent each class by a “prototype” vector

 Class Prototype: The “mean” or “average” of inputs from that class

Averages (prototypes) of each of the handwritten digits 1-9

 Predict label of each test input based on its distances from the class prototypes
 Predicted label will be the class that is the closest to the test input

 How we compute distances can have an effect on the accuracy of this model
(may need to try Euclidean, weight Euclidean, Mahalanobis, or something else)
Pic from: https://www.reddit.com/r/dataisbeautiful/comments/3wgbv9/average_handwritten_digit_oc/ CS771: Intro to ML
13
Learning with Prototypes (LwP): An Illustration
 Suppose the task is binary classification (two classes assumed pos and neg)

 Training data: labelled examples ,,


 Assume example from positive class, examples from negative class
 Assume green is positive and red is negative

1 𝜇

1
𝜇− = 𝐱𝑛 𝜇−
+¿=

¿
𝑁 − 𝑦 =−1
𝑛 𝜇+¿¿ 𝑁+¿
𝑦 𝑛=+1
𝐱𝑛¿

For LwP, the prototype


LwP straightforwardly generalizes
vectors (and here) define
to more than 2 classes as well
(multi-class classification) – K Test example Test example the “model”
prototypes for K classes
CS771: Intro to ML
14
LwP: The Prediction Rule, Mathematically
 What does the prediction rule for LwP look like mathematically?

 Assume we are using Euclidean distances here

||𝝁− − 𝐱|| =||𝝁−|| +||𝐱|| −2 ⟨ 𝝁− , 𝐱 ⟩


2 2 2
𝜇− 𝜇+¿¿
¿ ¿

Test example

Prediction Rule: Predict label as +1 if otherwise -1

CS771: Intro to ML
15
LwP: The Prediction Rule, Mathematically
 Let’s expand the prediction rule expression a bit more

 Thus LwP with Euclidean distance is equivalent to a linear model with


 Weight vector 2( Will look at linear models
 Bias term more formally and in more
detail later

 Prediction rule therefore is: Predict +1 if > 0, else predict -1


CS771: Intro to ML
16
LwP: Some Failure Cases
 Here is a case where LwP with Euclidean distance may not work well

Can use feature scaling or use


Mahalanobis distance to handle
such cases (will discuss this in
𝜇− the next lecture)
𝜇+¿¿

Test example

 In general, if classes are not equisized and spherical, LwP with Euclidean
distance will usually not work well (but improvements possible; will discuss
later)
CS771: Intro to ML
17
LwP: Some Key Aspects
 Very simple, interpretable, and lightweight model
 Just requires computing and storing the class prototype vectors

 Works with any number of classes (thus for multi-class classification as well)

 Can be generalized in various ways to improve it further, e.g.,


 Modeling each class by a probability distribution rather than just a prototype vector
 Using distances other than the standard Euclidean distance (e.g., Mahalanobis)

 With a learned distance function, can work very well even with very few
examples from each class (used in some “few-shot learning” models nowadays
– if interested, please refer to “Prototypical Networks for Few-shot Learning”)
CS771: Intro to ML
18
Learning with Prototypes (LwP)
1 𝜇
∑ 𝐱
1
𝜇− = 𝜇−
+¿=

¿
𝑁 − 𝑦 =−1 𝑛
𝑛
𝐰 𝜇+¿¿ 𝑁+¿
𝑦 𝑛=+1
𝐱𝑛 ¿

Prediction rule for LwP


𝐰= 𝝁+¿ −𝝁 − ¿
(for binary classification
If Euclidean distance used
with Euclidean distance)

+ For LwP, the prototype vectors (or


Decision boundary their difference) define the “model”.
(> 0 then predict +1 otherwise -1) and (or just in the Euclidean distance
(perpendicular bisector of line
case) are the model parameters.
joining the class prototype vectors)
Exercise: Show that for the bin. classfn case
𝑁 Note: Even though can be expressed in Can throw away training data after computing the
𝑓 ( 𝐱 )= ∑ 𝛼𝑛 ⟨ 𝐱𝑛 , 𝐱 ⟩ +𝑏 this form, if N > D, this may be more
expensive to compute (O(N) time)as
prototypes and just need to keep the model parameters
𝑛=1 for the test time in such “parametric” models
compared to (O(D) time).

So the “score” of a test point is a weighted sum of its


similarities with each of the N training inputs. Many However the form is still very useful as we will see later
supervised learning models have in this form as we will see when we discuss kernel methods
later
CS771: Intro to ML
19
Improving LwP when classes are complex-shaped
 Using weighted Euclidean or Mahalanobis distance can sometimes help
𝜇+¿¿ √
𝐷
𝑑 𝑤 ( 𝒂 , 𝒃 )= ∑ 𝑤 𝑖 ( 𝑎𝑖 − 𝑏𝑖 )2
𝜇− 𝑖=1

Use a smaller for the horizontal


axis feature in this example

 Note: Mahalanobis distance also has the effect of rotating the axes which helps
A good W will help bring
W will be a 2x2 symmetric matrix in points from same class
this case (chosen by us or learned) closer and move different
classes apart
𝑑 𝑤 ( 𝒂 , 𝒃 )=√ ( 𝒂 − 𝒃 ) 𝐖 ( 𝒂 − 𝒃 )

𝜇− 𝜇+¿¿
𝜇+¿¿ 𝜇−

CS771: Intro to ML
20
Improving LwP when classes are complex-shaped
 Even with weighted Euclidean or Mahalanobis dist, LwP still a linear classifier

 Exercise: Prove the above fact. You may use the following hint
 Mahalanobis dist can be written as
 is a symmetric matrix and thus can be written as for any matrix
 Showing for Mahalabonis is enough. Weighted Euclidean is a special case with diag

 LwP can be extended to learn nonlinear decision boundaries if we use


nonlinear distances/similarities(more on this Note:
when we talk
Modeling each about kernels)
class by not
just a mean by a probability
distribution can also help in learning
nonlinear decision boundaries. More
on this when we discuss
probabilistic models for
classification
CS771: Intro to ML
21
LwP as a subroutine in other ML models
 For data-clustering (unsupervised learning), K-means clustering is a popular algo

 K-means also computes means/centres/prototypes of groups of unlabeled points


 Harder than LwP since labels are unknown. But we can do the following
 Guess the label of each point, compute means using guess labels Will see K-means
in detail later
 Refine labels using these means (assign each point to the current closest mean)
 Repeat until means don’t change anymore
 Many other models also use LwP as a subroutine
CS771: Intro to ML
22
Next Lecture
 Nearest Neighbors

CS771: Intro to ML

You might also like