Machine Learning Basics (I)
簡韶逸 Shao-Yi Chien
Department of Electrical Engineering
National Taiwan University
1
References and Slide Credits
• Slides from Deep Learning for Computer Vision, Prof. Yu-Chiang
Frank Wang, EE, National Taiwan University
• Slides from CE 5554 / ECE 4554: Computer Vision, Prof. J.-B.
Huang, Virginia Tech
• Slides from CSE 576 Computer Vision, Prof. Steve Seitz and Prof.
Rick Szeliski, U. Washington
• Slides from EECS 498-007/598-005 Deep Learning for Computer
Vision, Prof. Justin Johnson
• Slides from CS291A Introduction to Pattern Recognition, Artificial
Neural Networks, and Machine Learning, Prof. Professor Yuan-
Fang Wang, UCSB
• Duda et al., Pattern Classification
• Bishop, Pattern Recognition and Machine Learning
• Reference papers
2
Outline
• Overview of recognition/classification pipeline
• Overview of machine learning
• From probability to Bayes decision rule
• Nonparametric techniques: Parzen window and nearest neighbor
• Unsupervised learning and supervised learning
• Unsupervised learning
• Clustering: k-means
• Dimension reduction: PCA and LDA
• Training, testing, & validation
• Supervised learning
• Linear classification: support vector machine (SVM)
• Combining models: decision tree, boosting
• Examples
3
Image Classification
Input: image Output: Assign image to one
of a fixed set of categories
cat
bird
deer
dog
truck
This image by Nikita is
licensed under CC-BY 2.0
4
Problem: Semantic Gap
What the computer sees
An image is just a big grid of numbers
between [0, 255]
e.g. 800 x 600 x 3
(3 channels RGB)
5
Challenges
• Viewpoint variation
• Intraclass variation
• Fine-grained categories
• Background clutter
• Illumination changes
• Deformation
• Occlusion
•…
6
Overview of Recognition Pipeline
7
AI, Machine Learning, and Deep Learning
[Kaggle]
8
Machine Learning:
Data-Driven Approach
1. Collect a dataset of images and labels
2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images
Example training set
9
Machine Learning/
Pattern Recognition
Parameter
Supervised
Density
Learning
Decision
Uncorrelated Boundary
Events
Minimum
Machine Distance
Learning/Pattern Unsupervised
Recognition Clustering
Hierarchical
Clustering
Correlated Hidden Markov
Events Models
10
From Probability to Bayes Decision Rule
• Example: Testing/Screening of COVID-19
• Distributions between positive/negative test results (e.g., PCR, antibody, etc.)
• further away from each other
• more accurate COVID diagnosis
Ground Truth
Positive Negative
Testing Result
+
Negative Positive -
for COVID for COVID
11
Bayesian Decision Theory
• Fundamental statistical approach to classification/detection tasks
• Take a 2-class classification/detection task as an example:
• Let’s see if a student would pass or fail the course of CV.
• Define a probabilistic variable ω describe the case of pass or fail.
• That is, ω = ω1 for pass, and ω = ω2 for fail.
• Prior Probability
• The a priori or prior probability reflects the knowledge of
how likely we expect a certain state of nature before observation.
• P(ω = ω1) or simply P(ω1) as the prior that the next student would pass CV.
• The priors must exhibit exclusivity and exhaustivity, i.e.,
𝐶
𝑝 𝑤𝑗 = 1
• Equal priors 𝑗=1
• If we have equal numbers of students pass/fail CV, then the priors are equal;
in other words, the priors are uniform.
𝑝 𝑤1 = 𝑝 𝑤2 = 0.5
12
Prior Probability (cont’d)
• Decision rule based on priors only
• If the only available info is the prior,
and the cost of any type of incorrect classification is equal,
what would be a reasonable decision rule?
• Decide ω1 if
𝑝 𝑤1 > 𝑝 𝑤2
otherwise decide ω2 .
• What’s the incorrect classification rate (or error rate) Pe?
𝑃𝑒 = min 𝑝 𝑤1 , 𝑝 𝜔2
13
Class-Conditional Probability Density
(or Likelihood)
• The probability density function (PDF) for input/observation x given a state of nature ω
is written as:
𝑝 𝑥 ȁ𝑤1
• Here’s (hopefully) the hypothetical class-conditional densities
reflecting the time of the students spending on CV who eventually pass/fail this course.
𝑝 𝑥 ȁ𝑤2 𝑝 𝑥 ȁ𝑤1
Training data
(observe/collect in advance)
Maximum Likelihood (MLE)
14
Posterior Probability & Bayes Formula
• If we know the prior distribution and the class-conditional density,
can we come up with a better decision rule?
• Yes We Can!
• By calculating the posterior probability.
• Posterior probability 𝑃(𝜔ȁ𝒙) :
• The probability of a certain state of nature ω given an observable x.
• Bayes formula:
𝑃 𝑤𝑗 , 𝒙 = 𝑝 𝑥 ȁ𝑤𝑗 𝑝 𝑤𝑗 = 𝑝 𝑤𝑗 ȁ𝑥 𝑝 𝑥
𝑝 𝑥ȁ𝑤𝑗 𝑃 𝑤𝑗
𝑃 𝑤𝑗 𝒙 = 𝑝 𝑥
And, we have σ𝐶𝑗=1 𝑃(𝜔𝑗 ȁ𝒙) = 1.
15
Decision Rule & Probability of Error
• For a given observable x (e.g., time you can spend for CV),
the decision rule (to take CV or not) will be now based on:
Decide 𝑤1 if 𝑝 𝑤1 𝒙 > 𝑝 𝑤2 𝒙
𝑤 ∗ = argmax 𝑝 𝑤𝑖 ȁ𝑥 Maximum A Posterior (MAP)
𝑖
• What’s the probability of error P(error) (or Pe)?
𝑝 𝑤2 ȁ𝑥 𝑝 𝑤1 ȁ𝑥
𝑃𝑒 = min 𝑝 𝑤1 ȁ𝑥 , 𝑃 𝑤2 ȁ𝑥 over all x
T
16
From Bayes Decision Rule to Detection Theory
• Hit (detection, TP), false alarm (FA, FP), miss (false reject, FN), rejection (TN)
∞
𝑝 𝑥ȁ𝑤2 𝑝 𝑤2 𝑝 𝑥ȁ𝑤1 𝑝 𝑤1 𝑇𝑃 = න 𝑝 𝑥 ȁ𝑤1 𝑝 𝑤1 ⅆ𝑥
𝑇
∞
𝐹𝑃 = න 𝑝 𝑥 ȁ𝑤2 𝑝 𝑤2 ⅆ𝑥
𝑇
T T*: EER (equal error rate): FP=FN
• Receiver Operating Characteristics (ROC)
• To assess the effectiveness of the designed features/classifiers
• False alarm (PFA or FP) vs. detection (Pd or TP) rates
17
Nonparametric Techniques:
Parzen Window
• Parzen-window approach to estimate densities: assume
e.g. that the region Rn is a d-dimensional hypercube
Vn = hnd (h n : length of the edge of n )
Let (u) be the following hypercube window function :
1
1 u j j = 1,... , d
(u) = 2
0 otherwise
• ((x-xi)/hn) is equal to unity if xi falls within the
hypercube of volume Vn centered at x, and equal to zero
otherwise.
• The number of samples in this hypercube is:
x − xi
i =n
k n =
i =1 hn
Substituting kn in pn(x) = (kn/n)/Vn we obtain:
1 i =n 1 x − xi
pn ( x ) =
n i =1 Vn hn
Pn(x) estimates p(x) as an average of functions of x and
the samples (xi) (i = 1,… ,n).
pn ( x ) → p ( x )
n →
Nonparametric Techniques:
Nearest Neighborhood
Memorize all data
and labels
Predict the label of
the most similar
training image
20
21
Nearest Neighbor Decision
Boundaries
x1
Decision boundary is
Nearest neighbors the boundary
in two dimensions between two
classification regions
Points are training
examples; colors Decision boundaries
give training labels can be noisy; affected
by outliers
Background colors How to smooth out
give the category a decision boundaries?
x
test point would Use more neighbors!
be assigned
x0
22
K-Nearest Neighbors (kNN)
• Instead of copying label from nearest neighbor,
take majority vote from K closest points
K=1 K=3
23
K-Nearest Neighbors (kNN)
• Make the decision boundary more smooth
• Reduce the effect of outliers
K=1 K=3
http://vision.stanford.edu/teaching/cs231n-demos/knn/
24
Minor Remarks on NN-based Methods
• k-NN is easy to implement but not of much interest in practice. Why?
• Choice of distance metrics might be an issue (see example below)
• Measuring distances in high-dimensional spaces might not be a good idea.
• Moreover, NN-based methods require lots of space and computation time!
(NN-based methods are viewed as data-driven approaches.)
All three images have the same Euclidean distance to the original one.
Image credit: Stanford CS231n 25
Nonparametric Techniques
• kNN is also a nonparametric technique:
• Specify kn as a function of n, such as kn = n; the volume
Vn is grown until it encloses kn neighbors of x
26
Unsupervised Learning and
Supervised Learning
Parameter
Supervised
Density
Learning
Decision
Uncorrelated Boundary
Events
Minimum
Machine Distance
Learning/Pattern Unsupervised
Recognition Clustering
Hierarchical
Clustering
Correlated Hidden Markov
Events Models
27
Clustering
• Clustering is an unsupervised algorithm.
• Given:
a set of N unlabeled instances {x1, …, xN}; # of clusters K
• Goal: group the samples into K partitions
• Remarks:
• High within-cluster (intra-cluster) similarity
• Low between-cluster (inter-cluster) similarity
• But…how to determine a proper similarity measure?
28
Similarity is NOT Always Objective…
29
Clustering (cont’d)
• Similarity:
• A key component/measure to perform data clustering
• Inversely proportional to distance
• Example distance metrics:
• Euclidean distance (L2 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 2 = σ𝐷
𝑖=1 𝑥𝑖 − 𝑧𝑖
2
• Manhattan distance (L1 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 1 = σ𝐷
𝑖=1 𝑥𝑖 − 𝑧𝑖
• Note that p-norm of x is denoted as:
1ൗ
𝐷 𝑝
𝑝
𝐿𝑃 𝒙, 𝒛 = 𝑥𝑖 − 𝑧𝑖
𝑖=1
1ൗ
𝐷 𝑝
𝐿0 𝒙, 𝒛 = lim 𝑥𝑖 − 𝑧𝑖 𝑝
𝑝→0
𝑖=1
30
Clustering (cont’d)
• Similarity:
• A key component/measure to perform data clustering
• Inversely proportional to distance
• Example distance metrics:
• Kernelized (non-linear) distance:
2 2 2
ⅆ 𝑥, 𝑧 = Φ(𝑥) − Φ 𝑧 2 = Φ(𝑥) 2 + Φ(𝑧) 2 − 2Φ(𝑥)𝑇 Φ(𝑧)
𝑥−𝑧 22
• Taking Gaussian kernel for example: 𝐾 𝑥, 𝑧 = Φ 𝑥 𝑇 Φ 𝑧 = 𝑒𝑥𝑝 − ,
2𝜎2
we have Φ(𝑥) 2 = Φ 𝑥 𝑇 Φ 𝑥 = 1
2
distance is more sensitive smaller σ.
• For example, L2 or kernelized distance metrics for the following two cases?
31
K-Means Clustering
• Input: N examples {x1, . . . , xN } (xn ∈ RD ); number of partitions K
• Initialize: K cluster centers µ 1, . . . , µ K . Several initialization options:
• Randomly initialize µ 1, . . . , µ K anywhere in RD
• Or, simply choose any K examples as the cluster centers
• Iterate:
• Assign each of example xn to its closest cluster center
• Recompute the new cluster centers µ k (mean/centroid of the set Ck )
• Repeat while not converge
• Possible convergence criteria:
• Cluster centers do not change anymore
• Max. number of iterations reached
• Output:
• K clusters (with centers/means of each cluster)
32
K-Means Clustering
• Example (K = 2): Initialization, iteration #1: pick cluster centers
33
K-Means Clustering
• Example (K = 2): iteration #1-2, assign data to each cluster
34
K-Means Clustering
• Example (K = 2): iteration #2-1, update cluster centers
35
K-Means Clustering
• Example (K = 2): iteration #2, assign data to each cluster
36
K-Means Clustering
• Example (K = 2): iteration #3-1
37
K-Means Clustering
• Example (K = 2): iteration #3-2
38
K-Means Clustering
• Example (K = 2): iteration #4-1
39
K-Means Clustering
• Example (K = 2): iteration #4-2
40
K-Means Clustering
• Example (K = 2): iteration #5, cluster means are not changed.
41
K-Means Clustering (cont’d)
• Limitation
• Preferable for round shaped clusters with similar sizes
• Sensitive to initialization; how to alleviate this problem?
• Sensitive to outliers; possible change from K-means to…
• Hard assignment only.
• Remarks
• Expectation-maximization (EM) algorithm
• Speed-up possible by hierarchical clustering (e.g., 100 = 102 clusters)
42
Dimension Reduction
• Principal Component Analysis (PCA)
• Unsupervised & linear dimension reduction
• Related to Eigenfaces, etc. feature extraction and classification techniques
• Still very popular despite of its simplicity and effectiveness.
• Goal:
• Determine the projection, so that the variation of projected data is maximized.
axis that describes the
largest variation for data
projected onto it.
x 43
Formulation & Derivation for PCA
• Input: a set of instances x without label info
• Output: a projection vector u1 maximizing the variance of the projected data
𝑆 = 𝑇𝑇 𝑇
T be the matrix of preprocessed training examples, where each
column contains one mean-subtracted image.
x
44
Formulation & Derivation for PCA
45
Formulation & Derivation for PCA
However TTT is a large matrix, and if instead we take the eigenvalue
decomposition of
then we notice that by pre-multiplying both sides of the equation
with T, we obtain
Meaning that, if ui is an eigenvector of TTT, then vi = Tui is an
eigenvector of S. If we have a training set of 300 images of 100 × 100
pixels, the matrix TTT is a 300 × 300 matrix, which is much more
manageable than the 10,000 × 10,000 covariance matrix.
46
Eigenanalysis
• A d x d covariance matrix contains a maximum of d
eigenvector/eigenvalue pairs.
• How dimension reduction is realized? how to reconstruct
the input data?
• Expanding a signal via eigenvectors as bases
• With symmetric matrices (e.g., covariance matrix),
eigenvectors are orthogonal.
• They can be regarded as unit basis vectors to span any
instance in the d-dim space.
47
Let’s See an Example (CMU AMP Face Database)
• Let’s take 5 face images x 13 people = 65 images, each is of size 64 x 64 = 4096 pixels.
• # of eigenvectors are expected to use for perfectly reconstructing the input = 64.
• Let’s check it out!
49
What Do the Eigenvectors/Eigenfaces Look Like?
Mean V1 V2 V3
V4 V5 V6 V7
V8 V9 V10 V11
V12 V13 V14 V15
50
All 64 Eigenvectors, do we need them all?
51
Use only 1 eigenvector, MSE = 1233
MSE=1233.16
52
Use 2 eigenvectors, MSE = 1027
MSE=1027.63
53
Use 3 eigenvectors, MSE = 758
MSE=758.13
54
Use 4 eigenvectors, MSE = 634
MSE=634.54
55
Use 8 eigenvectors, MSE = 285
MSE=285.08
56
With 20 eigenvectors, MSE = 87
MSE=87.93
57
With 30 eigenvectors, MSE = 20
MSE=20.55
58
With 50 eigenvectors, MSE = 2.14
MSE=2.14
59
With 60 eigenvectors, MSE = 0.06
MSE=0.06
60
All 64 eigenvectors, MSE = 0
MSE=0.00
61
Linear Discriminant Analysis(LDA)
• Linear Discriminant Analysis(LDA)
• Classify objects into one of two or more groups
• Base on a set of features
• The transform tries to maximize the ratio of between
variance to within class variance
• Between class variance
•
• Within class variance Different
•
62
Mathematical Operations
• Maximize
• If y is the transform of x
•
• Compute J after the transform
•
•
•
• Find W to maximize
63
Find W
• If we are lucky, is a non-singular matrix
• We can find
•
• Calculate the eigenvector of
• If not, well……….It’s a tough work to do.
• Everyone tries to avoid this
• Using PCA
Arong Media IC & System Lab 64
Small Example
Arong Media IC & System Lab 65
Matthew Turk and Alex Pentland, “Eigenfaces for Recognition,” Journal of
Cognitive Neuroscience, Match 1991.
Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces
Experiment vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE
Transactions on Pattern Analysis And Machine Intelligence, 1997.
Aron 66
Hyperparameters in ML
• Recall that for k-NN, we need to determine the k value in advance.
• What is the best k value?
• Or, take PCA for example, what is the best reduced dimension number?
• Hyperparameters: parameter choices for the learning model/algorithm
• We need to determine such hyperparameters instead of guessing.
• Let’s see what we can and cannot do…
k=1 k=3 k=5
Image credit: Stanford CS231n 67
How to Determine Hyperparameters?
• Idea #1
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• You take a dataset to train your model, and select your hyperparameters
(e.g., k of k-NN) based on the resulting performance.
• Might not generalize well.
Dataset
68
How to Determine Hyperparameters? (cont’d)
• Idea #2
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• For a dataset of interest, you split it into training and test sets.
• You train your model with possible hyperparameter choices (e.g., k in k-NN),
and select those work best on test set data.
• That’s called cheating…
Training set Test set
69
How to Determine Hyperparameters? (cont’d)
• Idea #3
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• For the dataset of interest, it is split it into training, validation, and test sets.
• You train your model with possible hyperparameter choices (k in k-NN),
and select those work best on the validation set.
• OK, but…
Training set Validation set Test set
Training set Validation set Test set
70
How to Determine Hyperparameters? (cont’d)
• Idea #3.5
• What if only training and test sets are given, not the validation set?
• Cross-validation (or k-fold cross validation)
• Split the training set into k folds with a hyperparameter choice
• Keep 1 fold as validation set and the remaining k-1 folds for training
• After each of k folds is evaluated, report the average validation performance.
• Choose the hyperparameter(s) which result in the highest average validation
performance.
• Take a 4-fold cross-validation as an example…
Training set Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set
71