0% found this document useful (0 votes)
55 views70 pages

Machine Learning Fundamentals Overview

Uploaded by

KhánhLinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views70 pages

Machine Learning Fundamentals Overview

Uploaded by

KhánhLinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning Basics (I)

簡韶逸 Shao-Yi Chien


Department of Electrical Engineering
National Taiwan University

1
References and Slide Credits
• Slides from Deep Learning for Computer Vision, Prof. Yu-Chiang
Frank Wang, EE, National Taiwan University
• Slides from CE 5554 / ECE 4554: Computer Vision, Prof. J.-B.
Huang, Virginia Tech
• Slides from CSE 576 Computer Vision, Prof. Steve Seitz and Prof.
Rick Szeliski, U. Washington
• Slides from EECS 498-007/598-005 Deep Learning for Computer
Vision, Prof. Justin Johnson
• Slides from CS291A Introduction to Pattern Recognition, Artificial
Neural Networks, and Machine Learning, Prof. Professor Yuan-
Fang Wang, UCSB
• Duda et al., Pattern Classification
• Bishop, Pattern Recognition and Machine Learning
• Reference papers

2
Outline
• Overview of recognition/classification pipeline
• Overview of machine learning
• From probability to Bayes decision rule
• Nonparametric techniques: Parzen window and nearest neighbor
• Unsupervised learning and supervised learning
• Unsupervised learning
• Clustering: k-means
• Dimension reduction: PCA and LDA
• Training, testing, & validation
• Supervised learning
• Linear classification: support vector machine (SVM)
• Combining models: decision tree, boosting
• Examples

3
Image Classification
Input: image Output: Assign image to one
of a fixed set of categories

cat
bird
deer
dog
truck

This image by Nikita is


licensed under CC-BY 2.0

4
Problem: Semantic Gap

What the computer sees


An image is just a big grid of numbers
between [0, 255]
e.g. 800 x 600 x 3
(3 channels RGB)
5
Challenges
• Viewpoint variation
• Intraclass variation
• Fine-grained categories
• Background clutter
• Illumination changes
• Deformation
• Occlusion
•…

6
Overview of Recognition Pipeline

7
AI, Machine Learning, and Deep Learning

[Kaggle]
8
Machine Learning:
Data-Driven Approach
1. Collect a dataset of images and labels
2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images

Example training set

9
Machine Learning/
Pattern Recognition
Parameter

Supervised
Density
Learning

Decision
Uncorrelated Boundary
Events

Minimum
Machine Distance
Learning/Pattern Unsupervised
Recognition Clustering
Hierarchical
Clustering
Correlated Hidden Markov
Events Models

10
From Probability to Bayes Decision Rule
• Example: Testing/Screening of COVID-19
• Distributions between positive/negative test results (e.g., PCR, antibody, etc.)
• further away from each other
• more accurate COVID diagnosis
Ground Truth
Positive Negative

Testing Result
+

Negative Positive -
for COVID for COVID

11
Bayesian Decision Theory
• Fundamental statistical approach to classification/detection tasks
• Take a 2-class classification/detection task as an example:
• Let’s see if a student would pass or fail the course of CV.
• Define a probabilistic variable ω describe the case of pass or fail.
• That is, ω = ω1 for pass, and ω = ω2 for fail.
• Prior Probability
• The a priori or prior probability reflects the knowledge of
how likely we expect a certain state of nature before observation.
• P(ω = ω1) or simply P(ω1) as the prior that the next student would pass CV.
• The priors must exhibit exclusivity and exhaustivity, i.e.,
𝐶

෍ 𝑝 𝑤𝑗 = 1
• Equal priors 𝑗=1
• If we have equal numbers of students pass/fail CV, then the priors are equal;
in other words, the priors are uniform.

𝑝 𝑤1 = 𝑝 𝑤2 = 0.5
12
Prior Probability (cont’d)
• Decision rule based on priors only
• If the only available info is the prior,
and the cost of any type of incorrect classification is equal,
what would be a reasonable decision rule?
• Decide ω1 if
𝑝 𝑤1 > 𝑝 𝑤2
otherwise decide ω2 .
• What’s the incorrect classification rate (or error rate) Pe?

𝑃𝑒 = min 𝑝 𝑤1 , 𝑝 𝜔2

13
Class-Conditional Probability Density
(or Likelihood)
• The probability density function (PDF) for input/observation x given a state of nature ω
is written as:
𝑝 𝑥 ȁ𝑤1

• Here’s (hopefully) the hypothetical class-conditional densities


reflecting the time of the students spending on CV who eventually pass/fail this course.

𝑝 𝑥 ȁ𝑤2 𝑝 𝑥 ȁ𝑤1
Training data
(observe/collect in advance)

Maximum Likelihood (MLE)


14
Posterior Probability & Bayes Formula
• If we know the prior distribution and the class-conditional density,
can we come up with a better decision rule?
• Yes We Can!
• By calculating the posterior probability.
• Posterior probability 𝑃(𝜔ȁ𝒙) :
• The probability of a certain state of nature ω given an observable x.
• Bayes formula:

𝑃 𝑤𝑗 , 𝒙 = 𝑝 𝑥 ȁ𝑤𝑗 𝑝 𝑤𝑗 = 𝑝 𝑤𝑗 ȁ𝑥 𝑝 𝑥

𝑝 𝑥ȁ𝑤𝑗 𝑃 𝑤𝑗
𝑃 𝑤𝑗 𝒙 = 𝑝 𝑥

And, we have σ𝐶𝑗=1 𝑃(𝜔𝑗 ȁ𝒙) = 1.

15
Decision Rule & Probability of Error
• For a given observable x (e.g., time you can spend for CV),
the decision rule (to take CV or not) will be now based on:
Decide 𝑤1 if 𝑝 𝑤1 𝒙 > 𝑝 𝑤2 𝒙

𝑤 ∗ = argmax 𝑝 𝑤𝑖 ȁ𝑥 Maximum A Posterior (MAP)


𝑖
• What’s the probability of error P(error) (or Pe)?

𝑝 𝑤2 ȁ𝑥 𝑝 𝑤1 ȁ𝑥

𝑃𝑒 = min 𝑝 𝑤1 ȁ𝑥 , 𝑃 𝑤2 ȁ𝑥 over all x


T

16
From Bayes Decision Rule to Detection Theory
• Hit (detection, TP), false alarm (FA, FP), miss (false reject, FN), rejection (TN)


𝑝 𝑥ȁ𝑤2 𝑝 𝑤2 𝑝 𝑥ȁ𝑤1 𝑝 𝑤1 𝑇𝑃 = න 𝑝 𝑥 ȁ𝑤1 𝑝 𝑤1 ⅆ𝑥
𝑇

𝐹𝑃 = න 𝑝 𝑥 ȁ𝑤2 𝑝 𝑤2 ⅆ𝑥
𝑇

T T*: EER (equal error rate): FP=FN

• Receiver Operating Characteristics (ROC)


• To assess the effectiveness of the designed features/classifiers
• False alarm (PFA or FP) vs. detection (Pd or TP) rates

17
Nonparametric Techniques:
Parzen Window
• Parzen-window approach to estimate densities: assume
e.g. that the region Rn is a d-dimensional hypercube
Vn = hnd (h n : length of the edge of  n )
Let  (u) be the following hypercube window function :
 1
1 u j  j = 1,... , d
 (u) =  2
0 otherwise
• ((x-xi)/hn) is equal to unity if xi falls within the
hypercube of volume Vn centered at x, and equal to zero
otherwise.
• The number of samples in this hypercube is:

 x − xi 
i =n
k n =    
i =1  hn 

Substituting kn in pn(x) = (kn/n)/Vn we obtain:

1 i =n 1  x − xi 
pn ( x ) =    
n i =1 Vn  hn 
Pn(x) estimates p(x) as an average of functions of x and
the samples (xi) (i = 1,… ,n).

pn ( x ) → p ( x )
n →
Nonparametric Techniques:
Nearest Neighborhood

Memorize all data


and labels

Predict the label of


the most similar
training image

20
21
Nearest Neighbor Decision
Boundaries
x1
Decision boundary is
Nearest neighbors the boundary
in two dimensions between two
classification regions

Points are training


examples; colors Decision boundaries
give training labels can be noisy; affected
by outliers

Background colors How to smooth out


give the category a decision boundaries?
x
test point would Use more neighbors!
be assigned
x0

22
K-Nearest Neighbors (kNN)
• Instead of copying label from nearest neighbor,
take majority vote from K closest points
K=1 K=3

23
K-Nearest Neighbors (kNN)
• Make the decision boundary more smooth
• Reduce the effect of outliers
K=1 K=3

http://vision.stanford.edu/teaching/cs231n-demos/knn/
24
Minor Remarks on NN-based Methods
• k-NN is easy to implement but not of much interest in practice. Why?
• Choice of distance metrics might be an issue (see example below)
• Measuring distances in high-dimensional spaces might not be a good idea.
• Moreover, NN-based methods require lots of space and computation time!
(NN-based methods are viewed as data-driven approaches.)

All three images have the same Euclidean distance to the original one.

Image credit: Stanford CS231n 25


Nonparametric Techniques
• kNN is also a nonparametric technique:
• Specify kn as a function of n, such as kn = n; the volume
Vn is grown until it encloses kn neighbors of x

26
Unsupervised Learning and
Supervised Learning
Parameter

Supervised
Density
Learning

Decision
Uncorrelated Boundary
Events

Minimum
Machine Distance
Learning/Pattern Unsupervised
Recognition Clustering
Hierarchical
Clustering
Correlated Hidden Markov
Events Models

27
Clustering
• Clustering is an unsupervised algorithm.
• Given:
a set of N unlabeled instances {x1, …, xN}; # of clusters K
• Goal: group the samples into K partitions
• Remarks:
• High within-cluster (intra-cluster) similarity
• Low between-cluster (inter-cluster) similarity
• But…how to determine a proper similarity measure?

28
Similarity is NOT Always Objective…

29
Clustering (cont’d)
• Similarity:
• A key component/measure to perform data clustering
• Inversely proportional to distance
• Example distance metrics:

• Euclidean distance (L2 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 2 = σ𝐷


𝑖=1 𝑥𝑖 − 𝑧𝑖
2

• Manhattan distance (L1 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 1 = σ𝐷


𝑖=1 𝑥𝑖 − 𝑧𝑖

• Note that p-norm of x is denoted as:


1ൗ
𝐷 𝑝
𝑝
𝐿𝑃 𝒙, 𝒛 = ෍ 𝑥𝑖 − 𝑧𝑖
𝑖=1
1ൗ
𝐷 𝑝

𝐿0 𝒙, 𝒛 = lim ෍ 𝑥𝑖 − 𝑧𝑖 𝑝
𝑝→0
𝑖=1

30
Clustering (cont’d)
• Similarity:
• A key component/measure to perform data clustering
• Inversely proportional to distance
• Example distance metrics:
• Kernelized (non-linear) distance:

2 2 2
ⅆ 𝑥, 𝑧 = Φ(𝑥) − Φ 𝑧 2 = Φ(𝑥) 2 + Φ(𝑧) 2 − 2Φ(𝑥)𝑇 Φ(𝑧)

𝑥−𝑧 22
• Taking Gaussian kernel for example: 𝐾 𝑥, 𝑧 = Φ 𝑥 𝑇 Φ 𝑧 = 𝑒𝑥𝑝 − ,
2𝜎2
we have Φ(𝑥) 2 = Φ 𝑥 𝑇 Φ 𝑥 = 1
2

distance is more sensitive smaller σ.


• For example, L2 or kernelized distance metrics for the following two cases?

31
K-Means Clustering
• Input: N examples {x1, . . . , xN } (xn ∈ RD ); number of partitions K
• Initialize: K cluster centers µ 1, . . . , µ K . Several initialization options:
• Randomly initialize µ 1, . . . , µ K anywhere in RD
• Or, simply choose any K examples as the cluster centers
• Iterate:
• Assign each of example xn to its closest cluster center
• Recompute the new cluster centers µ k (mean/centroid of the set Ck )
• Repeat while not converge
• Possible convergence criteria:
• Cluster centers do not change anymore
• Max. number of iterations reached
• Output:
• K clusters (with centers/means of each cluster)

32
K-Means Clustering

• Example (K = 2): Initialization, iteration #1: pick cluster centers

33
K-Means Clustering

• Example (K = 2): iteration #1-2, assign data to each cluster

34
K-Means Clustering

• Example (K = 2): iteration #2-1, update cluster centers

35
K-Means Clustering

• Example (K = 2): iteration #2, assign data to each cluster

36
K-Means Clustering

• Example (K = 2): iteration #3-1

37
K-Means Clustering

• Example (K = 2): iteration #3-2

38
K-Means Clustering

• Example (K = 2): iteration #4-1

39
K-Means Clustering

• Example (K = 2): iteration #4-2

40
K-Means Clustering

• Example (K = 2): iteration #5, cluster means are not changed.

41
K-Means Clustering (cont’d)
• Limitation
• Preferable for round shaped clusters with similar sizes

• Sensitive to initialization; how to alleviate this problem?


• Sensitive to outliers; possible change from K-means to…
• Hard assignment only.

• Remarks
• Expectation-maximization (EM) algorithm
• Speed-up possible by hierarchical clustering (e.g., 100 = 102 clusters)

42
Dimension Reduction
• Principal Component Analysis (PCA)
• Unsupervised & linear dimension reduction
• Related to Eigenfaces, etc. feature extraction and classification techniques
• Still very popular despite of its simplicity and effectiveness.
• Goal:
• Determine the projection, so that the variation of projected data is maximized.

axis that describes the


largest variation for data
projected onto it.

x 43
Formulation & Derivation for PCA
• Input: a set of instances x without label info
• Output: a projection vector u1 maximizing the variance of the projected data

𝑆 = 𝑇𝑇 𝑇
T be the matrix of preprocessed training examples, where each
column contains one mean-subtracted image.

x
44
Formulation & Derivation for PCA

45
Formulation & Derivation for PCA

However TTT is a large matrix, and if instead we take the eigenvalue


decomposition of

then we notice that by pre-multiplying both sides of the equation


with T, we obtain

Meaning that, if ui is an eigenvector of TTT, then vi = Tui is an


eigenvector of S. If we have a training set of 300 images of 100 × 100
pixels, the matrix TTT is a 300 × 300 matrix, which is much more
manageable than the 10,000 × 10,000 covariance matrix.

46
Eigenanalysis
• A d x d covariance matrix contains a maximum of d
eigenvector/eigenvalue pairs.
• How dimension reduction is realized? how to reconstruct
the input data?

• Expanding a signal via eigenvectors as bases


• With symmetric matrices (e.g., covariance matrix),
eigenvectors are orthogonal.
• They can be regarded as unit basis vectors to span any
instance in the d-dim space.

47
Let’s See an Example (CMU AMP Face Database)
• Let’s take 5 face images x 13 people = 65 images, each is of size 64 x 64 = 4096 pixels.
• # of eigenvectors are expected to use for perfectly reconstructing the input = 64.
• Let’s check it out!

49
What Do the Eigenvectors/Eigenfaces Look Like?
Mean V1 V2 V3

V4 V5 V6 V7

V8 V9 V10 V11

V12 V13 V14 V15

50
All 64 Eigenvectors, do we need them all?

51
Use only 1 eigenvector, MSE = 1233

MSE=1233.16

52
Use 2 eigenvectors, MSE = 1027

MSE=1027.63

53
Use 3 eigenvectors, MSE = 758

MSE=758.13

54
Use 4 eigenvectors, MSE = 634

MSE=634.54

55
Use 8 eigenvectors, MSE = 285

MSE=285.08

56
With 20 eigenvectors, MSE = 87

MSE=87.93

57
With 30 eigenvectors, MSE = 20

MSE=20.55

58
With 50 eigenvectors, MSE = 2.14

MSE=2.14

59
With 60 eigenvectors, MSE = 0.06

MSE=0.06

60
All 64 eigenvectors, MSE = 0

MSE=0.00

61
Linear Discriminant Analysis(LDA)
• Linear Discriminant Analysis(LDA)
• Classify objects into one of two or more groups
• Base on a set of features
• The transform tries to maximize the ratio of between
variance to within class variance
• Between class variance

• Within class variance Different


62
Mathematical Operations
• Maximize
• If y is the transform of x

• Compute J after the transform


• Find W to maximize

63
Find W
• If we are lucky, is a non-singular matrix
• We can find

• Calculate the eigenvector of

• If not, well……….It’s a tough work to do.


• Everyone tries to avoid this
• Using PCA

Arong Media IC & System Lab 64


Small Example

Arong Media IC & System Lab 65


Matthew Turk and Alex Pentland, “Eigenfaces for Recognition,” Journal of
Cognitive Neuroscience, Match 1991.
Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces

Experiment vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE
Transactions on Pattern Analysis And Machine Intelligence, 1997.

Aron 66
Hyperparameters in ML
• Recall that for k-NN, we need to determine the k value in advance.
• What is the best k value?
• Or, take PCA for example, what is the best reduced dimension number?

• Hyperparameters: parameter choices for the learning model/algorithm


• We need to determine such hyperparameters instead of guessing.
• Let’s see what we can and cannot do…

k=1 k=3 k=5

Image credit: Stanford CS231n 67


How to Determine Hyperparameters?
• Idea #1
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• You take a dataset to train your model, and select your hyperparameters
(e.g., k of k-NN) based on the resulting performance.

• Might not generalize well.

Dataset

68
How to Determine Hyperparameters? (cont’d)
• Idea #2
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• For a dataset of interest, you split it into training and test sets.
• You train your model with possible hyperparameter choices (e.g., k in k-NN),
and select those work best on test set data.

• That’s called cheating…

Training set Test set

69
How to Determine Hyperparameters? (cont’d)
• Idea #3
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• For the dataset of interest, it is split it into training, validation, and test sets.
• You train your model with possible hyperparameter choices (k in k-NN),
and select those work best on the validation set.

• OK, but…

Training set Validation set Test set

Training set Validation set Test set

70
How to Determine Hyperparameters? (cont’d)
• Idea #3.5
• What if only training and test sets are given, not the validation set?
• Cross-validation (or k-fold cross validation)
• Split the training set into k folds with a hyperparameter choice
• Keep 1 fold as validation set and the remaining k-1 folds for training
• After each of k folds is evaluated, report the average validation performance.
• Choose the hyperparameter(s) which result in the highest average validation
performance.
• Take a 4-fold cross-validation as an example…
Training set Test set

Fold 1 Fold 2 Fold 3 Fold 4 Test set


Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set

71

You might also like