0% found this document useful (0 votes)

55 views70 pages

Machine Learning Fundamentals Overview

Uploaded by

KhánhLinh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views70 pages

Machine Learning Fundamentals Overview

Uploaded by

KhánhLinh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning Basics (I)

簡韶逸 Shao-Yi Chien

Department of Electrical Engineering
National Taiwan University

1
References and Slide Credits
• Slides from Deep Learning for Computer Vision, Prof. Yu-Chiang
Frank Wang, EE, National Taiwan University
• Slides from CE 5554 / ECE 4554: Computer Vision, Prof. J.-B.
Huang, Virginia Tech
• Slides from CSE 576 Computer Vision, Prof. Steve Seitz and Prof.
Rick Szeliski, U. Washington
• Slides from EECS 498-007/598-005 Deep Learning for Computer
Vision, Prof. Justin Johnson
• Slides from CS291A Introduction to Pattern Recognition, Artificial
Neural Networks, and Machine Learning, Prof. Professor Yuan-
Fang Wang, UCSB
• Duda et al., Pattern Classification
• Bishop, Pattern Recognition and Machine Learning
• Reference papers

2
Outline
• Overview of recognition/classification pipeline
• Overview of machine learning
• From probability to Bayes decision rule
• Nonparametric techniques: Parzen window and nearest neighbor
• Unsupervised learning and supervised learning
• Unsupervised learning
• Clustering: k-means
• Dimension reduction: PCA and LDA
• Training, testing, & validation
• Supervised learning
• Linear classification: support vector machine (SVM)
• Combining models: decision tree, boosting
• Examples

3
Image Classification
Input: image Output: Assign image to one
of a fixed set of categories

cat
bird
deer
dog
truck

This image by Nikita is

licensed under CC-BY 2.0

4
Problem: Semantic Gap

What the computer sees

An image is just a big grid of numbers
between [0, 255]
e.g. 800 x 600 x 3
(3 channels RGB)
5
Challenges
• Viewpoint variation
• Intraclass variation
• Fine-grained categories
• Background clutter
• Illumination changes
• Deformation
• Occlusion
•…

6
Overview of Recognition Pipeline

7
AI, Machine Learning, and Deep Learning

[Kaggle]
8
Machine Learning:
Data-Driven Approach
1. Collect a dataset of images and labels
2. Use Machine Learning to train a classifier
3. Evaluate the classifier on new images

Example training set

9
Machine Learning/
Pattern Recognition
Parameter

Supervised
Density
Learning

Decision
Uncorrelated Boundary
Events

Minimum
Machine Distance
Learning/Pattern Unsupervised
Recognition Clustering
Hierarchical
Clustering
Correlated Hidden Markov
Events Models

10
From Probability to Bayes Decision Rule
• Example: Testing/Screening of COVID-19
• Distributions between positive/negative test results (e.g., PCR, antibody, etc.)
• further away from each other
• more accurate COVID diagnosis
Ground Truth
Positive Negative

Testing Result
+

Negative Positive -
for COVID for COVID

11
Bayesian Decision Theory
• Fundamental statistical approach to classification/detection tasks
• Take a 2-class classification/detection task as an example:
• Let’s see if a student would pass or fail the course of CV.
• Define a probabilistic variable ω describe the case of pass or fail.
• That is, ω = ω1 for pass, and ω = ω2 for fail.
• Prior Probability
• The a priori or prior probability reflects the knowledge of
how likely we expect a certain state of nature before observation.
• P(ω = ω1) or simply P(ω1) as the prior that the next student would pass CV.
• The priors must exhibit exclusivity and exhaustivity, i.e.,
𝐶

෍ 𝑝 𝑤𝑗 = 1
• Equal priors 𝑗=1
• If we have equal numbers of students pass/fail CV, then the priors are equal;
in other words, the priors are uniform.

𝑝 𝑤1 = 𝑝 𝑤2 = 0.5
12
Prior Probability (cont’d)
• Decision rule based on priors only
• If the only available info is the prior,
and the cost of any type of incorrect classification is equal,
what would be a reasonable decision rule?
• Decide ω1 if
𝑝 𝑤1 > 𝑝 𝑤2
otherwise decide ω2 .
• What’s the incorrect classification rate (or error rate) Pe?

𝑃𝑒 = min 𝑝 𝑤1 , 𝑝 𝜔2

13
Class-Conditional Probability Density
(or Likelihood)
• The probability density function (PDF) for input/observation x given a state of nature ω
is written as:
𝑝 𝑥 ȁ𝑤1

• Here’s (hopefully) the hypothetical class-conditional densities

reflecting the time of the students spending on CV who eventually pass/fail this course.

𝑝 𝑥 ȁ𝑤2 𝑝 𝑥 ȁ𝑤1
Training data
(observe/collect in advance)

Maximum Likelihood (MLE)

14
Posterior Probability & Bayes Formula
• If we know the prior distribution and the class-conditional density,
can we come up with a better decision rule?
• Yes We Can!
• By calculating the posterior probability.
• Posterior probability 𝑃(𝜔ȁ𝒙) :
• The probability of a certain state of nature ω given an observable x.
• Bayes formula:

𝑃 𝑤𝑗 , 𝒙 = 𝑝 𝑥 ȁ𝑤𝑗 𝑝 𝑤𝑗 = 𝑝 𝑤𝑗 ȁ𝑥 𝑝 𝑥

𝑝 𝑥ȁ𝑤𝑗 𝑃 𝑤𝑗
𝑃 𝑤𝑗 𝒙 = 𝑝 𝑥

And, we have σ𝐶𝑗=1 𝑃(𝜔𝑗 ȁ𝒙) = 1.

15
Decision Rule & Probability of Error
• For a given observable x (e.g., time you can spend for CV),
the decision rule (to take CV or not) will be now based on:
Decide 𝑤1 if 𝑝 𝑤1 𝒙 > 𝑝 𝑤2 𝒙

𝑤 ∗ = argmax 𝑝 𝑤𝑖 ȁ𝑥 Maximum A Posterior (MAP)

𝑖
• What’s the probability of error P(error) (or Pe)?

𝑝 𝑤2 ȁ𝑥 𝑝 𝑤1 ȁ𝑥

𝑃𝑒 = min 𝑝 𝑤1 ȁ𝑥 , 𝑃 𝑤2 ȁ𝑥 over all x

16
From Bayes Decision Rule to Detection Theory
• Hit (detection, TP), false alarm (FA, FP), miss (false reject, FN), rejection (TN)

∞
𝑝 𝑥ȁ𝑤2 𝑝 𝑤2 𝑝 𝑥ȁ𝑤1 𝑝 𝑤1 𝑇𝑃 = න 𝑝 𝑥 ȁ𝑤1 𝑝 𝑤1 ⅆ𝑥
𝑇
∞
𝐹𝑃 = න 𝑝 𝑥 ȁ𝑤2 𝑝 𝑤2 ⅆ𝑥
𝑇

T T*: EER (equal error rate): FP=FN

• Receiver Operating Characteristics (ROC)

• To assess the effectiveness of the designed features/classifiers
• False alarm (PFA or FP) vs. detection (Pd or TP) rates

17
Nonparametric Techniques:
Parzen Window
• Parzen-window approach to estimate densities: assume
e.g. that the region Rn is a d-dimensional hypercube
Vn = hnd (h n : length of the edge of  n )
Let  (u) be the following hypercube window function :
 1
1 u j  j = 1,... , d
 (u) =  2
0 otherwise
• ((x-xi)/hn) is equal to unity if xi falls within the
hypercube of volume Vn centered at x, and equal to zero
otherwise.
• The number of samples in this hypercube is:

 x − xi 
i =n
k n =    
i =1  hn 

Substituting kn in pn(x) = (kn/n)/Vn we obtain:

1 i =n 1  x − xi 
pn ( x ) =    
n i =1 Vn  hn 
Pn(x) estimates p(x) as an average of functions of x and
the samples (xi) (i = 1,… ,n).

pn ( x ) → p ( x )
n →
Nonparametric Techniques:
Nearest Neighborhood

Memorize all data

and labels

Predict the label of

the most similar
training image

20
21
Nearest Neighbor Decision
Boundaries
x1
Decision boundary is
Nearest neighbors the boundary
in two dimensions between two
classification regions

Points are training

examples; colors Decision boundaries
give training labels can be noisy; affected
by outliers

Background colors How to smooth out

give the category a decision boundaries?
x
test point would Use more neighbors!
be assigned
x0

22
K-Nearest Neighbors (kNN)
• Instead of copying label from nearest neighbor,
take majority vote from K closest points
K=1 K=3

23
K-Nearest Neighbors (kNN)
• Make the decision boundary more smooth
• Reduce the effect of outliers
K=1 K=3

http://vision.stanford.edu/teaching/cs231n-demos/knn/
24
Minor Remarks on NN-based Methods
• k-NN is easy to implement but not of much interest in practice. Why?
• Choice of distance metrics might be an issue (see example below)
• Measuring distances in high-dimensional spaces might not be a good idea.
• Moreover, NN-based methods require lots of space and computation time!
(NN-based methods are viewed as data-driven approaches.)

All three images have the same Euclidean distance to the original one.

Image credit: Stanford CS231n 25

Nonparametric Techniques
• kNN is also a nonparametric technique:
• Specify kn as a function of n, such as kn = n; the volume
Vn is grown until it encloses kn neighbors of x

26
Unsupervised Learning and
Supervised Learning
Parameter

Supervised
Density
Learning

Decision
Uncorrelated Boundary
Events

Minimum
Machine Distance
Learning/Pattern Unsupervised
Recognition Clustering
Hierarchical
Clustering
Correlated Hidden Markov
Events Models

27
Clustering
• Clustering is an unsupervised algorithm.
• Given:
a set of N unlabeled instances {x1, …, xN}; # of clusters K
• Goal: group the samples into K partitions
• Remarks:
• High within-cluster (intra-cluster) similarity
• Low between-cluster (inter-cluster) similarity
• But…how to determine a proper similarity measure?

28
Similarity is NOT Always Objective…

29
Clustering (cont’d)
• Similarity:
• A key component/measure to perform data clustering
• Inversely proportional to distance
• Example distance metrics:

• Euclidean distance (L2 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 2 = σ𝐷

𝑖=1 𝑥𝑖 − 𝑧𝑖
2

• Manhattan distance (L1 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 1 = σ𝐷

𝑖=1 𝑥𝑖 − 𝑧𝑖

• Note that p-norm of x is denoted as:

1ൗ
𝐷 𝑝
𝑝
𝐿𝑃 𝒙, 𝒛 = ෍ 𝑥𝑖 − 𝑧𝑖
𝑖=1
1ൗ
𝐷 𝑝

𝐿0 𝒙, 𝒛 = lim ෍ 𝑥𝑖 − 𝑧𝑖 𝑝
𝑝→0
𝑖=1

30
Clustering (cont’d)
• Similarity:
• A key component/measure to perform data clustering
• Inversely proportional to distance
• Example distance metrics:
• Kernelized (non-linear) distance:

2 2 2
ⅆ 𝑥, 𝑧 = Φ(𝑥) − Φ 𝑧 2 = Φ(𝑥) 2 + Φ(𝑧) 2 − 2Φ(𝑥)𝑇 Φ(𝑧)

𝑥−𝑧 22
• Taking Gaussian kernel for example: 𝐾 𝑥, 𝑧 = Φ 𝑥 𝑇 Φ 𝑧 = 𝑒𝑥𝑝 − ,
2𝜎2
we have Φ(𝑥) 2 = Φ 𝑥 𝑇 Φ 𝑥 = 1
2

distance is more sensitive smaller σ.

• For example, L2 or kernelized distance metrics for the following two cases?

31
K-Means Clustering
• Input: N examples {x1, . . . , xN } (xn ∈ RD ); number of partitions K
• Initialize: K cluster centers µ 1, . . . , µ K . Several initialization options:
• Randomly initialize µ 1, . . . , µ K anywhere in RD
• Or, simply choose any K examples as the cluster centers
• Iterate:
• Assign each of example xn to its closest cluster center
• Recompute the new cluster centers µ k (mean/centroid of the set Ck )
• Repeat while not converge
• Possible convergence criteria:
• Cluster centers do not change anymore
• Max. number of iterations reached
• Output:
• K clusters (with centers/means of each cluster)

32
K-Means Clustering

• Example (K = 2): Initialization, iteration #1: pick cluster centers

33
K-Means Clustering

• Example (K = 2): iteration #1-2, assign data to each cluster

34
K-Means Clustering

• Example (K = 2): iteration #2-1, update cluster centers

35
K-Means Clustering

• Example (K = 2): iteration #2, assign data to each cluster

36
K-Means Clustering

• Example (K = 2): iteration #3-1

37
K-Means Clustering

• Example (K = 2): iteration #3-2

38
K-Means Clustering

• Example (K = 2): iteration #4-1

39
K-Means Clustering

• Example (K = 2): iteration #4-2

40
K-Means Clustering

• Example (K = 2): iteration #5, cluster means are not changed.

41
K-Means Clustering (cont’d)
• Limitation
• Preferable for round shaped clusters with similar sizes

• Sensitive to initialization; how to alleviate this problem?

• Sensitive to outliers; possible change from K-means to…
• Hard assignment only.

• Remarks
• Expectation-maximization (EM) algorithm
• Speed-up possible by hierarchical clustering (e.g., 100 = 102 clusters)

42
Dimension Reduction
• Principal Component Analysis (PCA)
• Unsupervised & linear dimension reduction
• Related to Eigenfaces, etc. feature extraction and classification techniques
• Still very popular despite of its simplicity and effectiveness.
• Goal:
• Determine the projection, so that the variation of projected data is maximized.

axis that describes the

largest variation for data
projected onto it.

x 43
Formulation & Derivation for PCA
• Input: a set of instances x without label info
• Output: a projection vector u1 maximizing the variance of the projected data

𝑆 = 𝑇𝑇 𝑇
T be the matrix of preprocessed training examples, where each
column contains one mean-subtracted image.

x
44
Formulation & Derivation for PCA

45
Formulation & Derivation for PCA

However TTT is a large matrix, and if instead we take the eigenvalue

decomposition of

then we notice that by pre-multiplying both sides of the equation

with T, we obtain

Meaning that, if ui is an eigenvector of TTT, then vi = Tui is an

eigenvector of S. If we have a training set of 300 images of 100 × 100
pixels, the matrix TTT is a 300 × 300 matrix, which is much more
manageable than the 10,000 × 10,000 covariance matrix.

46
Eigenanalysis
• A d x d covariance matrix contains a maximum of d
eigenvector/eigenvalue pairs.
• How dimension reduction is realized? how to reconstruct
the input data?

• Expanding a signal via eigenvectors as bases

• With symmetric matrices (e.g., covariance matrix),
eigenvectors are orthogonal.
• They can be regarded as unit basis vectors to span any
instance in the d-dim space.

47
Let’s See an Example (CMU AMP Face Database)
• Let’s take 5 face images x 13 people = 65 images, each is of size 64 x 64 = 4096 pixels.
• # of eigenvectors are expected to use for perfectly reconstructing the input = 64.
• Let’s check it out!

49
What Do the Eigenvectors/Eigenfaces Look Like?
Mean V1 V2 V3

V4 V5 V6 V7

V8 V9 V10 V11

V12 V13 V14 V15

50
All 64 Eigenvectors, do we need them all?

51
Use only 1 eigenvector, MSE = 1233

MSE=1233.16

52
Use 2 eigenvectors, MSE = 1027

MSE=1027.63

53
Use 3 eigenvectors, MSE = 758

MSE=758.13

54
Use 4 eigenvectors, MSE = 634

MSE=634.54

55
Use 8 eigenvectors, MSE = 285

MSE=285.08

56
With 20 eigenvectors, MSE = 87

MSE=87.93

57
With 30 eigenvectors, MSE = 20

MSE=20.55

58
With 50 eigenvectors, MSE = 2.14

MSE=2.14

59
With 60 eigenvectors, MSE = 0.06

MSE=0.06

60
All 64 eigenvectors, MSE = 0

MSE=0.00

61
Linear Discriminant Analysis(LDA)
• Linear Discriminant Analysis(LDA)
• Classify objects into one of two or more groups
• Base on a set of features
• The transform tries to maximize the ratio of between
variance to within class variance
• Between class variance
•

• Within class variance Different

•

62
Mathematical Operations
• Maximize
• If y is the transform of x
•
• Compute J after the transform
•
•
•

• Find W to maximize

63
Find W
• If we are lucky, is a non-singular matrix
• We can find
•
• Calculate the eigenvector of

• If not, well……….It’s a tough work to do.

• Everyone tries to avoid this
• Using PCA

Arong Media IC & System Lab 64

Small Example

Arong Media IC & System Lab 65

Matthew Turk and Alex Pentland, “Eigenfaces for Recognition,” Journal of
Cognitive Neuroscience, Match 1991.
Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces

Experiment vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE
Transactions on Pattern Analysis And Machine Intelligence, 1997.

Aron 66
Hyperparameters in ML
• Recall that for k-NN, we need to determine the k value in advance.
• What is the best k value?
• Or, take PCA for example, what is the best reduced dimension number?

• Hyperparameters: parameter choices for the learning model/algorithm

• We need to determine such hyperparameters instead of guessing.
• Let’s see what we can and cannot do…

k=1 k=3 k=5

Image credit: Stanford CS231n 67

How to Determine Hyperparameters?
• Idea #1
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• You take a dataset to train your model, and select your hyperparameters
(e.g., k of k-NN) based on the resulting performance.

• Might not generalize well.

Dataset

68
How to Determine Hyperparameters? (cont’d)
• Idea #2
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• For a dataset of interest, you split it into training and test sets.
• You train your model with possible hyperparameter choices (e.g., k in k-NN),
and select those work best on test set data.

• That’s called cheating…

Training set Test set

69
How to Determine Hyperparameters? (cont’d)
• Idea #3
• Let’s say you are working on face recognition.
• You come up with your very own feature extraction/learning algorithm.
• For the dataset of interest, it is split it into training, validation, and test sets.
• You train your model with possible hyperparameter choices (k in k-NN),
and select those work best on the validation set.

• OK, but…

Training set Validation set Test set

70
How to Determine Hyperparameters? (cont’d)
• Idea #3.5
• What if only training and test sets are given, not the validation set?
• Cross-validation (or k-fold cross validation)
• Split the training set into k folds with a hyperparameter choice
• Keep 1 fold as validation set and the remaining k-1 folds for training
• After each of k folds is evaluated, report the average validation performance.
• Choose the hyperparameter(s) which result in the highest average validation
performance.
• Take a 4-fold cross-validation as an example…
Training set Test set

Fold 1 Fold 2 Fold 3 Fold 4 Test set

Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set
Fold 1 Fold 2 Fold 3 Fold 4 Test set

Medical Image Analysis Techniques
No ratings yet
Medical Image Analysis Techniques
41 pages
AI Learning Techniques Overview
No ratings yet
AI Learning Techniques Overview
63 pages
Classification and Clustering Algorithms
No ratings yet
Classification and Clustering Algorithms
108 pages
Unit 3
No ratings yet
Unit 3
100 pages
L6 Lecture Image - Classification.fundemental v4
No ratings yet
L6 Lecture Image - Classification.fundemental v4
66 pages
Overview of Classification Techniques
No ratings yet
Overview of Classification Techniques
59 pages
Classification and Clustering Techniques
100% (1)
Classification and Clustering Techniques
101 pages
Non-Parametric Learning & kNN Methods
No ratings yet
Non-Parametric Learning & kNN Methods
47 pages
Week 09 Lesson 1 Intro Machine Learning 1 To 32
No ratings yet
Week 09 Lesson 1 Intro Machine Learning 1 To 32
61 pages
Decision Support: Rule Induction & k-NN
No ratings yet
Decision Support: Rule Induction & k-NN
507 pages
Understanding Data Classification Basics
No ratings yet
Understanding Data Classification Basics
53 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Fish Classification Using Feature Vectors
No ratings yet
Fish Classification Using Feature Vectors
141 pages
Machine Learning in High Energy Physics
No ratings yet
Machine Learning in High Energy Physics
48 pages
DIP WISC 13 Recognition
No ratings yet
DIP WISC 13 Recognition
18 pages
Neural Networks in Statistical Classification
No ratings yet
Neural Networks in Statistical Classification
85 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
75 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
62 pages
Introduction to Machine Learning Classification
No ratings yet
Introduction to Machine Learning Classification
62 pages
Unit4 PPT
No ratings yet
Unit4 PPT
118 pages
Data Mining: Classification Techniques Overview
No ratings yet
Data Mining: Classification Techniques Overview
62 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
70 pages
Supervised vs Unsupervised Classification
No ratings yet
Supervised vs Unsupervised Classification
39 pages
Unsupervised Learning: Clustering Methods
No ratings yet
Unsupervised Learning: Clustering Methods
57 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
33 pages
AA1 Tema4
No ratings yet
AA1 Tema4
37 pages
Image Classification Techniques Explained
No ratings yet
Image Classification Techniques Explained
40 pages
Introduction to Pattern Recognition
No ratings yet
Introduction to Pattern Recognition
82 pages
Machine Learning Basics for Beginners
100% (5)
Machine Learning Basics for Beginners
134 pages
Uoc Luong Phi Tham So
No ratings yet
Uoc Luong Phi Tham So
84 pages
DM See M4
No ratings yet
DM See M4
8 pages
Lecture 07 Slides
No ratings yet
Lecture 07 Slides
45 pages
Data Mining Classification Techniques
No ratings yet
Data Mining Classification Techniques
61 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
85 pages
Advanced Machine Learning Lecture Notes
No ratings yet
Advanced Machine Learning Lecture Notes
123 pages
Solutions for Duda's Pattern Classification
No ratings yet
Solutions for Duda's Pattern Classification
77 pages
Unit-1 - Session 1 - Supervised & Unsupervised
No ratings yet
Unit-1 - Session 1 - Supervised & Unsupervised
24 pages
Probabilistic Neural Networks: Original Contribution
No ratings yet
Probabilistic Neural Networks: Original Contribution
10 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
7 pages
Session 5
No ratings yet
Session 5
36 pages
Machine Learning: Regression & Classification
No ratings yet
Machine Learning: Regression & Classification
61 pages
Data-Driven Healthcare Insights
No ratings yet
Data-Driven Healthcare Insights
38 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Introduction to Pattern Recognition
100% (1)
Introduction to Pattern Recognition
21 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
81 pages
Two-Class Pattern Classification Tutorial
No ratings yet
Two-Class Pattern Classification Tutorial
14 pages
ML Merged Endsem
No ratings yet
ML Merged Endsem
1,117 pages
ML Merged
No ratings yet
ML Merged
729 pages
Intro to Machine Learning Course
No ratings yet
Intro to Machine Learning Course
83 pages
Supervised Learning Techniques Overview
No ratings yet
Supervised Learning Techniques Overview
71 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
IoT Classification Algorithms Comparison
No ratings yet
IoT Classification Algorithms Comparison
3 pages
Applied Data Science: ML Classification Techniques
No ratings yet
Applied Data Science: ML Classification Techniques
70 pages
2 - Classification Models
No ratings yet
2 - Classification Models
52 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
73 pages
Classification and Regression in Data Mining
No ratings yet
Classification and Regression in Data Mining
25 pages
Statistical Fairness in Algorithms
No ratings yet
Statistical Fairness in Algorithms
63 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
89 pages
Set A Answer Key
No ratings yet
Set A Answer Key
9 pages
Prostate MRI Analysis and CAD Techniques
No ratings yet
Prostate MRI Analysis and CAD Techniques
2 pages
Comprehensive AI and Machine Learning Syllabus
100% (1)
Comprehensive AI and Machine Learning Syllabus
53 pages
NUS HS1502 Notes
No ratings yet
NUS HS1502 Notes
7 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Geometric Deep Learning Overview
No ratings yet
Geometric Deep Learning Overview
22 pages
Advanced EDA Detailed Summary
No ratings yet
Advanced EDA Detailed Summary
2 pages
Hyperspectral Endmember Extraction
No ratings yet
Hyperspectral Endmember Extraction
34 pages
Key Machine Learning Q&A Guide
No ratings yet
Key Machine Learning Q&A Guide
81 pages
Introduction To Kernel PCA
No ratings yet
Introduction To Kernel PCA
1 page
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
100% (1)
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
AI & ML Short Answer Questions Guide
No ratings yet
AI & ML Short Answer Questions Guide
26 pages
Efficient Frequent Itemset Mining Techniques
No ratings yet
Efficient Frequent Itemset Mining Techniques
47 pages
DM Merged
No ratings yet
DM Merged
169 pages
Chakravarthy
No ratings yet
Chakravarthy
59 pages
Understanding Multivariate Analysis Techniques
100% (1)
Understanding Multivariate Analysis Techniques
78 pages
Explainable AI (XAI) : Core Ideas, Techniques, and Solutions
No ratings yet
Explainable AI (XAI) : Core Ideas, Techniques, and Solutions
33 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Data Mining Course Overview 2023-24
No ratings yet
Data Mining Course Overview 2023-24
4 pages
Understanding Multi-Dimensional Scaling
No ratings yet
Understanding Multi-Dimensional Scaling
11 pages
Course - Machine Learning Part 1 Batch 2025
No ratings yet
Course - Machine Learning Part 1 Batch 2025
30 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
59 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Group 1 ML Project
No ratings yet
Group 1 ML Project
55 pages
Hard Set 5
No ratings yet
Hard Set 5
10 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
5 pages
PCA in Pattern Recognition Overview
No ratings yet
PCA in Pattern Recognition Overview
42 pages
Classsification Model 2020
No ratings yet
Classsification Model 2020
21 pages
Dimensionality Reduction With Subgaussian Matrices: A Unified Theory
No ratings yet
Dimensionality Reduction With Subgaussian Matrices: A Unified Theory
30 pages
Stat841 Outline
No ratings yet
Stat841 Outline
3 pages

Machine Learning Fundamentals Overview

Uploaded by

Machine Learning Fundamentals Overview

Uploaded by

Machine Learning Basics (I)

簡韶逸 Shao-Yi Chien

This image by Nikita is

What the computer sees

Example training set

• Here’s (hopefully) the hypothetical class-conditional densities

Maximum Likelihood (MLE)

And, we have σ𝐶𝑗=1 𝑃(𝜔𝑗 ȁ𝒙) = 1.

𝑤 ∗ = argmax 𝑝 𝑤𝑖 ȁ𝑥 Maximum A Posterior (MAP)

𝑃𝑒 = min 𝑝 𝑤1 ȁ𝑥 , 𝑃 𝑤2 ȁ𝑥 over all x

T T*: EER (equal error rate): FP=FN

• Receiver Operating Characteristics (ROC)

Substituting kn in pn(x) = (kn/n)/Vn we obtain:

Memorize all data

Predict the label of

Points are training

Background colors How to smooth out

Image credit: Stanford CS231n 25

• Euclidean distance (L2 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 2 = σ𝐷

• Manhattan distance (L1 norm): ⅆ 𝑥, 𝑧 = 𝑥 − 𝑧 1 = σ𝐷

• Note that p-norm of x is denoted as:

distance is more sensitive smaller σ.

• Example (K = 2): Initialization, iteration #1: pick cluster centers

• Example (K = 2): iteration #1-2, assign data to each cluster

• Example (K = 2): iteration #2-1, update cluster centers

• Example (K = 2): iteration #2, assign data to each cluster

• Example (K = 2): iteration #3-1

• Example (K = 2): iteration #3-2

• Example (K = 2): iteration #4-1

• Example (K = 2): iteration #4-2

• Example (K = 2): iteration #5, cluster means are not changed.

• Sensitive to initialization; how to alleviate this problem?

axis that describes the

However TTT is a large matrix, and if instead we take the eigenvalue

then we notice that by pre-multiplying both sides of the equation

Meaning that, if ui is an eigenvector of TTT, then vi = Tui is an

• Expanding a signal via eigenvectors as bases

V12 V13 V14 V15

• Within class variance Different

• If not, well……….It’s a tough work to do.

Arong Media IC & System Lab 64

Arong Media IC & System Lab 65

• Hyperparameters: parameter choices for the learning model/algorithm

k=1 k=3 k=5

Image credit: Stanford CS231n 67

• Might not generalize well.

• That’s called cheating…

Training set Test set

Training set Validation set Test set

Training set Validation set Test set

Fold 1 Fold 2 Fold 3 Fold 4 Test set

You might also like