0% found this document useful (0 votes)

17 views

Lecture 12 - Unsupervised Learning - Shoould Be Marged

Uploaded by

kateryna.koval

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Lecture 12 - Unsupervised Learning - Shoould Be Marged

Uploaded by

kateryna.koval

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Generative

Unsupervised Learning
Agenda
1. Unsupervised learning

2. Clustering
a) K-means clustering
b) Hierarchical clustering
c) Mean shift clustering
d) DBSCAN
e) Clustering metrics

3. Anomaly Detection
a) Gaussian Mixture of Models and Expectation-Maximization
b) One-class support vector machine
c) Isolation forest
Dataset and Notebook

Water Treatment Plant: Original Dataset

Water Treatment Plant: Processed Dataset
Data description
Collab Notebook
Unsupervised Learning

Unsupervised learning Supervised learning

No labels Labels are given

Find hidden structure in the data Learn a mapping from X to y, based
on train dataset
Main algorithms:
• clustering Main algorithms:
• dimensionality reduction • classification
• anomaly detection • regression
Training and Validation

Historical data Data cleaning and ML algorithms Patterns and rules Validation on real
preparation data

Inference

Historical data Data cleaning and Exploitation of Real data

preparation algorithms and rules grouping
Clustering
Main task: determine distinct groups (clusters), putting similar data into one group.
Clustering should satisfy the following conditions:
- points inside a cluster should be similar - high inter-class cluster similarity;
- points from different clusters should be far away from each other - low intra-
class cluster similarity.

What’s the natural grouping between the data points?

What’s the best way for grouping points?
What is the number of clusters? Can it be determined automatically?
How to avoid trivial clusters?
Whether to allow very big or very small clusters?
What metrics are used for clustering evaluation?
Clustering methods taxonomy
Distance between points

One way to find similar points - define a distance function between points.
The following metrics are often used:

- Euclidean distance
- Manhattan distance
- Minkowski distance
- Jaccard similarity index (for categorical data)
- Cosine similarity
K-means
1. Define the number of clusters 𝑘.
2. Randomly initialize 𝑘 centroids.
3. Calculate euclidean distance between
centroids and data points.
4. Group data points into clusters around
centroids.
5. Recalculate centroids. Let each centroid
be a mean of data points in a
corresponding cluster.
6. Repeat steps 3-5 until convergence.
K-means problems
- one should know a number of clusters a priori
- method isn’t robust to clusters of different size, density and shape.
- method isn’t robust to outliers
- inapplicable to large/high dimensional datasets
Hierarchical clustering

Repeat steps 1-2, until termination Data is split into two parts
3 condition is met
1
2 The closest points/clusters are merged
2 The best splitting is determined

1 Each point is considered as a separate

cluster
3 Repeat steps 1-2, until termination
condition is met

❑ predefined cluster number is achieved

Termination ❑ all points are merged into a single cluster ❑ each point forms a cluster
conditions ❑ distance between the closest clusters is bigger ❑ maximum distance between any partitions of a
than predefined threshold cluster is smaller than a predefined threshold.
Hierarchical clustering
Agglomerative clustering represents a hierarchy of clusters in a tree form -
dendrogram.

Root of the dendrogram – a single cluster, which contains all data points.
Each leaf is a separate point.
This model can be easily interpreted and shows similarity between
different data points.
Final clusters are obtained by pruning the dendrogram at some level. Thus,
one can choose the number of clusters to use.
Example

Multi-dimensional visualizations for clustering

DBSCAN: density-based spatial clustering
of applications with noise
Suitable for data with large number of groups that intersect.

Previous algorithms were based on distance function: close in distance points were put in the
same cluster. In case of scattered data, algorithms, that are based only on distance, are tend to
perform badly and consider all points as clusters.

DBSCAN is robust to outliers and can be also considered as an anomaly detection algorithm.

The algorithm can handle arbitrary cluster shapes and doesn’t require to specify number of
clusters a priori.

Parameters:
ε – the maximum neighborhood radius
n – the minimum number of points to form a region
DBSCAN algorithm
1. For current point calculate the distance to all other points.
2. Mark all points, that are in epsilon-neighborhood of a
current point, as neighbors.
3. If number of neighbors is greater or equal to n, merge
points to form a cluster.
4. If number of neighbors is smaller than n - mark point as an
outlier.
5. Continue for other points, until all are marked as outliers or
form a cluster.
Mean Shift Clustering
Non-parametric feature-space analysis
technique for locating the maxima of a
density function.
Alternative name - mode-seeking algorithm.

1. Define the kernel size and move a window on a current point.

2. Calculate a centroid over all points, that are located inside a window.
3. Move a window to centroid.
4. Repeat steps 1-3, until convergence.

More information here.

PROS Cons

• Window size can

• Doesn’t require to specify number
significantly affect the
of clusters a priori
performance. Wrong size can
• Can handle various cluster shapes merge the modes or create
multiple false positive
• Can handle various feature spaces
modes.
• Robust to outliers
• Сomputationally expensive

OpenCV: cv.meanShift
Sklearn: sklearn.cluster.MeanShift
How to choose number of clusters?

Do not forget that clusters are, in large part, on the

eye of the beholder
(Estivill-Castro, 2002)

Clusters number is chosen:

❑ based on metrics, so that, clusters explain the variance in the data.
❑ intuitively or based on prior knowledge (for example, from
customer)
Clustering Metrics
Elbow rule
1. Group data on 2, 3, ..., n clusters.
2. For each cluster calculate the average
distance to the centroid and average
the results (D)
3. Plot the graph D = f(n).
4. Optimal cluster number N is located
immediately after the last steep
decline, between DN-1 and DN.

yellowbrick.cluster.elbow.KElbowVisualizer
Clustering Metrics
Silhoutte value measures the consistency within clusters of data.

It shows how similar an object is to its own cluster (cohesion) compared to other
clusters (separation).

The silhouette ranges from −1 to +1, where a high

value indicates that the object is well matched to its
own cluster and poorly matched to neighboring
clusters.

Clustering metrics on synthetic da

Sklearn Silhouette analysis ta
sklearn.metrics.silhouette_score
Anomaly Detection

Anomaly (also called

outlier or noise) - unexpected value or
event, that significantly differs from
normal behaviour.

Their appearance may indicate objective

internal changes or external interventions.

Examples of anomaly detection here.

Anomaly Detection

• Anomalies occur very rarely (anomaly

detection - problem
with imbalanced classes);
• Properties of anomalies significantly
differs from normal instances.
Gaussians Mixture Models and
Expectation Maximization
Each separate cluster can be
represented as a Gaussian distribution,
so data is a mixture of Gaussian
distributions.
GMM tries to group points, which
come from the same distribution.

If data is a realization of the mixture of three Gaussians with means (μ1, μ2,
μ3) and standard deviations (σ1, σ2, σ3), then GMM will identify for each
point a probability distribution among different clusters.
Gaussians Mixture Models and
Expectation Maximization
Expectation Maximization is an
iterative algorithm, that
estimates the parameters of a
probability distribution with
latent variables, by maximizing
the likelihood function.

The optimization is done by

iteratively switching between
two steps: Expectation and
Maximization. It can be shown,
that each parameter update is
non-decreasing in likelihood.
Gaussians Mixture Models and
Expectation Maximization
Estimate Zi:

1. Initialize μ, σ, π and calculate log-likelihood of the data

2. E-step: estimate conditional probabilities Zi
3. M-step: based on E step, update parameters μ, σ, π
4. Calculate the log-likelihood of updated parameters
5. Repeat steps 2-4, until change in log-likelihood will be smaller than
predefined threshold
One-class SVM
Modification of the traditional SVM, that
transforms feature space, so that
observations lay as far as possible from the
origin. Two common approaches are
Schölkopf et al. and Tax & Duin.

As a result, on one side of the curve, there

are “normal” data points, and on the other
side - abnormal values or anomalies.

The algorithm can be used for novelty detection.

Training dataset shouldn’t contain any anomalies.

Kernel trick can be used for non-linear transformations of a feature-space.
Isolation Forest (iForest)
Based on the assumption, that there is a small amount of
anomalies and they’re significantly different from normal
observations.
That’s why iForest deliberately isolate abnormal points.

The key idea is, that it’s faster (smaller depth of the tree) to
separate abnormal points.

Implementation of iForest:
❑ Training set is used to build isolation forests.
❑ Each point of the test set is run through iForest and receives an
anomaly score.

Has linear complexity and can be easily applicable to large

datasets. Usually, performs better than one-class SVM
approach.
Summary
Clustering
a) K-means clustering
b) Hierarchical clustering
c) Mean-shift clustering
d) DBSCAN
e) Clustering metrics

Anomaly Detection
a) Gaussians Mixture of Models and Expectation-Maximization
b) One-class SVM
c) Isolation forest
References
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on pattern analysis and machine
intelligence, 17(8), 790-799.

Dangeti, P. (2017). Statistics for machine learning: Build supervised, unsupervised, and reinforcement learning
models using both Python and R. Birmingham, UK : Packt Publishing.
Ertel, W., Black, N., & Mast, F. (2017). Introduction to artificial intelligence. Cham, Switzerland : Springer.

Igual, L., & Seguí, S. (2017). Introduction to Data Science: A Python approach to concepts, techniques and
applications. Springer International Publishing : Imprint : Springer. *

Johnston, B., Jones, A., Kruger, C., & Safari, an O'Reilly Media Company. (2019). Applied unsupervised
learning with Python. Packt Publishing, 2019.

Liu, F. T., Ting, K. M., Zhou, Z-H. (2009). Isolation forest. 2008 Eighth IEEE International Conference on Data
Mining.
Liu, F. T., Ting, K. M., Zhou, Z-H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge
Discovery from Data, 6(1)
References
Patel, A. A. (2019). Hands-On Unsupervised learning using Python: How to build applied machine learning

solutions from unlabeled data. O'Reilly Media. *

Pradhan M., Kumar U. (2019). Machine Learning using Python. Wiley India. *

Swamynathan, M. (2019). Mastering Machine Learning with Python in Six Steps: A Practical Implementation
Guide to Predictive Data Analytics Using Python. Berkeley, CA: Apress L.P.
References
Ayramo, S., Karkkainen, T. (2006). Introduction to partitioning-based clustering methods with a robust example.
Reports of the Department of Mathematical Information Technology Series C. Software and Computational
Engineering, 1, 1-34.

Comaniciu, D., Meer, P. (2002). Mean Shift: A robust approach toward feature space analysis. IEEE Transactions
on pattern analysis and machine intelligence, 24(5), 603-619.

Estivill-Castro, V. (2002). Why so many clustering algorithms — A Position Paper. SIGKDD Explorations, 4 (1),
65-75.
Ghassabeh, Y., A. (2015).
A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel, Journal of
Multivariate Analysis, 135, 1-10.
Zimek, A., & Filzmoser, P. (2018). There and back again: Outlier detection between statistical reasoning and
data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6).

Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
65 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Unit 5
No ratings yet
Unit 5
63 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Detecting Patterns with Unsupervised Learning
No ratings yet
Detecting Patterns with Unsupervised Learning
21 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
ML - 8
No ratings yet
ML - 8
70 pages
UNIT5
No ratings yet
UNIT5
60 pages
MLLecture-1
No ratings yet
MLLecture-1
56 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Clustering
No ratings yet
Clustering
75 pages
M5
No ratings yet
M5
40 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
39 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Clustering
No ratings yet
Clustering
34 pages
Unit-4
No ratings yet
Unit-4
19 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Clustering
No ratings yet
Clustering
12 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
M5
No ratings yet
M5
40 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
PART2
No ratings yet
PART2
61 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Unit 5
No ratings yet
Unit 5
10 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
2012-Distribution System State Estimation Using An Artificial Neural Network Approach For Pseudo Measurement Modeling
No ratings yet
2012-Distribution System State Estimation Using An Artificial Neural Network Approach For Pseudo Measurement Modeling
9 pages
1 s2.0 S0165178123002159 Main
No ratings yet
1 s2.0 S0165178123002159 Main
28 pages
CS583 Supervised Learning
No ratings yet
CS583 Supervised Learning
166 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
Slides CQF 50 y Black Sches Proxy Hedging
No ratings yet
Slides CQF 50 y Black Sches Proxy Hedging
62 pages
ALIZE 3.0 - Open Source Toolkit For State-Of-The-Art Speaker Recognition
No ratings yet
ALIZE 3.0 - Open Source Toolkit For State-Of-The-Art Speaker Recognition
6 pages
NLP m4
No ratings yet
NLP m4
97 pages
Automatic Mood Detection and Tracking of Music Audio Signals
No ratings yet
Automatic Mood Detection and Tracking of Music Audio Signals
14 pages
FoDS MIDSEM Syllabus
No ratings yet
FoDS MIDSEM Syllabus
3 pages
The Infinite Gaussian Mixture Model: Carl Edward Rasmussen
No ratings yet
The Infinite Gaussian Mixture Model: Carl Edward Rasmussen
7 pages
Bayesian Non and Semi parametric Methods and Applications Peter Rossi download
100% (1)
Bayesian Non and Semi parametric Methods and Applications Peter Rossi download
84 pages
CS583 Chapter 4 Supervised Learning
No ratings yet
CS583 Chapter 4 Supervised Learning
166 pages
Audioseg-1 2
No ratings yet
Audioseg-1 2
64 pages
Disadvantages Eeg Eog
No ratings yet
Disadvantages Eeg Eog
22 pages
Download Complete Math and Architectures of Deep Learning Final Release 1st Edition Krishnendu Chaudhury PDF for All Chapters
100% (2)
Download Complete Math and Architectures of Deep Learning Final Release 1st Edition Krishnendu Chaudhury PDF for All Chapters
41 pages
Handwritten Signature Verification FULLTEXT01
No ratings yet
Handwritten Signature Verification FULLTEXT01
50 pages
Forecasting Ambulance Demand Using Machine Learning A Case Study From Oslo Norway
No ratings yet
Forecasting Ambulance Demand Using Machine Learning A Case Study From Oslo Norway
10 pages
Eplug Revise3
No ratings yet
Eplug Revise3
102 pages
AI & ML Unit 4 Notes
No ratings yet
AI & ML Unit 4 Notes
16 pages
Assignment 11
100% (1)
Assignment 11
4 pages
21ai66 ML Lab Manual
No ratings yet
21ai66 ML Lab Manual
41 pages
Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
No ratings yet
Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
59 pages
Hidden Markov Models for Stock Market Prediction
No ratings yet
Hidden Markov Models for Stock Market Prediction
7 pages
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
No ratings yet
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
16 pages
VLSI Design For SVM-Based Speaker Verification System
No ratings yet
VLSI Design For SVM-Based Speaker Verification System
5 pages
(Ebook) Structural Equation Modeling for Health and Medicine by Douglas Gunzler, Adam T. Perzynski, Adam C. Carle ISBN 9781138574250, 1138574252 - Experience the full ebook by downloading it now
100% (1)
(Ebook) Structural Equation Modeling for Health and Medicine by Douglas Gunzler, Adam T. Perzynski, Adam C. Carle ISBN 9781138574250, 1138574252 - Experience the full ebook by downloading it now
84 pages
Cs771 Mini Project-2 (1)
No ratings yet
Cs771 Mini Project-2 (1)
25 pages
Retrospective Mixture View of Experiments
No ratings yet
Retrospective Mixture View of Experiments
18 pages
Computer Security. ESORICS 2024 International Workshops
No ratings yet
Computer Security. ESORICS 2024 International Workshops
555 pages
Booklet Stats v8
No ratings yet
Booklet Stats v8
309 pages

Lecture 12 - Unsupervised Learning - Shoould Be Marged

Uploaded by

Lecture 12 - Unsupervised Learning - Shoould Be Marged

Uploaded by

Generative

Water Treatment Plant: Original Dataset

Unsupervised learning Supervised learning

No labels Labels are given

Historical data Data cleaning and Exploitation of Real data

What’s the natural grouping between the data points?

1 Each point is considered as a separate

❑ predefined cluster number is achieved

Multi-dimensional visualizations for clustering

1. Define the kernel size and move a window on a current point.

More information here.

• Window size can

Do not forget that clusters are, in large part, on the

Clusters number is chosen:

The silhouette ranges from −1 to +1, where a high

Clustering metrics on synthetic da

Anomaly (also called

Their appearance may indicate objective

Examples of anomaly detection here.

• Anomalies occur very rarely (anomaly

The optimization is done by

1. Initialize μ, σ, π and calculate log-likelihood of the data

As a result, on one side of the curve, there

The algorithm can be used for novelty detection.

Training dataset shouldn’t contain any anomalies.

Has linear complexity and can be easily applicable to large

solutions from unlabeled data. O'Reilly Media. *

You might also like