0% found this document useful (0 votes)
17 views

Lecture 12 - Unsupervised Learning - Shoould Be Marged

Uploaded by

kateryna.koval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture 12 - Unsupervised Learning - Shoould Be Marged

Uploaded by

kateryna.koval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Generative

Unsupervised Learning
Agenda
1. Unsupervised learning

2. Clustering
a) K-means clustering
b) Hierarchical clustering
c) Mean shift clustering
d) DBSCAN
e) Clustering metrics

3. Anomaly Detection
a) Gaussian Mixture of Models and Expectation-Maximization
b) One-class support vector machine
c) Isolation forest
Dataset and Notebook

Water Treatment Plant: Original Dataset


Water Treatment Plant: Processed Dataset
Data description
Collab Notebook
Unsupervised Learning

Unsupervised learning Supervised learning

No labels Labels are given


Find hidden structure in the data Learn a mapping from X to y, based
on train dataset
Main algorithms:
• clustering Main algorithms:
• dimensionality reduction • classification
• anomaly detection • regression
Training and Validation

Historical data Data cleaning and ML algorithms Patterns and rules Validation on real
preparation data

Inference

Historical data Data cleaning and Exploitation of Real data


preparation algorithms and rules grouping
Clustering
Main task: determine distinct groups (clusters), putting similar data into one group.
Clustering should satisfy the following conditions:
- points inside a cluster should be similar - high inter-class cluster similarity;
- points from different clusters should be far away from each other - low intra-
class cluster similarity.

What’s the natural grouping between the data points?


What’s the best way for grouping points?
What is the number of clusters? Can it be determined automatically?
How to avoid trivial clusters?
Whether to allow very big or very small clusters?
What metrics are used for clustering evaluation?
Clustering methods taxonomy
Distance between points

One way to find similar points - define a distance function between points.
The following metrics are often used:

- Euclidean distance
- Manhattan distance
- Minkowski distance
- Jaccard similarity index (for categorical data)
- Cosine similarity
K-means
1. Define the number of clusters 𝑘.
2. Randomly initialize 𝑘 centroids.
3. Calculate euclidean distance between
centroids and data points.
4. Group data points into clusters around
centroids.
5. Recalculate centroids. Let each centroid
be a mean of data points in a
corresponding cluster.
6. Repeat steps 3-5 until convergence.
K-means problems
- one should know a number of clusters a priori
- method isn’t robust to clusters of different size, density and shape.
- method isn’t robust to outliers
- inapplicable to large/high dimensional datasets
Hierarchical clustering

Repeat steps 1-2, until termination Data is split into two parts
3 condition is met
1
2 The closest points/clusters are merged
2 The best splitting is determined

1 Each point is considered as a separate


cluster
3 Repeat steps 1-2, until termination
condition is met

❑ predefined cluster number is achieved

Termination ❑ all points are merged into a single cluster ❑ each point forms a cluster
conditions ❑ distance between the closest clusters is bigger ❑ maximum distance between any partitions of a
than predefined threshold cluster is smaller than a predefined threshold.
Hierarchical clustering
Agglomerative clustering represents a hierarchy of clusters in a tree form -
dendrogram.

Root of the dendrogram – a single cluster, which contains all data points.
Each leaf is a separate point.
This model can be easily interpreted and shows similarity between
different data points.
Final clusters are obtained by pruning the dendrogram at some level. Thus,
one can choose the number of clusters to use.
Example

Multi-dimensional visualizations for clustering


DBSCAN: density-based spatial clustering
of applications with noise
Suitable for data with large number of groups that intersect.

Previous algorithms were based on distance function: close in distance points were put in the
same cluster. In case of scattered data, algorithms, that are based only on distance, are tend to
perform badly and consider all points as clusters.

DBSCAN is robust to outliers and can be also considered as an anomaly detection algorithm.

The algorithm can handle arbitrary cluster shapes and doesn’t require to specify number of
clusters a priori.

Parameters:
ε – the maximum neighborhood radius
n – the minimum number of points to form a region
DBSCAN algorithm
1. For current point calculate the distance to all other points.
2. Mark all points, that are in epsilon-neighborhood of a
current point, as neighbors.
3. If number of neighbors is greater or equal to n, merge
points to form a cluster.
4. If number of neighbors is smaller than n - mark point as an
outlier.
5. Continue for other points, until all are marked as outliers or
form a cluster.
Mean Shift Clustering
Non-parametric feature-space analysis
technique for locating the maxima of a
density function.
Alternative name - mode-seeking algorithm.

1. Define the kernel size and move a window on a current point.


2. Calculate a centroid over all points, that are located inside a window.
3. Move a window to centroid.
4. Repeat steps 1-3, until convergence.

More information here.


PROS Cons

• Window size can


• Doesn’t require to specify number
significantly affect the
of clusters a priori
performance. Wrong size can
• Can handle various cluster shapes merge the modes or create
multiple false positive
• Can handle various feature spaces
modes.
• Robust to outliers
• Сomputationally expensive

OpenCV: cv.meanShift
Sklearn: sklearn.cluster.MeanShift
How to choose number of clusters?

Do not forget that clusters are, in large part, on the


eye of the beholder
(Estivill-Castro, 2002)

Clusters number is chosen:


❑ based on metrics, so that, clusters explain the variance in the data.
❑ intuitively or based on prior knowledge (for example, from
customer)
Clustering Metrics
Elbow rule
1. Group data on 2, 3, ..., n clusters.
2. For each cluster calculate the average
distance to the centroid and average
the results (D)
3. Plot the graph D = f(n).
4. Optimal cluster number N is located
immediately after the last steep
decline, between DN-1 and DN.

yellowbrick.cluster.elbow.KElbowVisualizer
Clustering Metrics
Silhoutte value measures the consistency within clusters of data.

It shows how similar an object is to its own cluster (cohesion) compared to other
clusters (separation).

The silhouette ranges from −1 to +1, where a high


value indicates that the object is well matched to its
own cluster and poorly matched to neighboring
clusters.

Clustering metrics on synthetic da


Sklearn Silhouette analysis ta
sklearn.metrics.silhouette_score
Anomaly Detection

Anomaly (also called


outlier or noise) - unexpected value or
event, that significantly differs from
normal behaviour.

Their appearance may indicate objective


internal changes or external interventions.

Examples of anomaly detection here.


Anomaly Detection

• Anomalies occur very rarely (anomaly


detection - problem
with imbalanced classes);
• Properties of anomalies significantly
differs from normal instances.
Gaussians Mixture Models and
Expectation Maximization
Each separate cluster can be
represented as a Gaussian distribution,
so data is a mixture of Gaussian
distributions.
GMM tries to group points, which
come from the same distribution.

If data is a realization of the mixture of three Gaussians with means (μ1, μ2,
μ3) and standard deviations (σ1, σ2, σ3), then GMM will identify for each
point a probability distribution among different clusters.
Gaussians Mixture Models and
Expectation Maximization
Expectation Maximization is an
iterative algorithm, that
estimates the parameters of a
probability distribution with
latent variables, by maximizing
the likelihood function.

The optimization is done by


iteratively switching between
two steps: Expectation and
Maximization. It can be shown,
that each parameter update is
non-decreasing in likelihood.
Gaussians Mixture Models and
Expectation Maximization
Estimate Zi:

1. Initialize μ, σ, π and calculate log-likelihood of the data


2. E-step: estimate conditional probabilities Zi
3. M-step: based on E step, update parameters μ, σ, π
4. Calculate the log-likelihood of updated parameters
5. Repeat steps 2-4, until change in log-likelihood will be smaller than
predefined threshold
One-class SVM
Modification of the traditional SVM, that
transforms feature space, so that
observations lay as far as possible from the
origin. Two common approaches are
Schölkopf et al. and Tax & Duin.

As a result, on one side of the curve, there


are “normal” data points, and on the other
side - abnormal values or anomalies.

The algorithm can be used for novelty detection.

Training dataset shouldn’t contain any anomalies.


Kernel trick can be used for non-linear transformations of a feature-space.
Isolation Forest (iForest)
Based on the assumption, that there is a small amount of
anomalies and they’re significantly different from normal
observations.
That’s why iForest deliberately isolate abnormal points.

The key idea is, that it’s faster (smaller depth of the tree) to
separate abnormal points.

Implementation of iForest:
❑ Training set is used to build isolation forests.
❑ Each point of the test set is run through iForest and receives an
anomaly score.

Has linear complexity and can be easily applicable to large


datasets. Usually, performs better than one-class SVM
approach.
Summary
Clustering
a) K-means clustering
b) Hierarchical clustering
c) Mean-shift clustering
d) DBSCAN
e) Clustering metrics

Anomaly Detection
a) Gaussians Mixture of Models and Expectation-Maximization
b) One-class SVM
c) Isolation forest
References
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on pattern analysis and machine
intelligence, 17(8), 790-799.

Dangeti, P. (2017). Statistics for machine learning: Build supervised, unsupervised, and reinforcement learning
models using both Python and R. Birmingham, UK : Packt Publishing.
Ertel, W., Black, N., & Mast, F. (2017). Introduction to artificial intelligence. Cham, Switzerland : Springer.

Igual, L., & Seguí, S. (2017). Introduction to Data Science: A Python approach to concepts, techniques and
applications. Springer International Publishing : Imprint : Springer. *

Johnston, B., Jones, A., Kruger, C., & Safari, an O'Reilly Media Company. (2019). Applied unsupervised
learning with Python. Packt Publishing, 2019.

Liu, F. T., Ting, K. M., Zhou, Z-H. (2009). Isolation forest. 2008 Eighth IEEE International Conference on Data
Mining.
Liu, F. T., Ting, K. M., Zhou, Z-H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge
Discovery from Data, 6(1)
References
Patel, A. A. (2019). Hands-On Unsupervised learning using Python: How to build applied machine learning

solutions from unlabeled data. O'Reilly Media. *


Pradhan M., Kumar U. (2019). Machine Learning using Python. Wiley India. *

Swamynathan, M. (2019). Mastering Machine Learning with Python in Six Steps: A Practical Implementation
Guide to Predictive Data Analytics Using Python. Berkeley, CA: Apress L.P.
References
Ayramo, S., Karkkainen, T. (2006). Introduction to partitioning-based clustering methods with a robust example.
Reports of the Department of Mathematical Information Technology Series C. Software and Computational
Engineering, 1, 1-34.

Comaniciu, D., Meer, P. (2002). Mean Shift: A robust approach toward feature space analysis. IEEE Transactions
on pattern analysis and machine intelligence, 24(5), 603-619.

Estivill-Castro, V. (2002). Why so many clustering algorithms — A Position Paper. SIGKDD Explorations, 4 (1),
65-75.
Ghassabeh, Y., A. (2015).
A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel, Journal of
Multivariate Analysis, 135, 1-10.
Zimek, A., & Filzmoser, P. (2018). There and back again: Outlier detection between statistical reasoning and
data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6).

You might also like