Lecture 12 - Unsupervised Learning - Shoould Be Marged
Lecture 12 - Unsupervised Learning - Shoould Be Marged
Unsupervised Learning
Agenda
1. Unsupervised learning
2. Clustering
a) K-means clustering
b) Hierarchical clustering
c) Mean shift clustering
d) DBSCAN
e) Clustering metrics
3. Anomaly Detection
a) Gaussian Mixture of Models and Expectation-Maximization
b) One-class support vector machine
c) Isolation forest
Dataset and Notebook
Historical data Data cleaning and ML algorithms Patterns and rules Validation on real
preparation data
Inference
One way to find similar points - define a distance function between points.
The following metrics are often used:
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Jaccard similarity index (for categorical data)
- Cosine similarity
K-means
1. Define the number of clusters 𝑘.
2. Randomly initialize 𝑘 centroids.
3. Calculate euclidean distance between
centroids and data points.
4. Group data points into clusters around
centroids.
5. Recalculate centroids. Let each centroid
be a mean of data points in a
corresponding cluster.
6. Repeat steps 3-5 until convergence.
K-means problems
- one should know a number of clusters a priori
- method isn’t robust to clusters of different size, density and shape.
- method isn’t robust to outliers
- inapplicable to large/high dimensional datasets
Hierarchical clustering
Repeat steps 1-2, until termination Data is split into two parts
3 condition is met
1
2 The closest points/clusters are merged
2 The best splitting is determined
Termination ❑ all points are merged into a single cluster ❑ each point forms a cluster
conditions ❑ distance between the closest clusters is bigger ❑ maximum distance between any partitions of a
than predefined threshold cluster is smaller than a predefined threshold.
Hierarchical clustering
Agglomerative clustering represents a hierarchy of clusters in a tree form -
dendrogram.
Root of the dendrogram – a single cluster, which contains all data points.
Each leaf is a separate point.
This model can be easily interpreted and shows similarity between
different data points.
Final clusters are obtained by pruning the dendrogram at some level. Thus,
one can choose the number of clusters to use.
Example
Previous algorithms were based on distance function: close in distance points were put in the
same cluster. In case of scattered data, algorithms, that are based only on distance, are tend to
perform badly and consider all points as clusters.
DBSCAN is robust to outliers and can be also considered as an anomaly detection algorithm.
The algorithm can handle arbitrary cluster shapes and doesn’t require to specify number of
clusters a priori.
Parameters:
ε – the maximum neighborhood radius
n – the minimum number of points to form a region
DBSCAN algorithm
1. For current point calculate the distance to all other points.
2. Mark all points, that are in epsilon-neighborhood of a
current point, as neighbors.
3. If number of neighbors is greater or equal to n, merge
points to form a cluster.
4. If number of neighbors is smaller than n - mark point as an
outlier.
5. Continue for other points, until all are marked as outliers or
form a cluster.
Mean Shift Clustering
Non-parametric feature-space analysis
technique for locating the maxima of a
density function.
Alternative name - mode-seeking algorithm.
OpenCV: cv.meanShift
Sklearn: sklearn.cluster.MeanShift
How to choose number of clusters?
yellowbrick.cluster.elbow.KElbowVisualizer
Clustering Metrics
Silhoutte value measures the consistency within clusters of data.
It shows how similar an object is to its own cluster (cohesion) compared to other
clusters (separation).
If data is a realization of the mixture of three Gaussians with means (μ1, μ2,
μ3) and standard deviations (σ1, σ2, σ3), then GMM will identify for each
point a probability distribution among different clusters.
Gaussians Mixture Models and
Expectation Maximization
Expectation Maximization is an
iterative algorithm, that
estimates the parameters of a
probability distribution with
latent variables, by maximizing
the likelihood function.
The key idea is, that it’s faster (smaller depth of the tree) to
separate abnormal points.
Implementation of iForest:
❑ Training set is used to build isolation forests.
❑ Each point of the test set is run through iForest and receives an
anomaly score.
Anomaly Detection
a) Gaussians Mixture of Models and Expectation-Maximization
b) One-class SVM
c) Isolation forest
References
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on pattern analysis and machine
intelligence, 17(8), 790-799.
Dangeti, P. (2017). Statistics for machine learning: Build supervised, unsupervised, and reinforcement learning
models using both Python and R. Birmingham, UK : Packt Publishing.
Ertel, W., Black, N., & Mast, F. (2017). Introduction to artificial intelligence. Cham, Switzerland : Springer.
Igual, L., & Seguí, S. (2017). Introduction to Data Science: A Python approach to concepts, techniques and
applications. Springer International Publishing : Imprint : Springer. *
Johnston, B., Jones, A., Kruger, C., & Safari, an O'Reilly Media Company. (2019). Applied unsupervised
learning with Python. Packt Publishing, 2019.
Liu, F. T., Ting, K. M., Zhou, Z-H. (2009). Isolation forest. 2008 Eighth IEEE International Conference on Data
Mining.
Liu, F. T., Ting, K. M., Zhou, Z-H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge
Discovery from Data, 6(1)
References
Patel, A. A. (2019). Hands-On Unsupervised learning using Python: How to build applied machine learning
Swamynathan, M. (2019). Mastering Machine Learning with Python in Six Steps: A Practical Implementation
Guide to Predictive Data Analytics Using Python. Berkeley, CA: Apress L.P.
References
Ayramo, S., Karkkainen, T. (2006). Introduction to partitioning-based clustering methods with a robust example.
Reports of the Department of Mathematical Information Technology Series C. Software and Computational
Engineering, 1, 1-34.
Comaniciu, D., Meer, P. (2002). Mean Shift: A robust approach toward feature space analysis. IEEE Transactions
on pattern analysis and machine intelligence, 24(5), 603-619.
Estivill-Castro, V. (2002). Why so many clustering algorithms — A Position Paper. SIGKDD Explorations, 4 (1),
65-75.
Ghassabeh, Y., A. (2015).
A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel, Journal of
Multivariate Analysis, 135, 1-10.
Zimek, A., & Filzmoser, P. (2018). There and back again: Outlier detection between statistical reasoning and
data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6).