Clustering
Course: Artificial Intelligence
Fundamentals
Instructor: Marco Bonzanini
Machine Learning Tasks
Supervised Unsupervised
Discrete Data
Classification Clustering
(predict a label) (group similar items)
Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Machine Learning Tasks
Supervised Unsupervised
Discrete Data
Classification Clustering
(predict a label) (group similar items)
Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Agenda
• Introduction to Clustering
• Clustering Algorithms
• Centroid-based
• Optional: Connectivity-based
• Optional: Density-based
• Evaluation
Introduction to
Clustering
Clustering
• Place similar items in the same group
• (Place dissimilar items in different groups)
• How to define “similarity”?
• Clustering cannot give a comprehensive
description of an object
Clustering Applications
• Customer Segmentation
• Fraud Detection
• Social Network Analysis
• Search engines (navigation, indexing, …)
• … Your use case?
Clustering vs Classification
• Classification:
- supervised
- requires a set of labelled training samples
• Clustering:
- unsupervised
- learns without labels
Classification Example
Training:
items with labels
Classification Example
New, unseen item
?
Training:
items with labels
Classification Example
Training:
items with labels
Prediction:
assign item to class
Clustering Example
Training: no labels
Clustering Example
Training: no labels
Prediction: group items
More Definitions
• “Learn from raw data”
• “Find structure in data”
• “Unsupervised classification”
Flat vs Hierarchical
• Flat approach:
• There’s a number of clusters, and the relation
between clusters is undetermined
• Often start with a random partial partitioning
• Refine it iteratively (e.g. K-Means)
• Measurement: error minimisation
Flat vs Hierarchical
• Hierarchical approach:
• Bottom-up, agglomerative
• Top-down, divisive
• A hierarchy of clusters (i.e. tree structure)
• Measurement: similarity of instances
Hard vs Soft
• Cluster assignment
• Hard clustering: each item belongs to one and only
one cluster (more common)
• Soft clustering: items can belong to more than one
cluster (e.g. a pair of sneakers can be in “sports”
and “shoes”)
Common Issues
• Item representation (e.g. vector)
• Need a notion of distance / similarity
• Ideal: semantic similarity
• Practical: Euclidean distance, cosine similarity
• How many clusters?
• Fixed a priori? Data-driven?
Clustering
Algorithms
Different Approaches
• Centroid-based clustering:
K-Means
• Connectivity-based clustering:
Hierarchical Agglomerative
• Density-based clustering:
DBSCAN
Centroid-based Clustering
• Clusters are represented by their centroid
• Centroid: central vector, centre of gravity, arithmetic
mean
• Centroid: not necessarily a member of the cluster
• Based on distance between items and centroids
• Most common algorithm: K-Means
K-Means Overview
• Input:
- Set of items
- Desired n. of clusters K
• Output:
- A partition of the input set into K clusters
• Assumption:
- Input items are real-valued vectors
- Notion of distance / similarity
K-Means Algorithm
1.Initialise K centroids randomly
2.For each point: assign to closest centroid
3.Update centroids
4.Repeat 2-3 until convergence
K-Means Example
K=2
K-Means Example
1. Centroids: random init
K-Means Example
2. Assign items to
closest centroid
K-Means Example
3. Update centroids
K-Means Example
4. Repeat:
- assign items to centroids
- update centroids
K-Means Example
4. Repeat:
- assign items to centroids
- update centroids
K-Means Example
4. Repeat:
- assign items to centroids
- update centroids
K-Means Example
Convergence!
Assignment Step
• Assign to “closest” centroid
• Closest = least squared Euclidean distance
Update Step
• New centroid is the mean of the cluster
K-Means Discussion
• Pros: intuitive, quite good in practice
• Cons: requires to know (or find out) K
• Elbow method to find out K
Elbow Method
• Intrinsic metric: within-cluster Sum of Squared Error
• a.k.a. Distortion
• If K increases
the distortion decreases
X
Notebook Intermezzo
Clustering - KMeans
Connectivity-based
Clustering
• a.k.a. Hierarchical Clustering
• It needs a notion of pairwise dissimilarity between
groups (clusters), called linkage
• Top-down (divisive): start with all items in one group,
then divide them to maximise within-group similarity
• Bottom-up (agglomerative): start with items in
individual groups, then aggregate the most similar
ones (until one cluster containing all objects is formed)
Agglomerative Clustering
Example
7 2
1
Steps:
6
{1}, {2}, {3}, {4}, {5}, {6}, {7}
{1, 2}, {3}, {4}, {5}, {6}, {7}
{1, 2}, {3, 4}, {5}, {6}, {7}
4
{1, 2}, {3, 4, 5}, {6}, {7}
5
{1, 2, 6}, {3, 4, 5}, {7}
3 {1, 2, 6, 7}, {3, 4, 5}
{1, 2, 3, 4, 5, 6, 7}
Dendrogram
• Used to illustrate the output of hierarchical
clustering
• The algorithm builds a tree-based hierarchical
taxonomy
Dendrogram Example
7 2
1
5 4
3
1 2 6 7 5 3 4
Dendrogram Example on
Iris dataset
From Dendroid to Clusters
• Cutting the dendrogram horizontally partitions the
data points into clusters
• Choice of distance
• Choice of number of clusters
Linkage
• The notion of dissimilarity is described with a distance
function d(G, H), with G and H groups of nodes (cluster
assignments at any level)
• Single linkage: smallest dissimilarity between two points in
opposite groups, i.e. nearest neighbour interpretation
• Complete linkage: largest dissimilarity between two points
in opposite groups, i.e. furthest neighbour interpretation
• Average linkage: average dissimilarity between all points
in opposite groups
Single Linkage
Distance
0.7
• For each point x
there is a point y
in its cluster
where d(x, y) ≤ 0.7
1 2 6 7 5 3 4
Complete Linkage
Distance
0.7
• For each point x
all points y
in its cluster
satisfy d(x, y) ≤ 0.7
1 2 6 7 5 3 4
Average Linkage
Distance
0.7
• Cut interpretation:
there isn’t a good one!
1 2 6 7 5 3 4
Linkage Issues
• Single linkage suffers from chaining.
Only one pair of points needs to be close in order to
merge two groups, i.e. clusters can be spread out and
not very compact
• Complete linkage suffers from crowding.
Score based on worst-case dissimilarity between pairs
of points, i.e. clusters are compact but not far apart
• Average linkage strikes a balance, but doesn’t have a
clear interpretation
More on Linkage
• Centroid linkage (new centroid is avg of all group items)
• Median linkage: like Centroid, but new centroid
calculated as avg of the two old centroids
• Ward linkage (Ward’s variance minimisation)
Hierarchical Clustering
Discussion
• Pros: repeatability (why?), no prior knowledge of K
is required (can choose cut-off threshold, or
number of clusters)
• Cons: complexity (why?)
• No silver bullet
Notebook Intermezzo:
Clustering - Hierarchical
Density-based Clustering
• Clusters are regions in the data space with higher
density, separated by lower density regions
• Objects in sparse areas are considered noise or
borders between clusters
• A cluster is defined as a maximal set of density-
connected points
• Shape of clusters: arbitrary
• Most common algorithm: DBSCAN
DBSCAN Overview
• Density-Based Spatial Clustering of Applications
with Noise
• Given an input set of points, it groups together
points that are closely packed together, i.e. with
many neighbour points.
DBSCAN Concepts
• ε-Neighbourhood
N(p) = {q | d(p, q) ≤ ε }
• High density points (core points)
p is core point if N(p) contains at least minPts objects
• Density-reachable points
q is directly reachable from p if it’s in N(p) and p is a
core point
q is reachable from p if p is a core point and there’s a
path of directly reachable core points between them
DBSCAN Concepts
7 minPts = 2
2
6 1
8
- 1 is reachable from 6
- 3 and 5 are
5 4
ε directly reachable from 4
3 - 7 is an outlier
DBSCAN Algorithm
• Find the ε-neighbour of all points
• Identify the core points (at least minPts)
• Find the connected components of core points,
ignoring the non-core points
• Assign non-core points to nearby cluster, if the
cluster is an ε-neighbourhood, otherwise assign to
noise
DBSCAN Example
DBSCAN Discussion
• Pros: no prior knowledge of K is required, can find
arbitrarily shaped cluster, robust to outliers (notion
of noise)
• Cons: complexity (why?), sensitive to data sets with
large differences in densities
Notebook Intermezzo:
Clustering - DBSCAN
Clustering
Evaluation
Cluster Quality
• How good is the clustering result?
• What is its interpretation?
• What is the purpose of the clustering task?
Internal Evaluation
• Evaluation based on the data set itself
• No external gold standard
• Idea: good clustering produces clusters with high
within-cluster similarity and low between-cluster
similarity
• Drawback: is the evaluation biased?
• e.g. Sum of Squared Errors (SSE)
External Evaluation
• Requires externally supplied labels
• Relationship with classification evaluation
• Metrics: Precision, Recall, F-Measure, Jaccard
Index, Dice Index, …
• Drawback: are we missing out on knowledge
discovery?
Questions?