Machine Learning – COMS3007
Clustering
Benjamin Rosman
Based heavily on course notes by Chris
Williams, Victor Lavrenko, Charles Sutton,
David Blei, David Sontag, Shimon Ullman,
Tomaso Poggio, Danny Harari, Daneil
Zysman, Darren Seibart, and Clint van Alten
Previously on ML…
• So far: focused exclusively on supervised learning
• Data 𝑋 = {𝑥 0 , … , 𝑥 (𝑛) }, where 𝑥 (𝑖) ∈ 𝑅𝑑
• Labels 𝐲 = {𝑦 0 , … , 𝑦 (𝑛) }
• Want to learn function 𝑦 = 𝑓(𝑥, 𝜃) to predict y for a
new x
• Two main types:
• Classification:
• 𝑦 ∈ {0,1} (or more) 𝑥2
• Regression:
• 𝑦∈ℝ
• Conveniently: 𝑥1
• Similar models work
Unsupervised learning
• In supervised learning, we know what we are looking for
• We have appropriately labelled data
• This isn’t always the case!
• Unsupervised learning:
• Find patterns in the data (without labels)
• Understanding the hidden structure of the data
• Useful when you don’t know what you’re looking for
• Data:
• Given 𝐷 = {𝒙1 , … , 𝒙𝑁 }, where each 𝒙 ∈ ℝ𝑑
• No labels!
Clustering
• Clustering: one of the most common unsupervised
learning problems
• Involves automatically segmenting data into groups of
similar points
• Why?
• Automatically organising data
• Understanding structure of
the data
• Finding sub-populations
• Representing high dimensional
data in a low dimensional
space
Examples
• Make groupings from data, such as:
• Customers based on their purchase histories
• Genes according to expression profile
• Search results according to topic
• Facebook users according to interests
• Artifacts in a museum according to visual similarity
Note: this is different to
classifying. We don’t even
know what the classes are!
This gives us a way to
discover them.
Properties
• What makes a good clustering?
• Intra-cluster cohesion (compactness)
• Points in the same cluster are close together
• Inter-cluster separation (isolation)
• Points in different clusters are far apart
Distance metrics
• Notions of “closeness” require a distance metric
• Euclidean distance
2
𝑘 𝑘
• 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖 − 𝑥𝑗 Euclidean
• Manhattan (city block) distance
Manhattan
𝑘 𝑘
• 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 |𝑥𝑖 − 𝑥𝑗 |
• Approximation to Euclidean distance
• Both are special cases of Minkowski distance:
1
𝑝 𝑝
𝑘 𝑘
• 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖 − 𝑥𝑗 p is a positive integer
K-means
• K-means is one of the most commonly used clustering
algorithms
• Partitional clustering algorithm (maintains partitions over
the space)
• Data points 𝐷 = {𝒙1 , … , 𝒙𝑁 }, where each 𝒙 ∈ ℝ𝑑
• K-means partitions the data into k clusters
• Each cluster has a cluster centre, called the centroid
• k is user specified
K-means algorithm
• Input: 𝐷 = {𝒙1 , … , 𝒙𝑁 }, where each 𝒙 ∈ ℝ𝑑
• Place centroids 𝑐1 , 𝑐2 , … , 𝑐𝑘 at random locations in ℝ𝑑
• Repeat until convergence (cluster assignments don’t change):
• For each point 𝑥𝑖 :
• Find the closest centroid 𝑐𝑗 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑐𝑗 𝑑(𝑥𝑖 , 𝑐𝑗 )
• Assign 𝑥𝑖 to cluster 𝑗 Choose distance
metric 𝑑(⋅,⋅)
appropriately
• For each cluster 𝑗:
• Move the cluster centre 𝑐𝑗 to the average of the
1
assigned points: 𝑐𝑗 = σ𝑖:𝑥𝑖→𝑗 𝑥𝑖 Can compute
𝑛𝑗
median, etc.,
instead of mean
Performance
• Need an objective function to measure performance of the
algorithm
• K-means objective function is the sum of squared distances of
each point to its assigned mean.
• Let 𝑥𝑖 be assigned to cluster 𝑧𝑖
• Then
1 𝑁 2
• 𝐽 𝑥1:𝑁 , 𝑐1:𝐾 = σ𝑖=1 𝑥𝑖 − 𝑐𝑧𝑖
2
Distance between
point and its cluster
Example Objective value
Cluster
boundaries:
Points
equidistant from
2 cluster centres.
Defines a
partitioning.
Example
Example
Example
Example
Example
Example
Converged!
Convergence
• Note the decreasing objective
from the example
• K-means takes an alternating
optimization approach:
• Optimising cluster
assignments
• Optimising cluster positions
• Each step guaranteed to
decrease the objective
• So guaranteed to converge!
• But: local optimum
Properties of k-means
• Strengths:
• Simple to understand and implement
• Efficient: complexity O(NKT)
• N = number of data points Can you convince
• K = number of clusters yourself this is right?
• T = number of iterations
• Weaknesses:
• Converges to a local optimum May need to use
• Only applicable if mean can be defined something like a
• K must be specified median instead
• Sensitive to outliers
Limitations
• K-means finds a local
optimum
• Thus very reliant on
good initialisation
• May need to restart
several times
Limitations
• Very sensitive to outliers We want this:
• Points very far away
from other points
• Strategies:
• Remove outliers
manually (monitor
them over a few Instead, we may get this:
iterations first)
• Random sampling:
choose a subset of the
data, less likely to
contain outliers
• Median?
Limitations
• Not suitable for clusters that are not hyper-ellipsoids
• Nonlinear features may be useful here
Choosing k
• Not always clear what the correct number of clusters is
• Often heuristics are used
• Although there are algorithms that do this more
automatically
Changing values of k
Changing values of k
Changing values of k
Changing values of k
Changing values of k
Changing values of k
Changing values of k
Changing values of k
Changing values of k
• A common heuristic is to
look at the changing value
of the objective with
changing k
• The “kink” or “elbow” is
often taken to indicate the
best k
Running k-means online
• Online/sequential k-means
• Why? Clustering online articles as they’re written
• Algorithm:
• Place centroids 𝑐1 , 𝑐2 , … , 𝑐𝑘 at random locations in ℝ𝑑
• Set initial counts 𝑛1 , 𝑛2 , … , 𝑛𝑘 = 0
• Repeat until bored:
• Acquire new point 𝑥𝑖
• Find the closest centroid 𝑐𝑗 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑐𝑗 𝑑(𝑥𝑖 , 𝑐𝑗 )
• Assign 𝑥𝑖 to cluster 𝑗
• 𝑛𝑗 ← 𝑛𝑗 + 1 Update appropriate cluster
1 centre by moving it closer to 𝑥𝑖 .
• 𝑐𝑗 ← 𝑐𝑗 + (𝑥𝑖 − 𝑐𝑗 ) 1
acts as an adaptive learning
𝑛𝑗 𝑛 𝑗
rate.
Applications: supervised learning
• Use clustering to discretise continuous values for
supervised learning
• Instead of using a set discretisation interval
• Cluster training data and use cluster ID
• This can also lower the dimension of high
dimensional data
Applications: visual words
• Use a similar idea for images
• What if you wanted to use a naïve Bayes or decision
tree model to classify images?
• Pixels as attributes?
• Huge space, and not useful for learning
• Bag-of-words would be nice: {“water”, “grass”, “tiger”}
• Needs human annotation
Applications: visual words
• Idea:
• Break image into set of patches
• Compute appearance features of each patch
• Relative position, distribution of colours, texture,
edge orientations
• Convert to a “word” (code) to reflect patch appearance
• Similar feature vectors → same “word”
𝑥1
𝑥2 “grass”
…
“C27”
𝑥𝑑
Applications: visual words
Applications: visual words
• Use k-means to:
• Group all feature vectors from all images into K clusters
• Provide a cluster ID for every patch in every image
• Similar-looking patches have the same ID
• Represent patch with cluster ID
• Image = bag of cluster IDs
• K-dimensional representation:
• {4 x “C14”, 7 x “C27”, 24 x “C79”, 0 x else}
• Similar to bag-of-words
• Cluster IDs called vis-terms or “visual words”
• Plug these into a classifier
Applications: image compression
• Every pixel in an image has a red, green, blue value
• How many bits per pixel?
• We can use k-means to compress the image!
Applications: image compression
• Clustering in the colour space
• Replace each pixel 𝑥𝑖 by its cluster centre 𝑐𝑥𝑖
• The k means are called the codebook
Applications: image compression
Applications: image compression
Applications: image compression
Applications: image compression
Applications: image compression
Applications: image compression
Applications: image compression
Applications: image compression
Applications: image compression
• Sometimes known as vector quantisation
• Also an easy way to do image segmentation
Recap
• Clustering and applications
• Distance metrics
• The k-means algorithm
• Limitations of k-means
• How to choose K
• Online k-means
• Representations for supervised learning
• Visual words
• Image compression and segmentation