0% found this document useful (0 votes)

22 views63 pages

03 Clustering

Uploaded by

hrhee1atl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views63 pages

03 Clustering

Uploaded by

hrhee1atl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Clustering

Course: Artificial Intelligence

Fundamentals

Instructor: Marco Bonzanini

Machine Learning Tasks

Supervised Unsupervised
Discrete Data

Classification Clustering
(predict a label) (group similar items)

Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Machine Learning Tasks

Supervised Unsupervised
Discrete Data

Classification Clustering
(predict a label) (group similar items)

Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Agenda
• Introduction to Clustering

• Clustering Algorithms

• Centroid-based

• Optional: Connectivity-based

• Optional: Density-based

• Evaluation
Introduction to
Clustering
Clustering

• Place similar items in the same group

• (Place dissimilar items in different groups)

• How to define “similarity”?

• Clustering cannot give a comprehensive

description of an object
Clustering Applications

• Customer Segmentation

• Fraud Detection

• Social Network Analysis

• Search engines (navigation, indexing, …)

• … Your use case?

Clustering vs Classification

• Classification:
- supervised
- requires a set of labelled training samples

• Clustering:
- unsupervised
- learns without labels
Classification Example

Training:
items with labels
Classification Example
New, unseen item
?
Training:
items with labels
Classification Example

Training:
items with labels
Prediction:
assign item to class
Clustering Example

Training: no labels
Clustering Example

Training: no labels
Prediction: group items
More Definitions

• “Learn from raw data”

• “Find structure in data”

• “Unsupervised classification”
Flat vs Hierarchical
• Flat approach:

• There’s a number of clusters, and the relation

between clusters is undetermined

• Often start with a random partial partitioning

• Refine it iteratively (e.g. K-Means)

• Measurement: error minimisation

Flat vs Hierarchical
• Hierarchical approach:

• Bottom-up, agglomerative

• Top-down, divisive

• A hierarchy of clusters (i.e. tree structure)

• Measurement: similarity of instances

Hard vs Soft

• Cluster assignment

• Hard clustering: each item belongs to one and only

one cluster (more common)

• Soft clustering: items can belong to more than one

cluster (e.g. a pair of sneakers can be in “sports”
and “shoes”)
Common Issues
• Item representation (e.g. vector)

• Need a notion of distance / similarity

• Ideal: semantic similarity

• Practical: Euclidean distance, cosine similarity

• How many clusters?

• Fixed a priori? Data-driven?

Clustering
Algorithms
Different Approaches

• Centroid-based clustering:
K-Means

• Connectivity-based clustering:
Hierarchical Agglomerative

• Density-based clustering:
DBSCAN
Centroid-based Clustering

• Clusters are represented by their centroid

• Centroid: central vector, centre of gravity, arithmetic

mean

• Centroid: not necessarily a member of the cluster

• Based on distance between items and centroids

• Most common algorithm: K-Means

K-Means Overview
• Input:
- Set of items
- Desired n. of clusters K

• Output:
- A partition of the input set into K clusters

• Assumption:
- Input items are real-valued vectors
- Notion of distance / similarity
K-Means Algorithm

1.Initialise K centroids randomly

2.For each point: assign to closest centroid

3.Update centroids

4.Repeat 2-3 until convergence

K-Means Example

K=2
K-Means Example

1. Centroids: random init

K-Means Example

2. Assign items to
closest centroid
K-Means Example

3. Update centroids
K-Means Example

4. Repeat:
- assign items to centroids
- update centroids
K-Means Example

Convergence!
Assignment Step

• Assign to “closest” centroid

• Closest = least squared Euclidean distance

Update Step

• New centroid is the mean of the cluster

K-Means Discussion

• Pros: intuitive, quite good in practice

• Cons: requires to know (or find out) K

• Elbow method to find out K

Elbow Method

• Intrinsic metric: within-cluster Sum of Squared Error

• a.k.a. Distortion

• If K increases
the distortion decreases

X
Notebook Intermezzo
Clustering - KMeans
Connectivity-based
Clustering
• a.k.a. Hierarchical Clustering

• It needs a notion of pairwise dissimilarity between

groups (clusters), called linkage

• Top-down (divisive): start with all items in one group,

then divide them to maximise within-group similarity

• Bottom-up (agglomerative): start with items in

individual groups, then aggregate the most similar
ones (until one cluster containing all objects is formed)
Agglomerative Clustering
Example
7 2
1

Steps:
6
{1}, {2}, {3}, {4}, {5}, {6}, {7}
{1, 2}, {3}, {4}, {5}, {6}, {7}
{1, 2}, {3, 4}, {5}, {6}, {7}
4
{1, 2}, {3, 4, 5}, {6}, {7}
5
{1, 2, 6}, {3, 4, 5}, {7}
3 {1, 2, 6, 7}, {3, 4, 5}
{1, 2, 3, 4, 5, 6, 7}
Dendrogram

• Used to illustrate the output of hierarchical

clustering

• The algorithm builds a tree-based hierarchical

taxonomy
Dendrogram Example
7 2
1

5 4

3
1 2 6 7 5 3 4
Dendrogram Example on
Iris dataset
From Dendroid to Clusters

• Cutting the dendrogram horizontally partitions the

data points into clusters

• Choice of distance

• Choice of number of clusters

Linkage
• The notion of dissimilarity is described with a distance
function d(G, H), with G and H groups of nodes (cluster
assignments at any level)

• Single linkage: smallest dissimilarity between two points in

opposite groups, i.e. nearest neighbour interpretation

• Complete linkage: largest dissimilarity between two points

in opposite groups, i.e. furthest neighbour interpretation

• Average linkage: average dissimilarity between all points

in opposite groups
Single Linkage
Distance

0.7
• For each point x
there is a point y
in its cluster
where d(x, y) ≤ 0.7
1 2 6 7 5 3 4
Complete Linkage
Distance

0.7
• For each point x
all points y
in its cluster
satisfy d(x, y) ≤ 0.7
1 2 6 7 5 3 4
Average Linkage
Distance

0.7

• Cut interpretation:
there isn’t a good one!

1 2 6 7 5 3 4
Linkage Issues
• Single linkage suffers from chaining.
Only one pair of points needs to be close in order to
merge two groups, i.e. clusters can be spread out and
not very compact

• Complete linkage suffers from crowding.

Score based on worst-case dissimilarity between pairs
of points, i.e. clusters are compact but not far apart

• Average linkage strikes a balance, but doesn’t have a

clear interpretation
More on Linkage
• Centroid linkage (new centroid is avg of all group items)

• Median linkage: like Centroid, but new centroid

calculated as avg of the two old centroids

• Ward linkage (Ward’s variance minimisation)

Hierarchical Clustering
Discussion

• Pros: repeatability (why?), no prior knowledge of K

is required (can choose cut-off threshold, or
number of clusters)

• Cons: complexity (why?)

• No silver bullet
Notebook Intermezzo:
Clustering - Hierarchical
Density-based Clustering
• Clusters are regions in the data space with higher
density, separated by lower density regions

• Objects in sparse areas are considered noise or

borders between clusters

• A cluster is defined as a maximal set of density-

connected points

• Shape of clusters: arbitrary

• Most common algorithm: DBSCAN

DBSCAN Overview

• Density-Based Spatial Clustering of Applications

with Noise

• Given an input set of points, it groups together

points that are closely packed together, i.e. with
many neighbour points.
DBSCAN Concepts
• ε-Neighbourhood
N(p) = {q | d(p, q) ≤ ε }

• High density points (core points)

p is core point if N(p) contains at least minPts objects

• Density-reachable points
q is directly reachable from p if it’s in N(p) and p is a
core point
q is reachable from p if p is a core point and there’s a
path of directly reachable core points between them
DBSCAN Concepts
7 minPts = 2

2
6 1

8
- 1 is reachable from 6
- 3 and 5 are
5 4
ε directly reachable from 4
3 - 7 is an outlier
DBSCAN Algorithm
• Find the ε-neighbour of all points

• Identify the core points (at least minPts)

• Find the connected components of core points,

ignoring the non-core points

• Assign non-core points to nearby cluster, if the

cluster is an ε-neighbourhood, otherwise assign to
noise
DBSCAN Example
DBSCAN Discussion

• Pros: no prior knowledge of K is required, can find

arbitrarily shaped cluster, robust to outliers (notion
of noise)

• Cons: complexity (why?), sensitive to data sets with

large differences in densities
Notebook Intermezzo:
Clustering - DBSCAN
Clustering
Evaluation
Cluster Quality

• How good is the clustering result?

• What is its interpretation?

• What is the purpose of the clustering task?

Internal Evaluation
• Evaluation based on the data set itself

• No external gold standard

• Idea: good clustering produces clusters with high

within-cluster similarity and low between-cluster
similarity

• Drawback: is the evaluation biased?

• e.g. Sum of Squared Errors (SSE)

External Evaluation
• Requires externally supplied labels

• Relationship with classification evaluation

• Metrics: Precision, Recall, F-Measure, Jaccard

Index, Dice Index, …

• Drawback: are we missing out on knowledge

discovery?
Questions?

Transcript of Pivotal Climate-Change Hearing 1988
100% (4)
Transcript of Pivotal Climate-Change Hearing 1988
216 pages
Official Transcript of Competencies
No ratings yet
Official Transcript of Competencies
2 pages
Finite Element Method For Electromagnetics
No ratings yet
Finite Element Method For Electromagnetics
360 pages
Clustering
No ratings yet
Clustering
12 pages
Cluster
100% (1)
Cluster
72 pages
Unit 5
No ratings yet
Unit 5
63 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
Unit 2
No ratings yet
Unit 2
33 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
ML - 8
No ratings yet
ML - 8
70 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
75 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
Clustering
No ratings yet
Clustering
65 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Week 10
No ratings yet
Week 10
84 pages
Module 5
No ratings yet
Module 5
43 pages
Clustering
No ratings yet
Clustering
53 pages
M5
No ratings yet
M5
40 pages
L08 Hierachical Agglomerative Clustering
No ratings yet
L08 Hierachical Agglomerative Clustering
41 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering 2
No ratings yet
Clustering 2
17 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Clustering
No ratings yet
Clustering
118 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Intro to Neural Networks
No ratings yet
Intro to Neural Networks
184 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
03 Regression
No ratings yet
03 Regression
39 pages
AI Basics for Beginners
No ratings yet
AI Basics for Beginners
63 pages
Navigating Landscapes of Mediated Memory 1st Edition Paul Wilson Instant Download
100% (5)
Navigating Landscapes of Mediated Memory 1st Edition Paul Wilson Instant Download
85 pages
Shivam
No ratings yet
Shivam
43 pages
Percentage Prelims - I: 1 Exclusively Prepared For IACE Students Toll Free: 1800-270-9975, PH: 9533200400
No ratings yet
Percentage Prelims - I: 1 Exclusively Prepared For IACE Students Toll Free: 1800-270-9975, PH: 9533200400
3 pages
Lab Report: Submitted To
No ratings yet
Lab Report: Submitted To
6 pages
English 5 Co Combined
100% (2)
English 5 Co Combined
85 pages
Oops (Object Oriented Programming System) : Object Class Inheritance Polymorphism Abstraction Encapsulation
No ratings yet
Oops (Object Oriented Programming System) : Object Class Inheritance Polymorphism Abstraction Encapsulation
65 pages
Title List
No ratings yet
Title List
2 pages
Cambridge IGCSE: FRENCH 0520/03
No ratings yet
Cambridge IGCSE: FRENCH 0520/03
18 pages
Turbo Machinery Exam Results 2019
No ratings yet
Turbo Machinery Exam Results 2019
3 pages
Ep 20 Units
No ratings yet
Ep 20 Units
142 pages
CH1O3 Questions PDF
No ratings yet
CH1O3 Questions PDF
52 pages
Embankment Design Basic Nov20
No ratings yet
Embankment Design Basic Nov20
83 pages
P 1515 - Design and Contstruction of Anchored and Strutted Sheet Pile Walls Iin Soft Clay PDF
No ratings yet
P 1515 - Design and Contstruction of Anchored and Strutted Sheet Pile Walls Iin Soft Clay PDF
36 pages
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
No ratings yet
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
2 pages
Wearable Devices For The Detection of Covid-19
No ratings yet
Wearable Devices For The Detection of Covid-19
21 pages
Vivekananda Universe
No ratings yet
Vivekananda Universe
4 pages
2015 고등 영어독해와작문 (안병규) 교과서PDF
No ratings yet
2015 고등 영어독해와작문 (안병규) 교과서PDF
184 pages
Assignment MHDD 160
No ratings yet
Assignment MHDD 160
2 pages
Physics1 PDF
No ratings yet
Physics1 PDF
7 pages
The Future of Automotive Manufacturing - Integrating AI... For Next-Gen Automatic Cars
No ratings yet
The Future of Automotive Manufacturing - Integrating AI... For Next-Gen Automatic Cars
9 pages
The Sigma Guidelines-Toolkit: Sigma Opportunity and Risk Guide
No ratings yet
The Sigma Guidelines-Toolkit: Sigma Opportunity and Risk Guide
21 pages
Plus One Notes - Eng
No ratings yet
Plus One Notes - Eng
11 pages
CRT Controller
No ratings yet
CRT Controller
42 pages
Review of Invisalign System
No ratings yet
Review of Invisalign System
13 pages
Electromagnetic Warp Drive Theory
No ratings yet
Electromagnetic Warp Drive Theory
16 pages
Single Channel LoRa IoT Kit v2 User Manual - v1.0.7
No ratings yet
Single Channel LoRa IoT Kit v2 User Manual - v1.0.7
61 pages
The Genius Guide To - Divine Archetypes
100% (1)
The Genius Guide To - Divine Archetypes
18 pages