0% found this document useful (0 votes)

33 views36 pages

Lecture 9 Clustering

Uploaded by

aayatjamil16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views36 pages

Lecture 9 Clustering

Uploaded by

aayatjamil16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

DATA SCIENCE

LECTURE 8
Clustering
2

Basic Data Model: Sets

• Document: A document is represented as a set
shingles (more accurately, hashes of shingles)

• Document similarity: Jaccard similarity of the sets

of shingles.
• Common shingles over the union of shingles
• Sim (C1, C2) = |C1C2|/|C1C2|.

• Applicable to any kind of sets.

• E.g., similar customers or items.
What is a Clustering?
• In general a grouping of objects such that the objects in a
group (cluster) are similar (or related) to one another and
different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
Discovered Clusters Industry Group
• Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

• Group related documents for

1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
browsing, group genes and Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
functionality, or group stocks Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
with similar price fluctuations 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
• Reduce the size of large data
sets

Clustering precipitation
in Australia
Early applications of cluster analysis
• John Snow, London 1854
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Types of Clusterings
• A clustering is a set of clusters

• Important distinction between hierarchical and

partitional sets of clusters
• Partitional Clustering
• A division data objects into subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical
tree
Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Traditional Dendrogram

Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Non-traditional Dendrogram

Clustering
Other types of clustering
• Exclusive (or non-overlapping) versus non-
exclusive (or overlapping)
• In non-exclusive clusterings, points may belong to
multiple clusters.
• Points that belong to multiple classes, or ‘border’ points

• Fuzzy (or soft) versus non-fuzzy (or hard)

• In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1
• Weights usually must sum to 1 (often interpreted as probabilities)

• Partial versus complete

• In some cases, we only want to cluster some of the
data
Types of Clusters: Well-Separated
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
• A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
• The center of a cluster is often a centroid, the minimizer of
distances from all the points in the cluster, or a medoid, the
most “representative” point of a cluster

4 center-based clusters
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or
Transitive)
• A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
• Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles
Types of Clusters: Objective Function
• Clustering as an optimization problem
• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters
and evaluate the `goodness' of each potential set of clusters by
using the given objective function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the
data to a parameterized model.
• The parameters for the model are determined from the data, and they
determine the clustering
• E.g., Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
Clustering Algorithms
• K-means and its variants

• Hierarchical clustering

• DBSCAN
K-MEANS
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid
(center point)
• Each point is assigned to the cluster with the
closest centroid
• Number of clusters, K, must be specified
• The objective is to minimize the sum of
distances of the points to their respective
centroid
K-means Clustering

• Problem: Given a set X of n points in a d-

dimensional space and an integer K group the
points into K clusters C= {C1, C2,…,Ck} such that

is minimized, where ci is the centroid of the points

in cluster Ci
K-means Clustering
• Most common definition is with euclidean distance,
minimizing the Sum of Squares Error (SSE)
function
• Sometimes K-means is defined like that

• Problem: Given a set X of n points in a d-

dimensional space and an integer K group the
points into K clusters C= {C1, C2,…,Ck} such that

is minimized, where ci is the mean of the points in

Sum of Squares Error (SSE)
cluster Ci
Complexity of the k-means problem
• NP-hard if the dimensionality of the data is at
least 2 (d>=2)
• Finding the best solution in polynomial time is infeasible

• For d=1 the problem is solvable in polynomial

time (how?)

• A simple iterative algorithm works quite well in

practice
K-means Algorithm
• Also known as Lloyd’s algorithm.
• K-means is sometimes synonymous with this
algorithm
K-means Algorithm – Initialization
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
Two different K-means Clusterings
3

2.5

1.5
Original Points

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Optimal Clustering Sub-optimal Clustering

Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Dealing with Initialization
• Do multiple runs and select the clustering with the
smallest error

• Select original set of points by methods other

than random . E.g., pick the most distant (from
each other) points as cluster centers (K-means++
algorithm)
K-means Algorithm – Centroids
• The centroid depends on the distance function
• The minimizer for the distance function
• ‘Closeness’ is measured by Euclidean distance
(SSE), cosine similarity, correlation, etc.
• Centroid:
• The mean of the points in the cluster for SSE, and cosine
similarity
• The median for Manhattan distance.

• Finding the centroid is not always easy

• It can be an NP-hard problem for some distance functions
• E.g., median form multiple dimensions
K-means Algorithm – Convergence
• K-means will converge for common similarity
measures mentioned above.
• Most of the convergence happens in the first few
iterations.
• Often the stopping condition is changed to ‘Until
relatively few points change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = dimensionality
• In general a fast and efficient algorithm
Limitations of K-means
• K-means has problems when clusters are of
different
• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains

outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular
Shapes

Original Points K-means (2 Clusters)

Level - 01 - Aptimantra - Merged
No ratings yet
Level - 01 - Aptimantra - Merged
116 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Cluster
100% (1)
Cluster
72 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
UNIT5
No ratings yet
UNIT5
60 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Lect 12
No ratings yet
Lect 12
80 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 4
No ratings yet
Unit 4
16 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
38 pages
M5
No ratings yet
M5
40 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Original Points A Partitional Clustering
No ratings yet
Original Points A Partitional Clustering
50 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
M5
No ratings yet
M5
40 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Module 5
No ratings yet
Module 5
98 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering
No ratings yet
Clustering
104 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Clustering
No ratings yet
Clustering
29 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Wk. 9. Cluster Analysis (01-04-2021)
No ratings yet
Wk. 9. Cluster Analysis (01-04-2021)
97 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
39 pages
Clustering
No ratings yet
Clustering
34 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering
No ratings yet
Clustering
12 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
CH 1 MTG
No ratings yet
CH 1 MTG
28 pages
Pascal's Triangle and Its Secrets
No ratings yet
Pascal's Triangle and Its Secrets
27 pages
Biology, Chemistry & Physics: Guidelines For Individual Investigation Moderation
No ratings yet
Biology, Chemistry & Physics: Guidelines For Individual Investigation Moderation
13 pages
Units and Dimensions - DPP 05 - Arjuna KCET 2025
No ratings yet
Units and Dimensions - DPP 05 - Arjuna KCET 2025
3 pages
Presentation On Forces
No ratings yet
Presentation On Forces
53 pages
Faculty Details Proforma For DU Web-Site: Please Provide Photograph
No ratings yet
Faculty Details Proforma For DU Web-Site: Please Provide Photograph
2 pages
PPSC Math Interview
No ratings yet
PPSC Math Interview
45 pages
Permutations and Combinations Project
No ratings yet
Permutations and Combinations Project
11 pages
1 s2.0 S1350630721000480 Main
No ratings yet
1 s2.0 S1350630721000480 Main
18 pages
P6 Maths SA1 2019 Worked Solutions Nanyang
No ratings yet
P6 Maths SA1 2019 Worked Solutions Nanyang
7 pages
BU1007 Report Roddy
No ratings yet
BU1007 Report Roddy
19 pages
Vennapusa Et Al Geostatistical Analysis For Spatially Referenced Roller Integrated Compaction Measurements
No ratings yet
Vennapusa Et Al Geostatistical Analysis For Spatially Referenced Roller Integrated Compaction Measurements
10 pages
Sigma Plot Statistics User Guide
No ratings yet
Sigma Plot Statistics User Guide
462 pages
Principles of Dimensioning PDF
No ratings yet
Principles of Dimensioning PDF
8 pages
ITTC Calculation Procedures Lab1
No ratings yet
ITTC Calculation Procedures Lab1
8 pages
Philosophy of Science and Metaphysics - Are Infinity Machines Conceptually Possible?
No ratings yet
Philosophy of Science and Metaphysics - Are Infinity Machines Conceptually Possible?
4 pages
Python Functions - Unit3 1
No ratings yet
Python Functions - Unit3 1
25 pages
Static Structural Analysisnew: Chapter Four
No ratings yet
Static Structural Analysisnew: Chapter Four
54 pages
Student Lecture Notes: Capital Budgeting Is A Process of Management Accounting Which Assists Management
No ratings yet
Student Lecture Notes: Capital Budgeting Is A Process of Management Accounting Which Assists Management
13 pages
Low-Frequency Magnetic Fields From Electrical Appliances and Power Lines
No ratings yet
Low-Frequency Magnetic Fields From Electrical Appliances and Power Lines
10 pages
Question PDF
No ratings yet
Question PDF
40 pages
Mekanisme Creep PDF
No ratings yet
Mekanisme Creep PDF
20 pages
Procedure For Testing P643 Transformer Differential Relay: Prepared by Eng: Mohamad Tauseef
No ratings yet
Procedure For Testing P643 Transformer Differential Relay: Prepared by Eng: Mohamad Tauseef
36 pages
A Lumry-Eyring Nucleated Polymerization Model of Protein Aggregation Kinetics: 1. Aggregation With Pre-Equilibrated Unfolding
No ratings yet
A Lumry-Eyring Nucleated Polymerization Model of Protein Aggregation Kinetics: 1. Aggregation With Pre-Equilibrated Unfolding
17 pages
Levi Strauss
No ratings yet
Levi Strauss
6 pages
Piezo Electricidad
No ratings yet
Piezo Electricidad
3 pages
Ncert Pattern Revision Test Schedule
No ratings yet
Ncert Pattern Revision Test Schedule
2 pages
Subject-C Programming Section - A: Tick The Correct Answer From The Option Given
No ratings yet
Subject-C Programming Section - A: Tick The Correct Answer From The Option Given
4 pages
Kinematics Conceptual
No ratings yet
Kinematics Conceptual
7 pages

Lecture 9 Clustering

Uploaded by

Lecture 9 Clustering

Uploaded by

DATA SCIENCE

Basic Data Model: Sets

• Document similarity: Jaccard similarity of the sets

• Applicable to any kind of sets.

• Group related documents for

proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

How many clusters? Six Clusters

Two Clusters Four Clusters

• Important distinction between hierarchical and

Original Points A Partitional Clustering

Traditional Hierarchical Traditional Dendrogram

Non-traditional Hierarchical Non-traditional Dendrogram

• Fuzzy (or soft) versus non-fuzzy (or hard)

• Partial versus complete

• Problem: Given a set X of n points in a d-

is minimized, where ci is the centroid of the points

• Problem: Given a set X of n points in a d-

is minimized, where ci is the mean of the points in

• For d=1 the problem is solvable in polynomial

• A simple iterative algorithm works quite well in

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

• Select original set of points by methods other

• Finding the centroid is not always easy

• K-means has problems when the data contains

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

You might also like