0% found this document useful (0 votes)

64 views

Clustering Data Mining

Clustering is an unsupervised learning technique that groups unlabeled data points into clusters such that items within a cluster are more similar to each other than items in other clusters. The main clustering algorithms include partitioning methods like k-means, hierarchical methods, density-based methods, grid-based methods, and model-based clustering. Clustering is used for data exploration, preprocessing, and pattern recognition in applications such as market research, spatial data analysis, and document classification.

Uploaded by

Andrew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Clustering Data Mining

Uploaded by

Andrew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

Clustering

Wei Wang
Outline
• What is clustering
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid-based methods
• Model-based clustering methods
• Outlier analysis
What Is Clustering?
• Group data into clusters
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
– Unsupervised learning: no predefined classes

Outliers
Cluster 1
Cluster 2
Application Examples
• A stand-alone tool: explore data
distribution
• A preprocessing step for other algorithms
• Pattern recognition, spatial data analysis,
image processing, market research,
WWW, …
– Cluster documents
– Cluster web log data to discover groups of
similar access patterns
What Is A Good Clustering?
• High intra-class similarity and low inter-
class similarity
– Depending on the similarity measure
• The ability to discover some or all of the
hidden patterns
Requirements of Clustering
• Scalability
• Ability to deal with various types of
attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain
knowledge to determine input parameters
Requirements of Clustering
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Data Matrix
• For memory-based clustering
– Also called object-by-variable structure
• Represents n objects with p variables
(attributes, measures)
– A relational table  x11  x1 f  x 
 1p 
      
x  x  x 
 i1 if ip 
      
 
 xn1  xnf  x 
np 
Dissimilarity Matrix
• For memory-based clustering
– Also called object-by-object structure
– Proximities of pairs of objects
– d(i,j): dissimilarity between objects i and j
– Nonnegative
– Close to 0: similar  0 
 d (2,1) 0 
 
 d (3,1) d (3,2) 0 
 
    
d (n,1) d (n,2)   0
How Good Is A Clustering?
• Dissimilarity/similarity depends on
distance function
– Different applications have different functions
• Judgment of clustering quality is typically
highly subjective
Types of Data in Clustering
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
Similarity and Dissimilarity
Between Objects
• Distances are normally used measures
• Minkowski distance: a generalization
d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
i1 j1 i2 j2 ip jp
• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
• Weighed distance

d (i, j)  q w | x  x |q w | x  x |q ... w p | x  x |q ) (q  0)
1 i1 j1 2 i2 j 2 ip jp
Properties of Minkowski
Distance
• Nonnegative: d(i,j)  0
• The distance of an object to itself is 0
– d(i,i) = 0
• Symmetric: d(i,j) = d(j,i)
• Triangular inequality
– d(i,j)  d(i,k) + d(k,j)
Categories of Clustering
Approaches (1)
• Partitioning algorithms
– Partition the objects into k clusters
– Iteratively reallocate objects to improve the
clustering
• Hierarchy algorithms
– Agglomerative: each object is a cluster,
merge clusters to form larger ones
– Divisive: all objects are in a cluster, split it up
into smaller clusters
Categories of Clustering
Approaches (2)
• Density-based methods
– Based on connectivity and density functions
– Filter out noise, find clusters of arbitrary
shape
• Grid-based methods
– Quantize the object space into a grid structure
• Model-based
– Use a model to find the best fit of data
Partitioning Algorithms: Basic
Concepts
• Partition n objects into k clusters
– Optimize the chosen partitioning criterion
• Global optimal: examine all partitions
– (kn-(k-1)n-…-1) possible partitions, too expensive!
• Heuristic methods: k-means and k-medoids
– K-means: a cluster is represented by the center
– K-medoids or PAM (partition around medoids): each
cluster is represented by one of the objects in the
cluster
K-means
• Arbitrarily choose k objects as the initial
cluster centers
• Until no change, do
– (Re)assign each object to the cluster to which
the object is the most similar, based on the
mean value of the objects in the cluster
– Update the cluster means, i.e., calculate the
mean value of the objects for each cluster
K-Means: Example

10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Pros and Cons of K-means
• Relatively efficient: O(tkn)
– n: # objects, k: # clusters, t: # iterations; k, t << n.
• Often terminate at a local optimum
• Applicable only when mean is defined
– What about categorical data?
• Need to specify the number of clusters
• Unable to handle noisy data and outliers
• unsuitable to discover non-convex clusters
Variations of the K-means
• Aspects of variations
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes
– Use mode instead of mean
• Mode: the most frequent item(s)
– A mixture of categorical and numerical data: k-
prototype method
A Problem of K-means
+
• Sensitive to outliers +
– Outlier: objects with extremely large values
• May substantially distort the distribution of the data
• K-medoids: the most centrally located
object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
PAM: A K-medoids Method
• PAM: partitioning around Medoids
• Arbitrarily choose k objects as the initial medoids
• Until no change, do
– (Re)assign each object to the cluster to which the
nearest medoid
– Randomly select a non-medoid object o’, compute the
total cost, S, of swapping medoid o with o’
– If S < 0 then swap o with o’ to form the new set of k
medoids
Swapping Cost
• Measure whether o’ is better than o as a
medoid
• Use the squared-error criterion
k

– Compute Eo’-Eo E   d ( p , o
i 1 pCi
i ) 2

– Negative: swapping brings benefit

PAM: Example
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Pros and Cons of PAM
• PAM is more robust than k-means in the
presence of noise and outliers
– Medoids are less influenced by outliers
• PAM is efficiently for small data sets but
does not scale well for large data sets
– O(k(n-k)2 ) for each iteration
• Sampling based method: CLARA
CLARA (Clustering LARge
Applications)
• CLARA (Kaufmann and Rousseeuw in 1990)
– Built in statistical analysis packages, such as S+
• Draw multiple samples of the data set, apply
PAM on each sample, give the best clustering
• Perform better than PAM in larger data sets
• Efficiency depends on the sample size
– A good clustering on samples may not be a good
clustering of the whole data set
CLARANS (Clustering Large Applications
based upon RANdomized Search)
• The problem space: graph of clustering
– A vertex is k from n numbers, vertices in total
 n
– PAM search the whole graph  k 
– CLARA search some random sub-graphs
• CLARANS climbs mountains
– Randomly sample a set and select k medoids
– Consider neighbors of medoids as candidate for new
medoids
– Use the sample set to verify
– Repeat multiple times to avoid bad samples

Data Mining MCQ
78% (147)
Data Mining MCQ
34 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Cluster
No ratings yet
Cluster
20 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Clustering
No ratings yet
Clustering
25 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
07-Clustering
No ratings yet
07-Clustering
54 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Clustering
No ratings yet
Clustering
47 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering
No ratings yet
Clustering
80 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Clustering
No ratings yet
Clustering
24 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Clustering
No ratings yet
Clustering
104 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
k-medoids
No ratings yet
k-medoids
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering
No ratings yet
Clustering
45 pages
M5
No ratings yet
M5
40 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Clustering
No ratings yet
Clustering
34 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
M5
No ratings yet
M5
40 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
No ratings yet
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
11 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Clustering
No ratings yet
Clustering
75 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Clustering
No ratings yet
Clustering
32 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Clustering
No ratings yet
Clustering
27 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Partitioning Methods
100% (1)
Partitioning Methods
3 pages
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
No ratings yet
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
11 pages
Data Analysis and Classification Methods and Applications Entire PDF eBook
100% (9)
Data Analysis and Classification Methods and Applications Entire PDF eBook
14 pages
33 93 LM V1 S1 - Kmedoids
No ratings yet
33 93 LM V1 S1 - Kmedoids
3 pages
Clarans Clustering
No ratings yet
Clarans Clustering
26 pages
DMV & ML Lab
No ratings yet
DMV & ML Lab
103 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
ML GTU Solution
No ratings yet
ML GTU Solution
83 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
A Comparison of K-Means Clustering Algorithm and C
No ratings yet
A Comparison of K-Means Clustering Algorithm and C
4 pages
dataanalytics unit-4
No ratings yet
dataanalytics unit-4
23 pages
Assignment ON Data Mining: Submitted by Name: Manjula.T
No ratings yet
Assignment ON Data Mining: Submitted by Name: Manjula.T
11 pages
Machine Learning 3
No ratings yet
Machine Learning 3
65 pages
Analysis of Clustering Algorithms in Machine Learning For Healthcare Data
No ratings yet
Analysis of Clustering Algorithms in Machine Learning For Healthcare Data
13 pages
Assignment 1 Formatted-V4
No ratings yet
Assignment 1 Formatted-V4
4 pages
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
No ratings yet
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
4 pages
Partition
No ratings yet
Partition
52 pages
Numpy / Scipy Recipes For Data Science: K-Medoids Clustering
No ratings yet
Numpy / Scipy Recipes For Data Science: K-Medoids Clustering
6 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
13 Clustering Techniques
No ratings yet
13 Clustering Techniques
47 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Robust Detection of Multiple Outliers in A Multivariate Data Set
No ratings yet
Robust Detection of Multiple Outliers in A Multivariate Data Set
30 pages
CLARANS
No ratings yet
CLARANS
19 pages
Clustering Techniques and Their Applications in Engineering
100% (1)
Clustering Techniques and Their Applications in Engineering
16 pages
Chapter - 5 Machine Learning
0% (1)
Chapter - 5 Machine Learning
25 pages
Chapter
100% (1)
Chapter
101 pages
K Medroids
No ratings yet
K Medroids
13 pages

Clustering Data Mining

Uploaded by

Clustering Data Mining

Uploaded by

Clustering

cluster center 4 Update 4

– Negative: swapping brings benefit

K=2 Randomly select a

You might also like