0% found this document useful (0 votes)

26 views60 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. The goal is to maximize similarities within clusters and maximize differences between clusters. It is used in various domains like information retrieval, finance, biology, and marketing to group related documents, stocks with similar price movements, genes with similar functionality, and customers with similar characteristics respectively. The quality and interpretation of clustering results depend on the chosen similarity measure and number of clusters.

Uploaded by

Rani Shamas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views60 pages

Cluster Analysis

Uploaded by

Rani Shamas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Cluster Analysis

Basic Concepts and Algorithms

What is Cluster Analysis?
 Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
 Cluster analysis
– Grouping a set of data objects into clusters

Inter-cluster
Intra-cluster distances
distances are are
minimized maximized
Applications of Cluster Analysis

 Understanding Discovered Clusters Industry Group

1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

– Information Retrieval Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
 Group related documents for Sun-DOWN

2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
browsing Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

– Finance
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

 Group stocks with similar price 4

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlumberger-UP
fluctuations
– Biology
 Group genes and proteins that have
similar functionality
– Marketing
 Help marketers discover distinct
groups in their customer bases, and
develop targeted marketing programs
Clustering gene expression data
Applications of Cluster Analysis

 Image segmentation
– Goal: Break up the image
into meaningful or
perceptually similar regions

 Summarization
– Reduce the size of large
data sets

Clustering precipitation in
Australia
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Clustering results are crucially dependent on the measure of similarity (or distance) between
“points” to be clustered
Measure the Quality of Clustering

 Quality of clustering:
– There is usually a separate “quality” function that
measures the “goodness” of a cluster.
– It is hard to define “similar enough” or “good
enough”
 The answer is typically highly subjective

6
Considerations for Cluster Analysis
 Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
 Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
 Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-
based (e.g., density or contiguity)
 Clustering space (Partial versus complete)
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
 Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities

7
Types of Clusters

 Well-separated clusters
 Center-based clusters
 Contiguous clusters
 Density-based clusters
 Property or Conceptual
 Described by an Objective Function
Types of Clusters: Well-Separated

 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every point in the cluster than to any
point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based

 Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid

4 center-based clusters
Types of Clusters: Contiguity-Based

 Contiguous Cluster (Nearest neighbor or Transitive)

– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
Types of Clusters: Density-Based

 Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters

 Shared Property or Conceptual Clusters

– Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles
Characteristics of the Input Data Are Important

 Type of proximity or density measure

– This is a derived measure, but central to clustering
 Sparseness
– Dictates type of similarity
– Adds to efficiency
 Attribute type
– Dictates type of similarity
 Type of Data
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
 Dimensionality
 Noise and Outliers
 Type of Distribution
Similarity and
Dissimilarity
Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Types of Attributes

 There are different types of attributes

– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time, counts
Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., temperature in Celsius deviation, Pearson's
a unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent
length, electrical current variation
Attribute Transformation Comments
Level

Nominal Any permutation of values If all employee ID numbers

were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing

values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic well by the values {1, 2, 3} or
function. by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where Thus, the Fahrenheit and

a and b are constants Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).

Ratio new_value = a * old_value Length can be measured in

meters or feet.
Similarity and Dissimilarity

 Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Distance Measures

 Manhattan Distance

 Euclidean Distance
n
2
dist   ( pk  q k)
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
Manhattan Distance

3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6
Acutal Points in 2D
Distance between p1 and p2 Manhattan Distance is the sum of
(X1,Y1) = (0 , 2) The absolute values of differences
(X2,Y2) = (2 , 0) Of the coordinates.
d=|0–2|+|2–0|
d= 2+2 =4
Distance between p1 and p3
(X1,Y1) = (0 , 2)
(X2,Y2) = (3 , 1)
d=|0–3|+|2–1|
d= 3+1 = 4 L1 p1 P2 p3 p4
Distance between p1 and p4 p1 0 4 4 6
(X1,Y1) = (0 , 2) p2 4 0 2 4
(X4,Y4) = (5 , 1) p3 4 2 0 2
d=|0–5|+|2–1| p4 6 4 2 0
d= 5+1 =6
Euclidean Distance

point x y
3 p1 0 2
p2 2 0
2 p1 p3 3 1
p3 p4 p4 5 1
1
p2 Acutal Points in 2D
0
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Clustering Algorithms

 K-means and its variants

 Hierarchical clustering

 Density-based clustering
Partitional Clustering
Divide data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset
Typical methods: k-means, k-medoids, CLARANS

Original Points A Partitional Clustering

Hierarchical Clustering

A set of nested clusters organized as a hierarchical tree

p1
p3 p4
p2

p1 p2 p3
Traditional Hierarchical Clustering p4
Traditional Dendrogram

p1
p3 p4
p2
p1 p2
p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
K-Means : Partitioning approach

 An iterative clustering algorithm

 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified

Initialize: Pick K random points as cluster

centers
Repeat:
1.Assign data points to closest
cluster center
2.Change the cluster center to the
average of its assigned points
Until The centroids don’t change
K-Means : Partitioning approach

 An iterative clustering algorithm

 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified

Initialize: Pick K random points as cluster

centers
Repeat:
1.Assign data points to closest
cluster center
2.Change the cluster center to the
average of its assigned points
Until The centroids don’t change
K-means clustering Example
K-means clustering Example

Iteration 1

Iterative Step 2:
Iterative Step 1: Change the cluster
Assign data points to center to average of the
closest cluster center assigned points
K-means clustering Example

Iteration 2

Iterative Step 1:
Assign data points to
closest cluster center

Repeat until convergence Iterative Step 2:

Change the cluster
center to average of the
assigned points
K-means clustering Example

Iteration 3

Iterative Step 1:
Assign data points to
closest cluster center

Repeat until convergence Iterative Step 2:

Change the cluster
center to average of the
assigned points
K-means Clustering – Details

 Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

 The centroid is (typically) the mean of the points in the

cluster.

 ‘Closeness’ is measured by Euclidean distance, cosine

similarity, correlation, etc.
– K-means will converge for common similarity measures mentioned
above.
K-means Clustering – Details

 Most of the convergence happens in the first few

iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’

 Complexity is O( n * K * I )
– n = number of points,
– K = number of clusters,
– I = number of iterations
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

1.5 2
x x
Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids

Iteration 6
2
3
4
5
1
3

2.5

1.5
1.5
y
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-
2 x
-
1
.
5

-
1
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids …

Iteration 5
1
2
3
4
3

2.5

1.5
1.5
y
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-
2 x
-
1
.
5

-
1
Evaluating K-means Clusters

 Most common measure is Sum of Squared Error (SSE)

– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 (mi ,
i1 xC x) i

x is a data point in cluster Ci and mi is the representative point for cluster Ci

 can show that mi corresponds to the center (mean) of the cluster

– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Solutions to Initial Centroids Problem

 Multiple runs
– Helps, but probability is not on your side
 Sample and use hierarchical clustering to determine initial
centroids
 Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
 Postprocessing
 Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing

 Pre-processing
– Normalize the data
– Eliminate outliers

 Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
Limitations of K-means

 K-means has problems when clusters are of

different
– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains

outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Overcoming K-means Limitations

Original Points K-means Clusters

Overcoming K-means Limitations

Original Points K-means Clusters

Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such as density-

connected points
 Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
 Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

132
DBSCAN: Density Based Spatial Clustering of Applications with Noise
 Locates region of high density that are separated by regions of low
density.
 In center- based approach, density of a point is the number of points
within specified radius, Eps, of that point
 A cluster is defined as a maximal set of density-connected points

In center-based approach, we can classify

point as being
1) in the interior of dense region (core)
2) on the edge of a dense region (border)
Outlier
3) in a sparely occupied region (noise)

Border
Eps = 1cm
MinPts = 5
Cor e
DBSCAN

 DBSCAN is a density-based algorithm.

– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number

of points (MinPts) within Eps
 These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in

the
neighborhood of a core point

– A noise point is any point that is not a core point or a border

point.
DBSCAN: Core, Border, and Noise Points
DBSCAN: The Algorithm

– Label all points as core, border or noise.

– Eliminate noise points.

– Put an edge between all core points that are within

Eps of each other.

– Make each group of connected core points into a

separate cluster.

– Assign each border points to one of the clusters of its

associated core points.

138
DBSCAN Algorithm

 Eliminate noise points

 Perform clustering on the remaining points
DBSCAN Algorithm

 Time Complexity
– O(N x time to find points in Eps-neighbourhood)
– where N is the no of points
– Worst case O(N2)
– KD-trees, allow efficient retreivel of all points within
given distance of a specified point in O(N logN)

 Space Complexity
– O(N)
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,

border and noise

Eps = 10, MinPts = 4

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
DBSCAN: Sensitive to Parameters

DBSCAN online Demo:

http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.ht
ml

STJLR 51 5254
100% (1)
STJLR 51 5254
7 pages
Psychrometric Chart
No ratings yet
Psychrometric Chart
1 page
German Calibration Service: DKD-R 5-4 Temperature Block Calibrators
100% (3)
German Calibration Service: DKD-R 5-4 Temperature Block Calibrators
11 pages
K Medoids
No ratings yet
K Medoids
101 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Clustering Class
No ratings yet
Clustering Class
103 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DMDW Unit-5 Cluster Analysis
No ratings yet
DMDW Unit-5 Cluster Analysis
26 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Clustering
No ratings yet
Clustering
47 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Clustering
No ratings yet
Clustering
118 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
UNIT5
No ratings yet
UNIT5
60 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
DM 4
No ratings yet
DM 4
76 pages
Unit 2
No ratings yet
Unit 2
89 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Cluster Analysis in Construction
No ratings yet
Cluster Analysis in Construction
23 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
DWDS Unit 6 Cluster Analysis
No ratings yet
DWDS Unit 6 Cluster Analysis
31 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
A Comprehensive Survey of Clustering Algorithms
No ratings yet
A Comprehensive Survey of Clustering Algorithms
30 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
98 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
57 pages
Clustering
No ratings yet
Clustering
29 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
ENGR 205 Assignment#3
No ratings yet
ENGR 205 Assignment#3
3 pages
Content
No ratings yet
Content
34 pages
The Instrument at I 109458538
No ratings yet
The Instrument at I 109458538
147 pages
Solutions To Calorimetry Problems
No ratings yet
Solutions To Calorimetry Problems
4 pages
Heat Flow in Triangular & Parabolic Fins
No ratings yet
Heat Flow in Triangular & Parabolic Fins
10 pages
Mariah Campbell Biology Homework
No ratings yet
Mariah Campbell Biology Homework
4 pages
Flexibox 450 PDS EN - Ver2
No ratings yet
Flexibox 450 PDS EN - Ver2
4 pages
Everwell CA Condensing Unit Technical Specs
No ratings yet
Everwell CA Condensing Unit Technical Specs
2 pages
.Ukimages677036 Mark Scheme Periodic Table Elements and Physical Chemistry PDF
No ratings yet
.Ukimages677036 Mark Scheme Periodic Table Elements and Physical Chemistry PDF
35 pages
Underfloor Heating Guide
No ratings yet
Underfloor Heating Guide
24 pages
p.3 (Yunus 7th Ed p6. 83)
No ratings yet
p.3 (Yunus 7th Ed p6. 83)
1 page
Dragon Magazine, The Best of - Vol. 5 - Text
100% (2)
Dragon Magazine, The Best of - Vol. 5 - Text
82 pages
ASHRAE Handbooks (Manuales) Catálogo
No ratings yet
ASHRAE Handbooks (Manuales) Catálogo
16 pages
Development of A Program For Transient Behavior Simulation of Heavy-Duty Gas Turbines
No ratings yet
Development of A Program For Transient Behavior Simulation of Heavy-Duty Gas Turbines
12 pages
Australia Sydney Geothermal Meeting
No ratings yet
Australia Sydney Geothermal Meeting
274 pages
Scale of Measurement
No ratings yet
Scale of Measurement
5 pages
Expansion and Contraction
No ratings yet
Expansion and Contraction
6 pages
HVAC System Installation Plan
No ratings yet
HVAC System Installation Plan
4 pages
Fantech VHR100R Manual
No ratings yet
Fantech VHR100R Manual
36 pages
Gembarovic J. 2004. Simple Algorithm For Temperature Distribution Calculations. Applied Mathematical Modelling 28 (2004) 173-182
No ratings yet
Gembarovic J. 2004. Simple Algorithm For Temperature Distribution Calculations. Applied Mathematical Modelling 28 (2004) 173-182
10 pages
Developments in Blast Furnace Process Control at Port Kembla Base
No ratings yet
Developments in Blast Furnace Process Control at Port Kembla Base
13 pages
Form Two Physics Exam 2022
No ratings yet
Form Two Physics Exam 2022
13 pages
APMRG1 Series: Packaged Air Conditioners
No ratings yet
APMRG1 Series: Packaged Air Conditioners
32 pages
Single Split-Ceiling Cassette Type
No ratings yet
Single Split-Ceiling Cassette Type
1 page
LG CCD Catalouge
No ratings yet
LG CCD Catalouge
1 page
Processing Guide Alcom LG PC - EN
No ratings yet
Processing Guide Alcom LG PC - EN
4 pages
Chiller Compact Installatio PDF
No ratings yet
Chiller Compact Installatio PDF
44 pages

Cluster Analysis

Uploaded by

Cluster Analysis

Uploaded by

Cluster Analysis

Basic Concepts and Algorithms

 Understanding Discovered Clusters Industry Group

– Information Retrieval Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

 Group stocks with similar price 4

How many clusters? Six Clusters

Two Clusters Four Clusters

 Contiguous Cluster (Nearest neighbor or Transitive)

 Shared Property or Conceptual Clusters

 Type of proximity or density measure

 There are different types of attributes

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

Nominal Any permutation of values If all employee ID numbers

Ordinal An order preserving change of An attribute encompassing

Interval new_value =a * old_value + b where Thus, the Fahrenheit and

Ratio new_value = a * old_value Length can be measured in

p and q are the attribute values for two data objects.

 K-means and its variants

Original Points A Partitional Clustering

A set of nested clusters organized as a hierarchical tree

 An iterative clustering algorithm

Initialize: Pick K random points as cluster

 An iterative clustering algorithm

Initialize: Pick K random points as cluster

Repeat until convergence Iterative Step 2:

Repeat until convergence Iterative Step 2:

 Initial centroids are often chosen randomly.

 The centroid is (typically) the mean of the points in the

 ‘Closeness’ is measured by Euclidean distance, cosine

 Most of the convergence happens in the first few

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

 Most common measure is Sum of Squared Error (SSE)

x is a data point in cluster Ci and mi is the representative point for cluster Ci

 K-means has problems when clusters are of

 K-means has problems when the data contains

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Original Points K-means Clusters

Original Points K-means Clusters

 Clustering based on density (local cluster criterion), such as density-

In center-based approach, we can classify

 DBSCAN is a density-based algorithm.

– A point is a core point if it has more than a specified number

– A border point has fewer than MinPts within Eps, but is in

– A noise point is any point that is not a core point or a border

– Label all points as core, border or noise.

– Eliminate noise points.

– Put an edge between all core points that are within

– Make each group of connected core points into a

– Assign each border points to one of the clusters of its

 Eliminate noise points

Original Points Point types: core,

Eps = 10, MinPts = 4

Original Points Clusters

DBSCAN online Demo:

You might also like