0% found this document useful (0 votes)
756 views5 pages

Soft Vs Hard Clustering

K-means clustering is an algorithm that partitions objects into K clusters by minimizing the within-cluster sum of squares. It works by iteratively assigning objects to the closest centroid and recomputing centroids until convergence. The time complexity per iteration is O(Kn*tdist) where n is the number of objects, K is the number of clusters, and tdist is the time to compute distance between objects. K-means clustering is commonly used due to its speed and effectiveness in practice.

Uploaded by

Apoorva Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
756 views5 pages

Soft Vs Hard Clustering

K-means clustering is an algorithm that partitions objects into K clusters by minimizing the within-cluster sum of squares. It works by iteratively assigning objects to the closest centroid and recomputing centroids until convergence. The time complexity per iteration is O(Kn*tdist) where n is the number of objects, K is the number of clusters, and tdist is the time to compute distance between objects. K-means clustering is commonly used due to its speed and effectiveness in practice.

Uploaded by

Apoorva Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Informal goal

Clustering:
Overview and •  Given set of objects and measure of
similarity between them, group similar
K-means algorithm objects together

•  What mean by “similar”?


•  What is good grouping?
K-Means illustrations thanks to •  Computation time / quality tradeoff
2006 student Martin Makowiecki 1 2

General types of clustering Applications:


•  “Soft” versus “hard” clustering Many
–  Hard: partition the objects –  biology
•  each object in exactly one partition –  astronomy
–  Soft: assign degree to which object in –  computer aided design of circuits
cluster –  information organization
•  view as probability or score –  marketing
–  …
•  flat versus hierarchical clustering
–  hierarchical = clusters within clusters
3 4

Clustering in information Example applications in search


search and analysis
•  Query evaluation: cluster pruning (§7.1.6)
•  Group information objects - cluster all documents
⇒  discover topics - choose representative for each cluster
?  other groupings desirable - evaluate query w.r.t. cluster reps.
•  Clustering versus classifying - evaluate query for docs in cluster(s) having
- classifying: have pre-determined classes most similar cluster rep.(s)
with example members •  Results presentation: labeled clusters
- clustering: - cluster only query results
- get groups of similar objects - e.g. Yippy.com (metasearch)
- added problem of labeling clusters by topic
- e.g. common terms within cluster of docs. 5 hard / soft? flat / hier? 6

1
Issues Issues continued
•  What attributes represent items for clustering
purposes?
•  Cluster goals?
•  What is measure of similarity between items? –  Number of clusters?
•  General objects and matrix of pairwise similarities –  flat or hierarchical clustering?
•  Objects with specific properties that allow other –  cohesiveness of clusters?
specifications of measure •  How evaluate cluster results?
– Most common:
–  relates to measure of closeness between clusters
Objects are d-dimensional vectors
» Euclidean distance •  Efficiency of clustering algorithms
» cosine similarity –  large data sets => external storage
•  Maintain clusters in dynamic setting?
•  What is measure of similarity between clusters?
7
•  Clustering methods? - MANY! 8

Quality of clustering General types of clustering


methods
•  In applications, quality of clustering depends on
how well solves problem at hand
•  constructive versus iterative improvement
•  Algorithm uses measure of quality that can be
–  constructive: decide in what cluster each
optimized, but that may or may not do a good
object belongs and don’t change
job of capturing application needs.
•  often faster
•  Underlying graph-theoretic problems usually –  iterative improvement: start with a clustering
NP-complete and move objects around to see if can
–  e.g. graph partitioning improve clustering
•  often slower but better
•  Usually algorithm not finding optimal clustering
9 10

Vector model:
K-means overview
K- means algorithm
•  Choose k points among set to cluster
•  Well known, well used –  Call them k centroids

•  Flat clustering •  For each point not selected, assign it to its


closest centroid
•  Number of clusters picked ahead of time –  All assignment give initial clustering
•  Iterative improvement •  Until “happy” do:
•  Uses notion of centroid –  Recompute centroids of clusters
•  New centroids may not be points of original set
•  Typically uses Euclidean distance –  Reassign all points to closest centroid
•  Updates clusters
11 12

2
An Example An Example
start: choose centroids and cluster recompute centroids

13 14

An Example An Example
re-cluster around new centroids 2nd recompute centroids and re-cluster

15 16

An Example Details for K-means


3rd (final) recompute and re-cluster •  Need definition of centroid
ci = 1/|Ci| ∑x for ith cluster Ci containing objects x
x∈Ci
notion of sum of objects ?
•  Need definition of distance to (similarity to)
centroid
•  Typically vector model with Euclidean distance
•  minimizing sum of squared distances of each
point to its centroid = Residual Sum of Squares
K
RSS = ∑ ∑ dist(ci,x)2
i=1 x∈Ci

17 18

3
K-means performance Time Complexity of K-means

•  Can prove RSS decreases with each •  Let tdist be the time to calculate the distance
iteration, so converge between two objects
•  Can achieve local optimum •  Each iteration time complexity:
–  No change in centroids O(K*n*tdist)
n = number of objects
•  Running time depends on how •  Bound number of iterations I giving
demanding stopping criteria O(I*K*n*tdist)
•  Works well in practice •  for m-dimensional vectors:
–  speed O(I*K*n*m)
m large and centroids not sparse
–  quality
19 20

Space Complexity of K-means Choosing Initial Centroids


•  Store points and centroids •  Bad initialization leads to poor results
–  vector model: O((n + K)m)

•  External algorithm versus internal?


–  store k centroids in memory
–  run through points each iteration

Optimal Not Optimal


21 22

Choosing Initial Centroids


K-means weakness
Many people spent much time examining
how to choose seeds
•  Random Non-globular clusters
•  Fast and easy, but often poor results
•  Run random multiple times, take best
–  Slower, and still no guarantee of results
•  Pre-conditioning
–  remove outliers
•  Choose seeds algorithmically
–  run hierarchical clustering on sample points and
use resulting centroids
–  Works well on small samples and for few initial
centroids
23 24

4
K-means weakness K-means weakness
Wrong number of clusters Outliers and empty clusters

25 26

Real cases tend to be harder


•  Different attributes of the feature vector
have vastly different sizes
–  size of star versus color
•  Can weight different features
–  how weight greatly affects outcome

•  Difficulties can be overcome

27

You might also like