0% found this document useful (0 votes)
89 views103 pages

Image Segmentation

Image segmentation involves dividing an image into regions of similar pixels by looking for coherent regions where pixels share properties like color or intensity. Common segmentation methods include region growing, clustering, split-and-merge, and watershed algorithms, with split-and-merge starting with the whole image and recursively splitting inhomogeneous regions until homogeneity is achieved and then merging neighboring regions. The goal of segmentation is to partition an image into meaningful objects or regions while minimizing errors from over- and under-segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views103 pages

Image Segmentation

Image segmentation involves dividing an image into regions of similar pixels by looking for coherent regions where pixels share properties like color or intensity. Common segmentation methods include region growing, clustering, split-and-merge, and watershed algorithms, with split-and-merge starting with the whole image and recursively splitting inhomogeneous regions until homogeneity is achieved and then merging neighboring regions. The goal of segmentation is to partition an image into meaningful objects or regions while minimizing errors from over- and under-segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Image segmentation

By Prof. Jyotsna Singh


Division of ECE
What is segmentation?
• Segmentation divides an image into groups of pixels
• Pixels are grouped because they share some local property (gray level,
color, texture, motion, etc.)

0
7 3

11

21

boundaries labels pseudocolors mean colors

(different ways of displaying the output)


Region based segmentation
• Goal: find coherent (homogeneous) regions in the image

• Coherent regions contain pixel which share some similar


property

• Advantages: Better for noisy images


• Disadvantages: Oversegmented (too many regions),
Undersegmented (too few regions)

• Can’t find objects that span multiple disconnected regions


Two errors

oversegmentation
(hair should
be one group)

undersegmentation
(water should be
separated from trees)
Types of segmentations

Input Over-segmentation Under-segmentation

Multiple Segmentations
Image Segmentation
• So all we have to do is define and implement
the similarity predicate.

• But, what do we want to be similar in each


region?
• Is there any property that will cause the
regions to be meaningful objects?
Foreground / background separation

Background subtraction provides figure-ground separation,


which is a type of segmentation
Image segmentation
• Segmentation = partitioning
Carve dense data set into (disjoint) regions
– Divide image based on pixel similarity
– Divide spatiotemporal volume based on image similarity (shot
detection)
– Figure / ground separation (background subtraction)
– Regions can be overlapping (layers)

• Grouping = clustering
Gather sets of items according to some model
– If items are dense, then essentially the same problem as
above
(e.g., clustering pixels)
– If items are sparse, then problem has a slightly different
flavor:
• Collect tokens that lie on a line
• Collect pixels that share the same fundamental matrix
• Group 3D surface elements that belong to the same surface
Segmentation as partitioning
• A partition of image is collection of sets S1, .., SN such
that
I = S 1 U S2 … U S N (sets cover entire image)
Si ∩ Sj = 0 for all i ≠ j (sets do not overlap)
• A predicate H(Si) measures region homogeneity
{
H( R ) = true if pixels in region R are similar
false otherwise
• We want
1. Regions to be homogeneous
H( Si ) = true for all i
2. Adjacent regions to be different from each other
H( Si U Sj ) = false for all adjacent Si , Sj
Region based segmentation
• An image domain X must be segmented in N different regions
R1,…,RN
• The segmentation rule is a logical predicate of the form P(R)
• Image segmentation partitions the set X into the subsets Ri ,
i=1,…,N having the following properties

Every pixel must be in a region

Points in a region must be connected.

Regions must be disjoint.

All pixels in a region satisfy specific


properties.

Different regions have different properties.


Region based segmentation
• Each image R is a
set of regions Ri.
– Every pixel
belongs to one
region.

R7
R6

– One pixel can


R1
only belong to a R5

single region.
R2 R3

R4
Thresholding
How can we divide an image into uniform regions?
• One of the simplest methods is that of histogram and
thresholding.
– If we plot the number of pixels which have a specific grey
value versus that value, we create the histogram of the
image.
– Properly normalized, the histogram is essentially the
probability density function of the grey values of the
image.
• Assume that we have an image consisting of a bright object
on a dark background and assume that we want to extract
the object.
How can we divide an image into uniform regions?
• For such an image, the histogram will have two peaks and a valley
between them.
• We can choose as the threshold then the grey value which
corresponds to the valley of the histogram, indicated by t0 , and label
all pixels with grey values greater than t0 as object pixels and all pixels
with grey values smaller than t0 as background pixels.

The histogram of an image with a bright object on a dark background.


What do we mean by “labelling” an image?

• When we say we “extract” an object in an image, we


mean that we identify the pixels that make the
object up.
• To express this information, we create an array of
the same size as the original image and we give to
each pixel a label.
• All pixels that make up the object are given the
same label and all pixels that make up the
background are given a different label.
How can we choose the minimum error threshold?
• Let us assume that the pixels which make up the object are
distributed according to the probability density function po(x)
and the pixels which make up the background are distributed
according to function pb(x).

• Their weighted sum, i.e. po(x) and pb(x) multiplied with the total
number of pixels that make up the object, No, and the
background, Nb, respectively, and added, is the histogram of the
image
How can we choose the minimum error threshold?
• Assume that we choose a threshold value t. Then the
error committed by misclassifying object pixels as
background pixels will be given by

• and the error committed by misclassifying


background pixels as object pixels is:

• In other words, the error that we commit arises from


misclassifying the two tails of the two probability
density functions on either side of threshold t.
How can we choose the minimum error threshold?
• Let us assume that the fraction of the pixels that make up the
object is θ, and,
• by inference, the fraction of the pixels that make up the
background is 1 − θ. Then, the total error is:

• We would like to choose t so that E(t) is minimum. We take


the first derivative of E(t) with respect to t and set it to zero:

• The solution of this equation gives the minimum error


threshold, for any type of probability density functions that
are used to model the two pixel populations.
Example
Example
Image Segmentation
• Segmentation of an image entails the division
or separation of the image into regions of
similar attribute.

1. Region Growing
2. Clustering
3. Split and Merge
4. Watershed algorithm
1. Split-and-Merge
• Split-and-merge algorithm combines these two ideas
– Split image into quadtree, where each region satisfies homogeneity
criterion
– Merge neighboring regions if their union satisfies criterion (like
connected components)

image after split after merge


Two approaches
• Splitting • Merging
(Divisive clustering) (Agglomerative clustering)
– start with single region – start with each pixel as a
covering entire image separate region
– repeat: split – repeat: merge adjacent
inhomogeneous regions regions if union is
homogeneous
– even better:
repeat: split cluster to – even better:
yield two distant repeat: merge two
components (difficult) closest clusters
Property 2 is always true: Property 1 is always true:
H( Si U Sj ) = false for H( Si ) = true for every
adjacent regions region
Goal is to satisfy Property 1: Goal is to satisfy Property 2:
H( Si ) = true for every region H( Si U Sj ) = false for
adjacent regions
Split and Merge
• The basic idea of region splitting is to break the image into a set of
disjoint regions which are coherent within themselves:

• Initially take the image as a whole to be the area of interest.

• Look at the area of interest and decide if all pixels contained in the
region satisfy some similarity constraint.

• If TRUE then the area of interest corresponds to a region in the


image.

• If FALSE split the area of interest (usually into four equal


sub-areas) and consider each of the sub-areas as the area of interest
in turn.
Region splitting
• Start with entire image as a single region
• Repeat:
– Split any region that does not satisfy homogeneity criterion into
subregions
• Quad-tree representation is convenient
• Then need to merge regions that have been split
Aa Ab B

Aa Ada
Adc Add
C D
Split and Merge

• This process continues until no further splitting occurs. In the


worst case this happens when the areas are just one pixel in
size.
• This is a divide and conquer or top down method.
• If only a splitting schedule is used then the final segmentation
would probably contain many neighbouring regions that have
identical or similar properties.
• Thus, a merging process is used after each split which
compares adjacent regions and merges them if necessary.
• Algorithms of this nature are called split and
merge algorithms.
• To illustrate the basic principle of these methods let us consider
an imaginary image.
Split and Merge
• Let I denote the whole image shown in Fig (a).
• Not all the pixels in I are similar so the region is split as in Fig
(b).
• Assume that all pixels within regions I1, I2 and I3 respectively
are similar but those in I4 are not.
• Therefore I4 is split next as in Fig (c).
• Now assume that all pixels within each region are similar with
respect to that region, and that after comparing the split
regions, regions I43 and I44 are found to be identical.
• These are thus merged together as in Fig (d).
Split and Merge
Split and Merge
We can describe the splitting of the image using a tree structure,
using a modified quadtree.
Each non-terminal node in the tree has at most four descendants,
although it may have less due to merging. See Fig.
2. Region Growing

Region growing techniques start with one pixel of a


potential region and try to grow it by adding adjacent
pixels till the pixels being compared are too disimilar.

• Thefirst pixel selected can be just the first unlabeled


pixel in the image.

• For region growing we need a rule describing a


growth mechanism and a rule checking the
homogeneity of the regions after each growth step
31
Region Growing
• Region growing approach is the opposite of the split and merge
approach:

• An initial set of small areas are iteratively merged according to


similarity constraints.
• Start by choosing an arbitrary seed pixel and compare it with
neighbouring pixels.
• Region is grown from the seed pixel by adding in neighbouring
pixels that are similar, increasing the size of the region.
• When the growth of one region stops we simply choose another
seed pixel which does not yet belong to any region and start again.
• This whole process is continued until all pixels belong to some
region.
•A bottom up method.
Region growing methods often give very good segmentations that correspond
well to the observed edges.
Region Growing
• Start with a pixel, or a group of pixels, and examine the
neighboring pixels. If a neighboring pixel meets a
certain criteria, it is added to the group, and if it does
not meet the criteria, it is not added.
• This process is continued until no more neighboring
pixels can be added to the group. Thus, a region is
defined.
• The criteria depends on how you wish to segment the
region.
• It can be a limit on the derivative between pixels, a
change in color, or any other criteria you wish to
differentiate between pixels.
Method One - Recursive Region
Growing
The idea behind recursive region growing is as follows:
• Start with a pixel, and examine the eight pixels bordering it.

• If a pixel meets the criteria for addition to the group, you


recursively call the function on that pixel.

• This process continues until all possible pixels have been


examined.

• The problem with this recursive segmentation routine is processing


power, because for a large region many, many pixels will be
admissable to the region, and thus there will be many, many
recursions before the recursion sequence will terminate.

• In fact, with our implementation of the recursive region growing


routine, we had the problem that matlab would crash with a
segmentation fault.
Region growing
• Starting with a particular seed pixel and letting this region grow
completely before trying other seeds biases the segmentation in
favor of the regions which are segmented first.

• This can have several undesirable effects:

• Current region dominates the growth process -- ambiguities


around edges of adjacent regions may not be resolved
correctly.

• Different choices of seeds may give different segmentation


results.

• Problems can occur if the (arbitrarily chosen) seed point lies


on an edge.
Region growing
• To counter the above problems, simultaneous region
growing techniques have been developed.

• Similarities of neighbouring regions are taken into account in the


growing process.

• No single region is allowed to completely dominate the


proceedings.

• A number of regions are allowed to grow at the same time.

• similar regions will gradually coalesce into expanding regions.

• Control of these methods may be quite complicated but efficient


methods have been developed.

• Easy and efficient to implement on parallel computers.


3. Clustering
•There are K clusters C1,…, CK with means m1,…, mK.

• The least-squares error is defined as


K 2
D = ∑ ∑ || xi - mk || .
k=1 xi ∈
Ck
• Out of all possible partitions into K clusters,
choose the one that minimizes D.
Why don’t we just do this?
If we could, would we get meaningful objects?
Clustering
• Task of grouping a set of objects

• Objects in the same group (called a cluster) are more


similar (in some sense or another) to each other

• Object of one cluster is different from an object of the


another cluster

• Connectivity model, centroid model, distribution


model, density model, graph based model, hard
clustering, soft-clustering, …
An Ideal Clustering Situation
Fig. 20.1

Variable 1

Variable 2
More Common Clustering Situation
Fig. 20.2

Variable 1

X
Variable 2
Statistics Associated with Cluster Analysis

• Agglomeration schedule. Gives information on the objects


being combined at each stage of a hierarchical clustering
process.
• Cluster centroid. Mean values of the variables for all the cases
in a particular cluster.
• Cluster centers. Initial starting points in nonhierarchical
clustering. Clusters are built around these centers, or seeds.
• Cluster membership. Indicates the cluster to which each
object or case belongs.
Statistics Associated with Cluster Analysis
• Dendrogram (A tree graph). A
graphical device for displaying
clustering results.
-Vertical lines represent clusters
that are joined together.
-The position of the line on the
scale indicates distances at which
clusters were joined.
• Distances between cluster centers.
These distances indicate how
separated the individual pairs of
clusters are. Clusters that are
widely separated are distinct, and
therefore desirable.
• Icicle diagram. Another type of
graphical display of clustering
results.
Dendrograms
Dendrogram yields a picture of output as clustering process continues

raw data

clusters represented as tree

from http://en.wikipedia.org/wiki/Dendrogram
the icicle plot
• In their paper, Kruskal and Landwher described a way to implement icicle plots with
the simple plotters.
• “In the icicle plot each vertical line topped by a label corresponds to an object.
• The label is repeated vertically with the symbol "&" used to separate successive copies,
down to the level where the object becomes a singleton cluster.
• Each horizontal line in the icicle plot shows one level of the clustering, as illustrated on
the right.
• Objects in the same cluster are joined by the symbol "=," while clusters are separated
by a blank space.
• At the left of the line are a serial number and proximity level for this stage of the
clustering.”
Conducting Cluster Analysis
Formulate the Problem

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters

Assess the Validity of Clustering


Formulating the Problem
• Most important is selecting the variables on which
the clustering is based. Inclusion of even one or
two irrelevant variables may distort a clustering
solution.
• Variables selected should describe the similarity
between objects.
• Should be selected based on past research,
theory, or a consideration of the hypotheses being
tested.
Select a Similarity Measure
• Similarity measure can be correlations or distances

• The most commonly used measure of similarity is the


Euclidean distance. The city-block distance is also used.

• If variables measured in vastly different units, we must


standardize data. Also eliminate outliers

• Use of different similarity/distance measures may lead


to different clustering results.

• Hence, it is advisable to use different measures and


compare the results.
Classification of Clustering Procedures
Clustering
Procedures

Hierarchical Nonhierarchica
l

Agglomerative Divisive

Linkage Variance Centroi Sequential Parallel Optimizing


Method Method Method
d Threshol Threshol Partitionin
s s s d d g
Ward’s
Method

Single Complete Average


Linkage Linkage Linkage
Hierarchical Clustering Methods
• Hierarchical clustering is characterized by the development of
a hierarchy or tree-like structure.
-Agglomerative clustering starts with each object in a
separate cluster. Clusters are formed by grouping objects into
bigger and bigger clusters. Agglomerative methods are
commonly used in marketing research.
-Divisive clustering starts with all the objects grouped in a
single cluster. Clusters are divided or split until each object is
in a separate cluster.
Hierarchical Agglomerative Clustering-Linkage
Method
• The single linkage method is based on minimum
distance, or the nearest neighbor rule.

• The complete linkage method is based on the


maximum distance or the furthest neighbor approach.

• The average linkage method the distance between two


clusters is defined as the average of the distances
between all pairs of objects
Linkage Methods of Clustering
Single Linkage
Minimum Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Hierarchical Agglomerative Clustering-Variance
and Centroid Method
• Variance methods generate clusters to minimize the within-cluster
variance.

• Ward's procedure is commonly used. For each cluster, the sum of


squares is calculated. The two clusters with the smallest increase in the
overall sum of squares within cluster distances are combined.

• In the centroid methods, the distance between two clusters is the


distance between their centroids (means for all the variables),

• Of the hierarchical methods, average linkage and Ward's methods


have been shown to perform better than the other procedures.
Other Agglomerative Clustering Methods
Ward’s Procedure

Centroid Method
Select a Clustering Procedure
• The hierarchical and nonhierarchical methods should be used
in tandem.
-First, an initial clustering solution is obtained using a
hierarchical procedure (e.g. Ward's).
-The number of clusters and cluster centroids so obtained
are used as inputs to the optimizing partitioning method.
• Choice of a clustering method and choice of a distance
measure are interrelated.
• For example, squared Euclidean distances should be used
with the Ward's and centroid methods. Several
nonhierarchical procedures also use squared Euclidean
distances.
Nonhierarchical Clustering Methods

• The nonhierarchical clustering methods are frequently


referred to as k-means clustering. .

-In the sequential threshold method, a cluster center is selected


and all objects within a prespecified threshold value from the
center are grouped together.

-In the parallel threshold method, several cluster centers are


selected and objects within the threshold level are grouped with
the nearest center.

-The optimizing partitioning method differs from the two


threshold procedures in that objects can later be reassigned to
clusters to optimize an overall criterion, such as average within
cluster distance for a given number of clusters.
Centroid model
• Computational time is short

• User have to decide the number of clusters


before starting classifying data

• The concept of centroid

• One of the famous method: K-means Method


Idea Behind K-Means
Algorithm for K-means clustering
1. Partition items into K clusters.
2. Assign items to cluster with nearest centroid mean.
3. Recalculate centroids both for cluster receiving and
losing item.
4. Repeat steps 2 and 3 till no more reassignments.
k-Means Algorithm
• k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong
[1979].
• Given a set of n distinct objects, the k-Means clustering algorithm partitions
the objects into k number of clusters such that intracluster similarity is high
but the intercluster similarity is low.
• In this algorithm, user has to specify k, the number of clusters and consider
the objects are defined with numeric attributes and thus using any one of the
distance metric to demarcate the clusters.
k-Means Algorithm
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n objects. These
k objects are treated as the centroids or center of gravities of k clusters.
• For each of the remaining objects, it is assigned to one of the closest centroid.
Thus, it forms a collection of objects assigned to each centroid and is called a
cluster.
• Next, the centroid of each cluster is then updated (by calculating the mean
values of attributes of each object).
• The assignment and update procedure is until it reaches some stopping criteria
(such as, number of iteration, centroids remain unchanged or no assignment,
etc.)

CS 40003: Data Analytics 64


k-Means Algorithm
k-Means clustering
Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
k-Means Algorithm

Illustration of k-Means clustering algorithms
• Fig 16.1: Plotting data of Table 16.1

A1 A2
6.8 12.6
0.8 9.8
1.2 11.6
2.8 9.6
3.8 9.9
4.4 6.5
4.8 1.1
6.0 19.9
6.2 18.5
7.6 17.4
7.8 12.2
6.6 7.7
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
CS 40003: Data Analytics 67
Illustration of k-Means clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled (see Fig
16.1). These three centroids are shown below.
Initial Centroids chosen randomly
Centroi Objects
d A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table 16.2.
• Assignment of each object to the respective centroid is shown in the
right-most column and the clustering so obtained is shown in Fig 16.2.

CS 40003: Data Analytics 68


Illustration of k-Means clustering algorithms
Table 16.2: Distance calculation Fig 16.2: Initial cluster with respect to Table
A1 A2 d1 d2 d3 16.2
cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

CS 40003: Data Analytics 69


Illustration of k-Means clustering algorithms
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.

Calculation of new centroids

New Objects
Centroi A1 A2
d
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Fig 16.3: Initial cluster with new centroids


CS 40003: Data Analytics 70
Illustration of k-Means clustering algorithms
We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 16.4.
Note that point p moves from cluster C2 to cluster C1.

Fig 16.4: Cluster after first iteration

CS 40003: Data Analytics 71


Illustration of k-Means clustering algorithms
• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster in Fig 16.5 is same as Fig 16.4.

Fig 16.5: Cluster after Second iteration

Cluster centres after second iteration

Centroi Revised Centroids


d A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

CS 40003: Data Analytics 72


Comments on k-Means algorithm

CS 40003: Data Analytics 73


Partitional Clustering

• K-mean algorithm :
Decide the number of the final
classified result with N.

Numbers of cluster: N

we now assume N=3


Partitional Clustering

• K-mean algorithm :
Randomly choose N point for
the centroids of cluster.
(N=3)

Numbers of cluster: N
Partitional Clustering

• K-mean algorithm :
Find the nearest point for every
centroid of cluster. Classify the
point into the cluster.

Notice the definition of the nearest!


Partitional Clustering

• K-mean algorithm :
Calculate the new centroid of
every cluster.

Notice the definition of the


centroid!
Partitional Clustering

• K-mean algorithm :
Repeat step1~ step4 until all
the point are classified.
Partitional Clustering

• K-mean algorithm :
Repeat step1~ step4 until all
the point are classified.
Partitional Clustering

• K-mean algorithm :
Repeat step1~ step4 until all
the point are classified.
Partitional Clustering

• K-mean algorithm :
Repeat step1~ step4 until all
the point are classified.
Partitional Clustering

• K-mean algorithm :
Repeat step1~ step4 until all
the point are classified.
Partitional Clustering

• K-mean algorithm :
Data clustering completed

For N=3
Example

Image Clusters on intensity Clusters on color


Example
Input image Segmentation using K-means
K-Means Example 1
K-Means Example 2
Comparison K-means, K=6

Isodata, K became 5

Original
Comments on k-Means algorithm

Comments on k-Means algorithm

Comments on k-Means algorithm
Example : k versus cluster quality
• With reference to an arbitrary experiment, suppose the following results are
obtained.

k SSE
1 62.8
2 12.3
3 9.4
4 9.3
5 9.2
6 9.1
7 9.05
8 9.0
Comments on k-Means algorithm
2. Choosing initial centroids:
• Another requirement in the k-Means algorithm to choose initial cluster
centroid for each k would be clusters.
• It is observed that the k-Means algorithm terminate whatever be the initial
choice of the cluster centroids.
• It is also observed that initial choice influences the ultimate cluster quality.
In other words, the result may be trapped into local optima, if initial
centroids are chosen properly.
• One technique that is usually followed to avoid the above problem is to
choose initial centroids in multiple runs, each with a different set of
randomly chosen initial centroids, and then select the best cluster (with
respect to some quality measurement criterion, e.g. SSE).
• However, this strategy suffers from the combinational explosion problem
due to the number of all possible solutions.
Comments on k-Means algorithm

Comments on k-Means algorithm
3. Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that
should quantify the notion of “closest” for the objects under clustering.
• Usually Euclidean distance (L2 norm) is the best measure when object points
are defined in n-dimensional Euclidean space.
• Other measure namely cosine similarity is more appropriate when objects are of
document type.
• Further, there may be other type of proximity measures that appropriate in the
context of applications.
• For example, Manhattan distance (L1 norm), Jaccard measure, etc.
Comments on k-Means algorithm

Comments on k-Means algorithm
Distance with document objects
Suppose a set of n document objects is defined as d document term matrix (DTM) (a
typical look is shown in the below form).

Documen Term
t
t1 t2 tn
D1
D2

Dn
Comments on k-Means algorithm
Note: The criteria of objective function with different proximity measures

1. SSE (using L2 norm) : To minimize the SSE.


2. SAE (using L1 norm) : To minimize the SAE.
3. TC(using cosine similarity) : To maximize the TC.
Comments on k-Means algorithm

Comments on k-Means algorithm

Comments on k-Means algorithm

Comments on k-Means algorithm

? Interpret the best centroid for maximizing TC (with Cosine similarity measure) of a
cluster.

The above mentioned discussion is quite sufficient for the validation of k-Means
algorithm.
Different variants of k-means algorithm
There are a quite few variants of the k-Means algorithm. These can differ in the
procedure of selecting the initial k means, the calculation of proximity and strategy
for calculating cluster means. Another variants of k-means to cluster categorical
data.
Few variant of k-Means algorithm includes
• Bisecting k-Means (addressing the issue of initial choice of cluster means).
1. M. Steinbach, G. Karypis and V. Kumar “A comparison of document clustering
techniques”, Proceedings of KDD workshop on Text mining, 2000.
• Mean of clusters (Proposing various strategies to define means and variants of
means).
• B. zhan “Generalised k-Harmonic means – Dynamic weighting of data in
unsupervised learning”, Technical report, HP Labs, 2000.
• A. D. Chaturvedi, P. E. Green, J. D. Carroll, “k-Modes clustering”, Journal of
classification, Vol. 18, PP. 35-36, 2001.
• D. Pelleg, A. Moore, “x-Means: Extending k-Means with efficient estimation of the
number of clusters”, 17th International conference on Machine Learning, 2000.
Different variants of k-means algorithm
• N. B. Karayiannis, M. M. Randolph, “Non-Euclidean c-Means
clustering algorithm”, Intelligent data analysis journal, Vol 7(5), PP
405-425, 2003.
• V. J. Olivera, W. Pedrycy, “Advances in Fuzzy clustering and its
applications”, Edited book. John Wiley [2007]. (Fuzzy c-Means
algorithm).
• A. K. Jain and R. C. Bubes, “Algorithms for clustering Data”, Prentice
Hall, 1988.
Online book at http://www.cse.msu.edu/~jain/clustering_Jain_Dubes.pdf
• A. K. Jain, M. N. Munty and P. J. Flynn, “Data clustering: A Review”,
ACM computing surveys, 31(3), 264-323 [1999]. Also available online.

Common questions

Powered by AI

The k-means clustering algorithm is designed to minimize within-cluster variance by partitioning data into k clusters and iteratively optimizing cluster centroids . It begins by randomly selecting k centroids, then assigns each object to the nearest centroid. This forms clusters where objects are similar to their centroid. Next, new centroids are computed based on the current assignments . These steps repeat until no more reassignments occur or centroids cease to change, effectively reducing the variance within clusters and ensuring distinct separation among them . This iterative process continues until the solution stabilizes and achieves minimum error, making it effective for partitioning data into clear groupings.

The k-means clustering algorithm partitions a set of objects into k clusters, minimizing the variance within each cluster while maximizing the variance between clusters . It begins by choosing k initial centroids, assigning each data point to the nearest centroid, and iteratively updating the centroids based on the mean of assigned points until convergence . Proximity measures used in k-means clustering include the Euclidean distance for geometric data, cosine similarity for document-type data, and Manhattan distance for certain applications . These measures help determine how data points are assigned to clusters, influencing the clustering outcome and quality .

The 'split and merge' technique is a divide and conquer approach used for image segmentation, which involves dividing an image into smaller regions and then merging those that meet certain criteria . A quadtree representation is often used to facilitate this process. Initially, the whole image is considered a single region; regions that do not meet a homogeneity criterion are split until all regions satisfy the criterion or are reduced to the smallest size (potentially one pixel). Regions that were divided and are found to be similar enough upon comparison are then merged back . On the other hand, 'region growing' is a bottom-up approach starting from a seed pixel and expanding by including adjacent pixels that meet specific similarity criteria . Unlike split and merge, it grows regions iteratively by adding neighboring pixels until no more can be added. This approach is efficient for detecting edges and defining homogeneous regions in images .

Split-and-merge and region growing are fundamentally different approaches to image segmentation. Split-and-merge first considers the whole image and splits it into four segments if there is inhomogeneity, then merges neighboring homogeneous regions . This method uses a top-down, divide-and-conquer strategy, easily represented by a quadtree . In contrast, region growing starts with a seed pixel and adds adjacent pixels that meet predefined homogeneity criteria, expanding each region until no further pixels can be added . It uses a bottom-up approach focusing on localized homogeneity and allowing multiple regions to grow simultaneously to address ambiguities . Each method has unique characteristics tailored to specific types of images and segmentation goals.

The initial choice of centroids in the k-means clustering algorithm significantly influences the final outcome as it determines the starting configuration of clusters . An unsuitable selection can lead to suboptimal partitions where the algorithm converges to local minima rather than the global optimal solution . To address this, multiple runs of k-means can be executed with different sets of randomly chosen initial centroids, selecting the partition with the best quality metric, such as the lowest sum of squared errors (SSE). However, this approach might lead to a combinatorial explosion due to numerous possible solutions, thus careful selection or sophisticated methods like k-means++ can minimize sensitivity to initial positions .

In image segmentation, the main goal of the split-and-merge algorithm is to achieve regions that maximize homogeneity within themselves while minimizing similarity with adjacent regions . This is done by first splitting the image into smaller parts that meet a homogeneity criterion, and then merging adjacent regions that are homogeneously similar . The potential drawbacks include increased computational complexity due to iterative splitting and merging processes, and the fact that if only splitting is performed without merging, many neighboring regions could have similar properties, leading to over-segmentation . Moreover, the determination of suitable homogeneity criteria can affect the quality of segmentation outcomes and may require domain-specific tuning .

Hierarchical clustering methods construct a series of nested partitions either by a sequence of splits or merges, resulting in a tree-like structure where each node represents a cluster . The main advantage is that they do not require the number of clusters to be specified beforehand. Non-hierarchical methods, such as k-means clustering, require an initial number of clusters and focus on partitioning data into distinct and separate groups with high cohesion . They are computationally more efficient and often used when the number of clusters is predefined or needs optimization through artificial tuning . In non-hierarchical clustering, adjustments can be made via iterative reassignment processes until an optimal clustering solution is obtained .

To mitigate combinatorial explosion in k-means clustering due to initial centroid selection, one strategy is to apply the k-means++ algorithm, which carefully selects initial centroids to minimize variance and prevent poor initial configurations . Another approach involves executing the algorithm multiple times with different sets of random centroids and choosing the output with the optimal clustering measure, such as lowest SSE . While this increases computational cost, it often yields improved consistency in results . Techniques like bisecting k-means or hybrid methods integrating hierarchical insights can also reduce the impact of misselected centroids by balancing assignments throughout clustering iterations . These methods aim to reduce sensitivity to initial choice, ultimately improving cluster quality and robustness .

The watershed algorithm fundamentally differs from the split-and-merge technique by segmenting images based on topological interpretations: it visualizes an image as a topographic surface and simulates water filling the surface from local minima until basins are full, effectively dividing the image along watershed lines . This creates segments based on gradient changes and edge intensity, particularly useful for detecting precise boundaries in high-contrast images. In contrast, split-and-merge segments by analyzing pixel similarity and adjusts based on regional homogeneity rather than topographical representation . The split-and-merge method iteratively divides and combines regions to meet criteria, focusing on balancing homogeneity within regions but often requiring more computational effort due to iterative adjustments .

Simultaneous region growing addresses biases in image segmentation by allowing multiple regions to grow concurrently, thus preventing any single region from dominating and causing biased results . The main advantage of this approach is that it can resolve ambiguities at the edges of regions and ensure a comprehensive segmentation, as similar regions will coalesce gradually . However, the complexity of controlling these multiple growing regions can be high, requiring sophisticated parallel processing algorithms to manage effectively. This complexity might increase the computational resources needed but can lead to more accurate and nuanced segmentation results .

You might also like