0% found this document useful (0 votes)
81 views24 pages

ML Mod6

This module covers clustering methods including k-means, hierarchical clustering, and density-based clustering. It discusses k-means clustering including initializing clusters, minimizing reconstruction error, and determining convergence. It also covers hierarchical clustering, comparing agglomerative and divisive approaches. Dendograms are used to represent hierarchical clusters and different linkage methods are described to calculate distance between clusters.

Uploaded by

amarthya v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views24 pages

ML Mod6

This module covers clustering methods including k-means, hierarchical clustering, and density-based clustering. It discusses k-means clustering including initializing clusters, minimizing reconstruction error, and determining convergence. It also covers hierarchical clustering, comparing agglomerative and divisive approaches. Dendograms are used to represent hierarchical clusters and different linkage methods are described to calculate distance between clusters.

Uploaded by

amarthya v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CS 476 Introduction to Machine Learning, Module 6

MODULE 6 – SYLLABUS
Unsupervised Learning - Clustering Methods - K-means, Expectation-Maximization
Algorithm, Hierarchical Clustering Methods , Density based clustering
➢ Explain Clustering with example/application. Why is it said to be Unsupervised
Learning(can refer mod 1 too)
➢ Explain K-Means procedure/algorithm with example
➢ When do we say the K-means algorithm has converged or when do we stop
cluster reorganisation in K means.
➢ Explain the Reconstruction error to be minimized in Clustering
➢ How can we choose the initial clusters in K-means, how do we determine optimal
number of clusters to choose in Clustering
➢ What are the drawbacks for K-means
CLUSTERING
Clustering or cluster analysis is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other
than to those in other groups (clusters).
Example for Clustering – Color Quantization - Let us say we have an image that is stored
with 24 bits/pixel and can have up to 16 million colors. Assume we have a color screen with
8 bits/pixel that can display only 256 colors. We want to find the best 256 colors among all
16 million colors such that the image using only the 256 colors in the palette looks as close as
possible to the original image. This is color quantization where we map from high to lower
resolution.
Other Examples – Digit Classification, Categorizing News articles, Categorizing users in
Social Media
k-means Clustering
The k-means clustering algorithm is one of the simplest unsupervised learning
algorithms for solving the clustering problem.
Let it be required to classify a given data set into a certain number of clusters, say, k
clusters. We start by choosing k points arbitrarily as the “centres” of the clusters, one
for each cluster. We then associate each of the given data points with the nearest
centre. We now take the averages of the data points associated with a centre and
replace the centre with the average, and this is done for each of the centres. We repeat
the process until the centres converge to some fixed points. The data points nearest to
the centres form the various clusters in the dataset. Each cluster is represented by the
associated centre.

1
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

The aim is to minimize the Reconstruction error , given as

ie: we take the intra cluster error by taking distance of each data point from cluster center
(inner summation), we add this error for all the clusters(outer summation where k is the
number of clusters), we aim to minimize this sum. vi ‘s are the cluster centers.

Step 6. The steps are repeated until Convergence; we stop if

● If the reconstruction error is within a threshold


● If no data points are reassigned
● If cluster centers don’t change

Some Methods to choose initial cluster points

2
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Disadvantages/Drawbacks of K means Clustering


Even though the k-means algorithm is fast, robust and easy to understand, there are
several disadvantages to the algorithm. 

● The learning algorithm requires apriori specification of the number of cluster centers
● The final cluster centres depend on the initial vi’s. 

● With different representation of data we get different results
● Euclidean distance measures can unequally weight underlying factors. 

● The learning algorithm provides the local optima of the squared error function. 

● Randomly choosing of the initial cluster centres may not lead to a fruitful result. 

● The algorithm cannot be applied to categorical data. 

The optimum number of Cluters (k) can be identified by poltting number of clusters with
respect to error, after a certain K value the decrease in error reduces drastically with each
iteration – Elbow Method to find optimal number of K

3
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

In the problem, the required number of clusters


is 2 and we take k = 2.

We choose two points arbitrarily as the initial


cluster centres. Let us choose arbitrarily

We compute the distances of the given data


points from the cluster centers.

4
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Calculating distance of data points w.r.t new centers

5
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Hierarchical Clustering
➢ Explain types of hierarchical clustering
➢ Compare Agglomerative and divisive clustering methods
➢ Explain Dendograms with example
➢ Explain the various methods to find distance between group of data points (max
distance- complete linkage, min distance- single linkage, average distance)
➢ Explain Agglomerative clustering algorithm
➢ Explain Divisive clustering (DIANA) with example
Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method
of cluster analysis which seeks to build a hierarchy of clusters (or groups) in a given
dataset. The hierarchical clustering produces clusters in which the clusters at each
level of the hierarchy are created by merging clusters at the next lower level. At the
lowest level, each cluster contains a single observation. At the highest level there is
only one cluster containing all of the data.

The decision regarding whether two clusters are to be merged or not is taken based on
the measure of dissimilarity between the clusters. The distance between two clusters
is usually taken as the measure of dissimilarity between the clusters.

Dendrograms
Hierarchical clustering can be represented by a rooted binary tree. The nodes of the
trees represent groups or clusters. The root node represents the entire data set. The
terminal nodes each represent one of the individual observations (singleton clusters).
Each nonterminal node has two daughter nodes.

The distance between merged clusters is monotone increasing with the level of the
merger. The height of each node above the level of the terminal nodes in the tree is
proportional to the value of the distance between its two daughters

A dendrogram is a tree diagram used to illustrate the arrangement of the clusters

6
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

produced by hierarchical clustering. The dendrogram may be drawn with the root
node at the top and the branches growing vertically downwards

Figure 13.7 is a dendrogram of the dataset {a,


b, c, d, e}. Note that the root node represents
the entire dataset and the terminal nodes represent the individual observations.

Methods for hierarchical clustering


There are two methods for the hierarchical clustering of a dataset. These are known as
the

1. agglomerative method (or the bottom-up method) and the


2. divisive method (or, the top-down method).
Agglomerative Hierarchical Clustering

In the agglomerative we start at the bottom and at each level recursively merge a
selected pair of clusters into a single cluster. This produces a grouping at the next
higher level with one less cluster. If there are N observations in the dataset, there will
be N − 1 levels in the hierarchy. The pair chosen for merging consist of the two
groups with the smallest “intergroup dissimilarity”.

7
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Divisive method

The divisive method starts at the top and at


each level recursively split one of the
existing clusters at that level into two new
clusters. If there are N observations in the
dataset, there the divisive method also will
produce N − 1 levels in the hierarchy. The
split is chosen to produce two new groups
with the largest “between-group
dissimilarity”.

Measure of distance between Two data points

8
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

➢ what does measure of dissimilarity measure. Give examples of measures of


dissimilarity.
➢ Explain different types of linkages in clustering
Measures of distance between groups of data points

9
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Algorithm for agglomerative hierarchical clustering

In step 3 the cluster distance is calculate using Complete Linking Clustering or Single
Linkage Clustering or Average Linkage Clustering

10
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

11
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

12
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

The complete-linkage clustering uses the “minimum formula”, that is, the following
formula to compute the distance between two clusters A and B:

13
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

14
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Dendogram for the Hierarchical clustering

Algorithm for divisive hierarchical clustering


Divisive clustering algorithms begin with the entire data set as a single cluster, and
recursively divide one of the existing clusters into two daughter clusters at each
iteration in a top-down fashion. To apply this procedure, we need a separate algorithm
to divide a given dataset into two clusters.

DIANA (DIvisive ANAlysis)

15
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

16
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

= ¼ (d(a,b)+d(a,c)+d(a,d), d(a,e))= ¼ (9+3+6+11) = 7.25

17
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

18
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

DENSITY BASED CLUSTERING


➢ Explain Density based Clustering (DBSCAN) with example and illustrations.
(look terms in procedure – core point , neighbour hood, outlier, border point,
density reachable)
➢ When is Density based clustering preferred over K means clustering
In density-based clustering, clusters are defined as areas of higher density than the
remainder of the data set. Objects in these sparse areas - that are required to separate
clusters - are usually considered to be noise and border points. The most popular
density based clustering method is DBSCAN (Density-Based Spatial Clustering of
Applications with Noise).

K means clustering will fail to cluster based on density as it would cluster based on
distance to nearest centroid. It would obtain different clusters in comparison to density
based , Its fails to capture the complex density pattern in the data sets.

Fig shows examples of cases where Density based clustering can be applied to capture
the complex patterns

19
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

DBSCAN ALGORITHM

20
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Read for Expectation Maximization algorithm explanation .


Examples of probability distributions

A bimodal distribution is a continuous probability distribution with two different


modes. The modes appear as distinct peaks in the graph of the probability density
function. 


Consider the mixture of k normal distributions defined by. Let us define a k-


dimensional random variable

21
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

➢ Write the Expectation Maximization algorithm


Expectation-maximisation algorithm
The maximum likelihood estimation method (MLE) is a method for estimating the
parameters of a statistical model, given observations (see Section 6.5 for details). The
method attempts to find the parameter values that maximize the likelihood function, or
equivalently the log-likelihood function, given the observations.

The expectation-maximisation algorithm (sometimes abbreviated as the EM


algorithm) is used to find maximum likelihood estimates of the parameters of a
statistical model in cases where the equations cannot be solved directly. These models
generally involve latent or unobserved variables in addition to unknown parameters
and known data observations. For example, a Gaussian mixture model can be
described by assuming that each observed data point has a corresponding unobserved
data point, or latent variable, specifying the mixture component to which each data
point belongs.

22
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

In the case of Gaussian mixture problems, because of the nature of the function,
finding a maximum likelihood estimate by taking the derivatives of the log-likelihood
function with respect to all the parameters and simultaneously solving the resulting
equations is nearly impossible. So we apply the EM algorithm to solve the problem.

As already indicated, the EM algorithm is a general procedure for estimating the


parameters in a statistical model.

23
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra
CS 476 Introduction to Machine Learning, Module 6

Further tutorial to understand EM algorithm


https://www.kdnuggets.com/2016/08/tutorial-expectation-maximization-algorithm.html
https://www.cmi.ac.in/~madhavan/courses/dmml2018/literature/EM_algorithm_2coin_e
xample.pdf

24
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin, Lecture Notes in machine Learning by Dr V N
Krishnachandra

You might also like