0% found this document useful (0 votes)
53 views

A Comparative Study of Various Algorithms To Detect Clustering in Spatial Data

The document describes a comparative study of various clustering algorithms for spatial data. It discusses k-means, k-medoids and AGNES algorithms for clustering crime data on female rapes in India from 2013. The methodology section explains data preprocessing, mapping data to a shapefile, performing clustering analysis using the algorithms, and comparing results using Jaccard index, Jaccard distance and Rand index.

Uploaded by

Woona Hanish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

A Comparative Study of Various Algorithms To Detect Clustering in Spatial Data

The document describes a comparative study of various clustering algorithms for spatial data. It discusses k-means, k-medoids and AGNES algorithms for clustering crime data on female rapes in India from 2013. The methodology section explains data preprocessing, mapping data to a shapefile, performing clustering analysis using the algorithms, and comparing results using Jaccard index, Jaccard distance and Rand index.

Uploaded by

Woona Hanish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

A comparative study of various algorithms to

detect clustering in spatial data


A Graduate Project Final Report submitted to Manipal Academy of Higher Education in partial
fulfilment of the requirement for the award of the degree of
 
BACHELOR OF TECHNOLOGY
in
 Electronics and Communication Engineering
 
Submitted by
Hanish Woona
 Reg No. 160907316
Under the guidance of
 
Internal Guide External Guide
Name: Vishnumurthy Kedlaya K Name:AmitaPuranik Electronics and
communication Dept of DataScience

Department of Electronics & Communication Engineering, MIT, Manipal


Contents

1. Introduction

2. Background theory/literature review

3. Methodology

4. Result analysis

5. Conclusion and future work

6. References

Department of Electronics & Communication Engineering, MIT, Manipal


Introduction

In this project we are going to compare various clustering algorithms using


aggregated. To compare these algorithms we are using LISA as standard. For this
project we are using crime data of female rapes in India 2013.

• Cluster analysis is the process of partitioning a set of data objects (or observations)
into cluster, such that objects in a cluster are similar to one another, yet dissimilar to
objects in other clusters.

• Different clustering methods may generate different clustering on the same data set.
The partitioning is done by the clustering algorithms. Hence, clustering is useful in
discovery of previously unknown groups within the data.

• It is an important part of spatial data mining since it provides certain insights into the
distribution of data and characteristics of spatial clusters.

Department of Electronics & Communication Engineering, MIT, Manipal


Introduction

• Spatial data, also known as geospatial data or geographic information, is the data or
information that identifies the geographic location of features and boundaries on
earth, such as natural or constructed features, oceans, and more. Spatial data is
usually stored as coordinates and topology and is data that can be mapped.

Department of Electronics & Communication Engineering, MIT, Manipal


Methodology

• Data Preprocessing
Step I

• Mapping the data into shapefile


Step II

• Performing clustering analysis


Step III

• Comparing the results


Sep IV

Department of Electronics & Communication Engineering, MIT, Manipal


Methodology

Data Preprocessing
• We are considering the crime data on female rapes from the year 2013 to perform
clustering analysis.
• We are considering each district as an object, but the raw data with which we are
dealing must have same number of objects and same names for objects in order
to map with the shape file.
• We are goring to divide the total no of cases with total female population and
multiply it by 10000 to standardize.

Department of Electronics & Communication Engineering, MIT, Manipal


Methodology

Mapping the data into shapefile


• The shape file is a geospatial vector data format for geographic information
system (GIS) software.
• We use two different formats of shapefile for this project.
shp – Has Geospatial visualization
dbf – Has date of each object in an excel sheet
• The data from the excel sheet is mapped into the Indian districts shapefile.
• We use a software called ArcGIS for mapping data into shapefile.
• Mapping the date into a shape file is an important step, which can be later used to
perform clustering analysis.

Department of Electronics & Communication Engineering, MIT, Manipal


Methodology
Comparing the results

Jaccard Index:
It’s a measure of similarity for the two sets of data, with a range from 0% to 100%.
The higher the percentage, the more similar the two populations.

Jaccard Index = (the number in both sets)/(the number in either sets)*100

Jaccard Distance:
It is a measure of how dissimilar two sets are. It
is the complement of the Jaccard index and can be found by subtracting the Jaccard
Index from 100%.

D(X,Y) = 1 – J(X,Y)

Department of Electronics & Communication Engineering, MIT, Manipal


Methodology

Rand Index:
It is a measure of the similarity between two data clustering.

Given a set of n elements S={o1,…..on} and two partitions of S to compare X={X1,…


Xr},a
partition of S into r subsets and Y={Y1,…Yn},a partition of S into s subsets,define the
following:

• a, the number of pairs of elements in S that are in the same subset in X and in the
same subset in Y.
• b, the number of pairs of elements in S that are different subsets in X and in the
different subsets in Y.
• c, the number of pairs of elements in S that are in the same subset in X and in the
different subsets in Y.
• d, the number of pairs of elements in S that are in the different subsets in X and in
the same subset in Y.

Department of Electronics & Communication Engineering, MIT, Manipal


Methodology

The Rand index R is:

a+b can be considered as the number of agreements between X and Y and c + d as the
number of disagreements between X and Y.
Since the denominator is the total number of pairs, the Rand index represents the
frequency of occurrence of agreements over the total pairs, or the probability that X
and Y will agree on a randomly chosen pair.
Similarly, one can also view the Rand index as a measure of the percentage of correct
decisions made by the algorithm. It can be computed using the following formula:

where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives.

Department of Electronics & Communication Engineering, MIT, Manipal


K-means

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

1. To perform this algorithm we start with selecting k number of locations randomly as


the centroids for each.
2. Now we start forming clusters by allotting each observation to the closest centroid
based on Euclidean distance to form clusters.
3. To select the value k we start adding the standard deviation between the centroid and
each observation points
4. We plot a graph with k as x axis and standard deviation on y axis. We plot the graph
for k value starting from 2 to 10-20.
5. The graph looks similar to an exponentially decreasing graph. We consider the k
value where the change in sum of standard deviation is significantly less.

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


K-medoids

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

1. K-medoids algorithm is developed from K-means algorithm to eliminate the


drawback of not having a observation point at the centroid.
2. We follow the same steps we followed for k means algorithm, but we consider k
observation points as centroids.

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


Agnes

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory
1. Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of
clusters. Strategies for hierarchical clustering generally fall into two types :
2. Agglomerative: This is a bottom-up approach: each observation starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.

3. Divisive: This is a top-down approach: all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.

4. AGNES algorithm works by grouping the data one by one on the basis of the nearest
distance measure of all the pairwise distance between the data point. Again distance
between data points is recalculated

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


DBSCAN

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

1. Take a point and with epcilon as radius draw a circle


2. If number of points inside the circle are greater than equal to Minpoints then the above
point is considered as a core point
3. If a point doesn’t satisfy Minpoints condition but we have at least one core point inside it
then it becomes a border point.
4. If both the above conditions fail then the point becomes noise point.
5. Only core and border points are considered to form a cluster Noise points are never taken
into consideration.

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


CLIQUE

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

1. Partition the data space and find the number of points that lie inside each cell of the
partition.
2. Identify the subspaces that contain clusters using the Apriori principle
3. Identify clusters
a. Determine dense units in all subspaces of interests
b. Determine connected dense units in all subspaces of interests.
4. Generate minimal description for the clusters
a. Determine maximal regions that cover a cluster of connected dense units for each
cluster
b. Determination of minimal cover for each cluster

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


FUZZY

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

1. Fuzzy clustering is a extension of


the Kmeans, Kmeans is a one
approach of Partitioning methods.
2. In the fuzzy clustering each data
point can belong to more than one
cluster , each data point has a degree
of membership of belonging to each
cluster . The main advantage of fuzzy
clustering is that the fuzzy approach
yields much more detailed information
on the structure.

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


LISA

Department of Electronics & Communication Engineering, MIT, Manipal


Background theory

Moran’s “I” Statistic:

N is no. of cases
Xi is the value of a variable at a particular location
Xj is the value of the same variable at another location (where i =/ j)
X is the mean of the variable
Wij is a weight applied to the comparison between location i and location j.
Wij=(1/dij)

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

Department of Electronics & Communication Engineering, MIT, Manipal


Comparison

Department of Electronics & Communication Engineering, MIT, Manipal


Result analysis

SL No: Algorithm Jaccard Index Rand Index

1 K means 26.5 68.5

2 K medoid 23 68.3

3 AGNES 29 69.8

4 FUZZY 28.12 68.7

5 DBSCAN 30.4 71.3

Department of Electronics & Communication Engineering, MIT, Manipal


Conclusion and future work
Partitioning methods like k-means and k-medoids are more useful for applications like
facility allocation where the objective is not to find natural cluster but to minimize the
sum of distances from the data objects to their cluster centres.

AGNES algorithms fixed the membership of a data object once it has been allocated to a
cluster.

Instead of using distance to judge the membership of a data object, density-based


clustering algorithm like DBSCAN make use of the density of data points within a region
to discover clusters. DBSCAN results in a loss of efficiency for high dimensional
clustering.

To increase the efficiency of clustering grid based clustering methods approximate the
dense regions of the clustering space by quantizing it into a finite number of cells and
identifying cells that contain more than a number of points as dense. Grid based
approach is usually more efficient than a density-based approach.

Department of Electronics & Communication Engineering, MIT, Manipal


Conclusion and future work
To conclude the hierarchical clustering methods are similar in performance but takes
more time as compared to the others. The performance of partition based clustering
methods like k-means and k-medoid algorithms are not well in handling irregularly
shaped clusters. The density based methods and grid based methods are more suitable
for handling spatial data but when considering time complexity grid based methods are
more preferable.

The problem with LISA is it requires frequency of events associated with the data point
so it is not suitable for point data where each crime is reported individually which makes
the count of each data point as one. From the research papers we concluded fuzzy is the
best when dealing with point data. fuzzy shows decent results in aggregate.

Department of Electronics & Communication Engineering, MIT, Manipal


References
[1]. Neethu C V and Mr.Subu Surendra, “Review of Spatial Clustering Methods”,
SCT College of Engineering Trivandrum,India,2013,24.
[2]. S.Sivaranjani, Dr.S.Sivakumari and Aasha.M, “Crime Prediction and Forecasting
in Tamilnadu using Clustering Approaches”, Avinashilingam University Coimbatore,
India,2016,6
[3]. Tony H. Grubesic “On The Application of Fuzzy Clustering for Crime Hot Spot
Detection”
[4]. Wei Luo ,Michael Steptoe ,Zheng Chang ,Robert Link , Leon Clarke and Ross
Maciejewski “Impact of Spatial Scales on the Intercomparison of Climate Scenarios”

Department of Electronics & Communication Engineering, MIT, Manipal

You might also like