A Comparative Study of Various Algorithms To Detect Clustering in Spatial Data
A Comparative Study of Various Algorithms To Detect Clustering in Spatial Data
1. Introduction
3. Methodology
4. Result analysis
6. References
• Cluster analysis is the process of partitioning a set of data objects (or observations)
into cluster, such that objects in a cluster are similar to one another, yet dissimilar to
objects in other clusters.
• Different clustering methods may generate different clustering on the same data set.
The partitioning is done by the clustering algorithms. Hence, clustering is useful in
discovery of previously unknown groups within the data.
• It is an important part of spatial data mining since it provides certain insights into the
distribution of data and characteristics of spatial clusters.
• Spatial data, also known as geospatial data or geographic information, is the data or
information that identifies the geographic location of features and boundaries on
earth, such as natural or constructed features, oceans, and more. Spatial data is
usually stored as coordinates and topology and is data that can be mapped.
• Data Preprocessing
Step I
Data Preprocessing
• We are considering the crime data on female rapes from the year 2013 to perform
clustering analysis.
• We are considering each district as an object, but the raw data with which we are
dealing must have same number of objects and same names for objects in order
to map with the shape file.
• We are goring to divide the total no of cases with total female population and
multiply it by 10000 to standardize.
Jaccard Index:
It’s a measure of similarity for the two sets of data, with a range from 0% to 100%.
The higher the percentage, the more similar the two populations.
Jaccard Distance:
It is a measure of how dissimilar two sets are. It
is the complement of the Jaccard index and can be found by subtracting the Jaccard
Index from 100%.
D(X,Y) = 1 – J(X,Y)
Rand Index:
It is a measure of the similarity between two data clustering.
• a, the number of pairs of elements in S that are in the same subset in X and in the
same subset in Y.
• b, the number of pairs of elements in S that are different subsets in X and in the
different subsets in Y.
• c, the number of pairs of elements in S that are in the same subset in X and in the
different subsets in Y.
• d, the number of pairs of elements in S that are in the different subsets in X and in
the same subset in Y.
a+b can be considered as the number of agreements between X and Y and c + d as the
number of disagreements between X and Y.
Since the denominator is the total number of pairs, the Rand index represents the
frequency of occurrence of agreements over the total pairs, or the probability that X
and Y will agree on a randomly chosen pair.
Similarly, one can also view the Rand index as a measure of the percentage of correct
decisions made by the algorithm. It can be computed using the following formula:
where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives.
3. Divisive: This is a top-down approach: all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
4. AGNES algorithm works by grouping the data one by one on the basis of the nearest
distance measure of all the pairwise distance between the data point. Again distance
between data points is recalculated
1. Partition the data space and find the number of points that lie inside each cell of the
partition.
2. Identify the subspaces that contain clusters using the Apriori principle
3. Identify clusters
a. Determine dense units in all subspaces of interests
b. Determine connected dense units in all subspaces of interests.
4. Generate minimal description for the clusters
a. Determine maximal regions that cover a cluster of connected dense units for each
cluster
b. Determination of minimal cover for each cluster
N is no. of cases
Xi is the value of a variable at a particular location
Xj is the value of the same variable at another location (where i =/ j)
X is the mean of the variable
Wij is a weight applied to the comparison between location i and location j.
Wij=(1/dij)
2 K medoid 23 68.3
3 AGNES 29 69.8
AGNES algorithms fixed the membership of a data object once it has been allocated to a
cluster.
To increase the efficiency of clustering grid based clustering methods approximate the
dense regions of the clustering space by quantizing it into a finite number of cells and
identifying cells that contain more than a number of points as dense. Grid based
approach is usually more efficient than a density-based approach.
The problem with LISA is it requires frequency of events associated with the data point
so it is not suitable for point data where each crime is reported individually which makes
the count of each data point as one. From the research papers we concluded fuzzy is the
best when dealing with point data. fuzzy shows decent results in aggregate.