Clustering
Clustering
• Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset.
• It can be defined as “A way of grouping the data points into different clusters, consisting
of similar data points. The objects with the possible similarities remain in a group that
has less or no similarities with another group.”
• It does it by finding some similar patterns in the unlabelled dataset such as shape, size,
color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
• It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.
• After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
• The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world example of Mall:
• When we visit any shopping mall, we can observe that the things with similar
usage are grouped together.
• Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in
separate sections, so that we can easily find out the things.
• The clustering technique also works in the same way. Other examples of clustering
are grouping documents according to the topic.
• The clustering technique can be widely used in various tasks. Some most common uses of
this technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
• Apart from these general usages, it is used by the Amazon in its recommendation system
to provide the recommendations as per the past search of products.
• Netflix also uses this technique to recommend the movies and web-series to its users as
per the watch history.
Types of Clustering Methods
• The clustering methods are broadly divided into Hard clustering (datapoint belongs to
only one group) and Soft Clustering (data points can belong to another group also).
• But there are also other various approaches of Clustering exist. Below are the main
clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
• It is a type of clustering that divides the data into
non-hierarchical groups. It is also known as the
centroid-based method.
• The most common example of partitioning
clustering is the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k
groups, where K is used to define the number of
pre-defined groups.
• The cluster center is created in such a way that
the distance between the data points of one
cluster is minimum as compared to another
cluster centroid.
Density-Based Clustering
• The density-based clustering method connects the
highly-dense areas into clusters, and the arbitrarily
shaped distributions are formed as long as the
dense region can be connected.
• This algorithm does it by identifying different
clusters in the dataset and connects the areas of
high densities into clusters.
• The dense areas in data space are divided from
each other by sparser areas.
• These algorithms can face difficulty in clustering
the data points if the dataset has varying densities
and high dimensions.
Distribution Model-Based Clustering
• In the distribution model-based clustering
method, the data is divided based on the
probability of how a dataset belongs to a
particular distribution.
• The grouping is done by assuming some
distributions commonly Gaussian
Distribution.
• The example of this type is the
Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models
(GMM).
Hierarchical Clustering
• Hierarchical clustering can be used as an
alternative for the partitioned clustering as
there is no requirement of pre-specifying the
number of clusters to be created.
• In this technique, the dataset is divided into
clusters to create a tree-like structure, which
is also called a dendrogram.
• The observations or any number of clusters
can be selected by cutting the tree at the
correct level.
• The most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
• Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster.
• Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster.
• Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also
known as the Fuzzy k-means algorithm.