Copy of How to Perform Clustering Algorithms in Machine Learning
Copy of How to Perform Clustering Algorithms in Machine Learning
machine learning?
There are four types of learning technique in machine learning based on the dataset and the
problem. They are Supervised, unsupervised, semi-supervised. Mostly in the real world, the
problem and the dataset will revolve around supervised and semi supervised machine learning.
That is because of the labelled data in the dataset. Supervised machine learning involves the
labelled dataset, which means that there is an accurate dependent and independent variable for
the prediction. Semi supervised machine learning involves a partially labelled dataset, which
means either independent or dependent variables are labelled to a certain extent. But there is
another learning which is Unsupervised machine learning. Unsupervised machine learning
involves the discovery of the pattern of the given dataset. For example, in sentiment analysis,
grouping the positive, negative and neutral reviews based on the data of first 24-48 hrs of movie
release. Unsupervised machine learning is about grouping the data based on a condition.
Clustering based analysis comes under unsupervised machine learning. In the real-world, the
usage of clustering based analysis is lesser than the supervised and unsupervised. In this
article, we will see about the clustering analysis and how to perform a clustering algorithm in
machine learning.
There are two main types of unsupervised learning, they are as follows:
● Clustering
● Association
Clustering:
Clustering is often depicted as an area of density in feature space where the examples from the
domain are closer to present cluster to neighbor cluster. Clusters have a centroid which is the
point feature space, it may or may not have boundary. Clustering helps in the problem domain
like pattern recognition or discovery or knowledge discovery (sentiment analysis). Clustering
also have several types, they are as follows:
● K means clustering
● KNN- K Nearest Neighbor
● Hierarchical clustering
● Principal component analysis
● Singular value decomposition
● Independent component analysis
Exclusive- also known as the partition, in which data points are grouped in such a way that it
belongs to one cluster. Example- k means clustering.
Agglomerate- every data point is considered as a cluster, the iterative union between two
neighbors clusters reduces the number of clusters. Example- Hierarchical clustering.
Overlapping- In this technique, a fuzzy set is used as the cluster data. Fuzzy set is also known
as an uncertain set, in which each element has a degree of membership. Based on the fuzzy
set principle, each point belongs to two or more clusters with a certain and separate degree of
membership.
Probabilities- In this technique, probability distribution to create a cluster of data points. For
example,
● Nvidia TX GPU
● AMD GPU
● Nvidia GTX GPU
● AMD Fidelity RTX GPU
Here, we can classify the group as ‘Nvidia’ and ‘AMD’, ‘RTX’ and ‘GTX’.
Association:
Association rules for establishing the association among the data objects in a large database.
Association is a type of unsupervised machine learning used for discovering the interesting
relationship between the variables in the database.
For example, people who buy new phones tend to purchase extra battery banks, scratch proof
back cases etc.
Types of clustering:
There are six types of clustering algorithms in machine learning. They are as follows:
K means clustering:
It is an iterative clustering algorithm, to find the highest value for every iteration. The first step is,
the desired cluster is chosen. In this technique, the objective is to group/cluster the data points
into k numbers of the cluster. The large k value means less group with more granularity in the
same way and low k value means large group with minimal granularity.
In K means algorithm, based on centroid each group is formed and this centroid pulls the
nearest data points to form the cluster. The centroid acts as the nucleus and nearest data points
are surrounded by it.
KNN is based on the feature similarity. The process of finding the value of k is called the
parameter tuning which influences the accuracy. The value of k determines the number of
clusters. To find the value of K:
Hierarchical clustering:
Hierarchical clustering is another unsupervised learning algorithm that is used to group together
the unlabeled data points having similar characteristics. Hierarchical clustering algorithms fall
into the following two categories.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all
the data points are treated as one big cluster and the process of clustering involves dividing
(Top-down approach) the one big cluster into various small clusters.
U= mxm matrix,
Independent Component Analysis (ICA) is a technique in statistics used to detect hidden factors
that exist in datasets of random variables, signals, or measurements. An alternative to principal
component analysis (PCA), Independent Component Analysis helps to divide multivariate
signals into subcomponents that are assumed to be non-Gaussian in nature, and independent
of each other.
K-means clustering:
Before we perform the operation using k means, it is better to understand one important topic
which is "The Elbow method". The important aspect of unsupervised learning is to find the
optimal number of clusters. The Elbow method is a famous method used to determine the
number of clusters by finding the value of 'k'. Now, let us start the analysis.
NOTE: It is always important to scale the data before applying the algorithm
Importing the K-means model with the help of sklearn. The python code is shown below:
Matplotlib helps us to understand the elbow method which was previously mentioned and the
visual presentation of the number of clusters.
For explaining purposes, the IRIS dataset is used as an example. The python code for importing
the dataset from the local system.
This is mostly involved in the data cleaning. For this example, one of the independent variables
is dropped due to its insignificant effect on the dataset. The python code for preprocessing,
As mentioned, it is better to normalize the data, since it will be easy to process. In this example,
the reason for normalizing the data is that, from head to tail of the dataset, the petal length
increases. It is better to scale the data so that all the data points are within the range for better
visualization.
Step: 5 Performing the elbow method for finding the 'k' value:
The main step in the clustering algorithm is to find the optimal number of clustering. For this
example, as mentioned the elbow method used to find the number of clusters. The python code
is shown below.
Here, as mentioned, the number of clusters is found to be 2. But how to find the k value here, it
is simple. The graph shows the line has "bends" The number of bends represents the K value.
In this example, the line has 2 bends, then k value is 2 which means two clusters is the optimal
number of clusters for this dataset.
Now, the number of clustered to be applied was found, the final step is to apply the k-means
algorithm and visualize the clustered data. The python code is below.
The scaled data, shows that the data points are within the range of 0-1, which is easy to
process and also to visualize the data points clearly and neatly.
Step: 7 Output:
Here, one question arises, why k=2 instead of k=3 or 4 if considering the end points as bend.
Let us explore that option also.
Here, the number of clusters is taken as 3, and k means applied, the result shows the partition.
Now, comparing the K=2 and K=3 output, it is noticeable that the cluster one (yellow) is actually
divided into 2 (yellow and blue), and this is not the optimal cluster. So, it is best to eliminate the
end points as a bend in the elbow method.
FAQs:
1. What are the clustering methods used in machine learning?
● K means clustering
● KNN- K Nearest Neighbor
● Hierarchical clustering
● Principal component analysis
● Singular value decomposition
● Independent component analysis
Unsupervised machine learning deals with the natural discovery of the pattern in the dataset.
Conclusion:
Clustering based analysis is very helpful in increasing the performance of the model and
procure the faster results. The other advantages are greater scalability, simplified management.
In this article, unsupervised machine learning with types of clustering and live examples of one
of the types of clustering is discussed. The important points are, there is no best clustering
algorithm in machine learning, every algorithm has its own purpose and advantages. It is
important to scale the data before applying the algorithm. Do not consider the end points in the
elbow method graph, as it leads to improper clustering of the data points.