0% found this document useful (0 votes)
23 views44 pages

Unit 5

Unsupervised Machine Learning is a technique that identifies hidden patterns in unlabeled datasets without supervision, contrasting with supervised learning which uses labeled data. It encompasses methods like clustering and association to group data based on similarities, with popular algorithms including K-means and DBSCAN. While it offers advantages such as ease of obtaining unlabeled data, it also faces challenges like less accuracy due to the absence of labeled outputs.

Uploaded by

vkvijaykumarvk22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views44 pages

Unit 5

Unsupervised Machine Learning is a technique that identifies hidden patterns in unlabeled datasets without supervision, contrasting with supervised learning which uses labeled data. It encompasses methods like clustering and association to group data based on similarities, with popular algorithms including K-means and DBSCAN. While it offers advantages such as ease of obtaining unlabeled data, it also faces challenges like less accuracy due to the absence of labeled outputs.

Uploaded by

vkvijaykumarvk22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit-5

Unsupervised Machine Learning

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models
are trained using labeled data under the supervision of training data. But there
may be many cases in which we do not have labeled data and need to find the
hidden patterns from the given dataset. So, to solve such types of cases in
machine learning, we need unsupervised learning techniques.
What is Unsupervised Learning?
As the name suggests, unsupervised learning is a machine learning technique in
which models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be compared to
learning which takes place in the human brain while learning new things. It can
be defined as:

Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The algorithm is
never trained upon the given dataset, which means it does not have any idea
about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own. Unsupervised learning
algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.

Why use Unsupervised Learning?


Below are some main reasons which describe the importance of Unsupervised
Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their
own experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given. Now, this unlabeled input data is
fed to the machine learning model in order to train it. Firstly, it will interpret the
raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters
such that objects with most similarities remains into a group and has less
or no similarities with the objects of another group. Cluster analysis finds
the commonalities between the data objects and categorizes them as per
the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method
which is used for finding the relationships between variables in the large
database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Note: We will learn these algorithms in later chapters.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
o Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.
Disadvantages of Unsupervised Learning
o Unsupervised learning is intrinsically more difficult than supervised
learning as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate
as input data is not labeled, and algorithms do not know the exact output
in advance.

Clustering

Clustering in Machine Learning


Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points
into different clusters, consisting of similar data points. The objects
with the possible similarities remain in a group that has less or no
similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as
shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the
type of dataset that we are using. In classification, we work with the labeled data set,
whereas in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can
easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its
recommendation system to provide the recommendations as per the past search
of products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to
another group also). But there are also other various approaches of Clustering
exist. Below are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is
also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in such a
way that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed as long as the dense
region can be connected. This algorithm does it by identifying different clusters
in the dataset and connects the areas of high densities into clusters. The dense
areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset
has varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on
the probability of how a dataset belongs to a particular distribution. The grouping
is done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of clusters to
be created. In this technique, the dataset is divided into clusters to create a tree-
like structure, which is also called a dendrogram. The observations or any
number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical
algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a
cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the kind
of data that we are using. Such as, some algorithms need to guess the number
of clusters in the given dataset, whereas some are required to find the minimum
distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required, with
the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas
in the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center
of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering
of Applications with Noise. It is an example of a density-based model
similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low
density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can
be used as an alternative for the k-means algorithm or for those cases
where K-means can be failed. In GMM, it is assumed that the data points
are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each
data point is treated as a single cluster at the outset and then successively
merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely
used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering
technique. The search result appears based on the closest object to the
search query. It does it by grouping similar data objects in one group that
is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that
for what purpose the particular land should be used, that means for which
purpose it is more suitable.

K-Means Clustering

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve
the clustering problems in machine learning or data science. In this topic, we will
learn what is K-means clustering algorithm, how the algorithm works, along with
the Python implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative
process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these
datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point
or centroid. We will compute it by applying some mathematics that we
have studied to calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid. To choose the new centroids, we will compute
the center of gravity of these centroids, and will find new centroids as
below:

o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:
From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is


finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so
the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:

As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms. But choosing the optimal number of clusters is a
big task. There are some different ways to find the optimal number of clusters,
but here we are discussing the most appropriate method to find the number of
clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster.
The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K
values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters
K.
o The sharp point of bend or a point of the plot looks like an arm, then that
point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
Density based Clustering

Density-based clustering
Density-based clustering refers to a method that is based on local cluster
criterion, such as density connected points.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised
learning methodologies used in model building and machine learning algorithms.
The data points in the region separated by two clusters of low point density are
considered as noise. The surroundings with a radius ε of a given object are
known as the ε neighborhood of the object. If the ε neighborhood of the object
comprises at least a minimum number, MinPts of objects, then it is called a core
object.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood
of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with
respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts

Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps,
MinPts if there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that i i +
1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if
there is a point o such that both i and j are considered as density reachable from
o with respect to Eps and MinPts.

Working of Density-Based Clustering


Suppose a set of objects is denoted by D', we can say that an object I is directly
density reachable form the object j only if it is located within the ε neighborhood
of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a
given set of objects, D' only if there is a sequence of object chains point i1,…., in,
i1 = j, pn = i such that ii + 1 is directly density reachable from i i with respect to ε
and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given
set of objects, D' only if there is an object o belongs to D such that both point i
and j are density reachable from o with respect to ε and MinPts.
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Density-Based Clustering Methods
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It
depends on a density-based notion of cluster. It also identifies clusters of
arbitrary size in the spatial database with outliers.

OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a
significant order of database with respect to its density-based clustering
structure. The order of the cluster comprises information equivalent to the
density-based clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster analysis,
including determining an intrinsic clustering structure.
DENCLUE
Density-based clustering by Hinnebirg and Kiem. It enables a compact
mathematical description of arbitrarily shaped clusters in high dimension state of
data, and it is good for data sets with a huge amount of noise.

Dimensionality Reduction
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is
known as dimensionality, and the process to reduce these features is called
dimensionality reduction.
A dataset contains a huge number of input features in various cases, which
makes the predictive modeling task more complicated. Because it is very difficult
to visualize or make predictions for the training dataset with a high number of
features, for such cases, dimensionality reduction techniques are required to
use.
Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information." These techniques
are widely used in machine learning for obtaining a better fit predictive model
while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also
be used for data visualization, noise reduction, cluster analysis, etc.

The Curse of Dimensionality


Handling the high-dimensional data is very difficult in practice, commonly known
as the curse of dimensionality. If the dimensionality of the input dataset
increases, any machine learning algorithm and model becomes more complex.
As the number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases. If the machine
learning model is trained on high-dimensional data, it becomes overfitted and
results in poor performance.
Hence, it is often required to reduce the number of features, which can be done
with dimensionality reduction.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given
dataset are given below:
o By reducing the dimensions of the features, the space required to store
the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.
Disadvantages of dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction,
which are given below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.
Approaches of Dimension Reduction
There are two ways to apply the dimension reduction technique, which are given
below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features
and leaving out the irrelevant features present in a dataset to build a model of
high accuracy. In other words, it is a way of selecting the optimal features from
the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation. In this method, some features are fed
to the ML model, and evaluate the performance. The performance decides
whether to add those features or remove to increase the accuracy of the model.
This method is more accurate than the filtering method but complex to work.
Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training
iterations of the machine learning model and evaluate the importance of each
feature. Some common techniques of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when we
want to keep the whole information but use fewer resources while processing the
information.
Some common feature extraction techniques are:
a. Principal Component Analysis
b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis
Common techniques of Dimensionality Reduction
a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
Principal Component Analysis (PCA)
Principal Component Analysis is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features
with the help of orthogonal transformation. These new transformed features are
called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing,
movie recommendation system, optimizing the power allocation in
various communication channels.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing
Linear Regression or Logistic Regression model. Below steps are performed in
this technique to reduce the dimensionality or in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken
to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and
maximum tolerable error rate, we can define the optimal number of features
require for the machine learning algorithms.
Forward Feature Selection
Forward feature selection follows the inverse process of the backward
elimination process. It means, in this technique, we don't eliminate the feature;
instead, we will find the best features that can produce the highest increase in
the performance of the model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model.
Missing Value Ratio
If a dataset has too many missing values, then we drop those variables as they
do not carry much useful information. To perform this, we can set a threshold
level, and if a variable has missing values more than that threshold, we will drop
that variable. The higher the threshold value, the more efficient the reduction.
Low Variance Filter
As same as missing value ratio technique, data columns with some changes in
the data have less information. Therefore, we need to calculate the variance of
each variable, and all data columns with variance lower than a given threshold
are dropped because low variance features will not affect the target variable.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately
similar information. Due to this factor, the performance of the model can be
degraded. This correlation between the independent numerical variable gives the
calculated value of the correlation coefficient. If this value is higher than the
threshold value, we can remove one of the variables from the dataset. We can
consider those variables or features that show a high correlation with the target
variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in
machine learning. This algorithm contains an in-built feature importance
package, so we do not need to program it separately. In this technique, we need
to generate a large set of trees against the target variable, and with the help of
usage statistics of each attribute, we need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert
the input data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group
according to the correlation with other variables, it means variables within a
group can have a high correlation between themselves, but they have a low
correlation with variables of other groups.
We can understand it by an example, such as if we have two variables Income
and spend. These two variables have a high correlation, which means people
with high income spends more, and vice versa. So, such variables are put into a
group, and that group is known as the factor. The number of these factors will
be reduced as compared to the original dimension of the dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is
a type of ANN or artificial neural network, and its main aim is to copy the inputs
to their outputs. In this, the input is compressed into latent-space representation,
and output is occurred using this representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the
latent-space representation.

Reinforcement Learning
What is Reinforcement Learning?
o Reinforcement Learning is a feedback-based Machine learning technique
in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its
experience only.
o RL solves a specific type of problem where decision making is sequential,
and the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can
say that "Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts
with the environment and learns to act within that." How a Robotic
dog learns the movement of his arms is an example of Reinforcement
learning.
o It is a core part of Artificial intelligence, and all AI agent works on the
concept of reinforcement learning. Here we do not need to pre-program
the agent, as it learns from its own experience without any human
intervention.
o Example: Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent interacts with
the environment by performing some actions, and based on those actions,
the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing
these actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards
and what actions lead to negative feedback penalty. As a positive reward,
the agent gets a positive point, and as a penalty, it gets a negative point.
Association Rule
Association Rule Learning
Association rule learning is a type of unsupervised learning technique that
checks for the dependency of one data item on another data item and maps
accordingly so that it can be more profitable. It tries to find some interesting
relations or associations among the variables of dataset. It is based on different
rules to discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used
by the various big retailer to discover the associations between items. We can
understand it by taking an example of a supermarket, as in a supermarket, all
products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs,
or milk, so these products are stored within a shelf or mostly nearby. Consider
the below diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
We will understand these algorithms in later chapters.
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as
if A then B.

Here the If element is called antecedent, and then statement is called


as Consequent. These types of relationships where we can find out some
association or relation between two items is known as single cardinality. It is all
about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of
data items, there are several metrics. These metrics are given below:
o Support
o Confidence
o Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It
is defined as the fraction of the transaction T that contains the itemset X. If there
are X datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent
to each other.
o Lift<1: It tells us that one item is a substitute for other items, which
means one item has a negative effect on another.
Types of Association Rule Lerning
Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm uses
a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the
products that can be bought together. It can also be used in the healthcare field
to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm
uses a depth-first search technique to find frequent itemsets in a transaction
database. It performs faster execution than Apriori Algorithm.
F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The purpose of this
frequent tree is to extract the most frequent patterns.
Applications of Association Rule Learning
It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:
o Market Basket Analysis: It is one of the popular examples and
applications of association rule mining. This technique is commonly used
by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be
cured easily, as it helps in identifying the probability of illness for a
particular disease.
o Protein Sequence: The association rules help in determining the
synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and
many more other applications.

Upper Confidence Bound


In Reinforcement learning, the agent or decision-maker generates its training
data by interacting with the world. The agent must learn the consequences of
its actions through trial and error, rather than being explicitly told the correct
action.
Multi-Armed Bandit Problem
In Reinforcement Learning, we use Multi-Armed Bandit Problem to formalize the
notion of decision-making under uncertainty using k-armed bandits. A decision-
maker or agent is present in Multi-Armed Bandit Problem to choose between k-
different actions and receives a reward based on the action it chooses. Bandit
problem is used to describe fundamental concepts in reinforcement learning,
such as rewards, timesteps, and values.
The picture above represents a slot machine also known as a bandit with two
levers. We assume that each lever has a separate distribution of rewards and
there is at least one lever that generates maximum reward.
The probability distribution for the reward corresponding to each lever is
different and is unknown to the gambler(decision-maker). Hence, the goal here
is to identify which lever to pull to get the maximum reward after a given set of
trials.
For Example:

Imagine an online advertising trial where an advertiser wants to measure the


click-through rate of three different ads for the same product. Whenever a user
visits the website, the advertiser displays an ad at random. The advertiser then
monitors whether the user clicks on the ad or not. After a while, the advertiser
notices that one ad seems to be working better than the others. The advertiser
must now decide between sticking with the best-performing ad or continuing
with the randomized study.
If the advertiser only displays one ad, then he can no longer collect data on the
other two ads. Perhaps one of the other ads is better, it only appears worse due
to chance. If the other two ads are worse, then continuing the study can affect
the click-through rate adversely. This advertising trial exemplifies decision-
making under uncertainty.
In the above example, the role of the agent is played by an advertiser. The
advertiser has to choose between three different actions, to display the first,
second, or third ad. Each ad is an action. Choosing that ad yields some
unknown reward. Finally, the profit of the advertiser after the ad is the reward
that the advertiser receives.
Action-Values:
For the advertiser to decide which action is best, we must define the value of
taking each action. We define these values using the action-value function
using the language of probability. The value of selecting an action q*(a) is
defined as the expected reward Rt we receive when taking an action a from the
possible set of actions.
The goal of the agent is to maximize the expected reward by selecting the
action that has the highest action-value.
Action-value Estimate:
Since the value of selecting an action i.e. Q*(a) is not known to the agent, so we
will use the sample-average method to estimate it.

Exploration vs Exploitation:
 Greedy Action: When an agent chooses an action that currently has
the largest estimated value. The agent exploits its current knowledge
by choosing the greedy action.
 Non-Greedy Action: When the agent does not choose the largest
estimated value and sacrifice immediate reward hoping to gain more
information about the other actions.
 Exploration: It allows the agent to improve its knowledge about each
action. Hopefully, leading to a long-term benefit.
 Exploitation: It allows the agent to choose the greedy action to try to
get the most reward for short-term benefit. A pure greedy action
selection can lead to sub-optimal behaviour.
A dilemma occurs between exploration and exploitation because an agent can
not choose to both explore and exploit at the same time. Hence, we use
the Upper Confidence Bound algorithm to solve the exploration-exploitation
dilemma
Upper Confidence Bound Action Selection:
Upper-Confidence Bound action selection uses uncertainty in the action-value
estimates for balancing exploration and exploitation. Since there is inherent
uncertainty in the accuracy of the action-value estimates when we use a
sampled set of rewards thus UCB uses uncertainty in the estimates to drive
exploration.
Qt(a) here represents the current estimate for action a at time t. We select the
action that has the highest estimated action-value plus the upper-confidence
bound exploration term.
Q(A) in the above picture represents the current action-value estimate for
action A. The brackets represent a confidence interval around Q*(A) which says
that we are confident that the actual action-value of action A lies somewhere in
this region.
The lower bracket is called the lower bound, and the upper bracket is the upper
bound. The region between the brackets is the confidence interval which
represents the uncertainty in the estimates. If the region is very small, then we
become very certain that the actual value of action A is near our estimated
value. On the other hand, if the region is large, then we become uncertain that
the value of action A is near our estimated value.
The Upper Confidence Bound follows the principle of optimism in the face of
uncertainty which implies that if we are uncertain about an action, we should
optimistically assume that it is the correct action.
For example, let’s say we have these four actions with associated uncertainties
in the picture below, our agent has no idea which is the best action. So
according to the UCB algorithm, it will optimistically pick the action that has the
highest upper bound i.e. A. By doing this either it will have the highest value
and get the highest reward, or by taking that we will get to learn about an
action we know least about.

Let’s assume that after selecting the action A we end up in a state depicted in
the picture below. This time UCB will select the action B since Q(B) has the
highest upper-confidence bound because it’s action-value estimate is the
highest, even though the confidence interval is small.

Initially, UCB explores more to systematically reduce uncertainty but its


exploration reduces over time. Thus we can say that UCB obtains greater
reward on average than other algorithms such as Epsilon-greedy, Optimistic
Initial Values, etc.
Thompson Sampling

Reinforcement Learning is a branch of Machine Learning, also called Online


Learning. It is used to decide what action to take at t+1 based on data up to
time t. This concept is used in Artificial Intelligence applications such as
walking. A popular example of reinforcement learning is a chess engine. Here,
the agent decides upon a series of moves depending on the state of the board
(the environment), and the reward can be defined as a win or lose at the end
of the game.

Thompson Sampling (Posterior Sampling or Probability Matching) is an


algorithm for choosing the actions that address the exploration-exploitation
dilemma in the multi-armed bandit problem. Actions are performed several
times and are called exploration. It uses training information that evaluates the
actions taken rather than instructs by giving correct actions. This is what
creates the need for active exploration, for an explicit trial-and-error search for
good behavior. Based on the results of those actions, rewards (1) or penalties
(0) are given for that action to the machine. Further actions are performed in
order to maximize the reward that may improve future performance. Suppose a
robot has to pick several cans and put them in a container. Each time it puts
the can to the container, it will memorize the steps followed and train itself to
perform the task with better speed and precision (reward). If the Robot is not
able to put the can in the container, it will not memorize that procedure (hence
speed and performance will not improve) and will be considered as a penalty.
Thompson Sampling has the advantage of the tendency to decrease the search
as we get more and more information, which mimics the desirable trade-off in
the problem, where we want as much information as possible in fewer
searches. Hence, this Algorithm has a tendency to be more “search-oriented”
when we have fewer data and less “search-oriented” when we have a lot of
data.
Multi-Armed Bandit Problem
Multi-armed Bandit is synonymous with a slot machine with many arms. Each
action selection is like a play of one of the slot machine’s levers, and the
rewards are the payoffs for hitting the jackpot. Through repeated action
selections you are to maximize your winnings by concentrating your actions on
the best levers. Each machine provides a different reward from a probability
distribution over the mean reward specific to the machine. Without knowing
these probabilities, the gambler has to maximize the sum of reward earned
through a sequence of arms pull. If you maintain estimates of the action values,
then at any time step there is at least one action whose estimated value is
greatest. We call this a greedy action. The analogy to this problem can be
advertisements displayed whenever the user visits a webpage. Arms are ads
displayed to the users each time they connect to a web page. Each time a user
connects to the page makes around. At each round, we choose one ad to
display to the user. At each round n, ad I gives reward ri(n) ε {0, 1}: ri(n)=1 if
the user clicked on the ad i, 0 if the user didn’t. The goal of the algorithm will
be to maximize the reward. Another analogy is that of a doctor choosing
between experimental treatments for a series of seriously ill patients. Each
action selection is a treatment selection, and each reward is the survival or
well-being of the patient.
Algorithm

Some Practical Applications


 Netflix Item based recommender systems: Images related to
movies/shows are shown to users in such a way that they are more
likely to watch it.
 Bidding and Stock Exchange: Predicting Stocks based on Current
data of stock prices.
 Traffic Light Control: Predicting the delay in the signal.
 Automation in Industries: Bots and Machines for transporting and
Delivering items without human intervention.
 Robotics: Reinforcement learning is used in robotics for motion
planning, grasping objects, and controlling the robot’s movement. It
enables robots to learn from experience and make decisions based on
their environment.
 Game AI: Reinforcement learning has been used to train AI agents to
play games like Chess, Go, and Poker. It has been used to develop
game bots that can compete against human players.
 Natural Language Processing (NLP): Reinforcement learning is
used in NLP to train chatbots and virtual assistants to provide
personalized responses to users. It enables chatbots to learn from
user interactions and improve their responses over time.
 Advertising: Reinforcement learning is used in advertising to
optimize ad placements and target audiences. It enables advertisers
to learn which ads perform best and adjust their campaigns
accordingly.
 Finance: Reinforcement learning is used in finance for portfolio
management, fraud detection, and risk assessment. It enables
financial
Q Learning

What Is Q-Learning?

Q-Learning is a Reinforcement learning policy that will find the next best action,
given a current state. It chooses this action at random and aims to maximize the
reward.

Figure 3: Components of Q-Learning

Q-learning is a model-free, off-policy reinforcement learning that will find the


best course of action, given the current state of the agent. Depending on where
the agent is in the environment, it will decide the next action to be taken.

The objective of the model is to find the best course of action given its current
state. To do this, it may come up with rules of its own or it may operate outside
the policy given to it to follow. This means that there is no actual need for a
policy, hence we call it off-policy.

Model-free means that the agent uses predictions of the environment’s expected
response to move forward. It does not use the reward system to learn, but
rather, trial and error.

An example of Q-learning is an Advertisement recommendation system. In a


normal ad recommendation system, the ads you get are based on your previous
purchases or websites you may have visited. If you’ve bought a TV, you will get
recommended TVs of different brands.

Figure 4: Ad Recommendation System

Using Q-learning, we can optimize the ad recommendation system to


recommend products that are frequently bought together. The reward will be if
the user clicks on the suggested product.

Figure 5: Ad Recommendation System with Q-Learning

Important Terms in Q-Learning

1. States: The State, S, represents the current position of an agent in an


environment.
2. Action: The Action, A, is the step taken by the agent when it is in a
particular state.

3. Rewards: For every action, the agent will get a positive or negative
reward.

4. Episodes: When an agent ends up in a terminating state and can’t take


a new action.

5. Q-Values: Used to determine how good an Action, A, taken at a


particular state, S, is. Q (A, S).

6. Temporal Difference: A formula used to find the Q-Value by using the


value of current state and action and previous state and action.

What Is The Bellman Equation?

The Bellman Equation is used to determine the value of a particular state and
deduce how good it is to be in/take that state. The optimal state will give us the
highest optimal value.

The equation is given below. It uses the current state, and the reward associated
with that state, along with the maximum expected reward and a discount rate,
which determines its importance to the current state, to find the next state of
our agent. The learning rate determines how fast or slow, the model will be
learning.
Figure 6: Bellman Equation

How to Make a Q-Table?

While running our algorithm, we will come across various solutions and the agent
will take multiple paths. How do we find out the best among them? This is done
by tabulating our findings in a table called a Q-Table.

A Q-Table helps us to find the best action for each state in the environment. We
use the Bellman Equation at each state to get the expected future state and
reward and save it in a table to compare with other states.

Lets us create a q-table for an agent that has to learn to run, fetch and sit on
command. The steps taken to construct a q-table are :

Step 1: Create an initial Q-Table with all values initialized to 0

When we initially start, the values of all states and rewards will be 0. Consider
the Q-Table shown below which shows a dog simulator learning to perform
actions :

Figure 7: Initial Q-Table

Step 2: Choose an action and perform it. Update values in the table
This is the starting point. We have performed no other action as of yet. Let us
say that we want the agent to sit initially, which it does. The table will change to:

Figure 8: Q-Table after performing an action

Step 3: Get the value of the reward and calculate the value Q-Value using
Bellman Equation

For the action performed, we need to calculate the value of the actual reward
and the Q( S, A ) value

Figure 9: Updating Q-Table with Bellman Equation

Step 4: Continue the same until the table is filled or an episode ends

The agent continues taking actions and for each action, the reward and Q-value
are calculated and it updates the table.
Figure 10: Final Q-Table at end of an episode

You might also like