Unit 5
Unit 5
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The algorithm is
never trained upon the given dataset, which means it does not have any idea
about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own. Unsupervised learning
algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given. Now, this unlabeled input data is
fed to the machine learning model in order to train it. Firstly, it will interpret the
raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters
such that objects with most similarities remains into a group and has less
or no similarities with the objects of another group. Cluster analysis finds
the commonalities between the data objects and categorizes them as per
the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method
which is used for finding the relationships between variables in the large
database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Note: We will learn these algorithms in later chapters.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
o Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.
Disadvantages of Unsupervised Learning
o Unsupervised learning is intrinsically more difficult than supervised
learning as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate
as input data is not labeled, and algorithms do not know the exact output
in advance.
Clustering
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed as long as the dense
region can be connected. This algorithm does it by identifying different clusters
in the dataset and connects the areas of high densities into clusters. The dense
areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset
has varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on
the probability of how a dataset belongs to a particular distribution. The grouping
is done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of clusters to
be created. In this technique, the dataset is divided into clusters to create a tree-
like structure, which is also called a dendrogram. The observations or any
number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical
algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a
cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the kind
of data that we are using. Such as, some algorithms need to guess the number
of clusters in the given dataset, whereas some are required to find the minimum
distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required, with
the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas
in the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center
of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering
of Applications with Noise. It is an example of a density-based model
similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low
density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can
be used as an alternative for the k-means algorithm or for those cases
where K-means can be failed. In GMM, it is assumed that the data points
are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each
data point is treated as a single cluster at the outset and then successively
merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely
used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering
technique. The search result appears based on the closest object to the
search query. It does it by grouping similar data objects in one group that
is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that
for what purpose the particular land should be used, that means for which
purpose it is more suitable.
K-Means Clustering
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative
process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these
datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point
or centroid. We will compute it by applying some mathematics that we
have studied to calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:
From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K
values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters
K.
o The sharp point of bend or a point of the plot looks like an arm, then that
point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
Density based Clustering
Density-based clustering
Density-based clustering refers to a method that is based on local cluster
criterion, such as density connected points.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised
learning methodologies used in model building and machine learning algorithms.
The data points in the region separated by two clusters of low point density are
considered as noise. The surroundings with a radius ε of a given object are
known as the ε neighborhood of the object. If the ε neighborhood of the object
comprises at least a minimum number, MinPts of objects, then it is called a core
object.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood
of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with
respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps,
MinPts if there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that i i +
1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if
there is a point o such that both i and j are considered as density reachable from
o with respect to Eps and MinPts.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a
significant order of database with respect to its density-based clustering
structure. The order of the cluster comprises information equivalent to the
density-based clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster analysis,
including determining an intrinsic clustering structure.
DENCLUE
Density-based clustering by Hinnebirg and Kiem. It enables a compact
mathematical description of arbitrarily shaped clusters in high dimension state of
data, and it is good for data sets with a huge amount of noise.
Dimensionality Reduction
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is
known as dimensionality, and the process to reduce these features is called
dimensionality reduction.
A dataset contains a huge number of input features in various cases, which
makes the predictive modeling task more complicated. Because it is very difficult
to visualize or make predictions for the training dataset with a high number of
features, for such cases, dimensionality reduction techniques are required to
use.
Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information." These techniques
are widely used in machine learning for obtaining a better fit predictive model
while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also
be used for data visualization, noise reduction, cluster analysis, etc.
Reinforcement Learning
What is Reinforcement Learning?
o Reinforcement Learning is a feedback-based Machine learning technique
in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its
experience only.
o RL solves a specific type of problem where decision making is sequential,
and the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can
say that "Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts
with the environment and learns to act within that." How a Robotic
dog learns the movement of his arms is an example of Reinforcement
learning.
o It is a core part of Artificial intelligence, and all AI agent works on the
concept of reinforcement learning. Here we do not need to pre-program
the agent, as it learns from its own experience without any human
intervention.
o Example: Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent interacts with
the environment by performing some actions, and based on those actions,
the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing
these actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards
and what actions lead to negative feedback penalty. As a positive reward,
the agent gets a positive point, and as a penalty, it gets a negative point.
Association Rule
Association Rule Learning
Association rule learning is a type of unsupervised learning technique that
checks for the dependency of one data item on another data item and maps
accordingly so that it can be more profitable. It tries to find some interesting
relations or associations among the variables of dataset. It is based on different
rules to discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used
by the various big retailer to discover the associations between items. We can
understand it by taking an example of a supermarket, as in a supermarket, all
products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs,
or milk, so these products are stored within a shelf or mostly nearby. Consider
the below diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
We will understand these algorithms in later chapters.
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as
if A then B.
Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent
to each other.
o Lift<1: It tells us that one item is a substitute for other items, which
means one item has a negative effect on another.
Types of Association Rule Lerning
Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm uses
a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the
products that can be bought together. It can also be used in the healthcare field
to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm
uses a depth-first search technique to find frequent itemsets in a transaction
database. It performs faster execution than Apriori Algorithm.
F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The purpose of this
frequent tree is to extract the most frequent patterns.
Applications of Association Rule Learning
It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:
o Market Basket Analysis: It is one of the popular examples and
applications of association rule mining. This technique is commonly used
by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be
cured easily, as it helps in identifying the probability of illness for a
particular disease.
o Protein Sequence: The association rules help in determining the
synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and
many more other applications.
Exploration vs Exploitation:
Greedy Action: When an agent chooses an action that currently has
the largest estimated value. The agent exploits its current knowledge
by choosing the greedy action.
Non-Greedy Action: When the agent does not choose the largest
estimated value and sacrifice immediate reward hoping to gain more
information about the other actions.
Exploration: It allows the agent to improve its knowledge about each
action. Hopefully, leading to a long-term benefit.
Exploitation: It allows the agent to choose the greedy action to try to
get the most reward for short-term benefit. A pure greedy action
selection can lead to sub-optimal behaviour.
A dilemma occurs between exploration and exploitation because an agent can
not choose to both explore and exploit at the same time. Hence, we use
the Upper Confidence Bound algorithm to solve the exploration-exploitation
dilemma
Upper Confidence Bound Action Selection:
Upper-Confidence Bound action selection uses uncertainty in the action-value
estimates for balancing exploration and exploitation. Since there is inherent
uncertainty in the accuracy of the action-value estimates when we use a
sampled set of rewards thus UCB uses uncertainty in the estimates to drive
exploration.
Qt(a) here represents the current estimate for action a at time t. We select the
action that has the highest estimated action-value plus the upper-confidence
bound exploration term.
Q(A) in the above picture represents the current action-value estimate for
action A. The brackets represent a confidence interval around Q*(A) which says
that we are confident that the actual action-value of action A lies somewhere in
this region.
The lower bracket is called the lower bound, and the upper bracket is the upper
bound. The region between the brackets is the confidence interval which
represents the uncertainty in the estimates. If the region is very small, then we
become very certain that the actual value of action A is near our estimated
value. On the other hand, if the region is large, then we become uncertain that
the value of action A is near our estimated value.
The Upper Confidence Bound follows the principle of optimism in the face of
uncertainty which implies that if we are uncertain about an action, we should
optimistically assume that it is the correct action.
For example, let’s say we have these four actions with associated uncertainties
in the picture below, our agent has no idea which is the best action. So
according to the UCB algorithm, it will optimistically pick the action that has the
highest upper bound i.e. A. By doing this either it will have the highest value
and get the highest reward, or by taking that we will get to learn about an
action we know least about.
Let’s assume that after selecting the action A we end up in a state depicted in
the picture below. This time UCB will select the action B since Q(B) has the
highest upper-confidence bound because it’s action-value estimate is the
highest, even though the confidence interval is small.
What Is Q-Learning?
Q-Learning is a Reinforcement learning policy that will find the next best action,
given a current state. It chooses this action at random and aims to maximize the
reward.
The objective of the model is to find the best course of action given its current
state. To do this, it may come up with rules of its own or it may operate outside
the policy given to it to follow. This means that there is no actual need for a
policy, hence we call it off-policy.
Model-free means that the agent uses predictions of the environment’s expected
response to move forward. It does not use the reward system to learn, but
rather, trial and error.
3. Rewards: For every action, the agent will get a positive or negative
reward.
The Bellman Equation is used to determine the value of a particular state and
deduce how good it is to be in/take that state. The optimal state will give us the
highest optimal value.
The equation is given below. It uses the current state, and the reward associated
with that state, along with the maximum expected reward and a discount rate,
which determines its importance to the current state, to find the next state of
our agent. The learning rate determines how fast or slow, the model will be
learning.
Figure 6: Bellman Equation
While running our algorithm, we will come across various solutions and the agent
will take multiple paths. How do we find out the best among them? This is done
by tabulating our findings in a table called a Q-Table.
A Q-Table helps us to find the best action for each state in the environment. We
use the Bellman Equation at each state to get the expected future state and
reward and save it in a table to compare with other states.
Lets us create a q-table for an agent that has to learn to run, fetch and sit on
command. The steps taken to construct a q-table are :
When we initially start, the values of all states and rewards will be 0. Consider
the Q-Table shown below which shows a dog simulator learning to perform
actions :
Step 2: Choose an action and perform it. Update values in the table
This is the starting point. We have performed no other action as of yet. Let us
say that we want the agent to sit initially, which it does. The table will change to:
Step 3: Get the value of the reward and calculate the value Q-Value using
Bellman Equation
For the action performed, we need to calculate the value of the actual reward
and the Q( S, A ) value
Step 4: Continue the same until the table is filled or an episode ends
The agent continues taking actions and for each action, the reward and Q-value
are calculated and it updates the table.
Figure 10: Final Q-Table at end of an episode