ML Unit 4
ML Unit 4
B. TECH (CSE)
3rd YEAR – 2nd SEM
UNIT-IV
Genetic Algorithms
Learning Sets of Rules
Reinforcement Learning
NOTES MATERIAL
RAVIKRISHNA B
DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI
Updated on 31.10.2018
UNIT – IV Machine Learning
Page | 2 RK VITS
UNIT – IV Machine Learning
These solutions then undergo recombination and mutation (like in natural genetics),
producing new children, and the process is repeated over various generations.
▪ Each individual (or candidate solution) is assigned a fitness value (based on its objective
function value) search (in which we just try various random solutions, keeping track of
the best so far), as they exploit historical information as well.
How genetic algorithm actually works:
▪ These solutions then undergo recombination and mutation (like in natural genetics),
producing new children, and the process is repeated over various generations.
▪ Each individual (or candidate solution) is assigned a fitness value (based on its objective
function value) and the fitter individuals are given a higher chance to mate and yield
more “fitter” individuals. This is in line with the Darwinian Theory of “Survival of the
Fittest”.
▪ In this way we keep “evolving” better individuals or solutions over generations, till we
reach a stopping criterion.
▪ Genetic Algorithms are sufficiently randomized in nature, but they perform much better
than random local search (in which we just try various random solutions, keeping track
of the best so far), as they exploit historical information as well.
Advantages of Gas
(different factors behind the importance for genetic algorithms)
GAs have various advantages which have made them immensely popular. These include –
• Does not require any derivative information (which may not be available for many real-
world problems).
• Is faster and more efficient as compared to the traditional methods.
• Has very good parallel capabilities.
• Optimizes both continuous and discrete functions and also multi -objective problems.
• Provides a list of “good” solutions and not just a single solution.
• Always gets an answer to the problem, which gets better over the time.
• Useful when the search space is very large and there are a large number of parameters
involved.
Benefits of GA ‘s:
Genetic algorithms differ from traditional search and optimization methods in four significant
points:
• Genetic algorithms search parallel from a population of points. Therefore, it has the ability
to avoid being trapped in local optimal solution like traditional methods, which search
from a single point.
• Genetic algorithms use probabilistic selection rules, not deterministic ones.
• Genetic algorithms work on the Chromosome, which is encoded version of potential
solutions’ parameters, rather the parameters themselves.
• Genetic algorithms use fitness score, which is obtained from objective functions, without
other derivative or auxiliary information
Limitations of GA ‘s:
Like any technique, GAs also suffer from a few limitations. These include –
• GAs are not suited for all problems, especially problems which are simple and for which
derivative information is available.
• Fitness value is calculated repeatedly which might be computationally expensive for some
problems.
• Being stochastic, there are no guarantees on the optimality or the quality of the solution.
• If not implemented properly, the GA may not converge to the optimal solution.
Genotype – Genotype is the population in the computation space. In the computation space,
the solutions are represented in a way which can be easily understood and manipulated using
a computing system.
Phenotype – Phenotype is the population in the actual real world solution space in which
solutions are represented in a way they are represented in real world situations.
Page | 4 RK VITS
UNIT – IV Machine Learning
Basic Terminology:
Before beginning a discussion on Genetic Algorithms, it is essential to be familiar with some
basic terminology which will be used throughout this tutorial.
• Population – It is a subset of all the possible (encoded) solutions to the given problem. The
population for a GA is analogous to the population for human beings except that instead
of human beings, we have Candidate Solutions representing human beings.
• Chromosomes – A chromosome is one such
solution to the given problem.
• Gene – A gene is one element position of a
chromosome.
• Allele – It is the value a gene takes for a
particular chromosome.
• The diversity of the population should be maintained otherw ise it might lead to
premature convergence.
• The population size should not be kept very large as it can cause a GA to slow down,
while a smaller population might not be enough for a good mating pool. Therefore, an
optimal population size needs to be decided by trial and error.
Page | 6 RK VITS
UNIT – IV Machine Learning
Crossover
Crossover is the most significant phase in a genetic algorithm. For each pair of parents to be
mated, a crossover point is chosen at random from within the genes.
For example, consider the crossover point to be 3 as shown below.
Crossover point:
Offspring are created by exchanging the genes of parents among themselves until the
crossover point is reached.
Mutation
• In certain new offspring formed, some of their genes can be subjected to a mutation
with a low random probability. This implies that some of the bits in the bit string can
be flipped.
• Mutation occurs to maintain diversity within the population and prevent premature
convergence.
Termination
The algorithm terminates if the population has converged (does not produce offspring which
are significantly different from the previous generation). Then it is said that the genetic
algorithm has provided a set of solutions to our problem.
Pseudocode:
1. START
2. Generate the initial population
3. Compute fitness
4. REPEAT
5. Selection
6. Crossover
7. Mutation
8. Compute fitness
9. UNTIL population has converged
10. STOP
Application/Example/Case-study:
Here, to make things easier, let us understand it by the famous Knapsack problem. Let’s
suppose we are going to spend a month in the wilderness. Only thing you are carrying is the
backpack which can hold a maximum weight of 30 kg. Now you have different survival items,
each having its own “Survival Points” (which are given for each item in the table). So, your
objective is maximizing the survival points.
Here is the table giving details about each item.
Let’s say, you are going to spend a
month in the wilderness. Only thing
you are carrying is the backpack which
can hold a maximum weight of 30 kg.
Now you have different survival items,
each having its own “Survival Points”
(which are given for each item in the table). So, your objective is maximize the survival points.
Here is the table giving details about each item.
1. Initialization
To solve this problem using genetic algorithm, our
first step would be defining our population. So our
population will contain individuals, each having
their own set of chromosomes.
We know that, chromosomes are binary strings,
where for this problem 1 would mean that the
following item is taken and 0 meaning that it is
dropped.
Page | 8 RK VITS
UNIT – IV Machine Learning
2. Fitness Function:
Let us calculate fitness points for our first two chromosomes.
For A1 chromosome [100110],
So, for this problem, our chromosome will be considered as more fit when it contains more
survival points.
Therefore chromosome 1 is more fit than chromosome 2.
3 Selection:
Now, we can select fit chromosomes from our population which can
mate and create their off-springs.
General thought is that we should select the fit chromosomes and allow
them to produce off-springs. But that would lead to chromosomes that
are more close to one another in a few next generation, and therefore less diversity.
Therefore, we generally use Roulette Wheel Selection method.
Don’t be afraid of name, just take a look at the image below.
I suppose we all have seen this, either in real or in movies. So, let’s build our roulette wheel.
Consider a wheel, and let’s divide that into m divisions, where Survial Points Chart
m is the number of chromosomes in our populations. The area
occupied by each chromosome will be proportional to its fitness
value.
Survival Points Percentage 1st Chromosome 2nd Chromosome
Chromosome1 28 28.9% (28/97)
3rd Chromosome 4th Chromosome
Chromosome2 23 23.7%
Chromosome3 12 12.4%
Chromosome4 34 35.1%
TOTAL: 97
(TOTAL SURVIVAL
PNTS) Based on these values, let us
create our roulette wheel.
So, now this wheel is rotated and the region of wheel
which comes in front of the fixed point is chosen as
the parent. For the second parent, the same process
is repeated.
Sometimes we mark two fixed point as shown in the
figure below.
Crossover
So in this previous step, we have selected parent
chromosomes that will produce off-springs. So in
biological terms, crossover is nothing but reproduction.
So let us find the crossover of chromosome 1 and 4, which were selected in the previous step.
Take a look at the image below.
This is the most basic form of crossover, known as one point crossover. Here we select a random
crossover point and the tails of both the chromosomes are swapped to produce a new off -
springs.
If you take two crossover point, then it will called as multi point crossover which is as shown
below.
Page | 10 RK VITS
UNIT – IV Machine Learning
crossover operator:
The crossover operator produces two new offspring from two parent strings, by copying selected
bits from each parent. The bit at position i in each offspring is copied from the bit at position i
in one of the two parents. The choice of which parent contributes the bit for position i is
determined by an additional string called the crossover mask. To illustrate, consider the single-
point crossover operator at the top of Table.
5 Mutation
Now if you think in the biological sense, are the children produced have the same traits as their
parents? The answer is NO. During their growth, there is some change in the genes of children
which makes them different from its parents.
This process is known as mutation, which may be defined as a random tweak in the
chromosome, which also promotes the idea of diversity in the population.
A simple method of mutation is shown in the image below.
The off-springs thus produced are again validated using our fitness function, and if considered
fit then will replace the less fit chromosomes from the population.
But the question is how we will get to know that we have reached our best possible solut ion?
So basically, there are different termination conditions, which are listed below:
1. There is no improvement in the population for over x iterations.
2. We have already predefined an absolute number of generations for our algorithm.
3. When our fitness function has reached a predefined value.
Source: https://www.analyticsvidhya.com/blog/2017/07/introduction-to-genetic-algorithm/
Fitness Function and Selection:
• The fitness function defines the criterion for ranking potential hypotheses and for
probabilistically selecting them for inclusion in the next generation population.
Often other criteria may be included as well, such as the complexity or generality
of the rule. More generally, when the bit-string hypothesis is interpreted as a
complex procedure, the fitness function may measure the overall performance of
the resulting procedure rather than performance of individual rules.
• In our prototypical GA shown, the probability that a hypothesis will be selected
is given by the ratio of its fitness to the fitness of other members of the current
population. This method is sometimes called fitness proportionate selection, or
roulette wheel selection. Other methods for using fitness to select hypotheses
have also been proposed.
Prototypical Genetic Algorithm:
Page | 12 RK VITS
UNIT – IV Machine Learning
Fitness(h j ) n * f (t )
• j =1
• Thus, the probability that a hypothesis will be selected is proportional to its own
fitness and is inversely proportional to the fitness of the other competing
hypotheses in the current population.
• Once these members of the current generation have been selected for inclusion
in the next generation population, additional members are generated using a
crossover operation.
• Crossover, defined in detail in the next section, takes two parent hypotheses
from the current generation and creates two offspring hypotheses by
recombining portions of both parents. The parent hypotheses are chosen
probabilistically from the current population, again using the probability
function given by Equation.
• After new members have been created by this crossover operation, the new
generation population now contains the desired number of members.
• At this point, a certain fraction m of these members are chosen at random, and
random mutations all performed to alter these members.
• This GA algorithm thus performs a randomized, parallel beam search for
hypotheses that perform well according to the fitness function. In the following
subsections, we describe in more detail the representation of hypotheses and
genetic operators used in this algorithm.
GENETIC PROGRAMMING
Page | 14 RK VITS
UNIT – IV Machine Learning
• In effect, the ability to learn allows an individual to perform a small local search
during its lifetime to maximize its fitness.
• In contrast, non-learning individuals whose fitness is fully determined by their
genetic makeup will operate at a relative disadvantage.
• Those individuals who are able to learn many traits will rely less strongly on
their genetic code to "hard-wire" traits. As a result, these individuals can support
a more diverse gene pool, relying on individual learning to overcome the
"missing" or "not quite optimized" traits in the genetic code.
• This more diverse gene pool can, in turn, support more rapid evolutionary
adaptation. Thus, the ability of individuals to learn can have an indirect
accelerating effect on the rate of evolutionary adaptation for the entire
population.
Parallelizing Genetic Algorithms:
• GAs’ are naturally suited to parallel implementation, and a number of
approaches to parallelization have been explored. Coarse grain approaches to
parallelization subdivide the population into somewhat distinct groups of
individuals, called demes.
• Each deme is assigned to a different computational node, and a standard GA
search is performed at each node.
• Communication and cross-fertilization between demes occur on a less frequent
basis than within demes.
• Transfer between demes occurs by a migration process, in which individuals
from one deme are copied or transferred to other demes. This process is modeled
after the kind of cross-fertilization that might occur between physically
separated subpopulations of biological species.
• One benefit of such approaches is that it reduces the crowding problem often
encountered in nonparallel GAS, in which the system falls into a local optimum
due to the early appearance of a genotype that comes to dominate the entire
population.
• In contrast to coarse-grained parallel implementations of GAS, fine-grained
implementations typically assign one processor per individual in the population.
Recombination then takes place among neighboring individuals.
Page | 16 RK VITS
UNIT – IV Machine Learning
• these two rules compactly describe a recursive function that would be very
difficult to represent using a decision tree or other propositional representation.
One way to see the representational power of first-order rules is to consider the
general purpose programming language PROLOG. In PROLOG, programs are
sets of first-order rules such as the two shown above (rules of this form are also
called Horn clauses).
SEQUENTIAL COVERING ALGORITHMS
• we consider a family of algorithms for learning rule sets based on the strategy of learning
one rule, removing the data it covers, then iterating this process. Such algorithms are
called sequential covering algorithms.
• To elaborate, imagine we have a subroutine LEARN-ONE-RULE that accepts a set of
positive and negative training examples as input, then outputs a single rule that covers
many of the positive examples and few of the negative examples.
• Given this LEARN-ONE-RULE subroutine for learning a single rule, one obvious
approach to learning a set of rules is to invoke LEARN-ONE-RULE on all the available
training examples, remove any positive examples covered by the rule it learns, then
invoke it again to learn a second rule based on the remaining training examples.
• This procedure can be iterated as many times as desired to learn a disjunctive set of
rules that together cover any desired fraction of the positive examples. This is called a
sequential covering algorithm because it sequentially learns a set of rules that together
cover the full set of positive examples.
• The final set of rules can then be sorted so that more accurate rules will be considered
first when a new instance must be classified.
• This sequential covering algorithm is one of the most widespread approaches to learning
disjunctive sets of rules.
• It reduces the problem of learning a disjunctive set of rules to a sequence of simpler
problems, each requiring that a single conjunctive rule be learned. Because it performs
a greedy search, formulating a sequence of rules wi thout backtracking, it is not
guaranteed to find the smallest or best set of rules that cover the training examples.
The Sequential Learning algorithm takes care of to some extent, the low coverage problem
in the Learn-One-Rule algorithm covering all the rules in a sequential manner.
Page | 18 RK VITS
UNIT – IV Machine Learning
It extracts the best rule for a particular class ‘y’, where a rule is defined as:
General Form of Rule
In the beginning,
Step 2.a – if all training examples ∈ class ‘y’, then it’s classified as positive
example.
Step 2.b – else if all training examples ∉ class ‘y’, then it’s classified as neg
ative example.
Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive e
xamples.
Step 4 – When this rule is obtained, delete all the training data associated with
that rule.
(i.e. when the rule is applied to the dataset, it covers most of the training dat
a, and has to be removed)
Step 5 – The new rule is added to the bottom of decision list, ‘R’. (Fig.3)
• Once we cover these 6 positive examples, we get our first rule R 1, which is then
pushed into the decision list and those positive examples are removed from the
dataset.
• Now, we take the next majority of positive examples and follow the same process
until we get rule R 2.
• In the end, we obtain our final decision list with all the desirable rules.
Sequential Learning is a powerful algorithm for generating rule-based classifiers in
Machine Learning. It uses ‘Learn-One-Rule’ algorithm as its base to learn a sequence
of disjunctive rules. For doubts/queries regarding the algorithm, comment below.
Rule-Based Classifier – Machine Learning
Rule-based classifiers are just another type of classifier which makes the class
decision depending by using various “if..else” rules. These rules are easily
interpretable and thus these classifiers are generally used to generate descriptive
models. The condition used with “if” is called the antecedent and the predicted class
of each rule is called the consequent.
Properties of rule-based classifiers:
• Coverage: The percentage of records which satisfy the antecedent conditions of a
particular rule.
• The rules generated by the rule-based classifiers are generally not mutually
exclusive, i.e. many rules can cover the same record.
• The rules generated by the rule-based classifiers may not be exhaustive, i.e. there
may be some records which are not covered by any of the rules.
• The decision boundaries created by them is linear, but these can be much more
complex than the decision tree because the many rules are triggered for the same
record.
An obvious question, which comes into the mind after knowing that the rules are not
mutually exclusive is that how would the class be decided in case different rules with
different consequent cover the record.
There are two solutions to the above problem:
• Either rules can be ordered, i.e. the class corresponding to the highest priority
rule triggered is taken as the final class.
• Otherwise, we can assign votes for each class depending on some their weights,
i.e. the rules remain unordered.
Page | 20 RK VITS
UNIT – IV Machine Learning
LEARN-ONE-RULE ALGORITHM
Learn-One-Rule:
This method is used in the sequential learning algorithm for learning the rules. It
returns a single rule that covers at least some examples. However, what makes it
really powerful is its ability to create relations among the attributes given, hence
covering a larger hypothesis space.
Learn-One-Rule Algorithm
The Learn-One-Rule algorithm follows a greedy searching paradigm where it searches
for the rules with high accuracy but its coverage is very low. It classifies all the
positive examples for a particular instance. It returns a single rule that covers some
examples.
Simpler Form of
Learn-One-Rule (target_attribute, attributes, examples, k):
while candidate-hypothesis:
//Generate the next more specific candidate-hypothesis
//Update candidate-hypothesis
It involves a PERFORMANCE method that calculates the performance of each candidate hypothesis.
(i.e. how well the hypothesis matches the given set of examples in the training data.
For example:
IF Mother (y, x) and Female(y),
THEN Daughter (x, y).
Here, any person can be associated
with the variables x and y
Figure:
Learn-One-Rule Example
Page | 22 RK VITS
UNIT – IV Machine Learning
Learn-One-Rule Example
Let us understand the working of the algorithm using an example:
• The SEQUENTIAL-COVERING algorithm described above and the decision tree learning
algorithms, suggest a variety of possible methods for learning sets of rules.
• First, sequential covering algorithms learn one rule at a time, removing the covered
examples and repeating the process on the remaining exampl es. In contrast, decision
tree algorithms such as ID3 learn the entire set of disjuncts simultaneously as part of
the single search for an acceptable decision tree. We might, therefore, call algorithms
such as ID3 simultaneous covering algorithms, in contrast to sequential covering
algorithms. The key difference occurs in the choice made at the most primitive step in
the search. At each search step ID3 chooses among alternative attributes by comparing
the partitions of the data they generate.
• To learn a set of n rules, each containing k attribute-value tests in their preconditions,
sequential covering algorithms will perform n. k primitive search steps, making an
Page | 24 RK VITS
UNIT – IV Machine Learning
• Entropy. This is the measure used by the PERFORMANCE subroutine in the algorithm.
Let S be the set of examples that match the rule preconditions. Entropy measures the
uniformity of the target function values for this set of examples.
• We take the negative of the entropy so that better rules will have higher scores.
• we consider learning rules that contain variables-in particular, learning first-order Horn
theories. Our motivation for considering such rules is that they are much more
expressive than propositional rules.
• Inductive learning of first-order rules or theories is often referred to as inductive logic
programming (or LP for short), because this process can be viewed as automatically
inferring PROLOG programs from examples.
• PROLOG is a general purpose, Turing-equivalent programming language in which
programs are expressed as collections of Horn clauses.
First-Order Horn Clauses:
Before getting into the FOIL Algorithm, let us understand the meaning of first-order rules
and the various terminologies involved in it.
FIRST-ORDER LOGIC:
All expressions in first-order logic are composed of the following attributes:
constants — e.g. tyler, 23, a
variables — e.g. A, B, C
predicate symbols — e.g. male, father (True or False values only)
function symbols — e.g. age (can take on any constant as a value)
Page | 26 RK VITS
UNIT – IV Machine Learning
connectives — e.g. ∧, ∨, ¬, →, ←
quantifiers — e.g. ∀, ∃
Term: It can be defined as any constant, variable or function applied to any term.
e.g. age(bob)
Terminology:
Before moving on to algorithms for learning sets of Horn clauses, let us learn some basic
terminology from formal logic. All expressions are composed of constants (e.g., Bob, Louise),
variables (e.g., x, y), predicate symbols (e.g., Married, Greater -Than), and function symbols
(e.g., age).
The difference between predicates and functions is that predicates take on values of True or
False, whereas functions may take on any constant as their value.
We will use lowercase symbols for variables and capitalized symbols for constants. Also, we
will use lowercase for functions and capitalized symbols for predicates.
From these symbols, we build up expressions as follows:
A term is any constant, any variable, or any function applied to any term (e.g., Bob, x,
age(Bob)). A literal is any predicate or its negation applied to any term (e.g., Married (Bob,
Louise), -Greater-Than(age(Sue), 20)). If a literal contains a negation (~) symbol, we call it a
negative literal, otherwise a positive literal.
A clause is any disjunction of literals, where all variables are assumed to be universally
quantified.
A Horn clause is a clause containing at most one positive literal, such as
where H is the positive literal, and ~Ll . . . ~Ln are negative literals. Because of the equalities
(B v ~A) = (B <- A) and ~(A^B) = (~A v ~B), the above Horn clause can alternatively be written
in the form
Whatever the notation, the Horn clause preconditions L1 A . . . A L, are called the clause body
or, alternatively, the clause antecedents. The literal H that forms the postcondition is called
the clause head or, alternatively, the clause consequent.
RULE algorithms.
• In fact, the FOIL program is the natural extension of these earlier algorithms to
first-order representations. Formally, the hypotheses learned by FOIL are sets of
first-order rules, where each rule is similar to a Horn clause with two exceptions.
• First, the rules learned by FOIL are more restricted than general Horn clauses,
because the literals are not permitted to contain function symbols (this reduces
the complexity of the hypothesis space search).
• Second, FOIL rules are more expressive than Horn clauses, because the literals
appearing in the body of the rule may be negated. FOIL has been applied to a
variety of problem domains. For example, it has been demonstrated to learn a
recursive definition of the QUICKSORT algorithm and to learn to discriminate
legal from illegal chess positions.
Page | 28 RK VITS
UNIT – IV Machine Learning
• Also, FOIL performs a simple hill climbing search rather than a beam search
(equivalently, it uses a beam of width one). The hypothesis space search
performed by FOIL is best understood by viewing it hierarchically. Each iteration
through FOIL'S outer loop adds a new rule to its disjunctive hypothesis, Learned
rules.
REINFORCEMENT LEARNING:
What is Reinforcement Learning?
o Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, a nd the goal
is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of
an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning
is a type of machine learning method where an intelligent agent (computer
program) interacts with the environment and learns to act within that." How a
Robotic dog learns the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his
goal is to find the diamond. The agent interacts with the environment by performing
some actions, and based on those actions, the state of the agent gets changed, and it
also receives a reward or penalty as feedback.
o The agent continues doing these three things (take
action, change state/remain in the same state,
and get feedback), and by doing these actions, he
learns and explores the environment.
o The agent learns that what actions lead to positive
feedback or rewards and what actions lead to
negative feedback penalty. As a positive reward,
the agent gets a positive point, and as a penalty, it
gets a negative point.
Page | 30 RK VITS
UNIT – IV Machine Learning
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward. The policy-
based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
5. 2) Reward Signal: The goal of reinforcement learning is defined by the reward
signal. At each state, the environment sends an immediate signal to the learning
agent, and this signal is known as a reward signal. These rewards are given
according to the good and bad actions taken by the agent. The agent's main
objective is to maximize the total number of rewards for good actions. The reward
signal can change the policy, such as if an action selected by the agent leads to
low reward, then the policy may change to select other actions in the future.
6. 3) Value Function: The value function gives information about how good the
situation and action are and how much reward an agent can expect. A reward
indicates the immediate signal for each good and bad action, whereas a value
function specifies the good state and action for the future. The value function
depends on the reward, as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.
7. 4) Model: The last element of reinforcement learning is the model, which mimics
the behaviour of the environment. With the help of the model, one can make
inferences about how the environment will behave. Such as, if a state and an
action are given, then a model can predict the next state and reward.
8. The model is used for planning, which means it provides a way to take a course
of action by considering all future situations before actually experiencing those
situations. The approaches for solving the RL problems with the help of the
Page | 32 RK VITS
UNIT – IV Machine Learning
Let's take an example of a maze environment that the agent needs to explore. Consider
the below image:
In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S 6 block, which is a wall, S 8 a fire pit, and S 4 a diamond block.
The agent cannot cross the S 6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point.
It can take four actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will
get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the final
step. To memorize the steps, it assigns 1 value to each previous step. Consider the
below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which
has 1 value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to reach
the destination. Hence to solve the problem, we will use the Bellman equation, which
is the main concept behind reinforcement learning.
γ = Discount factor
In the above equation, we are taking the max of the complete values because the agent
tries to find the optimal solution always.
Page | 34 RK VITS
UNIT – IV Machine Learning
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is
no reward at this state.
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there
is no reward at this state also.
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there
is no reward at this state also.
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there
is no reward at this state also.
Now, we will move further to the 6th block, and here agent may change the route
because it always tries to find the optimal path. So now, let's consider from the block
next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel
a bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking
only positive rewards, so for this, he will move to upwards only. The complete block
values will be calculated using this formula. Consider the below image:
o Positive Reinforcement
o Negative Reinforcement
Positive Reinforcement:
The positive reinforcement learning means adding something to increase the tendency
that expected behavior would occur again. It impacts positively on the behavior of the
agent and increases the strength of the behavior.
This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.
Page | 36 RK VITS
UNIT – IV Machine Learning
Negative Reinforcement:
It can be more effective than the positive reinforcement depending on situation and
behavior, but it provides reinforcement only to meet minimum behavior.
MDP is used to describe the environment for the RL, and almost all the RL problem
can be formalized using MDP.
MDP uses Markov property, and to better understand the MDP, we need to learn
about it.
o Q-Learning:
Page | 38 RK VITS
UNIT – IV Machine Learning
1. Gamma (γ): the discount rate. A value between 0 and 1. The higher the value the less
you are discounting.
2. Lambda (λ): the credit assignment variable. A value between 0 and 1. The higher the
value the more credit you can assign to further back states and actions.
3. Alpha (α): the learning rate. How much of the error should we accept and therefore
adjust our estimates towards. A value between 0 and 1. A higher value adjusts
aggressively, accepting more of the error while a smaller one adjusts conservatively
but may make more conservative moves towards the actual values.
4. Delta (δ): a change or difference in value.
Confidence Intervals:
One common way to describe the uncertainty associated with an estimate is to give an interval
within which the true value is expected to fall, along with the probability with which it is
expected to fall into this interval. Such estimates are
called confidence interval estimates.
Definition: An N% confidence interval for some parameter p is an interval that is expected with
probability N% to contain p.
Reinforcement
Supervised Learning
Learning
RL works by interacting with Supervised learning works on the
the environment. existing dataset.
1. Robotics:
1. RL is used in Robot navigation,
Robo-soccer, walking, juggling, etc.
2. Control:
1. RL can be used for adaptive control such
as Factory processes, admission control in
telecommunication, and Helicopter pilot is
an example of reinforcement learning.
3. Game Playing:
1. RL can be used in Game playing such as
tic-tac-toe, chess, etc.
4. Chemistry:
1. RL can be used for optimizing the chemical reactions.
5. Business:
1. RL is now used for business strategy planning.
6. Manufacturing:
1. In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.
7. Finance Sector:
1. The RL is currently used in the finance sector for evaluating trading strategies.
Source:
https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56
https://www.geeksforgeeks.org/what-is-reinforcement-learning/
https://www.javatpoint.com/reinforcement-learning#Q-Learning
Page | 40 RK VITS