0% found this document useful (0 votes)
33 views

Sara Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Sara Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Reinforcement Learning (RL)

Prepared By: Sara Qamar Sultan


Research Associate ComSens Lab, Department of Computer
Science.
Supervised By: Prof. Dr. Nadeem Javaid
Professor, Department of Computer Science
COMSATS University Islamabad, Pakistan
What is Reinforcement Learning?
• Reinforcement Learning is a feedback-based Machine learning
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions. For each
good action, the agent gets positive feedback, and for each bad
action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its
experience only.
• RL solves a specific type of problem where decision making is
sequential, and the goal is long-term, such as game-playing, robotics,
etc.
What is Reinforcement Learning?
• The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can
say that "Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts with
the environment and learns to act within that." How a Robotic dog
learns the movement of his arms is an example of Reinforcement
learning.
• It is a core part of Artificial intelligence, and all AI agent works on the
concept of reinforcement learning. Here we do not need to pre-
program the agent, as it learns from its own experience without any
human intervention.
What is Reinforcement Learning?
• Example: Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent interacts
with the environment by performing some actions, and based on those
actions, the state of the agent gets changed, and it also receives a
reward or penalty as feedback.
• The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
• The agent learns that what actions lead to positive feedback or rewards
and what actions lead to negative feedback or penalty. As a positive
reward, the agent gets a positive point, and as a penalty, it gets a
negative point.
Example 1:
Example 2:
Terms used in Reinforcement Learning:
• Agent(): An entity that can perceive/explore the environment and act upon it.
• Environment(): A situation in which an agent is present or surrounded by. In RL, we assume
the stochastic environment, which means it is random in nature.
• Action(): Actions are the moves taken by an agent within the environment.
• State(): State is a situation returned by the environment after each action taken by the agent.
• Reward(): A feedback returned to the agent from the environment to evaluate the action of
the agent.
• Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.
• Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
• Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).
Key Features of Reinforcement Learning:
• In RL, the agent is not instructed about the environment and what actions
need to be taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the feedback
of the previous action.
• The agent may get a delayed reward.
• The environment is stochastic (random), and the agent needs to explore it to
reach to get the maximum positive rewards.
• Reinforcement is not suitable for environment where complete information is
available for example fraud detection and face recognition which can be
better solved by using a classifier rather than reinforcement learning.
Types of Reinforcement Learning:

There are two types:


1. Positive Reinforcement:
Positive reinforcement is defined as when an event, occurs due to specific behavior,
increases the strength and frequency of the behavior. It has a positive impact on
behavior.
Advantages:
– Maximizes the performance of an action.
– Sustain change for a longer period.
Disadvantage:
– Excess reinforcement can lead to an overload of states which would minimize the
results.
Types of Reinforcement Learning:

2. Negative Reinforcement:
Negative Reinforcement is represented as the strengthening of a
behavior. In other ways, when a negative condition is barred or avoided,
it tries to stop this action in the future.
Advantages:
– Maximized behavior.
– Provide a decent to minimum standard of performance.
Disadvantage:
– It just limits itself enough to meet up a minimum behavior.
Reinforcement Learning Workflow:
• Create the Environment
• Define the reward
• Create the agent
• Train and validate the agent
• Deploy the policy
How does Reinforcement Learning Work?

To understand the working process of the RL, we need to consider two


main things:
Environment: It can be anything such as a room, maze, football
ground, etc.
Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to
explore. Consider the below image:
In the below image, the agent is at the very first block of the maze. The
maze is consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a
diamond block.
How does Reinforcement Learning Work?
How does Reinforcement Learning Work?
• The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the fire pit,
then gets -1 reward point. It can take four actions: move up, move
down, move left, and move right.
• The agent can take any path to reach to the final point, but he needs to
make it in possible fewer steps. Suppose the agent considers the path
S9-S5-S1-S2-S3, so he will get the +1-reward point.
• The agent will try to remember the preceding steps that it has taken to
reach the final step. To memorize the steps, it assigns 1 value to each
previous step. Consider the below step:
How does Reinforcement Learning Work?
• Now, the agent has successfully stored the previous steps assigning the
1 value to each previous block. But what will the agent do if he starts
moving from the block, which has 1 value block on both sides? Consider
the below diagram:
How does Reinforcement Learning Work?
• It will be a difficult condition for the agent whether he should go up or
down as each block has the same value. So, the above approach is not
suitable for the agent to reach the destination. Hence to solve the
problem, we will use the Bellman equation, which is the main concept
behind reinforcement learning.
The Bellman Equation:
The key-elements used in Bellman equations are:
• Action performed by the agent is referred to as "a"
• State occurred by performing the action is "s."
• The reward/feedback obtained for each good and bad action is "R."
• A discount factor is Gamma "γ.“
• The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
• V(s)= value calculated at a particular point.
• R(s,a) = Reward at a particular state s by performing an action.
• γ = Discount factor
• V(s`) = The value at the previous state.
The Bellman Equation:
In the above equation, we are taking the max of the complete values
because the agent tries to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of
the given environment. We will start from the block, which is next to the
target block.
For 1st block:
• V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0
because there is no further state to move.
• V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
The Bellman Equation:
For 2nd block:
• V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets),
V(s')= 1, and R(s, a)= 0, because there is no
reward at this state.
• V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2)
=0.9
For 3rd block:
• V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets),
V(s')= 0.9, and R(s, a)= 0, because there is no
reward at this state also.
• V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1)
=0.81
The Bellman Equation:
For 4th block:
• V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets),
V(s')= 0.81, and R(s, a)= 0, because there is no
reward at this state also.
• V(s5)= max[0.9(0.81)]=> V(s5)=
max[0.81]=> V(s5) =0.73
For 5th block:
• V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets),
V(s')= 0.73, and R(s, a)= 0, because there is no
reward at this state also.
• V(s9)= max[0.9(0.73)]=> V(s4)=
max[0.81]=> V(s4) =0.66
The Bellman Equation:
• Consider the below image:
The Bellman Equation:
• Consider the below image:
The Bellman Equation:
• Consider the below image:
• Now, we will move further to the 6th block, and
here agent may change the route because it
always tries to find the optimal path. So now, let's
consider from the block next to the fire pit.
The Bellman Equation:
• Consider the below image:
• Now, we will move further to the 6th block, and
here agent may change the route because it
always tries to find the optimal path. So now, let's
consider from the block next to the fire pit.
The Bellman Equation:
• Consider the below image:
• Now, we will move further to the 6th block, and here
agent may change the route because it always tries to
find the optimal path. So now, let's consider from the
block next to the fire pit.
• Now, the agent has three options to move;
• if he moves to the blue box, then he will feel a bump
• if he moves to the fire pit, then he will get the -1 reward.
• But here we are taking only positive rewards, so for this,
he will move to upwards only. The complete block
values will be calculated using this formula. Consider
the below image:
The Bellman Equation:
Reinforcement Learning Applications:
Reinforcement Learning Applications:
• Robotics: RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
• Control: RL can be used for adaptive control such as Factory processes, admission
control in telecommunication, and Helicopter pilot is an example of
reinforcement learning.
• Game Playing: RL can be used in Game playing such as tic-tac-toe, chess, etc.
• Chemistry: RL can be used for optimizing the chemical reactions.
• Business: RL is now used for business strategy planning.
• Manufacturing: In various automobile manufacturing companies, the robots use
deep reinforcement learning to pick goods and put them in some containers.
• Finance Sector: The RL is currently used in the finance sector for evaluating
trading strategies.
Learning Models of Reinforcement:
Hidden Markov Models (HMM)
Hidden Markov Processes:
• The first order Markov process makes a very important simplification
to observed sequential data—the current system state depends only
on the previous system state.

• Additionally, hidden Markov models make one more important


modification to the Markov process — the actual system states are
assumed to be unobservable and are hidden.
Hidden Markov Processes:
• For a sequence of hidden states Z, the hidden Markov process emits
a corresponding sequence of observable processes X. Using the
observed processes X, we try to guess what Z really is using hidden
Markov models!
Hidden Markov Models:
• As mentioned in the previous section, hidden Markov models are used
to model a hidden Markov process. Hidden Markov models are defined
by the following 3 model parameters:
• Initial hidden state probabilities 𝝅 = [𝝅₀, 𝝅₁, 𝝅₂, …]ᵀ. This vector
describes the initial probabilities of the system being in a particular
hidden state.
• Hidden state transition matrix A. Each row in A corresponds to a
particular hidden state, and the columns for each row contain the
transition probabilities from the current hidden state to a new hidden
state. For example, A[1, 2] contains the transition probability from
hidden state 1 to hidden state 2.
Hidden Markov Models:
• Observable emission probabilities 𝜽 = [𝜽₀, 𝜽₁, 𝜽₂, …]ᵀ. This vector
describes the emission probabilities for the observable process Xᵢ
given some hidden state Zᵢ.
• Let us next study an extremely simple case study to get a better
appreciation of how hidden Markov models work.
Guessing Someone’s Mood (Example):
• An example of a hidden Markov process is the guessing of someone’s
mood. We cannot directly observe or measure the mood of a person
(at least without sticking electrodes in the person’s brain), instead we
observe his or her facial features, and then try to guess the mood.
• We assume that moods can be described as a Markov process, and
that there are 2 possible moods — good and bad. We also assume
that there are 2 possible observable facial features — smiling and
frowning.
Initial Hidden State Probabilities:
• When we first meet someone, we assume that there is a 70% chance
that the person is in a good mood and a 30% chance that the person
is in a bad mood. These are our initial probabilities:
𝝅 = [0.7, 0.3]ᵀ, i.e.
P(initially good mood) = 0.7 and
P(initially bad mood) = 0.3.
Hidden State Transition Matrix:
• We also assume that when a person is in a good mood, moments later
there is a 80% chance that he or she will still be in a good mood, and a
20% chance that he or she will now be in a bad mood. We also assume
the same probabilities for the opposite situation in order to simplify
the problem.
• These are the elements of the hidden state transition matrix:
A = [[0.8, 0.2], [0.2, 0.8]], i.e.
P(good to good) = 0.8, P(good to bad) = 0.2,
P(bad to good) = 0.2, P(bad to bad) = 0.8.
Observable Emission Probabilities:
• Finally, we assume that when a person is in a good mood, there is a
90% chance that he or she will be smiling, and a 10% chance that he
or she will be frowning. We also assume the same probabilities for the
opposite situation in order to simply the problem.
• These are the emission probabilities:
𝜽 = [[0.9, 0.1], [0.1, 0.9]], i.e.
P(smiling|good mood) = 0.9, P(frowning|good mood) = 0.1,
P(smiling|bad mood) = 0.1, P(frowning|bad mood) = 0.9.
Guessing Someone’s Mood from their Facial Features:

• Now, if for example we observed that for a couple of moments, someone had
a facial feature sequence of: [smiling, frowning], we can use the 3 model
parameters above to calculate the probabilities of their actual hidden moods:
• P([good, good]) = (𝝅[0]·𝜽[0, 0])·(A[0, 0]·𝜽[0, 1]) = (0.7·0.9)·(0.8·0.1) =
0.0504,
• P([good, bad]) = (𝝅[0]·𝜽[0, 0])·(A[0, 1]·𝜽[1, 1]) = (0.7·0.9)·(0.2·0.9) =
0.1134,
• P([bad, good]) = (𝝅[1]·𝜽[1, 0])·(A[1, 0]·𝜽[0, 1]) = (0.3·0.1)·(0.2·0.1) =
0.0006,
• P([bad, bad]) = (𝝅[1]·𝜽[1, 0])·(A[1, 1]·𝜽[1, 1]) = (0.3·0.1)·(0.8·0.9) =
0.0216.
Guessing Someone’s Mood from their Facial Features:

• By normalizing the sum of the 4 probabilities above to 1, we get the


following normalized joint probabilities:
• P’([good, good]) = 0.0504 / 0.186 = 0.271,
• P’([good, bad]) = 0.1134 / 0.186 = 0.610,
• P’([bad, good]) = 0.0006 / 0.186 = 0.003,
• P’([bad, bad]) = 0.0216 / 0.186 = 0.116.
• From these normalized probabilities, it might appear that we already have an
answer to the best guess: the person’s mood was most likely: [good, bad].
However this is not the actual final result we are looking for when dealing
with hidden Markov models — we still have one more step to go in order to
marginalise the joint probabilities above.
Guessing Someone’s Mood from their Facial Features:

• We calculate the marginal mood probabilities for each element in the sequence to
get the probabilities that the 1st mood is good/bad, and the 2nd mood is good/bad:
• P’(1st mood is good) = P’([good, good]) + P’([good, bad]) = 0.881,
• P’(1st mood is bad) = P’([bad, good]) + P’([bad, bad]) = 0.119,
• P’(2nd mood is good) = P’([good, good]) + P’([bad, good]) = 0.274,
• P’(2nd mood is bad) = P’([good, bad]) + P’([bad, bad]) = 0.726.
• The optimal mood sequence is simply obtained by taking the sum of the highest
mood probabilities for the sequence —P’(1st mood is good) is larger than P’(1st
mood is bad), and P’(2nd mood is good) is smaller than P’(2nd mood is bad). In
this case, it turns out that the optimal mood sequence is indeed: [good, bad].
Monte Carlo Learning
Monte Carlo Learning:
• Any method which solves a problem by generating suitable random
numbers, and observing that fraction of numbers obeying some property
or properties, can be classified as a Monte Carlo method.
• It’s used when there is no prior information of the environment and all the
information is essentially collected by experience.
• The Monte Carlo method for reinforcement learning learns directly from
episodes of experience without any prior knowledge of MDP transitions.
Here, the random component is the return or reward.
• One caveat is that it can only be applied to episodic MDPs. Its fair to ask
why, at this point. The reason is that the episode has to terminate before
we can calculate any returns. Here, we don’t do an update after every
action, but rather after every episode.
Example:
• Let’s do a fun exercise where we will try to find
out the value of pi using pen and paper. Let’s
draw a square of unit length and draw a quarter
circle with unit length radius. Now, we have a
helper bot C3PO with us. It is tasked with putting
as many dots as possible on the square randomly
3,000 times, resulting in the following figure:
Example:
• C3PO needs to count each time it puts a dot inside a circle. So, the
value of pi will be given by:

• where N is the number of times a dot was put inside the circle. As you
can see, we did not do anything except count the random dots that
fall inside the circle and then took a ratio to approximate the value of
pi.
Monte Carlo Policy Evaluation:
• The goal here, again, is to learn the value function vpi(s) from
episodes of experience under a policy pi. Recall that the return is the
total discounted reward:
S1, A1, R2, ….Sk ~ pi
• Also recall that the value function is the expected return:

• We know that we can estimate any expected value simply by adding


up samples and dividing by the total number of samples:
Monte Carlo Policy Evaluation:
• i – Episode index
• s – Index of state
• The question is how do we get these sample returns? For that, we
need to play a bunch of episodes and generate them.
• For every episode we play, we’ll have a sequence of states and
rewards. And from these rewards, we can calculate the return by
definition, which is just the sum of all future rewards.
Monte Carlo Policy Evaluation:
First Visit Monte Carlo: Average returns only for first time s is visited in an
episode. Here’s a step-by-step view of how the algorithm works:
1. Initialize the policy, state-value function
2. Start by generating an episode according to the current policy
3. Keep track of the states encountered through that episode
4. Select a state in 2.1
5. Add to a list the return received after first occurrence of this state
6. Average over all returns
7. Set the value of the state as that computed average
8. Repeat step 3
9. Repeat 2-4 until satisfied
Monte Carlo Policy Evaluation:
Every visit Monte Carlo: Average returns for every time s is visited in an
episode.
For this algorithm, we just change step #3.1 to ‘Add to a list the return
received after every occurrence of this state’.
Let’s consider a simple example to further understand this concept.
Suppose there’s an environment where we have 2 states – A and B.
Let’s say we observed 2 sample episodes:
Monte Carlo Policy Evaluation:
A+3 => A indicates a transition from state A to state A, with a reward
+3. Let’s find out the value function using both methods:
Q-Learning
Q-Learning:

• Q-learning is an Off policy RL algorithm,


which is used for the temporal difference
Learning. The temporal difference learning
methods are the way of comparing temporally
successive predictions.
• It learns the value function Q (S, a), which
means how good to take action "a" at a
particular state "s."
• The below flowchart explains the working of
Q- learning:
Q-Learning Explanation:

• Q-learning is a popular model-free reinforcement learning algorithm


based on the Bellman equation.
• The main objective of Q-learning is to learn the policy which can
inform the agent that what actions should be taken for maximizing
the reward under what circumstances.
• It is an off-policy RL that attempts to find the best action to take at a
current state.
• The goal of the agent in Q-learning is to maximize the value of Q.
• The value of Q-learning can be derived from the Bellman equation.
Consider the Bellman equation given below:
Q-Learning Explanation:

• In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider
the below image:
• In the image, we can see there is
an agent who has three values options,
V(s1), V(s2), V(s3). As this is MDP, so
agent only cares for the current state
and the future state. The agent can go to
any direction (Up, Left, or Right), so he
needs to decide where to go for the optimal
path.
Q-Learning Explanation:

• Here agent will take a move as per probability bases and changes the state. But if
we want some exact moves, so for this, we need to make some changes in terms of
Q-value. Consider the below image:
• Q- represents the quality of the actions at
each state. So instead of using a value at each
state, we will use a pair of state and action, i.e
Q(s, a). Q-value specifies that which action is
more lubricative than others, and according to
the best Q-value, the agent takes his next
move. The Bellman equation can be used for
deriving the Q-value.
Q-Learning Explanation:

• To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:

• Hence, we can say that, V(s) = max [Q(s, a)]

• The above formula is used to estimate the Q-values in Q-Learning.


• What is 'Q' in Q-learning?
• The Q stands for quality in Q-learning, which means it specifies the quality of an
action taken by the agent.
How to Make a Q-Table?

• While running our algorithm, we will come across various solutions


and the agent will take multiple paths. How do we find out the best
among them? This is done by tabulating our findings in a table called
a Q-Table.
• A Q-Table helps us to find the best action for each state in the
environment. We use the Bellman Equation at each state to get the
expected future state and reward and save it in a table to compare
with other states.
• Lets us create a q-table for an agent that has to learn to run, fetch and
sit on command. The steps taken to construct a q-table are :
How to Make a Q-Table?

• Step 1: Create an initial Q-Table with all values initialized to 0.


• When we initially start, the values of all states and rewards will be 0.
Consider the Q-Table shown below which shows a dog simulator
learning to perform actions :
Initial Q-Table
How to Make a Q-Table?

• Step 2: Choose an action and perform it. Update values in the table.
• This is the starting point. We have performed no other action as of
yet. Let us say that we want the agent to sit initially, which it does.
The table will change to:
Q-Table after performing an action
How to Make a Q-Table?

• Step 3: Get the value of the reward and calculate the value Q-Value
using Bellman Equation.
• For the action performed, we need to calculate the value of the actual
reward and the Q( S, A ) value.
Updating Q-Table with Bellman Equation
How to Make a Q-Table?

• Step 4: Continue the same until the table is filled or an episode ends.
• The agent continues taking actions and for each action, the reward
and Q-value are calculated and it updates the table.
Final Q-Table at end of an episode
Example:
Why use Reinforcement Learning?

Here are prime reasons for using Reinforcement Learning:


• It helps you to find which situation needs an action.
• Helps you to discover which action yields the highest reward over the
longer period.
• Reinforcement Learning also provides the learning agent with a
reward function.
• It also allows it to figure out the best method for obtaining large
rewards.
When Not to Use Reinforcement Learning?

You can’t apply reinforcement learning model is all the situation. Here
are some conditions when you should not use reinforcement learning
model.
• When you have enough data to solve the problem with a supervised
learning method
• You need to remember that Reinforcement Learning is computing-
heavy and time-consuming. in particular when the action space is
large.
Challenges of Reinforcement Learning:

Here are the major challenges you will face while doing Reinforcement
earning:
• Feature/reward design which should be very involved
• Parameters may affect the speed of learning.
• Realistic environments can have partial observability.
• Too much Reinforcement may lead to an overload of states which can
diminish the results.
• Realistic environments can be non-stationary.
Conclusion:

From the above discussion, we can say that Reinforcement Learning is


one of the most interesting and useful parts of Machine learning. In RL,
the agent explores the environment by exploring it without any human
intervention. It is the main learning algorithm that is used in Artificial
Intelligence. But there are some cases where it should not be used,
such as if you have enough data to solve the problem, then other ML
algorithms can be used more efficiently. The main issue with the RL
algorithm is that some of the parameters may affect the speed of the
learning, such as delayed feedback.
References:

• https://www.javatpoint.com/reinforcement-learning
• https://medium.com/@natsunoyuki/hidden-markov-models-with-pyt
hon-c026f778dfa7
• https://www.simplilearn.com/tutorials/machine-learning-tutorial/wh
at-is-q-learning
• https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learni
ng-introduction-monte-carlo-learning-openai-gym/
• https://www.youtube.com/watch?v=J3qX50yyiU0&list=PL4gu8xQu0_
5JBO1FKRO5p20wc8DprlOgn&index=122
• https://www.guru99.com/reinforcement-learning-tutorial.html

You might also like