0% found this document useful (0 votes)
35 views

Reinforcement Learning2A

Reinforcement learning is a machine learning technique where an agent learns optimal behavior through trial-and-error interactions with an environment. The agent receives rewards for good actions and tries to maximize cumulative reward. Key concepts include state-action pairs, policies, value functions, and Markov decision processes.

Uploaded by

jiriraymond65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Reinforcement Learning2A

Reinforcement learning is a machine learning technique where an agent learns optimal behavior through trial-and-error interactions with an environment. The agent receives rewards for good actions and tries to maximize cumulative reward. Key concepts include state-action pairs, policies, value functions, and Markov decision processes.

Uploaded by

jiriraymond65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 88

Reinforcement Learning

Reinforcement Learning
• Reinforcement Learning (RL) is about learning the optimal behaviour
in an environment to obtain maximum reward

• The optimal behaviour is learned through interactions with the


environment, and observations of how it responses

• Similar to children exploring the world around them and learning the
actions that help them achieve a goal
Reinforcement Learning
• Reinforcement Learning is a feedback-based Machine learning
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions. For each
good action, the agent gets positive feedback, and for each bad
action, the agent gets negative feedback or penalty.
Looks at the behaviour of an agent in some environment in order to
maximise some given reward

RL algorithms study the behaviour of the objects in such an


environment, and learn to optimise them
Markov Decision Processes (MDP)

MDP gives the formalisation sequential decision making.


The bases for problems solved with Reinforcement Learning
In MDP there is a decision maker called an Agent that interacts with the
environment that its placed in, these interactions are done
sequentially over time
Markov Decision Processes (MDP)
At each time step the agent will get some representation of the
environment’s state,
given this representation the agent selects an action to take,
which transforms the environment into some new state,
and the agent is given a reward as a result of its previous action
Markov Decision Processes (MDP)
The components of an MDP include
1. the environment,
2. the agent,
3. all the possible states of the environment,
4. all the actions that an agent can take in the environment, and
5. all the rewards that the agent can receive from taking actions
in the environment.
Markov Decision Processes (MDP)
This selecting an action from a given state,
transitioning to a new state, and
receiving a reward
happens sequentially over and over again which creates a trajectory,
that shows the sequence of state actions and rewards.
Markov Decision Processes (MDP)
Throughout the process, the goal of the agent is to maximise the total
amount of rewards that it receives from taking actions in given states of
the environment.

This means the agent wants to maximise not just the immediate reward
but the cumulative rewards that it will receive over time
Markov Decision Processes (MDP)
Mathematically:
Markov Decision Processes (MDP)
In MDP, we have a set of states S, a set of actions A and a set of
rewards R. Assume that each of the sets has a finite number of
elements.
At each time step t = 0, 1, 2, …., the agent receives some representation
of the environment’s state St ϵ S. Based on this state, the agent selects
an action At ϵ A, and this gives us the state-action pair (St,At)
Markov Decision Processes (MDP)
Time is then incremented to the next time step t + 1, and the environment is
transitioned to a new state St+1 ϵ S At this time the agent receives a numerical
reward Rt+1 ϵ R for action At taken from state St
We can think of the process of receiving a reward as an arbitrary function f that
maps state-action pairs to rewards.
So at time t we have
f(St, At) = Rt+1
The trajectory representing the sequential process of selecting an action from a
state, transitioning to a new state, and receiving an award can be represented as
S0, A0, R1, S1, A1, R2, S2, A2, R3, ….
Markov Decision Processes (MDP)

1. At time t, the environment is in state St.


2. The agent observes the current state and selects action At
3. The environment transitions to state St+1 and grants the agent reward Rt+1
4. The process starts over for the time step t + 1
Markov Decision Processes (MDP)
Deep Reinforcement Learning Algorithms
Deep Reinforcement Learning Algorithms
Value Learning
How good is a specific action or a given state for the agent
Selecting 1 action or another in a given state may increase or decrease the
agent’s return,
so knowing this in advance will help the agent in deciding which action to take
in which state in order to maximise its expected return
Value functions are functions of states or of state action pairs that estimate
how good it is for an agent to be in a given state, or how good it is for the
agent to perform a given action in a given state
This notion of how good a state or how good a state action pair is , is given in
terms of expected return.
Value Learning tries to estimate the Q function,
given our state and action,
and then use that Q function to optimise for the best action to take
given the particular state that we find ourselves in
Deep Reinforcement Learning Algorithms
Policy
• We’d like to know how probable it is for an agent to take any given
action from any given state
• What is the probability that an agent will select a specific action from
a specific state?
• A policy is a function that maps a given state to probabilities of
selecting each possible action from that state
Policy Learning
instead of first optimising the Q function and finding the Q value and
then using that Q function to optimising our actions,
what if we try to directly optimise our policy which is what action to
take based on a particular state that we find ourselves in.
If we do that , if we can find this function, then we can directly sample
from that policy distribution, to obtain the optimal action.
Reinforcement Learning
In Reinforcement Learning, the agent learns automatically using
feedbacks without any labelled data

Since there is no labelled data, so the agent is bound to learn by its


experience only.

RL solves a specific type of problem where decision making is


sequential, and the goal is long-term
Reinforcement Learning
The agent interacts with the environment and explores it by itself.

The primary goal of an agent in reinforcement learning is to improve


the performance by getting the maximum positive rewards.

The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way.
Terms used in Reinforcement Learning
• Agent(): An entity that can perceive/explore the environment and act
upon it.
• Environment(): A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
• Action(): Actions are the moves taken by an agent within the
environment.
• State(): State is a situation returned by the environment after each
action taken by the agent.
Terms used in Reinforcement Learning
• Reward(): A feedback returned to the agent from the environment to
evaluate the action of the agent.
• Policy(): Policy is a strategy applied by the agent for the next action
based on the current state.
• Value(): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.
• Q-value(): It is mostly similar to the value, but it takes one additional
parameter as a current action (a).
Return / Total Reward
The goal of the agent is to maximise its cumulative rewards
The return, or Total Reward Rt can be expressed as

Up to the final time step


This can work well with episodic tasks and not continuing tasks
Continuing tasks make the definition of the return problematic because the
time step T will be infinite
Better to use maximising the expected discounted return of reward

• Include the discount rate γ


• We add the gamma factor in front of all the rewards, and the discounting factor is
multiplied by every future reward that the agent gets,
• the reason is that this dampening factor is designed to make future rewards
worth less than rewards we might see at this instance.
• Enforcement of short term greediness in the algorithm, so the discounting factor
is between 0 and 1
Key Features of Reinforcement Learning
• In RL, the agent is not instructed about the environment and what
actions need to be taken.
• It is based on the trial and error process.
• The agent takes the next action and changes states according to the
feedback of the previous action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to
reach to get the maximum positive rewards.
Approaches for Reinforcement Learning
1. Value-based
2. Policy-based
3. Model-based

Value-based:

The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy.

Therefore, the agent expects the long-term return at any state(s) under policy
π.
Approaches for Reinforcement Learning
• Policy-based:
Policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach, the
agent tries to apply such a policy that the action performed in each
step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
• Deterministic: The same action is produced by the policy (π) at any state.
• Stochastic: In this policy, probability determines the produced action.
Approaches for Reinforcement Learning
Model-based:
In the model-based approach, a virtual model is created for the
environment,
The agent explores that environment to learn it.
There is no particular solution or algorithm for this approach because
the model representation is different for each environment.
Elements of Reinforcement Learning
There are four main elements of Reinforcement Learning, which are
given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
Elements of Reinforcement Learning
1) Policy:
A policy can be defined as a way how an agent behaves at a given time.
It maps the perceived states of the environment to the actions taken on those
states.
A policy is the core element of the RL as it alone can define the behaviour of the
agent.
It may be a simple function or a lookup table, whereas, for other cases, it may
involve general computation as a search process.
It could be deterministic or a stochastic policy:
• For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
Elements of Reinforcement Learning
2) Reward Signal:
The goal of reinforcement learning is defined by the reward signal.
At each state, the environment sends an immediate signal to the learning
agent, and this signal is known as a reward signal.
These rewards are given according to the good and bad actions taken by the
agent.
The agent's main objective is to maximize the total number of rewards for
good actions.
The reward signal can change the policy, such as if an action selected by the
agent leads to low reward, then the policy may change to select other actions
in the future.
Elements of Reinforcement Learning
3) Value Function:
The value function gives information about how good the situation and
action are and how much reward an agent can expect.
A reward indicates the immediate signal for each good and bad action,
whereas a value function specifies the good state and action for the
future.
The value function depends on the reward as, without reward, there
could be no value.
The goal of estimating values is to achieve more rewards.
Elements of Reinforcement Learning
• 4) Model:
• The model mimics the behaviour of the environment.
• With the help of the model, one can make inferences about how the environment will
behave.
• Such as, if a state and an action are given, then a model can predict the next state and
reward.
• The model is used for planning, which means it provides a way to take a course of
action by considering all future situations before actually experiencing those
situations.
• The approaches for solving the RL problems with the help of the model are termed as
the model-based approach.
• Comparatively, an approach without using a model is called a model-free approach.
Q-function
The total reward, Rt, is the discounted sum of all rewards obtained
from time t

The Q-function captures the expected total future reward an agent in


state s, can receive by executing a certain action a
Q-function
So st is the state at time t
at is the action you may want to take at time t
And the Q function of these two pieces is going to denote or capture
what the expected total return would be of that agent if it took that
action in that particular state
Q-function
The Q function tells us for any possible action, what is the expected
reward for that action,
so if we wanted to take a specific action, given a specific state,
ultimately we need to figure out which action is the best action
Q-function
The way we do that from a Q function is to pick the action that will
maximise our future reward and we can simply try out 1.
If we have a discrete action space we can try out all possible actions,
compute their Q value for every possible action,
based on the state that we currently find ourselves in and pick the
action that results in the highest Q value.
If we have a continuous action space, we follow the gradients along this
Q value curve
maximising it as part of an optimisation procedure
Q-function
e.g. Atari
Q-function
How can we train a NN that can learn this
Q function.
• Because we have a function that takes as input two things
• so our NN will also take 2 inputs, the state of the board ( the pixels in
the rainbow),
• the image of the board at a particular time.
• In addition you can provide some actions, the actions that a NN or
agent could take in this game is to move to the right, move to the left,
or stay put. 3 different actions that can be parameterised to the input
of the NN, the goal is to estimate the single number output that
measures what is the expected value, or the expected Q value of this
NN at this particular state – action pair.
Q-function and Policy
In addition you can provide some actions,
the actions that a NN or agent could take in this game is to move to the
right, move to the left, or stay put.
3 different actions that can be parameterised to the input of the NN,
the goal is to estimate the single number output that measures what is
the expected value,
or the expected Q value of this NN at this particular state – action pair.
Q-function and Policy
• if you want to evaluate a very large action space,
• it will be inefficient to try to approach on the left with a very large
action space,
• because you would have to run your NN forward many different
times, one time for every single element of the action space.
• Instead you only provide an input of your state, and as output you
give it all n different Q values,
• one Q value for every single possible action.
Q-function and Policy
Q-function and Policy
• That way you only need to run you NN once for the given state that
you are in
• That NN will tell you for all possible actions, what is the maximum
• It looks at the output and picks the action that has the highest Q
value.
How do we train the network
We want to use NN to learn the Q-function and then use to infer the
optimal policy
The Deep NN that we want to train looks like this
Q-function and Policy
• It takes as input a state, is trying to output n different numbers which
correspond to the Q value associated to n different actions, one Q value
per action
• The actions in Atari should be 3, we can go left, or go right, or do nothing
• The next step if we have this Q value output, we develop a policy function.
• A policy function is a function that given a state, it determines what is the
best action
• The Q function given a state and actions, what is the value, the return of
every action could take
• The policy goes one step further, givena state, what is the right action
Q-function and Policy
• A policy function is a function that given a state, it determines what is
the best action
• The Q function given a state and actions, what is the value, the return
of every action could take
• The policy goes one step further, given a state, what is the right action
Q-function and Policy
Q-function and Policy
• The policy function can be determined directly from the Q function by
optimising all the Q values for all the different actions.
• In the example, given this state, the Q function has a value of 20 if it
goes to the left, a Q value of 3 if it stays put, and a Q value of 0 if it
goes to the right
• In order to continue the game, it needs to move to the left
Q-function and Policy
• The optimal action is going to be the maximum of these Q values
• This action can be sent back to the environment in the form of the
game, to execute the next step
• As the agent moves through this environment, its going to be with not
only by new pixels that come from the game, but more importantly
some reward signal
• The reward signals in Atari Breakout are very sparse
Q-function and Policy
Q-function and Policy
The difference is that instead of trying to infer the policy from the Q
function, build a NN that will directly learn that policy function from the
data
Q-function and Policy
Q function – trying to build a NN that outputs these Q values, One
value per action, and we determine the policy by looking over the state
of Q values, picking the value that has the highest, and looking at its
corresponding action
Q-function and Policy
• With Policy Networks:
• Instead of predicting the Q values themselves, lets directly try to
optimise this policy function π(s)
• π(s) is our policy, s is the state
• So it’s a function that takes state as input and is going to directly
output the action
• The outputs give us the desired action that we should take in any
given state that we find ourselves in
Q-function and Policy
• That represents not only the best action that could be taken, but is
denoted as the probability that selecting that action would result in a
very desirable outcome
• The probability that selecting that action will be the highest value
• What is the likelihood that selecting this action will give the best
performing value that you could expect
Q-function and Policy
Q-function and Policy
• Going left 90%, staying 10%

• Because it is a probability, all the numbers have to sum to one
• Easier to train than a Q function
• Having the output in the form of a probability distribution allows the
NN to take on continuous forms
• So can improve on say not only what action should be taken, left,
right, or stay but also with what speed should be moving
How does Reinforcement Learning Work?
Consider the following:
Environment: It can be anything such as a room, maze, football ground,
etc.
Agent: An intelligent agent such as AI robot.
Take an example of a maze environment that the agent needs to
explore.
How does Reinforcement Learning Work?
The agent is at the very first block of the maze.
The maze is consisting of an S6 block, S8 a fire pit, and S4 a diamond
block.
The agent cannot cross the S6 block, as it is a solid wall.
If the agent reaches the S4 block, then get the +1 reward;
if it reaches the fire pit, then gets -1 reward point.
It can take four actions: move up, move down, move left, and move
right.
How does Reinforcement Learning Work?
The agent can take any path to reach to the final point,
but he needs to make it in possible fewer steps.
Suppose the agent considers the path S9-S5-S1-S2-S3, so as to get the
+1-reward point.
The agent will try to remember the preceding steps that it has taken to
reach the final step.
To memorize the steps, it assigns 1 value to each previous step.
How does Reinforcement Learning Work?
How does Reinforcement Learning Work?
Need to use the Bellman Equation
It is associated with dynamic programming and
Used to calculate the values of a decision problem at a certain point by
including the values of previous states
Bellman Equation
Main concept behind RL
The key-elements used in Bellman equations are:
1. Action performed by the agent is referred to as "a“
2. State occurred by performing the action is "s."
3. The reward/feedback obtained for each good and bad action is "R."
4. A discount factor is Gamma "γ."

V(s) = max [R(s,a) + γV(s`)]


Bellman Equation

V(s) = max [R(s,a) + γV(s`)]


Where,

• V(s)= value calculated at a particular point.


• R(s,a) = Reward at a particular state s by performing an action.
• γ = Discount factor
• V(s`) = The value at the previous state.
• In the above equation, we are taking the max of the complete values because
the agent tries to find the optimal solution always.
Bellman Equation
Using the Bellman equation, we will find value at each state of the
given environment.
We will start from the block, which is next to the target block.
How does Reinforcement Learning Work?
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further
state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
How does Reinforcement Learning Work?
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0,
because there is no reward at this state.

V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9


How does Reinforcement Learning Work?
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0,
because there is no reward at this state also.

V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81


How does Reinforcement Learning Work?
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)=
0, because there is no reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73

For 5th block:


V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)=
0, because there is no reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
How does Reinforcement Learning Work?
How does Reinforcement Learning Work?
How does Reinforcement Learning Work?
The agent has three options to move;
if he moves to the blue box, then he will feel a bump
if he moves to the fire pit, then he will get the -1 reward.
here we are taking only positive rewards,
he will move to upwards only.
How does Reinforcement Learning Work?
Uses of Reinforcement Learning
Reinforcement learning is designed to maximize the rewards earned by
the agents while they accomplish a specific task.

RL is beneficial for several real-life scenarios and applications,

These include autonomous cars, robotics, surgeons, and even AI bots


Uses of Reinforcement Learning
1. Managing self-driving cars
Vehicles operating autonomously in an urban environment need
substantial support from ML models that simulate all the possible
scenarios or scenes that the vehicle may encounter.
RL models are trained in a dynamic environment, wherein all the
possible pathways are studied and sorted through the learning
process.
Learning from experience makes RL the best choice for self-driving cars
that need to make optimal decisions on the fly.
Uses of Reinforcement Learning
1. Managing self-driving cars

Several variables, such as


managing driving zones,
handling traffic,
monitoring vehicle speeds, and
controlling accidents,
are handled well through RL methods.
Uses of Reinforcement Learning
2. Traffic signal control
RL models can be used for management of traffic congestion in urban
environments
RL models introduce traffic light control based on the traffic status
within a locality
The model considers the traffic from multiple directions and then
learns, adapts, and adjusts traffic light signals in urban traffic networks.
Uses of Reinforcement Learning
3. Healthcare

RL support medical professionals in handling patients’ health in the


healthcare sector as DTRs (Dynamic Treatment Regimes)
DTRs use a sequence of decisions to come up with a final solution.
Uses of Reinforcement Learning
3. Healthcare

DTRs use a sequence of decisions to come up with a final solution.

This sequential process may involve the following steps:


Determine the patient’s live status
Decide the treatment type
Discover the appropriate medication dosage based on the patient’s state
Decide dosage timings, etc.
Uses of Reinforcement Learning
3. Healthcare
With this sequence of decisions, doctors can fine-tune their treatment
strategy and
diagnose complex diseases such as
mental fatigue,
diabetes,
cancer, etc.
DTRs can further help in offering treatments at the right time, without
any complications arising due to delayed actions.
Uses of Reinforcement Learning
4. Robotics
Robotics is a field that trains a robot to mimic human behaviour as it
performs a task.
Deep Learning and RL can be blended (Deep Reinforcement Learning)
to get better results
Deep RL models are trained on multimodal data that are key to
identifying missing parts, cracks, scratches, or overall damage to
machines in warehouses by scanning images with billions of data
points.
Uses of Reinforcement Learning
5. Gaming

Reinforcement learning agents learn and adapt to the gaming environment


as they continue to apply logic through their experiences and achieve the
desired results by performing a sequence of steps.
RL agents are employed for game testing and bug detection within the
gaming environment.
Potential bugs are easily identified as RL runs multiple iterations without
external intervention.
Reinforcement Learning and Supervised
Learning
Reinforcement Learning Supervised Learning

RL works by interacting with the environment. Supervised learning works on the existing dataset.

The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.

There is no labeled dataset is present The labeled dataset is present.

No previous training is provided to the learning Training is provided to the algorithm so that it can
agent. predict the output.

RL helps to take decisions sequentially. In Supervised learning, decisions are made when
input is given.
Benefits of Reinforcement Learning
Focuses on the problem as a whole
Traditional ML algorithms are designed to excel at specific subtasks
RL does not ddivide the problem into subproblems
It directly works to maximise the long-term reward
RL understands the goal
Capable of trading off short-term rewards for long-term benefits
Benefits of Reinforcement Learning
Does not need a separate data collection step
Training data is obtained through the direct interaction with the
environment
Training data is the learning agent’s experience
No separate collection of data to be fed into the algorithm
Benefits of Reinforcement Learning
Works in dynamic, uncertain environments
RL algorithms are inherently adaptive, built to respond to changes in
the environment
Time matters
The experience collected by th agent is not independently and
identically distributed (iid)
Learning is adaptive

You might also like