Reinforcement Learning2A
Reinforcement Learning2A
Reinforcement Learning
• Reinforcement Learning (RL) is about learning the optimal behaviour
in an environment to obtain maximum reward
• Similar to children exploring the world around them and learning the
actions that help them achieve a goal
Reinforcement Learning
• Reinforcement Learning is a feedback-based Machine learning
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions. For each
good action, the agent gets positive feedback, and for each bad
action, the agent gets negative feedback or penalty.
Looks at the behaviour of an agent in some environment in order to
maximise some given reward
This means the agent wants to maximise not just the immediate reward
but the cumulative rewards that it will receive over time
Markov Decision Processes (MDP)
Mathematically:
Markov Decision Processes (MDP)
In MDP, we have a set of states S, a set of actions A and a set of
rewards R. Assume that each of the sets has a finite number of
elements.
At each time step t = 0, 1, 2, …., the agent receives some representation
of the environment’s state St ϵ S. Based on this state, the agent selects
an action At ϵ A, and this gives us the state-action pair (St,At)
Markov Decision Processes (MDP)
Time is then incremented to the next time step t + 1, and the environment is
transitioned to a new state St+1 ϵ S At this time the agent receives a numerical
reward Rt+1 ϵ R for action At taken from state St
We can think of the process of receiving a reward as an arbitrary function f that
maps state-action pairs to rewards.
So at time t we have
f(St, At) = Rt+1
The trajectory representing the sequential process of selecting an action from a
state, transitioning to a new state, and receiving an award can be represented as
S0, A0, R1, S1, A1, R2, S2, A2, R3, ….
Markov Decision Processes (MDP)
The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way.
Terms used in Reinforcement Learning
• Agent(): An entity that can perceive/explore the environment and act
upon it.
• Environment(): A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
• Action(): Actions are the moves taken by an agent within the
environment.
• State(): State is a situation returned by the environment after each
action taken by the agent.
Terms used in Reinforcement Learning
• Reward(): A feedback returned to the agent from the environment to
evaluate the action of the agent.
• Policy(): Policy is a strategy applied by the agent for the next action
based on the current state.
• Value(): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.
• Q-value(): It is mostly similar to the value, but it takes one additional
parameter as a current action (a).
Return / Total Reward
The goal of the agent is to maximise its cumulative rewards
The return, or Total Reward Rt can be expressed as
Value-based:
The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy.
Therefore, the agent expects the long-term return at any state(s) under policy
π.
Approaches for Reinforcement Learning
• Policy-based:
Policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach, the
agent tries to apply such a policy that the action performed in each
step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
• Deterministic: The same action is produced by the policy (π) at any state.
• Stochastic: In this policy, probability determines the produced action.
Approaches for Reinforcement Learning
Model-based:
In the model-based approach, a virtual model is created for the
environment,
The agent explores that environment to learn it.
There is no particular solution or algorithm for this approach because
the model representation is different for each environment.
Elements of Reinforcement Learning
There are four main elements of Reinforcement Learning, which are
given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
Elements of Reinforcement Learning
1) Policy:
A policy can be defined as a way how an agent behaves at a given time.
It maps the perceived states of the environment to the actions taken on those
states.
A policy is the core element of the RL as it alone can define the behaviour of the
agent.
It may be a simple function or a lookup table, whereas, for other cases, it may
involve general computation as a search process.
It could be deterministic or a stochastic policy:
• For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
Elements of Reinforcement Learning
2) Reward Signal:
The goal of reinforcement learning is defined by the reward signal.
At each state, the environment sends an immediate signal to the learning
agent, and this signal is known as a reward signal.
These rewards are given according to the good and bad actions taken by the
agent.
The agent's main objective is to maximize the total number of rewards for
good actions.
The reward signal can change the policy, such as if an action selected by the
agent leads to low reward, then the policy may change to select other actions
in the future.
Elements of Reinforcement Learning
3) Value Function:
The value function gives information about how good the situation and
action are and how much reward an agent can expect.
A reward indicates the immediate signal for each good and bad action,
whereas a value function specifies the good state and action for the
future.
The value function depends on the reward as, without reward, there
could be no value.
The goal of estimating values is to achieve more rewards.
Elements of Reinforcement Learning
• 4) Model:
• The model mimics the behaviour of the environment.
• With the help of the model, one can make inferences about how the environment will
behave.
• Such as, if a state and an action are given, then a model can predict the next state and
reward.
• The model is used for planning, which means it provides a way to take a course of
action by considering all future situations before actually experiencing those
situations.
• The approaches for solving the RL problems with the help of the model are termed as
the model-based approach.
• Comparatively, an approach without using a model is called a model-free approach.
Q-function
The total reward, Rt, is the discounted sum of all rewards obtained
from time t
RL works by interacting with the environment. Supervised learning works on the existing dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.
No previous training is provided to the learning Training is provided to the algorithm so that it can
agent. predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made when
input is given.
Benefits of Reinforcement Learning
Focuses on the problem as a whole
Traditional ML algorithms are designed to excel at specific subtasks
RL does not ddivide the problem into subproblems
It directly works to maximise the long-term reward
RL understands the goal
Capable of trading off short-term rewards for long-term benefits
Benefits of Reinforcement Learning
Does not need a separate data collection step
Training data is obtained through the direct interaction with the
environment
Training data is the learning agent’s experience
No separate collection of data to be fed into the algorithm
Benefits of Reinforcement Learning
Works in dynamic, uncertain environments
RL algorithms are inherently adaptive, built to respond to changes in
the environment
Time matters
The experience collected by th agent is not independently and
identically distributed (iid)
Learning is adaptive