0% found this document useful (0 votes)
36 views30 pages

Artificial Intelligence: Computer Science & Engineering, Khulna University

The document discusses reinforcement learning and provides examples of its applications. It covers the following key points in 3 sentences: Reinforcement learning involves an agent interacting with an environment to maximize rewards through trial and error. The document outlines reinforcement learning techniques like Q-learning, policy gradients, and Monte Carlo methods. It also provides an example of applying reinforcement learning to control a 7 degree of freedom robot arm to perform tasks like pushing and pick-and-place.

Uploaded by

razi.d6968
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views30 pages

Artificial Intelligence: Computer Science & Engineering, Khulna University

The document discusses reinforcement learning and provides examples of its applications. It covers the following key points in 3 sentences: Reinforcement learning involves an agent interacting with an environment to maximize rewards through trial and error. The document outlines reinforcement learning techniques like Q-learning, policy gradients, and Monte Carlo methods. It also provides an example of applying reinforcement learning to control a 7 degree of freedom robot arm to perform tasks like pushing and pick-and-place.

Uploaded by

razi.d6968
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Computer Science & Engineering, Khulna University

Artificial Intelligence
Reinforcement Learning

Dr. Amit Kumar Mondal,


Associate Professor, CSE KU

1
References
● Reinforcement Learning: An Introduction, Second Edition, Sutton and Barto
● Mondal, Amit Kumar. "A Survey of Reinforcement Learning Techniques: Strategies, Recent
Development, and Future Directions." arXiv preprint arXiv:2001.06921 (2020).

2
Class Outlines
● Reinforcement in Learning Classifier System
● Definition
● Basic Architecture
● Baseline Techniques
● Future Work Opportunity

3
Recap: Learning Agent

4
RL is MDP

5
Reinforcement Learning vs Supervised Learning
● Reinforcement learning (RL) techniques -- mimic human tasks in playing games [13],
objects/box grabbing and fetching [1], driving cars [32], and so on.
● Reinforcement learning -- an agent must have to interact with its environment via perception
and action.
● The main difference -- the agent is not given input/output presentation; rather, the agent itself
performs actions and get the immediate reward with the subsequent state/path to explore.
● The agent develops experience about the probable states, actions, transitions and rewards of
the system to perform effectively and on-line.
● Challenges of an RL approach --- environment model, exploration, exploitation, policy handling,
convergence rate, learning speed and so on.

6
LCS and Reinforcement Learning

7
LCS and Reinforcement Learning
● A learning classifiers system -- interacting with the real world from which it attains feedback in
the form of mostly numerical reward (R).
● Learning is driven by trying to maximize the amount of reward received.
● LCS consists of four components
➢ A finite population of condition-action rules, called classifiers, that represents the current
knowledge of the system;
➢ The performance component-- governs the interaction with the environment;
➢ The reinforcement component (or credit assignment component) -- distributes the
reward received from the environment to the classifiers accountable for the rewards
obtained;
➢ The discovery component -- responsible for discovering better rules and improving
existing ones through a genetic algorithm.
● Reinforcement learning is essential for capturing the diachronic behaviours of an intelligent
system.
8
Reinforcement Learning Basics
● Reinforcement learning is formally defined by
➢ A finite set of states S;

➢ A finite set of actions A;

➢ A transition function T(T : S ×A → Π(S)) which assigns to each state-action pair

a probability distribution over the set S,


➢ and A reward function R(R : S × A → R).

● Given a reinforcement learning problem modelled as MDP-- the Q-learning


algorithm converges with probability one to the optimal action-value function Q*
which maps state-action pairs to the associated expected payoff.

9
Reinforcement Learning Actions

10
Q-Table

11
Reinforcement Learning Basics
● Q-learning computes by successive approximations the table of all values Q(s; a),
named Q-table.
● Q(s; a) is defined as the payoff predicted under the hypothesis that the agent
performs action a in state s, and then it carries on always selecting the actions
which predict the highest payoff.
● For each state-action pair, Q(s; a) is initialized with a random values, and updated
at each time step t(t > 0) that action 𝒂𝒕 is performed in state 𝒔𝒕 according to the
formula:

The term α is the learning rate 0 < α < 1; γ is the discount factor which affects
how much future rewards are valued at present; r is the reward received for
performing at-1 in state st-1; st is the state the agent encounters after
performing
12
Reinforcement Learning Performance
● Tabular Q-learning is simple and easy to implement but infeasible for problems of
interest because the size of the Q-table (which is|AXJ|) grows exponentially in the
problem dimensions. ---- generalization is essential.
● The performance of learning is measured with three major metrics [14]:
➢ Eventual convergence to optimal -- many algorithms come with a provable guarantee of

asymptotic convergence to optimal behavior,


➢ Speed of convergence to optimality -- is the speed of convergence to near optimality,

level of performance after a given time, and


➢ Regret --- the expected decrease in reward gained due to executing the learning

algorithm instead of behaving optimally from the very beginning , it penalizes mistakes
wherever they occur during the run.
13
Example: 7-DOF Robot Arm

14
Example
Three tasks of the robot:
● Pushing -- a box is placed on a table in front of the robot and the task is to move it
to the target location on the table. The learned behaviour is a mixture of pushing
and rolling
● Sliding -- a puck is placed on a long slippery table, and the target
position is outside of the robots reach so that it has to hit the puck with such a
force that it slides and then stops in the appropriate place due to friction
● Pick-and-place -- similar to pushing but the target position is in the air

15
16
Example
➢ The system state (s) is represented in joint coordinates. Goals can be defined as
the desired position of the object depending on the task as

➢ Sobject is the position of the object in the state s

➢ The policy should be provided as input

✓ the absolute position of the gripper,

✓ the relative position of the object and the target,

✓ as well as the distance between the finger. 17


Example
➢ The Q-function can be additionally [1] given
✓ the linear velocity of the gripper and fingers
✓ Along with relative linear and angular velocity of the object

➢ Action space (A) is usually 4-dimensional.


✓ Three dimensions indicate the desired relative gripper position at the next time-step t.
✓ The last dimension defines the desired distance between the 2 fingers which are position
controlled.

➢ A strategy is defined for sampling goals such as Eqn (1)

18
Reinforcement Learning Architecture
Models: how the reward is considered for optimality
● Model-based
● Value-based
● Policy-based
● Other (i.e., off-policy)

19
Reinforcement Learning Architecture
Model-based -- adopted learning from the predefined world-model.
● It learns the sequential uncertainties of events and actions in a task (which
outcomes follow which actions),

○ which can be used adaptively and dynamically to determine ideal actions by simulating
their consequences.
● Model based approach first estimates transition (p) and cost functions (c) and used
them to estimate value (v).
● By contrast, the model-free approach can build policy without estimating the
transition function nor reward function.

20
Reinforcement Learning Architecture
Value-based—
● Decision maker keeps an estimate of the value of the objective criteria starting
from each state in the environment,
● These estimates are updated when a new experience is encountered.
● The optimal value is computed through value iteration.
● Any algorithms of this type have been proven to converge asymptotically to optimal
value estimates, which in turn can be used to generate optimal behavior.

➢ Technically, value based RIL can be applied to model-based RIL.


● Convergence theory is very important while applying value-function based RIL.
An estimate of the optimal value function is built gradually from the decision
maker's experience and sometimes this estimate is used for control
21
Reinforcement Learning Architecture
Policy-based
● A policy is used for function approximation. The policy is a greedy policy such as a
neural network.
● The example -- the actor-critic model. In value-based approach, a small change in
value can cause an action to be or not be selected. But, this is not the case in policy
gradient techniques.
● In Off-policy model, an agent can not pick actions and learns from experts and
sessions (recorded data).
● Double Q-learning is an off policy reinforcement learning algorithm, where a
different policy is used for value evaluation than what is used to select the next
action. Two separate value functions are trained in a mutually symmetric fashion
22
using separate experiences.
Baseline Techniques of Reinforcement Learning
● TDL
● Q-learning
● Monte Carlo method
● Policy gradient
● Simple Actor-critic
● Maximum Entropy

23
Temporal difference Learning
➢ Temporal difference methods focus on the sensitivity of changes in subsequent
predictions rather than to overall error between predictions and the final outcome.
➢ In response to an increase (decrease) in prediction from Pt to Pt+1, an increment
∆wt is measured that increases (decreases) the predictions for some or all of
the previous observation vectors x1,... ,xt.

where α is a positive parameter called the rate of learning, and the gradient,
rwPt, is the vector of partial derivatives of Pt with respect to each component
of weights w
24
Q-learning:
● Q-learning is treated as a controlled Markovian process.
● Q-learning is a model-free reinforcement learning algorithm based on TD.
● The goal of Q-learning is to learn a policy, which tells an agent what action to take
under what circumstances.
● Q value is the expected discounted (γ) reward R for executing action a at state s and
the following policy π

25
Monte Carlo method:
● Monte Carlo algorithms do not require explicit knowledge of the transition matrix, P.
● Monte Carlo algorithms can approximate the solution for some variables without
expending the computational effort required to approximate the solution for all of
the variables.
● In the Monte Carlo method, the value is the statistical evaluation of a simple sum.
● It is somewhat similar to MDP and TD in RL.
● It waits for a full episode to complete to update rewards.

26
Policy gradient
● This algorithm typically uses independent function approximator for a policy (not
value).
● Policy gradient algorithms adjust the policy to maximise the expected reward, Lπ = -
Es∼π[R1:1], using the gradient

Policy is defined as a parametric probability distribution, Πθ(ajs) = P[ajs; θ].


Here, θ is a parameter vector, and δθ δ is the gradient of θ.

27
Simple Actor-critic
● It is a policy-based technique that actually uses a gradient of actual return.
● In this technique, both value and policy are learned.
● The critic provides TD error as useful reinforcement feedback to the actor.
● The actor updates the stochastic policy using the TD error T Derr.
● The critic updates estimated value function V^ (xt) according to TD methods

28
Maximum Entropy
● focuses on a control method of dealing with the uncertainty in learning from
demonstrated behaviors.
● Under the condition of matching the reward value of the demonstrated
behaviors, apply maximum entropy to resolve the ambiguity in selecting a
distribution over decisions as a probabilistic model.
● It is well suited in imitation learning and follows a model-based technique
(knows about the environment).
● Specifically targeted to navigation and route preference problems for self-driving
cars; pre-recorded trajectories/paths ζi and features of the roads fsj are also
considered.
● It used partition function (z(θ)) for the convergence of both
the finite and infinite horizon.

29
Baseline Techniques of Reinforcement Learning

30

You might also like