0% found this document useful (0 votes)

36 views30 pages

Artificial Intelligence: Computer Science & Engineering, Khulna University

The document discusses reinforcement learning and provides examples of its applications. It covers the following key points in 3 sentences: Reinforcement learning involves an agent interacting with an environment to maximize rewards through trial and error. The document outlines reinforcement learning techniques like Q-learning, policy gradients, and Monte Carlo methods. It also provides an example of applying reinforcement learning to control a 7 degree of freedom robot arm to perform tasks like pushing and pick-and-place.

Uploaded by

razi.d6968

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views30 pages

Artificial Intelligence: Computer Science & Engineering, Khulna University

Uploaded by

razi.d6968

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Computer Science & Engineering, Khulna University

Artificial Intelligence
Reinforcement Learning

Dr. Amit Kumar Mondal,

Associate Professor, CSE KU

1
References
● Reinforcement Learning: An Introduction, Second Edition, Sutton and Barto
● Mondal, Amit Kumar. "A Survey of Reinforcement Learning Techniques: Strategies, Recent
Development, and Future Directions." arXiv preprint arXiv:2001.06921 (2020).

2
Class Outlines
● Reinforcement in Learning Classifier System
● Definition
● Basic Architecture
● Baseline Techniques
● Future Work Opportunity

3
Recap: Learning Agent

4
RL is MDP

5
Reinforcement Learning vs Supervised Learning
● Reinforcement learning (RL) techniques -- mimic human tasks in playing games [13],
objects/box grabbing and fetching [1], driving cars [32], and so on.
● Reinforcement learning -- an agent must have to interact with its environment via perception
and action.
● The main difference -- the agent is not given input/output presentation; rather, the agent itself
performs actions and get the immediate reward with the subsequent state/path to explore.
● The agent develops experience about the probable states, actions, transitions and rewards of
the system to perform effectively and on-line.
● Challenges of an RL approach --- environment model, exploration, exploitation, policy handling,
convergence rate, learning speed and so on.

6
LCS and Reinforcement Learning

7
LCS and Reinforcement Learning
● A learning classifiers system -- interacting with the real world from which it attains feedback in
the form of mostly numerical reward (R).
● Learning is driven by trying to maximize the amount of reward received.
● LCS consists of four components
➢ A finite population of condition-action rules, called classifiers, that represents the current
knowledge of the system;
➢ The performance component-- governs the interaction with the environment;
➢ The reinforcement component (or credit assignment component) -- distributes the
reward received from the environment to the classifiers accountable for the rewards
obtained;
➢ The discovery component -- responsible for discovering better rules and improving
existing ones through a genetic algorithm.
● Reinforcement learning is essential for capturing the diachronic behaviours of an intelligent
system.
8
Reinforcement Learning Basics
● Reinforcement learning is formally defined by
➢ A finite set of states S;

➢ A finite set of actions A;

➢ A transition function T(T : S ×A → Π(S)) which assigns to each state-action pair

a probability distribution over the set S,

➢ and A reward function R(R : S × A → R).

● Given a reinforcement learning problem modelled as MDP-- the Q-learning

algorithm converges with probability one to the optimal action-value function Q*
which maps state-action pairs to the associated expected payoff.

9
Reinforcement Learning Actions

10
Q-Table

11
Reinforcement Learning Basics
● Q-learning computes by successive approximations the table of all values Q(s; a),
named Q-table.
● Q(s; a) is defined as the payoff predicted under the hypothesis that the agent
performs action a in state s, and then it carries on always selecting the actions
which predict the highest payoff.
● For each state-action pair, Q(s; a) is initialized with a random values, and updated
at each time step t(t > 0) that action 𝒂𝒕 is performed in state 𝒔𝒕 according to the
formula:

The term α is the learning rate 0 < α < 1; γ is the discount factor which affects
how much future rewards are valued at present; r is the reward received for
performing at-1 in state st-1; st is the state the agent encounters after
performing
12
Reinforcement Learning Performance
● Tabular Q-learning is simple and easy to implement but infeasible for problems of
interest because the size of the Q-table (which is|AXJ|) grows exponentially in the
problem dimensions. ---- generalization is essential.
● The performance of learning is measured with three major metrics [14]:
➢ Eventual convergence to optimal -- many algorithms come with a provable guarantee of

asymptotic convergence to optimal behavior,

➢ Speed of convergence to optimality -- is the speed of convergence to near optimality,

level of performance after a given time, and

➢ Regret --- the expected decrease in reward gained due to executing the learning

algorithm instead of behaving optimally from the very beginning , it penalizes mistakes
wherever they occur during the run.
13
Example: 7-DOF Robot Arm

14
Example
Three tasks of the robot:
● Pushing -- a box is placed on a table in front of the robot and the task is to move it
to the target location on the table. The learned behaviour is a mixture of pushing
and rolling
● Sliding -- a puck is placed on a long slippery table, and the target
position is outside of the robots reach so that it has to hit the puck with such a
force that it slides and then stops in the appropriate place due to friction
● Pick-and-place -- similar to pushing but the target position is in the air

15
16
Example
➢ The system state (s) is represented in joint coordinates. Goals can be defined as
the desired position of the object depending on the task as

➢ Sobject is the position of the object in the state s

➢ The policy should be provided as input

✓ the absolute position of the gripper,

✓ the relative position of the object and the target,

✓ as well as the distance between the finger. 17

Example
➢ The Q-function can be additionally [1] given
✓ the linear velocity of the gripper and fingers
✓ Along with relative linear and angular velocity of the object

➢ Action space (A) is usually 4-dimensional.

✓ Three dimensions indicate the desired relative gripper position at the next time-step t.
✓ The last dimension defines the desired distance between the 2 fingers which are position
controlled.

➢ A strategy is defined for sampling goals such as Eqn (1)

18
Reinforcement Learning Architecture
Models: how the reward is considered for optimality
● Model-based
● Value-based
● Policy-based
● Other (i.e., off-policy)

19
Reinforcement Learning Architecture
Model-based -- adopted learning from the predefined world-model.
● It learns the sequential uncertainties of events and actions in a task (which
outcomes follow which actions),

○ which can be used adaptively and dynamically to determine ideal actions by simulating
their consequences.
● Model based approach first estimates transition (p) and cost functions (c) and used
them to estimate value (v).
● By contrast, the model-free approach can build policy without estimating the
transition function nor reward function.

20
Reinforcement Learning Architecture
Value-based—
● Decision maker keeps an estimate of the value of the objective criteria starting
from each state in the environment,
● These estimates are updated when a new experience is encountered.
● The optimal value is computed through value iteration.
● Any algorithms of this type have been proven to converge asymptotically to optimal
value estimates, which in turn can be used to generate optimal behavior.

➢ Technically, value based RIL can be applied to model-based RIL.

● Convergence theory is very important while applying value-function based RIL.
An estimate of the optimal value function is built gradually from the decision
maker's experience and sometimes this estimate is used for control
21
Reinforcement Learning Architecture
Policy-based
● A policy is used for function approximation. The policy is a greedy policy such as a
neural network.
● The example -- the actor-critic model. In value-based approach, a small change in
value can cause an action to be or not be selected. But, this is not the case in policy
gradient techniques.
● In Off-policy model, an agent can not pick actions and learns from experts and
sessions (recorded data).
● Double Q-learning is an off policy reinforcement learning algorithm, where a
different policy is used for value evaluation than what is used to select the next
action. Two separate value functions are trained in a mutually symmetric fashion
22
using separate experiences.
Baseline Techniques of Reinforcement Learning
● TDL
● Q-learning
● Monte Carlo method
● Policy gradient
● Simple Actor-critic
● Maximum Entropy

23
Temporal difference Learning
➢ Temporal difference methods focus on the sensitivity of changes in subsequent
predictions rather than to overall error between predictions and the final outcome.
➢ In response to an increase (decrease) in prediction from Pt to Pt+1, an increment
∆wt is measured that increases (decreases) the predictions for some or all of
the previous observation vectors x1,... ,xt.

where α is a positive parameter called the rate of learning, and the gradient,
rwPt, is the vector of partial derivatives of Pt with respect to each component
of weights w
24
Q-learning:
● Q-learning is treated as a controlled Markovian process.
● Q-learning is a model-free reinforcement learning algorithm based on TD.
● The goal of Q-learning is to learn a policy, which tells an agent what action to take
under what circumstances.
● Q value is the expected discounted (γ) reward R for executing action a at state s and
the following policy π

25
Monte Carlo method:
● Monte Carlo algorithms do not require explicit knowledge of the transition matrix, P.
● Monte Carlo algorithms can approximate the solution for some variables without
expending the computational effort required to approximate the solution for all of
the variables.
● In the Monte Carlo method, the value is the statistical evaluation of a simple sum.
● It is somewhat similar to MDP and TD in RL.
● It waits for a full episode to complete to update rewards.

26
Policy gradient
● This algorithm typically uses independent function approximator for a policy (not
value).
● Policy gradient algorithms adjust the policy to maximise the expected reward, Lπ = -
Es∼π[R1:1], using the gradient

Policy is defined as a parametric probability distribution, Πθ(ajs) = P[ajs; θ].

Here, θ is a parameter vector, and δθ δ is the gradient of θ.

27
Simple Actor-critic
● It is a policy-based technique that actually uses a gradient of actual return.
● In this technique, both value and policy are learned.
● The critic provides TD error as useful reinforcement feedback to the actor.
● The actor updates the stochastic policy using the TD error T Derr.
● The critic updates estimated value function V^ (xt) according to TD methods

28
Maximum Entropy
● focuses on a control method of dealing with the uncertainty in learning from
demonstrated behaviors.
● Under the condition of matching the reward value of the demonstrated
behaviors, apply maximum entropy to resolve the ambiguity in selecting a
distribution over decisions as a probabilistic model.
● It is well suited in imitation learning and follows a model-based technique
(knows about the environment).
● Specifically targeted to navigation and route preference problems for self-driving
cars; pre-recorded trajectories/paths ζi and features of the roads fsj are also
considered.
● It used partition function (z(θ)) for the convergence of both
the finite and infinite horizon.

29
Baseline Techniques of Reinforcement Learning

Unit 5
No ratings yet
Unit 5
45 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Unit-5 Mlt
No ratings yet
Unit-5 Mlt
13 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
25 pages
unit 5 ml
No ratings yet
unit 5 ml
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Unit 5 ML 3year
No ratings yet
Unit 5 ML 3year
17 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
IntroductiontoRL-BR
No ratings yet
IntroductiontoRL-BR
22 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
UNIT-4
No ratings yet
UNIT-4
56 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
unit4(AI)2024.docx-1
No ratings yet
unit4(AI)2024.docx-1
22 pages
DLMAIRIL01_Q4-2024_Session1
No ratings yet
DLMAIRIL01_Q4-2024_Session1
84 pages
L-14 - Reinforcement-L-d-07062024-111949am
No ratings yet
L-14 - Reinforcement-L-d-07062024-111949am
22 pages
Unit-5 Reinforcemnt and Q learning
No ratings yet
Unit-5 Reinforcemnt and Q learning
45 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
RL Vishnu Sankar
No ratings yet
RL Vishnu Sankar
26 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Unit V
100% (1)
Unit V
24 pages
Unit 6
No ratings yet
Unit 6
34 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Module 1
No ratings yet
Module 1
72 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
No ratings yet
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
5 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Unit 1
No ratings yet
Unit 1
18 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Unit 5
No ratings yet
Unit 5
10 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Question Bank Beel801 PDF
100% (1)
Question Bank Beel801 PDF
10 pages
Mexican Hat Network
33% (3)
Mexican Hat Network
10 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
14 pages
Neural Networks Exam
No ratings yet
Neural Networks Exam
3 pages
Deep Learning Using Rectified Linear Units (ReLU)
No ratings yet
Deep Learning Using Rectified Linear Units (ReLU)
7 pages
Machinecal Engineering to Datascience
No ratings yet
Machinecal Engineering to Datascience
11 pages
17.11.1501 - Andhy Panca Saputra
No ratings yet
17.11.1501 - Andhy Panca Saputra
10 pages
DL
No ratings yet
DL
73 pages
A Review of Conversational System Framework - Final - Submitted
No ratings yet
A Review of Conversational System Framework - Final - Submitted
12 pages
Andrew Rosenberg - Lecture 14: Neural Networks
No ratings yet
Andrew Rosenberg - Lecture 14: Neural Networks
50 pages
Soft Computing Unit 1 Notes
No ratings yet
Soft Computing Unit 1 Notes
33 pages
Text Summarization As Feature Selection For Arabic Text Classification
No ratings yet
Text Summarization As Feature Selection For Arabic Text Classification
4 pages
Data Science Mind Map PDF Download
No ratings yet
Data Science Mind Map PDF Download
1 page
Resume Format
100% (1)
Resume Format
1 page
Machine Learning Unit 1
No ratings yet
Machine Learning Unit 1
72 pages
Batch No.'S Student Names Student Roll No.'S
No ratings yet
Batch No.'S Student Names Student Roll No.'S
8 pages
Siddhant Bansal
No ratings yet
Siddhant Bansal
1 page
Experiment No. 4 TE SL-II (ANN)
No ratings yet
Experiment No. 4 TE SL-II (ANN)
2 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
Diya Updated
No ratings yet
Diya Updated
1 page
Neural Language Models in Natural Language Processing
100% (1)
Neural Language Models in Natural Language Processing
4 pages
Fuzzy Logic and Neural Networks - 4 - Solution
100% (1)
Fuzzy Logic and Neural Networks - 4 - Solution
13 pages
V02 SS24 DLforCV NN Basics Teil1
No ratings yet
V02 SS24 DLforCV NN Basics Teil1
68 pages
ML Question Papers
No ratings yet
ML Question Papers
8 pages
NNFL CBCGS Syllabus
No ratings yet
NNFL CBCGS Syllabus
8 pages
Artificial Intelligence in Robotics-51-95 - Compressed
No ratings yet
Artificial Intelligence in Robotics-51-95 - Compressed
45 pages
UNIT_IV_DL
No ratings yet
UNIT_IV_DL
26 pages
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder Ensembles.
No ratings yet
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder Ensembles.
12 pages
6th International Conference on Artificial Intelligence and Big Data (AIBD 2025)
No ratings yet
6th International Conference on Artificial Intelligence and Big Data (AIBD 2025)
2 pages

Artificial Intelligence: Computer Science & Engineering, Khulna University

Uploaded by

Artificial Intelligence: Computer Science & Engineering, Khulna University

Uploaded by

Computer Science & Engineering, Khulna University

Dr. Amit Kumar Mondal,

➢ A finite set of actions A;

➢ A transition function T(T : S ×A → Π(S)) which assigns to each state-action pair

a probability distribution over the set S,

● Given a reinforcement learning problem modelled as MDP-- the Q-learning

asymptotic convergence to optimal behavior,

level of performance after a given time, and

➢ Sobject is the position of the object in the state s

➢ The policy should be provided as input

✓ the absolute position of the gripper,

✓ the relative position of the object and the target,

✓ as well as the distance between the finger. 17

➢ Action space (A) is usually 4-dimensional.

➢ A strategy is defined for sampling goals such as Eqn (1)

➢ Technically, value based RIL can be applied to model-based RIL.

Policy is defined as a parametric probability distribution, Πθ(ajs) = P[ajs; θ].

You might also like