Artificial Intelligence: Computer Science & Engineering, Khulna University
Artificial Intelligence: Computer Science & Engineering, Khulna University
Artificial Intelligence
Reinforcement Learning
1
References
● Reinforcement Learning: An Introduction, Second Edition, Sutton and Barto
● Mondal, Amit Kumar. "A Survey of Reinforcement Learning Techniques: Strategies, Recent
Development, and Future Directions." arXiv preprint arXiv:2001.06921 (2020).
2
Class Outlines
● Reinforcement in Learning Classifier System
● Definition
● Basic Architecture
● Baseline Techniques
● Future Work Opportunity
3
Recap: Learning Agent
4
RL is MDP
5
Reinforcement Learning vs Supervised Learning
● Reinforcement learning (RL) techniques -- mimic human tasks in playing games [13],
objects/box grabbing and fetching [1], driving cars [32], and so on.
● Reinforcement learning -- an agent must have to interact with its environment via perception
and action.
● The main difference -- the agent is not given input/output presentation; rather, the agent itself
performs actions and get the immediate reward with the subsequent state/path to explore.
● The agent develops experience about the probable states, actions, transitions and rewards of
the system to perform effectively and on-line.
● Challenges of an RL approach --- environment model, exploration, exploitation, policy handling,
convergence rate, learning speed and so on.
6
LCS and Reinforcement Learning
7
LCS and Reinforcement Learning
● A learning classifiers system -- interacting with the real world from which it attains feedback in
the form of mostly numerical reward (R).
● Learning is driven by trying to maximize the amount of reward received.
● LCS consists of four components
➢ A finite population of condition-action rules, called classifiers, that represents the current
knowledge of the system;
➢ The performance component-- governs the interaction with the environment;
➢ The reinforcement component (or credit assignment component) -- distributes the
reward received from the environment to the classifiers accountable for the rewards
obtained;
➢ The discovery component -- responsible for discovering better rules and improving
existing ones through a genetic algorithm.
● Reinforcement learning is essential for capturing the diachronic behaviours of an intelligent
system.
8
Reinforcement Learning Basics
● Reinforcement learning is formally defined by
➢ A finite set of states S;
9
Reinforcement Learning Actions
10
Q-Table
11
Reinforcement Learning Basics
● Q-learning computes by successive approximations the table of all values Q(s; a),
named Q-table.
● Q(s; a) is defined as the payoff predicted under the hypothesis that the agent
performs action a in state s, and then it carries on always selecting the actions
which predict the highest payoff.
● For each state-action pair, Q(s; a) is initialized with a random values, and updated
at each time step t(t > 0) that action 𝒂𝒕 is performed in state 𝒔𝒕 according to the
formula:
The term α is the learning rate 0 < α < 1; γ is the discount factor which affects
how much future rewards are valued at present; r is the reward received for
performing at-1 in state st-1; st is the state the agent encounters after
performing
12
Reinforcement Learning Performance
● Tabular Q-learning is simple and easy to implement but infeasible for problems of
interest because the size of the Q-table (which is|AXJ|) grows exponentially in the
problem dimensions. ---- generalization is essential.
● The performance of learning is measured with three major metrics [14]:
➢ Eventual convergence to optimal -- many algorithms come with a provable guarantee of
algorithm instead of behaving optimally from the very beginning , it penalizes mistakes
wherever they occur during the run.
13
Example: 7-DOF Robot Arm
14
Example
Three tasks of the robot:
● Pushing -- a box is placed on a table in front of the robot and the task is to move it
to the target location on the table. The learned behaviour is a mixture of pushing
and rolling
● Sliding -- a puck is placed on a long slippery table, and the target
position is outside of the robots reach so that it has to hit the puck with such a
force that it slides and then stops in the appropriate place due to friction
● Pick-and-place -- similar to pushing but the target position is in the air
15
16
Example
➢ The system state (s) is represented in joint coordinates. Goals can be defined as
the desired position of the object depending on the task as
18
Reinforcement Learning Architecture
Models: how the reward is considered for optimality
● Model-based
● Value-based
● Policy-based
● Other (i.e., off-policy)
19
Reinforcement Learning Architecture
Model-based -- adopted learning from the predefined world-model.
● It learns the sequential uncertainties of events and actions in a task (which
outcomes follow which actions),
○ which can be used adaptively and dynamically to determine ideal actions by simulating
their consequences.
● Model based approach first estimates transition (p) and cost functions (c) and used
them to estimate value (v).
● By contrast, the model-free approach can build policy without estimating the
transition function nor reward function.
20
Reinforcement Learning Architecture
Value-based—
● Decision maker keeps an estimate of the value of the objective criteria starting
from each state in the environment,
● These estimates are updated when a new experience is encountered.
● The optimal value is computed through value iteration.
● Any algorithms of this type have been proven to converge asymptotically to optimal
value estimates, which in turn can be used to generate optimal behavior.
23
Temporal difference Learning
➢ Temporal difference methods focus on the sensitivity of changes in subsequent
predictions rather than to overall error between predictions and the final outcome.
➢ In response to an increase (decrease) in prediction from Pt to Pt+1, an increment
∆wt is measured that increases (decreases) the predictions for some or all of
the previous observation vectors x1,... ,xt.
where α is a positive parameter called the rate of learning, and the gradient,
rwPt, is the vector of partial derivatives of Pt with respect to each component
of weights w
24
Q-learning:
● Q-learning is treated as a controlled Markovian process.
● Q-learning is a model-free reinforcement learning algorithm based on TD.
● The goal of Q-learning is to learn a policy, which tells an agent what action to take
under what circumstances.
● Q value is the expected discounted (γ) reward R for executing action a at state s and
the following policy π
25
Monte Carlo method:
● Monte Carlo algorithms do not require explicit knowledge of the transition matrix, P.
● Monte Carlo algorithms can approximate the solution for some variables without
expending the computational effort required to approximate the solution for all of
the variables.
● In the Monte Carlo method, the value is the statistical evaluation of a simple sum.
● It is somewhat similar to MDP and TD in RL.
● It waits for a full episode to complete to update rewards.
26
Policy gradient
● This algorithm typically uses independent function approximator for a policy (not
value).
● Policy gradient algorithms adjust the policy to maximise the expected reward, Lπ = -
Es∼π[R1:1], using the gradient
27
Simple Actor-critic
● It is a policy-based technique that actually uses a gradient of actual return.
● In this technique, both value and policy are learned.
● The critic provides TD error as useful reinforcement feedback to the actor.
● The actor updates the stochastic policy using the TD error T Derr.
● The critic updates estimated value function V^ (xt) according to TD methods
28
Maximum Entropy
● focuses on a control method of dealing with the uncertainty in learning from
demonstrated behaviors.
● Under the condition of matching the reward value of the demonstrated
behaviors, apply maximum entropy to resolve the ambiguity in selecting a
distribution over decisions as a probabilistic model.
● It is well suited in imitation learning and follows a model-based technique
(knows about the environment).
● Specifically targeted to navigation and route preference problems for self-driving
cars; pre-recorded trajectories/paths ζi and features of the roads fsj are also
considered.
● It used partition function (z(θ)) for the convergence of both
the finite and infinite horizon.
29
Baseline Techniques of Reinforcement Learning
30