0% found this document useful (0 votes)

7 views60 pages

Reinforcement learning lec12

Uploaded by

bilouslouise2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views60 pages

Reinforcement learning lec12

Uploaded by

bilouslouise2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

CSC 311: Introduction to Machine Learning

Lecture 12 - Reinforcement Learning

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro ML (UofT) CSC311-Lec12 1 / 60

Reinforcement Learning Problem
Recall: we categorized types of ML by how much information they
provide about the desired behavior.
Supervised learning: labels of desired behavior
Unsupervised learning: no labels
Reinforcement learning: reward signal evaluating the outcome of
past actions
Bandit problems (Lecture 10) are a simple instance of RL where each
decision is independent.
More commonly, we focus on sequential decision making: an agent
Reinforcement Learning (RL)
chooses a sequence of actions which each affect future possibilities
available to the agent.

An agent observes the takes an action and with the goal of

world its states changes achieving long-term
rewards.

Intro ML (UofT) CSC311-Lec12 2 / 60

Reinforcement Learning
Most RL is done in a mathematical framework called a Markov Decision Process
(MDP).

Intro ML (UofT) CSC311-Lec12 3 / 60

MDPs: States and Actions

First let’s see how to describe the dynamics of the environment.

The state is a description of the environment in sufficient detail to
determine its evolution.
Think of Newtonian physics.
What would be the state variables for a puck sliding on a
frictionless table?

Markov assumption: the state at time t + 1 depends directly on the

state and action at time t, but not on past states and actions.
To describe the dynamics, we need to specify the transition
probabilities P(St+1 | St , At ).
In this lecture, we assume the state is fully observable, a highly
nontrivial assumption.

Intro ML (UofT) CSC311-Lec12 4 / 60

MDPs: States and Actions

Suppose you’re controlling a robot hand. What should be the set

of states and actions?

In general, the right granularity of states and actions depends on

what you’re trying to achieve.
Intro ML (UofT) CSC311-Lec12 5 / 60
MDPs: Policies

The way the agent chooses the action in each step is called a
policy.
We’ll consider two types:
Deterministic policy: At = π(St ) for some function π : S → A
Stochastic policy: At ∼ π(· | St ) for some function π : S → P(A).
(Here, P(A) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or
trajectories, factorizes:
p(s1 , a1 , . . . , sT , aT ) = p(s1 ) π(a1 | s1 ) P(s2 | s1 , a1 ) π(a2 | s2 ) · · · P(sT | sT −1 , aT −1 ) π(aT | sT )

Note: the fact that policies need consider only the current state is
a powerful consequence of the Markov assumption and full
observability.
If the environment is partially observable, then the policy needs to
depend on the history of observations.

Intro ML (UofT) CSC311-Lec12 6 / 60

MDPs: Rewards
In each time step, the agent receives a reward from a distribution
that depends on the current state and action

Rt ∼ R(· | St , At )

For simplicity, we’ll assume rewards are deterministic, i.e.

Rt = r(St , At )

What’s an example where Rt should depend on At ?

The return determines how good was the outcome of an episode.
Undiscounted: G = R0 + R1 + R2 + · · ·
Discounted: G = R0 + γR1 + γ 2 R2 + · · ·
The goal is to maximize the expected return, E[G].
γ is a hyperparameter called the discount factor which determines
how much we care about rewards now vs. rewards later.
What is the effect of large or small γ?
Intro ML (UofT) CSC311-Lec12 7 / 60
MDPs: Rewards

How might you define a reward function for an agent learning to

play a video game?
Change in score (why not current score?)
Some measure of novelty (this is sufficient for most Atari games!)
Consider two possible reward functions for the game of Go. How
do you think the agent’s play will differ depending on the choice?
Option 1: +1 for win, 0 for tie, -1 for loss
Option 2: Agent’s territory minus opponent’s territory (at end)
Specifying a good reward function can be tricky.
https://www.youtube.com/watch?v=tlOIHko8ySg

Intro ML (UofT) CSC311-Lec12 8 / 60

Markov Decision Processes

Putting this together, a Markov Decision Process (MDP) is defined by a

tuple (S, A, P, R, γ).
S: State space. Discrete or continuous
A: Action space. Here we consider finite action space, i.e.,
A = {a1 , . . . , a|A| }.
P: Transition probability
R: Immediate reward distribution
γ: Discount factor (0 ≤ γ < 1)
Together these define the environment that the agent operates in, and
the objectives it is supposed to achieve.

Intro ML (UofT) CSC311-Lec12 9 / 60

Finding a Policy

Now that we’ve defined MDPs, let’s see how to find a policy that
achieves a high return.
We can distinguish two situations:
Planning: given a fully specified MDP.
Learning: agent interacts with an environment with unknown
dynamics.
I.e., the environment is a black box that takes in actions and
outputs states and rewards.
Which framework would be most appropriate for chess? Super
Mario?

Intro ML (UofT) CSC311-Lec12 10 / 60

Value Functions

Intro ML (UofT) CSC311-Lec12 11 / 60

Value Function

The value function V π for a policy π measures the expected return if you
start in state s and follow policy π.
"∞ #
X
V (s) , Eπ [Gt | St = s] = Eπ
π k
γ Rt+k | St = s .
k=0

This measures the desirability of state s.

Intro ML (UofT) CSC311-Lec12 12 / 60

Value Function

Rewards: −1 per time-step

Actions: N, E, S, W
States: Agent’s location

[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 13 / 60

Value Function

Arrows represent policy π(s)

for each state s

[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 14 / 60

Value Function

Numbers represent value

V π (s) of each state s

[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 15 / 60

Viewing V π as a vector (where entries correspond to states), define the

Bellman backup operator T π .
" #
X X
(T V )(s) ,
π
π(a | s) r(s, a) + γ 0 0
P(s | a, s) V (s )
a s0

The Bellman equation can be seen as a fixed point of the Bellman

operator:
T πV π = V π.
Intro ML (UofT) CSC311-Lec12 16 / 60
ty (negative reward) of 1 for each stroke until we hit the ball
is Value Function
the location of the ball. The value of a state is the negative of
o the hole from that location. Our actions are how we aim and
urse, and which club we select. Let us take the former as given
oice of club, which we assume is either a putter or a driver. The
A value
shows function
a possible for golf:
state-value function, vputt (s), for the policy that
. The terminal
value of 0. From vputt !4

we assume we can V putt !3

sand
!"
es have value 1.
ot reach the hole !2
!1 !2
lue is greater. If !3
0 !1
!4
from a state by !5 s
a green
n
!6
must have value !4 !"
d

n’s value, that is, !2

!3
us assume we can
deterministically,
*(s,driver)
ge. This gives us qQ ⇤ (s, driver) — Sutton and Barto, Reinforcement Learning: An Introduction

abeled 2 in the sand

ween that line and

ly two strokes to
0 !1 !2
arly, anyIntro
location
ML (UofT)
!2
CSC311-Lec12 17 / 60
s
State-Action Value Function
A closely related but usefully different function is the state-action
value function, or Q-function, Qπ for policy π, defined as:
 
X
Qπ (s, a) , Eπ  γ k Rt+k | St = s, At = a .
k≥0

If you knew Qπ , how would you obtain V π ?

X
V π (s) = π(a | s) Qπ (s, a).
a

If you knew V π , how would you obtain Qπ ?

Apply a Bellman-like equation:
X
Qπ (s, a) = r(s, a) + γ P(s0 | a, s) V π (s0 )
s0

This requires knowing the dynamics, so in general it’s not easy to

recover Qπ from V π .
Intro ML (UofT) CSC311-Lec12 18 / 60
State-Action Value Function

Qπ satisfies a Bellman equation very similar to V π (proof is

analogous):
X X
Qπ (s, a) = r(s, a) + γ P(s0 | a, s) π(a0 | s0 )Qπ (s0 , a0 )
s0 a0
| {z }
,(T π Qπ )(s,a)

Intro ML (UofT) CSC311-Lec12 19 / 60

Dynamic Programming and Value Iteration

Intro ML (UofT) CSC311-Lec12 20 / 60

Optimal State-Action Value Function

Suppose you’re in state s. You get to pick one action a, and then
follow (fixed) policy π from then on. What do you pick?

arg max Qπ (s, a)

If a deterministic policy π is optimal, then it must be the case that

for any state s:
π(s) = arg max Qπ (s, a),
a

otherwise you could improve the policy by changing π(s). (see

Sutton & Barto for a proper proof)

Intro ML (UofT) CSC311-Lec12 21 / 60

Optimal State-Action Value Function

Bellman equation for optimal policy π ∗ :

∗ X ?
Qπ (s, a) = r(s, a) + γ P(s0 , | s, a)Qπ (s0 , π ? (s0 ))
s0
X ?
= r(s, a) + γ p(s0 | s, a) max
0
Qπ (s0 , a0 )
a
s0

π?
Now Q∗ = Q is the optimal state-action value function, and we
can rewrite the optimal Bellman equation without mentioning π ? :
X
Q∗ (s, a) = r(s, a) + γ p(s0 | s, a) max
0
Q∗ (s0 , a0 )
a
s0
| {z }
,(T ∗ Q∗ )(s,a)

Turns out this is sufficient to characterize the optimal policy. So

we simply need to solve the fixed point equation T ∗ Q∗ = Q∗ , and
then we can choose π ∗ (s) = arg maxa Q∗ (s, a).

Intro ML (UofT) CSC311-Lec12 22 / 60

Bellman Fixed Points

So far: showed that some interesting problems could be reduced

to finding fixed points of Bellman backup operators:
Evaluating a fixed policy π

T π Qπ = Qπ

Finding the optimal policy

T ∗ Q∗ = Q∗

Idea: keep iterating the backup operator over and over again.

Q ← T πQ (policy evaluation)
∗
Q←T Q (finding the optimal policy)
We’re treating Qπ or Q∗ as a vector with |S| · |A| entries.
This type of algorithm is an instance of dynamic programming.

Intro ML (UofT) CSC311-Lec12 23 / 60

Bellman Fixed Points

An operator f (mapping from vectors to vectors) is a contraction

map if
kf (x1 ) − f (x2 )k ≤ αkx1 − x2 k
for some scalar 0 ≤ α < 1 and vector norm k · k.
Let f (k) denote f iterated k times. A simple induction shows

kf (k) (x1 ) − f (k) (x2 )k ≤ αk kx1 − x2 k.

Let x∗ be a fixed point of f . Then for any x,

kf (k) (x) − x∗ k ≤ αk kx − x∗ k.

Hence, iterated application of f , starting from any x, converges

exponentially to a unique fixed point.

Intro ML (UofT) CSC311-Lec12 24 / 60

Finding the Optimal Value Function: Value Iteration

Let’s use dynamic programming to find Q∗ .

Value Iteration: Start from an initial function Q1 . For each k = 1, 2, . . . ,
apply
Qk+1 ← T ∗ Qk

Writing out the update in full,

X
Qk+1 (s, a) ← r(s, a) + γ P(s0 |s, a) max
0
Qk (s0 , a0 )
a ∈A
s0 ∈S

Observe: a fixed point of this update is exactly a solution of the optimal

Bellman equation, which we saw characterizes the Q-function of an
optimal policy.

Intro ML (UofT) CSC311-Lec12 25 / 60

Value Iteration
Q1
<latexit sha1_base64="bX6D+T8jAyDiXs6lV4RQJXyswVM=">AAACP3icdVBLS8NAGNzUV62vVo9egkXxVBIR9FjsxWOL9gFtKJvNJl26j7C7EUrIT/Cqv8ef4S/wJl69uU1zsC0d+GCY+QaG8WNKlHacT6u0tb2zu1ferxwcHh2fVGunPSUSiXAXCSrkwIcKU8JxVxNN8SCWGDKf4r4/bc39/guWigj+rGcx9hiMOAkJgtpIT52xO67WnYaTw14nbkHqoEB7XLOuRoFACcNcIwqVGrpOrL0USk0QxVlllCgcQzSFER4ayiHDykvzrpl9aZTADoU0x7Wdq/8TKWRKzZhvPhnUE7XqzcVNnp6wbFmjkZDEyARtMFba6vDeSwmPE405WpQNE2prYc/HswMiMdJ0ZghEJk+QjSZQQqTNxJVRHkxbgjHIA5WZZd3VHddJ76bhOg23c1tvPhQbl8E5uADXwAV3oAkeQRt0AQIReAVv4N36sL6sb+tn8VqyiswZWIL1+weid7B3</latexit>

T ⇤ Q1
T ⇤ (or T ⇡ )
<latexit sha1_base64="yqJIT7qss5GVGdOiCqx7wCN5G3s=">AAACRXicdVDJSgNBFOxxjXFL9OilMSiewowIegzm4jGBbJKE0NPpJE16GbrfCGGYr/Cq3+M3+BHexKt2loNJSMGDouoVFBVGglvw/U9va3tnd28/c5A9PDo+Oc3lzxpWx4ayOtVCm1ZILBNcsTpwEKwVGUZkKFgzHJenfvOFGcu1qsEkYl1JhooPOCXgpOdOTUeAq72glyv4RX8GvE6CBSmgBSq9vHfd6WsaS6aACmJtO/Aj6CbEAKeCpdlObFlE6JgMWdtRRSSz3WTWOMVXTunjgTbuFOCZ+j+REGntRIbuUxIY2VVvKm7yYCTTZU0MteFO5nSDsdIWBg/dhKsoBqbovOwgFhg0nk6I+9wwCmLiCKEuzymmI2IIBTd0tjMLJmUtJVF9m7plg9Ud10njthj4xaB6Vyg9LjbOoAt0iW5QgO5RCT2hCqojiiR6RW/o3fvwvrxv72f+uuUtMudoCd7vH4EfstY=</latexit>

1
<latexit sha1_base64="93BFzOY2xSaeRGRWOyKXTZ+MQlU=">AAACPXicdVDNSsNAGNzUv1r/Wj16WSyKp5KIoMdiLx5bsLXQhrLZbNql+xN2N0IJeQKv+jw+hw/gTbx6dZvmYFs68MEw8w0ME8SMauO6n05pa3tnd6+8Xzk4PDo+qdZOe1omCpMulkyqfoA0YVSQrqGGkX6sCOIBI8/BtDX3n1+I0lSKJzOLic/RWNCIYmSs1PFG1brbcHPAdeIVpA4KtEc152oYSpxwIgxmSOuB58bGT5EyFDOSVYaJJjHCUzQmA0sF4kT7ad40g5dWCWEklT1hYK7+T6SIaz3jgf3kyEz0qjcXN3lmwrNljY2lolameIOx0tZE935KRZwYIvCibJQwaCScTwdDqgg2bGYJwjZPMcQTpBA2duDKMA+mLck5EqHO7LLe6o7rpHfT8NyG17mtNx+KjcvgHFyAa+CBO9AEj6ANugADAl7BG3h3Ppwv59v5WbyWnCJzBpbg/P4BEcOvsw==</latexit>
<latexit
<latexit sha1_base64="1NtZU9jecEK2cV+EB72GcN1Qgh0=">AAACUXicdVA9TwJBFHx3fiF+gZY2G0GjDbmj0ZJIY6mJCIkQsrfswcb9uOy+MyEXfout/h4rf4qdC1IoxqkmM2/yJpNkUjiMoo8gXFvf2NwqbZd3dvf2DyrVwwdncst4hxlpbC+hjkuheQcFSt7LLKcqkbybPLXnfveZWyeMvsdpxgeKjrVIBaPopWHlqN6/NxnWybmxxPNM1C+GlVrUiBYgf0m8JDVY4nZYDc76I8NyxTUySZ17jKMMBwW1KJjks3I/dzyj7ImO+aOnmiruBsWi/YycemVEUv8/NRrJQv2ZKKhybqoSf6koTtyqNxf/83CiZr81OTZWeFmwf4yVtpheDQqhsxy5Zt9l01wSNGQ+JxkJyxnKqSeU+bxghE2opQz96OX+Ili0jVJUj9zMLxuv7viXPDQbcdSI75q11vVy4xIcwwmcQwyX0IIbuIUOMJjCC7zCW/AefIYQht+nYbDMHMEvhDtfC02y9g==</latexit>

<latexit sha1_base64="kcy9StiPGnILhHdxgpheFrNPU8w=">AAACQnicdVBLS8NAGNzUV62vVo9egkXxVBIR9FjsxWMF+4AmlM1mk67dR9jdCCXkP3jV3+Of8C94E68e3KY52EoHPhhmvoFhgoQSpR3nw6psbG5t71R3a3v7B4dH9cZxX4lUItxDggo5DKDClHDc00RTPEwkhiygeBBMO3N/8IylIoI/6lmCfQZjTiKCoDZS34shY3Bcbzotp4D9n7glaYIS3XHDuvBCgVKGuUYUKjVynUT7GZSaIIrzmpcqnEA0hTEeGcohw8rPirq5fW6U0I6ENMe1Xah/ExlkSs1YYD4Z1BO16s3FdZ6esHxZo7GQxMgErTFW2uro1s8IT1KNOVqUjVJqa2HP97NDIjHSdGYIRCZPkI0mUEKkzco1rwhmHWFW5aHKzbLu6o7/Sf+q5Tot9+G62b4rN66CU3AGLoELbkAb3IMu6AEEnsALeAVv1rv1aX1Z34vXilVmTsASrJ9f2QCyEw==</latexit>

Q2
<latexit sha1_base64="Srj0llwUfl51deTIkafbU9cQdDo=">AAACP3icdVBLS8NAGNzUV62vVo9egkXxVJIi6LHYi8cW7QPaUDabTbp0H2F3I5SQn+BVf48/w1/gTbx6c5vmYFsc+GCY+QaG8WNKlHacD6u0tb2zu1ferxwcHh2fVGunfSUSiXAPCSrk0IcKU8JxTxNN8TCWGDKf4oE/ay/8wTOWigj+pOcx9hiMOAkJgtpIj91Jc1KtOw0nh71J3ILUQYHOpGZdjQOBEoa5RhQqNXKdWHsplJogirPKOFE4hmgGIzwylEOGlZfmXTP70iiBHQppjms7V/8mUsiUmjPffDKop2rdW4j/eXrKslWNRkISIxP0j7HWVod3Xkp4nGjM0bJsmFBbC3sxnh0QiZGmc0MgMnmCbDSFEiJtJq6M82DaFoxBHqjMLOuu77hJ+s2G6zTc7k29dV9sXAbn4AJcAxfcghZ4AB3QAwhE4AW8gjfr3fq0vqzv5WvJKjJnYAXWzy+kULB4</latexit>
T ⇤ Q2
<latexit sha1_base64="TrsWdF6gWobKWXPGYUF2As5sxHQ=">AAACRXicdVBLS0JBGJ1rL7OX1rLNkBSt5F4Jaim5aangK1Rk7jjq4DwuM98N5OKvaFu/p9/Qj2gXbWt8LFLxwAeHc74DhxNGglvw/U8vtbO7t3+QPswcHZ+cnmVz5w2rY0NZnWqhTSsklgmuWB04CNaKDCMyFKwZjsszv/nCjOVa1WASsa4kQ8UHnBJw0nOnpiPA1V6xl837BX8OvEmCJcmjJSq9nHfT6WsaS6aACmJtO/Aj6CbEAKeCTTOd2LKI0DEZsrajikhmu8m88RRfO6WPB9q4U4Dn6v9EQqS1Exm6T0lgZNe9mbjNg5GcrmpiqA13MqdbjLW2MHjoJlxFMTBFF2UHscCg8WxC3OeGURATRwh1eU4xHRFDKLihM515MClrKYnq26lbNljfcZM0ioXALwTVu3zpcblxGl2iK3SLAnSPSugJVVAdUSTRK3pD796H9+V9ez+L15S3zFygFXi/f4L4stc=</latexit>

Claim: The value iteration update is a contraction map:

kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞

k·k∞ denotes the L∞ norm, defined as:

kxk∞ = max |xi |
i

If this claim is correct, then value iteration converges exponentially to

the unique fixed point.
The exponential decay factor is γ (the discount factor), which means
longer term planning is harder.
Intro ML (UofT) CSC311-Lec12 26 / 60
Bellman Operator is a Contraction (optional)
" #
∗ ∗
X 0 0 0
|(T Q1 )(s, a) − (T Q2 )(s, a)| = r(s, a) + γ P(s | s, a) max
0
Q1 (s , a ) −
a
s0
" #
X 0 0 0
r(s, a) + γ P(s | s, a) max
0
Q2 (s , a )
a
s0

X
=γ P(s0 | s, a) max
0
Q1 (s0
, a 0
) − max
0
Q2 (s0
, a 0
)
a a
s0
X
≤γ P(s0 | s, a) max
0
Q1 (s0 , a0 ) − Q2 (s0 , a0 )
a
s0
X
≤ γ max
0 0
Q1 (s0 , a0 ) − Q2 (s0 , a0 ) P(s0 | s, a)
s ,a
s0

= γ max
0 0
Q1 (s0 , a0 ) − Q2 (s0 , a0 )
s ,a

= γ kQ1 − Q2 k∞
This is true for any (s, a), so
kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞ ,
which is what we wanted to show.
Intro ML (UofT) CSC311-Lec12 27 / 60
Value Iteration Recap

So far, we’ve focused on planning, where the dynamics are known.

The optimal Q-function is characterized in terms of a Bellman
fixed point update.
Since the Bellman operator is a contraction map, we can just keep
applying it repeatedly, and we’ll converge to a unique fixed point.
What are the limitations of value iteration?
assumes known dynamics
requires explicitly representing Q∗ as a vector
|S| can be extremely large, or infinite
|A| can be infinite (e.g. continuous voltages in robotics)
But value iteration is still a foundation for a lot of more practical
RL algorithms.

Intro ML (UofT) CSC311-Lec12 28 / 60

Towards Learning

Now let’s focus on reinforcement learning, where the

environment is unknown. How can we apply learning?
1 Learn a model of the environment, and do planning in the model
(i.e. model-based reinforcement learning)
You already know how to do this in principle, but it’s very hard to
get to work. Not covered in this course.
2 Learn a value function (e.g. Q-learning, covered in this lecture)
3 Learn a policy directly (e.g. policy gradient, not covered in this
course)
How can we deal with extremely large state spaces?
Function approximation: choose a parametric form for the policy
and/or value function (e.g. linear in features, neural net, etc.)

Intro ML (UofT) CSC311-Lec12 29 / 60

Q-Learning

Intro ML (UofT) CSC311-Lec12 30 / 60

Monte Carlo Estimation
Recall the optimal Bellman equation:
h i
Q∗ (s, a) = r(s, a) + γEP(s0 | s,a) max
0
Q∗ 0 0
(s , a )
a

Problem: we need to know the dynamics to evaluate the expectation

Monte Carlo estimation of an expectation µ = E[X]: repeatedly sample
X and update
µ ← µ + α(X − µ)
Idea: Apply Monte Carlo estimation to the Bellman equation by
sampling S 0 ∼ P(· | s, a) and updating:
h i
0 0
Q(s, a) ← Q(s, a) + α r(s, a) + γ max Q(S , a ) − Q(s, a)
a0
| {z }
= Bellman error

This is an example of temporal difference learning, i.e. updating our

predictions to match our later predictions (once we have more
information).
Intro ML (UofT) CSC311-Lec12 31 / 60
Monte Carlo Estimation

Problem: Every iteration of value iteration requires updating Q

for every state.
There could be lots of states
We only observe transitions for states that are visited
Idea: Have the agent interact with the environment, and only
update Q for the states that are actually visited.
Problem: We might never visit certain states if they don’t look
promising, so we’ll never learn about them.
Idea: Have the agent sometimes take random actions so that it
eventually visits every state.
ε-greedy policy: a policy which picks arg maxa Q(s, a) with
probability 1 − ε and a random action with probability ε. (Typical
value: ε = 0.05)
Combining all three ideas gives an algorithm called Q-learning.

Intro ML (UofT) CSC311-Lec12 32 / 60

Q-Learning with ε-Greedy Policy
Parameters:
Learning rate α
Exploration parameter ε
Initialize Q(s, a) for all (s, a) ∈ S × A
The agent starts at state S0 .
For time step t = 0, 1, ...,
Choose At according to the ε-greedy policy, i.e.,
(
argmaxa∈A Q(St , a) with probability 1 − ε
At ←
Uniformly random action in A with probability ε
Take action At in the environment.
The state changes from St to St+1 ∼ P(·|St , At )
Observe St+1 and Rt (could be r(St , At ), or could be stochastic)
Update the action-value function at state-action (St , At ):

0
Q(St , At ) ← Q(St , At ) + α Rt + γ max
0
Q(St+1 , a ) − Q(St , At )
a ∈A

Intro ML (UofT) CSC311-Lec12 33 / 60

Exploration vs. Exploitation

The ε-greedy is a simple mechanism for managing the

exploration-exploitation tradeoff.
(
argmaxa∈A Q(S, a) with probability 1 − ε
πε (S; Q) =
Uniformly random action in A with probability ε

The ε-greedy policy ensures that most of the time (probability 1 − ε) the
agent exploits its incomplete knowledge of the world by chooses the best
action (i.e., corresponding to the highest action-value), but occasionally
(probability ε) it explores other actions.
Without exploration, the agent may never find some good actions.
The ε-greedy is one of the simplest, but widely used, methods for
trading-off exploration and exploitation. Exploration-exploitation
tradeoff is an important topic of research.

Intro ML (UofT) CSC311-Lec12 34 / 60

Examples of Exploration-Exploitation in the Real World

Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 35 / 60

An Intuition on Why Q-Learning Works? (Optional)

Consider a tuple (S, A, R, S 0 ). The Q-learning update is

0 0
Q(S, A) ← Q(S, A) + α R + γ max 0
Q(S , a ) − Q(S, A) .
a ∈A

To understand this better, let us focus on its stochastic equilibrium, i.e.,

where the expected change in Q(S, A) is zero. We have

0 0
E R + γ max0
Q(S , a ) − Q(S, A)|S, A = 0
a ∈A
∗
⇒(T Q)(S, A) = Q(S, A)

So at the stochastic equilibrium, we have (T ∗ Q)(S, A) = Q(S, A).

Because the fixed-point of the Bellman optimality operator is unique
(and is Q∗ ), Q is the same as the optimal action-value function Q∗ .

Intro ML (UofT) CSC311-Lec12 36 / 60

Off-Policy Learning

Q-learning update again:

0 0
Q(S, A) ← Q(S, A) + α R + γ max
0
Q(S , a ) − Q(S, A) .
a ∈A

Notice: this update doesn’t mention the policy anywhere. The

only thing the policy is used for is to determine which states are
visited.
This means we can follow whatever policy we want (e.g. ε-greedy),
and it still coverges to the optimal Q-function. Algorithms like
this are known as off-policy algorithms, and this is an extremely
useful property.
Policy gradient (another popular RL algorithm, not covered in this
course) is an on-policy algorithm. Encouraging exploration is
much harder in that case.

Intro ML (UofT) CSC311-Lec12 37 / 60

Function Approximation

Intro ML (UofT) CSC311-Lec12 38 / 60

Function Approximation

So far, we’ve been assuming a tabular representation of Q: one

entry for every state/action pair.
This is impractical to store for all but the simplest problems, and
doesn’t share structure between related states.
Solution: approximate Q using a parameterized function, e.g.
linear function approximation: Q(s, a) = w> ψ(s, a)
compute Q with a neural net
Update Q using backprop:

t ← r(st , at ) + γ max Q(st+1 , a)

a
θ ← θ + α(t − Q(s, a))∇θ Q(st , at ).

Intro ML (UofT) CSC311-Lec12 39 / 60

Function Approximation

Approximating Q with a neural net is a decades-old idea, but

DeepMind got it to work really well on Atari games in 2013 (“deep
Q-learning”)
They used a very small network by today’s standards

Main technical innovation: store experience into a replay buffer,

and perform Q-learning using stored experience
Gains sample efficiency by separating environment interaction from
optimization — don’t need new experience for every SGD update!

Intro ML (UofT) CSC311-Lec12 40 / 60

Atari

Mnih et al., Nature 2015. Human-level control through deep

reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that’s not the
case
After about a day of training on a particular game, often beat
“human-level” performance (number of points within 5 minutes of
play)
Did very well on reactive games, poorly on ones that require
planning (e.g. Montezuma’s Revenge)
https://www.youtube.com/watch?v=V1eYniJ0Rnk
https://www.youtube.com/watch?v=4MlZncshy1Q

Intro ML (UofT) CSC311-Lec12 41 / 60

Recap and Other Approaches
All discussed approaches estimate the value function first. They are
called value-based methods.
There are methods that directly optimize the policy, i.e., policy search
methods.
Model-based RL methods estimate the true, but unknown, model of
environment P by an estimate P̂, and use the estimate P in order to
plan.
There are hybrid methods.

Value Policy

Model

Intro ML (UofT) CSC311-Lec12 42 / 60

Reinforcement Learning Resources

Books:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning:
An Introduction, 2nd edition, 2018.
Csaba Szepesvari, Algorithms for Reinforcement Learning, 2010.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien
Ernst, Reinforcement Learning and Dynamic Programming Using
Function Approximators, 2010.
Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming, 1996.
Courses:
Video lectures by David Silver
CIFAR and Vector Institute’s Reinforcement Learning Summer
School, 2018.
Deep Reinforcement Learning, CS 294-112 at UC Berkeley

Intro ML (UofT) CSC311-Lec12 43 / 60

Closing Thoughts

Intro ML (UofT) CSC311-Lec12 44 / 60

Overview

What this course focused on:

Supervised learning: regression, classification
Choose model, loss function, optimizer
Parametric vs. nonparametric
Generative vs. discriminative
Iterative optimization vs. closed-form solutions
Unsupervised learning: dimensionality reduction and clustering
Reinforcement learning: value iteration
This lecture: what we left out, and teasers for other courses

Intro ML (UofT) CSC311-Lec12 45 / 60

CSC413 Teaser: Neural Nets

This course covered some fundamental ideas, most of which are

more than 10 years old.
Big shift of the past decade: neural nets and deep learning
2010: neural nets significantly improved speech recognition accuracy
(after 20 years of stagnation)
2012–2015: neural nets reduced error rates for object recognition by
a factor of 6
2016: a program called AlphaGo defeated the human Go champion
2015–2018: neural nets learned to produce convincing
high-resolution images
2018–today: transformers demonstrate a sophisticated ability to
generate natural language text and learn from few examples

Intro ML (UofT) CSC311-Lec12 46 / 60

CSC413 Teaser: Automatic Differentiation

In this course, you derived update rules by hand

Backprop is totally mechanical. Now we have automatic
differentiation tools that compute gradients for you.
In CSC413, you learn how an autodiff package can be
implemented
Lets you do fancy things like differentiate through the whole
training procedure to compute the gradient of validation loss with
respect to the hyperparameters.
With TensorFlow, PyTorch, etc., we can build much more complex
neural net architectures that we could previously.

Intro ML (UofT) CSC311-Lec12 47 / 60

CSC413 Teaser: Beyond Scalar/Discrete Targets

This course focused on regression and classification,

i.e. scalar-valued or discrete outputs
That only covers a small fraction of use cases. Often, we want to
output something more structured:
text (e.g. image captioning, machine translation)
dense labels of images (e.g. semantic segmentation)
graphs (e.g. molecule design)
This used to be known as structured prediction, but now it’s so
routine we don’t need a name for it.

Intro ML (UofT) CSC311-Lec12 48 / 60

CSC413 Teaser: Representation Learning

We talked about neural nets as learning feature maps you can use
for regression/classification
More generally, want to learn a representation of the data such
that mathematical operations on the representation are
semantically meaningful
Classic (decades-old) example: representing words as vectors
Measure semantic similarity using the dot product between word
vectors (or dissimilarity using Euclidean distance)
Represent a web page with the average of its word vectors

Intro ML (UofT) CSC311-Lec12 49 / 60

CSC413 Teaser: Representation Learning
Here’s a linear projection of word representations for cities and capitals
into 2 dimensions (part of a representation learned using word2vec)
The mapping city → capital corresponds roughly to a single direction in
the vector space:

Mikolov et al., 2018, “Efficient estimation of word representations in vector space”

Intro ML (UofT) CSC311-Lec12 50 / 60

CSC413 Teaser: Representation Learning
In other words, vec(Paris) − vec(France) ≈ vec(London) − vec(England)
This means we can analogies by doing arithmetic on word vectors:
e.g. “Paris is to France as London is to ”
Find the word whose vector is closest to
vec(France) − vec(Paris) + vec(London)
Example analogies:

Mikolov et al., 2018, “Efficient estimation of word representations in vector space”

Intro ML (UofT) CSC311-Lec12 51 / 60

-VAE [7]
CSC413 Teaser: Representation Learning
One of the big goals is to learn disentangled representations, where
individual dimensions tell you something meaningful
-TCVAE (Our)

(a) Baldness (-6, 6) (b) Face width (0, 6) (c) Gender (-6, 6
Figure 1: Qualitative comparisons on CelebA. Traversal ranges are
attributes are only manifested in one direction of a latent variable, so w
Most semantically similar variables from a -VAE are shown for com

1 Background: Learning and Evaluating Disentangle

) Face width (0, 6) (c) Gender (-6, 6) (d) Mustache (-6, 0)
We discuss existing work that aims at either learning disentangled repre
ons on CelebA. Traversal
or Chen
evaluating
Intro ML (UofT) ranges
suchare shown in parentheses.
representations. The two Some
problems are inherently
et al., 2018, “Isolating sources of disentanglement in variational autoencoders”
CSC311-Lec12 52 / 60
CSC413 Teaser: Image-to-Image Translation
Due to convenient autodiff frameworks, we can combine multiple
neural nets together into fancy architectures. Here’s the CycleGAN.

Zhu et al., 2017, “Unpaired image-to-image translation using cycle-consistent adversarial networks”

Intro ML (UofT) CSC311-Lec12 53 / 60

CSC413 Teaser: Image-to-Image Translation

Style transfer problem: change the style of an image while preserving

the content.

Data: Two unrelated collections of images, one for each style

Intro ML (UofT) CSC311-Lec12 54 / 60

CSC412 Teaser: Probabilistic Graphical Models
In this course, we just scratched the surface of probabilistic
models.
Probabilistic graphical models (PGMs) let you encode complex
probabilistic relationships between lots of variables.

Ghahramani, 2015, “Probabilistic ML and artificial intelligence”

Intro ML (UofT) CSC311-Lec12 55 / 60

CSC412 Teaser: PGM Inference

We derived inference methods by inspection for some easy special

cases (e.g. GDA, naı̈ve Bayes)
In CSC412, you’ll learn much more general and powerful inference
techniques that expand the range of models you can build
Exact inference using dynamic programming, for certain types of
graph structures (e.g. chains)
Markov chain Monte Carlo
forms the basis of a powerful probabilistic modeling tool called Stan
Variational inference: try to approximate a complex, intractable,
high-dimensional distribution using a tractable one
Try to minimze the KL divergence
Based on the same math from our EM lecture

Intro ML (UofT) CSC311-Lec12 56 / 60

CSC412 Teaser: Beyond Clustering

We’ve seen unsupervised learning algorithms based on two ways of

organizing your data
low-dimensional spaces (dimensionality reduction)
discrete categories (clustering)
Other ways to organize/model data
hierarchies
dynamical systems
sets of attributes
topic models (each document is a mixture of topics)
Motifs can be combined in all sorts of different ways

Intro ML (UofT) CSC311-Lec12 57 / 60

CSC412 Teaser: Beyond Clustering

Latent Dirichlet Allocation (LDA)

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.

Figure 8: An example article from the AP corpus. Each color codes a different factor from which
the word is putatively generated.
Blei et al., 2003, “Latent Dirichlet Allocation”

Intro ML (UofT) CSC311-Lec12 58 / 60

CSC412 Teaser: Automatic Statistician

Automatic search over Gaussian process kernel structures

Duvenaud et al., 2013, “Structure discovery in nonparametric regression through compositional kernel
search”
Image: Ghahramani, 2015, “Probabilistic ML and artificial intelligence”

Intro ML (UofT) CSC311-Lec12 59 / 60

Resources

Continuing with machine learning

Courses
csc413/2516, “Neural Networks and Deep Learning”
csc412/2506, “Probabilistic Learning and Reasoning”
Various topics courses (varies from year to year)
Videos from top ML conferences (NIPS/NeurIPS, ICML, ICLR,
UAI)
Tutorials and keynote talks are aimed at people with your level of
background (know the basics, but not experts in a subfield)
Try to reproduce results from papers
If they’ve released code, you can use that as a guide if you get stuck
Lots of excellent free resources avaiable online!

Intro ML (UofT) CSC311-Lec12 60 / 60

Beast Academy 2 Sequence
0% (2)
Beast Academy 2 Sequence
24 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
The Value Proposition Canvas Instruction Manual
60% (5)
The Value Proposition Canvas Instruction Manual
8 pages
lec12
No ratings yet
lec12
60 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
CS229
No ratings yet
CS229
17 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Lec 3
No ratings yet
Lec 3
15 pages
Lecture3__InsideAnAgent
No ratings yet
Lecture3__InsideAnAgent
35 pages
Lec 21
No ratings yet
Lec 21
28 pages
2
No ratings yet
2
23 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
lecture-06
No ratings yet
lecture-06
98 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Lec 02
No ratings yet
Lec 02
89 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
rl
No ratings yet
rl
6 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Markov decision
No ratings yet
Markov decision
4 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
hgtfhgfhtf
No ratings yet
hgtfhgfhtf
5 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
1-markov
No ratings yet
1-markov
34 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
AR514_MDP
No ratings yet
AR514_MDP
27 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
ICH 10 2021 Not - For - Submission EN
No ratings yet
ICH 10 2021 Not - For - Submission EN
57 pages
Kolb Learning Styles Diagram
No ratings yet
Kolb Learning Styles Diagram
1 page
Кролик
No ratings yet
Кролик
14 pages
Unit1_topic123_FRQ
No ratings yet
Unit1_topic123_FRQ
15 pages
NZS 3101-2006 PT-SL Example 001
No ratings yet
NZS 3101-2006 PT-SL Example 001
6 pages
Culture and Development
No ratings yet
Culture and Development
30 pages
Federal Tax Research 13th Edition Sawyers Test Bank Available Instantly
No ratings yet
Federal Tax Research 13th Edition Sawyers Test Bank Available Instantly
309 pages
National Artist in Visual Arts
No ratings yet
National Artist in Visual Arts
70 pages
worksheets in 3i
No ratings yet
worksheets in 3i
2 pages
21st-Wk-1-2
No ratings yet
21st-Wk-1-2
30 pages
Appendix e
No ratings yet
Appendix e
17 pages
Vodokruzne Vyvevy LPHX 55xxx
No ratings yet
Vodokruzne Vyvevy LPHX 55xxx
12 pages
BMJ Volume 339 Issue Nov11 1 2009 (Doi 10.1136/bmj.b4418) Goyder, C. McPherson, A. Glasziou, P. - Self Diagnosis PDF
No ratings yet
BMJ Volume 339 Issue Nov11 1 2009 (Doi 10.1136/bmj.b4418) Goyder, C. McPherson, A. Glasziou, P. - Self Diagnosis PDF
9 pages
APA Guide
100% (1)
APA Guide
47 pages
HW02 Sol
No ratings yet
HW02 Sol
11 pages
1A - Safe Operation in Chemical Plants With Stop Work Authority
No ratings yet
1A - Safe Operation in Chemical Plants With Stop Work Authority
12 pages
Powerpoint Presentation
No ratings yet
Powerpoint Presentation
53 pages
Project Title:: "Strategic Innovative Material Prepared For Learning Engagement"
No ratings yet
Project Title:: "Strategic Innovative Material Prepared For Learning Engagement"
10 pages
CS 501: Software Engineering Fall 2000
No ratings yet
CS 501: Software Engineering Fall 2000
22 pages
AG-HMC70 Manual
No ratings yet
AG-HMC70 Manual
8 pages
RFP-COM-01-2025-1
No ratings yet
RFP-COM-01-2025-1
59 pages
Job Opportunity: Senior Reservoir Engineer, Oman
No ratings yet
Job Opportunity: Senior Reservoir Engineer, Oman
2 pages
FOR DOWNLOAD
No ratings yet
FOR DOWNLOAD
23 pages
6.ABC Analysis
No ratings yet
6.ABC Analysis
32 pages
Bluetooth: Objective
No ratings yet
Bluetooth: Objective
8 pages
Hand Signals For Hoist and Crane Operations
No ratings yet
Hand Signals For Hoist and Crane Operations
2 pages
Bees, Pollination
No ratings yet
Bees, Pollination
4 pages
Suggestions From God - Salah Tul Istikhara
No ratings yet
Suggestions From God - Salah Tul Istikhara
3 pages