0% found this document useful (0 votes)
7 views60 pages

Reinforcement learning lec12

Uploaded by

bilouslouise2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views60 pages

Reinforcement learning lec12

Uploaded by

bilouslouise2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CSC 311: Introduction to Machine Learning

Lecture 12 - Reinforcement Learning

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro ML (UofT) CSC311-Lec12 1 / 60


Reinforcement Learning Problem
Recall: we categorized types of ML by how much information they
provide about the desired behavior.
Supervised learning: labels of desired behavior
Unsupervised learning: no labels
Reinforcement learning: reward signal evaluating the outcome of
past actions
Bandit problems (Lecture 10) are a simple instance of RL where each
decision is independent.
More commonly, we focus on sequential decision making: an agent
Reinforcement Learning (RL)
chooses a sequence of actions which each affect future possibilities
available to the agent.

An agent observes the takes an action and with the goal of


world its states changes achieving long-term
rewards.

Intro ML (UofT) CSC311-Lec12 2 / 60


Reinforcement Learning
Most RL is done in a mathematical framework called a Markov Decision Process
(MDP).

Intro ML (UofT) CSC311-Lec12 3 / 60


MDPs: States and Actions

First let’s see how to describe the dynamics of the environment.


The state is a description of the environment in sufficient detail to
determine its evolution.
Think of Newtonian physics.
What would be the state variables for a puck sliding on a
frictionless table?

Markov assumption: the state at time t + 1 depends directly on the


state and action at time t, but not on past states and actions.
To describe the dynamics, we need to specify the transition
probabilities P(St+1 | St , At ).
In this lecture, we assume the state is fully observable, a highly
nontrivial assumption.

Intro ML (UofT) CSC311-Lec12 4 / 60


MDPs: States and Actions

Suppose you’re controlling a robot hand. What should be the set


of states and actions?

In general, the right granularity of states and actions depends on


what you’re trying to achieve.
Intro ML (UofT) CSC311-Lec12 5 / 60
MDPs: Policies

The way the agent chooses the action in each step is called a
policy.
We’ll consider two types:
Deterministic policy: At = π(St ) for some function π : S → A
Stochastic policy: At ∼ π(· | St ) for some function π : S → P(A).
(Here, P(A) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or
trajectories, factorizes:
p(s1 , a1 , . . . , sT , aT ) = p(s1 ) π(a1 | s1 ) P(s2 | s1 , a1 ) π(a2 | s2 ) · · · P(sT | sT −1 , aT −1 ) π(aT | sT )

Note: the fact that policies need consider only the current state is
a powerful consequence of the Markov assumption and full
observability.
If the environment is partially observable, then the policy needs to
depend on the history of observations.

Intro ML (UofT) CSC311-Lec12 6 / 60


MDPs: Rewards
In each time step, the agent receives a reward from a distribution
that depends on the current state and action

Rt ∼ R(· | St , At )

For simplicity, we’ll assume rewards are deterministic, i.e.

Rt = r(St , At )

What’s an example where Rt should depend on At ?


The return determines how good was the outcome of an episode.
Undiscounted: G = R0 + R1 + R2 + · · ·
Discounted: G = R0 + γR1 + γ 2 R2 + · · ·
The goal is to maximize the expected return, E[G].
γ is a hyperparameter called the discount factor which determines
how much we care about rewards now vs. rewards later.
What is the effect of large or small γ?
Intro ML (UofT) CSC311-Lec12 7 / 60
MDPs: Rewards

How might you define a reward function for an agent learning to


play a video game?
Change in score (why not current score?)
Some measure of novelty (this is sufficient for most Atari games!)
Consider two possible reward functions for the game of Go. How
do you think the agent’s play will differ depending on the choice?
Option 1: +1 for win, 0 for tie, -1 for loss
Option 2: Agent’s territory minus opponent’s territory (at end)
Specifying a good reward function can be tricky.
https://www.youtube.com/watch?v=tlOIHko8ySg

Intro ML (UofT) CSC311-Lec12 8 / 60


Markov Decision Processes

Putting this together, a Markov Decision Process (MDP) is defined by a


tuple (S, A, P, R, γ).
S: State space. Discrete or continuous
A: Action space. Here we consider finite action space, i.e.,
A = {a1 , . . . , a|A| }.
P: Transition probability
R: Immediate reward distribution
γ: Discount factor (0 ≤ γ < 1)
Together these define the environment that the agent operates in, and
the objectives it is supposed to achieve.

Intro ML (UofT) CSC311-Lec12 9 / 60


Finding a Policy

Now that we’ve defined MDPs, let’s see how to find a policy that
achieves a high return.
We can distinguish two situations:
Planning: given a fully specified MDP.
Learning: agent interacts with an environment with unknown
dynamics.
I.e., the environment is a black box that takes in actions and
outputs states and rewards.
Which framework would be most appropriate for chess? Super
Mario?

Intro ML (UofT) CSC311-Lec12 10 / 60


Value Functions

Intro ML (UofT) CSC311-Lec12 11 / 60


Value Function

The value function V π for a policy π measures the expected return if you
start in state s and follow policy π.
"∞ #
X
V (s) , Eπ [Gt | St = s] = Eπ
π k
γ Rt+k | St = s .
k=0

This measures the desirability of state s.

Intro ML (UofT) CSC311-Lec12 12 / 60


Value Function

Rewards: −1 per time-step


Actions: N, E, S, W
States: Agent’s location

[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 13 / 60


Value Function

Arrows represent policy π(s)


for each state s

[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 14 / 60


Value Function

Numbers represent value


V π (s) of each state s

[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 15 / 60


Bellman equations
The foundation of many RL algorithms is the fact that value functions
satisfy a recursive relationship, called the Bellman equation:
V π (s) = Eπ [Gt | St = s]
= Eπ [Rt + γGt+1 | St = s]
" #
X X 0 0
= π(a | s) r(s, a) + γ P(s | a, s) Eπ [Gt+1 | St+1 = s ]
a s0
" #
X X 0 π 0
= π(a | s) r(s, a) + γ P(s | a, s) V (s )
a s0

Viewing V π as a vector (where entries correspond to states), define the


Bellman backup operator T π .
" #
X X
(T V )(s) ,
π
π(a | s) r(s, a) + γ 0 0
P(s | a, s) V (s )
a s0

The Bellman equation can be seen as a fixed point of the Bellman


operator:
T πV π = V π.
Intro ML (UofT) CSC311-Lec12 16 / 60
ty (negative reward) of 1 for each stroke until we hit the ball
is Value Function
the location of the ball. The value of a state is the negative of
o the hole from that location. Our actions are how we aim and
urse, and which club we select. Let us take the former as given
oice of club, which we assume is either a putter or a driver. The
A value
shows function
a possible for golf:
state-value function, vputt (s), for the policy that
. The terminal
value of 0. From vputt !4

we assume we can V putt !3


sand
!"
es have value 1.
ot reach the hole !2
!1 !2
lue is greater. If !3
0 !1
!4
from a state by !5 s
a green
n
!6
must have value !4 !"
d

n’s value, that is, !2


!3
us assume we can
deterministically,
*(s,driver)
ge. This gives us qQ ⇤ (s, driver) — Sutton and Barto, Reinforcement Learning: An Introduction

abeled 2 in the sand

ween that line and


ly two strokes to
0 !1 !2
arly, anyIntro
location
ML (UofT)
!2
CSC311-Lec12 17 / 60
s
State-Action Value Function
A closely related but usefully different function is the state-action
value function, or Q-function, Qπ for policy π, defined as:
 
X
Qπ (s, a) , Eπ  γ k Rt+k | St = s, At = a .
k≥0

If you knew Qπ , how would you obtain V π ?


X
V π (s) = π(a | s) Qπ (s, a).
a

If you knew V π , how would you obtain Qπ ?


Apply a Bellman-like equation:
X
Qπ (s, a) = r(s, a) + γ P(s0 | a, s) V π (s0 )
s0

This requires knowing the dynamics, so in general it’s not easy to


recover Qπ from V π .
Intro ML (UofT) CSC311-Lec12 18 / 60
State-Action Value Function

Qπ satisfies a Bellman equation very similar to V π (proof is


analogous):
X X
Qπ (s, a) = r(s, a) + γ P(s0 | a, s) π(a0 | s0 )Qπ (s0 , a0 )
s0 a0
| {z }
,(T π Qπ )(s,a)

Intro ML (UofT) CSC311-Lec12 19 / 60


Dynamic Programming and Value Iteration

Intro ML (UofT) CSC311-Lec12 20 / 60


Optimal State-Action Value Function

Suppose you’re in state s. You get to pick one action a, and then
follow (fixed) policy π from then on. What do you pick?

arg max Qπ (s, a)


a

If a deterministic policy π is optimal, then it must be the case that


for any state s:
π(s) = arg max Qπ (s, a),
a

otherwise you could improve the policy by changing π(s). (see


Sutton & Barto for a proper proof)

Intro ML (UofT) CSC311-Lec12 21 / 60


Optimal State-Action Value Function

Bellman equation for optimal policy π ∗ :


∗ X ?
Qπ (s, a) = r(s, a) + γ P(s0 , | s, a)Qπ (s0 , π ? (s0 ))
s0
X ?
= r(s, a) + γ p(s0 | s, a) max
0
Qπ (s0 , a0 )
a
s0

π?
Now Q∗ = Q is the optimal state-action value function, and we
can rewrite the optimal Bellman equation without mentioning π ? :
X
Q∗ (s, a) = r(s, a) + γ p(s0 | s, a) max
0
Q∗ (s0 , a0 )
a
s0
| {z }
,(T ∗ Q∗ )(s,a)

Turns out this is sufficient to characterize the optimal policy. So


we simply need to solve the fixed point equation T ∗ Q∗ = Q∗ , and
then we can choose π ∗ (s) = arg maxa Q∗ (s, a).

Intro ML (UofT) CSC311-Lec12 22 / 60


Bellman Fixed Points

So far: showed that some interesting problems could be reduced


to finding fixed points of Bellman backup operators:
Evaluating a fixed policy π

T π Qπ = Qπ

Finding the optimal policy

T ∗ Q∗ = Q∗

Idea: keep iterating the backup operator over and over again.

Q ← T πQ (policy evaluation)

Q←T Q (finding the optimal policy)
We’re treating Qπ or Q∗ as a vector with |S| · |A| entries.
This type of algorithm is an instance of dynamic programming.

Intro ML (UofT) CSC311-Lec12 23 / 60


Bellman Fixed Points

An operator f (mapping from vectors to vectors) is a contraction


map if
kf (x1 ) − f (x2 )k ≤ αkx1 − x2 k
for some scalar 0 ≤ α < 1 and vector norm k · k.
Let f (k) denote f iterated k times. A simple induction shows

kf (k) (x1 ) − f (k) (x2 )k ≤ αk kx1 − x2 k.

Let x∗ be a fixed point of f . Then for any x,

kf (k) (x) − x∗ k ≤ αk kx − x∗ k.

Hence, iterated application of f , starting from any x, converges


exponentially to a unique fixed point.

Intro ML (UofT) CSC311-Lec12 24 / 60


Finding the Optimal Value Function: Value Iteration

Let’s use dynamic programming to find Q∗ .


Value Iteration: Start from an initial function Q1 . For each k = 1, 2, . . . ,
apply
Qk+1 ← T ∗ Qk

Writing out the update in full,


X
Qk+1 (s, a) ← r(s, a) + γ P(s0 |s, a) max
0
Qk (s0 , a0 )
a ∈A
s0 ∈S

Observe: a fixed point of this update is exactly a solution of the optimal


Bellman equation, which we saw characterizes the Q-function of an
optimal policy.

Intro ML (UofT) CSC311-Lec12 25 / 60


Value Iteration
Q1
<latexit sha1_base64="bX6D+T8jAyDiXs6lV4RQJXyswVM=">AAACP3icdVBLS8NAGNzUV62vVo9egkXxVBIR9FjsxWOL9gFtKJvNJl26j7C7EUrIT/Cqv8ef4S/wJl69uU1zsC0d+GCY+QaG8WNKlHacT6u0tb2zu1ferxwcHh2fVGunPSUSiXAXCSrkwIcKU8JxVxNN8SCWGDKf4r4/bc39/guWigj+rGcx9hiMOAkJgtpIT52xO67WnYaTw14nbkHqoEB7XLOuRoFACcNcIwqVGrpOrL0USk0QxVlllCgcQzSFER4ayiHDykvzrpl9aZTADoU0x7Wdq/8TKWRKzZhvPhnUE7XqzcVNnp6wbFmjkZDEyARtMFba6vDeSwmPE405WpQNE2prYc/HswMiMdJ0ZghEJk+QjSZQQqTNxJVRHkxbgjHIA5WZZd3VHddJ76bhOg23c1tvPhQbl8E5uADXwAV3oAkeQRt0AQIReAVv4N36sL6sb+tn8VqyiswZWIL1+weid7B3</latexit>

T ⇤ Q1
T ⇤ (or T ⇡ )
<latexit sha1_base64="yqJIT7qss5GVGdOiCqx7wCN5G3s=">AAACRXicdVDJSgNBFOxxjXFL9OilMSiewowIegzm4jGBbJKE0NPpJE16GbrfCGGYr/Cq3+M3+BHexKt2loNJSMGDouoVFBVGglvw/U9va3tnd28/c5A9PDo+Oc3lzxpWx4ayOtVCm1ZILBNcsTpwEKwVGUZkKFgzHJenfvOFGcu1qsEkYl1JhooPOCXgpOdOTUeAq72glyv4RX8GvE6CBSmgBSq9vHfd6WsaS6aACmJtO/Aj6CbEAKeCpdlObFlE6JgMWdtRRSSz3WTWOMVXTunjgTbuFOCZ+j+REGntRIbuUxIY2VVvKm7yYCTTZU0MteFO5nSDsdIWBg/dhKsoBqbovOwgFhg0nk6I+9wwCmLiCKEuzymmI2IIBTd0tjMLJmUtJVF9m7plg9Ud10njthj4xaB6Vyg9LjbOoAt0iW5QgO5RCT2hCqojiiR6RW/o3fvwvrxv72f+uuUtMudoCd7vH4EfstY=</latexit>

1
<latexit sha1_base64="93BFzOY2xSaeRGRWOyKXTZ+MQlU=">AAACPXicdVDNSsNAGNzUv1r/Wj16WSyKp5KIoMdiLx5bsLXQhrLZbNql+xN2N0IJeQKv+jw+hw/gTbx6dZvmYFs68MEw8w0ME8SMauO6n05pa3tnd6+8Xzk4PDo+qdZOe1omCpMulkyqfoA0YVSQrqGGkX6sCOIBI8/BtDX3n1+I0lSKJzOLic/RWNCIYmSs1PFG1brbcHPAdeIVpA4KtEc152oYSpxwIgxmSOuB58bGT5EyFDOSVYaJJjHCUzQmA0sF4kT7ad40g5dWCWEklT1hYK7+T6SIaz3jgf3kyEz0qjcXN3lmwrNljY2lolameIOx0tZE935KRZwYIvCibJQwaCScTwdDqgg2bGYJwjZPMcQTpBA2duDKMA+mLck5EqHO7LLe6o7rpHfT8NyG17mtNx+KjcvgHFyAa+CBO9AEj6ANugADAl7BG3h3Ppwv59v5WbyWnCJzBpbg/P4BEcOvsw==</latexit>
<latexit
<latexit sha1_base64="1NtZU9jecEK2cV+EB72GcN1Qgh0=">AAACUXicdVA9TwJBFHx3fiF+gZY2G0GjDbmj0ZJIY6mJCIkQsrfswcb9uOy+MyEXfout/h4rf4qdC1IoxqkmM2/yJpNkUjiMoo8gXFvf2NwqbZd3dvf2DyrVwwdncst4hxlpbC+hjkuheQcFSt7LLKcqkbybPLXnfveZWyeMvsdpxgeKjrVIBaPopWHlqN6/NxnWybmxxPNM1C+GlVrUiBYgf0m8JDVY4nZYDc76I8NyxTUySZ17jKMMBwW1KJjks3I/dzyj7ImO+aOnmiruBsWi/YycemVEUv8/NRrJQv2ZKKhybqoSf6koTtyqNxf/83CiZr81OTZWeFmwf4yVtpheDQqhsxy5Zt9l01wSNGQ+JxkJyxnKqSeU+bxghE2opQz96OX+Ili0jVJUj9zMLxuv7viXPDQbcdSI75q11vVy4xIcwwmcQwyX0IIbuIUOMJjCC7zCW/AefIYQht+nYbDMHMEvhDtfC02y9g==</latexit>

<latexit sha1_base64="kcy9StiPGnILhHdxgpheFrNPU8w=">AAACQnicdVBLS8NAGNzUV62vVo9egkXxVBIR9FjsxWMF+4AmlM1mk67dR9jdCCXkP3jV3+Of8C94E68e3KY52EoHPhhmvoFhgoQSpR3nw6psbG5t71R3a3v7B4dH9cZxX4lUItxDggo5DKDClHDc00RTPEwkhiygeBBMO3N/8IylIoI/6lmCfQZjTiKCoDZS34shY3Bcbzotp4D9n7glaYIS3XHDuvBCgVKGuUYUKjVynUT7GZSaIIrzmpcqnEA0hTEeGcohw8rPirq5fW6U0I6ENMe1Xah/ExlkSs1YYD4Z1BO16s3FdZ6esHxZo7GQxMgErTFW2uro1s8IT1KNOVqUjVJqa2HP97NDIjHSdGYIRCZPkI0mUEKkzco1rwhmHWFW5aHKzbLu6o7/Sf+q5Tot9+G62b4rN66CU3AGLoELbkAb3IMu6AEEnsALeAVv1rv1aX1Z34vXilVmTsASrJ9f2QCyEw==</latexit>

Q2
<latexit sha1_base64="Srj0llwUfl51deTIkafbU9cQdDo=">AAACP3icdVBLS8NAGNzUV62vVo9egkXxVJIi6LHYi8cW7QPaUDabTbp0H2F3I5SQn+BVf48/w1/gTbx6c5vmYFsc+GCY+QaG8WNKlHacD6u0tb2zu1ferxwcHh2fVGunfSUSiXAPCSrk0IcKU8JxTxNN8TCWGDKf4oE/ay/8wTOWigj+pOcx9hiMOAkJgtpIj91Jc1KtOw0nh71J3ILUQYHOpGZdjQOBEoa5RhQqNXKdWHsplJogirPKOFE4hmgGIzwylEOGlZfmXTP70iiBHQppjms7V/8mUsiUmjPffDKop2rdW4j/eXrKslWNRkISIxP0j7HWVod3Xkp4nGjM0bJsmFBbC3sxnh0QiZGmc0MgMnmCbDSFEiJtJq6M82DaFoxBHqjMLOuu77hJ+s2G6zTc7k29dV9sXAbn4AJcAxfcghZ4AB3QAwhE4AW8gjfr3fq0vqzv5WvJKjJnYAXWzy+kULB4</latexit>
T ⇤ Q2
<latexit sha1_base64="TrsWdF6gWobKWXPGYUF2As5sxHQ=">AAACRXicdVBLS0JBGJ1rL7OX1rLNkBSt5F4Jaim5aangK1Rk7jjq4DwuM98N5OKvaFu/p9/Qj2gXbWt8LFLxwAeHc74DhxNGglvw/U8vtbO7t3+QPswcHZ+cnmVz5w2rY0NZnWqhTSsklgmuWB04CNaKDCMyFKwZjsszv/nCjOVa1WASsa4kQ8UHnBJw0nOnpiPA1V6xl837BX8OvEmCJcmjJSq9nHfT6WsaS6aACmJtO/Aj6CbEAKeCTTOd2LKI0DEZsrajikhmu8m88RRfO6WPB9q4U4Dn6v9EQqS1Exm6T0lgZNe9mbjNg5GcrmpiqA13MqdbjLW2MHjoJlxFMTBFF2UHscCg8WxC3OeGURATRwh1eU4xHRFDKLihM515MClrKYnq26lbNljfcZM0ioXALwTVu3zpcblxGl2iK3SLAnSPSugJVVAdUSTRK3pD796H9+V9ez+L15S3zFygFXi/f4L4stc=</latexit>

Claim: The value iteration update is a contraction map:


kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞

k·k∞ denotes the L∞ norm, defined as:


kxk∞ = max |xi |
i

If this claim is correct, then value iteration converges exponentially to


the unique fixed point.
The exponential decay factor is γ (the discount factor), which means
longer term planning is harder.
Intro ML (UofT) CSC311-Lec12 26 / 60
Bellman Operator is a Contraction (optional)
" #
∗ ∗
X 0 0 0
|(T Q1 )(s, a) − (T Q2 )(s, a)| = r(s, a) + γ P(s | s, a) max
0
Q1 (s , a ) −
a
s0
" #
X 0 0 0
r(s, a) + γ P(s | s, a) max
0
Q2 (s , a )
a
s0

X  
=γ P(s0 | s, a) max
0
Q1 (s0
, a 0
) − max
0
Q2 (s0
, a 0
)
a a
s0
X
≤γ P(s0 | s, a) max
0
Q1 (s0 , a0 ) − Q2 (s0 , a0 )
a
s0
X
≤ γ max
0 0
Q1 (s0 , a0 ) − Q2 (s0 , a0 ) P(s0 | s, a)
s ,a
s0

= γ max
0 0
Q1 (s0 , a0 ) − Q2 (s0 , a0 )
s ,a

= γ kQ1 − Q2 k∞
This is true for any (s, a), so
kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞ ,
which is what we wanted to show.
Intro ML (UofT) CSC311-Lec12 27 / 60
Value Iteration Recap

So far, we’ve focused on planning, where the dynamics are known.


The optimal Q-function is characterized in terms of a Bellman
fixed point update.
Since the Bellman operator is a contraction map, we can just keep
applying it repeatedly, and we’ll converge to a unique fixed point.
What are the limitations of value iteration?
assumes known dynamics
requires explicitly representing Q∗ as a vector
|S| can be extremely large, or infinite
|A| can be infinite (e.g. continuous voltages in robotics)
But value iteration is still a foundation for a lot of more practical
RL algorithms.

Intro ML (UofT) CSC311-Lec12 28 / 60


Towards Learning

Now let’s focus on reinforcement learning, where the


environment is unknown. How can we apply learning?
1 Learn a model of the environment, and do planning in the model
(i.e. model-based reinforcement learning)
You already know how to do this in principle, but it’s very hard to
get to work. Not covered in this course.
2 Learn a value function (e.g. Q-learning, covered in this lecture)
3 Learn a policy directly (e.g. policy gradient, not covered in this
course)
How can we deal with extremely large state spaces?
Function approximation: choose a parametric form for the policy
and/or value function (e.g. linear in features, neural net, etc.)

Intro ML (UofT) CSC311-Lec12 29 / 60


Q-Learning

Intro ML (UofT) CSC311-Lec12 30 / 60


Monte Carlo Estimation
Recall the optimal Bellman equation:
h i
Q∗ (s, a) = r(s, a) + γEP(s0 | s,a) max
0
Q∗ 0 0
(s , a )
a

Problem: we need to know the dynamics to evaluate the expectation


Monte Carlo estimation of an expectation µ = E[X]: repeatedly sample
X and update
µ ← µ + α(X − µ)
Idea: Apply Monte Carlo estimation to the Bellman equation by
sampling S 0 ∼ P(· | s, a) and updating:
h i
0 0
Q(s, a) ← Q(s, a) + α r(s, a) + γ max Q(S , a ) − Q(s, a)
a0
| {z }
= Bellman error

This is an example of temporal difference learning, i.e. updating our


predictions to match our later predictions (once we have more
information).
Intro ML (UofT) CSC311-Lec12 31 / 60
Monte Carlo Estimation

Problem: Every iteration of value iteration requires updating Q


for every state.
There could be lots of states
We only observe transitions for states that are visited
Idea: Have the agent interact with the environment, and only
update Q for the states that are actually visited.
Problem: We might never visit certain states if they don’t look
promising, so we’ll never learn about them.
Idea: Have the agent sometimes take random actions so that it
eventually visits every state.
ε-greedy policy: a policy which picks arg maxa Q(s, a) with
probability 1 − ε and a random action with probability ε. (Typical
value: ε = 0.05)
Combining all three ideas gives an algorithm called Q-learning.

Intro ML (UofT) CSC311-Lec12 32 / 60


Q-Learning with ε-Greedy Policy
Parameters:
Learning rate α
Exploration parameter ε
Initialize Q(s, a) for all (s, a) ∈ S × A
The agent starts at state S0 .
For time step t = 0, 1, ...,
Choose At according to the ε-greedy policy, i.e.,
(
argmaxa∈A Q(St , a) with probability 1 − ε
At ←
Uniformly random action in A with probability ε
Take action At in the environment.
The state changes from St to St+1 ∼ P(·|St , At )
Observe St+1 and Rt (could be r(St , At ), or could be stochastic)
Update the action-value function at state-action (St , At ):
 
0
Q(St , At ) ← Q(St , At ) + α Rt + γ max
0
Q(St+1 , a ) − Q(St , At )
a ∈A

Intro ML (UofT) CSC311-Lec12 33 / 60


Exploration vs. Exploitation

The ε-greedy is a simple mechanism for managing the


exploration-exploitation tradeoff.
(
argmaxa∈A Q(S, a) with probability 1 − ε
πε (S; Q) =
Uniformly random action in A with probability ε

The ε-greedy policy ensures that most of the time (probability 1 − ε) the
agent exploits its incomplete knowledge of the world by chooses the best
action (i.e., corresponding to the highest action-value), but occasionally
(probability ε) it explores other actions.
Without exploration, the agent may never find some good actions.
The ε-greedy is one of the simplest, but widely used, methods for
trading-off exploration and exploitation. Exploration-exploitation
tradeoff is an important topic of research.

Intro ML (UofT) CSC311-Lec12 34 / 60


Examples of Exploration-Exploitation in the Real World

Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]

Intro ML (UofT) CSC311-Lec12 35 / 60


An Intuition on Why Q-Learning Works? (Optional)

Consider a tuple (S, A, R, S 0 ). The Q-learning update is


 
0 0
Q(S, A) ← Q(S, A) + α R + γ max 0
Q(S , a ) − Q(S, A) .
a ∈A

To understand this better, let us focus on its stochastic equilibrium, i.e.,


where the expected change in Q(S, A) is zero. We have
 
0 0
E R + γ max0
Q(S , a ) − Q(S, A)|S, A = 0
a ∈A

⇒(T Q)(S, A) = Q(S, A)

So at the stochastic equilibrium, we have (T ∗ Q)(S, A) = Q(S, A).


Because the fixed-point of the Bellman optimality operator is unique
(and is Q∗ ), Q is the same as the optimal action-value function Q∗ .

Intro ML (UofT) CSC311-Lec12 36 / 60


Off-Policy Learning

Q-learning update again:


 
0 0
Q(S, A) ← Q(S, A) + α R + γ max
0
Q(S , a ) − Q(S, A) .
a ∈A

Notice: this update doesn’t mention the policy anywhere. The


only thing the policy is used for is to determine which states are
visited.
This means we can follow whatever policy we want (e.g. ε-greedy),
and it still coverges to the optimal Q-function. Algorithms like
this are known as off-policy algorithms, and this is an extremely
useful property.
Policy gradient (another popular RL algorithm, not covered in this
course) is an on-policy algorithm. Encouraging exploration is
much harder in that case.

Intro ML (UofT) CSC311-Lec12 37 / 60


Function Approximation

Intro ML (UofT) CSC311-Lec12 38 / 60


Function Approximation

So far, we’ve been assuming a tabular representation of Q: one


entry for every state/action pair.
This is impractical to store for all but the simplest problems, and
doesn’t share structure between related states.
Solution: approximate Q using a parameterized function, e.g.
linear function approximation: Q(s, a) = w> ψ(s, a)
compute Q with a neural net
Update Q using backprop:

t ← r(st , at ) + γ max Q(st+1 , a)


a
θ ← θ + α(t − Q(s, a))∇θ Q(st , at ).

Intro ML (UofT) CSC311-Lec12 39 / 60


Function Approximation

Approximating Q with a neural net is a decades-old idea, but


DeepMind got it to work really well on Atari games in 2013 (“deep
Q-learning”)
They used a very small network by today’s standards

Main technical innovation: store experience into a replay buffer,


and perform Q-learning using stored experience
Gains sample efficiency by separating environment interaction from
optimization — don’t need new experience for every SGD update!

Intro ML (UofT) CSC311-Lec12 40 / 60


Atari

Mnih et al., Nature 2015. Human-level control through deep


reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that’s not the
case
After about a day of training on a particular game, often beat
“human-level” performance (number of points within 5 minutes of
play)
Did very well on reactive games, poorly on ones that require
planning (e.g. Montezuma’s Revenge)
https://www.youtube.com/watch?v=V1eYniJ0Rnk
https://www.youtube.com/watch?v=4MlZncshy1Q

Intro ML (UofT) CSC311-Lec12 41 / 60


Recap and Other Approaches
All discussed approaches estimate the value function first. They are
called value-based methods.
There are methods that directly optimize the policy, i.e., policy search
methods.
Model-based RL methods estimate the true, but unknown, model of
environment P by an estimate P̂, and use the estimate P in order to
plan.
There are hybrid methods.

Value Policy

Model

Intro ML (UofT) CSC311-Lec12 42 / 60


Reinforcement Learning Resources

Books:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning:
An Introduction, 2nd edition, 2018.
Csaba Szepesvari, Algorithms for Reinforcement Learning, 2010.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien
Ernst, Reinforcement Learning and Dynamic Programming Using
Function Approximators, 2010.
Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming, 1996.
Courses:
Video lectures by David Silver
CIFAR and Vector Institute’s Reinforcement Learning Summer
School, 2018.
Deep Reinforcement Learning, CS 294-112 at UC Berkeley

Intro ML (UofT) CSC311-Lec12 43 / 60


Closing Thoughts

Intro ML (UofT) CSC311-Lec12 44 / 60


Overview

What this course focused on:


Supervised learning: regression, classification
Choose model, loss function, optimizer
Parametric vs. nonparametric
Generative vs. discriminative
Iterative optimization vs. closed-form solutions
Unsupervised learning: dimensionality reduction and clustering
Reinforcement learning: value iteration
This lecture: what we left out, and teasers for other courses

Intro ML (UofT) CSC311-Lec12 45 / 60


CSC413 Teaser: Neural Nets

This course covered some fundamental ideas, most of which are


more than 10 years old.
Big shift of the past decade: neural nets and deep learning
2010: neural nets significantly improved speech recognition accuracy
(after 20 years of stagnation)
2012–2015: neural nets reduced error rates for object recognition by
a factor of 6
2016: a program called AlphaGo defeated the human Go champion
2015–2018: neural nets learned to produce convincing
high-resolution images
2018–today: transformers demonstrate a sophisticated ability to
generate natural language text and learn from few examples

Intro ML (UofT) CSC311-Lec12 46 / 60


CSC413 Teaser: Automatic Differentiation

In this course, you derived update rules by hand


Backprop is totally mechanical. Now we have automatic
differentiation tools that compute gradients for you.
In CSC413, you learn how an autodiff package can be
implemented
Lets you do fancy things like differentiate through the whole
training procedure to compute the gradient of validation loss with
respect to the hyperparameters.
With TensorFlow, PyTorch, etc., we can build much more complex
neural net architectures that we could previously.

Intro ML (UofT) CSC311-Lec12 47 / 60


CSC413 Teaser: Beyond Scalar/Discrete Targets

This course focused on regression and classification,


i.e. scalar-valued or discrete outputs
That only covers a small fraction of use cases. Often, we want to
output something more structured:
text (e.g. image captioning, machine translation)
dense labels of images (e.g. semantic segmentation)
graphs (e.g. molecule design)
This used to be known as structured prediction, but now it’s so
routine we don’t need a name for it.

Intro ML (UofT) CSC311-Lec12 48 / 60


CSC413 Teaser: Representation Learning

We talked about neural nets as learning feature maps you can use
for regression/classification
More generally, want to learn a representation of the data such
that mathematical operations on the representation are
semantically meaningful
Classic (decades-old) example: representing words as vectors
Measure semantic similarity using the dot product between word
vectors (or dissimilarity using Euclidean distance)
Represent a web page with the average of its word vectors

Intro ML (UofT) CSC311-Lec12 49 / 60


CSC413 Teaser: Representation Learning
Here’s a linear projection of word representations for cities and capitals
into 2 dimensions (part of a representation learned using word2vec)
The mapping city → capital corresponds roughly to a single direction in
the vector space:

Mikolov et al., 2018, “Efficient estimation of word representations in vector space”

Intro ML (UofT) CSC311-Lec12 50 / 60


CSC413 Teaser: Representation Learning
In other words, vec(Paris) − vec(France) ≈ vec(London) − vec(England)
This means we can analogies by doing arithmetic on word vectors:
e.g. “Paris is to France as London is to ”
Find the word whose vector is closest to
vec(France) − vec(Paris) + vec(London)
Example analogies:

Mikolov et al., 2018, “Efficient estimation of word representations in vector space”

Intro ML (UofT) CSC311-Lec12 51 / 60


-VAE [7]
CSC413 Teaser: Representation Learning
One of the big goals is to learn disentangled representations, where
individual dimensions tell you something meaningful
-TCVAE (Our)

(a) Baldness (-6, 6) (b) Face width (0, 6) (c) Gender (-6, 6
Figure 1: Qualitative comparisons on CelebA. Traversal ranges are
attributes are only manifested in one direction of a latent variable, so w
Most semantically similar variables from a -VAE are shown for com

1 Background: Learning and Evaluating Disentangle


) Face width (0, 6) (c) Gender (-6, 6) (d) Mustache (-6, 0)
We discuss existing work that aims at either learning disentangled repre
ons on CelebA. Traversal
or Chen
evaluating
Intro ML (UofT) ranges
suchare shown in parentheses.
representations. The two Some
problems are inherently
et al., 2018, “Isolating sources of disentanglement in variational autoencoders”
CSC311-Lec12 52 / 60
CSC413 Teaser: Image-to-Image Translation
Due to convenient autodiff frameworks, we can combine multiple
neural nets together into fancy architectures. Here’s the CycleGAN.

Zhu et al., 2017, “Unpaired image-to-image translation using cycle-consistent adversarial networks”

Intro ML (UofT) CSC311-Lec12 53 / 60


CSC413 Teaser: Image-to-Image Translation

Style transfer problem: change the style of an image while preserving


the content.

Data: Two unrelated collections of images, one for each style

Intro ML (UofT) CSC311-Lec12 54 / 60


CSC412 Teaser: Probabilistic Graphical Models
In this course, we just scratched the surface of probabilistic
models.
Probabilistic graphical models (PGMs) let you encode complex
probabilistic relationships between lots of variables.

Ghahramani, 2015, “Probabilistic ML and artificial intelligence”

Intro ML (UofT) CSC311-Lec12 55 / 60


CSC412 Teaser: PGM Inference

We derived inference methods by inspection for some easy special


cases (e.g. GDA, naı̈ve Bayes)
In CSC412, you’ll learn much more general and powerful inference
techniques that expand the range of models you can build
Exact inference using dynamic programming, for certain types of
graph structures (e.g. chains)
Markov chain Monte Carlo
forms the basis of a powerful probabilistic modeling tool called Stan
Variational inference: try to approximate a complex, intractable,
high-dimensional distribution using a tractable one
Try to minimze the KL divergence
Based on the same math from our EM lecture

Intro ML (UofT) CSC311-Lec12 56 / 60


CSC412 Teaser: Beyond Clustering

We’ve seen unsupervised learning algorithms based on two ways of


organizing your data
low-dimensional spaces (dimensionality reduction)
discrete categories (clustering)
Other ways to organize/model data
hierarchies
dynamical systems
sets of attributes
topic models (each document is a mixture of topics)
Motifs can be combined in all sorts of different ways

Intro ML (UofT) CSC311-Lec12 57 / 60


CSC412 Teaser: Beyond Clustering

Latent Dirichlet Allocation (LDA)

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.

Figure 8: An example article from the AP corpus. Each color codes a different factor from which
the word is putatively generated.
Blei et al., 2003, “Latent Dirichlet Allocation”

Intro ML (UofT) CSC311-Lec12 58 / 60


CSC412 Teaser: Automatic Statistician

Automatic search over Gaussian process kernel structures

Duvenaud et al., 2013, “Structure discovery in nonparametric regression through compositional kernel
search”
Image: Ghahramani, 2015, “Probabilistic ML and artificial intelligence”

Intro ML (UofT) CSC311-Lec12 59 / 60


Resources

Continuing with machine learning


Courses
csc413/2516, “Neural Networks and Deep Learning”
csc412/2506, “Probabilistic Learning and Reasoning”
Various topics courses (varies from year to year)
Videos from top ML conferences (NIPS/NeurIPS, ICML, ICLR,
UAI)
Tutorials and keynote talks are aimed at people with your level of
background (know the basics, but not experts in a subfield)
Try to reproduce results from papers
If they’ve released code, you can use that as a guide if you get stuck
Lots of excellent free resources avaiable online!

Intro ML (UofT) CSC311-Lec12 60 / 60

You might also like