Reinforcement learning lec12
Reinforcement learning lec12
The way the agent chooses the action in each step is called a
policy.
We’ll consider two types:
Deterministic policy: At = π(St ) for some function π : S → A
Stochastic policy: At ∼ π(· | St ) for some function π : S → P(A).
(Here, P(A) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or
trajectories, factorizes:
p(s1 , a1 , . . . , sT , aT ) = p(s1 ) π(a1 | s1 ) P(s2 | s1 , a1 ) π(a2 | s2 ) · · · P(sT | sT −1 , aT −1 ) π(aT | sT )
Note: the fact that policies need consider only the current state is
a powerful consequence of the Markov assumption and full
observability.
If the environment is partially observable, then the policy needs to
depend on the history of observations.
Rt ∼ R(· | St , At )
Rt = r(St , At )
Now that we’ve defined MDPs, let’s see how to find a policy that
achieves a high return.
We can distinguish two situations:
Planning: given a fully specified MDP.
Learning: agent interacts with an environment with unknown
dynamics.
I.e., the environment is a black box that takes in actions and
outputs states and rewards.
Which framework would be most appropriate for chess? Super
Mario?
The value function V π for a policy π measures the expected return if you
start in state s and follow policy π.
"∞ #
X
V (s) , Eπ [Gt | St = s] = Eπ
π k
γ Rt+k | St = s .
k=0
Suppose you’re in state s. You get to pick one action a, and then
follow (fixed) policy π from then on. What do you pick?
π?
Now Q∗ = Q is the optimal state-action value function, and we
can rewrite the optimal Bellman equation without mentioning π ? :
X
Q∗ (s, a) = r(s, a) + γ p(s0 | s, a) max
0
Q∗ (s0 , a0 )
a
s0
| {z }
,(T ∗ Q∗ )(s,a)
T π Qπ = Qπ
T ∗ Q∗ = Q∗
Idea: keep iterating the backup operator over and over again.
Q ← T πQ (policy evaluation)
∗
Q←T Q (finding the optimal policy)
We’re treating Qπ or Q∗ as a vector with |S| · |A| entries.
This type of algorithm is an instance of dynamic programming.
kf (k) (x) − x∗ k ≤ αk kx − x∗ k.
T ⇤ Q1
T ⇤ (or T ⇡ )
<latexit sha1_base64="yqJIT7qss5GVGdOiCqx7wCN5G3s=">AAACRXicdVDJSgNBFOxxjXFL9OilMSiewowIegzm4jGBbJKE0NPpJE16GbrfCGGYr/Cq3+M3+BHexKt2loNJSMGDouoVFBVGglvw/U9va3tnd28/c5A9PDo+Oc3lzxpWx4ayOtVCm1ZILBNcsTpwEKwVGUZkKFgzHJenfvOFGcu1qsEkYl1JhooPOCXgpOdOTUeAq72glyv4RX8GvE6CBSmgBSq9vHfd6WsaS6aACmJtO/Aj6CbEAKeCpdlObFlE6JgMWdtRRSSz3WTWOMVXTunjgTbuFOCZ+j+REGntRIbuUxIY2VVvKm7yYCTTZU0MteFO5nSDsdIWBg/dhKsoBqbovOwgFhg0nk6I+9wwCmLiCKEuzymmI2IIBTd0tjMLJmUtJVF9m7plg9Ud10njthj4xaB6Vyg9LjbOoAt0iW5QgO5RCT2hCqojiiR6RW/o3fvwvrxv72f+uuUtMudoCd7vH4EfstY=</latexit>
1
<latexit sha1_base64="93BFzOY2xSaeRGRWOyKXTZ+MQlU=">AAACPXicdVDNSsNAGNzUv1r/Wj16WSyKp5KIoMdiLx5bsLXQhrLZbNql+xN2N0IJeQKv+jw+hw/gTbx6dZvmYFs68MEw8w0ME8SMauO6n05pa3tnd6+8Xzk4PDo+qdZOe1omCpMulkyqfoA0YVSQrqGGkX6sCOIBI8/BtDX3n1+I0lSKJzOLic/RWNCIYmSs1PFG1brbcHPAdeIVpA4KtEc152oYSpxwIgxmSOuB58bGT5EyFDOSVYaJJjHCUzQmA0sF4kT7ad40g5dWCWEklT1hYK7+T6SIaz3jgf3kyEz0qjcXN3lmwrNljY2lolameIOx0tZE935KRZwYIvCibJQwaCScTwdDqgg2bGYJwjZPMcQTpBA2duDKMA+mLck5EqHO7LLe6o7rpHfT8NyG17mtNx+KjcvgHFyAa+CBO9AEj6ANugADAl7BG3h3Ppwv59v5WbyWnCJzBpbg/P4BEcOvsw==</latexit>
<latexit
<latexit sha1_base64="1NtZU9jecEK2cV+EB72GcN1Qgh0=">AAACUXicdVA9TwJBFHx3fiF+gZY2G0GjDbmj0ZJIY6mJCIkQsrfswcb9uOy+MyEXfout/h4rf4qdC1IoxqkmM2/yJpNkUjiMoo8gXFvf2NwqbZd3dvf2DyrVwwdncst4hxlpbC+hjkuheQcFSt7LLKcqkbybPLXnfveZWyeMvsdpxgeKjrVIBaPopWHlqN6/NxnWybmxxPNM1C+GlVrUiBYgf0m8JDVY4nZYDc76I8NyxTUySZ17jKMMBwW1KJjks3I/dzyj7ImO+aOnmiruBsWi/YycemVEUv8/NRrJQv2ZKKhybqoSf6koTtyqNxf/83CiZr81OTZWeFmwf4yVtpheDQqhsxy5Zt9l01wSNGQ+JxkJyxnKqSeU+bxghE2opQz96OX+Ili0jVJUj9zMLxuv7viXPDQbcdSI75q11vVy4xIcwwmcQwyX0IIbuIUOMJjCC7zCW/AefIYQht+nYbDMHMEvhDtfC02y9g==</latexit>
<latexit sha1_base64="kcy9StiPGnILhHdxgpheFrNPU8w=">AAACQnicdVBLS8NAGNzUV62vVo9egkXxVBIR9FjsxWMF+4AmlM1mk67dR9jdCCXkP3jV3+Of8C94E68e3KY52EoHPhhmvoFhgoQSpR3nw6psbG5t71R3a3v7B4dH9cZxX4lUItxDggo5DKDClHDc00RTPEwkhiygeBBMO3N/8IylIoI/6lmCfQZjTiKCoDZS34shY3Bcbzotp4D9n7glaYIS3XHDuvBCgVKGuUYUKjVynUT7GZSaIIrzmpcqnEA0hTEeGcohw8rPirq5fW6U0I6ENMe1Xah/ExlkSs1YYD4Z1BO16s3FdZ6esHxZo7GQxMgErTFW2uro1s8IT1KNOVqUjVJqa2HP97NDIjHSdGYIRCZPkI0mUEKkzco1rwhmHWFW5aHKzbLu6o7/Sf+q5Tot9+G62b4rN66CU3AGLoELbkAb3IMu6AEEnsALeAVv1rv1aX1Z34vXilVmTsASrJ9f2QCyEw==</latexit>
Q2
<latexit sha1_base64="Srj0llwUfl51deTIkafbU9cQdDo=">AAACP3icdVBLS8NAGNzUV62vVo9egkXxVJIi6LHYi8cW7QPaUDabTbp0H2F3I5SQn+BVf48/w1/gTbx6c5vmYFsc+GCY+QaG8WNKlHacD6u0tb2zu1ferxwcHh2fVGunfSUSiXAPCSrk0IcKU8JxTxNN8TCWGDKf4oE/ay/8wTOWigj+pOcx9hiMOAkJgtpIj91Jc1KtOw0nh71J3ILUQYHOpGZdjQOBEoa5RhQqNXKdWHsplJogirPKOFE4hmgGIzwylEOGlZfmXTP70iiBHQppjms7V/8mUsiUmjPffDKop2rdW4j/eXrKslWNRkISIxP0j7HWVod3Xkp4nGjM0bJsmFBbC3sxnh0QiZGmc0MgMnmCbDSFEiJtJq6M82DaFoxBHqjMLOuu77hJ+s2G6zTc7k29dV9sXAbn4AJcAxfcghZ4AB3QAwhE4AW8gjfr3fq0vqzv5WvJKjJnYAXWzy+kULB4</latexit>
T ⇤ Q2
<latexit sha1_base64="TrsWdF6gWobKWXPGYUF2As5sxHQ=">AAACRXicdVBLS0JBGJ1rL7OX1rLNkBSt5F4Jaim5aangK1Rk7jjq4DwuM98N5OKvaFu/p9/Qj2gXbWt8LFLxwAeHc74DhxNGglvw/U8vtbO7t3+QPswcHZ+cnmVz5w2rY0NZnWqhTSsklgmuWB04CNaKDCMyFKwZjsszv/nCjOVa1WASsa4kQ8UHnBJw0nOnpiPA1V6xl837BX8OvEmCJcmjJSq9nHfT6WsaS6aACmJtO/Aj6CbEAKeCTTOd2LKI0DEZsrajikhmu8m88RRfO6WPB9q4U4Dn6v9EQqS1Exm6T0lgZNe9mbjNg5GcrmpiqA13MqdbjLW2MHjoJlxFMTBFF2UHscCg8WxC3OeGURATRwh1eU4xHRFDKLihM515MClrKYnq26lbNljfcZM0ioXALwTVu3zpcblxGl2iK3SLAnSPSugJVVAdUSTRK3pD796H9+V9ez+L15S3zFygFXi/f4L4stc=</latexit>
X
=γ P(s0 | s, a) max
0
Q1 (s0
, a 0
) − max
0
Q2 (s0
, a 0
)
a a
s0
X
≤γ P(s0 | s, a) max
0
Q1 (s0 , a0 ) − Q2 (s0 , a0 )
a
s0
X
≤ γ max
0 0
Q1 (s0 , a0 ) − Q2 (s0 , a0 ) P(s0 | s, a)
s ,a
s0
= γ max
0 0
Q1 (s0 , a0 ) − Q2 (s0 , a0 )
s ,a
= γ kQ1 − Q2 k∞
This is true for any (s, a), so
kT ∗ Q1 − T ∗ Q2 k∞ ≤ γ kQ1 − Q2 k∞ ,
which is what we wanted to show.
Intro ML (UofT) CSC311-Lec12 27 / 60
Value Iteration Recap
The ε-greedy policy ensures that most of the time (probability 1 − ε) the
agent exploits its incomplete knowledge of the world by chooses the best
action (i.e., corresponding to the highest action-value), but occasionally
(probability ε) it explores other actions.
Without exploration, the agent may never find some good actions.
The ε-greedy is one of the simplest, but widely used, methods for
trading-off exploration and exploitation. Exploration-exploitation
tradeoff is an important topic of research.
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]
Value Policy
Model
Books:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning:
An Introduction, 2nd edition, 2018.
Csaba Szepesvari, Algorithms for Reinforcement Learning, 2010.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien
Ernst, Reinforcement Learning and Dynamic Programming Using
Function Approximators, 2010.
Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming, 1996.
Courses:
Video lectures by David Silver
CIFAR and Vector Institute’s Reinforcement Learning Summer
School, 2018.
Deep Reinforcement Learning, CS 294-112 at UC Berkeley
We talked about neural nets as learning feature maps you can use
for regression/classification
More generally, want to learn a representation of the data such
that mathematical operations on the representation are
semantically meaningful
Classic (decades-old) example: representing words as vectors
Measure semantic similarity using the dot product between word
vectors (or dissimilarity using Euclidean distance)
Represent a web page with the average of its word vectors
(a) Baldness (-6, 6) (b) Face width (0, 6) (c) Gender (-6, 6
Figure 1: Qualitative comparisons on CelebA. Traversal ranges are
attributes are only manifested in one direction of a latent variable, so w
Most semantically similar variables from a -VAE are shown for com
Zhu et al., 2017, “Unpaired image-to-image translation using cycle-consistent adversarial networks”
The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.
Figure 8: An example article from the AP corpus. Each color codes a different factor from which
the word is putatively generated.
Blei et al., 2003, “Latent Dirichlet Allocation”
Duvenaud et al., 2013, “Structure discovery in nonparametric regression through compositional kernel
search”
Image: Ghahramani, 2015, “Probabilistic ML and artificial intelligence”