Assignment 5
Assignment 5
Reinforcement Learning
Prof. B. Ravindran
1. For a particular finite MDP with bounded rewards, let V be the space of bounded functions
on S, the state space of the MDP. Let Π be the set of all policies, and let vπ be the value
function corresponding to policy π where π ∈ Π. Is it true that vπ ∈ V , ∀π ∈ Π?
(a) no
(b) yes
2. In the proof of the value iteration theorem, we saw that Lv n+1 = Lπ v n+1 . Is it true, in general,
that for an arbitrary bounded function v, Lv = Lπ v (disregarding any special conditions that
may be existing in the aforementioned proof)?
(a) no
(b) yes
3. Continuing with the previous question, why is it the case that Lv n+1 = Lπ v n+1 in the proof
of the value iteration theorem?
5. Recall the problem described in the first question of the previous assignment. Use the MDP
formulation arrived at in that question and starting with policy π(laughing) = π(silent) =
(incense, no organ), perform a couple of policy iterations or value iterations (by hand!) until
you find an optimal policy (if you are taking a lot of iterations, stop and reconsider your
formulation!). What are the resulting optimal state-action values for all state-action pairs?
1
(a) q∗ (s, a) = 8, ∀a
(b) q∗ (s, a) = 10, ∀a
(c) q∗ (s, a∗ ) = 10, q∗ (s, a) = −10, ∀a 6= a∗
(d) q∗ (s, a∗ ) = 10, q∗ (s, a) = 8, ∀a 6= a∗
6. In the previous question, what does the state value function converge to for the policy we
started off with?
7. In solving an episodic problem we observe that all trajectories from the start state to the goal
state pass through a particular state exactly twice. In such a scenario, is it preferable to use
first-visit or every-visit MC for evaluating the policy?
(a) first-visit MC
(b) every-visit MC
(c) every-visit MC with exploring starts
(d) neither, as there are issues with the problem itself
8. Which of the following are advantages of Monte Carlo methods over dynamic programming
techniques?
9. For a specific MDP, suppose we have a policy that we want to evaluate through the use of
actual experience in the environment alone and using Monte Carlo methods. We decide to use
the first-visit approach along with the technique of always picking the start state at random
from the available set of states. Will this approach ensure complete evaluation of the action
value function corresponding to the policy?
(a) no
(b) yes
10. Assuming an MDP where there are n actions a ∈ A each of which is applicable in each state
s ∈ S, if π is an -soft policy for some > 0, then