0% found this document useful (0 votes)
24 views

Assignment 5

Ai association

Uploaded by

ratnababu.m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Assignment 5

Ai association

Uploaded by

ratnababu.m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Assignment 5

Reinforcement Learning
Prof. B. Ravindran

1. For a particular finite MDP with bounded rewards, let V be the space of bounded functions
on S, the state space of the MDP. Let Π be the set of all policies, and let vπ be the value
function corresponding to policy π where π ∈ Π. Is it true that vπ ∈ V , ∀π ∈ Π?

(a) no
(b) yes

2. In the proof of the value iteration theorem, we saw that Lv n+1 = Lπ v n+1 . Is it true, in general,
that for an arbitrary bounded function v, Lv = Lπ v (disregarding any special conditions that
may be existing in the aforementioned proof)?

(a) no
(b) yes

3. Continuing with the previous question, why is it the case that Lv n+1 = Lπ v n+1 in the proof
of the value iteration theorem?

(a) because the equality holds in general


(b) because v n+1 is the optimal value function
(c) because we are considering only deterministic policies choosing a max valued action in
each state
(d) because v n+1 is not a value function

4. Given that qπ (s, a) > vπ (s), we can conclude

(a) action a is the best action that can be taken in state s


(b) π may be an optimal policy
(c) π is not an optimal policy
(d) none of the above

5. Recall the problem described in the first question of the previous assignment. Use the MDP
formulation arrived at in that question and starting with policy π(laughing) = π(silent) =
(incense, no organ), perform a couple of policy iterations or value iterations (by hand!) until
you find an optimal policy (if you are taking a lot of iterations, stop and reconsider your
formulation!). What are the resulting optimal state-action values for all state-action pairs?

1
(a) q∗ (s, a) = 8, ∀a
(b) q∗ (s, a) = 10, ∀a
(c) q∗ (s, a∗ ) = 10, q∗ (s, a) = −10, ∀a 6= a∗
(d) q∗ (s, a∗ ) = 10, q∗ (s, a) = 8, ∀a 6= a∗

6. In the previous question, what does the state value function converge to for the policy we
started off with?

(a) vπ (laughing) = vπ (silent) = 10


(b) vπ (laughing) = 8, vπ (silent) = 10
(c) vπ (laughing) = −10, vπ (silent) = 10
(d) vπ (laughing) = −8, vπ (silent) = 10

7. In solving an episodic problem we observe that all trajectories from the start state to the goal
state pass through a particular state exactly twice. In such a scenario, is it preferable to use
first-visit or every-visit MC for evaluating the policy?

(a) first-visit MC
(b) every-visit MC
(c) every-visit MC with exploring starts
(d) neither, as there are issues with the problem itself

8. Which of the following are advantages of Monte Carlo methods over dynamic programming
techniques?

(a) the ability to learn from actual experience


(b) the ability to learn from simulated experience
(c) the ability to estimate the value of a single state independent of the number of states
(d) the ability to show guaranteed convergence to an optimal policy

9. For a specific MDP, suppose we have a policy that we want to evaluate through the use of
actual experience in the environment alone and using Monte Carlo methods. We decide to use
the first-visit approach along with the technique of always picking the start state at random
from the available set of states. Will this approach ensure complete evaluation of the action
value function corresponding to the policy?

(a) no
(b) yes

10. Assuming an MDP where there are n actions a ∈ A each of which is applicable in each state
s ∈ S, if π is an -soft policy for some  > 0, then

(a) π(a|s) = , ∀a, s



(b) π(a|s) = n , ∀a, s
(c) π(a|s) >= n , ∀a, s
(d) π(a0 |s) = 1 −  + n , π(a|s) = 
n, ∀a 6= a0 , ∀s

You might also like