0% found this document useful (0 votes)
45 views6 pages

RL - Exam2023 Solved

The document outlines the exam details for SOW-BKI258 Reinforcement Learning, including date, time, number of pages, and questions. It provides instructions on eligibility, prohibited items, use of aids, and exam result announcements. The exam consists of various questions related to reinforcement learning concepts, requiring students to demonstrate their understanding through problem-solving and calculations.

Uploaded by

fatimamoalin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views6 pages

RL - Exam2023 Solved

The document outlines the exam details for SOW-BKI258 Reinforcement Learning, including date, time, number of pages, and questions. It provides instructions on eligibility, prohibited items, use of aids, and exam result announcements. The exam consists of various questions related to reinforcement learning concepts, requiring students to demonstrate their understanding through problem-solving and calculations.

Uploaded by

fatimamoalin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Faculty of Social Sciences

Exam: SOW-BKI258 Reinforcement Learning


Date: 30/03/2023
Time of exam: 08:30-10:30
Number of pages of which this exam consists: 4
Number of questions of which this exam consists: 7

Student number:

Name:

Exam instructions:
Any exam taken by students who are not eligible to sit that exam will be declared invalid. Registering for
the exam is compulsory.
Mobile phones may not be brought into the exam hall.
Candidates will only be permitted to sit an examination if they can present a valid ID.
No part of this exam may be reproduced and/or made public by means of writing out, photo, photocopy or
other medium.
Fill in your name, student number and number of the version, where appropriate, on every form you use.
After the exam, hand in all the forms received to the invigilator.
Any breach of the above-mentioned rules will be reported to the examination committee as an instance of
fraud. As a consequence, your exam could be declared invalid.

Use of aids:
A calculator may be used during the exam
No (other) aids are permitted.

Use of the forms handed out:


Answer all the questions on the exam form
Use of scrap paper is permitted. In the assignments, include your calculations.

Exam result
10 Points can be gained for the entire exam.
The exam results will be announced before: 20/04/2023

Exam review
Date, time and location of access will be announced on Brightspace.
SOW-BKI258: Reinforcement Learning Final Exam - Page 2 of 6 30/03/2023

SOW-BKI258: Reinforcement Learning Name:


Exam Date: 30/03/2023
Duration: 2h Student Number:

1. (0.5 points) Indicate whether the statements below are True or False:
(a) Episodic tasks have a clear terminal state, whereas continuing tasks do not.

Solution: True

T
.
γ k−t−1 Rk , the following two conditions can be
P
(b) When we define the (discounted) return as Gt =
k=t+1
simultaneously true: T = ∞, γ = 1

Solution: False, to have finite returns, either T = ∞ or γ = 1, both conditions cannot be true
simultaneously

(c) Exploration is important when learning in model-free scenarios because it ensures more state-action pairs
can be experienced.

Solution: True

(d) TD(0) uses local, sample backups, whereas Dynamic Programming uses full backups. The main difference
between the two is thus the depth of the update.

Solution: False, it is the width.

(e) In off-policy Monte Carlo methods, the returns observed following the behavioral policy b are scaled by an
−1
. TQ π(Ak |Sk )
importance sampling ratio determined as ρt:T −1 = b(Ak |Sk )
k=t

Solution: True

2. (0.5 points) Maintaining exploration is important when learning in model-free scenarios because (choose all the
options that are correct):
A. Without it Q-value estimates can become biased
B. It imposes a deterministic policy
C. It allow the agent to plan using existing knowledge
D. It ensures a larger number of state-action pairs are experienced
E. The agent receives a larger reward

Solution: A and D are correct. Points are given only if the 2 and no other are selected

3. (1 points) Which of the following expressions defines an ϵ-greedy policy update (mark the correct option or write
it below):

ϵ/|A(s)| + 1 if a = argmax Q(s, a)
A. π(s) = a∈A
0.1 otherwise


ϵ/|A(s)| + 1 − ϵ if a = argmin Q(s, a)
B. π(s) = a∈A
ϵ/|A(s)| otherwise


ϵ/|A(s)| + 1 − ϵ if a = argmax Q(s, a)
C. π(s) = a∈A
ϵ/|A(s)| otherwise


ϵ/|A(s)| if a = argmax Q(s, a)
D. π(s) = a∈A
0. otherwise
SOW-BKI258: Reinforcement Learning Final Exam - Page 3 of 6 30/03/2023

Solution: Option C

4. (1 points) Among the classes of algorithms covered during the course, namely Dynamic programming (DP),
Monte Carlo (MC) and Temporal Difference Learning (TD),
(a) Which are model-free?

Solution: MC and TD

(b) Which strictly require terminal states (episodic tasks)?

Solution: MC

(c) Which involve planning from knowledge of the environment?

Solution: DP

(d) Which use sampled experience and shallow backups?

Solution: TD

(e) Which use sampled experience and deep backups?

Solution: MC

5. (2 points) An agent receives the following sequence of rewards: R1 = −1, R2 = 2, R3 = 8, R4 = 2, R5 = 3 in an


episodic task where the state entered after time step 5 is terminal. Based on the definition of returns provided
below and assuming a discounting factor γ = 0.8, calculate the partial returns obtained: G0 , G1 , G2 , G3 , G4 , G5 .
Hint: it is often easier to work backwards. Remember the definition of returns:
.
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · = Rt+1 + γGt+1 (1)

Solution: Note: the final values don’t need to be calculated, but the students should show they understand
the expressions. This can also be calculated using only the rewards. It is important that the students realize
the terminal state has a reward of 0.
G5 = GT = 0
G4 = R5 + γG5 = 3
G3 = R4 + γG4 = 2 + 0.8 ∗ 3 = 4.4
G2 = R3 + γG3 = 8 + 0.8 ∗ 4.4 = 11.52
G1 = R2 + γG2 = 2 + 0.8 ∗ 11.52 = 11.216
G0 = R1 + γG1 = −1 + 0.8 ∗ 11.216 ≈ 7.97

6. (3 points) Consider the 2x3 Grid World MDP depicted below. An agent is placed at the bottom left node (state
1) and the goal is to navigate to the top right node (state 6). Actions that lead to state 6 receive a reward of
+10 and terminate the episode (the agent is returned to the start state for a new episode). For all other actions
taken from any of the other states that do not lead to state 6, the reward is -1. In each state, the agent has 4
possible actions: up, down, left, right. The environment is deterministic, i.e. for each action, the agent moves
in the specified direction.
(a) (2 points) Assuming that actions that would take the agent outside the grid are not allowed (e.g. from
state 1, the agent can only move right or up) and that we start our exploration with an equiprobable
stochastic policy: π0 (a|s) = 1/|A(s)|, ∀s ∈ S, a ∈ A, use the Bellman expectation equation as an update
rule:
SOW-BKI258: Reinforcement Learning Final Exam - Page 4 of 6 30/03/2023

Up
4 5 6
Goal Left Right

1 2 3 Down
Start

X X
vk+1 (s) = Eπ [Rt+1 + γvk (St+1 ) | St = s] = π(a | s) p (s′ , r | s, a) [r + γvk (s′ )] (2)
a s′ ,r

and calculate the first updates of the value function for each state, assuming a discounting factor γ = 0.9
and V0 (s) = 0, ∀s ∈ S, i.e. complete the table below.

k Vk (1) Vk (2) Vk (3) Vk (4) Vk (5) Vk (6)


0 0 0 0 0 0 0
1 -1

Solution: Note: the most important insight the students should demonstrate is how to use the
equiprobable random policy when different states allow for a different number of actions. Otherwise, it
is just application of the Bellman equation. For example:

V1 (1) = 0.5 ∗ [(−1) + γV0 (4)] + 0.5 ∗ [(−1) + γV0 (2)] = −1


V1 (2) = 1/3 ∗ [(−1) + γV0 (1)] + 1/3 ∗ [(−1) + γV0 (5)] + 1/3 ∗ [(−1) + γV0 (3)] = −1
V1 (3) = 0.5 ∗ [(−1) + γV0 (2)] + 0.5 ∗ [(+10) + γV0 (6)] = 4.5
···

The final results are: V1 (1) = −1, V1 (2) = −1, V1 (3) = 4.5, V1 (4) = −1, V1 (5) = 8/3, V1 (6) = −1
k Vk (1) Vk (2) Vk (3) Vk (4) Vk (5) Vk (6)
0 0 0 0 0 0 0
1 -1 -1 4.5 -1 2.666 -1

(b) (0.2 points) Which algorithm did you employ to solve the problem above:
A. Policy iteration
B. Q-learning
C. SARSA
D. TD-learning
E. Value iteration
F. First-visit MC

Solution: E) value iteration

(c) (0.8 point) Given the value function estimated above, if the agent acts greedy, how would he modify the
initial policy, i.e. what would the new policy π1 (s) look like? (complete the grid below).

π0 π1

Solution: Given the value function computed above, V1 and the fact that the Q-value depends on the
state value function, we can directly use the state values to replace the actions in states 2 and 4:
v1 π1

-1 2.66 -1

-1 -1 4.5
SOW-BKI258: Reinforcement Learning Final Exam - Page 5 of 6 30/03/2023

7. (2 points) The agent from the previous question is now placed in a new, unfamiliar environment knowing only that
it comprises the same state and action spaces as before, i.e. S = {1, 2, 3, 4, 5, 6}, and A = {up, down, left, right}.
He starts exploring this environment and experiences the following 2 episodes:

Timestep (t) Reward (Rt ) State (St ) Action (At )


0 - 1 right
1 -1 2 left
2 -1 1 up
3 -1 1 right
4 -1 2 right
5 -1 3 up
6 +10 6 -
Timestep (t) Reward (Rt ) State (St ) Action (At )
0 - 1 right
1 -1 2 up
2 -1 2 right
3 -1 3 left
4 -1 2 right
5 -1 3 right
6 -1 4 down
7 -1 4 up
8 +10 5 -

(a) (0.7 point) Based on these two episodes and assuming γ = 1, what Q-value estimates can be calculated for
Q(1, right) using an every-visit Monte Carlo method?

Solution: We observe 3 visits to state-action pair (1, right), 2 in the first episode, 1 in the second.
The estimate is the average return:

G11 + G12 + G21


Q(1, right) =
3

with G11 = 5 ∗ (−1) + 10 = 5, G12 = 2 ∗ (−1) + 10 = 8 and G21 = 7 ∗ (−1) + 10 = 3,

5+8+3
Q(1, right) = ≈ 5.33
3

(b) (0.3 points) Using the first-visit method, what estimates can be obtained for the value of state 3, V (3)?

Solution: From the first visit on the first episode we obtain the return G11 = 5 ∗ (−1) + 10 = 5 and
from the first visit on the second episode G21 = 7 ∗ (−1) + 10 = 3, so

G11 + G21
V (3) = =4
2

(c) (0.7 point) If we change to a TD-learning algorithm to evaluate the state value under the current policy,
what estimates can we obtain for the value of state 3, V (3), given these episodes, assuming the state value
function is initialized at zero, i.e. V0 (s) = 0∀s ∈ S? Remember that, in TD-learning:

V (St ) ← V (St ) + α (Rt+1 + γV (St+1 ) − V (St ))


Additionally, for simplicity, assume the only value being updated by the algorithm is that of state 3, i.e. all
other state values are clipped to 0 and are not updated (i.e., V (s) = V0 (s) = 0 ∀s ∈ {1, 2, 4, 5, 6}) and use
a fix learning rate α = 0.1 and discounting factor γ = 0.9.
SOW-BKI258: Reinforcement Learning Final Exam - Page 6 of 6 30/03/2023

Solution: State 3 is observed 3 times in the two episodes, so we can estimate its value based on these
observations:
V1 (3) = V0 (3) + α[r + γV0 (6) − V0 (3)] = +1

V2 (3) = V1 (3) + α[r + γV0 (2) − V1 (3)] = +1 − 0.2 = 0.8

V3 (3) = V2 (3) + α[r + γV0 (4) − V2 (3)] = 0.8 − 0.18 ≈ 0.62


So, the final value estimate, based on these 2 episodes is V (3) ≈ 0.62.

(d) (0.3 points) Considering the two value estimates obtained in (b) and (c) and the corresponding methods,
which of the approaches (MC/TD):
A. Provides estimates with a larger variance?
B. Provides estimates with a larger bias?
C. Is more sensitive to initial values?
Indicate your answer (MC or TD) in front of each question or write the answer key below.

Solution: A. MC; B. TD; C. TD

You might also like