0% found this document useful (0 votes)

45 views6 pages

RL - Exam2023 Solved

The document outlines the exam details for SOW-BKI258 Reinforcement Learning, including date, time, number of pages, and questions. It provides instructions on eligibility, prohibited items, use of aids, and exam result announcements. The exam consists of various questions related to reinforcement learning concepts, requiring students to demonstrate their understanding through problem-solving and calculations.

Uploaded by

fatimamoalin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views6 pages

RL - Exam2023 Solved

Uploaded by

fatimamoalin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Faculty of Social Sciences

Exam: SOW-BKI258 Reinforcement Learning

Date: 30/03/2023
Time of exam: 08:30-10:30
Number of pages of which this exam consists: 4
Number of questions of which this exam consists: 7

Student number:

Name:

Exam instructions:
Any exam taken by students who are not eligible to sit that exam will be declared invalid. Registering for
the exam is compulsory.
Mobile phones may not be brought into the exam hall.
Candidates will only be permitted to sit an examination if they can present a valid ID.
No part of this exam may be reproduced and/or made public by means of writing out, photo, photocopy or
other medium.
Fill in your name, student number and number of the version, where appropriate, on every form you use.
After the exam, hand in all the forms received to the invigilator.
Any breach of the above-mentioned rules will be reported to the examination committee as an instance of
fraud. As a consequence, your exam could be declared invalid.

Use of aids:
A calculator may be used during the exam
No (other) aids are permitted.

Use of the forms handed out:

Answer all the questions on the exam form
Use of scrap paper is permitted. In the assignments, include your calculations.

Exam result
10 Points can be gained for the entire exam.
The exam results will be announced before: 20/04/2023

Exam review
Date, time and location of access will be announced on Brightspace.
SOW-BKI258: Reinforcement Learning Final Exam - Page 2 of 6 30/03/2023

SOW-BKI258: Reinforcement Learning Name:

Exam Date: 30/03/2023
Duration: 2h Student Number:

1. (0.5 points) Indicate whether the statements below are True or False:
(a) Episodic tasks have a clear terminal state, whereas continuing tasks do not.

Solution: True

T
.
γ k−t−1 Rk , the following two conditions can be
P
(b) When we define the (discounted) return as Gt =
k=t+1
simultaneously true: T = ∞, γ = 1

Solution: False, to have finite returns, either T = ∞ or γ = 1, both conditions cannot be true
simultaneously

(c) Exploration is important when learning in model-free scenarios because it ensures more state-action pairs
can be experienced.

Solution: True

(d) TD(0) uses local, sample backups, whereas Dynamic Programming uses full backups. The main difference
between the two is thus the depth of the update.

Solution: False, it is the width.

(e) In off-policy Monte Carlo methods, the returns observed following the behavioral policy b are scaled by an
−1
. TQ π(Ak |Sk )
importance sampling ratio determined as ρt:T −1 = b(Ak |Sk )
k=t

Solution: True

2. (0.5 points) Maintaining exploration is important when learning in model-free scenarios because (choose all the
options that are correct):
A. Without it Q-value estimates can become biased
B. It imposes a deterministic policy
C. It allow the agent to plan using existing knowledge
D. It ensures a larger number of state-action pairs are experienced
E. The agent receives a larger reward

Solution: A and D are correct. Points are given only if the 2 and no other are selected

3. (1 points) Which of the following expressions defines an ϵ-greedy policy update (mark the correct option or write
it below):

ϵ/|A(s)| + 1 if a = argmax Q(s, a)
A. π(s) = a∈A
0.1 otherwise


ϵ/|A(s)| + 1 − ϵ if a = argmin Q(s, a)
B. π(s) = a∈A
ϵ/|A(s)| otherwise


ϵ/|A(s)| + 1 − ϵ if a = argmax Q(s, a)
C. π(s) = a∈A
ϵ/|A(s)| otherwise


ϵ/|A(s)| if a = argmax Q(s, a)
D. π(s) = a∈A
0. otherwise
SOW-BKI258: Reinforcement Learning Final Exam - Page 3 of 6 30/03/2023

Solution: Option C

4. (1 points) Among the classes of algorithms covered during the course, namely Dynamic programming (DP),
Monte Carlo (MC) and Temporal Difference Learning (TD),
(a) Which are model-free?

Solution: MC and TD

(b) Which strictly require terminal states (episodic tasks)?

Solution: MC

(c) Which involve planning from knowledge of the environment?

Solution: DP

(d) Which use sampled experience and shallow backups?

Solution: TD

(e) Which use sampled experience and deep backups?

Solution: MC

5. (2 points) An agent receives the following sequence of rewards: R1 = −1, R2 = 2, R3 = 8, R4 = 2, R5 = 3 in an

episodic task where the state entered after time step 5 is terminal. Based on the definition of returns provided
below and assuming a discounting factor γ = 0.8, calculate the partial returns obtained: G0 , G1 , G2 , G3 , G4 , G5 .
Hint: it is often easier to work backwards. Remember the definition of returns:
.
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · = Rt+1 + γGt+1 (1)

Solution: Note: the final values don’t need to be calculated, but the students should show they understand
the expressions. This can also be calculated using only the rewards. It is important that the students realize
the terminal state has a reward of 0.
G5 = GT = 0
G4 = R5 + γG5 = 3
G3 = R4 + γG4 = 2 + 0.8 ∗ 3 = 4.4
G2 = R3 + γG3 = 8 + 0.8 ∗ 4.4 = 11.52
G1 = R2 + γG2 = 2 + 0.8 ∗ 11.52 = 11.216
G0 = R1 + γG1 = −1 + 0.8 ∗ 11.216 ≈ 7.97

6. (3 points) Consider the 2x3 Grid World MDP depicted below. An agent is placed at the bottom left node (state
1) and the goal is to navigate to the top right node (state 6). Actions that lead to state 6 receive a reward of
+10 and terminate the episode (the agent is returned to the start state for a new episode). For all other actions
taken from any of the other states that do not lead to state 6, the reward is -1. In each state, the agent has 4
possible actions: up, down, left, right. The environment is deterministic, i.e. for each action, the agent moves
in the specified direction.
(a) (2 points) Assuming that actions that would take the agent outside the grid are not allowed (e.g. from
state 1, the agent can only move right or up) and that we start our exploration with an equiprobable
stochastic policy: π0 (a|s) = 1/|A(s)|, ∀s ∈ S, a ∈ A, use the Bellman expectation equation as an update
rule:
SOW-BKI258: Reinforcement Learning Final Exam - Page 4 of 6 30/03/2023

Up
4 5 6
Goal Left Right

1 2 3 Down
Start

X X
vk+1 (s) = Eπ [Rt+1 + γvk (St+1 ) | St = s] = π(a | s) p (s′ , r | s, a) [r + γvk (s′ )] (2)
a s′ ,r

and calculate the first updates of the value function for each state, assuming a discounting factor γ = 0.9
and V0 (s) = 0, ∀s ∈ S, i.e. complete the table below.

k Vk (1) Vk (2) Vk (3) Vk (4) Vk (5) Vk (6)

0 0 0 0 0 0 0
1 -1

Solution: Note: the most important insight the students should demonstrate is how to use the
equiprobable random policy when different states allow for a different number of actions. Otherwise, it
is just application of the Bellman equation. For example:

V1 (1) = 0.5 ∗ [(−1) + γV0 (4)] + 0.5 ∗ [(−1) + γV0 (2)] = −1

V1 (2) = 1/3 ∗ [(−1) + γV0 (1)] + 1/3 ∗ [(−1) + γV0 (5)] + 1/3 ∗ [(−1) + γV0 (3)] = −1
V1 (3) = 0.5 ∗ [(−1) + γV0 (2)] + 0.5 ∗ [(+10) + γV0 (6)] = 4.5
···

The final results are: V1 (1) = −1, V1 (2) = −1, V1 (3) = 4.5, V1 (4) = −1, V1 (5) = 8/3, V1 (6) = −1
k Vk (1) Vk (2) Vk (3) Vk (4) Vk (5) Vk (6)
0 0 0 0 0 0 0
1 -1 -1 4.5 -1 2.666 -1

(b) (0.2 points) Which algorithm did you employ to solve the problem above:
A. Policy iteration
B. Q-learning
C. SARSA
D. TD-learning
E. Value iteration
F. First-visit MC

Solution: E) value iteration

(c) (0.8 point) Given the value function estimated above, if the agent acts greedy, how would he modify the
initial policy, i.e. what would the new policy π1 (s) look like? (complete the grid below).

π0 π1

Solution: Given the value function computed above, V1 and the fact that the Q-value depends on the
state value function, we can directly use the state values to replace the actions in states 2 and 4:
v1 π1

-1 2.66 -1

-1 -1 4.5
SOW-BKI258: Reinforcement Learning Final Exam - Page 5 of 6 30/03/2023

7. (2 points) The agent from the previous question is now placed in a new, unfamiliar environment knowing only that
it comprises the same state and action spaces as before, i.e. S = {1, 2, 3, 4, 5, 6}, and A = {up, down, left, right}.
He starts exploring this environment and experiences the following 2 episodes:

Timestep (t) Reward (Rt ) State (St ) Action (At )

0 - 1 right
1 -1 2 left
2 -1 1 up
3 -1 1 right
4 -1 2 right
5 -1 3 up
6 +10 6 -
Timestep (t) Reward (Rt ) State (St ) Action (At )
0 - 1 right
1 -1 2 up
2 -1 2 right
3 -1 3 left
4 -1 2 right
5 -1 3 right
6 -1 4 down
7 -1 4 up
8 +10 5 -

(a) (0.7 point) Based on these two episodes and assuming γ = 1, what Q-value estimates can be calculated for
Q(1, right) using an every-visit Monte Carlo method?

Solution: We observe 3 visits to state-action pair (1, right), 2 in the first episode, 1 in the second.
The estimate is the average return:

G11 + G12 + G21

Q(1, right) =
3

with G11 = 5 ∗ (−1) + 10 = 5, G12 = 2 ∗ (−1) + 10 = 8 and G21 = 7 ∗ (−1) + 10 = 3,

5+8+3
Q(1, right) = ≈ 5.33
3

(b) (0.3 points) Using the first-visit method, what estimates can be obtained for the value of state 3, V (3)?

Solution: From the first visit on the first episode we obtain the return G11 = 5 ∗ (−1) + 10 = 5 and
from the first visit on the second episode G21 = 7 ∗ (−1) + 10 = 3, so

G11 + G21
V (3) = =4
2

(c) (0.7 point) If we change to a TD-learning algorithm to evaluate the state value under the current policy,
what estimates can we obtain for the value of state 3, V (3), given these episodes, assuming the state value
function is initialized at zero, i.e. V0 (s) = 0∀s ∈ S? Remember that, in TD-learning:

V (St ) ← V (St ) + α (Rt+1 + γV (St+1 ) − V (St ))

Additionally, for simplicity, assume the only value being updated by the algorithm is that of state 3, i.e. all
other state values are clipped to 0 and are not updated (i.e., V (s) = V0 (s) = 0 ∀s ∈ {1, 2, 4, 5, 6}) and use
a fix learning rate α = 0.1 and discounting factor γ = 0.9.
SOW-BKI258: Reinforcement Learning Final Exam - Page 6 of 6 30/03/2023

Solution: State 3 is observed 3 times in the two episodes, so we can estimate its value based on these
observations:
V1 (3) = V0 (3) + α[r + γV0 (6) − V0 (3)] = +1

V2 (3) = V1 (3) + α[r + γV0 (2) − V1 (3)] = +1 − 0.2 = 0.8

V3 (3) = V2 (3) + α[r + γV0 (4) − V2 (3)] = 0.8 − 0.18 ≈ 0.62

So, the final value estimate, based on these 2 episodes is V (3) ≈ 0.62.

(d) (0.3 points) Considering the two value estimates obtained in (b) and (c) and the corresponding methods,
which of the approaches (MC/TD):
A. Provides estimates with a larger variance?
B. Provides estimates with a larger bias?
C. Is more sensitive to initial values?
Indicate your answer (MC or TD) in front of each question or write the answer key below.

Solution: A. MC; B. TD; C. TD

Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
RL Solution3
No ratings yet
RL Solution3
4 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
Bits
No ratings yet
Bits
5 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
Notes For Module 4 and 5
No ratings yet
Notes For Module 4 and 5
9 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
q2B Review
No ratings yet
q2B Review
9 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
2 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Intro RL Paper GPT
No ratings yet
Intro RL Paper GPT
5 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
Monte Carlo Methods in AI & Data Science
No ratings yet
Monte Carlo Methods in AI & Data Science
40 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
2 DRL Compre Makeup
No ratings yet
2 DRL Compre Makeup
12 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
RL Unitwise Imp Questions
No ratings yet
RL Unitwise Imp Questions
4 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Solution 3
No ratings yet
Solution 3
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Advanced Value-Based Methods in RL
No ratings yet
Advanced Value-Based Methods in RL
6 pages
RL Question Bank - Final
No ratings yet
RL Question Bank - Final
4 pages
RL 2021 22 Exam I
No ratings yet
RL 2021 22 Exam I
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
RL-solution 4
No ratings yet
RL-solution 4
4 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Question Bank - Reinforcement Learning
No ratings yet
Question Bank - Reinforcement Learning
3 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning Assignment
No ratings yet
Reinforcement Learning Assignment
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Reinforcement Learning Question Bank
No ratings yet
Reinforcement Learning Question Bank
11 pages
2023-24 First Sem - DRL Mid Sem Regular
No ratings yet
2023-24 First Sem - DRL Mid Sem Regular
2 pages
16 RL
No ratings yet
16 RL
51 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
Reinforcement Learning Basics and Algorithms
No ratings yet
Reinforcement Learning Basics and Algorithms
42 pages
HW 2
No ratings yet
HW 2
2 pages
Reinforcement Learning - Unit 6 - Week 4
0% (1)
Reinforcement Learning - Unit 6 - Week 4
3 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
1 page
Planar Graphs Graph Coloring
No ratings yet
Planar Graphs Graph Coloring
17 pages
Ds Questions
No ratings yet
Ds Questions
2 pages
Chapter 2 - Polynomials
No ratings yet
Chapter 2 - Polynomials
27 pages
SVM Distance-Based Kernel Accuracy
No ratings yet
SVM Distance-Based Kernel Accuracy
1 page
Python Lab Manual With Output Alg FC A
No ratings yet
Python Lab Manual With Output Alg FC A
12 pages
Approach 1: Exact RBF
No ratings yet
Approach 1: Exact RBF
6 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Emergency Transport Optimization
No ratings yet
Emergency Transport Optimization
12 pages
Artificial Neural Network - ..
100% (1)
Artificial Neural Network - ..
15 pages
Tugas: Pengantar Geofisika Eksplorasi
No ratings yet
Tugas: Pengantar Geofisika Eksplorasi
4 pages
Optimize Excel Solver for Shipping Costs
No ratings yet
Optimize Excel Solver for Shipping Costs
6 pages
Exercises 06 Noise and Filter
No ratings yet
Exercises 06 Noise and Filter
5 pages
Analog and Digital Signal Processing by Ambardar
88% (8)
Analog and Digital Signal Processing by Ambardar
821 pages
Answers To Practice Questions For Convolution - PDF
No ratings yet
Answers To Practice Questions For Convolution - PDF
6 pages
Midterm EE456 11
No ratings yet
Midterm EE456 11
4 pages
Linear-Systems-From-Theory-To-Computations-22500590: (4.5/5.0 - 280 Downloads)
No ratings yet
Linear-Systems-From-Theory-To-Computations-22500590: (4.5/5.0 - 280 Downloads)
59 pages
COMP702 - CourseOutline - 2025
No ratings yet
COMP702 - CourseOutline - 2025
2 pages
Merge Sort
No ratings yet
Merge Sort
8 pages
Optimal Binary Search Tree
No ratings yet
Optimal Binary Search Tree
21 pages
CSC 325 AI Lecture06 Adversarial Search Fall2024 10102024 041106pm
No ratings yet
CSC 325 AI Lecture06 Adversarial Search Fall2024 10102024 041106pm
65 pages
Partial Fractions and Algebraic Division
No ratings yet
Partial Fractions and Algebraic Division
1 page
Crop Production Optimization with ML
No ratings yet
Crop Production Optimization with ML
26 pages
Algorithm Complexity Quiz
No ratings yet
Algorithm Complexity Quiz
5 pages
Operations Research I - Final Exam
100% (2)
Operations Research I - Final Exam
9 pages
Image Restoration and Reconstruction
No ratings yet
Image Restoration and Reconstruction
73 pages
Minimum Spanning Trees Quiz
No ratings yet
Minimum Spanning Trees Quiz
5 pages
Bcs058 Computer Oriented Numerical Techniques Lab - PDF - 20250428 - 110017 - 0000
No ratings yet
Bcs058 Computer Oriented Numerical Techniques Lab - PDF - 20250428 - 110017 - 0000
26 pages
Bmats101 - MQP 2
No ratings yet
Bmats101 - MQP 2
4 pages
Lec - 3 Bisection Method
No ratings yet
Lec - 3 Bisection Method
31 pages

RL - Exam2023 Solved

Uploaded by

RL - Exam2023 Solved

Uploaded by

Faculty of Social Sciences

Exam: SOW-BKI258 Reinforcement Learning

Use of the forms handed out:

SOW-BKI258: Reinforcement Learning Name:

Solution: False, it is the width.

(b) Which strictly require terminal states (episodic tasks)?

(c) Which involve planning from knowledge of the environment?

(d) Which use sampled experience and shallow backups?

(e) Which use sampled experience and deep backups?

5. (2 points) An agent receives the following sequence of rewards: R1 = −1, R2 = 2, R3 = 8, R4 = 2, R5 = 3 in an

k Vk (1) Vk (2) Vk (3) Vk (4) Vk (5) Vk (6)

V1 (1) = 0.5 ∗ [(−1) + γV0 (4)] + 0.5 ∗ [(−1) + γV0 (2)] = −1

Solution: E) value iteration

Timestep (t) Reward (Rt ) State (St ) Action (At )

G11 + G12 + G21

with G11 = 5 ∗ (−1) + 10 = 5, G12 = 2 ∗ (−1) + 10 = 8 and G21 = 7 ∗ (−1) + 10 = 3,

V (St ) ← V (St ) + α (Rt+1 + γV (St+1 ) − V (St ))

V2 (3) = V1 (3) + α[r + γV0 (2) − V1 (3)] = +1 − 0.2 = 0.8

V3 (3) = V2 (3) + α[r + γV0 (4) − V2 (3)] = 0.8 − 0.18 ≈ 0.62

Solution: A. MC; B. TD; C. TD

You might also like