Reinforcement Learning - - Unit 6 - Week 3
Reinforcement Learning - - Unit 6 - Week 3
Assessment submitted.
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
X
Click to register
for Certification
exam
Thank you for taking the Week 3:
Assignment 3.
(https://examform.nptel.ac.in/2024_10/exam_form/dashboard)
If already
registered, click
to check your
Week 3: Assignment 3
payment status Your last recorded submission was on 2024-08-14, 16:22 IST Due date: 2024-08-14, 23:59 IST.
1) The baseline in the REINFORCE update should not depend on which of the following 1 point
(without voiding any of the steps in the proof of REINFORCE)?
Course
outline rn−1
rn
About NPTEL
() Action taken(an )
None of the above
How does an
NPTEL online
2) Which of the following statements is true about the RL problem? 1 point
course work?
()
Our main aim is to maximize the cumulative reward.
The agent always performs the actions in a deterministic fashion.
Week 0 ()
We assume that the agent determines the next state based on the current state and action
Week 1 () It is impossible to have zero rewards.
Week 2 () 3) Let us say we are taking actions according to a Gaussian distribution with parameters 1 point
µ and σ . We update the parameters according to REINFORCE and at denote the action taken at
Week 3 () step t.
µt − at
Policy Search (i) µt+1 = µt + αr t
2
σ
(unit? t
unit=34&lesson (a t − µt ) 2 1
(ii) σ t+1 = σ t + αr t ( − )
=35) σ
3 σt
t
https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 1/4
8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3
REINFORCE (a t − µt ) 2
Assessment submitted. (iii) σ t+1 = σ t + αr t
(unit? σ
3
X unit=34&lesson
t
at − µt
=36) (iv) µt+1 = µt + αr t
2
σ
t
Contextual
Bandits (unit? Which of the above updates are correct?
unit=34&lesson
=37) (i), (iii)
Full RL (i), (iv)
Introduction (ii), (iv)
(unit?
(ii), (iii)
unit=34&lesson
=38) 4) The update in REINFORCE is given by ∂ ln π(a t;θt ) 1 point
θt+1 = θt + αrt , where
∂θt
Returns, Value ∂ ln π(a t;θt )
rt is an unbiased estimator of the true gradient of the performance function. However,
Functions and ∂θt
MDPs (unit? there was another variant of REINFORCE, where a baseline b , that is independent of the action
unit=34&lesson taken, is subtracted from the obtained reward, i.e, the update is given by
=39) ∂ ln π(a t;θt ) ∂ ln π(a t;θt ) ∂ ln π(a t;θt )
θt+1 = θt + α(rt − b) . How are E[(rt − b) ] and E[rt ]
∂θt ∂θt ∂θt
name=235) E[(rt − b)
∂θt
> E[rt
∂θt
]
Week 4 ()
where is 1 if a = at and 0 otherwise. Which of the following is true for the above algorithm?
DOWNLOAD
VIDEOS () It is LR−I algorithm.
https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 2/4
8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3
Session - 6) Assertion: Contextual bandits can be modeled as a full reinforcement learning problem. 1 point
Assessment
July submitted.
2024 () Reason: We can define an MDP with n states where n is the number of bandits. The number of
X actions from each state corresponds to the arms in each bandit, with every action leading to
termination of the episode, and giving a reward according to the corresponding bandit and arm.
Assertion and Reason are both true and Reason is a correct explanation of Assertion
Assertion and Reason are both true and Reason is not a correct explanation of Assertion
Assertion is true and Reason is false
Both Assertion and Reason are false
7) Let’s assume for some full RL problem we are acting according to a policy π . At some 1 point
time t, we are in a state s where we took action a1 . After few time steps, at time t', the same state s
was reached where we performed an action a2 (≠ a1 ) . Which of the following statements is true?
8) Stochastic gradient ascent/descent update occurs in the right direction at every step 1 point
True
False
P r (st+1 , rt+1 |st , at , st−1 , at−1 , st−2 , at−2 , . . . , s0 , a0 ) = P r (st+1 , rt+1 |st , at )
2
Gt = rt + γrt+1 + γ rt+2 +. . .
Where γ is a discount factor. Which of the following best explains what happens when γ > 1 , (say
γ = 5 )?
https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 3/4
8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3
The agent will learn that delayed rewards will always be beneficial and so will not learn
Assessment submitted.
properly.
X
None of the above is true.
You may submit any number of times before the due date. The final submission will be considered
for grading.
Submit Answers
https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 4/4