0% found this document useful (0 votes)
10 views4 pages

Reinforcement Learning - - Unit 6 - Week 3

The document pertains to the Week 3 assessment for the Reinforcement Learning course on NPTEL, detailing various questions related to the REINFORCE algorithm, policy updates, and Markov Decision Processes (MDPs). It includes multiple-choice questions that test understanding of concepts such as cumulative rewards, Gaussian distributions, and the implications of discount factors in reinforcement learning. The assessment was submitted on August 14, 2024, and allows for multiple submissions before the due date.

Uploaded by

somupurush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

Reinforcement Learning - - Unit 6 - Week 3

The document pertains to the Week 3 assessment for the Reinforcement Learning course on NPTEL, detailing various questions related to the REINFORCE algorithm, policy updates, and Markov Decision Processes (MDPs). It includes multiple-choice questions that test understanding of concepts such as cumulative rewards, Gaussian distributions, and the implications of discount factors in reinforcement learning. The assessment was submitted on August 14, 2024, and allows for multiple submissions before the due date.

Uploaded by

somupurush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3

Assessment submitted.
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
X

[email protected]

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)


Click to register
for Certification
exam
Thank you for taking the Week 3:
Assignment 3.
(https://examform.nptel.ac.in/2024_10/exam_form/dashboard)

If already
registered, click
to check your
Week 3: Assignment 3
payment status Your last recorded submission was on 2024-08-14, 16:22 IST Due date: 2024-08-14, 23:59 IST.

1) The baseline in the REINFORCE update should not depend on which of the following 1 point
(without voiding any of the steps in the proof of REINFORCE)?
Course
outline rn−1

rn
About NPTEL
() Action taken(an )
None of the above
How does an
NPTEL online
2) Which of the following statements is true about the RL problem? 1 point
course work?
()
Our main aim is to maximize the cumulative reward.
The agent always performs the actions in a deterministic fashion.
Week 0 ()
We assume that the agent determines the next state based on the current state and action
Week 1 () It is impossible to have zero rewards.

Week 2 () 3) Let us say we are taking actions according to a Gaussian distribution with parameters 1 point
µ and σ . We update the parameters according to REINFORCE and at denote the action taken at
Week 3 () step t.
µt − at
Policy Search (i) µt+1 = µt + αr t
2
σ
(unit? t

unit=34&lesson (a t − µt ) 2 1
(ii) σ t+1 = σ t + αr t ( − )
=35) σ
3 σt
t

https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 1/4
8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3

REINFORCE (a t − µt ) 2
Assessment submitted. (iii) σ t+1 = σ t + αr t
(unit? σ
3

X unit=34&lesson
t

at − µt
=36) (iv) µt+1 = µt + αr t
2
σ
t

Contextual
Bandits (unit? Which of the above updates are correct?
unit=34&lesson
=37) (i), (iii)
Full RL (i), (iv)
Introduction (ii), (iv)
(unit?
(ii), (iii)
unit=34&lesson
=38) 4) The update in REINFORCE is given by ∂ ln π(a t;θt ) 1 point
θt+1 = θt + αrt , where
∂θt
Returns, Value ∂ ln π(a t;θt )
rt is an unbiased estimator of the true gradient of the performance function. However,
Functions and ∂θt

MDPs (unit? there was another variant of REINFORCE, where a baseline b , that is independent of the action
unit=34&lesson taken, is subtracted from the obtained reward, i.e, the update is given by
=39) ∂ ln π(a t;θt ) ∂ ln π(a t;θt ) ∂ ln π(a t;θt )
θt+1 = θt + α(rt − b) . How are E[(rt − b) ] and E[rt ]
∂θt ∂θt ∂θt

Practice: Week related?


3 : Assignment
3(Non Graded)
(assessment? E[(rt − b)
∂ ln π(a t;θt )
] = E[rt
∂ ln π(a t;θt )
]
∂θt ∂θt
name=218)

Quiz: Week 3: E[(rt − b)


∂ ln π(a t;θt )
] < E[rt
∂ ln π(a t;θt )
]
∂θt ∂θt
Assignment 3
(assessment?
∂ ln π(a t;θt ) ∂ ln π(a t;θt )

name=235) E[(rt − b)
∂θt
> E[rt
∂θt
]

Week 3 Could be either of a, b or c, depending on the choice of baseline


Feedback Form
: Reinforcement 5) Consider the following policy-search algorithm for a multi-armed binary bandit: 1 point
Learning (unit?
unit=34&lesson
=236)

Week 4 ()
where is 1 if a = at and 0 otherwise. Which of the following is true for the above algorithm?

DOWNLOAD
VIDEOS () It is LR−I algorithm.

Text It is LR−ϵP algorithm.


Transcripts ()
It would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next
NPTEL best arm had probability of 0.5 of resulting in +1 reward
Resources ()
It would work well if the best arm had probability of 0.3 of resulting in +1 reward and the worst
Problem arm had probability of 0.25 of resulting in +1 reward
Solving

https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 2/4
8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3

Session - 6) Assertion: Contextual bandits can be modeled as a full reinforcement learning problem. 1 point
Assessment
July submitted.
2024 () Reason: We can define an MDP with n states where n is the number of bandits. The number of
X actions from each state corresponds to the arms in each bandit, with every action leading to
termination of the episode, and giving a reward according to the corresponding bandit and arm.

Assertion and Reason are both true and Reason is a correct explanation of Assertion
Assertion and Reason are both true and Reason is not a correct explanation of Assertion
Assertion is true and Reason is false
Both Assertion and Reason are false

7) Let’s assume for some full RL problem we are acting according to a policy π . At some 1 point
time t, we are in a state s where we took action a1 . After few time steps, at time t', the same state s
was reached where we performed an action a2 (≠ a1 ) . Which of the following statements is true?

π is definitely a Stationary policy

π is definitely a Non-Stationary policy

π can be Stationary or Non-Stationary

8) Stochastic gradient ascent/descent update occurs in the right direction at every step 1 point

True
False

9) Which of the following is true for an MDP? 1 point

P r (st+1 , rt+1 |st , at ) = P r (st+1 , rt+1 )

P r (st+1 , rt+1 |st , at , st−1 , at−1 , st−2 , at−2 , . . . , s0 , a0 ) = P r (st+1 , rt+1 |st , at )

P r (st+1 , rt+1 |st , at ) = P r (st+1 , rt+1 |s0 , a0 )

P r (st+1 , rt+1 |st , at ) = P r (st , rt |st−1 , at−1 )

10) Remember for discounted returns, 1 point

2
Gt = rt + γrt+1 + γ rt+2 +. . .

Where γ is a discount factor. Which of the following best explains what happens when γ > 1 , (say
γ = 5 )?

Nothing, γ > 1 is common for many RL problems


Theoretically nothing can go wrong, but this case does not represent any real world
problems

https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 3/4
8/14/24, 4:12 PM Reinforcement Learning - - Unit 6 - Week 3

The agent will learn that delayed rewards will always be beneficial and so will not learn
Assessment submitted.
properly.
X
None of the above is true.

You may submit any number of times before the due date. The final submission will be considered
for grading.
Submit Answers

https://onlinecourses.nptel.ac.in/noc24_cs102/unit?unit=34&assessment=235 4/4

You might also like