0% found this document useful (0 votes)
9 views

Reinforcement Learning

Uploaded by

hiiamom04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Reinforcement Learning

Uploaded by

hiiamom04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Value Function 𝑽(𝒔)

RL Cheat Sheet • Long term value of state S


• State value function 𝑉(𝑠) of a MRP is expected
Definitions reward from state s
• State(S): Current Condition • 𝑉(𝑠) = 𝐸(𝐺𝑡 |𝑆𝑡 = 𝑠)
• Reward(R): Instant Return from environment to appraise the last action Action Value Function 𝒒(𝒔, 𝒂)
• Value(V): Expected Long-term reward with discount, as opposed to short-term reward R • 𝑞𝜋 (𝑠, 𝑎) = 𝐸𝜋 (𝐺𝑡 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎)
• Action-Value(Q): This is similar to Value, except, it takes extra parameter, action A Bellmen Equation
• Policy(𝝅): Approach of agent to determine next action based on current state • 𝑉(𝑠) = 𝑅(𝑠) + 𝛾𝐸𝑠′ ∈𝑠 [𝑉(𝑠 ′ )]
• Exploitation is about using already known info to maximize rewards • 𝑉(𝑠) = 𝑅(𝑠) + 𝛾 ∑𝑠′ ∈𝑆 𝑃𝑠𝑠′ (𝑉(𝑠 ′ ))
• Exploration is about exploring and capturing more information • 𝑉 = 𝑅 + 𝛾𝑃𝑉 → 𝑉 = (𝐼 − 𝛾𝑃)−1 𝑅
Discount Factor (𝜸) Markov Process
• Varies between 0 to 1 • Consists of < 𝑠, 𝑝 > tuple where s are states and p is state transition matrix
• Closer to 0 → Agent tend to consider immediate reward • 𝑃𝑠𝑠′ = 𝑃(𝑠𝑡+1 = 𝑠 ′ |𝑠𝑡 = 𝑠)
• Closer to 1→ Agent tend to consider future reward with greater • 𝜇𝑡+1 = 𝑝𝑇 𝜇𝑡 where 𝜇𝑡 = [𝜇𝑡,1 … 𝜇𝑡,𝑛 ]
𝑇

Q-Learning
weight
4.Create reward matrix 𝑅 where 𝑅𝑠𝑎 = reward for taking action 𝑎 Markov Reward Process
in state 𝑠 and set 𝛾 parameter. • Consists of < 𝑠, 𝑝, 𝑅, 𝛾 > tuple where R is reward 𝛾 is discount
5.Initialize Q matrix to 0 • 𝑅 = 𝐸[𝑅𝑡+1 |𝑆𝑡 = 𝑆] = 𝑅(𝑠)
6.Set initial random state and assign this to current state • 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + ⋯ = ∑𝑘=0 𝛾 𝑘 𝑅𝑡+𝑘+1 is Total discounted reward
7.Select One among all possible actions of current state
a. Use this action to get new state Discounted Reward cons: Uncertainty may not be fully represented. Immediate
b. Get maximum Q value for this state based on all previous rewards values> delayed, Avoid ∞ rewards in cycle
actions
Markov Decision Process
c. Compute Q matrix using
• Consists of < 𝑠, 𝐴, 𝑝, 𝑅, 𝛾 > tuple where A is action
𝑄𝑠𝑎 = 𝑅𝑠𝑎 + 𝛾 × 𝑀𝑎𝑥[𝑄𝑠′ 𝑎′ ]
• 𝑃𝑠𝑠′ = 𝑃(𝑆𝑡+1 = 𝑆 ′ |𝑆𝑡 = 𝑆, 𝐴𝑡 = 𝑎)
∀ 𝑠 ′ 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑏𝑙𝑒 𝑓𝑟𝑜𝑚 𝑠 Discounted Reward cons: Uncertainty may not be fully represented. Immediate
∀𝑎′ 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 𝑖𝑛 𝑠 ′ rewards values> delayed, Avoid ∞ rewards in cycle
8.Repeat 4 until current state=goal state

Policy
Monte Carlo Policy Evaluation • 𝜋(𝑎|𝑠) = 𝑃(𝑎𝑡 = 𝑎|𝑠𝑡 = 𝑠)
1.To evaluate Value(s) 𝑉𝜋 (𝑆) • Either deterministic or stochastics. In deterministic P=1 for one 𝑎𝑡
2.At any time step t when state s is visited in an episode • 𝑃𝜋 (𝑠 ′ |𝑠) = ∑𝜋(𝑎|𝑠) × 𝑃(𝑠 ′ |𝑠, 𝑎) for stochastic process.
a. Increment Counter N(s)<- N(s)+1
• One step expected reward 𝑟𝜋 = ∑𝑎 𝜋(𝑎|𝑠)𝑟(𝑠, 𝑎)
b. Increment total return S(s) <- S(s)+𝐺𝑡
• For rewards as function of transition states
3.Value estimated is mean 𝑉(𝑠) = 𝑆(𝑠)/𝑁(𝑠)
𝑟𝜋 = ∑ 𝜋(𝑎|𝑠) ∑ 𝜋(𝑎|𝑠) × ∑ 𝑃(𝑠 ′ |𝑠, 𝑎) × 𝑟(𝑠, 𝑎, 𝑠 ′ )
Policy Gradients 𝑎 𝑠 𝑠′
• 𝑝𝜃 (𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 … ) = 𝑝(𝑠1 ) × Π𝑡=1
𝑇
𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 )𝜋𝜃 (𝑠𝑡 , 𝑎𝑡 )
• Goal is to 𝜃 = arg max 𝐸𝜏~𝑃𝜃 (𝜏) [∑𝑡 𝑟(𝑠𝑡 , 𝑎𝑡 )] = arg max 𝐽(𝜃) Relation Between 𝑽𝝅 and 𝒒𝝅

θ
1 • 𝑉𝜋 (𝑠) = ∑𝑎∈𝐴 𝜋(𝑎|𝑠𝑡 = 𝑠)𝑞𝜋 (𝑠, 𝑎)
• 𝐽(𝜃) = ∑𝑖 ∑𝑡 𝑟(𝑠𝑖 , 𝑎𝑖 )
𝑁 • 𝑉𝜋 (𝑠) = ∑𝑎∈𝐴 𝜋(𝑎|𝑠) × {𝑟(𝑠, 𝑎) + 𝛾 × ∑𝑠′ ∈𝑆 𝑃(𝑠 ′ |𝑠, 𝑎)𝑉𝜋 (𝑠)}
• ∇𝜃 𝑝𝜃 (𝜏) = 𝑝𝜃 (𝜏) × ∇𝜃 log 𝑝𝜃 (𝜏) • 𝑉𝜋 (𝑠) = 𝑟(𝑠) + 𝛾 ∑𝑎∈𝐴 𝜋(𝑎|𝑠)∑𝑝(𝑠 ′ |𝑠, 𝑎) × 𝑉𝜋 (𝑠 ′ )
1
• ∇𝜃 𝐽𝜃 = ∑𝑁 {∑ ∇
𝑡=1 𝜃 log 𝜋𝜃 (𝑎 |𝑠
𝑖,𝑡 𝑖,𝑡 ) ∑ 𝑡=1 𝑟(𝑠 , 𝑎
𝑖,𝑡 𝑖,𝑡 ) • 𝑞𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 × ∑𝑠′ ∈𝑆 𝑃(𝑠 ′ |𝑠, 𝑎)𝑉𝜋 (𝑠)
𝑁 𝑖=1
• 𝜃 ← 𝜃 + ∇𝜃 𝐽(𝜃) • 𝑞𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 × ∑𝑃(𝑠 ′ |𝑠, 𝑎){ ∑𝑎′ ∈𝐴 𝜋(𝑎′ |𝑠 ′ )𝑞𝜋 (𝑠 ′ , 𝑎′ ) }
• log 𝜋𝜃 (𝑎𝑖,𝑡 |𝑠𝑖,𝑡 ) is the log probability of action, defines how • 𝑞𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + ∑𝑠∈𝑆 ′ 𝑝(𝑠 ′ |𝑠, 𝑎) ∑𝑎′ ∈𝐴 𝑞𝜋 (𝑠 ′ , 𝑎′ ) × 𝑃(𝑎′ |𝑠 ′ )
likely are we going to see 𝑎𝑖,𝑡 as action Optimality Condition
• 𝑉𝜋∗ (𝑠) = max 𝑉𝜋 (𝑠) ∀𝑠 ∈ 𝑆 similarly for 𝑞𝜋∗ (𝑠, 𝑎)
Actor Critic π
• Q-V = Advantage
• 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) = ∑𝑇𝑡′=𝑡 𝐸𝜋0 [𝑟(𝑠𝑡 , 𝑎𝑡 )|𝑠𝑡 , 𝑎𝑡 ] Reward of action 𝑎𝑡 in 𝑠𝑡
• 𝑉 𝜋 (𝑠𝑡 ) = 𝐸𝑎𝑡 ~𝜋𝜃 (𝑎𝑡|𝑠𝑡) [𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )] total reward from st
• 𝐴𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) − 𝑉 𝜋 (𝑠𝑡 ) how much better 𝐴𝑡 is
1
• ∇𝜃 𝐽(𝜃) = ∑𝑁 𝑇 𝜋
𝑖=1 ∑𝑡=1 ∇𝜃 log 𝜋𝜃 (𝑎𝑖,𝑡 |𝑠𝑖,𝑡 ) 𝐴 (𝑠𝑡 , 𝑎𝑡 )
𝑁
• 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + ∑𝑇𝑡′=𝑡+1 𝐸𝜋𝜃 [𝑟(𝑠𝑡′ , 𝑎𝑡′ )|𝑠𝑡 , 𝑎𝑡 ]
= 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝐸𝜋𝑡+1 ~𝑝(𝑠𝑡+1|𝑠𝑡,𝑎𝑡 ) [𝑉 𝜋 (𝑠𝑡+1 )]
• 𝐴 𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝑉 𝜋 (𝑠𝑡+1 ) − 𝑉 𝜋 (𝑠𝑡 )
𝜋 (𝑠

You might also like