[머가]Chap11 강화학습

2
Overview
Of several responses made to the same situation, those which are accompanied
or closely followed by satisfaction to the animal will, other things being equal,
be more firmly connected with the situation, so that, when it recurs, they will
be more likely to recur; those which are accompanied or closely followed by
discomfort to the animal will, other things being equal, have their connections
with that situation weakened, so that, when it recurs, they will be less likely to
occur. The greater the satisfaction or discomfort, the greater the strengthening
or weakening of the bond. (E. L. Thorndike, Animal Intelligence, page 244.)
 강화학습의 중요성은 시행착오(trial-and-error) 학습의 개념에서 나옴
 손다이크의(Thorndike) 효과의 법칙(law of effect)에서 유래  만족감이 더 강력해져서 이에 따른 행동이 반복

3
Overview (cont’d)
 강화학습에서는 상태(state)의 보상(reward)를 최대화 하기 위해서 행동함
 강화학습 용어들
• Agent: 학습을 하는 주체
• Environment: Agent가 행동하는 곳
• State(𝒔 𝒕): agent의 상태 or 환경의 상태
• Action(𝒂 𝒕): agent의 행동
• Reward(𝒓𝒕): 행동을 했을 때의 보상

4
Overview (cont’d)
From Yann Lecun, (NIPS 2016)
 강화학습에서 알고리즘은 보상을 통해 얼마나 잘 수행 하고 있는지를 피드백 받음
 강화학습에서는 현재의 상태를 평가(보상)하지만 어떻게 발전시켜야 하는지 말해주지 않음
 반면에 Supervised Learning은 정답에 대해 직접 학습
 보상이 정해진 후에는 현재 상태(𝑠𝑡)에 맞는 행동 (𝑎 𝑡)을 선택  정책(policy)
 정책은 탐험(exploration)과 활용(exploitation)으로 이루어짐

5
MDP, Markov Decision Process
 강화학습에서 state는 Markov 성질을 가진다고 가정하고 접근
 Markov State: 과거의 state 들을 돌아볼 필요없이 현재 state는 reward를 계산하는 데 충분한 정보를 준다
 예를 들어, 체스 게임에서 현재 말들의 상황을 아는 것이 다음 행동을 결정하는데 도움을 주지만, 현재 위치에 어떻게
왔는지를 알고있는 것은 그리 도움이 되지 않음

6
Q-Learning
① state
② action
③ Reward
Q(state, action) ↔ 𝑸(𝒔, 𝒂)
 Q-function (state-action value function)

7
Q-Learning (cont’d)
 Frozen Lake를 통한 Q-Learning 예제
Left Right
Up
Down

8
Q-Learning – state, action reward
 state, action, reward
𝑠0, 𝑎0, 𝑟1, 𝑠0, 𝑎0, 𝑟1, ⋯ , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟𝑛, 𝑠 𝑛
state
action
reward
Terminal state

9
Q-Learning – state, action reward
 state, action, reward
𝑅𝑡 = 𝑟𝑡 + 𝑅𝑡+1
𝑅𝑡
∗
= 𝑟𝑡 + max 𝑅𝑡+1
𝑄(𝑠, 𝑎) = 𝑟 + max 𝑄(𝑠′
, 𝑎′
)

10
Q-Learning : Policy
 각 state 에서 최적화 결과를 얻기 위해 어떤 actio을 할지 선택하는 것을 policy, π 라고 함
• Greedy: 가장 높은 즉, max𝑸(𝒔, 𝒂) 선택
• 𝝐-greedy: greedy policy와 유사하지만 작은 확률 𝝐로 랜덤한 action을 선택
 때때로 더 좋은 행동을 선택할 수 있다는 희망으로 대안을 시도
• soft-max: 𝜖-greedy의 개선으로 탐험이 이루어질 때 action의 선택을 소프트 맥스 함수를 이용 (p.277 참고)

11
 Learning Q(s, a) Table
𝑄(𝑠, 𝑎) ← 𝑟 + max 𝑄(𝑠′, 𝑎′)
Initial Q values are 0
Q-Learning : Greedy Policy

12
𝑄(𝑠, 𝑎) ← 𝑟 + max 𝑄(𝑠′, 𝑎′)
1

13
𝑄(𝑠, 𝑎) ← 𝑟 + max 𝑄(𝑠′, 𝑎′)
11

14
𝑄(𝑠, 𝑎) ← 𝑟 + max 𝑄(𝑠′, 𝑎′)
11
1
1
1
1
11

15
Q-Learning Algorithm(greedy)
 Q-function (state-action value function)

16
Exploitation vs Exploration
11
1
1
1
1
11
Exploitation
Exploration

18
Q–Learning : Discounted future reward
11
1
1
1
1
11
1
 미래에 일어날 reward에 대한 확실성을 고려
 추가적인 변수 0 ≤ 𝛾 ≤ 1를 사용해서 미래에 대한 할인을 곱해서 적용
𝑄(𝑠, 𝑎) ← 𝑟 + 𝛾max 𝑄(𝑠′, 𝑎′)

19

20
1
𝑄(𝑠, 𝑎) ← 𝑟 + 𝛾max 𝑄(𝑠′, 𝑎′)
𝛾 = 0.9

21
1
𝑄(𝑠, 𝑎) ← 𝑟 + 𝛾max 𝑄(𝑠′, 𝑎′)
𝛾 = 0.9
0.9
= 0 + 0.9 × 1

22
1
𝑄(𝑠, 𝑎) ← 𝑟 + 𝛾max 𝑄(𝑠′, 𝑎′)
𝛾 = 0.9
0.9
0.81 0.9
0.729

23

24
Q–Learning : Deterministic vs Stochastic
 Deterministic: Agent 의 action이 정해진(determined)대로 state가 변하는 것
 Stochastic(Non-deterministic): Agent의 확률적인 action에 따라 state 가 변하는 것
 즉, agent가 action을 취했을 경우 state가 deterministic하게 정해지는 것이 아니라 확률적으로 정해짐

25
𝑄 𝑠, 𝑎 ← 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼[𝑟 + 𝛾max 𝑄 𝑠′, 𝑎′ ]
- Learning rate, 𝛼 = 0.1
𝑄(𝑠, 𝑎) ← 𝑟 + 𝛾max 𝑄(𝑠′, 𝑎′)
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼[𝑟 + 𝛾max 𝑄 𝑠′, 𝑎′ − 𝑄 𝑠, 𝑎 ]

26

[머가]Chap11 강화학습

More Related Content

What's hot (20)

More from 종현 최 (7)

[머가]Chap11 강화학습