Sara Reinforcement Learning
Sara Reinforcement Learning
2. Negative Reinforcement:
Negative Reinforcement is represented as the strengthening of a
behavior. In other ways, when a negative condition is barred or avoided,
it tries to stop this action in the future.
Advantages:
– Maximized behavior.
– Provide a decent to minimum standard of performance.
Disadvantage:
– It just limits itself enough to meet up a minimum behavior.
Reinforcement Learning Workflow:
• Create the Environment
• Define the reward
• Create the agent
• Train and validate the agent
• Deploy the policy
How does Reinforcement Learning Work?
• Now, if for example we observed that for a couple of moments, someone had
a facial feature sequence of: [smiling, frowning], we can use the 3 model
parameters above to calculate the probabilities of their actual hidden moods:
• P([good, good]) = (𝝅[0]·𝜽[0, 0])·(A[0, 0]·𝜽[0, 1]) = (0.7·0.9)·(0.8·0.1) =
0.0504,
• P([good, bad]) = (𝝅[0]·𝜽[0, 0])·(A[0, 1]·𝜽[1, 1]) = (0.7·0.9)·(0.2·0.9) =
0.1134,
• P([bad, good]) = (𝝅[1]·𝜽[1, 0])·(A[1, 0]·𝜽[0, 1]) = (0.3·0.1)·(0.2·0.1) =
0.0006,
• P([bad, bad]) = (𝝅[1]·𝜽[1, 0])·(A[1, 1]·𝜽[1, 1]) = (0.3·0.1)·(0.8·0.9) =
0.0216.
Guessing Someone’s Mood from their Facial Features:
• We calculate the marginal mood probabilities for each element in the sequence to
get the probabilities that the 1st mood is good/bad, and the 2nd mood is good/bad:
• P’(1st mood is good) = P’([good, good]) + P’([good, bad]) = 0.881,
• P’(1st mood is bad) = P’([bad, good]) + P’([bad, bad]) = 0.119,
• P’(2nd mood is good) = P’([good, good]) + P’([bad, good]) = 0.274,
• P’(2nd mood is bad) = P’([good, bad]) + P’([bad, bad]) = 0.726.
• The optimal mood sequence is simply obtained by taking the sum of the highest
mood probabilities for the sequence —P’(1st mood is good) is larger than P’(1st
mood is bad), and P’(2nd mood is good) is smaller than P’(2nd mood is bad). In
this case, it turns out that the optimal mood sequence is indeed: [good, bad].
Monte Carlo Learning
Monte Carlo Learning:
• Any method which solves a problem by generating suitable random
numbers, and observing that fraction of numbers obeying some property
or properties, can be classified as a Monte Carlo method.
• It’s used when there is no prior information of the environment and all the
information is essentially collected by experience.
• The Monte Carlo method for reinforcement learning learns directly from
episodes of experience without any prior knowledge of MDP transitions.
Here, the random component is the return or reward.
• One caveat is that it can only be applied to episodic MDPs. Its fair to ask
why, at this point. The reason is that the episode has to terminate before
we can calculate any returns. Here, we don’t do an update after every
action, but rather after every episode.
Example:
• Let’s do a fun exercise where we will try to find
out the value of pi using pen and paper. Let’s
draw a square of unit length and draw a quarter
circle with unit length radius. Now, we have a
helper bot C3PO with us. It is tasked with putting
as many dots as possible on the square randomly
3,000 times, resulting in the following figure:
Example:
• C3PO needs to count each time it puts a dot inside a circle. So, the
value of pi will be given by:
• where N is the number of times a dot was put inside the circle. As you
can see, we did not do anything except count the random dots that
fall inside the circle and then took a ratio to approximate the value of
pi.
Monte Carlo Policy Evaluation:
• The goal here, again, is to learn the value function vpi(s) from
episodes of experience under a policy pi. Recall that the return is the
total discounted reward:
S1, A1, R2, ….Sk ~ pi
• Also recall that the value function is the expected return:
• In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider
the below image:
• In the image, we can see there is
an agent who has three values options,
V(s1), V(s2), V(s3). As this is MDP, so
agent only cares for the current state
and the future state. The agent can go to
any direction (Up, Left, or Right), so he
needs to decide where to go for the optimal
path.
Q-Learning Explanation:
• Here agent will take a move as per probability bases and changes the state. But if
we want some exact moves, so for this, we need to make some changes in terms of
Q-value. Consider the below image:
• Q- represents the quality of the actions at
each state. So instead of using a value at each
state, we will use a pair of state and action, i.e
Q(s, a). Q-value specifies that which action is
more lubricative than others, and according to
the best Q-value, the agent takes his next
move. The Bellman equation can be used for
deriving the Q-value.
Q-Learning Explanation:
• To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:
• Step 2: Choose an action and perform it. Update values in the table.
• This is the starting point. We have performed no other action as of
yet. Let us say that we want the agent to sit initially, which it does.
The table will change to:
Q-Table after performing an action
How to Make a Q-Table?
• Step 3: Get the value of the reward and calculate the value Q-Value
using Bellman Equation.
• For the action performed, we need to calculate the value of the actual
reward and the Q( S, A ) value.
Updating Q-Table with Bellman Equation
How to Make a Q-Table?
• Step 4: Continue the same until the table is filled or an episode ends.
• The agent continues taking actions and for each action, the reward
and Q-value are calculated and it updates the table.
Final Q-Table at end of an episode
Example:
Why use Reinforcement Learning?
You can’t apply reinforcement learning model is all the situation. Here
are some conditions when you should not use reinforcement learning
model.
• When you have enough data to solve the problem with a supervised
learning method
• You need to remember that Reinforcement Learning is computing-
heavy and time-consuming. in particular when the action space is
large.
Challenges of Reinforcement Learning:
Here are the major challenges you will face while doing Reinforcement
earning:
• Feature/reward design which should be very involved
• Parameters may affect the speed of learning.
• Realistic environments can have partial observability.
• Too much Reinforcement may lead to an overload of states which can
diminish the results.
• Realistic environments can be non-stationary.
Conclusion:
• https://www.javatpoint.com/reinforcement-learning
• https://medium.com/@natsunoyuki/hidden-markov-models-with-pyt
hon-c026f778dfa7
• https://www.simplilearn.com/tutorials/machine-learning-tutorial/wh
at-is-q-learning
• https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learni
ng-introduction-monte-carlo-learning-openai-gym/
• https://www.youtube.com/watch?v=J3qX50yyiU0&list=PL4gu8xQu0_
5JBO1FKRO5p20wc8DprlOgn&index=122
• https://www.guru99.com/reinforcement-learning-tutorial.html