0% found this document useful (0 votes)
34 views29 pages

RL 1

Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by interacting with an environment to maximize rewards. The process involves observing states, taking actions, receiving feedback, and refining strategies over time. While RL has applications in various domains like gaming, robotics, and healthcare, it faces challenges such as sample inefficiency, exploration-exploitation dilemmas, and high computational costs.

Uploaded by

narasimha282005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views29 pages

RL 1

Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by interacting with an environment to maximize rewards. The process involves observing states, taking actions, receiving feedback, and refining strategies over time. While RL has applications in various domains like gaming, robotics, and healthcare, it faces challenges such as sample inefficiency, exploration-exploitation dilemmas, and high computational costs.

Uploaded by

narasimha282005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-1

The Reinforcement Learning Problem

Reinforcement Learning is a type of machine learning where an agent learns how to act in an
environment in order to achieve a goal. It does this by interacting with the environment and
learning from the consequences of its actions, guided by rewards.

Agent: It is the entity that interacts with the environment, takes actions, and learns from the
outcomes in order to achieve a goal.

Role of the Agent

• Observes the current state of the environment.


• Decides what action to take based on its policy (strategy).
• Receives a reward and the new state from the environment.
• Learns from this feedback to improve future decisions.
Reinforcement Learning Loop in Simple Terms:

1. Agent observes the environment (state).


2. It takes an action.
3. The environment changes and gives a reward.
4. The agent learns which actions lead to better rewards.
5. It tries to maximize rewards over time.
Example:

• Task: Learn to play a video game.


• State: Current game screen.
• Action: Move left, right, jump, shoot.
• Reward: Score increase or decrease.
• Goal: Learn the best way to act to get the highest score.
Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can learn
to make decisions through trial and error to maximize cumulative rewards. RL allows machines to
learn by interacting with an environment and receiving feedback based on their actions. This
feedback comes in the form of rewards or penalties.

Reinforcement Learning Introduction


UNIT-1

Figure: Framework for Reinforcement Learning

Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker)
interacts with an environment to achieve a goal. The agent performs actions and receives feedback
to optimize its decision-making over time.

• Agent: The decision-maker that performs actions.


• Environment: The world or system in which the agent operates.
• State: The situation or condition the agent is currently in.
• Action: The possible moves or decisions the agent can make.
• Reward: The feedback or result from the environment based on the agent’s action.
Examples of Reinforcement Learning

1. Game Playing (e.g., Chess, Go, Atari Games)


• Agent: The AI player
• Environment: The game world
• Action: Move a piece or take a game-specific action
• Reward: +1 for winning, -1 for losing, 0 for draw
• Goal: Learn the strategy that maximizes chances of winning
2. Autonomous Driving
• Agent: Self-driving car AI
• Environment: Roads, traffic, pedestrians
• Action: Accelerate, brake, steer, etc.

Reinforcement Learning Introduction


UNIT-1

• Reward: Positive for safe driving; negative for crashes or traffic violations
• Goal: Drive safely and reach the destination efficiently
3. Robotics
• Agent: Robot
• Environment: Physical world (room, lab, etc.)
• Action: Move arm, pick/place objects
• Reward: Success in completing task (e.g., picking up an object)
• Goal: Learn to perform tasks like walking, grasping, or assembling
Example: A robot learns to walk without falling by trying different leg movements and
receiving reward signals based on balance and progress.
4. Recommendation Systems (e.g., Netflix, YouTube)
• Agent: Recommendation algorithm
• Environment: User interface and interaction
• Action: Recommend a movie or video
• Reward: Click, watch time, or like
• Goal: Maximize user engagement
5. Stock Trading
• Agent: Trading bot
• Environment: Stock market data
• Action: Buy, sell, or hold
• Reward: Profit or loss
• Goal: Maximize total return on investment
6. Healthcare (e.g., Personalized Treatment Planning)
• Agent: AI system
• Environment: Patient data and response to treatment
• Action: Suggest treatment or dosage
• Reward: Improvement in health condition
• Goal: Learn optimal treatments over time for best patient outcomes.

Reinforcement Learning Introduction


UNIT-1

Application Area Agent Action Reward


Game Playing AI player Moves Win/loss/draw
Self-Driving Cars Car control AI Steering, braking Safe driving, efficiency
Robotics Robot Motor commands Task success/failure
Recommendations Recommender system Show content User engagement
Stock Trading Trading algorithm Buy/sell/hold Profit/loss
Healthcare AI medical system Treatment plan Patient improvement

Table 1: Summary Table for Examples of RL

Elements of Reinforcement Learning (RL)

Reinforcement Learning is defined by a set of core elements that work together to enable
learning through interaction with an environment. a policy, a reward signal, a value function,
and, optionally, a model of the environment.

Term Agent
Agent The artificial entity that is being trained to perform a task by learning from its
own experience. A learning agent must be able to sense the state of its
environment and must be able to take actions to affect the state
Environment Comprises of everything outside the purview of the agent. The environment
has its own internal dynamics and rules which are usually not visible to the
agent. The boundary between agent and environment is typically not the same
as the physical boundary of a robot’s or animal’s body. Usually, the boundary
is drawn closer to the agent than that.
State Refers to the current situation of the environment (as observed by the agent),
which forms the basis for the decisions taken by the agent.
Action Choices made by the agent to change the state of the environment.
Reward The reward signal is a scalar quantity that is emitted by the environment in
response to the action taken by the agent. It defines the goal of the
reinforcement learning problem and defines what is good in the short term. It
forms the basis for evaluating the choices made by the agent.
Return The cumulative sum of rewards received or expected to be received by the
agent.
Goal The agent’s goal is to maximize the total amount of reward (the expected
return) it receives.
Policy Defines the learning agent’s behavior. The policy can be viewed as a function
that provides a mapping from perceived states to actions (or probability of
actions) to be taken in those states. It can be implemented as a lookup table or
as a function approximator such as a neural network.
Value Function Specifies what is good in the long run. Value of a state is the total amount of
reward that an agent can expect to accumulate starting from that state. It is

Reinforcement Learning Introduction


UNIT-1

used to decide which actions should be taken. The most important component
of RL algorithms is a method for efficiently estimating values.
Model Something that mimics the behavior of the environment and allows inferences
to be made about how the environment will behave. In RL, models are used
for planning by deciding on a course of action by considering situations even
before they have occurred in reality

Table 2: Informal Description of Common RL Terms.

Limitations and Scope of Reinforcement Learning (RL)

Reinforcement Learning is a powerful and widely used paradigm, but it also comes with certain
challenges and boundaries. Here's a balanced view of its scope and limitations:

Scope of Reinforcement Learning

Reinforcement Learning is suitable for problems where:

• Sequential decision-making is required.


• The agent learns from trial and error without explicit supervision.
• Delayed rewards are involved.
• The environment is dynamic or partially unknown.

Applications and Strengths

Domain Examples
Games & Strategy Chess, Go, Atari, Poker, StarCraft
Robotics Robot navigation, object manipulation
Autonomous Systems Self-driving cars, drones
Finance Trading strategies, portfolio management
Healthcare Personalized treatment planning
Recommender Systems Adaptive content and ad placement
Industrial Control Smart grids, scheduling, process control

Limitations of Reinforcement Learning

Despite its versatility, RL has significant challenges:

1. Sample Inefficiency
• Requires a large number of interactions with the environment to learn effectively.
Reinforcement Learning Introduction
UNIT-1

• In real-world settings (like robotics or healthcare), this can be costly or unsafe.


2. Exploration vs. Exploitation Dilemma
• Balancing between trying new actions (exploration) and using known good actions
(exploitation) is non-trivial.
• Poor exploration can trap the agent in suboptimal behavior.
3. Sparse or Delayed Rewards
• If rewards are rare or delayed, learning becomes slow and unstable.
4. High Variance and Instability
• Especially in deep RL, training can be unstable due to sensitivity to hyperparameters and
initial conditions.
5. Credit Assignment Problem
• Determining which actions led to a specific outcome (reward) is hard when there are long
delays.
6. Lack of Generalization
• Agents often fail to transfer knowledge to new but similar tasks (poor transfer learning).
7. Computational Costs
• RL, especially deep RL, is resource-intensive (needs GPUs, simulations, etc.).
8. Safety and Ethics Concerns
• In critical domains like healthcare, finance, or driving, trial-and-error learning can lead to
harmful behavior.
An Extended Example: Tic-Tac-Toe

Tic-tac-toe is not a very challenging game for human beings. If you’re an enthusiast, you’ve
probably moved from the basic game to some variant like three-dimensional tic-tac-toe on a larger
grid.

If you sit down right now to play ordinary three-by-three tic-tac-toe with a friend, what will
probably happen is that every game will come out a tie. Both you and your friend can probably
play perfectly, never making a mistake that would allow your opponent to win.

But can you describe how you know where to move each turn? Most of the time, you probably
aren’t even aware of alternative possibilities; you just look at the board and instantly know where
you want to move.

Reinforcement Learning Introduction


UNIT-1

That kind of instant knowledge is great for human beings, because it makes you a fast player. But
it isn’t much help in writing a computer program. For that, you have to know very explicitly what
your strategy is.

By the way, although the example of tic-tac-toe strategy is a relatively trivial one, this issue of
instant knowledge versus explicit rules is a hot one in modern psychology. Some cognitive
scientists, who think that human intelligence works through mechanisms similar to computer
programs, maintain that when you know how to do something without knowing how you know,
you have an explicit set of rules deep down inside. It’s just that the rules have become a habit, so
you don’t think about them deliberately.

They think that human thought is profoundly different from the way computers work, and that a
computer cannot be programmed to simulate the full power of human problem-solving. These
people would say, for example, that when you look at a tic-tac-toe board you immediately grasp
the strategic situation, and your eye is drawn to the best move without any need to examine
alternatives according to a set of rules.

Before you read further, try to write down a set of strategy rules that, if followed consistently, will
never lose a game. Play a few games using your rules. Make sure they work even if the other player
does something bizarre.

I’m going to number the squares in the tic-tac-toe board this way:

Squares 1, 3, 7, and 9 are corner squares. I’ll call 2, 4, 6, and 8 edge squares. And of course number
5 is the center square. I’ll use the word position to mean a specific partly-filled-in board with X
and O in certain squares, and other squares empty.

One way you might meet my challenge of describing your strategy explicitly is to list all the
possible sequences of moves up to a certain point in the game, then say what move you’d make
next in each situation. How big would the list have to be? There are nine possibilities for the first
move. For each first move, there are eight possibilities for the second move. If you continue this
line of reasoning, you’ll see that there are nine factorial, or 362880, possible sequences of moves.

Reinforcement Learning Introduction


UNIT-1

Your computer may not have enough memory to list them all, and you certainly don’t have enough
patience.

Fortunately, not all these sequences are interesting. Suppose you are describing the rules a
computer should use against a human player, and suppose the human being moves first. Then there
are, indeed, nine possible first moves. But for each of these, there is only one possible computer
move! After all, we’re programming the computer. We get to decide which move it will choose.
Then there are seven possible responses by the opponent, and so on. The number of sequences
when the human being plays first is 9 times 7 times 5 times 3, or 945.

Strategy

The highest-priority and the lowest-priority rules seemed obvious to me right away. The highest-
priority are these:
1. If I can win on this move, do it.
2. If the other player can win on the next move, block that winning square.
Here are the lowest-priority rules, used only if there is nothing suggested more strongly by the
board position:
n-2. Take the center square if it’s free.
n-1. Take a corner square if one is free.
n. Take whatever is available.

The highest priority rules are the ones dealing with the most urgent situations: either I or my
opponent can win on the next move. The lowest priority ones deal with the least urgent situations,
in which there is nothing special about the moves already made to guide me.

What was harder was to find the rules in between. I knew that the goal of my own tic-tac-toe
strategy was to set up a fork , a board position in which I have two winning moves, so my opponent
can only block one of them. Here is an example:

X can win by playing in square 3 or square 4. It’s O’s turn, but poor O can only block one of those
squares at a time. Whichever O picks, X will then win by picking the other one.
Given this concept of forking, I decided to use it as the next highest priority rule:

Reinforcement Learning Introduction


UNIT-1

3. If I can make a move that will set up a fork for myself, do it.
That was the end of the easy part. My first attempt at writing the program used only these six rules.
Unfortunately, it lost in many different situations. I needed to add something, but I had trouble
finding a good rule to add.
My first idea was that rule 4 should be the defensive equivalent of rule 3, just as rule 2 is the
defensive equivalent of rule 1:
4a. If, on the next move, my opponent can set up a fork, block that possibility by moving into the
square that is common to his two winning combinations.
In other words, apply the same search technique to the opponent’s position that I applied to my
own.
This strategy works well in many cases, but not all. For example, here is a sequence of moves
under this strategy, with the human player moving first:

In the fourth grid, the computer (playing O) has discovered that X can set up a fork by moving in
square 6, between the winning combinations 456 and 369. The computer moves to block this fork.
Unfortunately, X can also set up a fork by moving in squares 3, 7, or 8. The computer’s move in
square 6 has blocked one combination of the square-3 fork, but X can still set up the other two. In
the fifth grid, X has moved in square 8. This sets up the winning combinations 258 and 789. The
computer can only block one of these, and X will win on the next move.

Since X has so many forks available, does this mean that the game was already hopeless before
O moved in square 6? No. Here is something O could have done:

In this sequence, the computer’s second move is in square 7. This move also blocks a fork, but it
wasn’t chosen for that reason. Instead, it was chosen to force X’s next move. In the fifth grid, X has
had to move in square 4, to prevent an immediate win by O. The advantage of this situation for O
is that square 4 was not one of the ones with which X could set up a fork. O’s next move, in the
sixth grid, is also forced. But then the board is too crowded for either player to force a win; the
game ends in a tie, as usual.
Reinforcement Learning Introduction
UNIT-1

This analysis suggests a different choice for an intermediate-level strategy rule, taking the
offensive:

4b. If I can make a move that will set up a winning combination for myself, do it.

Compared to my earlier try, this rule has the benefit of simplicity. It’s much easier for the program
to look for a single winning combination than for a fork, which is two such combinations with a
common square.

Unfortunately, this simple rule isn’t quite good enough. In the example just above, the computer
found the winning combination 147 in which it already had square 1, and the other two were free.
But why should it choose to move in square 7 rather than square 4? If the program did choose
square 4, then X’s move would still be forced, into square 7.

We would then have forced X into creating a fork, which would defeat the program on the next
move.

It seems that there is no choice but to combine the ideas from rules 4a and 4b:

4. If I can make a move that will set up a winning combination for myself, do it. But ensure that
this move does not force the opponent into establishing a fork.

What this means is that we are looking for a winning combination in which the computer already
owns one square and the other two are empty. Having found such a combination, we can move in
either of its empty squares. Whichever we choose, the opponent will be forced to choose the other
one on the next move. If one of the two empty squares would create a fork for the opponent, then
the computer must choose that square and leave the other for the opponent.

What if both of the empty squares in the combination we find would make forks for the opponent?
In that case, we’ve chosen a bad winning combination. It turns out that there is only one situation
in which this can happen:

Again, the computer is playing O. After the third grid, it is looking for a possible winning
combination for itself. There are three possibilities: 258, 357, and 456. So far we have not given

Reinforcement Learning Introduction


UNIT-1

the computer any reason to prefer one over another. But here is what happens if the program
happens to choose 357:

By this choice, the computer has forced its opponent into a fork that will win the game for the
opponent. If the computer chooses either of the other two possible winning combinations, the game
ends in a tie. (All moves after this choice turn out to be forced.)

This particular game sequence was very troublesome for me because it goes against most of the
rules I had chosen earlier. For one thing, the correct choice for the program is an edge square, while
the corner squares must be avoided. This is the opposite of the usual priority.

Another point is that this situation contradicts rule 4a (prevent forks for the other player) even
more sharply than the example we considered earlier. In that example, rule 4a wasn’t enough
guidance to ensure a correct choice, but the correct choice was at least consistent with the rule.
That is, just blocking a fork isn’t enough, but threatening a win is.

and also blocking a fork is better than just threatening a win alone. This is the meaning of rule 4.
But in this new situation, the corner square (the move we have to avoid) does block a fork, while
the edge square (the correct move) doesn’t block a fork!

When I discovered this anomalous case, I was ready to give up on the idea of beautiful, general
rules. I almost decided to build into the program a special check for this precise board
configuration. That would have been pretty ugly, I think. But a shift in viewpoint makes this case
easier to understand: What the program must do is force the other player’s move, and force it in a
way that helps the computer win. If one possible winning combination doesn’t allow us to meet
these conditions, the program should try another combination. My mistake was to think either
about forcing alone (rule 4b) or about the opponent’s forks alone (rule 4a).

As it turns out, the board situation we’ve been considering is the only one in which a possible
winning combination could include two possible forks for the opponent. What’s more, in this board
situation, it’s a diagonal combination that gets us in trouble, while a horizontal or vertical
combination is always okay. Therefore, I was able to implement rule 4 in a way that only considers
one possible winning combination by setting up the program’s data structures so that diagonal

Reinforcement Learning Introduction


UNIT-1

combinations are the last to be chosen. This trick makes the program’s design less than obvious
from reading the actual program, but it does save the program some effort.

Multi-armed Bandits:

A k-Armed Bandit Problem

It is a foundational problem in Reinforcement Learning that models decision-making in uncertain


environments.
Imagine:
You are in front of k slot machines (also called arms), and each gives a random reward when
played.
But here's the twist:
• You don’t know which machine gives the best reward.
• You can only learn by trying each one — that is, by exploring.
• But you also want to earn as much reward as possible — so you need to exploit what you’ve
learned.
The Goal
To maximize the total reward you get over a series of turns.
To do this, you must balance:
• Exploration: Trying different arms to learn their reward patterns.
• Exploitation: Choosing the best-known arm based on past results.
This is called the exploration-exploitation tradeoff.

Why is it called “k-armed”?


• The term “k” just means any number of arms (options).
o E.g., if you have 5 options to choose from → it's a 5-armed bandit.
Mathematical Setup

Let’s define it step by step:

• You have k arms: A1, A2, …………,...,Ak


• Each arm Ai has an unknown reward distribution with a true mean reward μi
• At each time step t, you choose an arm at and receive a reward rt ∼ Distributioni

Reinforcement Learning Introduction


UNIT-1

Your objective is to:

Which means: maximize the total cumulative reward over T rounds.

Regret

Since you don’t know the best arm in advance, you’ll make some wrong choices.

We measure this loss using regret:

Where:

• μ∗ = expected reward of the best arm

• μat = expected reward of the arm you chose at time t

Real-World Examples

Scenario Arms (Choices) Reward

Online advertising Different ads Click or no-click (CTR)

Clinical trials Different drugs Patient improvement

News recommendation Headlines User reads article or not

Investment strategies Financial assets Return on investment

Action-Value Methods:

Action-value methods are a group of solutions to the Multi-Armed Bandits problem that focus on
getting accurate estimations of the value of each action & using these estimations to make
decisions about which action to choose.

Reinforcement Learning Introduction


UNIT-1

Action-value methods have 2 parts:

1. Estimating the value of a given action

2. Selecting which action to choose based on the estimated value of all of the actions.

Estimating Action Values

The simplest way to estimate the value of any particular action is to average the rewards that were
received every time this action was taken. This is called sample averaging.

This can formally be defined as:

• The ‘1’ on the right side of the equation is a simple variable that returns 1 if the predicate
is true and 0 otherwise.

Incremental Updates

The simplest implementation of sample averaging would be to store the rewards for each action as
well as the number of times each action was taken, then these could be used to calculate the above
sample averaging formula.

The problem with this implementation is that the memory and computation grow as more rewards
are seen. Instead, we’d like to avoid having to save all of the rewards to lower memory and
computation needs. This can be done through incremental updating.

For any action:

Rᵢ = reward received after the ith selection of this action

Qₙ = estimate of this action’s value after it has been selected n-1 times

Reinforcement Learning Introduction


UNIT-1

We can now calculate Qn+1 or the new action value estimation that takes into account the reward
that was just received.

This implementation requires memory only for Qₙ and n and only the final computation for each
new reward.

You can think of this update rule in the format of:

New Estimate = Old Estimate + Step Size [ Target — Old Estimate]

• The expression [ Target — Old Estimate] is an error in the current estimate of the
action’s value. We are moving our estimate toward the “Target” (the current reward)
which represents a ground truth (though it may be noisy).

Reinforcement Learning Introduction


UNIT-1

Action Selection: Greedy and Epsilon-Greedy

Now that we know how to estimate the value of actions we can move on to the second-part of
action-value methods: choosing an action. There are 2 basic ways to choose an action:

Greedy Action Selection: The simplest way is to always choose the greedy action (the action with
the highest-estimated value). This can be thought of as always exploiting.

Epsilon-Greedy Action Selection: Another method is to choose the greedy action most of the
time, but choose some other action with some probability ε (epsilon). ε could be any value of our
choice (e.g. 0.1 in which case we’d choose the greedy action 90% of the time). In this method we
still exploit most of the time but occasionally we explore other actions that may lead to better
rewards.

Optimistic Initial Values

The final detail before we can implement a solution is to understand how we can initialize our
estimates of the action values.

• The simplest way is to initialize them all to 0 and this works fine.

• Another way is to initialize them optimistically — start them off with values much higher
than they could ever be in reality.

For sample average methods, whichever way we choose to initialize the estimations doesn’t really
matter in the long-run because the bias begins to disappear once all actions have been taken at least
once.

In the short-run however, optimistic initialization encourages exploration. Since the first few
rewards for any action are guaranteed to be below their estimated value, the agent is forced to try
out other actions for which it is already estimating higher. With this method, greedy agents can be
encouraged to explore in their initial phases which increases the likelihood of them finding the
best action.

Implementation

Now that we know how to estimate the values of actions and how to select an action based on
those estimations, we can combine these steps into a simple algorithm to solve the bandit problem.

Reinforcement Learning Introduction


UNIT-1

Action-values are initialized optimistically.


• GreedyAgent: uses greedy action selection
• EpsilonGreedyAgent: uses epsilon-greedy action selection
Performance

We can test these agents on different versions of the Multi-Armed Bandits problem implemented
using the OpenAI Gym Framework.

BanditTwoArmedDeterministicFixed-v0: This is the simplest version of the bandit problem


where there are only 2 arms, one that always pays and one that doesn’t.

BanditTenArmedRandomRandom-v0: This is a more complex version of the problem where


there are 10 arms each with a random probability assigned to both payouts and rewards.

For the simple BanditTwoArmedDeterministicFixed-v0 environment both agents learn quickly


and the greedy agent ends up with a slightly better cumulative reward (999 vs. 941 over 1000
episodes) because it never stops to explore once it finds the best action.

Reinforcement Learning Introduction


UNIT-1

For the more complex BanditTenArmedRandomRandom-v0 the epsilon-greedy agent can beat
the greedy agent or vice-versa. It depends on if the initial exploration forced on the greedy agent
by the optimistic inital-value estimation is enough for it to stumble on the best action.

Reinforcement Learning Introduction


UNIT-1

The 10-armed Testbed

The RL book introduced a 10-armed testbed (p.28) with the true value of each of the ten actions
sampled from a normal distribution with μ = 0 and σ² = 1, and the actual rewards were selected
according to each action’s true value and unit-variance normal distribution. We can implement
the 10-armed testbed as follows:

Incremental Implementation

As we mentioned earlier, the way of estimating the action-value of the n-th action is by averaging
the reward that action actually received, and we can write the explicit equation as:

The RL book devices the incremental formulas for updating the average in a memory-saving and
computational-efficient way:

Reinforcement Learning Introduction


UNIT-1

The generalized updating equation could express as follows:


NewEstimate ← OldEstimate + StepSize × [Target — OldEstimate]
Then, we can put all the knowledge we have so far into building a simple bandit algorithm:

By running the algorithm on our 10-armed testbed with ε=0(greedy), ε=0.01, and ε=0.1, and
getting the average over 2000 rounds, the effect of exploration becomes obvious:

Reinforcement Learning Introduction


UNIT-1

The reward graph shows the greedy method improved slightly faster than the others at the very
beginning but was stuck in the sub-optimal choice afterward. The ε=0.1 method explores more
so it can find the optimal solution earlier, but it can never select that action more than 91%
because of its ε value. The ε=0.01 improved slowly but will eventually find the optimal choice,
pick it with a higher probability, and outperform the ε=0.1 one in the long run.

Reinforcement Learning Introduction


UNIT-1

Tracking a Nonstationary Problem

All of our discussions so far are based on the stationary setup — the reward distribution is static
over time. However, most of the time, the reinforcement learning problem we’re facing is non-
stationary. In this condition, the average method does not make much sense as the reward
distribution is changing; the weight of the long-past reward should be getting smaller and smaller,
while more weights should be assigned to the most recent reward.

One quick modification to the previous algorithm is to replace the “running-average” step size
with a constant value α:

Now we run the algorithm 10,000 timesteps with a constant step size α=0.1, ε=0.1 on the non-
stationary version of the 10-armed testbed, where each q value independent random walks with
steps sampled from a normal distribution with μ=0, σ=0.01.

where the step-size parameter α ∈ (0, 1]1 is constant. This results in Qk+1 being a weighted
average of past rewards and the initial estimate Q1:

Reinforcement Learning Introduction


UNIT-1

From the plots we can see the weighted method outperforms the sample method, the average
method we implemented previously, in the non-stationary environment.

Optimistic Initial Values

In our multi-armed bandit code, we needed to give actions some initial value before we proceed
with our estimation process. Natural choice was to give all actions a value of zero (we used
a ZeroInitializer for this — defined in [Link]).

The effect of choosing zero for all action values was that the first time the agent needs an action,
it will choose a random one. Second time, agent can still choose the same action if it made a
greedy choice. Wouldn’t it be better if we cycle through all actions and try each one first before
we go greedy? This is the idea behind optimistic initial value. It promotes more exploration in
the beginning until we have some estimates for action values then we can benefit from our
greedy choices.

Effect of optimal initial values

The way this work is as follows. We start by assuming all actions are optimum. We do this by
assigning all actions an initial value larger than the mean reward we expect to receive after
pulling any arm. As a result of this, the first estimate of any action will always be less than this

Reinforcement Learning Introduction


UNIT-1

optimum initial value. And, even if we’re greedy, we’ll not visit this action again and rather
select another unvisited action. This approach encourages the agent to visit all actions multiple
times, resulting in early improvement in estimated action values.

Here is how the two initialization methods compare:

Looking at the plot we notice the following:

• Optimistic initialization performance is not good at the beginning as it explores a lot


early.

• As time goes on, it will perform better as it learned action values earlier and is selecting
optimal value more often.

• S&B poses the question: if we are averaging over 2000 runs, which should remove
random fluctuations, what is it about the spike we see in early steps? the answer is that
with optimistic initialization, first few steps are not really random. We rather cycle
through all actions many times randomly picking our large optimistic value. First round,
all will have equal % optimality (1/k, where k is number of actions — note the 10% in
the figure above). The next action will have a spike as on average optimal action will
have the largest value. This will repeat, with decaying values until the effect of optimal
initial values fades away.

Reinforcement Learning Introduction


UNIT-1

Upper-Confidence-Bound (UCB)

So far, we’ve been selecting actions randomly. Problem with this randomness is that it doesn’t
discriminate between actions. An action that we have a good estimate for its value can be
selected while we explore. There is no new knowledge we gain by selecting this action. A better
approach will be to select actions that we know little about. Optimistic initial values approach
achieved this to some extent in the early learning steps, but we can do better.

UCB follows what is called optimism in the face of uncertainty. This means that if we don’t have
enough confidence on a value of an action, assume it is optimum and select it. This way we can
continuously improve our estimates of the value of these actions.

How does UCB works

The idea of UCB is simple — even though its selection rule may appear to be intimidating. What
we’ve been doing so far is to estimate the value Q(a) of actions and we either select actions with
max Q (greedy action) or we randomly explore. UCB doesn’t use Q(a) to figure out what action
to select. Instead, it adds some quantity U(a) the depends on how good is our estimate of
action a is, or in other words, how confident we are in the value of action a.

We want this quantity U(a) to have the following attributes:

• It must have a small value if the action was selected frequently enough and have a good
estimate. Mathematically, U(a) should be inversely proportional to number of times
action was selected. The more we select an action, the smaller U becomes, and our Q will
be good enough estimate for its value.

• It must have large value if we choose actions other than a. This can be represented
mathematically by making U directly proportional to age of the agent — or the number of
time steps in the episode. The more we choose actions other than a, the larger the number
of time steps, hence the larger our uncertainty in the value of a.

This is a qualitative description of U(a), and in fact, there are many ways to mathematically
represent U. S&B uses the following form:

Reinforcement Learning Introduction


UNIT-1

As we can see:

• We no longer select actions that maximize Q(a), we rather select actions to


maximize Q(a) + U(a).

• U(a) will increase the values of actions we want to explore; it is essentially an


exploration term.

• As mentioned above, U(a) is inversely proportional to N(a) — the number of time action
a was selected. The more we select an action, the less contribution from U term for this
action.

• Actions never selected will result in value of U(a) equal to infinity. In other words, these
are considered optimal and should be selected.

• U(a) directly proportional with ln(t). The more we choose other actions over a, the less
confident we are in a. U will increase as ln(t) will increase while Nt(a) will not. The
Choice of ln(t), as opposed to t, makes sure that rate of increase gets smaller over time.
We already expect actions with high confidence to be selected less and less as we
progress in the learning process.

• The constant c determines how much we explore as it is the weight of the exploration
term U(a).

• We will always perform greedy selection for actions based on equation above. This is
because exploration is already incorporated into U(a).

Implementing UCB

As you might’ve already guessed, we need a new action values provider that knows how to
calculate U in addition to Q. We also need a new selector that is always greedy. Code for these is
relatively straight forward. Here is how the performance curve for UCB looks like:
Reinforcement Learning Introduction
UNIT-1

Note: S&B doesn’t average rewards received from the environment. In other words, if you studied
the code, you’ll see that Experiment class passes the reward it receives from the environment to a
RunningAverage utility, the value from this utility is returned to our analysis component. Hence
the smooth curves you see in these articles. To get the exact result in S&B (where the average
reward curve is noisy and have this characteristic peak), I turned off averaging. You can see a lot
confusion online due to this averaging trick. Hopefully this clears things out.

Gradient Bandit Algorithm

As opposed to action-value methods that estimate the value of actions, gradient bandit methods
don’t care about the value of an action but rather learn a preference for each action over another.

This preference has no interpretation in terms of reward. only the relative preference over other
actions matters.

Formalization

The probability of selecting any action is given according to a soft-max distribution:

Reinforcement Learning Introduction


UNIT-1

• If you’re not familiar with soft-max, it is a function that outputs probabilities over a set of
numbers such that the probabilities sum to 1.

• In this case, the set of numbers are the actions, so each action will be given a probability
of being selected and the probabilities of all those actions will sum to 1.

• For example, if there are 3 actions and the game is at some timestep t you could have: πₜ(1)
= 0.3, πₜ(2) = 0.6, πₜ(3) = 0.1

The value Hₜ(a) represents the preference of selecting an action a over the other actions. On each
step, after selecting the action Aₜ and receiving the reward Rₜ, the action preferences are updated:

• The top part of the equation represents the preference update for the action just taken and
the bottom part represents the preference updates for all other actions

• α > 0 is a step-size parameter that determines the degree to which the preferences are
changed at each step

• R¯ₜ or “R mean at timestep t” (Medium won’t show the dash above the “R”) is the average
of all the rewards from all timesteps including timestep t for Aₜ. Similar to previous
methods, this can be updated using incremental updates with the sample-averaging method
or a constant-α parameter that is better suited for nonstationary rewards.

• R¯ₜ can be thought of as a baseline against which the current reward is compared. If the
reward is higher than the baseline, then the preference for this action is increased. If the
reward is lower than the baseline, then vice-versa.

• The bottom part of the equation moves the preferences for all other actions in the opposite
direction.

Implementing Gradient Bandit

No surprise here as well. We’ll use our trustworthy ZeroInitializer, we’ll use a new
GrandientActionValuesProvider, and finally we’ll use a new GradientSelector. Again, code is

Reinforcement Learning Introduction


UNIT-1

really simple, go ahead and take a look and familiarize yourself with it. Here is how the
performance look like as generated by the code:

As you can see, we got the same results on in S&B with very small code change.

Reinforcement Learning Introduction

You might also like