强化学习笔记：多臂老虎机问题(2)--Python仿真

本文链接：https://blog.csdn.net/chenxy_bwave/article/details/121472838

本文通过Python模拟了多臂老虎机问题，详细研究了epsilon-greedy算法在不同ε值下的表现，包括最优选择比例、Q值与理论值的对比和行动选择次数。结果显示算法符合预期，展现了马太效应和学习收敛特性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0. 前言

1. k_armed_bandit function

2. The first trial

2.1 Optimal selection ratio along the time

2.2 Q value vs qstar

2.3 Number of actions vs qstar

3. Comparison between different epsilon

3.1 Average Reward

3.2 Optimal Action

4 Summary

0. 前言

关于多臂老虎机的介绍参见上一篇博客：强化学习笔记：多臂老虎机问题(1)https://blog.csdn.net/chenxy_bwave/article/details/121395855https://blog.csdn.net/chenxy_bwave/article/details/121395855

本文我们针对多臂老虎机问题做一个基于python的仿真，通过仿真来具体地探索一下epsilon-greedy算法在多臂老虎机问题上的表现。

本文中代码在Jupyter Notebook中运行确认过。

首先导入需要的库：

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

1. k_armed_bandit function

首先，我们将k_armed_bandit的一次仿真的处理封装成一个函数，如下所示：

def k_armed_bandit_one_run(qstar,epsilon,nStep):
    """
    One run of K-armed bandit simulation.
    Here, K is not explicitly specified, instead it is derived from the length qstar
    """
    
    K     = len(qstar)
    Q     = np.zeros(K)                   # Estimation of action value.
    actCnt  = np.zeros(K,dtype='int')     # Record the number of action#k being selected

    a     = np.zeros(nStep+1,dtype='int') # Record the adopted action in each time step.
    r     = np.zeros(nStep+1)             # Recode the reward in each time step
    
    optAct   = np.argmax(qstar)           # The ground-truth optimal action, with the largest qstar value
    optCnt   = 0                          # Count the number of time steps in which the optimal action is selected
    optRatio = np.zeros(nStep+1,dtype='float') # Item#0 for initialization

    for t in range(1,nStep+1): # loop over time step
        #1. action selection
        tmp = np.random.uniform(0,1)
        #print(tmp)
        if tmp < epsilon: # random selection for exploring 
            a[t] = np.random.choice(np.arange(K))
            #print('random selection: a[{0}] = {1}'.format(t,a[t]))
        else:             # greedy selection for exploitation
            #选择Q值最大的那个，当多个Q值并列第一时，从中任选一个--但是如何判断有多个并列第一的呢？
            #对Q进行random permutation处理后再找最大值可以等价地解决这个问题
            #因为np.argmax()是找第一个最大的(当存在多个同为最大时)    
            p = np.random.permutation(K)
            a[t] = p[np.argmax(Q[p])]
            #print('greedy selection: a[{0}] = {1}'.format(t,a[t]))

        actCnt[a[t]] = actCnt[a[t]] + 1

        #2. reward: draw from the normal distribution with mean = qstar[a[t]], and variance = 1.
        r[t] = np.random.randn() + qstar[a[t]]        

        #3.Update Q of the selected action - should be refined
        Q[a[t]] = (Q[a[t]]*(actCnt[a[t]]-1) + r[t])/actCnt[a[t]]    
        
        #4. Optimal Action Ratio tracking
        #print(a[t], optAct)
        if a[t] == optAct:
            optCnt = optCnt + 1
        optRatio[t] = optCnt/t
        
    return a,actCnt,r,Q,optRatio

这个代码写的比较直白，而且也有一些相应的注释，相信对多臂老虎机问题及epsilon-greedy算法有一定了解也熟悉python的小伙伴们应该比较容易读懂，这里就不多解释了。不过有问题的话欢迎在评论区讨论。

2. The first trial

首先我们进行一个单次运行，看看运行结果是不是基本符合预期，据此可以大致判断程序有没有什么大的问题。

The true value q*(a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance. The actual rewards were selected according to a mean q*(a), unit-variance normal distribution. (不知道今天CSDN出啥毛病了，公式编辑器好像出问题了)

nStep = 10000
epsilon = 0.1
K     = 10
qstar = np.random.randn(K)
a,actCnt,r,Q,optRatio = k_armed_bandit_one_run(qstar,epsilon,nStep)

print('qstar = \n',qstar)
print('optimal action is action#{}\n'.format(np.argmax(qstar)))
fig,ax = plt.subplots()
ax.hist(a)

qstar = 
 [ 1.21516579  0.33073445  1.18376738  0.07092504  0.17977624  2.40731134
 -0.40352898 -0.30859606 -0.57574036 -0.0660783 ]
optimal action is action#5

如上所示，qstar的最大值为2.4，是action#5。也就是说，action#5是最优行动(the optimal action)。

watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_18,color_FFFFFF,t_70,g_se,x_16

从每个time-step的行动历史记录来看，Action#5的选择占据了绝大多数，这的确是符合预期的。说明以上k-armed-bandit仿真模型（函数）基本上是正确的。

2.1 Optimal selection ratio along the time

让我们来看看随着时间的推进选择最优行动的概率是怎么变化。仿真函数中返回的optRatio反映了这一结果，将它画出来就可以了。

plt.plot(optRatio)
plt.title('Optimal selection ratio along the time')

watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_16,color_FFFFFF,t_70,g_se,x_16

可以看出来，到4000步以后，基本上最优行动选择比例就基本稳定到在90%了，这个也是符合预期的（1-epsilon+epsilon/K）

2.2 Q value vs qstar

接下来看看各行动的价值估计与各自的Ground-Truth相比如何？理论预期是足够多步长后，估计值应该逼近Ground-Truth。

Q[k]代表所有采取行动k的reward的平均，而qstar[k]代表行动k的reward的数学期望。当𝑛𝑆𝑡𝑒𝑝→∞时应该有𝑄[𝑘]→𝑞∗[𝑘].

fig,ax = plt.subplots(1,2,figsize=[12,6])
ax[0].scatter(Q,qstar)
ax[0].grid()
ax[0].set_title('One trial of epsilon-greedy k-armed badit game')
ax[1].plot(Q-qstar)
ax[1].grid()
ax[1].set_title('Difference between estimation and ground-truth of action value')
ax[1].set_ylim(-1.5,2.0)

watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_20,color_FFFFFF,t_70,g_se,x_16

看上去不错，作图以估计值Q为横坐标GT值qstart为纵坐标得到的散点图，看上去基本上集中在y=x直线附近，说明估计值与真值是基本吻合的。右图是两者的差，也确实非常小。所以说这个结果是符合预期的。

2.3 Number of actions vs qstar

接下来看看各种行动被选择的次数以及它们的价值之间的对比关系。actCnt[k]代表所有采取action#k被选择的次数，而qstar[k]代表action#k的reward的数学期望。当nStep→∞时理论上来说应该有reward期望越高的行动被选择的次数越多。但是与𝑄[𝑘]∝𝑞∗[𝑘]不同，各action被选择的次数应该遵循马太效应，即最后应该集中在最高q值的action或最高的几个actions上。当然从实验结果来看，至少在单次仿真中，并不一定是q值最高的action被选择次数最高，特别是在有多个行动的价值比较接近为近似并行最大值时。

fig, ax2_1 = plt.subplots()
ax2_1.scatter(np.arange(K),actCnt)
ax2_1.set_ylabel("number of action")

ax2_2 = ax2_1.twinx()
ax2_2.plot(np.arange(K),qstar, label='qstar', color='blue', marker='s')
ax2_2.set_ylabel("qstar")
ax2_2.set_ylim(np.min(qstar)-0.5, np.max(qstar)+0.5)
ax2_2.legend(loc='upper left')
print(actCnt)
print(qstar)

watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_15,color_FFFFFF,t_70,g_se,x_16

从结果来看，的确如预期一样的马太效应呈现了。除了最优行动的被选择次数一骑绝尘外，其它各行动被选择的次数大抵差不多，都很少。

以上为单次实验的结果，有一定的随机性，重复运行的话每次都会得到不一样的结果。要想定量的性能分析，需要执行蒙特卡洛仿真。参见下一章。

3. Comparison between different epsilon

单次的run的结果有较大的随机性，要比较不同的epsilon条件下的算法行为，需要做蒙特卡洛仿真。针对某个epsilon参数设定，在随机条件下执行多次仿真，然后将多次仿真的结果进行统计平均，得到统计平均行为。然后针对不同epsilon充分以上步骤，最后比较不同epsilon设定下的统计平均行为。

仿真控制参数
- number of steps
- number of run
- K: The number of candidate actions
Input:
- qstar[K]
- epsilon: greedy factor 𝜖ϵ
Output:
- a[t]: action series for each step in one run
- r[t]: reward series for each step in one run
- Q[k]: reward sample average up to t-1 for action[k]. Note!!!: qstar randonmization should be put inside the outer loop to make each run representing a different bandit problem.

nStep = 1000
nRun  = 2000
#qstar   = np.random.randn(10) 

r_0p0   = np.zeros((nRun,nStep+1))
r_0p1   = np.zeros((nRun,nStep+1))
r_0p01  = np.zeros((nRun,nStep+1))
optRatio_0p0 = np.zeros((nRun,nStep+1))
optRatio_0p1 = np.zeros((nRun,nStep+1))
optRatio_0p01= np.zeros((nRun,nStep+1))

for run in range(nRun):
    print('.',end='')
    if run%100==99:        
        print('run = ',run+1)
    
    qstar   = np.random.randn(10) 
    a,aNum,r_0p0[run,:],Q,optRatio_0p0[run,:] = k_armed_bandit_one_run(qstar,epsilon=0,nStep=nStep)
    a,aNum,r_0p1[run,:],Q,optRatio_0p1[run,:] = k_armed_bandit_one_run(qstar,epsilon=0.1,nStep=nStep)
    a,aNum,r_0p01[run,:],Q,optRatio_0p01[run,:] = k_armed_bandit_one_run(qstar,epsilon=0.01,nStep=nStep)

由于要跑2000次仿真，所以每次仿真的步长设定为1000次（上面的单次仿真设定为10000次），所以不一定能看到最后完全收敛的状态。但是就这个2000*1000的设定也需要跑一会儿。如果愿意等待的话，可以将仿真步长设得更长一些。

3.1 Average Reward

我们来看看平均奖励结果。

首先我们定义一个smooth函数，用于对仿真结果进行一个统计平均，使得结果看起来更“好看”一些。

def smooth(a,winsize):
    """
    Smoothing with edge processing.
    Input:
        a:原始数据，NumPy 1-D array containing the data to be smoothed,必须是1-D的，如果不是，请使用 np.ravel()或者np.squeeze()转化 
        winsize: smoothing window size needs, which must be odd number,as in the original MATLAB implementation
    Output:
        
    """
    out0 = np.convolve(a,np.ones(winsize,dtype=int),'valid')/winsize
    r = np.arange(1,winsize-1,2)
    start = np.cumsum(a[:winsize-1])[::2]/r
    stop = (np.cumsum(a[:-winsize:-1])[::2]/r)[::-1]
    return np.concatenate(( start , out0, stop ))

利用smooth函数将仿真结果做一些平滑处理画出图来（平均长度越大得话结果会更平滑一些）。

rEnsembleMean_0p0 = np.mean(r_0p0,axis=0)
rEnsembleMean_0p1 = np.mean(r_0p1,axis=0)
rEnsembleMean_0p01 = np.mean(r_0p01,axis=0)
#plt.plot(rEnsembleMean)
plt.plot(smooth(rEnsembleMean_0p0,5))
plt.plot(smooth(rEnsembleMean_0p1,5))
plt.plot(smooth(rEnsembleMean_0p01,5))
plt.legend(['epsilon = 0','epsilon = 0.1','epsilon = 0.01'])

watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_12,color_FFFFFF,t_70,g_se,x_16

有兴趣的小伙伴去对比一下这个图与Sutton-RL-book的Figure2.2的话，会看到基本上是一致的，也就是说本仿真基本上再现了原书的结果。很开心。。。^-^

3.2 Optimal Action

接下来看看最优行动选择结果。

Optimal Action是指对应于𝑞∗q∗中期望值最高的那个。In the above simulation, 𝑞∗ is re-initialized for each run. Hence, in simulation, the optimal action counting must be tracked in each run in a cumulative way.

optRatioEnsembleMean_0p0 = np.mean(optRatio_0p0,axis=0)
optRatioEnsembleMean_0p1 = np.mean(optRatio_0p1,axis=0)
optRatioEnsembleMean_0p01 = np.mean(optRatio_0p01,axis=0)
#plt.plot(rEnsembleMean)
plt.plot(smooth(optRatioEnsembleMean_0p0,5))
plt.plot(smooth(optRatioEnsembleMean_0p1,5))
plt.plot(smooth(optRatioEnsembleMean_0p01,5))
plt.legend(['epsilon = 0','epsilon = 0.1','epsilon = 0.01'])

watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_12,color_FFFFFF,t_70,g_se,x_16