百度增强学习框架PARL的并行算法IMPALA初探

开博客啦

cftang9999

347人浏览 · 2009-02-12 14:51:00

cftang9999 · 2009-02-12 14:51:00 发布

增强学习是人工智能领域一个较新的领域，著名的Alpha Go就是基于增强学习的技术研发出来的。

在NeurlPS 2018和2019 强化学习赛事中，百度凭借基于飞桨的自研强化学习框架「PARL」两次夺冠。

源代码在

https://gitee.com/paddlepaddle/PARL/tree/develop/examples/NeurIPS2018-AI-for-Prosthetics-Challenge

https://gitee.com/paddlepaddle/PARL/tree/develop/examples/NeurIPS2019-Learn-to-Move-Challenge

最近参加了百度的免费入门课程《强化学习7日打卡营-世界冠军带你从零实践》，由参赛的科科老师主讲，听完受益匪浅。

课上讲到PARL实现了很多优秀的算法，正好第四课中Policy Gradient的pong游戏训练较慢，查到资料说通过IMPALA并发算法，

可以使用一个P40 GPU和32个CPU，在7分钟内完成训练达到20分，总分21分。

正好百度AIStudio提供每天12小时的免费GPU环境，1个V100 GPU和8个CPU，验证了一下该程序，果然在1700秒内达到最高分20分，平均分18-19分的好成绩。

下面记录一下相关的执行环境和步骤：

代码路径

https://github.com/PaddlePaddle/PARL.git

或

https://gitee.com/paddlepaddle/PARL.git

运行环境里有8个cpu，将

PARL/examples/IMPALA/impala_config.py中

'actor_num': 32,

修改为

'actor_num': 8,

其余不用修改

启动并行环境

xparl start --cpu_num 8 --port 8010

启动训练程序

cd PARL/examples/IMPALA/

python train.py

日志如下

[06-25 19:40:10 MainThread @train.py:249] {'sample_steps': 3502250, 'max_episode_rewards': 21.0, 'mean_episode_rewards': 18.0, 'min_episode_rewards': 11.0, 'max_episode_steps': 9726, 'mean_episode_steps': 7513.571428571428, 'min_episode_steps': 6314, 'sample_queue_size': 0, 'total_params_sync': 38969, 'cache_params_sent_cnt': 1, 'total_loss': -16.607183, 'pi_loss': -6.2154207, 'vf_loss': 11.594783, 'entropy': 1618.9156, 'kl': 0.00052885746, 'learn_time_s': 0.5279640007019043, 'elapsed_time_s': 1688, 'lr': 0.001, 'entropy_coeff': -0.01}

在训练elapsed_time_s=1688秒，约350万步后，训练基本达到目标

最高reward：‘max_episode_rewards': 21.0,

平均reward：'mean_episode_rewards': 18.0,

最小reward: 'min_episode_rewards': 11.0

软件版本

python 3.7.4

paddlepaddle-gpu 1.8.2.post107

parl 1.3.1