Deep Reinforcement Learning For Robotic Manipulation
Deep Reinforcement Learning For Robotic Manipulation
Shixiang Gu∗,1,2,3 and Ethan Holly∗,1 and Timothy Lillicrap4 and Sergey Levine1,5
be accurately modeled. In this work we thus focus on model- π which maximizes the expected sum of returns from the
∗
free reinforcement learning, which includes policy search initial state distribution, given by R = Eπ [R1 ].
methods [19], [3], [20] and value-iteration methods [21], Among reinforcement learning methods, off-policy meth-
[22], [14]. Both approaches have recently been combined ods such as Q-learning offer significant data efficiency com-
with deep neural networks to achieve unprecedented suc- pared to on-policy variants, which is crucial for robotics
cesses in learning complex tasks [23], [24], [18], [10], [11], applications. Q-learning trains a greedy deterministic pol-
[25]. However, while policy search methods [23], [18], [25] icy π(uut |xxt ) = δ (uut = µ (xxt )) by iterating between learning
offer a simple and direct way to optimize the true objec- the Q-function, Qπn (xxt , ut ) = Eri≥t ,xxi>t ∼E,uui>t ∼πn [Rt |xxt , ut ], of
tive, they often require significantly more data than value a policy and updating the policy by greedily maximiz-
iteration methods because of on-policy learning, making ing the Q-function, µ n+1 (xxt ) = arg maxu Qπn (xxt , ut ). Let θ Q
them a less obvious choice for robotic applications. We parametrize the action-value function, β be an arbitrary
therefore build on two value iteration methods based on Q- exploration policy, and ρ β be the state visitation distribution
learning with function approximation [22], Deep Determinis- induced by β , the learning objective is to minimize the
tic Policy Gradient (DDPG) [10] and Normalized Advantage Bellman error, where we fix the target yt :
Function (NAF) [11], as they successfully extend Deep Q-
L(θ Q ) = Ext ∼ρ β ,uut ∼β ,xxt+1 ,rt ∼E [(Q(xxt , ut |θ Q ) − yt )2 ]
Learning [24] to continuous action space and are significantly
more sample-efficient than competing policy search methods yt = r(xxt , ut ) + γQ(xxt+1 , µ (xxt+1 ))
due to off-policy learning. DDPG is closely related to the For continuous action problems, the policy update step is
NFQCA [26] algorithm, with principle differences being intractable for a Q-function parametrized by a deep neural
that NFQCA uses full-batch updates and parameter resetting network. Thus, we will investigate Deep Deterministic Policy
between episodes. Gradient (DDPG) [10] and Normalized Advantage Functions
Accelerating robotic learning by pooling experience from (NAF) [11]. DDPG circumvents the problem by adopting
multiple robots has long been recognized as a promising an actor-critic method, while NAF restricts the class of
direction in the domain of cloud robotics, where it is typically Q-function to the expression below to enable closed-form
referred to as collective robotic learning [27], [28], [29], updates, as in the discrete action case. During exploration, a
[30]. In deep reinforcement learning, parallelized learning temporally-correlated noise is added to the policy network
has also been proposed to speed up simulated experiments output. For more details and comparisons on DDPG and
[25]. The goals of this prior work are fundamentally different NAF, please refer to [10], [11] as well as experimental results
from ours: while prior asynchronous deep reinforcement in Section V-B.
learning work seeks to reduce overall training time, under
the assumption that simulation time is inexpensive and the Q(xx, u |θ Q ) = A(xx, u |θ A ) +V (xx|θ V )
training is dominated by neural network computations, our 1
A(xx, u |θ A ) = − (uu − µ (xx|θ µ ))T P (xx|θ P )(uu − µ (xx|θ µ ))
work instead seeks to minimize the training time when 2
training on real physical robots, where experience is ex- We evaluate both DDPG and NAF in our simulated ex-
pensive and computing neural network backward passes is periments, where they yield comparable performance, with
comparatively cheap. In this case, we retain the use of a NAF producing slightly better results overall for the tasks
replay buffer, and focus on asynchronous execution and examined here. On real physical systems, we focus on
neural network training. Our results demonstrate that we variants of the NAF method, which is simpler, requires
achieve significant speedup in overall training time from only a single optimization objective, and has fewer hyper-
simultaneously collecting experience across multiple robotic parameters.
This RL formulation can be applied on robotic systems Algorithm 1 Asynchronous NAF - N collector threads and
to learn a variety of skills defined by reward functions. 1 trainer thread
However, the learning process is typically time consuming, // trainer thread
and requires a number of practical considerations. In the Randomly initialize normalized Q network Q(xx, u |θ Q ), where
θ Q = {θ µ , θ P , θ V } as in Eq. 1
next section, we will present our main technical contribution,
Initialize target network Q0 with weight θ Q ← θ Q
0
which consists of a parallelized variant of NAF, and also Initialize shared replay buffer R ← 0/
discuss a variety of technical contributions necessary to apply for iteration=1, I do
NAF to real-world robotic skill learning. Sample a(random minibatch of m transitions from R
r + γV 0 (xx0i |θ Q ) if ti < T
0
Fig. 4: The table summarizes the performances of DDPG, Linear-NAF, and NAF across four tasks. Note that the linear
model learns the perfect reaching and pick & place policies given enough time, but fails to learn either of the door tasks.
applied [36].
Figure 5 shows the results on reaching and door pushing.
The x-axis shows the number of parameter updates, which
is proportional to the amount of wall-clock time required for
training, since the amount of data per step increases with the
number of workers. The results demonstrate three points: (1)
under some circumstances, increasing data collection makes
the learning converge significantly faster with respect to the
number of gradient steps, (2) final policy performances de-
pend a lot on the ratio between collecting and training speeds,
(a) Reaching and (3) there is a limit where collecting more data does not
help speed up learning. However, we hypothesize that accel-
erating the speed of neural network training, which in these
cases was pegged to one update per time step, could allow
the model to ingest more data and benefit more from greater
parallelism. This is particularly relevant as parallel computa-
tional hardware, such as GPUs, are improved and deployed
more widely. Videos of the learned policies are available
in supplementary materials and online: https://sites.
google.com/site/deeproboticmanipulation/
Fig. 7: The 7-DoF arm random target reaching with asyn- Fig. 8: Learning curves for real-world door opening. Learn-
chronous NAF on real robots. Note that 1 worker suffers in ing with two workers significantly outperforms the single
both learning speed and final policy performance. worker, and achieves a 100% success rate in under 500,000
update steps, corresponding to about 2.5 hours of real time.
A. Random Target Reaching
The simulation results in Section 5 provide approximate that involves complex and discontinuous contact dynamics.
performance gains that can be expected from parallelism. Previous work has demonstrated learning of door opening
However, the simulation setting does not consider several is- policies using example demonstration provided by a human
sues that could arise in real-world experiments: delays due to expert [6]. In this work, we demonstrate that we can learn
slow resetting procedures, non-constant execution speeds of policies for pulling open a door from scratch using asyn-
the training thread, and subtle physics discrepancies among chronous NAF. The entire task required approximately 2.5
robots. Thus, it is important to demonstrate the benefits from hours to learn with two workers learning simultaneously, and
parallel training with real robots. the final policy achieves 100% success rate evaluated across
We set up the same reaching experiment in the real world 20 consecutive trials. An illustration of this task is shown in
across up to four robots. Robots execute policies at 20 Figure 6, and the supplementary video shows different stages
Hz, while the training thread simply updates the network in the learning process, as well as the final learned policy.
continuously at approximately 100 Hz. The same network Figure 8 illustrates the difference in the learning process
architecture and hyper-parameters from the simulation ex- between one and two workers, where the horizontal axis
periment are used. shows the number of parameter updates. 100,000 updates
Figure 7 confirms that 2 or 4 workers significantly im- correspond to approximately half an hour, with some delays
proves learning speed over 1 worker, though the gains on this incurred due to periodic policy evaluation, which is only used
simple task are not substantial past 2 workers. Importantly, for measuring the reward for the plot. One worker required
when the training thread is not synchronized with the data significantly more than 4 hours to achieve 100% success
collection thread and the data collection is too slow, it may rate, while two workers achieved the same success rate in
not just slow down learning but also hurt the final policy 2.5 hours. Qualitatively, the learning process goes through
performance, as observed in the 1-worker case. Further a set of stages as the robots learn the task, as illustrated by
discrepancies from the simulation may also be explained by learning curves in Figure 8, where the plateau near reward=0
physical discrepancies among different robots. The learned corresponds to placing the hook near the handle, but not
policies are presented in the supplementary video. pulling the door open. In the first stage, the robots are unable
to reach the handle, and explore the free space to determine
B. Door Opening an effective policy for reaching. Once the robots begin to
The previous section describes a real-world evaluation contact the handle sporadically, they will occasionally pull
of asynchronous NAF and demonstrates that learning can on the handle by accident, but require additional training to
be accelerated by using multiple workers. In this section, be able to reach the handle consistently; this corresponds to
we describe a more complex door opening task. Door the plateau in the learning curves. At this point, it becomes
opening presents a practical application of robotic learning much easier for the robots to pull open the door, and a
successful policy emerges. The final policy learned by the and Jon Scholz for help on parallelization, and the Google
two workers was able to open the door every time, including Brain, X, and DeepMind teams for their support.
in the presence of exploration noise.
R EFERENCES
VII. D ISCUSSION AND F UTURE W ORK [1] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast
We presented an asynchronous deep reinforcement learn- quadrupedal locomotion,” in International Conference on Robotics and
Automation (IROS), 2004.
ing approach that can be used to learn complex robotic [2] G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng,
manipulation skills from scratch on real physical robotic “Learning CPG-based biped locomotion with a policy gradient
manipulators. We demonstrate that our approach can learn a method: Application to a humanoid robot,” International Journal of
Robotic Research, vol. 27, no. 2, pp. 213–228, 2008.
complex door opening task with only a few hours of training, [3] J. Peters and S. Schaal, “Reinforcement learning of motor skills with
and our simulated results demonstrate that training times de- policy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.
crease with more learners. Our technical contribution consists [4] E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning
of motor skills in high dimensions,” in International Conference on
of a novel asynchronous version of the normalized advantage Robotics and Automation (ICRA), 2010.
functions (NAF) deep reinforcement learning algorithm, as [5] J. Peters, K. Mülling, and Y. Altün, “Relative entropy policy search,”
well as a number of practical extensions to enable safe and in AAAI Conference on Artificial Intelligence, 2010.
[6] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning
efficient deep reinforcement learning on physical systems, force control policies for compliant manipulation,” in International
and our experiments confirm the benefits of nonlinear deep Conference on Intelligent Robots and Systems (IROS), 2011.
neural network policies over simpler shallow representations [7] P. Abbeel, A. Coates, M. Quigley, and A. Ng, “An application of
reinforcement learning to aerobatic helicopter flight,” in Advances in
for complex robotic manipulation tasks. Neural Information Processing Systems (NIPS), 2006.
While we’ve shown that deep off-policy reinforcement [8] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
learning algorithms are capable of learning complex ma- robotics: A survey,” International Journal of Robotic Research, vol. 32,
no. 11, pp. 1238–1274, 2013.
nipulation skills from scratch and without purpose built [9] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and
representations, our method has a number of limitations. generalization of motor skills by learning from demonstration,” in
Although each of the tasks is learned from scratch, the International Conference on Robotics and Automation (ICRA), 2009.
[10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
reward function provides some amount of guidance to the D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
learning algorithm. In the reacher task, the reward provides ment learning,” International Conference on Learning Representations
the distance to the target, while in the door task, it provides (ICLR), 2016.
[11] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-
the distance from the gripper to the handle as well as the learning with model-based acceleration,” in International Conference
difference between the current and desired door pose. If the on Machine Learning (ICML), 2016.
reward consists only of a binary success signal, both tasks [12] M. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search
for robotics,” Foundations and Trends in Robotics, vol. 2, no. 1-2, pp.
become substantially more difficult and require considerably 1–142, 2013.
more exploration. However, such simple binary rewards may [13] K. J. Hunt, D. Sbarbaro, R. Żbikowski, and P. J. Gawthrop, “Neural
be substantially easier to engineer in many practical robotic networks for control systems: A survey,” Automatica, vol. 28, no. 6,
pp. 1083–1112, Nov. 1992.
learning applications. Improving exploration and learning [14] M. Riedmiller, “Neural fitted q iteration–first experiences with a
speed in future work to enable the use of such sparse rewards data efficient neural reinforcement learning method,” in European
would further improve the practical applicability of the class Conference on Machine Learning. Springer, 2005, pp. 317–328.
of methods explored here. [15] R. Hafner and M. Riedmiller, “Neural reinforcement learning con-
trollers for a real robot application,” in International Conference on
Another promising direction of future work is to inves- Robotics and Automation (ICRA), 2007.
tigate how diverse experience of multiple robotic platforms [16] M. Riedmiller, S. Lange, and A. Voigtlaender, “Autonomous reinforce-
can be appropriately integrated into a single policy. While we ment learning on raw visual input data in a real world application,” in
International Joint Conference on Neural Networks, 2012.
take the simplest approach of pooling all collected experi- [17] J. Koutnı́k, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-
ence, multi-robot learning differs fundamentally from single- scale neural networks for vision-based reinforcement learning,” in
robot learning in the diversity of experience that multiple Conference on Genetic and Evolutionary Computation, ser. GECCO
’13, 2013.
robots can collect. For example, in a real-world instantiation [18] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust
of the door opening example, each robot might attempt to region policy optimization,” in International Conference on Machine
open a different door, eventually allowing for generalization Learning (ICML), 2015.
[19] R. Williams, “Simple statistical gradient-following algorithms for
across door types. Properly handling such diversity might connectionist reinforcement learning,” Machine Learning, vol. 8, no.
benefit from explicit exploration or even separate policies 3-4, pp. 229–256, May 1992.
trained on each robot, with subsequent pooling based on [20] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy
search for robotics.” Foundations and Trends in Robotics, vol. 2, no.
policy distillation [37]. Exploring these extensions of our 1-2, pp. 1–142, 2013.
method could enable the training of highly generalizable [21] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
deep neural network policies in future work. no. 3-4, pp. 279–292, 1992.
[22] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
methods for reinforcement learning with function approximation,” in
ACKNOWLEDGEMENTS Advances in Neural Information Processing Systems (NIPS), 1999.
We sincerely thank Peter Pastor, Ryan Walker, Mrinal [23] J. Koutnı́k, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-
scale neural networks for vision-based reinforcement learning,” in Pro-
Kalakrishnan, Ali Yahya, Vincent Vanhoucke for their as- ceedings of the 15th annual conference on Genetic and evolutionary
sistance and advice on robot set-ups, Gabriel Dulac-Arnold computation. ACM, 2013, pp. 1061–1068.
[24] V. Mnih et al., “Human-level control through deep reinforcement
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
reinforcement learning,” in International Conference on Machine
Learning (ICML), 2016, pp. 1928–1937.
[26] R. Hafner and M. Riedmiller, “Reinforcement learning in feedback
control,” Machine learning, vol. 84, no. 1-2, pp. 137–169, 2011.
[27] M. Inaba, S. Kagami, F. Kanehiro, and Y. Hoshino, “A platform
for robotics research based on the remote-brained robot approach,”
International Journal of Robotics Research, vol. 19, no. 10, 2000.
[28] J. Kuffner, “Cloud-enabled humanoid robots,” in IEEE-RAS Interna-
tional Conference on Humanoid Robotics, 2010.
[29] B. Kehoe, A. Matsukawa, S. Candido, J. Kuffner, and K. Goldberg,
“Cloud-based robot grasping with the google object recognition en-
gine,” in IEEE International Conference on Robotics and Automation,
2013.
[30] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research
on cloud robotics and automation,” IEEE Transactions on Automation
Science and Engineering, vol. 12, no. 2, April 2015.
[31] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for
model-based control,” in 2012 IEEE/RSJ International Conference on
Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
[32] J. Ba and D. Kingma, “Adam: A method for stochastic optimization,”
2015.
[33] J. Kober and J. Peters, “Learning motor primitives for robotics,” in
International Conference on Robotics and Automation (ICRA), 2009.
[34] R. Tedrake, T. W. Zhang, and H. S. Seung, “Learning to walk in 20
minutes.”
[35] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[37] A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick,
R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy
distillation,” in International Conference on Learning Representations
(ICLR), 2016.