0% found this document useful (0 votes)
19 views9 pages

Deep Reinforcement Learning For Robotic Manipulation

Uploaded by

ashisd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Deep Reinforcement Learning For Robotic Manipulation

Uploaded by

ashisd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Reinforcement Learning for Robotic Manipulation

Shixiang Gu∗,1,2,3 and Ethan Holly∗,1 and Timothy Lillicrap4 and Sergey Levine1,5

Abstract— Reinforcement learning holds the promise of en-


abling autonomous robots to learn large repertoires of behav-
ioral skills with minimal human intervention. However, robotic
applications of reinforcement learning often compromise the
autonomy of the learning process in favor of achieving training
times that are practical for real physical systems. This typically
arXiv:1610.00633v1 [cs.RO] 3 Oct 2016

involves introducing hand-engineered policy representations


and human-supplied demonstrations. Deep reinforcement learn-
ing alleviates this limitation by training general-purpose neural
network policies, but applications of direct deep reinforcement
learning algorithms have so far been restricted to simulated
settings and relatively simple tasks, due to their apparent
high sample complexity. In this paper, we demonstrate that Fig. 1: Two robots in the process of learning a door opening
a recent deep reinforcement learning algorithm based on off- task. We present a method that allows multiple robots to
policy training of deep Q-functions can scale to complex
3D manipulation tasks and can learn deep neural network
cooperatively learn a single policy with deep reinforcement
policies efficiently enough to train on real physical robots. We learning.
demonstrate that the training times can be further reduced
by parallelizing the algorithm across multiple robots which algorithm (NAF) [11] can achieve training times that are
pool their policy updates asynchronously. Our experimental suitable for real robotic systems. We also demonstrate that
evaluation shows that our method can learn a variety of we can further reduce training times by parallelizing the
3D manipulation skills in simulation and a complex door
opening skill on real robots without any prior demonstrations
algorithm across multiple robotic platforms. To that end,
or manually designed representations. we present a novel asynchronous variant of NAF, evaluate
the speedup obtained with varying numbers of learners in
I. I NTRODUCTION simulation, and demonstrate real-world results with paral-
lelism across multiple robots. An illustration of these robots
Reinforcement learning methods have been applied to
learning a door opening task is shown in Figure 1.
range of robotic control tasks, from locomotion [1], [2] to
The main contribution of this paper is a demonstration of
manipulation [3], [4], [5], [6] and autonomous vehicle control
asynchronous deep reinforcement learning using our parallel
[7]. However, practical real-world applications of reinforce-
NAF algorithm across a cluster of robots. Our technical
ment learning have typically required significant additional
contribution consists of the asynchronous variant of the NAF
engineering beyond the learning algorithm itself: an appro-
algorithm, as well as practical extensions of the method to
priate representation for the policy or value function must
enable sample-efficient training on real robotic platforms.
be chosen so as to achieve training times that are practical
We also introduce a simple and effective safety mechanism
for physical hardware [8], and example demonstrations must
for constraining exploration at training time, and present
often be provided to initialize the policy and mitigate safety
simulated experiments that evaluate the speedup obtained
concerns during training [9]. In this work, we show that
from parallelizing across a variable number of learners.
a recently proposed deep reinforcement learning algorithms
Our experiments also evaluate the benefits of deep neural
based on off-policy training of deep Q-functions [10], [11]
network representations for several complex manipulation
can be extended to learn complex manipulation policies from
tasks, including door opening and pick-and-place, by com-
scratch, without user-provided demonstrations, and using
paring to more standard linear representations. Our real
only general-purpose neural network representations that do
world experiments show that our approach can be used to
not require task-specific domain knowledge.
learn a door opening skill from scratch using only general-
One of the central challenges with applying direct deep
purpose neural network representations and without any
reinforcement learning algorithms to real-world robotic plat-
human demonstrations. To the best of our knowledge, this
forms has been their apparent high sample-complexity. We
is the first demonstration of autonomous door opening that
demonstrate that, contrary to commonly held assumptions,
does not use human-provided examples for initialization.
recently developed off-policy deep Q-function based al-
gorithms such as the Deep Deterministic Policy Gradient II. R ELATED W ORK
algorithm (DDPG) [10] and Normalized Advantage Function
Applications of reinforcement learning (RL) in robotics
∗ equal contribution, 1 Google Brain, 2 University of Cambridge, 3 MPI have included locomotion [1], [2], manipulation [3], [4],
Tübingen, 4 Google DeepMind, 5 UC Berkeley [5], [6], and autonomous vehicle control [7]. Many of
the RL methods demonstrated on physical robotic systems platforms.
have used relatively low-dimensional policy representations,
typically with under one hundred parameters, due to the III. BACKGROUND
difficulty of efficiently optimizing high-dimensional policy In this section, we will formulate the robotic reinforcement
parameter vectors [12]. Although there has been considerable learning problem, introduce essential notation, and describe
research on reinforcement learning with general-purpose the existing algorithmic foundations on which we build the
neural networks for some time [13], [14], [15], [16], [17], methods for this work. The goal in reinforcement learning
such methods have only recently been developed to the point is to control an agent attempting to maximize a reward
where they could be applied to continuous control of high- function which, in the context of a robotic skill, denotes
dimensional systems, such as 7 degree-of-freedom (DoF) a user-provided definition of what the robot should try to
arms, and with large and deep neural networks [18], [10], accomplish. At state xt in time t, the agent chooses and
[11]. This has made it possible to learn complex skills with executes action ut according to its policy π(uut |xxt ), transitions
minimal manual engineering, though it has remained unclear to a new state xt according to the dynamics p(xxt |xxt , ut )
whether such approaches could be adapted to real systems and receives a reward r(xxt , ut ). Here, we consider infinite-
given their sample complexity. horizon discounted return problems, where the objective is
In real robot environments, particularly those with contact the γ−discounted future return from time t to ∞, given by
events, environment dynamics are rarely available or cannot Rt = ∑∞ i=t γ xi , u i ). The goal is to find the optimal policy
(i−t) r(x

be accurately modeled. In this work we thus focus on model- π which maximizes the expected sum of returns from the

free reinforcement learning, which includes policy search initial state distribution, given by R = Eπ [R1 ].
methods [19], [3], [20] and value-iteration methods [21], Among reinforcement learning methods, off-policy meth-
[22], [14]. Both approaches have recently been combined ods such as Q-learning offer significant data efficiency com-
with deep neural networks to achieve unprecedented suc- pared to on-policy variants, which is crucial for robotics
cesses in learning complex tasks [23], [24], [18], [10], [11], applications. Q-learning trains a greedy deterministic pol-
[25]. However, while policy search methods [23], [18], [25] icy π(uut |xxt ) = δ (uut = µ (xxt )) by iterating between learning
offer a simple and direct way to optimize the true objec- the Q-function, Qπn (xxt , ut ) = Eri≥t ,xxi>t ∼E,uui>t ∼πn [Rt |xxt , ut ], of
tive, they often require significantly more data than value a policy and updating the policy by greedily maximiz-
iteration methods because of on-policy learning, making ing the Q-function, µ n+1 (xxt ) = arg maxu Qπn (xxt , ut ). Let θ Q
them a less obvious choice for robotic applications. We parametrize the action-value function, β be an arbitrary
therefore build on two value iteration methods based on Q- exploration policy, and ρ β be the state visitation distribution
learning with function approximation [22], Deep Determinis- induced by β , the learning objective is to minimize the
tic Policy Gradient (DDPG) [10] and Normalized Advantage Bellman error, where we fix the target yt :
Function (NAF) [11], as they successfully extend Deep Q-
L(θ Q ) = Ext ∼ρ β ,uut ∼β ,xxt+1 ,rt ∼E [(Q(xxt , ut |θ Q ) − yt )2 ]
Learning [24] to continuous action space and are significantly
more sample-efficient than competing policy search methods yt = r(xxt , ut ) + γQ(xxt+1 , µ (xxt+1 ))
due to off-policy learning. DDPG is closely related to the For continuous action problems, the policy update step is
NFQCA [26] algorithm, with principle differences being intractable for a Q-function parametrized by a deep neural
that NFQCA uses full-batch updates and parameter resetting network. Thus, we will investigate Deep Deterministic Policy
between episodes. Gradient (DDPG) [10] and Normalized Advantage Functions
Accelerating robotic learning by pooling experience from (NAF) [11]. DDPG circumvents the problem by adopting
multiple robots has long been recognized as a promising an actor-critic method, while NAF restricts the class of
direction in the domain of cloud robotics, where it is typically Q-function to the expression below to enable closed-form
referred to as collective robotic learning [27], [28], [29], updates, as in the discrete action case. During exploration, a
[30]. In deep reinforcement learning, parallelized learning temporally-correlated noise is added to the policy network
has also been proposed to speed up simulated experiments output. For more details and comparisons on DDPG and
[25]. The goals of this prior work are fundamentally different NAF, please refer to [10], [11] as well as experimental results
from ours: while prior asynchronous deep reinforcement in Section V-B.
learning work seeks to reduce overall training time, under
the assumption that simulation time is inexpensive and the Q(xx, u |θ Q ) = A(xx, u |θ A ) +V (xx|θ V )
training is dominated by neural network computations, our 1
A(xx, u |θ A ) = − (uu − µ (xx|θ µ ))T P (xx|θ P )(uu − µ (xx|θ µ ))
work instead seeks to minimize the training time when 2
training on real physical robots, where experience is ex- We evaluate both DDPG and NAF in our simulated ex-
pensive and computing neural network backward passes is periments, where they yield comparable performance, with
comparatively cheap. In this case, we retain the use of a NAF producing slightly better results overall for the tasks
replay buffer, and focus on asynchronous execution and examined here. On real physical systems, we focus on
neural network training. Our results demonstrate that we variants of the NAF method, which is simpler, requires
achieve significant speedup in overall training time from only a single optimization objective, and has fewer hyper-
simultaneously collecting experience across multiple robotic parameters.
This RL formulation can be applied on robotic systems Algorithm 1 Asynchronous NAF - N collector threads and
to learn a variety of skills defined by reward functions. 1 trainer thread
However, the learning process is typically time consuming, // trainer thread
and requires a number of practical considerations. In the Randomly initialize normalized Q network Q(xx, u |θ Q ), where
θ Q = {θ µ , θ P , θ V } as in Eq. 1
next section, we will present our main technical contribution,
Initialize target network Q0 with weight θ Q ← θ Q
0

which consists of a parallelized variant of NAF, and also Initialize shared replay buffer R ← 0/
discuss a variety of technical contributions necessary to apply for iteration=1, I do
NAF to real-world robotic skill learning. Sample a(random minibatch of m transitions from R
r + γV 0 (xx0i |θ Q ) if ti < T
0

IV. A SYNCHRONOUS T RAINING OF N ORMALIZED Set yi = i


ri if ti = T
A DVANTAGE F UNCTIONS Update the weight θ Q by minimizing the loss:
In this section, we present our primary contribution: an L = m1 ∑i (yi − Q(xxi , u i |θ Q ))2
Update the target network: θ Q ← τθ Q + (1 − τ)θ Q
0 0
extension of NAF that makes it practical for use with real-
world robotic platforms. To that end, we describe how end for
// collector thread n, n = 1...N
online training of the Q-function estimator can be performed Randomly initialize policy network µ (xx|θn )
µ
asynchronously, with a learner thread that trains the network for episode=1, M do
µ
and one or more worker threads that collect data by executing Sync policy network weight θn ← θ µ
the current policy on one or more robots. Besides making Initialize a random process N for action exploration
NAF suitable for real time applications, this approach also Receive initial observation state x 1 ∼ p(xx1 )
for t=1, T do
makes it straightforward to collect experience from multiple µ
Select action ut = µ (xxt |θn ) + Nt
robots in parallel. This is crucial in real-world robot learning, Execute ut and observe rt and xt+1
since the learning time is often constrained by the data Send transition (xxt , ut , rt , xt+1 ,t) to R
collection rate in real time, rather than network training end for
speed. When data collection is the limiting factor, then 2- end for
3 times quicker data collection may translate directly to 2-3
times faster skill acquisition on a real robot. We also describe
practical considerations, such as safety constraints, which are same way as [25] within our framework. While the trainer
necessary in order to allow the exploration required to train thread keeps training from the centralized replay buffer, the
complex policies from scratch on real systems. To the best collector threads sync their policy parameters with the trainer
of our knowledge, this is the first direct deep RL method that thread at the beginning of each episode, execute commands
has been demonstrated on a real robotics platform with many on the robots, and push experience into the buffer.
DoFs and contact dynamics, and without demonstrations or
B. Safety Constraints
simulated pretraining [18], [10], [11]. As we will show in
our experimental evaluation, this approach can be used to Ensuring safe exploration poses a significant challenge for
learn complex tasks such as door opening from scratch, real-world training with reinforcement learning. Q-learning
which previously required additional details such as human requires a significant amount of noisy exploration for gath-
demonstrations to succeed [6]. ering the experience necessary for action-value function
approximation. For all experiments, we set a maximum com-
A. Asynchronous Learning manded velocity allowed per joint, as well as strict position
In asynchronous NAF, the learner thread is separated from limits for each joint. In addition to joint position limits, we
the experience collecting worker threads. The asynchronous used a bounding sphere for the end-effector position. If the
learning algorithm is summarized in Algorithm 1. The commanded joint velocities would send the end-effector out-
learner thread uses the replay buffer to perform asynchronous side of the sphere, we used the forward kinematics to project
updates to the deep neural network Q-function approximator. the commanded velocity onto the surface of the sphere, plus
This thread runs on a central server, and dispatches updated some correction velocity to force toward the center. For
policy parameters to each of the worker threads. The experi- experiments with no contacts, these safety constraints were
ence collecting worker threads run on the individual robots, sufficient to prevent unsafe exploration; for experiments with
and send the observation, action, and reward for each time contacts, additional heuristics were required for safety.
step to the central server to append to the replay buffer. This
decoupling between the training and the collecting threads C. Network Architectures
allows the controllers on each of the robots to run in real To minimize manual engineering, we use a simple and
time, without experiencing delays due to the computational readily available state representation consisting of joint
cost of backpropagation through the network. Furthermore, it angles and end-effector positions, as well as their time
makes it straightforward to parallelize experience collection derivatives. In addition, we append a target position to the
across multiple robots simply by adding additional worker state, which depends on the task: for the reaching task,
threads. We only use one thread for training the network; this is the goal position for the end-effector; for the door
however, the gradient computation can also be distributed in opening, this is the handle position when the door is closed
and the quaternion measurement of the sensor attached to
the door frame. Since the state representation is compact,
we use standard feed-forward networks to parametrize the
action-value functions and policies. We use two-hidden-layer
network with size 100 units each to parametrize each of µ (x),
L (x) (Cholesky decomposition of P (xx)), and V (xx) in NAF
and µ (x) and Q(xx, u ) in DDPG. For Q(xx, u ) in DDPG, the
action vector u added as another input to second hidden layer
followed by a linear projection. ReLU is used as hidden
activations and hyperbolic tangent (Tanh) is used for the Fig. 2: The 7-DoF arm and JACO arm in simulation.
final layer activation function in the policy networks µ (xx)
to bound the action scale.
To illustrate the importance of deep neural networks or 0.001 is used for all the experiments. Importantly, almost
for representing policies or action-value functions, we no hyperparameter search was required to ensure that the
study these neural network models against another sim- employed algorithms were successful across robot and task.
pler parametrization. Specifically we study a variant of 1) Reaching (7-DoF arm): The 7-DoF arm tries to reach
NAF (Linear-NAF) as below, where µ (xx) = f (kk + K x ), a random target in space from a fixed initial configuration.
P , k , K , B , b , c are learnable matricies, vectors, or scalars of A random target is generated per episode by sampling points
appropriate dimension, and f is Tanh to enforce bounded uniformly from a cube of size 0.2m centered around a
actions. point. State features include the 7 joint angles and their time
1 derivatives, the end-effector position and the target position,
Q(xx, u ) = (uu − µ (xx))T P (uu − µ (xx)) + x T B x + x T b + c
2 totalling 20 dimensions. Each episode duration is 150 time
If f is identity, then the expression corresponds to a globally steps (7.5 seconds). Success rate is computed from 5 random
quadratic Q-function and a linear feedback policy, though test episodes where an episode is successful if the arm can
due to the Tanh non-linearity, the Q-function is not linear reach within 5 cm of the target. Given the end-effector
with respect to state-action features. position e and the target position y , the reward function is
below,
V. S IMULATED E XPERIMENTS
We first performed a detailed investigation of the learning r(xx, u) = −c1 d(yy, e(xx)) − c2 uT u
algorithms using simulated tasks modeled using the MuJoCo
physics simulator [31]. Simulated environments enable fast 2) Door Pushing and Pulling (7-DoF arm): The 7-DoF
comparisons of design choices, including update frequen- arm tries to open the door by pushing or pulling the handle
cies, parallelism, network architectures, and other hyper- (see Figure 2). For each episode, the door position is sampled
parameters. We modeled a 7-DoF lightweight arm that was randomly within a rectangle of 0.2m by 0.1m. The handle
also used in our physical robot experiments, as well as a 6- can be turned downward for up to 90 degrees, while the
DoF Kinova JACO arm with 3 additional degrees of freedom door can be opened up to 90 degrees in both directions.
in the fingers, for a total of 9 degrees of freedom. Both arms The door has a spring such that it closes gradually when no
were controlled at the level of joint velocities, except the force is applied. The door has a latch such that it could
three JACO finger joints which are controlled with torque only open the door only when the handle is turned past
actuators. The 7-DoF arm is controlled at 20Hz to match the approximately 60 degrees. To make the setting similar to the
real-world robot experiments, and the JACO arm is controlled real robot experiment where the quaternion readings from the
at 100Hz. Gravity is turned off for the 7-DoF arm, which is VectorNav IMU are used for door angle measurements, the
a valid assumption given that the actual robot uses built- quaternion of the door handle is used to compute the loss.
in gravity compensation. Gravity is enabled for the JACO The reward function is composed of two parts: the closeness
arm. Different arm geometries, control frequencies, and of the end-effector to the handle, and the measure of how
gravity settings illustrate the learning algorithm’s robustness much the door is opened in the right direction. The first
to different learning environments. part depends on the distance between end-effector position e
and the handle position h in its neutral state. The second part
A. Simulation Tasks depends on the distance between the quaternion of the handle
Tasks include random-target reaching, door pushing, door q and its value when the handle is turned and door is opened
pulling, and pick & place in a 3D environment, as detailed q o . We also added the distance when the door is at neutral
below. The 7-DoF arm is set up for the random target position as offset di = d(qqo , q i ) such that, when the door
reaching and door tasks, while the JACO arm is used for is opened the correct way, it receives positive reward. State
the pick & place task (see Figure 2). Details of each task features include the 7 joint angles and their time derivatives,
are below, where d is Huber loss and ci ’s are non-negative the end-effector position, the resting handle position, the door
constants. Discount factor of γ = 0.98 is chosen and the frame position, the door angle, and the handle angle, totally
Adam optimizer [32] with base learning rate of either 0.0001 25 dimensions. Each episode duration is 300 time steps (15
seconds). Success rate is computed from 20 random test
episodes where an episode is successful if the arm can open
the door in the correct direction by a minimum of 10 degrees.

r(xx, u ) = −c1 d(hh, e (xx)) + c2 (−d(qqo , q (xx)) + di ) − c3 u T u


3) pick & place (JACO): The JACO arm tries to pick
up a stick suspending in the air by a string and place it
near the target upward in the space (see Figure 2). The hand
begins near to, but not in contact with the stick, so the grasp
must be learned. The task is similar to a task previously
explored with on-policy methods [25], except that here the (a) Door Pulling
task requires moving the stick to multiple targets. For each
episode a new target is sampled from a square of size 0.24
m a t a fixed height, while the initial stick position and
the arm configuration are fixed. Given the grip site position
g (where the three fingers meet when closed), the three
finger tip positions f 1 , f 2 , f 3 , the stick position s and the
target position y , the reward function is below. State features
include the position and rotation matrices of all geometries
in the environment, the target position and the vector from
the stick to the target, totally 180 dimensions. The large
observation dimensionality creates an interesting comparison
with the above two tasks. Each episode duration is 300 time (b) JACO pick & place
steps (3 seconds). Success rate is computed from 20 random Fig. 3: The figure shows the learning curves for two tasks,
test episodes where an episode is judged successful if the comparing DDPG, Linear-NAF, and NAF. Note that the
arm can bring the stick within 5 cm of the target. linear model struggles to learn the tasks, indicating the
3 importance of expressive nonlinear policy representations.
r(xx, u ) = − c1 d(ss(xx), g (xx)) − c2 ∑ d(ss(xx), f i (xx))
i=1
− c3 d(yy, s (xx)) − c4 u T u
and DDPG, but converges significantly slower than both
B. Neural Network Policy Representations
NAF and DDPG. This is contrary to common belief that
Neural networks are powerful function approximators, but neural networks take significantly more data and update steps
they have significantly more parameters than the simpler to converge to good solutions. One possible explanation is
linear models that are often used in robotic learning [20], that in RL the data collection and the model learning are
[8]. In this section, we compare empirical performance of coupled, and if the model is more expressive, it can explore a
DDPG, NAF, and Linear-NAF as described in Section IV-C. greater variety of complex policies efficiently and thus collect
In particular, we want to verify if deep representations for diverse and good data quickly. This is not a problem for well-
policy and value functions are necessary for solving complex pre-trained policy learning but could be an important issue
tasks from scratch, and evaluate how they compare with when learning from scratch. In the case of door tasks, the
linear models in terms of convergence rate. For the 7-DoF linear model completely fails to learn perfect policies. More
arm tasks, DDPG and NAF models have significantly more thorough investigations into how expressivity of the policy
parameters than Linear-NAF, while the pick & place task interact with reinforcement learning is a promising direction
has a high-dimensional observation, and thus the parameter for future work.
sizes are more comparable. Of course, many other linear
representations are possible, including DMPs [33], splines Additionally, the experimental results on the door tasks
[3], and task-specific representations [34]. This comparison show that Linear-NAF does not succeed in learning such
only serves to illustrate that our tasks are complex enough tasks. The difference from above tasks likely comes from
that simple, fully generic linear representations are not by the complexity of policies. For reaching and pick & place,
themselves sufficient for success. For the experiments in this the tasks mainly requires learning single-motion policies,
section, batch normalization [35] is applied. These exper- e.g. close fingers to grasp the stick and move it to
iments were conducted synchronously, where 1 parameter the target. For the door tasks, the robot is required to
update is applied per 1 time step in simulation. learn how to hook onto the door handle in different lo-
Figure 3 shows the experimental results on the 7-DoF door cations, turn it, and push or pull. See the supplemen-
pulling and JACO pick & place tasks and Table 4 summarizes tary video at https://sites.google.com/site/
the overall results. For reaching and pick & place, Linear- deeproboticmanipulation/ for learned resulting be-
NAF learns good policies competitive with those of NAF haviors for each tasks.
Max. success rate (%) Episodes to 100% success (1000s)
DDPG Lin-NAF NAF DDPG Lin-NAF NAF
Reach 100±0 100±0 100±0 3.2±0.7 8±3 3.6±1.0
Door Pull 100± 0 5±6 100± 0 10±8 N/A 6±3
Door Push 100±0 40± 10 100± 0 3.1± 1.0 N/A 4.2± 1.0
Pick & Place 100±0 100±0 100±0 4.4± 0.6 12± 3 2.9±0.9

Fig. 4: The table summarizes the performances of DDPG, Linear-NAF, and NAF across four tasks. Note that the linear
model learns the perfect reaching and pick & place policies given enough time, but fails to learn either of the door tasks.

applied [36].
Figure 5 shows the results on reaching and door pushing.
The x-axis shows the number of parameter updates, which
is proportional to the amount of wall-clock time required for
training, since the amount of data per step increases with the
number of workers. The results demonstrate three points: (1)
under some circumstances, increasing data collection makes
the learning converge significantly faster with respect to the
number of gradient steps, (2) final policy performances de-
pend a lot on the ratio between collecting and training speeds,
(a) Reaching and (3) there is a limit where collecting more data does not
help speed up learning. However, we hypothesize that accel-
erating the speed of neural network training, which in these
cases was pegged to one update per time step, could allow
the model to ingest more data and benefit more from greater
parallelism. This is particularly relevant as parallel computa-
tional hardware, such as GPUs, are improved and deployed
more widely. Videos of the learned policies are available
in supplementary materials and online: https://sites.
google.com/site/deeproboticmanipulation/

(b) Door Pushing VI. R EAL -W ORLD E XPERIMENTS


Fig. 5: Asynchronous training of NAF in simulation. Note
that both learning speed and final policy success rates de- The real-world experiments are conducted with the 7-DoF
pending significantly on the number of workers. arm shown in Figure 6. The tasks are the same as the sim-
ulation tasks in Section V-A with some minor changes. For
reaching, the same state representation and reward functions
are used. The randomized target position is sampled from a
C. Asynchronous Training
cube of 0.4 m, providing more diverse and extreme targets
In asynchronous training, the training thread continu- for reaching. We noticed that these more aggressive targets,
ously trains the network at a fixed frequency determined combined with stricter safety measures (slower movements
by network size and computational hardware, while each and tight joint limits), reduced the performance compared
collector thread runs at a specified control frequency. The to the simulation, and thus we relax the definition of a
main question to answer is: given these constraints, how successful episode for reporting, marking episodes within 10
much speedup can we gain from increasing the number of cm as successful. For the door task, the robot was required to
workers, i.e. the data collection speed? To analyze this in a reach for and pull the door open by hooking the handle with
realistic but controlled setting, we first set up the following the end-effector. Due to the geometry of the workspace, we
experiment in simulation. We locked each collector thread could not test the door pushing task on the real hardware. The
to run at S times the speed of the training thread. Then, we orientation of the door was measured by a VectorNav IMU
varied the number of collector threads N. Thus, the overall attached to the back of the door. Unlike in the simulation, we
data collection speed is approximately S ×N times that of the cannot automatically reposition the door for every episode,
trainer thread. For our experiments, we varied N and fixed so the pose of the door was kept fixed. State features for the
S = 1/5 since our training thread runs at approximately 100 door task include the joint angles and their time derivatives,
updates per second on CPU, while the collector thread in the end effector position and the quaternion reading from the
real robot will be locked to 20Hz. Layer normalization is IMU, totalling 21 dimensions.
Fig. 6: Two robots learning to open doors using asynchronous NAF. The final policy learned with two workers could achieve
a 100% success rate on the task across 20 consecutive trials.

Fig. 7: The 7-DoF arm random target reaching with asyn- Fig. 8: Learning curves for real-world door opening. Learn-
chronous NAF on real robots. Note that 1 worker suffers in ing with two workers significantly outperforms the single
both learning speed and final policy performance. worker, and achieves a 100% success rate in under 500,000
update steps, corresponding to about 2.5 hours of real time.
A. Random Target Reaching
The simulation results in Section 5 provide approximate that involves complex and discontinuous contact dynamics.
performance gains that can be expected from parallelism. Previous work has demonstrated learning of door opening
However, the simulation setting does not consider several is- policies using example demonstration provided by a human
sues that could arise in real-world experiments: delays due to expert [6]. In this work, we demonstrate that we can learn
slow resetting procedures, non-constant execution speeds of policies for pulling open a door from scratch using asyn-
the training thread, and subtle physics discrepancies among chronous NAF. The entire task required approximately 2.5
robots. Thus, it is important to demonstrate the benefits from hours to learn with two workers learning simultaneously, and
parallel training with real robots. the final policy achieves 100% success rate evaluated across
We set up the same reaching experiment in the real world 20 consecutive trials. An illustration of this task is shown in
across up to four robots. Robots execute policies at 20 Figure 6, and the supplementary video shows different stages
Hz, while the training thread simply updates the network in the learning process, as well as the final learned policy.
continuously at approximately 100 Hz. The same network Figure 8 illustrates the difference in the learning process
architecture and hyper-parameters from the simulation ex- between one and two workers, where the horizontal axis
periment are used. shows the number of parameter updates. 100,000 updates
Figure 7 confirms that 2 or 4 workers significantly im- correspond to approximately half an hour, with some delays
proves learning speed over 1 worker, though the gains on this incurred due to periodic policy evaluation, which is only used
simple task are not substantial past 2 workers. Importantly, for measuring the reward for the plot. One worker required
when the training thread is not synchronized with the data significantly more than 4 hours to achieve 100% success
collection thread and the data collection is too slow, it may rate, while two workers achieved the same success rate in
not just slow down learning but also hurt the final policy 2.5 hours. Qualitatively, the learning process goes through
performance, as observed in the 1-worker case. Further a set of stages as the robots learn the task, as illustrated by
discrepancies from the simulation may also be explained by learning curves in Figure 8, where the plateau near reward=0
physical discrepancies among different robots. The learned corresponds to placing the hook near the handle, but not
policies are presented in the supplementary video. pulling the door open. In the first stage, the robots are unable
to reach the handle, and explore the free space to determine
B. Door Opening an effective policy for reaching. Once the robots begin to
The previous section describes a real-world evaluation contact the handle sporadically, they will occasionally pull
of asynchronous NAF and demonstrates that learning can on the handle by accident, but require additional training to
be accelerated by using multiple workers. In this section, be able to reach the handle consistently; this corresponds to
we describe a more complex door opening task. Door the plateau in the learning curves. At this point, it becomes
opening presents a practical application of robotic learning much easier for the robots to pull open the door, and a
successful policy emerges. The final policy learned by the and Jon Scholz for help on parallelization, and the Google
two workers was able to open the door every time, including Brain, X, and DeepMind teams for their support.
in the presence of exploration noise.
R EFERENCES
VII. D ISCUSSION AND F UTURE W ORK [1] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast
We presented an asynchronous deep reinforcement learn- quadrupedal locomotion,” in International Conference on Robotics and
Automation (IROS), 2004.
ing approach that can be used to learn complex robotic [2] G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng,
manipulation skills from scratch on real physical robotic “Learning CPG-based biped locomotion with a policy gradient
manipulators. We demonstrate that our approach can learn a method: Application to a humanoid robot,” International Journal of
Robotic Research, vol. 27, no. 2, pp. 213–228, 2008.
complex door opening task with only a few hours of training, [3] J. Peters and S. Schaal, “Reinforcement learning of motor skills with
and our simulated results demonstrate that training times de- policy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.
crease with more learners. Our technical contribution consists [4] E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning
of motor skills in high dimensions,” in International Conference on
of a novel asynchronous version of the normalized advantage Robotics and Automation (ICRA), 2010.
functions (NAF) deep reinforcement learning algorithm, as [5] J. Peters, K. Mülling, and Y. Altün, “Relative entropy policy search,”
well as a number of practical extensions to enable safe and in AAAI Conference on Artificial Intelligence, 2010.
[6] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning
efficient deep reinforcement learning on physical systems, force control policies for compliant manipulation,” in International
and our experiments confirm the benefits of nonlinear deep Conference on Intelligent Robots and Systems (IROS), 2011.
neural network policies over simpler shallow representations [7] P. Abbeel, A. Coates, M. Quigley, and A. Ng, “An application of
reinforcement learning to aerobatic helicopter flight,” in Advances in
for complex robotic manipulation tasks. Neural Information Processing Systems (NIPS), 2006.
While we’ve shown that deep off-policy reinforcement [8] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
learning algorithms are capable of learning complex ma- robotics: A survey,” International Journal of Robotic Research, vol. 32,
no. 11, pp. 1238–1274, 2013.
nipulation skills from scratch and without purpose built [9] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and
representations, our method has a number of limitations. generalization of motor skills by learning from demonstration,” in
Although each of the tasks is learned from scratch, the International Conference on Robotics and Automation (ICRA), 2009.
[10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
reward function provides some amount of guidance to the D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
learning algorithm. In the reacher task, the reward provides ment learning,” International Conference on Learning Representations
the distance to the target, while in the door task, it provides (ICLR), 2016.
[11] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-
the distance from the gripper to the handle as well as the learning with model-based acceleration,” in International Conference
difference between the current and desired door pose. If the on Machine Learning (ICML), 2016.
reward consists only of a binary success signal, both tasks [12] M. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search
for robotics,” Foundations and Trends in Robotics, vol. 2, no. 1-2, pp.
become substantially more difficult and require considerably 1–142, 2013.
more exploration. However, such simple binary rewards may [13] K. J. Hunt, D. Sbarbaro, R. Żbikowski, and P. J. Gawthrop, “Neural
be substantially easier to engineer in many practical robotic networks for control systems: A survey,” Automatica, vol. 28, no. 6,
pp. 1083–1112, Nov. 1992.
learning applications. Improving exploration and learning [14] M. Riedmiller, “Neural fitted q iteration–first experiences with a
speed in future work to enable the use of such sparse rewards data efficient neural reinforcement learning method,” in European
would further improve the practical applicability of the class Conference on Machine Learning. Springer, 2005, pp. 317–328.
of methods explored here. [15] R. Hafner and M. Riedmiller, “Neural reinforcement learning con-
trollers for a real robot application,” in International Conference on
Another promising direction of future work is to inves- Robotics and Automation (ICRA), 2007.
tigate how diverse experience of multiple robotic platforms [16] M. Riedmiller, S. Lange, and A. Voigtlaender, “Autonomous reinforce-
can be appropriately integrated into a single policy. While we ment learning on raw visual input data in a real world application,” in
International Joint Conference on Neural Networks, 2012.
take the simplest approach of pooling all collected experi- [17] J. Koutnı́k, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-
ence, multi-robot learning differs fundamentally from single- scale neural networks for vision-based reinforcement learning,” in
robot learning in the diversity of experience that multiple Conference on Genetic and Evolutionary Computation, ser. GECCO
’13, 2013.
robots can collect. For example, in a real-world instantiation [18] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust
of the door opening example, each robot might attempt to region policy optimization,” in International Conference on Machine
open a different door, eventually allowing for generalization Learning (ICML), 2015.
[19] R. Williams, “Simple statistical gradient-following algorithms for
across door types. Properly handling such diversity might connectionist reinforcement learning,” Machine Learning, vol. 8, no.
benefit from explicit exploration or even separate policies 3-4, pp. 229–256, May 1992.
trained on each robot, with subsequent pooling based on [20] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy
search for robotics.” Foundations and Trends in Robotics, vol. 2, no.
policy distillation [37]. Exploring these extensions of our 1-2, pp. 1–142, 2013.
method could enable the training of highly generalizable [21] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
deep neural network policies in future work. no. 3-4, pp. 279–292, 1992.
[22] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
methods for reinforcement learning with function approximation,” in
ACKNOWLEDGEMENTS Advances in Neural Information Processing Systems (NIPS), 1999.
We sincerely thank Peter Pastor, Ryan Walker, Mrinal [23] J. Koutnı́k, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-
scale neural networks for vision-based reinforcement learning,” in Pro-
Kalakrishnan, Ali Yahya, Vincent Vanhoucke for their as- ceedings of the 15th annual conference on Genetic and evolutionary
sistance and advice on robot set-ups, Gabriel Dulac-Arnold computation. ACM, 2013, pp. 1061–1068.
[24] V. Mnih et al., “Human-level control through deep reinforcement
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
reinforcement learning,” in International Conference on Machine
Learning (ICML), 2016, pp. 1928–1937.
[26] R. Hafner and M. Riedmiller, “Reinforcement learning in feedback
control,” Machine learning, vol. 84, no. 1-2, pp. 137–169, 2011.
[27] M. Inaba, S. Kagami, F. Kanehiro, and Y. Hoshino, “A platform
for robotics research based on the remote-brained robot approach,”
International Journal of Robotics Research, vol. 19, no. 10, 2000.
[28] J. Kuffner, “Cloud-enabled humanoid robots,” in IEEE-RAS Interna-
tional Conference on Humanoid Robotics, 2010.
[29] B. Kehoe, A. Matsukawa, S. Candido, J. Kuffner, and K. Goldberg,
“Cloud-based robot grasping with the google object recognition en-
gine,” in IEEE International Conference on Robotics and Automation,
2013.
[30] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research
on cloud robotics and automation,” IEEE Transactions on Automation
Science and Engineering, vol. 12, no. 2, April 2015.
[31] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for
model-based control,” in 2012 IEEE/RSJ International Conference on
Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
[32] J. Ba and D. Kingma, “Adam: A method for stochastic optimization,”
2015.
[33] J. Kober and J. Peters, “Learning motor primitives for robotics,” in
International Conference on Robotics and Automation (ICRA), 2009.
[34] R. Tedrake, T. W. Zhang, and H. S. Seung, “Learning to walk in 20
minutes.”
[35] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[37] A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick,
R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy
distillation,” in International Conference on Learning Representations
(ICLR), 2016.

You might also like