Root
Root
net/publication/341697027
CITATIONS READS
84 973
5 authors, including:
mario Fravolini
Università degli Studi di Perugia
222 PUBLICATIONS 3,016 CITATIONS
SEE PROFILE
All content following this page was uploaded by Gabriele Costante on 02 July 2020.
RL algorithms, however, show a great tendency to overfit in recent successes, the DRL framework proves very promising
the domain in which they are trained [23], [24]. To avoid this, for this purpose.
we make sure that the agent’s navigation policy is not biased
by the objects seen during training, and that the localization
A. DRL for Robotic Applications and Visual Navigation
of such objects is not affected by the exploration strategies. To
this end, we introduce a novel framework composed by two RL has a long history in the Artificial Intelligence research
networks. The first one aims to develop exploration strategies field [25]. However, it is only in recent years, with the adoption
in unknown environments, while the other one to locate the of Deep Neural Networks (DNNs), that the first great results
target object in the image. We design our framework in such have been achieved. In [26] and [27], human level performance
a way that the two parts of the architecture are independent is reached and surpassed in the games of Backgammon and
of each other, helping generalization, but at the same time Chess, respectively. In [28], the authors prove as a combination
perfectly suited to work together. We show that by completely of DNN and tree search can master the game of Go, beating
decoupling these two components we can effectively apply the world champion Lee Sedol. In 2015, [29] presents the
domain-transfer techniques to the two networks separately first RL and CNNs based algorithm able to reach human level
and also control their complexity according to their respective performance in a variety of Atari videogames, playing directly
tasks. from pixels. This work encourages further research in many
To train such agent, we build two simple simulated en- other visually complex tasks and videogames.
vironments, one for each subtask, using the photorealistic These successes in DRL inspires also several works in
graphics engine Unreal Engine 4 (UE4)1 . To evaluate our the robotic field. [19] proposes a DRL based closed-loop
model performance, we analyze its ability to explore unknown predictive controller for soft robotic manipulators. [20] uses
environments and locate target objects in different simulated RL to learn policies for complex dexterous in-hand manipula-
scenarios. Through our experiments, we show that with our tion. [30] introduces an Image-Based Visual Servoing (IBVS)
approach it is possible to achieve surprising results even in controller for object following applications with multirotor
much more complex scenarios compared to the ones used aerial robots. In [31], the authors propose an algorithm, trained
during training. Finally, we verify the generalization capability with 3D CAD models only, which is able to perform collision-
of our algorithm in a complex real setting with a real robot. free indoor flight in real environments. In [32], a system
The supplementary video can be found at the following link: to control torques at the robots motors directly from raw
https://www.youtube.com/watch?v=gZzP0y4AnRY. image observations is designed. [33] presents an A3C [34]
based algorithm for end-to-end driving in the rally game
TORCS, using only RGB frames as input. They run open
II. RELATED WORK loop tests on real sequences of images, demonstrating some
Target-driven visual navigation is a relatively new task in the domain adaptation capability. [35] introduces the interactive
field of robotics research. Only recently, end-to-end systems replay to show how perform zero-shot learning under real-
have been specifically developed to address this problem. A world environmental changes. [36] proposes a new method for
possible naive approach could be to use a classic map-based object approaching in indoor environments. [13] and [14] show
navigation algorithm along with an image or object recog- that an agent can be successfully trained to explore complex
nition model, such as YOLO [9] or ResNet [10]. However, unknown mazes to find a specific target, using only RGB
map-based methods cannot be used in unknown or dynamic frames as input. However, in their experiments, the appearance
environments. To cope with this restriction, they can be re- of the goals is fixed during training. As a consequence, the
placed with SLAM (Simultaneous Localization and Mapping) target is embedded in the model parameters, meaning that
systems [5], [6], [7], [8], but other limitations still remain, in during the test phase, the target must necessarily be the same.
particular: i) SLAM systems build and use geometric maps Other approaches [37], [38] , [39] focus on visual navigation
to navigate, which are not necessary for target-driven visual to find objects given the corresponding class labels. [37] try
navigation; ii) classic navigation algorithms separate mapping to learn navigation policies by using State-of-the-Art (SotA)
from planning, which is again not needed for this task, and can computer vision techniques to acquire semantic segmentation
potentially compromise the robustness of the whole system; and detection masks. They show that such additional inputs are
iii) additional efforts are required to combine SLAM systems domain independent and allow joint training in simulated and
and object detection modules, as they are not designed to work real scenes, reducing the need for common sim-to-real transfer
together. methods, like domain adaptation or domain randomization.
To overcome these limits, map-less methods, which try to However, we argue that segmentation and detection masks
solve the problem of navigation and target approaching jointly, are less beneficial in maze-like environments, where floor
have been proposed [21], [22], [12]. These systems, like and walls are the predominant elements, which, instead, our
ours, do not build a geometric map of the area, instead, they method addresses. In [38], the authors introduce a new deep
implicitly acquire the minimum knowledge of the environment learning based approach for visual semantic planning. They
necessary for navigation. This is done by direct mapping visual use both RL and Imitation Learning (IL) to train a deep
inputs to motion, i.e. pixels to actions. In particular, given its predictive model based on successor representations, demon-
strating good cross-task knowledge transfer results. Despite
1 https://www.unrealengine.com bootstrapping RL with IL proves to be an effective way to
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 3
Fig. 2. The overall system architecture. We can see the object localization and the navigation networks with the yellow and blue background respectively.
The first network takes two 224 × 224 RGB images as input, the goal image and the current frame. All the weights are shared among the two branches (red
dotted square). The network outputs a six length one-hot vector encoding the relative target position in the current frame. This vector is then fed into the
navigation network together with the current frame, resized to 84 × 84. The network finally produced the policy and the value function.
increase sample efficiency in many other applications, one of new unknown environments, even much larger than the ones
the main purposes of our work is to study how artificial agents used during training, and, most importantly, also to real ones
learn complex navigation policies in complete autonomy, i.e. with real targets.
without human influence. [39] addresses generalization in This work proceeds as follows: Section III describes the
visual semantic navigation by managing prior knowledge with DRL framework used and the proposed two-networks archi-
Graph Convolutional Networks [40]. While they prove that, tecture; Section IV focusses on the training procedure; Section
in their settings, prior knowledge improves generalization to V presents the environment setup used to train our algorithm;
unseen scenes and targets, it has limited applicability in maze- Section VI shows the experiments and the results; finally,
like environments and in target-driven visual navigation, where Section VII draws the conclusions and the path of future work.
target objects are not specified semantically but with images.
To the best of our knowledge, the only two works which III. FRAMEWORK
tackle the task of target-driven visual navigation, by using
In the following, the proposed framework is presented. In
images to specify goals and employing DRL only (i.e., without
particular, this section is divided in two parts:
IL), are [21] and [22]. [21] is the first work that addresses this
problem. The authors propose a novel algorithm whose policy 1) Problem Formulation: in which the problem is presented,
is a function of the goal as well as the current state. In this and our design choices are described;
way, it is possible to specify a new arbitrary target without 2) Network Architecture: where our two-networks architec-
the need of re-training or fine-tuning the model. However, the ture is detailed.
images they use to specify the goal are scenes taken from the
same area where the agent is placed. This means that with this A. Problem Formulation
approach, a robot cannot navigate in an environment of which The target-driven visual navigation problem consists in
we have no image. finding the shortest sequence of actions to reach a specified
One further step towards generalization is taken by [22], target, using only visual inputs. Our goal is to design an RL
which introduces a framework that integrates a deep neural agent able to find that sequence directly from pixels.
network based object recognition module. With this module, We consider the standard RL setting where the agent
the agent can identify the target object regardless of where the interacts with the environment over a number of discrete time
photo of that object is taken. However, it is still trained or fine- steps. The environment can be seen as a Partially Observable
tuned in the same environments where it is tested. Therefore, Markov Decision Process (POMDP) in which the main task
it is still not able to generalize to unseen scenarios. of the agent is to find a policy π that maximises the expected
sum of future discounted rewards:
B. Contribution and Overview "∞
X
#
t
In all the aforementioned works [21], [22], it is always Vπ (x) = Eπ γ rt , (1)
t=0
necessary to re-train or at least fine-tune the model for new
environments and objects. In real scenarios, this is not only an where γ ∈ [0, 1) is the discount factor, rt = r(xt , at ) is the
inefficient approach due to the high cost of producing samples reward at time t, xt is the state at time t and at ∼ π(·|xt ) is
with physical robots, but it can also be dangerous. the action generated by following some policy π. The MDP is
To avoid that, we design a novel architecture composed by clearly partially observable, in our case, because in every step
two main DNNs: the first, the navigation network, with the the agent has no access to the true state xt of the environment,
goal of exploring the environment and approaching the target; but only to an observation ot of it.
the second, the object localization network, with the aim of The observation ot is composed by the current RGB frame
recognising the specified target in the robot’s view. They are from the agent point of view and the image of the target to be
exclusively trained in simulation, and no single real image is reached. These two inputs are both fed into the architecture,
used. Finally, we show that our algorithm directly transfers to which consists of two different networks. The first, i.e. the
4 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA
256
ResNet-50
0
3x3x16 3x3x16 3x3x16 1
3x3x512 3x3x128 0
224x224
0
80x40
0
0 84x84 4x4x32 4x4x32
ResNet-50
1 8x8x16 8x8x16
3x3x16 3x3x16 3x3x16 0
224x224 3x3x512 3x3x128 0 V
0
0 Graphics
0 0 0 0 1 0 0
Engine
ResNet-50 256
(a) (b)
Fig. 4. The two networks during the training phase. (a) The object localization network takes three pictures as input. The features extracted by the ResNet
and the first three convolutional layers are used to compute the triplet loss (`t ) in (2). The features are then further processed by the last two convolutional
layers and the fully connected one to produce the two probability vectors. (b) The navigation network inputs the current frame and one-hot vector from the
engine. The features extracted by the two convolutional layers are used to make depth estimation, and concatenated to the vector to produce the policy and
the value function.
where m is a positive constant which controls the margin, g interaction with the environment, while the learner processes
represents the features extracted from the goal image, and f + them to update the network. In our specific implementation,
and f − the features extracted from the picture where the target we use 16 actors and one learner.
is visible and where it is not, respectively. This loss allows the 1) Actor: During the learning phase, the navigation net-
model to develop a concept of similarity. The three descriptors work is completely separated from the object detection net-
are then concatenated together two by two: work (see Fig. 4b). Each IMPALA actor is placed in a different
maze (see Section V-A for details), and for every action at ,
r1 = [f + , g], it receives from the environment a reward rt and a new
r2 = [f − , g]. observation ot . rt is always 0, except when the agent reaches
the goal, in which case it receives +1. ot is composed by
Finally, r1 and r2 are separately processed by the last two
the target position and the current RGB frame, which, during
convolutional layers and the fully connected one, producing
learning, are both generated by the game engine itself.
the two probability vectors, p1 and p2 .
Once an actor finishes to perform a predefined number of
The loss we use is a weighted Cross Entropy (CE) loss,
steps, it sends its trajectory to the learner, which reprocesses
adapted to our specific task. In particular, in the classical CE
it to update the network.
loss formulation, every classification error weigh the same. On
2) Learner: The learner is responsible for loss computation
the contrary, we argue that for our specific task, the value of
and parameters update. The first loss is relative to depth
the loss should increase proportionally to the distance between
estimation. According to [14], if this loss shares model weights
the estimated and the true object location. For that reason, we
with the policy, which is true for our architecture, this auxiliary
modify the CE formulation for our localization loss:
( task can be used to speed up learning and to achieve higher
−d log(p∗n ) 1 ≤ n, k ≤ 5 performance. Since only the ceiling and floor are located in
`l (p∗ ) = , (3) the lower and upper areas, we compute the depth loss only on
− log(p∗n ) n=0∨k =0
the central 80 × 40 pixels of the frame. The loss we use for
where that is a pixel-wise mean squared error between the predicted
d = |n − k|, depth dp and the one provided by the engine de :
p∗n is the n-th element of the probability vector (either p1 or 1 X
ld = (dp − de )2 . (4)
p2 ), n corresponds to the true location of the target object, 80 × 40
and k corresponds to the most likely position according to the The features extracted by the convolutional layers are concate-
network, i.e.: nated with the one-hot input vector provided by UE4 (Fig. 3b),
k = arg max p∗i . and processed by the network, which finally output the policy
i
and the estimated value, as described in Section III-B2.
So, the overall loss for a single triplet is given by the sum of To speed up learning, we use an experience replay [48],
the components in (2) and (3): which is a buffer of trajectories shared among actors. There-
`obj = `t + `l (p1 ) + `l (p2 ). fore, every computation, 2 trajectories are randomly picked
from the experience replay, batched together with the actor’s
current one, and processed in parallel in a single pass.
B. Navigation Training Phase All the trajectories are used to compute (4) and the fol-
The navigation network is trained via RL using IMPALA. lowing three different losses. Before describing them, it is
It requires the use of two main entities: the actor, which runs important to clarify that the value IMPALA assigns to a state
in CPU, and the learner, which works in GPU. Both actor is different from the one computed with the classical Bellman
and learner share the same network parameters. The task of equation (see Eq. 1). Due to the lag between the time when
the actor is to collect trajectories of experiences through the trajectories are generated by the actors and when the learner
6 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA
However, as we show in Section VI-B, this increases training pictures depicting goal objects, and associate each of them
inefficiency without improving the performance. As detailed in with two images. Each entry of the dataset is then composed
the following sections, the reasons behind those results can be by: an image of the target, a capture where the target is not
mainly ascribed to the high complexity of their architectures, present, a capture where it does, and its relative position in
which do not decouple the navigation and the target recog- that capture.
nition tasks. On the contrary, our approach relies upon two The dataset counts 630,000 samples of 9 classes of objects,
totally separable components (i.e., the navigation and the target which are: “Chair”, “Monitor”, “Trash”, “Microwave”, “Bot-
recognition networks). Therefore, the navigation network can tle”, “Ball”, “Lamp”, “Plant” and “Jar”. In the simulator, for
be much smaller, considerably reducing the training time and each object, we use 4 to 10 different meshes.
increasing its effectiveness. It is also important to emphasize
that this in no way compromises the accuracy in locating
objects, as the object recognition network can be arbitrarily VI. EXPERIMENTS AND RESULTS
complex.
In the experiments, we measure our target-driven visual
The possible actions our actors can do are three: “turn
navigation system performance in unseen environment. In
right”, “move forward”, “turn left”. To simulate the uncertainty
particular, we analyze our agent’s ability to: i) explore the
of a real robot’s motion, we inject uniform noise in the speed
surrounding environment and ii) reach the designated target.
and angle of our actors movement.
To do that, we design three kind of tests, which are described
As the training of the two networks takes place separately,
in the next sections. Furthermore, we propose an ablation study
the input vector of the navigation network is not the one pro-
to examine the benefit of the auxiliary depth estimation loss
duced by the object localization network but is generated by
on the agent performance. Finally, to verify the generalization
the UE4 itself. In order to make the navigation network more
capability of our algorithm, we use it in a complex real setting
robust against possible classification errors of the localization
with a real robot.
network, we ensure that the one-hot vectors generated by UE4
have a 10% probability of being wrong (in that case, it is
picked uniformly from all classes). A. Training Details
We train the Navigation Network with 16 actors for 70
B. Capture-Level million steps using SGD without momentum, with learning
Contrary to the navigation network, which is trained via re- rate 0.0005 and batch size 8. The specific parameter settings
inforcement learning, the object localization network is trained of IMPALA and the Maze-level that are used throughout
in a supervised manner. We collect a dataset of synthetic our experiments are given in detail in Table I and Table II
images from the UE4 game engine in the following way. respectively.
We place a camera in a fixed position and spawn randomly The Object Localization Network is trained with the first
generated objects in its field of view. Every time the camera 540,000 samples of our synthetic dataset for 50 epochs with
takes a picture, the current objects are replaced with ran- learning rate 0.0025, batch size 128, and margin constant m
domly generated ones. Like for the Maze-level, we follow the equals to 0.1. We use other 90,000 samples to implement
approach in [49] to achieve good generalization capabilities early stopping and selecting the best model parameters. All
across domains. the details on the parameters used for the Capture-level are
For each image, we get the relative position of each object reported in Table III.
with respect to the camera from the engine. We discretise The source code is publicly available and can be
the position in five classes: “extreme right”, “right”, “center”, found at https://github.com/isarlab-department-engineering/
“left” and “extreme left”. Then, we download from the web DRL4TargetDrivenVN.
8 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA
(a) (b)
(c) (d)
Fig. 8. Two images of corners (left) and two of dead ends (right), with their corresponding value. The function clearly presents high picks when the agent
turns corners, while it drops sharply in correspondence of dead ends.
TABLE IV
AGENT E XPLORATION P ERFORMANCE WITH D IFFERENT L IGHT
I NTENSITY L EVELS AND M AZE C ONFIGURATIONS
TABLE V
AGENT TARGET- DRIVEN P ERFORMANCE IN THE 5 × 5 M AZE BY TARGET O BJECT C LASS . T IME , S UCCESS R ATE AND SPL FOR TDVN AND AOP ARE
PROVIDED FOR BOTH VERSIONS , WITH AND WITHOUT DR.
Approach Metric Chair Monitor Trash Microwave Bottle Ball Lamp Plant Jar Can Extinguisher Boot Avg
Time − − −
RA Success Rate 0% 0% 0%
SPL 0 0 0
Time −-− −-− −-−
TDNM [21] Success Rate w DR - w/o DR 0% - 0% 0% - 0% 0% - 0%
SPL 0-0 0-0 0-0
Time −-− −-− −-−
AOP [22] Success Rate w DR - w/o DR 0% - 0% 0% - 0% 0% - 0%
SPL 0-0 0-0 0-0
Time 28s 30s 31s − 34s 27s 28s 23s 26s − 48s 38s 31s
Ours Success Rate 83% 33% 33% 0% 100% 50% 100% 100% 100% 0% 50% 67% 72%
SPL 0.3 0.11 0.11 0 0.29 0.19 0.36 0.43 0.38 0 0.1 0.17 0.24
Time 17s 17s 17s
Human Expert Success Rate 100% 100% 100%
SPL 0.59 0.59 0.59
shows that our agent learns a navigation policy (i.e., the wall adequate exploration policy is essential to complete the task.
follower strategy) that proves to be far superior and more Without such a policy, it is extremely unlikely to ever reach
efficient than a random explorer, as expected. the room with the objects, as confirmed by the RA results.
In Fig. 10, very large score ranges can be seen in correspon- We argue that the SotA baselines are not able to effectively
dence of every light intensity levels. The agent performance explore the mazes, even if trained with domain randomization,
oscillates considerably, and with some particular wall textures for the following reasons. First, both baselines are originally
it reaches rather low levels. However, in the same figure, we conceived to be quickly fine-tuned on previously unseen envi-
can also see that in some other settings our agent can get close ronments, thanks to the scene-specific layers they implement.
to human performance, which is impressive, considering that Therefore, they are not specifically designed to directly gener-
it is trained in very small 3 × 3 random mazes only. alize to new scenarios, contrary to our approach. Secondly,
2) Target-driven Experiment (5 × 5): In this second experi- TDNM and AOP rely upon a single complex architecture
ment, we place our agent in a 5×5 maze, which ends in a room to simultaneously address exploration and target recognition.
with 3 different objects, including the target. This experiment Hence, their optimization is more difficult and it is harder
is naturally divided in two phases for the agent: it has first to to achieve navigation capabilities that generalize over test
explore the maze to find the room, then it has to distinguish scenarios that are wider and more complex than those used
the target from the other objects, locate it, and approach it. for training. On the contrary, our approach decouples the two
An episode ends when the agent reaches the target (i.e., when sub-tasks and favors the generalization of exploration policies
they collide) or when 90 seconds are elapsed. In Fig. 12, an over other environments.
example of our agent trajectory is shown, in that case the goal In addition, as shown in Fig. 11, our approach is also
is the red chair. more efficient. The plot represents the average episode reward
We try with all the 9 different objects used to train the object obtained by our method and the other SotA baselines (with
localization network, together with other 3 previously unseen domain randomization) during training, as function of time. In
object classes (“Can”, “Extinguisher” and “Boot”), averaged particular, our model can complete its 70 million steps training
over 6 runs each. To measure the agent performance, we use in about 156 hours, while TDVN and AOP require more than
3 metrics: the time (in seconds) needed to reach the goal, 380 hours to perform roughly 40 million steps (we decide to
the success rate (in percentage), and the Success weighted by not continue further since the curves do not show any signs of
(normalized inverse) Path Length (SPL), introduced in [52]. improvement for several million steps). Hence, our approach is
The SPL is calculated as follows: considerably faster than [21]and [22], and this further suggests
1 X
N
`i that those approaches are not particularly suited for direct sim-
SP L = Si , to-real transfer.
N i=1 max (pi , li )
On the contrary, as can be observed in Table V, our model
where, N is the number of test episodes, `i the shortest-path produces reasonable results, successfully reaching the targets
distance from the agent starting point to the target in episode most of the time. There are particular object meshes, however,
i, and pi the length of the path actually taken by the agent in for which the agent is not able to complete the task before the
episode i. end of the episode. In that cases, the agent finds the room but
The results of all the models are summarised in Table V, it is not able to approach the target. Most of the time, this
in comparison to human and RA performance. As shown is due to the uncertainty of the object localization network,
by the results, neither the TDNM [21] nor the AOP [22] which is unable to locate some object meshes continuously
are able to generalize in the test environment. Since the test and consistently over time.
environment is much harder than the training scenarios, an Since our object localization network is trained by using a
12 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA
TABLE VI
AGENT TARGET- DRIVEN P ERFORMANCE IN THE 20 × 20 M AZE BY TARGET O BJECT C LASS
Approach Metric Chair Monitor Trash Microwave Bottle Ball Lamp Plant Jar Can Extinguisher Boot Avg
Time 228s 201s 127s − 200s 56s 120s 44s 67s − 276s − 147s
Ours Success Rate 17% 33% 33% 0% 83% 50% 67% 83% 67% 0% 33% 0% 52%
SPL 0.02 0.05 0.08 0 0.12 0.27 0.17 0.57 0.3 0 0.04 0 0.18
Time 66s 66s 66s
Human Expert Success Rate 86% 86% 86%
SPL 0.40 0.40 0.40
TABLE VII
D EPTH L OSS A BLATION S TUDY
Approach Metric Chair Monitor Trash Microwave Bottle Ball Lamp Plant Jar Can Extinguisher Boot Avg
Time 56s 67s 77s 76s 65s 84s 76s 58s 71s − 63s 63s 69s
Ours (No depth loss) Success Rate 83% 50% 50% 17% 17% 33% 50% 100% 83% 0% 50% 83% 56%
SPL 0.15 0.07 0.06 0.02 0.03 0.04 0.07 0.17 0.12 0 0.08 0.13 0.09
Time 28s 30s 31s − 34s 27s 28s 23s 26s − 48s 38s 31s
Ours Success Rate 83% 33% 33% 0% 100% 50% 100% 100% 100% 0% 50% 67% 72%
SPL 0.3 0.11 0.11 0 0.29 0.19 0.36 0.43 0.38 0 0.1 0.17 0.24
Fig. 13. Three examples of depth images produced by our agent. Although depth estimation is only an auxiliary loss, and hence the images produced are
not accurate, it is still possible to identify the right wall.
similarity metric, we expect a certain ability to generalize to actual distance covered by the agent/human can be extremely
previously unseen objects classes. From Table V (see “Can”, large.
“Extinguisher” and “Boot”), it can be seen that the agent suc-
cessfully reaches 2 (“Extinguisher” and “Boot”) of the 3 new 4) Ablation Study on Depth Estimation: In this paragraph,
object classes. In particular, the poor results with the “Can” we aim to evaluate the effects of the auxiliary depth estimation
are due to frequent false positives of the object localization loss, proposed in [14], on the general performances of the
network that, misleading the navigation component, prevent agent. To this end, we train a second version of our model
the agent from exploring the environment properly. (using the same training protocol of the other experiments)
without the auxiliary depth estimation loss. In Table VII, the
3) Target-driven Experiment (20 × 20): Differently from comparison results, in the test 5 × 5 maze, between the two
the previous experiment, in this case, the test maze is a much models are shown. As can be seen, the agent trained with depth
larger 20 × 20 maze and the time limit is increased to 300 loss shows better performance in terms of success rate and,
seconds. The purpose is to assess the ability to collaborate on average (last column of Table VII), appears to be more
between the object localization network and the navigation than twice as fast in completing the task. By analyzing the
network, in situations where the distance to be covered is navigation trajectories computed by the agent trained without
much longer. Considering the poor performance of the two the depth loss, we notice that it does not seem to follow
baselines in the 5 × 5 maze, in Table VI, the agent results any specific strategy and the exploration appears much more
are reported in comparison with human performance only. As random.
expected, it can be noticed that the time needed to reach the
target increases significantly, while both the success rate and Therefore, the results suggest that depth estimation could
the SPL considerably decrease. In particular, the score highly be helpful to develop a robust navigation policy. In fact, Fig.
depends on the first turns the agent chooses to take, since the 13 shows some depth images produced by the deconvolutional
maze is so large that it is rather difficult to get back on the network of the model. Since the main task is navigation and
right path within the time limit. This consideration is also valid our aim is to teach it only the basic concept of depth, the
for humans who, as can be seen, sometimes fail to complete images produced are poorly accurate. Nevertheless, it can be
the task. The size of the maze also implies a rather low SPL, clearly seen that it is able to distinguish the right wall from
for both agent and human. In fact, while the shortest path from the rest of the image, and we suppose that this may have
the starting position to the target is not particularly long, the encouraged the development of the wall-following policy.
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 13
TABLE VIII
R EAL E XPERIMENTS R ESULTS : T YPE I
TABLE IX
R EAL E XPERIMENTS R ESULTS : T YPE II
TABLE X
T Mi L Mo R EAL E XPERIMENTS R ESULTS : T YPE III
(a) (b)
Fig. 16. (a) Same maze in two different orientations in indoor: 0◦ (left) and 180◦ (right). (b) Example of maze in outdoor. In this settings, the background
is completely different. In particular, it is characterised by the presence of numerous pedestrians, dynamic elements that the agent never sees during training.
two rows of Table VIII, it is evident that the algorithm prefers the weaknesses of DRL agents, such as [53] and [54], can be
the first orientation. This shows that although it can navigate, an effective way to prevent particularly bad behaviours and
it is sensitive to the environment surrounding the maze. improve overall performance.
In the third experiment, we consider one maze configuration Furthermore, as reported at the end of the Section VI-B, we
and 5 targets to reach: “Monitor”, “Trash”, “Microwave”, observed temporal inconsistency in locating some instances
“Bottle” and “Lamp”. We make 4 different configurations (C1- of objects. We attribute this issue to the way the object
C4), with 3 objects each (Fig. 15). For each of them, we run localization module was trained. We think that a standard
an episode, hence 12 in total. supervised learning on uncorrelated images is not suitable for
From the results, reported in Table X, we can say that this task, and we left for future work the development of a
our agent is able to recognise and reach the objects 75% different learning strategy that take into account the frames
of the times. It is important to note that our model has temporal dependency.
never seen any real object, not even the ones we use in the
experiments. Interesting is the fact that it is able to approach APPENDIX
the “Microwave” every time it is specified as a goal, contrary V-trace is an off-policy actor-critic algorithm, introduced in
to what happens in simulation, where it always fails. In this [44]. Off-policy algorithms are applied when there is the need
regard, we believe that the use of a pre-trained ResNet-50 to learn the value function V π of a policy π (target policy)
plays an important role. from trajectories generated by another policy µ (behaviour
2) Outdoor Experiments: We repeat type 1 and type 2 tests policy). In the case of IMPALA, although both the actors and
to measure our agent performance also in outdoor mazes (Fig. the learners are initialized with the same neural network and
16b). The bottom row of Fig. 14 shows the configurations same weights, the models differ from each other after the first
used. From the results, summarized in Table VIII, a slight steps of the computation. In fact, due to the asynchronous
degradation in performance can be seen with regard to the nature of the framework, actors can lag behind each other and
first type of test. The exploration capabilities of the agent, on the learner even by numerous updates. For this reason, an off-
the other hand, remain practically unchanged, as reported in policy algorithm is required.
Table IX. It appears that, despite the great difference in lighting It is important to make clear that the behaviour policies
and background between indoor and outdoor, the algorithm are followed by the actors only, while the target policy is
performance are consistent. parameterized by the learner model. For this reason, all the
following V-trace computational steps are performed by the
VII. CONCLUSIONS learner only.
t=s+n
First of all, consider the trajectory (xt , at , rt )t=s col-
In this work, we introduced a new framework to address the
lected by an actor with its policy µ. The n-steps V-trace target
challenging task of target-driven visual navigation. Through
for the value approximation V (xs ) at state xs is then defined
extensive experimentation, in both simulated and real mazes,
by:
we showed that a direct sim-to-real transfer for this problem is s+n−1
X t−1
Y
!
possible. The proposed two-network architecture, indeed, not t−s
vs = V (xs ) + γ ci δt V, (8)
only proved capable of reaching previously unseen objects in t=s i=s
much larger mazes than those in which it was trained, but also where:
showed a good ability to generalize in real scenarios. δt V = ρt (rt + γV (xt+1 ) − V (xt )) . (9)
However, although the results are encouraging, there are
t |xt ) i |xi )
still a number of open problems and many aspects to improve. ρt = min ρ̄, π(a µ(at |xt ) and ci = min c̄, π(a
µ(ai |xi ) are
In particular, the navigation performance were fluctuating, and truncated importance sampling (IS) weights, and we assume
for some combinations of light and textures, the agent was not that ρ̄ ≥ c̄.
able to achieve satisfactory results. In this regard, we believe The use of the V-trace target is necessary because of the lag
that the use of techniques designed to discover and analyze between the actors and the learner, which makes the learning
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 15
be off-policy. For this reason, the Bellman equation cannot be that favors entropy in action selection. This loss can be crucial
applied as it is, but it needs to be adapted. V-trace uses the for a successful training, because by pushing the probabilities
truncated IS weights to correct the estimation error caused by of actions to be similar, it induces the agent to keep on
the discrepancy between the policies. However, it should be exploring the MDP.
noticed that the V-trace target is just a generalization of the on- For further details, read [44].
policy n-steps Bellman target. In fact, when π = µ and c̄ ≥ 1,
every ci = 1 and ρt = 1, therefore (8) can be simplified: R EFERENCES
s+n−1
X [1] J. Borenstein and Y. Koren, “Real-time obstacle avoidance for fast
vs = γ t−s rt + γ n V (xs+n ) . mobile robots,” IEEE Transactions on systems, Man, and Cybernetics,
vol. 19, no. 5, pp. 1179–1187, 1989.
t=s
[2] ——, “The vector field histogram-fast obstacle avoidance for mobile
The main difference is the presence of the truncated IS weights robots,” IEEE transactions on robotics and automation, vol. 7, no. 3,
pp. 278–288, 1991.
ρt and ci . ρt determines the fixed point of the update rule, since [3] D. Kim and R. Nevatia, “Symbolic navigation with a generic map,”
it appears in the definition of δt V (Eq. 9). The fixed point is Autonomous Robots, vol. 6, no. 1, pp. 69–88, 1999.
the value function V πρ̄ of some policy πρ̄ , defined by: [4] G. Oriolo, M. Vendittelli, and G. Ulivi, “On-line map building and
navigation for autonomous mobile robots,” in Proceedings of 1995 IEEE
min (ρ̄µ (a|x) , π (a|x)) International Conference on Robotics and Automation, vol. 3. IEEE,
πρ̄ (a|x) = P . 1995, pp. 2900–2906.
b∈A min (ρ̄µ (b|x) , π (b|x)) [5] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam
system for monocular, stereo, and rgb-d cameras,” IEEE Transactions
This means that for ρ̄ = ∞ this is the value function V π on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
of the target policy π. Conversely, choosing ρ̄ = 0 the fixed [6] M. G. Dissanayake, P. Newman, S. Clark, H. F. Durrant-Whyte, and
M. Csorba, “A solution to the simultaneous localization and map build-
value become the value function V µ of the behavior policy µ. ing (slam) problem,” IEEE Transactions on robotics and automation,
In general, for non-zero finite ρ̄ we obtain the value function vol. 17, no. 3, pp. 229–241, 2001.
V πρ̄ of a policy πρ̄ which lies between the behavior and the [7] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha,
“Visual simultaneous localization and mapping: a survey,” Artificial
target policy. Hence, the larger ρ̄ the smaller the bias in the Intelligence Review, vol. 43, no. 1, pp. 55–81, 2015.
estimation, but the grater the variance. [8] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
The product of the coefficients ci (Eq. 8) determines the I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE
importance of a temporal difference δt V observed at time t Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
on the update of the value function at a previous time s. The [9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
more different the two policies are, the higher the variance once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
of this product is. To avoid that, we clip the coefficient at 788.
c̄, in this way, we reduce the variance without affecting the [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
solution to which we converge. In summary, ρt controls the recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
value function we converge to, and ci the convergence speed [11] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, “Self-
to this function. supervised deep reinforcement learning with generalized computation
With (8), we can describe the three parameters update. graphs for robot navigation,” in 2018 IEEE International Conference on
Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8.
Recalling that IMPALA is an actor-critic algorithm, it employs [12] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cogni-
two entities, the actor (which produces the policy) and the tive mapping and planning for visual navigation,” in Proceedings of the
critic (which calculates the value), to compute the learning IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2616–2625.
updates. Considering the parametric representations of the [13] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-
value function Vθ and the target policy πω (notice that θ and ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised
ω can be shared, as in our implementation), at training time auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.
[14] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino,
s, the parameters θ are updated in the direction of: M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al., “Learning to
navigate in complex environments,” arXiv preprint arXiv:1611.03673,
(vs − Vθ (xs )) ∇θ Vθ (xs ) , 2016.
[15] A. Devo, G. Costante, and P. Valigi, “Deep reinforcement learning for
to fit the V-trace target. This loss is needed to train the critic instruction following visual navigation in 3d maze-like environments,”
to judge the behavior of the actor. IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1175–1182,
2020.
The ω are updated along the direction of the policy gradient: [16] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
robotics: A survey,” The International Journal of Robotics Research,
ρs ∇ω log πω (as |xs ) (rs + γvs+1 − Vθ (xs )) . vol. 32, no. 11, pp. 1238–1274, 2013.
[17] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement
This second loss refers to the actor, which adjust the policy learning for robotic manipulation with asynchronous off-policy updates,”
in 2017 IEEE international conference on robotics and automation
according to the critic evaluation. The critic contribution in the (ICRA). IEEE, 2017, pp. 3389–3396.
policy update rule is described by the second multiplicative [18] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,
term in the equation. D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Qt-
opt: Scalable deep reinforcement learning for vision-based robotic
Finally, in order to avoid premature convergence, we can manipulation,” arXiv preprint arXiv:1806.10293, 2018.
add a third component, in the direction of: [19] T. G. Thuruthel, E. Falotico, F. Renda, and C. Laschi, “Model-based
X reinforcement learning for closed-loop dynamic control of soft robotic
−∇ω πω (a|xs ) log πω (a|xs ) , manipulators,” IEEE Transactions on Robotics, vol. 35, no. 1, pp. 124–
134, 2018.
a
16 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA
[20] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, [43] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Ga3c:
J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learn- Gpu-based a3c for deep reinforcement learning,” CoRR abs/1611.06256,
ing dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2016.
2018. [44] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
[21] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., “Impala: Scalable dis-
A. Farhadi, “Target-driven visual navigation in indoor scenes using tributed deep-rl with importance weighted actor-learner architectures,”
deep reinforcement learning,” in 2017 IEEE international conference arXiv preprint arXiv:1802.01561, 2018.
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[22] X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang, “Active object perceiver: Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large
Recognition-guided policy learning for object searching on mobile scale visual recognition challenge,” International journal of computer
robots,” in 2018 IEEE/RSJ International Conference on Intelligent vision, vol. 115, no. 3, pp. 211–252, 2015.
Robots and Systems (IROS). IEEE, 2018, pp. 6857–6863. [46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[23] J. Farebrother, M. C. Machado, and M. Bowling, “Generalization and computation, vol. 9, no. 8, pp. 1735–1780, 1997.
regularization in dqn,” arXiv preprint arXiv:1810.00123, 2018. [47] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:
[24] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song, “As- Learning global representations for image search,” in European confer-
sessing generalization in deep reinforcement learning,” arXiv preprint ence on computer vision. Springer, 2016, pp. 241–257.
arXiv:1810.12282, 2018. [48] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu,
and N. de Freitas, “Sample efficient actor-critic with experience replay,”
[25] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
arXiv preprint arXiv:1611.01224, 2016.
MIT press, 2018.
[49] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[26] G. Tesauro, “Temporal difference learning and td-gammon,” Communi-
“Domain randomization for transferring deep neural networks from sim-
cations of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
ulation to the real world,” in 2017 IEEE/RSJ International Conference
[27] J. Baxter, A. Tridgell, and L. Weaver, “Learning to play chess using on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30.
temporal differences,” Machine Learning, vol. 40, no. 3, pp. 243–263, [50] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real
2000. transfer of robotic control with dynamics randomization,” in 2018 IEEE
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van International Conference on Robotics and Automation (ICRA). IEEE,
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, 2018, pp. 1–8.
M. Lanctot, et al., “Mastering the game of go with deep neural networks [51] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. “Grad-cam++: Generalized gradient-based visual explanations for deep
[29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. convolutional networks,” in 2018 IEEE Winter Conference on Applica-
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, tions of Computer Vision (WACV). IEEE, 2018, pp. 839–847.
et al., “Human-level control through deep reinforcement learning,” [52] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta,
Nature, vol. 518, no. 7540, p. 529, 2015. V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., “On evalu-
[30] C. Sampedro, A. Rodriguez-Ramos, I. Gil, L. Mejias, and P. Campoy, ation of embodied navigation agents,” arXiv preprint arXiv:1807.06757,
“Image-based visual servoing controller for multirotor aerial robots 2018.
using deep reinforcement learning,” in 2018 IEEE/RSJ International [53] S. Greydanus, A. Koul, J. Dodge, and A. Fern, “Visualizing and
Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, understanding atari agents,” in International Conference on Machine
pp. 979–986. Learning, 2018, pp. 1787–1796.
[31] F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a [54] C. Rupprecht, C. Ibrahim, and C. J. Pal, “Finding and visualizing
single real image,” arXiv preprint arXiv:1611.04201, 2016. weaknesses of deep reinforcement learning agents,” arXiv preprint
[32] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of arXiv:1904.01318, 2019.
deep visuomotor policies,” The Journal of Machine Learning Research,
vol. 17, no. 1, pp. 1334–1373, 2016.
[33] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,
“End-to-end race driving with deep reinforcement learning,” in 2018
IEEE International Conference on Robotics and Automation (ICRA).
IEEE, 2018, pp. 2070–2075.
[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
forcement learning,” in International conference on machine learning,
2016, pp. 1928–1937.
[35] J. Bruce, N. Sünderhauf, P. Mirowski, R. Hadsell, and M. Milford, “One-
shot reinforcement learning for robot navigation with interactive replay,”
arXiv preprint arXiv:1711.10137, 2017.
[36] X. Ye, Z. Lin, J.-Y. Lee, J. Zhang, S. Zheng, and Y. Yang, “Gaple:
Generalizable approaching policy learning for robotic object searching
in indoor environment,” arXiv preprint arXiv:1809.08287, 2018.
[37] A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. David-
son, “Visual representations for semantic target driven navigation,” in
2019 International Conference on Robotics and Automation (ICRA).
IEEE, 2019, pp. 8846–8852.
[38] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mot-
taghi, and A. Farhadi, “Visual semantic planning using deep successor
representations,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017, pp. 483–492.
[39] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual se-
mantic navigation using scene priors,” arXiv preprint arXiv:1810.06543,
2018.
[40] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[41] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
[42] A. V. Clemente, H. N. Castejón, and A. Chandra, “Efficient
parallel methods for deep reinforcement learning,” arXiv preprint
arXiv:1705.04862, 2017.
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 17