0% found this document useful (0 votes)
14 views18 pages

Root

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

Root

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341697027

Towards Generalization in Target-Driven Visual Navigation by Using Deep


Reinforcement Learning

Article in IEEE Transactions on Robotics · May 2020


DOI: 10.1109/TRO.2020.2994002

CITATIONS READS
84 973

5 authors, including:

Alessandro Devo Gabriele Costante


Università degli Studi di Perugia Università degli Studi di Perugia
8 PUBLICATIONS 189 CITATIONS 82 PUBLICATIONS 1,407 CITATIONS

SEE PROFILE SEE PROFILE

mario Fravolini
Università degli Studi di Perugia
222 PUBLICATIONS 3,016 CITATIONS

SEE PROFILE

All content following this page was uploaded by Gabriele Costante on 02 July 2020.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA 1

Towards Generalization in Target-Driven Visual


Navigation by Using Deep Reinforcement Learning
Alessandro Devo, Student Member, IEEE, Giacomo Mezzetti, Gabriele Costante, Member, IEEE,
Mario L. Fravolini, and Paolo Valigi, Member, IEEE

Abstract—Among the main challenges in robotics, target-


driven visual navigation has gained increasing interest in recent
years. In this task, an agent has to navigate in an environment
to reach a user specified target, only through vision. Recent
fruitful approaches rely on Deep Reinforcement Learning, which
has proven to be an effective framework to learn navigation
policies. However, current state-of-the-art methods require to re-
train, or at least fine-tune, the model for every new environment
and object. In real scenarios, this operation can be extremely
challenging or even dangerous. For these reasons, we address gen-
eralization in target-driven visual navigation by proposing a novel
architecture composed by two networks, both exclusively trained
Fig. 1. The target-driven visual navigation task. The agent has to explore an
in simulation. The first one has the objective of exploring the
environment to reach a user specified target. The only inputs it receives are
environment, while the other one of locating the target. They are the visual frames from its first person view and an image of the goal.
specifically designed to work together, while separately trained
to help generalization. We test our agent in both simulated and
real scenarios, and validate its generalization capabilities through
extensive experiments with previously unseen goals and unknown in this task, and can make the whole system unnecessarily
mazes, even much larger than the ones used for training. fragile. Furthermore, deep learning object detection models
Index Terms—Target-Driven Visual Navigation, Visual-Based
and classic navigation algorithm are not originally developed
Navigation, Deep Learning in Robotics and Automation, Visual to work together, and combining them would not be trivial.
Learning. For these reasons, map-less methods [11], [12], [13], [14]
have proven to be much more suitable for target-driven vi-
sual navigation. A widespread approach is to combine deep
I. INTRODUCTION
Convolutional Neural Networks (CNNs) with Reinforcement

T ARGET-DRIVEN visual navigation is a longstanding


goal in the robotics community. A robot able to navigate
and reach user specified targets, by using only visual inputs,
Learning (RL). Deep Reinforcement Learning (DRL), indeed,
allows to manage the relationship between vision and motion
in a natural way, and it has shown impressive results for map-
would have a great impact on many robotic applications, such less visual navigation and many other robotic tasks [15], [16],
as people assistance, industrial automation and transportation [17], [18], [19], [20].
(Fig. 1). The common objective in most RL problems is to find a
A naive way to approach this problem is to combine a policy which maximizes the reward w.r.t a specific goal. In
classic navigation system with an object detection module. such tasks, the target is unique and fixed from the beginning
For example, it is possible to couple a map-based navigation of the training phase. On the contrary, in target-driven visual
algorithm [1], [2], [3], [4] or SLAM systems [5], [6], [7], navigation, a possibly different goal is specified for each run of
[8] with one of the state of the art object detection or image the algorithm. The typical approach is to embed into the policy
recognition models [9], [10]. However, map-based approaches both the target goal and the current state. Theoretically, in this
assume the availability of a global map of the environment, way, it is possible to train just one algorithm to find multiple
while SLAM algorithms are not still specifically designed targets, without the need of learning new model parameters
for target-driven visual navigation. A geometric map and the for every possible goal. However, current methods are limited
distinction between mapping and planning are not necessary to consider as goals specific scenes or objects with which the
model is trained [21], [22]. Therefore, in practice, it is still
Manuscript received: June, 21, 2019; Accepted May, 5, 2020.
This paper was recommended for publication by Editor Francois Chaumette necessary to train, or at least fine-tune, the agent for every
upon evaluation of the Associate Editor and Reviewers’ comments. specific object it has to find and for every new environment it
The authors are with the Department of Engineering, University of Perugia, has to explore. If the environment is a real one, the training
06125 Perugia, Italy [email protected],
[email protected], [email protected], procedure for such DRL agent can be very complex and in
[email protected], [email protected] some cases dangerous.
This research has been partially supported by the University of Perugia In order to address this problem, we propose a new architec-
through the 2017 and 2018 Basic Research Funds (Projects: RICBA17MRF
and RICBA18MF). ture, trained exclusively in simulation, which transfers to real
Digital Object Identifier (DOI): see top of this page. environments and real objects, without the need of fine-tuning.
2 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

RL algorithms, however, show a great tendency to overfit in recent successes, the DRL framework proves very promising
the domain in which they are trained [23], [24]. To avoid this, for this purpose.
we make sure that the agent’s navigation policy is not biased
by the objects seen during training, and that the localization
A. DRL for Robotic Applications and Visual Navigation
of such objects is not affected by the exploration strategies. To
this end, we introduce a novel framework composed by two RL has a long history in the Artificial Intelligence research
networks. The first one aims to develop exploration strategies field [25]. However, it is only in recent years, with the adoption
in unknown environments, while the other one to locate the of Deep Neural Networks (DNNs), that the first great results
target object in the image. We design our framework in such have been achieved. In [26] and [27], human level performance
a way that the two parts of the architecture are independent is reached and surpassed in the games of Backgammon and
of each other, helping generalization, but at the same time Chess, respectively. In [28], the authors prove as a combination
perfectly suited to work together. We show that by completely of DNN and tree search can master the game of Go, beating
decoupling these two components we can effectively apply the world champion Lee Sedol. In 2015, [29] presents the
domain-transfer techniques to the two networks separately first RL and CNNs based algorithm able to reach human level
and also control their complexity according to their respective performance in a variety of Atari videogames, playing directly
tasks. from pixels. This work encourages further research in many
To train such agent, we build two simple simulated en- other visually complex tasks and videogames.
vironments, one for each subtask, using the photorealistic These successes in DRL inspires also several works in
graphics engine Unreal Engine 4 (UE4)1 . To evaluate our the robotic field. [19] proposes a DRL based closed-loop
model performance, we analyze its ability to explore unknown predictive controller for soft robotic manipulators. [20] uses
environments and locate target objects in different simulated RL to learn policies for complex dexterous in-hand manipula-
scenarios. Through our experiments, we show that with our tion. [30] introduces an Image-Based Visual Servoing (IBVS)
approach it is possible to achieve surprising results even in controller for object following applications with multirotor
much more complex scenarios compared to the ones used aerial robots. In [31], the authors propose an algorithm, trained
during training. Finally, we verify the generalization capability with 3D CAD models only, which is able to perform collision-
of our algorithm in a complex real setting with a real robot. free indoor flight in real environments. In [32], a system
The supplementary video can be found at the following link: to control torques at the robots motors directly from raw
https://www.youtube.com/watch?v=gZzP0y4AnRY. image observations is designed. [33] presents an A3C [34]
based algorithm for end-to-end driving in the rally game
TORCS, using only RGB frames as input. They run open
II. RELATED WORK loop tests on real sequences of images, demonstrating some
Target-driven visual navigation is a relatively new task in the domain adaptation capability. [35] introduces the interactive
field of robotics research. Only recently, end-to-end systems replay to show how perform zero-shot learning under real-
have been specifically developed to address this problem. A world environmental changes. [36] proposes a new method for
possible naive approach could be to use a classic map-based object approaching in indoor environments. [13] and [14] show
navigation algorithm along with an image or object recog- that an agent can be successfully trained to explore complex
nition model, such as YOLO [9] or ResNet [10]. However, unknown mazes to find a specific target, using only RGB
map-based methods cannot be used in unknown or dynamic frames as input. However, in their experiments, the appearance
environments. To cope with this restriction, they can be re- of the goals is fixed during training. As a consequence, the
placed with SLAM (Simultaneous Localization and Mapping) target is embedded in the model parameters, meaning that
systems [5], [6], [7], [8], but other limitations still remain, in during the test phase, the target must necessarily be the same.
particular: i) SLAM systems build and use geometric maps Other approaches [37], [38] , [39] focus on visual navigation
to navigate, which are not necessary for target-driven visual to find objects given the corresponding class labels. [37] try
navigation; ii) classic navigation algorithms separate mapping to learn navigation policies by using State-of-the-Art (SotA)
from planning, which is again not needed for this task, and can computer vision techniques to acquire semantic segmentation
potentially compromise the robustness of the whole system; and detection masks. They show that such additional inputs are
iii) additional efforts are required to combine SLAM systems domain independent and allow joint training in simulated and
and object detection modules, as they are not designed to work real scenes, reducing the need for common sim-to-real transfer
together. methods, like domain adaptation or domain randomization.
To overcome these limits, map-less methods, which try to However, we argue that segmentation and detection masks
solve the problem of navigation and target approaching jointly, are less beneficial in maze-like environments, where floor
have been proposed [21], [22], [12]. These systems, like and walls are the predominant elements, which, instead, our
ours, do not build a geometric map of the area, instead, they method addresses. In [38], the authors introduce a new deep
implicitly acquire the minimum knowledge of the environment learning based approach for visual semantic planning. They
necessary for navigation. This is done by direct mapping visual use both RL and Imitation Learning (IL) to train a deep
inputs to motion, i.e. pixels to actions. In particular, given its predictive model based on successor representations, demon-
strating good cross-task knowledge transfer results. Despite
1 https://www.unrealengine.com bootstrapping RL with IL proves to be an effective way to
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 3

Fig. 2. The overall system architecture. We can see the object localization and the navigation networks with the yellow and blue background respectively.
The first network takes two 224 × 224 RGB images as input, the goal image and the current frame. All the weights are shared among the two branches (red
dotted square). The network outputs a six length one-hot vector encoding the relative target position in the current frame. This vector is then fed into the
navigation network together with the current frame, resized to 84 × 84. The network finally produced the policy and the value function.

increase sample efficiency in many other applications, one of new unknown environments, even much larger than the ones
the main purposes of our work is to study how artificial agents used during training, and, most importantly, also to real ones
learn complex navigation policies in complete autonomy, i.e. with real targets.
without human influence. [39] addresses generalization in This work proceeds as follows: Section III describes the
visual semantic navigation by managing prior knowledge with DRL framework used and the proposed two-networks archi-
Graph Convolutional Networks [40]. While they prove that, tecture; Section IV focusses on the training procedure; Section
in their settings, prior knowledge improves generalization to V presents the environment setup used to train our algorithm;
unseen scenes and targets, it has limited applicability in maze- Section VI shows the experiments and the results; finally,
like environments and in target-driven visual navigation, where Section VII draws the conclusions and the path of future work.
target objects are not specified semantically but with images.
To the best of our knowledge, the only two works which III. FRAMEWORK
tackle the task of target-driven visual navigation, by using
In the following, the proposed framework is presented. In
images to specify goals and employing DRL only (i.e., without
particular, this section is divided in two parts:
IL), are [21] and [22]. [21] is the first work that addresses this
problem. The authors propose a novel algorithm whose policy 1) Problem Formulation: in which the problem is presented,
is a function of the goal as well as the current state. In this and our design choices are described;
way, it is possible to specify a new arbitrary target without 2) Network Architecture: where our two-networks architec-
the need of re-training or fine-tuning the model. However, the ture is detailed.
images they use to specify the goal are scenes taken from the
same area where the agent is placed. This means that with this A. Problem Formulation
approach, a robot cannot navigate in an environment of which The target-driven visual navigation problem consists in
we have no image. finding the shortest sequence of actions to reach a specified
One further step towards generalization is taken by [22], target, using only visual inputs. Our goal is to design an RL
which introduces a framework that integrates a deep neural agent able to find that sequence directly from pixels.
network based object recognition module. With this module, We consider the standard RL setting where the agent
the agent can identify the target object regardless of where the interacts with the environment over a number of discrete time
photo of that object is taken. However, it is still trained or fine- steps. The environment can be seen as a Partially Observable
tuned in the same environments where it is tested. Therefore, Markov Decision Process (POMDP) in which the main task
it is still not able to generalize to unseen scenarios. of the agent is to find a policy π that maximises the expected
sum of future discounted rewards:
B. Contribution and Overview "∞
X
#
t
In all the aforementioned works [21], [22], it is always Vπ (x) = Eπ γ rt , (1)
t=0
necessary to re-train or at least fine-tune the model for new
environments and objects. In real scenarios, this is not only an where γ ∈ [0, 1) is the discount factor, rt = r(xt , at ) is the
inefficient approach due to the high cost of producing samples reward at time t, xt is the state at time t and at ∼ π(·|xt ) is
with physical robots, but it can also be dangerous. the action generated by following some policy π. The MDP is
To avoid that, we design a novel architecture composed by clearly partially observable, in our case, because in every step
two main DNNs: the first, the navigation network, with the the agent has no access to the true state xt of the environment,
goal of exploring the environment and approaching the target; but only to an observation ot of it.
the second, the object localization network, with the aim of The observation ot is composed by the current RGB frame
recognising the specified target in the robot’s view. They are from the agent point of view and the image of the target to be
exclusively trained in simulation, and no single real image is reached. These two inputs are both fed into the architecture,
used. Finally, we show that our algorithm directly transfers to which consists of two different networks. The first, i.e. the
4 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

the visual field of the agent. We discretise the position of


the goal in the current frame in 5 + 1 different classes:
“extreme right”, “right”, “center”, “left”, “extreme left, “no
target object” (Fig. 3a). The network takes two 224 × 224
RGB images as input: the current frame and the goal’s image.
To extract robust features, they are first preprocessed by a
ResNet-50 [10] (pre-trained on ImageNet [45], with the last
two fully-connected layers dropped). Then, they are fed into
5 convolutional layers with (512, 128, 16, 16, 16) filters with
3 × 3 kernel and stride 1, all of them with rectified linear units
0 1 0 0 0 (ReLU) as activation and followed by a GroupNorm layer.
(a) (b) Finally, the two representations are concatenated together and
Fig. 3. (a) Example of frame with its corresponding one-hot vector. The processed by a fully connected layer with 256 hidden units and
image is divided in 5 parts, and to each of them is associated a 0, except to ReLU activation. The final output is a probability vector of 6
the one containing the target (a monitor, in this case). If the goal is visible elements, one for each possible position. The position with
in more than one part, the left-most one is chosen. (b) Top-down view of the
technique used to collect the ground truth for object localization. 221 rays are the highest probability is chosen as input to the navigation
projected from the agent, distributed horizontally and vertically in 80◦ and network.
60◦ , respectively. 2) Navigation Network: The navigation network has the
main objective of exploring the environment until the object
localization network does not locate the target object. Its inputs
object localization network, has the objective of comparing
are the current 84 × 84 RGB frame and the target estimated
the two images and locate the target. The second, i.e. the
position provided by the object localization network. The
navigation network, is used to learn exploration strategies to
image is first processed by 2 convolutional layers: the first with
solve complex mazes. In particular, both inputs are processed
16 8×8 filters with stride 4, and the second with 32 4×4 filters
by the object localization network, which compares them and
with stride 2. Both layers are followed by a ReLU activation
outputs a vector indicating the target relative position in the
and a GroupNorm layer. Then, the image features extracted are
field of view of the agent. This vector and the current RGB
concatenated with the target estimated position and elaborated
frame are then fed into the navigation network that selects the
by a fully connected layer followed by a Long Short-Term
next action (Fig. 2).
Memory (LSTM) [46] network, both with 256 hidden units
From the results in [13] and [14], it emerges that also a
and ReLU activations. It is important to specify that, during
small and fast CNN can reach great results in synthetic maze
the training phase, the features extracted by the CNNs are
solving, and we argue that such a network could be sufficient
also fed into 2 deconvolutional layers, symmetrical to the
in real ones too. With this reasoning, we employe a lightweight
convolutional ones, to make depth estimation (see Fig. 4b and
CNN for the navigation and a larger one for the object location
Section IV-B for details). The final output is a probability
task, where a powerful model for feature extraction is needed.
vector p of 3 elements, which represents the probability of
In this way, the object localization network can be much
choosing an action: “turn right”, “move forward”, “turn left”,
more efficiently trained offline by supervised learning, and the
and a scalar value which is the estimated expected sum of
navigation network can be faster trained with RL.
future discounted rewards.
Among the several DRL algorithm, e.g. Deep Q-Learning
[41], A3C [34], batched A2C [42], GA3C [43], we choose IV. TRAINING APPROACH
the recent IMPALA (Importance Weighted Actor-Learner Ar-
The training is divided in two different phases: the learning
chitecture), which, in [44], is used to simultaneously learn
phase of the navigation network and the learning phase of the
a large variety of complex visual tasks. For our purposes, it
object localization network. These two phases are completely
offers two main advantages. First, it leverages parallel CPU
independent of each other, hence each network is trained
computations for efficient trajectory generation, and GPUs for
separately.
faster backward computation. Second, it implements V-trace
targets, in replacement to the standard value function, which
A. Object Localization Training Phase
allows sample-efficient off-policy learning.
The architecture of the two networks is explained in detail The object localization network training is posed as a
in the following paragraphs, while the training approach is similarity metric learning problem (Fig. 4a). We use the dataset
described in Section IV. collected in the Capture-Level (Section V-B), whose samples
consist of triplets of images. Each triplet contains: the picture
of the goal, an image in which the goal is visible and an other
B. Networks Architecture
one in which it is not. These three 224 × 224 pictures are first
The proposed model is composed by two different networks: preprocessed by a ResNet-50, and then fed into the first three
the object localization network and the navigation network. In convolutional layers. At this point, a triplet loss function is
Fig. 2, the overview of the architecture is shown. computed as follows [47]:
1) Object Localization Network: The object localization 1
`t = max 0, m + k g − f + k2 − k g − f − k2 ,

network has the objective of locating the target object in (2)
2
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 5

256
ResNet-50
0
3x3x16 3x3x16 3x3x16 1
3x3x512 3x3x128 0
224x224
0
80x40
0
0 84x84 4x4x32 4x4x32
ResNet-50
1 8x8x16 8x8x16
3x3x16 3x3x16 3x3x16 0
224x224 3x3x512 3x3x128 0 V
0
0 Graphics
0 0 0 0 1 0 0
Engine
ResNet-50 256

3x3x16 3x3x16 256


3x3x128 3x3x16
224x224 3x3x512 256

(a) (b)
Fig. 4. The two networks during the training phase. (a) The object localization network takes three pictures as input. The features extracted by the ResNet
and the first three convolutional layers are used to compute the triplet loss (`t ) in (2). The features are then further processed by the last two convolutional
layers and the fully connected one to produce the two probability vectors. (b) The navigation network inputs the current frame and one-hot vector from the
engine. The features extracted by the two convolutional layers are used to make depth estimation, and concatenated to the vector to produce the policy and
the value function.

where m is a positive constant which controls the margin, g interaction with the environment, while the learner processes
represents the features extracted from the goal image, and f + them to update the network. In our specific implementation,
and f − the features extracted from the picture where the target we use 16 actors and one learner.
is visible and where it is not, respectively. This loss allows the 1) Actor: During the learning phase, the navigation net-
model to develop a concept of similarity. The three descriptors work is completely separated from the object detection net-
are then concatenated together two by two: work (see Fig. 4b). Each IMPALA actor is placed in a different
maze (see Section V-A for details), and for every action at ,
r1 = [f + , g], it receives from the environment a reward rt and a new
r2 = [f − , g]. observation ot . rt is always 0, except when the agent reaches
the goal, in which case it receives +1. ot is composed by
Finally, r1 and r2 are separately processed by the last two
the target position and the current RGB frame, which, during
convolutional layers and the fully connected one, producing
learning, are both generated by the game engine itself.
the two probability vectors, p1 and p2 .
Once an actor finishes to perform a predefined number of
The loss we use is a weighted Cross Entropy (CE) loss,
steps, it sends its trajectory to the learner, which reprocesses
adapted to our specific task. In particular, in the classical CE
it to update the network.
loss formulation, every classification error weigh the same. On
2) Learner: The learner is responsible for loss computation
the contrary, we argue that for our specific task, the value of
and parameters update. The first loss is relative to depth
the loss should increase proportionally to the distance between
estimation. According to [14], if this loss shares model weights
the estimated and the true object location. For that reason, we
with the policy, which is true for our architecture, this auxiliary
modify the CE formulation for our localization loss:
( task can be used to speed up learning and to achieve higher
−d log(p∗n ) 1 ≤ n, k ≤ 5 performance. Since only the ceiling and floor are located in
`l (p∗ ) = , (3) the lower and upper areas, we compute the depth loss only on
− log(p∗n ) n=0∨k =0
the central 80 × 40 pixels of the frame. The loss we use for
where that is a pixel-wise mean squared error between the predicted
d = |n − k|, depth dp and the one provided by the engine de :
p∗n is the n-th element of the probability vector (either p1 or 1 X
ld = (dp − de )2 . (4)
p2 ), n corresponds to the true location of the target object, 80 × 40
and k corresponds to the most likely position according to the The features extracted by the convolutional layers are concate-
network, i.e.: nated with the one-hot input vector provided by UE4 (Fig. 3b),
k = arg max p∗i . and processed by the network, which finally output the policy
i
and the estimated value, as described in Section III-B2.
So, the overall loss for a single triplet is given by the sum of To speed up learning, we use an experience replay [48],
the components in (2) and (3): which is a buffer of trajectories shared among actors. There-
`obj = `t + `l (p1 ) + `l (p2 ). fore, every computation, 2 trajectories are randomly picked
from the experience replay, batched together with the actor’s
current one, and processed in parallel in a single pass.
B. Navigation Training Phase All the trajectories are used to compute (4) and the fol-
The navigation network is trained via RL using IMPALA. lowing three different losses. Before describing them, it is
It requires the use of two main entities: the actor, which runs important to clarify that the value IMPALA assigns to a state
in CPU, and the learner, which works in GPU. Both actor is different from the one computed with the classical Bellman
and learner share the same network parameters. The task of equation (see Eq. 1). Due to the lag between the time when
the actor is to collect trajectories of experiences through the trajectories are generated by the actors and when the learner
6 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

estimates the gradient, they actually follow different policies,


µ (behaviour policy) and π (target policy) respectively. There-
fore, the learning takes place off-policy and this must be taken
into account when estimating the value. IMPALA uses V-trace
targets vt for that purpose, which are a generalization of the
Bellman equation for off-policy learning. The first loss we
report needs to fit those V-trace targets vt :
1 2
lv = (vt − Vθ (ot )) , (5)
2
where Vθ (ot ) is the estimated value, parameterized by θ, for
the current observation ot . The second is relative to the policy
π:
lp = ρt log πθ (at |ot ) (rt + γvt+1 − Vθ (ot )) , (6)
 
t |xt )
in which ρt = min ρ̄, π(a µ(at |xt ) is one of the truncated
importance sampling weights. Importance sampling is a well
known technique for estimating properties of a particular
distribution, while only having samples generated from a
different distribution (µ in our case) than the distribution of
interest (π in our case). However, it may suffer from instability
because, in some conditions, the policies can diverge from
t |xt )
each other, resulting in extremely high weights π(a µ(at |xt ) . To
reduce the variance of the gradient estimate, the weights are Fig. 5. Top-down view of the Maze-level. It is composed by 16 3 × 3 mazes,
where the navigation network is trained. In each maze, an actor (the blue
clipped at ρ̄ = 1. sphere) and its target are placed.
It should be noticed that with our architecture πθ and Vθ
share the same parameters θ, however, in general, they can
be different. The third loss consists in a bonus for entropy A. Maze-level
in action selection, and is employed to avoid a premature
convergence: In Fig. 5, a top-down view of this level is shown. It is
X composed by several separate 3 × 3 mazes, one for each
lc = − πθ (a|ot ) log πθ (a|ot ) . (7) IMPALA actor. Every time an actor hits its target or reach the
a
maximum number of allowed steps, its corresponding maze
This third component is especially useful during the first is regenerated and the actor and its goal are re-spawned. To
training steps, because it balances exploration and exploitation, avoid overfitting on a set of specific configuration of mazes
allowing the agent to sufficiently explore the MDP before and actor-to-target paths, we choose to generate mazes and to
converging. For a detailed description of the V-trace algorithm spawn actors/goals completely at random.
and the last three losses, see the APPENDIX and [44]. Most importantly, contrary to the approaches in [21] and
All the afore described equations (4)-(7) define the overall
[22], in which the DRL agent is trained on the same scenes
loss function for the parameters update:
where the objects it has to find are located, we decouple
lnav = ld + blv + lp + clc , navigation from target localization. To this aim, we use a goal
where b and c are the baseline and entropy costs, respectively. invisible to the navigation network. The only way it has to
The new weights are then returned to the actor, which starts reach the target is to follow the 6 element one-hot vector,
a new trajectory. generated by the UE4 engine itself, which indicates its relative
position. In this way, the agent can be trained with no bias
V. ENVIRONMENT w.r.t. the objects.
To train our DRL agent on the target-driven visual naviga- In order to make the navigation network be directly trans-
tion task, we use 3D virtual environments only. One of our ferable from a simulated environment to a real one, we use
main focuses is to design an algorithm capable of generalizing domain randomization (DR) [49]. It is a simple yet effective
to real world scenarios. For this reason, we choose the UE4 way to achieve sim-to-real transfer, which is successfully
graphics engine, which is one of the most recent and best- applied in other robotic tasks [50], [20]. It consists in ran-
known. It is widely spread in the game industry, and a lot of domizing the training environment settings in order to make
companies use it for its photorealism and customisability. the system more robust to domain changes. To do that, every
We have designed two different levels: time an actor hits its target, we change at random the following
1) Maze-level: level where the navigation network has been parameters: “maze wall heights”, “maze wall textures”, “maze
trained; floor textures”, “light color”, “light intensity”, “light source
2) Capture-level: level used to collect the images to train angle”.
the object localization network. It should be noted that applying domain randomization to
In the following subsections we describe both levels in detail. the SotA architectures in [21] or [22] is, in theory, possible.
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 7

(a) (b) (c) (d)


Fig. 6. Four examples of agent trajectories in a 20 × 20 maze with some variants of floor and wall textures. It should be noticed that the agent learns to
follow the right wall, which is an optimal rule to solve simply connected mazes.

However, as we show in Section VI-B, this increases training pictures depicting goal objects, and associate each of them
inefficiency without improving the performance. As detailed in with two images. Each entry of the dataset is then composed
the following sections, the reasons behind those results can be by: an image of the target, a capture where the target is not
mainly ascribed to the high complexity of their architectures, present, a capture where it does, and its relative position in
which do not decouple the navigation and the target recog- that capture.
nition tasks. On the contrary, our approach relies upon two The dataset counts 630,000 samples of 9 classes of objects,
totally separable components (i.e., the navigation and the target which are: “Chair”, “Monitor”, “Trash”, “Microwave”, “Bot-
recognition networks). Therefore, the navigation network can tle”, “Ball”, “Lamp”, “Plant” and “Jar”. In the simulator, for
be much smaller, considerably reducing the training time and each object, we use 4 to 10 different meshes.
increasing its effectiveness. It is also important to emphasize
that this in no way compromises the accuracy in locating
objects, as the object recognition network can be arbitrarily VI. EXPERIMENTS AND RESULTS
complex.
In the experiments, we measure our target-driven visual
The possible actions our actors can do are three: “turn
navigation system performance in unseen environment. In
right”, “move forward”, “turn left”. To simulate the uncertainty
particular, we analyze our agent’s ability to: i) explore the
of a real robot’s motion, we inject uniform noise in the speed
surrounding environment and ii) reach the designated target.
and angle of our actors movement.
To do that, we design three kind of tests, which are described
As the training of the two networks takes place separately,
in the next sections. Furthermore, we propose an ablation study
the input vector of the navigation network is not the one pro-
to examine the benefit of the auxiliary depth estimation loss
duced by the object localization network but is generated by
on the agent performance. Finally, to verify the generalization
the UE4 itself. In order to make the navigation network more
capability of our algorithm, we use it in a complex real setting
robust against possible classification errors of the localization
with a real robot.
network, we ensure that the one-hot vectors generated by UE4
have a 10% probability of being wrong (in that case, it is
picked uniformly from all classes). A. Training Details
We train the Navigation Network with 16 actors for 70
B. Capture-Level million steps using SGD without momentum, with learning
Contrary to the navigation network, which is trained via re- rate 0.0005 and batch size 8. The specific parameter settings
inforcement learning, the object localization network is trained of IMPALA and the Maze-level that are used throughout
in a supervised manner. We collect a dataset of synthetic our experiments are given in detail in Table I and Table II
images from the UE4 game engine in the following way. respectively.
We place a camera in a fixed position and spawn randomly The Object Localization Network is trained with the first
generated objects in its field of view. Every time the camera 540,000 samples of our synthetic dataset for 50 epochs with
takes a picture, the current objects are replaced with ran- learning rate 0.0025, batch size 128, and margin constant m
domly generated ones. Like for the Maze-level, we follow the equals to 0.1. We use other 90,000 samples to implement
approach in [49] to achieve good generalization capabilities early stopping and selecting the best model parameters. All
across domains. the details on the parameters used for the Capture-level are
For each image, we get the relative position of each object reported in Table III.
with respect to the camera from the engine. We discretise The source code is publicly available and can be
the position in five classes: “extreme right”, “right”, “center”, found at https://github.com/isarlab-department-engineering/
“left” and “extreme left”. Then, we download from the web DRL4TargetDrivenVN.
8 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

(a) (b) (c) (d)


Fig. 7. Two images with their corresponding saliency maps. From the pictures, it is clear the agent pays much attention on the right wall, especially on the
margins and edges.

B. Simulated Experiments TABLE I


IMPALA H YPERPARAMETERS
To measure the performance of our target-driven visual
navigation system in unseen environments, we make three Hyperparameter Value
types of tests in simulation: one to check the exploration Entropy Coefficient (c) 0.00025
Baseline Coefficient (b) 0.5
capability of the surrounding environment, and the other two
Unroll Length 100
to verify if it is able to locate and reach the designated target.
Discount (γ) 0.99
Validation on target-driven tasks is achieved by comparing
Experience Replay Capacity (per actor) 500 frames
our approach against different strategies. First, we show that
our method learns effective navigation policies in contrast TABLE II
to a Random Agent (RA) that picks actions by following M AZE - LEVEL P ROPERTIES
a uniform random probability distribution. Furthermore, we
Property Range Type
show comparisons against the Target-Driven Navigation Model
Maze Wall Heights [0.5, 2.0] Real
(TDNM), presented in [21], and the Active Object Perceiver
Maze Wall Textures [0, 45] Categorical
(AOP), introduced in [22]. We consider two different versions
Maze Floor Textures [0, 45] Categorical
for each of the two SotA baselines, that differ from each other
Light Color (per channel) [0.0, 255.0] Real
in the way they are trained:
Light Intensity [1.0, 10.0] Real
1) Without DR: In these experiments we evaluate the Light Source Angle [0, 360] Integer
performance of the two baselines as originally presented
in the papers [21] and [22]. Specifically, both models are TABLE III
C APTURE - LEVEL P ROPERTIES
trained in 16 3 × 3 mazes, as for our method, but with
no domain randomization (i.e., mazes configurations, Property Range Type
textures, lights, targets, etc... are all fixed from the Maze Wall Textures [0, 45] Categorical
beginning of training). A different specific layer is used Maze Floor Textures [0, 45] Categorical
for each one of the 16 mazes, following the training Light Color (per channel) [0.0, 255.0] Real
protocol proposed by the authors. Light Source Angle [0, 360] Integer
2) With DR: This second set of tests aim to analyze the # Objects per Image [0, 6] Integer
effect of domain randomization on the SotA approaches. Distance Objects-Camera [150, 1050] Real
However, since they both employ scene-specific layers,
DR cannot be straightforwardly applied. As originally
conceived in [21] and [22], each scene-specific layer differs from the original methods proposed in [21] and
is trained in a single fixed scene. By applying DR on [22].
each of the 16 mazes, the scenes change at the end of For a fairer comparison, we avoid training the object recog-
each episode, which raises two issues: i) a huge number nition network of the AOP (both with and without DR) from
of scene-specific layers would be generated throughout scratch, preferring to directly feed it with the ground truth
the training process and ii) each scene-specific layers bounding boxes provided by the engine.
would be trained for just one episode, preventing the Finally, at the end of this Section, we present an ablation
whole network from learning properly. To overcome study to evaluate the effects of the auxiliary depth estimation
these issues and evaluate the SotA baselines with DR, loss ld on the agent performance.
we decided to share the scene-specific layer across 1) Exploration Experiment: In this experiment, we place
all the agents and scenes, as for the generic layers. the agent in the center of a 20 × 20 maze, which is much
Nevertheless, it should be highlighted that such strategy larger than the 3 × 3 mazes in which it is trained. We give
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 9

(a) (b)

(c) (d)
Fig. 8. Two images of corners (left) and two of dead ends (right), with their corresponding value. The function clearly presents high picks when the agent
turns corners, while it drops sharply in correspondence of dead ends.

(a) (b) (c) (d)


Fig. 9. Agent first person view with different light intensity levels: (a) 0.5, (b) 3, (c) 7 and (d) 13. It should be noticed that for a value of 0.5 it is really
hard to distinguish the wall contours.

(a) (b) (c) (d)


Fig. 10. Agent exploration score (blue) w.r.t. human expert performance (red), with the four maze seeds: (a) 2, (b) 3, (c), 4, (d) 5. Plots show that the agent
can achieve high scores with good light intensity values. However, the performance drops rapidly when the brightness is reduced. The wide score ranges that
can be seen in all the figures are caused by particular wall textures with which our agent produces poor performance.
10 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

TABLE IV
AGENT E XPLORATION P ERFORMANCE WITH D IFFERENT L IGHT
I NTENSITY L EVELS AND M AZE C ONFIGURATIONS

Maze Seed Light Intensity Agent Human Expert RA


13 5.9%
7 20.3%
2 31.5% 4.1%
3 9.8%
0.5 5.5%
13 14.4%
7 12.4%
3 33.9% 5.4%
3 8.4%
0.5 11.5%
13 11.0%
7 19.4%
4 42.5% 3.3%
3 12.2% Fig. 12. An example of a trajectory followed by our agent in the target-driven
0.5 1.2% (5 × 5) experiment. The environment is a 5 × 5 maze which ends in a room
13 19.1% with 3 different objects, including the target (a chair, in this case).
7 12.5%
5 37.8% 3.4%
3 7.4%
0.5 0.7% encouraged to explore the maze as fast as possible. This
implies that revisiting already inspected locations should be
avoided, which is exactly what the wall-following policy
does. Having developed such strategy suggests that our model
extracts the fundamental features to explore mazes of any size.
As proof of this, we follow the approach in [51] to visualise
the saliency map of the network. In Fig. 7, as expected, it can
be seen that the gradient is much stronger on the right wall,
especially on the margins and edges.
Another evidence that the agent has developed a certain
awareness of the task and a concept of exploration comes from
the analysis of the value function. In Fig. 8, we can see sample
frames with the corresponding estimated values. When the
agent finds a corner, it knows that it has no visibility behind it,
Fig. 11. Average episode reward during training as a function of time,
and that there is an unexplored area that could hide the object
expressed in hours. Our model (blue line) is trained for 70 million steps, it has to find. This is clearly visible from its estimate of the
while the other approaches considered ([21] and [22], trained with domain value function, which shows peaks precisely in correspondence
randomization) for roughly 40 millions. The figure proves that our method
can reach superior final performances and also be much more efficient to train
of the corners (Fig. 8a, 8c). On the other hand, there is an
(more than 4 times faster compared to TDVN). Notice that the difference in opposite behaviour in correspondence of dead ends, to which
terms of training time between TDVN [21] and AOP [22], which employ very the agent assigns very low values (Fig. 8b, 8d).
similar architectures, is related to the need to generate ground truth bounding
boxes by the simulation engine. We measure our agent performance in 4 different mazes,
each of them generated by a different seed, and with 4 levels
of light intensity (Fig. 9). For each maze-light setting, we
it 180 seconds to explore it as much as possible. At the end randomly pick 3 floor and wall textures. For each of the 48
of the episode, we measure the percentage of the maze it has possible combinations, we average the results over 3 runs.
discovered. It is important to say that with this amount of time, In Table IV, the model scores are reported by percentage of
it is impossible to explore the entire maze. explored area, w.r.t. the maze random seed and the intensity
In Fig. 6, four sample trajectories in a 20 × 20 maze are levels, in comparison with human and RA performances. As
shown. In these pictures, we can see that the agent usually can be seen, the agent exploration ability is highly affected
follows the right wall of the maze. The wall follower is a by the light conditions, in particular, the performance drops
well-known algorithm to solve mazes, in particular, for simply for the very low value of 0.5, for which the walls contour are
connected ones, it is a technique that guarantees the agent not fairly visible. It should be noticed that the two extreme values
to get lost and not to walk the same path more than twice. are both outside the range used during training, but only the
The fact that the agent, even if trained in small 3 × 3 mazes lower one caused poor performance. We suppose that the main
only, has been able to develop this algorithm is remarkable. problem is not the overfitting in the training distribution, but
We argue that this can be explained by the reward function the difficulty to distinguish the walls contours in dark frames.
we designed. Specifically, the agent receives a positive reward Indeed, also the score for intensity equals to 3, which is still
only when it reaches the target (which is unknown to the in the training range, is quite low.
agent at the beginning of the exploration). Therefore, it is The comparison between our method and RA (Table IV),
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 11

TABLE V
AGENT TARGET- DRIVEN P ERFORMANCE IN THE 5 × 5 M AZE BY TARGET O BJECT C LASS . T IME , S UCCESS R ATE AND SPL FOR TDVN AND AOP ARE
PROVIDED FOR BOTH VERSIONS , WITH AND WITHOUT DR.

Approach Metric Chair Monitor Trash Microwave Bottle Ball Lamp Plant Jar Can Extinguisher Boot Avg
Time − − −
RA Success Rate 0% 0% 0%
SPL 0 0 0
Time −-− −-− −-−
TDNM [21] Success Rate w DR - w/o DR 0% - 0% 0% - 0% 0% - 0%
SPL 0-0 0-0 0-0
Time −-− −-− −-−
AOP [22] Success Rate w DR - w/o DR 0% - 0% 0% - 0% 0% - 0%
SPL 0-0 0-0 0-0
Time 28s 30s 31s − 34s 27s 28s 23s 26s − 48s 38s 31s
Ours Success Rate 83% 33% 33% 0% 100% 50% 100% 100% 100% 0% 50% 67% 72%
SPL 0.3 0.11 0.11 0 0.29 0.19 0.36 0.43 0.38 0 0.1 0.17 0.24
Time 17s 17s 17s
Human Expert Success Rate 100% 100% 100%
SPL 0.59 0.59 0.59

shows that our agent learns a navigation policy (i.e., the wall adequate exploration policy is essential to complete the task.
follower strategy) that proves to be far superior and more Without such a policy, it is extremely unlikely to ever reach
efficient than a random explorer, as expected. the room with the objects, as confirmed by the RA results.
In Fig. 10, very large score ranges can be seen in correspon- We argue that the SotA baselines are not able to effectively
dence of every light intensity levels. The agent performance explore the mazes, even if trained with domain randomization,
oscillates considerably, and with some particular wall textures for the following reasons. First, both baselines are originally
it reaches rather low levels. However, in the same figure, we conceived to be quickly fine-tuned on previously unseen envi-
can also see that in some other settings our agent can get close ronments, thanks to the scene-specific layers they implement.
to human performance, which is impressive, considering that Therefore, they are not specifically designed to directly gener-
it is trained in very small 3 × 3 random mazes only. alize to new scenarios, contrary to our approach. Secondly,
2) Target-driven Experiment (5 × 5): In this second experi- TDNM and AOP rely upon a single complex architecture
ment, we place our agent in a 5×5 maze, which ends in a room to simultaneously address exploration and target recognition.
with 3 different objects, including the target. This experiment Hence, their optimization is more difficult and it is harder
is naturally divided in two phases for the agent: it has first to to achieve navigation capabilities that generalize over test
explore the maze to find the room, then it has to distinguish scenarios that are wider and more complex than those used
the target from the other objects, locate it, and approach it. for training. On the contrary, our approach decouples the two
An episode ends when the agent reaches the target (i.e., when sub-tasks and favors the generalization of exploration policies
they collide) or when 90 seconds are elapsed. In Fig. 12, an over other environments.
example of our agent trajectory is shown, in that case the goal In addition, as shown in Fig. 11, our approach is also
is the red chair. more efficient. The plot represents the average episode reward
We try with all the 9 different objects used to train the object obtained by our method and the other SotA baselines (with
localization network, together with other 3 previously unseen domain randomization) during training, as function of time. In
object classes (“Can”, “Extinguisher” and “Boot”), averaged particular, our model can complete its 70 million steps training
over 6 runs each. To measure the agent performance, we use in about 156 hours, while TDVN and AOP require more than
3 metrics: the time (in seconds) needed to reach the goal, 380 hours to perform roughly 40 million steps (we decide to
the success rate (in percentage), and the Success weighted by not continue further since the curves do not show any signs of
(normalized inverse) Path Length (SPL), introduced in [52]. improvement for several million steps). Hence, our approach is
The SPL is calculated as follows: considerably faster than [21]and [22], and this further suggests
1 X
N
`i that those approaches are not particularly suited for direct sim-
SP L = Si , to-real transfer.
N i=1 max (pi , li )
On the contrary, as can be observed in Table V, our model
where, N is the number of test episodes, `i the shortest-path produces reasonable results, successfully reaching the targets
distance from the agent starting point to the target in episode most of the time. There are particular object meshes, however,
i, and pi the length of the path actually taken by the agent in for which the agent is not able to complete the task before the
episode i. end of the episode. In that cases, the agent finds the room but
The results of all the models are summarised in Table V, it is not able to approach the target. Most of the time, this
in comparison to human and RA performance. As shown is due to the uncertainty of the object localization network,
by the results, neither the TDNM [21] nor the AOP [22] which is unable to locate some object meshes continuously
are able to generalize in the test environment. Since the test and consistently over time.
environment is much harder than the training scenarios, an Since our object localization network is trained by using a
12 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

TABLE VI
AGENT TARGET- DRIVEN P ERFORMANCE IN THE 20 × 20 M AZE BY TARGET O BJECT C LASS

Approach Metric Chair Monitor Trash Microwave Bottle Ball Lamp Plant Jar Can Extinguisher Boot Avg
Time 228s 201s 127s − 200s 56s 120s 44s 67s − 276s − 147s
Ours Success Rate 17% 33% 33% 0% 83% 50% 67% 83% 67% 0% 33% 0% 52%
SPL 0.02 0.05 0.08 0 0.12 0.27 0.17 0.57 0.3 0 0.04 0 0.18
Time 66s 66s 66s
Human Expert Success Rate 86% 86% 86%
SPL 0.40 0.40 0.40

TABLE VII
D EPTH L OSS A BLATION S TUDY

Approach Metric Chair Monitor Trash Microwave Bottle Ball Lamp Plant Jar Can Extinguisher Boot Avg
Time 56s 67s 77s 76s 65s 84s 76s 58s 71s − 63s 63s 69s
Ours (No depth loss) Success Rate 83% 50% 50% 17% 17% 33% 50% 100% 83% 0% 50% 83% 56%
SPL 0.15 0.07 0.06 0.02 0.03 0.04 0.07 0.17 0.12 0 0.08 0.13 0.09
Time 28s 30s 31s − 34s 27s 28s 23s 26s − 48s 38s 31s
Ours Success Rate 83% 33% 33% 0% 100% 50% 100% 100% 100% 0% 50% 67% 72%
SPL 0.3 0.11 0.11 0 0.29 0.19 0.36 0.43 0.38 0 0.1 0.17 0.24

Fig. 13. Three examples of depth images produced by our agent. Although depth estimation is only an auxiliary loss, and hence the images produced are
not accurate, it is still possible to identify the right wall.

similarity metric, we expect a certain ability to generalize to actual distance covered by the agent/human can be extremely
previously unseen objects classes. From Table V (see “Can”, large.
“Extinguisher” and “Boot”), it can be seen that the agent suc-
cessfully reaches 2 (“Extinguisher” and “Boot”) of the 3 new 4) Ablation Study on Depth Estimation: In this paragraph,
object classes. In particular, the poor results with the “Can” we aim to evaluate the effects of the auxiliary depth estimation
are due to frequent false positives of the object localization loss, proposed in [14], on the general performances of the
network that, misleading the navigation component, prevent agent. To this end, we train a second version of our model
the agent from exploring the environment properly. (using the same training protocol of the other experiments)
without the auxiliary depth estimation loss. In Table VII, the
3) Target-driven Experiment (20 × 20): Differently from comparison results, in the test 5 × 5 maze, between the two
the previous experiment, in this case, the test maze is a much models are shown. As can be seen, the agent trained with depth
larger 20 × 20 maze and the time limit is increased to 300 loss shows better performance in terms of success rate and,
seconds. The purpose is to assess the ability to collaborate on average (last column of Table VII), appears to be more
between the object localization network and the navigation than twice as fast in completing the task. By analyzing the
network, in situations where the distance to be covered is navigation trajectories computed by the agent trained without
much longer. Considering the poor performance of the two the depth loss, we notice that it does not seem to follow
baselines in the 5 × 5 maze, in Table VI, the agent results any specific strategy and the exploration appears much more
are reported in comparison with human performance only. As random.
expected, it can be noticed that the time needed to reach the
target increases significantly, while both the success rate and Therefore, the results suggest that depth estimation could
the SPL considerably decrease. In particular, the score highly be helpful to develop a robust navigation policy. In fact, Fig.
depends on the first turns the agent chooses to take, since the 13 shows some depth images produced by the deconvolutional
maze is so large that it is rather difficult to get back on the network of the model. Since the main task is navigation and
right path within the time limit. This consideration is also valid our aim is to teach it only the basic concept of depth, the
for humans who, as can be seen, sometimes fail to complete images produced are poorly accurate. Nevertheless, it can be
the task. The size of the maze also implies a rather low SPL, clearly seen that it is able to distinguish the right wall from
for both agent and human. In fact, while the shortest path from the rest of the image, and we suppose that this may have
the starting position to the target is not particularly long, the encouraged the development of the wall-following policy.
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 13

TABLE VIII
R EAL E XPERIMENTS R ESULTS : T YPE I

Maze Conf. Indoor/Outdoor Steps Success Rate SPL


(a) I 342 59% 0.4
(b) I 500 27% 0.12
(e), (f), (g), (h) O 427 36% 0.23
(a) (b) (c) (d)
Avg I 379 46% 0.27

TABLE IX
R EAL E XPERIMENTS R ESULTS : T YPE II

Maze Configuration Indoor/Outdoor Area Explored


(a) I 52%
(e) (f) (g) (h) (b) I 50%
(c) I 48%
Fig. 14. The 6 maze configurations used for the first two types of tests. (a)-(d)
In the first row, depicted with grey floors, the mazes placed indoor are shown. (d) I 59%
(e)-(h) In the second row, represented with green floors, the configurations (f) O 48%
used outdoor. (g) O 40%
(h) O 72%
Avg I 52%
L Mi T B B Mo Mi L
Avg O 51%

TABLE X
T Mi L Mo R EAL E XPERIMENTS R ESULTS : T YPE III

Target Object Avg Steps SPL C1 C2 C3 C4


(a) (b) (c) (d)
Monitor 279 0.40 − − 3 3
Fig. 15. The 4 combinations used in the third type of test. We always use
Trash 112 0.49 7 3 − −
the same maze, in which we place the robot (red circle in the bottom-right
corner) and 5 possible objects: “Monitor” (Mo), “Trash” (T), “Microwave” Microwave 253 0.49 3 3 − 3
(Mi), “Bottle” (B) and “Lamp” (L). We form 4 combinations with 3 different Bottle 102 0.5 − 7 3 −
objects each: (a) C1, (b) C2, (c) C3, (d) C4. Then, for each combination, we Lamp 255 0.3 3 − 3 7
perform a run for each target.

C. Real Experiments of this experiment is to verify the agent’s ability to


To test our model performance in real settings, we build distinguish partially occluded objects, locate the goal,
several 4 × 4 mazes, both indoor and outdoor. In particular, and approach it. As for the first, the run ends when the
we use 7 different maze configurations where we place our robot reaches the target or after 1000 steps.
robot and the target objects (Fig. 14, 15). Due to the high Considering all the maze configurations, locations, robot-target
complexity of the background and light conditions, we argue positioning and test types, we run a total of 84 experiments.
that the performance of the algorithm in a specific maze The robot we use in all the experiments has a substantially
is probably affected by the position of the maze itself. To different shape from the avatar used during training, but
check this hypothesis, we test our agent in one of the 7 maze performs the same actions as the agent in simulation: “turn
configurations several times, in two different orientations (Fig. right”, “move forward”, “turn left”.
14a, 14b, 16a). 1) Indoor Experiments: We measure our agent performance
As for the experiments in simulation, we test both the in all three kinds of tests. In the top row of Fig. 14, the maze
exploration and the target detection capabilities of our model. configurations used for the first two indoor tests are shown.
In real settings, we measure our agent performance in three We can see in Table VIII and Table IX that, on average, the
different types of tests: robot successes 46% of the times, while exploring roughly
1) The robot and the target are both placed randomly in the half of the maze. We expect a strong correlation between
maze, and the goal of the agent is to reach the target as the success rate in the first experiment and the exploration
fast as possible. The run ends when the robot approaches performance. However, the results show that the former is
the object or when the maximum number of 1000 steps slightly inferior to the latter. This is probably caused by the
is reached; robot/target positioning. In fact, we ensure that the agent and
2) The robot is placed randomly in the maze, and its the target are at a reasonable distance, and, in particular, the
objective is just to explore as much as possible. In this latter is placed in areas less likely to be explored. As a result,
case, the episode ends when the maximum number of the success rate is consequently slightly lower.
1000 steps is reached; To check the agent’s sensitivity to background and light
3) The robot is placed inside a simple maze, together with changes, we choose a maze configuration and orient it in two
three objects, including the target (Fig 15). The purpose different ways (Fig. 14a, 14b, 16a). From the results in the first
14 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

(a) (b)
Fig. 16. (a) Same maze in two different orientations in indoor: 0◦ (left) and 180◦ (right). (b) Example of maze in outdoor. In this settings, the background
is completely different. In particular, it is characterised by the presence of numerous pedestrians, dynamic elements that the agent never sees during training.

two rows of Table VIII, it is evident that the algorithm prefers the weaknesses of DRL agents, such as [53] and [54], can be
the first orientation. This shows that although it can navigate, an effective way to prevent particularly bad behaviours and
it is sensitive to the environment surrounding the maze. improve overall performance.
In the third experiment, we consider one maze configuration Furthermore, as reported at the end of the Section VI-B, we
and 5 targets to reach: “Monitor”, “Trash”, “Microwave”, observed temporal inconsistency in locating some instances
“Bottle” and “Lamp”. We make 4 different configurations (C1- of objects. We attribute this issue to the way the object
C4), with 3 objects each (Fig. 15). For each of them, we run localization module was trained. We think that a standard
an episode, hence 12 in total. supervised learning on uncorrelated images is not suitable for
From the results, reported in Table X, we can say that this task, and we left for future work the development of a
our agent is able to recognise and reach the objects 75% different learning strategy that take into account the frames
of the times. It is important to note that our model has temporal dependency.
never seen any real object, not even the ones we use in the
experiments. Interesting is the fact that it is able to approach APPENDIX
the “Microwave” every time it is specified as a goal, contrary V-trace is an off-policy actor-critic algorithm, introduced in
to what happens in simulation, where it always fails. In this [44]. Off-policy algorithms are applied when there is the need
regard, we believe that the use of a pre-trained ResNet-50 to learn the value function V π of a policy π (target policy)
plays an important role. from trajectories generated by another policy µ (behaviour
2) Outdoor Experiments: We repeat type 1 and type 2 tests policy). In the case of IMPALA, although both the actors and
to measure our agent performance also in outdoor mazes (Fig. the learners are initialized with the same neural network and
16b). The bottom row of Fig. 14 shows the configurations same weights, the models differ from each other after the first
used. From the results, summarized in Table VIII, a slight steps of the computation. In fact, due to the asynchronous
degradation in performance can be seen with regard to the nature of the framework, actors can lag behind each other and
first type of test. The exploration capabilities of the agent, on the learner even by numerous updates. For this reason, an off-
the other hand, remain practically unchanged, as reported in policy algorithm is required.
Table IX. It appears that, despite the great difference in lighting It is important to make clear that the behaviour policies
and background between indoor and outdoor, the algorithm are followed by the actors only, while the target policy is
performance are consistent. parameterized by the learner model. For this reason, all the
following V-trace computational steps are performed by the
VII. CONCLUSIONS learner only.
t=s+n
First of all, consider the trajectory (xt , at , rt )t=s col-
In this work, we introduced a new framework to address the
lected by an actor with its policy µ. The n-steps V-trace target
challenging task of target-driven visual navigation. Through
for the value approximation V (xs ) at state xs is then defined
extensive experimentation, in both simulated and real mazes,
by:
we showed that a direct sim-to-real transfer for this problem is s+n−1
X t−1
Y
!
possible. The proposed two-network architecture, indeed, not t−s
vs = V (xs ) + γ ci δt V, (8)
only proved capable of reaching previously unseen objects in t=s i=s
much larger mazes than those in which it was trained, but also where:
showed a good ability to generalize in real scenarios. δt V = ρt (rt + γV (xt+1 ) − V (xt )) . (9)
However, although the results are encouraging, there are    
t |xt ) i |xi )
still a number of open problems and many aspects to improve. ρt = min ρ̄, π(a µ(at |xt ) and ci = min c̄, π(a
µ(ai |xi ) are
In particular, the navigation performance were fluctuating, and truncated importance sampling (IS) weights, and we assume
for some combinations of light and textures, the agent was not that ρ̄ ≥ c̄.
able to achieve satisfactory results. In this regard, we believe The use of the V-trace target is necessary because of the lag
that the use of techniques designed to discover and analyze between the actors and the learner, which makes the learning
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 15

be off-policy. For this reason, the Bellman equation cannot be that favors entropy in action selection. This loss can be crucial
applied as it is, but it needs to be adapted. V-trace uses the for a successful training, because by pushing the probabilities
truncated IS weights to correct the estimation error caused by of actions to be similar, it induces the agent to keep on
the discrepancy between the policies. However, it should be exploring the MDP.
noticed that the V-trace target is just a generalization of the on- For further details, read [44].
policy n-steps Bellman target. In fact, when π = µ and c̄ ≥ 1,
every ci = 1 and ρt = 1, therefore (8) can be simplified: R EFERENCES
s+n−1
X [1] J. Borenstein and Y. Koren, “Real-time obstacle avoidance for fast
vs = γ t−s rt + γ n V (xs+n ) . mobile robots,” IEEE Transactions on systems, Man, and Cybernetics,
vol. 19, no. 5, pp. 1179–1187, 1989.
t=s
[2] ——, “The vector field histogram-fast obstacle avoidance for mobile
The main difference is the presence of the truncated IS weights robots,” IEEE transactions on robotics and automation, vol. 7, no. 3,
pp. 278–288, 1991.
ρt and ci . ρt determines the fixed point of the update rule, since [3] D. Kim and R. Nevatia, “Symbolic navigation with a generic map,”
it appears in the definition of δt V (Eq. 9). The fixed point is Autonomous Robots, vol. 6, no. 1, pp. 69–88, 1999.
the value function V πρ̄ of some policy πρ̄ , defined by: [4] G. Oriolo, M. Vendittelli, and G. Ulivi, “On-line map building and
navigation for autonomous mobile robots,” in Proceedings of 1995 IEEE
min (ρ̄µ (a|x) , π (a|x)) International Conference on Robotics and Automation, vol. 3. IEEE,
πρ̄ (a|x) = P . 1995, pp. 2900–2906.
b∈A min (ρ̄µ (b|x) , π (b|x)) [5] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam
system for monocular, stereo, and rgb-d cameras,” IEEE Transactions
This means that for ρ̄ = ∞ this is the value function V π on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
of the target policy π. Conversely, choosing ρ̄ = 0 the fixed [6] M. G. Dissanayake, P. Newman, S. Clark, H. F. Durrant-Whyte, and
M. Csorba, “A solution to the simultaneous localization and map build-
value become the value function V µ of the behavior policy µ. ing (slam) problem,” IEEE Transactions on robotics and automation,
In general, for non-zero finite ρ̄ we obtain the value function vol. 17, no. 3, pp. 229–241, 2001.
V πρ̄ of a policy πρ̄ which lies between the behavior and the [7] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha,
“Visual simultaneous localization and mapping: a survey,” Artificial
target policy. Hence, the larger ρ̄ the smaller the bias in the Intelligence Review, vol. 43, no. 1, pp. 55–81, 2015.
estimation, but the grater the variance. [8] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
The product of the coefficients ci (Eq. 8) determines the I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE
importance of a temporal difference δt V observed at time t Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
on the update of the value function at a previous time s. The [9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
more different the two policies are, the higher the variance once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
of this product is. To avoid that, we clip the coefficient at 788.
c̄, in this way, we reduce the variance without affecting the [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
solution to which we converge. In summary, ρt controls the recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
value function we converge to, and ci the convergence speed [11] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, “Self-
to this function. supervised deep reinforcement learning with generalized computation
With (8), we can describe the three parameters update. graphs for robot navigation,” in 2018 IEEE International Conference on
Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8.
Recalling that IMPALA is an actor-critic algorithm, it employs [12] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cogni-
two entities, the actor (which produces the policy) and the tive mapping and planning for visual navigation,” in Proceedings of the
critic (which calculates the value), to compute the learning IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2616–2625.
updates. Considering the parametric representations of the [13] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-
value function Vθ and the target policy πω (notice that θ and ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised
ω can be shared, as in our implementation), at training time auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.
[14] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino,
s, the parameters θ are updated in the direction of: M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al., “Learning to
navigate in complex environments,” arXiv preprint arXiv:1611.03673,
(vs − Vθ (xs )) ∇θ Vθ (xs ) , 2016.
[15] A. Devo, G. Costante, and P. Valigi, “Deep reinforcement learning for
to fit the V-trace target. This loss is needed to train the critic instruction following visual navigation in 3d maze-like environments,”
to judge the behavior of the actor. IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1175–1182,
2020.
The ω are updated along the direction of the policy gradient: [16] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
robotics: A survey,” The International Journal of Robotics Research,
ρs ∇ω log πω (as |xs ) (rs + γvs+1 − Vθ (xs )) . vol. 32, no. 11, pp. 1238–1274, 2013.
[17] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement
This second loss refers to the actor, which adjust the policy learning for robotic manipulation with asynchronous off-policy updates,”
in 2017 IEEE international conference on robotics and automation
according to the critic evaluation. The critic contribution in the (ICRA). IEEE, 2017, pp. 3389–3396.
policy update rule is described by the second multiplicative [18] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,
term in the equation. D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Qt-
opt: Scalable deep reinforcement learning for vision-based robotic
Finally, in order to avoid premature convergence, we can manipulation,” arXiv preprint arXiv:1806.10293, 2018.
add a third component, in the direction of: [19] T. G. Thuruthel, E. Falotico, F. Renda, and C. Laschi, “Model-based
X reinforcement learning for closed-loop dynamic control of soft robotic
−∇ω πω (a|xs ) log πω (a|xs ) , manipulators,” IEEE Transactions on Robotics, vol. 35, no. 1, pp. 124–
134, 2018.
a
16 IEEE TRANSACTIONS ON ROBOTICS, VOL. XX, NO. X, MM AAAA

[20] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, [43] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Ga3c:
J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learn- Gpu-based a3c for deep reinforcement learning,” CoRR abs/1611.06256,
ing dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2016.
2018. [44] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
[21] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., “Impala: Scalable dis-
A. Farhadi, “Target-driven visual navigation in indoor scenes using tributed deep-rl with importance weighted actor-learner architectures,”
deep reinforcement learning,” in 2017 IEEE international conference arXiv preprint arXiv:1802.01561, 2018.
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[22] X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang, “Active object perceiver: Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large
Recognition-guided policy learning for object searching on mobile scale visual recognition challenge,” International journal of computer
robots,” in 2018 IEEE/RSJ International Conference on Intelligent vision, vol. 115, no. 3, pp. 211–252, 2015.
Robots and Systems (IROS). IEEE, 2018, pp. 6857–6863. [46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[23] J. Farebrother, M. C. Machado, and M. Bowling, “Generalization and computation, vol. 9, no. 8, pp. 1735–1780, 1997.
regularization in dqn,” arXiv preprint arXiv:1810.00123, 2018. [47] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:
[24] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song, “As- Learning global representations for image search,” in European confer-
sessing generalization in deep reinforcement learning,” arXiv preprint ence on computer vision. Springer, 2016, pp. 241–257.
arXiv:1810.12282, 2018. [48] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu,
and N. de Freitas, “Sample efficient actor-critic with experience replay,”
[25] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
arXiv preprint arXiv:1611.01224, 2016.
MIT press, 2018.
[49] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[26] G. Tesauro, “Temporal difference learning and td-gammon,” Communi-
“Domain randomization for transferring deep neural networks from sim-
cations of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
ulation to the real world,” in 2017 IEEE/RSJ International Conference
[27] J. Baxter, A. Tridgell, and L. Weaver, “Learning to play chess using on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30.
temporal differences,” Machine Learning, vol. 40, no. 3, pp. 243–263, [50] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real
2000. transfer of robotic control with dynamics randomization,” in 2018 IEEE
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van International Conference on Robotics and Automation (ICRA). IEEE,
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, 2018, pp. 1–8.
M. Lanctot, et al., “Mastering the game of go with deep neural networks [51] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. “Grad-cam++: Generalized gradient-based visual explanations for deep
[29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. convolutional networks,” in 2018 IEEE Winter Conference on Applica-
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, tions of Computer Vision (WACV). IEEE, 2018, pp. 839–847.
et al., “Human-level control through deep reinforcement learning,” [52] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta,
Nature, vol. 518, no. 7540, p. 529, 2015. V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., “On evalu-
[30] C. Sampedro, A. Rodriguez-Ramos, I. Gil, L. Mejias, and P. Campoy, ation of embodied navigation agents,” arXiv preprint arXiv:1807.06757,
“Image-based visual servoing controller for multirotor aerial robots 2018.
using deep reinforcement learning,” in 2018 IEEE/RSJ International [53] S. Greydanus, A. Koul, J. Dodge, and A. Fern, “Visualizing and
Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, understanding atari agents,” in International Conference on Machine
pp. 979–986. Learning, 2018, pp. 1787–1796.
[31] F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a [54] C. Rupprecht, C. Ibrahim, and C. J. Pal, “Finding and visualizing
single real image,” arXiv preprint arXiv:1611.04201, 2016. weaknesses of deep reinforcement learning agents,” arXiv preprint
[32] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of arXiv:1904.01318, 2019.
deep visuomotor policies,” The Journal of Machine Learning Research,
vol. 17, no. 1, pp. 1334–1373, 2016.
[33] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,
“End-to-end race driving with deep reinforcement learning,” in 2018
IEEE International Conference on Robotics and Automation (ICRA).
IEEE, 2018, pp. 2070–2075.
[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
forcement learning,” in International conference on machine learning,
2016, pp. 1928–1937.
[35] J. Bruce, N. Sünderhauf, P. Mirowski, R. Hadsell, and M. Milford, “One-
shot reinforcement learning for robot navigation with interactive replay,”
arXiv preprint arXiv:1711.10137, 2017.
[36] X. Ye, Z. Lin, J.-Y. Lee, J. Zhang, S. Zheng, and Y. Yang, “Gaple:
Generalizable approaching policy learning for robotic object searching
in indoor environment,” arXiv preprint arXiv:1809.08287, 2018.
[37] A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. David-
son, “Visual representations for semantic target driven navigation,” in
2019 International Conference on Robotics and Automation (ICRA).
IEEE, 2019, pp. 8846–8852.
[38] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mot-
taghi, and A. Farhadi, “Visual semantic planning using deep successor
representations,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017, pp. 483–492.
[39] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual se-
mantic navigation using scene priors,” arXiv preprint arXiv:1810.06543,
2018.
[40] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[41] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
[42] A. V. Clemente, H. N. Castejón, and A. Chandra, “Efficient
parallel methods for deep reinforcement learning,” arXiv preprint
arXiv:1705.04862, 2017.
DEVO et al.: TOWARDS GENERALIZATION IN TARGET-DRIVEN VISUAL NAVIGATION BY USING DEEP REINFORCEMENT LEARNING 17

Alessandro Devo received the M.Sc. magna cum


laude degree in Information and Robotics Engineer-
ing in 2018 from University of Perugia, with a thesis
on Natural Language Video Description for Service
Robotics Applications. He then joined the ISAR-
lab as a Ph.D. Student. His research interests are
mainly Machine Learning, Reinforcement Learning,
and Computer Vision.

Giacomo Mezzetti received the M. Sc. degree in


Information and Robotics Engineering in 2019 from
University of Perugia, with a thesis on Design and
Experimentation of Target-Driven Visual Navigation
in Simulated and Real Environment via Deep Rein-
forcement Learning Architecture for Robotics Appli-
cations. His research interests are mainly Robotics,
Machine Learning, and Reinforcement Learning.

Gabriele Costante received the Ph.D. degree in


Robotics from the Department of Engineering, Uni-
versity of Perugia, in 2016. He is currently a Post-
Doc Researcher at the Intelligent Systems, Automa-
tion and Robotics Laboratory (ISARLab) and a
Lecturer of Computer Vision at the University of
Perugia, Department of Engineering. His research
interests are mainly Artificial Intelligence, Robotics,
Computer Vision and Machine Learning.

Mario L. Fravolini received the Ph.D. degree in


Electronic Engineering from the University of Pe-
rugia in 2000. He worked as a Research Assistant
in the Control Group at the School of Aerospace
Engineering, Georgia Institute of Technology, and
at the Department of Mechanical and Aerospace
Engineering West Virginia University. Currently, he
is an Associate Professor at the Department of Engi-
neering, University of Perugia. His research interests
include: Fault Diagnosis, Intelligent and Adaptive
Control and Biomedical Imaging.

Paolo Valigi received the Ph.D. degree from Uni-


versity of Rome Tor Vergata in 1991. From 1990 to
1994 he worked with the Fondazione Ugo Bordoni.
Since 2004 he has been Full Professor at the Uni-
versity of Perugia, Department of Engineering. He
is currently the head of the ISARLab. His research
interests are in the field of Robotics and Systems
Biology.

View publication stats

You might also like