0% found this document useful (0 votes)
16 views33 pages

Autonomous Mobile Robot Localization and Navigation

The document presents a novel approach for autonomous mobile robot localization and navigation in dynamic outdoor environments, primarily utilizing visual perception modules alongside a planar laser range finder for obstacle avoidance. It introduces a hybrid topological/grid-occupancy map that integrates various perceptual cues to achieve real-time performance, validated through extensive testing on a university campus. The system successfully navigated over 10 km of routes with high localization accuracy, demonstrating its robustness in crowded settings.

Uploaded by

dopham2k3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views33 pages

Autonomous Mobile Robot Localization and Navigation

The document presents a novel approach for autonomous mobile robot localization and navigation in dynamic outdoor environments, primarily utilizing visual perception modules alongside a planar laser range finder for obstacle avoidance. It introduces a hybrid topological/grid-occupancy map that integrates various perceptual cues to achieve real-time performance, validated through extensive testing on a university campus. The system successfully navigated over 10 km of routes with high localization accuracy, demonstrating its robustness in crowded settings.

Uploaded by

dopham2k3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Autonomous Mobile Robot Localization and Navigation

Using a Hierarchical Map Representation Primarily


Guided by Vision
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

Christian Siagian∗ , Chin Kai Chang† , and Laurent Itti


Department of Computer Science, University of Southern California, Los Angeles, California 90089
e-mail: [email protected], [email protected], [email protected]
Received 24 April 2013; accepted 3 December 2013

While impressive progress has recently been made with autonomous vehicles, both indoors and on streets,
autonomous localization and navigation in less constrained and more dynamic environments, such as out-
door pedestrian and bicycle-friendly sites, remains a challenging problem. We describe a new approach that
utilizes several visual perception modules—place recognition, landmark recognition, and road lane detection—
supplemented by proximity cues from a planar laser range finder for obstacle avoidance. At the core of our
system is a new hybrid topological/grid-occupancy map that integrates the outputs from all perceptual mod-
ules, despite different latencies and time scales. Our approach allows for real-time performance through a
combination of fast but shallow processing modules that update the map’s state while slower but more dis-
criminating modules are still computing. We validated our system using a ground vehicle that autonomously
traversed three outdoor routes several times, each 400 m or longer, on a university campus. The routes featured
different road types, environmental hazards, moving pedestrians, and service vehicles. In total, the robot logged
over 10 km of successful recorded experiments, driving within a median of 1.37 m laterally of the center of
the road, and localizing within 0.97 m (median) longitudinally of its true location along the route.  C 2014 Wiley

Periodicals, Inc.

1. INTRODUCTION 2010; Minguez and Montano, 2004) systems rely primarily


on planar proximity sensors such as a laser range finder
The ability to localize or recognize one’s location and to
(LRF). These proximity sensors exploit the confined na-
navigate or move about one’s environment are critical build-
ture of indoor space, e.g., narrow hallways, to create a
ing blocks in creating truly autonomous robots. In the past
unique location signature for localization, as well as to di-
decade, there has been significant progress in this aspect
rect the robot heading. In the open-space outdoors, how-
of autonomy, particularly indoors (Fox, Burgard, Dellaert,
ever, these sensors are less effective because, first, surfaces
and Thrun, 1999; Marder-Eppstein, Berger, Foote, Gerkey, &
detected by laser may not be as structurally simple as
Konolige, 2010; Thrun et al., 1999) and on streets and high-
indoors (e.g., due to trees, shrubs, etc.), and second, of-
ways for autonomous cars (Montemerlo et al., 2008; Thrun,
ten the nearest solid surfaces may be out of the sensor’s
2011). However, despite these successes, ample opportu-
range. Thus, it is desirable to find other ways to robustly
nity remains for substantial progress in less constrained
sense the environment. Two primary subproblems in nav-
pedestrian environments, such as a university campus or
igation are to estimate the road heading, so that the robot
an outdoor shopping center. This type of setting encom-
stays on its route, and to avoid static and dynamic obsta-
passes a major portion of the working environments for
cles. Because obstacle avoidance is readily and identically
human-sized service robots whose tasks are exemplified in
solved by proximity sensors both indoors and outdoors,
Figure 1. One major hurdle is that sensors and techniques
road recognition becomes the more pressing navigational
that are reliable in other areas are not as effective here.
concern.
For one, current indoor localization (Fox et al., 1999;
Another commonly used group of devices are depth
Thrun et al., 2000) and navigation (Marder-Eppstein et al.,
sensors, which produce dense three-dimensional proximity
information by exploiting a number of physical properties.
∗ website:
However, many of them, e.g., the Swiss-ranger or Kinect,
ilab.usc.edu/siagian; *equal authorship
† website: do not work robustly outdoors, especially in the presence
ilab.usc.edu/kai; *equal authorship
Direct correspondence to: Christian Siagian, e-mail: siagian@ of sunlight. Even for those that do, such as stereo cameras,
usc.edu the system still has to perform the difficult task of extracting

Journal of Field Robotics 31(3), 408–440 (2014) C 2014 Wiley Periodicals, Inc.
View this article online at wileyonlinelibrary.com • DOI: 10.1002/rob.21505
Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 409

Figure 1. Examples of tasks for service robots illustrated from left to right: patrolling alongside law enforcement personnel,
searching for target objects, providing information to visitors, and assessing the situations and conditions of the robot’s area of
responsibility.

the shape and appearance of the road, possibly without the primarily using monocular vision and complemented by
help of surrounding walls. LRF-based obstacle avoidance. Our main contributions are
The problem of road recognition can now be robustly fourfold:
solved on urban streets and highways because of distin-
guishing cues such as clear lane markings and bankings. r Using vision algorithms (visual attention, landmark
One key breakthrough of the DARPA Grand Challenge recognition, gist classification, localization, road finding)
(Montemerlo et al., 2008) is the introduction of a sensor to compute localization and navigation cues that are ro-
called the Velodyne (Glennie and Lichti, 2010). It provides bust in busy outdoor pedestrian areas.
combined proximity and appearance information, mea- r Developing a framework for the integration of these cues
sured densely in all 360◦ , with a 120-m range. The sensor using a proposed hierarchical hybrid topological/grid
has played a major role in enabling many teams to finish occupancy map representation.
the race. Furthermore, it has also allowed the Google Car r Exploiting the hierarchical map, along with applying
(Thrun, 2011) to travel more than 300,000 miles for many techniques such as forward projection and tracking, to re-
months in the San Francisco area. However, because of its solve differences in algorithmic complexity, latency, and
cost, the Velodyne has not been widely utilized. In addi- throughput, to create an overall real-time robotics sys-
tion, pedestrian roads, which are less regulated and whose tem.
delineations are more subtle and varied, pose a different r Testing over more than a 10 km total traveled distance,
challenge, one that still has not been generally solved even with routes 400 m or longer, to demonstrate that the ap-
with a Velodyne. proach is indeed viable and robust even in crowded en-
Another aspect of autonomous cars that does not trans- vironments such as a university campus. As far as we
fer to human-sized service robots is how localization is per- know, this is the most extensive evaluation of a service
formed. Unlike faster moving cars driving on a highway, robot localization and navigation system to date.
which can afford to make global positioning system (GPS)
location queries that are far apart, service robots operate on The system is described in Section 3 and tested in Sec-
a smaller environment scale and move at the speed of hu- tion 4. We then discuss our findings in Section 5. First, we
mans. A GPS reading, with its relatively coarse resolution, describe related vision localization and navigation research
would not be able to localize a robot accurately enough to in the following section.
turn correctly at a typical campus road intersection. Further-
more, many pedestrian roads have tall trees or buildings
alongside them, which denies the satellite visibility upon
2. RELATED WORKS
which the GPS relies. We start with similar full systems that perform both nav-
Given these factors, vision, the primary sense of hu- igation and localization in Section 2.1. In addition, be-
mans, is an ideal perceptual modality to localize and nav- cause many research efforts have focused on specific sub-
igate in most outdoor places where humans can travel. In components of our system, we also separately survey vision
addition, it is also an attractive alternative because of the af- localization (Section 2.2), road recognition (Section 2.3), and
fordability and availability of cameras. However, one draw- path planning (Section 2.4). A large portion of these refer-
back of these versatile devices is the associated processing ences directly motivate the implementations of our system
complexity, further exacerbated by the real-time require- components, techniques, and overall architecture. Further-
ments that robotic systems demand. more, we also cite references that contrast our approaches
In this paper, we present a fully autonomous local- to complete the background literature. These cited works
ization and navigation system for a human-sized service shown that our decisions are well founded by the respec-
robot operating in unconstrained pedestrian environments, tive communities.

Journal of Field Robotics DOI 10.1002/rob


410 • Journal of Field Robotics—2014

2.1. Indoor and Outdoor Mobile Robot of visual key points, such as SIFT (Lowe, 2004) and SURF
Localization and Navigation (Bay, Tuytelaars, and Gool, 2006), as well as subsequent tech-
niques such as the visual bag-of-words (Lazebnik, Schmid,
In recent years, a few systems have ventured out to pedes-
and Ponce, 2006), systems can now robustly perform this
trian outdoor environments. Systems such as that of Trulls
critical matching step in many more environments, includ-
et al. (2011) and Montella, Perkins, Spletzer, and Sands
ing outdoors.
(2012) do so by solving localization, road heading estima-
There are now many systems that are capable of ac-
tion, and obstacle avoidance, all by using only an LRF. The
curate coordinate-level localization (Murillo, Guerrero, and
key is being able to take advantage of various detectable and
Sagues, 2007; Royer, Lhuillier, Dhome, and Lavest, 2007;
distinguishable environment features within the LRF range.
Se, Lowe, and Little, 2005; Segvic, Remazeilles, Diosi, and
In their testing, these static cues are obtained from nearby
Chaumette, 2009), as well as simultaneous localization and
building surfaces, walls, poles, and trees. In the absence of
mapping (SLAM) (Angeli, Filliat, Doncieux, and Meyer,
such features, other cues would be necessary in order to use
2008; Eade and Drummond, 2008; Strasdat, Montiel, and
the technique.
Davison, 2010). In fact, many systems are capable of lo-
Another approach to navigation and localization makes
calizing in large-scale areas the size of a few city blocks
use of visual landmarks through the teach-and-replay
(Cummins and Newman, 2008; Schindler, Brown, and
paradigm (Chang, Siagian, & Itti, 2010; Chen and Birchfield,
Szeliski, 2007), as well as across many seasons (Valgren and
2006; Furgale and Barfoot, 2010). The robot is first manu-
Lilienthal, 2008). However, many of these systems have thus
ally steered through a specific route in the teaching stage,
far only been tested offline as the computational complexity
and is then asked to execute the same exact route dur-
can be prohibitive.
ing autonomous operation. Once a few landmarks are
An interesting recent effort focuses on a lower-
detected, the system can synchronize what it currently
resolution localization task called place recognition
sees with the training frames. Localization, like other
(Fazl-Ersi and Tsotsos, 2012; Pronobis, Caputo, Jensfelt, and
visual recognition-based systems, is straightforward be-
Christensen, 2006; Torralba, Murphy, Freeman, and Rubin,
cause training frames are associated with a specific loca-
2003; Ulrich and Nourbakhsh, 2000). Here, the problem is
tion. Navigation, on the other hand, is achieved by mov-
framed as a classification task to distinguish places in the en-
ing the robot to align the image coordinates of the seen
vironment, not the exact coordinate. Because of the coarser
landmarks to the appropriate locations in the training
resolution, these systems easily run in real time.
images.
A practical way to exploit the speed of place recognition
The success of the technique depends on being able to
is by combining it with the accuracy of metric localization
quickly and robustly match the input visual features despite
systems. Many systems (Siagian and Itti, 2009; Wang, Zha,
changes in lighting conditions or in dynamic obstacles such
and Cipolla, 2006; Weiss, Tamimi, Masselli, and Zell, 2007;
as pedestrians. One important concern involves how to get
Zhang and Kosecka, 2005) do so through a coarse-to-fine
back to the set route when the robot has to deviate from the
hierarchical approach, where place recognition primes pos-
set path momentarily, e.g., to avoid unexpected obstacles.
sible areas in the map before refining the location hypothesis
Indeed, if such deviation is not encountered during train-
to a coordinate level.
ing, the robot will have no memory of what it sees after de-
For completeness, it should be noted that many sys-
viating. Recent improvements by Cherubini, Spindler, and
tems combine vision-based localization with other modal-
Chaumette (2012) have shown promising results in over-
ities such as proximity (Newman et al., 2009; Ouerhani,
coming this difficulty.
Bur, and Hugli, 2005; Pronobis, Mozos, Caputo, and Jens-
Another issue arises when the system drifts and be-
felt, 2010), GPS (Agrawal and Konolige, 2006; Vargas, Na-
comes out of synchronization with the reference training
gatani, and Yoshida, 2007), or visual odometry (Nistér,
image sets. In that case, the system has to reset itself. This
Naroditsky, and Bergen, 2006; Tardif, Pavlidis, and Dani-
has to be done while the robot is stationary because the land-
ilidis, 2008). Each modality provides something that com-
marks are responsible for both localization and navigation.
plements vision localization. For example, even though GPS
Our system separates these two components. When our lo-
is coarser, has a lower update frequency, and is less reliable
calization system is temporarily confused about where it
in some places, it provides clean global absolute location in-
is, the robot can still move forward using road recognition,
formation that can be used to limit the matching process
which gives the localization system the chance to recover
to a smaller geographic area. Note that we make a dis-
later on down the route.
tinction between vision localization and visual odometry.
The former is a technique that tries to visually recognize
2.2. Vision Localization one’s global location, while the latter only visually esti-
The key to a successful vision localization system is to be mates one’s motion. However, if an initial global location
able to match observed visual landmarks despite outside in- is available, visual odometry can be critical in maintaining
terference or visual variability. Since the recent introduction localization.

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 411

2.3. Road Lane Recognition but simply detect and avoid potentially hazardous condi-
tions, for example using optical flow (Coombs, Herman,
A critical task in navigating in pedestrian environments is
Hong, and Nashman, 1995; Santos-Victor, Sandini, Curotto,
to recognize and stay on the road. It is important to distin-
and Garibaldi, 1993) or other visual features (i Badia, Pyk,
guish road lane navigation from systems that simply drive
and Verschure, 2007; Michels, Saxena, and Ng, 2005; Naka-
through any flat terrain. The former specifically enforces
mura and Asada, 1995). A standard representation allows a
proper and socially acceptable ways to travel in pedestrian
system to utilize multiple perceptual modalities and natu-
environments, e.g., avoiding excessive zigzags. In addition,
rally fuse them together.
often it is important for the robot to travel in the middle
The mapping is usually in the form of a grid occu-
of the road, because it provides an optimal vantage point
pancy map, whether it be the traditional ground-view two-
to observe signs or landmarks. Also, by trying to recognize
dimensional (Moravec and Elfes, 1985) map, or the more
roads in general, instead of performing teach-and-replay,
recent three-dimensional (Marder-Eppstein et al., 2010) ver-
the system is not limited to a set path that has previously
sion. These maps are primarily filled by planar proximity
been manually driven. Instead, the system has the freedom
sensors such as the LRF, which can also be swept to cre-
to move about its surroundings and explore unmapped
ate a three-dimensional perception. Another way to create
areas.
a three-dimensional map is by using the aforementioned
Aside from using vision, systems have been proposed
Velodyne (Glennie and Lichti, 2010). From our experience,
that use proximity cues from LRFs (Hu, Wang, Zhang, and
because of the openness of the outdoors, a well-placed pla-
Yu, 2009; Wurm, Kummerle, Stachniss, and Burgard, 2009),
nar LRF is sufficient to avoid most obstacles. For more ag-
stereo cameras (Hrabar and Sukhatme, 2009), and Kinect
gressive maneuvers through narrow openings such as un-
(Cunha, Pedrosa, Cruz, Neves, and Lau, 2011). They have
der a table, a three-dimensional representation would be
shown a measure of success in restricted conditions, without
needed.
the presence of sunlight or where the surface characteristics
Given an occupancy map, there are two approaches
of the road and its flanking areas are sufficiently dissimilar.
to compute a motor command: directional space control
Monocular visual cues, on the other hand, readily encode
and velocity space control. In the former, the algorithm se-
road information in most environments.
lects a command that moves the robot to the best neigh-
One visual approach relies on modeling the road ap-
bor location in terms of advancing to the goal while
pearance using color histograms (Kuhnl, Kummert, and
avoiding obstacles. A class of this is called the vector
Fritsch, 2011; Rasmussen, Lu, and Kocamaz, 2009), which
field histogram (Borenstein and Koren, 1991; Ulrich and
assumes that the road is contiguous, uniform, and visually
Borenstein, 1998), which combines an attractive force from
different from the flanking areas (Santana, Alves, Correia,
the goal with repelling forces from nearby obstacles. How-
and Barata, 2010). In addition, to ease the recognition pro-
ever, this technique, as well as related approaches such as
cess, the technique usually simplifies the road shape, as
the nearness-diagram algorithm (Minguez and Montano,
viewed in the image, as a triangle. These assumptions are
2004), suffers from local minima problems because they do
often violated in cases in which there are complex mark-
not explicitly search for complete paths that eventually ar-
ings, shadows, or pedestrians on the road. Furthermore,
rive at the goal. On the other hand, shortest path algorithms
they also do not hold when the road appearance is similar
such as A* (Wilson, 1982) and its derivatives, e.g., D*-lite
to the flanking areas.
(Koenig and Likhachev, 2002), ensure path completion to the
Another way to visually recognize the road direction is
goal.
by detecting the vanishing point (VP). Most systems (Kong,
One difficulty of using a grid map is determining the
Audibert, and Ponce, 2010; Miksik, 2012; Moghamadam,
optimal grid size. An overly large size creates a coarse map
Starzyk, and Wijesoma, 2012) use the consensus direction of
that excludes some needed details, for example in cluttered
local textures and image edgels (edge pixels) to vote for the
settings. A grid size that is too small is also not ideal as
most likely VP. However, edgels, because of their limited
it renders the system inefficient because of the exponential
scope of support, can lead to an unstable result. Further-
increase of the search space. One way to combat such a
more, many systems also attach a triangle to the VP to ide-
problem is through the use of a nonuniform grid size, such
alize the road shape. Our road recognition module (Siagian,
as an octree (Wurm, Hornung, Bennewitz, Stachniss, and
Chang, and Itti, 2013a) improves upon these limitations by
Burgard, 2010).
detecting the VP using more robust contour segments and
Even with an ideal grid size, because of the map dis-
not assuming the triangular shape.
cretization, the resulting path may be too rigid and does not
maximally avoid obstacles. This is because the path con-
2.4. Path Planning sists of straight segments that connect the centers of the
Generally, a path planning system needs a mapping of the visited grid cells. Techniques such as the elastic band ap-
surroundings to generate motion commands. However, a proach (Quinlan and Khatib, 1993) modify and smoothe
few systems do not explicitly model the nearby obstacles, out the resulting rigid path. Another approach by Yang

Journal of Field Robotics DOI 10.1002/rob


412 • Journal of Field Robotics—2014

Figure 2. Diagram for the overall localization and navigation system. We run our system on a wheelchair-based robot called the
Beobot 2.0. As shown on the sensor column, the system utilizes an IMU, camera, LRF, and wheel encoders. We use the camera
to perform vision localization by integrating results from salient landmark recognition and place recognition. For navigation we
use visual road recognition for estimating the road heading, and LRF-based grid-occupancy mapping for obstacle avoidance. The
system utilizes a topological map for global location representation and a grid-occupancy map for local surrounding mapping.
The grid map, which is represented by the blue rectangle on the topological map, is connected with the topological map by the
intermediate goal (green dot).

and Lavalle (2004) achieves the same objective by directly goal is time-consuming considering the number of options
sampling the free space in the map without gridding it in each step.
altogether. One way to alleviate this is using a rapidly explor-
For the robot to behave decisively, it is important to cre- ing random tree (RRT) (Kuffner and LaValle, 2000), which
ate a path that is consistent over time. An extension of the is designed to efficiently search through high-dimensional
shortest path algorithms, called the incremental heuristic space. Another way is to compute a full path in the direc-
search (Koenig, Likhachev, Liu, and Furcy, 2004), encour- tional space and then refine an immediate portion in the
ages such a requirement and speeds up the process by in- velocity space (Marder-Eppstein et al., 2010). This combi-
corporating paths from previous time steps as a prior to nation allows a system to gain the benefit of both tech-
guide the subsequent search. niques. Our presented system is an example of such an
A fundamental problem with directional space control algorithm.
is that it does not deal with kinematic and dynamic con-
straints, for example whether a robot is holonomic. Velocity
space control techniques (Fiorini and Shiller, 1998; Fox et al., 3. DESIGN AND IMPLEMENTATION
1997; Simmons, 1996) address them directly by modeling The flow of our overall system is illustrated in Figure 2. The
factors such as robot geometry, actuator configuration, and left side of the image depicts our mobile platform, Beobot
the robot’s current and achievable velocities. 2.0, with the utilized sensors—a forward-pointing camera, a
The technique requires a good motion model to com- LRF, wheel encoders, an inertial measurement unit (IMU)—
pute precise trajectory estimation. This is in contrast with as well as motors to move the robot. Our IMU (MicroStrain,
directional space control, which does not take into account Inc., 2009) is also equipped with a compass that makes mag-
how fast the robot will get to a certain spot, as long as it netic measurements. It is placed high above the computers
ends up there. In addition, constructing a full path to the and motors to minimize magnetic distortion induced by the

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 413

Figure 3. Diagram of the vision localization system. The system takes an input image and extracts low-level features in the color,
intensity, and orientation domains. They are further processed to produce gist features and salient regions. The localization system
then uses these features for segment classification and salient landmark matching (respectively), along with fused odometry, to
estimate the robot’s location using a Monte-Carlo framework.

robot. Also note that ferrous obstacles in the environment struction, as well as the mechanisms that are utilized to
may affect the IMU. overcome long latencies of some components. We then con-
The two major subsystems, localization and naviga- tinue to the overall system and examine how it accommo-
tion, are displayed in the middle of the figure. The localiza- dates the different component latencies to ensure a timely
tion system (the top of the column) is described in Section system execution.
3.1, while the navigation system (bottom) is in Section 3.2. We start the detailed description with the vision local-
Because earlier versions of these subsystems have been de- ization system.
scribed previously (Siagian and Itti, 2009; Siagian, Chang,
and Itti, 2013a), here we provide summary descriptions of
their technical details, focusing instead on information flow,
3.1. Localization System
timing, and interactions between subsystems, which are key We summarize the vision localization system, which is
to designing a complete working mobile robot system. Note inspired by previously published work (Siagian and Itti,
that both subsystems utilize fused odometry. 2009). We focus on the mechanisms that allow it to run in
The rightmost column in the figure is the hierarchical real time: multilevel hierarchical and multiperception lo-
spatial representation of the robot’s environment. The first calization. As illustrated in Figure 3, the system is divided
level is a global topological map that compactly represents into three stages. The first stage takes an input image and
all the needed metric information for global path planning. extracts low-level visual-cortex-like features, consisting of
The next one is a local grid-occupancy map, which details center-surround color, intensity, and orientation channels.
the robot surroundings. Mechanically, the local map is con- These features are used by two parallel processes to ex-
nected to the global map through an intermediate goal lo- tract the “gist” and salient points of the scene, two exten-
cation that is registered in both maps (see Figure 2). The sively studied human visual capabilities. We define the gist
task of the navigation system is simply to safely reach the features as a low-dimensional vector (compared to a raw
intermediate goal. The local map, which is robot-centric, is image pixel array) that represents a scene and can be ac-
then advanced along the current graph edge in the global quired over very short time frames (Siagian and Itti, 2007).
map to always keep the robot at a fixed location in the local Saliency, on the other hand, is computed as a measure of
map. interest at every image pixel to efficiently direct the time-
For each subsystem, we describe not only the algorithm consuming landmark-matching process toward the most
and its pertinent components, but also the time scale of the likely candidate locations in the image. The gist features
implementation. By analyzing the time scale, we demon- and salient regions, along with fused odometry, are in-
strate the advantages afforded by the hierarchical map con- put to the back-end Monte-Carlo localization (Fox et al.,

Journal of Field Robotics DOI 10.1002/rob


414 • Journal of Field Robotics—2014

1999; Thrun et al., 2000) algorithm to estimate the robot’s graph edges but with exact metric location specification,
position. the search space is effectively continuous one-dimensional.
The topological map, which currently is manually con- That is, even though the locus of the considered locations
structed, is represented by a graph G = {V , E}. The nodes describes a two-dimensional geographical space, it is con-
v ∈ V are augmented with metric information specifying strained to within the path edges. Thus, a small and com-
their coordinate locations. Accordingly, the edge cost We for putationally less demanding N = 100 suffices, even in kid-
each edge e ∈ E is set to the distance between its respective napped robot instances. Also, with this formulation, we
end nodes. The system also defines segments s ⊆ E, each an specify a path as the exact location of the end points and
ordered list of edges, with one edge connected to the next to a list of edges that connects them.
form a continuous path. This grouping is motivated by the The system estimates the location belief Bel(St ) by re-
fact that the views/layouts in a segment are similar and can cursively updating the posterior p(St |zt , ut ) using segment
be classified using gist features. An example of a segment is estimation zt , landmark matching zt , and motion measure-
the highlighted three-edge segment 2 (thicker and in green) ments ut . In addition, the system invokes importance re-
in the map in Figure 3. sampling whenever more than 75% of the particles collapse
The term “salient region” refers to a conspicuous area to within the robot width of 60 cm. It chooses 95% of the
in an input image depicting an easily detected part of the en- total N particles from the existing population by taking into
vironment. An ideal salient region is one that is persistently account their weights wt,i , while it randomly generates the
observed from different points of view and at different times remaining 5%. The latter is done by randomly choosing an
of the day. In addition, the set of salient regions that portray integer among the segment numbers for kt,i and a decimal
the same point of interest is considered jointly, and that set is value between 0.0 and 1.0 for lt,i .
called a landmark. To recognize a landmark and to associate To robustly measure robot motion, we implement a
it with a location, the system employs a training regimen Kalman filter (King, 2008), which fuses the wheel encoders
by manually driving the robot through all the paths in the and the IMU. The filter’s state space is the current absolute
environment. The produced database is quite large, about heading θt and distance traveled dt . Using these estimated
one landmark per meter of path. As a result, unlike the values, we compute the longitudinal distance traveled, ut =
instantaneous segment classification, the landmark recog- dt ∗ cos(θt − θψt ), with θψt being the absolute road direction
nition process can take seconds. However, as opposed to estimate, discussed in Section 3.2.1 below.
just localizing to the segments, salient landmarks produce The system then computes the motion model
coordinate locations. p(St |St−1 , ut ) by advancing the particles by ut and adding
The localization system takes advantage of the con- noise to account for uncertainties such as wheel slippage
trasting segment classification and landmark matching by that can produce large odometry errors. This is done by
coarsely but quickly estimating the robot’s segment loca- drawing a random location from a Gaussian probability
tion in every video frame. It then refines the estimation density p(xt,i |ut , xt−1,i ), where the mean is the previous loca-
to a more accurate metric location when the salient region tion xt−1,i plus the odometry ut , and the standard deviation
matching results are available. In addition, the system also is 3 cm (about one-sixth of a typical single step). If in the
speeds up the database matching process by using contex- process a particle moves past either end of a segment, we
tual information such as current location belief and seg- change the segment number appropriately and normalize
ment classification output. It orders the database items to the excess distance from the end of the original segment. If
be compared from the most likely to be positively matched the original segment ends at an intersection with multiple
to the least. This helps because the search ends, for effi- continuing segments, we randomly select one. If no other
ciency reasons, after the first match is found. Also, to keep segment extends the path, we leave the particle at the end.
the database match results relevant despite the processing For the visual observations, the segment estimator
latency, the system stores the accumulated odometry during (Siagian and Itti, 2007) uses the gist features to classify the
the landmark-matching process. When the results are ready, input scene, and it outputs the likelihood zt = { φt,j }, j =
the landmark recognition module adds the longitudinal 1 . . . , Nsegment , with φt,j being the likelihood value for seg-
movement offset before sending the resulting location to the ment j at time t. The system then updates the weights wt,i
back-end Monte-Carlo localization. of each particle xt,i to
Within the Monte-Carlo framework, we formulate
the location state St as a set of weighted particles St = φt,kt,i
p(zt |xt,i ) = Nsegment ∗ φt,kt,i . (1)
{xt,i , wt,i }, i = 1 . . . , N at time t, with N being the num- j =1 φt,j
ber of particles. Each particle or possible robot location
xt,i = {kt,i , lt,i }, with k being the segment number and l the Here, the likelihood that particle xt,i observes zt is propor-
length traveled along the segment edges normalized by the tional to its segment location estimation value φt,kt,i over
segment length (from 0.0 at the start of the segment to 1.0 the total sum (first term) times the value itself (second
at the end). Because the localization formulation is over the term). The first term accounts for the relative strength of the

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 415

The resulting matches are denoted as zt = { ot,j }, j =


1, . . . , Mt , with ot,j being the j th of the total Mt matched
database regions at time t. The system computes the like-
lihood of simultaneously observing the matched regions
at a given particle location xt,i by updating the weight
wt,i = p(zt |xt,i ):


Mt
p(zt |xt,i ) = p(ot,j |xt,i ). (4)
j =1

Because each salient region match observation is inde-


Figure 4. Matching process of two salient regions using SIFT pendent, the system can multiply each likelihood to calcu-
key points (drawn as red disks) and a salient feature vector, late the joint probability. Each p(ot,j |xt,i ) is modeled by a
which is a set of feature map values taken at the salient point Gaussian where the likelihood value is set to the probabil-
location (yellow disk). The lines indicate the key-point corre- ity of drawing a length longer than the distance between
spondences. The fused image is added to show alignment. the particle and the location where the matched database
region is acquired. The standard deviation of the Gaussian
is set to 5% of the map diagonal to reflect an increased level
segment, while the second preserves the ratio with respect of uncertainty for larger environments.
to the absolute 1.0 maximum value. The system then estimates the current location of the
The landmark-matching process, which is illustrated robot using the highest weighted particle.
in Figure 4, compares incoming salient regions to the land- We illustrate the temporal integration of the localiza-
mark database regions using two sets of signatures: SIFT key tion components with a timeline in Figure 5. The local-
points (Lowe, 2004) within the region bounding box, and a ization system, along with other subsystems, uses a sub-
salient feature vector. The former denotes the shape of the scribe/publish architecture, where each component is an
region, while the latter quantifies its saliency characteris- independent process. The outputting localization module,
tics. We employ a straightforward SIFT-based recognition which is highlighted in the last row, processes observations
system (Lowe, 2004) using the suggested parameters, while from the components as soon as they are available. This
the salient feature vector matching is described here. allows the system to not be explicitly tied to a rigid time
A salient feature vector is a set of feature map values step synchrony. However, because odometry and segment
(Itti, Koch, and Niebur, 1998; Itti, 2000) from the color, inten- estimation come in regular intervals, the system appears to
sity, and orientation channels. These values are taken from produce output regularly from the outside perspective.
a 5-by-5 window centered at the salient point location of The top section of the figure (the first two rows) lists
a region. In total, there are 1,050 features: 7 subchannels the utilized sensors: a camera, which runs 30 Hz, and the
times 6 feature maps times a 5 × 5 array. To compare salient 15 Hz fused odometry. Note that the fused odometry uses
feature vectors of two regions ρ1 and ρ2 , we factor in the a Kalman filter to combine information from the 80,000 Hz
feature similarity sfSim [Eq. (2] and the salient point lo- encoders and the 50 Hz IMU. Even though odometry is
cation proximity sfProx [Eq. (3)] after aligning the regions reported less frequently, the information latency is closer to
(observe the fused image in Figure 4). The former is based that of the encoders. Furthermore, if we consider the robot
on the Euclidian-distance in feature space: speed of under 1 m/s, a 15 Hz or 66.67 ms latency period

Nsf only amounts to a 7 cm localization error.
i=1 (ρ1,i − ρ2,i )
2
Below the sensors are the perceptual components. The
sfSim(ρ1 , ρ2 ) = 1 − , (2)
Nsf raw feature extraction, gist, and saliency computations start
after being triggered by the camera output, as indicated by
with Nsf being the feature dimension. For a match to be the arrow from the first row. The module, which is im-
confirmed, the feature similarity has to be above 0.75 out plemented in parallel over multiple processing cores, takes
of the maximal 1.0. The location proximity sfProx, on the about 25 ms (noted by the length of the solid maroon block).
other hand, is the Euclidian distance in pixel units (denoted The process then idles until the next image arrives. For each
by the function “dist”), normalized by the image diagonal module, we state its throughput in parentheses, with the
length η: output indicated by an arrow that drops down from the
dist(ρ1 , ρ2 ) right end of the processing block. Note that there are also
sfProx(ρ1 , ρ2 ) = 1 − . (3) a number of arrows that seem to do the same from the
η
left side of a block. They are actually pass-through arrows
The matching threshold for sfProx is 95% or within 5% from their respective original processes. The first arrow
of η. from Fused Odometry, for example, is processed by both

Journal of Field Robotics DOI 10.1002/rob


416 • Journal of Field Robotics—2014

Figure 5. Diagram of the localization subsystem timeline. The figure lists all the localization subsystem components: two sensors,
four perceptual processing steps, and a back-end localization estimation. Please see the text for an explanation.

Odometry Tracking and Localization. However, it may look section, we put the whole system together, integrating both
as if Localization takes an input from Odometry Tracking. localization and navigation.
Segment classification, which takes less than 1 ms to The local map represents the robot’s surroundings by
compute, passes its result to landmark matching for search using a 64 by 64 grid occupancy map (Moravec and Elfes,
priming and the localization module for location update. 1985), with each grid cell spanning 15 cm by 15 cm spa-
The landmark-matching process, on the other hand, takes tial area for a 9.6 m by 9.6 m local map extent. We select
3 s on average (0.33 Hz), despite a multithreaded land- 15 cm or 1/4 the robot size (our robot is 60 cm in length and
mark search implementation. However, because odometry width) because it is a balance between sufficiently detailed
tracking keeps track of where the original image is be- depiction and being manageable for real-time update.
ing triggered, the long latency is accounted for. The final As shown in Figure 2, the robot is displayed as a red
localization module itself outputs results at 30 Hz, with rectangle and is placed at the local map horizontal cen-
the input image and, subsequently, segment estimation be- ter and three quarters down the vertical length to increase
ing the trigger. front-of-robot coverage. In addition, there is a layer of grid
A critical point that enables localization to effectively surrounding the map to represent goal locations that are
operate in real time is that each perceptual input already outside the map. The figure shows the goal to be straight
accounts for its own internal latency. In addition, because ahead, which, by convention, is the direction of the road.
the topological map formulation is one-dimensional along Thus a large portion of motion control is trying to minimize
the graph edges and needs only 100 particles to localize, the the deviation between the robot and the road heading.
localization computation itself takes less than 1 ms. Thus Given the robot’s autonomous speed of 0.66 m/s aver-
the system is not burdened by the frequent observation up- age and 0.89 m/s maximum, the 7.2 m (0.75 × 64 × 15 cm)
dates, which allows the location belief to be accurate to mapping of the robot front area affords at least 8.1 s of
the most recent input. Furthermore, the system can still forward look-ahead to properly react to potential static
be expanded with new observations and modalities of any obstacles. On the other hand, dynamic obstacles such as
time scale as long as their latencies are accounted for. Most people have to be evaluated with respect to the local map
importantly, the localization system timely mode of opera- update rate of 20 Hz. Here, using the average human walk-
tion allows the full system to be integrated with real-time ing speed of 1.4 m/s, the system latency of as much as
navigation. 100 ms translates to 0.14 m movement or within one grid
size.

3.2. Navigation System


The components of the navigation system are the visual 3.2.1. Visual Road Recognition
road recognition (Siagian, Chang, and Itti, 2013a) to esti- Our visual road recognition system (Siagian, Chang, and
mate the road direction (described in Section 3.2.1), grid Itti, 2013a), which is illustrated in Figure 6, is a novel van-
map construction (Section 3.2.2), and path planning to gen- ishing point (VP) -based algorithm that uses long and robust
erate motor commands (Section 3.2.3). In the last part of the contour segments, instead of the smaller and often noisier

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 417

Figure 6. Overall visual road recognition system. The algorithm starts by computing a Canny edge map from the input image.
The system has two ways to estimate the road. One is using the slower full recognition step (the top pipeline), where it performs
road segment detection. Here, the segments are used to vote for the vanishing point (VP) as well as to extract road lines. The second
(bottom) is by tracking the previously discovered road lines to update the VP location as well as the robot lateral position. The
tracker is also used to project forward output road lines from the top pipeline, which may fall behind the camera frame rate. The
road recognition system then outputs the VP-based road heading direction as well as the lateral position to the navigation system
to generate motor commands.

edgels. In addition, much like the localization system, it also


implements a tracking mechanism to ensure that it runs in
real time. The output of the system is the absolute road
direction θroad . Note that the road recognition system does
not make use of odometry information.
The system takes the input image and performs Canny
edge detection to create an edge map. From here it has
two ways to recognize the road. One (the top pipeline in
the image) is through a full but slower recognition process,
which detects segments in the edge map, votes for the most
likely VP, and extracts the road lines that are extended from
the VP. On the other hand, the bottom pipeline runs in real Figure 7. VP voting. The VP candidates are illustrated by disks
time because it only tracks the already available road lines on the calibrated horizon line, with radii proportional to their
to update the VP location. Because the system has both respective accumulated votes from the detected segments. For
mechanisms, it balances accuracy and timeliness. clarity, the figure only displays segments that support the win-
The recognition pipeline uses a Hough transform to ning VP. A segment s contributes to a vanishing point p by the
product of its length and distance between p and the intersec-
find straight segments [OpenCV (Bradski, 2001) implemen-
tion point of the horizon line and a line extended from s, labeled
tation] in the edge map. The process then filters out seg-
as h intercept(s).
ments that are above the manually calibrated horizon line,
near horizontal (usually hover around the horizon line),
and near vertical (usually part of close-by objects or build-
ings). The remaining segments vote for candidate VPs on below:
the horizon line, regularly spaced at 1/16th image width  
apart. |h intercept(s), p|
score(s, p) = 1.0 − ∗ |s| (5)
To cast a vote, each segment is extended on a straight μ
line to intersect the horizon line (see Figure 7). The voting
score of segment s for a vanishing point p is the product of with μ set to 1/8th of an image width. μ is also the voting
segment length |s| and the inverse proportion of the prox- influence limit, with any farther segment not considered.
imity of p to the intersection point of the extended line and Note that the exact values of the constants in the algorithm
the horizon line, denoted by h intercept(s) in the equation do not significantly affect overall system performance.

Journal of Field Robotics DOI 10.1002/rob


418 • Journal of Field Robotics—2014

two-pixel spacing. The reason for using the horizon sup-


port line is because we want each new candidate line, when
extended, to intersect the horizon line on the same side of
where it came from. That is, if the candidate line intersects
the bottom of the image on the left side of the VP, it should
intersect the horizon line on the left as well. We find that
true road lines almost never do otherwise.
The top pipeline also uses the same tracking mecha-
nism for other purposes. One is to refine the VP estimate
because the voting scheme only selects a location among
sparsely spaced hypotheses at the calibrated horizon line.
By using the intersection of road lines instead, the estimate
is much more accurate. The second use of tracking is to help
the recognition pipeline catch up to the current frame by
Figure 8. Line-tracking process. The system tracks a line equa- projecting the newly discovered road lines through unpro-
tion from the previous frame (denoted in yellow) in the current cessed frames accumulated during the recognition compu-
edge map by finding an optimally fit line (orange) among a set tation.
of lines obtained by shifting the horizontal coordinate of the
At the end of both pipelines, the system converts the VP
previous line’s bottom point (bottom of the image) and horizon
pixel location to angle deviation from the road direction by
support point (intersection point of line and the horizon sup-
linear approximation, from 0◦ at the center of the image to
port line) by +/− 10 pixels with two-pixel spacing. The set of
27◦ at the edge (or 0.16875◦ per pixel). Here we assume the
candidate shifted points is shown in red on the bottom and on
the horizon support line. camera is fitted with a standard 60◦ field-of-view lens. The
system then calculates the absolute road heading by sum-
ming the angular deviation with the current IMU reading.
To improve the robustness of the current VP estima- By approximating the road direction in the absolute head-
tion (denoted as ψt ), the system multiplies the accumulated ing, the system does not directly couple the vision-based
scores with the inverse proportion of the proximity to the road recognition with motion generation. Instead it can use
previous VP location ψt−1 : IMU readings to maintain the robot heading. This, at times,
  is essential when the road is momentarily undetected visu-
 |p, ψt−1 | ally, e.g., when going under a bridge.
ψt = arg max score(s, p) ∗ 1.0 − . (6) To improve the heading estimation robustness, the sys-
p
s
μ
tem, which runs at 30 Hz, considers the last 100 absolute
We then use the segments that support the winning VP, headings using a histogram of 72 bins, each covering 5◦ . It
indicated in red in Figure 7, to extract lines for fast road only updates the road heading estimate if 75% of the val-
tracking. The system first sorts the supporting segments ues reside within three bins, updating it with the average
based on their lengths. It then fits a line equation through the value within those bins. In addition, the system also discards
longest segment using least-squares, and iteratively adds heading inputs that are farther than 30◦ from the current
the remaining nearby segments (if all edgel in the segment estimate, which usually occurs during obstacle avoidance.
are within five pixels) to the line, always reestimating the The resulting filtered VP-based absolute road direction is
equation after each addition. Once all segments within close called θψt .
proximity are incorporated, the step is repeated to create This angle by itself is actually not accurate enough to
more lines using unclaimed segments, processed in length properly drive a robot. Even though the general heading is
order. To discard weakly supported lines, the system throws correct, a single degree of bias can slowly drift a robot to one
away lines that are represented by fewer than 50 edgels in side of the road. To rectify this, the system locks the robot to
the map. its initial lateral position by storing the locations where the
Given a line equation from the previous frame and the road lines intersect with the bottom of the image and trying
current edge map, the system calculates a new equation by to maintain these coordinates.
perturbing the line’s end points in the image. The first is the The system converts the unit-pixel lateral deviation to
road bottom point, which is the on-screen intersection point metric length dev using an empirically calibrated distance
with either the bottom or the side of the image. The second of 9.525 mm per pixel. The system then corrects for lateral
is the horizon support point, which is the line’s intersection deviation by adding a small proportional bias to θψt ,
point with a line 20 pixels below the horizon line (called the
θroad = θψt + a tan[δ/long(g)], (7)
horizon support line), as illustrated in Figure 8.
The system tries to fit a new line by shifting the hori- with θroad being the final absolute road direction to be sent
zontal coordinate of both end points by +/− 10 pixels with to the navigation system. The second term, which is the

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 419

Figure 9. The conversion of the VP shift from the center of the image and of the lateral deviation from the original location to the
robot’s lateral position. In the VP shift, each pixel (in this case 120 pixels) away from the center is multiplied by 0.16875◦ , while
on lateral deviation, every pixel (in this case 22.8 pixels) away from the original position is equal to 9.525 mm. Since the lateral
deviation δ is of low priority, it only needs to be corrected by the time the robot arrives at the top of the local map or in the goal
longitudinal distance long(g), which is 7.2 m. The resulting angle computes to a tan[δ/long(g)] = a tan(22.8 ∗ 0.009525/7.2) = 1.73◦ .

Figure 10. Diagram of the road recognition timeline. The figure lists all the components in the system: camera and IMU, edge-map
computation, the two visual recognition pipelines, and the final road heading estimation.

lateral deviation correction, is the angle created by the ob- between the robot and the road heading, for both straight
served deviation δ and the current longitudinal distance and curved roads. Siagian, Chang, and Itti (2013a) have
from the robot to the goal location long(g), computed us- shown that the system is able to follow curved roads. In that
ing the a tan function. See Figure 9 for the illustration. The paper, we experimentally tested the system on four paths,
small correction is sufficient because the robot is already go- one of which (HWY path) contained curved segments. On
ing in the right direction and lateral correction only needs to average, the system performed only slightly worse (10 cm
guard against long-term drift, in the scale of 1 m for a 30 m lower) than its average performance. The key to following
traversal. this type of road is being able to extract as much of the
Figure 10 displays the road recognition system time- straight image segments on the curved path as possible. In
line. The canny edge map computations run in 3 ms and this case, these segments point toward a moving vanishing
precede both the full recognition (third row) and road line point, which coincides with the changing direction of the
tracking pipelines (fourth). The former runs close to 40 ms, road.
with Hough segments lasting for 20 ms, VP voting for 3 ms,
and road line extractions for 15 ms. The line-tracking pro-
cess runs in 5 ms, depending on the number of lines being 3.2.2. Constructing the Grid-occupancy Map
tracked. Usually the system tracks no more than six road The system updates the grid-occupancy values using LRF
lines at a time. proximity distances that are already appropriately aligned
The final road heading estimation itself is triggered by with respect to the road heading. The grid values are initially
the IMU output, which reports the current robot absolute set to −1.0 to indicate unknown occupancy. The grid cells
heading. This allows the system to quickly correct deviation are updated only if they are hit by or in line with the LRF

Journal of Field Robotics DOI 10.1002/rob


420 • Journal of Field Robotics—2014

Figure 11. Three stages of the grid occupancy map. The first (leftmost) image is the discrete map D of all the discovered occupancy
values, with the white color being unoccupied, black fully occupied, and cyan unknown. The second map is the blurred map B,
while the last occupancy map F is biased by the most recently produced path to encourage temporal path consistency. Also notice
that the unknown grid cells, such as the ones below the robot, do not extend their influence in either the blurred or the biased map.

rays. For each ray, the system draws a line from the robot heading direction, aligning with the road. The robot move-
location to the end point. If the laser line goes through a ment estimation, which is computed using fused odometry,
grid cell, indicating that it is free, the occupancy value is set is sufficiently accurate for grid shifting purposes. Here per-
to 0.0. If the laser line ends on the grid cell, it is set to 1.0. fect occupancy alignment is not critical because the robot
This discrete valued map D is stored and updated at each quickly passes through the local environment and discards
time step. any briefly accumulated errors. Unlike in SLAM formu-
To generate a path, the system first blurs map D using lations, the system does not intend to create a complete
the equation below. The resulting map is called a blurred global map, which requires reasoning about motion uncer-
map B. tainties. Even when the system encounters an occasional
large odometry error, it can quickly recover from grid map
B(i, j ) = max D(i + di , j + dj ) misalignment because the occupancy values of every grid
−nsize <=di ,dj <=nsize
 visible by the LRF are updated in each frame. This way
⎛ ⎞
di2 + dj2 the misalignment only affects the grids that are in the local
∗ ⎝1 − ⎠. (8) map and in the robot’s current blind spot. This is fine be-
(nsize + 1)2 + (nsize + 1)2 cause the robot always turns toward the direction where it
is moving, thus exposing as well as correcting the erroneous
Here the value B(i, j ) is the maximum of the weighted blind spots.
occupancy values within an nsize = 4 neighborhood, where See Figure 11 for an illustration of the grid occupancy
each neighbor’s discrete value D(i + di , j + dj ) is inversely process. It also includes a path biasing step, described in the
weighted by its distance to (i, j ). By setting the influence next section.
neighborhood to 4 or a full robot width, the repelling force The grid construction timeline is shown in Figure 12.
of the obstacle starts to influence the robot half of its length The process, because the map uses a relatively coarse 15 cm
away from its surface. Note that in the calculation of B(i, j ), by 15 cm resolution, has a low computational requirement
D(i, j ) is included in the neighborhood with the distance of less than 1 ms and outputs results at 20 Hz after receiving
weight of 1.0, while the weight terms for the farthest neigh- each LRF readings. The path planning step, on the other
bors are evaluated as 0.2. In addition, the blurring is not hand, takes substantially longer.
applied to cells with unknown occupancy status, to keep
their influence to a minimum. After blurring, these cells are
given a conservative value of 0.99 to allow the path plan- 3.2.3. Path Generation
ning algorithm to go through them if all other options have The system first applies an A* search to the local map to cre-
been exhausted. ate a coarse path trajectory to the goal. For each grid cell, the
As the robot moves longitudinally toward the goal, search process creates eight directed edges to the surround-
the grid values are shifted down one row at a time, keep- ing nodes. The edge weight is computed by adding the edge
ing the robot location as is. This means that the local map length and F (i, j ) value of the destination cell, which is the
is robot-centric in denoting the location, but allocentric in sum of the destination B(i, j ) and a bias term to encourage

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 421

Figure 12. Diagram of the grid map construction timeline. The figure lists all the components: the three utilized sensors, the road
heading estimation, and the grid map construction process.

path consistency. The bias term is calculated by adding the To allow for maximum obstacle avoidance and to
proximity of the destination cell to the corresponding clos- smooth out the discrete path, the system applies a modified
est point in the previous A* path, divided by the local map elastic band algorithm (Quinlan and Khatib, 1993) using the
unit pixel diagonal length λ, current LRF laser scans as individual obstacles. Each path
point undergoes an iterative process for tmax = 20, where at
(Pk,i − i)2 + (Pk,j − j )2 each time step the point is moved by an attractive force A
F (i, j ) = B(i, j ) + min .
k=1 to npath λ to straighten the path, and a repulsive force R to repel it
(9) perpendicular to the closest obstacle. The attractive force is
Here the previous path P is defined as k ordered and defined as
connected points Pk , with Pk,i and Pk,j referring to the i Pt,k+1 + Pt,k−1
At+1,k = − Pt,k . (10)
and j coordinate locations of the path point. To save time, 2
the algorithm does not add edges to cells going into fully On the other hand, the repulsive force R uses a vector
occupied nodes D(i, j ) = 1.0. Consequently, to include all Ct,k = Pt,k − closestObs(Pt,k ), which extends from the clos-
non-fully-occupied cells, all F (i, j ) values are capped at 0.99 est obstacle closestObs(Pt,k ) of path point Pt,k to that point.
for nodes D(i, j ) = 1.0. In addition, the system zeros out the The force can then be defined as
bias if the distance to the closest obstacle is less than two Ct,k m − |Ct,k |
unit pixels, to allow for flexibility in finding an optimum Rt+1,k = ∗ . (11)
|Ct,k | |Ct,k |
nearby path.
We find that adding a bias and generating a path on Here the first term normalizes the vector Ct,k , while
every frame, which is computed efficiently because of the the second term creates an exponential magnitude weight,
coarse 15 cm grid size, is more practical for encouraging where the closer the obstacle is, the larger is the repul-
temporal smoothness. The more burdensome alternative sive force. Note that, when identifying the closest obstacle
would be to advance and modify an old path whenever points, the system models the robot body more accurately
there are changes to the local map. Here, generating new as a square. Here the maximum obstacle distance m = 5 lim-
paths allows for the fastest possible reaction to new changes its the obstacle consideration to within five grid units of the
without having to explicitly detect them. robot, otherwise Rt+1,k = 0.
There are times when A* cannot return a path to the Equation (12) combines the two factors:
goal, usually because the robot is moving into a dead-end 
1 2
and the goal is on the other side of the wall. In that case, Pt+1,k = Pt,k + Wt ∗ ∗ At+1,k + ∗ Rt+1,k . (12)
3 3
the system creates a straightforward path to keep the robot
moving ahead until it is directly in front of a wall, within The weight Wt decays overtime, becoming linearly
two robot lengths. This ensures that there is no way to the weaker 5% at a time from 0.2 to 0.01, or Wt = 0.2 ∗
goal, and not because of inaccuracies in the local map con- (1.0 − tmax
t
).
struction, especially using far away LRF values. In cases To compute the motor command using the smoothed
in which the robot is truly stuck, the system invokes an path, the system takes the path’s first step and applies the
exploratory procedure by spinning the robot 360◦ . In the fu- dynamic window approach (DWA) (Fox et al., 1997). It
ture, we plan to implement a more robust procedure, such not only calculates the deviation between the current robot
as moving backwards as well as considering pertinent con- heading and the heading of the initial path step, but it also
textual information. takes into account the robot dynamics. This is done by only

Journal of Field Robotics DOI 10.1002/rob


422 • Journal of Field Robotics—2014

Figure 13. The three stages of the path trajectory and motion command generation. The figure displays both the full local map on
the left and a zoomed part within the red rectangle on the right for a clearer picture. The system first runs A* to generate a rigid
path, shown in green in both images. The system then deforms the rigid path (in blue) to optimally and smoothly avoid obstacles.
Finally, the system uses the dynamic window approach (DWA) to generate the actual command, one that accounts for the current
robot dynamics. That resulting command is modeled accurately as a curve in orange. In addition, the figure also displays the
predicted location of the robot, noted by the red disk.

considering commands within the allowable velocity space While other modules influence the robot motion in al-
based on the estimated velocity calculated from the pre- most every time step, localization updates the intermediate
vious command, its execution delay, and motor and body goal in the local map quite infrequently. For the most part,
dynamics. Furthermore, the approach then accurately sim- the goal location is placed in the canonical straight-ahead
ulates each resulting arc trajectory of the allowable com- position (see Figure 13). There are, however, a few occur-
mands (translational and rotational velocity), modeling the rences when the goal is moved, such as during a turn proce-
robot shape to test whether any part of the robot would hit dure, or when the goal is within the local map dimensions.
an obstacle. In the end, the system selects a motor command For turns, the localization system sends a command to the
that is expected to best achieve the target heading and veloc- local map to initiate a turn. It specifies the turn angle as the
ity, while staying clear of the obstacles. Figure 13 illustrates angle created by the involved incoming and outgoing edges
the three stages of the motion generation. in the map.
Note that even though the displayed DWA trajectory The navigation system then moves the goal to the ap-
deviates from the smoothed path, this is only for one step. propriate location, triggering a path generation that forces
When the next time step is executed, the front of the robot the robot to turn immediately. The system then lets the robot
is expected to be in the dotted location. At that point, the move until its angle is aligned to the goal direction, usually
system is going to recompute the path again. After all the in less than 2 s. When the turn is completed, the local map is
steps are computed and executed, the actual robot trajectory rotated so that the goal location is at the canonical straight-
usually ends up similar to the smoothed path. forward position, and the robot can proceed in a normal
The path planning timeline, which integrates the fashion.
subsystems in our presented system, can be viewed in The framework has the flexibility to allow the robot to
Figure 14. The path planning subsystem creates the occu- still perform robustly despite not using some of the subsys-
pancy grid map and computes A* in 5 ms, the path smooth- tems. In the absence of the localization system, the robot can
ing step (elastic band algorithm) in 1.8 ms, and the DWA keep itself on the road. If the road recognition is also absent,
step in 5 ms, which adds up to about 12 ms. The output path the robot then simply tries to move in the direction of its
itself is triggered by the grid-occupancy map output, which heading when the system is turned on, while also avoiding
is triggered by the LRF output. obstacles. In addition, in the case in which any of the sensors

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 423

Figure 14. Diagram of the full system timeline. The final motor command to be executed by the robot (bottom row) is generated
by the path planning subsystem. It utilizes all of the previously described subsystem, with grid map construction being the trigger.

(camera, LRF, and IMU) is broken, it is easy to set up the dimension by the road recognition module, but scaled to
corresponding subsystems (localization, road heading esti- 160 × 120 for vision localization. We did not apply camera
mation, and grid map construction, respectively) to trigger calibration or lens distortion correction, which may help in
on other events, such as fused odometry. salient region matching.
Beobot 2.0 has a 16-core 2.6 GHz distributed com-
puting platform, which we take full advantage of by em-
4. TESTING AND RESULTS
ploying a subscribe/publish architecture to manage all
We rigorously tested the system using our robot, Beobot 2.0 functionalities, from sensor data acquisition, localization,
(Siagian, Chang, Voorhies, and Itti, 2011), pictured in Figure and navigation to actuation. In addition, the setup simpli-
2, at the University of Southern California campus (USC) in fies the expansion process if we want to add capabilities
Los Angeles. We constructed three routes of 511.25, 615.48, such as object recognition and human robot interaction.
and 401.16 m length (paths shown in Figure 15 in red, blue, The full code is available in the INVT toolkit page in Itti
and green, respectively) through many parts of the cam- (2012).
pus. The goal was to cross the finish line at the end of the We selected routes that make up a fairly complete rep-
route, wherever the robot was laterally on the road. In the resentation of the different environments at USC. They are
figure, we also include paths in black, which are roads that also picked to vary the localization and navigation chal-
have been traversed during system development but are lenges and conditions, which are listed in Table I. Each route
not used in the documented experiment. They show the ex- consists of four segments. Between them are intersections
tent of the network of roads that we used during testing. In that force the system to make decisions about which road
addition, some of these paths are currently too dangerous to take. Scenes from each route are displayed on separate
for the robot to traverse. One is the leftmost portion of the rows in Figure 17. The first route is called the libraries route
network. This is a narrow sidewalk with a steep boundary (LIBR) because the path goes through a cluster of libraries
cliff and people walking behind and toward the robot. Cur- at the heart of the campus. This route includes roads with
rently, the situation forces the robot either to quickly get off some of the densest crowds. The second route at the south
the sidewalk or be stuck thinking that the road is a dead part of USC goes through the professional schools (PROF)
end. neighborhood, where there are more trees. The third route,
Beobot 2.0 logged over 3 h and 17 min, or 10.11 km going toward the north part of the campus, is called the
of recorded experiment conducted within a three-month athletic (ATHL) route. This rounds out the variety of scenes
winter period (December through February). The average in the experiments.
robot speed during autonomous driving is 0.66 m/s (about There are parts of the full system that have been pre-
half the average walking speed of people), with a maximum viously published. Siagian and Itti (2009) showed that the
of 0.89 m/s. The difference is caused by a dropping battery vision localization system can maintain comparable results
charge as the experiment wears on. A video summary can be on a number of lighting conditions at different times of the
found in Siagian, Chang, and Itti (2013b), while a snapshot day, from the brightest (noon time) to the darkest (early
is displayed in Figure 16. evening), and in real time (Siagian et al., 2011). Siagian,
For the experiments, as seen in Figure 2, we use a sub- Chang, and Itti (2013a) also demonstrated that the visual
set of Beobot 2.0’s sensors: a front-facing monocular camera road recognition system performs better than the state of
and LRF (117 and 60 cm above the ground, respectively), the art (Chang et al., 2012; Kong et al., 2010; Rasmussen, Lu,
as well as wheel encoders and an IMU. The camera out- and Kocamaz, 2009) on a standard (offline) dataset (Chang
puts (320 × 240)-pixel images, which are kept at the same et al., 2012).

Journal of Field Robotics DOI 10.1002/rob


424 • Journal of Field Robotics—2014

Figure 15. Map of the experiment routes at the University of Southern California. They are the libraries (LIBR) route, indicated in
red, the professional schools (PROF) route in blue, and the athletic complex (ATHL) route in green. The remaining routes in black
are not used in the documented experiment but have been traversed during system development.

In this paper, because the full system connects the nav- fully autonomous. In addition, we systematically manually
igation and localization modules, we examine the effect of drive the robot on a straight line at different offsets from
one module upon the other. In addition, since both modules the center of the road to gauge how navigation error affects
have been shown to be robust in the presence of various localization results. We also isolate localization results at the
lighting conditions, we focus on the effect of crowding on junctions. Junction is a critical part of a route because it is
our system. That is, we test the system in two conditions, where a robot has to make a turn. In the navigation section,
namely low and high crowd, the former during off-peak we evaluate how well the navigation system maintains the
hours and the latter during peak hours with many pedes- robot’s lateral position throughout traversal, and how it is
trians on the road. Siagian, Chang, and Itti (2013a) showed affected by obstacle avoidance.
that the road recognition system is affected by crowding Altogether, we evaluate the system using 12 full path
more so than lighting. Here, we complete the analysis by traversals, four for each route, for all combinations of con-
rigorously testing the whole system. ditions: autonomous mode in low crowd, manual driving
We carry out a number of testing procedures, measur- in low crowd, autonomous mode in high crowd, and man-
ing different metrics. In Section 4.1, we focus on the localiza- ual driving in high crowd. In addition, we also add four
tion system, while Section 4.2 focuses on navigation. In the 180 m manual runs to test the effect of lateral shift on lo-
former, we isolate the localization problem by comparing calization results. Although there are many more environ-
its performance when the robot is manually driven versus ments in which we can test the robot, these selected sites

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 425

Figure 16. A snapshot of the system test run. The top-left (main) image displays the robot moving toward the goal. Below is the
road recognition system, which identifies the road heading (indicated by the green disk) and the lateral position of the robot (blue
stub on the bottom of the image). To the right are three salient landmark matches as well as the segment estimation histogram
below. The top-right of the figure shows the global topological map of the campus, with the current robot location indicated by a red
square, and the goal by a blue star. It also draws the current segment in pink, matching the segment estimation output underneath
landmark matching. Finally, the right bottom of the figure displays the local grid occupancy map, which shows a robot path to
avoid an approaching pedestrian.

Table I. Experiment site information.

Route Pedestrian Total Total Total Obstacle


Length Density Weather/Lighting Speed Training Testing Time Avoidance
Site (m) (Person/m) Condition (m/s) Frames Frames (s) Time (s)

Library 511.25 0.19 4–6 p.m., partial sun 0.61 16,621 13,825 838.35 29.99
0.97 12–2 p.m., sunny 0.44 16,688 34,979 1,165.88 73.13
Professional 615.48 0.06 4–6 p.m., cloudy 0.73 21,488 20,000 840.79 0.00
0.36 12–2 p.m., sunny 0.89 9,702 20,773 692.37 46.10
Athletic 401.16 0.04 4–6 p.m., clear sky 0.77 19,022 11,017 522.21 13.67
0.43 11 a.m.–6 p.m., sunny 0.68 9,102 17,499 583.24 27.10

encompass a large part of the current challenges that out- decided to use this metric because the system never com-
door autonomous mobile systems face. We believe there are pletely failed to localize, and evaluating displacement error
many lessons that can be learned from testing the robot at can provide a better understanding of how it performed.
these challenging sites and under the proposed conditions. The error measured here is in the longitudinal direction,
along the graph edges of the topological map. We record
the ground truth by sending signals to the system, indi-
4.1. Localization Testing cating the true current robot location. The signals are sent
We measure the localization system accuracy by compar- at specific parts of the path and are spaced appropriately
ing its output with manually obtained ground truth. We (as small as 3 m) to accurately record the robot’s location

Journal of Field Robotics DOI 10.1002/rob


426 • Journal of Field Robotics—2014

Figure 17. Examples of scenes from the libraries (LIBR), professional schools (PROF), and athletic complex (ATHL) route, each
displayed in its respective row. The figure also depicts various challenges that the robot encountered, such as pedestrians, service
vehicles, and shadows. These images are adjusted to remove the bluish tint in the original images for better display.

throughout the experiment. Assuming the robot is running creating the box plots. In addition, because our result distri-
at approximately constant speed, we use interpolation to fill butions are not Gaussian, we use nonparametric statistical
in the ground truth between signal points. Note that these significance tests. Because of this, we do not test for inter-
signals are only processed after testing and are not used action between effects, e.g., whether the crowd level causes
during run time. For our experiments, we manually initial- greater error for autonomous navigation than for manual
ized the system with the starting location. In Siagian and Itti driving.
(2009) we have shown that the system can relocalize from a The results show that the system is able to satisfac-
kidnapped robot as well as random initialization. Here we torily localize within 0.97 m (median) of the true location
decided not to do so because the system currently does not while driving autonomously and reaching goals of more
converge to the true location in a reasonable amount of time than 400 m away. The farthest localization disparity is
with 100% certainty. 7.15 m, which happens in the middle of a long segment,
To create a landmark database and a segment classi- allowing the system to recover. The main source of error,
fier, we run the robot manually through the routes to take which is prevalent throughout all conditions, can be at-
images for our training procedure (Siagian and Itti, 2008b). tributed to the less exact nature of the salient region match-
We have systematically shown in Siagian and Itti (2009) for ing step. To speed up the process, the system uses small
the localization system and Siagian and Itti (2007) for the salient regions of about one-third the down-scaled 160 by
segment classifier that training at different times with 3-h 120 input image. For landmarks that are near the robot,
intervals endows our system with lighting invariance. In the small size regions are sufficient to pinpoint locations
those papers, we tested the system from 9 a.m. to late after- with high accuracy. In fact, whenever the system observes
noon, after performing the prescribed training runs multiple nearby landmarks, the localization error tends to decrease.
times. We found that the segment classification errors are However, when the landmarks are far away from the robot,
within 5% of the average, while the localization errors are they can appear identical from a large range of viewing dis-
within 60 cm, regardless of lighting condition. Thus, here, tances. Because of the low discrimination power, oftentimes
we test the system at different times of day, but only with the system cannot quickly correct a sizeable steady-state
the training and testing taken in similar lighting for each error.
site. Table I lists the number of training and testing images The main problem is that there is no way of knowing
involved. which landmarks are far from the robot as the system only
The localization box plots can be viewed in Figure 18 records the robot location when viewing them. A possible
for all three sites as well as the total, in both crowd con- labor-intensive solution without incorporating long-range
dition and driving mode. We also show in Figure 19 an 3D sensors would be to annotate the estimated distances
example run from the PROF route under a low crowd con- of the landmarks to the robot. This way, the system would
dition for more detailed observation. To reduce the effect of know that landmarks in the form of the tops of buildings
dependence in the samples, we average the errors every 30 are hundreds of meters away and would not be as easy to
samples or within 1 s (localization output is 30 Hz) before discriminate as signs on top of a nearby door.

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 427

Figure 18. Box plot of metric localization error from different routes, for both manual and autonomous navigation, and in both
low and high crowd conditions. In addition, the figure also reports the total performance accuracies for each condition. The figure
shows that the overall autonomous mode error median is 0.97 m, while medians of the various conditions remain low, from 0.75
to 1.5. Each box plot notes the median (the thicker middle line), the first and third quartile, and the end whiskers, which are
either the minima/maxima or 1.5 * IQR (interquartile range). The stars above the lines joining various box plots note the statistical
significance, with one star indicating p < 0.05, two stars p < 0.01, and three stars p < 0.005.

In terms of the effect of routes on system performance, are likely to be farther away, thus reducing the accuracy of
the data show that it is statistically significant (Friedman the localization system.
test, p < 0.005). As shown in the figure, the Athletic (ATHL) In addition, we also observe statistically significant lo-
route generally produces lower error distributions than that calization performance differences between autonomous
of the libraries (LIBR) and professional school (PROF). We and manual-drive navigation (Friedman test, p < 0.005)
found that the disparity comes from a sustained spike in with the autonomous drive median of 0.96 m being lower
error, which often occurs when the robot is driving on a than that of manual driving (1.00 m). This is quite surprising
long segment during which the system continually utilizes given that in autonomous drive the robot can veer in and
faraway silhouettes at the end of the road as a landmark. out of the middle of the road due to the granularity of the
An example is taken from the last segment of the PROF road heading perception. That is, it takes substantial devia-
route, where the system sustains errors above 3 m for more tion from the road center for the system to react to the error.
than 50 m (see Figure 19). As the robot is moving through Furthermore, it can also be attributed to robot response de-
a corridor (shown in Figure 20), the light at the end makes lay, uneven ground surfaces, and motors not having equal
that part of the image very salient, rendering other parts efficiency, biasing the robot to drift to one side. In man-
less conspicuous by comparison. ual drive, on the other hand, the robot moves steadily in
As for the difference between low and high crowd con- the middle of the road, which is the same vantage point as
ditions, the difference is statistically significant (Friedman during training.
test, p < 0.005) but with near equal medians (0.98 m low When the robot is not moving at the center of the road,
crowd vs 0.96 m high crowd). Most times, the localization there are fewer landmarks that are matched with the ones
system is able to identify enough landmarks among the stored in the database, which lowers the system accuracy.
pedestrians just as well as when the crowd is much sparser We demonstrate this point by manually driving the robot
because our training procedure rejects moving objects as at different lateral positions, 0, 1.33, 2.33, and 3.33 m away
possible landmarks during training (Siagian and Itti, 2008b). from the road center at the first segment of the LIBR route.
However, there are times when pedestrians do affect the sys- We selected this segment because it has sections with many
tem, particularly when they occlude the landmarks that are nearby salient landmarks as well as sections that primarily
at ground level. These landmarks tend to be the ones that use a faraway silhouette at the end of the segment.
are closer to the robot. When this happens, the robot has to As we can see in Figure 21, there are statistically signifi-
rely on landmarks that are higher above pedestrians. They cant differences in performance between 0 m lateral position

Journal of Field Robotics DOI 10.1002/rob


428 • Journal of Field Robotics—2014

Figure 19. Detailed diagram of localization error while the robot autonomously navigates the professional schools (PROF) route
during a low crowd condition. The error is indicated by a gradation of red color, with the darker hue denoting higher error.

Figure 20. Landmark recognition in a building corridor at the last segment of the PROF route. The opening at the end is the only
salient landmark that is consistently recognized at different distances, with minimally observed changes throughout the traversal.

and the others, with the system becoming increasingly less far away from the center of the road, instead moving in
accurate as it moves away from the center. The good news and out of the center, the system can match the localization
is, even at a deviation of 3.33 m, the localization system still performance of the manual driving.
does not completely misplace the robot. This leads us to be- Another aspect of the testing that we observe closely
lieve that, as long as the robot does not permanently travel is during obstacle avoidance. In total, in our data, there are

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 429

Figure 21. The localization error box plot for lateral deviation
of 0, 1.33, 2.33, and 3.33 m from the center of the road. All the
error differences compared to that of driving in the middle of
the road are statistically significant based on the Wilcoxon rank
sum test (p = 1.093 ∗ 10−7 vs 1.33 m, p = 0.015 vs 2.33 m, and
p = 2.476 ∗ 10−5 vs 3.33 m).

Figure 23. Plot of turning locations throughout the experi-


ment. The black partially blocked arrow coming from the bot-
tom of the figure indicates the direction from which the robot
is coming. It is pointing toward the gray and black circle rep-
resenting the center of the intersection, which is the ideal turn
location. The color of the dots, which corresponds to the color
of the arrows, indicates where the robot is trying to go: red is for
turning left, green for right, and yellow for straight. Each red,
yellow, and green dot represents where the robot was when
turning (the robot turns in place).

cle avoidance traversals for both crowd levels combined


(Friedman test, p < 0.005). It should be pointed out that
the median difference for the last comparison is only 1 cm,
Figure 22. Box plot of localization errors during obstacle 0.95 m during obstacle avoidance versus 0.96 m during
avoidance on the different routes, and in both low and high nonobstacle avoidance. This stability, which is also shown
crowd conditions. Note that the result for the PROF route dur- by the respective distribution shapes, demonstrates that the
ing low crowd is missing because the robot did not execute a localization system can endure these significant but short
single obstacle avoidance maneuver on that run. perturbations. At times we see that an avoidance maneu-
ver actually allows the robot to go back to the center of the
road, improving the localization accuracy. In fact, on aver-
25 trajectory-changing obstacles that affect the robot’s head- age, the localization error actually decreases from 1.39 to
ing. This definition differentiates a select few from many 1.09 m before and after avoidance.
other nearby pedestrians seen in the camera but not close For the most part, the amount of localization error car-
enough to warrant avoidance. A majority of them only re- ried by the system is manageable, especially in the middle
sults in slight deflections in the robot’s trajectory, and not of segments, where the task of the robot is mainly to pro-
a clearly visible maneuver. Refer to Table I for the accumu- ceed forward. However, it becomes critical when the robot
lated times that the robot spent avoiding obstacles. is trying to turn to the next segment at an intersection. In
As shown in Figure 22, we observe statistically sig- Figure 23, we plot the turn locations with respect to the
nificant performance differences between low and high center of the intersection.
crowd conditions during obstacle avoidance maneuvers Fortunately, considering that the robot is traveling on
(Friedman test, p < 0.005), low and high crowd condi- campus roads of 6–8 m in width, it never misses the road
tions during nonobstacle avoidance traversals (Friedman completely. Even when the robot needs to turn to narrower
test, p < 0.005), as well as between obstacle and nonobsta- roads, it manages to do so because these roads tend to be in

Journal of Field Robotics DOI 10.1002/rob


430 • Journal of Field Robotics—2014

ceptable given the challenges that the robot faced. However,


clearly there are a number of aspects that need improving.
The data show that there is a statistically significant dif-
ference between the low and high crowd conditions (Fried-
man test, p < 0.005), with the median of 1.23 m for low
crowd vs. 1.55 m for high. As mentioned in the previous
section, excluding numerous slight changes of direction,
there are only a few (25 in total) obvious obstacle avoidance
situations. One reason for this is because the 7.2 m front
extent of the local map allows the robot to register obsta-
cles early enough to move away gracefully. In addition, the
robot only encounters a small number of static obstacles
because it is moving in the middle of the street. Also, be-
cause of the robot’s appearance and relatively slower mov-
ing speed, pedestrians are aware of it approaching them.
In some cases, however, the robot does have to break from
its path to avoid collision. And sometimes, the navigation
performance suffers because of it.
It is important to note that the starting point of an avoid-
ance maneuver’s temporal window is when the robot first
rapidly changes its direction, and it ends when it stabilizes
Figure 24. Results of the navigation system in each site and
its lateral position. For some situations, however, the sys-
under low and high crowd conditions. We measure the error
tem error remains long after the maneuver is deemed over.
as the lateral deviation of the robot from the road center. The
For example, during the LIBR test run in a low crowd con-
far right bars display the distributions of low vs high crowd
performance, which is statistically significant (Friedman test,
dition, the robot has to avoid two people, street signs, and
p < 0.005). a long bike rack in succession (see Figure 26). Because the
compound maneuver keeps the robot on the left flank of the
road (3 m from the center) for so long, the system resets its
current lateral position, thinking that the robot is at a new
between buildings. An example is when the robot is entering center of the road. Had the robot moved back to the center,
the corridor at the last segment of the PROF route. In this as opposed to staying out in the flank for more than 70 m,
case, the obstacle avoidance, or wall following behavior, the overall median would have been 0.87 m, lower than the
kicks in and funnels the robot to the segment. We believe actual 1.09 m error.
that in order for the robot to properly turn at the center of We decided to add a reset option because the current
an intersection, both localization and navigation must work road recognition system does not have the capability to rec-
in concert. That is, localization can seed the system that it is ognize the true center robustly. Resetting the center when-
nearly time to turn, while navigation should then estimate ever a new stable lateral position is established allows the
the configuration of the junction using the involved road system to react to changing boundary positions. If the sys-
boundaries to fine-tune the turn location. tem is able to estimate the road center, it would help not
only for obstacle avoidance recovery but also in the case
in which the localization system turns at the wrong point
4.2. Navigation Testing of the intersection at the start of a segment. Currently, the
We measure the accuracy of the system in being able to navigation system assumes that the original lateral position
navigate and maintain its position in the middle of the road is the center of the road, and simply tries to maintain that
on many different road types and challenges. They range spot.
from clean roads with clear borders, to roads with complex We also analyze the navigation error during obstacle
markings, to roads without clear boundaries. Navigation avoidance, with the result shown in Figure 27. Again, the fig-
error is defined as the physical lateral distance of the robot ure shows statistical significance in all cases: navigation er-
center from the center of the road. We measure the ground ror during obstacle avoidance for low vs high crowd condi-
truth by manually annotating the robot’s lateral position tions (Friedman test, p < 0.005), during nonobstacle avoid-
when crossing certain points of the path. We then interpolate ance traversals for low vs high crowd conditions (Friedman
the points in between using odometry. test, p < 0.005), as well as between obstacle and nonobstacle
Figure 24 reports the result, which is divided by sites avoidance traversals for both crowd conditions combined
and then by crowding conditions. Overall, the system is (Friedman test, p < 0.005). Here the difference for the latter
able to stay within 1.37 m median of the center, which is ac- comparison is more pronounced, 1.58 m vs 1.36 m medians,

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 431

Figure 25. Trajectory of the robot on the PROF route trial run under a low crowd condition. Here the deviation from the ground
truth center of road is indicated by the gradation of the red color, with the darker hue being higher error. In addition, the green bar
behind the trajectory indicates that the error is within 1.5 m of the center of the road, while the wider black bar is for 5 m.

Figure 26. Successive avoidance maneuver by the robot. It has to avoid two people (first image), street signs (second), and a long
bike rack (third).

respectively. In addition, the disparity is also supported by the missing bar in Figure 27), there is still error in the system.
the average error increase between the start and end of the A major reason for this is because, in some areas, visual cues
obstacle maneuver of 32 cm. for road direction are very sparse. An example is the water
Even without obstacle avoidance, in the case of the fountain area in the PROF route, where the road changes
PROF route during the low crowd condition (indicated by seamlessly to an open space before continuing to another

Journal of Field Robotics DOI 10.1002/rob


432 • Journal of Field Robotics—2014

Throughout the experiment, there were only a few


times when the robot failed to detect obstacles in its path.
On those occasions, we stepped in front of the robot to assert
the presence of the obstacles, and we let the system adjust
by itself. One reason for the mistakes is that the planar LRF
is placed too high above the ground to detect shorter or pro-
truding object parts, such as the seat of a bench. However,
we find that there is no correct height because, if it is placed
low, the robot may attempt to drive underneath an obstacle.
Other reasons to note are the inability of the LRF to con-
sistently detect thin objects (e.g., poles, fences), and objects
with holes such as a bicycle approaching at certain angles.
Furthermore, the navigation system is currently unable to
detect negative obstacles such as a ditch or a hole, making
it unsafe to navigate on roads with hazardous boundaries
such as sidewalks, ramps, and cliffs. In Section 5.3, we de-
Figure 27. Navigation errors during obstacle avoidance from
scribe our plan to rectify these problems.
different routes, and in both low and high crowd conditions.
Note that the result for the PROF route during low crowd is
missing because the robot did not execute a single obstacle
avoidance maneuver during that time. 5. DISCUSSION AND CONCLUSION
In this paper, we present a real-time visually guided au-
tonomous localization and navigation system. We demon-
road (see Figure 28). In the absence of these visual cues, and strated the system’s efficacy using a wheeled robot and driv-
thus of a vanishing point, the robot simply maintains the ing it through long-range and challenging routes of at least
last road IMU heading estimate while avoiding obstacles 400 m at the University of Southern California (USC) cam-
whenever needed. Despite this difficult stretch, however, pus, for a total of 10.11 km of documented testing. In each
the robot was still able to arrive at the target location, about route, the robot had to show its robustness in navigating
200 m further. among crowds, turning at several intersections, and stop-
On a related topic, open areas present a challenge in ping at the assigned destination. In Section 5.1, we discuss
terms of constructing a topological representation because why the system is able to perform as well as it did. We then
the robot can move about in any direction. In a truly open compare its performance with the available state-of-the-art
area devoid of obstacles, there can be any number of maps systems and we describe how our work relates to them in
that can be created to describe the space, but in the end, Section 5.2. We finish by listing the current system failures
these do not matter since the space is open and hence can as well as future improvements in Section 5.3.
be traversed in any way. Our strategy with mapping open For our setup, we have clearly framed the objective of
spaces thus incorporates the robot behavior of continuing to our system to within a specific type of environment and ap-
proceed straight while avoiding obstacles. So, for example, plication, and at this point we do not know whether this ap-
in the case of a large plaza with four roads connecting to proach generalizes to a completely different scenario (e.g.,
it from the north, east, south, and west, we simply reduce autonomous cars). However, the service robot task in un-
the plaza to its center point and extend the four roads to constrained pedestrian environments has the potential to
meet at that point. If the open space does contain fixed have quite a broad impact on society. We believe that our
obstacles, these will be represented as additional nodes in extensive work will serve as an inspiration for many others
our topological map. in continuing the research on autonomous mobile robotics.

Figure 28. Views of the environment as the robot is moving through a water fountain area.

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 433

5.1. Performance Evaluation being responsive enough to detect a change in the road di-
rection. On the other hand, the latter seems to adapt to road
A key factor for our system’s success is the hierarchical rep-
changes noticeably more slowly. Furthermore, when pos-
resentation, which effectively separates the localization and
sible, we and others have previously conducted extensive
navigation tasks. Here, the former is dealt with at the local
parametric testing of system components: for attention (Itti
map level, while the latter is at the global map level. The
and Koch, 2001), gist (Siagian and Itti, 2007), SIFT (Lowe,
separation enables the navigation module to independently
2004), road recognition (Siagian, Chang, and Itti, 2013a).
control the robot, affording the localization module a longer
To obtain a truly optimal set of parameters, we have
period to perform its computations. This, in fact, is a neces-
to test all possible combinations of all possible values. Un-
sity because it takes seconds for the localization module to
fortunately, the presented system is tightly coupled, where
compare input landmarks with its large stored database.
the number of variants to consider when adjusting possible
However, unlike obstacle avoidance and road heading cor-
parameters is subject to combinatorial explosion. Neverthe-
rection, localization-related tasks such as turning at an in-
less, we have shown that the system still performs quite
tersection or stopping at the goal are infrequently invoked
robustly with the parameters chosen. We believe that most
and can be updated at a substantial delay. In addition, the
of these parameters are not extremely sensitive for perfor-
infrequent goal location updates do not affect the naviga-
mance, and there is flexibility in selecting these numbers. In
tion system on a frame-to-frame basis because it is assumed
this paper, we tried to provide an intuition as to what the
that the robot wants to go toward the direction of the road.
factors are that should be considered, and we provide these
On the contrary, by having a mechanism that keeps the
values to gauge what is the expected ideal range. We do not
robot’s viewing perspective from or near the center of the
claim that these are the best or only possible values. It is
road, we do not have to train the vision localization system
possible that better values will be found in future research.
on landmarks from other perspectives. Such requirements
However, we report those numbers to allow the readers to
would almost certainly put our recognition system past its
reproduce our exact system if they want to. To that end, we
limits because we use a regular camera as opposed to an om-
also provide the software source code at Itti (2012) to port to
nidirectional camera (Murillo, Guerrero, and Sagues, 2007;
other robotic systems, or as a guide to adapt our algorithms
Valgren and Lilienthal, 2008) or creating a 3D landmark rep-
for specific usage.
resentation (Trulls et al., 2011). In addition, staying in the
middle of the road also allows the robot to better see other
semantically important objects, such as stop or exit signs,
5.2. System Comparison
building entrances, and doors. In the future, we would like
to have the system be able to navigate through open areas Our evaluation, in terms of scope, is most similar to the
where the road lines are not readily visible, while still being Tsukuba challenge (Yuta, Mizukawa, Hashimoto, Tashiro,
able to optimally view these important cues. & Okubo, 2011), which takes its human-sized service robot
Another important aspect for the success of the system participants through a 1.1 km route of a pedestrian road
is the implementation of many techniques to account for dif- in the city of Tsukuba. This is unlike the DARPA grand
ferent processing delays of the vision modules. For the long challenge (Montemerlo et al., 2008) and the European
latency of the localization module, the system implements Land-Robot Trial (ELROB) (Schneider and Wildermuth,
a multilevel recognition with gist for segment recognition, 2011), which are for autonomous cars, or AUVSI’s Intel-
and landmark recognition to refine the localization hypoth- ligent Ground Vehicle Competition (IGVC) (Theisen, Fred-
esis. In addition, the system also projects the results forward erick, and Smuda, 2011) and NASA’s Centennial Challenge
by adding an odometry difference to the localization estima- (NASA, 2013), which are conducted on open grass fields.
tion between the start and end of the landmark recognition In terms of goals, our system relates to a number of
process. Also, in the road recognition system, the module fully autonomous systems that are operating in outdoor
runs two processing pipelines: one is for the full visual road pedestrian environments. The ACE project robot (Lidoris,
recognition, while another tracks previously detected road Rohrmullera, Wollherr, and Buss, 2009) performs outdoor
lines to produce updated estimates while the full recogni- navigation but relies on passersby to guide it to the final
tion pipeline looks for new road lines. destination 1.5 km away. The EUROPA project robot (Stach-
In terms of implementation, the overall system uses a niss, 2013), on the other hand, is able to navigate and localize
number of parameters. For the most part, we have chosen in unconstrained urban settings using LRF sensors. We did
them experimentally during development. We tried several not find a fully autonomous visually guided localization
values for each parameter and used the one that produced and navigation system that has been tested in a pedestrian
the best performance. For example, we tried to incorporate environment to compare with.
the last 100 and 500 absolute headings in the road direction One that comes closest is RatSLAM (Milford and
filtering step. We found that the former, which we decided Wyeth, 2008), which takes inspiration from research done
on using in the end, allows for robust filtering while also on how rat brains perform navigation tasks. It has a fully

Journal of Field Robotics DOI 10.1002/rob


434 • Journal of Field Robotics—2014

autonomous version (Milford and Wyeth, 2010), where it si- sion rate, on the other hand, is close to 100% because of
multaneously localizes and maps (SLAM) the environment the high threshold on positive matches, at least three key
using vision, and navigates using a planar LRF indoors. Our points in a proper affine configuration. Note that an almost
system, despite not having SLAM yet, performs both nav- perfect precision rate is the norm for key-point-based sys-
igation and localization outdoors. Another noteworthy as- tems. Thus, precision is usually set at 99–100% before eval-
pect of RatSLAM is its visual features. It uses a column aver- uating the recall rate. From Maddern et al. (2012), we gauge
age intensity profile taken at specific windows for three dif- that our recall rate is competitive, as anything above 10% is
ferent purposes: scene recognition, rotation detection, and respectable.
speed estimation. This is a stark difference from what most Recognition rate is also the metric of choice for visual
state-of-the-art vision localization systems, including ours, place recognition systems. As Siagian and Itti (2008a) and
used: computationally complex key-point-based recogni- a number of others (Fazl-Ersi and Tsotsos, 2012; Song and
tion. Our perceptual system, which relies on SIFT-based Tao, 2010) have shown, the performance of our gist-based
salient landmark database matching, is more in line with system, which is about 80% for the standard USC dataset,
that of the current SLAM systems: FAB MAP (Cummins is only 10% lower than some of the more recent systems. In
and Newman, 2008) or CAT-SLAM (Maddern, Milford, and a sense, place recognition is a different classification prob-
Wyeth, 2012). lem than that of visual SLAM systems. The former has to
In terms of results, the best way to benchmark our sys- discriminate a smaller number of classes, but with more
tem is by directly comparing our results with others in the in-class variance, while the latter has to discriminate many
same settings. However, this would be a large undertaking. more classes, with fewer views within a class. We believe
To have a fair comparison, the systems would have to be that, to build a vision localization system that can perform
run on the same robot that they are originally published in, robustly in many environments, a system has to be able to
or at least the same configuration of sensors. In addition, distinguish places at different levels, from the general place
there is also an issue of code availability as well as the care names to the specific locations.
needed to port and have it perform optimally. A way to For navigation, there are a number of metrics that are
circumvent this is by requesting that the authors run their currently used for evaluation. One is the success rate in
systems on recorded data. However, we are dealing with reaching an assigned goal. Although localization is also in-
active systems, where different decisions at each step can volved, it is clear to us that the main difficulty lies in path
lead to very different outcomes. Similarly, simulation exer- creation and execution. Even if the path created is perfect,
cises would also fall short of real-life testing because the if the robot is not able to properly execute the commands,
robot operates in a dynamic, unpredictable, and not fully it can still hit an obstacle. Thus, systems are also evaluated
accessible (in the AI sense) world, which we believe can- based on the severity of the failure, whether it be a square hit
not be well captured by the current world models. Given or just a graze (Marder-Eppstein et al., 2010). Our system,
these difficulties, we decided to compare our results with aside from a number of isolated incidents, usually provides
the results of other systems as reported in their respective about 30 cm of space between its body and the obstacles.
papers. In addition, we can also consider execution speed, which is
Our localization median error of 0.97 m is quite high currently less emphasized. The goal for service robots such
compared to that of outdoor LRF-based systems such as that as ours, which are meant to help people, is to at least match
of Montella et al. (2012), which reports error near 10 cm. This that of an average person (about 1.4 m/s). Our average ve-
is because proximity data are much more discriminating locity is currently half that (0.66 m/s), and it is competitive
when there is a sufficient amount of valid laser readings, with most currently available systems. It should be noted
which is not always the case outdoors. Visual features, on that robot speed also depends on the computational plat-
the other hand, observe the appearance of the environment. form, and the speed can be increased by simply upgrading
They help the system to rely less on nearby surfaces by the platform.
extending its perceptual range. Because of this, we believe The success of a system is also influenced by the travel
our vision-based navigation and localization system will distance, i.e., the longer the distance, the higher the likeli-
complement laser-based systems. hood of failure. For shorter distances of 40–50 m, it is not
The use of geographical distance from ground truth as unusual to achieve a near 100% success rate (Trulls et al.,
a localization metric in our testing is actually different from 2011). Our system is expected to reach a goal of 400 m or
how vision-based SLAM systems are currently evaluated. more in busy outdoor environments. Out of six fully au-
They use a precision-recall curve, which focuses more on tonomous runs, we have to intervene in two of them, all of
the recognition step. Here precision is the number of cor- which are more than 200 m from the start. As a comparison,
rect landmark matches over the number of matches, while there are a number of indoor systems that have achieved
recall is the number of correct matches over the number of much longer autonomy, lasting a full 26.2-mile marathon
possible matches. Our recall rate is close to one landmark (Marder-Eppstein et al., 2010) and 40 km over 37 h (Milford
per five possible matches in an image (20%). Our preci- and Wyeth, 2010).

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 435

Because a large part of the failure to avoid an obstacle tion, we plan to create the metric topological map automati-
originates in the detection step, another way of evaluat- cally using a SLAM algorithm. There are a number of SLAM
ing is based on the produced obstacle avoidance map. This algorithms that aim to create such a hierarchical map. A sys-
way is often seen in systems (Maye, Kaestner, and Siegwart, tem by Marder-Eppstein, Berger, Foote, Gerkey, and Kono-
2012) that try to detect negative obstacles (e.g., a dip next lige (2011) connects local overlapping LRF-based occupancy
to a sidewalk). Given the large area in which our system maps to a single global topological map. Similarly, Pradeep,
is operating, it would be time-consuming to create a com- Medioni, and Weiland (2010) constructed a topological map
plete ground truth occupancy map. We are, however, able to by associating stereo-based local submaps. Note that these
qualitatively assess our system and produce a list of events systems have only been tested indoors, and it may not be
in which it fails; see Section 5.3. easy to adapt them to outdoor environments. For our sys-
Our lateral deviation navigation metric delves deeper tem, we are particularly interested in creating a map with
into the quality of robot traversal because we believe the meaningful nodes that represent important geographical
middle of the road allows the robot to be in a position points, similar to Ranganathan and Dellaert (2011), where
to perform more tasks, if called upon, along the way. Our it identifies unique LRF signatures produced by junctions
road recognition system has been shown to be statistically in indoor environments. The key is to robustly do the same
superior in detecting road centers in images with a num- in unconstrained pedestrian outdoor environments, where
ber of other state-of-the-art systems (Chang et al., 2012; the presence of surrounding walls cannot be assumed.
Kong et al., 2010; Rasmussen, Lu, and Kocamaz, 2009) One way is by using an online gist feature grouping tech-
with an accuracy level close to 40 cm. However, the sys- nique (Valgren and Lilienthal, 2008), which detects a large
tem performs worse in our robot long-range testing, with change in the feature space, corresponding to moving into
a median of 1.37 m. Thus, it is hard to predict how this, a new segment. In addition, related to metric topological
or other image-based metrics (Miksik, 2012; Rasmussen, SLAM, there is also a research problem (Beeson, Modayil,
Lu, and Kocamaz, 2009) that measure road segmentation and Kuipers, 2010; Kuipers, Modayil, Beeson, Macmahon,
overlap, would translate to actual autonomous naviga- and Savelli, 2004) of how to topologically represent a more
tion. As such, it is important to create a testing regimen complex path or space, such as winding roads or an open
where all the systems are tested on a robot under the same area.
conditions.

5.3. Failures and Future Improvements 5.3.3. Proper Turning at an Intersection


Although the localization system did not completely fail or
During testing, we observed a number of aspects that we
misplace the robot during testing, at times it converged a
would like to improve upon.
few meters away from the true location. This becomes an
issue when the robot is near an intersection and needs to
5.3.1. Automatic Localization Initialization turn. If the turn occurs far away from the center of the in-
The current localization system has to be given an initial lo- tersection, the robot may end up on the shoulder instead
cation to guarantee correct convergence. It uses the current of near the center of the next road. One way to solve this
location to order the database landmarks when comparing problem is by explicitly recognizing the junction or inter-
them with the input regions. In addition, the system also section. Given a prior that the robot is near an intersection,
employs a search cutoff in the comparison process for ef- the system can look for road lines perpendicular to the lines
ficiency reasons. Thus, instead of going through the whole that are pointing toward the current robot direction. An-
database, there are times when the matching landmark may other way to avoid traveling on the side of the road is by
not even get to be compared with the input region because recovering the true road center established during train-
of an incorrect prior. Another related issue is the need to ing. The visual landmarks in the stored database also carry
fine-tune the matching criteria to maximize matches while image coordinate locations and can be used to adjust the
keeping false positives to a minimum. In addition, when the target lateral position. This approach is similar to teach-and-
robot is lost, it should be more active in searching for better replay (Chang et al., 2010), but different in that the result
views, where the landmarks are more likely to be matched. is not directly coupled with motor command generation. It
merely corrects for the road center estimation whenever the
matches are available, preventing the system from having
5.3.2. Simultaneous Localization and Mapping to stop the robot when recognition failed. This technique
Currently, the localization map is still manually constructed. is also useful for navigating when the vanishing point is
We always understood that this is inconvenient, although undetected, as well as on more extreme curved and wind-
for many application scenarios (see Figure 1), we believe ing roads, improving upon the system’s current ability to
this step is well worth the effort because the robot will be follow the more gradually changing roads (Siagian, Chang,
used many times in the same environment. In the next itera- and Itti, 2013a).

Journal of Field Robotics DOI 10.1002/rob


436 • Journal of Field Robotics—2014

5.3.4. Robust Obstacle Avoidance Navigation ing behaviors such as wall-following or centering along a
In terms of navigational shortcomings, there are a few times hallway. Our robot in a sense follows a subsumption archi-
when the robot failed to detect obstacles in its path, partic- tecture (Brooks, 1986), i.e., when a road is detected, follow
ularly thin or short objects as well as objects with holes. In it, otherwise default to driving straight forward. Thus the
addition, there are also a number of scenarios that we did navigation system is not dependent on the presence of the
not include in our test, but that are expected to lead to sys- vanishing point. On the contrary, it only uses visual road
tem failures and need to be handled in the future: curbs, recognition when it is advantageous. This way the road
ramps, and off-roads. To improve the system competencies recognition system enhances the standard laser-based nav-
in these situations, we are adding an up-down sweeping igation system largely utilized today. In pedestrian envi-
LRF, as well as applying techniques by Klasing et al. (2008) ronments, however, there are open areas such as a square
and Maye et al. (2012). We believe a moving LRF is the best or a large hall where there are no nearby wall surfaces to
way to maximize obstacle visibility, while the algorithms guide the robot. In the experiment, the system was able to
enable our system to handle those situations more robustly. navigate through an open water fountain area (see Figure
In addition, we also plan to incorporate the evidence from 28). However, this success is due, in large part, to the fact
the sweeping LRF with our visual road recognition module that the continuing road is aligned with the previous road.
by identifying which road lines are solid boundaries and In less ideal configurations, the system can be confused and
should never be crossed. Here, the road boundary can be drive to the wrong road. To rectify this problem, the system
noted by visual appearance change, change in proximity must have a richer understanding of the space, knowing all
profile, or a sharp change in elevation. Once we incorporate the roads that come in and out of the area.
these improvements, we believe the resulting system will
be robust in many different outdoor scenarios, sufficient to
5.3.7. Navigation in Crowded Areas
travel for even longer distances.
As for the effect of crowding on our system, we are encour-
aged by the results that show that it is able to localize and
5.3.5. Better Road Detection navigate through those difficult conditions. In the future,
The system also needs to be improved with regard to de- we would like to improve on Beobot’s navigation skills
tecting roads under more difficult conditions, such as bright in moving more naturally among people (Trautman and
sunlight. This can be done in both hardware and software. Krause, 2010), for example using a fast GPU-based pedes-
For the former, a camera with a wider dynamic range would trian recognition system (Benenson, Mathias, Timofte, and
be a welcome addition. In addition, having a lens with a Gool, 2012) to properly anticipate where they are headed.
wider field-of-view is also helpful, especially for the wider We can also try to characterize how pedestrians behave as a
uniform roads with no direction cues between the bound- group (Henry, Vollmer, Ferris, and Fox, 2010), which would
aries. For software improvements, a better low-level edge be more feasible for denser crowds.
and line segment detector (with local contrast adjustments)
can significantly boost performance. A different way to im-
prove the system is by utilizing visual cues other than road
ACKNOWLEDGMENTS
lines. We are planning to incorporate the use of histogram
differencing to segment out the road from its flanking ar- This work was supported by the National Science Founda-
eas, much like Chang et al. (2012). By creating a system tion (Grants No. CCF-1317433 and No. CMMI-1235539), the
that runs a number of contrasting approaches in parallel, Army Research Office (W911NF-11-1-0046 and W911NF-12-
we hope to enable the system to robustly detect more road 1-0433), and the U.S. Army (W81XWH-10-2-0076). The au-
types, even in more naturalistic off-road environments. Re- thors affirm that the views expressed herein are solely their
lated to this, the localization system, particularly our bio- own, and do not represent the views of the U.S. government
logically inspired gist classification and saliency, has been or any agency thereof.
shown to perform well in nonurban environments (Itti and
Baldi, 2009; Siagian and Itti, 2009, 2007).
REFERENCES
Agrawal, M., & Konolige, K. (2006). Real-time localization in
5.3.6. Improved Navigation in Open Areas
outdoor environments using stereo vision and inexpen-
In situations in which there is no road in front of the robot, sive gps. Proceedings of the International Conference on
the system simply maintains its current forward heading set Pattern Recognition (ICPR) (vol. 3, pp. 1063–1068).
by the last seen road while trying to avoid obstacles. This Angeli, A., Filliat, D., Doncieux, S., & Meyer, J.-A. (2008). A
goes on until there is a continuing road or the localization fast and incremental method for loop-closure detection
module instructs to make a turn. In these situations, the using bags of visual words. IEEE Transactions on Robotics,
system performs navigation exclusively using LRF, produc- Special Issue on Visual SLAM, 24(5), 1027–1037.

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 437

Bay, H., Tuytelaars, T., & Gool, L. V. (2006). Surf: Speeded up ro- Fiorini, P., & Shiller, Z. (1998). Motion planning in dynamic
bust features. In Proceedings of the European Conference environments using velocity obstacles. The International
on Computer Vision (ECCV) (pp. 404–417). Journal of Robotics Research, 17(7), 760–772.
Beeson, P., Modayil, J., & Kuipers, B. (2010). Factoring the map- Fox, D., Burgard, W., Dellaert, F., & Thrun, S. (1999). Monte
ping problem: Mobile robot map-building in the hybrid Carlo localization: Efficient position estimation for mobile
spatial semantic hierarchy. 29, 428–459. robots. In Proceedings of the Sixteenth National Confer-
Benenson, R., Mathias, M., Timofte, R., & Gool, L. V. (2012). ence on Artificial Intelligence (AAAI’99).
Pedestrian detection at 100 frames per second. In Proceed- Fox, D., Burgard, W., & Thrun, S. (1997). The dynamic window
ings of the IEEE Computer Society Conference on Com- approach to collision avoidance. IEEE Robotics & Automa-
puter Vision and Pattern Recognition (CVPR) (pp. 290– tion Magazine, 4(1), 23–33.
2910), Providence, RI: IEEE Computer Society. Furgale, P., & Barfoot, T. D. (2010). Visual teach and repeat
Borenstein, J., & Koren, Y. (1991). The vector field histogram— for long-range rover autonomy. Journal of Field Robotics,
Fast obstacle avoidance for mobile robots. IEEE Journal of 27(5), 534–560.
Robotics and Automation, 7(3), 278–288. Glennie, C., & Lichti, D. D. (2010). Static calibration and anal-
Bradski, G. (2001). Open source computer vision library. ysis of the velodyne hdl-64e s2 for high accuracy mobile
Brooks, R. A. (1986). A robust layered control system for a mo- scanning. Remote Sensing, 2(6), 1610–1624.
bile robot. IEEE Transactions on Robotics and Automation, Henry, P., Vollmer, C., Ferris, B., & Fox, D. (2010). Learning
2(1), 14–23. to navigate through crowded environments. In Proceed-
Chang, C.-K., Siagian, C., & Itti, L. (2010). Mobile robot vision ings of the IEEE International Conference on Robotics and
navigation & localization using gist and saliency. In Pro- Automation (ICRA) (pp. 981–986).
ceedings of the IEEE/RSJ International Conference on In- Hrabar, S., & Sukhatme, G. (2009). Vision-based navigation
telligent Robots and Systems (IROS) (pp. 4147–4154). Both through urban canyons. Journal of Field Robotics, 26(5),
first authors contributed equally. 431–452.
Chang, C.-K., Siagian, C., & Itti, L. (2012). Mobile robot monoc- Hu, Z., Wang, H., Zhang, L., & Yu, H. (2009). Laser based
ular vision navigation based on road region and boundary sensor boundary recognition of mobile robot. In Interna-
estimation. In Proceedings of the IEEE/RSJ International tional Conference on Networking, Sensing and Control,
Conference on Intelligent Robots and Systems (IROS) (pp. Okayama, Japan.
1043–1050). i Badia, S. B., Pyk, P., & Verschure, P. F. M. J. (2007). A
Chen, Z., & Birchfield, S. (2006). Quantitative vision-based mo- fly-locust based neuronal control system applied to an
bile robot navigation. In Proceedings of the IEEE Inter- unmanned aerial vehicle: The invertebrate neuronal prin-
national Conference on Robotics and Automation (ICRA) ciples for course stabilization, altitude control and colli-
(pp. 2686–2692). sion avoidance. International Journal of Robotics Research,
Cherubini, A., Spindler, F., & Chaumette, F. (2012). A new 26(7), 759–772.
tentacles-based technique for avoiding obstacles during Itti, L. (2000). Models of bottom-up and top-down visual atten-
visual navigation. In Proceedings of the IEEE Interna- tion. bu;td;mod;psy;cv, Pasadena, CA.
tional Conference on Robotics and Automation (ICRA) Itti, L. (2012). iLab neuromorphic vision C++ toolkit
(pp. 4850–4855). (iNVT). http://ilab.usc.edu/toolkit/. Accessed December
Coombs, D., Herman, M., Hong, T., & Nashman, M. (1995). 15, 2012.
Real-time obstacle avoidance using central flow diver- Itti, L., & Baldi, P. F. (2009). Bayesian surprise attracts human
gence and peripheral flow. In Proceedings of the Interna- attention. Vision Research, 49(10), 1295–1306. Top cited ar-
tional Conference on Computer Vision (ICCV) (pp. 276– ticle 2008–2010 award from Vision Research.
283). Itti, L., & Koch, C. (2001). Feature combination strategies for
Cummins, M., & Newman, P. (2008). Fab-map: Probabilistic saliency-based visual attention systems. Journal of Elec-
localization and mapping in the space of appearance. The tronic Imaging, 10(1), 161–169.
International Journal of Robotics Research, 27(6), 647–665. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based
Cunha, J., Pedrosa, E., Cruz, C., Neves, A., & Lau, N. (2011). visual attention for rapid scene analysis. IEEE Transac-
Using a depth camera for indoor robot localization and tions on Pattern Analysis and Machine Intelligence, 20(11),
navigation. In RGB-D Robotics: Science and Systems (RSS) 1254–1259.
Workshop, Los Angeles. King, P. H. (2008). A low cost localization solution using a
Eade, E., & Drummond, T. (2008). Unified loop closing and Kalman filter for data fusion. Master’s thesis, Virginia
recovery for real time monocular slam. In Proceedings of Polytechnic Institute and State University, Blacksburg,
the British Machine Vision Conference (BMVC) pp. 6.1– VA.
6.10. Klasing, K., Lidoris, G., Bauer, A., Rohrmuller, F., Wollherr, D.,
Fazl-Ersi, E., & Tsotsos, J. (2012). Histogram of oriented uniform & Buss, M. (2008). The autonomous city explorer: Towards
patterns for robust place recognition and categorization. semantic navigation in urban environments. In Proceed-
The International Journal of Robotics Research, 31(2), 468– ings of the International Workshop on Cognition For Tech-
483. nical Systems (COTESYS).

Journal of Field Robotics DOI 10.1002/rob


438 • Journal of Field Robotics—2014

Koenig, S., & Likhachev, M. (2002). D* lite. In AAAI MicroStrain, Inc. (2009). 3DM-GX2:: MicroStrain, AHRS Ori-
Conference of Artificial Intelligence (AAAI) (pp. 476– entation Sensor. http://www.microstrain.com/3dm-gx2
483). .aspx. Accessed: July 15, 2009.
Koenig, S., Likhachev, M., Liu, Y., & Furcy, D. (2004). Incre- Miksik, O. (2012). Rapid vanishing point estimation for general
mental heuristic search in artificial intelligence. Artificial road detection. In Proceedings of the IEEE International
Intelligence Magazine, 25(2), 99–112. Conference on Robotics and Automation (ICRA).
Kong, H., Audibert, J.-Y., & Ponce, J. (2010). General road de- Milford, M., & Wyeth, G. (2008). Mapping a suburb with a
tection from a single image. IEEE Transactions on Image single camera using a biologically inspired slam system.
Processing, 19(8), 2211–2220. IEEE Transactions on Robotics, 24(5), 1038–1053.
Kuffner, J. J., & LaValle, S. M. (2000). Rrt-connect: An efficient Milford, M., & Wyeth, G. (2010). Persistent navigation and map-
approach to single-query path planning. In Proceedings ping using a biologically inspired slam system. The In-
of the IEEE International Conference on Robotics and Au- ternational Journal of Robotics Research (IJRR), 29, 1131–
tomation (ICRA) (pp. 995–1001). 1153.
Kuhnl, T., Kummert, F., & Fritsch, J. (2011). Monocular road seg- Minguez, J., & Montano, L. (2004). Nearness diagram (nd) nav-
mentation using slow feature analysis. In IEEE Intelligent igation: Collision avoidance in troublesome scenario. IEEE
Vehicles Symposium (IV) (pp. 800–806). Transactions on Robotics and Automation, 20(1), 45–59.
Kuipers, B., Modayil, J., Beeson, P., Macmahon, M., & Savelli, F. Moghamadam, P., Starzyk, J. A., & Wijesoma, W. S. (2012). Fast
(2004). Local metrical and global topological maps in the vanishing-point detection in unstructured environments.
hybrid spatial semantic hierarchy. In Proceedings of the IEEE Transactions on Image Processing, 21(1), 425–430.
IEEE International Conference on Robotics and Automa- Montella, C., Perkins, T., Spletzer, J., & Sands, M. (2012). To the
tion (ICRA) (pp. 4845–4851). bookstore! Autonomous wheelchair navigation in an ur-
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of ban environment. In Proceedings of the International Con-
features: Spatial pyramid matching for recognizing nat- ference on Field and Service Robotics (FSR), Matsushima,
ural scene categories. In Proceedings of the IEEE Com- Japan.
puter Society Conference on Computer Vision and Pat- Montemerlo, M., Becker, J., Bhat, S., Dahlkamp, H., Dol-
tern Recognition (CVPR) (pp. 2169–2178), Washington, gov, D., Ettinger, S., Haehnel, D., Hilden, T., Hoffmann,
DC: IEEE Computer Society. G., Huhnke, B., Johnston, D., Klumpp, S., Langer, D.,
Lidoris, G., Rohrmullera, F., Wollherr, D., & Buss, M. (2009). Levandowski, A., Levinson, J., Marcil, J., Orenstein, D.,
The autonomous city explorer (ace) project mobile robot Paefgen, J., Penny, I., Petrovskaya, A., Pflueger, M., Stanek,
navigation in highly populated urban environments. Pro- G., Stavens, D., Vogt, A., & Thrun, S. (2008). Junior: The
ceedings of the IEEE International Conference on Robotics stanford entry in the urban challenge. Journal of Field
and Automation (ICRA) (pp. 1416–1422). Robotics, 25(9), 569–597.
Lowe, D. (2004). Distinctive image features from scale-invariant Moravec, H., & Elfes, A. (1985). High resolution maps from
keypoints. International Journal of Computer Vision, 60(2), wide angle sonar. In Proceedings of the IEEE International
91–110. Conference on Robotics and Automation (ICRA) (vol. 2,
Maddern, W., Milford, M., & Wyeth, G. (2012). Cat-slam: pp. 116–121).
Probabilistic localisation and mapping using a continu- Murillo, A., Guerrero, J., & Sagues, C. (2007). Surf features for
ous appearance-based trajectory. International Journal of efficient robot localization with omnidirectional images. In
Robotics Research, 31, 429–451. Proceedings of the International Conference on Robotics
Marder-Eppstein, E., Berger, E., Foote, T., Gerkey, B., & Kono- and Automation (ICRA) (pp. 3901–3907).
lige, K. (2010). The office marathon: Robust navigation Nakamura, T., & Asada, M. (1995). Motion sketch: Acquisition
in an indoor office environment. In Proceedings of the of visual motion guided behaviors. In International Joint
IEEE International Conference on Robotics and Automa- Conference on Artificial Intelligence (IJCAI) (pp. 126–132).
tion (ICRA) (pp. 300–307). NASA (2013). Centennial challenges. http://www.nasa.gov/
Marder-Eppstein, E., Berger, E., Foote, T., Gerkey, B., & Kono- offices/oct/stp/centennial_challenges/index.html. Ac-
lige, K. (2011). Kurt konolige, eitan marder-eppstein, cessed January 15, 2013.
bhaskara marthi. In Proceedings of the IEEE Interna- Newman, P., Sibley, G., Smith, M., Cummins, M., Harrison,
tional Conference on Robotics and Automation (ICRA) A., Mei, C., Posner, I., Shade, R., Schroeter, D., Murphy, L.,
(pp. 3041–3047). Churchill, W., Cole, D., & Reid, I. (2009). Navigating, recog-
Maye, J., Kaestner, R., & Siegwart, R. (2012). Curb detection nizing and describing urban spaces with vision and lasers.
for a pedestrian robot in urban environments. In Proceed- The International Journal of Robotics Research, 28(11-12),
ings of the IEEE International Conference on Robotics and 1406–1433.
Automation (ICRA) (pp. 367–373). Nistér, D., Naroditsky, O., & Bergen, J. (2006). Visual odometry
Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed ob- for ground vehicle applications. Journal of Field Robotics,
stacle avoidance using monocular vision and reinforce- 23(1), 3–20.
ment learning. In International Conference on Machine Ouerhani, N., Bur, A., & Hugli, H. (2005). Visual attention-
Learning. based robot self-localization. In Proceedings of the

Journal of Field Robotics DOI 10.1002/rob


Siagian et al.: Vision Guided Hierarchical Map Localization Navigation • 439

European Conference on Mobile Robotics (ECMR) national Conference on Robotics and Automation (ICRA).
(pp. 8–13), Ancona, Italy. Both first authors contributed equally.
Pradeep, V., Medioni, G., & Weiland, J. (2010). Robot vision for Siagian, C., Chang, C. K., & Itti, L. (2013b). Beobot 2.0.
the visually impaired. In Proceedings of the IEEE Com- http://ilab.usc.edu/beobot2. Accessed December 15,
puter Society Conference on Computer Vision and Pattern 2012.
Recognition (CVPR) (pp. 15–22). IEEE Computer Society. Siagian, C., Chang, C.-K., Voorhies, R., & Itti, L. (2011). Beobot
Pronobis, A., Caputo, B., Jensfelt, P., & Christensen, H. (2006). A 2.0: Cluster architecture for mobile robotics. Journal of
discriminative approach to robust visual place recognition. Field Robotics, 28(2), 278–302.
In Proceedings of the IEEE International Conference on Siagian, C., & Itti, L. (2007). Rapid biologically-inspired scene
Intelligent Robots and Systems (IROS) (pp. 3829–3836). classification using features shared with visual attention.
Pronobis, A., Mozos, O. M., Caputo, B., & Jensfelt, P. (2010). IEEE Transactions on Pattern Analysis and Machine Intel-
Multi-modal semantic place classification. The Interna- ligence, 29(2), 300–312.
tional Journal of Robotics Research (IJRR), Special Issue Siagian, C., & Itti, L. (2008a). Comparison of gist models in rapid
on Robotic Vision, 29(2-3), 298–320. scene categorization tasks. In Proceedings of the Vision
Quinlan, S., & Khatib, O. (1993). Elastic bands: Connecting path Science Society Annual Meeting (VSS08).
planning and control. In Proceedings of the IEEE Inter- Siagian, C., & Itti, L. (2008b). Storing and recalling informa-
national Conference on Robotics and Automation (ICRA) tion for vision localization. In IEEE International Confer-
(pp. 802–807). ence on Robotics and Automation (ICRA), Pasadena, CA
Ranganathan, A., & Dellaert, F. (2011). Online probabilis- (pp. 1848–1855).
tic topological mapping. The International Journal of Siagian, C., & Itti, L. (2009). Biologically inspired mobile robot
Robotics Research (IJRR), 30, 755–771. vision localization. IEEE Transactions on Robotics, 25(4),
Rasmussen, C., Lu, Y., & Kocamaz, M. (2009). Appearance con- 861–873.
trast for fast, robust trail-following. In Proceedings of the Simmons, R. (1996). The curvature-velocity method for local
IEEE/RSJ International Conference on Intelligent Robots obstacle avoidance. In Proceedings of the IEEE Interna-
and Systems (IROS) (pp. 3505–3512). tional Conference on Robotics and Automation (ICRA)
Royer, E., Lhuillier, M., Dhome, M., & Lavest, J.-M. (2007). (pp. 3375–3382).
Monocular vision for mobile robot localization and au- Song, D., & Tao, D. (2010). Biologically inspired feature man-
tonomous navigation. International Journal of Computer ifold for scene classification. IEEE Transactions on Image
Vision, 74(3), 237–260. Processing, 19(1), 174–184.
Santana, P., Alves, N., Correia, L., & Barata, J. (2010). Swarm- Stachniss, C. (2013). Europa. http://europa.informatik.uni-
based visual saliency for trail detection. In Proceedings freiburg.de/. Accessed June 30, 2013.
of the IEEE/RSJ International Conference on Intelligent Strasdat, H., Montiel, J. M. M., & Davison, A. J. (2010). Scale
Robots and Systems (IROS) (pp. 759–765). drift-aware large scale monocular slam. In Robotics: Sci-
Santos-Victor, J., Sandini, G., Curotto, F., & Garibaldi, S. (1993). ence and Systems (RSS), Vol. 2.
Divergent stereo in autonomous navigation: Learning Tardif, J.-P., Pavlidis, Y., & Daniilidis, K. (2008). Monocular vi-
from bees. In Proceedings of the Conference on Computer sual odometry in urban environments using an omnidi-
Vision and Pattern Recognition (CVPR) (pp. 434–439). rectional camera. In Proceedings of the IEEE/RSJ Inter-
Schindler, G., Brown, M., & Szeliski, R. (2007). City-scale loca- national Conference on Intelligence Robotics and Systems
tion recognition. In Proceedings of the IEEE Computer So- (IROS) (pp. 2531–2538).
ciety Conference on Computer Vision and Pattern Recog- Theisen, B. L., Frederick, P., & Smuda, W. (2011). The 18th an-
nition (CVPR) (pp. 1–7), Los Alamitos, CA: IEEE Computer nual intelligent ground vehicle competition: Trends and
Society. influences for intelligent ground vehicle control. In Pro-
Schneider, F. E., & Wildermuth, D. (2011). Results of the euro- ceedings of SPIE 7878, Intelligent Robots and Computer
pean land robot trial and their usability for benchmarking Vision XXVIII: Algorithms and Techniques.
outdoor robot systems. In Toward Autonomous Robotic Thrun, S. (2011). Google’s driverless car. Talk was viewed at
Systems, volume 6856 of Lecture Notes in Computer Sci- http://www.ted.com/talks/sebastian_thrun_google_s_
ence, (pp. 408–409). Springer. driverless_car.html on September 1, 2012.
Se, S., Lowe, D. G., & Little, J. J. (2005). Vision-based global local- Thrun, S., Bennewitz, M., Burgard, W., Cremers, A., Dellaert, F.,
ization and mapping for mobile robots. IEEE Transactions Fox, D., Hähnel, D., Rosenberg, C., Roy, N., Schulte, J., &
on Robotics, 21(3), 364–375. Schulz, D. (1999). MINERVA: A second generation mobile
Segvic, S., Remazeilles, A., Diosi, A., & Chaumette, F. (2009). tour-guide robot. In Proceedings of the IEEE International
A mapping and localization framework for scalable Conference on Robotics and Automation (ICRA).
appearance-based navigation. Computer Vision and Im- Thrun, S., Fox, D., Burgard, W., & Dellaert, F. (2000). Robust
age Understanding, 113(2), 172–187. Monte-Carlo localization for mobile robots. Artificial In-
Siagian, C., Chang, C., & Itti, L. (2013a). Mobile robot navigation telligence, 128(1-2), 99–141.
system in outdoor pedestrian environment using vision- Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A.
based road recognition. In Proceedings of the IEEE Inter- (2003). Context-based vision system for place and object

Journal of Field Robotics DOI 10.1002/rob


440 • Journal of Field Robotics—2014

recognition. In Proceedings of the International Confer- IEEE Transactions: Systems, Man and Cybernetics, 36(2),
ence on Computer Vision (ICCV) (pp. 1023–1029), Nice, 413–422.
France. Weiss, C., Tamimi, H., Masselli, A., & Zell, A. (2007). A hybrid
Trautman, P., & Krause, A. (2010). Unfreezing the robot: Navi- approach for vision-based outdoor robot localization us-
gation in dense, interacting crowds. In Proceedings of the ing global and local image features. In Proceedings of the
IEEE International Conference on Intelligent Robots and IEEE International Conference on Intelligent Robots and
Systems (IROS) (pp. 797–803). Systems (IROS) (pp. 1047–1052).
Trulls, E., Murtra, A. C., Pérez-Ibarz, J., Ferrer, G., Vasquez, Wilson, N. J. (1982). Principles of artificial intelligence. Berlin:
D., Mirats-Tur, J. M., & Sanfeliu, A. (2011). Autonomous Springer-Verlag.
navigation for mobile service robots in urban pedestrian Wurm, K., Kummerle, R., Stachniss, C., & Burgard, W. (2009).
environments. Journal of Field Robotics, 28(3), 329–354. Improving robot navigation in structured outdoor envi-
Ulrich, I., & Borenstein, J. (1998). Vfh+: Reliable obstacle avoid- ronments by identifying vegetation from laser data. In
ance for fast mobile robots. Proceedings of the IEEE Inter- Proceedings of the IEEE/RSJ International Conference on
national Conference on Robotics and Automation (ICRA) Intelligent Robots and Systems (IROS) (pp. 1217–1222).
(pp. 1572–1577). Wurm, K. M., Hornung, A., Bennewitz, M., Stachniss, C., &
Ulrich, I., & Nourbakhsh, I. (2000). Appearance-based place Burgard, W. (2010). Octomap: A probabilistic, flexible, and
recognition for topological localization. In Proceedings of compact 3d map representation for robotic systems. In
the IEEE International Conference on Robotics and Au- ICRA Workshop on Best Practice in 3D Perception and
tomation (ICRA) (pp. 1023–1029). Modeling for Mobile Manipulation (pp. 300–307).
Valgren, C., & Lilienthal, A. J. (2008). Incremental spectral clus- Yang, L., & Lavalle, S. M. (2004). The sampling-based neigh-
tering and seasons: Appearance-based localization in out- borhood graph: An approach to computing and executing
door environments. In Proceedings of the IEEE Interna- feedback motion strategies. IEEE Transactions on Robotics
tional Conference on Robotics and Automation (ICRA) and Automation, 20(3), 419–432.
(pp. 1856–1861), Pasadena, CA. Yuta, S., Mizukawa, M., Hashimoto, H., Tashiro, H., & Okubo, T.
Vargas, G. R. A., Nagatani, K., & Yoshida, K. (2007). Adaptive (2011). Tsukuba challenge 2009—Towards robots working
Kalman filtering for gps-based mobile robot localization. in the real world: Records in 2009. Journal of Robotics and
In IEEE International Workshop on Safety, Security and Mechatronics, 23(2), 201–206.
Rescue Robotics, Rome, Italy. Zhang, W., & Kosecka, J. (2005). Localization based on building
Wang, J., Zha, H., & Cipolla, R. (2006). Coarse-to-fine vision- recognition. In IEEE Workshop on Applications for Visu-
based localization by indexing scale-invariant features. ally Impaired (pp. 21–28).

Journal of Field Robotics DOI 10.1002/rob

You might also like