0% found this document useful (0 votes)

73 views6 pages

Person Follower

This document is a compilation on how to make a person follower robot, please use it as a guide for further projects.

Uploaded by

Anjo Meneses

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views6 pages

Person Follower

This document is a compilation on how to make a person follower robot, please use it as a guide for further projects.

Uploaded by

Anjo Meneses

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Proceedings of the IEEE ICRA 2009

Workshop on People Detection and Tracking

Kobe, Japan, May 2009

Robust Stereo-Based Person Detection and Tracking

for a Person Following Robot
Junji Satake and Jun Miura
Department of Information and Computer Sciences
Toyohashi University of Technology
Abstract This paper describes a stereo-based person detection and tracking method for a mobile robot that can follow a
specific person in dynamic environments. Many previous works
on person detection use laser range finders which can provide
very accurate range measurements. Stereo-based systems have
also been popular, but most of them have not been used for
controlling a real robot. We propose a detection method using
depth templates of person shape applied to a dense depth
image. We also develop an SVM-based verifier for eliminating
false positive. For person tracking by a mobile platform, we
formulate the tracking problem using the Extended Kalman
filter. The robot continuously estimates the position and the
velocity of persons in the robot local coordinates, which are then
used for appropriately controlling the robot motion. Although
our approach is relatively simple, our robot can robustly follow
a specific person while recognizing the target and other persons
with occasional occlusions.
Index Terms Person detection and tracking, Mobile robot,
Stereo.

I. I NTRODUCTION
Following a specific person is an important task for service
robots. Visual person following in public spaces entails
tracking of multiple persons by a moving camera.
There have been a lot of works on person detection
and tracking using various image features and classification
methods [1], [2], [3], [4], [5]. Many of them, however, use
a fixed camera. In the case of using a moving camera,
foreground/background separation is an important problem.
This paper deals with detection and tracking of multiple
persons for a mobile robot. Laser range finders are widely
used for person detection and tracking by mobile robots
[6], [7], [8]. Image information such as color and texture
is, however, sometimes necessary for person segmentation
and/or identification. Omnidirectional cameras are also used
[9], [10], but their limited resolutions are sometimes inappropriate for analyzing complex scenes.
Stereo is also popular in moving object detection and
tracking. Beymer and Konolige [11] developed a method of
tracking people by continuously detecting people using distance information obtained from a stationary stereo camera.
Howard et al. [12] proposed a person detection method
which first converts a depth map into a polar-perspective
map on the ground and then extracts regions with largelyaccumulated pixels. Calisi et al. [13] developed a robot system that can follow a moving person. It makes an appearance
model for each person using stereo in advance. In tracking,
the robot extracts candidate regions using the model and

PC
Camera

Image Processing Part

3D position
of each person

robot motion

Robot Control Part

Fig. 1.

Robot

Configuration of our system.

confirms it using stereo. Occlusions between people are not

handled in these works.
Ess et al. [14], [15] proposed to integrate various cues
such as appearance-based object detection, depth estimation,
visual odometry, and ground plane detection using a graphical model for pedestrian detection. Although their method
exhibits a nice performance for complicated scenes, it is still
costly to be used for controlling a real robot.
In this paper, we propose a person tracking method using
stereo. We prepare several depth templates to be used for
dense depth images and detect person regions by template
matching, followed by a support vector machine (SVM)based verifier. Depth information is very effective in data
association with adjusting template size and values as well as
occlusion handling. Person detection results are input to Extended Kalman Filter-based trackers. The robot continuously
estimates the position and the velocity of persons in the robot
local coordinates to appropriately control its motion. Fig. 1
shows the configuration of our system. The main contribution
of the paper is to show that a simple depth template-based
approach, combined with EKF and an SVM-based verifier,
realizes a robust person following by a mobile robot.
II. S TEREO - BASED PERSON DETECTION AND TRACKING
To track persons stably with a moving camera, we use
depth templates, which are the templates for human upper
body in depth images (see Fig. 2); we currently use three
templates with different direction of body. We made the
templates from the depth images where the target person
was at 2 [m] away from the camera. A depth template is a
binary template, the foreground and the background value
are adjusted according to status of tracks and input data.
A. Tracking
For a person being tracked, his/her predicted scene position is available from the corresponding EKF-tracker (see

Left

Front
Fig. 2.

Right

Depth templates.

Sec. III-B). We thus set the foreground depth of the template

to the predicted depth of the head of the person and search
a region around the predicted image position for the person.
Concerning the background depth, since it may change as
the camera moves, we estimate it on-line. We make the depth
histogram of the current input depth image and use the Kth
percentile as the background depth (currently, K = 90).
For a depth template T (x, y) of H W pixels (x
[W/2, W/2], y [H/2, H/2]) and the depth image
ID (x, y), the 2D image position (x , y ) is given as the
position which minimizes the following SSD (sum of squared
distances) criterion:

W/2

H/2

(a) Input images

Fig. 3.

(b) Depth images

Detection examples using depth templates.

(a) positive samples

[T (p, q) ID (x + p, y + q)] .

(1)

p=W/2 q=H/2

We use the three templates simultaneously and take the one

with the smallest SSD value as the detection result if that
value is less than some threshold.
Each template has the position of the head and the median
value of the neighboring region of that position is used as the
depth from the camera of the detected person. The accuracy
of the depth value is empirically estimated as about one
percent when a person is at about 3 [m] distance.
B. Detection
We continuously check if a new person appears in the
image. In this case, we do not have any prediction and
basically search the entire image. The foreground depth is
set to the depth of each image position and the background
one is set as in the same way as tracking.
We use the same SSD criterion (see eq. (1)) for judging
if a person exists at an image position. Since applying this
SSD calculation to the entire image is costly, we examine
the three boundary points, on the left of, on the right of, and
above the head position, and only when the depth values of
the points are one-meter farther than the depth of the head,
the SSD value is calculated and evaluated. We also set a
detection volume to search in the scene; its height range is
0.7 2.0 [m] and the range of the depth from the camera
is 0.5 5.5 [m]. In addition, if the image position under
consideration is in an already-detected person region, and
unless the its depth is at least one-meter smaller than the
depth of the region, the detection there is skipped. These
techniques can reduce the search cost largely. After collecting
pixels with qualified SSD values, we extract the mass centers
of all connected regions as the positions of newly detected
persons.

(b) negative samples

Fig. 4.

Training samples for the SVM-based verifier.

Figure 3 shows examples of detection using the depth

templates. Three rectangles in each depth image are detection
results with the three templates, and the one with the highest
evaluation value is shown in bold line. Even when the
direction of the body changed, it is possible to detect a person
stably by using multiple templates.
C. Intensity-based false detection elimination
A simple template-based detection is effective in reducing
the computational cost but at the same time may produce
many false detections for objects with similar silhouette to
person. To cope with this, we use an SVM-based person
verifier using intensity images.
We collected many person candidate images detected by
the depth templates, and manually examined if they are correct. Fig. 4 shows some of positive and negative samples. We
used 438 positive and 146 negative images for training. The
size of the sample images is normalized to 2020. The SVM
is the one with RBF kernel (K(x1 , x2 ) = exp(||x1
x2 ||2 ), = 8.0). We use an OpenCV implementation of
SVM.
We examine the performance of the SVM-based verifier
using three image sequences, which had not used for training.
The numbers of persons appearing in the sequences are
zero, one, and two, respectively. We used the image regions

X
Z Y

TABLE I

P ERFORMANCE SUMMARY OF THE SVM- BASED VERIFIER .

# of persons

results

exist
not exist

judged to exist
0

judged not to exist

126

exist
not exist

judged to exist
414
0

judged not to exist

5
75

exist
not exist

judged to exist
391
0

judged not to exist

31
491

Fig. 5.

Definition of coordinate systems.

target

turning
center

t+1

Fig. 6.

III. P ERSON TRACKING AND ROBOT CONTROL

t+1

detected using the depth templates. Table I summarizes

the results. It is noted that the rate of eliminating false
positives is 100%. This is very important because a simple
depth template-based person detection tends to produce many
false positives. On the other hand, the verifier sometimes
eliminates actual person regions; the false negative rate is
about six percent. The EKF-based tracker can usually cope
with such an occasional failure of person detection.

vL
left
wheel

right
wheel

Control of wheeled mobile robot.

A. Configuration of our system

Figure 5 illustrates the coordinate systems attached to our
mobile robot and stereo system. The relation between the
robot and the camera coordinate system is given by
T

T

Xr Yr Zr 1
=A R|T
,
(2)
Zc x y 1
where A, R, and T show the intrinsic parameters matrix,
the rotation matrix, and the translation vector, respectively.
B. Estimation of 3D position using EKF
1) State equation: In the robot coordinate system, the
persons position at time t is defined as (X t , Yt , Zt ). The
state variable xt is defined as
T

xt =
Xt Yt Zt X t Y t
,
where X t and Y t denote velocities in the horizontal plane.
We first consider the case where the robot does not move.
The system equation is given by
xt+1
where w t is the
1 0
0 1
Ft = 0 0
0
0

0
0

F t xt + Gt wt

process noise and

0 t
0
0
1
0
0

0
0
1
0

0
t

0
0 , Gt = 0
0
1

t
0

(3)

0
0
0
0
t

1 0
2
Qt = Cov(w t ) = E w t wTt = w
0 1 .
We then consider the case where the robot moves. Figure 6
shows how a wheeled mobile robot moves. The distance of
two wheels is denoted as 2d. When each wheel rotates with

speed vL and vR , the velocity v, the angular velocity , and

the turning radius of the robot have the following relations:
v = (vR + vL )/2, = (vR vL )/2d,
= d(vR + vL )/(vR vL ).
The rotation angle and the moved distance L during
time t are obtained respectively as
= t, L = 2 sin(/2).
In addition, the robot movement X and Y seen from the
robot position at time t are obtained respectively as
X = L cos(/2), Y = L sin(/2).
We then have the relationship between the position and
the velocity of a person before and after the coordinate
transformation from the robot coordinate at time t to that
at time t + 1 as follows:
X (t+1) = (X (t) X) cos + (Y (t) Y ) sin ,
Y (t+1) = (X (t) X) sin + (Y (t) Y ) cos ,
X (t+1) = X (t) cos + Y (t) sin v,
Y (t+1) = X (t) sin + Y (t) cos .
By the combination of these equations and eq. (3), the state
equation that considers the robot movement u t = [vL vR ]T
is expressed as
xt+1

f t (xt , ut ) + Gt wt ,

(4)

where
f t (xt , ut ) =

(Xt + tX t X) cos + (Yt + tY t Y ) sin

(Xt + tX t X) sin + (Yt + tYt Y ) cos

Xt cos + Yt sin v
X t sin + Y t cos

2) Observation equation: The observed persons position

in the robot coordinate system is denoted as y t . The observation equation is expressed as
yt

= H t xt + v t ,

where v t is the observation noise and

Xr
Yr
Zr

1
0
0

0
1
0

(5)

0 0 0
0 0 0 ,
yt =
, Ht =
1 0 0

1 0 0

Rt = Cov(v t ) = E v t v Tt = v2 0 1 0 .
0 0 1

3) Extended Kalman filter: The Extended Kalman filter

(EKF) are formulated using the the state eq. (4) and the
observation equation (5). The EKF can estimate the position
and the velocity of a person with their uncertainty estimates.

input images
Fig. 7.

depth images

Correctly tracking two persons in an occlusion case.

C. Data association and occlusion handling

3D position information is effective in data association.
We use the predicted 3D position to adjust the size and
the foreground depth of the depth templates to be used
(see Section. II-A). If a person is detected, then its 3D
position is tested with the Mahalanobis distance to see if
the matching can be made between the detected person and
the corresponding track.
3D information is also used for occlusion handling. In the
case where an occlusion relation is reliably predicted between two persons, if an occluding one is correctly detected,
only the prediction step in EKF is performed for the occluded
person. Possible occlusion relationships are enumerated by
examining the predicted 3D positions of tracks.
In an ordinary situation, persons pass each other with
keeping a certain distance (say, one meter) between them. In
our current setting, this distance difference can be detected
as long as they are within about four meters from the camera;
this is enough for the robot to correctly recognize the person
motion in a local region around the robot.
Figure 7 shows an example of correctly tracking two
persons under occlusion and depth change. In the middle
row of the image, the person behind is completely occluded
and only the prediction step in EKF is performed. After the
occlusion, the track continues correctly.
D. Tracking algorithm
The image processing part (see Fig. 1) works as follows:
1) Stereo processing: The depth image is made with a stereo
camera.
2) Person tracking: Each person is tracked by using the
EKF described in Section III-B.

2.1) Prediction: The 3D position and its uncertainty at

the current time t are predicted from the state variable at
the previous time t 1. They are then projected to 2D
image by eq. (2). The projected uncertainty region is used
for determining the predicted region.
2.2) Observation: The predicted region is searched for
the person by the method described in Section II-A. The
templates used for search are made based on the depth to
the person. After the search, the persons 3D position y t
is calculated by eq. (2) based on image coordinates (x, y)
and distance from camera Z c = D.
2.3) Data association: Correspondences are made between tracks and observations by the procedure described
in Section III-C.
2.4) Update: The state variable is updated, if an observation is obtained.
3) Detection: The persons who appear newly in image are
detected with depth templates.
4) Communication: The estimated position is sent to the
robot control part, and the rotational speeds of the left and
right wheels are received.
IV. C ONTROL TO FOLLOW A SPECIFIC PERSON
The robot with two-wheel drive follows a circular trajectory from the current to the target position (path A in Fig. 8).
In this case, the speeds for the wheels to move the robot at
velocity v is calculated as follows. From the equation:

2
2
2
2
2 =
(X/2) + (Y /2) + (X/2) + ( Y /2) ,
we have
=

(X 2 + Y 2 )/2Y.

person C

(X, Y)
target

robot
person A

direction of camera
7m
ceiling
camera

person B

left
wheel

Fig. 8.

right
wheel

Fig. 10.

Initial positions of the robot and the persons.

Path to target position.

person A

Then we can calculate the velocities as:

vL = v 1 d
= v 1 X2dY
,
2 +Y 2

vR = v 1 + d
= v 1 + X2dY
.
2 +Y 2
When the robot follows this circular path, however, since the
turning rate of robot orientation is relatively slow, the target
person tends to go out of the field of view. On the other
hand, the robot first turns and then moves straight toward
the target like path B, the robot movement is not smooth.
We thus use the one like path C, on which the robot turns
to the target while moving ahead. In this case, the velocity
of each wheel is adjusted as follows:

, vR = v 1 + k X2dY
.
vL = v 1 k X2dY
2 +Y 2
2 +Y 2
This means the turning radius is reduced to /k.
V. E XPERIMENTAL RESULT
A. Experimental setup
We have implemented the proposed method on a PeopleBot (by Mobile Robots) with a Bumblebee2 stereo camera
(by PointGrey Research) for the experiments (see Fig. 5).
A note PC (Core2Duo, 2.6GHz) performs all processes including stereo calculation, person detection and tracking, and
robot motion control. The processed image size is 512 384
and the processing time is about 90 [msec/f rame]. Table II
shows the breakdown of processing time; our system can
process about eleven frames per second.
We implemented the software modules for person detection and tracking, motion planning, and robot control as
RT components in the RT-middleware environment [16] for
easier development and maintenance.
TABLE II
B REAKDOWN OF PROCESSING TIME .
Processing
1)
2)
3)
4)

Image acquisition & stereo processing

Person tracking (in the case of two persons)
Person detection
Communication, display, and save data

Total

Time
40 [ms]
20 [ms]
10 [ms]
20 [ms]
90 [ms]

person B
4m
robot

Fig. 11.

Trace of two persons and the robot.

B. Person following experiments

Figure 9 shows a result of tracking. The left row images
are the results of person detection. Each circle in the image
shows the result of observation with depth templates, and
each small point shows the 3D head position estimated using
EKF. The right row images show the positions of the robot
and the persons taken by a ceiling camera. In addition, the
curves in the final frame (#156) shows the traces of the robot
and the persons.
Figure 10 shows the initial positions of the robot and the
persons. The robot moved toward person A who was detected
first and considered the target. Even when person B and C
passed between the robot and person A, the target person
was correctly tracked.
C. Evaluation of person position estimation
We evaluated the quality of the person position estimation.
Figure 11 shows the traces of the robot and two persons in
the robot initial coordinates. Person A moved on two edges
of a 4 4 [m] square drawn on the floor. Person B moved
so that it temporarily occluded person A.
The robot followed person A while estimating the positions of the persons. The averaged and the maximum error
in position estimation for person A were 125 [mm] and
336 [mm], respectively. This result shows that the position
estimation is accurate enough for the robot to follow a
specific person.

#038

#093

#131

#069

#113

#139

#079

#123

#156

robot view

ceiling camera view

Fig. 9.

robot view

ceiling camera view

robot view

ceiling camera view

Experimental result with one person to follow and the other two.

VI. C ONCLUSIONS AND F UTURE W ORK

This paper has described a method of detecting and tracking multiple persons for a mobile robot by using distance
information obtained by stereo. We presented an EKF-based
formulation by which the robot continuously estimates the
position and the velocity of persons. Distance information is
effectively utilized for robust person detection, data association, and occlusion handling. We realized a robot that can
robustly follow a specific person while recognizing the target
and other persons with occasional occlusions.
The current algorithm does not consider the case where
multiple persons are too close to be separated by depth
information. To cope with such cases, it would be necessary
to use other visual information such as color and texture.
It is also necessary to manage static obstacles such as
furniture as well as an effective path planning to realize a
person following robot that can operate in more complex
environments.
Acknowledgment
The authors would like to thank Yuki Ishikawa for his
help in implementing the system. This work is supported by
NEDO (New Energy and Industrial Technology Development
Organization, Japan) Intelligent RT Software Project.
R EFERENCES
[1] P. Viola, M.J. Jones, and D. Snow. Detecting Pedestrians Using
Patterns of Motion and Appearance. Int. J. of Computer Vision,
Vol. 63, No. 2, pp. 153161, 2005.
[2] N. Dalal and B. Briggs. Histograms of Oriented Gradients for Human
Detection. In Proceedings of 2005 IEEE Conf. on Computer Vision
and Patttern Recognition, pp. 886893, 2005.
[3] B. Han, S.W. Joo, and L.S. Davis. Probabilistic Fusion Tracking Using
Mixture Kernel-Based Bayesian Filtering. In Proceedings of the 11th
Int. Conf. on Computer Vision, 2007.
[4] D.M. Gavrila. A Bayesian, Exempler-Based Approach to Hierarchical
Shape Matching. IEEE Trans. on Pattern Analysis and Machine
Intelligence, Vol. 29, No. 8, pp. 14081421, 2008.

[5] S. Munder, C. Schnorr, and D.M. Gavrila. Pedestrian Detection and

Tracking Using a Mixture of View-Based Shape-Texture Models. IEEE
Trans. on Intelligent Transportation Systems, Vol. 9, No. 2, pp. 333
343, 2008.
[6] C.-Y. Lee, H. Gonz`alez-Banos, and J.-C. Latombe. Real-Time
Tracking of an Unpredictable Target Amidst Unknown Obstacles. In
Proceedings of the 7th Int. Conf. on Control, Automation, Robotics
and Vision, pp. 596601, 2002.
[7] D. Schulz, W. Burgard, D. Fox, and A.B. Cremers. People Tracking
with a Mobile Robot Using Sample-Based Joint Probabilistic Data
Association Filters. Int. J. of Robotics Research, Vol. 22, No. 2, pp.
99116, 2003.
[8] N. Bellotto and H. Hu. Multisensor Data Fusion for Joint People
Tracking and Identification with a Service Robot. In Proceedings of
2007 IEEE Int. Conf. on Robotics and Biomimetics, pp. 14941499,
2007.
[9] H. Koyasu, J. Miura, and Y. Shirai. Realtime Omnidirectional Stereo
for Obstacle Detection and Tracking in Dynamic Environments. In
Proceedings of the 2001 IEEE/RSJ Int. Conf. on Intelligent Robots
and Sysetms, pp. 3136, 2001.
[10] M. Kobilarov, G. Sukhatme, J. Hyams, and P. Batavia. People Tracking
and Following with Mobile Robot Using Omnidirectional Camera and
a Laser. In Proceedings of 2006 IEEE Int. Conf. on Robotics and
Automation, pp. 557562, 2006.
[11] D. Beymer and K. Konolige. Real-Time Tracking of Multiple People
Using Continuous Detection. In Proceedings of the 7th Int. Conf. on
Computer Vision, 1999.
[12] A. Howard, L.H. Matthies, A. Huertas, M. Bajracharya, and A. Rankin.
Detecting Pedestrians with Stereo Vision: Safe Operation of Autonomous Ground Vehicles in Dynamic Environments. In Proceedings
of the 13th Int. Symp. of Robotics Research, 2007.
[13] D. Calisi, L. Locchi, and R. Leone. Person Following through
Appearance Models and Stereo Vision using a Mobile Robot. In
Proceedings of VISAPP-2007 Workshop on Robot Vision, pp. 4656,
2007.
[14] A. Ess, B. Leibe, and L.V. Gool. Depth and Appearance for Mobile
Scene Analysis. In Proceedings of the 11th Int. Conf. on Computer
Vision, 2007.
[15] A. Ess, B. Leibe, K. Schindler, and L.V. Gool. A Mobile Vision
System for Robust Multi-Person Tracking. In Proceedings of the 2008
IEEE Conf. on Computer Vision and Pattern Recognition, 2008.
[16] N. Ando, T. Suehiro, K. Kitagaki, T. Kotoku, and W.-K. Yoon. RTMiddleware: Distributed Component Middleware for RT (Robot Technology). In Proceedings of 2005 IEEE/RSJ Int. Conf. on Intelligent
Robots and Systems, pp. 35553560, 2005.