0% found this document useful (0 votes)
38 views

Real-Time Face Tracking System For Human-Robot Interact Ion

This document describes a real-time face tracking system for human-robot interaction using stereo vision. The key aspect of the system is using a stereo camera and field multiplexing device to directly measure 3D coordinates of facial features. This allows for simplified 3D model fitting to obtain full head pose and gaze direction without expensive hardware, artificial environments, or restrictions on user motion. The system satisfies requirements of being non-contact, passive, real-time, robust, accurate, compact, and capable of simultaneous head pose and gaze detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Real-Time Face Tracking System For Human-Robot Interact Ion

This document describes a real-time face tracking system for human-robot interaction using stereo vision. The key aspect of the system is using a stereo camera and field multiplexing device to directly measure 3D coordinates of facial features. This allows for simplified 3D model fitting to obtain full head pose and gaze direction without expensive hardware, artificial environments, or restrictions on user motion. The system satisfies requirements of being non-contact, passive, real-time, robust, accurate, compact, and capable of simultaneous head pose and gaze detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Real-time Face Tracking System

for Human-Robot Interact ion


Yoshio Matsumotot , Alexander Zelinskyt

*Graduate School of Information Science, Nara Institute of Science and Technology


8916-5, Takayamacho, Ikoma-city, Nara, 630-0101, Japan
SDept of Systems Engineering, RSISE, The Australian National University
Email: [email protected]
URL: ht tp://robotics.aist-nara.ac.
j p/-yoshio

Abstract 1.2 Face and Gaze Detection System


Several kinds of commercial products exist to detect
When a person instructs operations to a robot, or human’s head position and orientation, such as mag-
performs a cooperative task with a robot, it is neces- netic sensors and link mechanisms. There are also sev-
sary to inform the robot of the person’s intention and eral companies supporting products that perform eye
attention. Since the motion of a person’s face and the gaze tracking. These products are generally highly
direction of the gaze is deeply related with person’s in- accurate and reliable, however all requires either ex-
tention and attention, detection of such motions can be pensive hardware or artificial environments (helmets,
utilized as a natural way of communication in human- infrared lighting, marking on the face etc). Therefore
robot interaction. In this paper, we describe our real- the discomfort and the restriction of the motion af-
time stereo face tracking system. The key of our sys- fects the human’s behavior, which makes it difficult to
tem is the use of a stereo vision. Since the 3D co- measure the natural behavior of a human.
ordinates of the features on the face can be directly To solve this problem, there are many research re-
measured in our system, we can drastically simplify sults reported to visually detect the pose of a head
the algorithm of the 3D model fitting to obtain the [l, 2, 3, 4, 5, 6, 71. The recent advances in hardware
full 3D pose of the head compared with convention- have allowed vision researchers to develop those real-
al system with monocular camera. Consequently we time face tracking systems. However all of these sys-
achieved a non-contact, passive, real-time, robust, ac- tems utilize a monocular system. Recovering the 3D
curate and compact measurement system of human’s pose from a monocular image stream is known as a
head pose and gaze direction. difficult problem in general and high accuracy as well
as robustness is hard to be achieved. Therefore, some
can not compute full 3D, 6DOF posture of the head.
Some researchers have also developed systems to de-
1 Face and G a z e Detection tect both the head pose and gaze point simultaneously
[8,9], however, all of which can not acquire the 3D vec-
1.1 Visual Human Interfaces tor of the gaze line.
In order to construct a system which observes a per-
When a person instructs operations to a robot, or per- son without giving him/her any discomfort, it should
forms a cooperative task with a robot, it is necessary satisfy requirements as follows:
to inform the robot of the person’s intention and at-
tention. Since the motion of a person’s face and the non-contact
direction of the gaze is deeply related with person’s in- passive
tention and attention, detection of such motions can be real-time
utilized as a natural way of communication for human- robust to occlusion and lighting change
robot interaction. For example, the direction of the compact
gaze represents a target of operation in some situa- accurate
tions, and the motion of the head represents a gesture capable to detect head posture and a gaze direc-
in other situations. tion simultaneously

01999 IEEE
0-7803-5731-0/99$10.00 II -830
Our system satisfies all those requirements simultane- 2.3 Field Multiplexing Device
ously by utilizing the following techniques:
The field multiplexing is a technique t o generate a mul-
0 stereo vision with field multiplexing device tiplexed video steam from two video streams in the
0 image processing board with a normalized corre- analog phase. A diagram of the device is shown in
lation capability Figure 2 . The device takes two video steams which
0 3D model fitting based on virtual spring are synchronized. They are input into a video switch-
ing IC, and one of them are selected and output in
The details of the hardware and software system are every field.
described in section 2 and 3 respectively. Some ex- Thus the frequency of the switching is only 60[Hz],
perimental results are shown in section 4. Finally the which makes the device to be implemented quite easily
conclusion and the future works are described in sec- and cheaply. A photo of the device is also shown in
tion 5. Figure 2 . The size is less than 5[cm] square using
only consumer electronic parts.
2 Hardware Configuration of Real- The advantage of multiplexing video signals in the
analog phase is that it can be applied to any vision
time Stereo Vision System system which takes a single video stream as an in-
put, and makes it perform stereo vision processing.
2.1 System Setup Since the multiplexed image is stored in a single video
Figure 1 illustrates the hardware setup of our real- frame memory, stereo image processing can be per-
time stereo face tracking system. It has a NTSC cam- formed with in the memory. This means there is no
era pair (SONY EVI-370DG x 2) to capture a per- overhead cost for image transfer which is inevitable in
son’s face. The output video signals from the cam- stereo vision system with two image processing board-
eras are multiplexed into one video signal by the “field s. Thus a system with a field multiplexing device can
multiplexing technique” [lo]. The multiplexed video have a higher performance than a system with two
stream is then fed into a vision processing board (Hi- boards.
tach IP5000), where the position and the orientation of The weak point of the field multiplexing is that the
the face are recognized. The result of the recognition image looks strange t o human eyes if you display the
is visualized by a graphics workstation (SGI 02). signal directly on a T V monitor, because two images
are superimposed every two lines. However it doesn’t
make image processing any harder, since a normal im-
2.2 IP5000 Image Processing Board age can be obtained by subsampling the multiplexed
image in the vertical direction.
IP5000 is a PCI half-sized image processing board
which is used being connected with a NTSC camer-
a and a T V monitor. It is equipped with 40 frame
memories of 512 x 512 pixels. It provides a wide va- Field MultiplexingDevice
riety of fast image processing functions performed by
hardware, such as binarization, convolution, filtering,
MixOut Image
labeling, histogram calculation, color extraction and Processing
normalized correlation. The frequency of those opera- System
tion is 73.5[MHz],therefore it can apply a basic func-
tion (e.g. binarization) to one image within 3.6[ms].

p, Stereemera

Vision
Processor
Penliumll 450MHz
MMB Memory
SGI 02

Fig. 1 : System configuration of human-machine in- Fig. 2 : Block diagram and a photograph of the Field
terface. Multiplexing Device

II -831
3 Stereo Face Tracking Algorithm
3.1 3D Facial Model
The 3D facial model utilized in our stereo face tracking
is composed of three items as follows: .............................. face image
...................

0 images of the facial features,


0 3D coordinates of the facial features,
0 an image of the whole face.
3D structure
.......................
The facial features are defined as both of the corners
of the eyes and the mouth. Thus there are six feature
images and coordinates in.a facial model, an example
of which is shown in Figure 3 . The facial model also . . . . . . . feature image

contains an image of the whole face in low resolution.


This image is utilized in search for a face both at the
initial stage and after the system fails in tracking.
The facial model can be acquired either in the “au- 3D structure
...............
tomatic model acquisition stage” or in the “manual
model acquisition stage.” In the automatic manner,
eyes and a mouth are detected by first finding skin
colored regions in the image and then binarizing in-
tensity information in the skin colored facial region.
Small patches at both ends of the extracted eyes and
a mouth are then memorized as feature templates, and
the 3D coordinates of the features are calculated based
on stereo matching. In the manual manner, the patch- Fig. 4 : Tracking Algorithm.
es of the features are simply selected by clicking the
position in an image and then stereo matching are per-
formed to calculate the 3D coordinates.
3.2 Stereo Tracking Algorithm

The flowchart of the stereo tracking algorithm is de-


scribed in Figure 4 . Before starting the face tracking,
an error recovery procedure is carried out to determine
the approximate position of the face in the live video
stream using the whole face image.
Then tracking and stereo matching of each feature
is carried out to determine the 2D and 3D position
of each feature. Then the 3D facial model is fitted
with these 3D measurements, and the 3D position and
orientation of the face are best estimated in terms of
Feature templates Feature Coordinates Whole face template six parameters. Finally the 3D coordinates of each
(-49, 15, -9) feature are adjusted to keep the consistency of the rigid
(-17, 15, -5)
( 17, 12, -4)
body of the facial model, and they are projected back
( 48, 10, -3) onto 2D image plane in order to update the search area
(-27,-57, -1) for feature tracking in the next frame.
( 23,-58, 0)
At the end of the tracking process, the reliability of
tracking is calculated based on the correlation values of
feature tracking and stereo matching. If the reliability
is higher than a threshold, the system jumps to the
beginning of the face tracking again. Otherwise the
Fig. 3 : Upper: Extracted six features in the stereo
images, Lower: 3D facial model. system decides it has lost the face and then jumps
back to the error recovery stage.

II -832
3.2.1 3D Feature Tracking
In the 3D feature tracking stage, each feature is as-
1- Attime t
sumed to have a small motion between the current obiect
frame and the previous one, and the 2D position in
the previous frame is utilized to determine the search nt
area in the current frame. The feature images stored
in the 3D facial model are used as templates, and the
right image is used as a search area. Then the matched
image in 2D feature tracking is used as a template and
the left image is utilized as a search area. Thus the
3D coordinates of each feature are acquired. The pro- Orient (O,O,O)
cessing time of the whole tracking process (i.e. feature
tracking +
stereo matching for six features) is about
lO[ms] by IP5000.

3.2.2 3D Model Fitting


Figure 5 illustrates the coordinate system t o rep-
resent the position and the orientation of the face. The
parameters (d,B,cp) represents the orientation of the
face, and (x,y,z) represents the position of the face
center relative to the the origin of the camera axis.
The diagrams in Figure 6 describes the model fit- /model
Z
ting method in our face tracking system. In the real
implementation six features are used for tracking, how-
ever only three points are illustrated in the diagrams
for simplicity. The basic idea of the model fitting is to
displacement due to (dx,~y,dz)
move the model closer to the measurement iterative-
nd (d+de,dcp) still remains
ly with considering the reliability of the result of the

+
3D feature tracking. As mentioned before, the face is X
assumed t o have a small motion between the frames.
This also means there can be only small displacements
in terms of the position and the orientation, which is Z
described as (Ax,Ay,Az,Ad,AB, AV) in Figure 6 (1).
Then the position and the orientation acquired in (3) 3D model fitting
based on virtual spring:
the previous frame (at time t) are utilized t o rotate
and translate the measurement sets to move back clos- Stiffness of spring kn
er to the model as shown in Figure 6 (2). After the oc correlation value
rotation and translation, the measurements still have
a small disparity to the model due t o the motion which F2 4
FR
occurred during the interval At. Then the fine mod-
el fitting is to be performed. To realize robust model
fitting, it is essential to take the reliability values of Fig. 6 : 3D model fitting algorithm.

the individual measurements into account. The least


square method is usually adopted for such purposes.
In our system, a similar fitting approach based on vir-
tual springs is utilized. The result of 3D tracking gives
two correlation values which are between 0 and 1 for
each feature. If a template and a search area utilized
in image correlation have the same pattern, resulting
correlation value becomes 1. Therefore the product of
Fig. 5 : Coordinate system for face tracking system. the two correlation values can be regarded as a reliabil-
ity value which is also between 0 and 1. The reliability

II -833
values are then used as the stiffness of the springs be-
tween each feature in the model and the corresponding
measurement as shown in Figure 6 (3). The model is
then rotated and translated gradually and iteratively
to reduce the elastic energy of the spring. The weight-
ing based on the reliability makes the fitting result in-
sensitive to the partial matching failure, and enables
robust face tracking. The processing time of the iter-
ative model fitting takes less than 2[ms] using a Pen-
tiumII 450MHz.

3.3 Error Recovery


The tracking method described above has only a s-
mall search area in the image, which enable the real-
time processing and continuous stable result of track-
ing. However once the system fails to track the face, it
is hard for the system to make a recovery by using on-
ly the local template matching, and a complementary
method for finding the face in the image is necessary
as a error recovery function. This process is also used Fig. 7 : Result of face tracking at various head POS-
at the beginning of the tracking when the position of tures.
the face is unknown.
The whole face image shown in Figure 3 is used in
this process. In order to reduce the processing time,
the template is memorized in low resolution. The live
video streams is also reduced in resolution. The tem-
plate is first searched in the right image, and then the
matched image is searched in the left image. As a re-
sult, the rough 3D position of the face is determined
and is then used as the initial state of the face for
the face tracking. This searching process takes about
100[ms].

4 Experimental Results
4.1 Face Tracking
Some snapshots obtained as results of tracking in our
real-time face tracking system are shown Figure 7 .
(1) and (2) in Figure 7 are the results when the
face has rotations, while (3) and (4) indicates the re-
sults when the face moves closer to and further from Fig. 8 : Result of face tracking at situations with de-
the camera. The whole tracking process takes about formation and occlusion.
30[ms] which is within a NTSC video frame rate. The
accuracy of the tracking is approximately &l[mm] in
translation and fl[deg] in rotation.
4.2 Visualization
The snapshots in Figure 8 show the results of
tracking when there is some deformation of the facial The results of the tracking are visualized using a SGI
features and partial occlusions of the face by a hand. 0 2 graphics workstation. Figure 9 illustrates ex-
The results indicate our tracking system works quite amples of the tracking results and the corresponding
robustly in such situations owing to our model fitting visualization. The 3D model used in the visualization
method. By utilizing the normalized correlation func- consists of rigid surface of the face and two eyeball-
tion on the IP5000, the tracking system is tolerant of s. The face has six DOF for position and orientation,
the fluctuation in lighting. and the eyeballs have two DOF respectively. The po-

II -834
Fig. 10 : Result of gaze detection.

will evaluate the accuracy of the head pose and the


gaze direction. We also aim t o improve the accuracy
and processing speed of the gaze detection.

References
Fig. 9 : Visualization of tracking results.
A. Azarbayejani , TStarner, B.Horowitz,
and A.Pentland. Visually controlled graphics. IEEE
Trans. on Pattern Analysis and Machine Intelligence,
15(6):602-605, 1993.
sition of the irises of the eyes is detected using the
circular hough transform, which are used to move the A.Zelinsky and J.Heinzmann. Real-time Visual
Recognition of Facial Gestures for Human Computer
eyes of the mannequin head. The visualization process Interaction. In Proc. of the Int. Conf. on Automatic
is performed online during the tracking, therefore the Face and Gesture Recognition, pages 351-356, 1996.
mannequin head can mimic the person’s head and eye P.Ballard and G.C.Stockman. Controlling a Comput-
motions in real-time. er via Facial Aspect. IEEE h n s . Sys. Man and Cy-
bernetics, 25(4):669-677, 1995.
Black and Yaccob. Tracking and Recognizing Rigid
5 Conclusion and Non-rigid Facial Motions Using Parametric Mod-
els of Image Motion. In Proc. of Int. Conf. on Com-
In this paper, our robust real-time stereo face tracking puter Vision (ICCV’95), pages 374-381, 1995.
system for visual human interfaces was presented. The S.Birchfield and C.Tomasi. Elliptical Head Tracking
system consists of a stereo camera pair and a standard Using Intensity Gradients and Color Histograms”.
In Proc. of Computer Vision and Pattern Recognition
PC equipped with an image processing board, and is (CVPR’98), 1998.
able to detect the position and orientation of the face.
A.Gee and R.Cipolla. Fast Visual Tracking by Tem-
The face tracking system is (1) non-intrusive, (2) pas- poral Consensus. Image and Vision Computing,
sive, (3) real-time and (4)accurate, all of which have 14(2):105-114, 1996.
not been able to be achieved by previous research re- Kentaro Toyama. Look, Ma - No Hands! Hands-
sults. The qualitative accuracy and robustness of the Free Corsor Control with Real-time 3D Face Tracking.
tracking is yet to be evaluated, however we believe that In Proc. of Workshop on Perceptual User Interface
the performance of the system is quite high compared (PUI’98), 1998.
with existing systems. J.Heinzmann and A.Zelinsky. 3-D Facial Pose and
By extending the system presented in this paper, we Gaze Point Estimation using a Robust Real-Time
Tracking Paradigm. In Proc. of the Int. Conf. on Au-
have already succeeded in detecting the the 3D gaze tomatic Face and Gesture Recognition, 1998.
vector of a person in real-time (at 15[Hz]). Snapshots R.Stiefelhagan, J.Yang, and A.Waibe1. Tracking Eyes
of the experiment are shown as shown in Figure 10 . and Monitoring Eye Gaze. In Proc. of Workshop on
Our system is developed to be utilized for a visual Perceptual User Interface (PUI’97), 1997.
interface between a human and a robot. However con- Y . Matsutmoto, T. Shibata, K. Sakai, M. Inaba, and
sidering the advantages described above, it can be ap- H. Inoue. Real-time Color Stereo Vision System for a
plied to various targets, such as psychological experi- Mobile Robot based on Field Multiplexing. In Proc. of
ments, ergonomic designing, products for the disabled IEEE Int. Conf. on Robotics and Automation, pages
1934-1939, 1997.
and the amusement industry. In our future work, we

II -835

You might also like