CVGIP 2010 Part 3

3-D Environment Model Construction and Adaptive
Foreground Detection for Multi-Camera Surveillance System
Yi-Yuan Chen1† , Hung-I Pai2† , Yung-Huang Huang∗ , Yung-Cheng Cheng∗ , Yong-Sheng Chen∗
Jian-Ren Chen† , Shang-Chih Hung† , Yueh-Hsun Hsieh† , Shen-Zheng Wang† , San-Lung Zhao†
†
Industrial Technology Research Institute, Taiwan 310, ROC
∗
Department of Computer and Information Science, National Chiao-Tung University, Taiwan 30010, ROC
E-mail:1 yiyuan@itri.org.tw, 2 HIpai@itri.org.tw

Abstract— Conventional surveillance systems usually use
multiple screens to display acquired video streams and may
cause trouble to keep track of targets due to the lack of spatial
relationship among the screens. This paper presents an effective
and efficient surveillance system that can integrate multiple
video contents into one single comprehensive view. To visualize
the monitored area, the proposed system uses planar patches to
approximate the 3-D model of the monitored environment and
displays the video contents of cameras by applying dynamic
texture mapping on the model. Moreover, a pixel-based shadow
detection scheme for surveillance system is proposed. After
an offline training phase, our method exploits the threshold
which determines whether a pixel is in a shadow part of the Fig. 1. A conventional surveillance system with multiple screens.
frame. The thresholds of pixels would be automatically adjusted
and updated according to received video streams. The moving
objects are extracted accurately with removing cast shadows and
then visualized through axis-aligned billboarding. The system
direction of cameras, and locations of billboards indicate
provides security guards a better situational awareness of the the positions of cameras, but the billboard contents will be
monitored site, including the activities of the tracking targets. hard to perceive if the angles between viewing direction and
Index Terms— Video surveillance system, planar patch mod- normal directions of billboards are too large. However, in
eling, axis-aligned billboarding, cast shadow removal rotating billboard method, when the billboard rotates and
faces to the viewpoint of user, neither camera orientations
I. I NTRODUCTION nor capturing areas will be preserved. In outdoor surveillance
Recently, video surveillance has experienced accelerated system, an aerial or satellite photograph can be used as
growth because of continuously decreasing price and better a reference map and some measurement equipments are
capability of cameras [1] and has become an important used to build the 3-D environment [3]–[5]. Neumann, et al.
research topic in the general field of security. Since the utilized an airborne LiDAR (Light Detection and Ranging)
monitored regions are often wide and the field of views sensor system to collect 3-D geometry samples of a specific
of cameras are limited, multiple cameras are required to environment [6]. In [3], image registration seams the video on
cover the whole area. In the conventional surveillance system, the 3-D model. Furthermore, video projection, such as video
security guards in the control center monitor the security flashlight or virtual projector, is another way to display video
area through a screen wall (Figure 1). It is difficult for the in the 3-D model [4], [7].
guards to keep track of targets because the spatial relationship However, the multi-camera surveillance system still has
between adjacent screens is not intuitively known. Also, it is many open problems to be solved, such as object tracking
tiresome to simultaneously gaze between many screens over across cameras and object re-identification. The detection of
a long period of time. Therefore, it is beneficial to develop a moving objects in video sequences is the first relevant step in
surveillance system that can integrate all the videos acquired the extraction of information in vision-based applications. In
by the monitoring cameras into a single comprehensive view. general, the quality of object segmentation is very important.
Many researches on integrated video surveillance systems The more accurate positions and shapes of objects are, the
are proposed in the literature. Video billboards and video more reliable identification and tracking will be. Cast shadow
on fixed planes project camera views including foreground detection is an issue for precise object segmentation or
objects onto individual vertical planes in a reference map to tracking. The characteristics of shadow are quite different
visualize the monitored area [2]. In fixed billboard method, in outdoor and indoor environment. The main difficulties in
billboards face to specified directions to indicate the capturing separating the shadow from an interesting object are due to

988

the physical property of floor, directions of light sources and System configuration On-line monitoring
additive noise in indoor environment. Based on brightness
Manual operation video
and chromaticity, some works are proposed to decide thresh- streams
olds of these features to roughly detect the shadow from
objects [8]–[10]. However, current local threshold methods 2D 3D
patterns model Background modeling
couple blob-level processing with pixel-level detection. It
causes the performance of these methods to be limited due
to the averaging effect of considering a big image region. Registration with Segmentation and
corresponding points refinement
Two works to remove shadow are proposed to update
the threshold with time and detect cast shadow in different
Axis-aligned
scenes. Carmona et al [11] propose a method to detect Lookup billboarding
shadow by using the properties of shadow in Angle-Module tables
space. Blob-level knowledge is used to identify shadow, 3-D model
refection and ghost. This work also proposes a method to construction
update the thresholds to remove shadow in different positions
of the scene. However there are many undetermined param- Fig. 2. The flowchart and components of the proposed 3-D surveillance
system.
eters to update the thresholds and the optimal parameters are
hard to find in practice. Martel-Brisson et al [12] propose
a method, called GMSM, which initially uses Gaissian of
Mixture Model (GMM) to define the most stable Gaussian
distributions as the shadow and background distributions.
Since a background model is included in this method, more
computation is needed for object segmentation if a more com-
plex background model is included in the system. Besides,
because that each pixel has to be updated no matter how many
objects moving, it cost more computation in few objects.
In this paper, we develop a 3-D surveillance system based
on multiple cameras integration. We use planar patches to
build the 3-D environment model firstly and then visualize
videos by using dynamic texture mapping on the 3-D model.
To obtain the relationship between the camera contents and
the 3-D model, homography transformations are estimated
for every pair of image regions in the video contents and
the corresponding areas in the 3-D model. Before texture Fig. 3. Planar patch modeling for 3-D model construction. Red patches
mapping, patches are automatically divided into smaller ones (top-left), green patches (top-right), and blue patches (bottom-left) represent
with appropriate sizes according to the environment. Lookup the mapping textures in three cameras. The yellow point is the origin of
the 3-D model. The 3-D environment model (bottom-right) is composed of
tables for the homography transformations are also built for horizontal and vertical patches from these three cameras.
accelerating the coordinate mapping in the video visual-
ization processing. Furthermore, a novel method to detect
moving shadow is also proposed. It consists of two phases. quired from IP cameras deployed in the scene to the 3-D
The first phase is an off-line training phase which determines model by specifying corresponding points between the 3-D
the threshold of every pixel by judging whether the pixel is model and the 2-D images. Since the cameras are fixed, this
in the shadow part. In the second phase, the statistic data configuration procedure can be done only once beforehand.
of every pixel is updated with time, and the threshold is Then in the on-line monitoring stage, based on the 3-D
adjusted accordingly. By this way, a fixed parameters setting model, all videos will be integrated and visualized in a single
for detecting shadow can be avoided. The moving objects are view in which the foreground objects extracted from images
segmented accurately from the background and are displayed are displayed through billboards.
via axis-aligned billboarding for better 3-D visual effects.
A. Image registration
II. S YSTEM CONFIGURATION
For a point on a planar object, its coordinates on the plane
Figure 2 illustrates the flowchart of constructing the pro- can be mapped to 2-D image through homography citeho-
posed surveillance system. First, we construct lookup tables mography, which is a transformation between two planar
for the coordinate transformation from the 2-D images ac- coordinate systems. A homography matrix H represents the

989

relationship between points on two planes:
sct = Hcs , (1)
where s is a scalar factor and cs and ct are a pair of corre-
sponding points in the source and target patches, respectively.
If there are at least four correspondences where no three
correspondences in each patch are collinear, we can estimate
H through the least-squares approach.
We regard cs as points of 3-D environment model and ct
as points of 2-D image and then calculate the matrix H to
map points from the 3-D model to the images. In the reverse
order, we can also map points from the images to the 3-D
model.
B. Planar patch modeling
Precise camera calibration is not an easy job [13]. In the Fig. 4. The comparison of rendering layouts between different numbers
and sizes of patches. A large distortion occurs if there are fewer patches for
virtual projector methods [4], [7], the texture image will be rendering (left). More patches make the rendering much better (right).
miss-aligned to the model if the camera calibration or the
3-D model reconstruction has large error. Alternatively, we
develop a method that approximates the 3-D environment where Iij is the intensity of the point obtained from homog-
model through multiple yet individual planar patches and ˜
raphy transformation, Iij is the intensity of the point obtained
then renders the image content of every patches to generate from texture mapping, i and j are the coordinates of row and
a synthesized and integrated view of the monitored scene. In column in the image, respectively, and m × n represents the
this way we can easily construct a surveillance system with dimension of the patch in the 2-D image. In order to have
3-D view of the environment. an reference scale to quantify the distortion amount, a peak
Mostly we can model the environment with two basic signal-to-noise ratio is calculated by
building components, horizontal planes and vertical planes.
The horizontal planes for hallways and floors are usually MAX2
I
surrounded by doors and walls, which are modeled as the PSNR = 10 log10 , (3)
MSE
vertical planes. Both two kinds of planes are further divided
into several patches according to the geometry of the scenes where MAXI is the maximum pixel value of the image.
(Figure 3). If the scene consists of simple structures, a few Typical values for the PSNR are between 30 and 50 dB and
large patches can well represent the scene with less rendering an acceptable value is considered to be about 20 dB to 25 dB
costs. On the other hand, more and smaller patches are in this work. We set a threshold T to determine the quality
required to accurately render a complex environment, at the of texture mapping by
expense of more computational costs.
In the proposed system, the 3-D rendering platform is PSNR ≥ T . (4)
developed on OpenGL and each patch is divided into tri-
angles before rendering. Since linear interpolation is used If the PSNR of the patch is lower than T , the procedure
to fill triangles with texture in OpenGL and not suitable divides it into smaller patches and repeats the process until
for the perspective projection, distortion will appear in the the PSNR values of every patches are greater than the given
rendering result. One can use a lot of triangles to reduce this threshold T .
kind of distortion, as shown in Figure 4, it will enlarge the
computational burden and therefore not feasible for real-time III. O N - LINE MONITORING
surveillance systems.
To make a compromise between visualization accuracy and The proposed system displays the videos on the 3-D model.
rendering cost, we propose a procedure that automatically However, the 3-D foreground objects such as pedestrians are
divides each patch into smaller ones and decides suitable projected to image frame and become 2-D objects. They will
sizes of patches for accurate rendering (Figure 4). We use the appear flattened on the floor or wall since the system displays
following mean-squared error method to estimate the amount them on planar patches. Furthermore, there might be ghosting
of distortion when rendering image patches: effects when 3-D objects are in the overlapping areas of
m−1 n−1 different camera views. We need to tackle this problem by
1 ˜
MSE = (Iij − Iij )2 , (2) separating and rendering 3-D foreground objects in addition
m×n i=0 j=0 to the background environment.

990

our method such that the background doesn’t have to be
determined again.
In the indoor environment, we assume the color in eq.(7) is
similar between shadow and background in a pixel although
it is not evidently in sunshine in outdoor. Only the case of
indoor environment is considered in this paper.

Fig. 5. The tracking results obtained by using different shadow thresholds B. Collecting samples
while people stand on different positions of the floor. (a) Tr = 0.8 (b) Samples I(x, y, t) in some frames are collected to decide
Tr = 0.3. The threshold value Tθ = 6o is the same for both.
the shadow area, where t is the time. In [12] all samples are
collected including the classification of background, shadow
A. Underlying assumption and foreground by the pixel value changed with time. But
if a good background model has already built and some
Shadow is a type of foreground noise. It appears in any initial foreground objects were segmented, the background
zone of the camera scene. In [8], each pixel belongs to samples are not necessary. Only foreground and shadow
a shadow blob is detected by two properties. First, the samples If (x, y, t) were needed to consider. Besides, since
color vector of a pixel in shadow blob has similar direction background pixels are dropped from the samples list, this can
to that of the background pixel in the same position of save the computer and memory especially in a scene with
image. Second, the magnitude of the color vector in the T
few objects. Future, If θ (x, y, t) is obtained by dropping the
shadow is slightly less than the corresponding color vector of samples which not satisfy inequality eq.(7) from If (x, y, t).
background. Similar to [11], RGB or other color space can be Obviously, the samples data composed of more shadows
transformed into two dimensional space (called angle-module samples and less foreground samples. This also leads to that
space). The color vector of a pixel in position (x, y) of current the threshold r(x, y, t) can be derived more easily than the
frame, Ic (x, y), θ(x, y) is the angle between background threshold derived from samples of If (x, y, t).
vector Ib (x, y) and Ic (x, y), and the magnitude ratio r(x, y)
are defined as C. Deciding module ratio threshold
arccos(Ic (x, y) · Ib (x, y)) The initial threshold Tθ (x, y, 0) is set according to the
θ(x, y) = (5) experiment. In this case, Tθ (x, y, 0) = cos(6◦ ) is set as
|Ic (x, y)||Ib (x, y)| +
the initial value. After collecting enough samples, the ini-
|Ic (x, y)|
r(x, y) = (6) tial module ratio threshold Tr (x, y, 0) can be decided by
|Ib (x, y)| this method, Fast step minimum searching (FSMS). FSMS
where is a small number to avoid zero denominator. In [11], can fast separate the shadow from foreground distribution
the shadow of a pixels have to satisfy which collected samples are described above. The detail of
this method is described below. The whole distribution is
Tθ < cos θ(x, y) < 1 (7) separated by a window size w. The height of each window
Tr < r(x, y) < 1 (8) is the sum of the samples. Besides the background peak, two
peaks were found. The threshold Tr is used to search the
where Tθ is the angle threshold and Tr is the module peak which is closest to the average background value and
ratio threshold. According to the demonstration showed in smaller than the background value, the shadow threshold can
Figure 5, the best shadow thresholds are highly depends on be found by searching the minimum value or the value close
positions (pixels) in the scene, because of the complexity to zero.
of environment, the light sources and objects positions.
Therefore, we propose a method to automatically adjust the D. Updating angle threshold
thresholds for detecting shadow for each pixel. The threshold When a pixel satisfies both conditions in inequality eq.(7,
for a pixel to be classified as shadow or not is determined by 8) at the same time, the pixel is classified as shadow. In other
the necessary samples (data) collected with time. Only one words, if the pixel Is (x, y) is actually a shadow pixel, and
c is classified as one of candidate of shadow by FSMS, the
parameter has to be manually initialized. It is Tθ (0), where
0 means the initial time. Then the method can update the property of the pixel is require to satisfy the below equation
thresholds automatically and fast. Our method is faster than at the same time
the similar idea, GMSM method [12], when a background
model has built up. There are two major advantages of
0 ≤ cos θ(x, y, t) < Tθ (x, y, t) (9)
the computation time for our method. First, only necessary
samples are collected. Second, compared with method [12], Tθ (x, y, t) can be decided by searching the minimum
any background or foreground results can combine with cos(θ) of pixels in Is which is obtained by FSMS. However

991

Fig. 7. Orientation determination of the axis-aligned billboarding. L is
the location of the billboard, E is the location projected vertically from the
viewpoint to the floor, and v is the vector from L to E. The normal vector
(n) of the billboard is rotated according to the location of the viewpoint. Y
is the rotation axis and φ is the rotation angle.
Fig. 6. A flowchart to illustrate the whole method. The purple part is based
on pixel.
are always moving on the floor, the billboards can be aligned
to be perpendicular to the floor in the 3-D model. The 3-D
we propose another method to find out Tθ (x, y, t) more fast. location of the billboard is estimated by mapping the bottom-
The number of samples which are classified as shadow or middle point of the foreground bounding box in the 2-D
background at time t is ATr (x, y, t) by using FSMS. We
{b,s} image through the lookup tables. The ratio between the height
define a ratio R(Tr ) = ATr /A{b,s,f } where A{b,s,f } is all
{b,s} of the bounding box and the 3-D model determines the height
samples in position x, y, where b, s, f represent the back- of the billboard in the 3-D model. The relationship between
ground, shadow and foreground respectively. The threshold the direction of a billboard and the viewpoint is defined as
Tθ (x, y, t) can be updating to Tθ (x, y, t) by R(Tr ). The shown in Figure 7.
number of samples whose cos(θ(x, y)) values are larger than The following equations are used to calculate the rotation
the Tθ (x, y, t) is equal to A{b,s} and is required angle of the billboard:

R(Tθ (x, y, t)) = R(Tr ) (10) Y = (n × v) , (12)

Besides, we add a perturbation δTθ to the Tθ (x, y, t).
T
Since FSMS only finds out a threshold in If θ (x, y, t), if the φ = cos−1 (v · n) , (13)
initial threshold Tθ (x, y, 0) is set larger than true threshold,
the best updating threshold is equal to threshold Tθ not where v is the vector from the location of the billboard, L, to
smaller than threshold Tθ . Therefore the true angle threshold the location E projected vertically from the viewpoint to the
will never be found with time. To solve this problem, a per- floor, n is the normal vector of the billboard, Y is the rotation
turbation of the updating threshold is added to the updating axis, and φ is the estimated rotation angle. The normal vector
threshold of the billboard is parallel to the vector v and the billboard
is always facing toward the viewpoint of the operator.
Tθ (x, y, t) = Tθ (x, y, t) − δTθ (11)
F. Video content integration
Since the new threshold Tθ (x, y, t) has smaller value
to cover more samples, it can approach the true threshold If the fields of views of cameras are overlapped, objects in
with time. This perturbation can also make the method more these overlapping areas are seen by multiple cameras. In this
adaptable to the change of environment. Here is a flowchart case, there might be ghosting effects when we simultaneously
Figure 6 to illustrate the whole method. display videos from these cameras. To deal with this problem,
we use 3-D locations of moving objects to identify the cor-
E. Axis-aligned billboarding respondence of objects in different views. When the operator
In visualization, axis-aligned billboarding [14] constructs chooses a viewpoint, the rotation angles of the corresponding
billboards in the 3-D model for moving objects, such as billboards are estimated by the method presented above and
pedestrians, and the billboard always faces to the viewpoint of the system only render the billboard whose rotation angle is
the user. The billboard has three properties: location, height, the smallest among all of the corresponding billboards, as
and direction. By assuming that all the foreground objects shown in Figure 8.

992

C1 C3

C2

C1

Fig. 8. Removal of the ghosting effects. When we render the foreground
object from one view, the object may appear in another view and thus
cause the ghosting effect (bottom-left). Static background images without Fig. 9. Determination of viewpoint switch. We divide the floor area
foreground objects are used to fill the area of the foreground objects (top). depending on the fields of view of the cameras and associate each area to one
Ghosting effects are removed and static background images can be update of the viewpoint close to a camera. The viewpoint is switched automatically
by background modeling. to the predefined viewpoint of the area containing more foreground objects.

G. Automatic change of viewpoint The experimental results shown in Figure 12 demonstrate
that the viewpoint can be able to be chosen arbitrarily in
The proposed surveillance system provides target tracking the system and operators can track targets with a closer
feature by determining and automatic switching the view- view or any viewing direction by moving the virtual camera.
points. Before rendering, several viewpoints are specified in Moreover, the moving objects are always facing the virtual
advance to be close to the locations of the cameras. During camera by billboarding and the operators can easily perceive
the viewpoint switching from one to another, the parameters the spatial information of the foreground objects from any
of the viewpoints are gradually changed from the starting viewpoint.
point to the destination point for smooth view transition.
The switching criterion is defined as the number of blobs V. C ONCLUSIONS
found in the specific areas. First, we divide the floor area into In this work we have developed an integrated video surveil-
several parts and associate them to each camera, as shown lance system that can provide a single comprehensive view
in Figure 9. When people move in the scene, the viewpoint for the monitored areas to facilitate tracking moving targets
is switched automatically to the predefined viewpoint of the through its interactive control and immersive visualization.
area containing more foreground objects. We also make the We utilize planar patches for 3-D environment model con-
billboard transparent by setting the alpha value of textures, so struction. The scenes from cameras are divided into several
the foreground objects appear with fitting shapes, as shown patches according to their structures and the numbers and
in Figure 10. sizes of patches are automatically determined for compromis-
ing between the rendering effects and efficiency. To integrate
IV. E XPERIMENT RESULTS video contents, homography transformations are estimated
for relationships between image regions of the video contents
We developed the proposed surveillance system on a PC
and the corresponding areas of the 3D model. Moreover,
with Intel Core Quad Q9550 processor, 2GB RAM, and one
the proposed method to remove moving cast shadow can
nVidia GeForce 9800GT graphic card. Three IP cameras with
automatically decide thresholds by on-line learning. In this
352 × 240 pixels resolution are connected to the PC through
way, the manual setting can be avoided. Compared with the
Internet. The frame rate of the system is about 25 frames per
work based on frames, our method increases the accuracy to
second.
remove shadow. In visualization, the foreground objects are
In the monitored area, automated doors and elevators are segmented accurately and displayed on billboards.
specified as background objects, albeit their image do change
when the doors open or close. These areas will be modeled in R EFERENCES
background construction and not be visualized by billboards, [1] R. Sizemore, “Internet protocol/networked video surveillance market:
the system use a ground mask to indicate the region of Equipment, technology and semiconductors,” Tech. Rep., 2008.
interesting. Only the moving objects located in the indicated [2] Y. Wang, D. Krum, E. Coelho, and D. Bowman, “Contextualized
videos: Combining videos with environment models to support situa-
areas are considered as moving foreground objects, as shown tional understanding,” IEEE Transactions on Visualization and Com-
in Figure 11. puter Graphics, 2007.

993

Fig. 11. Dynamic background removal by ground mask. There is an
automated door in the scene (top-left) and it is visualized by a billboard (top-
right). A mask covered the floor (bottom-left) is used to decide whether to
visualize the foreground or not. With the mask, we can remove unnecessary
billboards (bottom-right).

Fig. 10. Automatic switching the viewpoint for tracking targets. People Fig. 12. Immersive monitoring at arbitary viewpoint. We can zoom out the
walk in the lobby and the viewpoint of the operator automatically switches viewpoint to monitor the whole surveillance area or zoom in the viewpoint
to keep track of the targets. to focus on a particular place.

[3] Y. Cheng, K. Lin, Y. Chen, J. Tarng, C. Yuan, and C. Kao, “Accurate transactions on Geosci. and remote sens., 2009.
planar image registration for an integrated video surveillance system,” [10] J. Kim and H. Kim, “Efficient regionbased motion segmentation for a
Computational Intelligence for Visual Intelligence, 2009. video monitoring system,” Pattern Recognition Letters, 2003.
[4] H. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal, [11] E. J. Carmona, J. Martńez-Cantos, and J. Mira, “A new video seg-
ı
S. Hsu, D. Nister, and K. Hanna, “Video flashlights: real time ren- mentation method of moving objects based on blob-level knowledge,”
dering of multiple videos for immersive model visualization,” in 13th Pattern Recognition Letters, 2008.
Eurographics workshop on Rendering, 2002. [12] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast
[5] U. Neumann, S. You, J. Hu, B. Jiang, and J. Lee, “Augmented virtual shadows through a multidistribution approach,” IEEE transactions on
environments (ave): dynamic fusion of imagery and 3-d models,” IEEE pattern analysis and machine intelligence, 2007.
Virtual Reality, 2003. [13] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and
[6] S. You, J. Hu, U. Neumann, and P. Fox, “Urban site modeling from N. Master, “Calibrated, registered images of an extended urban area,”
lidar,” Lecture Notes in Computer Science, 2003. International Journal of Computer Vision, 2003.
[7] I. Sebe, J. Hu, S. You, and U. Neumann, “3-d video surveillance [14] A. Fernandes, “Billboarding tutorial,” 2005.
with augmented virtual environments,” in International Multimedia
Conference, 2003.
[8] T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for
real-time robust background subtraction and shadow detection,” IEEE
ICCV. (1999).
[9] K. Chung, Y. Lin, and Y. Huang, “Efficient shadow detection of
color aerial images based on successive thresholding scheme,” IEEE

994

Morphing And Texturing Based On The Transformation Between Triangle Mesh And
Point

Wei-Chih Hsu Wu-Huang Cheng
Department of Computer and Communication Institute of Engineering Science and Technology,
Engineering, National Kaohsiung First University of National Kaohsiung First University of Science and
Science and Technology. Kaohsiung, Taiwan Technology. Kaohsiung, Taiwan
u9715901@nkfust.edu.tw

Abstract—This research proposes a methodology of [1] has proposed a method to represent multi scale surface.
transforming triangle mesh object into point-based object and M. Müller et al. The [2] has developed a method for
the applications. Considering the cost and program functions, modeling and animation to show that point-based has
the experiments of this paper adopt C++ instead of 3D flexible property.
computer graphic software to create the point cloud from Morphing can base on geometric, shape, or other features.
meshes. The method employs mesh bounded area and planar Mesh-based morphing sometimes involves geometry, mesh
dilation to construct the point cloud of triangle mesh. Two structure, and other feature analysis. The [3] has
point-based applications are addressed in this research. 3D demonstrated a method to edit free form surface based on
model generation can use point-based object morphing to
geometric. The method applies complex computing to deal
simplify computing structure. Another application for texture
mapping is using the relation of 2D image pixel and 3D planar.
with topology, curve face property, and triangulation. The [4]
The experiment results illustrate some properties of point- not only has divided objects into components, but also used
based modeling. Flexibility and scalability are the biggest components in local-level and global-level morphing. The [5]
advantages among the properties of point-based modeling. The has adopted two model morphing with mesh comparison and
main idea of this research is to detect more sophisticated merging to generate new model. The methods involved
methods of 3D object modeling from point-based object. complicate data structure and computing. This research has
illustrated simple and less feature analysis to create new
Keywords-point-based modeling; triangle mesh; texturing; model by using regular point to morph two or more objects.
morphing Texturing is essential in rendering 3D model. In virtual
reality, the goal of texture mapping is try to be as similar to
I. INTRODUCTION the real object as possible. In special effect, exaggeration
texturing is more suitable for demand. The [6] has built a
In recent computer graphic related researches, form•Z, mesh atlas for texturing. The texture atlases' coordinates,
Maya, 3DS, Max, Blender, Lightwave, Modo, solidThinking considered with triangle mesh structure, were mapped to 3D
and other 3D computer graphics software are frequently model. The [7] has used the conformal equivalence of
adopted tools. For example, Maya is a very popular software, triangle meshes to find the flat mesh for texture mapping.
and it includes many powerful and efficient functions for This method is more comprehensible and easy to implement.
producing results. The diverse functions of software can The rest arrangements are described as followings:
increase the working efficiency, but the methodology design Transforming triangle mesh into point set for modeling are
must follow the specific rules and the cost is usually high. addressed in Section II and III, and that introduce point-
Using C++ as the research tool has many advantages, based morphing for model creating. The point-based texture
especially in data information collection. Powerful functions mapping is addressed in Section IV, and followed by the
can be created by C language instructions, parameters and conclusion of Section V.
C++ oriented object. More complete the data of 3D be
abstracted, more unlimited analysis can be produced. II. TRANSFORMING TRIANGLE MESH INTO POINT SET
The polygon mesh is widely used to represent 3D models
In order to implement the advantages of point-based
and has some drawbacks in modeling. Unsmooth surface of
model, transforming triangle mesh into point is the first step.
combined meshes is one of them. Estimating vertices of
The point set can be estimated by using three normal bound
objects and constructing each vertex set of mesh are the
lines of triangle mesh. The normal denoted by n can be
factors of modeling inefficiency. Point-based modeling is the
calculated by three triangle vertices. The point in the triangle
solution to conquer some disadvantages of mesh modeling.
area is denoted by B in , A denotes the triangle mesh area, the
Point-based modeling is based on point primitives. No
structure of each point to another is needed. To simplify the 3D space planar can be presented by p with coordinate
point based data can employ marching cube and Delaunay ( x, y , z ) , vi =1, 2,3 denotes three triangle vertices of triangle
triangulation to transform point-based model into polygon
mesh. Mark Pauly has published many related researches mesh, v denotes the mean of three triangle vertices. The
about point-based in international journals as followings: the formula that presents the triangle area is described below.

995

A = { p ( x, y , z ) | pn T − v i n T = 0 , i ∈ (1,2,3), p ∈ Bin } The experiments use some objects file which is the wave
front file format (.obj) from NTU 3D model database ver.1
Bin = { p( x, y, z ) | f (i , j ) ( p) × f (i , j ) (v) > 0} of National Taiwan University. The process of transforming
f (i , j ) ( p) = r × a − b + s triangle mesh into point-based is shown in Figure 1. It is
clear to see that some areas with uncompleted the whole
b j − bi point set shown in red rectangle of figure 1. The planar
r= , s = bi - r × ai dilation process is employed to refine fail areas.
a j − ai
Planar dilation process uses 26-connected planar to refine
i, j = 1,2,3 a , b = x, y , z i < j a<b the spots leaved in the area. The first half portion of Figure 2
shows 26 positions of connected planar. If any planar and its
26 neighbor positions are the object planar is the condition.
The main purpose to estimate the object planar is to verify
the condition is true. The result in second half portion of
Figure 2 reveals the efficiency of planar dilation process.
III. POINT-BASED MORPHING FOR MODEL CREATING
The more flexible in objects combining is one of property
of point-based. No matter what the shape or category of the
objects, the method of this study can put them into morphing
process to create new objects.
The morphing process includes 3 steps. Step one is to
Figure 1. The process of transforming triangle mesh into point-based equalize the objects. Step two is to calculate each normal
point of objects in morphing process. Step three is to
estimate each point of target object by using the same normal
point of two objects with the formula as described below.
n −1
ot = p r1o1 + p r 2 o2 + ⋅ ⋅ ⋅ + (1 − ∑ p ri )o n
i =1
n
0 ≤ p r1 , p r 2 ,⋅ ⋅ ⋅, p r ( n −1) ≤ 1 , ∑ p ri = 1
i =1
ot presents each target object point of morphing, and oi is
the object for morphing process. p ri donates the object
effect weight in morphing process, and i indicates the
number of object. The new model appearance generated
from morphing is depended on which objects were chosen
and the value of each object weight as well. The research
experiments use two objects, therefore i = 1 or 2, n = 2 .
The results are shown in Figure 3. First row is a simple
flat board and a character morphing. The second row shows
the object selecting free in point-based modeling, because
two totally different objects can be put into morphing and
produced the satisfactory results. The models the created by
objects morphing with different weights can be seen in figure
4.
IV. POINT-BASED TEXTURE MAPPING
Texturing mapping is a very plain in this research method.
It uses a texture metric to map the 3D model to the 2D image
pixel by using the concept of 2D image transformation into
3D. Assuming 3D spaces is divided into α × β blocks, α is
the number of row, and β is the number of column. Hence
the length, the width, and the height of 3D space
is h × h × h ; afterwards the ( X , Y ) and ( x. y, z ) will
denote the image coordination and 3D model respectively.
Figure 2. Planar dilation process. The texture of each block is assigned by texture cube, and it

996

is made by 2D image as shown in the middle image of first confirmed by the scalability and flexibility of proposed
raw in figure 5. The process can be expressed by a formula methodologies.
as below.
At T = c T REFERENCES
h h h [1] MARK PAULY, “Point-Based Multiscale Surface
t = [ x mod , y mod , z mod ] , c = [ X,Y ]
α β β Representation,” ACM Transactions on Graphics, Vol. 25, No.
2, pp. 177–193, April 2006.
⎡α 0 0⎤
A=⎢ β (h − z ) ⎥ [2] M. Müller1, R. Keiser1, A. Nealen2, M. Pauly3, M. Gross1
⎢ 0 0 ⎥ and M. Alexa2, “Point Based Animation of Elastic, Plastic
⎣ y ⎦ and Melting Objects,” Eurographics/ACM SIGGRAPH
A denotes the texture transforming metric, t denotes the 3D Symposium on Computer Animation, pp. 141-151, 2004.
model current position, c denotes the image pixel content in [3] Theodoris Athanasiadis, Ioannis Fudos, Christophoros Nikou,
the current position. “Feature-based 3D Morphing based on Geometrically
The experiment results are shown in the second row of Constrained Sphere Mapping Optimization,” SAC’10 Sierre,
figure 5 and 6. The setting results α = β = 2 are shown in Switzerland, pp. 1258-1265, March 22-26, 2010.
second row of figure 5. The setting results α = β = 4 create [4] Yonghong Zhao, Hong-Yang Ong, Tiow-Seng Tan and
Yongguan Xiao, “Interactive Control of Component-based
the images are shown in the first row of figure 6. The last
Morphing,” Eurographics/SIGGRAPH Symposium on
row images of figure 6 indicate the proposed texture Computer Animation , pp. 340-385, 2003.
mapping method can be applied into any point-based model.
[5] Kosuke Kaneko, Yoshihiro Okada and Koichi Niijima, “3D
V. CONCLUSION Model Generation by Morphing,” IEEE Computer Graphics,
Imaging and Visualisation, 2006.
In sum, the research focuses on point-based modeling
[6] Boris Springborn, Peter Schröder, Ulrich Pinkall, “Conformal
applications by using C++ instead of convenient facilities or
Equivalence of Triangle Meshes,” ACM Transactions on
other computer graphic software. The methodologies that Graphics, Vol. 27, No. 3, Article 77, August 2008.
developed by point-based include the simple data structure
properties and less complex computing. Moreover, the [7] NATHAN A. CARR and JOHN C. HART, “Meshed Atlases
for Real-Time Procedural Solid Texturing,” ACM
methods can be compiled with two applications morphing
Transactions on Graphics, Vol. 21,No. 2, pp. 106–131, April
and texture mapping. The experiment results have been
2002.

Figure 3. The results of point-based modeling using different objects morphing.

997

Figure 4. The models created by objects morphing with different weights.

Figure 5. The process of 3D model texturing with 2D image shown in first row and the results shown in second row.

998

Figure 6. The results of point-based texture mapping with α = β = 4 and different objects.

999

LAYERED LAYOUTS OF DIRECTED GRAPHS USING A GENETIC
ALGORITHM

Chun-Cheng Lin1,∗, Yi-Ting Lin2 , Hsu-Chun Yen2,† , Chia-Chen Yu3
1
Dept. of Computer Science,
Taipei Municipal University of Education, Taipei, Taiwan 100, ROC
2
Dept. of Electrical Engineering,
National Taiwan University, Taipei, Taiwan 106, ROC
3
Emerging Smart Technology Institute,
Institute for Information Industry, Taipei, Taiwan, ROC

ABSTRACT charts, maps, posters, scheduler, UML diagrams,
etc. It is important that a graph be drawn “clear”,
By layered layouts of graphs (in which nodes are
such that users can understand and get information
distributed over several layers and all edges are di-
from the graph easily. This paper focuses on lay-
rected downward as much as possible), users can
ered layouts of directed graphs, in which nodes are
easily understand the hierarchical relation of di-
distributed on several layers and in general edges
rected graphs. The well-known method for generat-
should point downward as shown in Figure 1(b).
ing layered layouts proposed by Sugiyama includes
By this layout, users can easily trace each edge from
four steps, each of which is associated with an NP-
top to bottom and understand the priority or order
hard optimization problem. It is observed that the
information of these nodes clearly.
four optimization problems are not independent, in
the sense that each respective aesthetic criterion
may contradict each other. That is, it is impossi-
ble to obtain an optimal solution to satisfy all aes-
thetic criteria at the same time. Hence, the choice
for each criterion becomes a very important prob-
lem. In this paper, we propose a genetic algorithm
to model the first three steps of the Sugiyama’s al-
gorithm, in hope of simultaneously considering the Figure 1: The layered layout of a directed graph.
first three aesthetic criteria. Our experimental re-
sults show that this proposed algorithm could make Specifically, we use the following criteria to es-
layered layouts satisfy human’s aesthetic viewpoint. timate the quality of a directed graph layout: to
minimize the total length of all edges; to mini-
Keywords: Visualization, genetic algorithm, mize the number of edge crossings; to minimize the
graph drawing. number of edges pointing upward; to draw edges
as straight as possible. Sugiyama [9] proposed a
1. INTRODUCTION classical algorithm for producing layered layouts
of directed graphs, consisting of four steps: cycle
Drawings of directed graphs have many applica- removal, layer assignment, crossing reduction, and
tions in our daily lives, including manuals, flow assignment of horizontal coordinates, each of which
∗ Research supported in part by National Science Council
addresses a problem of achieving one of the above
under grant NSC 98-2218-E-151-004-MY3
criteria, respectively. Unfortunately, the first three
† Research supported in part by National Science Council problems have been proven to be NP-hard when the
under grant NSC 97-2221-E-002-094-MY3 width of the layout is restricted. There has been

1000

a great deal of work with respect to each step of is quite different between drawing layered layouts
Sugiyama’s algorithm in the literature. of acyclic and cyclic directed graphs. In acyclic
Drawing layered layouts by four independent graphs, one would not need to solve problems on
steps could be executed efficiently, but it may not cyclic removal. If the algorithm does not restrict
always obtain nice layouts because preceding steps the layer by a fixed width, one also would not need
may restrain the results of subsequent steps. For to solve the limited layer assignment problem. Note
example, four nodes assigned at two levels after the that the unlimited-width layer assignment is not an
layer assignment step lead to an edge crossing in NP-hard problem, because the layers of nodes can
Figure 2(a), so that the edge crossing cannot be be assigned by a topological logic ordering. The
removed during the subsequent crossing reduction algorithm in [10] only focuses on minimizing the
step, which only moves each node’s relative posi- number of edge crossings and making the edges as
tion on each layer, but in fact the edge crossing straight as possible. Although it also combined
can be removed as drawn in Figure 2(b). Namely, three steps of Sugiyama’s algorithm, but it only
the crossing reduction step is restricted by the layer contained one NP-hard problem. Oppositely, our
assignment step. Such a negative effect exists ex- algorithm combined three NP-hard problems, in-
clusively not only for these two particular steps but cluding cycle removal, limited-width layer assign-
also for every other preceding/subsequent step pair. ment, and crossing reduction.

In addition, our algorithm has the following ad-
vantages. More customized restrictions on layered
layouts are allowed to be added in our algorithm.
For example, some nodes should be placed to the
(a) (b) left of some other nodes, the maximal layer number
should be less than or equal to a certain number,
Figure 2: Different layouts of the same graph. etc. Moreover, the weighting ratio of each optimal
criterion can be adjusted for different applications.
Even if one could obtain the optimal solution for According to our experimental results, our genetic
each step, those “optimal solutions” may not be the algorithm may effectively adjust the ratio between
real optimal solution, because those locally optimal edge crossings number and total edge length. That
solutions are restricted by their respective preced- is, our algorithm may make layered layouts more
ing steps. Since we cannot obtain the optimal solu- appealing to human’s aesthetic viewpoint.
tion satisfying all criteria at the same time, we have
to make a choice in a trade-off among all criteria.
For the above reasons, the basic idea of our 2. PRELIMINARIES
method for drawing layered layouts is to combine
the first three steps together to avoid the restric-
tions due to criterion trade-offs. Then we use the The frameworks of three different algorithms for
genetic algorithm to implement our idea. In the layered layouts of directed graphs (i.e., Sugiyama’s
literature, there has existed some work on produc- algorithm, the cyclic leveling algorithm, and our
ing layered layouts of directed graphs using ge- algorithm) are illustrated in Figure 2(a)–2(c), re-
netic algorithm, e.g., using genetic algorithm to respectively. See Figure 2. Sugiyama’s algorithm
duce edge crossings in bipartite graphs [7] or entire consists of four steps, as mentioned previously; the
acyclic layered layouts [6], modifying nodes in a other two algorithms are based on Sugiyama’s algo-
subgraph of the original graph on a layered graph rithm, in which the cyclic leveling algorithm com-
layout [2], drawing common layouts of directed or bines the first two steps, while our genetic algo-
undirected graphs [3] [11], and drawing layered lay- rithm combines the first three steps. Furthermore,
outs of acyclic directed graphs [10]. a barycenter algorithm is applied to the crossing re-
Note that the algorithm for drawing layered lay- duction step of the cyclic leveling and our genetic
outs of acyclic directed graphs in [10] also com- algorithms, and the priority layout method is ap-
bined three steps of Sugiyama’s algorithm, but it plied to the x-coordinate assignment step.

1001

Sugiyama’s Algorithm Cyclic Leveling Genetic Algorithm

Cycle Removel Cycle Removel Cycle Removel edge-node crossing
Layer Assignment Layer Assignment Layer Assignment
edge crossing
Crossing Reduction (a) An edge crossing. (b) An edge-node crossing
Crossing Reduction Crossing Reduction Barycenter Algorithm

x-Coordinte Assignment x-Coordinte Assignment Priority Layout Method
x-Coordinte Assignment
Figure 4: Two kinds of crossings.
(a) Sugiyama (b) Cyclic Leveling (c) Our
we reverse as few edges as possible such that the
Figure 3: Comparison among different algorithms. input graph becomes acyclic. This problem can
be stated as the maximum acyclic subgraph prob-
2.1. Basic Definitions lem, which is NP-hard. (2) Layer assignment: Each
node is assigned to a layer so that the total vertical
A directed graph is denoted by G(V, E), where V is length of all edges is minimized. If an edge spans
the set of nodes and E is the set of edges. An edge across at least two layers, then dummy nodes should
e is denoted by e = (v1 , v2 ) ∈ E, where v1 , v2 ∈ V ; be introduced to each crossed layer. If the maxi-
edge e is directed from v1 to v2 . A so-called layered mum width is bounded greater or equal to three,
layout is defined by the following conditions: (1) the problem of finding a layered layout with min-
Let the number of layers in this layout denoted by imum height is NP-compete. (3) Crossings reduc-
n, where n ∈ N and n ≥ 2. Moreover, the n-layer tion: The relative positions of nodes on each layer
layout is denoted by G(V, E, n). (2) V is parti- are reordered to reduce edges crossings. Even if we
tioned into n subsets: V = V1 ∪ V2 ∪ V3 ∪ · · · ∪ Vn , restrict the problem to bipartite (two-layer) graphs,
where Vi ∩ Vj = ∅, ∀i ̸= j; nodes in Vk are assigned it is also an NP-hard problem. (4) x-coordinate as-
to layer k, 1 ≤ k ≤ n. (3) A sequence ordering, signment: The x-coordinates of nodes and dummy
σi , of Vi is given for each i ( σi = v1 v2 v3 · · · v|Vi | nodes are modified, such that all the edges on the
with x(v1 ) < x(v2 ) < · · · < x(v|Vi | )). The n- original graph structure are as straight as possi-
layer layout is denoted by G(V, E, n, σ), where σ = ble. This step includes two objective functions: to
(σ1 , σ2 , · · · , σn ) with y(σ1 ) < y(σ2 ) < · · · < y(σn ). make all edges as close to vertical lines as possible;
An n-layer layout is called “proper” when it fur- to make all edge-paths as straight as possible.
ther satisfies the following condition: E is parti-
tioned into n − 1 subsets: E = E1 ∪ E2 ∪ E3 ∪ 2.3. Cyclic Leveling Algorithm
· · · ∪ En−1 , where Ei ∩ Ej = ∅, ∀i ̸= j, and
Ek ⊂ Vk × Vk+1 , 1 ≤ k ≤ n − 1. The cyclic leveling algorithm (CLA) [1] combines
the first two steps of Sugiyama’s algorithm, i.e., it
An edge crossing (assuming that the layout is
focuses on minimizing the number of edges point-
proper) is defined as follows. Consider two edges
ing upward and total vertical length of all edges.
e1 = (v11 , v12 ), e2 = (v21 , v22 ) Ei, in which v11
It introduces a number called span that represents
and v21 are the j1 -th and the j2 -th nodes in σi ,
the number of edges pointing upward and the total
respectively; v12 and v22 are the k1 -th and the k2 -
vertical length of all edges at the same time.
th nodes in σi+1 , respectively. If either j1 < j2 &
k1 > k2 or j1 > j2 & k1 < k2 , there is an edge The span number is defined as follows. Consider
crossing between e1 and e2 (see Figure 4(a)). a directed graph G = (V, E). Given k ∈ N, define a
layer assignment function ϕ : V → {1, 2, · · · , k}.
An edge-node crossing is defined as follows. Con-
Let span(u, v) = ϕ(v) − ϕ(u), if ϕ(u) < ϕ(v);
sider an edge e = (v1 , v2 ), where v1 , v2 ∈ V i; v1
span(u, v) = ϕ(v) − ϕ(u) + k, otherwise. For each
and v2 are the j-th and the k-th nodes in σi , re-
edge e = (u, v) ∈ E, denote span(e) = span(u, v)
∑
spectively. W.l.o.g., assuming that j > k, there are
and span(G) = e∈E span(e). In brief, span
(k − j − 1) edge-node crossings (see Figure 4(b)).
means the sum of vertical length of all edges and
the penalty of edges pointing upward or horizontal,
2.2. Sugiyama’s Algorithm
provided maximum height of this layout is given.
Sugiyama’s algorithm [9] consists of four steps: (1) The main idea of the CLA is: if a node causes
Cycle removal: If the input directed graph is cyclic, a high increase in span, then the layer position of

1002

the node would be determined later. In the algo- then priority(v) = B − (|k − m/2|), in which B is a
rithm, the distance function is defined to decide big given number, and m is the number of nodes in
which nodes should be assigned first and is ap- layer k; if down procedures (resp., up procedures),
plied. There are four such functions as follows, then priority(v) = connected nodes of node v on
but only one can be chosen to be applied to all layer p − 1 (resp., p + 1).
the nodes: (1) Minimum Increase in Span Moreover, the x-coordinate position of each node
= minϕ(v)∈{1,··· ,k} span(E(v, V ′ )); (2) Minimum v is defined as the average x-coordinate position of
Average Increase in Span (MST MIN AVG) connected nodes of node v on layer k − 1 (resp.,
= minϕ(v)∈{1,··· ,k} span(E(v, V ′ ))/E(v, V ′ ); (3) k + 1), if down procedures (resp., up procedures).
Maximum Increase in Span = 1/δM IN (v);
(4) Maximum Average Increase in Span =
2.6. Genetic Algorithm
1/δM IN AV G (v). From the experimental results
in [1], using “MST MIN AVG” as the distance The Genetic algorithm (GA) [5] is a stochastic
function yields the best result. Therefore, our global search method that has proved to be success-
algorithm will be compared with the CLA using ful for many kinds of optimization problems. GA
MST MIN AVG in the experimental section. is categorized as a global search heuristic. It works
with a population of candidate solutions and tries
2.4. Barycenter Algorithm to optimize the answer by using three basic princi-
The barycenter algorithm is a heuristic for solv- ples, including selection, crossover, and mutation.
ing the edge crossing problem between two lay- For more details on GA, readers are referred to [5].
ers. The main idea is to order nodes on each
layer by its barycentric ordering. Assuming that 3. OUR METHOD
node u is located on the layer i (u ∈ Vi ), the
The major issue for drawing layered layouts of di-
barycentric∑ value of node u is defined as bary(u) =
rected graphs is that the result of the preceding step
(1/|N (u)|) v∈N (u) π(v), where N (u) is the set
may restrict that of the subsequent step on the first
consisting of u’s connected nodes on u’s below or
three steps of Sugiyama’s algorithm. To solve it, we
above layer (Vi−1 or Vi+1 ); π(v) is the order of v
design a GA that combines the first three steps of
in σi−1 or σi+1 . The process in this algorithm is
Sugiyama’s algorithm. Figure 5 is the flow chart
reordering the relative positions of all nodes accord-
of our GA. That is, our method consists of a GA
ing to the ordering: layer 2 to layer n and then layer
and an x-coordinate assignment step. Note that
n − 1 to layer 1 by barycentric values.
the barycenter algorithm and the priority method
are also used in our method, in which the former is
2.5. Priority Layout Method
used in our GA to reduce the edge crossing, while
The priority layout method solves the x-coordinate the latter is applied to the x-coordinate assignment
assignment problem. Its idea is similar to the step of our method.
barycenter algorithm. It assigns the x-coordinate
position of each node layer to layer according to the Initialization
priority value of each node.
At first, these nodes’ x-coordinate positions in
each layer are given by xi = x0 + k, where x0 is
k
Assign dummy nodes

i
a given integer, and xk is the k-th element of σi . Draw the best Chromosome Terminate? Barycenter
Next, nodes’ x-coordinate positions are adjusted
Fine tune Selection
according to the order from layer 2 to layer n, layer
n − 1 to layer 1, and layer n/2 to layer n. The im- Mutation Remove dummy nodes

provements of the positions of nodes from layer 2 to
Crossover
layer n are called down procedures, while those from
layer n−1 to layer 1 are called up procedures. Based
on the above, the priority value of a k-th node v on Figure 5: The flow chart of our genetic algorithm.
layer p is defined as: if node v is a dummy node,

1003

3.1. Definitions 4. MAIN COMPONENTS OF OUR GA
For arranging nodes on layers, if the relative hori- Initialization: For each chromosome, we ran-
√ √
zontal positions of nodes are determined, then the domly assign nodes to a ⌈ |V |⌉ × ⌈ |V |⌉ grid.
exact x-coordinate positions of nodes are also de- Selection: To evaluate the fitness value of each
termined according to the priority layout method. chromosome, we have to compute the number of
Hence, in the following, we only consider the rela- edge crossings, which however cannot be computed
tive horizontal positions of nodes, and each node is at this step, because the routing of each edge is
arranged on a grid. We use GA to model the lay- not determined yet. Hence, some dummy nodes
ered layout problem, so define some basic elements: should be introduced to determine the routing of
Population: A population (generation) includes edges. In general, these dummy nodes are placed
many chromosomes, and the number of chromo- on the best relative position with the optimal edge
somes depends on setting of initial population size. crossings between two adjacent layers. Neverthe-
Chromosome: One chromosome represents one less, permuting these nodes on each layer for the
graph layout, where the absolute position of each fewest edge crossings is an NP-hard problem [4].
(dummy) node on the grid is recorded. Since the Hence, the barycenter algorithm (which is also used
adjacencies of nodes and the directions of edges by the CLA) is applied to reducing edge crossings
will not be altered after our GA, we do not need on each chromosome before selection. Next, the
record the information on chromosomes. On this selection step is implemented by the truncation se-
grid, one row represents one layer; a column rep- lection, which duplicates the best (selection rate ×
resents the order of nodes on the same layer, and population size) chromosomes (1/selection rate)
these nodes on the same layer are always placed times to fill the entire population. In addition, we
successively. The best-chromosome window reserves use a best-chromosome window to reserve some of
the best several chromosomes during all antecedent the best chromosomes in the previous generations
generations; the best-chromosome window size ra- as shown in Figure 6.
tio is the ratio of the best-chromosome window size
Best-Chromosome Window
to the population size. Best-Chromosome Window
Fitness Function: The ‘fitness’ value in our def- duplicate

inition is abused to be defined as the penalty for
the bad quality of chromosome. That is, larger ‘fit- Parent Population Child Population Child Population

ness’ value implies worse chromosome. Hence, our
GA aims to find the chromosome with minimal ‘fit-
ness’ value. Some aesthetical criteria to determine Figure 6: The selection process of our GA.
the quality of chromosomes (layouts) are given as
follows (noticing that these criteria are referred Crossover: Some main steps of our crossover pro-
∑7
from [8] and [9]): f itness value = i=1 Ci × Fi cess are detailed as follows: (1) Two ordered par-
where Ci are constants, 1 ≤ i ≤ 7, ∀i; F1 is the to- ent chromosomes are called the 1st and 2nd parent
tal edge vertical length; F2 is the number of edges chromosome. W.l.o.g., we only introduce how to
pointing upward; F3 is the number of edges point- generate the first child chromosome from the two
ing horizontally; F4 is the number of edge crossing; parent chromosomes, and the other child is similar.
F5 is the number of edge-node crossing; F6 is the (2) Remove all dummy nodes from these two par-
degree of layout height over limited height; F7 is ent chromosomes. (3) Choose a half of the nodes
the degree of layout width over limited width. from each layer of the 1st parent chromosome and
In order to experimentally compare our GA place them on the same relative layers of child chro-
with the CLA in [1], the fitness function of our mosome in the same horizontal ordering. (4) The
GA is tailored to satisfy the CLA as follows: information on the relative positions of the remain-
f itness value = span + weight × edge crossing + ing nodes all depends on the 2nd chromosomes.
C6 × F6 + C7 × F7 where we will adjust the weight Specifically, we choose a node adjacent to the small-
of edge crossing number in our experiment to rep- est number of unplaced nodes until all nodes are
resent the major issue which we want to discuss. placed. If there are many candidate nodes, we ran-

1004

domly choose one. The layer of the chosen node is Note that the x-coordinate assignment problem
equal to base layer plus relative layer, where base (step 4) is solved by the priority layout method
layer is the average of its placed connected nodes’ in our experiment. In fact, this step would not
layers in the child chromosome and relative layer is affect the span number or edge crossing number. In
the relative layer position of its placed connected addition, the second step of Sugiyama’s algorithm
nodes’ layers in the 2nd parent chromosome. (5) (layer assignment) is an NP-hard problem when the
The layers of this new child chromosome are mod- width of the layered layout is restricted. Hence,
ified such that layers start from layer 1. we will respectively investigate the cases when the
Mutation: In the mutated chromosome, a node width of the layered layout is limited or not.
is chosen randomly, and then the position of the
chosen node is altered randomly. 5.1. Experimental Environment
Termination: If the difference of average fitness
All experiments run on a 2.0 GHz dual core lap-
values between successive generations in the latest
top with 2GB memory under Java 6.0 platform
ten generations is ≤ 1% of the average fitness value
from Sun Microsystems, Inc. The parameters of
of these ten generations, then our GA algorithm
our GA are given as follows: Population size:
stops. Then, the best chromosome from the latest
100; Max Generation: 100; Selection Rate: 0.7;
population is chosen, and its corresponding graph
Best-Chromosome Window Size Ratio: 0.2; Mutate
layout (including dummy nodes at barycenter po-
Probability: 0.2; C6 : 500; C7 : 500; f itness value =
sitions) is drawn.
span + weight × edgecrossing + C6 × F6 + C7 × F7 .
Fine Tune: Before the selection step or after the
termination step, we could tune better chromo- 5.2. Unlimited Layout Width
somes according to the fitness function. For ex-
ample, we remove all layers which contain only Because it is necessary to limit the layout width
dummy nodes but no normal nodes, called dummy and height for L M algorithm, we set both limits
layers. Such a process does not necessarily worsen for width and height to be 30. It implies that there
the edge crossing number but it would improve are at most 30 nodes (dummy nodes excluded) on
the span number. In addition, some unnecessary each layer and at most 30 layers in each layout. If
dummy nodes on each edge can also be removed we let the maximal node number to be 30 in our
after the termination step, in which the so-called experiment, then the range for node distribution
unnecessary dummy node is a dummy node that is equivalently unlimited. In our experiments, we
is removed without causing new edge crossings or consider a graph with 30 nodes under three differ-
worsening the fitness value. ent densities (2%, 5%, 10%), in which the density
is the ratio of edge number to all possible edges,
5. EXPERIMENTAL RESULTS i.e. density = edge number/(|V |(|V | − 1)/2). Let
the weight ratio of edge crossing to span be de-
To evaluate the performance of our algorithm, our noted by α. In our experiments, we consider five
algorithm is experimentally compared with the different α values 1, 3, 5, 7, 9. The statistics for
CLA (combing the first two steps of Sugiyama’s the experimental results is given in Table 1.
algorithm) using MST MIN AVG as the distance Consider an example of a 30-node graph with
function [1], as mentioned in the previous sections. 5% density. The layered layout by the LM B algo-
For convenience, the CLA using MST MIN AVG rithm, our algorithm under α = 1 and α = 9 are
distance function is called as the L M algorithm shown in Figure 7, Figure 8(a) and Figure 8(b), re-
(Leveling with MST MIN AVG). The L M algo- spectively. Obviously, our algorithm performs bet-
rithm (for step 1 + step 2) and barycenter algo- ter than the LM B.
rithm (for step 3) can replace the first three steps
in Sugiyama’s algorithm. In order to be compared
5.3. Limited Layout Width
with our GA (for step 1 + step 2 + step 3), we con-
sider the algorithm combining the L M algorithm The input graph used in this subsection is the same
and barycenter algorithm, which is called LM B al- as the previous subsection (i.e., a 30-node graph).
gorithm through the rest of this paper. The limited width is set to be 5, which is smaller

1005

Table 1: The result after redrawing random graphs
with 30 nodes and unlimited layout width.
method measure density =2%density=5%density=10%
span 30.00 226.70 798.64
LM B crossing 4.45 57.90 367.00
running time 61.2ms 151.4ms 376.8ms
α =1 span 30.27 253.93 977.56
crossing 0.65 38.96 301.75
α =3 span 31.05 277.65 1338.84
crossing 0.67 32.00 272.80
our α =5 span 30.78 305.62 1280.51
GA crossing 0.67 29.89 218.45
α =7 span 32.24 329.82 1359.46
crossing 0.75 26.18 202.53 (a) α = 1 (b) α = 9
α =9 span 31.65 351.36 1444.27
crossing 0.53 24.89 200.62 (span: 188, crossing: 30)(span: 238, crossing: 14)
running time 3.73s 17.32s 108.04s
Figure 8: Layered layouts by our GA.

Table 2: The result after redrawing random graphs
with 30 nodes and limited layout width 5.
method measure density =2%density=5%density=10%
span 28.82 271.55 808.36
LM B crossing 5.64 59.09 383.82
running time 73.0ms 147.6ms 456.2ms
Figure 7: Layered layout by LM B (span:262, α =1 span 32.29 271.45 1019.56
crossing:38). crossing 0.96 39.36 292.69
α =3 span 31.76 294.09 1153.60
crossing 0.80 33.16 232.76
our α =5 span 31.82 322.69 1282.24
GA crossing 0.82 30.62 202.31
than the square root of the node number (30), be- α =7 span 32.20 351.00 1369.73
cause we hope the results under limit and unlimited crossing 0.69 27.16 198.20
α =9 span 33.55 380.20 1420.31
conditions have obvious differences. The statistics crossing 0.89 24.95 189.25
for the experimental results under the same settings running time 3.731s 3.71s 18.07s
in the previous subsection is given in Table 2.
Consider an example of a 30-node graph with 5%
density. The layered layout for this graph by the
our GA may produce simultaneously span and edge
LM B algorithm, our algorithm under α = 1 and
crossing numbers both smaller than that by LM B.
α = 9 are shown in Figure 9, Figure 10(a) and Fig-
ure 10(b), respectively. Obviously, our algorithm Moreover, we discovered that under any condi-
also performs better than the LM B. tions the edge crossing number gets smaller and
the span number gets greater when increasing the
weight of edge crossing. It implies that we may ef-
5.4. Discussion
fectively adjust the weight between edge crossings
Due to page limitation, only the case of 30-node and spans. That is, we could reduce the edge cross-
graphs is included in this paper. In fact, we con- ing by increasing the span number.
ducted many experiments for various graphs. Be- Under limited width condition, because the re-
sides those results, those tables and figures show sults of L M are restricted, its span number should
that under any conditions (node number, edge den- be larger than that under unlimited condition.
sity, and limited width or not) the crossing number However, there are some unusual situations in our
by our GA is smaller than that by LM B. How- GA. Although the results of our GA are also re-
ever, the span number by our GA is not neces- stricted under limited width condition, its span
sarily larger than that by LM B. When the layout number is smaller than that under unlimited width
width is limited and the node number is sufficiently condition. Our reason is that the limited width
small (about 20 from our experimental evaluation), condition may reduce the possible dimension. In

1006

REFERENCES
[1] C. Bachmaier, F. Brandenburg, W. Brunner,
and G. Lov´sz. Cyclic leveling of directed
a
graphs. In Proc. of GD 2008, volume 5417
of LNCS, pages 348–359, 2008.
Figure 9: Layered layout by LM B algorithm [2] H. do Nascimento and P. Eades. A focus and
(span: 288, crossing: 29) with limited layout constraint-based genetic algorithm for interac-
width = 5. tive directed graph drawing. Technical Report
533, University of Sydney, 2002.
[3] T. Eloranta and E. Makinen. TimGA:
A genetic algorithm for drawing undirected
graphs. Divulagciones Matematicas, 9(2):55–
171, 2001.
[4] M. R. Garey and D. S. Johnson. Crossing
number is NP-complete. SIAM Journal on
Algebraic and Discrete Methods, 4(3):312–316,
1983.
[5] J. Holland. Adaptation in Natural and Arti-
(a) α = 1 (b) α = 9 ficial Systems. University of Michigan Press,
(span: 252, crossing: 29)(span: 295, crossing: 14) Ann Arbor, 1975.
Figure 10: Layered layouts by our GA. [6] P. Kuntz, B. Pinaud, and R. Lehn. Minimizing
crossings in hierarchical digraphs with a hy-
bridized genetic algorithm. Journal of Heuris-
this problem, the dimension represents the posi- tics, 12(1-2):23–36, 2006.
tion as which nodes could be placed. Furthermore,
[7] E. M¨kinen and M. Sieranta. Genetic algo-
a
if the dimension is smaller, then our GA can easier
rithms for drawing bipartite graphs. Inter-
converge to a better result.
national Journal of Computer Mathematics,
53:157–166, 1994.
6. CONCLUSIONS
[8] H. Purchase. Metrics for graph drawing aes-
This paper has proposed an approach for producing thetics. Journal of Visual Languages and
layered layouts of directed graphs, which uses a GA Computing, 13(5):501–516, 2002.
to simultaneously consider the first three steps of
classical Sugiyama’s algorithm (consisting of four [9] K. Sugiyama, S. Tagawa, and M. Toda. Meth-
steps) and applies the priority layout method for ods for visual understanding of hierarchical
the forth step. Our experimental results revealed system structures. IEEE Transitions on Sys-
that our GA may efficiently adjust the weighting tems, Man, and Cybernetics, 11(2):109–125,
ratios among all aesthetic criteria. 1981.
[10] J. Utech, J. Branke, H. Schmeck, and P. Eades.
ACKNOWLEDGEMENT An evolutionary algorithm for drawing di-
rected graphs. In Proc. of CISST’98, pages
This study is conducted under the ”Next Gener-
154–160. CSREA Press, 1998.
ation Telematics System and Innovative Applica-
tions/Services Technologies Project” of the Insti- [11] Q.-G. Zhang, H.-Y. Liu, W. Zhang, and Y.-J.
tute for Information Industry which is subsidized Guo. Drawing undirected graphs with genetic
by the Ministry of Economy Affairs of the Repub- algorithms. In Proc. of ICNC 2005, volume
lic of China. 3612 of LNCS, pages 28–36, 2005.

1007

Structured Local Binary Haar Pattern for Graphics Retrieval

Song-Zhi Su, Shu-Yuan Chen*, Shang-An Li
Cognitive Science Department of Xiamen University, Department of Computer Science and Engineering of
Fujian Key Laboratory of the Brain-like Intelligent Yuan Ze University, Taiwan
Systems (Xiamen University), Xiamen, China *correspondence author, cschen@saturn.yzu.edu.tw
SUSONGZHI@163.com Der-Jyh Duh
Shao-Zi Li Department of Computer Science and Information
Cognitive Science Department of Xiamen University, Engineering, Ching Yun University, Taiwan
Fujian Key Laboratory of the Brain-like Intelligent djduh@cyu.edu.tw
Systems (Xiamen University), Xiamen, China
szlig@xmu.edu.cn

Abstract—Feature extraction is an important issue in graphics histogram indexing structure to addresses two issues in shape
retrieval. Local feature based descriptors are currently the retrieval problem: perceptual similarity measure on partial
predominate method used in image retrieval and object query and overcoming dimensionality curse and adverse
recognition. Inspired by the success of Haar feature and Local environment. Chalechale et al. [6] proposed a sketch-based
Binary Pattern (LBP), a novel feature named structured local image retrieval system, in which feature extraction for
binary Haar pattern (SLBHP) is proposed for graphics matching purpose is based on angular partitioning of two
retrieval in this paper. SLBHP encodes the polarity instead of abstract images which are obtained from the model image
the magnitude of the difference between accumulated gray and from the query image. The angular-spatial distribution of
values of adjacent rectangles. Experimental results on graphics
pixels in the abstract images is scale and rotation invariant
retrieval show that the discriminative power of SLBHP is
better than those of using edge points (EP), Haar feature, and
and robust against translation by using the Fourier transform.
LBP even in noisy condition. Most existing graphics retrieval adopting contour-based
[4] [5] rather than pixel-based approaches [6]. Since the
Keywords-graphics retrieval; structured local binary haar contour-based method is concerned with a lot of curves and
pattern; Haar; local binary pattern; lines, it is computational intensive. Thus it is the goal of this
paper to propose a pixel-based graphics retrieval using novel
structured local binary Haar pattern.
I. INTRODUCTION
This paper is organized as follows. The original Haar
With the advent of computing technology, media and LBP feature is described in Section 2. Proposed SLBHP
acquisition/storage devices, and multimedia compression feature is described in Section 3. Experimental results and
standards, more and more digital data are generated and performance comparison are given in Section 4. Finally,
available to user all over the world. Nowadays, it is easy to conclusions are given in Section 5.
access electronic books, electronic journals, web portals, and
video streams. Hence, it will be convenient for users to II. LOCAL BINARY PATTEN AND HAAR FEATURE
provide an image retrieval system for browsing, searching
and retrieving images from a large database of digital images. A. Local Binary Pattern
Traditional systems add some metadata such as caption, Local feature based approaches have got great success
keywords, descriptions or annotation of images so that in object detection and recognition in recent years. The
retrieval can be converted into a text retrieval problem. original LBP descriptor was proposed by Ojala et al. [7],
However, manual annotation is time-consuming, and was proved a powerful means for texture analysis. LBP
laborious and expensive. There are a lot of works on content- encode local primitives including different types of curved
based image retrieval (CBIR) [1] [2] [3], which is also called edges, spots, flat areas, etc. The advantage of LBP was
query by image content. “Content-based” means that the
invariant to monotonic changes in gray scale. So LBP is
retrieval process will utilize and analyze the actual contents
of the image, which might refer to colors, shapes, textures, or widely used in face recognition [8], pedestrian detection [9],
any other information that can be derived from the images and many other computer vision applications.
themselves. The basic LBP operator assigns a label to every pixel
Unfortunately, although there are many content-based of an image by thresholding the 3 × 3-neighborhood and
retrieval methods for image databases, few of them are considering the results as a binary number. Then the
specifically designed for graphics. Huet et al. [4] exploit both histogram of labels can be used as descriptor of local
geometric attributes and structural information to construct a regions. See Figure 1(a) for an illustration of the basic LBP
shape histogram for retrieving line-patterns from large operator.
databases. Chi et al. [5] proposed an approach to combine a
local-structure-based shape representation and a new

1008

The decimal form of the resulting 8-bit LBP code can be
expressed as follows:
7
LBP ( x, y ) = ∑ wi bi ( x, y )
i =0

where wi = 2i , bi ( x, y) = ⎧1, Haari ( x, y) > T . As Figure 2 shown,
⎨
⎩0, otherwise
each component of LBP is actually a binary 2 rectangle
Haar feature with rectangle size 1 × 1. Even the gradient can
be seen as combination of Haar features. For example,
I x = Haar0 + Haar4
I y = Haar2 + Haar6
where I x and I y are gradient along x axis and y axis with
Figure 1. Illustration of LBP and Haar. (a) The basic LBP operator, (b) filter [1, −2,1] and [1, −2,1]T , respectively.
four types of Haar feature.
III. STRUCTURED LOCAL BINARY HAAR PATTERN
B. Haar Feature A. SLBHP
A simple rectangular Haar feature can be defined as the
difference of the accumulate sum of pixels of area inside the
rectangle, which can be at any position and scale within the
given image. Oren et al. [10] first used 2 rectangle features
in pedestrian classification. Viola and Jones [11] extend
them to 3 rectangle features and 4 rectangle features in
Viola-Jones object detection framework for face and
pedestrian. The difference values indicate certain
characteristics of a particular area of image. Haar feature
encodes the low-frequency information and each feature
type can indicate the existence of certain characteristics in
the images, such as vertical or horizontal edges, changes in
texture.
Haar feature can be computed quickly using the
integral image [11]. It is an intermediate representation of
image, all the rectangular two-dimensional image features
can be computed rapidly using this representation. Each
element of the integral image contains the sum of all pixels
located on the up-left region of the original image. Given Figure 3. An example of SLBHP. (a) Four Haar features; (b)
the integral image, any rectangular sum of pixel values corresponding Haar features with overlapping; (c) an example to compute
SLBHP values.
aligned with the coordinate axes can be computed in four
array references.
C. A New Sight into LBP, Haar,and Gradient In this paper, based on the similar idea of multi-block
local binary pattern features [12, 13], a descriptor Structured
Local Binary Haar Pattern (SLBHP) has been modified from
LBP with Haar feature. The proposed SLBHP adopts four
types of Haar features, which capture the changes of gray
values along the horizontal direction, the vertical direction
and the diagonals as shown in Figure 3(a). However, only
the polarity of Haar feature is involved in SLBHP, while the
magnitude is discarded. It is noted that the number of
encoding patterns has been reduced from 256 for LBP to 16
for SLBHP. Moreover, SLBHP encoding spatial structure of
two adjacent rectangle regions in four-directions. Thus,
Figure 2. LBP can be seen as a weighted combination of binary Haar compared to LBP, the SLBHP has compact encoding
feature. patterns and incorporates more semantic structure
information. Let ai , i = 0,1,L,8 denote the corresponding
gray values for a 3×3 window with a0 at the center pixel

1009

( x, y) as shown in Figure 3(a). The value of SLBHP code of
a pixel ( x, y) is given by the following equation,
SLBHP( x, y) = ∑ B ( H p ⊗ N ( x, y)) × 2 p
4

p =1

⎡ a1 a2 a3 ⎤ ⎡ 1 1 0⎤
where N ( x, y) = ⎢ a8
⎢
⎥,
a0 a4 ⎥ H1 = ⎢ 1 0 −1⎥ ,
⎢ ⎥
⎢a7
⎣ a6 a5 ⎥
⎦ ⎢ ⎥
⎣0 −1 −1⎦ Figure 4. An example of SLBHP histograms for graphics retrieval.

⎡ 0 1 1⎤ ⎡ 1 1 1⎤ ⎡−1 0 1⎤
⎢−1 0 1⎥ , H = ⎢ 0
H2 = ⎢ ⎥ , H = ⎢−1 0 1⎥.
0 0⎥ 4 ⎢ IV. EXPERIMENTAL RESULTS
⎥ 3 ⎢ ⎥
⎢−1 −1 0⎥
⎣ ⎦ ⎢−1
⎣ −1 −1⎥
⎦ ⎢−1 0 1⎥
⎣ ⎦

⎧1 if | x |> T
and B ( x ) = ⎨ with T as a threshold 15 in our
⎩0 otherwise
experiments). By this binary operation, the feature becomes
more robust to global lighting changes. It is noted that H p
denote a Harr-like basis function and H p ⊗ N (x, y) denotes
the difference between the accumulated gray values of the
black and red rectangle as shown in Figure 3(c). Unlike
traditional Haar feature, here the rectangles are overlapped
with one pixel. Inspired by LBP and the fact that a single
binary Haar feature might have not enough discriminative
power, we combine this binary feature just like LBP. Figure
3(c) shows an example of SLBHP feature. SLBHP feature
extends the merits of both Haar feature and LBP and it (a) (b)
encodes the most common structure information of graphics. Figure 5. Some query results for graphics database. (a) Query graphics; (b)
Moreover, SLBHP has dimension of 16 smaller than the a list of three most similar graphics ordered by similarity values. The one
with red rectangle is the ground true match.
dimension of LBP 256, while has more immunities to noise
since Haar feature uses more pixels once at a time.
B. SLBHP for Graphics Retrieval 479 electronic files of graphics are collected to
After the SLBHP value is computed, the histogram of construct database for retrieval experiments. Test images are
comprised of 479 graphics photos taken by a digital camera
SLBHP for a region R is computed by the following
and then added by noises to obtain noisy test images The
equation
performance of graphics retrieval is measured by the
H (i) = ∑ I {SLBHP( x, y) = i} , retrieval accuracy. The retrieval accuracy is computed as the
( x, y )∈R
ratio of the number of graphics correctly retrieved to the
⎧1，A = true, number of total queries. Moreover, not only the retrieval
where I { A} = ⎨ The histogram H contains
⎩0，A = false. accuracy with respect to the first rank but also the second
information about the distribution of the local patterns, such and third ranks is concerned in our experiments. The
as edges, spots and flat areas, over the image region R . In retrieval accuracies for different approaches are listed in
order to make SLBHP robust to slight transition, a graphics Tables 1 through 4 with different block sizes from 8×8 to
photo is divided into several small spatial regions (“block”), 32×32. Moreover, the retrieval accuracy for non-
for each of which a SLBHP histogram is computed and then overlapping case is also listed in Table 4. By comparing
concatenated to form the representation of graphics as shown Tables 1 and 4, we found that overlapping results in higher
in Figure 4. For better invariance to illumination, it is useful retrieval accuracy. It is noted that the proposed method and
to contrast-normalize the local responses in each block the approaches using EP [6] and LBP all adopt histogram-
before using them. Experiment results showed that the based matching. However, for the Haar feature, the
L2NORM gets better results than L1NORM and L1SQRT. computed four Haar values for each block are normalized
Similar to other popular local feature based object detection, and then concatenated to form the representation, Chi-
the detection windows are tiled with a dense (overlapping) square distance is also adopted as similarity measure of the
grid of SLBHP descriptors. The overlap size is half of whole
Haar feature.
block.

1010

In our experiment, we found that chi-square is a better distance. Some retrieval results are shown in Figure 5.
similarity for histogram-based matching than Euclidean

TABLE I. RETRIEVAL ACCURACIES OF EDGE POINTS (EP), LBP, HAAR, AND SLBHP WITH HALF-OVERLAPPING BLOCKS.
1-best 2-best 3-best
EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP
32x32 85.2 70.4 83.7 88.3 91.6 79.5 90.6 95.6 93.3 82.5 92.5 96.5
32x16 83.3 62.3 68.9 88.5 91.4 74.9 76.0 94.6 93.5 78.3 78.5 95.7
16x32 86.8 66.6 60.8 90.2 92.9 76.0 68.7 95.6 94.2 80.0 72.2 96.7
16x16 85.0 58.2 62.4 89.4 92.3 66.8 70.1 94.4 94.4 69.3 73.3 95.8
16x8 81.2 42.0 37.4 86.6 89.8 51.8 43.4 91.9 91.2 55.5 45.9 93.7
8x16 83.3 45.3 29.0 86.6 90.6 55.1 36.5 92.5 92.9 57.8 40.7 94.8
8x8 79.3 30.5 29.2 82.7 86.8 39.5 34.9 89.3 89.8 44.5 39.2 91.2

TABLE II. RETRIEVAL ACCURACIES UNDER GAUSSIAN NOISE WITH VARIANCE 50 AND PERTURBATION 1%.
32x32 63.88 71.19 83.09 82.46 74.53 78.91 90.40 90.81 77.87 84.13 92.48 93.95
32x16 71.61 65.76 68.48 85.18 79.54 75.16 75.78 93.53 84.76 79.54 78.71 94.57
16x32 72.44 67.22 60.96 87.06 79.54 76.41 68.27 93.53 83.72 81.21 72.03 94.99
16x16 78.08 59.92 62.42 88.31 85.39 68.27 69.52 93.74 89.14 72.65 73.70 94.99
16x8 79.96 43.63 37.58 86.01 87.89 52.61 43.42 92.07 89.98 55.74 45.72 93.95
8x16 79.12 47.60 29.02 86.85 88.31 54.91 36.33 93.11 91.44 59.08 41.34 94.99
8x8 81.00 31.52 29.23 83.72 87.68 40.08 34.24 90.40 90.61 44.89 38.62 92.48

TABLE III. RETRIEVAL ACCURACIES UNDER SALT AND PEPPER NOISES WITH PERTURBATION 0.5%.
32x32 15.24 70.77 83.51 84.76 19.83 79.33 91.02 92.48 25.05 82.88 92.48 94.78
32x16 20.46 64.93 68.48 86.01 27.97 75.57 75.79 94.15 39.25 79.33 78.50 95.62
16x32 22.55 67.43 60.96 88.10 27.35 76.20 68.48 94.15 34.66 80.17 72.44 95.62
16x16 37.37 59.71 61.59 88.52 46.97 67.85 68.89 93.95 61.38 71.19 73.28 95.41
16x8 55.95 42.80 36.74 86.22 67.22 52.40 43.01 92.28 73.90 55.95 45.30 93.32
8x16 54.28 47.39 28.60 87.27 67.43 54.90 36.12 93.32 78.08 58.87 40.71 94.57
8x8 70.35 31.11 29.02 83.51 81.00 40.29 34.86 89.77 84.55 45.09 39.25 92.28

TABLE IV. RETRIEVAL ACCURACIES WITH NON-OVERLAPPING BLOCKS.
32x32 82.04 70.98 69.73 87.68 90.40 77.66 78.29 94.57 92.49 81.42 82.88 95.82
32x16 79.75 64.09 61.80 87.27 88.10 72.65 68.89 93.53 89.98 75.57 71.61 94.78
16x32 82.25 66.18 53.44 89.14 89.77 74.95 61.17 94.78 90.81 78.50 64.30 96.24
16x16 81.00 57.20 57.83 88.94 88.94 66.18 65.76 93.11 91.23 69.31 69.10 94.36
16x8 78.50 41.34 29.23 84.97 87.06 51.57 36.33 91.23 89.35 54.90 40.08 92.48
8x16 79.96 43.01 23.38 86.01 88.52 51.77 29.44 91.23 91.65 55.11 32.57 93.53
8x8 75.99 29..22 27.97 83.30 84.97 39.25 33.83 88.31 88.31 42.17 36.12 90.61

V. CONCLUSION ACKNOWLEDGMENT
A novel local feature SLBHP, combining the merits of This work was partially supported by National Science
Haar and LBP, is proposed in this paper. The effectiveness Council of Taiwan, under Grants NSC 99-2221-E-155-072,
of SLBHP has been proven by various experimental results. National Nature Science Foundation of China under Grants
Moreover, compared to the other approaches using EP, Haar 60873179, Shenzhen Technology Fundamental Research
and LBP descriptors, SLBHP is superior even in the noisy Project under Grants JC200903180630A, and Doctoral
conditions. Further research can be directed to extend the Program Foundation of Institutions of Higher Education of
proposed graphics retrieval for slide retrieval or e-learning China under Grants 20090121110032.
video retrieval using graphics as query keywords. REFERENCES

1011

[1] R. Datta, D. Joshi, J. Lia, and J. Z. Wang, “Image retrieval: ideas,
influences, and trends of the new age,” ACM Computing Surveys,
2008, vol. 40, no.2, Atricle. 5, pp. 1–60.
[2] J. Deng, W. Dong, R. Socher, et al. ImageNet: A large-scale
hierarchical image database. In: Proceedings of Computer Vision and
Pattern Recognition, 2009.
[3] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images:
a large dataset for non-parametric object and scene recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
2008, vol. 30, no.11, pp. 1958- 1970.
[4] B. Huet and E. R. Hancock, “Line pattern retrieval using rational
histograms,” IEEE Transaction on Pattern Analysis and Machine
Intelligence, 1999, vol.12, no.12, pp. 1363-1370.
[5] Y. Chi and M.K.H. Leung, “ALSBIR: A local-structure-based image
retrieval,” Pattern Recognition, 2007, vol. 40, pp. 244-261.
[6] A. Chalechale, G. Naghdy and A. Mertins, “Sketch-based image
matching using angular partitioning,” IEEE Transactions on Systems,
Man, and Cybernetics –Part A: Systems and Humans, 2005, vol. 35,
no. 1, pp.28-41.
[7] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of
texture measures with classification based on featured distribution,”
Pattern Recognition, 1996, vol. 29, no. 1, pp.51-59.
[8] T. Ahonen, A. Hadid, and M. Pietikinen. “Face description with local
binary patterns, application to face recognition,” IEEE Transaction on
Pattern Analysis and Machine Intelligence, 2006, vol.28, no. 12, pp.
2037-2041.
[9] X. Wang, T. X. Han, and S. Yan, “An HOG-LBP human detector
with partial occlusion handling,” In: Proceedings of Internation
Conference on Computer Vision, 2009.
[10] M. Oren, C. Papageorion, P. Sinha, et al, “Pedestrian detection using
wavelet templates,” In: Proceedings of International Conference on
Computer Vision and Pattern Recognition, 1997.
[11] P. Viola and M Jones, “Robust real-time face detection,” International
Journal of Computer Vision, 2004, vol. 57, no. 2, pp. 137-154.
[12] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, ‘Face detection
based on multi-block LBP representation’, Proc. Int. Conf. on
Biometrics, 2007.
[13] S. Yan, S. Shan, X. Chen, and W. Gao, ‘Locally assembled binary
(LAB) feature with feature-centric cascade for fast and accurate face
detection’, Proc. Int. Conf. Computer Vision and Pattern Recognition,
2008.

1012

IMAGE-BASED INTELLIGENT ATTENDANCE LOGGING SYSTEM
Hary Oktavianto1, Gee-Sern Hsu2, Sheng-Luen Chung1
1
Department of Electrical Engineering
2
Department of Mechanical Engineering
National Taiwan University of Science and Technology, Taipei, Taiwan
E-mail: hary35@yahoo.com

Abstract— This paper proposes an extension of the
surveillance camera’s function as an intelligent
attendance logging system. The system works like a time
recorder. Based on sitting and standing up events, the
system was designed with learning phase and monitoring
phase. The learning phase learns the environment to
locate the sitting areas. After a defined time, the system
switches to the monitoring phase which monitors the
incoming occupants. When the occupant sits at the same Fig. 1. Occupant’s working room (left) and a map consists of occupants’
location with the sitting area found by the learning sitting areas (right).
phase, the monitoring phase will generate a sitting-time
report. A leaving-time report is also generated when an working area of the occupant does and when the occupant
occupant stands up from his/her seat. This system works.
employs one static camera. The camera is placed 6.2 The diagram flow of the proposed system is shown in
Fig. 2. The system consists of an object segmentation unit, a
meters far, 2.6 meters high, and facing down 21° from
tracking unit, learning phase, and monitoring phase. A fixed
horizontal. The camera’s view is perpendicular with the static camera is placed inside the occupants’ working room.
working location. The experimental result shows that the The images taken by the camera are pre-processed by the
system can achieve a good result. object segmentation unit to extract the foreground objects.
The connected foreground object is called as blob. These
Keywords Activity Map; Attendance; Logging system; blobs are processed further in the tracking unit. Once the
Learning phase; Monitoring phase; Surveillance Camera; system detected the blob as an occupant, the system keeps
tracking the occupant in the scene using centroid, ground
I. INTRODUCTION position, color, and size of the occupant as the simple
Intelligent buildings have increased as a research topic tracking features. The learning phase has responsibility to
recently [1], [2], [3]. Many buildings are installed with learn the environment and constructs a map as the output.
surveillance cameras for security reasons. This paper extends The monitoring phase uses the map to monitor whether the
the function of existing surveillance cameras as an intelligent occupants are present in their working desk or not. The
attendance logging system. The purpose is to report the report on the presence or the absence of the occupants is the
occupant’s attendance. The system works like a time final output of the system for further analysis. The system is
recorder or time clock. A time recorder is a mechanical or implemented by taking the advantages of the existing open
electronics timepiece that is used to assist in tracking the source library for computer vision, OpenCV [5] and cvBlob
hours an employee of a company worked [4]. Instead of [6].
spending more budgets to apply those timepieces, the The contributions of this paper are:
surveillance camera can be used to do the same function. The (1) Learning mechanism that locates seats in an unknown
system is so called intelligent since it learns from a given environment.
environment automatically to build a map. A map consists of (2) Monitoring mechanism that detects in entering and
sitting areas of the occupants. Sitting area is the space leaving events of occupants.
information about where are the locations of the occupant’s (3) Integrating system with real-time performance up to 16
working desk. So, there is no need to select the area of fps, ready for context-aware applications.
occupant’s working area manually. Fig. 1 shows an example This paper is organized with the following sections. The
scenario. Naturally, the occupant enters into the room and problem definition and the previous researches as related
sits to start working. Afterward, the occupant stands up from works are reviewed in Section II. Section III describes the
his/her seat and leaves the room. The sitting and standing up technical overview of the proposed solution. Section IV
events will be used by the system to decide where the explains about the tracking that is used to keep tracking of
the occupants during their appearance in a scene based on the

1013

information from the previous frame. The learning phase and
the monitoring phase are explained in Section V. Section VI
explains the experiments’ setup, result, and discussion.
Finally, the conclusions are summarized in Section VII.
II. PROBLEM DEFINITION AND RELATED WORK
This section describes the problem definition and the
previous works related to the intelligent attendance logging
system.
A. Problem Definition
The goal of this paper is to design an image-based
intelligent attendance logging system. Given a fixed static
camera as an input device inside an unknown working
environment with a number of fixed seats, each of them
belong to a particular user or occupant. Occupant enters and
leaves not necessarily at the same time. We are to design a
camera-equipped intelligent attendance logging system, such Fig. 2. Diagram flow of the system.
that, the system can report in real-time each occupant’s
entering and leaving events to and from his/her particular provide the vocabulary to categorize past and present
seat. activity, predict future behavior, and detect abnormalities.
The system is designed based on two assumptions. The The researches above detect occupants and build a map
first assumption is the environment is unknown, in that, the consists of locations where those people mostly occupy. This
number of seat and the location of these seats are not known paper extends the advantages of the surveillance cameras to
before the system monitors. The second assumption is each monitor the occupant’s presence. A static camera is used by
occupant has his/her own seat, as such, detecting the the system as in [2], [8]. Morris and Trivedi applied
presence/absence of a particular seat amounts to answering omnidirectional camera [9] to their system. The other
the presence/absence of that corresponding occupant. researchers [1], [3], [7] used stereo camera to reduce the
There are two performance criteria to evaluate the system effect of lighting intensity and occlusion. It is intended that
regarding to the main functions of the system. The main the system in this paper works in real time and has the
functions of the system are to find the sitting area and to capability to learn the environment automatically from
report the monitoring result. The first criterion is the system observed behavior.
should find the sitting areas given by the ground truth. The
second criterion is the system should be able to monitor the
occupants during their appearance in the scene to generate III. TECHNICAL OVERVIEW
the accurate report. As shown in Fig. 2 and the detail in Fig. 3, the input
B. Related Work images acquired from the camera are fed into the object
segmentation unit to extract the foreground object.
During the past decades, intelligent building has been
Foreground object is the moving object in a scene. The
developed. Zhou et al [3] developed the video-based human
foreground object is obtained by subtracting the current
indoor activity monitoring that is aimed to assist elderly.
image with the background image. To model the background
Demirdjian et al [7] presented a method for automatically
estimating activity zone based on observed user behaviors in image, Gaussian Mixture Model (GMM) is used. GMM
represents each background pixel variation with a set of
an office room using 3D person tracking technique. They
weighted Gaussian distributions [10], [11], [12], [13]. The
used simple position, motion, and shape features for
first frame will be used to initialize the mean. A pixel is
tracking. This activity zone is used at run time to
contextualize user preferences, e.g., allowing “location- decided as the background if it falls into a deviation around
the mean of any of the Gaussians that model it. The update
sticky” settings for messaging, environmental controls,
process, which is performed in the current frame, will
and/or media delivery. Girgensohn, Shipman, and Wilcox [8]
increase the weight of the Gaussian model that is matched to
thought that retail establishments want to know about traffic
the pixel. By taking the difference between the current image
flow in order to better arrange goods and staff placement.
and the background image, the foreground object is obtained.
They visualized the results as heat maps to show activity and
object counts and average velocities overlaid on the map of After that, the foreground object is converted from RGB
the space. Morris and Trivedi [9] extracted the human color image to gray color image [13]. The edges of the
activity. They presented an adaptive framework for live objects in the gray color image are extracted by applying
video analysis based on trajectory learning. A surveillance edge detector. The edge detector uses moving frame
scene is described by a map which is learned in unsupervised algorithm. Moving frame algorithm has four steps. Step one,
fashion to indicate interesting image regions and the way the gray color image (I) is shifted to eight directions using
objects move between these places. These descriptors the fixed distance in pixel unit (dx and dy), resulting eight

1014

images with an offset to right, left, up, down, up right, up
left, down right, and down left, respectively. Those eight
shifted images with offset are called moving frame images
(Fi ).
Fi ( x , y )  I ( x  dxi , y  dyi ) (1)

Step two, each of moving frames image is updated (F*) by
making subtraction to the image frame (I) to get the
extended edges.
Fi ( x , y )  I ( x , y )  Fi ( x , y ) (2)
Step three, each of moving frames is converted to binary by
applying a threshold value (TF). Fig. 3. The detail of the object segmentation unit and the tracking unit.

FiT ( x , y )  f T ( Fi ) (3) where Biy and Bjy are the y-coordinate of of blob-i and blob-
j, respectively, ci and cj are the centroid of each blob. If
1 if ( Fi* ( x , y ))  TF those three conditions satisfy (6) then the broken blobs are
fT  
0 otherwise grouped.

Finally, all of moving frame images are added together. As BI  
 TC   Bdy  TD  B A  TA  
1
the result, the edges of the image (E) are obtained. G (6)
0 otherwise
E( x , y )   FiT ( x , y ) (4)
i TC, TD, and TA are the threshold values for the intersection
distance, the nearest vertical distance of blobs, and the angle
Edge detector extracts the object while removes the weak of blobs, respectively. In the experiments, TC is 0 pixel, TD is
shadows at the same time since weak shadows do not have 50 pixels, and TA is 30°.
edges. However, strong shadows may happen and create After the broken blobs are grouped into one, the motion
some edges. Strong edges appear between legs can still be detector will test the blob whether it is an occupant or not.
tolerated since the system does not consider about occupant’s The blob is an occupant if the size of the blob looks like a
contour. human and the blob has movement. A minimum size of
The result from the edge detection process is refined by human is an approximation relative to the image size. X-axis
using morphology filters [13]. Dilation filter is applied twice displacement and optical flow [13] are used to detect the
to join the edges and erosion filter is used once to remove the movement of the blob. If a blob is detected as an occupant
noises. The last step in the object segmentation unit is then the tracking unit gives a unique identification (ID)
connected component labeling. The connected component number and a track the occupant. A track is an indicator that
labeling is used to detect the connected region. The a blob is an occupant, and it is represented by a bounding
connected region is so called as blob. In the object box. Tracking rules are implemented as states to handle each
segmentation unit, the GMM, the gray color conversion, the event. There are five basic states; entering state, people state,
edge detector, and the morphology filters are implemented sitting state, standing up state, and leaving state. During the
using OpenCV library while the connected component tracking, the occlusion problem may happen. Two more
labeling is implemented using cvBlob library. states are added. They are merge state and split state. In the
The blob that represents the foreground object may be tracking unit, optical flow implements OpenCV library while
broken due to the occlusion with furniture or having the the tracking rules employ cvBlob library.
same color with the background image. Some rules to group The learning phase is activated if the map has not
the broken blob are provided. There are three conditions to constructed yet. The sitting state in the tracking unit triggers
examine the broken blobs. The first is the intersection the learning phase to locate the occupant’s sitting area. After
distance of blobs (BI). The second is the nearest vertical a defined time, the learning phase finished its job and the
monitoring phase is activated. In this phase, the sitting state
distance of blobs (Bdy). The third is the angle of blobs (BA)
and the standing up state in the tracking unit trigger the
from their centroids. Bdy and BA are calculated using (5) monitoring phase to generate reports. The reports tell when
while BI is explained in [14]. the occupant sit and left.
Bdy  min( Bi . y , B j . y ) The system will be evaluated by testing it with some
(5) video clips. There are two scenarios in the video clips. Five
B A  ( ci ,c j ) occupants are asked to enter the scene. They sit, stand up,
leave the scene and sometimes cross each other.

1015

IV. TRACKING
This section describes about the tracking rules in the
tracking unit (Fig. 3). Tracking rules will keep tracking the
occupants during their appearance in the scene based on the
information (features) from the previous frame. The Fig. 4. Basic tracking states.
tracking rules are represented by states. The basic tracking
states are shown in Fig. 4. There are five states:
 Entering state (ES), an incoming blob that appears in
the scene for the first time will be marked as entering
state. This state also receives information from the
motion detector to detect whether the incoming blob is
an occupant or a noise. If the incoming blob is
considered as noise and it remains there for more than
100 frames then the system will delete it, for instance,
the size is too small because of shadows. To erase the
noise from the scene, the system re-initializes the Fig. 5. An occupant in the scene and the features.
Gaussian model to the noise region so that the noise
will be absorbed as a background image. An incoming
blob is classified as an occupant if the incoming blob
has motion at least for 20 frames continuously and the
height of the blob is more than 60 pixels.
 Person state (PS), if the incoming blob is detected as
an occupant, a unique identification (ID) number and a
bounding box are attached to this blob. The blob that is
detected as an occupant is called as a track. The system
adds this track in the tracking list. Fig. 6. Centroid feature to check the distance in 2D.
 Sitting state (IS), detects if the occupant is sitting.
Sitting occupant can be assumed if there is no surrounding by a bounding box), size (number of blob
movement from the occupant for a defined time. In the pixels or area density), centroid (center gravity of mass),
experiments, an occupant is sitting when the x-axis and ground position (foot position of occupant).
displacement is zero for 20 frames and the velocity The first feature is centroid. Centroid is used to associate
vectors from the optical flow’s result are zero for 100 the object’s location in the 2D image between two
frames, continuously. consecutive frames by measuring the centroids distance.
 Standing-up state (US), detects when the sitting Fig. 6 shows the two objects being associated. One object is
occupant starts to move to leave his/her desk. In the already defined as track in the previous frame (t-1) and
experiments, a standing up occupant is detected when another object is appearing in the current frame (t) as a blob.
the sitting occupant produces movements, the height Each object has centroid (c). These two objects are
increases above 75%, and the size changes to 80%- measured [14] in the following way. If one of centroid is
140% comparing to the size of the current bounding inside another object (the boundary of each object is defined
box. as a rectangle) the returned distance value is zero. If the
centroids are lying outside the boundary of each object then
 Leaving state (LS), deletes the occupant from the list.
the returned distance value is the nearest centroid to the
A leaving occupant is detected when occupant moves
opponent boundary. A threshold value (TC) is set. When the
to the edge of the scene and occupant’s track loses its
distance is below TC meaning that those two objects are the
blob for 5 frames.
same object, the track position will be updated to the blob
A. Tracking Features position. If the distance is not satisfied then it means these
The system is tried to match every detected occupant in two objects is not correlated each other. It could be the
the scene from frame to frame. This can be done by previous track loses the object in the next frame and a new
matching the features of occupant. Four features (centroid, object appears at the same time. A track that missed the
ground position, color, and size) are used for tracking tracking is defined in the leaving state (LS) and a new object
purpose. Fig. 5 shows the illustration of blob (the connected that appears in the scene is handled in the blob state (BS).
region of occupant object in the current frame), track (a The second feature is ground position. It is possible that
connected blob that considers as an individual occupant, two objects are not the same object but their centroids are

1016

Fig. 10. Extended tracking states.

Fig. 7. Ground position feature to check the distance in 3D. Blob and track
in the processing stage (left). View in the real image (right). categories; n is the total bin number; the histogram HR,G,B of
occupant-i meets the following conditions:
n
H iR ,G ,B   bk (7)
k 1

The histogram HR,G,B are calculated using the masked image
and then normalized. The masked image, shown in Fig. 8, is
obtained from the occupant’s object and the blob with and-
operation. The method for matching the occupant’s
Fig. 8. Color feature is calculated on masked image.
histogram is correlation method. In the experiments, 10-bins
for each color are chosen. The histogram matching
procedure uses a threshold value of 0.8 to indicate that the
comparing histogram is sufficient matched.
The fourth feature is size. The size feature is used to
match the object between two consecutive frames based on
the pixel density. The pixel density is the blob itself, shown
at Fig. 9. Allowable changing size at the next frame is set ±
20% from the previous size. Let p(x’,y’) be the pixel
Fig. 9. Size feature of occupant.
location of an occupant in binary image. The size feature of
object-i is calculated as follow:
lying inside each other boundary. Fig. 7 shows this problem.
There are two occupants in the scene. One occupant is si   p( x' , y' ) (8)
x' , y'
sitting while the other is walking through behind. In the 2D
image (left), the two objects are overlapped each other.
However, it is clear that the walking occupant should not be B. Merge-split Problem
confused with the sitting occupant. To solve this problem, A challenging situation may happen. While the occupants
ground position is used to associate the object’s location in are walking in the scene, they are crossing each other and
the 3D image between two consecutive frames. Ground making occlusion. Since the system keeps tracking each
position feature will eliminate the error that an object to be occupant in the scene, it is necessary to extend the tracking
updated with another object even thought they overlap each states from Fig. 4. Two states are added for this purpose;
other. Occupant’s foot location is used as ground position. merge state (MS) and split state (SS). Fig. 10 shows the
A fixed uniform ellipse boundary (25 pixels and 20 pixels extended tracking states. Merge and split can be detected by
for major axis and minor axis, respectively) around the using proximity matrix [14]. Objects are merged when
ground position is set to indicate the maximum allowable multiple tracks (in the previous frame) are associated to one
range of the same person to move. In the real scene, this blob (in the current frame). Objects are split when multiple
pixel area is equal to 40 centimeters square for the nearest blobs (in the current frame) are created from a track (in the
object from the camera until 85 centimeters square for the previous frame). In the merge condition, only centroid
furthest object from the camera. This wide range is caused feature is used to track the next possible position since the
by the using of uniform ellipse distance for all locations in other three features are not useful when the objects merge.
the image. After a group of occupants split, their color will be matched
The third feature is color. Color feature is used to to their color just before they merged together.
indicate color information of occupant’s clothing or wearing In experiments, when more than two occupants split,
and help to separate the objects in term of occlusion. Three sometimes an occupant remains occluded. Later, the
dimension of RGB color histogram is used. Let b be the bin occluded occupant splits. When the occluded occupant
that counts the number of pixel that fall into the same splits, the system will re-identify each occupant and correct

1017

Sitting area number Event Time stamp
1 Sitting 09:02:09 Wed 2 June 2010
2 Leaving 10:46:38 Wed 2 June 2010
3 Leaving 12:46:38 Wed 2 June 2010
Fig. 12. A report example.

B. Monitoring Phase
The monitoring phase is derived from the sitting state and
the standing up state in the tracking rules. The monitoring
phase generates the reports of the occupant’s attendance. It
Fig. 11. Merge-split algorithm with occlusion handling. uses the map that has been constructed by the learning
phase. From Fig. 4, the sitting into state (IS) and standing up
their previous ID number just before they have merged. Fig. state (US) trigger the monitoring phase. When the occupant
11 shows the algorithm to handle the occluded problem. sits, the system will try to match the current occupant’s
sitting location with the sitting area in the map. If the
V. LEARNING AND MONITORING PHASES positions are the same then the system will generate a time
This section introduces about how the learning phase stamp of sitting time for the particular sitting area. A time
and the monitoring phase work. These phases are derived stamp of leaving time is also generated by the system when
from the tracking unit, which are the sitting state and the the occupant moves out from the sitting area. Fig. 12 shows
standing up state in the tracking rules. At the beginning, the the example of the report.
system activates the learning phase. Triggering by the sitting
VI. APPLICATION TO INTELLIGENT ATTENDANCE
event, the learning phase starts to construct the map. When
LOGGING SYSTEM
the given time interval is passed, the learning phase is
stopped. A map has been constructed. The system switches This paper demonstrates the usage of the surveillance
to the monitoring phase to report the occupants’ attendance camera as an intelligent attendance logging system. It
based on when they sit into and stand from their seat. mentioned earlier that the system works like a time recorder.
The system assists for tracking the hours of occupant
A. Learning Phase attendance. Using this system, the occupants no need to
The learning phase is derived from the sitting state in the bring special tag or badge. In this section, the environment
tracking rules. The output of the learning phase is a map. setup, result, and discussion are described.
The map consists of occupants’ sitting areas. From Fig. 4,
A. Environment Setup
the information about when the occupant sits is extracted
from the sitting into state (IS). When an occupant is detected A static network camera is used to capture the images
as sitting, the system will start counting. After a certain from the scene. It is a HLC-83M, a network camera
period of counting, the location where the occupant sits is produced by Hunt Electronic. The image size taken from the
determined as sitting area. The counting period is used as a camera is 320 x 240 pixels. The test room is in our
delay. The delay makes sure that the occupant sits for laboratory. The camera is placed about 6.24 meters far, 2.63
enough time. In the experiments, the delay is defined for meters high, and 21° facing down from the horizontal line.
200 frames. Ideally, the learning phase is considered to be The occupant desks and the camera view are orthogonal to
finished after all of the sitting areas are found. In this paper, get the best view. There are 5 desks as the ground truths.
to show that the learning phase does its job, the occupants The room has inner lighting from fluorescent lamps and
enter into the scene and sit one by one without making the windows are covered so the sunlight cannot come into
occlusion. The scenario for this demonstration is arranged the room during the test.
so that after 10 minutes, the map is expected to be B. Result and Discussion
completely constructed. Thus, the learning phase is finished Visual C++ and OpenCV platform on Intel® Core™2
its job. The system will be switched to the monitoring
Quad CPU at 2.33GHz with 4 GB RAMs is used to
phase. In the real situation, the delay and how long the implement the system. Both offline and online methods are
learning phase will be finished can be adjusted. allowed. In the scene without any detected objects, the
system ran at 16 frames per second (fps). When the number

1018

of incoming objects is increasing, the lowest speed can be Table 1. Test results of scene type 1. The number of detected seat by the
achieved is 8 fps. system for 10 times experiments.
The algorithm was tested with 2 types of scenarios. The Sitting Desk number
first scenario is sitting occupants with no occlusion (Fig. area #1 #2 #3 #4 #5
13). This scenario demonstrated the working of learning Detected 7 10 10 10 8
phase. The second scenario is the same as the first scenario Missed 3 0 0 0 2
but the occupants are allowed crossing each other to make
an occlusion (Fig. 14). This scenario demonstrated the Table 2. Test results of scene type 2. The number shows the success rate of
merge-split handling. monitoring without occlusion for 10 times experiments.
Table 1 shows the test result of scenario type 1. There Desk number
are 5 desks as ground truth (Fig. 1). Five occupants enter Occupant
#1 #2 #3 #4 #5
into the scene. They sit, stand up, and left the scene one by Sitting 9 10 10 10 9
one without making any occlusion. The order or the Leaving 0 9 10 10 9
occupants enter and leave are arranged. The occupant
started to occupy the desk number 5 (the right most desk), Table 3. Test results of scene type 2. The number of occupant mistakenly
until the desk number 1 (the left most desk). When they left, assigned in merge-split case for 10 times merged.
the occupant started to leave from the desk number 1, until
Number of Sitting Walking Split
the desk number 5. This order is made to make sure that Occupant occupant occupant
Merge
Succeeded Failed
there is no occupant walks through behind the sitting
2 0 2 10 9 1
occupant. This scenario was repeated 10 times. The result
2 1 1 10 9 1
shows that there is no problem for the desk number 2, 3, and
4. However, there are some errors that the system failed to 3 0 3 10 8 2

locate the occupants’ sitting areas. In the case of the desk 3 1 2 10 9 1
number 1, sometimes the occupant’s blob merges with 3 2 1 10 9 1
his/her neighbor occupant. So, the system cannot detect or
track the occupant that sits into desk number 1. In the case which is which after they split. The error happened because
of the desk number 5, the occupant’s color was similar to of the occupant’s color and the sitting occupant. If the
the color of the background image. This caused the occupants have a similar color then the system may get
occupant produced small blob. The system cannot track the confuse to differentiate them. Another time, when the sitting
occupant because his/her blob’ size becomes too small. occupant makes a movement, it creates a blob. However, the
Table 2 shows the test result of scene type 2. The system system still does not have enough evidence to determine that
monitored the occupants based on the map that has been this blob will change the status of sitting occupant become
found. The experiments were done 10 times without standing up occupant. Another occupant walked closer and
occlusion. There are some errors that the system failed to merged with this blob. After they split, the system confused
recognize the sitting occupant. The system failed to detect since the blob had no previous information. As the result,
the occupant because of the same problems in the previous the system missed count the previous track being merged.
discussion; the system lost to track the occupant because the The ID number of occupant is restored incorrectly.
occupant has the similar color to the background image so
that the occupant suddenly has small blob. The system also VII. CONCLUSIONS
failed to recognize the leaving event from desk number 1. We have already designed an intelligent attendance
The system detects a leaving occupant when the occupant logging system by integrating the open source with
split with his/her seat. Since the desk number 1 does not additional algorithm. The system works in two phases;
have enough space for the system to detect the splitting, the learning phase and monitoring phase. The system can
system still detected that the desk number 1 is always achieve real-time performance up to 16 fps. We also
occupied even the corresponding occupant has left that demonstrate that the system can handle the occlusion up to
location. three occupants considering that the scene seems become
Table 3 shows the test result of scene type 2. The too crowded for more than three occupants. While the
experiments were done 10 times with occlusion. The system regular time recorder only reports the time stamp of the
should be able to keep tracking the occupants. To test the beginning and the ending of the occupant’s working hour,
system, three occupants enter to the scene to make the this system provides more detail about the timing
scenario as shown in Table 3. Some occupants walk through information. Some unexpected behavior may cause an error.
behind the sitting occupant or the occupants just walk and For instance, the occupant has the color similar to the
cross each other. Most of the cases, the system can detect background, the desk position, or the occupant moves while

1019

sitting.
In the future, the events generated by this system can be
used to deliver a message to another system. It is possible to
control the environment automatically such as adjust the
lighting, playing a relaxation music, setting the air
conditioner when an occupant enters or leaves the room.
The summary report of the occupant’s attendance also can
be used for activity analysis. The current system does not
include the recognition capability since it only detect
whether the working desk is occupied or not. However, if
occupant recognition is needed then there are two ways.
After the map of sitting areas are found, user may label each
sitting area manually or a recognition system can be added.

REFERENCES
[1] B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafers,
“EasyLiving Technologies for Intelligent Environments,” Lecture
Notes in Computer Science, Volume 1927/2000, pp. 97-119, 2000.
[2] S. -L. Chung and W. –Y. Chen, “MyHome: A Residential Server for
Smart Homes”, Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics) 4693 LNAI (PART 2), pp. 664-670, 2007.
[3] Z. Zhou, X. Chen, Y. –C. Chung, Z. He, T. X. Man, and J. M. Keller,
“Activity analysis, summarization, and visualization for indoor
human activity monitoring,” IEEE Transactions on Circuits and
Systems for Video Technology 18 (11), art. no. 4633633, pp. 1489-
1498, 2008. Fig. 13. Scenario type-1. It shows how the system builds a map. The current
[4] Wikipedia, “Time Clock,” http://en.wikipedia.org/wiki/Time_clock images (left) and a map is shown as filled rectangles (right images).
(June 24, 2010).
[5] OpenCV. Available: http://sourceforge.net/projects/opencvlibrary/
[6] cvBlob. Available : http://code.google.com/p/cvblob/
[7] D. Demirdjian, K. Tollmar, K. Koile, N. Checka, and T. Darrell,
“Activity maps for location-aware computing,” Proceedings of the
Sixth IEEE Workshop on Applications of Computer Vision (WACV),
pp. 70-75, 2002.
[8] A. Girgensohn, F. Shipman, and L. Wilcox, “Determining Activity
Patterns in Retail Spaces through Video Analysis,” MM'08 -
Proceedings of the 2008 ACM International Conference on
Multimedia, with co-located Symposium and Workshops , pp. 889-
892, 2008.
[9] B. Morris and M. Trivedi, “An Adaptive Scene Description for
Activity Analysis in Surveillance Video,” 2008 19th International
Conference on Pattern Recognition, ICPR 2008 , art. no. 4761228,
2008.
[10] A. Bayona, J.C. SanMiguel, and J.M. Martínez, “Comparative
evaluation of stationary foreground object detection algorithms based
on background subtraction techniques,” 6th IEEE International
Conference on Advanced Video and Signal Based Surveillance, AVSS
2009 , art. no. 5279450, pp. 25-30, 2009.
[11] S. Herrero and J. Bescós, “Background subtraction techniques:
Systematic evaluation and comparative analysis” Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 5807 LNCS, pp. 33-
42, 2009.
[12] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive
Background Mixture Model for Real-time Tracking with Shadow
Detection,” Proc. 2nd European Workshop on Advanced Video Based
Surveillance Systems, AVBS01, 2001
[13] G. Bradski and A. Kaehler, “Learning OpenCV: Computer Vision
with the OpenCV Library,” Sebastopol, CA: O'Reilly Media, 2008.
[14] A. Senior, A. Hampapur, Y.-L. Tian, , L. Brown, S. Pankanti, and R.
Bolle, ” Appearance models for occlusion handling,” Image and Fig. 14. Scenario type-2. The map of 3 desks has been completed. The
Vision Computing 24 (11), pp. 1233-1243, 2006. occupants cross each other and the system can handle this situation.

1020

i-m-Walk : Interactive Multimedia Walking-Aware System

1
Meng-Chieh Yu(余孟杰), 2Cheng-Chih Tsai(蔡承志), 1Ying-Chieh Tseng(曾映傑), 1Hao-Tien
Chiang(姜昊天), 1Shih-Ta Liu(劉士達), 1Wei-Ting Chen(陳威廷), 1Wan-Wei Teo(張菀薇), 2Mike Y.
Chen(陳彥仰), 1,2Ming-Sui Lee(李明穗), and 1,2Yi-Ping Hung(洪一平)

1
Graduate Institute of Networking and Multimedia,
National Taiwan University, Taipei, Taiwan
2
Dept. of Computer Science and Information Engineering,
National Taiwan University, Taipei, Taiwan

Abstract calories burned [16]. adidas used a accelerometer to detect
the footsteps of the runner, and it will let you know running
i-m-Walk is a mobile application that uses pressure information audibly [31]. Wii fit used balance boards to
sensors in shoes to visualize phases of footsteps on a mobile detect user's center of gravity and designed several games,
device in order to raise the awareness for the user´s walking such as yoga, gymnastics, aerobics, and balancing [18]. In
behaviour and to help him improve it. As an example addition, walking is an important factor of our health. For
application in slow technology, we used i-m-Walk to help example, it is one of the earliest rehabilitation exercises and
beginners learn “walking meditation,” a type of meditation an essential exercise for elders [5]. Improper foot pressure
where users aim to be as slow as possible in taking pace, and distribution can also contribute to various types of foot
to land every footstep with toes first. In our experiment, we injuries. In recent years, the ambient light and the
asked 30 participants to learn walking meditation over a biofeedback were widely used in rehabilitation and healing,
period of 5 days; the experimental group used i-m-Walk and the concept of “slow technology” was proposed. Slow
from day 2 to day 4, and the control group did not use it at all. technology aimed to use slowness in learning, understanding
The results showed that i-m-Walk effectively assisted and presence to give people time to think and reflect [30].
beginners in slowing down their pace and decreasing the Meditation is one kind of the example in slow technology.
error rate of pace during walking meditation. To conclude, Also, “walking meditation” is an important form of
this study may be of importance in providing a mechanism to meditation. Although many research projects have focused
assist users in better understanding of his pace and on meditation, showing benefits such as enhancing the
improving the walking habit. In the future, i-m-Walk could synchronization of neuronal excitation [11] and increasing
be used in other application, such as walking rehabilitation. the concentration of antibodies in blood after vaccination [3],
most projects have focused on meditation while sitting. In
Keywords: Smart Shoes, Walking Meditation, Visual Feedback, order to better understand how users walk in a portable way,
Slow Technology we have designed i-m-Walk, which uses multiple force
sensitive resistor sensors embedded in the soles of shoes to
monitor users’ pressure distribution while walking. The
1. INTRODUCTION sensor data are wirelessly transmitted over ZigBee, and then
Walking is an integral part of our daily lives in terms of relayed over Bluetooth to be analyzed in real-time on
transportation as well as exercise, and it is a basic exercise smartphones. Interactive visual feedback can then be
can be done everywhere. In recent years, many research provided via the smartphones (see Figure 1).
projects have studied walking-related human-computer In this paper, in order to develop a system that can help
interfaces on mobile phones with the rapid growth of users in improving the walking habit, we use the training of
smartphones. For example, there is research evaluated the walking meditation as an example application to evaluate the
walking user interfaces for mobile devices [9], and proposed effectiveness of i-m-Walk. Traditional training of walking
minimal attention user interfaces to support ecologists in the meditation demands one-on-one instruction, and there is no
field [21]. In addition, there are several walking-related standardized evaluation after training. It is challenging for
systems developed to help people in walking and running. beginners to self-learn walking meditation without feedback
Nike+ used footstep sensors attached to users’ shoes to from the trainers.
adjust the playback speed of music while running and to
track running related statistics like time, distance, pace, and

1021

2.2 Multimedia-Assisted Walking Application
There are some studies using multimedia feedback and
walking detection technique to help people in monitoring or
training application in daily life. In the application of
dancing training, there was an intelligent shoe that can detect
the timing of footsteps, and play the music to help beginners
in learning of ballroom dancing. If it detected missed
footsteps while dancing, it would show warning messages to
the user. The device emphasizes the acoustic element of the
music to help the dancing couple stay in sync with the music
[4]. The other application of dance performance could detect
dancers’ pace and applied them in interactive music for
dance performance [20]. In the application of musical tempo
and rhythm training for children, there was a system which
can write out the music on a timeline along the ground, and
Figure 1. A participant is using i-m-Walk during walking
each footstep activates the next note in the song [13].
meditation.
Besides, visual information was be used to adjust foot
trajectory during the swing phase of a step when stepping
We have designed experiments to test the effect of onto a stationary target [23].
training by using i-m-Walk during walking meditation. In the application of psychological, there are some
Participants were asked to do a 15-minute practice of experiments related to walking perceptive system. In the
walking meditation for five consecutive days. During the application in walking assisting of stroke patients, lighted
experiment, participants using i-m-Walk will be shown real- target was used to load onto left side and right side of
time pace information on the screen. We would like to test walkway, and stroke patients can follow the lighted target to
whether it could help participants to raise the awareness for carry on their step. The result pointed out that stroke patient
their walking behaviour and to improve it. We proposed two might effectively get help by using vision and hearing as
hypotheses: (a) i-m-Walk could help users to walk slower guidance [14]. An fMRI study of multimedia-assisted
walking during meditation; (b) i-m-Walk could help users to walking showed that increased activation during visually
walk correctly in the method of walking meditation. guided self-generated ankle movements, and proved that
This paper is structured as follows: The first section deals multimedia-assisted walking is profound to people [1]. In the
with the introduction of walking system. The second section related application of walking in entertainment, Personal
of the article is a review of walking detection and Trainer – Walking [17] detects users’ footsteps trough
multimedia-assisted walking applications. This is followed accelerometer, and encourage users to walk through
by some introduction of walking meditation. The forth interesting and interactive games. In the healthcare
section describes the system design. After which application, there was a system applied the concept of
experimental design is presented. The results for the various intelligence shoes on the healthcare field, such as to detect
analyses are presented following each of these descriptive the walking stability of elderly and thus to prevent falling
sections. Finally, the discussion and conclusion are presented down [19]. The system monitored walking behaviours and
and suggestions are made for further research. used a fall risk estimation model to predict the future risk of
a fall. Another application used electromyography
biofeedback system for stroke and rehabilitation patients, and
2. RELATED WORKS
the results showed that there was recovery of foot-drop in the
swing phase after training [8].
2.1 Methods of Walking Detection
In the past decade, there were many researches on 3. WALKING MEDITATION
intelligent shoes. The first concept of wearable computing
and smart clothing systems included an intelligence cloth, The practice of meditation has several different ways and
glasses, and an intelligence shoes. The intelligence shoes postures, such as meditation in standing, meditation in sitting,
could detect the walking condition [12]. Then, a research meditation in walking, or meditation in lying down on back.
used pressure sensor and gyro sensor to detect feet posture, Compared to sitting meditation, people tend to feel less dull,
such as heel-off, swing, and heel-strike[22], and a research tense, or easily distract in walking meditation. In this paper,
embedded pressure sensor in the shoes to detect the walking we focus on the meditation in walking, which is also named
cycle, and a vibrator was equipped to assist when walking walking meditation. Walking meditation is a way to align the
[26]. Besides, there were many other methods on walking feeling inside and outside of the body, and it would helps
detection, such as use bend sensor [15], accelerometer [2], people to focus and concentrate on his mind and body.
ultrasonic [29], and computer vision technology [24] to Furthermore, it can also deeply investigate our knowledge
analyze footsteps. and wisdom.

1022

4.1 i-m-Walk Architecture
The shoe module is based on Atmel's high-performance,
low-power 8-bit AVR ATMega328 microcontroller, and
transmits sensing values through a 2.4GHz XBee 1mW Chip
Antenna module wirelessly. The module size is 3.9 cm x 5.3
cm x 0.8 cm with an overall weight of 185g (Figure 4),
Figure 2. Six phases of each footstep in walking meditation [25]. including an 1800mAh Lithium battery can continuous use
for 24 hours. We kept the hardware small and lightweight in
The methods of walking meditation aim to be as slow as order not to affect users while walking.
possible in taking pace, and landing each pace with toes first. We use force sensitive resistor sensors to detect the
The participants could focus on the movement of walking, pressure distribution of feet while walking. The sensing area
from raising, lifting, pushing, lowering, stepping, to pressing of sensor is 0.5 inch in diameter. The sensor changes its
(Figure 2). Also, the participants should aware of the resistance depending on how much pressure is applied to the
movement of the feet in each stage. It is important to stay sensing area. In our system, the intelligent shoes would
aware of the feet sensation. As a result, keep on practicing of detect the walking speed and the walking method in walking
walking meditation is an effective way to develop meditation. According to the recommendations of
concentration and maintain tranquillity in participants’ daily orthopaedic surgery, we use three force sensitive resistor
lives. Furthermore, it can also help participants become sensors fixed underneath the shoe insole, and the three main
calmer and their minds can be still and peaceful. With the sustain areas located at structural bunion, Tailor’s bunion,
long-term practice of walking meditation, it benefits people and heel, seperately. (see Figure 4). The shoe module is put
by increasing patience, enhancing attention, overcoming on the outside of the shoes (see Figure 5). With a fully
drowsiness, and leading to healthy body [6]. In order to help charged battery, the pressure sensing modules can
beginners in learning the walking methods in walking continuous use for 24 hours. A power button can switch the
meditation, i-m-Walk system was developed. module off when it is not being used.

4. SYSTEM DESIGN
i-m-Walk includes a pair of intelligent shoes for detecting
pace, an ZigBee-to-Bluetooth relay, and a smartphone for
walking analysis and visual feedback. There are three force
sensitive resistor sensors fixed underneath each shoe insole
that send pressure data through the relay. We implemented
an analysis and visual feedback application on HTC HD2
smartphone running Window Mobile 6.5 and which has a
4.3-inch LCD screen. The overview of the system is shown
in Figure 3. Figure 4. Sensing module: the micro-controller and wireless
Force Relay module (right), and one of the insole with three force sensitive
Sensor Right Shoe resistor sensors (left).
Xbee (Receiver)

Force
Microcontroller
Sensor Bluetooth

Force Xbee (Transfer)
Sensor Smart Phone

Force Bluetooth
Sensor Left Shoe
Footstep Detection
Force
Microcontroller
Sensor
Stability Analysis

Force Xbee (Transfer) Visual feedback
Sensor

Figure 3. System structure of i-m-Walk. Figure 5. Sensing shoes: attached the sensing module into the
shoes.

1023

4.2 Walking detection 4.3.1 Pace awareness
There were many methods in walking detection, and the The function of pace awareness is to help user aware of his
methods were different according to the applications. In our walking phases and whether he use correct footstep during
system, we use three pressure sensors in each shoe, and walking meditation. A feet pattern shows on the smartphone,
totally will sense six sensing values at the sample rat 30 and the color block shows where the position of foot’s
times per second. In order to detect whether the user lands center of gravity is and how much is forced on the foot in
each pace with toes first or heel first, we divide each shoe real-time. The transparency of the block will decrease while
into two parts, toe part and heel part. The sensing value of user land his feet. On the contrary, the transparency of the
toe part is the average of two force sensors, which block will increase while user raise his foot. Besides, the
underneath at structural bunion and Tailor’s bunion. The color block moves top-down while the participants land
sensing value of heel part is the force sensor underneath at with toes first. The color blocks move bottom-up means that
heel. Therefore, the system divides the sensing area into toe the participants land with heel first. If forward of the foot
part and heel part in each shoe, and totally four parts in each lands first, then the colour block moves forward to indicate
person. Then, we use threshold method to detect the moment the landing position. In addition, if the user land pace with
while the sensing part less than the threshold value, and toe first, the system would defined that he is using correct
activate that part. We define the beginning of the each gait walking methods in walking meditation, and the colour
cycle while heel part is lifting. The end of the gait cycle is at block would display as the color of green. On the contrary,
the moment while the another foot’s heel part is rising. The while the user lands pace with heel first, the system would
previous cycle stops in one foot and another foot begins a recognize that he is using wrong walking methods, and the
new cycle of gait. Figure 6 shows an example. In this case, colour block would change the color from green to red.
when the heel in left foot rose in 5 seconds, the sensing value
was less than the threshold, and our system detected left foot 4.3.2 Walking Speed and Warning Message
rise in this moment. In the mean time, it means that user’s During walking meditation, people should stabilize his
right foot is stepping down. On the contrary, when the heel walking paces at a lower speed. By this way, the user
in right foot rose in 10.7 seconds, the sensing value was less interface should provide the information of walking speed in
than the threshold, and our system detected right foot rise in real time, and remind the user while the walking speed is
this moment. too fast. Walking speed and wrong pace can be measured
after the processing of walking signals. Then, the walking
speed is visualized as a speedometer. The indicator point of
the speedometer will point to the value of walking speed.
For example, if the indicator points to the value “30”, it
means that the user walk thirty paces per three minutes.
Therefore, the speedometer provides the function to remind
the user while he is walking too fast. According to the pilot
study, we defined the lower-bound of the walking speed as
40 paces per three minutes. While the walking speed
exceeds the speed, the indicator will point to the red area,
and the screen will show a warning message “too fast” on
the top of the screen. The warning message would disappear
while the walking speed is less than 40 paces per three
minutes.

Figure 6. Signal processing of walking signals. Blue line indicates Warning
the sensed weight (kg) of heel and green line indicates the sensed message
weight of toe. Red line means the threshold in detecting the
landing event. Gray block means that which foot is landing.
Footstep
Awareness
4.3 User interface
Multimedia feedback can be effectively applied in
preventive medicine [7], and it can also assist rehabilitation Walking
patients in walking effectively [5, 27]. i-m-Walk is developed speed
to assist user in learning the walking methods during walking
meditation. The user interface of i-m-Walk includes three
components: warning message, pace awareness, and walking
speed (see Figure 7). In this section, we describe the user
interface and the design principles of our system. Figure 7: User interface of i-m-Walk. The user interface shows
three events: warning message, condition of each footstep, and

1024

walking speed. Picture (a) shows that the user used incorrect meditation. 83.3% of the participants carry mobile phone all
walking method in right foot and the colour block changed to red the time, and 63.3% of the participants have the experience
on the right foot. Picture (b) shows that the walking speed is too of using smartphone. There were fifteen participants (eleven
fast (46 steps per three minutes). male and four female) in experiment group (with visual
feedback), and fifteen participants (eleven male and four
female) in control group (without visual feedback). Because
5. EXPERIMENT DESIGN the feet size was different in each participant, we prepared
Two experiments was designed to test the effects of i-m- two pair os shoes with different sizes, and participants could
Walk during walking meditation. The first experiment was a choose comfortable one to wear.
pilot study. In this study, we evaluated the effect of visual
feedback which showed six sensing curves which projected 5.2.2 Location
on the wall. The second experiment evaluated the effects of
i-m-Walk in improving user´s walking behaviour. Meditating in a quiet and enclosed area would be easier
to bring mind inward into ourselves and may reach in calm
5.1 Pilot study and peace situation. In this experiment, we selected a
Before the test of i-m-Walk, we designed a preliminary corridor in the faculty building as the experimental place for
study to test the effects of visual feedback displayed on the walking meditation. The corridor is a public place at an
wall. Eight master students volunteered to participate in this enclosed area and few people would conduct their daily
pilot study. Participants’ average age is 26.3 (SD=0.52). All activities like standing, walking, and interacting with one
participants have the experience of sitting meditation before, another there. The surrounding of corridor is quiet and
but all of them do not have the experience of walking comfortable for user to reach their mind in calm. The length
meditation. There were four participants (three male and one of the corridor is thirty meters, and the width is three meters.
female) in experiment group (with visual feedback), and four The temperature is 21~23 Celsius degree.
participants (four male) in control group (without visual
feedback). Participants would take ten minutes each day and 5.2.3 Procedure and analysis
for three consecutive days in the experiment. The experiment
took place at a seminar room in the faculty building. Before Before the experiment, participants were asked to walk
the experiment, participants were taught the methods and alone the corridor in usual walking speed, and we recorded it.
principles of walking meditation. The experiment was a 4 × Then, participants were taught the methods of walking
2 between-participants design. In the experiment group, meditation. The guideline of the walking meditation which
participants were asked to watch the curves which showed we provided to the participants was as follows: “Walking
feet’s sessing value projected on the wall. In the control meditation is a way to align the feeling inside and outside of
group, participants were asked to walk themselves without the body. You should focus on the movement of walking,
visual feedback. Participants walked straight in the seminar from raising, lifting, pushing, lowering, stepping, to pressing.
room. The results showed that there was a significant main You have to land every footstep with toes first and then
effect that experimental group had lower walking speed than slowly land your heel down. During walking meditation, you
control group (p<0.05) during the three days. The average should stabilize your walking paces at a lower speed as
number of wrong pace in experimental group was less than possible. You have to relax your body from head to toes.”
control group, too. As the results from this pilot study, we The experiment was a 15 × 2 between-participants
concluded two preliminary conclusion that (a) visual design. Participants would take fifteen minutes each day and
biofeedback could help users in slowing down the walking for five consecutive days in this experiment. Table 1 shows
speed during walking meditation; (b) Multimedia guidance the procedure of this experiment. In the experiment group,
could usefully help user in aware of his pace during walking participants were asked to use i-m-Walk from day 2 to day 4.
meditation, and could decrease the number of wrong pace. In the control group, participants were asked to walk
However, we also observed some issues in pilot study, one is themselves without any feedback during walking meditation.
that the perspective would change over time while walking DAY 1 DAY 2 DAY 3 DAY 4 DAY 5
to different location, and it might influence the effect of
learning. Based on the results and recommands, we designed Experimental
○ ● ● ● ○
a experiment to evaluate the effect of i-m-Walk. Group
5.2 User study Control
○ ○ ○ ○ ○
5.2.1 Participants Group

Thirty master and PhD students in the Department of Table 1: Experimental procedure: ● means that participants have
Computer Science volunteered to participate in this to use i-m-Walk during walking meditation, and ○ means that
experiment. Participants’ average age is 25.2 (SD=3.71). participant do not have to use i-m-Walk during walking meditation.
Twenty-seven participants have the experience of sitting
meditation, and three participants do not. However, all While learning of walking mediation, all participants
participants do not have the experience of walking were asked to walk clockwise around the corridor and hold

1025

the smartphone. In control group, there was no visual minute learning of walking meditation on experimental
feedback on smartphone although they still need to hold the group and control group from day 1 to day 5.
smartphone. In experimental group, participants were In experimental group, the median value of the wrong
informed that they can choice not to look at the visual pace decreased from eight wrong pace in day 1 to one wrong
feedback while aware of their pace well. The participants of pace in day 5, and the value of wrong pace decreased over
experimental group were asked to complete a questionnaire day. In control group, the median value of the wrong pace
after experiments from day 2 to day 4. Besides, we asked all decreased from 7 paces in day 1 to 5 paces in day 5, but the
participants the feelings and impressions after the experiment wrong pace decreased just in first three days. The results
in day 5. However, all participants could write down any showed that i-m-Walk could effectively reduce wrong pace
recommends and feelings after the experiment, and we will during walking meditation.
discuss the issues in the discussion section.

5.2.4 Results

We analyzed the average walking speed and the wrong
pace both in experimental group and control group. In the
results of the average walking speed, figure 8 shows the
average time of each pace on experimental group and control
group from day 1 to day 5. In day 1 and day 5, all
participants learned walking meditation without using i-m-
Walk. Following t-tests revealed significant difference (p <
0.005) in the average walking time per footstep for the
experimental group and control group from day 2 to day 4.
In experimental group, the average walking time per footstep
increased from 4.5 seconds in day 1 to 10.9 seconds in day 5. Figure 9: The median value of error footsteps for experimental
In control group, the average walking time per footstep group and control group from day 1 to day 5.
increased from 3.2 seconds (day 1) to 5.1 seconds (day 5).
The results showed that the participants in experimental The experimental group was asked to complete a
group had significant main effect (p < .005) in slowing down questionnaire after using i-m-Walk system from day 2 to day
the walking speed after the learning of walking meditation. 4. The content of the questionnaire was the same in each day.
On the contrary, the participants in control group had no Figure 10 shows the results of questionnaires which were
significant main effect (p > .1) in slowing down the walking completed after walking meditation. We asked two questions:
speed after the learning of walking meditation. The results (1) what is the degree of i-m-Walk to help you in aware of
showed that i-m-Walk could help participants in slowing your pace? (2) what is the degree of i-m-Walk to help you in
down the walking speed during walking meditation. slowing down the footstep? There were five options for
answers, including “1: serious interference”, “2: a little
interference”, “3: no interference and no help”, “4: a little
help”, and “5: very helpful”. The results showed that all
participants in experimental group gave positive feedback
both in question 1 and question 2, and the score of
questionnaire were between “a little helpful” and “very
helpful”.

Figure 8: The average time of one footstep for experimental group
and control group from day 1 to day 5. Error bars show ±1 SE.

In our experiment, the rule of correct walking method
was that participants should land every footstep with toes Figure 10: The questionnaire results filled by experimental group
first during walking meditation. If they landed footstep with from day 2 to day 4. The red line shows the baseline of the
heel first, it was a wrong pace defined in our experiment. satisfaction. Error bars show ±1 SE.
Figure 9 shows the median values of total wrong pace in 15

1026

6. DISCUSSION the time during walking meditation because we did not know
whether the user needed the guidance or not. But we
The aim of this section is to summarize, analyze and informed participants that they could decide not to look at
discuss the results of this study and give guidelines for the the visual feedback while they could aware of their pace well.
future development of applications. By this way, it could minimize the interference when using.
The questionnaire showed that the participants think that
6.1 User Interface: there was no interference while using i-m-Walk, and the
system was helpful in use.
The user interface of i-m-Walk provides the information of
pace, including walking speed, wrong pace, and the center 6.3 Beginners vs. Masters
of feet. The results in walking speed significantly showed
that i-m-Walk could help beginners decrease walking speed In recent years, the concept of “slow technology” was
during walking meditation. There are some participants’ applied in many mediate systems. The design philosophy of
comments from experimental group: “slow technology” is that we should use slowness in
User E6 in day 3: “I always walked fast before, but when I learning, understanding and presence to give people time to
saw the dashboard and the warning message “too fast,” it think and reflect. In our case, walking meditation is one
was helpful to remind me in slowing down the walking kind of the conception in slow technology. There are two
speed. main parts in walking meditation, including inside condition
and outside condition. The inside condition means the
We list two design principles of the user interface: (a)
meditation of mind and the outside condition means the
We used the form of dashboard to represent the walking
meditation of walking posture. All participants were
speed. The value of walking speed is easy to watch, and user
beginners in our experiment, because we focused on the
might aware of the change of walking speed while he slowed
training of the outside condition, walking posture. The
down or speeded up; (b) i-m-Walk provided additional alarm
difference between beginners and masters in walking
mechanism, a warning message “too fast”, while walking too
meditation is that the beginners do not familiar in walking
fast. The mechanism could remind user when distracted. The
meditation and needs to pay more attention on the control of
results in wrong pace showed that i-m-Walk could effectively
pace, but the masters familiar it and could focus on the
reduce wrong pace for beginners during walking meditation.
meditation of align the inside and outside of body. Walking
One of the participants from experimental group said that:
meditation is a way to align the feeling inside and outside of
User E1 in day2: “While I saw the color of block on the the body. The beginner should familiar the walking posture
screen changed from green to red, and I knew that I had a before the spiritual development. In this paper, the goal of
wrong pace. Then, I would concentrate on my pace our experiment is to evaluate the learning effects of i-m-
deliberately while the next footstep. Walk system. The experimental results showed that the
participants of experimental group could slow down the
6.2 Human Perception walking speed and decrease the wrong pace after five days
training. Six participants in experimental group felt that the
Human beings receive messages by means of the five experimental time in day four is short than first day,
modalities, including vision, sound, smell, taste and touch. although the experimental time is the same. On the contrary,
The most use in the field of human-computer interaction is there was no such comment from the participants in control
visual modality and audio modality. There was a comment group. The results showed that i-m-Walk could help user in
from an experimental participant: training the walking posture of walking meditation.
User E3 in day 2: “If I can listen to my pace during
walking meditation, I do not need to hold the smartphone”. 6.4 Reaction Time
In cross-modal research, visual modality is always Reaction time is an important issue in human-computer
considered superior than auditory modality in spatial domain. interaction design. If the reaction time delay too long, users
In our case, we need to show the footstep phases accurately, could not control it well and could not aware of the
and also need to show walking speed and wrong pace in the interaction easily. According to the observation, the delay
same time. Therefore, we selected visual feedback as the time of i-m-Walk is 0.2 second. However, the delay time do
user interface. The advantage was that users could decide not affect users because the application in this experiment
whether to watch the information or not, but the shortcoming do not need fast reaction time. The average pace speed is
was that users failed to receive the information while they 10.9 seconds in experiment group in day five. The results of
did not see it. Therefore, there are possible to provide more questionnaires also showed that participants felt that the
interaction methods, such as tactile perception and acoustic visual feedback could reflect the walking status immediately.
perception, to remind users. However, the somatosensory of one’s feet is the most
On the other hand, the mechanisms of multimedia intuitive, and i-m-Walk can only provide accessibility for
feedback might attract user’s attention in some case. Too beginners while they need.
many inappropriate and redundant events might disturb users
while using it. In our system, we provided visual feedback all

1027

7. CONCLUSIONS AND FUTURE WORK [9] Kane, S.K., Wobbrock, J.O., and Smith, I.E., 2008. Getting off the
treadmill: evaluating walking user interfaces for mobile devices in
In this paper, we present a mobile application that uses public spaces. In Proc. MobileHCI '08, 109-118.
pressure sensors in shoes to visualize phases of footsteps on [10] Kong, K. and Tomizuka, M., 2008. Smooth and Continuous Human
a mobile device in order to raise the awareness for the user´s gait Phase Detection based on foot pressure patterns. In Proc.
ICRA ’08, 3678-3683.
walking behaviour and to help him improve it. Our study
[11] Lutz, A., 2004. Long-term meditators self-induce high-amplitude
showed that i-m-Walk could effectively assisted beginners in gamma synchrony during mental practice. PNAS 101, 16369-16373.
slowing down their pace and decreasing the error rate of pace
[12] Mann, S., 1997. Smart clothing: The wearable computer and
during walking meditation. Therefore, the conception of i-m- WearCam. Journal of Personal Technologies 1(1), 21-27
Walk could be used in other applications, such as walking [13] Mann, S., 2006. The andante phone: a musical instrument that you
rehabilitation. play by simply walking. In Proc. ACM Multimedia 14th, 181-184.
Despite the encouraging results of this study as to the [14] Montoya, R., Dupui, P.H., Pagès, B., and Bessou, P., 1994. Step-
positive effect of i-m-Walk, future research is required in a length biofeedback device for walk rehabilitation. Journal of Medical
number of directions. In the part of intelligient shoes, we will and Biological Engineering and Computing 32(4), 416-420.
analyze user’s walking method, such as pigeon toe gait and [15] Morris, S.J., and Paradiso, J.A., 2002. Shoe-integrated sensor system
out toe gait while walking. In the part of biofeedback for wireless gait analysis and real-time feedback. In Proc. Joint IEEE
EMBS (Engineering in Medicine and Biology Society) and BMES
mechanisms, we will try to design more interaction methods, (the Biomedical Engineering Society) 2nd, 2468-2469.
such as tactile perception and acoustic perception. Besides,
[16] Nike, INC., 2009. Nike+, Retrieved October 26, 2009, from
we will record and analyze user’s learning status while www.nikeplus.com/
walking, and provide appropriate and personalized guidance [17] Nintendo, 2009. Personal Trainer - Walking, Retrieved November 26,
according to his condition. Currently, we are using additional 2009, from http://www.personaltrainerwalking.com/
sensing devices, such as Breath-Aware Garment and Sensing [18] Nintendo, 2009. Wii Fit Plus, Retrieved January 8, 2010, from
ring, to detect user’s biosignal and activities, and integrate to http://www.nintendo.co.jp/wii/rfpj/index.html
i-m-Walk to analyze the breathing status and heartbeat rate [19] Noshadi, H., Ahmadian, S., Dabiri, F., Nahapetian, A., Stathopoulus,
while walking and running. T., Batalin, M., Kaiser, W., Sarrafzadeh, M., 2008. Smart Shoe for
Balance, Fall Risk Assessment and Applications in Wireless Health.
In Proc. Microsoft eScience Workshop.
8. ACKNOWLEDGMENT
[20] Paradiso, J., 2002. FootNotes: Personal Reflections on the
This work was supported in part by the Technology Development of Instrumented Dance Shoes and their Musical
Development Program for Academia, Ministry of Economic Applications. In Quinz, E., ed., Digital Performance, Anomalie,
Affairs, Taiwan, under grant 98-EC-17-A-19-S2-0133. digital_arts Vol. 2, 34 - 49.
[21] Pascoe, J., Ryan, N. and Morse, D., 2000. Using while moving: HCI
9. REFERENCES issues in fieldwork environments. ACM Transactions on Computer-
Human Interaction, 7 (3), 417- 437.
[22] Pappas, P. I., Keller, T., Mangold, S., Popovic, M.R., Dietz, V., and
[1] Christensen, M.S., Lundbye-Jensen, J., Petersen, N., Geertsen, S.S., Morari, M., 2004. A reliable, insole-embedded gait phase detection
Paulson, O.B., and Nielsen, J.B., 2007. Watching Your Foot Move-- sensor for FES-assisted walking. Journal of IEEE Sensors 4 (2), 268-
An fMRI Study of Visuomotor Interactions during Foot Movement. 274.
Journal of Cereb Cortex 17 (8), 1906-1917. [23] Reynolds, R.F., Day, B.L., 2005. Visual guidance of the human foot
[2] Crossan, A., Murray-Smith, R., Brewster, S., Kelly, J., and Musizza, during a step. Journal of Physiology 569 (2), 677-684.
B., 2005. Gait phase effects in mobile interaction, In Proc. CHI '05 [24] Quek, F., Ehrich, R., Lockhart, T., 2008. As go the feet...: on the
extended abstracts on Human factors in computing systems, 1312- estimation of attentional focus from stance. In Proc. ICMI 10th, 97-
1315. 104.
[3] Davidson, R.J., 2003. Alterations in brain and immune function [25] Thera, S., 1998. The first step' to Insight Meditation. Buddhist
produced by mindfulness meditation. Psychosom Med. 65, 564-570. Cultural Centre.
[4] Drobny, D., Weiss, M., and Borchers, J. 2009. Saltate! -– A Sensor- [26] Watanabe, J., Ando, H., and Maeda, T., 2005. Shoe-shaped Interface
Based System to Support Dance Beginners. In Proc. CHI '09 for Inducing a Walking Cycle. In Proc. ICAT 15th, 30-34.
Extended Abstracts on Human Factors in Computing Systems, 3943-
3948. [27] Woodbridge, J., Nahapetian, A., Noshadi, H., Kaiser, W. and
Sarrafzadeh, M., 2009. Wireless Health and the Smart Phone
[5] Femery, V.G., Moretto, P.G., Hespel, J-MG, Thévenon, A., and Conundrum. HCMDSS/ MDPnP.
Lensel, G., 2004. A real-time plantar pressure feedback device for
foot unloading. Journal of Arch Phys Med Rehabi 85(10), 1724-1728. [28] Ikemoto, L., Arikan, O., and Forsyth, D., 2006. Knowing when to put
your foot down. In Proc. Interactive 3D graphics and games 06’, 49-
[6] Hanh, T.N., Nquyen, A.H., 2006. Walking Meditation. Sounds True 53.
Ltd.
[29] Yeh, S.Y., Wu, C.I., Chu, H.H., and Hsu, Y.J., 2007. GETA sandals:
[7] Hu, M.H., and Woollacott, M.H., 1994. Multisensory training of a footstep location tracking system. Personal and Ubiquitous
standing balance in older adults: I. Postural stability and one-leg Computing 11(6): 451-463.
stance balance. Journal of Gerontology: MEDICAL SCIENCES 49(2),
M52-M61. [30] Hallnäs, L., Redström, J., 2001. Slow Technology: Designing for
Reflection. Personal and Ubiquitous Computing, Vol. 5(3). pp. 201-
[8] Intiso D., Santilli V., Grasso M.G., Rossi R., and Caruso I., 1994. 212.
Rehabilitation of walking with electromyographic biofeedback in
foot-drop after stroke. Journal of Stroke 25(6), 1189-1192. [31] adidas, INC., 2010. miCoach, Retrieved March 5, 2010, from
www.micoach.com

1028

Object of Interest Detection Using Edge Contrast Analysis

Ding-Horng Chen FangDe Yao
Department of Computer Science and Information Department of Computer Science and Information
Engineering Engineering
Southern Taiwan University Southern Taiwan University
Yong Kang City, Tainan County Yong Kang City, Tainan County
chendh@mail.stut.edu.tw m97g0102@webmail.stut.edu.tw

Abstract— This study presents a novel method to detect the the separation of variations in illumination from the
focused object-of-interest (OOI) from a defocused low depth- reflectance of the objects (also known as intrinsic image
of-field (DOF) image. The proposed method divides into three extraction) and in-focus areas (foreground) or out-of-focus
steps. First, we utilized three different operators, saturation (background) areas in an image.
contrast, morphological functions and color gradient to The DOF is the portion of a scene that appears
compute the object's edges. Second, the hill climbing color acceptably sharp in the image. Although lens can precisely
segmentation is used to search the color distribution of an
focus at one specific distance, the sharpness decreases
image. Finally, we combine the edge detection and color
segmentation to detect the object of interest in an image. The gradually on each side of the focused distance. A low (small)
proposed method utilizes the edge analysis and color DOF can be more effective to emphasize the photo subject.
segmentation, which takes both advantages of two features The OOI is thus obtained via the photography technique by
space. The experimental results show that our method works using low DOF to separate the interested object in a photo.
satisfactorily on many challenging image data. Fig. 1 shows a typical OOI image with low DOF.
Keywords-component; Object of Interest (OOI); Depth of
Field (DOF); Object Detection; Edge Detection; Blur Detection.

I. INTRODUCTION
The market for digital single-lens reflex cameras, or so-
called DSLR, has expanded tremendously for its price
become more acceptable. For a professional photographer,
the DSLR owns the advantages for the excellent image
quality, the interchangeable lenses, and the accurate, large,
and bright optical viewfinder. The DSLR camera has bigger
sensor unit that can create more obvious depth-of-field (DOF)
photos, and that is the most significant features of DSLR.
Figure 1. A typical OOI image
According to market reports [1][2][3], the DSLR market
share will grows very fast in the near future. Table 1 shows The OOI detection problem can be viewed as an
the growth rate of the digital camera market. extension of the blurred detection problem. In Chung’s
Table 1. Market Estimate of the Digital Cameras method [6], they compute x and y direction derivative and
gradient map to measure the blurred level ,by obtaining the
Year 2006 2011 Growth Rate edge points which is computed by a weighted average of the
World Market 81 82.2 108% standard deviation of the magnitude profile around the edge
point.
DSLR 4.8 8.3 173%
Renting Liu et al. [7] have proposed a method could
DSC 76.8 79.9 104% determine blurred type of an image, using the pre-de ned
Unit: Million US$ blur features, the method train a blur classifier to
The extraction of the local region of interested in an discriminate different regions. This classifier is based on
image is one of the most important research topics for some features such as local power spectrum slope, gradient
computer vision and image processing [4][5]. The detection histogram span, and maximum saturation. Then they
of object of interest (OOI) in a low DOF images can be detected the blurry regions that are measured by local
applied in many fields such as content-based image retrieval. autocorrelation congruency to recognize the blur types.
To measure the sharpness or blurriness edges in an image is The above methods determine the blur level and regions,
also important for many image processing applications. For but they still cannot extract OOI object from an image. If the
instance, checking the focus of a camera lens, identifying background is complex or edges are blurred, the described
shadows (which edges are often less sharp than object edges), methods are unable to find OOI [6][7]. N. Santh and
K.Ramar have proposed two approaches, i.e. the edge-based

1029

and region-based approach, to segment the low-DOF images A. Saturation Edge Power Mean
[8]. They transformed the low-DOF pixels into an Fig. 3 shows the original image that we want to detect
appropriate feature space called higher-order statistics (HOS) the OOI. The background is out-of-focus and thus is
map. The OOI is then extracted from a low-DOF image by smoother then the object we want to detect. The color
region-merging and threholding technique as the final saturation and edge sharpness are the major differences
decision. between the objects and the background. Color information
But if the object’s shape is complex or the edges are not is very important in blur detection. It is observed that blurred
fully connected, it’s still hard to find the object. The OOI pixels tend to have less vivid colors than un-blurred pixels
may not a compact region with a perfect sharp boundary. It because of the smoothing effect of the blurring process.
cannot simply use edge detection to find a complete object in Focused (or un-blurred) objects are likely to have more vivid
a low-DOF image. In some cases, such as macro colors than blurred parts. The maximum saturation value in
photography or close-up photography, the depth-of-field is blurred regions is expected to be smaller than in un-blurred
very low. Some parts of subject may out of focus. This also regions. By this observation, we use the following equation
causes a partial blur on subject. To acquire a satisfactory to compute pixel saturation:
result on OOI detection, not only the blurred part but also the 3
sharp part needs to be taken into consideration. How to find a S P=1- Min R,G,B ,
good OOI subject in the image are challenging in this issue. R G B
where Sp means the saturation point for image. Equation (1)
II. THE P ROPOSED METHOD transforms the original image into saturation feature space to
In this paper, we proposed a novel method to extract the find the higher saturation part in the image.
OOI from a low-DOF image. The proposed algorithm In low-DOF images, the saturation won’t change
contains three steps. First, we find the object boundaries dramatically for the background is smoother. On the contrary,
based on computing the sharpness of edges. Second, the hill the color saturation will change sharply along the edges.
climbing color segmentation is used to find color distribution Therefore, we define the edge contrast CA, which is
and its edges. Finally, we integrate the above results to get computed in a 3x3 window described as follows:
the OOI location. 1
CA ( n A) 2
The first step is divided into three parts and is illustrated n M ,n A n
in Fig. 2. We calculate the feature parameters including the Here M is the 3x3 window; A is the saturation value on the
maximum saturation, the color gradient and the local range window center, n is the value of the neighborhood in this
image. The image is converted into CIE Lab color-space and window.
is performed with edge detection. In the part of noise Equation (2) calculates the saturation intensity. Here,
reduction, we use a median filter to reduce the fragmentary we show the result images to demonstrate the processing
value. Then all the featured image will be multiply together steps. Fig. 4 is the result of saturation image. Fig. 5 shows
to extract the exact position of OOI. the result after performing the edge contrast computation.

Figure 3. Original image

Figure 4. Saturation image

Figure 2. Edge detection flowchart

1030

Gx Gy Gx Gy .
G2 0.5
cos(2 A ) 2 * Gxy . sin 2 A
2

The value of color gradient CG, is obtained by choosing
the maximum value of comparing G1 and G2, i.e.,
CG Max G1 ,G2
This CG value shows the color intensity followed the
Figure 5. Saturation edge image
edge gradient. The CG value will increase if the color of this
edge point changes dramatically.
B. Color Gradient Fig. 6 shows the result after color vector computation.
The gradient is a scalar field for a vector field which
points in the direction of the greatest rate of increase of the
scalar field, and whose magnitude is the greatest rate of
change. It is very useful in a lot of typical edge detection
problem.
To calculate the gradient of the color intensity, first we
would use the Sobel operator to separate vertical and
horizontal edges.
1 0 1 1 2 1
Gx 2 0 2 A, Gy 0 0 0 A, Figure 6. Color vector image
1 0 1 1 2 1
C. Local Range Image
2 2
G Gx Gy , In this study, we adopt the morphological functions
DILATION and EROSION to find the local maximum and
Gy minimum values in the specified neighborhood.
arctan
G x First, we convert the original image form RGB color
Equation (3) and (4) show a traditional way to compute space to CIE Lab color space. Because the luminance of an
object is not always flat, we compute the local range value
gradient. Here is the edge angle, and is 0 for a vertical
for A and B layer without L(luminance) component. We
edge which is darker on the left side. We modify the above censored the color diversification on object in order to
equations to be more accurate in our case with the following prevent this situation. The dilation, erosion and local range
equations:
computation are defined as the following equations:
2 2 2
Gx Rx Gx Bx Dilation

Gy Ry
2
Gy
2
By
2 A B z| B z, A B z
Erosion
G xy Rx R y G x Gy Bx By ˆ ˆ
A B z| B z, A B z
2 * Gxy
A 0.5 * arctan Local Range Image A B A B
(Gx Gy )
Fig. 7 shows the result after the local range operation.
Gx Gy Gx Gy .
G1 0.5
cos(2 * A) 2 * Gxy . sin(2 * A)
where Rx, Gx, Bx are RGB layers through horizontal Sobel
operator; Ry, Gy and By are RGB layers through vertical
Sobel operator. A is the angle of Gxy, and G1 is the color
gradient of image on angle 0.
The definition of G2 is quite similar as G1, but the term A
is replaced by:
A
2
Therefore, G2 is computed by: Figure 7. A local range image

1031

D. Median Filter ImageColorSegmention

Median filter is a nonlinear digital filtering technique,
which is often used to remove noise. Such noise reduction is
a typical pre-processing step to improve the results for later
processing. The process of edge detection will cease some
fragmentary values. If the values are low or the fragment
edges are not connected, it could as seen as noise. Therefore,
we adopt the median filter to reduce the fragmentary pixels.
E. Hill Climbing Color Segmentation
Edge detection can find most edges of OOI object, but
the boundaries are not usually closed completely. The
Figure 9. A color segmentation result
morphological operators cannot link all the disconnected
edges to obtain a complete boundary. Most OOI edges can
F. Edge Combination
be detected after the previous procedures, but some edges are
still unconnected. To make the OOI boundary be a regular The OOI edges are obtained by two methods. First, we
closure, we adopt color segmentation to connect the isolated use morphological close operation, which is a dilation
edges. followed by an erosion, to connect the isolated points. The
The color segmentation method is illustrated in Fig. 8. close operation will make the gaps between unconnected
This method is based on T. Ohashi et al. [9] and R. Achanta edges become smaller and make the outer edges become
et al. [10] .The hill-climbing algorithm detects local maxima smoother. Second, we adopt edge detection on color
of clusters in the global three-dimensional color histogram of segmentation map to find the color distribution, and merge it
an image. Then, the algorithm associates the pixels of an with pre-edge detection result.
image with the detected local maxima; as a result, several After the above procedures, we can get most of the edge
visually coherent segments are generated. clues, and then we want to integrate these clues to a complete
OOI boundary. Let the result of the boundary detection be IE,
the result from the color segmentation be IC. The edge is
extended by counting the pixels in IC and the neighboring
points of IE. To determine whether a pixel at the end of the IE

=
to be extended or not, here we reassign a value P at point (i,j)

( , ) (, )
as an “edge extension” value as follow:
(, ) ,

where n=-1, m=1, is sliding in a 3x3 window, IE is the pre-
edge detection image value of the neighborhood in this
window. Equation (16) will remove the un-necessary pixels
and let the OOI mask be closed by extending the boundaries.
The value is shown in Fig. 10. The result image that merges
the edge extension image with the color segmentation edge is
shown in Fig. 11.

Figure 8. Color segmentation and egde detection flow chart

The detailed algorithm is described as follows: (a)
1. Convert image to CIE Lab color space.
2. Build CIE Lab color histogram.
3. Follow color histogram to find local maximum value.
4. Apply local maximum color to be initial centroid of
k-means classification.
5. Re-train the classifier until the cluster centers are
stable.
6. Apply K-means clustering and remap the original (b)
pixels to each cluster.
Figure 10. (a) The result before the edge extension (b) The result after the
Fig.9 shows the result of color segmentation. edge extension

1032

Plus ColorSeg
Figure 13. Five examples with different aperture values

The DOF is smaller as the aperture value gets lower, and
the OOI would be blurred as well. The higher aperture value
will increase the edge sharpness; that will cause the difficulty
to separate the background and the OOI. From Fig. 14 to Fig.
17, we show the OOI detection results. By experiment, the
object boundaries become irregular while the aperture value
gets higher. In our experiment, the proper aperture value to
obtain the best segmentation results is about f2.8 to f5.6.
Figure 11. The result image that merged the edge extention image and
color segmentation image

We integrate the above edge pieces into a complete OOI
mask. If the boundaries are closed, we will add this region
into the final OOI mask. The edge combination of the final
OOI mask is shown in Fig.12.

Figure 12. Edge combination result

III. THE EXPERIMENTAL RESULTS
The aperture stop of a photographic lens companion with
shutter speed can adjust the amount of light reaching to the
film or image sensor. In this study, we use a digital camera
Pantax istDL and a prime lens “Helois M44-2 60mm F2.0” Figure 14. The experimental results (sample 1)
to perform the experiment. We choose a prime lens to be our
test lens in order to reduce the instability parameters. To
insure all of the exposures are the same, we have controlled
the shutter speed and aperture parameter manually.
To test the propose method, we select 5 test photos in a
50 photos album randomly. They are all prepared in a same
condition and camera parameter. Fig. 13 shows the proposed
OOI detection results in different aperture value.

Figure 15. The experimental results (sample 2)

1033

the overlapped region between the reference and the detected

Accuracy
OOI boundaries, i.e.
( x ,y ) I est ( x , y ) I ref ( x , y )
1
( x ,y )
I ref ( x , y )
where Iest is the OOI mask from the proposed method and Iref
is the mask drawn by the user as the ground truth.
Fig. 18 (a) shows the user drawn OOI boundaries and (b)
shows the detected OOI boundaries.

Figure 16. The experimental results (sample 3)

Figure 18. Comparsion results: (a)User drawn OOI boundary (b) The
proposed method result

The detection accuracy decreases while the OOI has
complex texture such as shirt, cloth, or artificial structures;
and the accuracy is higher while the background is simple.
However, if the image is not correctly focused on the target,
the proposed method still can find a complete object. The
correctness will become lower if there are more than one
OOI in an image, as shown in sample 2 in Fig. 18. Table 2
shows the result of accuracy computed by Equation (17).
Table 2. The comparison result between the reference images and the
proposed method

Figure 17. The experimental results (sample 4) Sample 1 2 3 4 5
The convincing definition of “good OOI” is hard to Accuracy 98.2% 94.6% 96.1% 98% 91%
define; it will depend on human cognition. In this paper, we
refer N. Santh and K.Ramar’s experiment [8] to verify the
proposed method. First, five user-defined OOI boundaries
are drawn, then we compare with the boundaries that
detected by the proposed method. Equation (17) computes

1034

IV. CONCLUSION [6] Yun-Chung Chung, Jung-Ming Wang, Robert R. Bailey, Sei-Wang
Chen, “A Non-Parametric Blur Measure Based on Edge Analysis for
In this paper we propose a method to extract the OOI Image Processing Applications,” IEEE Conference on Cybernetics
objects form a low DOF image based on edge and color and Intelligent Systems Singapore, 1-3 December, 2004.
information. The method needs no user-defined parameters [7] Renting Liu ,Zhaorong Li ,Jiaya Jia, “Image Partial Blur Detection
like shapes and positions of objects, or extra scene and Classi cation,” IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2008, pp. 1–8.
information. We integrate the color saturation,
morphological functions and color gradient to detect the [8] N. Santh, K.Ramar, “Image Segmentation Using Morphological
Filters and Region Merging,” Asian Journal of Information
rough OOI. Final we utilize color segmentation to make the Technology vol. 6(3) 2007,pp. 274-279.
OOI boundaries close and compact. Our method takes both [9] D. Kornack and P. Rakic, “Cell Proliferation without Neurogenesis in
advantages of edge detection and color segmentation. Adult Primate Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127-
The experiments show that our method works 2130.
satisfactorily on many different kinds of image data. This [10] T.Ohashi, Z.Aghbari, and A.Makinouchi. “Hill-climbing Algorithm
method can apply in image processing or computer vision for Efficient Color-based Image Segmentation,” IASTED
tasks such as object indexing or content-based image International Conference On Signal Processing, Pattern Recognition,
and Applications (SPPRA 2003), June 2003. P.200.
retrieval as a pre-processing.
[11] R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk1. “Salient Region
REFERENCES Detection and Segmentation,” International Conference on Computer
Vision Systems (ICVS 2008), May 2008. PP.66-75
[1] InfoTrend ,”The Consumer Digital SLR Marketplace: Identifying & [12] Martin Ru i, Davide Scaramuzza, and Roland Siegwart. “Automatic
Profiling Emerging Segments,” Digital Photography Trends. Detection of Checkerboards on Blurred and Distorted Images,”
September,2008 International Conference on Intelligent Robots and Systems 2008,
http://www.capv.com/public/Content/Multiclients/DSLR.html Sept, 2008. PP.22-26
[2] Dudubird, “Chinese Photographic Equipment Industry Market [13] Hanghang Tong, Mingjing Li, Hongjiang Zhang, and Chanshui Zang.
Research Report,” December,2009. http://www.cnmarketdata.com “Blur Detection for Digital Images Using Wavelet Transform,”
/Article_84/2009127175051902-1.html International Conference on Multimedia and Expo 2004, PP.17-20
[3] .” [14] Gang Cao, Yao Zhao and Rongrong Ni. “Edge-based Blur Metric for
,” Tamper Detection,” Journal of Information Hiding and Multimedia
September,2007. https://www.fuji-keizai.co.jp/market/06074.html Signal Processing, Volume 1, Number 1, January 2009. pp. 20-27
[4] Khalid Idrissi, Guillaume Lavou e, Julien Ricard , and Atilla Baskurt, [15] Rong-bing Gan, Jian-guo Wang. “Minimum Total Variation
“Object of interest-based visual navigation, retrieval, and semantic Autofocus Algorithm for SAR Imaging,” Journal of Electronics &
content identi cation system” Computer Vision and Image Information Technology, Volume 29, Number 1, January 2007. pp.
Understanding vol. 94 ,2004 , pp. 271-294. 12-14
[5] James Z. Wang, Jia Li, Robert M. Gray, Gio Wiederhold , [16] Ri-Hua XIANG, Run-Sheng WANG, “A Range Image Segmentation
“Unsupervised Multiresolution Segmentation for Images with Low Algorithm Based on Gaussian Mixture Model,” Journal of Software
Depth of Field” IEEE TRANSACTIONS ON PATTERN 2003, Volume 14, Number 7, pp. 1250-1257
ANALYSIS AND MACHINE INTELLIGENCE vol.23 no.1, January
2001, pp. 85-90.

1035

Efficient Multi-Layer Background Model on Complex Environment for
Foreground Object Detection
1
Wen-kai Tsai(蔡文凱),2Chung-chi Lin(林正基), 1Ming-hwa Sheu(許明華), 1Siang-min Siao(蕭翔民),
1
Kai-min Lin(林凱名)
1
Graduate School of Engineering Science and Technology
National Yunlin University of Science & Technology
2
Department of Computer Science
Tung Hai University
E-mail:g9610804@yuntech.edu.tw

Abstract—This paper proposes an establishment of multi-layer has the advantages of updating model parameters
background model, which can be used in a complex automatically, it is necessary to take a very long period of
environment scene. In general, the surveillance system focuses time to learn the background model. In addition, it also faces
on detecting the moving object, but in the real scenes there are strenuous limitations such as memory space and processing
many moving background, such as dynamic leaves, falling rain speed in embedded system. Next, Codebook background
etc. In order to detect the object in the moving background model [3] establishes a rational and adaptive capability
environment, we use exponential distribution function to which is able to improve the detection accuracy of moving
update background model and combine background background and lighting changes. However, the Codebook
subtraction with homogeneous region analysis to find out
background model still requires higher computational cost,
foreground object. The system uses the TI TMS320DM6446
larger memory space for saving background data.
Davinci development platform, and it can achieve 20 frames
per second for benchmark images of size 160×120. From the
Subsequently, Gaussian model [4] is presented by updating
the threshold value for each pixel, but its disadvantages
experimental results, our approach has better performance in
includes large amount of computing and lots of memory
terms of detection accuracy and similarity measure, when
comparing with other modeling techniques methods. space used to record the background model. In order to
reduce the usage of memory, [5] and [6] are to calculate the
Keywords-background modeling; object detection weight value for each pixel to establish background model.
According to the weight value, the updating mechanism
determines whether the pixel is replaced or not. So it uses a
I. INTRODUCTION less amount of memory space to establish moving
Foreground object detection is a very important background,.
technology in the image surveillance system since the system The above works all use multi-layer background model
performance highly dependents on whether the foreground to store background information, but this is still inadequate
object detection is right or not. Furthermore, it needs to to deal with moving background issues. They need to take
detect the foreground object accurately and quickly, such into account the dependency between adjacent pixels to
that the follow-up works such as tracking, identification can inspect whether the neighbor region possesses the
be easy to perform correctly and reliably. Conceptually, the homogeneous characteristics or not. This paper proposes an
technology of foreground object detection is based on efficient 4-layer background model and homogeneous region
background substation mostly. This approach seems simple analysis to feature the background pixels.
and low computational cost; however, it is difficult to obtain
good results without reliable background model. To manage II. BUILDING MULTI-LAYER BACKGROUND MODELS
these complex background scenarios, the skill of how to First, the input image pixel xi,j(t) consists of R, G and B
construct a suitable background model has become the most elements as shown in Eq. (1). The pixels of moving
crucial one. background are inevitably appeared in some region
Generally speaking, most of the algorithms only regard repeatedly, so we have to learn these appearance behaviors
non-moving objects as background, but in real environment, when constructing multi-layer background model. The first
many moving objects may also belong to a part of the layer background model (BGM1) is used to store the first
background, in which we named the moving background input frame. For the 2nd frame, we record on the difference
such as waving trees. However, it is a difficult task to of the 1st and 2nd frames for the second layer background
construct the moving background model. The general model (BGM2). Similarly, the difference of the consecutive 3
practice is to use algorithms to conduct the learning and grams is saved for the third layer (BGM3), etc. We use the
establish of background model. After building up the model, first 4 frame and their differences as the initial background
the system starts to carry on the foreground object detection. model. Besides, Eq. (2) is used to record the numbers of
Therefore, in recent years a number of background models occurrence each pixel in the learning frame.
have been proposed. The most popular approach is the
Mixture of Gaussians Model (MoG) [1- 2]. Although MoG xi , j (t ) = ( xiR j (t ), xiGj (t ), xiB j (t ))
, , , (1)

1036

⎧
⎪ remove , if weight iu, j (t ) < Te (5)
⎧ MATCH iu, j (t − 1), if xi , j (t ) − BGM iu, j (t ) > th (2) BGM iu, j (t ) = ⎨
⎪ ⎪α × BGM i , j (t ) + (1 − α ) × BGM i , j (t − 1), else
u u
MATCH iu, j (t ) = ⎨ ⎩
⎪MATCH i , j (t − 1) + 1,
u
⎩ else
where, u=1…4 and th is the threshold value of compare where Te is a threshold for weight; α is a constant and
similarity. From the 5th learning frame, we start to calculate α <1.
all the pixel repetition numbers of occurrence in each layer Based on the above mentioned approach, Fig. 2
of background model, and Eq. (3) is to obtain its frequency demonstrates a 4-layer background which be constructed
of occurrence. after learning 100 frames.
MATCH iu, j (t )
λu, j =
i (3)
N
where N is the total number of learning frames. The larger
λu j
i, indicates that the corresponding pixel in the learning
period has higher occurrence and must preserve with 4
layers. Conversely, the pixel with lower occurrence will be
removed. (a) BGM1 (b) BGM2

III. BACKGROUND UPDATE
After building up multi-layer background model, we
must update the content of BGMi,j along with the time, to
replace the inadequate background information. So the
background update mechanism is very important for the
following object detection. The proposed background update (c) BGM3 (d) BGM4
method uses the exponential distribution model to calculate Figure 2. Multi-layer Background Model
the weight value for each pixel, as shown in Eq. (4). It can
obtain the repetition condition of occurrence for each pixel in
background model. The lower weight expresses that the IV. OBJECT DETECTION
corresponding pixel has not appeared for a long time. It
After establishing the accurate background model, the
should be replaced by the higher-weight input pixel.
background subtraction can be used to obtain foreground
object. From the practical observation, the moving
− λu, j × t
weightiu, j (t ) = λu, j × e
i
i
,t > 0 (4) background has the homogeneous characteristic. Therefore,
the object detection method can carry on the subtraction on
both 4-layer background and their homogeneous regions.
where t is the number of non-match frames.
As shown in Fig. 2, the information stored in background
Fig. 1 shows the distribution of weight values. If the
model is the scene of the moving background. It is with
pixel in background model does not be matched in a period
important features of homogeneous. In Eq. (6) and (7), TI(t)
time, its weight value becomes exponentially decreased. If
the weight value is less than a threshold, the background is the total matching index for input pixel and the
pixel should be replaced based on Eq.(5). homogeneous region of 4-layer background and Diu+ k , j + p is an
weight individual matching index the input pixel and one
background data BGM iu+ k , j + p . The homogeneous region is
defined as (2r +1) * (2r +1) for the background data at (i, j)
location.

4 r r
TI (t ) = ∑ ∑ ∑D u
i+k, j+ p (t ) (6)
u =1 k = − r p = − r

⎧1, if xi , j (t ) − BGM iu+ k , j + p (t ) ≤ th
Diu+ k , j + p (t ) = ⎨ (7)
⎩0, else
t
Figure 1. Exponential distribution of weight where th is a threshold value to determine whether it is
similar. If TI(t) is greater than a threshold (τ), that indicates
the input xi,j(t) is similar to many background information

1037

and it is not a object pixel. Eq. (8) is used to find out sequence. Our proposed approach can achieve the highest
foreground object (FO). similarity value, i.e. our results are close to those of ground
truth.
⎧ 0, if TI (t ) ≥ τ
FOi , j (t ) = ⎨ (8)
⎩1, else

When FOi , j (t ) = 1 , the input pixel belongs to foreground
object pixel. On the other hand, If FOi , j (t ) = 0 , the input
pixel belongs to background pixel.

V. EXPERIMENTAL RESULTS OF PROTOTYING SYSTEM
Based on our proposed approach, the object detection is
implemented by TMS320DM6446 Davinci as shown in
Fig.3. The input image resolution is 160*120 per flame.
Averagely, our approach can process 20 frames per second
for performing object detection on the prototyping platform.

Figure 3. TI TMS320DM6446 Davinci development kit

Next, by using the presented research methods, the
foreground object with binary-value results are also
displayed in Fig.4. The result of ground truth, which is
segmented the objects manually from the original image
frame, is regarded as the perfect result. It can be found that
our result has the better object detection. In order to make a
fair comparison, we adopt [7] calculating similarity and
total error pixels method to assess these results of the
Figure 4. Foreground Object Detection Result
algorithms. Eq.(9) is used to get the total error pixel number
and Eq. (10) is used to evaluate similarity value. Wu[2]
Total Error Pixels Chien[5]
3000
total error pixel = fn + fp (9) Tsai[6]
Our proposed
2500
Error pixel

2000
tp
Similarity = (10) 1500
tp + fn + fp
1000
where fp is the total number of false positives, fn is the total 500
number of false negatives, and tp indicates the total number 0
of true positives. Fig. 6 depicts the number of error pixels 240 245 250 255 260 265 270 275 280
for a video sequence. We can see the numbers of error Frame Number
pixels produced by our proposed are less than other
algorithms. Fig. 7 shows the similarity of the video Figure 5. Error pixels by different methods

1038

Wu[2] [6] Wen-Kai Tsai, Ming-Hwa Sheu, Ching-Lung Su, Jun-Jie Lin and
Similarity Shau-Yin Tseng, “Image Object Detection and Tracking
Chien[5]
1
Tsai[6] Implementation for Outdoor Scenes on an Embedded SoC
0.9 Our proposed Platform,” International Conference on Intelligent Information
0.8 Hiding and Multimedia Signal Processing, pp.386-389, September,
0.7 2009.
0.6
[7] Lucia Maddalean, Alfredo Petrosino, “A Self-Organizing Approach
Sim ilarity

to Background Subtraction for Visual Surveillance Applications,”
0.5
IEEE Trans. on Image Processing, vol. 17, No.7, July, 2008.
0.4

0.3

0.2

0.1

0
242 245 248 251 254 257 260
Frame Number

Figure 6. Similarity by different methods

VI. Conclusion
In this paper, we propose an effective and robust multi-
layer background modeling algorithm. The foreground
object detection will encounter the problem of moving
background, because there are outdoor scenes of fluttering
leaves, rain, and indoor scenes of fans etc. Therefore, we
construct the moving background into multi-layer
background model through calculating weight value and
analyzing the characteristics of regional homogeneous. In
this way, our approach can be suitable to a variety of scenes.
Finally, we present the result of foreground detection by
using data-oriented form of similarity and total error pixels,
furthermore through explicit data and graph to show the
benefit of our algorithms.

REFERENCES
[1] C. Stauffer, W. Eric L. Grimson, “Learning Patterns of Activity
Using Real-Time Tracking,” IEEE Trans. Pattern Analysis and
Machine Intelligence, vol.22, No. 8, pp.747-757, 2000.
[2] H. H. P. Wu, J. H. Chang, P. K. Weng, and Y. Y. Wu, “Improved
Moving Object Segmentation by Multi-Resolution and Variable
Thresholding, ” Optical Engineering. vol. 45, No. 11, 117003, 2006.
[3] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis,
“Real-Time Foreground-Background Segmentation using Codebook
Model, ” Real-Time Imaging, pp.172-185, 2005.
[4] Hanzi Wang, and David Suter, “ A Consensus-Based Method for
Tracking Modelling Background Scenario and Foreground
Appearance,” Pattern Recognition, pp.1091-1105, 2006.
[5] Wei-Kai Chan, Shao-Yi Chien,”Real-Time Memory-Efficient Video
Object Segmentation in Dynamic Background with Multi-
Background Registration Technique,” International Workshop on
Multimedia Signal Processing, pp.219-222, 2002.

1039

CLEARER 3D ENVIRONMENT CONSTRUCTION USING IMPROVED DM BASED ON
GAZE TECHNOLOGY APPLIED TO AUTONOMOUS LAND VEHICLES

1 2
Kuei-Chang Yang (楊桂彰), Rong-Chin Lo (駱榮欽)
1
Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering,
National Taipei University of Technology, Taipei
2
Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering,
National Taipei University of Technology, Taipei
E-mail: t7418002@ntut.edu.tw

ABSTRACT to obtain meaningful information. There are a lot of
manpower and resources devoted to the binocular stereo
In this paper, we propose a gaze approach that sets vision [4] research for many countries. As applied to
the binocular cameras in different baseline distances to robots and ALV, the advantage of binocular stereo
obtain better resolution of three dimensions (3D) vision is to obtain the depth of the environment, and this
environment construction. The method being capable depth can be used for obstacle avoidance, environment
of obtain more accurate distance of an object and learning, and path planning. In such applications, the
clearer environment construction that can be applied to disparity is used as the vision system based on image
the Autonomous Land Vehicles (ALV) navigation. In recognition and image-signal analysis. Besides, two
the study, the ALV is equipped with parallel binocular cameras need to be set in parallel and to be fixed
cameras to simulate human eye to have the binocular accurately, this disparity method still requires a high-
stereo vision. Using the information of binocular stereo speed computer to store and analyze images. However,
vision to build a disparity map (DM), the 3D setting the binocular cameras of ALV with fixed
environment can be reconstructed. Owing to the baseline can only obtain the better DM of environment
baseline of the binocular cameras usually being fixed, images in a specific region.
the DM, shown as an image, only has a better resolution
in a specific distance range, that is, only partial specific In this paper, we try to propose an approach that sets the
region of the reconstructed 3D environment is clearer. binocular cameras with different baseline to obtain the
However, it cannot provide a complete navigation depths of DM corresponding to different measuring
environment. Therefore, the study proposes the multiple distances; In the future, this method can obtain the
baselines to obtain the clearer DMs according to the environment image from near to far range, such that it
near, middle and far distances of the environment. will help the ALV in path planning.
Several experimental results, showing the feasibility of
the proposed approach, are also included. 2. STEREO VISION

Keywords binocular stereo vision; disparity map In recent years, because the computing speed of
the computer is much faster and its hardware
1. INTRODUCTION performance also becomes better, therefore a lot of
researches relating to the computer vision are proposed
In recent years, the machine vision is the most for image processing. The computer vision system with
important sensing system for intelligent robots. The depth sensing ability is called the stereo vision system,
vision image captured from camera has a large number and the stereo vision is the core of computer vision
of object information including shape, color, shading, technologies. However, one camera can only obtain two
shadow, etc. Unlike other used sensors can only obtain dimensions (2D) information of environment image that
one of measurement information, such as ultrasonic is unable to reconstruct the 3D coordinate, To improve
sensors [1], infrared sensors [2], laser sensors [3], etc. the shortage of one camera, in the study, two cameras
In other words, the visual sensor can achieve a lot of are used to calculate 3D coordinate. The details are
environmental information, but this information is with described in the following sub-sections.
each other. Therefore, various image processing
techniques are necessary for separating them one by one

1040

2.1. Projective Transform Nowadays the cost of camera becomes very
The projective transform model of one camera is cheaper, therefore, in the study, we chose two cameras
to project the real objects or scene to the image plane. fixed in parallel to solve the problem of depth and
As shown in Fig. 1, assume that the coordinate of object height. The usage of parallel cameras can reduce the
P in the real world is (X, Y, Z) relative to the origin (0, complexity of the corresponding problem. In Fig. 3, we
0, 0) at the camera center. After transform, the easily derive the Xl and Xr by using similar triangles, and
coordinate of P' projected by P on the image plane is (x, we have:
y, f) relative to the image origin (0, 0, f), where f is the Zx  
l

X 
l

distance from the camera center to image plane. Using f
similar triangle geometry theory to find the relationship
between the actual object P and its projected point P' on Zx r
the image plane, the relationship between two points is Xr 
f
 
as follows:
X   Assuming that the optical axes of two cameras are
x f parallel to each other, where b is the distance between
Z
Y  two camera centers, and b= Xl - Xr. C and G are the
y f projected points of P to left image plane and right image
Z
plane, respectively. The disparity d is defined as d = xl -
Therefore, even if P'(x, y, f) captured from the
xr . From (3) and (4), we have:
image plane is the known condition, we still cannot
b  X l  X r  x l  x r   d 
Z Z
calculate the depth Z of P point and determine its
coordinate P(X, Y, Z) according to (1) and (2) unless f f
we know one of X or Y (height) or Z (depth). Therefore, the image depth Z can be given by:
f  b 
P (X ,Y ,Z ) Z
d
Y y P' ( x , y , f )
Xl P(Object)
X x
r
X

Z Z
Z
Camera center C
(0,0,0)
Image plane xl xr G
f

f f
Figure 1. Perspective projection of one camera.
Ol b Or
2.2. Image Depth
From the previous discussion, we have known that Figure 3. Projection transform of two cameras and disparity.
it is impossible to calculate accurately the depth or
height of object or scene from the information of one As shown in Fig. 4. The height image of an object can
camera, even if we have a lot of known conditions in be derived from the height of the object image based on
advance. Therefore, several studies use the overlapping the assumption of a pinhole camera and the image-
view’s information of two [5] or more cameras to forming geometry.
calculate the depth or height of object or scene, shown  Y  y  Z 
in Fig. 2. f

Right image plane
Y
y Pinhole
Left image plane r r
(x ,y )
x
Optical axis
y y
x
b
l l
(x ,y ) f
Z
P (X ,Y ,Z )

Figure 4. Image-forming geometry.
Figure 2. The relationship between depth and disparity for two
cameras.

1041

Due to rapid corresponding on two cameras, the the middle region is from 5m to 10m, and far region is
method has high efficiency on calculating the depth and over 10m.
height of the objects, and is suitable for the application
Acquisition of the best baseline b
of ALV navigating. This method can find the disparity d
from two corresponding points (for instance, C and G To acquire a best baseline b means that to find the
shown in Fig. 3.) respective to left image and right appropriate cameras baseline b on the basis of different
image. Here, the accuracies of two corresponding points depths of the region. Table 1 and Table 2 show the
are very important. Regard the value of disparity d as relationship between depth Z and the two-camera
image intensity shown by gray values (0 to 255), then, distance baseline b. We set the d = 30 as the threshold
whole disparities form an image, called disparity map value dth, and region of d less than dth as background.
(DM) or DM image. The DM construction proposed by Therefore, when the depth Z is known, the disparity d
Birchfield and Tomasi [6] is employed in this paper. can be obtained from Table 1 and Table 2in the
The advantage of this constructing method is faster to different baseline, and then find the most appropriate
obtain all depths that also include the depths of value of b makes the value of d closest or greater than
discontinuous, covered, and mismatch points. Otherwise, the dth. For example: 20cm is the best b for short-range
the disadvantage is to lack the accuracy of the obtained region (0 m~ 5m), and 40cm for medium-range region
disparity map. Fig. 5 shows that the disparity map is (5m ~ 10m).
generated from left and right images. Calculation of the depth and height
The cameras are calibrated [8] in advance, then,
we can obtain the focus value f =874 pixels.
Substituting the obtained d for object into (6), we find
the distance Z between the camera and the object, and Z
is then substituted into (7) to calculate the object height
Y [9] that usually can be used to decide whether the
object is an obstacle.
(a)

Distance

(b)

Figure 5. The disparity map (a) left image and right image (b)
disparity map
Disparity
Camera

3. PROPOSED METHOD Figure 6. The relationship between distance and disparity.

From (6) [7], we know that the object is far from TABLE I. DISPARITY VALUES d (PIXELS) VS. DEPTH Z=1M~5M
two cameras, the disparity value will become small, and AND BASELINE b =10CM~150CM.
vice versa. In Fig. 6, there is obviously a nonlinear Z(m)
relationship between these two terms. The disadvantage b(cm) 1 2 3 4 5
of DM is that the farther distance between objects and 10 87 44 29 22 17
two cameras makes the smaller disparity value, and it 20 175 87 58 44 35*
begets the difficulty of separation between the object 30 262 131 87 66 52
40 350 175 117 87 70
and the background becoming difficult. Therefore, how 50 437 219 146 109 87
to find the suitable baseline b for obtaining the clearer 60 524 262 175 131 105
DM for each region in different depth region of two 70 612 306 204 153 122
cameras is required. The processing steps are described 80 699 350 233 175 140
in the following sub-sections: 90 787 393 262 197 157
100 874 437 291 219 175
Region segmentation 110 961 481 320 240 192
120 1049 524 350 262 210
We partition the region segmentation into three
130 1136 568 379 284 227
levels by near, middle and far, and obtain the best DM 140 1224 612 408 306 245
of the depth in the different regions. In the paper, we 150 1311 656 437 328 262
define the near region is the distance from 0m to 5m, *: The best disparity for short-range region.

1042

TABLE II. DISPARITY VALUES d (PIXELS) VS. DEPTH
Z=6M~10M AND BASELINE b =10CM~150CM.
Z(m)
b(cm) 6 7 8 9 10
10 15 12 11 10 9
20 29 25 22 19 17
30 44 37 33 29 26
40 58 50 44 39 35*
50 73 62 55 49 44
60 87 75 66 58 52
70 102 87 76 68 61
80 117 100 87 78 70
90 131 112 98 87 79
100 146 125 109 97 87
110 160 137 120 107 96
120 175 150 131 117 105
130 189 162 142 126 114
140 204 175 153 136 122
150 219 187 164 146 131
*: The best disparity for medium-range region.

4. EXPERIMENTAL RESULTS
The proposed methods have been implemented disparity map (Z=400CM、800cm,b=20cm).
and tested on the 2.8GHz Pentium IV PC. Fig. 7 shows
two cameras are fixed on a sliding way and can be
pulled apart to change the baseline distance. In Section
III, we know that the best b for the short-range region
0m ~ 5m is 20cm, and 40cm for medium-range region
5m ~ 10m. Therefore, we set two persons standing in
the distance from the two-camera of 4m and 8m, two-
camera distance b = 20cm, shown in Figure 8. Because
the person standing at 4m is in the short-range region,
so it can be seen clearly. However, another person
standing at 8m is in medium-range region, it's difficult
to separate it from background.

disparity map (Z=800CM,b=20cm).

Figure 7. Experiment platform of stereo vision.

To compare Fig. 9 and Fig. 10 with the distance
from a person to the baseline is 8m (medium-range
region) and the baseline is changed from b = 20cm to b
= 40cm, so the results show that as b = 40cm, the person
(object) becomes clearer as shown in Fig. 10.

1043

[6] S. Birchfield and C. Tomasi, ”Depth Discontinuities by
Pixel-to-Pixel Stereo,” International Journal of Computer
Vision, pp. 269-293, Aug 1999.
[7] G. Bradski and A. Kaehler, Learning OpenCV: Computer
Vision with the OpenCV Library, O'Reilly Press, 2008.
[8] http://www.vision.caltech.edu/bouguetj/calib_doc/
[9] L. Zhao and C. Thorpe, “Stereo- and Neural Network-
Based Pedestrian Detection,” IEEE Trans, Intelligent
Transportation System, Vol. 3, No. 3, pp. 148-154, Sep
2000.

disparity map (Z=800cm,b=40cm).

5. CONCLUSION

From the experimental results, we have found that
the suitable baseline of two cameras can help us to
obtain the better disparity. However, if the object is far
from two cameras, its disparity value will become small,
then the disparity value of the object is near to that of
the background, and not easily detected. Using the
proposed method, to change the baseline of two
cameras, the object becomes clearer and easier detected,
and 3D object information is obtained more. The results
can be used to a lot of applications, for example, ALV
navigation. In the future, we plan to solve the DM noise
of horizontal stripe inside, so DM can be shown better.

REFERENCES

[1] A. Elfes, “Using occupancy grids for mobile robot
perception and navigation,” Computer Magazine, pp. 46-
57, June 1989.

[2] J. Hancock, M. Hebert and C. Thorpe, “Laser intensity-
based obstacle detection Intelligent Robots and Systems,”
1998 IEEE/RSJ International Conference on Intelligent
Robotic Systems, Vol. 3, pp. 1541-1546, 1998.
[3] E. Elkonyaly, F. Areed, Y. Enab, and F. Zada, “Range
sensory=based navigation in unknown terrains,” in Proc.
SPIE, Vol. 2591, pp.76-85.
[4] 陳禹旗，使用 3D 視覺資訊偵測道路和障礙物應用於人
工智慧策略之室外自動車導航，碩士論文，國立台北
科技大學電腦與通訊研究所，台北，2003。
[5] 張煜青，以雙眼立體電腦視覺配合人工智慧策略做室
外自動車導航之研究，碩士論文，國立台北科技大學
自動化科技研究所，台北，2003。

1044

A MULTI-LAYER GMM BASED ON COLOR-TEXTURE COMBINATION
FEATURE FOR MOVING OBJECT DETECTION

Tai-Hwei Hwang (黃泰惠), Chuang-Hsien Huang (黃鐘賢), Wen-Hao Wang (王文豪)

Advanced Technology Center, Information and Communications Research Laboratories,
Industrial Technology Research Institute, Chutung, HsinChu, Taiwan ROC 310
E-mail: {hthwei, DavidCHHuang, devin}@itri.org.tw

ABSTRACT background scene. The background scene contains the
images of static or quasi-periodically dynamic objects,
Foreground detection generally plays an important role in the for instance, sea tides, a fountain, or an escalator. The
intelligent video surveillance systems. The detection is based representation of background scene is basically a
on the characteristic similarity of pixels between the input collection of statistics of pixel-wise features such as
image and the background scene. To improve the
color intensities or spatial textures. The color feature
characteristic representation of pixel, a color and texture
combination scheme for background scene modeling is
can be the RGB components or other features derived
proposed in this paper. The color-texture feature is applied from the RGB, such as HSI, or YUV expression. The
into a four-layer structured GMM, which can classify a pixel texture accounts for information of intensity variation in
into one of states of background, moving foreground, static a small region centered by the input pixel, which can be
foreground and shadow. The proposed method is evaluated computed by the conventional edge or gradient
with three in-door videos and the performance is verified by extraction algorithm, local binary pattern [1], etc. The
pixel detection accuracy, false positive and false negative rate statistical background models of pixel color and textures
based on ground truth data. The experimental results are respectively efficient when the moving objects are
demonstrate it can eliminate shadow significantly but without
with different colors from background objects and are
many apertures in foreground object.
full of textures for either background or foreground
moving objects. For example, it is hard to detect a
1. INTRODUCTION
walking man in green from a green bush using the color
feature only. In this case, since the bush is full of
Wide range deployment of video surveillance system is
different textures from the green cloth, the man can be
getting more and more importance to security
easily detected by background subtraction with texture
maintenance in a modern city as the criminal issue is
feature. However, this will not be the case when only
strongly concerned by the public today. However,
using the texture difference to detect the man walking in
conventional video surveillance systems need heavy
the front of flat white wall because of the lack of texture
human monitoring and attention. The more cameras
for both the cloth and the wall. Therefore, some studies
deployed, the more inspection personnel employed. In
are conducted to combine the color and the texture
addition, attention of inspection personnel is decreased
information together as a pixel representation for
over time, resulting in lower effectiveness at recognizing
background scene modeling [2][3][4]. In addition to the
events while monitoring real-time surveillance videos.
different modeling abilities of color and texture, texture
To minimize the involved man power, research in the
feature is much more robust than color under the
field of intelligent video surveillance is blooming in
illumination change and is less sensitive to slight cast
recent years.
shadow of moving object.
Among the studies, background subtraction is a
Though the combination of color and texture can
fundamental element and is commonly used for moving
provide a better modeling ability and robustness for
object detection or human behavior analysis in the
background scene under illumination change, it is not
intelligent visual surveillance systems. The basic idea
enough to eliminate a slightly dark cast shadow or to
behind the background subtraction is to build a
keep an invariant scene under stronger illumination
background scene representation so that moving objects
change or automatic white balance of camera. To
in the monitored scene can be detected by a distance
improve the robustness of background modeling further,
comparison between the input image and the

1045

a simple but efficient way to eliminate shadows is to waving leaves. In this study, we propose a four-layer
filter pixels casted by shadows according to the scene model which classes each pixel into four states, i.e.
chromatic and illuminative changes. In the illuminative background, static foreground, moving foreground, and
component the value of the shadow pixel is lower than shadow. We improve Gallego’s work by modeling
that in background model; while in the chromatic background with mixture Gaussians of color and texture
component, it shows slightly different from that in the combined feature and design related mechanisms for
background model. Therefore, shadows can be detected state transition. In addition, we also bring the concept of
by using thresholding technique to obtain the pixels shadow learning, based on the work [7], into the
which are satisfied with these physical characteristics. proposed scene model. The structure and the
Cucchiara et al. [5] transformed video frames from RGB mechanisms of our background scene model are
space to Hue-Saturation-Intensity (HSI) space to described in section 2. Section 3 reveals applicable
highlight these physical characteristics. In the work of scenarios and experimental results. Section 4 presents
Shan et al. [6], they evaluated the performance of the conclusions and our future works
thresholding-based shadow detection approach on
different color spaces such as HSI, YCrCb, c1c2c3,
L*a*b. To sum up, conventional approaches are based 2. MULTI-LAYER SCENE MODEL
on transforming the RGB features to other color
domains or features, which have better characteristics to Figure 1 illustrates the flowchart of the multi-layer scene
represent shadows. But no matter what kind of color model. In the first stage, the color and texture
spaces or features is adopted, users usually need to set representation have to be obtained for all pixels in the
one or more threshold values to filter shadows out. input image. Four layers which represent the states of
background, shadow, static foreground and moving
Recently, Nicolas et al. [7] proposed an online- foreground, are modeled separately. For each pixel i
learning approach named Gaussian Mixture Shadow belonging to the current frame, if it is fit to the
Model (GMSM) for shadow detection. The GMSM background model, the background model is updated
utilities two Gaussian mixture models (GMM) [8] to and the pixel is then labeled as the state of background.
model the static background and casting shadows, Otherwise, the pixel is passed to the shadow layer.
respectively. Afterward, Tanaka et al. [9] used the same
idea but modeled the distributions of background and In the shadow layer, i is examined whether it is
shadows non-parametrically by Parzon windows. It is satisfied to be a shadow candidate by a weak shadow
faster than GMSM but costs more storage space. Both of classifier, which was designed according to the shadow
them are based on statistical analysis and have better physical characteristics such as the mentioned chromatic
discriminative power on shadows, especially when the and illuminative changes. If i is determined as a shadow
color of moving object shows similar attribute to the candidate, the shadow layer is updated by the pixel’s
pixels covered by shadows. color features. If i shows strong fitness to the dominant
Gaussian of the updated shadow model, its state is then
On the other hand, maintenance of static foreground labeled as shadow. For the pixel which is not satisfied to
objects is also an important issue for background being a shadow candidate or does not fit to the shadow
modeling. The static foreground objects are those model, we pass it to the static foreground layer.
objects that, after entering into the surveillance scene,
reach a position and then stop their motion. Examples Consequentially, if i dose not fit the static foreground
are such as cars waiting for traffic lights, browsing model if it exists, i is passed to the moving foreground
people in shops, or abandoned luggage in train stations. layer. When i fits the moving foreground model, it is the
In traditional GMM-based background models [8], the circumstance that the state of the moving object is from
static foreground objects are usually absorbed into moving to staying at the current position. As a result, we
background after a given time period, which usually update the moving foreground model by i’s color
proportional to the learning rate of the background features. A counter named CountMF corresponding to the
model. The current state-of-the-art technique to moving background model is increased as well. When
distinguish the static foreground objects from static CountMF reaches another user-defined threshold T2, we
background and moving objects is to maintain a multi- replace the static foreground model by the moving
layer model representing background, moving foreground model and CountSF is set to zero. Otherwise,
foreground and static foreground separately [10,11]. if i does not fit the moving foreground model, we use it
to reinitialize the moving foreground model, i.e. set
In the work of Gallego et al. [11], they proposed a three- CountMF to zero, and then update the background model
layer model which comprises moving foreground, static by the past moving foreground model. The reason of
foreground and background layer. However, they using the moving foreground model to update the
modeled background by using a single Gaussian, which background is to allow the background model having the
can not cope with the multi-mode background such as ability to deal with the multi-mode background problem

1046

such as waving leaves, ocean waves or traffic lights. The background model is first initialized with a set of
Details of feature extraction stage, the background and training data. For example, the training data could be
the shadow layers are described in the following collected from the first L frames of the testing video.
subsections. After that, each pixel at frame t can be determined
whether it matches to the m-th Gaussian Nm by satisfying
2.1. Feature extraction stage the following inequality for all components {xC ,i , xT ,i } ∈ x :

The color-texture feature is a vector including
( xC , i − µC ,i , m ) 2 ( xT ,i − µT , i, m ) 2
dC dT
λ 1− λ
B B
components of RGB and local difference pattern (LDP)
as the texture in a local region. The LDP is an edge-like dC ∑
i =1 k × (σ C ,i , m ) 2
B
+
dT ∑
i =1 k × (σT , i , m ) 2
B
<1 (3)

feature which is consisted of intensity differences
between predefined pixel pairs. Each component of LDP where dC and dT denote vector dimension of color and
is computed by texture, respectively, λ is the color-texture combination
weight, k is a threshold factor and we set it to three
LDPn(C)=I(Pn)-I(C), (1)
according to the three-sigma rule (a.k.a. 68-95-99.7 rule)
where C and Pn represent the pixel and thereof neighbor of normal distribution. The weights of Gaussian
pixel n, respectively, and I(C) represents the gray level distribution are sorted in decreasing order. Therefore if
intensity of pixel C. The gray level intensity can be the pixel matches to the first nB distributions, where nB is
computed by the average of RGB components. Four obtained by Eq. (4), it is then classified as the
types of pattern defining the neighbor pixels are background [13].
depicted in Figure 2 and are separately adopted to
compare their performance of moving object detection  b 
experimentally. b 
∑
n B = arg min  π m > 1 − p f 

(4)
 m=1 

where pf is a measure of the maximum proportion of the
data that belong to foreground objects without
influencing the background model.

When a pixel fits the background model, the
background model is updated in order to adapt it to
Fig. 2. Four types of pattern defining neighbor pixels progressive image variations. The update for each pixel
for computation of LDP is as follows:
2.2. Background Layer
π m ← π m + α (om − π m ) − αcL
B B B
(5)
The GMM background subtraction approach presented
by Stauffer and Grimson [8] is a widely used approach µ m ← µ m + om (α / π m )(x − µ m )
B B B B
(6)
for extracting moving objects. Basically, it uses couples
of Gaussian distribution to model the reasonable (σ m ) 2 ← (σ m ) 2 + om (α / π m )((x − µ m )T (x − µ m ) − (σ m ) 2 )
B B B B B B
(7)
variation of the background pixels. Therefore, an
unclassified pixel will be considered as foreground if the where α=1/L is a learning rate and cL is a constant value
variation is larger than a threshold. We consider non- (set to 0.01 herein [14]). The ownership om is set to 1 for
correlated feature components and model the the matched Gaussian, and set to 0 for the others.
background distribution with a mixture of M Gaussian
distributions for each pixel of input image: 2.3. Shadow Layer
M
p (x) = ∑π
m=1
B B B
m N m ( x; µ m , Iσ m )
(2) The problem of color space selection for shadow
detection has been discussed in [6][12]. Their
experimental results revealed that performing cast
B
where x represents the feature vector of a pixel, µ m is shadow detection in CIE L*u*v, YUV or HSV is more
B
efficient than in RGB color space. Considering that the
the estimated mean, σ m is the variance, and I represents
RGB-to-CIE L*u*v transform is nonlinear and the Hue
the identity matrix to keep the covariance matrix domain is circular statistics in HSV space, YUV color
isotropic for computational efficiency. The estimated space shows more computing efficiency due to its
mixing weights, denoted by π m , are non-negative and
B
linearity of transforming from RGB space. In addition,
they add up to one. YUV is also for interfacing with analogy and digital
television or photographic equipment. As a result, YUV

1047

color features were adopted in this study, i.e. the color frame. The color-texture representation of pixel is the
components x mentioned in the previous subsection. It is vector concatenation of RGB and LDP. The second
worth reminding that Y stands for the illuminative pattern in Figure 2 is adopted for the computation of
component and U and V are the chromatic components. LDP in the experiments. The effect of shadow
For the pixel which does not fit the background model, elimination are shown not only by background masks
it is then passed to the shadow layer. First, it is but also with the pixel detection accuracy rate (Acc.),
examined if it is qualified as a shadow candidate by a false positive rate (FPR), and false negative rate (FNR),
weak shadow classifier according to the following rules if ground truth data is available. These quantitative
[7] measures are defined as follows,
xY
rmin < < rmax (8) # TP
µY
B
Acc. = (15)
# TP + # FP + # FN
| xU − µU |< Λ U
B
(9)
# FP
FPR = (16)
| xV − µ V |< Λ V
B
(10) # TP + # FP + # FN
model, respectively. The parameters, rmin, rmax, ΛU and # FN
FNR = (17)
Λmax, are user-defined thresholding values. Users just # TP + # FP + # FN
need to set them roughly by a friendly graphical user
interface (GUI) because the more precise shadow Where TP is short for true positive and #TP means pixel
classification will further be made by the following number of TP in a frame. In general, false positives are
shadow GMM. resulted from moving cast shadows and false negatives
are apertures inside the foreground regions. In the
Similar to the background layer, the shadow layer is also following figures of background mask, the pixels
modeled by a GMM but only the color features of depicted with black, red, white and green represent the
shadow candidate will be fed in. For initialization, rmin, background region, false negatives, moving foreground
rmax, ΛU and Λmax are used to derive the first Gaussian of and shadows, respectively. The experiments are
each color component and set its weight to one. The performed on a personal computer with a Pentium 4 3.0-
corresponding means and variances of the first Gaussian GHz CPU and 2 GB RAM. The processing frame rate is
are obtained by the following equations: about 15 frames/second.

µ Y = µ Y (rmax + rmin ) / 2
S B
(11) 3.2.Effect of color-texture combination
σY
S
= ( µ Y rmax
B
− µY
S
)/3 (12)
To check the effectiveness of using color-texture
µU
S
= µU
B
(13) combination feature, an experiment of background
subtraction using the feature but with single background
σ U = ΛU / 3
S
(14) layered GMM is conducted. The result of video 1 is
demonstrated in Figure 3. The combination weights λ’s
where superscripts B and S are related to the background are set to 1, 0.3, and 0 for experiments in column 2, 3,
or shadow models, and µV and σ V are calculated in the
S S
and 4 in Figure.3, respectively. When λ=1, i.e., only
same way as Eq. (13) and Eq. (14). In the circumstance color feature is effectively used, there are significant
if a feature vector x is not matched to any Gaussian false positives caused by shadows and camera brightness
distribution, the Gaussian which has the smallest weight control in the background masks of column 2. When λ=0,
is replaced with µ = x and σ = [σ0 σ0 σ0]T, where σ0 is an i.e., only texture feature is effectively used, most of the
initial variance. false positives disappear but many apertures show up in
the foreground region at column 4 because of the lack of
3. EXPERIMENTS texture in both the road scene (background) and most of
the surface of car (foreground). When λ=0.3, i.e., the
3.1. Experimental setting color-texture feature is used, the number and size of
There are four videos used in the experiment, video 1 is aperture in the results at column 3 become smaller than
collected from a road side camera of real surveillance the results at column 4.
system by the police department of Taichung County
(PDTC), and the others are recorded indoors. Video 2 is 3.3.Results of using multi-layer GMM
collected by our colleagues at a porch with glossy wall
that reflects object slightly, video 3 and video 4 are The experimental results of using the multi-layer GMM
selected from the video dataset at on video 2, 3, and 4 are demonstrated in Figure 4, 5, and
http://cvrr.ucsd.edu/aton/shadow, which are entitled by 6, respectively. The results at column 2, 3, and 4 of each
intelligentroom_raw and Laboratory_raw, resoectively. figure are obtained by using single layered RGB, multi-
The image size of these videos are 320x240 pixels per layered RGB, and multi-layered RGB+LDP background

1048

models, respectively. Detection rates, including color and gradient information”, In Proc. of IEEE
detection accuracy, false positive rate, and false negative Workshop on Motion and Video Computing, 2002.
rate of pixel, of video 2 and 3 are computed based on
ground-truth data and are printed on each frame. In [3]K. Yokoi, “Illumination-robust Change Detection
addition, the average of detection rates is tabulated for Using Texture Based Features”, In Proc. of IAPR
each method in Table 1 and 2. As shown in these figures Conference on Machine Vision Applications, 2007.
and tables, the multi-layer GMM of RGB+LDP feature
outperforms the method without combining the LDP [4]J. Yao and J. Odobez, “Multi-Layer Background
significantly. Subtraction Based on Color and Texture”, In Proc.
of IEEE CVPR, 2007.
Acc. FPR FNR
(%) (%) (%) [5]Cucchiara, R., Grana, C., Piccardi, M., Prati, A., and
RGB only 63.89 31.53 4.58 Sirotti, S.: Improving Shadow Suppression in
RGB+shadow layer 70.05 4.91 25.04 Moving Object Detection with HSV Color
RGB+LDP+shadow 80.27 14.43 5.30 Information. in Proceedings of 2001 IEEE Intelligent
layer Transportation Systems Conference. pp. 334-339
Table 1. Average detection rates of moving objects in (2001).
video 2.
[6]Shan, Y., Yang, F., and Wang, R.: Color Space
Acc. FPR FNR Selection for Moving Shadow Elimination. in
(%) (%) (%) Proceedings of 4th International conference on
RGB only 45.19 53.94 0.87 Image and Graphics. pp. 496-501 (2007).
RGB+shadow layer 76.85 1.21 21.94
RGB+LDP+shadow 82.89 15.87 1.25 [7]Nicolas, M.-B., and Zaccarin, A.: Learning and
layer Removing Cast Shadows through a Multidistribution
Table 2. Average detection rates of moving objects in Approach. IEEE Transactions on Pattern Analysis
video 3. and Machine Intelligent. vol. 29, no. 7, pp. 1133-
1146 (2007).
5.CONCLUSION
[8]Stauffer, C., and Grimson, W. E. L.: Adaptive
This study presents a multi-layer scene model for Background Mixture Models for Real-time Tracking.
applications of video surveillance. The proposed scene in Proceedings of IEEE Computer Society
model uses a RGB+LDP feature to represent each pixel Conference on Computer Vision and Pattern
and classifies each pixel into four different states Recognition. vol. 2, pp. 246-252 (1999).
comprising background, moving foreground, static
foreground and shadow. As shown in the experimental [9]Tanaka, T., Shimada, A., Arita, D., and Taniguchi, R.:
results, both the modeling ability and illumination Non-parametric Background and Shadow Modeling
invariance are significantly improved by including the for Object Detection. Lecture Notes in Computer
texture information. Science. no. 4843, pp. 159-168 (2007).

ACKNOWLEDGEMENT [10]Herrero-Jaraba, E., Orrite-Urunuela, C., and Senar,
J.: Detected Motion Classification with a Double-
background and a neighborhood-based difference.
This paper is a partial result of project 9365C51100 Pattern Recognition Letters. vol. 24, pp. 2079-2092
conducted by ITRI under sponsorship of the Ministry of (2003).
Economic Affairs, Taiwan.
[11]Gallego, J., Pardas, M., and Landabaso, J.-L.:
Segmentation and Tracking of Static and Moving
REFERENCES Objects in Video Surveillance Scenarios. in
Proceedings of IEEE International Conference on
[1] M. Heikkil¨a and M. Pietik¨ainen, “A texture-based
Image Processing. pp. 2716-2719 (2008).
method for modeling the background and detecting
moving objects”, In Proc. of IEEE Transactions on
[12]Benedek C., and Sziranyi, T.: Study on Color Space
Pattern Analysis and Machine Intelligence, Vol. 28,
Selection for Detecting Cast Shadows in Video
No. 4, pp. 657–662, April 2006.
Surveillance. International Journal of Imaging
Systems and Technology. vol. 17, pp. 190-201
[2]O. Javed, K. Shafique and M. Shah, “A hierarchical
(2007).
approach to robust background subtraction using

1049

[13]Izadi, M., and Parvaneh, S.: Robust Region-based [14]Zivkovic Z., and van der Heijden, F.: Recursive
Background Subtraction and Shadow Removing Unsupervised Learning of Finite Mixture Models.
using Color and Gradient Information. in IEEE Transactions on Pattern Analysis and Machine
Proceedings of International Conference on Pattern Intelligent. vol. 26, no. 7, pp. 773-780 (2006).
Recognition. pp. 1-5 (2008).

Color-texture
representation of
Pixel i

Fit Background
No
Model?

Yes

Update Is Shadow
Background No
Candidate?
Model
Yes
Fit Static
Background Update Shadow Foreground No
Model Model?
No Yes
Update Static Fit Moving
Fit Shadow Foreground No
Foreground Model
Model ? Model?
and Count SF +1
Yes Yes

Shadow If Count SF > T 1 Update Moving
Foreground Model
Transfer Moving
and Count MF +1
Yes Foreground to
Background,
Reinitialize Moving
Transfer Static
If Count MF >T2 Foreground Model ,
Foreground Model to
No and Set Count MF = 0
Background Model
Yes
Transfer Moving
to Static
Foreground Model No
and Set Count SF =0
Foreground Model

Static Moving
Foreground Foreground

Fig. 1. Flowchart of the proposed multi-layer scene model

1050

Fig. 3. Results of background subtraction controlled by combination weight of color and texture

Fig. 4. Foreground detection results of video 2.

1051

Fig.5. Foreground detection results of video 3, the intelligentroom_raw.

Fig. 6. Foreground detection results of video 4, the Laboratory_raw.

1052

Adaptive Traffic Scene Analysis by Using Implicit Shape Model

Kai-Kai Hsu, Po-Chyi Su and Kai-Yi Cheng
Dept. of Computer Science and Information Engineering
National Central University
Jhongli, Taiwan
Email: pochyisu@csie.ncu.edu.tw

Abstract—This research presents a framework of analyz- research is to provide an approach to deal with the vehicle
ing the traffic information in the surveillance videos from occlusion problem, in which multiple vehicles appear in
the static roadside cameras to assist resolving the vehicle the video scene and certain parts of them overlap, in
occlusion problem for more accurate traffic flow estimation
and vehicle classification. The proposed scheme consists of the vehicle detection. The occlusions of vehicles occur
two main parts. The first part is a model training mechanism, quite often in cameras set up at the streets and cause
in which the traffic and vehicle information will be collected ambiguity in vehicle detecting and may lead to inaccurate
and their statistics are employed to automatically establish measurement of traffic parameters, such as the traffic flow
the model of the scene and the implicit shape model of volume. We adopt a so-called “Implicit Shape Model”
vehicles. The second part adopts the flexibly trained models
for vehicle recognition when possible occlusions of vehicles (ISM) to recognize the vehicle and reasonably help in
are detected. Experimental results show the feasibility of the solving the occlusion problem. The proposed scheme will
proposed scheme. have two parts, i.e. the self-training mechanism and the
Keywords-Vehicle; traffic surveillance; occlusion; SIFT; construction of the implicit shape model for resolving
vehicle occlusion. The organization of this paper is as
follows. A review of the related works is described in
I. I NTRODUCTION
Section II. The proposed method is presented in Section
Developing Intelligent Transportation System (ITS) has III. Preliminary results are shown in Section IV and the
been a major investigation these years. Through the inte- conclusive remarks are given in Section V.
gration of advanced computing facilities, electronics, com-
munication and sensor technologies, ITS can provide the II. R ELATED W ORKS
real-time information to help maintain the traffic order or There have been active research efforts on the automatic
to ensure the safety of pedestrians and drivers. As there are vision-based traffic scene analysis in recent years [1]–[6].
more and more surveillance cameras deployed along the Levin et al. [1] proposed to collect the training examples
local roads or highways, the visual information provided by a coarse detector and the training examples are used
by these surveillance videos become an important part to build the final pedestrian detector. The classification
of ITS. The traffic information obtained by the vision- criterion for the coarse detector has to be defined manually.
based approach can assist the traffic flow control, vehicle Wu et al. [3] employed an online boosting method to
counting and categorization, etc. In addition, the emergent enhance the performance of the system. A prior detector
traffic events may be detected right after they happened by by off-line learning is employed to train the posterior
the advanced visual processing so that the corresponding detector, which adopts unsupervised learning. Nair et al.
processes can be applied in a more active way. [7] also employed a supervised way for the initial training.
The vehicle detection/classification by using the vision- Hsieh et al. [2] adopts a different approach to detect
based approach is a challenging issue and various methods the lanes of the surveillance video in the initial stage
have been proposed in recent years. It should be noted automatically. Vehicle features such as size and linearity
that appearances of vehicles in the surveillance videos are used to detect and classify vehicles, instead of using
from different cameras are quite diverse because of the the large amount of labeled training data. The vehicle size
different locations, heights, angles and views of cameras. information has to be pre-defined manually. Zhou et al. [4]
In addition, the weather condition and the time of video proposed an example-based moving vehicle detection. The
recording, e.g. morning or evening, may also affect the vehicle are detected according to the luminance changes
vehicle detection process. It is quite difficult to establish by the background subtraction. The features are extracted
a common model in advance for all the surveillance from those examples using PCA and trained as a detector
videos. Nevertheless, if we choose to construct a model by SVM. Celik et al. [5] presented an unsupervised and
for each individual surveillance video, a great deal of on-line approach. A coarse object detector is used to
human efforts will be required, given that there are so extract the moving object by the background subtraction
many roadside cameras. Therefore, the objective of our and then the obtained samples are refined by clustering
research is to enable the procedures of model construction based on the similarity matrix. These extracted features
in an automatic manner so that the customized model of are separated into good and bad positives for training a
each scene can be established. The other objective of this final detector via SVM. Celik et al. [6] then addressed an

1053

¢c £¨¨¢`

automatic classification method for identifying pedestrians ¤ ¤ !©

and vehicles by SIFT.
¤ ¤!©
¨¢©#¢!(
¤¢W¢¢%

Regarding the issues of occlusion problem, various solu- CR3QPC I

97Q H B67 V A
2 9

tions have been proposed [8]–[16]. We roughly classify the ¨§£©¢

¨¢¡
© ##¢`

¨©¤%
¤ ¤ ¥ ¤!©

¨¢¡#¨
¥ ©!)¡

!¤0
¤)¢

approaches into 3D model-based, feature-based methods
b !¢ä ©¢¥

and others. The 3D model is a popular solution to solve the
@9876 5 321
4
E
@C H 26 HG
3 F
C 5 BA
2
U T CR3QPC I
S

vehicle occlusion. Pang et al. [8], [9] detect a vanishing
¤ ¤!© # ¨£¤ ¨§!©©$ ¤D ¨§ !©©$ 3Q VV 6 X
2
¤£¢¡
¨¢¨¤¡£¤¥ ¤¤¨ ¨©¤¤% £¨'!¤ T C H B262Y
6 C
¤©¨¤§¦¤¥

point first in the traffic scene. A 3D deformable model
is used to estimate each viewpoint of vehicle occlusion
and transform it into a 2D representation. The occlusion Figure 1. The proposed framework
is detected by obtaining the curvature points on the shape
of vehicle. Occluded vehicles are separated into individual
vehicles. The vanishing point is also adopted by Yoneyama a single vehicle can be correctly located. Since the traffic
et al. [10]. A hexagon is used to approximate the shape scenes from different cameras may vary significantly, this
of vehicle for eliminating shadows. A multiple-camera training process has to be applied for each individual cam-
system is utilized to detect the occlusion problem. Song era. If this process is carried out manually, its computation
et al. [11] proposed to employ vehicle shape models, will require considerable amounts of human efforts. In
camera calibration and ground plane knowledge to detect, order to provide a more feasible solution, we plan to
track and classify the occlusion by estimating the related develop a “self-training” adaptive scheme so that these
likelihood. Lou et al. [12] established a 3D model for models will be built in a more automatic manner without
tracking vehicles and an improved extended Kalman filter involving a great deal of human efforts. Considering that
was also presented to track and predict the vehicle motion. the settings of traffic surveillance cameras are usually fixed
Most methods of 3D model require the precise camera cal- without rotation and the corresponding traffic scenes tend
ibration and vehicle detection. In feature-based methods, to be static, i.e. the background of the traffic scene is
the occlusion can be resolved by tracking partial visible invariant, we will extract a long video segment from the
features of occluded vehicles. Kanhere et al. [13] proposed target camera for building models. It should be noted that
to track vehicle in low-angle situation and estimate the 3D typical vehicles should appear in the extracted long video
height of features on the vehicles. The feature points are segment and their related information can thus be collected
detected and tracked throughout the image sequence. The as references for future usage.
feature-based methods may be influenced by the similar Fig. 1 demonstrates our system framework. The back-
shapes from the background. Zhang et al. [14] presented ground of the scene will be constructed from the traffic
a multilevel framework, which consists of the intra-frame, surveillance video by an iteratively updating method so
inter-frame and tracking level. In the intra-frame level, an that the background subtraction can be applied to extract
occlusion is detected by evaluating convex compactness the vehicle masks. Although the extracted vehicle masks
ratio of vehicle shape and resolved by removing a “cutting may contain a single vehicle or colluded ones, it is as-
region.” In the inter-frame level, an occlusion is detected sumed that the long video used for training should contain
by the statistics of motion vectors of vehicles. During a large number of single vehicles and that the vehicles of
the tracking level, the detected vehicles are tracked for the same types should exhibit a similar shape/size. Even
resolving the full occlusion. Tsai et al. [15] detect the if many occlusions may happen, their shapes are usually
vehicles by using color and edges. The vehicle color usu- quite different. Therefore, the majority-voting methodol-
ally looks unique and can be used for searching possible ogy can be employed to determine such static information
vehicle locations. Then the edge maps and coefficients in the target traffic video, including the traffic flow direc-
of wavelet transform are used for examining the vehicle tions and the vehicle shape/size of different types at the
candidates. Wang and Lien [16] proposed an automatic scene to construct our first model, i.e. the scene model. The
vehicle detection based on significant subregions of ve- second model, i.e. the shape model, which will be used for
hicles, which are transformed to PCA weighting vectors recognizing vehicles, especially the concluded vehicles,
and ICA coefficient vectors. The position information is is said to be implicitly established since image features
estimated by a likelihood probability evaluation process. are extracted and grouped without explicitly resorting to
the exact shapes of vehicles. The scale-invariant feature
III. T HE P ROPOSED S ELF -T RAINING S CHEME
transform (SIFT) will be used to extract effective features
A. System overview from the segmented vehicle masks of consecutive frames
Our system is aimed at resolving the vehicle occlusion to indicate the pixels covering vehicles more precisely.
problem for more accurate estimation of traffic flow at The statistics of vehicle size information obtained by the
the scene captured by a static traffic surveillance camera. occlusion detection is analyzed and will be utilized to clas-
Our scheme mainly relies on establishing two models, i.e. sify the vehicle types. By the results of statistics, the types
the traffic scene model and the implicit shape model of of vehicles can be classified into motorcycles, sedan cars
vehicles, for effective traffic scene analysis. The models and buses according to the vehicle size information. The
should be trained in advance so that the pixels covering step of vehicle pattern extraction and classification will

1054

(a)

(a)

(b)

Figure 3. Convex hulls of (a) non-occlusion vehicle and (b) occluded
vehicles.
(b)

Figure 2. Background image Construction where Vs and Vc represent the vehicle area from the
background subtraction and the vehicle convex area, re-
spectively. When the value of Γ is closer to one, the
collect various types of vehicle masks. The classification vehicle area is similar to its convex hull area and it
is implemented by the vehicle size information obtained indicates that the occlusion may not happen. In the training
from the traffic information analysis. When the system process, our system tries to extract non-occluded vehicle
runs after a period of time, there will be enough vehicle patterns so we set up a high threshold to ensure that most
masks to establish the implicit shape model. We will detail of the extracted vehicle patterns contain single vehicles.
the procedures of our proposed system as follows.
D. Traffic Information Analysis
B. Background Model Construction
As mentioned before, we require that our system be
A series of traffic surveillance frames will be utilized
executed in an more automatic manner to reduce the
to construct the background image of the traffic scene
human efforts for tuning the parameters. Our scheme will
captured by a static roadside camera so that the moving
obtain the direction of traffic appearing in the scene and
vehicles will be detected by the background subtraction.
i the common vehicle size information by the statistics of
Let Bx,y be the pixel at (x, y) of the background image,
the surveillance videos in the training phase. For analyzing
and the background updating function is given by
the direction of traffic, the vehicle movements must be
i+1 b i b i attained first. SIFT is employed to identify features on
Bx,y = (1 − αMx,y )Bx,y + αMx,y Fx,y (1)
vehicles. After the vehicle segmentation, the vehicles
i
in which Fx,y is the pixel at (x, y) in frame i; α is the are transformed into feature descriptors of SIFT. The
b
small learning rate; Mx,y is the binary mask of the current features of frames will be compared and the positions
frame. If the pixel at (x, y) belongs to the foreground part, of movements are recorded. After a period of time, the
b b
Mx,y = 1 to turn on the updating. Otherwise, Mx,y is set main direction of traffic in the surveillance scene can
as 0 to avoid updating the background with the moving be observed from the resultant movement histogram. In
objects. An example of the scene with its constructed addition, the Region of Interest (ROI) can be identified to
background is demonstrated in Fig. 2. facilitate the subsequent processing. The position of ROI is
located in the area of the detected traffic flow and the area
C. Occlusion Vehicle Detection
near the bottom of the captured traffic scene for vehicles
It has been observed that the shape of non-occluded of larger size, which can offer more information.
vehicle should be close to its convex hull and that the After determining ROI, we can collect vehicle patterns
shape of occluded vehicles will show certain concavity, or masks that appear in the ROI. In the training phase, ve-
as illustrated in Fig. 3. This characteristic can be used to hicle patterns that are determined to contain single vehicles
roughly extract the non-occluded vehicle. In our imple- based on the convex hull analysis will be archived. Then
mentation, compactness, Γ, is used to evaluate how close we can check the size histogram of archived vehicles to set
the vehicle’s shape and its convex hull are. That is, up the criterion for roughly classifying them. In our test
Vs videos, the most common vehicles are motorcycles, sedan
Γ= , (2) cars and buses. When we examine the histogram of the
Vc

1055

the position of training vectors where the codebook entry

is found. The position of each feature is dependent on
the object center. We match the features from the training
! ! images with the codebook entries. When the similarity of
¥©¨§¦¥¤ £¢¡ ©¥§¥ ©¥§¨ ¥¥©©¨ ¥©¥
features with any entry is above a threshold, the position
relative to the object center is recorded along with the
codebook entry. After matching the training images with
Figure 4. The codebook training procedure. the codebook entries, we obtain the spatial probability
distribution.
2) Recognition Approach: Given a target image, the
sizes of the collected single vehicle patterns, there will be
features are extracted by SIFT and matched to the
obvious peaks. To be more specific, we basically make use
codewords in the codebook. When the similarity be-
of the peaks to determine the sizes of common motorcycles
tween extracted features and the codebook entries is
and sedan cars since they appear more often. We can then
higher than a threshold, these matches are then collected.
set up the upper and lower bounds of sedan cars and then
According to the spatial probability distribution, these
we can use them as the reference to assign a lower bound
matched codebook entries cast votes to the object center.
of the bus size. In the detection phase, if the vehicle mask
When the features of target image that are extracted at
is large and shows a convex hull, then the pattern may be
(ximg , yimg , simg ), in which (x, y) is the location and
determined as a bus. Otherwise, an occlusion may happen
s means scale, are determined to have a match with a
and this has to be solved by using ISM. In other words,
codebook entry, the positions (xpos , ypos , spos ) recorded
after the rough classification according to the vehicle sizes,
in this codebook entry cast votes for the object center.
we proceed to use the vehicle patterns to establish the
The voting is applied by
codebooks of ISM, which will then be used for resolving
the vehicle occlusions. simg
xvote = ximg − xpos ( ) (3)
spos
E. Implicit Shape Model simg
yvote = yimg − ypos ( ) (4)
Leibe et al. [17] proposed to use ISM for learning spos
the shape representations in detecting the most possible simg
svote = (5)
locations of vehicles in images or frames. The object spos
categorization is achieved by learning the appearance
where (xvote , yvote , svote ) is a vote for the object center.
variability of an object category in a codebook. The
After all the matches that the codebook entries have voted,
investigated image will be compared with the codewords
we store these votes for a probability density estimation
in the codebook that has a similar shape and then a
mechanism, which is used to obtain the most possible
weighted voting procedure will be applied to address the
location of the object center.
object detection. The steps of ISM are as follows.
Next, we collect the votes in a binned 3D accumulator
1) Shape Model Establishment: In the visual object
array and search the local maxima for speeding up the
recognition, we have to determine the correspondence of
computation. The local maxima are detected by comparing
the image features with the structures of the object, even
each member of the binned 3D accumulator array to its 26
under different conditions. To employ a flexible repre-
neighbors in 3×3 regions. Then, the Mean-Shift approach
sentation for object recognition, a codebook is built for
[18] is employed to refine the local maxima for more
representing features that appear on training images quite
accurate location. The Mean-Shift approach can locate
often and similar features are clustered. A codeword in
the maxima of a density function given the discrete data
the codebook should be a compact representation of local
sampled from that function. It will quickly converge to
appearances of objects. Given an unknown image struc-
more precise locations of the local maxima after several
ture, we will try to match it with a possible representation
iterations.
or codeword in the codebook. Then, many such matches
are collected and we can then infer the existence of that The refined local maxima can be regarded as candidates
object. Again, the scale-invariant interest point detector of the object center. Thus, the following criterion is used
is employed to detect the feature points on the training to estimate the existing probability of the object:
images and the extracted image regions are then translated 1 lc − li
to a representation by a local descriptor. Next, the visually score(lc ) = wi Ker( ), (6)
V (sc ) i
b(sc )
similar features will be grouped to construct a codebook
for representing the local appearances of certain object. where Ker() is a kernel function; b(sc ) is the kernel
The k-means algorithm is used to partition the features into bandwidth; V (sc ) is the volume of the kernel; wi and
k clusters, in which each feature is assigned to the cluster li are the weighting factor and the location of the vote,
center with the nearest distance. The codebook generation respectively; lc and sc are the location and scale of the
process is shown in Fig. 4. local maximum. The kernel function Ker() can be treated
After building the codebook, the spatial probability as a search window for the position of object center. If
distribution is defined for each codebook entry. It records the vote location li is inside of the kernel, the Ker()

1056

(a) (b) (a) (b)

Figure 5. (a) Multi-type vehicle error detection and (b) the result after Figure 6. (a) Multiple hypotheses detected in one vehicle and (b) the
the refining procedure. results from the refining procedure.

function returns a value one. Otherwise, it returns zero. There exists another problem in the vehicle recognition
For the 3D voting space, we use a spherical kernel and by using ISM. As shown in Fig 6, there are three bounding
the radius is the bandwidth, b(sc ), which is adaptive to boxes on the same vehicle. It means that the recognition
the local maximum scale sc . As the object scale increases, result includes some error detections that ISM has defined
the kernel bandwidth should also increase for an accurate for multiple hypotheses on this vehicle. Since the multiple
estimation. Therefore, we sum up all the weighting values definition problem comes from the fact that the ISM
that are inside of the kernel and divide them by the volume searches the local maxima in the scale-space as shown
V (sc ) to obtain an average weight density, which is called in Fig. 7, the scheme may find several local maxima in
the score. After the score is derived, we define a thresh- different scale levels but at a similar location. In fact,
old θ for determining whether the object exists. When these local maxima are generated by the same vehicle
the score is above θ, the hypothesized object center is center. Therefore, the unnecessary hypotheses should be
preserved. Finally, we back-project the votes that support eliminated. We deal with the problem by computing the
this hypothesized object center to obtain an approximate overlapped area between the two bounding boxes. When
shape of the object. the overlapped area between two bounding boxes is very
large, we can claim that the bounding box that has a
F. Occlusion Resolving weaker score is an error detection. For efficient compu-
After detecting the existence of certain occluded vehi- tation, the rate of overlap is computed by finding the
cles in the image, we need to classify them into different distance between the two bounding boxes’ central points
types. In our scheme, we construct the codebooks of and use the longer diagonal line of the larger bounding
different types of vehicles. Each type of vehicle codebook box as the criterion. The longer the distance is, the higher
will be established automatically after we obtain enough the areas overlap. In other words, for every two bounding
vehicle patterns collected by the process of vehicle ex- boxes, we need to check
traction. However, as shown in Fig. 5, the performance of 1
recognition is not as good as expected since many errors distance(B1 , B2 ) D, (7)
3
happen on the bus image. Owing to the fact that the area
where B1 and B2 denote two bounding boxes central
of buses are much larger than sedan cars and that there
points and D is the diagonal line of the larger one. In our
are many similar local appearances in these two types, 1
implementation, when the distance is smaller than 3 D, the
errors of this kind occur quite often. We provide a refining
overlapped area of the bounding boxes is above 50% and
procedure as follows.
we will thus remove the bounding box that has a lower
All the hypotheses are supported by the contributing
score. The error detection from ISM can thus be reduced.
votes that are cast by the matched features. Theoretically,
every extracted feature should only support one hypothesis IV. E XPERIMENTAL R ESULTS
since it is not possible that one feature belongs to two We have tested the proposed self-training mechanism
vehicles. Thus, we will modify these hypotheses after on two different surveillance videos. The scenes of two
executing multiple recognition procedures. We first store surveillance videos are displayed in Fig. 8. Scene 1 shown
all the hypotheses whose scores are over a threshold. in Fig. 8(a) is a 15 minutes long video while Scene
Then all the hypotheses are refined by checking each 2 shown in Fig. 8(b) is a 17 minutes long video. The
contributing vote that appears in two hypotheses at the experimental results will be demonstrated in three parts,
same time. The hypothesis with a higher score can retain i.e. the traffic information analysis, the vehicle pattern
this vote while the vote from others will be eliminated. extraction/classification and the occlusion resolving.
Next, the scores of these hypotheses are recalculated.
When the new score is above the threshold, the hypothesis A. Traffic Information Analysis
can be preserved. After this refining procedure, the number The directions of traffic flow analysis of two scenes
of error detections can be reduced. are illustrated in Fig. 9. The red points represent forward

1057

9 4

D D 8 3.5

Number of Occurences (per minute)

Number of Occurences (per minute)
7
3

6
2.5
5
2
4
1.5
3

1
2

1 0.5

0 0
0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140
Size of Vehicles Unit: 100 pixels Size of Vehicles Unit: 100 pixels

(a) (b)

Figure 10. The vehicle size statistics for (a) Scene 1 and (b) Scene 2.
Figure 7. If the distance between two bounding boxes’ centers is smaller,
then the overlap area is larger so the distance will be employed to remove Table I
the duplicated detections. V EHICLE PATTERN E XTRACTION

Total Error Correct rate
Scene 1 940 15 98.4%
Scene 2 1251 31 97.5%

B. Vehicle Pattern Extraction and Classification
The various extracted vehicle patterns are demonstrated
(a) (b) and they pass the occlusion detection process to ensure that
it have no occlusion problem. In our experiment, we give
Figure 8. The views of two surveillance videos. (a) Scene 1. (b) Scene Eq.(2) a threshold 0.9 for extracting the sedan car/bus and
2.
0.8 for motorcycles. We apply the shape analysis on sedan
cars and buses but not on motorcycles since they cannot
be approximated by a convex hull. The performance of
vehicle extraction is summarized in Table I. These vehicle
moving vehicles and blue points are backward moving
patterns will be employed for training. It should be noted
vehicles. We can see that the directions of traffic flows are
that the errors usually come from some unstable envi-
successfully obtained after training the video for a while. It
ronmental conditions, which will affect the construction
should be noted that the more traffic volume is, the lesser
of background image. The vehicle classification result is
time we will need. The vehicle size information statistics
summarized in Table II. Some extracted patterns from
for Scene 1 and Scene 2 are exhibited in Fig 10. There
Scene 1 are illustrated in Figs. 11-13. We can see that
exist two peaks in each scene as the left peak, which has
the vehicle patterns can be effectively extracted and they
a smaller vehicle size, represents a motorcycle, while the
will be helpful in training a more accurate codebook or
right one, which has a larger vehicle size, stands for a
models.
sedan car. In Scene 1, according to Fig. 10, we assign the
lower bound 700 pixels and upper bound 1000 pixels for C. Occlusion Resolving
motorcycle size. The upper and lower bounds of sedan
Table III and Figs. 14-16 demonstrate the results of
car size are 1700 pixels and 3300 pixels respectively. In
occlusion resolving. We use the extracted vehicle patterns
Scene 2, the motorcycle size is assigned with 1400 pixels
to train the ISM codebooks for two different scenes. Table
and 2100 pixels while the sedan car size is assigned with
III is the performance of resolving occlusion on sedan
the lower bound 4000 pixels and the upper bound 8500
cars and the occlusion part of Table III denotes the sedan
pixels. We can see that the vehicle size information i.e.
cars actually occlude with other vehicles while the non-
the motorcycle and sedan car, for surveillance video can
occlusion part stands for the sedan cars which are not
be obtained by statistics successfully.
occluded with other vehicles but pass the occlusion detec-
tion. As shown in Figs. 14 and 15, there are several sedan
cars that are partially occluded. We use the trained ISM to
resolve the occlusions. The red points and bounding boxes

Table II
V EHICLE PATTERN C LASSIFICATION

Motorcycle Sedan car
(a) (b) Total Error Correct rate Total Error Correct rate
Scene 1 135 3 97.8% 765 34 95.6%
Figure 9. The directions of traffic flows for (a) Scene 1 and (b) Scene Scene 2 159 2 98.7% 826 46 94.4%
2.

1058

Figure 11. The extracted motorcycle patterns from Scene 1.

Figure 13. The extracted bus patterns from Scene 1.

Figure 14. Occlusion resolving of sedan cars in Scene 1.

Figure 12. The extracted sedan car patterns from Scene 1.

represent vehicle’s central coordinate and its position that
are detected by ISM. In Fig. 16, we resolve the problem of
occlusion from the two types of vehicles i.e. bus and sedan
car. By combining ISM and the proposed self-training
mechanism, these occlusion problems can be reasonably
resolved.
Figure 15. Occlusion resolving of sedan cars in Scene 2.

Table III
S EDAN C AR O CCLUSION R ESOLVING R ATE

Total Miss False alarm
occlusion 177 35 46
Scene 1 non-occlusion 88 1 2
occlusion 92 16 21
Scene 2 non-occlusion 130 2 12 Figure 16. Resolving the partial occlusion of sedan car and bus.
Recall Precision
occlusion 80.2% 75.5%
Scene 1 non-occlusion 98.9% 97.8% V. C ONCLUSION
occlusion 82.6% 78.2%
Scene 2 non-occlusion 98.4% 99.2% We have proposed a framework of analyzing the trafﬁc
information in the surveillance videos captured by the

1059

static roadside cameras. The traffic and vehicle infor- [13] N. Kanhere, S. Birchfield, and W. Sarasua, “Vehicle seg-
mation will be collected from the videos for training mentation and tracking in the presence of occlusions,”
the related model automatically. For the vehicles without Transportation Research Record: Journal of the Trans-
portation Research Board, vol. 1944, no. -1, pp. 89–97,
occlusion, we can use the scene model to record and 2006.
classify. If an occlusion happen, the implicit shape model
will be employed. The experimental results demonstrate [14] W. Zhang, Q. Wu, X. Yang, and X. Fang, “Multilevel
this potential solution of solving occlusion problems in Framework to Detect and Handle Vehicle Occlusion,” IEEE
the traffic surveillance videos. Future work will be further Transactions on Intelligent Transportation Systems, vol. 9,
no. 1, pp. 161–174, 2008.
improving the accuracy and the speed of execution.
[15] L. Tsai, J. Hsieh, and K. Fan, “Vehicle detection using
R EFERENCES
normalized color and edge map,” IEEE Transactions on
[1] O. Javed, S. Ali, and M. Shah, “Online detection and clas- Image Processing, vol. 16, no. 3, pp. 850–864, 2007.
sification of moving objects using progressively improv-
ing detectors,” Computer Vision and Pattern Recognition, [16] C. Wang and J. Lien, “Automatic Vehicle Detection Using
vol. 1, p. 696701, 2005. Local FeaturesA Statistical Approach,” IEEE Transactions
on Intelligent Transportation Systems, vol. 9, no. 1, pp. 83–
[2] J. Hsieh, S. Yu, Y. Chen, and W. Hu, “Automatic traffic 96, 2008.
surveillance system for vehicle tracking and classification,”
IEEE Transactions on Intelligent Transportation Systems, [17] B. Leibe, A. Leonardis, and B. Schiele, “Robust object de-
vol. 7, no. 2, pp. 175–187, 2006. tection with interleaved categorization and segmentation,”
International Journal of Computer Vision, vol. 77, no. 1,
[3] B. Wu and R. Nevatia, “Improving part based object pp. 259–289, 2008.
detection by unsupervised, online boosting,” in IEEE Con-
ference on Computer Vision and Pattern Recognition, 2007. [18] Y. Cheng, “Mean shift, mode seeking, and clustering,”
CVPR’07, 2007, pp. 1–8. IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 17, no. 8, pp. 790–799, 1995.
[4] J. Zhou, D. Gao, and D. Zhang, “Moving vehicle detection
for automatic traffic monitoring,” IEEE transactions on
vehicular technology, vol. 56, no. 1, pp. 51–59, 2007.

[5] H. Celik, A. Hanjalic, E. Hendriks, and S. Boughor-
bel, “Online training of object detectors from unlabeled
surveillance video,” in IEEE Computer Society Conference
on Computer Vision and Pattern Recognition Workshops,
2008. CVPRW’08, 2008, pp. 1–7.

[6] H. Celik, A. Hanjalic, and E. Hendriks, “Unsupervised
and simultaneous training of multiple object detectors from
unlabeled surveillance video,” Computer Vision and Image
Understanding, vol. 113, no. 10, pp. 1076–1094, 2009.

[7] V. Nair and J. Clark, “An unsupervised, online learning
framework for moving object detection,” Computer Vision
and Pattern Recognition, vol. 2, p. 317324, 2004.

[8] C. Pang, W. Lam, and N. Yung, “A novel method for
resolving vehicle occlusion in a monocular traffic-image
sequence,” IEEE Transactions on Intelligent Transportation
Systems, vol. 5, pp. 129–141, 2004.

[9] ——, “A method for vehicle count in the presence of
multiple-vehicle occlusions in traffic images,” IEEE Trans-
actions on Intelligent Transportation Systems, vol. 8, no. 3,
pp. 441–459, 2007.

[10] A. Yoneyama, C. Yeh, and C. Kuo, “Robust vehicle and
traffic information extraction for highway surveillance,”
EURASIP Journal on Applied Signal Processing, vol. 2005,
p. 2321, 2005.

[11] X. Song and R. Nevatia, “A model-based vehicle segmen-
tation method for tracking,” in Tenth IEEE International
Conference on Computer Vision, 2005. ICCV 2005, 2005,
pp. 1124–1131.

[12] J. Lou, T. Tan, W. Hu, H. Yang, and S. Maybank, “3-
D model-based vehicle tracking,” IEEE Transactions on
image processing, vol. 14, no. 10, pp. 1561–1569, 2005.

1060

An Augmented Reality Based Navigation System for Museum Guidance

Jun-Ming Pan, Chi-Fa Chen Chia-Yen Chen, Bo-Sen Huang, Jun-Long Huang,
Dept. of Electrical Engineering, Wen-Bin Hong, Hong-Cyuan Syu
I-Shou University Vision and Graphics Lab.
Dept. of Computer Science and Information
Engineering
Nat. University of Kaohsiung, Kaohsiung, Taiwan
ayen@nuk.edu.tw

Abstract— The paper describes the design and an interest about the exhibition. Therefore, the implemented
implementation of an augmented reality based navigation system aims to provide more in depth visual and auditory
system used for guidance through a museum. The aim of this information, as well interactive 3D viewing of the objects,
work is to improve the level of interactions between a viewer which otherwise cannot be provide by a pamphlet alone. In
and the system by means of augmented reality. In the addition, an interactive guidance system will have more
implemented system, hand motions are captured via computer impact and create a more interesting experience for the
vision based approaches and analyzed to extract representative visitors.
actions which are used to interact with the system. In this The implemented system does not require the keyboard
manner, tactile peripheral hardware such as keyboard and
or the mouse for interaction. Instead, a camera and a
mouse can be eliminated. In addition, the proposed system also
pamphlet are used to provide the necessary input. To use the
aims to reduce hardware related costs and avoid health risks
associated with contaminations by contact in public areas. system, the user is first given a pamphlet, as often given out
to visitors to the museum, he/she can then check out the
Keywords- augmented reality; computer vision; human different objects by moving his/her finger across the paper
computer interaction; multimedia interface; and point to the pictures of objects on the pamphlet. The
location of the fingertip is captured by an overhead camera
and the images are analyzed to determine the user's intended
I. INTRODUCTION AND BACKGROUND
actions. In this manner, the user does not need to come into
The popularity of computers has induced a wide spread contact with anything other than the pamphlet that is given to
usage of computers as information providers, in public him/her, thus eliminating health risks due to direct contact
facilities such as museums or other tourist attractions. with harmful substances or contaminated surfaces.
However, in most locations, the user is required to interact
with the system via tactile means, for example, a mouse, a To implement the system, we make use of technologies
keyboard, or a touch screen. With a large number of users in augmented reality.
coming into contact with the hardware devices, it is hard to Augmented reality (AR) has received a lot of attentions
keep the devices free from bacteria and other harmful due to its attractive characteristics including real time
contaminates which may cause health concerns to immersive interactions and freedom from cumbersome
subsequent users. In addition, constant handling increases the hardware [7]. There have been many applications designed
risk of damage to the devices, incurring higher maintenance using AR technologies in areas such as medical applications,
cost to the providing party. Thus, it is our aim to design and entertainment, military navigation, as well as many other
implement an interactive system using computer vision new possibilities.
approaches, such that the above mentioned negative effects An AR system usually incorporates technologies from
may be eliminated. Moreover, we also intend to enhance the different fields. For example, technologies from computer
efficiency of the interface by increasing the amount of graphics are required for the projection and embedding of
interaction, which can be achieved by means of a multimedia, virtual objects; video processing is required to display the
user augmented reality interface. virtual objects in real time; and computer vision technologies
are required to analyse and interpret actions from input
In this work, we realize the proposed idea by image frames. As such, an AR system is usually realized by
implementing an interactive system that enables the user to a cross disciplinary combination of techniques.
interact with a terminal via a pamphlet, which can easily be Existing AR systems or applications often use designated
produced in a museum and distributed visitors. The pamphlet markers, such as the AR encyclopedia or other applications
contains summarized information about the exhibition or written by ARToolkit [8]. The markers are often bi-coloured
objects of interest. However, due to the size of the pamphlet, and without details to facilitate marker recognition. However,
it is not possible to put in a lot of information, besides, too for the guidance application, we intend to have a system that
much textual information tends to make the visitor lose is able to recognize colour and meaningful images of objects

1061

or buildings as printed on a brochure or guide book and use
them for user interactions.
The paper is organized as follows. Section 2 describes
the design of the system; section 3 describes the different
steps in the implementation of the system; section 4
discusses the operational navigation system; and section 5
provides the conclusion and discusses possible future
research directions.
II. S YSTEM DESIGN
The section describes how the system is designed and
implemented. Issues that arose during the implementation of
the system, as well as the approaches taken to resolve the
issues are also discussed in the following.
To achieve the goals and ideas set out in the previous
section, the system is designed with the following Figure 2. Concept of the navigation system.
considerations.
The system obtains input via a camera, located above and
 Minimum direct contact: The need for a user to overlooking the pamphlet. The camera captures images of
come into direct contact with hardware devices such the user's hand and the pamphlet. The images are processed
as a keyboard, or a mouse, or a touch screen, should and analyzed to extract the motion and the location of the
be minimized. fingertip. The extracted information is used to determine the
 User friendliness: The system should be easy and multimedia data, including text, 2D pictures, 3D models,
intuitive to use, with simple interface and concise sound files, and/or movie clips, to be displayed for the
instructions. selected location on the pamphlet. Fig. 2 shows the concept
of the proposed navigation system.
 Adaptability: The system should be able to handle
other different but similar operations with minimum
modifications. III. SYSTEM IMPLEMENTATION
 Cost effectiveness: We wish to implement the
system using readily available hardware, to The main steps in our system are discussed in the
demonstrate that the integration of simple hardware followings.
can have fascinating performance. A. Build the system using ARToolKit
 Simple and robust setup. Our goal is to have the We have selected ARToolKit to develop our system,
system installed at various locations throughout the since it has many readily available high level functions that
school, or other public facilities. By having a simple can be used for our purpose. It can also be easily integrated
and robust setup, we reduce the chances of a system with other libraries to provide more advanced functions and
failure. implement many creative applications.
In accordance to the considerations listed above, the B. Create markers
system is designed to have the input and out interfaces as The system will associate 2D markers on the pamphlet
shown in Fig. 1. with 3D objects stored in the database, as well as actions to
manipulate the objects. This is achieved by first scanning the
marker patterns, storing them in the system and let the
program learn to recognize the patterns. In the program, each
marker is associated with a particular 3D model or action,
such that when the marker has been selected by the user, the
associated data or action will be displayed or executed. Fig. 3
shows examples of markers used for the system. Each
marker is surrounded by a black border to facilitate
recognition. The object markers, as indicated by the blue
Figure 1. Diagram for the navigation interface. arrows, are designed to match the objects to be displayed.
Bottom right shows a row of markers, as enclosed by the red
oval, used to perform actions on the displayed 3D objects.

1062

Objects model. Note that the actions can be applied to any 3D
models that can be displayed by the system.

Actions

Figure 3. Markers used by the system.

C. Create 3D models Figure 5. The user selects the zoom in function to magnify the displayed
3D model.
The 3D models that are associated with the markers are
created using OpenGL or VRML format. These models can
be displayed on top of the live-feed video, such that the user
can interact with the 3D models in real time. The models are
texture mapped to provide realistic appearances. The models
are created in collaboration with the Kaohsiung Museum of
History [9]. Fig. 4 shows examples of the 3D models used in
the navigation system. The models are completely 3D with
texture mapping, and can be viewed from any angle by the
user.

Figure 6. The user selects the zoom out function to shrink the displayed
3D model.

Figure 4. Markers used by the system.

D. Implement interactive functions
In addition to displaying the 3D models when the user
selects a marker, the system will also provide a set of actions
that the user can use to manipulate the displayed 3D model
in real time. For example, we have designed “+/-” markers
for the user to magnify or shrink the displayed 3D model.
The user simply places his/her finger on the markers and the
3D model will change size accordingly. There are also
markers for the user to rotate the 3D model, as well as reset
the model to its original size and position. Figs 5 to 7 show
the system with its implemented actions in operation. In the Figure 7. The user use the rotation marker to rotate the 3D object.
figures, the user simply puts a finger over the marker, and
the selected actions will be performed on the displayed 3D

1063

E. Determine selection V. CONCLUSION
An USB camera is used to capture continuous images of A multimedia, augmented reality interactive navigation
the scene. The program will automatically scan the field of system has been designed and implemented in this work. In
view in real time for recognized markers. Once a marker has particular, the system is implemented for application in
found to be selected, that is, it is partially obstructed by the providing museum guidance.
hand, it is considered to be selected. The program will match The implemented system does not require the user to
the selected marker with the associated 3D model or action operate hardware devices such as the keyboard, mouse, or
in the database. Figs. 5 to 7 show the user selecting markers touch screen. Instead, computer vision approaches are used
by pointing to the markers with a finger. From the figures, it to obtain input information from the user via an overhead
can be seen that the selected 3D model is shown within the camera. As the user points to certain locations on the
video window in real time. Also notice that the models are pamphlet with a finger, the selected markers are identified by
placed on top of the corresponding marker’s position in the the system, and relevant data are shown or played, including
video window. a texture mapped 3D model of the object, textual, audio, or
other multimedia information. Actions to manipulate the
displayed 3D model can also be selected in a similar manner.
IV. NAVIGATION SYSTEM Hence, the user is able to operate the system without
The proposed navigation system has been designed and contacting any hardware device except for the printout of the
implemented according to the descriptions provided in the pamphlet.
previous sections. The system does not have high memory The implementation of the system is hoped to reduce the
requirements and runs effectively on usual PC or laptops. cost of providing and maintaining peripheral hardware
The system also requires no expensive hardware, an USB devices at information terminals. At the same time,
camera is sufficient to provide the input required. It is also eliminating health risks associated with contaminations by
quite easy to set up and customized to various objects and contact in public areas.
applications. Work to enhance the system is ongoing and it is hoped
The system can be placed at various points in the that the system will be used widely in the future.
museum on separate terminals to enable visitors to access
additions museum information in an interactive manner.
ACKNOWLEDGMENT
This research is supported by National Science Council
(NSC98-2815-C-390-026-E). We would also like to thank
Kaohsiung Museum of History for providing cultural
artifacts and kind assistance.

REFERENCES
[1] J.-Z. Jiang，Why can Wii Win ?,Awareness Publishing，2007
[2] D.-Y. Lai, M. Liou, Digital Image Processing Technical Manual,
Kings Information Co., Ltd.,2007
[3] R. Jain, R. Kasturi B. G. Schunck, Machine Vision、McGraw-Hill,
1995
[4] R. Klette, K. Schluns K. Koschan, Computer vision: three-
dimensional data from images, Springer; 1998.
[5] R. C. Gonzalez and R. E. Woods, Prentice Hall,Digital Image
Processing, Prentice Hall; 2nd edition, 2002.
[6] HitLabNZ, http://www.hitlabnz.org/wiki/Home, 2008
Figure 8. The interface showning the 3D model and other multimedia
information. [7] R. T. Azuma, A Survey of Augmented Reality. In Presence:
Teleoperators and Virtual Environments 6, pp 355—385, (1997)
Fig. 8 shows the screen shot of the system in operation. [8] Augmented Reality Network, http://augmentedreality.ning.com, 2008
In Fig. 8, the left window is the live-feed video, with the [9] H.-J. Chien, C.-Y. Chen, C.-F. Chen, Reconstruction of Cultural
selected 3D model shown on top of the corresponding Artifact using Structured Lighting with Densified Stereo
Correspondence, ARTSIT, 2009.
marker’s position in the video window. The window on the
[10] C.-H. Liu, Hand Posture Recognition, Master thesis, Dept. of
right hand side shows the multimedia information that will Computer Science Eng., Yuan Ze University, Taiwan, 2006.
be shown along with the 3D model to provide more [11] C.-Y., Chen, Virtual Mouse:Vision-Based Gesture Recognition,
information about the object. For example, when the 3D Master thesis, Dept. of Computer Science Eng., National Sun Yat-
object is displayed, the window on the right might show sen University, Taiwan, 2003
additional textual information about the object, as well as [12] J. C., Lai, Research and Development of Interactive Physical Games
audio files to describe the object or to suitable provide Based on Computer Vision, Master thesis, Department of Information
background music. Communication, Yuan Ze University, Taiwan, 2005

1064

[13] H.-C., Yeh, An Investigation of Web Interface Modal on Interaction
Design - Based on the Project of Burg Ziesar in Germany and the
Web of National Palace Museum in Taiwan, Master thesis, Dept. of
Industrical Design Graduate Institute of Innovation Design,
National Taipei University of Technology, Taiwan, 2007.
[14] T. Brown and R. C. Thomas, Finger tracking for the digital desk. In
First Australasian User Interface Conference, vol 22, number 5, pp
11--16, 2000
[15] P. Wellner, Interacting with papers on the DigitalDesk,
Communications of the ACM, pp.28-35, 1993

1065

Facial Expression Recognition Based on Local Binary Pattern and Support
Vector Machine
1 2 3 4
Ting-Wei Lee (李亭緯), Yu-shann Wu(吳玉善), Heng-Sung Liu(柳恆崧) and Shiao-Peng
Huang(黃少鵬)

Chunghwa Telecommunication Laboratories
12, Lane 551, Min-Tsu Road Sec.5
Yang-Mei, Taoyuan, Taiwan 32601, R.O.C.
TEL:886 3 424-5095, FAX:886 3 424-4742
Email: finas@cht.com.tw, yushanwu@cht.com.tw, lhs306@cht.com.tw, pone@cht.com.tw

Abstract—For a long time, facial expression Besides the PCA and LDA, Gabor filter method [3]
recognition is an important issue to be full of challenge. In is also used in facial feature extraction. This method has
this paper, we propose a method for facial expression both multi-scale and multi-orientation selection in
recognition. Firstly we take the face detection method to choosing filters which can present some local features of
detect the location of face. Then using the Local Binary facial expression effectively. However, the Gabor filter
Patterns (LBP) extracts the facial features. When
method suffers the same problem as PCA and LDA. It
calculating the LBP features, we use an NxN window to be
a statistical region and remove this window by certain would cost too much computation and high dimension of
pixels. Finally, we adopt the Support Vector Machine feature space.
(SVM) method to be a classifier and recognize the facial In this paper, we use the Local Binary Pattern
expression. In the experimental process, we use the JAFFE
(LBP) [4][5] as the facial feature extraction method.
database and recognize seven kinds of expressions. The
average correct rate achieves 93.24%. According to the
LBP has low computation cost and efficiently encodes
experimental results, we prove that this proposed method the texture features of micro-pattern information in the
has the higher accuracy. face image. In the first step, we have to detect the face
area to remove the background image. We extract the
Keywords: facial expression, face detection, LBP, SVM Haar-like [6] features and use the Adaboost [7] classifier
for face detection. The face detection module can be
found in the Open Source Computer Vision Library
I. INTRODUCTION (OpenCV). After adopting the face area, we calculate
this area’s LBP features. Finally, using the Support
To analyze facial expression can provide much
Vector Machine (SVM) classifies the LBP feature and
interesting information and used in several applications.
recognizes the facial expression. Experimental results
Take electronic board as example, we can realize
demonstrate the effective performance of the proposed
whether the commercials attract the customers or not by
method.
the facial expression recognition. In recent years, many
researches had worked on this technique of human- The rest of this paper is organized as follows: In
computer interaction. Section Ⅱ, we introduce our system flow chart and the
The basic key point of any image processing is to face detection. In section Ⅲ, we explain the facial LBP
extract the facial features from the original images. representation and SVM classifier. In Section Ⅳ ,
Principal Component Analysis (PCA) [1] and Linear experimental results are presented. Finally, we give brief
Discriminant Analysis (LDA) [2] are two methods used discussion and conclusion in section Ⅴ.
widely. PCA computes a set of eigenvalues and
eigenvectors. By selecting several most significant
II. THE PROPOSED METHOD
eigenvectors, it produces the projection axes to let the
images projected and minimizes the reconstruction error. The flow chart of the proposed facial expression
The goal of LDA is to find a linear transformation by recognition method was shown in Fig.1. In the first step,
minimizing the within-class variance and maximizing the face detection is performed on the original image to
the between-class variance. In other words, PCA is locate the face area. In order to reduce the region of hair
suitable for data analysis and reconstruction. LDA is image or the background image, we take a smaller area
suitable for classification. But the dimension of image is from the face area after the face detection. In the second
usually higher, the calculations require for the process of step, using the LBP method extracts the facial
feature extraction would be significant. expression features. When calculating the histogram of
LBP features, we use an NxN window to be a statistical

1066

Original
Image

Figure 2. Haar-like features: the first row is for the edge
Face Detection features and the second row is for the line features.
The face detection module can be found in the
Open Source Computer Vision Library (OpenCV) [10].
But if we use the original detection region, it may
include some areas which are unnecessary, such as hair
LBP Feature
or background. For avoiding this situation, we cut the
Extraction smaller area from the detection region and try to reduce
the unnecessary areas but also keep the important
features. This area’s width is 126 and the height is 147.
Fig. 4 shows the final result of face area.
SVM
Classification
Features

Weak Pass Weak Pass Pass Weak Pass
Classifier Classifier Classifier A face area
1 2 N

Recognition Deny Deny Deny

Result Not a face
area

Figure 1. The flow chart of the proposed method. Figure 3. The decision process of cascade Adaboost.

region and move this window by certain pixels. In the
last step, SVM classifier is used for the facial expression
recognition.
A. The Face Detection
Viola and Jones [9] used the Haar-like feature
for face detection. There are some Haar-like feature
samples shown in Fig. 2. Haar-like features can
highlight the differences between the black region and
the white region. Each portion in facial area has
different property, for example, the eye region is darker
than the nose region. Hence, the Haar-like features may
extract rich information to discriminate different regions.
The cascade of classifiers trained by Adaboost
technique is an optimal way to reduce the time for
searching face area. In this cascade algorithm, the
boosted classifier combines several weak classifiers to
become a strong classifier. Different Haar-like features
are selected and processed by different cascade weak
classifiers. Fig. 3 shows the decision process of this
algorithm. If the feature set passes through all of the
weak classifiers, it is acknowledged as the face area. On
the other hand, if the feature set is denied by any weak
classifier, it is rejected.
Figure 4. The first column is the original images; the
second column is the final face areas.

1067

6 18
8
III. THE LBP METHOD AND SVM CLASSIFIER

B. Local Binary Patterns
21
LBP was used in the texture analysis. This
approach is defined as a gray-level invariant
measurement method, derived from the texture in a local
neighborhood. The LBP has been applied to many Figure 7. Representation of statistic way in width.
different fields including the face recognition.
By considering the 3x3-neighborhood, the operator C. Support Vector Machine
assigns a label to every pixel around the central points in The SVM is a kind of learning machine whose
an image. By thresholding each pixel with the center fundamental is statistics learning theory. It has been
pixel value, the result is regarded as a binary number. widely applied in pattern recognition.
Then, the histogram of the labels can be used as a The basic scheme of SVM is to try to create an
texture descriptor. See Figure 5 for an illustration of the optimal hyper-plane as the decision plane, which
basic LBP operator. maximizes the margin between the closest points of two
Another extension version to the original LBP is classes. The points on the hyper-plane are called support
called uniform patterns [11]. A Local Binary Pattern is vectors. In other words, those support vectors are used
called uniform if it contains at most two bitwise to decide the hyper-plane.
transitions from 0 to 1 or vice versa. For example, Assume we have a set of sample points from two
00011110 and 10000011 are uniform patterns. classes
We utilized the above idea of LBP with uniform
patterns in our facial expression representation. We {xi , yi }, i  1,, m xi  R N , yi  {1,1} (1)
compute the uniform patterns using the (8, 2)
neighborhood, which is shown in Fig.6. The (8, 2) stand the discrimination hyper-plane is defined as below:
for finding eight neighborhoods in the radius of two.
The black rectangle in the center means the threshold, m
the other circle points around there mean the f ( x )   y i a i k ( x, xi )  b (2)
neighborhoods. But we can see four neighborhoods are i 1
not located in the center of pixels, these neighborhoods’
values are calculated by interpolation method. After that, where f (x ) indicates the membership of x . ai and
a sliding window with size 18x21 is used for uniform
patterns statistic by shifting 6 pixels in width and 8 b are real constants. k ( x, xi )   ( x),  ( xi ) is a
pixels in height. Fig.7 represents the statistic way in kernel function and  (x) is the nonlinear map from
width.
original space to the high dimensional space. The kernel
function can be various types. For example, the linear
 
function is k ( x, xi )  x  xi , the radial basis function
(RBF) kernel function is
1
k ( x, xi )  exp(  x  y ) and the polynomial
2

Figure 5. The basic idea of the LBP operator 2 2
kernel function is k ( x, xi )  ( x  xi  1) n . SVM can be
designed for either two-classes classification or multi-
classes classification. In this paper, we use the multi-
classified SVM and polynomial kernel function [12].

IV. EXPERIMENTAL RESULTS
In this paper, we use the JAFFE facial expression
database [13]. The examples of this database are shown
in the Table 1. The face database is composed of 213
gray scale images of 10 Japanese females. Each person
has 7 kinds of expressions, and every expression
Figure 6. LBP representation using the (8, 2) includes 3 or 4 copies. Those 7 expressions are Anger,
neighborhood Disgust, Fear, Happiness, Neutral, Sadness and Surprise.

1068

Table I THE EXAMPLES OF JAFFE DATABASE Table III THE COMPARISON RESULTS

Anger The
Reference Reference
proposed
[14] [15]
method
Disgust
Anger 95% 95.2% 90%
Disgust 88% 95.2% 88.89%
Fear
Fear 100% 85.7% 92.3%
Happiness Happiness 100% 84.9% 100%
Neutral 75% 100% 100%
Neutral
Sadness 90% 90.4% 81.8%
Surprise 100% 89.8% 100%
Sadness
Average 92.57% 91.6% 93.24%
Surprise
According to the Table 3, we can realize the
proposed method has the better performance than the
The size of each image is 256x256 pixels. Two images other two references obviously. Even though some
of each expression for all of the people are used as recognition rates of expressions aren’t as good as the
training samples and the rest are testing samples. Hence two reference methods, we still have the highest average
the total number of training sample is 140, and the recognition rate.
number of testing sample is 73.
V. CONCLUSIONS
The Table 2 shows that recognition rate of each
facial expression which were experimented by the In this paper, we proposed a facial expression
proposed method. The last row is the average recognition method by using the LBP features. For
recognition rate of 7 expressions, which is 93.24%. The decreasing the computing efforts, we detect the face
recognition time of each face image is 0.105 seconds. region before the LBP method. After we extract the
We also compare our experimental results with facial features from the detected area, the SVM
some references. In reference [14], the author used the classifier will recognize the facial expression finally. By
Gabor features and NN fusion method. In another using the JAFFE be the experiment database, we can
reference [15], the author took the face image into three prove the proposed method has the 93.24% correction
parts and used the 2DPCA method. The training images rate and better than the two reference methods.
and test images are the same as the proposed method. For the future work, we still have some aspects to
Table 3 shows the comparison result. The average be studied hardly. Those experiments which we
recognition rate of reference [14] is 92.57% and the discussed above have the same property. This property
reference [15] is 91.6%. is that the training and testing samples are from the same
person. In other word, if we want to recognize
someone’s expression, we must have his images of
Table II THE RECOGNITION RATE OF PROPOSED METHOD various expressions in database previously. But this
property is not suitable for the real application. In the
Anger 90% future, we want to overcome this problem. Perhaps we
Disgust 88.89% can utilize the variations between the different
expressions to become a model and use this model to
Fear 92.3% recognize. There are other problems in the facial
Happiness 100% recognition still have to be dealt with, such as the
lighting variation and the pose changing. Those difficult
Neutral 100% issues exist for a long time. We will try to find out a
Sadness 81.8% better algorithm to enhance our method.
Surprise 100%
REFERENCES
Average 93.24%
[1] L.I. Smith, “A Tutorial on Principal Components Analysis”,
2002.

[2] H. Yu and J. Yang, “A Direct LDA Algorithm for High-

1069

Dimensional Data with Application to Face Recognition”,
Pattern Recognition, vol. 34, no. 10, pp.2067–2070, 2001.

[3] Deng Hb, Jin Lw and Zhen Lx et al, “A New Facial
Expression Recognition Method Based on Local Gabor Filter
Bank and PCA plus LDA”, International Journal of
Information Technology, vol.11, no. 11, pp.86-96, 2005.

[4] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen,
“Face Description with Local Binary Patterns: Application to
Face Recognition”, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 28, no. 12, pp.2037–2041,
2006.

[5] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen, “Face
Recognition with Local Binary Patterns”, Springer-Verlag
Berlin Heidelberg 2004, pp.469–481, 2004.

[6] Pavlovic V. and Garg A. “Efficient Detection of Objects and
Attributes using Boosting”, IEEE Conf. Computer Vision and
Pattern Recognition, 2001.

[7] Jerome Friedman, Trevor Hastie and Robert Tibshirani,
“Additive Logistic Regression: A Statistical View of Boosting”,
The Annals of Statistics, vol. 28, no. 2, pp.337–407, 2000.

[8] C. Burges, Tutorial on support vector machines for pattern
recognition, Data Mining and Knowledge Discovery, vol. 2, no.
2, pp. 955-974, 1998.

[9] P. Viola and M. Jones, “Rapid object detection using a boosted
cascade of simple features”, Proceedings of the 2001 IEEE
Computer Society Conference, vol 1, 2001, pp. I-511-I-518.

[10] Intel, “Open source computer vision library;
http://sourceforge.net/projects/opencvlibrary/”, 2001.

[11] T. Ojala, M. Pietikaïnen, and T. Maënpaä¨,
“Multiresolution Gray-Scale and Rotation Invariant Texture
Classification with Local Binary Patterns,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 24, no. 7,
pp. 971-987, July 2002.

[12] Dana Simian, “A model for a complex polynomial SVM kernel”,
Mathematics And Computers in Science and Engineering, pp.
164-169, 2008.

[13] M. Lyons, S. Akamatsu, etc. “Coding Facial Expressions with
Gabor Wavelets”. Proceedings of the Third IEEE International
Conference on Automatic Face and Gesture Recognition,Nara
Japan, 200-205, 1998.

[14] WeiFeng Liu and ZengFu Wang “Facial Expression Recognition
Based on Fusion of Multiple Gabor Features”, International
Conference on Pattern Recognition, 2006

[15] Bin Hua and Ting Liu , “Facial expression recognition based on
FB2DPCA and multi-classifier fusion”, International
Conference on Information Technology and Computer Science,
2009.

1070

MILLION-SCALE IMAGE OBJECT RETRIEVAL
1 1,2
Yin-Hsi Kuo (郭盈希) and Winston H. Hsu (徐宏民)
1
Dept. of Computer Science and Information Engineering,
National Taiwan University, Taipei
2
Graduate Institute of Networking and Multimedia,
National Taiwan University, Taipei

ABSTRACT

In this paper, we present a real-time system that
addresses three essential issues of large-scale image
object retrieval: 1) image object retrieval—facilitating
pseudo-objects in inverted indexing and novel object-
level pseudo-relevance feedback for retrieval accuracy;
2) time efficiency—boosting the time efficiency and
memory usage of object-level image retrieval by a novel
inverted indexing structure and efficient query
evaluation; 3) recall rate improvement—mining
semantically relevant auxiliary visual features through
visual and textual clusters in an unsupervised and
scalable (i.e., MapReduce) manner. We are able to
search over one-million image collection in respond to a
user query in 121ms, with significantly better accuracy
(+99%) than the traditional bag-of-words model.
Figure 1: With the proposed auxiliary visual feature
Keywords Image Object Retrieval; Inverted File; Visual
discovery, more accurate and diverse results of image
Words; Query Expansion
object retrieval can be obtained. The search quality is
greatly improved. Regarding efficiency, because the
1. INTRODUCTION auxiliary visual words are discovered offline on a
MapReduce platform, the proposed system takes less
than one second searching over million-scale image
Different from traditional content-based image retrieval collection to respond to a user query.
(CBIR) techniques, the target images to match might
only cover a small region in the database images. The
needs raise a challenging problem of image object noisily quantized descriptors. Meanwhile, the target
retrieval, which aims at finding images that contain a images generally have different visual appearances
specific query object rather than images that are globally (lighting condition, occlusion, etc). To tackle these
similar to the query (cf. Figure 1). To improve the issues, we propose to mine visual features semantically
accuracy of image object retrieval and ensure retrieval relevant to the search targets (see the results in Figure 1)
efficiency, in this paper, we consider several issues of and augment each image with such auxiliary visual
image object retrieval and propose methods to tackle features. As illustrated in Figure 5, these features are
them accordingly. discovered from visual and textual graphs (clusters) in an
State-of-the-art object retrieval systems are mostly unsupervised manner by distributed computing (i.e.,
based on the bag-of-words (BoW) [6] representation and MapReduce [1]). Moreover, to facilitate object-level
inverted-file indexing methods. However, unlike textual indexing and retrieval, we incorporate the idea of
queries with few semantic keywords, image object pseudo-objects [4] to the inverted file paradigm and the
queries are composed of hundreds (or few thousands) of pseudo-relevance feedback mechanism. A novel efficient

1071

Figure 2: The system diagram. Offline part: We extract visual and textual features from images. Textual and visual
image graphs are constructed by an inverted list-based approach and clustered by an adapted affinity propagation
algorithm by MapReduce (18 Hadoop servers). Based on the graphs, auxiliary visual features are mined by
informative feature selection and propagation. Pseudo-objects are then generated by considering the spatial
consistency of salient local features. A compact inverted structure is used over pseudo-objects for efficiency. Online
part: To speed up image retrieval, we proposed an efficient query evaluation approach for inverted indexing. The
retrieval process is then completed by relevance scoring and object-level pseudo-relevance feedback. It takes around
121ms to produce the final image ranking of image object retrieval over one-million image collections.

query evaluation method is also developed to remove Inverted file is a popular way to index large-scale data in
unreliable features and further improve accuracy and the information retrieval community [8]. Because of its
efficiency. superiority of efficiency, many recent image retrieval
Experiment shows that the automatically discovered systems adopt the concept to index visual features (i.e.
auxiliary visual features are complementary to VWs). The intuitive way is to record each entry with
conventional query expansion methods. Its performance image ID, VW frequency in the inverted file.
is significantly superior to the BoW model. Moreover, However, to our best knowledge, most systems simply
the proposed object-level indexing framework is adopt the conventional method to the visual domain,
remarkably efficiency and takes only 121ms for without considering the differences between documents
searching over the one million image collection. and images, where the image query is composed of
thousands of (noisy) VWs and the object of interest may
occupy small portions of the target images.
2. SYSTEM OVERVIEW
3.1. Pseudo-Objects
Figure 2 shows a schematic plot of the proposed system,
which consists of offline and online parts. In the offline Images often contain several objects so we cannot take
part, visual features (VWs) and textual features (tfidf of the whole image features to represent each object. Each
expanded tags) are extracted from the images. We then object has its distinctive VWs. Motivated by the novelty
propagate semantically relevant VWs from the textual and promising retrieval accuracy in [4], we adopt the
domain to the visual domain, and remove visually concept of pseudo-object—a subset of proximate feature
irrelevant VWs in the visual domain (cf. Section 4). All points with its own feature vector to represent a local
these operations are performed in an unsupervised area. An example shows in Figure 4 that the pseudo-
manner on the MapReduce [1] platform, which is objects, efficiently discovered, can almost catch different
famous of it scalability. Operations including image objects; however, advanced methods such as efficient
graph construction, clustering, and mining over million- indexing or query expansion are not considered. We
scale images can be performed efficiently. To further further propose a novel object-level inverted indexing.
enhance efficiency, we index the VWs by the proposed
object-level inverted indexing method (cf. Section 3). 3.2. Index Construction
We incorporate the concept of pseudo-object and adopt
compression methods to reduce memory usage. Unlike document words, VWs have a spatial dimension.
In the online part, an efficient retrieval algorithm is Neighboring VWs often correspond to the same object in
employed to speed up the query process without loss of an image, and an image consists of several objects. We
retrieval accuracy. In the end, we apply object-level adopt pseudo-objects and store the object information in
pseudo-relevance feedback to refine the search result and the inverted file to support object-level image retrieval.
improve the recall rate. Unlike its conventional Specifically, we construct an inverted list for each VW t
counterpart, the proposed object-level pseudo-relevance as follows, Image ID i, ft,i, RID1, ... ,RIDf, which
feedback places more importance on local objects indicates the ID of the image i where the VW appears,
instead of the whole image. the occurrence frequency (ft,i), and the associated object
region ID (RIDf) in each image. The addition of the
object ID to the inverted file makes it possible to search
3. OBJECT-LEVEL INVERTED INDEXING for a specific object even if the object only occupies a
small region of an image.

1072

Figure 3: Illustration of efficient query evaluation (cf.
Section 3). To achieve time efficiency, first, we rank a
visual word by its salience to the query and then retrieve
the designated number of candidate images (e.g., 7
images, A to G). After deciding the candidate images,
we skip the irrelevant images and cut those non-salient
VWs.

3.3. Index Compression

Index compression is a common way to reduce memory
usage in textual domain. First, we discard the top 5%
frequent VWs as stop words to decrease the mismatch
rate and reduces the size of inverted file. We then adopt
different coding methods to compress data based on their Figure 4: Object-level retrieval results by pseudo-
visual characteristics. Image IDs are ordinal numbers objects and object-level pseudo-relevance feedback.
sorted in ascending order in the lists, thus we store the The letter below each image represents the region
difference between adjacent image IDs instead of the (pseudo-object) with the highest relevance to query
image ID itself which is called d-gap [8]. And for region object by (2). The region information is essential for
IDs, we adopt a fixed length bit-level coding of three bits query expansion. Instead of using the whole image as
to encode it (e.g., R2 010). On the other hand, we use the seed for retrieving other related images, we can
a variant length bit-level coding to encode frequency easily identify those related objects (e.g., R0, R5, R0) and
(e.g., 3 1110). Furthermore, we implement AND and mitigate the influence of noisy features. Note that the
SHIFT operations to efficiently decode the frequency yellow dots in the background are detected feature
and region IDs at query time. The memory space for points.
indexing pseudo-objects can be saved about 54.1%.

3.4. Object-Level Scoring Method
3.5. Efficient Query Evaluation (EQE)
We use the intersection of TFIDF, which performs the
best for matching, to calculate the score of each region Conventional query evaluation in inverted indexing
indexed by VW t. Besides the discovered pseudo-objects, needs to keep track of the scores of all images in the
we also define a new object R0 to treat the whole image inverted lists. In fact, it is observed that most of the
as another object. We first calculate the score of every scored images contain only a few matched VWs. We
pseudo-object (R) to the query object (Q) as follows, propose an efficient query evaluation (EQE) algorithm
that explores a small part of a large-scale database to
score ( R , Q ) = ∑ IDFt × min( wt , R , wt ,Q ), (1) reduce the online retrieval time. The procedures of EQE
t∈Q
are described below and illustrated in Figure 3.
where wt,R and wt,Q are the normalized VW frequency in 1. Query term ranking: The ranking score in (1)
pseudo-object and in the query respectively. And then favors the query term with higher frequency and
the pseudo-object with the highest score is regarded as IDFt; therefore, we sort the query terms according
the most relevant object with respect to the query, as to its salience, which is calculated as wt,Q×IDFt for
suggested in [4]: VW t. The following phases are then processed
sequentially to deal with VWs ordered and
score(i,Q) = max{score(R,Q) | R ∈ i}. (2) weighted by their visual significance to the query.
2. Collecting phase: In the retrieval process, user
only cares about the images in the top ranks.

1073

(a)visual cluster example (b)representative VW selection (c)example results (d)auxiliary VW propagation (e)textual cluster example
Figure 5: Image clustering results and mining auxiliary visual words. (a) and (e) show the sample visual and textual
clusters; the former keeps visually similar images in the same cluster, while the latter favors semantic similarities. The
former facilitates representative VW selection, while the latter facilitates semantic (auxiliary) VW propagation. (b) and
(d) illustrate the selection and propagation operations based on the cluster histogram as detailed in Section 4. And a
simple example shows in (c).

Therefore, instead of calculating the score of each R0 in Figure 4), we can further remove irrelevant objects
image, we score the top images of the inverted lists such as the toy in R4 of the second image.
and add them to a set S until we have collected
sufficient number of candidate images.
4. AUXILIARY VISUAL WORD (AVW)
3. Skipping phase: After deciding the candidate
DISCOVERY
images, we skip the images that do not appear in
the collecting phase. For every image i in the
inverted list, score the image i if i∈S , otherwise Due to the limitation of VWs, it is difficult to retrieve
skip it. If the number of visited VWs reaches a images with different viewpoints, lighting conditions and
predefined cut ratio, go on to the next phase. occlusions, etc. To improve recall rate, query expansion
is the most adopted method; however, it is limited by the
4. Cutting phase: Simply remove the remaining VWs,
quality of initial retrieval results. Instead, in an offline
which usually have little influence on the results.
stage, we augment each image with auxiliary visual
And then the process stops here.
features and consider representative (dominant) features
This algorithm works remarkably well, bringing in its visual clusters and semantically related features in
about almost the same retrieval quality with much less its textual graph respectively. Such auxiliary visual
computational cost. As image queries are generally features can significantly improve the recall rate as
composed of thousands or hundreds of (noisy) VWs, demonstrated in Figure 1. We can deploy all the
rejecting those non-salient VWs significantly improves processes in a parallel way by MapReduce [1]. Besides,
the efficiency and slightly improves the accuracy. the by-product of auxiliary visual word discovery is the
reduction of the number indexed visual features for each
3.6. Object-Level Pseudo-Relevance Feedback image for better efficiency in time and memory.
(OPRF) Moreover, it is easy to embed the auxiliary visual
features in the proposed indexing framework by adding
Conventional approach using whole images for pseudo- one new region for those discovered auxiliary visual
relevance feedback (PRF) may not perform well when features not existing in the original VW set.
only a part of retrieved images are relevant. In such a
case, many irrelevant objects would be included in PRF, 4.1. Image Clustering by MapReduce
resulting in too many query terms (or noises) and
degrading the retrieval accuracy. To tackle this issue, a The image clustering is first based on a graph
novel object-level pseudo-relevance feedback (OPRF) construction. The images are represented by 1M VWs
algorithm is proposed. Rather than using the whole and 50K text tokens expanded by Google snippets from
images, we select the most important objects from each their associated (noisy) tags. However, it is very
of the top-ranked images and use them for PRF. The challenging to construct image graphs for million-scale
importance of each object is estimated according to (2). images. To tackle the scalability problem, we construct
By selecting relevant objects in each image (e.g., R0, R5,

1074

image graphs using MapReduce model [1], a scalable images in the same textual cluster are semantically close
framework that simplifies distributed computations. but usually visually different. Therefore, these images
We take the advantage of the sparseness and use provide a comprehensive view of the same object.
cosine measure as the similarity measure. Our algorithm Propagating the VWs from the textual domain can
extends the method proposed in [2] which uses a two- therefore enrich the visual descriptions of the images. As
phase MapReduce model—indexing phase and the example shows in Figure 5(c), the bottom image can
calculation phase—to calculate pairwise similarities. It obtain auxiliary VWs with the different lighting
takes around 42 minutes to construct a graph of 550K condition of the Arc de Triomphe. The similarity score
images on 18-node Hadoop servers. To cluster images can be weighted to decide the number of VWs to be
on the image graph, we apply affinity propagation (AP) propagated. Specifically, we derive the VW histogram
proposed in [3]. AP is a graph-based clustering from the images of each cluster and then propagate VWs
algorithm. It passes and updates messages among nodes based on the cluster histogram weighted by its (semantic)
on graph iteratively and locally—associating with the similarity to the canonical image of the textual cluster.
sparse neighbors only. It takes around 20 minutes for
each iteration and AP converges generally around 20 4.4. Combining Selection and Propagation
iterations (~400 minutes) for 550K images by
MapReduce model. The selection and propagation operations described
The image clustering results are sampled in Figure above can be performed iteratively. The selection
5(a) and (e). Note that if an image is close to the operation removes visually irrelevant VWs and improves
canonical image (center image), it has a higher AP score, memory usage and efficiency, whereas the propagation
indicating that it is more strongly associated with the operation obtains semantically relevant VWs to improve
cluster. Moreover, images in the same visual cluster are the recall rate. Though propagation may include too
often visually similar to each other, whereas some of the many VWs and thus decrease the precision, we can
images in the same textual cluster differ in view, lighting perform selection after propagation to mitigate this effect.
condition, angle, etc., and are potential to bring A straightforward approach is to iterate the two
complementary VWs for other images in the same operations until convergence. However, we find that it is
textual cluster. enough to perform a selection first, a propagation next,
and finally a selection because of the following reasons.
4.2. Representative Visual Word Selection First, only the propagation step updates the auxiliary
visual feature and textual cluster images are fixed; each
We first propose to remove irrelevant VWs in each image will obtain distinctive VWs at the first
image to mitigate the effect of noise and quantization propagation step. The subsequent propagation steps will
error to reduce memory usage in the inverted file system only modify the frequency of the VWs. As the objective
and to speed up search efficiency. We observe that is to obtain distinctive VWs, frequency is less important
images in the same visual cluster are visually similar to here. Second, binary feature vectors perform better or at
each other (cf. Figure 5(a)). As illustrated in Figure 5(c), least comparable to the real-valued.
the middle image can then have representative VWs
from the visual cluster it belongs to. We accumulate the
number of each VW from the images of a cluster to form 5. EXPERIMENTS
a cluster histogram. As shown in Figure 5(b), each image
donates the same weight to the cluster histogram. We 5.1. Experimental Setup
can then select the VWs whose occurrence frequency is
above a predefined threshold (e.g., in Figure 5(b) the We evaluate the proposed methods using a large-scale
VWs in red rectangles are selected). photo retrieval benchmark—Flickr550 [7]. Besides, we
randomly add Manhattan photos to Flickr550 to make it
4.3. Auxiliary Visual Word Propagation a 1 million dataset. As suggested by many literatures
(e.g., [5]), we use the Hessian-affine detector to extract
Due to variant capture conditions, some VWs that feature points in images. The feature points are described
strongly characterize the query object may not appear in by SIFT and quantized into 1 million VWs for better
the query image. It is also difficult to obtain these VWs performance. In addition, we use the average precision to
through query expansion method such as PRF because of evaluate the retrieval accuracy. Since average precision
the difference in visual appearance between the query only shows the performance for a single image query, we
image and the retrieved. Mining semantically relevant compute the mean average precision (MAP) to represent
VWs from other information source such as text is the system performance over all the queries.
therefore essential to improve the retrieval accuracy.
As illustrated in Figure 5(e), we propose to augment 5.2. Experimenal Results
each image with VWs propagated from the textual
cluster result. This is based on the observation that

1075

Table 1: The summarization of the impacts in the features points. This result shows that the selection
performance and query time comparing with the and propagation operations are effective in mining useful
baseline methods. It can be found that our proposed features and remove the irrelevant one. In addition, the
methods can achieve better retrieval accuracy and relative improvement of AVW (+44%) is orthogonal and
respond to a user query in 121ms over one-million complement to OPRF (0.352 0.487, +38%).
photo collections. The number in the parentheses
indicates relative gain over baseline. And the symbol
‘%’ stands for relative improvement over BoW model 6. CONCLUSIONS
[6].
(a) Image object retrieval In this paper, we cover four aspects of large-scale
retrieval system: 1) image object retrieval over one-
MAP Baseline PRF OPRF
million image collections—responding to user queries in
0.290 0.324 121ms, 2) the impact of object-level pseudo-relevance
Pseudo-objects [4] 0.251
(+15.5%) (+29.1%) feedback—boosting retrieval accuracy, 3) time
(b) Time efficiency efficiency with efficient query evaluation in the inverted
Flickr550 One-million file paradigm—comparing with the traditional inverted
Pseudo-objects [4] +EQE +EQE file structure, and 4) image object retrieval based on
effective auxiliary visual feature discovery—improving
Query time
854 56 121 the recall rate. That is to say, the efficiency and
(ms)
effectiveness of the proposed methods are validated over
(c) Recall rate improvement large-scale consumer photos.
BoW model [6] AVW AVW+OPRF
MAP 0.245 0.352 0.487 REFERENCES
% - 43.7% 98.8%
[1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data
processing on large clusters,” OSDI, 2004.
We first evaluate the performance of object-level PRF
[2] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document
(OPRF) in boosting the retrieval accuracy. As shown in similarity in large collections with mapreduce,” ACL,
Table 1(a), OPRF outperforms PRF by a great margin 2008.
(relative improvement 29.1% vs. 15.5%). The result
shows that the pseudo-object paradigm is essential for [3] B. J. Frey and D. Dueck, “Clustering by passing
PRF-based query expansion in object-level image messages between data points,” Science, 2007.
retrieval since the targets of interest might only occupy a
small portion of the images. [4] K.-H. Lin, K.-T. Chen, W. H. Hsu, C.-J. Lee, and T.-H.
Li, “Boosting object retrieval by estimating pseudo-
We then evaluate the query time of object-level objects,” ICIP, 2009.
inverted indexing augmented with efficient query
evaluation (EQE) to achieve time efficiency. The query [5] J. Philbin, O. Chum, M. Isard, J. Sivic, and A.
time is 15.2 times faster (854 56) after combining Zisserman, “Object retrieval with large vocabularies and
with EQE method as shown in Table 1(b). The reasons fast spatial matching,” CVPR, 2007.
attribute to the selection of salient VWs and ignoring
those insignificant inverted lists. It is essential since [6] J. Sivic and A. Zisserman, “Video google: a text
retrieval approach to object matching in videos,” ICCV,
unlike textual queries with 2 or 3 query terms, an image 2003.
query might contain thousands (or hundreds) of VWs.
Therefore, we can respond to a user query in 121ms over [7] Y.-H. Yang, P.-T. Wu, C.-W. Lee, K.-H. Lin, W. H.
one-million photo collections. Hsu, and H. Chen, “ContextSeer: context search and
Finally, to improve recall, we evaluate the recommendation at query time for shared consumer
performance of auxiliary visual word (AVW) discovery. photos,” ACM MM, 2008.
As shown in Table 1(c), the combination of selection,
propagation and further OPRF brings 99% relative [8] J. Zobel and A. Moffat, “Inverted files for text search
improvement over BoW model and reduces one-fifth of engines,” ACM Computing Surveys, 2006

1076

Sport Video Highlight Extraction Based on
Kernel Support Vector Machines
Po-Yi Sung, Ruei-Yao Haung, and Chih-Hung Kuo,
Department of Electrical Engineering
National Cheng Kung University
Tainan, Taiwan
{ n2895130 , n2697169 , chkuo }@mail.ncku.edu.tw

Abstract—This paper presents a generalized highlight density of cuts, and audio energy, with a derived function to
extraction method based on Kernel support vector machines detect highlights. In [5], Duan proposes a technique that
(Kernel SVM) that can be applied to various types of sport searches shots with goalposts and excited voices to find
video. The proposed method is utilized to extract highlights highlights for soccer programs. To locate scenes of the
without any predefining rules of the highlights events. The goalposts in football games, the technique of Chang [6]
framework is composed of the training mode and the analysis detects white lines in the field, and then verifies touch-down
mode. In the training mode, the Kernel SVM is applied to train shots via audio features. Wan [7] detects voices in
classification plane for a specific type of sport by shot features commentaries with high volume, combined with the
of selected video sequences. And then the genetic algorithm
frequency of shot change and other visual features to locate
(GA) is adopted to optimize kernel parameters and select
features for improving the classification accuracy. In the
goal events. Huang [8] exploited color and motion
analysis mode, we use the classification plane to generate the information to find logo objects in the replay of sport video.
video highlights of sport video. Accordingly, viewers can access All these techniques have to depend on predefined rules for a
important segments quickly without watching through the single specific type of sport video, and as a result may need
entire sport video. lots of human efforts to analyze the video sequences and
identify the proper objects for highlights in the particular
Keywords-Highlight extraction; Sport analysis; Kernel type of sport.
support vector machines; Genetic algorithm Many other techniques have employed probabilistic
models, such as Hidden Markov Models (HMM), to look for
I. INTRODUCTION the correlations of events and the temporal dependency of
features [9]-[15]. The selected scene types are represented by
Due to the rapid growth of multimedia storage hidden states, and the state transition probabilities can be
technologies, such as Portable Multimedia Player (PMP), evaluated by the HMM. Highlights can be identified
HD DVD and Blu-ray DVD, large amounts of video contents accurately by some specific transition rules. However, it is
can be saved in a small piece of storage device. However, hard to include all types of highlight events in the same set of
people may not have sufficient time to watch all the recorded rules, and the model may fail to detect highlights if the video
programs. They may prefer skipping less important parts and features are different from the original ones. Cheng [16]
only watch those remarkable segments, especially for sport proposed a likelihood model to extract audio and motion
videos. Highlight extraction is a technique making use of features, and employed the HMM to detect the transition of
video content analysis to index significant events in video the integrated representation for the highlight segments. This
data, and thereby help viewers to access the desired parts of kind of methods all need to estimate the probabilities of state
the content more efficiently. This technique can also be a transitions, which has to be set up through intense human
help to the processes of summarization, retrieval, and observations.
abstraction from large amounts of video database. Most of the previous researches have adopted rule-based
In this paper, we focus on the highlight extraction methods, whereby the rules are heuristically set to describe
techniques for sport videos. Many works have been proposed the dynamics among objects and scenes in the highlight
that can identify objects that appear frequently in sport events of a specific sport. The rules set for one kind of sport
highlights. Xiong [1] propose a technique that extracts audio video usually cannot be applied to the other kinds. In [17],
and video objects that are frequently appearing in the we have proposed a more generalized technique based on
highlight scenes, like applauses, the baseball catcher, the low-level semantic features. In this approach, we can
soccer goalpost, and so on. Tong [2] characterized three generate highlight tempo curves without defining
essential aspects for sport videos: focus ranges of the camera, complicated transitions among hidden states, and hence we
object types, and video production techniques. Hanjalic et al. can apply this technique to various kinds of videos.
[3]-[4] measured three factors, that is, motion activity,

1077

In this paper, we extend our technique [17] and A. Shot Change Detection
incorporate it with the framework of Kernel support vector The task in this stage is to detect the transition point from
machines (Kernel SVM). For each type of sport video, a one scene to another. Histogram differences of two
small amount of highlight shots are input so that some consecutive frames are calculated by (2) to detect the shot
unified features can be extracted. Then apply the Kernel changes in video sequences. A shot change is said to be
SVM system to train the classification plane and utilize the detected if the histogram difference is greater than a
trained classification plane to analyze other input videos of predefined threshold. The pixel values that are employed to
the same sport type, generating the highlight shots. calculate the histogram contains luminance only, since the
The rest of this paper is organized as follows. Section II human visual system is more sensitive to luminance
presents the overview of the proposed system. Section III (brightness) than to colors. The histogram difference is
details the method for highlight shots classification and computed by the equation
highlight shots generation. The highlight extraction 255
performance and experimental results are shown in Section  H (i)  H
I I 1 (i)
(2)
Ⅳ. SectionⅤ is the conclusion. DI  i0
N
II. PROPOSED HIGHLIGHT SHOT EXTRACTION
SYSTEM OVERVIEW where N is the total pixel number in a frame, and HI (i) is the
Fig. 1 shows four stages of the proposed scheme: (1) shot pixel number of level i for the I-th frame. Finally, the video
change detection, (2) visual and audio features computation, sequence will be separated into several shots according to
(3) Kernel SVM training and analysis, and (4) highlight the shot change detection results.
shots generation. In the first stage, histogram differences are B. Visual and Audio Features Computation
counted to detect the shot change points. In the second stage,
the feature parameters of each shot are computed and taken Each shot may contain lots of frames. To reduce the
as the input eigenvalues into the Kernel SVM training and computation complexity, we select a keyframe to represent
analysis system. The shot eigenvalues include shot length (L), the shot. In this work, we simply define the 10th frame of
color structure (C), shot frame difference (Ds), shot motion each shot as the keyframe, since it is usually more stable
(Ms), keyframe difference (Dkey), keyframe motion (Mkey), Y- than the previous frames, which may contain mixing frames
histogram difference (Yd), sound energy (Es), sound zero- during scene transition. Many of the following features are
crossing rate (Zs) and short-time sound energy (Est). They are extracted from this keyframe.
collected as a feature set for the i-th shot 1) Shot Length

Vi  L, C, Ds , M s , Dkey , M key , Yd , Es , Zs , Est (1)  We designate the frame number in each shot as the shot
length (L). Experiments show that the highlight shot lengths
are shorter in non-highlight shots, such as the shots with
In the third stage, the Kernel SVM either trains the judges or scenes with special effects. A highlight shot is
parameters or analyzes the input features, according to the often longer than a non-highlight shot. For example,
mode of the system. Then, in the last stage, highlight shots pitching in baseball games and shooting goal in soccer
are generated based on the output of Kernel SVM. We games are usually longer in shot length. Hence, the shot
explain the first two stages in the following, and the other length is an important feature for the highlights and is
two stages are explained in Section Ⅲ. included as one of the input eigenvalues.
2) MPEG-7 Color Structure
Highlight shots

The color structure descriptor (C) is defined in the
generation

Highlight Highlight
shot shot MPEG-7 standard [18,19] to describe the structuring
property of video contents. Unlike the simple statistic of
Training histograms, it counts the color histograms based on a
data
moving window called the structuring element. The
analysis system

GA
Training and

Kernel SVM Kernel SVM Baseball
parameters descriptor value of the corresponding bin in the color
optimization analysis training
Basketball histogram is increased by one if the specified color is within
and features mode mode
selection Soccer the structuring element. Compared to the simple statistic of
one histogram, the color structure descriptor can better
Visual and audio features

reflect the grouping properties of a picture. A smaller C
Visual features Audio features value means the image is more structured. For example,
both of the two monochrome images in Fig. 2 have 85 black
Shot Shot pixels, and hence their histograms are the same. The color
structure descriptor C of the image in Fig. 2-(a) is 129,
Shot Shot
while the image in Fig 2-(b) is more scattered with the C
value 508. Fig. 3 depicts the curve of the C values in the
detection
Shot

Video Data video of a baseball game. It shows that pictures with a
Audio Data scattered structure usually have higher C values.
Figure 1. The proposed highlight shots extraction system

1078

where W and H are the block numbers in the horizontal and
vertical directions respectively. MVx,n(i, j) and MVy,n(i, j) are
the motion vectors in x and y directions respectively, of the
block at i-th row and j-th column in the n-th frame of the
shot. The motion vector of a block represents the
displacement in the reference frame from the co-located
block to the best matched square, and is searched by
minimizing the sum of absolute error (SAE) [21].
(a)
K 1 K 1
SAE   C(i, j)  R(i, j) (5)
i 0 j 0

where C(i, j) is the pixel intensity of a current block at
relative position (i, j), and R(i, j) is the pixel intensity of a
reference block.
5) Keyframe Difference and Keyframe Motion
(b) We calculate the frame difference and estimate the
Figure 2. The MPEG-7 Color Structure: (a) a highly structured motion activity between the keyframe and its next frame.
monochrome image; (b) a scattered monochrome image. Both have the Suppose the k-th frame is a keyframe. The keyframe
same histogram difference Dkey of the shot is defined by
W 1 H 1
Dkey   f k (i, j)  f k 1 (i, j) (6)
i 0 j 0

where fk(i, j) represents the intensity of the pixel at position
(i, j) in the k-th frame. Similarly, keyframe motion Mkey
represents the average magnitude of motion vectors inside
the key frame and is defined as
Figure 3. MPEG-7 Color Structure Descriptor curve in a baseball game. W 1 H 1

 MVx (i, j)2  MVy (i, j)2
1
M key  (7)
In this paper, we perform edge detection before W  H / K 2 i0 j 0
calculating the color structure descriptors. The resultant C
value of each keyframe is regarded as an eigenvalue and where MVx(i, j) and MVy(i, j) denote the components of the
included in the input data set of Kernel SVM. motion vectors in x- and y- directions respectively.
3) Shot Frame Difference 6) Y-Histogram Difference
The average shot frame difference (Ds) of each shot is The average Y-histogram difference is calculated by
defined by
255
L1 W 1 H 1  H (i)  H n1 (i)
(  f (i, j)  f n1 (i, j) )
1 1 L1 n
Ds  (3) Yd 
1
 i0
(8)
L 1 n1 WH i0 j 0 n L  1 n1 W H

where W and H are frame width and height respectively, where Hn (i) represents the number of pixels at level i,
and fn(i, j) is the pixel intensity at position (i, j) in the n-th counted in the n-th frame. In general, the value of Yd is
frame. This feature shows the frame activities in a shot. In higher in the highlight shots.
general, highlight shots have higher Ds values than non- 7) Sound Energy
highlight shots. The sound energy Es is defined as
4) Shot Motion
To measure the motion activity, we first partition a M

frame into square blocks of the size K-by-K pixels, and  S (n)  S (n) (9)
perform motion estimation to find the motion vector of each Es  n 1

block [20]. The shot motion Ms is defined as the average M
magnitude of motion vectors by
where S(n) is the signal strength of the n-th audio sample in
L1 W 1 H 1 a shot, M is the total number of audio samples in the
 MVx,n (i, j)2  MVy,n (i, j)2
1
Ms  (4) duration of the corresponding shot. In the highlight shot, the
(L 1) W  H / K 2 n1 i0 j 0
sound energy is usually higher than those in non-highlight
shots.

1079

8) Sound Zero-crossing Rate section, we briefly explain the basic idea about constructing
We also adopt the zero-crossing rate (Zs) of the audio the SVM decision functions.
signals as one of the input features, since it is a simple a) Linear SVM
indicator of the audio frequency information. Experiments Given a training set (x1, y1), (x2, y2),…, (xi,
yi), xn  Rn , yn 1, 1, n  1 i , where i is the total
indicate that the zero-crossing rate becomes higher in
highlight shots. The zero-crossing rate is defined as
number of training data, each training data point xn is
M associated with one of two classes characterized by a value
 signS (i) signS (i  1)
1 fs
Z s (n)  (10) yn = ±1. In the linear SVM theory, the decision function is
2M i 1
supposed to be a linear function and defined as

f  x  wTx  b
where fs is the audio sampling rate, and the sign function is
defined by (13)

 1 , if S (i)  0, where w, x  Rn , b  R , w is the weighting vector of

signS (i)   0 , if S (i)  0, (11) hyperplane coefficients, x is the data vector in space and b
 1 , otherwise. is the bias. The decision function lies half way between two
 hyperplanes which referred to as support hyperplanes. SVM
is expected to find the linear function f(x) = 0 such that
9) Short-time Sound Energy separates the two classes of data. Fig. 4-(a) shows the
Since the crowd sounds always last for 1 or 2 seconds, decision function that separates two classes of data. For
and therefore the sound energy can not represent the crowd separable data, there are many possible decision functions.
sounds for video shot with longer shot length. Thus, we The basic idea is to determine the margin that separates the
select short-time sound energy (Est) as one of the input two hyperplanes and maximize the margin in order to find
eigenvalues. The short-time sound energy is defined as the optimal decision function. As shown in Fig. 4-(b), two
hyperplanes consist of data points which satisfy
wT  x  b  1 and wT  x  b  1 respectively. For example,
 S (n)  S p (n) 
24000

p the data point x1 of the positive class (yn = +1) lead to a
e( p)  n 1
(12) positive value and the data points x2 of the negative class (yn
24000 = -1) are negative. The perpendicular distance between the
Est  max e(1), e(2), e(3), , e(m) 2
two hyperplanes is . In order to find the maximized
w
where Sp(n) is the signal strength of the n-th audio sample at
p-th second in the video shot, e(p) is the sound energy of margin and optimal hyperplanes, we must find the smallest
the p-th second in the video shot, and m is the time of the distance w . Therefore, the data points have to satisfy the
video shot. condition as one set of inequalities

y j  wT x j  b  1, for j  1, 2, 3, , i
III. HIGHLIGHT SHOT CLASSIFICATION METHOD
(14)
A. Kernel SVM Training and Analysis System
In this work, the Kernel SVM is adopted to analyze the The problem for solving the w and b can be reduced to the
input videos and generate the highlight shots. In the training following optimization problem
mode, the selected shots for a specific sport type are fed
into the system to train for the classification hyperplanes 1 2
Minimize w
and we apply genetic algorithm (GA) to select features and 2 (15)
subject to y j  wT x j  b  1, for j  1 i
optimize kernel parameters for support vector machines. In
the analysis mode, the system just loads these pre-stored
parameters and generates highlight shots for the input sport
video. We will explain the process in details in the This is a quadratic programming (QP) problem and can be
following. solved by the following Lagrange function [24]:
1) Support Vector Machines

α y w x  b  1, for j  1i
SVM is a machine learning technique first suggests by 1
i
Vapnik [22] and has widespread applications in L(w, b,  )  wT w  j j
T
j (16)
classification, pattern recognition and bioinformatics. 2 j 1
Typical concepts of SVM are for solving binary
classification problems [23]. The data may be where α j denotes the Lagrange multiplier. The w, b, and
multidimensional and form several disjoint regions in the
space. The feature of SVM is to find the decision functions α j at optimum to minimize (16) are obtained. Then,
that optimally separate the data into two classes. In this following the Karush Kuhn-Tucker (KKT) conditions to

1080

simplify this optimization problem. Since the optimization where C is the penalty parameter. This optimization
problem have to satisfy the KKT conditions defined by problem also can be solved by the Lagrange function and
transformed to dual problem as follows
i
w α y x
j 1
j j j
Maximize L( ) 
i
 j 
1
y
i
j yk  j k x j x k
T

j 1
2 j ,k 1
i
α y
(22)
0 (17) i

j 0
j j
Subject to  j1
 j y j  0 ,0   j  C, j  1i
αj 0 , for j  1i
 
 j y j w x j  b  1  0 , for j  1i
T
  Similarity, we can solve this dual problem and find the
optimal w and b.
Substitute (17) into (16), then the Lagrange function is c) Non-linear SVM
transformed to dual problem as follows The SVM can extended to the case of nonlinear conditions
by projecting the original data sets to a higher dimensional
i i

 y y α α x x
1 space referred to as the feature space via a mapping
MaximizeL( )  αj  j k j k
T
j k function φ which also called kernel function. The nonlinear
2
j 1 j ,k 1
(18) decision function is obtained by formulating the linear
i
Subject to α yj1
j j  0, α j  0, j  1i
classification problem in the feature space. In nonlinear
SVM, the inner products xT xk in (22) can be replaced by
j

the kernel function k(x j , xk )  φ(x j )T φ(xk ) . Therefore, the
Solving for this dual problem and find the Lagrange dual problem in (22) can be replaced by the following
multiplier α j . Substitute α j into (19) to find the optimal w equation
and b.
i i

i Maximize L( )   j 
1
 y y   k (x , x )
j k j k j k

 j 1
2 j ,k 1
w α j y j x j , for j  1i (23)
i


j1
(19) Subject to jyj  0 ,0   j  C, j  1i
1  sv  1 
N
b 
N sv 
  wT x S 
y
 s1  s


 j1

According to (19), we also can solve above dual problem
and find optimal w and b. The classification is then
where xS are data points which Lagrange multiplier α j 0, ys obtained by the sign of
is the class of xS and Nsv is the number of xS .
 i 
b) Linear Generalized SVM
In the case where the data is not linearly separable as
sign


 j 0
y j j k(x, x j )  b 



(24)
shown in Fig. 4-(c), the optimization problem in (15) will
be infeasible. The concepts of linear SVM can also be
extended to the linearly nonseparable case. Rewrite (14) as d) Types of Kernels
(20) by introducing a non-negative slack variable  . The most commonly used kernel functions are
multivariate Gaussian radial basis function (MGRBF),

y j wTx j  b  1   j , for j  1i  (20)
Gaussian radial basis function (GRBF), polynomial
function and sigmoid function.
MGRBF:
The above inequality constraints are minimized through a
penalized object function. Then the optimization problem n x jm xkm2
can be written as 

2 m
2 (25)
k(x j , x k )  φ(x j ) T φ(x k )  e m1

 i 

1
Maximize L( )  w 2  C  j  where  m  , x jm , xkm  , x j , x k  n , xjm is m-th
2  
 j 1  (21)

Subject to y j w T x j  b  1   j , for j  1i  element of xj, xkm is m-th element of xk,  m is the adjustable
parameter of the Gaussian kernel, x j , xk are input data.

1081

GRBF: 1 n n
where g1 ~ gSs , gC ~ gCc and g1 ~ g f f are parameters of
n
S f

x j xk
2 kernel, penalty factor and features respectively. The ns, nc,
 and nf are the bits to represent the above parameters. The
2 2 (26)
k(x j , xk )  φ(x j )T φ(xk )  e parameters defined at the start process are bits of
parameters and features, number of generations, crossover
and mutation rate, and limitations of parameters. The next
where   , x j , xk n ,  is the adjustable parameter of step is output parameters and features to Kernel SVM for
the Gaussian kernel, x j , xk are input data. training. In the selection step, we keep two chromosomes
Polynomial function: with maximum objective value (Of) obtained by (29) for
next generation. These chromosomes will not change in the
following crossover and mutation steps. Fig. 10 shows the
k(x j , xk )  φ(x j )T φ(xk )  (1  xT xk )d
j (27) crossover and mutation operations. As shown in Fig. 10-(a),
two new offspring are obtained by randomly exchanging
where d is positive integer, x j , xk n , d is the adjustable genes between two chromosomes using one point crossover.
After crossover operation, as shown in Fig. 10-(b), the
parameter of the polynomial kernel, x j , xk are input data.
binary code genes are changed occasionally from 0 to 1 or
2) Kernel SVM Input Data Structure vice versa called mutation operation. Finally, a new
In sport videos, a highlight event usually consists of generation is obtained and output parameters and features
several consecutive shots. Fig. 5 shows an example of a again. These processes will be terminated until the
home run in a baseball game. It includes three consecutive predefined numbers of generations satisfy.
shots: pitching and hitting, ball flying, and the base running. In this paper, we adopt precision and recall rates to
Unlike many other highlight extraction algorithms that have evaluate the performance of our system. The precision (P)
to predefine the highlight events with specific constituting and recall (R) rates are defined as follows
shots, we simply propose to collect the feature sets of
several consecutive shots together as the input eigenvalues SNc SNc
of the Kernel SVM. P ,R (28)
SNe SNt
3) Kernel SVM Training Mode
For the training mode, the data are processed in two
steps: a) initialization, b) kernel parameters optimization where SNc, SNe, and SNt are the number of correctly
and feature selection. extracted highlight shots, extracted highlight shots, and
actual highlight shots repectively.
a) Initialization of the Input Data
In the objective function calculation step, we calculate
The initialization process of the training mode is shown the objective value (Of) to evaluate the kernel parameters
in Fig. 6. The video is partitioned into shots and divided and select features generated by GA. The objective value
into two sets: highlight shots and non-highlight shots. The calculated by following equation
eigenvalues of consecutive shots are collected as a data set.
All data sets are composed as the input data vector. Then Of  0.5 P  0.5 R
each eigenvalue is normalized into the range of [0, 100]. (29)
The order of the data set in the input data vector is
randomized. These steps will terminate when the predefined number of
b) Kernel Parameters Optimization and Feature generations have achieved. And finally we select the kernel
selection parameters and features which have maximum objective
value.
Since the parameters in kernel functions are adjustable, 4) Kernel SVM Analysis Mode
and in order to improve the classification accuracy, these
In the analysis mode, the user has to select a sport type.
kernel parameters should be properly set. In this process,
The Kernel SVM system directly loads the pre-trained
we adopt the GA-based feature selection and parameters
classification function corresponding to the sport type. The
optimization method proposed by Huang [25] to select
classification function is defined as (30), where Cx is the
features and optimize kernel parameters for support vector
classes of the video shots. Cx = +1 represents the shots
machines. Fig. 7 shows the flowchart of the feature
belong to highlight shot, and Cx = -1 are non-highlight shot.
selection and parameters optimization method.
This process can be performed very quickly, since these
As shown in Fig. 7, we apply the GA to generate kernel
kernel parameters and features do not need to be trained
parameters and select features to train the hyperplanes of again.
the Kernel SVM. The processes to generate kernel
parameters and select features utilize the GA are shown in
  i 
Fig. 8. The GA start process include generate chromosome
 i   
 1 , if  y j  j k(x, x j )  b   0
 
randomly and parameters setup. The chromosome is
represented as binary coding format as shown in Fig. 9,  
Cx  sign y j  j k(x, x j )  b   
 
 j 1
 1 , if  y  k(x, x )  b   0
 i

(30)
 j 1




 j 1
j j j




1082

Figure 4. Linear decision function separatimg two classes: (a) Decision function separete class postive from class negative; (b) The margin that seperates
two hyperplanes; (c) The case of linear non-separable data sets.

(a) pitching and hitting (b) ball flying (c) base running
Figure 5. A home run event in a baseball game.

Start

｛
data set 0
data set 1
data set 2
Parameters and
Highlights Output parameters and features
Normalize Randomize
…

｛
selected features
Training Data Vector

Data set n
Non-Highlights Data Vector
Selection
Training Data

Figure 6. The initialization of training data.
Crossover

Training data Testing data
Mutation
Selected New generation
features Normalization and
feature selection
Figure 8. Genetic algorithm to generate prameters and features
Genetic Algorithm
Kernel SVM
Kernel SVM training mode
parameters
i n j n n
g1  g S  g Ss
S g1 gC  g Cc g 1 g k  g f f
f f
Kernel SVM C
analysis mode
Precision and Figure 9. Chromosome
recall rates
Objective function
calculation

Parents 1 0101 1111 Offspring 1 0101 0010
Crossover
No Parents 2 0001 0010 Offspring 2 0001 1111
Terminate?

Yes

Optimized parameters 01011111 01111111
and features Mutation
Before After

Figure 7. The flowchart of the feature selection and parameters
optimization method Figure 10. (a) Crossover operation; (b) Mutation operation

1083

IV. EXPERIMENTAL RESULTS [2] X. Tong, L. Duan, H. Lu, C. Xu, Q. Tian and J. S. Jin, „A mid-level
visual concept generation framework for sports analysis‟, Proc. IEEE
The experimental setup for different sport types are listed in ICME, July 2005, pp. 646–649.
Table I. For the baseball game, we take hits, home runs, [3] A. Hanjalic, „Multimodal approach to measuring excitement in
strike out, steal, and replay as highlight events. For video‟, Proc. IEEE ICME, July 2003, pp. 289–292.
basketball game, the highlight events are dunks, three-point [4] A. Hanjalic, „Generic approach to highlights extraction from a sport
shots, jump shots, bank shots and replays. For soccer game, video‟, Proc. IEEE ICIP, Sept. 2003, pp. I - 1–4.
we set highlight events as goals, long shoots, close-range [5] L. Y. Duan, M. Xu, T. S. Chua, Q. Tian, and C. S.Xu, „A mid-level
shoots, free kicks, corner kicks, break through, and replays. representation framework for semantic sports video analysis‟, Proc.
ACM Multimedia, Nov. 2003, pp. 33–44.
[6] Y. L. Chang, W. Zeng, I. Kamel, and R. Alonso, „Integrated image
In this paper, we adopt three kernel functions include and speech analysis for content-based video indexing‟, Proc. IEEE
multivariate Gaussian radial basis function, Gaussian radial ICMCS, May 1996, pp. 306–313.
basis function and polynomial function. Then we evaluate [7] K. Wan and C. Xu, „Efficient multimodal features for automatic
the performance for extracting highlight shots of sport video soccer highlight generation‟, Proc. IEEE ICPR, Aug. 2004, pp. 973–
among these kernel functions. Table. II shows the 976.
experimental results of NYY vs. NYM, Table. III shows the [8] Q. Huang, J. Hu, W. Hu, T. Wang, H. Bai and Y. Zhang, „A reliable
experimental results in the game NBA Celtics vs. Rockets, logo and replay detector for sports video‟, Proc. IEEE ICME, July
and Table. IV shows the experimental results of the soccer 2007, pp. 1695–1698.
game Arsenal vs. Hotspur. According to the experimental [9] J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati and P. Pala,
„Soccer highlights detection and recognition using HMMs‟, Proc.
results, we find that the SVM with kernel function MGRBF IEEE ICME, Aug. 2002, pp. 825–828.
have the best performance among these types of sport videos. [10] G. Xu, Y. F. Ma, H. J. Zhang and S. Yang, „A HMM based semantic
analysis framework for sports game event detection‟, Proc. IEEE
TABLE I. THE EXPERIMENTAL SETUP FOR DIFFERENT SPORT TYPES ICIP, Sept. 2003, pp. I - 25–8.
Sport type Sequence Total length Shot length [11] J. Wang, C. Xu, E. Chng and Q. Tian, „Sports highlight detection
from keyword sequences using HMM‟, Proc. IEEE ICME, June 2004,
Baseball NYY vs. NYM 146 minutes 1097 pp. 599–602.
Basketball Celtics vs. Rockets 32 minutes 180 [12] P. Chang, M. Han and Y. Gong, „Extract highlights from baseball
Soccer Asenal vs. Hotspur 48 minutes 280 game video with hidden Markov models‟, Proc. IEEE ICIP, Sept.
2002, pp. 609–612.
TABLE II. THE EXPERIMENTAL RESULTS OF BASEBALL GAME [13] N. H. Bach, K. Shinoda and S. Furui, „Robust highlight extraction
using multi-stream hidden Markov models for baseball video‟, Proc.
Sequence NYY vs. NYM IEEE ICIP, Sept. 2005, pp. III - 173–6.
Kernel MGRBF GRBF Polynomial [14] Z. Xiong, R. Radhakrishnan, A. Divakaran and T. S. Huang, „Audio
events detection based highlights extraction from baseball, golf and
Precision 87% 89% 77%
soccer games in a unified framework‟, Proc. IEEE ICME, July 2003,
Recall 99% 81% 91% pp. III - 401–4.
[15] B. Zhang, W. Chen, W. Dou, Y. J. Zhang and L. Chen, „Content-
TABLE III. THE EXPERIMENTAL RESULTS OF BASKETBALL GAME based table tennis games highlight detection utilizing audiovisual
clues‟, Proc. IEEE ICIG, Aug. 2007, pp. 833–838.
Sequence Celtics vs. Rockets
[16] C. C. Cheng and C. T. Hsu, „Fusion of audio and motion information
Kernel MGRBF GRBF Polynomial on HMM-based highlight extraction for baseball games‟, IEEE Trans.
Precision 100% 86% 93% Multimedia, pp. 585–599, June 2006.
Recall 93% 100% 87%
[17] L. C. Chang, Y. S. Chen, R. W. Liou, C. H. Kuo, C. H. Yeh and B. D.
Liu, „A real time and low cost hardware architecture for video
TABLE IV. THE EXPERIMENTAL RESULTS OF SOCCER GAME abstraction system‟, Proc. IEEE ISCAS, May 2007, pp. 773–776.
Sequence Asenal vs. Hotspur [18] ISO/IEC JTC1/SC29/WG11/ N6881: ‟MPEG-7 Requirements
Document V.18‟, January 2005.
Kernel MGRBF GRBF Polynomial
[19] ISO/IEC JTC1/SC29/WG11: „MPEG-7 Overview (version 10)‟,
Precision 100% 76% 100% October 2004.
Recall 88% 96% 73%
[20] C. H. Kuo, M. Shen and C.-C. Jay Kuo, „Fast motion search with
efficient inter-prediction mode decision for H.264‟, Journal of Visual
V. CONCLUSION Communication and Image Representation, pp. 217–242, 2006.
Kernel SVM can be trained to classify the shots by [21] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression,
exploiting the information of a unified set of basic features. WILEY, 2003.
Experimental results show that the SVM with multivariate [22] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
Gaussian radial basis kernel can achieve average of 96% [23] V. Kecman, Learning and Soft Computing, MIT Press, Cambridge,
precision rate and 93% recall rate. 2001.
[24] Vapnyarskii, I.B. (2001), “Lagrange Multipliers”, in
REFERENCES Hazewinkel,Michiel, Encyclopedia of Mathematics, Kluwer
Academic Publishers,ISBN 978-1556080104.
[1] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T.S. Huang,
[25] C.L. Huang and C.J. Wei, GA-based feature selection and
„Highlights extraction from sports video based on an audio-visual
Parameters optimization for support vector machine, Expert Syst
marker detection framework‟, Proc. IEEE ICME, July 2005, pp. 29–
Appl 31 (2006), pp. 231–240.
32.

1084

IMAGE INPAINTING USING STRUCTURE-GUIDED PRIORITY BELIEF
PROPAGATION AND LABEL TRANSFORMATIONS

Heng-Feng Hsin (辛恆豐), Jin-Jang Leou (柳金章), Hsuan-Ying Chen (陳軒盈)

Department of Computer Science and Information Engineering
National Chung Cheng University
Chiayi, Taiwan 621, Republic of China
E-mail: {hhf96m, jjleou, chenhy}@cs.ccu.edu.tw

ABSTRACT problem with isophote constraint. They estimate the
smoothness value given by the best chromosome of GA,
In this study, an image inpainting approach using and project this value in the isophotes direction. Chan
structure-guided priority belief propagation (BP) and and Shen [3] proposed a new diffusion method, called
label transformations is proposed. The proposed curvature-driven diffusions (CDD), as compared to
approach contains five stages, namely, Markov random other diffusion models. PDE-based approaches are
field (MRF) node determination, structure map suitable for thin and elongated missing parts in an image.
generation, label set enlargement by label For large and textured missing regions, the processed
transformations, image inpainting by priority-BP results of PDE-based approaches are usually
optimization, and overlapped region composition. Based oversmooth (i.e., blurring).
on experimental results obtained in this study, as Exemplar-based approaches try to fill missing
compared with three comparison approaches, the regions in an image by simply copying some available
proposed approach provides the better image inpainting part in the image. Nie et al. [4] improved Criminisi et
results. al.’s approach [5] by changing the filling order and
overcame the problem that gradients of some pixels on
Keywords Image Inpainting; Priority Brief Propagation; the source region contour are zeros. A major
Label Transformation; Markov Random Field (MRF); shortcoming of exemplar-based approaches is the
Structure Map. greedy way of filling an image, resulting in visual
inconsistencies. To cope with this problem, Sun et al. [6]
1. INTRODUCTION proposed a new approach. However, in their approach,
user intervention is required to specify the curves on
Image inpainting is to remove unwanted objects or which the most salient missing structures reside. Jia and
recover damaged parts in an image, which can be Tang [7] used image segmentation to abstract image
employed in various applications, such as repairing structures. Note that natural image segmentation is a
aged images and multimedia editing. Image inpainting difficult task. To cope with this problem, Komodaskis
approaches can be classified into three categories, and Tziritas [8] proposed a new exemplar-based
namely, statistical-based, partial differential equation approach, which treats image inpainting as a discrete
(PDE) based, and exemplar-based approaches. global optimization problem.
Statistical-based approaches are usually used for texture
synthesis and suitable for highly-stochastic parts in an 2. PROPOSED APPROACH
image. However, statistical-based approaches are hard
to rebuild structure parts in an image. The proposed approach contains five stages, namely,
PDE-based approaches try to fill target regions of Markov random field (MRF) node determination,
an image through a diffusion process, i.e., diffuse structure map generation, label set enlargement by label
available data from the source region boundary towards transformations, image inpainting by priority-BP
the interior of the target region by PDE, which is optimization, and overlapped region composition.
typically nonlinear. Bertalmio et al. [1] proposed a
PDE-based image inpainting approach, which finds out 2.1. MRF node determination
isophote directions and propagates image Laplacians to
the target region along these directions. Kim et al. [2] As shown in Fig. 1 [8], an image I0 contains a target
used genetic algorithms (GA) to solve the inpainting region T and a source region S with S=I0-T. Image

1085

inapinting is to fill T in a visually plausible way by Vpq (xp , xq )
simply pasting various patches from S. In this study,
image inpainting is treated as a discrete optimization
= ∑Z(x
dp, dq∈Ro
p + dp, xq + dq)(I0 (xp + dp) − I0 (xq + dq))2 , (4)

problem with a well-defined energy function. Here, where Ro is the overlapped region between two labels, xp
discrete MRFs are employed. and xq.
To define the nodes of an MRF, the image lattice is
used with the horizontal and vertical spacings of gapx 2.3. Label set enlargement
and gapy (pixels), respectively. For each lattice point, if
its neighborhood of size (2gapx + 1) × (2gapy + 1) overlaps To completely use label informations in the original
the target region, it will be an MRF node p. Each label image, three types of label transformations are used to
of the label set L of an MRF consists of enlarge the label set. The first type of label
(2gapx+1) × (2gapy+1) pixels from the source region S. transformation contains two different directions: the
Based on the image lattice, each MRF node may have 2, vertical and horizontal flippings, which can find out
3, or 4 neighboring MRF nodes. labels (patches) that do not exist in the original source
Assigning a label to an MRF node is equivalent to region, but have symmetric properties in the horizontal
copying the label (patch) to the MRF node. To evaluate or vertical direction. The second type of label
the goodness of a label (patch) for an MRF node, the transformation contains three different rotations: left
energy (cost) function of an MRF will be defined, 90° rotation, right 90° rotation, and 180° rotation, which
which includes the cost of the observed region of an can find out rotated labels (patches) of the above-
MRF node. mentioned three degrees. The third type of label
We will assign a label x p ∈ L to each MRF node p
ˆ
transformation is scaling. To keep the original size of
horizontal and vertical spacings gapx and gapy, the
so that the total energy F (x) of the MRFs is minimized.
ˆ original image is directly up/down scaled so that new
Here, labels (patches) can be obtained in the original image
F ( x) = ∑ V p ( x p ) +
ˆ ˆ ∑V pq
ˆ ˆ
( x p , xq ), (1) with the same horizontal and vertical spacings. Here,
p∈v ( p , q )∈ε both the up-sampling (double-resolution by bilinear
where V p ( x p ) (called the label cost hereafter) denotes interpolation) image and the down-sampling (half-
the single node potential for placing label xp over MRF resolution) image are used to generate extra candidate
node p, i.e., how the label xp agrees with the source labels (patches).
region around p. Vpq(xp,xq) represents the pairwise
potential measuring how well node p agrees with the 2.4. Image inpainting by priority-BP optimization
overlapped region ε between p and its neighboring
node q when pasting xp at p and pasting xq at q. Belief propagation (BP) [10] treats an optimization
problem by iteratively solving a finite set of equations
2.2. Structure map generation until the optimal solution is found. Ordinary BP is
computationally expensive. For an MRF graph, each
In this study, the Canny edge detector [9] is used to node sends “message” to all its neighboring nodes,
extract the edge map of an image, which preserves the whereas the node receives messages from all its
important structural properties of the source region in neighboring nodes. This process is iterated until all the
the image. A binary mask E(p) to used to build the messages do not change any more.
structure map of the image, which is just the edge map The set of messages sent from node p to its
with morphological dilation. If E(p) is non-zero, pixel p neighboring node q is denoted by {m pq ( xq )} . This
xq ∈L
is belonging to the structure part. Then, E(p) is used to message expresses the opinion of node p about
formulate the structure weighting function Z(p,q): assigning label xq to node q. The message formulation is
⎧ 1, if E ( p) = 0 and E (q ) = 0, defined as:
Z ( p, q ) = ⎨ (2)
⎩w, otherwise, ⎧ ⎫
where w is “the structure weighting coefficient.” The mpq (xq ) = min⎨Vpq (x p , xq ) +Vp (x p ) + ∑mrp (x p )⎬. (5)
x p ∈L
label cost Vp(xp) is defined as (the sum of weighted ⎩ r:r ≠q,( r , p)∈ε ⎭
squared differences, SWSD): That is, if node p wants to send message mpq to node q,
Vp (xp ) node p must traverse its own label set and find the best
label to support node q when label xq is assigning to
= ∑[Z( p + dp,x +dp)M ( p + dp)(I ( p + dp) − I (x +dp)) ,
]
dp∈[− gapx , gapx ]× − gapy , gapy
p 0 0 p
2 (3)
node q. Each message is based on two factors: (1) the
where M(p) denotes a binary mask, which is non-zero if compatibility between labels xp and xq, and (2) the
pixel p lies inside the source region S. Thus, for an likelihood of assigning label xp to node p, which also
MRF node p, if its neighborhood of size (2gapx+1) × contains two factors: (1) the label cost Vp(xp), and (2)
(2gapy+1) does not intersect S, Vp(xp)=0. Vpq(xp,xq) for the opinion of its neighboring node about xp measured
pasting labels xp and xq over p and q, respectively, can by the third term in Eq. (5).
be similarly defined as:

1086

Messages are iteratively updated by Eq. (5) until MRF edge can be bidirectionally traversed. In the
they converge. Then, a set of beliefs, which represents forward pass, all the nodes are visited by the priority
the probability of assigning label xp to p, is computed order, an MRF node having the highest priority will
for each MRF node p as: pass message to its neighboring MRF nodes having the
bp (x p ) = −Vp (x p ) − ∑m rp (x p ). (6)
lower priorities, and the MRF node having the highest
r:(r , p)∈ε priority will be marked as “committed,” which will not
The second term in Eq. (6) means that to calculate a be visited again in this forward pass. For label pruning,
node’s belief, it is required to gather all messages from the MRF node having the highest priority can transmit
all its neighboring nodes. When the beliefs of all MRF its “cheap” message to all its neighboring MRF nodes
nodes have been calculated, each node p is assigned the having not been committed. The priority of each
best label having the maximum belief: neighboring MRF node having received a new message
x p = arg maxbp ( x p ).
ˆ (7) is updated. The above process is iterated until there are
x p∈L
no uncommitted MRF nodes. On the other hand, the
To reduce the computational cost of BP, backward pass is performed in the reverse order of the
Komodakis and Tziritas [8] proposed “priority-BP” to forward pass. Note that label pruning is not performed
control the message passing order of MRF nodes and in the backward pass.
“dynamic label pruning” to reduce the number of
elements in the label set of each MRF node. In [8], the 2.5. Overlapped region composition
priority of an MRF node p is related to the confidence
of node p about the label should be assigned to it. The When the number of iterations reaches K, each MRF
confidence depends on the current set of beliefs node p is assigned a label having maximum bp values.
{bp(xp)} that has been calculated by BP. Here, the
xp∈L
All the MRF nodes are composed to produce the final
priority of node p is designed as: image inpainting results, where label composition is
1 performed in a decreasing order of MRF node priorities.
priority ( p ) = , Depending on whether the region contains a global
{x p ∈ L : b p ( x p ) ≥ bconf }
rel (8)
structure or not, two strategies are used to compose each
bp (xp ) = bp (xp ) − bp ,
rel max (9) overlapped region. If an overlapped region contains a
rel
global structure, graph cuts are used to seam it.
where bp is the relative belief value and b p is the
max
Otherwise, each pixel value of the overlapped region is
maximum belief among all labels in the label set of computed by weighted sum of two corresponding pixel
node p. Here, the confidence of an MRF node is the values, where the weighting coefficient is proportional
number of candidate labels whose relative belief values to the priority of an MRF node.
exceed a certain threshold bconf.
On the other hand, to traverse MRF nodes, the 3. EXPERIMENTAL RESULTS
number of candidate labels for an MRF node can be
pruned dynamically. To commit a node p, all labels with In this study, 21 test images are used to evaluate the
relative beliefs being less than a threshold bprune for performance of the proposed approach. Three
node p will not be considered as its candidate labels. comparison inpainting approaches, namely, the PDE-
The remaining labels are called “active labels” for node based approach [1], the exemplar-based approach [5],
p. In this study, the label set of an MRF node is sorted and the ordinary priority-BP-based approach [8], are
by belief values, at least Lmin active labels are selected implemented in this study. Some image inpainting
for an MRF node, and a similarity measure is used to results by the three comparison approaches and the
check the remaining labels. If the similarity between proposed approach are shown in Figs. 2-6.
two remaining labels is greater than a threshold Sdiff, one In Fig. 2, the image size is 256 × 170, gapx=9,
of the two remaining labels will be pruned. This process gapy=9, bconf=-180000, bprune=-360000, Lmax=30, Lmin=5,
will be iterated until the relative belief value of any and w=10. Blurring artifacts appear in Fig. 2(c). In Fig.
remaining label is smaller than bprune or the number of 2(d), because the isophotes direction is too complex to
active labels reaches a user-specified parameter Lmax. guide the inpainting process, the inpainting results are
To apply priority-BP to image inpainting, the labels not good. Compared with the ordinary priority-BP-
from the source region of an original image and the based approach (Fig. 2(e)), the proposed approach (Fig.
labels by applying three types of label transformations 2(f)) can keep the global structure in the image by
are obtained so that each MRF node maintains its label guiding the message passing process by the structure
set. Then, the number of priority-BP iterations, K, is set, map. In Fig. 3, the image size is 206 × 308, gapx=5,
the priorities of all MRF nodes are initialized only by gapy=5, bconf=-40000, bprune=-80000, Lmax=20, Lmin=3,
their Vp(xp) values, and message passing is performed. and w=10. In Fig. 3(c), blurring artifacts appear in the
Each priority-BP iteration consists of the forward and upper part of the image. In Fig. 3(d), the stone bridge
backward passes. Message passing and dynamic label can not be well reconstructed, because there is no
pruning are performed in the forward pass, and each suitable patch in the image. Furthermore, error

1087

propagation appears in the lake. In Fig. 3(e), because
the priority of the bridge structure is low, the bridge
structure is broken. In the proposed approach, the
weighting coefficient is used to raise the priority of the
bridge structure, resulting in the better inpainting results.
In Fig. 4, the image size is 208×278, gapx=7, gapy=7,
bconf= -150000, bprune=-300000, Lmax=30, Lmin=5 and
w=2. For the image, the proposed approach can
reconstruct the tower structure by label transformations, (a) (b)
whereas the three comparison approaches contain error Fig. 1. (a) Nodes and edges of an MRF; (b) labels of an
propagations, due to lack of suitable labels. In Fig. 5, MRF for image inpainting [8].
the image size is 287×216, gapx=10, gapy=10, bconf=
-200000, bprune=-400000, Lmax=50, Lmin=5 and w=15. In
Fig. 5(f), the proposed approach uses both the original
labels and the flipped labels to reconstruct the region to
be inpainted, resulting in the better inpainting image. In
Fig. 6, the image size is 257 × 271, gapx=6, gapy=6,
bconf=-200000, bprune=-400000, Lmax=50, Lmin=10 and
(a) (b)
w=5. Because the building in the original image has the
symmetric property, label transformations can be
employed in this case. Blurring artifacts appear in Fig.
6(c). In Fig. 6(d), the isophote direction is too complex
so that the structures interfere each other. In Fig. 6(e),
the inpainting results are poor, due to lack of valid
labels. In Fig. 6(f), for the lower part of the image, the
window structure is partially broken due to the building (c) (d)
is not totally symmetric so that error propagation
appears in some inpainting regions of the image.
However, the inpainting image by the proposed
approach is better than that by the three comparison
methods.

4. CONCLUDING REMARKS (e) (f)
Fig. 2. (a) The original image, “Lantern;” (b) the
In this study, an image inpainting approach using masked image; (c)-(f) the image inpainting results by
structure-guided priority BP and label transformations is the PDE-based approach [1], the exemplar-based
proposed. In the proposed approach, to reconstruct the approach [5], the ordinary priority-BP-based approach
global structures in an image, the structure map of the [8], and the proposed approach, respectively.
image is generated, which guides the inpainting process
by priority-BP optimization. Furthermore, three types of
label transformations are employed to get more usable
labels (patches) for inpainting. Based on the
experimental results obtained in this study, as compared
with three comparison approaches, the proposed
approach provides the better image inpainting results.

ACKNOWLEDGEMENT

This work was supported in part by National Science
Council, Taiwan, Republic of China under Grants NSC (a) (b)
96-2221-E-194-033-MY3 and NSC 98-2221-E-194- Fig. 3. (a) The original image, “Bungee jumping;” (b)
034-MY3. the masked image; (c)-(f) the image inpainting results
by the PDE-based approach [1], the exemplar-based
approach [5], the ordinary priority-BP-based approach
[8], and the proposed approach, respectively (to be
continued).

1088

(c) (d) (e) (f)
Fig. 4. (a) The original image, “Tower;” (b) the masked
image; (c)-(f) the image inpainting results by the PDE-
based approach [1], the exemplar-based approach [5],
the ordinary priority-BP-based approach [8], and the
proposed approach, respectively (continued).

(e) (f)
Fig. 3. (a) The original image, “Bungee jumping;” (b)
the masked image; (c)-(f) the image inpainting results
by the PDE-based approach [1], the exemplar-based (a) (b)
[8], and the proposed approach, respectively
(continued).

(c) (d)

(a) (b)

(e) (f)
Fig. 5. (a) The original image, “Picture frame;” (b) the
masked image; (c)-(f) the image inpainting results by
the PDE-based approach [1], the exemplar-based
[8], and the proposed approach, respectively.

(c) (d)
Fig. 4. (a) The original image, “Tower;” (b) the masked
image; (c)-(f) the image inpainting results by the PDE-
based approach [1], the exemplar-based approach [5],
the ordinary priority-BP-based approach [8], and the
proposed approach, respectively (to be continued).

1089

Computer Society Conf. on Computer Vision and Pattern
Recognition, 2003, 721–728.

[6] J. Sun, L. Yuan, J. Jia, and H. Y. Shun, “Image completion
with structure propagation,” in Proc. of 2005 ACM
SIGGRAPH on Computer Graphics, 2005, pp. 861–868.

[7] J. Jia and C. K. Tang, “Image repairing: Robust image
synthesis by adaptive and tensor voting,” in Proc. of 2003
IEEE Int. Conf. on Computer Vision and Pattern
Recognition, 2003, pp. 643–650.
(a) (b)
[8] N. Komodakis and G. Tziritas, “Image completion using
efficient belief propagation via priority scheduling and
dynamic pruning,” IEEE Trans. on Image Processing, Vol.
16, pp. 2649–2661, 2007.

[9] J. Canny, “A computational approach to edge detection,”
IEEE Trans. on Pattern Analysis and Machine Intelligence,
Vol. 8, pp. 679–698, 1986.

[10] J. Pearl, Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference, Morgan Kaufmann, San
(c) (d) Francisco, CA, 1988.

(e) (f)
Fig. 6. (a) The original image, “Building;” (b) the
masked image; (c)-(f) the image inpainting results by
the PDE-based approach [1], the exemplar-based
[8], and the proposed approach, respectively.

REFERENCES

[1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester,
“Image inapinting,” in Proc. of ACM Int. Conf. on
Computer Graphics and Interactive Techniques, 2000,
417–424.

[2] J. B. Kim and H. J. Kim, “Region removal and restoration
using a genetic algorithm with isophote constraint,”
Pattern Recognition Letters, Vol. 24, pp. 1303–1316, 2003.

[3] T. Chan and J. Shen, “Non-texture inpaintings by
curvature-driven diffusions,” Journal of Visual Comm.
Image Rep., Vol. 12, pp. 436–449, 2001.

[4] D. Nie, L. Ma, and S. Xiao, “Similarity based image
inpainting method,” in Proc. of 2006 Multi-Media
Modeling Conf., 2006, 4–6.

[5] A. Criminisi, P. Perez, and K. Toyama, “Object removal
by exemplar-based inpainting,” in Proc. of IEEE

1090

CONTENT-BASED BUILDING IMAGE RETRIEVAL

Wen-Chao Chen(陳文昭), Chi-Min Huang (黃啟銘), Shu-Kuo Sun (孫樹國), Zen Chen (陳稔)

Dept. of Computer Science, National Chiao Tung University
E-mail:Chaody.cs94g@nctu.edu.tw, toothbrush.cs97g@ nctu.edu.tw,
sksun@csie.nctu.edu.tw, zchen@cs.nctu.edu.tw

Abstract—This paper addresses an image retrieval query image, the content-based image retrieval system
system which searches the most similar building for a extracts the most similar images from a database by
captured building image from an image database based either spatial information, such as color, texture and
on an image feature extraction and matching method. shape, or frequency domain features, e.g. wavelet-based
The system then can provide relevant information to methods [3].
users, such as text or video information regarding the Existing content-based image retrieval algorithms
query building in augmented reality setting. However, can be categorized into (a) image classification methods,
the main challenge is the inevitable geometric and and (b) object identification methods. The first approach
photometric transformations encountered when a retrieves images which belong to the same category as a
handheld camera operates at a varying viewpoint under query image. Jing et al. proposed region-based image
various lighting environments. To deal with these retrieval architecture [6]. An image is segmented into
transformations, the system measures the similarity regions by the JSEG method and every region is
between the MSER features of the captured image and described with color moment. Every region is clustered
database images using the Zernike Moment (ZM) to form a codebook by Generalized Lloyd algorithm.
information. This paper also presents algorithms based The similarity of two images is then measured by Earth
on feature selection by multi-view information and the Mover’s Distance (EMD). Willamowski et al. presented
DBSCAN clustering method to retrieve the most generic visual categorization method by using support
relevant image from database efficiently. The vector machine as a classifier [7]. Affine invariant
experimental results indicate that the proposed system descriptor represents an image as a vector quantization.
has excellent performance in terms of the accuracy and In the second approach Wu and Yang [8] detected
processing time under the above inevitable imaging and recognized street landmarks from database images
variations. by combining salient region detection and segmentation
techniques. Obdrzalek and Matas [9] developed a
Keywords Image recognition and retrieval; Geometric
building image recognition system based on local affine
and photometric transformations; Zernike moments;
features that allows retrieval of objects in images taken
Image indexing;
from distinct viewpoints. Discrete cosine transform
1. INTRODUCTION (DCT) is then applied to the local representations to
reduce the memory usage. Zhang and Kosecka [10] also
In recent years, there have been an increasing
proposed a system to recognize building by a
number of applications in Location-Based Service
hierarchical approach. They first index the model views
(LBS). LBS is an service that can be accessed from
by localized color histograms. After converting to
mobile devices to provide information based on the
YCbCr color space and indexing with the hue value,
current geographical position, e.g. GPS information.
SIFT descriptors [4, 5] are then applied to refine
However, GPS position is only available in open spaces
recognition results.
since the GPS signal is often blocked by high-rise
Most of related image retrieval algorithms detect
buildings or overhead bridges. Magnetic compasses are
local features of a query image and then compare with
also disturbed by nearby magnetic materials. Vision-
detected features of database images by feature
based localization is therefore an alternative approach to
descriptors. However, the feature detectors such as
provide both accurate and robust navigation information.
Harris corner detector and the SIFT detector, which is
This paper addresses the aspects of a building
based on the difference of Gaussians (DOG), utilize a
image retrieval system. The building recognition is a
circular window to search for a possible location of a
content-based image retrieval technique that can be
feature. The image content in the circular window is not
extended to applications of object recognition and web
robust to affine deformations. Furthermore, the feature
image search via a cloud service combined with
points may not be reliable and may not appear
consumer-oriented augmented reality tools. Given a

1091

simultaneously across the multiple views with wide- 2. FEATURE DETECTOR AND DESCRIPTOR
baselines.
Matas et al. [13] presented a maximally stable 2.1. MSER feature region detector
extremal region (MSER) detector. Mikolajczyk and
Recently, numbers of local feature detectors using a
Schmid [3] proposed Harris-Affine and Hessian-Affine
local elliptical window have been investigated. The
detectors. The performances of the existing region
MSER detector is evaluated as one of the best region
detectors were evaluated in [14] in which the MSER
detectors [5]. The advantage of MSER detector is the
detector and the Hessian-Affine detector were ranked as
ability to resist geometry transformation. The MSER
the two best. Chen and Sun [2] compare various popular
detector performs also well when images contain
feature descriptors, e.g. SIFT, PCA-SIFT, GLOH,
homogenous regions with distinctive boundaries [1].
steerable filter, with phase-based Zernike Moment (ZM)
Because building images contain regions with
descriptor. The ZM descriptor performs significantly
boundaries, such as windows and color bricks, the
better than other descriptors in geometric and
MSER detector can extract these regions stably.
photometric transformations, such as blur, illumination,
After detecting elliptical regions by MSER method,
noise, scale, JPEG compression. To describe a building
we have to filter out unstable regions such as oversized
image in geometric and photometric transformations,
area, large aspect ratio, duplicated regions, and high
this paper utilizes the MSER method as the feature
area variation, as shown in fig. 2.
detector. The Zernike Moment is then applied to
describe each detected feature region. 2.2. Zernike Moment feature region descriptor
In order to index a large number of features
Once the feature regions are detected, every region
descriptors, KD-tree [12] is a fundamental method to
is described as a feature vector for similarity
recursively partition the space into two subspaces to
measurement. This paper presents a method which
construct a binary tree.
applies Zernike Moment (ZM) as the feature descriptor
We also introduce a building image dataset, the
[2].
NCTU-Bud dataset, containing the high resolution
Zernike moments (ZMs) have been used in object
images of 22 buildings located on National Chiao Tung
recognition regardless of variations in position, size and
University campus with a total of 190 database images.
orientation. Essentially Zernike moments are the
We capture at least one face of each building from 5
extension of the geometric moments by replacing the
distinct viewing directions. Query images are captured
under 12 different lighting conditions for performance conventional transform kernel x m y n with orthogonal
evaluation. Zernike polynomials.
Fig. 1 shows the overall system block diagram. The Zernike basis function Vnm ( ρ , θ ) is defined
Section 2 briefly describes the background of the feature over a unit circle with order n and repetition m such
detector and descriptor. Section 3 presents a feature that (a) n − m is even and (b) m n , as given by
selection method to remove unstable features and a
clustering method to obtain representative features. In Vnm ( ρ ,θ ) = Rnm (ρ ,θ )e jmθ , for ρ ≤ 1 (1)
Section 4 the image indexing and retrieval method is
where {Rnm (ρ )} is a radial polynomial in the form of
described. In Section 5 experimental results of the ( n −|m|) / 2
(n − s )!
NCTU-Bud dataset are described. The performance on Rnm ( ρ ) = ∑ (−1) s
n+ | m | n− | m |
ρ n− 2 s
(2)
the publicly available ZuBud dataset is evaluated as well. s =0
s! ( − s )! ( − s )!
2 2
Finally, Section 6 concludes the paper.

(a)

(b)

Figure 2. (a) Initial MSER results. (b) Results after
Figure 1. System block diagram. removing unstable MSER feature regions.

1092

v v
The set of basis functions { Vnm ( ρ , θ ) } is orthogonal, i.e. The similarity of magnitude S mag ( Pq , Pd ) is defined
π π
∫ ∫ V (ρ ,θ )V (ρ ,θ )ρdρdθ = n + 1 δ
2 1
*
δ , as the degree of cosine between two vectors.
nm pq np mq
0 0
v v mag q ⋅ mag d
S mag ( Pq , Pd ) = (7)
1 a=b mag q mag d
with δ ab = { (3)
0 otherwise
The value ranges between 0 and 1, while a higher value
The two-dimensional ZMs for a continuous image indicates two vectors are more similar. This is
function f ( ρ ,θ ) are represented by equivalent to the Euclidean distance between the two
n +1 normalized unit vectors.
∑ ∑ f ( ρ ,θ )V *nm (ρ ,θ ) = Znm e nm (4)
iφ
Z nm = A similarity measure using the weighted ZM phase
π ( ρ , θ )∈unit disk
differences is expressed by
v v
For a digital image function the two-dimensional S phase ( Pq , Pd ) =

ZMs are given as min{ Φ nm − (mα ) mod(2π ) ,2π − Φ nm − (mα ) mod(2π ) }
ˆ ˆ (8)
1 − ∑∑ wnm
n +1 π
∑ ∑ f ( ρ ,θ )V *nm ( ρ ,θ ) = Z nm e nm (5)
iφ m n

Z nm =
π ( ρ , θ )∈unit disk Z nm + Z nm
q d

where wnm = + Z nm , and
Φ nm = (φ nm − φ nm ) mod 2π is
q d
r ∑Z q
nm
d

Define a region descriptor P based on the sorted n ,m

ZMs as follows: the actual phase difference.
r iφ iφ iφ The rotation angle α is determined by an iterative
ˆ
P = [ Z11 e 11 , Z 31 e 31 ,LL , Z nmax mmax e nmaxmmax ]T (6)
computation of α m = (Φ nm − α m −1 ) mod 2π , with the
ˆ ˆ
where Z nm is the ZM magnitude, and ϕnm is the ZM initial value α 0 = 0 , using the entire information of
ˆ
phase. Zernike moments sorted by m . The value range of
v v
Zernike Moment is then derived after integrating S phase ( Pq , Pd ) is the interval [0, 1] while a higher value
the normalized region with respect to Zernike basis indicates two vectors are more similar.
function. In this paper, the ZMs with m =0 are not
included, and both the maximum order n and maximum 3. EFFICIENT BUILDING IMAGE
repetition m equal to 12, resulting the length of feature DATABASE CONSTRUCTION
vector to be 42. In this way, two feature vectors In building image retrieval applications, the scale
represent a feature region: mag = [ Z1,1 , Z 3,1 ,..., Z12,12 ]T of a database is typically large with considerable visual
descriptors. In order to index and search rapidly,
and phase = [φ1,1 , φ3,1 ,...,φ12,12 ]T .
effective approaches to store appropriate descriptor are
proposed for constructing a large-scale building image
database.
3.1. Feature selection from multiple images
Modern building databases in image retrieval
applications normally contain multiple views for a single
building. For example, the ZuBud dataset collects five
images for each building in the database. We refine
Figure 3. Normalization of an elliptical region. detected MSER feature regions by verifying consistency
between multiple images of a building that are captured
2.3. A similarity measure from distinct viewpoints. The basic idea of selection is
v v to keep representative feature regions and remove
Let Pq = (mag q , phaseq ) and Pd = (mag d , phased ) be two discrepant features as outliers. Feature region selection
ZM feature vectors, where mag q = [ Z1q,1 , Z 3q,1 ,..., Z12 ,12 ]T ,
q
reduces storage space of feature descriptors in a
database. Furthermore, this method improves the
phaseq = [φ1q1 ,φ3q,1 ,...,φ12,12 ]T , magd = [ Z1d,1 , Z 3d,1 ,..., Z12,12 ]T ,
,
q d
efficiency and accuracy of the image retrieval process
and phased = [φ1d,1 , φ3d,1 ,..., φ12,12 ]T .
d remarkably.

1093

(a) (b) (c)

(d) (e) (f)
Figure 4. (a) - (c) Three different images in a group of a building image before feature selection. (d) - (f) Three
different images in a group of a building image after feature selection.

The occurrence of discrepant feature regions comes After removing non-building feature regions, most
from non-building areas, such as trees, bicycles, and of remaining feature regions belong to the buildings.
pedestrians, as shown in Fig 4. Feature regions of non- However, repeated pattern, e.g. windows, doors, is
building area are not stable comparing to regions of popular in a building image. In order to reduce the
building area. Therefore, excluding these feature regions storage space of the repeated feature descriptors in a
out of a database is necessary to ensure uniform results. database, clustering similar features into a representative
This paper presented a method to select feature feature descriptor is necessary.
regions automatically by measuring similarity between In conventional clustering algorithms, e.g. k-means
multiple images of a building. The algorithm description algorithm and k-medoid algorithm, each cluster is
for feature region selection is given in Fig. 5. Only represented by the gravity center or by one of the
similar feature regions across the views are preserved.
v v
objects of the cluster located near its center. However,
Two regions are similar if Smag ( Pq , Pd ) 0.7 and determining the number of clusters k is not
v v straightforward. Moreover, the ability for distinguishing
S phase ( Pq , Pd ) 0.7. Comparison of feature regions before
different features is reduced because the isolated feature
and after selection is shown in Fig. 4. Unstable feature
regions are forced to merge to a nearby cluster which
regions in Figs. 4(a)-4(c), such as trees and pedestrians
may be with dissimilar characteristic of region
are removed by the proposed algorithm. The results of
appearance. Consequently, we utilized the Density-
selection are shown in Figs. 4(d)-4(f).
Based Spatial Clustering algorithm (DBSCAN)
algorithm [15] is used for clustering.
Input: A group of feature regions in multi-view images.
The DBSCAN algorithm relies on a density-based
Output: Selected feature regions.
notion of clusters. Two input parameters ε and MinPts
For each feature region
If there’s at least two similar regions in other views are required to determine the clustering conditions in
Preserve the feature region; two steps. The first step chooses an arbitrary point from
Else the database as a seed. Second, retrieve all reachable
Delete the feature region; points from the seed. The parameter ε defines the size
of neighborhood, and for each point to be included in a
Figure 5. Feature region selection algorithm. cluster there must be at least a minimum number
( MinPts ) of points in an ε -neighborhood of a cluster
3.2. Feature clustering point.

1094

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)
Figure 6. (a) - (e) Feature regions in the same cluster. (f) - (i) Another cluster of feature regions after DBSCAN.

The input to the DBSCAN algorithm is 42- Then, based on the current minimum distance between
dimensional selected ZM magnitude vectors of all the query point and the single database point in the leaf
images belonging to the same group or building. We node the KD-tree is revisited to search for the next
calculate the mean of feature vectors as the available neighbor within the current minimum distance.
representative one of the cluster, while preserving the The tree backtracking is repeated until no further
isolated feature points. reduced minimum distance to the query point is found.
The elliptical regions in Figs. 6 (a)-6(e) are feature
vectors in the same cluster, and are replaced by a Input: N feature vectors in k dimension
representative feature vector. Figs. 6 (f)-6(j) show the Output: A KD-tree, every leaf nodes contains a single
other feature cluster in the same group of multi-view feature vector
images. kd_tree_build (Dataset, dim)
{
4. IMAGE INDEXING AND RETRIEVAL If Dataset contains only one point
Mark a leaf node containing the point;
4.1. Descriptor indexing with a KD-tree Return;
else
After feature selection and clustering processes 1. Sort all points in Dataset according to
described above, all extracted building regions are then feature dimension dim;
indexed by a KD-tree according to their ZM magnitude 2. Determine the median value of feature
vectors. The goal is to build an indexing structure so that dimension dim in Dataset, make a new node
the nearest neighbors of a query vector can be searched and save the median value;
rapidly. 3. Dataset_bigger = The points in Dataset
with dim = median value;
A KD-tree (k-dimensional tree) is a binary tree that
4. Dataset_smaller = The points in Dataset
recursively partitions feature space into two parts by a with dim median value;
hyperplane perpendicular to a coordinate axis. The 5. Set Dataset_bigger as the new node’s right
binary space partition is recursively executed until all child and set Dataset_smaller as the new
leaf nodes contain each a single data point. The node’s left child;
algorithm for constructing a KD-tree is given in Fig. 7 6. call kd_tree_build (Dataset_bigger ,
by initializing dim as 1 and Dataset as the set of N (dim+1) % k);
database points. 7. call kd_tree_build (Dataset_smaller,
(dim+1) % k);
4.2. Query by region vote counting }
After establishing a KD-tree for organizing the ZM Figure 7. The KD-tree construction algorithm.
magnitude feature vectors in the database, the KD-tree is
descended to find a leaf node into which the query point For each extracted region in the query building
falls. After obtaining the first candidate nearest neighbor, image one vote is cast to a certain database building
we verify with the ZM phase feature vector to confirm image which has a region to be claimed as the nearest
the candidate point is qualified or not. In our neighbor of the query region. After all extracted regions
experiments, two vectors are qualified as similar when of the query image have voted, we count the number of
their distance is as small as possible and their magnitude votes each database image receives. The database image
v v with the maximum votes is returned as the most similar
and phase similarity measures satisfy Smag ( Pq , Pd ) 0.85
v v building to the query building.
in equation (7) and S phase ( Pq , Pd ) 0.85 in equation (8).

1095

View 1 View 2 View 3 View 4 View 5

EC Building

ED Building
Face 1 (first
side view)

ED Building
Face 2
(second side
view)

Figure 8. Examples of the database images in the NCTU-Bud dataset.

Class D Class E Class F
Class A Class B Class C
Correct exposure Over exposure Under exposure
Correct exposure Over exposure Under exposure
with occlusion with occlusion with occlusion

Sunny
day

Cloudy
day

Figure 9. Examples of query images for the NCTU-Bud dataset.

weather condition, six images are collected, each with
5. EXPERIMENTAL RESULTS
different exposure settings of and different occlusion
In our experiments, the proposed algorithm is conditions. Totally 12 classes of images constitute the
written in Matlab under the Windows environment and query dataset, as shown in Fig. 9. Furthermore, five
evaluated on the platform with a 2.83GHz processor and additional camera poses, such as different rotations,
3GB Ram. We test our proposed indexing and retrieval focal lengths, and translations, are recorded for further
system on two sets of building images: the NCTU-Bud testing. A total of 2280 query images is gathered.
dataset created by our own and the publicly available
5.2. Experimental results for the NCTU-Bud dataset
ZuBud [11].
Table I shows the total number of different region
5.1. The NCTU-Bud Dataset
feature vectors collected in the database and the
To evaluate our proposed approach and to establish recognition rate for the query images captured in the
a benchmark for future work, we introduce the NCTU- normal exposure during cloudy days. From this table,
Bud dataset. Our dataset contains 22 high resolution feature selection using multiple images does not raise
images of the buildings on the NCTU campus. For each the query accuracy rate. However, we achieve 100%
building in the database we capture at least one facet of accuracy after applying the feature selection and
the building from five different viewing directions. All DBSCAN clustering. In this case not only the region
database images are in a resolution of 1600x1200 pixels. storage space is reduced, but also only the representative
The database contains a total of 190 building images. feature vectors are stored for query search.
Some representative database images are shown in Fig. Consequently, the image retrieval accuracy is raised to
8. 100%.
For query images, we capture with a different The storage size (the number of nodes) is decided
camera of a 2352x1568 resolution in two different by the number of region feature vectors found from all
weather conditions: sunny and cloudy. For each type of images in the database. Approximately 50% space is

1096

saved by applying feature selection and the DBSCAN TABLE II. AVERAGE PROCESSING TIME OF FEATURE DETECTION
clustering method. AND DESCRIPTOR COMPUTATION IN DIFFERENT RESOLUTIONS.

The time of feature region detection and
description relies on the resolution and the content of an Resolution 2352x1568 1600x1200 640x480
image. If the scene of an image is complex, the number
of detected extremal regions by MSER increases and the Avg. / std.
13.8 /4.3 5.8 / 1.58 1.8 / 0.7
processing time increases as well. Table II shows the processing time (sec)
average processing time of feature detection and
descriptor computation of 92 different images in
different resolutions. TABLE III. QUERY ACCURACY RATE OF THE NCTU-BUD
With the feature selection and DBSCAN clustering DATASET IN DIFFERENT WEATHER CONDITIONS.

method, the average time of indexing the database is Sunny day Cloudy day
22.4 seconds and the average query time for an image in Class A
93.6 % 100%
Correct exposure
a resolution of 2352x1568 pixels is 40 seconds. The
Class B
image query time comprises the time for feature region 92.1 % 92.1 %
Over exposure
detection (MSER), descriptor computation (ZM) and Class C
searching time for the nearest neighbor in the database. 93.1 % 96.3 %
Under exposure
Class D
Table III shows the query accuracy rate for the 12 93.6 % 96.3 %
Correct exposure with occlusion
different classes of images. Each class consists of 190 Class E
92.1 % 94.2 %
query images. We can find that the accuracy rate in Over exposure with occlusion
cloudy days generally higher than that in sunny days. Class F
92.6 % 96.8 %
This reason may be because strong shadows are casted Under exposure with occlusion
by occluding object in sunny days. And the over-
exposured images are harder to recognize comparing to
other exposure conditions. Query Database
Comparing classes of D-F with classes of A-C, we image image
can find that the proposed methods also perform well
under occlusion. It shows that the proposed system is
able to distinguish region feature regions even when
buildings are partially occluded.
5.3. Experimental results for the ZuBud dataset
The ZuBud dataset contains images of 201
different buildings taken in Zurich, Switzerland. There
are 5 different images taken for each building. Fig. 10
shows some example images. The dataset consists of
115 query images, which are taken with a different
camera under different weather conditions.
In the experimental result for the ZuBud dataset,
the query accuracy rate with the feature selection and
DBSCAN clustering is over 95%. The average query
time is 3.1 second with a variation of 1.16 second. From
the experimental results, our system still performs well Figure 10. Example of images of the ZuBud dataset.
in this publicly available dataset.
TABLE IV. TOTAL NUMBER OF REGION FEATURE VECTORS AND
TABLE I. TOTAL NUMBER OF REGION FEATURE VECTORS AND
QUERY ACCURACY RATE OF THE ZUBUD DATASET
QUERY ACCURACY RATE OF THE NCTU-BUD DATASET
Without With feature
With feature
Without With feature With feature feature selection
selection
feature selection selection selection DBSCAN
selection only DBSCAN
# Region feature
# of region feature 488,527 264,311 256,261
113,194 68,036 56,089 vectors
vectors
Memory size Recognition
22 MB 12.9 MB 10.6 MB 89.57% 94.8% 95.6%
of a KD-tree Accuracy
Query accuracy
94.7% 94.7% 100%
rate

1097

6. CONCLUSION Analysis with Applications to Street Landmark
Localization”, Proceedings of ACM International
In this paper, we have presented a novel image
Conference on Multimedia, 2008.
retrieval system based on the MSER detector and the
[9] S. Obdrzalek, J. Matas, “Image Retrieval Using
ZM descriptor, which can resist against the geometric
Local Compact DCT-Based Representation”,
and photometric transformations. Experimental results
Pattern Recognition, 25th DAGM Symposium, vol.
illustrate that the KD-tree indexing and retrieval system
2781 of Lecture Notes in Computer Science.
with the magnitude and phase ZM feature vectors
Magdeburg, Germany: Springer Verlag, p.490–
achieves a query high accuracy rate. The accuracy rate
497, 2003.
for our created NCTU-Bud dataset and the ZuBud
[10] W. Zhang, J. Kosecka, “Hierarchical building
dataset are 100% and 95%, respectively.
recognition”, Image and Vision Computing, 2007.
The success of our system are attributed to
[11] H. Shao, T. Svoboda, L.V. Gool, “ZuBuD—
(a) Selection of MSER feature vectors using multiple
Zurich Buildings Database for Image Based
images of the same building captured from
Recognition” Technical Report 260, Computer
different viewpoints removes the unreliable
Vision Laboratory, Swiss Federal Institute of
regions.
Technology,2003
(b) The DBSCAN clustering technique groups similar
[12] J. H. Friedman, J. L. Bentley, R. A. Finkel, “An
feature vectors into a representative feature
Algorithm for Finding Best Matches in
descriptor to tackle the problem of repeated
Logarithmic Expected Time”, ACM Transactions
feature patterns in the image.
on Mathematical Software, vol. 3,no 3, pp 209-
In the future, we will consider optimizing the
266,1977
programs and porting to mobile phone for mobile device
[13] J. Matas, O. Chum, M. Urban, T. Pajdla, “Robust
applications. Furthermore, the query results may be
wide-baseline stereo from maximally stable
verified using the multi-view geometry constraints for
extremal regions,” Image and Vision Computing,
eliminating the outliers in order to lower the miss
vol.22, pp.761–767, 2004.
recognition rate.
[14] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A.
REFERENCES Zisserman, and J. Matas, “A comparison of affine
region detectors,” Int’l J. Computer Vision, vol.
[1] J. Wang, G. Wiederhold, O. Firschein, and S. Wei,
65, no. 1/2, pp. 43–72, 2005.
“Content-Based Image Indexing and Searching
[15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ” A
Using Daubechies’ Wavelets,” Int’l J. Digital
Density-Based Algorithm for Discovering Clusters
Libraries, vol. 1, pp. 311-328, 1998.
in Large Spatial Databases with Noise,” in Proc.
[2] Z. Chen, and S.K. Sun, “A Zernike moment phase-
Int’l Conf. Knowledge Discovery and Data
based descriptor for local image representation
Mining, 1996.
and matching”, IEEE Trans. Image Processing,
vol. 19, No. 1, pp. 205-219. 22 September 2009.
[3] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A.
Zisserman, J. Matas, F. Schaffalitzky, T. Kadir,
and L. V. Gool, “A comparison of affine region
detectors”. Int’l J. Computer Vision, vol. 65, 43–
72, 2005.
[4] D. G. Lowe, “Distinctive image features from
scale-invariant keypoints,” Int’l J. Computer
Vision, vol. 60, no.2, pp. 91–110, 2004.
[5] K. Mikolajczyk, and C. Schmid, “A Performance
Evaluation of Local Descriptors,” IEEE Trans.
Pattern Analysis and Machine Intelligence, vol.
27, no. 10, p1615-1630, 2005.
[6] F. Jing, M. Li, “An Efficient and Effective
Region-Based Image Retrieval Framework”, IEEE
Trans. Image Processing, vol. 13, no. 5, MAY
2004.
[7] J. Willamowski, D. Arregui, G. Csurka, C.R.
Dance, and L. Fan., “Categorizing nine visual
classes using local appearance descriptors”, ICPR
Workshop on Learning for Adaptable Visual
Systems, 2004
[8] W. Wu, J. Yang, “Object Fingerprints for Content

1098

Using Modified View-Based AAM to Reconstruct the Frontal Facial Image with
Expression from Different Head Orientation
1
Po-Tsang Li(李柏蒼), 1Sheng-Yu Wang(王勝毓), 1,2Chung-Lin Huang(黃仲陵)

1
Dept. of Electrical Engineering,
National Tsing Hua University, Hsin-Chu, Taiwan.
2
Dept. of Informatics, Fo-Guang University, I-Lan, Taiwan.
E-mail: clhuang@ee.nthu.edu.tw

information using 3D face laser scanner. 3DMM can accurately
reconstruct 3D human face, however, it needs a lot of
Abstract computation that limit its applications for only academic
This paper develops a method to solve the unpredictable head research.
orientation problem in 2D facial analysis. We extend the
expression subspace to the view-based Active Appearance Blanz et al. [12] apply 3DMM for human identity
Model (AAM) so that it can be applied for multi-view face recognition, however, the fitting process takes 4.5 minutes per
frame on a workstation with 2GHz Pentium 4 processor. For
fitting and pose correction for facial image with any expression.
facial expression recognition, the problem becomes more
Our multi-views model-based facial image fitting system can be
obvious. Due to the non-sufficient 3D face expression data, one
applied for a 2D face image (with expression variation) at any
can only rely on single expression (neutral) 3D face model for
pose. The facial image in any view can be reconstructed in 3D facial identity recognition. However, because more 3D face
another different view. We divide the facial image into expression database become available, researchers such as Wang
expression component and identity component to increase the et al. [16], Amor et al. [17], and Kakadiaris et al. [18] develop a
face identification accuracy. The experimental results method to identify the human face in different views and
demonstrate that the proposed algorithm can be applied to different expression. However, because the facial expressions are
improve the facial identification process. We test our system for complicate {surprise, sadness, happiness, disgust, anger, fear},
the video sequence with frame size is 320*240 pixels. It requires the 3-D model for different facial expressions are non-practical.
30~45 ms to fitting a face and 0.35~0.45 ms for warping. Lu et al. [20] only record the variations of the landmark points
Keywords View-Based; AAM; Facial expression; and then apply the Thin-Plate-Spline warping method to
synthesize other expression facial images for fitting face
expression image. Chang et al. [15] also divide the training data
into identity space and expression space, and use bilinear
1. Introduction interpolation to synthesize other expression human faces.
Ramanathan et al. [19] propose a method using 3DMM for facial
The facial image analysis consists of face detection, facial expression recognition。
feature extraction, face identification and facial expression
recognition. Currently, the 2D face recognition technology is To capture 3D face information, we may use either 3D
well-developed with high recognition accuracy. However, the laser scanner or multi-view 2D images. Recently, the 2D+3D
unpredictable head orientation often causes a big problem for 2D active appearance models (AAM) method has been proposed by
facial analysis. Most of the previous facial identity identification Xiao et al. [21], Koterba et al.[22], and Sung et al.[23]. Based on
or facial expression recognition methods are limited to the the known project matrix of certain view, the so-called 2D+3D
frontal face and profile face. They work only for the faces in a AAM method trains a 2D AAM for single view for later tracking
single view with ± 15 degrees variation. and fitting of the landmark points of 2D images. Then it uses the
corresponding points to calculate the 3D position of the
The 3D model is well-known as the 3D Morphable Model landmark points. Xiao et al.[21] use only 900 image frames of a
(3DMM) which is proposed by Blanz and Veter [11]. 3DMM single camera to develop the 3D AAM model. Because of the
and AAM are similar. They are model-based approach which precision error of 2D AAM tracking landmark point, Lucey et al.
consists of shape model and texture model. They both use [24] point out that the feature points tracked by 2D+3D AAM is
Principal Component Analysis (PCA) for dimension reduction. worse than the normalized shape obtained by 2D AAM fitting.
The two major differences between them are (1) the optimization Their argument is that 2D+3D AAM can not obtain the depth
algorithm used in fitting, and (2) the feature points in shape information precisely and it causes the recognition errors.
model for 3DMM are 3D feature points, whereas in AAM, they
are 2D locations. In data collection, AAM can be developed In this paper, we apply the view-based AAM proposed by
using 2D facial images, whereas 3DMM captures the depth Cootes et al.[4] for model-fitting of input face with any

1099

expression view in any angl and then it c be warped t any
wed le, can to regenerate the face
e.
target viewing angle. The vie ew-based AAM consists of se
M everal
2D AAMs wh hich can be fur rther divided in inter mode and
nto el 2.1 Statistical A
1 Appearance Mo
odels
intra model. The inter m model describ bes the para ameter
transformation between any two 2D AAMs, whereas the intra e A statistical appearance m
l model consists of two parts: t the
model describe the relations
es ship between th model param
he meters sha model desc
ape cribing the shap of the objec and the textu
pe ct ure
and viewing an ngle for a single 2D AAM mo
e odel. The view-
-based model, describing the gray-leve information of the object. It
g el
AAM can be g generated by an off-line traini process. Be
n ing esides use labeled face images to train the AAM. To train the AA
es n T AM
the identity sub
bspace, this pap extends the expression sub
per bspace model, we must h have an annota ated set of faci images, call
ial led
to inter model so that the vie ew-based AAM can be applie for
M ed land
dmark points. These landma points are selected as the
ark t
multi-view face fitting and pose correction for an input fa of
ace sali
ient points on t face and ide
the entifiable from any human fac ce.
any expressionn. Fig
gure 2 shows some annotated tr raining face im
mage data set.

The flow diagram is s
w shown in Fig 1 For an input face
1. t
image, based o the intra mo
on odel, we may find the relatioonship
between param meters and viewwing angle, an then remov the
nd ve
angle effect on the par rameters. Then we divide the
n e
angle-independ model para
dent ameters into ide
entity parameter and
rs
expression para
ameters which can be transforrmed to the targ 2D
get igure 2. Examp of the train
Fi ples ning set.
AAM model by using the inte model. Finall based on the intra
y er ly, e
model, we add the influence of the angle parameters ont the
d e to The numb ber of landm mark points is determin ned
model paramet esize the facial image in the target
ters and synthe exp Although more landmark poi
perimentally. A ints increase tthe
viewing angle. acc
curacy of the model, how
e wever, it also increases t
o the
commputation of model fitting process. The distribution of
e
landdmark points d
depends on the ccharacteristics on the faces, su
o uch
as t eyebrows, e
the eyes, nose and m
mouth. In these regions, we ne eed
Input image Facial region detection
to p more landm
put mark points, wh hereas in the ot
ther regions (suuch
Pose classification as e
ears, forehead, o other non-visible area, we p no landmark
or put ks.

Modified View -based AAM
d
2.2 Shape Mode
2 el

Fitting Her we use trian
re, ngular meshes t compose the human face. W
to e We
using i th
AAM def
fine a shape si as a vector co ontaining the c coordinates of Ns
land
dmark points in a face image Ii.
n
si = ( x1 , y1 , x 2 , y 2 , K , x N , y N ) Τ (1)
Target Select the s s
orientation Ө target model
The model is cons
e structed based on the coordinate of the label
led
points of training images. We aligned the locations of t
g e the
rresponding po
cor oints on diffeerent training faces by usi ing
Rotate model i→ j Pro
ocrustes analysis as normalizzation. Given a set of alignned
sha
ape, we then appplied Principa component analysis (PCA) to
al a
the data. Any shap example can then be approx
pe ximated by usin
ng
_
Reconstructed
at s = s + Ρs bs (2)
j th AAM
whe s_ defines th mean shape o all aligned sh
ere he of hape is calculatted
usin s = Σ i=1Si/N, Ps= (ps1, ps2,…pst) is th matrix of t
N
ng he the
firs t eigenvectors and bs is a s of shape par
st s set rameters. pst is the
t
eigeenvector of shhape covarianc matrix. Figu 3 shows the
ce ure t
New vie
ew
in ang le Ө
effe g parameters by ±2
ects of varying the first two shape model p
stan
ndard deviation
ns.

igure 1. The flo
Fi owchart of our system.

2. Active Ap
ppearance M
Model Figure 3. F
First two modes of shape variat
tion (±2sd)

In the Modified View-based A
d AAM, 2D AAM play a crucial part.
M 2.3 Texture Mod
3 del
This chapter in
ntroduces the ooverall structur of 2D AAM the
re M,
flow diagram o training and f
of fitting algorithm The major g of
m. goal The texture of A
e AAM is defined the gray leve information at
d el
AAM, which is first proposed by Cootes et al. [2] is to fin the
d nd T _

model paramet ters that reduce the differenc between syn
e ce npaper pix x=(x, y) th lie inside the mean shape s . First, we ali
xels hat e ign
_
image (generated by the AA model) an the target im
AM nd mage. the control points and the mean shape s of ev
very training fa
ace
Based on the parameters and the AAM model, we may
e M

1100

image by using affine warping. Then we sample the gray level The shape parameters bs have unit distance and texture
information gim of the warping images at the mean shape region. parameters bg have unit intensity. Because they are difference
Before applying PCA on texture date, to minimize the effect of nature and difference relevance, they cannot be compared
lighting variation, we normalize gim first by applying a scaling α directly. To estimate the correct value of Ws, we systemically
displace the element of bs from each example’s best match
and offset β as
parameter in the training set and sample the sample the
g = ( gim − β ⋅ Ι) / α (3)
_ corresponding difference. In addition, active appearance model
where I is a vector of ones. Let g defined the mean of the
normalized texture data, scaled and offset so that the sum is zero have a pose parameter vector to describe the similarity
and the variance is unity. α and β are selected to normalize gim as transformation of shape. The pose parameter vector t has four
T
_ elements as t=(kx, ky, tx, ty) . Where (tx, ty) is the translation and
β=(gim ·I)/n and α= gim· g , (4)
(kx, ky) represent the scaling k and in-plane rotation angle θ,
where the n is the number of pixels in the mean _ shape. We
iteratively use Equations (3) and (4) to estimate g until the kx,=k(cosθ−1) and ky,=ksinθ.
estimation stabilized. Then, we apply PCA to the normalized
texture data so that the texture example can be expressed as
_
2.6 Active Appearance Model Search
g = g + Ρg bg (5)
where Pg is the eigenvectors and bg is the texture parameters. Here, we introduce the kernel of AAM. The ultimate goal of
Figure 4 shows the effects of varying the first two texture model applying AAM is that given an input facial image, we may find
parameters through ±2 standard deviations. the model parameters the may be apply to the AAM model to
synthesize an image similar to the input image. Given a new
image, we have the initial estimate of appearance parameter c,
and the position, orientation and scaling placed in the image. We
need to minimize the difference E as
E = g image − g mod el . (9)
Figure 4. First two modes of texture variation (±2 sd). where based on the pre-estimated c , we may have
_ _
−1
2.4 Appearance Model g mod el = g + Pg Q g c and s mod el = s + PsW Q s c . gimage denotes the
s

The shape and texture of any example data in the training set can target image obtained by applying warp function using smodel
be summarized by the bs and bg. The appearance model combines _
the two parameters vector into a single parameter bc as and s , and sampling the pixel intensity of region. It is need an
algorithm to adjust parameters to make the input image and that
⎛Ws bs ⎞ ⎛Ws PsΤ ( s − s ) ⎞
_
image generated by model as closely as possible. There are many
⎟=⎜ ⎟ (6)
bc = ⎜
⎜ ⎟ ⎜ _ ⎟ optimization algorithms propose for parameters searching. In this
⎝ bg ⎠ ⎜ PgΤ ( g − g ) ⎟
⎝ ⎠ paper, we apply the so-called AAM-API method [8]. Rewrite (9)
where Ws is a diagonal matrix of weights for each shape as
parameter. A further PCA is applied for remove the possible
correlations between the shape and texture variations. E ( p ) = g image − g mod el (10)
bc = Qc c (7)
where p is the parameters of model. p = (c Τ | t Τ | u Τ ) ,
where the Qc is the eigenvectors and c is the appearance u = (α β ) Τ . Taylor expansion of (10) gives
parameter. ∂E
E ( p + ∇p ) = E ( p ) + ∂p (11)
Given a appearance parameter c, we can synthesis a face ∂p
image by generate gray level g the interior_ of mean shape and
warping the texture from the mean shape s to model shape s, where the ijth element of matrix ∂E/∂p is ∂Ei/∂pj. Suppose E is
using our current matching error. We want to find ∇p such as to
_ _
minimize E ( p + ∇p ) 2 . By equating Equation 2.11 to zero, we
s = s + PsWs−1Qs c , g = g + Pg Q g c (8) obtain the RMS solution,
T
where Qc=(Qs, Qg) . Figure 5 shows the effects of varying the ∇p = − AE ( p )
first two appearance model parameters through ±2 standard
deviations. −1
where A = ⎛ ∂E ∂E ⎞ ∂E Τ
Τ
⎜
⎜ ∂p ⎟
⎝ ∂p ⎟ ∂p
⎠

If we apply a conventional optimization process, we need
to recalculate ∂E/∂p after every match, and it requires heavy
Figure 5. First two modes of appearance variation (±2 sd). computation. So, to simplify the optimization process, Cootes et
al. assume that A can be approximately constant, and the
relationship between E and ∇p is linear.
2.5 Shape parameter weight

1101

Therefor we systemat
re, tically displace the parameter from
e r
3.1 Training Da
1 ata
the optimal v value on the example image and record the d
corresponding effect of texture dif fference. App plying
multi-variance linear regressio on the displa
on acements ∇p an the
nd Sin we do not h
nce have a large multiple expressio and multi-vie
on ew
corresponding difference textu E to find A. Therefore, we need
ure e faci image datab
ial base of for 2D AAM training process, we ne eed
not recalculate matrix A, wh
e hich can be computed off-lin and
ne to o ning data by using six camer to capture t
obtain the train ras the
stored in the m
memory for ref ference afterward. When we want
e muultiple expressio and multi-v
on view facial ima database. W
age We
to match a imag on-line, the step of procedu is as follow:
ge ure : hav obtained mu
ve ultiple expressio and multi-v
on view facial ima
age
from 13 people (i.e., neutral, surprised, hap
m ppiness, sadne
ess,
disggust, anger, fea There are t
ar). totally 510 fac images in t
cial the
Initial estimate parameters p
I e trai
ining data set. F 6 shows som of the trainin samples of o
Fig me ng our
1. Calculate the model shape smodel. and mode texture gmodel.
e el trai
ining data.
2. Warping the current image and sample tex xture gimage.
ture E= gimage − gmodel.
3. Evaluate the difference text
4. update the mmodel paramete p→p+k∇p, ∇p=−AE(p), initial
ers ,
k=1. 3.2 Intra-Model Rotate
2
5. Calculate the new model sh
e hape smodel. and m
model texture gmodel.
6. Sample the iimage from new shape gimage.
w Coo et al. [4] s
otes suggest that the model parame
e eters c are relat
ted
7. Calculate the new error E
e to t view angle θ as
the
2
8. if E ' E 2 , then accept the new estim mate ; otherwis try
se, c = c0 + cc cos(θ ) + c s sin θ )
n( (12)
k=0.5, k=0.225.
whe c0, cc, and cs are vectors learned from training data. W
ere t We
The iteration o the preceding steps stop wh the E 2 ca not
of g hen an can find the opt
n timal value of parameters ci of the traini
f ing
be reduced, an we may as
nd ssume that the iterative algo
e orithm exaample and its c corresponding view angle θi. Cootes’ meth hod
converge. doe not θi prec
es cisely, it allow ±10 degree errors. In o
ws e our
expperiment, we fifixed the camer so that the viewing angle is
ra
knoown beforehan However, i creates an under-determina
nd. it u ant
pro
oblem. We use facial images f from two views to generate oneo
AAAM. There are only two inpu that can be used to estima
uts ate
(a) thre unknowns. So we rando
ee omly increase θi by ±1. It is
reassonable becaus the error mad during the im
se de mage capturing is
g
unaavoidable, such as the human s subject slightly movement of h
y his
bod or head. Usin this method to add more in
dy ng nput data, we may
m
esti
imate c0, cc, an cs by applyin multiple lin
nd ng near regression on
quations of cs an (1, cos(θ ), sin(θ )) Τ .
the relationship eq nd
(b)
Given an fa acial image, to f find the best fit
tting parameter cj,
we may use Equations (13) and (14) to estim
d mate the viewi ing
ang θj as
gle
( x j , y j ) Τ = Rc−1 ( c j − c0 ) (13)
(c) -1
Figure 6. Exxamples from th training set f the models. (a)
he for whe Rc is the le pseudo-inve of Rc−1 (cc | cs ) = Ι 2 .
ere eft erse
Right profile F
Face, 90° and75 (b) Right Ha Face 60° and 45°,
5°; alf d θ j = ta −1 ( y j / x j )
an (14)
(c) Frontal Face 0° and -15°.
Fig 7 shows the predicted angle compared with the actual ang
g p h gle
for the training set for each mode The results are worse than t
t el. a the
3. Modified View-Based AAM
d resu from Coote et al [4]. It is due to that ou model contai
ults es ur ins
muultiple expressio facial image data.
on
Cootes et al. [4 propose View
4] w-based AAM, based on sever 2D ral
AAM for 3D M Model fitting 2 image. The model-based fitting
2D f 150

for model para ameter estimati can be div
ion vided as intra-m model
P red icted A n g le(d eg ree)

100

and inter-mode His method h been succes
e. has ssfully applied to the 50
human face w without expressi ion. However, they have prob blems 0
fitting the face with expressio It is becaus in the human face
e on. se n -40 -20 0 20
2 40 60 80 100
-50
parameter spac the expressi difference between the ch
ce, ion b hanges
for intra-person is much bigge than the cha
n er anges in inter-peerson. -100
Actua Angle(degree)
al
The original liinear transformmation between the view-angl andle
AAM paramete is no longer valid. Here w propose a m
ers we method (a)
) b)
(b
to project the facial space to identity subsp
o pace and expre ession
subspace to sol the problem Here we divid the viewing angle
lve m. de
in five ranges: [-90, -75], [-60 -45], [-15, 15 [45, 60], [75 90]
0, 5], 5, gure 7. Predic angle vs ac
Fig cted ctual angle across training set (
(a)
from leftward to rightward. S Since the human face is symm
n metric, resu of our data. (b) Cootes’ exp
ult perimental resul at ‘view-bas
lts sed
in the experim ments, we only develop the 2D AAM for three
y acti appearance mode’.
ive
different angles [-15, 0], [45, 60], and [75, 9
s: 90].
Giv a new pers image, we apply AAM f
ven son fitting to find t
the

1102

best model parameters and e estimating the head angle as well. Τ
_
(20)
Then, we can remove to angle effect by using
e g b exp = P ex ( r − b exp )
xp

cresidual = c j − c0 − cc cos(θ j ) − cs sin(θ j ) (15) The we can comp the rneutral b using
en pute by
Therefore the mmodel paramete are separated into two parts: one
ers _
(21)
part that describ the variatio due to rotatio and the othe part
bes on on, er rneutral = r − b exp − Pexp bexp
that describe th other variat
he tions (e.g. the vvariation of ide
entity, And get project in identity subs
d nto space
expression, illumination). W can use the paramete to
We er
_
reconstruction at a new angle φ as bneutral = Pne ( rneutral − r ne )
eutral (22)
eutral
c(φ ) = cresidual + c0 + cc cos(φ ) + cs sin(φ ) (16)
This method ca only do smal angel rotation based on 2D A
an ll n AAM.
Cootes et al. [7] and Huis sman [25] hav proved tha the
ve at
intra-model pose can be applie for the huma face recognit
ed an tion.

3.3 Identity a Expressio Subspace
and on

To make a l large angel wwarping, we m must transform the
m
parameters bet tween the 2D m models. We inte to find a s
end simple
transformation between th two mode
he els. However, the
,
parameters in ((15) consist of identity compo
onent and expreession
component tha the transform
at vial. Cootes use two
mation non-triv
different methhods to remove the expressio and project into
e on t
identity subspa The parameters are simplified to the var
ace. riation
of identity, whi is linear tran
ich nsformation.

Let r de efined as the r residual param
meter after (15) We
).
ning data as rneu and rexp whe exp ∈{happ
divide the train utral ere piness,
sadness, fear, anger, disgu ust, surprised} to compute the
} e
expression and identity covar
d riance matrix. R
Remove the iddentity
component of rexp by
eexp = rexp − rneutral (17)
Figure 10. The facial space rel of identity and expression.
F late a .
where eexp be d
defined the exp pression compon nent. Fig (15) s
shows
the training ex
xample of eexp, and Figure 9 shows the tra aining
examples rneutra . By applying PCA to rneutral and eexp, we can find
al n 3.4 Inter-Model Rotate
4 l
pace and Pexp, into a
the projection Pneutral. into an identity subsp
n i
expression subspace as Now we may use M
i w Multiple linear regression met
thod to find b ex ,
ij xp
_ r ijeutral in thej ith A
ne AAM model and the relationshi (i.e., R neutral ,
j d ips
e exp = e exp + Pexp bexp (18) R e ) with b exp and r neutral in t j AAM mo as
exp the th
odel
_ rneutral = enettural + Rneutral rneuttral
j ij ij i
(23)
eutral = r neutral + P
and rne neutral bneutral
(19)

and
d bexp = eexp + Rexp rexp
j ij ij i (24)

whe enetural and eexp are cons
ere ij d ij stant.

3.5 Reconstruct a New View
5 t w

Figure 8. some examples from expression co
e m omponent traini
ing Giv a match of a new person in a view, we can reconstruct a
ven f
set view by follow ste (as shown in Fig. 11).
w eps n
1. R
Remove the effe of orientati
fects ion. (Eq. 15).
2. P
Project into ide entity and exprression subspace of the mod del.
(
(Eqs. 20, 21, 222).
Project into the subspaces of th target model. (Eqs. 23, 24).
3. P he
4. P
Project that into residual space and combined two vectors in
o e d nto
o vector (inve Eqs. 20, 21 22).
one erse 1,
Figure 9. T neutral imag for training i
The ge identity subspac
ce. 5. A the assigne orientation. (
Add ed (Eq. 16)
Costen et al. [26] suggest ession changes are
ted that expre
orthogonal to the changes du to identity in framework. A new
t ue n
image with pa arameter r, the expression pa
arameter bexp,can be
calculated by

1103

Figure 11. The flowchart of Rotate Model.

4. Experimental Results

Here, we illustrate the results of our methods. We use six
cameras to capture the expression of one person. There are 13
persons in our experiments with 5 or 6 different expressions. We
select 510 pictures for our training data for Multi-Pose 2D AAMs. Figure 12. result of warping Right Half to Frontal vs Ground
In the testing phase, we apply the model fitting for all pictures truth.
and about 90% of the testing pictures have been successful.
Then, we illustrate the experimental results of warping
Because we do not have enough training data, we apply
right-side view to frontal view and compare with the ground
leave-one-out to train and test our rotational model algorithm.
truth as shown in Figure 13. Apparently, the performance is not
Besides warp the input face to the pre-trained pose, we also try
as good as the previous one.
warping the face to other pose and compare with the video
captured in that specific pose.

Although our system allows us to do the model face fitting
and then warp the face to any pose, for some view, the warping
results are not as good as the others. To compare the results of
the rotated model, we do the warping of the input face image in
right half view to the front pose and compare with the ground
truth pre-stored in our database as shown in Figure 12.

1104

Figure 13. The experimental results of warping right-side view Figure15. The experimental results of warping the
to frontal view. right-side-view facial image to the front view.

We use the distance similarity measure x1 ⋅ x 2 / x1 x 2 to
We use PC equipped with Intel C2D 6300 CPU and 2045 MB
evaluate whether the warped image help increasing the
memory to test our algorithm. For a video sequence (with frame
recognition rate, where x1 represent the pre-stored frontal neutral
resolution 320*240), the processing time is less than
face image database, x2 represent the testing data of facial image
45ms/frame.
of any expression and in any viewing direction.

The purpose of the warping the non-frontal face to the
Table 1 The improvement of identity recognition, with ICO
frontal view is to increase the face identification accuracy.
(identity component only) and PC (pose correction) with 15
Before the warping process, we have separated the identity
degree.
component and expression component from the model parameter.
ICO PC PC+ICO
To analyze the warped facial image, we may use identity
parameter or expression parameter independently to increase the Frontal 18% 3.7% 21.5%
recognition rate. In the following, we will synthesize the face intra-model
image by using only the identity component or the expression
component. The experimental results of right-half-view facial In Table 2, the comparison is done with the expression parameter.
image and right-side-view are shown in Figures 14 and 15. We find that the identity component increase the similarity
between the neutral face in the database. On the other hand, we
In Figure 14, the lower-right figure illustrates the facial have the right-half-view faces with expression processed by PC
image by using identity component. The expression can hardly + ICO(45-60 degree), the average similarity is about 74.3%. It is
be found and it shows a facial image of neutral face. In Figure 15, lower than the PC+ICO Frontal expression face for 4.6% only.
the warped image using identity component is worse than Figure However, the improvement of right-view face with expression is
14, however, the warped image by using expression parameter very limited. The similarity is about 56.4%.
looks fine.
5. Conclusions
In this paper, we have demonstrated that the expression
parameter can be linear transform between each two AAMs of
the view-based AAM. Then, it can be used to match an
expression variant face at any angle, and to predict the
appearance from new viewpoints given a single image of a
person. We anticipate this approach will be useful for face
recognition and expression recognition system more invariant to
viewing angle. In the future, we may establish a wide angle
facial detection and recognition system with higher accuracy,
less processing time, and more stable.

References

[1] T.F. Cootes, D. Cooper, C.J. Taylor and J. Graham, Active
Figure14. The experimental results of warping the Shape Models - Their Training and Application. Computer
right-half-view facial image to front view. Vision and Image Understanding. Vol. 61, No. 1, pp. 38-59,

1105

1995. Page(s):511-518 Vol.1
[2] T.F.Cootes, G.J. Edwards and C.J.Taylor. Active [23] J. Sung, D. Kim STAAM: Fitting a 2D+3D to Stereo
Appearance Models, Proc. European Conf. on Computer Images IEEE ICIP on 8-11 Oct. 2006.
Vision, Vol. 2, pp. 484-498, 1998. [24] Lucey, S., Mathews, I., Changbo Hu, Ambadar, Z., de la
[3] G.J.Edwards, C.J.Taylor, T.F.Cootes, Interpreting Face Torre, F., Cohn, J., AAM derived face representations for
Images using Active Appearance Models, Int. Conf. on robust action recognition Int. Conf. on Automatic Face and
Face and Gesture Recognition 1998. Gesture Recognition, 10-12 April 2006 Page(s): 155-160
[4] T. F. Cootes, G.V.Wheeler, K.N.Walker and C. J. Taylor, [25] Huisman, P., van Munster, R., Moro-Ellenberger, S.,
View-Based Active Appearance Models, Image and Veldhuis, R., Bazen, A. Making 2D face recognition more
Vision Computing, Vol.20, 2002, pp.657-664 robust using AAMs for pose compensation nt. Conf. on
[5] T.F. Cootes, G.V.Wheeler, K.N. Walker and C.J.Taylor Automatic Face and Gesture Recognition, 10-12 April 2006
Coupled-View Active Appearance Models, British [26] N. Costen, T. F. Cootes and C. J. Taylor, Compensating for
Machine Vision Conference 2000. Ensemble-Specificity Effects when Building Facial
[6] T.F.Cootes, G.J. Edwards and C.J.Taylor. Active Models, Proc. British Machine Vision Conference 2000,
Appearance Models, IEEE PAMI, Vol.23, No.6, Vol. 1, pp.62-71.
pp.681-685, 2001
[7] H. Kang, T.F. Cootes and C.J. Taylor, A Comparison of
Face Verification Algorithms using Appearance Models,
Proc. BMVC2002, Vol.2,pp.477-4862.
[8] M. B. Stegmann, B. K. Ersbøll, R. Larsen, FAME -- A
Flexible Appearance Modelling Environment, IEEE
Transactions on Medical Imaging, 2003
[9] I. Matthews and S. Baker. “Active Appearance Models
revisited.” IJCV, 2004. In Press
[10] 陳曉瑩即時多角度人臉偵測國立清華大學電機工程
研究所碩士論文,2006
[11] V. Blanz and T. Vetter. A morphable model for the
synpaper of 3d faces. Proc. Computer Graphics
SIGGRAPH '99, 1999.
[12] V. Blanz and T. Vetter. Face recognition based on fitting
a 3d morphable model. IEEE Trans. On PAMI, 25(9),
September 2003.
[13] C. Christoudias, L. Morency, and T. DarreIl. Light field
appearance manifolds. European Conf. on Computer
Vision, (4):482-493, 2004
[14] R. Gross, I. Matthews, and S. Baker. Eigen light-fields and
face recognition across pose. Int. Conf on Automatic Face
and Gesture Recognition, 2002.
[15] Chang, J., Zheng, Y., Wang, Z. Facial Expression Anaylsis
and synthesis: a Bilinear Appraoach Int. Conf. on
Information Acquisition, ICIA’07, 8-11 July 2007.
[16] Wang, Tueming, Pen, Gang, Wu, Zhaohui 3D Face
Recognition in the Presence of Expression : A
Guidance-based Constraint Deformation Approach IEEE
CVPR 2007.
[17] Amor, B.B. Ardabilianm, M., Chen, L. New Experiments
on ICP-Based 3D Face Recognition and Authentication
ICPR 2006, Volume 3, Page(s) : 1195 – 1199
[18] I.A. Kakadiaris, G.Passalis, , G.Toderici, , M.N Murtuza,.,
Y. Lu, Karampatziakis, N, Theoharis, T.
Three-Dimensional Face Recognition in the Presence of
Facial Expression: An Annotated Deformable Model
Approach IEEE Trans. on PAMI, Vol. 29, Issue 4, April
2007 pp. 640-649
[19] S. Ramanathan, A. Kassim, Y. Vemlatesh, S. W. Wu,
Human Facial Expression Recognition using a 3D
Morphable Model IEEE ICIP, Oct 2006,
[20] Lu X. and Jain A., Deformation Modeling for Robust 3D
face Matching IEEE Trans. on PAMI, 2007.
[21] Jling Xiao, Baker, S., Mathews, I., Kanade, T. Real-time
conbined 2D+3D active appearance models CVPR 2004,
Page(s):II-535~II-542.
[22] Koterba, S., Baker, S., Mathews, I., Changbo Hu, Jing Xiao,
Cohn, J., Janade, T. Multi-view AAM fitting and camera
calibration IEEE ICCV, Vol. 1, 17-21 Oct. 2005

1106

Patch-Based Occupant Classification for Smart Airbag

Shih-Shinh Huang Er-Liang Jian and Chi-Liang Chien
Dept. of Computer and Communication Engineering Chung-Shan Institute of Science and Technology
National Kaohsiung First University of Science and Technology Email: jianerliang@gmail.com
Email: poww@nkfust.edu.tw

Abstract—This paper presents a vision-based approach for
occupant classification. In order to circumvent the intra-class
variance, we consider the empty class as reference and describe
the occupant class by using appearance difference rather than
appearance itself in the tradition approaches. Each class in
this work is modeled by a set of representative parts called
patches and each of which is represented by a Gaussian
distribution. This alleviates the mis-classification resulting from
the severe lighting change which makes the image locally
blooming or invisible. Instead of using maximum likelihood
(ML) for patch selection and estimating the parameters of Figure 1. Challenges: (a) Severe lighting change. The images have
the proposed generative models, we discriminatively learn the considerably large dynamic range. These observed images have significantly
models through a boosting algorithm by minimizing the loss different appearance. (b) Intra-class variance. Persons wearing clothing with
different styles or colors.
of the training error.
Keywords-patch-based model, discriminative learning
considerably dynamic range from bright sunlight to dark
I. I NTRODUCTION
shadow. Extremely, this makes some regions of the image
Until now, the integration of the airbags into automobiles blooming or invisible and thus complicates the classifica-
has significantly improved the occupant safety in vehicle tion task. The intra-class variance denotes that the same
crashes. However, inappropriate deployment of airbags in occupant class may have different appearance. For instance,
some situations may cause severe or even fatal injuries. For the passengers may wear clothing with different colors; the
example, it deploys on a rear-facing infant seat or in a case baby seats may have different styles. The difference in scene
of that a passenger sitting too close to the airbag. According resulting from the configuration change of objects inside
to the report of American National Highway Transportation the vehicle is referred to as the structure variance. Figure
and Safety Administration (NHTSA), since 1990, more than 1 shows some images exhibiting the lighting change and
200 occupants have been killed by the airbag deployed in intra-class variance. Similar to the works in the literature, we
low-speed crashes. To prevent occupants from this kind of assume that the scene monitored has no structure variance,
injure, NHTSA defined the Federal Motor Vehicle Safety and the objective of this paper is to achieve high recognition
Standard (FMVSS) 208 in 2001. One of the fundamental rate against severe lighting change and intra-class variance.
issues of FMVSS 208 is to recognize the occupant class
inside the vehicle for controlling the deployment of airbags. A. Related Work
The five basic classes defined in FMVSS 208 are (i) Empty, Owechko etc., [1] who are the pioneer in this area,
(ii) RFIS (Rear Facing Infant Seat), (iii) FFCS (Front Facing attempted to eliminate the illumination variance by firstly
Child Seat), (iv) Child, and (v) Adult. applying intensity normalization to the training images. The
Some existing sensors, such as ultrasound, pressure, or coefficients of the eigen vectors computed by the princi-
camera, have been used for developing system which aims pal component analysis (PCA) are used to represent the
at meeting the classification requirements in FMVSS 208. occupant class. The input unknown image is then recog-
In this work, we choose camera as the sensing device, since nized as the same class of the nearest neighbor sample.
it can provide rich representation of the occupant in front In order to overcome lighting change, Haar wavelet filters
of the dashboard. This makes the proposed approach have which describe the intensity difference among neighboring
potentially higher classification accuracy. The success to the regions have been used for occupant representation. An over-
problem of the occupant classification based on computer complete and dense way using Haar filters over thousands
vision is challenging in the presence of severe change in of rectangular regions is adopted in [2]. Then, Support
lighting, large intra-class variance, and structure variance. Vector Machine (SVM) is applied to determine the bound-
Since the vehicle is moving, the observed image may have aries among different occupant classes for handling intra-

1112

class variance. In [3], [4], the edge map of the passenger image and that at the reference image. This makes the
appearance is extracted through the background subtraction proposed approach be invariant to intra-class variance. Then,
algorithm and further described by the use of high-order the likelihood ratios for evaluating the existence confidence
Legendre moments. The classification is achieved using the for giving image with respect to five trained models are
k-nearest neighbors strategy. The edge map of the occupant computed and the classification result is the occupant class
is described by higher-order Tchebichef moments in [5] with the highest confidence.
and then Adaboost algorithm is applied to select a set of The remainder of this paper is organized as follows. In
discriminative moments for classification. To utilize more Section II, we introduce the generative models for repre-
information for classification, multiple features including senting the occupant classes and the way to perform oc-
range [6], motion information, and edge map are fused under cupant classification. The boosting algorithm for estimating
a two-layer architecture [7], [8]. The classifiers for each layer the parameters of the models in a discriminative manner
are the Non-linear Discriminant Analysis (NDA) classifiers. is then described in Section III. Section IV demonstrates
The features used in the aforementioned works are all the effectiveness of the developed approach by providing
global descriptors, such as dense edge map [7], [8], Legendre experimental results on an abundant database. Finally, we
moments [3], [4] or Tchebichef moments[5]. The main conclude the paper in Section V with some discussion.
limitation in this kind of approaches is that the classification
accuracy deteriorates in the two extreme cases (blooming II. PATCH -BASED C LASSIFIER
and invisible) resulting from severe lighting change. To For every class, we model it by a generative model
circumvent this, we present a patch-based model which consisting of several patches described by a Gaussian distri-
is commonly used in recognition literature [9], [10] for bution. The observed image is classified by maximizing the
handling occlusion effect to describe the occupant class. likelihood probability. Here, the feature representation of a
Furthermore, the above works directly model the appearance patch is appearance difference with respect to a reference
of the occupant and thus suffer from the significant intra- image. In order to eliminate illumination factor, we recover
class variance. The general way to solve or alleviate this the reflectance image of the empty class and consider it
problem is by by introducing some classification algorithms, as the reference image. Negative normalized correlation
such as SVM or NDA in the literature. According to the is then introduced to measure appearance difference for
insight that the silhouettes of different occupant classes are representing the feature of patches.
distinct, we consider the empty class as the reference and
thus model the appearance difference with respect to the A. Feature Representation
empty class. The images for training are captured from various lighting
conditions. Similar in foreground segmentation literature
B. Approach Overview [11], a reference image suitable for difference measurement
The objective of occupant classification is to assign one of should be illumination invariant and without any moving
five classes C = {CEmpty , CRF IS , CF F CS , CChild , CAdult } objects inside. As discussed in [12], an image is a product
to the currently observed image. The system mainly consists of two images: a reflectance image and an illumination
of two phases: training and classification. In training phase, image. The reflectance image of the scene is constant and
we firstly a reflectance image of empty class by removing illumination image changes with the lighting condition in
the illumination effect. The obtained reflectance image is the environment. Accordingly, the reflectance image of the
considered as the reference image for further feature repre- empty class is recovered and thus considered as the reference
sentation. In this work, each occupant class is described by a image here.
patch-based generative model in order to handle with severe Giving a set of empty-class images in the training
lighting change which makes the image locally blooming or database, we apply the approach proposed in [13] to estimate
invisible. In tradition, the parameters of generative model is the empty-class reflectance image Ir based on an assumption
generally estimated using ML strategy which only samples that illumination images are lower contrast than reflectance
with the same label are considered and used for training that image. This leads to that the derivative filter outputs on the
corresponding model. However, the models learned in this illumination image will be sparse and the reflectance recov-
way suffers from the problem of having less discriminativity ery problem can be re-formulated as maximum-likelihood
among difference classes. Instead, we adopt a discriminitive estimation problem. Figure 2 shows the decomposition of
boosting algorithm to estimate the model parameters by three empty-class images into a constant reflectance image
directly minimizing the training error. and its corresponding illumination images.
In classification phase, the appearances at the trained Let p be a patch whose configuration is θ(p) = (t, l, w, h),
patches of a specific occupant model are taken into consider- where (t, l) is the coordinate of the top-left corner and (w, h)
ation for feature representation. Feature used in this work is is the patch size. To impose locality property similar to
the difference in appearance between the patch at observed histogram of oriented gradients (HOGs) [14], we divided

1113

B. Classification Model
A generative occupant model Mc = {pc : k = 1, ..., K c}
k
consisting of K c patches is proposed to describe the class
c ∈ C in this work. Each patch pc is modeled by a Gaussian
k
distribution Nk = {µc , Σc } associated with the patch
c
k k
configuration θ(pk ), where µc and Σk are the mean and
c
k
c

covariance matrix, respectively. By assuming independence
among patches, the log likelihood probability of an observed
image I belonging to the class c is defined as:
Kc
c c
log Pr(I|z = 1) = log Pr(I|M ) = log Pr(f (I(pc ))|Nk )
k
c

k=1
(3)
Figure 2. Examples of reflectance image recovery for three images at the where z c ∈ {+1, −1} is the membership label for the class
first row. The second row shows the recovered reflectance image and the
three corresponding illumination images are shown at the third row. c and f (Io (pc )) is the aforementioned patch representation
k
c
of the image I at the patch pk . Remarkably, the proposed
model which learns the likelihood probability of a given
observation is a generative one.
Instead of solving the problem of occupant classification
directly using maximum likelihood (ML), that is, c∗ =
arg max log Pr(I|z c = 1), we introduce existence confi-
dence to re-formulate it as five one-against-others binary
classification problems. The work in [9] claims that this
benefits both classification and training to be done in a
Figure 3. The definition of quadrant images Io (qi ) and Ir (qi ).
discriminative manner and thus improve the classification
accuracy. Consequently, we define the existence confidence
of a specific class c given an observed image I as the log
likelihood ratio test (LRT) which can be expressed as:
a patch p into four quadrants {q1 , q2 , q3 , q4 }. We denote
the quadrant qi at the observed image Io and the recovered Pr(I|z c = 1)
H(I, c) = log (4)
reflectance image Ir as Io (qi ) and Ir (qi ), respectively. The Pr(I|z c = −1)
schematic form is shown in Figure 3. Inspired by the Without assuming any prior, we approximate the background
work [15] to deal with the presence of the severe lighting hypothesis Pr(I|z c = −1) by a constant Θc . Accordingly,
change, a matching function (MF) γ(.) is applied to measure the function form H(.) of the LRT statistics in (4) becomes:
the appearance difference between Io (qi ) and Ir (qi ). The
H(I, c) = log Pr(I|z c = 1) − Θc
γ(Io (qi ), Ir (qi )) is defined as:
Kc

N (x, y) = log Pr(f (I(pc )|Nk ) − Θc
k
c

(x,y)∈qi k=1 (5)
γ(Io (qi ), Ir (qi )) = −
Kc
Do (x, y) Dr (x, y)
= {log Pr(f (I(pc ))|Nk ) − Θc }
k
c
k
(x,y)∈qi (x,y)∈qi
k=1
(1)
c Kc
where where Θ = k=1 Θc . Therefore, the classification result
k
of giving I and five trained patch-based generative model
N (x, y) ¯ ¯
= (Io (x, y) − Io (qi )) × (Ir (x, y) − Ir (qi )) {Mc : c ∈ C} is the class c∗ with the highest existence
Do (x, y) ¯ ¯
= (Io (x, y) − Io (qi )) × (Io (x, y) − Io (qi )) confidence value, that is, c∗ = arg max H(I, c). However,
Dr (x, y) ¯ ¯
= (Ir (x, y) − Ir (qi )) × (Ir (x, y) − Ir (qi )) we still not mention about how to estimate the model pa-
(2) rameters including Ωc = {(θk , µc , Σc , Θc ) : k = 1, .., K c }.
c
¯ ¯ k k k
Here, Io (qi ) and Ir (qi ) denote the average intensity of the In the next section, a boosting algorithm is proposed to train
quadrant images Io (qi ) and Ir (qi ), respectively. Remarkably, these parameters in a discriminative way.
this function computes the negative normalized correlation
between Io (qi ) and Ir (qi ). Hence, the range of γ(.) is III. D ISCRIMINATIVE L EARNING U SING B OOSTING
between [−1, 1]. Thus, the feature representation of the p In the learning literature [9], [10], several compelling
is defined as a 4-D vector. arguments indicate that the model with the parameters

1114

estimated in a discriminative manner is preferable in terms J(H)(x). The differentiating of J(.) in (6) w.r.t H(.) can
of classification accuracy. Inspired by this, the parameters is be expressed as
thus determined directly by minimizing the exponential loss
N
of the margin over all training samples [16]. c ∂ c
J(D, Ω ) = exp{−zi H(Ii , c)}
∂H(I, c) i=1 (9)
A. Cost Function Definition
= −z c exp{−z cH(I, c)}
Assume that there are a set of labeled images D =
{(Ii , ti )}N . The margin of a sample (Ii , ti ) with respect to
i=1
c
a learned model (classifier) H(.) is defined as zi H(Ii , c), Since it will not be possible to choose a hm (I, c) =
c
where zi ∈ {+1, −1} is the membership label of the ith J(D, Ωc ), so instead the AnyBoost algorithm search for
c
sample for the class c. zi = 1 if ti is equal to c; otherwise, a function with greatest inner product with J(D, Ωc )).
c
zi = −1. Then, the cost function J(.) for evaluating the The inner product between two functions J(D, Ωc ) and
training error of the training set D to the class c is defined hm (I, c) is defined by
as: N
N
J(D, Ωc ), hm (I, c) = c c
−zi exp{−zi H(Ii , c)}hm (Ii , c)
J(D, Ωc ) = c
exp{−zi H(Ii , c)} (6)
i=1
i=1 (10)
c c
Notably, the less training error of a model H(.) determined Denoting exp{−zi H(Ii , c)} as the weight wi , the task at
by parameters Ωc for the class c; the smaller the cost J(.) boosting round m is to find weak hypothesis that maximize
N c c
is. In other words, the objective of training the classifier for i=1 −zi wi hm (Ii , c).
each class c is to find a set of model parameters in Ωc space
so that the cost function is minimized. IV. E XPERIMENT
The minimization to the (6) is by boosting algorithm
which is currently a popular way to sequentially approach In this section, we present some experimental results on
to the solution with a set of additive models. At each round a great amount of videos in this section.
m, our defined function H(.) is updated as H(.) + hm (.) so
as to decrease the cost. hm (.) and H(.) are called as weak A. System Setup and Video Collection
and strong classifier, respectively, in the boosting literature.
Consequently, H(.) has the form: The car used for experiment is Mitsubishi Sarvin and the
appearance inside the vehicle is shown in Figure 4(a). We
M
mount the camera at the center of roof near the rear-view
H(x) = hm (x) (7) mirror (see Figure 4(b)) for providing a near profile view of
m=1 the occupant and preventing the camera view from blocking
where M is the number of boosting rounds. By designing by the driver. The video sequences used for both training
hm (x) as the log likelihood of a patch minus an offset and and validation are gathered from the deployed camera in the
setting M = K c , we have H(x) in (7) equivalent to H(I, c) situation that the platform is moving on road. The camera
in (5). The problem of estimating the model parameters is grabs the images at the rate of 30 frames per second.
thus the same as boosting the strong classifier in a sequential In order to make the database with abundant lighting
manner. change, we collect the videos at different weather conditions,
such as sunny or cloudy day for a period of more than
M = Kc two months. In addition, we drive the vehicle to pass
hm (x, c) = log Pr(f (I(pc ))|Nk ) − Θc
c through several different scenes including indoor and out-
k k
(8) door environments, such as basement, facing the sun, streets
Kc with shade of the trees, etc. As for intra-class variance,
H(x) = {log Pr(f (I(pc ))|Nk ) − Θc }
k
c
k several adults and children with different body types and
k=1 clothing are included in videos and are asked to exhibit
various postures. Some examples are shown in Figure 5. Our
B. Gradient Descent Optimization
database contains 34 video sets and each set consists of one
Boosting for choosing linear combination of weak classi- video for each occupant class. There are totally 34×5 = 170
fiers to minimize the proposed cost function J(.) is shown videos in our database. The time of a video is about 5 to
to be a greedy gradient descent in [17]. An algorithm called 10 minutes long and consists of about 8, 000 to 11, 000
AnyBoost presented by [17] claims that the weak hypothesis frames. The total frames in our database is 1, 633, 752 and
resulting in the greatest reduction in cost is at the direction the detailed statistics can be found in Table I.

1115

Table I
D ATABASE S TATISTICS

Empty RFIS FFCS Child Adult
Fold 1 174,572 164,635 167,589 166,103 176,423
Fold 2 153,322 157,699 157,064 156,352 159,993
Total 327,894 322,334 324,653 322,455 336,416

other is for validation, and vice versa (see Table I). Each
Figure 4. Camera configuration: (a) is the appearance inside the Mitsubishi
Sarvin; (b) shows the deployment of the camera. fold is thus include 17 sets. For training, we extract 50
frames from every video in training fold and thus totally
have 50 ∗ 85 = 4250 frames used for training. The selection
of training frames in each video is by equally sampling per
100 frames from the first 5000 frames. Figure 6 shows the
first selected 10 patches for five occupant classes.
The confusion matrices for classification results of fold 1
and fold 2 are shown in Table II and Table III, respectively.
Obviously, our proposed approach is effective in both cases.
This is because the patch-based model based on local
features will be more robust to severe lighting change than
the one using global representation. The usage of appearance
difference for feature representation makes the system be
Figure 5. Various poses with different illuminations. invaraint to intra-class invariance. The classification results
in four classes including RFIS, FFCS, Child, and Adult are
more than 99.0%. The classification time for our method
B. Classification Results and Analysis is about 16 ms .The efficiency of our approaches results
from the simplicity in computing the log likelihood ratio
Our work for occupant classification is based on a set of which uses 4-D Gaussian distributions.However, there are
discriminative patches. For computation saving, the grabbed still many mis-classifications between FFCS class and Adult
images are normalized to resolution 256 × 128. Four types class. The reason hard to distinguish between them is that
of rectangles used in both approaches include 32 × 32, the FFCS and adult classes have similar appearance.
32 × 16, 16 × 32. The steps for scanning the entire the
image in horizontal and vertical directions are set to 1/2 V. C ONCLUSION
of the width and height of the rectangles, respectively. For In this paper, we present a patch-based generative model
example, we shift the 32 × 16 rectangle by 16 and 8 pixels, for occupant classification. Each patch is divided into four
respectively. The number of patches selected for modeling quadrants and appearance difference measured by the pro-
is K c = 50 for each occupant class. The CUP used is posed negative correlation is used for representing patch.
Intel Duo Core with 2.4GHz and 1.0 GB working memory. Instead using ML for classification, the idea of existence
The Intel Open Source Computer Vision Library (OpenCV) confidence is introduced and thus model parameters can be
and library libsvm 2.89 [18] are adopted to support the estimated in a discriminative manner. To achieve this, an
programming under Microsoft Windows XP. boosting algorithm is applied to approach the solution by
Here, we use a statistical method called 2-fold cross directly minimizing the training error. The robustness and ef-
validation to compare the classification performance of the fectiveness of our proposed method to severe lighting change
two algorithms. The collected videos in our database are and intra-class variance has been intensively validated by
divided into two folds: one is used to learn the models; the using abundant database with more than 1, 600, 000 frames.
In the near future, we will introduce some semantic cues,
such as head or seat detection to make the system have
classification accuracy more closer to 100%. In addition,
the assumption that there is no structure variance inside the
vehicle due to user’s preference should be relaxed in the
ongoing work.
ACKNOWLEDGMENT
This research is sponsored by the Chung-Shan Institute
Figure 6. Selected patches for five occupant classes. of Science and Technology under the project XB98175P.

1116

Table II
C ONFUSION M ATRICES FOR F OLD 1

Our Approach (99.50%)
Empty RFIS FFCS Child Adult Accuracy
Empty 171,261 233 0 86 4 98.10%
RFIS 0 164,605 0 1 29 99.98%
FFCS 0 0 167,567 0 22 99.98%
Child 0 0 6 165,597 500 99.69%
Adult 0 116 276 2 176,029 99.77%

Table III
C ONFUSION M ATRICES FOR F OLD 2

Our Approach (Average: 99.59%)
Empty RFIS FFCS Child Adult Accuracy
Empty 153,301 0 0 17 4 99.98%
RFIS 0 157,684 0 0 15 99.99%
FFCS 0 0 154,642 0 2,422 98.45%
Child 0 0 2 155,598 752 99.51%
Adult 0 0 0 0 159,993 100.0%

R EFERENCES [10] T. Deselaers, D. Keysers, and H. Ney, “Discriminative Train-
ing for Object Recognition Using Image Patches,” IEEE Intl.
[1] J. Krumm and G. Kirk, “Video Occupant Detection for Airbag Conf. on Computer Vision and Pattern Recognition, vol. 2,
Deployment,” IEEE Workshop on Applications of Computer pp. 20–25, 2005.
Vision, pp. 20–35, 1998.
[11] S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao, “Region-Level
[2] Y. Zhang, S. J. Kiselewich, and W. A. Bauson, “”A Monocu- Motion-Based Foreground Segmentation under a Bayesian
lar Vision-Based Occupant Classification Approach for Smart Network,” IEEE Trans. on Circuits and Systems for Video
Airbag”,” IEEE Proceedings on Intelligent Vehicle Sympo- Technology, vol. 19, no. 4, pp. 522–532, April 2009.
sium, pp. 632–637, 2005.
[12] H. Farid and E. H. Adelson, “Separating Reflections from
Images by Use of Independent Components Analysis,” Jour-
[3] M. E. Farmer and A. K. Jain, “Smart Automotive Airbags: nal of the Optical Society of America, vol. 16, no. 9, pp.
Occupant Classification and Tracking,” IEEE Trans. on Ve- 2136–2145, 1999.
hicular Technology, vol. 56, no. 1, pp. 60–80, January 2007.
[13] Y. Weiss, “Deriving intrinsic images from image sequences,”
[4] ——, “Occupant Classification System for Automotive IEEE Intl. Conf. on Computer Vision, vol. 1, pp. 68–75, 2001.
Airbag Suppression,” IEEE Intl. Conf. on Computer Vision
and Pattern Recognition, vol. 1, pp. 756–761, 2003. [14] N. Dalal and B. Triggs, “Histograms of Oriented Gradients
for Human Detection,” IEEE Computer Society Conference on
[5] S.-S. Huang and P.-Y. Hsiao, “Occupant Classification for Computer Vision and Pattern Recognition, vol. 1, pp. 886–
Smart Airbag Using Bayesian Filtering,” International Con- 893, 2005.
ference on Green Circuits and Systems, 2010.
[15] L. D. Stefano, F. Tombari, and S. Mottoccia, “Robust and
[6] P. R. Devarakota, M. Castillo-Franco, R. Ginhoux, B. Mir- Accurate Change Detection Under Sudden Illumination Vari-
bach, and B. Ottersten, “Smart Automotive Airbags: Occu- ations,” Asia Conference on Computer Vision, pp. 103–109,
pant Classification and Tracking,” IEEE Trans. on Vehicular November 2007.
Technology, vol. 56, no. 4, pp. 1983–1993, July 2007.
[16] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing
Visual Features for Multiclass and MultiView Object Detec-
[7] Y. Owechko, N. Srinivasa, S. Medasani, and R. Boscolo, tion,” IEEE Transactions on Pattern Analysis and Machine
“High Performance Sensor Fusion Architecture for Vision- Intelligence, vol. 29, no. 5, pp. 854–869, May 2007.
Based Occupant Detection,” IEEE Intl. Conference on Intel-
ligent Transportation Systems, pp. 1128–1132, 2003. [17] L. Mason, J. Baxter, P. Bartlett, and M. Frean, “Boosting
Algoirhtms as Gradient Descent,” Neural Information Pro-
[8] ——, “Vision-Based Fusion System for Smart Airbag Appli- cessing Systems (NIPS), pp. 512–518, 2000.
cation,” IEEE Proceedings on Intelligent Vehicle Symposium,
pp. 245–250, 2002. [18] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working Set Selection
Using Second Order Information for Training Support Vector
[9] A. B. Hillel, T. Hertz, and D. Weinshall, “Object Class Machines,” Journal of Machine Learning Research, no. 6, pp.
Recognition by Boosting a Part-Based Model,” IEEE Intl. 1889–1918, 2005.
Conf. on Computer Vision and Pattern Recognition, pp. 702–
709, 2005.

1117

DISPLAY CHARACTERIZATION IN VISUAL CRYPTOGRAPHY FOR
COLOR IMAGES

Chao-Hua Wen (溫照華)

Color Imaging and Illumination Center, Graduate Institute of Engineering,
National Taiwan University of Science and Technology
Taipei, Taiwan
E-mail: chwen@mail.ntust.edu.tw

ABSTRACT images can be reconstructed by stacking operation. This
property makes VCS especially useful in condition of
Visual cryptography can encrypt the visual information the system requirement of low computation load.
and then decrypt the information by human visual Noar and Shamir proposed the (k, n) threshold
system without complicated computation. There are scheme or k out of n threshold scheme which illustrated
various measures on the performance of kinds of visual a new paradigm in image sharing [2]. In this scheme a
cryptography schemes, but rare studies on exact color secrete image is divided into n share images. With any k
reproduction for visual cryptography. This paper of the n shares, the secret can be perfectly reconstructed,
proposes a new visual cryptography scheme with the while even complete knowledge of (k-1) shares reveals
display characterization model which can render no information about the secret image. Consequently,
decrypted color image accurately. In the experiments, Noar and Shamir’s method is restricted to a binary
the processes of encryption and decryption were image due to the nature of the basic model.
demonstrated from the source display to the destination Verheul and Van Tilborg (1997) proposed the
display. For color secret images, this method only uses scheme that extended the basic visual cryptography
two encryption share images and the decryption can be scheme from binary image to color image [3]. In this
performed via a simple operation. scheme each pixel is expanded into m subpixels. Each
subpixel may take one of the color from the set of color
Keywords Visual Cryptography; Visual Secret Sharing; 0, 1,…, c-1, where c is the total number of the colors
Color Visual Cryptography; Display Characterization used to represent the pixel. These subpixels are
interrelated to each other such that after all shares are
1. INTRODUCTION stacked and the color is revealed if corresponding
subpixels of all shares are of same color, otherwise the
With the rapid deployment of network technology, level of black is revealed. In this scheme the size of the
multimedia information is transmitted over the Internet decrypted image will increase by a factor of ck-1, when c
conveniently. While transmitting secret images, security ≥ n for a (k, n) threshold scheme.
shall be taken into consideration because hackers may Koga and Yamamoto (1998) proposed the lattice
utilize weak link over communication network to based (k, n) VCS scheme for gray level and color image
exposure the hidden information. There are various [4]. In that scheme, the pixels are treated as elements of
image secret sharing schemes have been developed to finite lattice and the stacking up of pixels is defined as
strengthen the security of the secret images. Information an operation on the finite lattice. In that scheme, (k, n)
hiding and secrete sharing are two major approaches. VCS for color images is defined with c colors as a
For instance, the watermarking method is widely used collection of c subsets in nth Cartesian product of the
for information hidden [1] and the Visual Cryptogrphay finite lattice.
(VC) is adopted for secret sharing [2]. Yang (2000) proposed a new VCS for the color
VC is introduced first by Noar and Shamir (1994), images [5]. The scheme is implemented based on the
which allows visual information (e.g. plain text, basic concept of a black and white VCS and gets much
handwritten notes, graphs and pictures) to be encrypted better block length than the Verheul-Van Tilborg
by producing random noise images that are used to scheme. Here each pixel is expanded into 2c-1
decrypt through the human visual system [2]. Visual subpixels, where c is the number of colors. Hou (2003)
cryptography scheme (VCS) eliminates complex proposed a scheme of secret sharing for both gray-level
computation in decryption process, and the secret and color images using halftone technique [6]. The

1118

color secret image is decomposed into individual 2. VISUAL CRYPTOGRAPHY SCHEME
channels before the application of halftone technique.
Then the traditional VC is applied to halftone image of Naor and Shamir proposed a (k, n) threshold visual
each channel to accomplish the creation of shares. The secret sharing scheme to share a secret image [2]. A
size of decrypted image is increased by a factor of nk-1 secret image is hidden into n share images and can be
for (k, n) threshold VCS and the quality of decrypted decrypted by superimposing at least k share images but
image is based on halftone technique used. any k-1 shares cannot reveal the secret.
Cimato et al. (2003) proposed c-colors (k, n)
threshold cryptography scheme that provides a 2.1. Visual Cryptography Scheme for binary images
characterization of contrast optimal scheme with pixel
expansion of 2c – 1 [7]. Yang and Chen (2008) The (2, 2) VCS is illustrated to introduce the basic
proposed VCS for color image based on additive color concept of threshold visual secret sharing schemes. The
mixing [8]. In the scheme, each pixel is expanded by a encryption process transforms each secret pixel into two
factor of three. shares, and each share belongs to the corresponding
In order to reduce the size and the distortion of share image. In the decryption process the two
decrypted image, Dharwadkar et al. (2010) propose the corresponding shares are stacked together (using
visual cryptography for color image using color error OR/AND operation) to recover the secret pixel. Two
diffusion dithering technique [10][16]. This technique share of a white secret pixel are of the same while those
improves the quality of decrypted image compared to of a black secret pixel are complementary as shown in
other dithering techniques, such as Flyod-Steinberg Fig. 1:. Consequently a white secret pixel is recovered
error diffusion [11] which is shown by the experimental by a share with the stacked result of half white sub-
results obtained using Picture Quality Evaluation pixels and a black secret pixel is recovered by all black.
metrics [12]. Meanwhile, Revenkar et al. (2010) Using this basic VCS, the contrast ratio of the decrypted
provided the overview of various VCS and performance image is reduced results from halving intensity of the
analysis on the basis of pixel expansion, number of white secret pixels.
secret images, image format and type of shares
generated [13].
Display is one of the most used media devices in
Visual Cryptography. In most applications, the
decryption side uses a different display model from the
encryption side. Even though the same display model
used, luminance and color of the displays are possibly
different because of production variance. Color gamut is
one of characteristics of the color reproduction media
for reproducing color images play a major role in
determining how a given secret image will perform in
VCS. The display color gamut that we have been living
for the past several decades is standardized as “Rec. Fig. 1: (2, 2) VCS for transforming a binary pixel into
709” in the video industry [14] or “sRGB” in the two shares.
computer industry [15]. These systems share the same
primaries. However, the advanced wide gamut displays
are rapid deployment in specialized professional 2.2. Digital Halftoning
applications and even in home theater now. That makes
display characterization more serious in terms of Halftone technique is one of the most important parts of
accurate information communications between the the image reproduction process for devices with a
source and destination. limited number of colors. According to the physical
The rest of this paper is organized as follows: characteristics of different media uses the different ways
Section 2 provides overview of black and white VCS, of representing the color level of images. The general
digital halftoning, error diffusion, halftone-based VCS printer such as dot matrix printers and laser printers can
for gray-scale images, and color visual cryptography only control a single pixel to be printed (black pixel) or
scheme. Display characterization is elaborated in not be printed (white pixel). The halftone is applied to
section 3. The proposed framework is introduced in the given image to render the illusion of the continuous
Section 4. Results and discussion is given in Section 5. tone images on the devices that are capable of
Finally the conclusion is given in Section 6. producing only binary image elements. This illusion is
achieved because our eyes perform spatial integration.
That is, if we view a very small area from sufficiently
large viewing distance our eyes averages the fine detail

1119

within the small area and record only the overall increase pixel expansion. Wei Qiao et al. also
intensity of the area. introduced a VCS for color images based on halftone
technique [21].
2.3. Digital Halftoning
3. DISPLAY CHARACTERIZATION
In the (k, n) threshold VCS for gray-level image [3].
The pixels have g gray levels ranging from 0 to g-1, Display is one of the most used media devices in Visual
where each pixel is expanded to m subpixels of size m ≥ Cryptography. Flat panel displays have been become a
gk-1. In this scheme the size of decoded image is larger common peripheral for desktop personal computers and
than the secret image compared to Naor and Shamir workstations. In general VC tasks, we create an image
VCS scheme. In order to reduce the size of decrypted on one display and take the data file to a second
image, the gray-level halftone image is transformed into imaging system. When viewed on the second display,
an approximate binary image. Then, the basic VCS the decrypted image is likely to have different color
described in Section 2.1 can use to create shares. The reproduction. Here we address primarily users who will
following steps are used to generate less distorted same be doing accurate imaging on a monitor.
decrypted image. The traditional CRT techniques have been
1) Transform the gray-level image into a binary summarized by Berns [17] and can be described as
image using halftone technique. application of the gain-offset-gamma (GOG) model to
2) Each black or white pixel in the halftone image is characterize the electro-optical transfer functions of the
represented by m subpixels into different shares display and a 3x3 linear transform to go from RGB to
selecting from the shares of black or white pixels. CIE XYZ tristimulus values. The accuracy of the GOG
3) Repeat step 2 until every pixel in the halftone characterization is probably adequate for most desktop
image is decomposed into shares. color applications and color management systems [18].
The International Color Consortium (ICC) has
2.4. Error Diffusion published a standard file format for storing ‘‘profile’’
information about any imaging device
In literature there are many mature error diffusion (http://www.color.org/). It has been become routine to
techniques are exists, and because of its exceptionally use such profiles to achieve accurate imaging. The
high image quality, it continues to be a popular choice widespread support for profiles allows most users to
among digital halftoning algorithms [9]. Nagaraj V. achieve characterization and correction without needing
Dharwadkar et al. have used Adaptive Order Dithering to understand the underlying characteristics of the
(Cluster-dot dithering) [16], Floyd-Steinberg error imaging device. ICC monitor profiles use the standard
diffusion technique [11] and color error diffusion CRT model presented in this article.
technique and performed the computation of Picture
Quality evaluation for decrypted images [12]. Those 3.1. Primary transform matrix and inverse
experimental results revealed that the color error
diffusion produces the superior quality of recovered The primary transform matrix for the colorimetric
image compare to Adaptive Order Dithering and Floyd- characterization of the display was derived from the
Steinberg error diffusion technique. direct colorimetric measurements of the three full-on
primaries after black correction. The matrix and its
2.5. VCS for Color Images inverse are given in Equation (1) and Equation (2).

First color VCS was developed by Verheul and Van ⎡X ⎤ ⎡X R XG X B ⎤⎡R⎤
⎢ ⎥ ⎢
Tilborg [3]. Colored secret images can be shared with
⎢ Y ⎥ = ⎢ YR YG YB ⎥ ⎢G ⎥
⎥⎢ ⎥ (1)
the concept of arcs to construct a colored visual
⎢Z ⎥ ⎢ZR
⎣ ⎦ ⎣ ZG ZB ⎥⎢B⎥
⎦⎣ ⎦
cryptography scheme. In c colorful VCS, one pixel is
−1
transformed into m subpixels, and each subpixel is ⎡R⎤ ⎡ X R XG XB⎤ ⎡X ⎤
divided into c color regions. In each subpixel, there is ⎢G ⎥ = ⎢ Y YG YB ⎥ ⎢Y ⎥ (2)
exactly one color region colored, and all the other color ⎢ ⎥ ⎢ R ⎥ ⎢ ⎥
regions are black. The color of one pixel depends on the ⎢B⎥ ⎢ ZR
⎣ ⎦ ⎣ ZG ZB ⎥
⎦ ⎢Z ⎥
⎣ ⎦
interrelations between the stacked subpixels. For a
colored visual cryptography scheme with c colors, the 3.2. Electro-Optical Transfer Function (EOTF)
pixel expansion m is c × 3. Yang and Laih [19]
improved the pixel expansion to c × 2 of Verheul and EOTF is used to describe the relationship between the
Van Tilborg [3]. Liu et al. developed a color VCS under signal used to drive a given display channel and the
the visual cryptography model of Naor and Shamir with luminance produced by that channel. For displays, this
no pixel expansion [20]. In this scheme the increase in function is sometimes referred to as gamma and it is
the number of colors of recovered secret image does not the aspect of the display characterization described by

1120

GOG portion of the display characterization model. [IR, IG, IB] [IRHT, IGHT, IBHT]
EOTF, however, does not work in visual cryptography
because VCS deals with fully on/off signal basically. (3) Creation of work-in-process shares: The
method described in Section 2.1 is used for creating the
4. THE PROPOSED COLOR VCS work-in-process shares by (2, 2) VCS for each halftone
images. For example, the red halftone image IRHT, (2, 2)
The objective of our proposed scheme is to apply the VCS encodes the halftone image into two shares, IRSH1
VCS for color image and get better quality decrypted and IRSH2 respectively. Green and blue halftone images
image with display characterization procedures. Fig. 9: is performed the same process as the red halftone image.
illustrates the framework of the encryption algorithm
and the simulated decryption image. In this encryption [IRHT, IGHT, IBHT] [IRSH1, IRSH2, IGSH1, IGSH2, IBSH1, IBSH2]
algorithm the color image is decomposed into three
channels and each channel is considered as a gray-level (4) Creation of encrypted shares: To combine the
image. For each gray-level image dithering and VCS work-in-process shares of IRSH1, IGSH1, and IBSH1 into a
schemes are applied independently to accomplish the color Share1 image, and to combine IRSH2, IGSH2 and
creation of shares. We used color error diffusion for IBSH2 into a Share2 image.
dithering technique. It reduces the color sets that render
the halftone image and chooses the color from sets by [IRSH1, IRSH2, IGSH1, IGSH2, IBSH1, IBSH2] [Share1, Share2]
which the desired color may be rendered and whose
brightness variation is minimal. Fig. 2: shows how to (5) Display characterization: To apply display
decompose a magenta pixel (R = 1, G = 0, B = 1) into model for color correction of Share1 and Share2 images.
two sharing blocks and how to reconstruct the magenta
block. We superimpose (using AND operation) the [Share1, Share2] [Share1’, Share2’]
binary shares of each channel to get the decrypted color
image. For delivery of accuracy color communication
between the original secret image and the decrypted
(R,G,B) = (1,0,1) image, two displays were used in this study. One is the
laptop monitor of HP Pavilion dm3, and other is the
R = 1 G = 0 B = 1 mobile phone display of hTC Diamond. The
colorimetric measurements by Konica-Minolta CA-210
are shown in Table 1: and plotted in Fig. 3:. Fig. 3:
illustrates the difference of chromaticity coordinates of
Share1 Share2
HP monitor, hTC display and NTSC color space. The
color gamut of hTC display is wider than HP monitor.
Decrypted pixel The primary transform matrices of two displays were
calculated and embedded into the ICC profiles. The
Fig. 2: An example of the proposed VCS for a
transform matrices of HP monitor and hTC display are
magenta pixel.
shown in Equation (3) and Equation (4) respectively.

Table 1: Measured Luminance and chromaticities
4.1. Encryption
Color Luminance and Chromaticity
Display
In the encryption algorithm, the two shares are R G B Y (cd/m2) x y
generated from the color image. Based on Noar and 1 0 0 54.95 0.5894 0.3440
Shamir’s basic concept, the color image is decomposed
0 1 0 138.8 0.328 0.5852
into R, G and B channels. From these channels, six of
the work-in-process shares are created. Next to combine HP 0 0 1 41.94 0.1466 0.1156
these six work-in-process shares into two encrypted 0 0 0 0.25 0.2009 0.1966
color images using following steps:
(1) Color Decomposition: The color image I is 1 1 1 234.7 0.2964 0.3099
decomposed into IR, IG and IB monochrome gray-level 1 0 0 55.67 0.6340 0.3336
images for R, G and B color channels respectively.
0 1 0 192.9 0.3321 0.6273

[I] [IR, IG, IB] hTC 0 0 1 36.24 0.1423 0.0778

0 0 0 0.04 0.2329 0.2269
(2) Digital halftoning: To apply the halftone
1 1 1 284.80 0.2925 0.3044
technique for each color channel to obtain IRHT, IGHT,
and IBHT halftone images respectively.

1121

0.6 processed by the Floyd-Steinberg error diffusion
algorithm as shown in Fig. 4:.
0.5

0.4
v'

0.3

0.2 (a) RGB secret image (b) Red halftone image

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
u''
(c) Green halftone image (d) Blue halftone image
Fig. 3: Plot of CIE chromaticity coordinate values of
HP monitor (purple line), hTC display (blue line) and Fig. 4: Color secret image and decomposited halftone
NTSC color spcace (yellow line). images by Floyd-Steinberg error diffusion algorithm.

⎡X ⎤ ⎡0.4009 0.3294 0.2261⎤ ⎡ R ⎤ Fig. 5: shows the creation of the work-in-process
⎢ Y ⎥ = ⎢0.2340 0.5877 0.1783⎥ ⎢G ⎥ (3)
shares. Fig. 5: (a) and Fig. 5: (b) are two work-in-
⎢ ⎥ ⎢ ⎥⎢ ⎥ process shares of red channel. Fig. 5: (c) and Fig. 5: (d)
⎢Z ⎥
⎣ ⎦ HP ⎣⎢0.0453 0.0872 1.1379 ⎥ ⎢ B ⎥
⎦ ⎣ ⎦ HP are the shares of green channel and Fig. 5: (e) and Fig. 5:
(f) illustrates the work-in-process shares of blue channel.
⎡X ⎤ ⎡0.3714 0.3593 0.2301⎤ ⎡ R ⎤
⎢Y ⎥ = ⎢0.1954 0.6787 0.1258⎥ ⎢G ⎥ (4)
Overall, these six shares reveal no information about the
⎢ ⎥ ⎢ ⎥⎢ ⎥ secret image.
⎢Z ⎥
⎣ ⎦ hTC ⎣⎢0.0190 0.0439 1.2613 ⎥ ⎢ B ⎥
⎦ ⎣ ⎦ hTC

As described in Section 3, two ICC profiles were
first created. Here we assigned the source profile to HP
monitor and the destination profile to hTC display. The
new color transform was created based on the source (a) IRSH1 (b) IRSH2 (c) IGSH1
profile and the destination profile. The profile connect
color space CIEXYZ was adapted. Therefore, the
convert from HP image to hTC image are
RGBhp XYZ RGBhTC shown in Equation (5).
Consequently, we convert from Share1 and Share2
images to Share1’ and Share2’ using the equation. (d) IGSH2 (e) IBSH1 (f) IBSH2

−1
Fig. 5: Six of the work-in-process share images.
⎡R⎤ ⎡0.3714 0.3593 0.2301⎤ ⎡0.4009 0.3294 0.2261⎤ ⎡ R ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ (Actually, these are black and white images)
⎢G ⎥ = ⎢0.1954 0.6787 0.1258⎥ ⎢0.2340 0.5877 0.1783⎥ ⎢G ⎥
⎢B⎦
⎣ ⎥ ⎢0.0190 0.0439 1.2613⎥
⎣ ⎦ ⎢0.0453 0.0872 1.1379⎥ ⎢ B ⎥
⎣ ⎦ ⎣ ⎦ HP
hTC (5) To reduce the number of the share images for
portability and distribution, the proposed scheme
4.2. Decryption divides the work-in-process shares into two groups and
creates two encrypted shares. Fig. 6: shows the
In the decryption algorithm the color image channels combination of six work-in-process shares into two
are reconstructed by stacking the shares of channels. encrypted color images. Both encrypted color images
Our proposed scheme is straightforward to reconstruct show no information about the secret image at all.
the decrypted image Idecrypt by stacking Share1’ and
Share2’ images with AND operation for each color
channel individually.

[Share1’, Share2’] [Idecrypt]
(a) Share1 color image (b) Share2 color image
5. RESULTS AND DISCUSSION
Fig. 6: Two encrypted share color images.
The color image is decomposed into R, G and B channel
images, and then next those decomposition images are Here we assume that Alice uses the monitor of HP
Pavilion dm3 to encrypt the secret image and then Bob

1122

uses the display of hTC Diamond One to decrypted the REFERENCES
image by his eyes. The simulation encrypted share
images are shown in Fig. 7: that Alice can preview the [1] H. Arafat Ali, “Qualitative spatial image data hiding for
encryption results of the images on hTC display. As a secure data transmission,” ICGST, GVIP Journal,
consequence, Share1’ and Share2’ were used rather Volume 7, Issue 2, August, pp-35-43, 2007.
than Share1 and Share2. [2] M. Naor, A. Shamir, “Visual cryptography,” Advances in
Cryptology, Eurocrypt 94, Lecture Notes in Computer
Science, Vol. 950, pp. 1-12, 1995.
[3] Verheul, Van Tilborg, “Construction and properties of k
out of n visual secret sharing scheme Designs,” Codes
and Cryptography, Vol. 11, pp. 179-196, 1997.
[4] H. Koga, H. Yamamoto, “Proposal of a lattice based
(a) Share1’ color image (b) Share2’ color image visual secret sharing scheme for color and gray-scale
images,” IEICE Transactions on Fundamentals of
Fig. 7: Color correction of the encrypted images Electronics, Communications and Computer Sciences,
resulted from the display characterization of HP monitor vol. E81-A, no. 6, pp. 1262-1269, 1998.
and hTC display. [5] C.N. Yang, C.S. Laih, “New colored visual secret sharing
scheme,” Design, Codes and Cryptography, vol. 20,
Finally, the decryption results are illustrated in Fig. pp.325-335, 2000.
8:. Fig. 8: (a) depicts the decrypted image without [6] Y.C. Hou, “Visual Cryptography for color images,”
Pattern Recognition, vol. 36, pp. 1619-1629, 2003.
display characterization. In contrast to the decrypted [7] S. Cimato, R.Prisco and A.De Santis, “Optimal colored
image with display characterization in Fig. 8: (b), threshold visual cryptography schemes,” Design, Codes
results revealed that there were color different between and Cryptography, vol. 35, pp. 311-335, 2003.
(a) and (b). However, note that Alice can share the [8] Ching-Nung Yang and Tse-Shih Chen, “Colored Visual
encrypted images to Bob, then he can decrypted the Cryptography Scheme based on additive color mixing,”
secret image and see the same contents with same color Pattern Recognition, vol. 41, pp. 3114-3129, 2008.
as Alice create. [9] Keith T. Knox, “Evolution of error diffusion,” Journal of
Electronic Imaging, Vol. 8, pp. 422-429, 1999.
[10] Shaked, N. Arad, A. Fitzhugh and I. Sobel, “Color
diffusion: Error diffusion for color halftones,” H.P.
laboratories Israel, HPL-96-128(R.1), 1999.
[11] R. Floyd and L. Steinberg, “An adaptive algorithm for
spatial gray scale,” Proceedings of the S.I.D. 17,
(a) (b) 2(Second Quarter), 75-77, 1976.
[12] Tomas Kratochvil and Pavel Simicek, “Utalization of
Fig. 8: Decryption results. (a) shows the decrypted MATLAB for Picture Quality Evaluation,” Institute of
image without display characterization and (b) Radio electronics, Brno University of Technology.
demonstrated the decrpted image with display [13] P.S.Revenkar, Anisa Anjum and W.Z. Gandhare,
characterization. “Survey of visual cryptography schemes,” International
Journal of Security and Its Applications, Vol. 4, No. 2, pp.
49-56, April, 2010
6. CONCLUSIONS [14] ITU-R Recommendation BT.709-5, Basic parameter
values for the HDTV standard for the studio and for
international program exchange.
In this paper, we proposed a new VCS for the color [15] IEC 61966-2-1, Multimedia systems and equipment –
images with display characterization, which uses the Color measurement and management– Part 2-1: Color
error diffusion dithering on primary color channel management –Default RGB color space – sRGB.
directly. We also reduced the encrypted images down [16] Nagaraj V. Dharwadkar, B. B. Amberker, and Sushil Raj
two share images for easily reconstruction and hidden Joshi, “Visual Cryptography for Color Image using Color
information. Here we first applied the display Error Diffusion,” ICGST-GVIP Journal, Vol. 10, Issue 1,
characterization into Visual cryptography. Results pp.1-8, February, 2010.
revealed that we can accurately deliver color [17] R.S. Berns, “Methods for Characterizing CRT Displays,”
Displays, Vol. 16, pp. 173-182, 1996.
information and secret image as well. Further works can
[18] Mark D. Fairchild and David R. Wyble, “Colorimetric
be done to reduce the size of share image, improve the Characterization of the Apple Studio Display (Flat Panel
quality of halftone shares, and use the model of display LCD),” Munsell Color Science Laboratory Technical
characterization as an encryption key. Report, July, 1998. (www.cis.rit.edu/mcsl/ research/PDFs
/LCD.pdf )
[19] C.N. Yang, “New visual secret sharing schemes using
probabilistic method,” Pattern Recognition Letter 25,
pp.481-494, 2004.

1123

[20] F. Liu, C.K. Wu and X.J. Lin, “Color Visual On Halftone Technique,” International Conference on
Cryptography Schemes,” IET Information Security, vol. Measuring Technology and Mechatronics Automation
2, No. 4, pp 151-165, 2008. 978-0-7695-3583-8/09, pp. 393-395, 2009.
[21] Wei Qiao, Hongdong Yin and Huaqing Liang, “A Kind
Of Visual Cryptography Scheme For Color Images Based

Red channel SR0
Display model
SG0 Share1

SB0
Green channel Decrypted image

SR1
Original
image Blue channel SG1 Share2

SB1

Display model
Decomposition

Digital Halftone Visual Cryptography Scheme

Fig. 9: The proposed Framework of Color VCS

1124

Data Hiding with Rate-Distortion Optimization on
H.264/AVC Video
Yih-Chuan Lin and Jung-Hong Li
Dept. of Computer Sciences and Information Engineering, National Formosa University, Yunlin, Taiwan.
E-mail: lyc@nfu.edu.tw
Abstract - This paper proposes a data hiding algorithm for the video quality caused by the watermark hiding can be
H.264/AVC standard videos. The proposed video data hiding controlled at the bound less than 2 dB.
scheme embeds information that is useful to some specific The remainder of this paper is organized as follows.
applications into the symbols of context adaptive variable Section 2 describes the watermarking principles and related
length coding (CAVLC) domain in H.264/AVC video streams. literatures. Section 3 explains our proposed scheme, including
In order to minimize the changes on both the reproduced video the watermark embedding/extracting schemes and embedding
quality and the output bit-rate, the algorithm selects DCT restriction rule. In Section 4, the performance of our proposed
blocks using a coefficient energy difference (CED) rule and scheme is presented. Finally, some conclusions are given in
then modifies the minor significant symbols, trailing one (T1) Section 5.
symbols and the least significant bits (LSB) of non-zero
quantized coefficient symbols, to hide data into the selected II. BACKGROUND
blocks. Upon considering the joint optimization on rate and In general, most data hiding methods in H.264/AVC are
distortion, the data hiding algorithm considers the data hiding based on entropy coding symbols or motion vectors (MV).
task as a special quantization process and performs within the There are two kinds of entropy coding method in H.264/AVC:
rate-distortion optimization loop of H.264/AVC encoder. The CAVLC and CABAC (Context-adaptive binary arithmetic
experiment results have demonstrated that our scheme has coding). Many scholars choose CAVLC to develop because it
good efficiency on hiding capacity, video quality and output is not complicated and is easy to operate for most situations.
bit-rate. We can modify those nonzero coefficients in DCT blocks for
Keywords: H.264/AVC, data hiding, CAVLC, reconstruction embedding, but it would affect the bit-rate and video quality
loop, coefficient energy difference. seriously. Although the watermark hiding in the DCT blocks is
easy to develop, we should consider avoiding unnecessary
I. INTRODUCTION problems.
Information hiding (or called data hiding interchangeably After transform and quantization, a DCT block usually
hereafter) for video is a video process that adds some useful contains sparse zeros and nonzero coefficients. The nonzero
data to the raw data or compressed formats of the video in a coefficients in high-frequency after the zig-zag reorder are
manner such that the third parties or others can not discern the often sequences ±1, which are called trailing one and they are
presence or contents of the hidden message in perception. limited only up to three at most in H.264/AVC. When the
H.264/AVC can provide better compression efficiency number of trailing ones becomes more, the coding length is
than other exiting standard at the cost of high computation shortest. So most researchers are focus on this part to develop
complexity. Owing to the high popularity of this standard algorithm in data hiding. Consider changing the coefficients in
format over many video applications, the hiding of useful data a DCT block. Four symbols for the CAVLC are available:
into this format attracts a great deal of attention for different coeff_token, trailing_ones_sign_flag, total_zero, and
applications. Recently, many researchers are committed to run_before. The coeff_token is composed of nonzero
develop watermark schemes in H.264/AVC [1-4], but in order coefficients and T1 in a DCT block. In the same case, if the
to make a balance between video quality and bit-rate; they number of trailing one increases, the bit-rate will reduce. On
usually offer only a small capacity to hide data. This paper the contrary, when the number of coefficient is raised, the
proposes a data hiding (or called watermark interchangeably bit-rate will increase oppositely.
hereafter) scheme that is based on the CAVLC in H.264/AVC In Wu et al. [4], their proposed method is emphasizing on
encoder and decoder sides. In the proposed method, one robustness to the compression attacks for H.264/AVC with
watermark bit is embedded by employing the relationship more than a 40:1 compression ratio in I frame. The data
between all of the polarity of T1 symbols in a 4x4 luminance embedded to the predicted 4x4 DCT block is only one bit. In
DCT block. If the DCT block has no any T1, the algorithm Tian et al. [5], this proposed method just modified the nonzero
considers modifying the LSB of the last nonzero coefficient coefficients. Therefore, the bit-rate increase is about 0.1% and
for embedding information. Experiment results have shown the PSNR degradation is less then 0.5dB. It is good at keeping
that our proposed method provide more capacity and can low bit-rate and high quality. However the capacity is too low.
enhance the rate-distortion efficiency. The degradation of In Liao et al. [6], this method embeds message into the trailing
ones of 4x4 blocks during the CAVLC. The feature of this

1125

method is to allow data hiding directly in the compressed is intra-mode, the encoder performs intra-prediction and the
stream in real time and the capacity is more than others [5-6]. mode set contains only I4MB, I16MB and IPCM modes.
In Shahid et al. [7], this proposed method also embeds
watermark into DCT blocks. It modifies the LSB of
coefficients in each inter- and intra-frames and provides a high
capacity of data hiding. In Huang et al. [8], this method is a
new steganography scheme with capacity variability and
synchronization for secure transmission of acoustic data, In
Wang et al. [9], the method has good efficiency, it are always
higher than 45 dB at the hiding capacity of 1.99 bpp by
embedding for all test images

III. THE PROPOSED SCHEME

A. OVERVIEW OF OUR METHOD
Figure. 1 depicts the block diagram of our proposed
method in the H.264/AVC encoder side. The watermark
embedding method is inserted into H.264/AVC during the
encoding process. Data is hided in DCT blocks before entropy
coding. In our proposed method, the watermarking is done on
luminance DCT blocks in both intra and inter modes, not
considering the chrominance DCT blocks.
Fig. 2. The proposed watermarking method at macro-block
level.

Fig. 1. Schematic illustration of our proposed watermarking
/embedding procedure.

When the encoder executes information hiding method,
the rate-distortion must be considered. Because the marked
result changes are reflected to the reconstruction frame, the
encoding of next frame refers to this marked reconstruction
frame. So we must consider the reconstruction loop [7]. In
other words, the data hiding block should perform inside the
reconstruction loop or inside the reconstruction loop with Fig. 3. The proposed watermarking integration with RDO
RDO (Rate Distortion Optimization). Otherwise, the bit-rate procedure.
and video quality would be affected seriously due to the
prediction drift phenomenon between encoder and decoder As indicated in Fig. 2, our proposed method is also
sides. integrated within the RDO procedure in the encoder side.
In the H264/AVC encoding, RDO helps current frame to When the encoder performs the RDO procedure, it selects the
select the best mode and get the best trade-off between best coding mode while watermarking is done at the same time.
distortion of quality and bit-rate. Therefore, our method takes That mode might be different from that without watermarking.
into account RDO in order to get better coding performance But the bit-rate and video quality are best among other modes
while embedding the information into the video blocks. As in the mode set. Fig. 3 illustrates the detail of “RDCost with
shown in Fig. 2, the embedding procedure at the macro-block watermarking” block shown in Fig. 2. As described previously,
level is illustrated. When a macro-block enters the encoder we focused on both intra- and inter-blocks of luminance
side, the encoder firstly determines its encoding mode. If the component for data hiding. As indicated in Fig. 3, the modes
marco-block is inter-mode, the encoder performs both inter- IPCM and SKIP are not considered for embedding.
and intra-prediction to select the best mode from the mode set. As previously described, our method can be done within
The mode set includes PSKIP, P16x16, P16x8, P8x16, P8x8, RDO inside reconstruction loop. As shown in Fig. 2, the block
I4MB, I16MB and IPCM modes. When the marco-block mode “Get best MB mode” selects the best mode to do the coding
task. The performance of data hiding without RDO is not

1126

better than that of considering the RDO based on the results Fig. 5. Example illustration of proposed watermark restriction.
shown in a later section. There is a 4x4 DCT block with five coefficients and the
Fig. 4 illustrates the integration of the proposed method threshold is set 0.25. After zig-zag scanning all of the
with the H.264/AVC decoder. An extracting algorithm is coefficients, the sequence is -2, 4, 3, -3, 0, 0, -1. The last
inserted into H.264/AVC decoding phase. The extracting trailing one is -1. Before embedding phase, we must calculate
phase can be done in DCT blocks after entropy decoding. In the CED firstly and compare the CED value with threshold. As
our method, we embed the watermark on luminance DCT shown in Fig. 5, the block satisfies our restriction, in that the
blocks in both intra- and inter-modes. So we need only to do CED is lower than the threshold.
extract on the luminance part of DCT blocks.
C. EMBEDDING ALGORITHM
In this subsection, we will show the pseudo code for the
embedding algorithm and explains the detailed. In Table I, we
define the symbols and the functions in the pseudo code. These
functions often refer to the DCT block or trailing one set to get
the information of DCT block.

Table I the symbol and function explanation
Fig. 4. Schematic illustration of the proposed watermark Variable or Function Definition
extracting procedure.
DCTB A size 4x4 DCT block
A size 4x4 DCT block by
DCTB
B. THE RESTRICTION OF OUR METHOD embedding
In literatures, most methods usually utilize the quantized The trailing one set in a DCT
T 1set block
coefficients for embedding; they all have the common feature
that only modifying the value but not changing the sign. The W Watermarking bit, W = {0,1}
proposed algorithm utilizes the relation of the polarity of each Threshold Threshold value
T1 to embedding. The polarity and the sign of coefficient are coeEnergy coefficient energy difference
related. getT 1set ( DCTB) Get the T1 set from DCTB
Based on experiments, we observe a phenomenon that
when the number of coefficients is sparse in a DCT block, getT 1count (T 1set ) Get the number of trailing
changing the sign of trailing one causes the bit-rate increasing one from T1 set
significantly. In intra-prediction phase, the current block refers getLevcount (DCTB ) Get the number of nonzero
to the upper and the left blocks to make prediction and encode level from DCTB
the prediction residual. When changing the sign of trailing one getLastT 1Index (T 1set ) Get the last T1 index from T1
with sparse nonzero coefficients in the current block, the block set
data in spatial domain would change greatly because the getLastLev Index (DCTB ) Get the last nonzero level
energy changes by the sign flip is a greater proportion of the index from DCTB
whole block coefficient energy. When the coded block is XorT 1Polarity (T 1set ) All of polarity doing the
referenced by other uncoded blocks, this bad effect would be XOR operation in T1 set.
propagated to other uncoded blocks due to the reconstruction ChangeSign (DCTB , Changing the sign of T1 on
loop. Thus, we have to draw up a mechanism for preventing Index) index position in DCTB
this effect. If the number of coefficients is not sparsely and the
coefficient energy of trailing one to be changed the sign
ChangeLSB (DCTB , Changing the LSB of T1 on
Index) index position in DCTB
occupies slightly proportion in the current block, we does not
hide any watermark bits to the DCT block. getLSB( DCTB, Index) Getting the LSB of level on
In our method, we set a threshold to decide whether the index position in DCTB
DCT block is suitable to embedding data or not. At first, we getEnergy (DCTB ) Getting coefficient energy
calculate the coefficient energy of the current DCT block and difference in DCTB
the CED after changing the sign of one trailing one. If the
change rate of CED is less than the prespecified threshold, the The Embedding algorithm can be divided into two parts,
block is chosen to hide data. Otherwise, the block is kept intact. as shown in Table II. The first part, for blocks with at least one
One simple example is shown in Fig. 5. trailing one and CED less than the threshold, utilizes all of the
polarity values of trailing ones to hide data. If the sign of
trailing one is negative, the polarity value is 0. On the contrary,
the sign of trailing one is positive, the polarity value is 1. The
polarity values of trailing ones are through an XOR operation.
The result must be identical to the value of the watermark bit to
be hided into the block; otherwise we should change the sign
of last trailing one to satisfy the hiding condition. If the result

1127

equals to the watermarking bit, the process does not modify watermark bit when the number of trailing one is nonzero. If
any thing for the block. The algorithm changes the sign of the the number of trailing one is zero and the last level existence,
last trailing one because the last trailing one in the high we can get the LSB from the last level as watermark bit. If the
frequency zone has lower energy than other trailing ones, not number of level and trailing one is zero, we do not do any
causing significant degradation of quality and bit-rate. thing.

Table II The pseudo code for Embedding Algorithm Table III The pseudo code for Extracting Algorithm
Embedding Algorithm Extracting Algorithm

Input: DCTB Input: DCTB
Output: DCTB Output: W
Initialization: Initialization:
T 1set  getT 1set ( DCTB ) T 1set  getT 1set ( DCTB )
numT1  getT1count (T 1set ) numT1  getT1count (T 1set )
numLevel  getLevcount ( DCTB) numLevel  getLevcount ( DCTB)
Begin Embedding() Begin Extracting()
if( numT1  0 ) if( numT1  0 )
coeEngergy  getEnergy (DCTB ) coeEngergy  getEnergy ( DCTB )
if( coeEngergy  Threshold ) if( coeEngergy  Threshold )
W  XorT1Polarity(T1set )

W  XorT1Polarity(T1set )
if( W !  W ) output W
LastT1  getLastT1Index( DCTB) end
ChangeSign( DCTB, LastT 1) else if( numT 1  0 numlevel  0 )
output DCTB LastLevel  getLastLevIndex( DCTB )
end W  getLSB ( DCTB , LastLevel )
end output W
else if( numT1  0 numlevel  0 ) end
LastLevel  getLastLevIndex(DCTB ) End
ChangeLSB ( DCTB , LastLevel ,W )
output DCTB IV. EXPERIMENTAL RESULTS
end
End A. THE EXPERIMENT ENVIRONMENT

The second part, when the number of nonzero Table IV the experimental parameters for H.264/AVC codec.
coefficients is nonzero and the number of trailing one is zero, Parameter Information
utilizes the last level to change the LSB for hiding data. Profile IDC 66(baseline)
Otherwise if the number of levels and trailing ones are zero, Intra period 15(I-P-P-P)
we do not perform the embedding work. The advantage of the Slice mode 0
method in the first case is that the change of the sign does not Frames to be encoded 300
affect other symbols in the same block. According to the Motion Estimation scheme Fast Full Search
CAVLC rule, the trailing_ones_sign_flag indicate the sign of Rate Control Disable
trailing one, it is encoded as one bit in the NAL (Network
Abstraction Layer). If the sign is negative, it will be encoded Table V the test video format parameters
bit 1. On the contrary, if the sign of trailing one is positive, it Parameter Information
will be encoded one bit 0. We change only the sign of last Video format QCIF
trailing one so that the encoded block has the same length as YUV format 4:2:0
that prior to embedding process. Frame Size 176×144
D. EXTRACTING ALGORITHM Frame rate 30 fps
The extracting phase as shown in Table III is easier than
the embedding phase. The watermarking extracting algorithm We utilize the H.264/AVC JM Reference software [9] as
is performed between the entropy decoding phase and the the platform to simulate our proposed method. This subsection
inverse quantization phase. We find out all of the trailing ones presents that the experiment parameters for our method in JM
in current DCT block firstly and calculate the CED value; if reference software. We use the version of JM software is 12.2,
the CED is lower than threshold, we collect all of the polarity where the related environmental parameters are shown in
values for each trailing one to do XOR operation to get the Table IV. In the experiment, four videos: “akiyo,” “foreman,”

1128

“mobile,” and “news” are used as test data set. Their format
information is shown in Table V. The secret data to be hided Table VI Comparison the efficiency between the original’s
into the test videos is a random bit stream. and proposed method for foreman in QP = 15
QP = 15
B. The EXPERIMENT RESULTS
PSNR(dB) Bit-rate(kbit) Capacity(bit)
In this subsection, we demonstrate the experiment results Original 47.32 969.62
and make an explanation about the results. Three methods are without ER 45.18 1070.09 337752
considered. The original method refers to the method without With ER
data hiding; the “within RDO” method represents the method
T=0.5 46.35 1023.22 165019
operated in the RDO loop while the “without RDO” method
T=0.1 46.35 1024.11 164923
means that it executes after the RDO stage in the
T=0.05 46.36 1025.11 165190
reconstruction loop of encoder. As shown in Figs. 6 and 7, the
“within RDO” method is superior to the “without RDO” in
Table VII Comparison the efficiency between the original and
terms of the output video bit-rate and the reconstructed video
embedding method for foreman in QP = 27
PSNR.
QP = 27
Original 37.5 196.26
without ER 36.62 228.05 80708
With ER
T=0.5 37.33 205.92 22118
T=0.1 37.33 205.7 22216
T=0.05 37.32 205.59 22273

Table VIII Comparison the efficiency between the original’s
and proposed method for foreman in QP = 31
QP = 31
Fig. 6. Comparison of the video quality for video foreman
encoded at varying QP values Original 34.86 74.92
without ER 34.1 140.93 48152
With ER
T=0.5 34.65 127.25 11449
T=0.1 34.64 126.81 11289
T=0.05 34.63 126.85 11409

From the experiments, we can observe that the
degradation of bit-rate and video quality caused by embedding
can be controlled effectively by adding embedding restriction.
But it also raises another question. When the threshold is small,
the performance is improved to a saturation degree. In other
words, the effectiveness of the embedding restriction rule has
a limitation level for controlling the degradation. For other test
videos, we illustrate their results in terms of video quality and
Fig. 7. Comparison of output bit-rate for video foreman
bit-rate in Figs. 8-15.
encoded at varying QP values
In Fig. 7, the bit-rate of the within RDO method is higher
than that of the original. This is not a desired phenomenon for
some applications. We use a threshold value of CED to select
appropriate DCT blocks to embed data. The number of DCT
blocks that can be embedded is decreasing with the restriction
threshold. This mechanism helps us to control the degradation
of marked video quality, bit-rate change, and the capacity of
data hiding.
In the experiments, we set different threshold values T of
embedding restriction rule as 1, 0.5, 0.1 or 0.05 for the “within
RDO” scheme. The results are shown in Tables VI to VIII.
We can find that the degradation of quality is reduced from
Fig. 8. Comparison of the video quality between our method
3dB to 1dB and that the bit-rate after embedding is not
and the original for video foreman at varying QP values
increasing significantly by setting the restriction rule.

1129

Fig. 9. Comparison of the bit-rate between our method and the
original for video foreman at varying QP values Fig. 13. Comparison of the video quality between our method
and the original for video mobile at varying QP values

and the original for video akiyo at varying QP values
and original for video news at varying QP values

Fig. 11. Comparison of the bit-rate between our method and
the original for video akiyo at varying QP values

Fig. 15. Comparison the video quality between our method
and the original for video news at varying QP values

For smaller threshold values, most of the DCT blocks in
the video are excluded to modify the T1 symbols. However, it
doesn’t affect the scheme because in that case it modifies the
LSB of the last coefficient in the block. Therefore, for smaller
threshold values, the number of DCT blocks hided using the
T1 symbols is less than that of using the LSB replacement.
This means that the bit-rate and video quality will be kept
saturation. Only changing the LSB of the last coefficient in the
block would not affect the bit-rate and PSNR significantly.
Fig. 12. Comparison of the video quality between our method The capacity for each test video is shown in Figs. 16-19
and the original for video mobile at varying QP values

1130

According to Fig. 3, our proposed method does not aim at
the SKIP mode blocks for data hiding. When the cost of SKIP
mode is lower than others, the mode decision phase selects the
SKIP mode to be the block mode, the number of SKIP mode
blocks is increasing with the QP value, as the results shown in
Fig. 20.

Fig. 16. Comparison of the capacity between our method and
the original for video foreman at varying QP values

Fig. 20. Comparison of the number of SKIP mode block for
video foreman encoded at varying QP values

In Figs. 21 to 23, our proposed method and Shahid’s [7]
are compared in terms of bit-rate, PSNR and capacity. There
are two variants of our proposed method; the one with
threshold value of CED T=0.1 and the other with T=0.5,
respectively. When the QP values are higher than 11, Shahid’s
Fig. 17. Comparison of the capacity between our method and capacity is rapidly declined due to the number of coefficients
the original for video akiyo at varying QP values in high QP values is sparse. The efficiency of our method with
CED is close to Shahid’s regarding the bit-rate and video
quality.

the original for video mobile at varying QP values Fig. 21. Comparison video quality of our proposed and Shahid
for video foreman encoded at varying QP values

Fig. 22. Comparison bit-rate of the number of our proposed
and Shahid for video foreman encoded at varying QP values
the original for video news at varying QP values

1131

[2] S.K. Kapotas, E.E. Varsaki, A.N. Skodras, “Data Hiding in
H.264 Encoded Video Sequences”, IEEE 9th Workshop
on Multimedia Signal Processing, October 1-3, 2007,
Crete, pp. 373-376.
[3] B.G. Mobasseri, Y.N. Raikar, “Authentication of H.264
Streams by Watermarking CAVLC blocks”, SPIE
Conference on Security, Steganography and
Watermarking of Multimedia Contents IX, San Jose, CA,
January 28-February 2, 2007.
[4] G.Z. Wu, Y.J. Wang, W.H. Hsu, “Robust watermark
embedding detection algorithm for H.264 video”,
Journal of Electronic Imaging 14(1), 013013, 2005
[5] L. Tian, N. Zheng, J. Xue and T. Xu, “A CAVLC-Based
Fig. 23. Comparison capacity of our proposed and Shahid for Blind Watermarking Method for H.264/AVC Compressed
video foreman encoded at varying QP values Video”, In: Asia-Pacific Services Computing Conference,
2008. APSCC 2008, pp. 1295–1299. IEEE, Los Alamitos
In Table IX, we compare the capacity performance (2008)
between Shahid’s scheme and our proposed algorithm. At the [6] K. Liao, D. Ye, S. Lian, Z. Guo, J. Wang, “Lightweight
same QP, our method can provide higher capacity than that of Information Hiding in H.264/AVC Video Stream”, mines,
Shahid’s, and the capacity of Shahid’s is decreasing seriously vol. 1, pp.578-582, 2009 International Conference on
with the QP value decreased. Multimedia Information Networking and Security, 2009
[7] Z. Shahid, M. Chaumont, W. Puech, “Considering the
Table IX Comparison capacity of our method and Shaid’s for Reconstruction Loop for Data Hiding of Intra and Inter
foreman at varying QP Frames of H.264/AVC”, published in European Signal
Proposed method Shahid[7] Processing Conference (EUSIPCO), 2009.
QP T = 0.5 T = 0.1 [8] X. Huang, Y. Abe, and I. Echizen, “Capacity Adaptive
Capacity (bit) Synchronized Acoustic Steganography Scheme”, Journal
11 281591 281497 280578 of Information Hiding and Multimedia Signal Processing,
15 165019 164923 139629 Vol. 1, No. 2, pp. 72-90, Apr. 2010
19 82915 83241 67582 [9] Z.H. Wang, T.D. Kieu, C.C. Chang, M.C. Li, A Novel
23 40620 40652 29851 Information Concealing Method Based on Exploiting
27 22118 22216 12108 Modification Direction Journal of Information Hiding
31 11449 11289 4357 and Multimedia Signal Processing, Vo1. 1, No. 1, pp. 1-9,
Jan. 2010
V. CONCLUSIONS [10] K. Sühring, H.264/AVC Reference Software Group
[On-line]. Available: http://iphome.hhi.de/suehring/tml/,
In this paper, we propose a data hiding algorithm that has
Joint Model 12.2 (JM12.2), Jan. 2009.
considered the rate distortion performance for H.264/AVC
standard. The algorithm can control the increase of bit-rate and
decrease of PSNR after hiding secret data into the videos at the
cost of reducing the capacity of data to be hided. The
information is hided in the T1 symbols of CAVLC domain in
H.264/AVC encoder. In order to reduce the propagation of
hiding modification to the subsequent blocks, the proposed
algorithm can selection those blocks with minor energy
change to hide data. With the selection scheme, the proposed
algorithm can control the threshold value to adjust adaptively
the capacity for different application requirements.

ACKNOWLEDGEMENT
This research is supported in part by National Science Council,
Taiwan under the grant NSC 98-2221-E-150-051

REFERENCES
[1] G. Qiu, P. Marziliano, A. Ho, D. He, Q. Sun, “A Hybrid
Watermarking Scheme for H.264 Video”, Processing of
the 17th International Conference on Pattern Recognition,
ICPR, vol.4, pp.865-868, Aug. 2004.

1132

Secret-fragment-visible Mosaic — a New Image Art and Its Application to
Information Hiding

I-Jen Lai (賴怡臻) Wen-Hsiang Tsai (蔡文祥)
Institute of Computer Science and Engineering Dept. of Computer Science
National Chiao Tung University, Hsinchu, Taiwan National Chiao Tung University, Hsinchu, Taiwan
Email: nekolai.cs97g@g2.nctu.edu.tw Email: whtsai@cs.nctu.edu.tw

Abstract—A new type of art image called secret-fragment- Dobashi et al. [3] improved the voronoi diagram to allow a
visible mosaic image is created, which is composed of user to add various effects to the mosaic image, such as
rectangular-shaped fragments yielded by division of a secret simulation of stained glasses. Kim and Pellacini [4]
image. To create this kind of mosaic image, the 3D RGB color generated jigsaw image mosaic composed of many arbitrary
space is transformed into a 1-dimensional h-colorscale based
shapes of tiles selected from a database. Extending the
on which a new image similarity measure is proposed; and the
most similar candidate image from an image database is concept of [4], Blasi et al. [5] presented a new mosaic image
selected accordingly as a target image. Then, a greedy called puzzle image mosaic. Lin and Tsai [6] embedded
algorithm is adopted to fit every tile image in the secret image secret data in image mosaics by adjusting regions of
into a properly-selected block in the target image, resulting in boundaries and altering pixels’ color values. Wang and Tsai
an effect of embedding the secret image fragmentally and [7] hid data into image mosaics by utilizing overlapping
visibly in the composed mosaic image. In addition to this type spaces of component images. Hung and Tsai [8] embedded
of secret image hiding, secret message bits may be embedded data into stained-glass-like mosaic images by modifying the
as well for the purpose of covert communication. Based on the tree structure used in the creation process. Hsu and Tsai [9]
fact that tile images in an identical bin of the histogram of the
presented a new type of art image, circular-dotted image,
created mosaic image have similar colors, all the tile images in
each histogram bin are reordered pairwisely and their relative and used the characteristics of its creation processes to hide
positions are switched accordingly, to embed secret message secret messages in the generated art image. Chang and Tsai
bits without creating noticeable changes in the resulting mosaic [10] proposed a new type of art image, called tetromino-
image. The embedded message is protected by a secret key, and based mosaic, which is composed of tetrominoes appearing
may be extracted from the stego-image using the key. in a video game. Data hiding is made possible by distinct
Additional security measures are also discussed. Experimental combinations and color shifting of the tetromino elements.
results show the feasibility of the proposed methods. A new type of art image, called secret-fragment-visible
Keywords: secret-fragment-visible mosaic image, covert mosaic image, which contains small fragments of a secret
communication, data hiding. source image is proposed in this study. Observing such a
type of mosaic image, people can see all of the fragments of
I. INTRODUCTION the secret image, but the fragments are so tiny in size and so
Mosaics are artworks created from composing small random in position that people cannot figure out what the
pieces of materials, such as stone, glass, tile, etc. Nowadays, source image look like, unless they have some way to
they are used popularly for decorating houses and other rearrange the pieces back into their original positions, using
constructions. Creation of mosaic images by computers is a a secret key from the image owner. Therefore, the source
new research topic in recent years. Traditional mosaic image may be said to be secretly embedded in the resulting
images are obtained by arranging a large number of small mosaic image, though the fragment pieces are all visible to
images, called tile images, in a certain manner so that each an observer of the image. And this is just why we name the
tile image represents a small piece of a source image, named resulting image as a secret-fragment-visible mosaic image.
target image. Consequently, while we see a mosaic image In the remainder of this paper, the proposed mosaic image
from a distance, as a whole it will look like its source creation process will be described in Section II, a covert
image — an effect of a human vision property. Many communication method via secret-fragment-visible mosaic
methods have been proposed to create different types of images will be proposed in Section III, and some
mosaic images [1-8]. experimental results will be presented in Section IV,
Haeberli [1] proposed a method for mosaic image followed by conclusions in Section V.
creation using voronoi diagrams by placing the sites of
II. PROPOSED MOSAIC IMAGE CREATION PROCESS
blocks randomly and filling colors into the blocks based on
the content of the original image. Hausner [2] created tile The proposed mosaic image creation process is composed
mosaic images by using centroidal voronoi diagrams. of two major stages. The first is the construction of a

1133

database which can be used later to select similar target above defines a 1-D h-colorscale. The resulting image
images for given secret images. The quality of a constructed created by our method is given in Fig. 1(b), which
secret-fragment-visible mosaic image is related to the contrastively has less noise when compared with Fig. 1(a).
similarity between the secret image and the target image; the
selected target image should be as similar to the secret
image as possible. An appropriate similarity measure for
this purpose is proposed in this study and described later.
The other stage is the creation of a desired mosaic image
using the secret image and the target image as input. In this
stage, the secret image is divided into fragment pieces as tile
images, which then are used to create the mosaic image. The
number of tile images is limited by the size of the secret (a) (b)
image and that of the tile images. Note that this is not the Figure 1. Effects of mosaic image creation using different color
case in traditional mosaic image creation where available similarity measures (a) Image created with similarity measure
tile images for use to fit into the target image are unlimited of [12]. (b) Image created with proposed similarity measure.
in number. In order to solve this problem of fitting a limited
Furthermore, to compute the similarity measure between
number of tile images into a target image, a greedy a tile image in the secret image and a target block in an
algorithm is proposed, which is described later as well. image in a database for use in tile-image fitting in generating
2.1 Database Construction a mosaic image, we propose a new feature, called h-feature,
for each block image C (either a tile image or a target block),
The database plays an important role in the secret- denoted as hC, which is computed by the following steps:
fragment-visible mosaic image creation process. If a target
image is dissimilar to a secret image, the created image will 1. compute the average of the color values of all the
be distinct from the target one. In order to generate a good pixels in C as (RC, GC, BC);
result, the database so should be as large as possible. 2. re-quantize (RC, GC, BC) into (rC′, gC′, bC′) using the
Searching a database for a target image with the highest new Nr, Ng, and Nb color levels; and
similarity to the secret image is a problem of content-based 3. calculate the h-feature hC for C by Eq. (2) above,
image retrieval. A technique to solve this problem is to base resulting in the following equation:
the similarity on 1-D color histogram transformation [12] of hC(rC′, gC′, bC′) = bC′ + NbrC′ + NbNrgC′. (3)
the color distribution of the image. The transformation maps
With Nr, Ng, and Nb all set equal to 8, the range of the
the three color channel values into a single value. computed values of the h-feature fC above may be figured out
Specifically, each color channel is re-quantized first into to be from 0 to 584. The proposed algorithm for constructing
fewer levels, yielding a new image I′ with a lower resolution a database of candidate images for use in generating secret-
in color specified by (r′, g′, b′). Let Nr, Ng, and Nb denote the fragment-visible mosaic images is described in the following.
numbers of levels of the new color values r′, g′, and b′,
respectively. Then, for each pixel P′ in I′ with new colors (r′, Algorithm 1: construction of candidate image database.
Input: a set S of images, a pre-selected tile image size Zt,
g′, b′), the following 1-D function value f is computed:
and a pre-selected candidate image size Zc.
f(r′, g′, b′) = r′ + Nrg′ + NrNgb′.  Output: a database DB of candidate images with size Zc and
However, according to our experimental experience using their corresponding h-colorscale histograms.
this 1-D color function f, it is found inappropriate for our Steps:
study here where the human’s visual feeling of image Step 1. For each input image I, perform the following steps.
similarity must be emphasized, as shown by Fig. 1(a). 1.1 Resize and crop I to yield an image D of size Zc.
Therefore, we propose a new function h as follows: 1.2 Divide D into blocks of size Zt.
1.3 For each block C of D, calculate and round off the
 h(r′, g′, b′) = b′ + Nbr′ + NbNrg′ 
h-feature value hC described by Eq. (3).
where the numbers of levels, Nr, Ng, and Nb, are all set to be 1.4 Generate a histogram H of the h-feature values of all
8. Differently from the case in (1), we set in (2) the largest the blocks in D.
weight NbNr to the green channel value g′ and the smallest 1.5 Save H with D into the desired database DB.
weight 1 to the blue channel value b′. The reason is that the Step 2. If the input images are not exhausted, go to Step 1;
eyes of human beings are the most sensitive to the green otherwise, exit.
color, and the least sensitive to the blue one. In addition,
2.2 Similarity Measure Computation
with all of Nr, Ng, and Nb set to 8 in (2), an advantage of
speeding up the process of mosaic image creation can be Before generating a mosaic image, we have to choose as
obtained according to our experiments. Subsequently, we the target image the most similar candidate image from the
will say that the new color feature function h we propose database based on the given secret image content. For this,
we define a difference measure e between the 1-D histogram

1134

HS of the secret image S and that of a candidate image D in edge of the graph with its label taken to be that of the tile
the database in the following way: image and its weight taken to be the average Euclidean
584 distance between the pixels’ colors of the selected tile image
 e   Hs  m  HD  m   and those of the target block. Accordingly, we can build a
m 0
tree structure as the graph for this problem, as shown by Fig.
where m stands for a h-feature value. The smaller the value e 2.
is, the more similar the candidate image D is to the secret
image S. After calculating the errors of all the images in the
database, we can select the one with the smallest error as the
desired target image for use in mosaic image generation. The
detail of selecting the most similar candidate image from a
database is given as follows.
Algorithm 2: selection of the most similar candidate image
as a target image.
Input: a secret image S, a database DB of candidate images,
and the sizes Zt and Zc mentioned in Algorithm 1.
Output: the target image T in DB which is the most similar
to S. Figure 2. A tree structure of fitting tile images to target blocks.
Steps:
Step 1. Resize S to yield an image S′ of size Zc to become of In order to find the optimal solution, we may utilize the
the same size as the candidate images in DB. Dijkstra algorithm whose the running time for getting an
Step 2. Divide S′ into blocks of size Zt, and perform the optimal answer is O(|V|2), where V denotes for the number
following steps. of vertices in the tree. Unfortunately, according to Fig. 2 the
N 1
2.1 For each block C of S′, calculate its h-feature value number of vertices in this problem is   1)!/n!] where
n1
hC by Eq. (3) and round off the result.
2.2 Generate a 1-D h-colorscale histogram HS′ for S′ N is the number of target blocks which is larger than 40,000
from the h-feature values of all the blocks in S′. for images used in this study, and so the computation time
Step 3. For each candidate image D with 1-D h-colorscale for getting an optimal solution for such a large N is
histogram HD in DB, perform the following steps. obviously too high to be practical! This means that we have
3.1 Compute the difference measure e between HS' and to find other feasible solutions to solve this problem.
HD according to Eq. (4) described above. The solution we propose is to use a greedy algorithm. We
3.2 Record the value e. calculate the average Euclidean distance between the pixels’
Step 4. If the images in DB are not exhausted, go to Step 3; colors of a tile image T and those of a target block B as the
otherwise, continue. similarity measure between T and B; and then use the
Step 5. Select the image in DB which has the minimum measure as a selection function for the greedy algorithm to
difference measure e and take it as the desired target select the most similar target block for tile image fitting.
image T. However, as shown by the example of Fig. 4(a) which is the
result of using such a greedy algorithm to fit the tile images
2.3 Algorithm for Secret-fragment-visible Mosaic Image of the secret image, Fig. 3(a), into the target image, Fig. 3(b),
Creation the algorithm is found unsatisfactory, yielding often a result
Before presenting the algorithm for creating the proposed with the lower part of the target image being filled with
mosaic images, we discuss some problems which are some fragment pieces of inappropriate colors. This
encountered in the creation process and present the solutions phenomenon comes from the situation that the number of
we propose to solve them. tile images obtained from the secret image, Fig. 3(a), is
limited by the secret image’s own size, so that the tile
A. Problem of fitting tile images optimally and proposed images available for choice to fit the target blocks in Fig.
solution 3(b) become less and less near the end of the fitting process.
The first problem faced in the creation process is how to As a result, the similarity differences between the later-fitted
find an optimal solution for fitting a tile image of the secret tile images and the chosen target blocks become bigger and
image into an appropriate target block in a target image bigger than the earlier-fitted ones, yielding a poorly-fitted
selected by Algorithm 2. For this, it seems that we can bottom part like that shown in Fig. 4(a).
reduce it to a single-source shortest path problem. The A solution to this problem found in this study is to use the
shortest path problem is one of finding a path in a graph previously-proposed h-feature to define the selection
with the smallest sum of between-vertex edge weights. The function for the greedy algorithm. This feature takes the
state of fitting a tile image may be represented by a vertex global color distribution of an image into consideration,
of the graph. And the action of selecting the most similar which helps creation of a mosaic image with its content
tile image for each target block may be represented by an resembling the target image more effectively, as shown by

1135

the example of Fig. 4(b) which is an improvement of Fig. of proposed secret-fragment-visible mosaic images is
4(a). described in the following.

(a) (b) (a) (b)
Figure 3. Input images. (a) A secret image. (b) A selected target image. Figure 5. Input images. (a) A secret image. (b) A selected target
image.

(a) (b)
Figure 6. Resulting images. (a) Image created without the proposed
(a) (b)
remedy method, which is four times as large as (b). (b)
Figure 4. Resulting images using different similarity measures. (a) Image created with the proposed remedy method.
Image created using Euclidean distance to define select
function of greedy algorithm. (b) Image created using h-
feature to define select function of greedy algorithm. Algorithm 3: mosaic image creation.
Input: a secret image S, a database DB, and a selected size
B. Problem of small-sized candidate image database and Zt of a tile image.
proposed solution Output: A secret-fragment-visible mosaic image R.
A second problem faced in the mosaic image creation Steps:
process is how to deal with a database which is not large Stage 1  embedding secret image fragments into a
enough. This problem will cause the selection of an selected target image.
insufficiently similar image from the database as the target Step 1. Crop S to yield an image S′ which is divisible by
image for a given secret image. As a result, the created size Zt.
mosaic image will look unlike the target one, as shown by Step 2. Perform following steps to select a target image T
the example of Fig. 6(a), a mosaic image created with Figs. from DB.
5(a) and 5(b) as the secret and target images, respectively. 2.1 Select a candidate image as T from DB by
To solve this problem, during the candidate image Algorithm 2.
selection process, after the difference measure between a 2.2 If the difference measure e computed in Step 3.1 of
secret image and a candidate image is computed, if the Algorithm 2 is larger than a pre-selected threshold
computed value is large, the selected target image is Th, then enlarge the size of T e Th  times.
 
regarded inappropriate for the creation process. In this case, Step 3. Obtain a block-label sequence L1 of S′ by
we enlarge the size of the selected target image as a remedy. calculating and sorting the h-feature values of all
The reason is that if the size of the target image is larger tile images in S.
than that of the secret image, the number of target blocks, or Step 4. Obtain a block-label sequence L2 of T by
equivalently, the number of possible positions, to fit each calculating and sorting the h-feature values of all
tile image, will become larger, yielding in general a better target blocks in T.
fitting result. In this way, the resulting mosaic image will Step 5. Fit the tile images of S′ to the target blocks of T
become better visually than before, as shown by the based on the one-to-one mappings from the ordered
example of Fig. 6(b). labels of L1 to those of L2, thus completing
C. Algorithm for secret-fragment-visible mosaic image embedding of all the tile images in S′ into all the
creation target blocks of T according to the greedy criterion.
According to above discussions, an algorithm for creation Stage 2  dealing with unfilled target blocks.

1136

Step 6. Perform the following steps to fill each remaining
unfilled target blocks, B, in T if there is any.
6.1 Compute the difference e′ between the h-feature hB
of B and the h-feature hA of each of the tile images,
A, in S′ by the following equation:
e′ = |hB  hA|. (5)
6.2 Pick out the tile image Ao with the smallest
(a) (b)
difference eo′ and compare eo′ with another pre-
selected threshold Th′ to conduct either of the
following two operations:
A. if eo′ Th′, then fill the tile image Ao into the
target block B;
B. if eo′  Th′, then fill the averages of the R, G, and
B values of all the pixels in B into B.
Stage 3  generating the desired mosaic image.
Step 7. Generate as output an image R obtained by
composing all the tile images fitted at their
respective positions in T.

2.4 Experimental Results of Mosaic Image Creation
Some mosaic images generated by the above algorithm
(c)
are shown in Figs. 7 and 8. Note that in either figure, the
secret image of (a) may be thought to have been embedded Figure 8. Another example of mosaic image creation. (a) Secret image.
(b) Target image. (c) Generated secret-fragment-visible
into the target image of (b) to yield the stego-image of (c). mosaic image.
The database used in running the algorithm includes 841
candidate images. The size of this database is regarded as
large enough because the remedy measure of target image III. COVERT COMMUNICATION VIA SECRET-FRAGMENT-
enlargement is rarely used in the mosaic image creation VISIBLE MOSAIC IMAGES
process in our experiments.
3.1 Idea of Proposed Covert Communication Method
In the proposed mosaic image creation process, tile
images with the same h-feature values appear to have
similar colors. Each tile image is fitted into a corresponding
target block based on the one-to-one mappings established
between the two label sequences of the secret image and the
selected target image. Note that both sequences have been
sorted according to the h-feature values of their image
(a) (b) blocks. They are said to have been h-sorted. As a result of
such h-sorting, every pair of neighboring labels in either
sequence specify two image blocks with similar h-feature
values, implying that the average colors of the two blocks
are essentially visually similar.
The main idea of secret embedding in the proposed
covert communication method is to switch the orders of the
target blocks in the h-sorted label sequence of the target
image during the mosaic image creation process to embed
message bits, thus achieving the goal of hiding a secret
message into a secret-fragment-visible mosaic image
imperceptibly.
More specifically, after the label switching, if a leading
label is smaller than the following one in the target block
(c)
label sequence, then a bit “0” is regarded to have been
embedded in the two neighboring labels; otherwise, a bit
Figure 7. An example of mosaic image creation. (a) Secret image. (b)
Target image. (c) Generated secret-fragment-visible mosaic “1” is regarded as embedded there. Furthermore, as shown
image. by the example of Fig. 9, because the tile images which

1137

correspond to the target blocks with switched labels have the Step 3. Perform Step 3 to 4 of Algorithm 3 to obtain the h-
similar average colors as mentioned previously, after the sorted label sequences L1 and L2 of S′ and T,
message is embedded, no visually perceptible difference respectively.
will arise in the resulting mosaic image. Step 4. Group the labels of L1 and L2 by the following
steps.
4.1 Group the labels of L1 based on the h-feature
values of the tile images in S′, with each resulting
1 2 3 4
21
1 2 3 4
21
group including the labels of a set of tile images
5 6 7 8 5 6 7 8
having the same h-feature values.
59 59
4.2 Group the labels of L2 based on the grouping of L1
9 10 11 12 9 10 11 12 obtained in Step 4.1, resulting in groups of labels,
G1, G2, …, Gm, with each group Gi including the
Target blocks Tile images Target blocks Tile images labels of a set of target blocks whose corresponding
tile images have the same h-feature values.
(a) (b)
Figure 9. Label switching and corresponding corresponding target block
Stage 2  embedding the secret message M.
exchange. (a) The original one. (b) After switching the Step 5. Generate the histogram H of the h-feature values of
corresponding target blocks of tile images. all the tile images in the resized secret image S′.
Step 6. Transform the message to be embedded, M, into a
3.2 Modified Secret-fragment-visible Mosaic Image bit string M′.
Creation Process for Secret Message Embedding Step 7. Perform the following steps to embed the bits of M′
In the proposed covert communication method, the into L2.
mappings of the labels of the tile images to those of their 7.1 Select the smallest unprocessed h-feature value hi
corresponding target blocks is recorded in a recovery whose histogram value H(hi) is larger than or equal
sequence LR for use in later data recovery. An illustration is to two.
shown in Fig. 10. Embedding of LR is then accomplished by 7.2 Take out the group Gi of labels in L2 corresponding
hiding the labels of LR into the tile images randomly by the to the h-feature value hi.
lossless LSB-modification scheme [11] controlled by a 7.3 Take out the first two unprocessed labels l1 and l2
secret key. The detailed algorithm for secret message in Gi, and switch the order of l1 and l2 in L2 if the
embedding is now given as follows, which is a modified following two conditions are satisfied, assuming
version of Algorithm 3. that the first unembedded bit in M is denoted as b:
A. b = 0 and l1 l2;
B. b = 1 and l1 l2.
7.4 Repeat Step 7.3 until Gi includes at most one label,
which is left untouched.
7.5 Repeat Steps 7.1 through 7.4 until the bits of M′ are
exhausted.
Step 8. Form an extra string M′ of 8 bits of “0” as the
ending signal of the input message M, and embed it
into L2 by Step 7 above.
Step 9. Fit the tile images of S′ into the target blocks of T
based on the one-to-one mappings from the labels
Figure 10. An illustration of generation of a recovery sequence LR.
of L1 to those of the re-ordered L2 obtained in Steps
7 and 8 (denoted as L2′ subsequently), and let the
Algorithm 4: embedding a message into a secret-fragment- resulting image be denoted as T′.
visible mosaic image. Stage 3  dealing with unfilled target blocks, generating
Input: a secret image S, a secret key K, the size Zt of tile and embedding the recovery sequence, and
images, a database DB, and a secret message M. generating the desired mosaic image.
Output: a secret-fragment-visible mosaic image R into Step 10. Perform Step 6 of Algorithm 3 to fill each of the
which M is embedded. remaining unfilled target blocks if there is any.
Steps: Step 11. Sort all the labels in L1 by their corresponding h-
Stage 1  embedding secret image fragments into a feature values, re-order accordingly the
selected target image. corresponding labels in L2′, take the re-ordering
Step 1. Crop S to yield S′ with a size divisible by Zt. result as a recovery sequence LR, and transform it
Step 2. Select a target image T for S′ with histogram H into a binary string.
from the database DB by Algorithm 2. Step 12. Embed the width and height of S′ as well as the

1138

size Zt into the first ten pixels of image T′ in a key K.
raster-scan order by the LSB modification scheme. Step 3. Compose the desired secret image S based on the
Step 13. Embed the data of LR by the same scheme into sequence LR by extracting the tile images fitted in R
unprocessed tile images of T′ randomly selected by in order and placing them at correct relative
the secret key K. positions.
Step 14. Generate as output an image R obtained by Stage 2  regaining the h-sorted label sequences.
composing all the tile images fitted at their Step 4. Get the h-sorted label sequence L1 of the tile
respective positions in T′. images of the recovered secret image S, and group
the labels of L1 based on the h-feature values of the
3.3 Secret Extraction Process tile images, with each resulting group including the
In the proposed secret message extraction process, we labels of the tile images having the same h-feature
extract the recovery sequence LR first and retrieve values.
accordingly the original secret image S. Also, by calculating Step 5. Perform the following steps to get the h-sorted
the h-feature values of the original secret image, we regain label sequence L2 of the target image T.
the h-feature values of the tile images and sort them to get 5.1 Get the re-ordered block sequence QT of the target
the h-sorted label sequence L1. blocks in T by one-to-one-mapping the labels in
Next, as illustrated by Fig. 11, the recorded sequence LR, sequence L1 to those of LR.
though including only the labels of L2, essentially specifies 5.2 Get a new h-sorted label sequence L2 from the
one-to-one mappings between the tile images and the target labels of the re-ordered block sequence QT.
blocks. Therefore, we may regain the h-sorted label 5.3 Group the labels of L2 based on the grouping of L1
sequence L2 of the target blocks from the corresponding conducted in Step 4 with each group Gi including
mappings from L1 to LR. Then, with the histogram H of the the labels of the target blocks whose corresponding
h-feature values of all the tile images, we may group the tile images have the same h-feature values, hi.
labels of sequences L1 and L2, and then examine the orders Stage 3  extracting the embedded secret message M.
of the labels of L2 to extract the embedded secret message, Step 6. Generate the histogram H of the h-feature values of
in a way reverse to the message embedding process as all the tile images in the recovered secret image S.
described in Algorithm 4. Step 7. Perform the following steps to extract the bits of
secret message M.
7.1 Select the smallest h-feature value hi whose
histogram value H(hi) is larger than or equal to two.
7.2 Take out the group Gi of labels in L2 corresponding
to hi.
7.3 Take out the first two unprocessed labels l1 and l2
in Gi, extract a hidden message bit b by the
following rule, and append it to the end of a bit
version of the message, denoted as D:
A. if l1  l2, then set b = 0;
B. if l1 l2, then set b = 1.
7.4 Repeat Step 7.3 until Gi includes at most one label,
which is then left untouched.
7.5 Repeat Steps 7.1 through 7.4 if the 8-bit end signal
Figure 11. An illustration of the regaining of the label sequence L2.
is not extracted (i.e., if the last extracted 8 bits are
not a sequence of eight “0’s”).
Algorithm 5: secret image recovery and secret message
Step 8. Transform every 8 bits of D into characters as the
extraction.
desired secret message M.
Input: a secret-fragment-visible mosaic image R, and a
secret key K identical to that used in Algorithm 4. 3.4 Experimental Results
Output: a recovered secret image S, and the secret message An example of experimental results using Algorithms 4
M supposedly embedded in R. and 5 is given in Fig. 12. The average difference measure
Steps: value at the block level between Fig. 8(c) and Fig. 12
Stage 1  retrieving the secret image S. (computed as the sum of all the Euclidean distances divided
Step 1. Retrieve the width and height of S′ as well as the by the number of blocks) is 0.05, and the PSNR of Fig. 12
size Zt of tile images from the LSB’s of the first ten with respect to Fig. 8(c) is 66.6 which is quite satisfactory,
pixels of image R. meaning that the proposed information hiding method
Step 2. Extract the recovery sequence LR from the LSB’s (implemented by Algorithms 4 and 5) provides a good effect
of blocks in R randomly selected using the secret on covert communication.

1139

colorscale to represent the color distribution of an image
more effectively, based on which a new h-feature is
proposed for measuring image similarity. A greedy
algorithm is proposed accordingly for fitting the tile images
of the secret image into appropriate target blocks more
efficiently. A remedy method has also been proposed to
solve the problem of using a small-sized database, which
enlarges a selected target image in proportion to the
difference measure between the secret and the target images.
For the proposed data hiding method used in covert
communication via secret-fragment-visible mosaic images, it
was observed that the tile images in an identical bin have
(a) similar colors. By switching the relative positions of the
target blocks corresponding to such tile images, we can
embed secret message bits into a secret-fragment-visible
mosaic image imperceptibly.
Future works may be directed to allowing users to select
target images freely to create secret-fragment-visible mosaic
images. This seems achievable by applying a reversible color
shifting technique to fit the color distribution of the secret
image to a selected target image.

REFERENCES
[1] P. Haeberli, “Paint by numbers: abstract image representations,” Proc.
SIGGRAPH 99, pp.207-214, Dallas, USA, 1990.
[2] A. Hausner, “Simulating decorative mosaics,” Proceedings of 2001
International Conf. on Computer Graphics Interactive Techniques
(SIGGRAPH 01), Los Angeles, USA, August 2001, pp. 573-580.
[3] Y. Dobashi, T. Haga, H. Johan and T. Nishita, “A method for creating
(b) mosaic image using voronoi diagrams,” Proc. 2002 European
Association for Computer Graphics (Eurographics 02), Saarbrucken,
Germany, September 2002, pp. 341-348.
[4] J. Kim and F. Pellacini, “Jigsaw image mosaics,” Proc. 2002
International Conf. on Computer Graphics Interactive Techniques
(SIGGRAPH 02), San Antonio, USA, July 2002, pp. 657-664.
[5] G. D. Blasi, G. Gallo and M. Petralia, “Puzzle image mosaic,” Proc.
2005 Int’l Association of Science Technology for Development on
Visualization, Imaging Image Processing (IASTED/VIIP 2005),
Benidorm, Spain, Sept. 2005.
[6] W. L. Lin and W. H. Tsai, “Data hiding in image mosaics by visible
boundary regions and its copyright protection application against
print-and-scan attacks,” Proc. 2004 Int’l Computer Symp. (ICS 2004),
Taipei, Taiwan, Dec. 15-17, 2004.
[7] C. C. Wang and W. H. Tsai, Creation of Tile-overlapping mosaic
images for information hiding, Proc. 2007 Nat’l Computer Symp.,
Taichung, Taiwan, Dec. 20- 21, 2007, pp. 119-126.
[8] S. C. Hung, D. C. Wu and W. H. Tsai, “Data hiding in stained glass
images,” Proc. 2005 Int’l Symp. on Intelligent Signal Processing
Communications Systems, June 2005, Hong Kong, pp. 129-132.
(c) [9] C. Y. Hsu and W. H. Tsai, “Creation of a new type of image - circular
Figure 12. An example of covert communication. (a) A mosaic image dotted image - for data hiding by a dot overlapping scheme,” Proc.
into which messages are hidden. (b) Resulting image and 2006 Conf. on Computer Vision, Graphics Image Processing,
extracted messages using a right key. (c) Resulting image and Taoyuan, Taiwan, Aug. 13-15, 2006.
extracted messages using a wrong key. [10] C. P. Chang and W. H. Tsai, “Creation of a new type of art image 
tetromino-based mosaic image  and protection of its copyright by
losslessly-removable visible watermarking,” Proc. 2009 Nat’l
Computer Symp., Taipei, Taiwan, Nov. 27-28, 2009, pp. 577-586.
IV. CONCLUSIONS AND SUGGESTIONS [11] D. Coltuc and JM. Chassery, “Very fast watermarking by reversible
A new type of art image  secret-fragment-visible mosaic contrast mapping,” IEEE Signal Processing Letters, vol. 14, no. 4, pp.
255-258, April 2007.
image, and a data hiding technique have been proposed for [12] J. R. Smith and S. F. Chang, “Tools and techniques for color image
secret image hiding and covert message communication, retrieval,” Proc. Society for Imaging Science Technology SPIE
respectively. For the former, we have proposed a new 1-D h- (IS T/SPIE), vol. 2670, Feb. 1995, pp. 2-7.

1140

A Practical Design of High-Volume Steganography in Digital Videos

Ming-Tse Lu, Po-Chyi Su and Ying-Chang Wu
Dept. of Computer Science and Information Engineering
National Central University
Jhongli, Taiwan
Email: pochyisu@csie.ncu.edu.tw

Abstract—In this research, we consider to exploit the available to most of the people and their transmission is
large volume of audio/video data streams in compressed increasingly popular. This research aims at developing a
video clips/files for effective steganography. By observing steganographic scheme for popular digital video files.
that most of the widely distributed video files employ
H.264/AVC and MPEG AAC for video/audio compression, H.264/AVC is the state-of-the-art video codec and its
we examine the coding features in these data streams to decent coding performance lends itself to become the
determine good choices of data modifications for reliable major coding mechanism in various applications. The most
and acceptable information hiding, in which the perceptual popular digital video formats/containers for file sharing
quality, compressed bit-stream length, payload of embedding, nowadays, including FLV (Flash Video), MKV (Matroska
effectiveness of extraction and efficiency of execution are
taken into account. Experimental results demonstrate that Multimedia Container), AVI (Audio Video Interleave) and
the payload of the selected features for achieving a good MP4, etc., support H.264/AVC so we choose H.264/AVC
balance among several constraints can be more than 10% of as the host. In addition, since FLV has become very
the compressed video file size. popular in file sharing these days, we wrap the resulting
Keywords-Steganography; H.264/AVC; MPEG; AAC; in- H.264/AVC video bit-stream into a FLV file for the future
formation hiding; usage. As FLV files contain both video and audio data
streams for playback, we will make use of both video and
I. I NTRODUCTION audio data streams to embed as much secret information as
Digital videos are widely available nowadays thanks to possible. The chosen audio format is MPEG AAC, which
the fast advances of increasingly cheaper yet powerful is usually adopted by FLV files.
computer facilities and broadband internet technologies. Two embedding scenarios may be considered in this
It is now possible to stream high-quality videos on the application. First, an user may acquire a compressed video
Internet and such web sites as YouTube, Yahoo! Video, file that is not coded by H.264/AVC, e.g. an MPEG2 or
DailyMotion, etc. offer free video viewing, sharing or MPEG4 related file. In order to embed the information,
downloading services. Watching videos anytime and any- this video file will be transcoded into an H.264/AVC bit-
where may become people’s daily activity as the portable stream so that the secret information can be embedded
devices may become more and more popular. As a re- during the encoding process. The resultant H.264/AVC
sult, digital videos are ubiquitous and will be the major bit-stream will then become the video stream of an FLV
circulated multimedia content. Due to the large volume file. If the input video is already an FLV file compressed
of digital videos, data compression is usually applied to by H.264/AVC, the information may be embedded more
facilitate their transmission and storage. Since human’s efficiently since the existing coding parameters in the
perceptual models are not perfect, lossy compression is original video file can be referenced. It should be noted
usually preferred to increase the coding efficiency of that both of the embedding procedures will be carried out
digital videos without affecting human’s perception. In in the encoding phase so the transcoding is always needed.
other words, there exists certain redundancy in digital To achieve the high-volume information hiding and to
video files. Nevertheless, in the viewpoint of communica- retain the fidelity of the audio/visual data, we investigate
tion, this redundancy can serve as an “invisible” channel combinations of embedding methods to satisfy most of
and, if one can make good use of it, the high-volume the requirements or restrictions. The paper is organized
secret communication using digital videos as a camouflage as follows. Some previous works will be described in
is achievable. The secret communication is also termed Sec. II and our proposed scheme is delineated in Sec. III.
“steganography”, which means “cover writing”, and can be Experimental results will be shown in Sec. IV to validate
applied to transmit sensitive information between trusted the trade-offs that we make among several different re-
parties or when the encryption is not allowed or safe quirements. The conclusive remarks are given in Sec. V.
in the normal communication channel. There are a few
requirements in steganography, including the high payload II. R EVIEW OF THE R ELATED W ORKS
of hidden information, unobtrusiveness of the distortion, Unlike digital watermarking, in which the embedded
security and reliability. To achieve the secure, high-volume information should be able to withstand some common
and reliable covert communication, digital videos can processing attacks such as re-compression at a different
serve as a good host, especially when these files are bit-rate, random video frame dropping, resizing, etc.,

1141

the high-volume steganography emphasizes more on the III. T HE P ROPOSED S CHEME
payload, reliability and the difficulty of detection even A. System Overview
with steganalysis [1], which is a process for revealing the
The block diagram of our proposed scheme is shown in
existence of certain hidden information in a suspicious
Fig. 1. A video file is parsed first to extract the video and
video. Of course, the quality should always be main-
audio data streams. As the transcoding will be applied, the
tained to avoid affecting its applications. Some data hiding
video and audio decoder will extract the compressed bit-
schemes in digital videos [2]–[7] have been proposed. As
streams into raw data. After obtaining the reconstructed
the video file consists of a large number of frames, the
video and audio signals, H.264/AVC and AAC encoders
similar data hiding techniques of still images may also
will encode the raw data right after they are extracted and
be applied on videos. The most widely used technique
the embedding procedure is triggered. If the input video
to hide the data in digital images or video data is the
file is processed by H.264/AVC already, a mode copying
usage of the Least-Significant Bit (LSB) modification [8],
procedure that records features of the original video stream
[9], in which the LSB’s of samples, usually coefficients or
may be applied to speed up the whole process. The hidden
quantization indices if the compressed data are used, are
information can be extracted efficiently in the decoder.
substituted by the secret message bits. JStego, F3, F4 and
F5 described in [8] are popular approaches. In the original ¤£¢¡
JStego algorithm, the LSB’s of JPEG residual coefficients ©¨§¦¥ ¢ ¨ ¨

are overwritten with the binary secret message consisting ¨¡¦

of “0” and “1”. JStego skips the embedding operation
CD
32©1¨§©0 EFG

when it encounters 0 and ±1 to avoid generating zero ¢ ©¨§¦¥
P
HI

values, which will cause ambiguity to the hidden infor-
mation extraction. Other values are grouped into pairs, i.e.
¨$©( ©¨§¦¥ ! ©¦§'% %%
£¨$'¢¨ $¨§©#¨ $¨§©#¨
(±2, ±3), (±4, ±5)... In F3 algorithm, the LSB of non- ©¦¢)$©¦
¦§§Ä)4

zero coefficients will be matched with the secret message
after the information embedding, which decreases the ©¨§¦¥ ! ©¦§'% %%
absolute values of coefficients. If the coefficient becomes ¦££%1¨§©0 $¨§©#4 $¨§©#4 ¨$#¨(

zero after this modification operation, we will embed ¨¢££¨0

this bit once again in the next sample. F4 algorithm is ©¨§¦¥ ¤£¢¡
developed to complement the weakness of F3 algorithm. In $¨'0
986765
F4 algorithm, a negative coefficient is presented inversely. @ ¡©$©

In F5 algorithm, permutating straddling process is adopted @ ¢¢ RQ
B¨¡¦ © §4

to improve the perceptive characteristic and enhance the
security level. In addition, the so-called matrix coding is
Figure 1. The flowchart of the proposed scheme
applied to avoid modifying too many data samples.
In addition to modifying the coefficients, some re-
searchers employ the characteristics of popular video B. Steganography in H.264/AVC
compression standards. Fang i.e. [4] proposed to embed H.264/AVC offers a much better compression perfor-
the data into motion vectors’ phase angle in the inter- mance than the existing video codecs due to its vari-
frames. Wang et al. [10] utilized motion vectors in P and ous encoding tools. As previous video coding standards,
B-pictures as the data carriers for hiding the copyright H.264/AVC is based on motion compensated, DCT-like
information. An motion vector is selected based on its transform coding. Each picture is compressed by partition-
magnitude and its angle guides the modification opera- ing it as one or more slices; each slice consists of mac-
tion. Yang G, et al. [11] employed the intra-prediction roblocks, which are blocks of 16 × 16 luma samples with
mode and matrix coding. They mapped the two secret the corresponding chroma samples. Each macroblock may
message bits to every three intra 4×4 blocks by the matrix also be divided into sub-macroblock partitions for motion
coding. Kim et al. [1] proposed an entropy coding based prediction. The prediction partitions can have seven differ-
watermarking algorithm to achieve the balance between ent sizes 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4. The
the capacity of watermark bits and the fidelity of the video. large variety of partition shapes and the quarter sample
One bit of information is embedded in the sign bit of the compensation provide enhanced prediction accuracy. In
trailing ones in context-adaptive variable length coding intra-coded slices, 4 × 4 or 16 × 16 intra spatial prediction
(CAVLC) of the H.264/AVC stream. The transcoding based on neighboring decoded pixels in the same slice
process may thus be avoided but the drift errors resulting will be applied. The 4 × 4 spatial transform, which is an
from the different reference frame content may appear. approximate DCT and can be implemented with integer
In this paper, we will try to make good use of the coding operations with a few additions/shifts, will be calculated
features in video/audio data streams to maximize the ca- for the residual data. The point by point multiplication in
pacity of embedded data. Our work can be applied to any the transform step will be combined with the quantization
container format that can de-multiplex the video stream of step and implemented by simple shifting operations to
H.264/AVC and audio stream of AAC compression. achieve efficiency. CAVLC or CABAC will be used for

1142

lossless coding. Our video embedding scheme is integrated Besides, using MPM to encode a mode should appear in
into the H.264/AVC encoding process as the quantization, normal videos so we have to keep this situation in our
intra prediction and motion estimation procedures in the “stego” video. Our scheme divide the eight modes into
encoder are modified. two groups to represent the binary secret information and
1) Employing Intra Prediction Modes: In H.264/AVC, the division or classification is applied according to Fig. 2.
the intra prediction in the luma and chroma of a frame is If the DC mode is not MPM, we replace the direction of
quite important for reducing the coding redundancy since MPM by the DC mode and then assign “0” and “1” to the
a coding block is usually related to its neighbors. Four prediction directions, which are known by the embedder
16×16 or nine 4×4 intra prediction modes can be applied and detector. The Rate Distortion Optimization (RDO)
on the luma while four 8 × 8 prediction modes are for the will be employed to determine a better prediction mode.
chroma. Fig. 2(a) shows the 4×4 intra prediction. Since Although the payload becomes 0.79% of the compressed
the samples above and to the left (labeled as A to M) of the bit-stream size, the increment of file size is less than 3%. If
current block have been encoded/reconstructed previously the input video is already an H.264/AVC video stream, we
and are available to both the encoder and decoder, nine may reference the prediction mode in the original video
prediction modes, including eight directions and one DC and determine pairs of modes to represent “0” and “1”
prediction can thus be calculated. It should be noted that, directly. If the execution-time constraint is not that restrict,
if the neighboring upper or left block of the current block we still suggest applying RDO to find a better mode
is not available, the number of available modes is reduced. since it’s not easy to predict a good selection just based
For instance, if the upper block is available while the left on the mode in the incoming/original video. Besides, the
block is not, only “horizontal”, “DC”, and “horizontal up” computational load is not increased a lot by ROD since
modes can be chosen. we only have four candidate selections to embed one bit.
2) Employing Inter Prediction: The inter prediction
provides a reference from one or more previously en-
coded video frames for effective encoding. In order to
acquire the precise motion vectors, H.264/AVC adopts the
quarter-pixel precision for motion compensation. The last
two bits of the motion vectors indicate more accurate
locations of motion vectors. We basically make use of
(a) (b) the last bit of motion vectors for effective information
Figure 2. (a) Labeling of prediction samples (4×4) and (b) the directions hiding without affecting the coding performance severely.
of 4×4 intra prediction modes. Since the transcoding is applied, the Sum of Absolute
Difference (SAD) of the investigated motion vectors with
Our proposed scheme only utilizes the nine 4 × 4 their Least Significant Bit (LSB) being equal to the
intra prediction modes since the content of these blocks hidden bit will be available, we can examine them to
is usually complicated so the blocks are suitable for find good motion vectors. Again, like the intra prediction,
information hiding. Compared with 16 × 16 luma and the motion vectors of neighboring partitions are often
8 × 8 chroma prediction modes, 4 × 4 intra prediction highly correlated. After determining the motion vectors
modes offer finer prediction so the modification of them by motion estimation, H.264/AVC predicts the motion
will affect the coding performance less. To embed the vector first from the nearby, previously coded partitions.
information, one may think of grouping the nine modes After obtaining a predicted motion vector M Vpredicted , the
into pairs so that one bit can be cast in each 4×4 subblock. difference between the current motion vector, M Vcurrent ,
However, the resultant bit-stream length will be consid- and M Vpredicted will be calculated and encoded. The
erably increased. We take “Container” video compressed motion vector difference is termed ”M V D” and formed
with a fixed Quantization Parameter (QP) equal to 30 as an in the same way at the decoder. In our scheme, we actually
example. By doing so, the payload can reach 2.07% of the modify the data of M V D, instead of motion vectors them-
total bit-stream size, which is inappropriately enlarged by selves so the detector can extract the hidden information
6.72%. The reason comes from the fact that the correlation efficiently. Furthermore, we will skip the partitions with
in the intra prediction modes of adjacent blocks is not M V D equal to 0 and avoid generating a new zero M V D.
taken into account. In H.264/AVC, the mode of the current This strategy can limit the file size increment and make the
block is first predicted by the minimum of the prediction statistics of motion vectors look normal. If only the motion
modes of its two neighbors, i.e. the upper and left blocks. vector whose M V D becomes zero after the information
If the mode matches with the predicted one, only one embedding has a reasonably small SAD, we will choose
flag bit called “Most Probable Mode” (MPM) is asserted this motion vector and embed this bit once again in the
and sent. Otherwise, this flag bit will be set as “0” and following partition.
three extra bits will also be sent to signal which of the It should be noted that the effect of M V D embedding
remaining eight modes is used. We only modify the modes is more obvious if we use the fixed QP to compress a
when the flag bit is “0” since this block may be more video. We illustrate this by compressing “Container” with
different from its neighbors and suitable for embedding. QP equal to 30. Fig. 3 shows the proportion of each

1143

coding features that occupy in the compressed bit-stream. other hand, for a negative coefficient and its LSB is
Fig. 3(a) shows that the proportion of luma components equal to BitT oEmbed, the modification has to be applied
from inter blocks is 39% if nothing is embedded while and the coefficient is added by 1, i.e. the LSB of this
the proportion of luma components from inter blocks is negative number is equal to the inverted target bit. After
extended to 49% as shown in Fig. 3(b). That is, a large the embedding operations, it is required to check if the
increment will appear in the residuals of inter blocks. If index becomes 0. If yes, the bit will be skipped by the
the bit-stream size increment has to be strictly limited, decoder so it has to be embedded again.
we should avoid embedding the information in motion
vectors. However, the increased residuals may be helpful Algorithm 1 F4 Algorithm
in the information embedding of quantization indices, Input: BitT oEmbed ∈ {0, 1}
which will be discussed later. 1: for all AC values coe in a block after quantization
11% 9%
do
5%
4% 2: if coe 0 ∧ LSB(coe) = BitT oEmbed then /*
37% 4%
31% 4%

4%
positive number */
4% 3: coe ← coe − 1
4: else if coe 0∧LSB(coe) = BitT oEmbed then
LumaIntra
LumaInter
LumaIntra
LumaInter /* negative */
Chroma Chroma
Motion Vectors
IntraMode4x4
Motion Vectors
IntraMode4x4
5: coe ← coe + 1
39%
else
49%
else
6: else /* skip zero value */
(a) (b) 7: continue
8: end if
Figure 3. Pie chart of “Container” with (a) nothing being embedded 9: if coe = 0 then /* Successfully embedded */
and (b) M V D being modified.
10: /* get nextBitToEmbed */
11: BitT oEmbed = GetNextBit();
Here, we test the videos “Garden” and “Container” 12: end if
that are coded with the target bit-rate equal to 2 Mbps 13: end for
to explain our strategy. Table I shows the comparison of
modifying motion vectors (M V ) directly without consid- The reason of choosing F4 is as follows. It has been
ering M V D and the M V D embedding. The payload is reported that we can reveal the existence of hidden infor-
calculated in bits per frame, bpf. In our viewpoint, the mation by checking the statistics of samples in JStego and
M V D embedding is still a better choice as the quality of F3 since they may change the histogram of coefficients’
video is less affected, although the payload is decreased frequency after the information embedding. Besides, the
due to the skipping of some MVD’s that are equal to zero original/natural message induced by unchanged carrier
vectors, especially in such static video as “Container”. media may have more steganographic ones than zeros due
Table I to the appearance of ±1 so we had better keep this situa-
T HE PERFORMANCE COMPARISON OF M V AND M V D EMBEDDING tion. F5 is assumed to be a better approach by using matrix
(B IT- RATE : 2M BPS ) coding so that less data will need to be modified. However,
as we would like to embed the information during the
Video MV MV D encoding process to achieve the efficiency, we have to
Name PSNR Payload PSNR Payload finish coding the data in a subblock before we proceed
Garden 29.78 8461 30.56 5148
Container 40.70 29372 43.41 8024 to encode the next subblock. In the matrix encoding, we
need to collect 2m − 1 samples to embed m bits by
changing only 1 sample. Since the prediction mechanism
3) Quantized Coefficients Embedding: After the intra of H.264 performs well so a lot of zero indices may exist
and inter predictions and compensation, the prediction and several subblocks may thus be required for collecting
residuals will be generated and occupy a large portion 2m − 1 nonzero samples for the information hiding. As
of the video stream. It is advantageous to utilize these the efficient modification is one of our major objectives,
residuals to achieve the high-volume information hiding. F5 is not suitable. In F4, if we require a higher degree of
In our scheme, both luma and chorma of residuals will be safety, some coefficients may be skipped for embedding
embedded. As mentioned before, popular methodologies given that both the embedder and detector know the rule.
to achieve high-volume steganography in samples without As described earlier, the information hiding by F4 has a
affecting the perceptual quality of videos include JStego, positive side-effect in video coding as the magnitudes of
F3, F4 and F5 algorithms. In our scheme, F4 algorithm the resultant coefficients tend to become smaller. When
is adopted to achieve the effect information hiding in using a fixed QP to encode a video, the video size may
residuals. Algorithm 1 shows the pseudo code of the even be reduced after information embedding and this may
embedding loop of F4. For each non-zero AC coefficient offset the negative effects from the information hiding by
coe, if it is a positive number and its LSB is not equal the intra and inter predictions.
to BitT oEmbed, its absolute value is decreased. On the It should be noted that, when the rate control mechanism

1144

is enabled, QP will be adjusted along with the encoding if the coding performance is our major concern.
process and the F4 algorithm may help to save some bits in
the current frame so that a smaller QP can be assigned in C. Information Hiding in Advanced Audio Coding
the following frames. In addition, if the bit-stream length
Advanced Audio Coding (AAC) is a standardized com-
is not the major concern, we may try to generate more
pression scheme for digital audio and designed as the
nonzero indices. As some indices may be quantized into
successor of the MP3 format. AAC makes use of many
zero values because a large QP is used, we may leverage
advanced coding techniques available at the time of its
the coefficients that may barely survive by using a smaller
development to provide high quality multi-channel audios.
QP. For example, if QP = 28 is adopted in this block, we
Therefore, it becomes the kernel algorithm of audio com-
may try a smaller QP = 27 to see whether there will be
pression standards. At the beginning of encoding, the filter
zero coefficients/indices that become survivors by using
bank is employed to transform the time domain signal
this smaller QP. If yes, this index can be changed to ±1
to the frequency domain. Following the time-frequency
so that we may have more nonzero indices to consider.
conversion is a series of prediction mechanisms and those
4) Mode-Copy Procedure: Our embedding methods
stages attempt to improve the redundancy reduction of
described above are based on a transcoding process. If
previous encoded signals or the joint stereo channel. After
the input video is already an H.264/AVC bit-stream, we
the predictions, an iteration loop is applied to quantize
may record the coding modes during the decoding process
the spectral coefficients. The scalefactors of subbands are
so that the time-consuming mode decision process can be
obtained and multiplied by all of the coefficients in the
made efficient by referencing the modes in the input H.264
corresponding scalefactor band. The number of required
video as long as the settings of the video, including the
bits and the related information are determined to control
GOP structure, the bit-rate, etc. are the same. We thus
the trade-off between the audio distortion and payload.
implement a mode-copy procedure to skip some time-
Huffman coding is followed according to the 12 pre-
consuming mode decision steps in the encoding process.
defined Huffman tables. Since the data of scalefactors and
In our implementation, the coding information that we
spectral coefficients occupy a significant part in the coded
record are the frame type, macroblock type, intra- and
audio streams, we will make use of them to embed the
inter-prediction modes and motion vectors in the quarter
information.
unit. After decoding a frame, the video encoder assign
1) Embedding in the Scale Factors: The scalefactors
those features directly to speed up the whole transcoding
have been used for the effective information embedding
process. We compare the typical transcoding and mode-
purposes [12]. In our implementation, the scalefactors
copy encoding in Table II, where Frame per second (FPS)
equal to zero will be skipped in the embedding scheme
is used as the performance measurement. No information
and the scalefactor bands that use pseudo codebooks in the
embedding method is applied in both cases. It can be seen
intensity stereo are also skipped. For nonzero scalefactors,
that employing the mode-copy procedure is competent to
the secret message bit is embedded in the LSB of each
the typical transcoding.
scalefactor. The payload of scalefactors embedding in
Table II bytes is shown in Table III, in which two audio clips with
T HE COMPARISON OF THE TYPICAL TRANSCODING AND MODE - COPY different characteristics are employed. Two target bit rates
ENCODING (B IT- RATE : 500K BPS )
are employed, i.e. High: 264Kbps and Low: 132Kbps. We
can see that the payload of the scalefactors embedding
Typical could achieve around 1 to 3% of the audio stream size.
Video transcoding Mode-Copy
Name PSNR FPS PSNR FPS Table III
Garden 23.62 14.54 23.53 21.15 T HE PAYLOAD OF INFORMATION EMBEDDING IN AUDIO
Container 37.77 11.99 37.51 18.42

Scalefactor Quantization index
Music High Low High Low
For the information embedding, we may use both the A(5:06) 1.55% 2.86% 7.10% 6.79%
intra- and inter-prediction modes as the references and B(3:51) 1.26% 2.52% 11.90% 7.72%
try to modify the modes directly without using RDO. If
the speed is the major concern, this approach is feasible.
In the information hiding in the intra-prediction modes, 2) Embedding in the Quantization Indices: In order to
we can replace the prediction direction with DC (if DC maximize the payload in the audio stream, the spectral
mode is not MPM) and then group them into known pairs coefficients after quantization are also employed. Again,
according to the prediction directions. For the information we apply the F4 algorithm to the spectral coefficients.
hiding in M V D, we may simply change the bits according Table III also shows the payload using quantization indices
to the incoming motion vectors or use a refined method, embedding and the average payload is around 6 to 12% of
which calculates the SAD’s of the adjacent locations to the embedded audio file size. It can be seen that “Music
pick a better motion vector. However, in our opinions, we B” has a larger payload than “Music A” because Music
may prefer to run RDO in the intra mode modification, as B has more transient signals than A and more non-zero
described before, and omit the motion vector embedding, coefficients exist.

1145

IV. E XPERIMENTAL R ESULTS under various bit-rates, the payloads of IMP and MVD
tend to be independent from the target bit-rate since they
Our results are demonstrated in two parts, i.e. the
are usually more related to the frame size.
information embedding in video and audio streams. In
the video embedding part, we evaluate the performance
of various videos by first using a fixed QP value and then 15

by enabling the rate control mechanism. In both ways, the 14

payload, fidelity and increment of bit-stream size are the
major concerns. Six common test videos, namely, “Con- 13

Payload (%)
tainer”, “Hall Monitor”, “Foreman”, “Football”, “Garden”, 12

“Mobile”, are utilized to verify the proposed video em-
bedding method. The details of test videos are shown 11 Container
Hall Mointor

in Table IV. The proposed scheme is integrated within 10
Foreman
Football

Intel Integrated Performance Primitives (IPP) version 5.2, Garden
Mobile

which is a highly optimized run-time library for supporting 9
500 1000
Bit−rate (Kbps)
1500 2000

fast H.264/AVC coding.
Figure 4. The payload of information embedding in videos under various
Table IV bit-rates.
D ETAILS OF T EST V IDEOS

Num of Frame No. of MB
Name frames resolution per frame 50
500Kbps
Container 300 352×288 396 1Mbps
2Mbps
45
Football 125 352×240 330
Foreman 300 352×288 396 40
Garden 115 352×240 330
PSNR (dB)

Hall Monitor 300 352×288 396 35

Mobile 140 352×240 330
30

25
First, we set a fixed QP value equal to 30 within all
frames in the test videos to see the effects of different 20
Hall Monitor Container Foreman Garden
Video Name
embedding methods. When a fixed QP is employed, the
trade-off between the hidden information payload and the Figure 5. PSNR of embedded videos and transcoded videos under
increment of bit-stream size is the major consideration. For various bit-rates
each video, we record the payload in bits per frame and
the increments of bit-stream that is resulted from various Then, we consider the fidelity of embedded videos
embedding methods are demonstrated in Table V. We can under various target bit-rates. We present the PSNR values
find that “Football” video provides the largest payload in of the embedded video and the transcoded videos. The
the modifications of quantization indices of intra blocks values of fidelity decreases of four videos are shown in
and intra mode prediction, IMP, since the high motions in Fig. 5. We can see that the fidelity of transcoded high-
the video leads to more intra blocks. In the quantization motion video is not as high as that of other static videos
index embedding in inter predicted blocks, the payload of under the same bit-rate because the high-motion videos
“Garden” video is even higher than “Football” because it lead to a significant amount of inter block residuals for
not only has high variations in the frame but also high compensating the high variations of video frames. It can
similarity among frames so more inter blocks exist. In be observed that the PSNR of high-motion videos such as
other embedding modes, high motion videos always have “Garden” is decreased a lot after the information embed-
higher payloads. ding. The reason could be that modifying motion vectors
Using a fixed QP value is only an experiment to observe of high-motion videos seriously cause inaccuracies in
the trade-off among various embedding modes. We should motion compensation. In the lower bit-rate, the difference
enable the rate control to simulate the scenario of the between transcoded and embedded video is not as much
real applications. Under a given target bit-rate, the issue as that in the higher bit-rate. Despite of that, our scheme
we will discuss is the trade-off between the payload and still achieves around 10% payload of embedded video size
fidelity of video. Unlike the fixed QP mode, we combine on average as shown in Fig. 4.
all the embedding modes to directly observe the trade- As mentioned in Section III-B4, a mode-copy procedure
off between the payload and PSNR under the same target was introduced for speeding up the encoding process. It
bit-rate as shown in Fig. 6, in which four videos are should be noted that mode-copy procedure is reasonable
tested and their payloads are shown as the solid lines. only when the rate-control mode is enabled. Our mode-
We can see that “Garden” achieves the best payload copy procedure skips the most time-consuming stage in
performance at all the bit-rates. Again, it demonstrates that video encoder, including the motion estimation, to increase
the high motion videos usually perform better. In fact, the efficiency. First, we record the time of embedding

1146

Table V
T HE AVERAGE PAYLOAD AND THE CORRESPONDING INCREMENT PAYLOAD IN BPF

Intra4x4,16x16 Inter IMP MVD All (MVD)
File Name Payload Size Payload Size Payload Size Payload Size Payload Size
Container 409 0.14 339 -7.30 81 2.96 128 25.89 1191 7.49
Hall Monitor 262 -3.38 341 -8.77 95 4.13 60 6.14 880 -5.16
Foreman 349 -1.41 586 -7.60 246 4.44 482 12.72 1887 3.44
Football 1768 -4.34 2411 -8.15 615 3.00 407 5.59 6095 -8.38
Garden 1460 0.70 5039 -8.81 206 0.81 440 19.37 9164 2.03
Mobile 1592 2.04 4325 -9.24 222 1.19 470 30.94 8996 1.99

process, with and without employing the mode-copy pro- 45

cedure. Table VI shows the ratio of the increased efficiency
in execution time. It can be observed that the efficiency 40

can be improved by more than 28% when the mode-copy

PSNR (dB)
procedure is employed. The lower bit-rate value is used,
35

the more improvement can be obtained. Figure 6 shows 30
the payload of embedding when the mode-copy procedure Container
is applied. It should be noted that the M V D embedding 25
Foreman
Garden
is disabled. The dotted lines in Fig. 6 show the payload Football
with mode−copy

of the hidden information and the solid lines are from 20
500 1000 1500 2000

the complete transcoding scheme. “Garden” still has the Bit−rate (Kbps)

best embedding performance. Here we disable the M V D
Figure 7. The fidelity of video under various bit-rates, with and without
embedding and the payload of embedding can still achieve the mode-copy procedure
around 10% of encoded video size in average.

Table VI
T HE RATIO OF EXECUTION TIME DECREASE AFTER EMPLOYING THE When a higher bit-rate is set, the mode-copy procedure
MODE - COPY PROCEDURE . even performs better.

Video Name 500Kbps 1Mbps 2Mbps A. The Evaluation of Audio Embedding
Container 53% 39% 30%
Hall Monitor 48% 40% 35% For the information embedding in audio, we select
Foreman 52% 44% 33% some audio clips from the EBU SQAM (Sound Quality
Football 35% 32% 30% Assessment Material) CD, including “abba”, ”speech”,
Garden 55% 35% 31% “baird” and “bach”. All the audio clips from EBU SQAM
Mobile 46% 35% 27%
CD are encoded in lossless compression format (FLAC)
and we transcode those clips into FLV as the input to
our scheme. The audio encoder parameters are set to
15
retain the fidelity of audio as much as possible. We
14
employ the “target quality mode” in Nero AAC encoder
to preserve the fidelity of original clips from EBU SQAM
13
CD. In addition, we also select two videos, “classic”, and
Payload (%)

12
“electronic”, from YouTube as one is a classical music and
11 the other is a remixed pop music.
10
We first investigate the payload of embedding with all
both encode modes enabled. The payload unit we use is
Foreman
Garden
Football
9
Hall Monitor
with mode−copy also bits per frame, bpf, in which “frame” is the basic
8
500 1000
Bit−rate (Kbps)
1500 2000 element to collect the sampling points. Table VII shows
the ratio of short window appearance and the payload of
Figure 6. The embedding payload under various bit-rates with the scalefactors embedding, quantization indices embedding,
mode-copy procedure and the embedding ratio of payload to encoded audio
size. All the audio clips are encoded at 192 Kbps. The
The mode-copy procedure skips lots of searching steps ratio of short window appearance shows the characteristics
so it may degrade the fidelity of video frames. Figure 7 in audio presentation. The transient signal is a short-
shows the PSNR value of four videos with and without the duration signal that contains the high degree of non-
mode-copy procedure. The dotted line in the figure repre- periodic components and a higher magnitude of high
sents the PSNR values with the mode-copy procedure. We frequencies than the harmonic content of that sound. It can
can see that the fidelity of videos is not affected by much. be seen that the notations in “abba”, “speech”, “electronic”

1147

have more transient signals. Be comparing the ratio of [2] C. Xu, X. Ping, and T. Zhang, “Steganography in com-
short window appearance and payload, we can find that the pressed video stream,” 2006.
payload obtained by scalefactors embedding is irrelative to
[3] S. Kapotas, E. Varsaki, and A. Skodras, “Data Hiding in H.
the ratio of short windows because each subband has only 264 Encoded Video Sequences,” in IEEE 9th Workshop on
one scalefactor so it is not related to the bitrate or number Multimedia Signal Processing, 2007. MMSP 2007, 2007,
of short windows. Therefore, the ratio of short windows pp. 373–376.
appearance is proportional to the payload obtained by the
embedding in the quantization indices. [4] D. Fang and L. Chang, “Data hiding for digital video
with phase of motion vector,” in 2006 IEEE International
Table VII
Symposium on Circuits and Systems, 2006. ISCAS 2006.
T HE PAYLOAD OF THE AUDIO EMBEDDING Proceedings, 2006, p. 4.

[5] Z. Liu, H. LANG, X. Niu, and Y. Yang, “A robust video
Audio Short Payload [bpf] watermarking in motion vectors,” in International Confer-
name window [%] SF QC Ratio[%] ence on Signal Processing, 2004, pp. 2358–2361.
abba 14.78 76 341 10.55
speech 11.42 77 339 10.61
[6] M. Wu and B. Liu, “Data hiding in image and video: part
baird 0.11 92 271 8.20
I-fundamental issues and solutions,” IEEE Transactions on
bach 2.60 72 284 9.42
classic 0.04 95 319 9.19
Image Processing, vol. 12, no. 6, pp. 685–695, 2003.
electro. 5.28 79 496 12.80
[7] M. Wu, H. Yu, and B. Liu, “Data hiding in image and
video: part II-designs and applications,” IEEE Transactions
on Image Processing, vol. 12, no. 6, pp. 696–705, 2003.
Fig. 8 shows the bit-rate and payload of quantization
indices embedding. We can see that the higher target [8] A. Westfeld, “F5Xa steganographic algorithm,” in Informa-
bit-rate is set, the more payload of quantization indices tion Hiding. Springer, pp. 289–302.
embedding can be achieved because the number of co-
efficients in subbands increases. The payload of audio [9] A. Bhaumik, M. Choi, R. Robles, and M. Balitanas, “Data
Hiding in Video,” 2009.
embedding achieves around 10% of the audio stream size
in average as the video information embedding. [10] H. Wang, Y. Li, Z. Lu, and S. Sun, “Compressed domain
video watermarking in motion vector,” in Knowledge-Based
700
Intelligent Information and Engineering Systems. Springer,
2005, pp. 580–586.
600

[11] G. Yang, J. Li, Y. He, and Z. Kang, “An information hid-
500
ing algorithm based on intra-prediction modes and matrix
Payload (bpf)

coding for H. 264/AVC video stream,” AEU-International
400
Journal of Electronics and Communications, 2010.
abba
300
speech
[12] S. Kirbiz, A. Lemma, M. Celik, and S. Katzen-
baird beisser, “Decode-Time Forensic Watermarking of AAC
bach
200
classic Bitstreams,” IEEE Transactions on Information Forensics
electronic and Security, vol. 2, no. 4, pp. 683–696, 2007.
100
120 140 160 180 200 220 240 260
Bit−rate (Kbps)

Figure 8. The payload of quantization indices embedding in various
bit-rates.

V. C ONCLUSION
We develop a high-volume steganographic scheme for
FLV files. Both video and audio streams are employed and
several coding features are taken into account. The users
may take their applications into account to select suitable
features. The payload, perceptual quality, file increment
and security should be the major concerns. Experimental
results demonstrate the payload can reach more than 10%
of the total file size when a good tradeoff is achieved.

R EFERENCES
[1] U. Budhia, D. Kundur, and T. Zourntos, “Digital video
steganalysis exploiting statistical visibility in the temporal
domain,” IEEE Transactions on Information Forensics and
Security, vol. 1, no. 4, pp. 502–516, 2006.

1148

MULTI-SCALE IMAGE CONTRAST ENHANCEMENT: USING ADAPTIVE
INVERSE HYPERBOLIC TANGENT ALGORITHM
1,2 1 3 1,4
Cheng-Yi Yu, Yen-Chieh Ouyang, Tzu-Wei Yu, Chein-I Chang
1
Dept. of Electrical Engineering, National Chung Hsing University, Taichung, ROC
2
Dept. of Computer Science and Information Engineering, National Chin Yi University of
Technology, Taichung, ROC
3
Dept. of Electronic Engineering, National Chin Yi University of Technology, Taichung, ROC
4
Remote Sensing Signal and Image Processing Laboratory, Dept. of Computer Science and Electrical
Engineering, University of Maryland, Baltimore County, Baltimore, MD 21250, USA
E-mail:youjy@ncut.edu.tw

ABSTRACT contrast image. A dark image has particular low gray
levels in intensity, while a bright image has very high
This paper presents a fast and effective method for gray levels in intensity. The gray levels of a back-lighted
image contrast enhancement based on multi-scale image are usually distributed at the two ends of dark and
parameter adjustment of Adaptive Inverse Hyperbolic bright regions. On the other hand, the gray levels of a
Tangent Algorithm (MSAIHT). Sub-band coefficients low-contrast image are generally centralized on the
were developed base on the method of Adaptive Inverse middle region, while the gray levels of a high-contrast
Hyperbolic Tangent Algorithm. In the proposed method, image are scattered across the whole spectrum (Fig.
the image contrast is calculated from the local mean and 2).[3,4]
local variance before the further processing for Adaptive
Inverse Hyperbolic Tangent Algorithm (AIHT). We
show that this approach could provide a convenient and
effective way for various types of images. Applications
of the proposed method in real time image were also
discussed. Experimental results show that the proposed
algorithm is capable of enhancing the local contrast of
the original image adaptively while extruding the details
of objects simultaneously.
Keywords — Multi-Scale, Adaptive Inverse Hyperbolic
Tangent, Contrast Enhancement, Image Processing
Topic area — Multi Processing, Image Post-Processing Figure 1. Human Visual System mapping curve

1. INTRODUCTION
Light is the electromagnetic radiation that
stimulates our visual response. In real-world situations,
light intensities have a large range. The illumination
range over which the human visual system can operate is
roughly 1 to 1010, or ten orders of magnitude.
The retina of the human eye contains about 100
million rods and 6.5 million cones. The rods are sensitive,
and provide vision the lower several orders of magnitude
of illumination. The cones are less sensitive, and provide Figure 2. Five kinds of contrast types
the visual response at the higher 5 to 6 orders of
magnitude of illumination. Figure 1 is the Human Visual Five categories of commonly used gray level
System mapping curve [1,2]. transfer functions shown in Fig. 3 are generally used to
According to image contrast an images is generally perform contrast enhancement so as to achieve different
categorized into one of five groups: dark image, bright types of contrast [3, 4]. For example, for dark images
image, back-lighted image, low-contrast image, and high- with mean 0.5, the function in Fig. 3(a) is used,

1149

whereas the function in Fig. 3(b) is used for a bright interactive applications. It can automatically produce
image with mean 0.5 for the same purpose. For images contrast enhanced images with good quality while using a
whose gray levels are centralized in the middle region spatially uniform mapping function that is based on a
with mean near at 0.5, the function in Fig. 3(c) is used. simple brightness perception model to achieve better
For images whose gray levels are distributed at the two efficiency. In addition, the MSAIHT also provides users
end of dark and bright region, the function in Fig. 3(d) is with a tool of tuning the on-the-fly image appearance in
used. For the images whose gray levels are uniformly terms of brightness and contrast and thus, is suitable for
scattered across the whole spectrum, the function in Fig. interactive applications. The AIHT-processed images can
3(e) is used. be reproduced within the capabilities of the display
medium to have better detailed and faithful
bias=0.37,gain=0.35 bias=0.37,gain=3.0 bias=0.97,gain=1.0
1 1 1

0.9 0.9 0.9

0.8

0.7
0.8

0.7
0.8

0.7
representations of original scenes.
0.6 0.6 0.6
The remainder of this paper is organized as follows:
Output Lev el
Output Level

Output Level

0.5 0.5 0.5

0.4

0.3
0.4

0.3
0.4

0.3
Section 2 reviews previous work done in the literary.
0.2

0.1
0.2

0.1
0.2

0.1
Section 3 develops the MSAIHT contrast enhancement
0
0 0.1 0.2 0.3 0.4 0.5
Input Level
0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5
Input Level
0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5
Input Level
0.6 0.7 0.8 0.9 1
algorithm along with its parameters, and usage. Section 4
(a) dark image (b) bright image (c) back-lighted conducts experiments including simulations. Finally,
Section 5 provides future directions of further research.
bias=0.97,gain=1.0 bias=0.37,gain=1.0
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
2. CONTRAST ENHANCEMENT FOR AN IMAGE
O utput Lev el

O utput Lev el

0.5 0.5

0.4

0.3
0.4

0.3
There are two categories of contrast enhancement
0.2

0.1
0.2

0.1
techniques: global methods and local methods. Global
0
0 0.1 0.2 0.3 0.4 0.5
Input Level
0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5
Input Level
0.6 0.7 0.8 0.9 1 contrast enhancement techniques remedy problems that
(d) low contrast image (e) high contrast image manifest themselves in a global fashion such as
Figure 3. Five category of classical gray level transform excessive/poor lightning conditions in the source
functions environment. On the other hand, local contrast
enhancement tries to enhance the visibility of local
Contrast enhancement techniques are widely used details in the image. Locally enhanced images look more
to increase the visual image quality. The purpose of attractive than the originals because of the higher contrast
image enhance: First, people eyes identified more easily [5].
to images or make the image clear and detailed. Second, The advantages of using a global method are its
the computer easy analysis of image data and identified, high efficiency and low computational load. The
and like humans has a visual perception capabilities. drawback of using a global operator is its inability in
However, in our previous projects undertaken[3,4], we revealing image details of local luminance variation. On
proposed that an Adaptive Inverse Hyperbolic Tangent the contrary, the advantage of a local operator is its
Algorithm, however, this approach suffers from the capability of revealing the details of luminance level
following drawbacks: First, it lacks of a mechanism to information in an image at the expense of very high
adjust the degree of enhancement, using the AIHT based computational cost that may not be unsuitable for video
image contrast enhancement methods can not retain the applications without hardware realization [3,4]. Two
detail brightness distribution of the original image types of contrast enhancement techniques, linear and
therefore lead to distortion. Second, this algorithm can nonlinear are discussed as follows.
only be done on the image of the global contrast Linear contrast enhancement is also referred to as
enhancement and cannot achieve a local contrast contrast stretching. It linearly expands the original digital
enhanced, and unable to meet the Human Visual System luminance values of an image to a new distribution.
mapping curve, and to produce non-smooth or distorted Expanding the original input values of the image makes it
images phenomenon. possible to use the entire sensitivity range of the display
In this paper, the above-mentioned shortcomings of device. Linear contrast enhancement also highlights
image contrast enhancement methods in order to propose subtle variations within the data.
multi-scales image enhancement method base on Nonlinear contrast enhancement often involves
Adaptive Inverse Hyperbolic Tangent Algorithm. This histogram equalization, which requires an algorithm to
method has two main features: (1) a sub-processing accomplish the task. One major disadvantage resulting
method to achieve the local contrast enhancement; (2) from the nonlinear contrast stretch is that each value in
proposed a method capable of processing various types the input image can have several values in the output
of images, enhance and retain the original image details. image so that objects in the original scene lose their
Image enhancement of the results will contribute to correct relative brightness values.
image analysis. Under such a circumstance the contrast
Multi-scale parameter adjustment of Adaptive enhancement is generally performed to expand gray level
Inverse Hyperbolic Tangent Algorithm (MSAIHT) for range to mitigate the problem. One popular technique to
image contrast enhancement that is suitable for accomplish this task is histogram equalization in

1150

(Gonzalez and Woods [6]). A disadvantage of the method inverse hyperbolic tangent curve can be further
is that it is indiscriminate and produces unrealistic effects dynamically adjusted. The following section describes
in photographs. It may increase the contrast of the method we will use, which is similar to the proposed
background noise, while decreasing the usable signal. In algorithm.
scientific imaging where spatial correlation is more
important than intensity of signal, the small signal to
noise ratio usually hampers visual detection.

3. MULTI-SCALE PARAMETER ADJUSTMENT
OF ADAPTIVE INVERSE HYPERBOLIC
TANGENT (MSAIHT) ALGORITHM
3.1. Adaptive Inverse Hyperbolic Tangent (AIHT)
Algorithm
Figure 4 is a block diagram of the AIHT algorithm.
The input data is converted from its original format to a
floating point representation of RGB values. The
principal characteristic of our proposed enhancement
function is a adaptive adjustment of the Inverse
Hyperbolic Tangent (AIHT) Function determined by
each pixel’s radiance. After reading the image file, the
bias(x) and gain(x) is computed. These parameters
control the shape of the IHT function. Figure 5 shows a Figure 4. A flowchart of the AIHT algorithm.
block diagram of AIHT parameters evaluates, including
bias(x) and gain(x) parameters [3,4].
The Adaptive Inverse Hyperbolic Tangent
algorithm has several desirable properties. For very small
and very large luminance values, its logarithmic function
enhances the contrast in both dark and bright areas of an
image. Because this function is an asymptote, the output
mapping is always bounded between 0 and 1. Another
advantage of this function is that it supports an
approximately inverse hyperbolic tangent mapping for
Figure 5. A flowchart of AIHT parameters
intermediate luminance, or luminance distributed
evaluates.
between dark and bright values. Figure 6 shows an
example where the middle section of the curve is gain=1.0
approximately linear. 1

The form of the AIHT fits data obtained from
bias=0.2
0.9 bias=1.0

measuring the electrical response of photo-receptors to
bias=4.0
0.8 linear

flashes of light in various species [7]. It has also provided 0.7

a good fit to other electro-physiological and 0.6

psychophysical measurements of human visual function
Output Level

0.5
[8]-[10].
0.4
The contrast of an image can be enhanced using
Adaptive inverse hyperbolic function. The enhanced
0.3

pixel xij is defined as following, 0.2

0.1

  1  xij  x   
bias

Enhancexij    log   1  gainx 
0
(1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  1  xij  x   
Input Level
bias
    Figure 6. AIHT is approximately linear over the middle
where xij is the image gray level of the ith row and jth range values where the choice of a semi-saturation
column. The bias(x) is a power of xij to speed up the constant determines how input values are mapped to
changing. The gain function is a weighting function display values.
which is used to determine the steepness of the AIHT 3.2. Bias and Gain Parameters
curve. A steeper slope narrows a smaller range of input The bias function is a power function defined over
values to the display range. The gain function is used to the unit interval which remaps x according to the bias
help shape how fast the mid-range of objects in a soft transfer function. The bias function is used to bend the
region goes from 0 to 1. A higher gain value means a density function either upwards or downwards over the
higher rate in change. Therefore the steepness of the [0,1] interval.

1151

The bias power function is defined by: fixed (gain = 0.85), the corresponding results are shown
0.25 in Fig. 8(b).
 1 m n 
 mean ( x ) 


  x ij 
m  n i 1 j 1  (2)
0 .25
Original Image. Gain=1.0 processed of image. Gain=0.99 processed of image.

bias  x      
 0 .5  
0 .5

 
 
The gain function determines the steepness of the
AIHT curve. A steeper slope maps a smaller range of
input values to the display range. The gain function is
used to help to reshape the object’s midrange from 0 to 1
Gain=0.97 processed of image. Gain=0.93 processed of image. Gain=0.85 processed of image.

of its soft region.
The gain function is defined by:
0. 5
 1 m n 
gain  x   0.1  variance ( x )   m  n  xij   
 0 .1  
0 .5

 i 1 j 1 
(3) Gain=0.69 processed of image. Gain=0.37 processed of image.

1 m n
where    xij
m  n i 1 j 1
Decreasing the gain(x) value increases the contrast
of the re-mapped image. Shifting the distribution toward
lower levels of light (i.e., decreasing bias(x)) decreases
the highlights. By adjusting the bias(x) and gain(x), it is
possible to tailor a re-mapping function with appropriate (a)
amounts of image contrast enhancement, highlights and bias=0.4 processed of image. bias=0.6 processed of image. bias=0.8 processed of image.

shadow lightness as shown in Fig. 7.
bias=0.4 varying the gain bias=0.5 varying the gain bias=0.65 varying the gain
1 1 1
1
0.99
0.8 0.8 0.8
0.97
0.93
Output Level

Output Level

Output Level

0.6 0.6 0.6
0.85
0.69
0.4 0.4 0.4 0.37

0.2 0.2 0.2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Input Level Input Level Input Level bias=1.0 processed of image. bias=1.2 processed of image. bias=1.4 processed of image.
1 1 1

0.8 0.8 0.8
Output Level

Output Level

Output Level

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Input Level Input Level Input Level

1 1 1
bias=1.6 processed of image. bias=1.8 processed of image. bias=2.0 processed of image.
0.8 0.8 0.8
Output Level

Output Level

Output Level

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Input Level Input Level Input Level

Figure 7. Inverse Hyperbolic Tangent Curve
produced by varying the gain and bias values, varying
the gain and bias values of mapping curves. (b)
Figure 8. (a) The bias parameter fixed (bias = 1)
The gain function determines the steepness of the and eight different gain values processed of images. (a)
curve. Steeper slopes map a smaller range of input values gain parameter fixed (gain = 0.85) and nine different bias
to the display range. The value of bias controls the values of mapping curves, (b) fixed gain= 0.85,
centering of the inverse hyperbolic tangent. Figure 8 processed of image.
shows gain and bias the curve for different values of 3.3. Multi-Scale Parameter Adjustment of Adaptive
processed images. There are a total of eight gain values Inverse Hyperbolic Tangent (MSAIHT)
(1, 0.99, 0.97, 0.93, 0.85, 0.69, 0.37) and bias parameter Algorithm
fixed (bias = 1), the corresponding results are shown in Figure 9 shows a block diagram of the MSAIHT
Fig. 8(a). There are a total of nine bias values (0.4, 0.5, algorithm. The input data is converted from its original
0.65, 0.8, 1.0, 1.25, 1.6, 2.1, 2.8) and gain parameter format to a floating point representation of RGB values.
The principal characteristic of our proposed enhancement

1152

function is a Multi-scale adaptive adjustment of the gain images in the low level and high gain images in the
Inverse Hyperbolic Tangent (MSAIHT) Function high level. An additional problem that is potentially
determined by each pixel’s radiance. After reading the solved by this approach is the compression property of
image file, the bias(x) and gain(x) is computed. These the display (so-called gamma curve). This transfer
parameters control the shape of the AIHT function. function has high suppressed rate for higher luminance
Figure 10 shows a block diagram of MSAIHT parameters range and has low prolonged rate for the lower luminance
evaluates, including Multi-scale bias(x) and gain(x) regions.
parameters.
4. IMPLEMENTATION AND EXPERIMENTAL
RESULTS
A variety of video sequences and still images were
tested by using the proposed method. There are four
types of extreme case images: dark image, bright image,
back-lighted image, and low-contrast image. Images with
different types of histogram distributions were taken to
test for experiments. These include some daily life
images that may arise in contrast to the poor image and
demonstrate the enhanced results. Figure 11 shows
various types of images with bad contrast enhancement.
Figure 11 displays the results of the enhanced image
processing by histogram equalization, AIHT and the
proposed MSAIHT method. Figure 12 shows Adaptive
Inverse Hyperbolic Tangent and Multi-scale Adaptive
Inverse Hyperbolic Tangent Comparison on local detail.
In the local detail of the enhance MSAIHT better than
AIHT.
Figure 9. A flowchart of the MSAIHT algorithm. The comparative analysis has shown the proposed
methods can display more detail in the sense of contrast
There are two important design goals for the Multi- than the current frequently used methods. The MSAIHT
scale: avoiding noise visibility especially in smooth technique can keep the sharpness of defects' edges and
regions and preventing intensity saturation to minimum local detail well. Therefore, AIHT and MSAIHT can
and maximum possible intensity values (e.g. 0 and 255 greatly enhance poor image and they will be helpful for
for 1 byte per channel source format). defect recognition.
The enhanced output image resulting from the Finally, Figure 13 shows the MSAIHT system
multi-scale approach for processing input image x, is interface in manual and automatic mode. The automatic
described by: mode adjusts the best parameters (multi-scale gain and
K
bias) based on the automatic calculation of characteristics
Enhance _ MSAIHT   AIHT (bias(k ), gain(k )) (4)
k 1
of images (Piecewise mean and variance). In manual
where K is the number of band we used. mode, users can select th

CVGIP 2010 Part 3

More Related Content

What's hot

Viewers also liked

Similar to CVGIP 2010 Part 3

Recently uploaded

CVGIP 2010 Part 3