3-D Environment Model Construction and Adaptive
 Foreground Detection for Multi-Camera Surveillance System
          Yi-Yuan Chen1† , Hung-I Pai2† , Yung-Huang Huang∗ , Yung-Cheng Cheng∗ , Yong-Sheng Chen∗
           Jian-Ren Chen† , Shang-Chih Hung† , Yueh-Hsun Hsieh† , Shen-Zheng Wang† , San-Lung Zhao†
                                      †
                                       Industrial Technology Research Institute, Taiwan 310, ROC
               ∗
                   Department of Computer and Information Science, National Chiao-Tung University, Taiwan 30010, ROC
                                             E-mail:1 yiyuan@itri.org.tw, 2 HIpai@itri.org.tw



   Abstract— Conventional surveillance systems usually use
multiple screens to display acquired video streams and may
cause trouble to keep track of targets due to the lack of spatial
relationship among the screens. This paper presents an effective
and efficient surveillance system that can integrate multiple
video contents into one single comprehensive view. To visualize
the monitored area, the proposed system uses planar patches to
approximate the 3-D model of the monitored environment and
displays the video contents of cameras by applying dynamic
texture mapping on the model. Moreover, a pixel-based shadow
detection scheme for surveillance system is proposed. After
an offline training phase, our method exploits the threshold
which determines whether a pixel is in a shadow part of the                  Fig. 1.   A conventional surveillance system with multiple screens.
frame. The thresholds of pixels would be automatically adjusted
and updated according to received video streams. The moving
objects are extracted accurately with removing cast shadows and
then visualized through axis-aligned billboarding. The system
                                                                          direction of cameras, and locations of billboards indicate
provides security guards a better situational awareness of the            the positions of cameras, but the billboard contents will be
monitored site, including the activities of the tracking targets.         hard to perceive if the angles between viewing direction and
   Index Terms— Video surveillance system, planar patch mod-              normal directions of billboards are too large. However, in
eling, axis-aligned billboarding, cast shadow removal                     rotating billboard method, when the billboard rotates and
                                                                          faces to the viewpoint of user, neither camera orientations
                        I. I NTRODUCTION                                  nor capturing areas will be preserved. In outdoor surveillance
   Recently, video surveillance has experienced accelerated               system, an aerial or satellite photograph can be used as
growth because of continuously decreasing price and better                a reference map and some measurement equipments are
capability of cameras [1] and has become an important                     used to build the 3-D environment [3]–[5]. Neumann, et al.
research topic in the general field of security. Since the                 utilized an airborne LiDAR (Light Detection and Ranging)
monitored regions are often wide and the field of views                    sensor system to collect 3-D geometry samples of a specific
of cameras are limited, multiple cameras are required to                  environment [6]. In [3], image registration seams the video on
cover the whole area. In the conventional surveillance system,            the 3-D model. Furthermore, video projection, such as video
security guards in the control center monitor the security                flashlight or virtual projector, is another way to display video
area through a screen wall (Figure 1). It is difficult for the             in the 3-D model [4], [7].
guards to keep track of targets because the spatial relationship             However, the multi-camera surveillance system still has
between adjacent screens is not intuitively known. Also, it is            many open problems to be solved, such as object tracking
tiresome to simultaneously gaze between many screens over                 across cameras and object re-identification. The detection of
a long period of time. Therefore, it is beneficial to develop a            moving objects in video sequences is the first relevant step in
surveillance system that can integrate all the videos acquired            the extraction of information in vision-based applications. In
by the monitoring cameras into a single comprehensive view.               general, the quality of object segmentation is very important.
   Many researches on integrated video surveillance systems               The more accurate positions and shapes of objects are, the
are proposed in the literature. Video billboards and video                more reliable identification and tracking will be. Cast shadow
on fixed planes project camera views including foreground                  detection is an issue for precise object segmentation or
objects onto individual vertical planes in a reference map to             tracking. The characteristics of shadow are quite different
visualize the monitored area [2]. In fixed billboard method,               in outdoor and indoor environment. The main difficulties in
billboards face to specified directions to indicate the capturing          separating the shadow from an interesting object are due to




                                                                    988
the physical property of floor, directions of light sources and           System configuration On-line monitoring
additive noise in indoor environment. Based on brightness
                                                                                Manual operation                                 video
and chromaticity, some works are proposed to decide thresh-                                                                     streams
olds of these features to roughly detect the shadow from
objects [8]–[10]. However, current local threshold methods                      2D                 3D
                                                                              patterns            model                 Background modeling
couple blob-level processing with pixel-level detection. It
causes the performance of these methods to be limited due
to the averaging effect of considering a big image region.                       Registration with                        Segmentation and
                                                                               corresponding points                          refinement
   Two works to remove shadow are proposed to update
the threshold with time and detect cast shadow in different
                                                                                                                              Axis-aligned
scenes. Carmona et al [11] propose a method to detect                                 Lookup                                  billboarding
shadow by using the properties of shadow in Angle-Module                               tables
space. Blob-level knowledge is used to identify shadow,                                                                        3-D model
refection and ghost. This work also proposes a method to                                                                      construction
update the thresholds to remove shadow in different positions
of the scene. However there are many undetermined param-               Fig. 2. The flowchart and components of the proposed 3-D surveillance
                                                                       system.
eters to update the thresholds and the optimal parameters are
hard to find in practice. Martel-Brisson et al [12] propose
a method, called GMSM, which initially uses Gaissian of
Mixture Model (GMM) to define the most stable Gaussian
distributions as the shadow and background distributions.
Since a background model is included in this method, more
computation is needed for object segmentation if a more com-
plex background model is included in the system. Besides,
because that each pixel has to be updated no matter how many
objects moving, it cost more computation in few objects.
   In this paper, we develop a 3-D surveillance system based
on multiple cameras integration. We use planar patches to
build the 3-D environment model firstly and then visualize
videos by using dynamic texture mapping on the 3-D model.
To obtain the relationship between the camera contents and
the 3-D model, homography transformations are estimated
for every pair of image regions in the video contents and
the corresponding areas in the 3-D model. Before texture               Fig. 3. Planar patch modeling for 3-D model construction. Red patches
mapping, patches are automatically divided into smaller ones           (top-left), green patches (top-right), and blue patches (bottom-left) represent
with appropriate sizes according to the environment. Lookup            the mapping textures in three cameras. The yellow point is the origin of
                                                                       the 3-D model. The 3-D environment model (bottom-right) is composed of
tables for the homography transformations are also built for           horizontal and vertical patches from these three cameras.
accelerating the coordinate mapping in the video visual-
ization processing. Furthermore, a novel method to detect
moving shadow is also proposed. It consists of two phases.             quired from IP cameras deployed in the scene to the 3-D
The first phase is an off-line training phase which determines          model by specifying corresponding points between the 3-D
the threshold of every pixel by judging whether the pixel is           model and the 2-D images. Since the cameras are fixed, this
in the shadow part. In the second phase, the statistic data            configuration procedure can be done only once beforehand.
of every pixel is updated with time, and the threshold is              Then in the on-line monitoring stage, based on the 3-D
adjusted accordingly. By this way, a fixed parameters setting           model, all videos will be integrated and visualized in a single
for detecting shadow can be avoided. The moving objects are            view in which the foreground objects extracted from images
segmented accurately from the background and are displayed             are displayed through billboards.
via axis-aligned billboarding for better 3-D visual effects.
                                                                       A. Image registration
               II. S YSTEM CONFIGURATION
                                                                         For a point on a planar object, its coordinates on the plane
   Figure 2 illustrates the flowchart of constructing the pro-          can be mapped to 2-D image through homography citeho-
posed surveillance system. First, we construct lookup tables           mography, which is a transformation between two planar
for the coordinate transformation from the 2-D images ac-              coordinate systems. A homography matrix H represents the




                                                                 989
relationship between points on two planes:
                      sct = Hcs ,                           (1)
where s is a scalar factor and cs and ct are a pair of corre-
sponding points in the source and target patches, respectively.
If there are at least four correspondences where no three
correspondences in each patch are collinear, we can estimate
H through the least-squares approach.
   We regard cs as points of 3-D environment model and ct
as points of 2-D image and then calculate the matrix H to
map points from the 3-D model to the images. In the reverse
order, we can also map points from the images to the 3-D
model.
B. Planar patch modeling
   Precise camera calibration is not an easy job [13]. In the           Fig. 4. The comparison of rendering layouts between different numbers
                                                                        and sizes of patches. A large distortion occurs if there are fewer patches for
virtual projector methods [4], [7], the texture image will be           rendering (left). More patches make the rendering much better (right).
miss-aligned to the model if the camera calibration or the
3-D model reconstruction has large error. Alternatively, we
develop a method that approximates the 3-D environment                  where Iij is the intensity of the point obtained from homog-
model through multiple yet individual planar patches and                                        ˜
                                                                        raphy transformation, Iij is the intensity of the point obtained
then renders the image content of every patches to generate             from texture mapping, i and j are the coordinates of row and
a synthesized and integrated view of the monitored scene. In            column in the image, respectively, and m × n represents the
this way we can easily construct a surveillance system with             dimension of the patch in the 2-D image. In order to have
3-D view of the environment.                                            an reference scale to quantify the distortion amount, a peak
   Mostly we can model the environment with two basic                   signal-to-noise ratio is calculated by
building components, horizontal planes and vertical planes.
The horizontal planes for hallways and floors are usually                                                              MAX2
                                                                                                                         I
surrounded by doors and walls, which are modeled as the                                 PSNR =           10 log10                   ,             (3)
                                                                                                                      MSE
vertical planes. Both two kinds of planes are further divided
into several patches according to the geometry of the scenes            where MAXI is the maximum pixel value of the image.
(Figure 3). If the scene consists of simple structures, a few           Typical values for the PSNR are between 30 and 50 dB and
large patches can well represent the scene with less rendering          an acceptable value is considered to be about 20 dB to 25 dB
costs. On the other hand, more and smaller patches are                  in this work. We set a threshold T to determine the quality
required to accurately render a complex environment, at the             of texture mapping by
expense of more computational costs.
   In the proposed system, the 3-D rendering platform is                                           PSNR ≥ T .                                     (4)
developed on OpenGL and each patch is divided into tri-
angles before rendering. Since linear interpolation is used             If the PSNR of the patch is lower than T , the procedure
to fill triangles with texture in OpenGL and not suitable                divides it into smaller patches and repeats the process until
for the perspective projection, distortion will appear in the           the PSNR values of every patches are greater than the given
rendering result. One can use a lot of triangles to reduce this         threshold T .
kind of distortion, as shown in Figure 4, it will enlarge the
computational burden and therefore not feasible for real-time                               III. O N - LINE MONITORING
surveillance systems.
   To make a compromise between visualization accuracy and                 The proposed system displays the videos on the 3-D model.
rendering cost, we propose a procedure that automatically               However, the 3-D foreground objects such as pedestrians are
divides each patch into smaller ones and decides suitable               projected to image frame and become 2-D objects. They will
sizes of patches for accurate rendering (Figure 4). We use the          appear flattened on the floor or wall since the system displays
following mean-squared error method to estimate the amount              them on planar patches. Furthermore, there might be ghosting
of distortion when rendering image patches:                             effects when 3-D objects are in the overlapping areas of
                              m−1 n−1                                   different camera views. We need to tackle this problem by
                       1                       ˜
         MSE =                          (Iij − Iij )2 ,     (2)         separating and rendering 3-D foreground objects in addition
                      m×n     i=0 j=0                                   to the background environment.




                                                                  990
our method such that the background doesn’t have to be
                                                                                   determined again.
                                                                                      In the indoor environment, we assume the color in eq.(7) is
                                                                                   similar between shadow and background in a pixel although
                                                                                   it is not evidently in sunshine in outdoor. Only the case of
                                                                                   indoor environment is considered in this paper.

Fig. 5. The tracking results obtained by using different shadow thresholds         B. Collecting samples
while people stand on different positions of the floor. (a) Tr = 0.8 (b)               Samples I(x, y, t) in some frames are collected to decide
Tr = 0.3. The threshold value Tθ = 6o is the same for both.
                                                                                   the shadow area, where t is the time. In [12] all samples are
                                                                                   collected including the classification of background, shadow
A. Underlying assumption                                                           and foreground by the pixel value changed with time. But
                                                                                   if a good background model has already built and some
   Shadow is a type of foreground noise. It appears in any                         initial foreground objects were segmented, the background
zone of the camera scene. In [8], each pixel belongs to                            samples are not necessary. Only foreground and shadow
a shadow blob is detected by two properties. First, the                            samples If (x, y, t) were needed to consider. Besides, since
color vector of a pixel in shadow blob has similar direction                       background pixels are dropped from the samples list, this can
to that of the background pixel in the same position of                            save the computer and memory especially in a scene with
image. Second, the magnitude of the color vector in the                                                   T
                                                                                   few objects. Future, If θ (x, y, t) is obtained by dropping the
shadow is slightly less than the corresponding color vector of                     samples which not satisfy inequality eq.(7) from If (x, y, t).
background. Similar to [11], RGB or other color space can be                       Obviously, the samples data composed of more shadows
transformed into two dimensional space (called angle-module                        samples and less foreground samples. This also leads to that
space). The color vector of a pixel in position (x, y) of current                  the threshold r(x, y, t) can be derived more easily than the
frame, Ic (x, y), θ(x, y) is the angle between background                          threshold derived from samples of If (x, y, t).
vector Ib (x, y) and Ic (x, y), and the magnitude ratio r(x, y)
are defined as                                                                      C. Deciding module ratio threshold
                        arccos(Ic (x, y) · Ib (x, y))                                 The initial threshold Tθ (x, y, 0) is set according to the
           θ(x, y) =                                                  (5)          experiment. In this case, Tθ (x, y, 0) = cos(6◦ ) is set as
                          |Ic (x, y)||Ib (x, y)| +
                                                                                   the initial value. After collecting enough samples, the ini-
                                            |Ic (x, y)|
                                r(x, y) =                             (6)          tial module ratio threshold Tr (x, y, 0) can be decided by
                                            |Ib (x, y)|                            this method, Fast step minimum searching (FSMS). FSMS
where is a small number to avoid zero denominator. In [11],                        can fast separate the shadow from foreground distribution
the shadow of a pixels have to satisfy                                             which collected samples are described above. The detail of
                                                                                   this method is described below. The whole distribution is
                     Tθ < cos θ(x, y) < 1                             (7)          separated by a window size w. The height of each window
                        Tr < r(x, y) < 1                              (8)          is the sum of the samples. Besides the background peak, two
                                                                                   peaks were found. The threshold Tr is used to search the
  where Tθ is the angle threshold and Tr is the module                             peak which is closest to the average background value and
ratio threshold. According to the demonstration showed in                          smaller than the background value, the shadow threshold can
Figure 5, the best shadow thresholds are highly depends on                         be found by searching the minimum value or the value close
positions (pixels) in the scene, because of the complexity                         to zero.
of environment, the light sources and objects positions.
Therefore, we propose a method to automatically adjust the                         D. Updating angle threshold
thresholds for detecting shadow for each pixel. The threshold                         When a pixel satisfies both conditions in inequality eq.(7,
for a pixel to be classified as shadow or not is determined by                      8) at the same time, the pixel is classified as shadow. In other
the necessary samples (data) collected with time. Only one                         words, if the pixel Is (x, y) is actually a shadow pixel, and
                                                  c                                is classified as one of candidate of shadow by FSMS, the
parameter has to be manually initialized. It is Tθ (0), where
0 means the initial time. Then the method can update the                           property of the pixel is require to satisfy the below equation
thresholds automatically and fast. Our method is faster than                       at the same time
the similar idea, GMSM method [12], when a background
model has built up. There are two major advantages of
                                                                                               0 ≤ cos θ(x, y, t) < Tθ (x, y, t)               (9)
the computation time for our method. First, only necessary
samples are collected. Second, compared with method [12],                            Tθ (x, y, t) can be decided by searching the minimum
any background or foreground results can combine with                              cos(θ) of pixels in Is which is obtained by FSMS. However




                                                                             991
Fig. 7. Orientation determination of the axis-aligned billboarding. L is
                                                                                    the location of the billboard, E is the location projected vertically from the
                                                                                    viewpoint to the floor, and v is the vector from L to E. The normal vector
                                                                                    (n) of the billboard is rotated according to the location of the viewpoint. Y
                                                                                    is the rotation axis and φ is the rotation angle.
Fig. 6. A flowchart to illustrate the whole method. The purple part is based
on pixel.
                                                                                    are always moving on the floor, the billboards can be aligned
                                                                                    to be perpendicular to the floor in the 3-D model. The 3-D
we propose another method to find out Tθ (x, y, t) more fast.                        location of the billboard is estimated by mapping the bottom-
The number of samples which are classified as shadow or                              middle point of the foreground bounding box in the 2-D
background at time t is ATr (x, y, t) by using FSMS. We
                             {b,s}                                                  image through the lookup tables. The ratio between the height
define a ratio R(Tr ) = ATr /A{b,s,f } where A{b,s,f } is all
                           {b,s}                                                    of the bounding box and the 3-D model determines the height
samples in position x, y, where b, s, f represent the back-                         of the billboard in the 3-D model. The relationship between
ground, shadow and foreground respectively. The threshold                           the direction of a billboard and the viewpoint is defined as
Tθ (x, y, t) can be updating to Tθ (x, y, t) by R(Tr ). The                         shown in Figure 7.
number of samples whose cos(θ(x, y)) values are larger than                            The following equations are used to calculate the rotation
the Tθ (x, y, t) is equal to A{b,s} and is required                                 angle of the billboard:

                      R(Tθ (x, y, t)) = R(Tr )                       (10)                                     Y = (n × v) ,                                 (12)

   Besides, we add a perturbation δTθ to the Tθ (x, y, t).
                                                T
Since FSMS only finds out a threshold in If θ (x, y, t), if the                                             φ = cos−1 (v · n) ,                              (13)
initial threshold Tθ (x, y, 0) is set larger than true threshold,
the best updating threshold is equal to threshold Tθ not                            where v is the vector from the location of the billboard, L, to
smaller than threshold Tθ . Therefore the true angle threshold                      the location E projected vertically from the viewpoint to the
will never be found with time. To solve this problem, a per-                        floor, n is the normal vector of the billboard, Y is the rotation
turbation of the updating threshold is added to the updating                        axis, and φ is the estimated rotation angle. The normal vector
threshold                                                                           of the billboard is parallel to the vector v and the billboard
                                                                                    is always facing toward the viewpoint of the operator.
                 Tθ (x, y, t) = Tθ (x, y, t) − δTθ                   (11)
                                                                                    F. Video content integration
  Since the new threshold Tθ (x, y, t) has smaller value
to cover more samples, it can approach the true threshold                              If the fields of views of cameras are overlapped, objects in
with time. This perturbation can also make the method more                          these overlapping areas are seen by multiple cameras. In this
adaptable to the change of environment. Here is a flowchart                          case, there might be ghosting effects when we simultaneously
Figure 6 to illustrate the whole method.                                            display videos from these cameras. To deal with this problem,
                                                                                    we use 3-D locations of moving objects to identify the cor-
E. Axis-aligned billboarding                                                        respondence of objects in different views. When the operator
   In visualization, axis-aligned billboarding [14] constructs                      chooses a viewpoint, the rotation angles of the corresponding
billboards in the 3-D model for moving objects, such as                             billboards are estimated by the method presented above and
pedestrians, and the billboard always faces to the viewpoint of                     the system only render the billboard whose rotation angle is
the user. The billboard has three properties: location, height,                     the smallest among all of the corresponding billboards, as
and direction. By assuming that all the foreground objects                          shown in Figure 8.




                                                                              992
C1                 C3


                                                                                                                      C2

                                                                                                                                    C1




Fig. 8. Removal of the ghosting effects. When we render the foreground
object from one view, the object may appear in another view and thus
cause the ghosting effect (bottom-left). Static background images without            Fig. 9.    Determination of viewpoint switch. We divide the floor area
foreground objects are used to fill the area of the foreground objects (top).         depending on the fields of view of the cameras and associate each area to one
Ghosting effects are removed and static background images can be update              of the viewpoint close to a camera. The viewpoint is switched automatically
by background modeling.                                                              to the predefined viewpoint of the area containing more foreground objects.



G. Automatic change of viewpoint                                                        The experimental results shown in Figure 12 demonstrate
                                                                                     that the viewpoint can be able to be chosen arbitrarily in
   The proposed surveillance system provides target tracking                         the system and operators can track targets with a closer
feature by determining and automatic switching the view-                             view or any viewing direction by moving the virtual camera.
points. Before rendering, several viewpoints are specified in                         Moreover, the moving objects are always facing the virtual
advance to be close to the locations of the cameras. During                          camera by billboarding and the operators can easily perceive
the viewpoint switching from one to another, the parameters                          the spatial information of the foreground objects from any
of the viewpoints are gradually changed from the starting                            viewpoint.
point to the destination point for smooth view transition.
   The switching criterion is defined as the number of blobs                                                    V. C ONCLUSIONS
found in the specific areas. First, we divide the floor area into                         In this work we have developed an integrated video surveil-
several parts and associate them to each camera, as shown                            lance system that can provide a single comprehensive view
in Figure 9. When people move in the scene, the viewpoint                            for the monitored areas to facilitate tracking moving targets
is switched automatically to the predefined viewpoint of the                          through its interactive control and immersive visualization.
area containing more foreground objects. We also make the                            We utilize planar patches for 3-D environment model con-
billboard transparent by setting the alpha value of textures, so                     struction. The scenes from cameras are divided into several
the foreground objects appear with fitting shapes, as shown                           patches according to their structures and the numbers and
in Figure 10.                                                                        sizes of patches are automatically determined for compromis-
                                                                                     ing between the rendering effects and efficiency. To integrate
                    IV. E XPERIMENT RESULTS                                          video contents, homography transformations are estimated
                                                                                     for relationships between image regions of the video contents
   We developed the proposed surveillance system on a PC
                                                                                     and the corresponding areas of the 3D model. Moreover,
with Intel Core Quad Q9550 processor, 2GB RAM, and one
                                                                                     the proposed method to remove moving cast shadow can
nVidia GeForce 9800GT graphic card. Three IP cameras with
                                                                                     automatically decide thresholds by on-line learning. In this
352 × 240 pixels resolution are connected to the PC through
                                                                                     way, the manual setting can be avoided. Compared with the
Internet. The frame rate of the system is about 25 frames per
                                                                                     work based on frames, our method increases the accuracy to
second.
                                                                                     remove shadow. In visualization, the foreground objects are
   In the monitored area, automated doors and elevators are                          segmented accurately and displayed on billboards.
specified as background objects, albeit their image do change
when the doors open or close. These areas will be modeled in                                                      R EFERENCES
background construction and not be visualized by billboards,                          [1] R. Sizemore, “Internet protocol/networked video surveillance market:
the system use a ground mask to indicate the region of                                    Equipment, technology and semiconductors,” Tech. Rep., 2008.
interesting. Only the moving objects located in the indicated                         [2] Y. Wang, D. Krum, E. Coelho, and D. Bowman, “Contextualized
                                                                                          videos: Combining videos with environment models to support situa-
areas are considered as moving foreground objects, as shown                               tional understanding,” IEEE Transactions on Visualization and Com-
in Figure 11.                                                                             puter Graphics, 2007.




                                                                               993
Fig. 11.     Dynamic background removal by ground mask. There is an
                                                                                     automated door in the scene (top-left) and it is visualized by a billboard (top-
                                                                                     right). A mask covered the floor (bottom-left) is used to decide whether to
                                                                                     visualize the foreground or not. With the mask, we can remove unnecessary
                                                                                     billboards (bottom-right).




Fig. 10. Automatic switching the viewpoint for tracking targets. People              Fig. 12. Immersive monitoring at arbitary viewpoint. We can zoom out the
walk in the lobby and the viewpoint of the operator automatically switches           viewpoint to monitor the whole surveillance area or zoom in the viewpoint
to keep track of the targets.                                                        to focus on a particular place.



 [3] Y. Cheng, K. Lin, Y. Chen, J. Tarng, C. Yuan, and C. Kao, “Accurate                  transactions on Geosci. and remote sens., 2009.
     planar image registration for an integrated video surveillance system,”         [10] J. Kim and H. Kim, “Efficient regionbased motion segmentation for a
     Computational Intelligence for Visual Intelligence, 2009.                            video monitoring system,” Pattern Recognition Letters, 2003.
 [4] H. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal,                    [11] E. J. Carmona, J. Mart´nez-Cantos, and J. Mira, “A new video seg-
                                                                                                                  ı
     S. Hsu, D. Nister, and K. Hanna, “Video flashlights: real time ren-                   mentation method of moving objects based on blob-level knowledge,”
     dering of multiple videos for immersive model visualization,” in 13th                Pattern Recognition Letters, 2008.
     Eurographics workshop on Rendering, 2002.                                       [12] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast
 [5] U. Neumann, S. You, J. Hu, B. Jiang, and J. Lee, “Augmented virtual                  shadows through a multidistribution approach,” IEEE transactions on
     environments (ave): dynamic fusion of imagery and 3-d models,” IEEE                  pattern analysis and machine intelligence, 2007.
     Virtual Reality, 2003.                                                          [13] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and
 [6] S. You, J. Hu, U. Neumann, and P. Fox, “Urban site modeling from                     N. Master, “Calibrated, registered images of an extended urban area,”
     lidar,” Lecture Notes in Computer Science, 2003.                                     International Journal of Computer Vision, 2003.
 [7] I. Sebe, J. Hu, S. You, and U. Neumann, “3-d video surveillance                 [14] A. Fernandes, “Billboarding tutorial,” 2005.
     with augmented virtual environments,” in International Multimedia
     Conference, 2003.
 [8] T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for
     real-time robust background subtraction and shadow detection,” IEEE
     ICCV. (1999).
 [9] K. Chung, Y. Lin, and Y. Huang, “Efficient shadow detection of
     color aerial images based on successive thresholding scheme,” IEEE




                                                                               994
Morphing And Texturing Based On The Transformation Between Triangle Mesh And
                                   Point

                    Wei-Chih Hsu                                                           Wu-Huang Cheng
   Department of Computer and Communication                               Institute of Engineering Science and Technology,
 Engineering, National Kaohsiung First University of                     National Kaohsiung First University of Science and
    Science and Technology. Kaohsiung, Taiwan                                      Technology. Kaohsiung, Taiwan
                                                                                       u9715901@nkfust.edu.tw


Abstract—This research proposes a methodology of                       [1] has proposed a method to represent multi scale surface.
transforming triangle mesh object into point-based object and          M. Müller et al. The [2] has developed a method for
the applications. Considering the cost and program functions,          modeling and animation to show that point-based has
the experiments of this paper adopt C++ instead of 3D                  flexible property.
computer graphic software to create the point cloud from                   Morphing can base on geometric, shape, or other features.
meshes. The method employs mesh bounded area and planar                Mesh-based morphing sometimes involves geometry, mesh
dilation to construct the point cloud of triangle mesh. Two            structure, and other feature analysis. The [3] has
point-based applications are addressed in this research. 3D            demonstrated a method to edit free form surface based on
model generation can use point-based object morphing to
                                                                       geometric. The method applies complex computing to deal
simplify computing structure. Another application for texture
mapping is using the relation of 2D image pixel and 3D planar.
                                                                       with topology, curve face property, and triangulation. The [4]
The experiment results illustrate some properties of point-            not only has divided objects into components, but also used
based modeling. Flexibility and scalability are the biggest            components in local-level and global-level morphing. The [5]
advantages among the properties of point-based modeling. The           has adopted two model morphing with mesh comparison and
main idea of this research is to detect more sophisticated             merging to generate new model. The methods involved
methods of 3D object modeling from point-based object.                 complicate data structure and computing. This research has
                                                                       illustrated simple and less feature analysis to create new
   Keywords-point-based modeling; triangle mesh; texturing;            model by using regular point to morph two or more objects.
morphing                                                                   Texturing is essential in rendering 3D model. In virtual
                                                                       reality, the goal of texture mapping is try to be as similar to
                     I.    INTRODUCTION                                the real object as possible. In special effect, exaggeration
                                                                       texturing is more suitable for demand. The [6] has built a
    In recent computer graphic related researches, form•Z,             mesh atlas for texturing. The texture atlases' coordinates,
Maya, 3DS, Max, Blender, Lightwave, Modo, solidThinking                considered with triangle mesh structure, were mapped to 3D
and other 3D computer graphics software are frequently                 model. The [7] has used the conformal equivalence of
adopted tools. For example, Maya is a very popular software,           triangle meshes to find the flat mesh for texture mapping.
and it includes many powerful and efficient functions for              This method is more comprehensible and easy to implement.
producing results. The diverse functions of software can                   The rest arrangements are described as followings:
increase the working efficiency, but the methodology design            Transforming triangle mesh into point set for modeling are
must follow the specific rules and the cost is usually high.           addressed in Section II and III, and that introduce point-
Using C++ as the research tool has many advantages,                    based morphing for model creating. The point-based texture
especially in data information collection. Powerful functions          mapping is addressed in Section IV, and followed by the
can be created by C language instructions, parameters and              conclusion of Section V.
C++ oriented object. More complete the data of 3D be
abstracted, more unlimited analysis can be produced.                         II. TRANSFORMING TRIANGLE MESH INTO POINT SET
    The polygon mesh is widely used to represent 3D models
                                                                            In order to implement the advantages of point-based
and has some drawbacks in modeling. Unsmooth surface of
                                                                       model, transforming triangle mesh into point is the first step.
combined meshes is one of them. Estimating vertices of
                                                                       The point set can be estimated by using three normal bound
objects and constructing each vertex set of mesh are the
                                                                       lines of triangle mesh. The normal denoted by n can be
factors of modeling inefficiency. Point-based modeling is the
                                                                       calculated by three triangle vertices. The point in the triangle
solution to conquer some disadvantages of mesh modeling.
                                                                       area is denoted by B in , A denotes the triangle mesh area, the
Point-based modeling is based on point primitives. No
structure of each point to another is needed. To simplify the          3D space planar can be presented by p with coordinate
point based data can employ marching cube and Delaunay                 ( x, y , z ) , vi =1, 2,3 denotes three triangle vertices of triangle
triangulation to transform point-based model into polygon
mesh. Mark Pauly has published many related researches                 mesh, v denotes the mean of three triangle vertices. The
about point-based in international journals as followings: the         formula that presents the triangle area is described below.




                                                                 995
A = { p ( x, y , z ) | pn T − v i n T = 0 , i ∈ (1,2,3), p ∈ Bin }                The experiments use some objects file which is the wave
                                                                              front file format (.obj) from NTU 3D model database ver.1
        Bin = { p( x, y, z ) | f (i , j ) ( p) × f (i , j ) (v) > 0}          of National Taiwan University. The process of transforming
                     f (i , j ) ( p) = r × a − b + s                          triangle mesh into point-based is shown in Figure 1. It is
                                                                              clear to see that some areas with uncompleted the whole
                        b j − bi                                              point set shown in red rectangle of figure 1. The planar
                  r=             , s = bi - r × ai                            dilation process is employed to refine fail areas.
                        a j − ai
                                                                                  Planar dilation process uses 26-connected planar to refine
      i, j = 1,2,3       a , b = x, y , z i < j                a<b            the spots leaved in the area. The first half portion of Figure 2
                                                                              shows 26 positions of connected planar. If any planar and its
                                                                              26 neighbor positions are the object planar is the condition.
                                                                              The main purpose to estimate the object planar is to verify
                                                                              the condition is true. The result in second half portion of
                                                                              Figure 2 reveals the efficiency of planar dilation process.
                                                                                  III.    POINT-BASED MORPHING FOR MODEL CREATING
                                                                                  The more flexible in objects combining is one of property
                                                                              of point-based. No matter what the shape or category of the
                                                                              objects, the method of this study can put them into morphing
                                                                              process to create new objects.
                                                                                  The morphing process includes 3 steps. Step one is to
 Figure 1. The process of transforming triangle mesh into point-based         equalize the objects. Step two is to calculate each normal
                                                                              point of objects in morphing process. Step three is to
                                                                              estimate each point of target object by using the same normal
                                                                              point of two objects with the formula as described below.
                                                                                                                                      n −1
                                                                                         ot = p r1o1 + p r 2 o2 + ⋅ ⋅ ⋅ + (1 − ∑ p ri )o n
                                                                                                                                      i =1
                                                                                                                                  n
                                                                                          0 ≤ p r1 , p r 2 ,⋅ ⋅ ⋅, p r ( n −1) ≤ 1 , ∑ p ri = 1
                                                                                                                                 i =1
                                                                               ot presents each target object point of morphing, and oi is
                                                                              the object for morphing process. p ri donates the object
                                                                              effect weight in morphing process, and i indicates the
                                                                              number of object. The new model appearance generated
                                                                              from morphing is depended on which objects were chosen
                                                                              and the value of each object weight as well. The research
                                                                              experiments use two objects, therefore i = 1 or 2, n = 2 .
                                                                                  The results are shown in Figure 3. First row is a simple
                                                                              flat board and a character morphing. The second row shows
                                                                              the object selecting free in point-based modeling, because
                                                                              two totally different objects can be put into morphing and
                                                                              produced the satisfactory results. The models the created by
                                                                              objects morphing with different weights can be seen in figure
                                                                              4.
                                                                                           IV.    POINT-BASED TEXTURE MAPPING
                                                                                  Texturing mapping is a very plain in this research method.
                                                                              It uses a texture metric to map the 3D model to the 2D image
                                                                              pixel by using the concept of 2D image transformation into
                                                                              3D. Assuming 3D spaces is divided into α × β blocks, α is
                                                                              the number of row, and β is the number of column. Hence
                                                                              the length, the width, and the height of 3D space
                                                                              is h × h × h ; afterwards the ( X , Y ) and ( x. y, z ) will
                                                                              denote the image coordination and 3D model respectively.
                      Figure 2. Planar dilation process.                      The texture of each block is assigned by texture cube, and it



                                                                        996
is made by 2D image as shown in the middle image of first                            confirmed by the scalability and flexibility of proposed
raw in figure 5. The process can be expressed by a formula                           methodologies.
as below.
   At T = c T                                                                                                   REFERENCES
                  h             h             h                                      [1]   MARK PAULY, “Point-Based Multiscale Surface
    t = [ x mod       , y mod       , z mod       ] , c = [ X,Y ]
                  α          β                β                                            Representation,” ACM Transactions on Graphics, Vol. 25, No.
                                                                                           2, pp. 177–193, April 2006.
      ⎡α              0       0⎤
    A=⎢           β (h − z )    ⎥                                                    [2]   M. Müller1, R. Keiser1, A. Nealen2, M. Pauly3, M. Gross1
      ⎢ 0                     0 ⎥                                                          and M. Alexa2, “Point Based Animation of Elastic, Plastic
      ⎣               y         ⎦                                                          and Melting Objects,” Eurographics/ACM SIGGRAPH
 A denotes the texture transforming metric, t denotes the 3D                               Symposium on Computer Animation, pp. 141-151, 2004.
model current position, c denotes the image pixel content in                         [3]   Theodoris Athanasiadis, Ioannis Fudos, Christophoros Nikou,
the current position.                                                                      “Feature-based 3D Morphing based on Geometrically
    The experiment results are shown in the second row of                                  Constrained Sphere Mapping Optimization,” SAC’10 Sierre,
figure 5 and 6. The setting results α = β = 2 are shown in                                 Switzerland, pp. 1258-1265, March 22-26, 2010.
second row of figure 5. The setting results α = β = 4 create                         [4]   Yonghong Zhao, Hong-Yang Ong, Tiow-Seng Tan and
                                                                                           Yongguan Xiao, “Interactive Control of Component-based
the images are shown in the first row of figure 6. The last
                                                                                           Morphing,” Eurographics/SIGGRAPH Symposium on
row images of figure 6 indicate the proposed texture                                       Computer Animation , pp. 340-385, 2003.
mapping method can be applied into any point-based model.
                                                                                     [5]   Kosuke Kaneko, Yoshihiro Okada and Koichi Niijima, “3D
                           V.         CONCLUSION                                           Model Generation by Morphing,” IEEE Computer Graphics,
                                                                                           Imaging and Visualisation, 2006.
   In sum, the research focuses on point-based modeling
                                                                                     [6]   Boris Springborn, Peter Schröder, Ulrich Pinkall, “Conformal
applications by using C++ instead of convenient facilities or
                                                                                           Equivalence of Triangle Meshes,” ACM Transactions on
other computer graphic software. The methodologies that                                    Graphics, Vol. 27, No. 3, Article 77, August 2008.
developed by point-based include the simple data structure
properties and less complex computing. Moreover, the                                 [7]   NATHAN A. CARR and JOHN C. HART, “Meshed Atlases
                                                                                           for Real-Time Procedural Solid Texturing,” ACM
methods can be compiled with two applications morphing
                                                                                           Transactions on Graphics, Vol. 21,No. 2, pp. 106–131, April
and texture mapping. The experiment results have been
                                                                                           2002.




                                         Figure 3. The results of point-based modeling using different objects morphing.




                                                                               997
Figure 4. The models created by objects morphing with different weights.




Figure 5. The process of 3D model texturing with 2D image shown in first row and the results shown in second row.




                                                       998
Figure 6. The results of point-based texture mapping with α = β = 4 and different objects.




                                           999
LAYERED LAYOUTS OF DIRECTED GRAPHS USING A GENETIC
                    ALGORITHM

            Chun-Cheng Lin1,∗, Yi-Ting Lin2 , Hsu-Chun Yen2,† , Chia-Chen Yu3
                                  1
                                    Dept. of Computer Science,
            Taipei Municipal University of Education, Taipei, Taiwan 100, ROC
                               2
                                 Dept. of Electrical Engineering,
                   National Taiwan University, Taipei, Taiwan 106, ROC
                           3
                             Emerging Smart Technology Institute,
                  Institute for Information Industry, Taipei, Taiwan, ROC


                    ABSTRACT                             charts, maps, posters, scheduler, UML diagrams,
                                                         etc. It is important that a graph be drawn “clear”,
By layered layouts of graphs (in which nodes are
                                                         such that users can understand and get information
distributed over several layers and all edges are di-
                                                         from the graph easily. This paper focuses on lay-
rected downward as much as possible), users can
                                                         ered layouts of directed graphs, in which nodes are
easily understand the hierarchical relation of di-
                                                         distributed on several layers and in general edges
rected graphs. The well-known method for generat-
                                                         should point downward as shown in Figure 1(b).
ing layered layouts proposed by Sugiyama includes
                                                         By this layout, users can easily trace each edge from
four steps, each of which is associated with an NP-
                                                         top to bottom and understand the priority or order
hard optimization problem. It is observed that the
                                                         information of these nodes clearly.
four optimization problems are not independent, in
the sense that each respective aesthetic criterion
may contradict each other. That is, it is impossi-
ble to obtain an optimal solution to satisfy all aes-
thetic criteria at the same time. Hence, the choice
for each criterion becomes a very important prob-
lem. In this paper, we propose a genetic algorithm
to model the first three steps of the Sugiyama’s al-
gorithm, in hope of simultaneously considering the Figure 1: The layered layout of a directed graph.
first three aesthetic criteria. Our experimental re-
sults show that this proposed algorithm could make             Specifically, we use the following criteria to es-
layered layouts satisfy human’s aesthetic viewpoint. timate the quality of a directed graph layout: to
                                                            minimize the total length of all edges; to mini-
Keywords: Visualization,            genetic algorithm, mize the number of edge crossings; to minimize the
graph drawing.                                              number of edges pointing upward; to draw edges
                                                            as straight as possible. Sugiyama [9] proposed a
                 1. INTRODUCTION                            classical algorithm for producing layered layouts
                                                            of directed graphs, consisting of four steps: cycle
Drawings of directed graphs have many applica- removal, layer assignment, crossing reduction, and
tions in our daily lives, including manuals, flow assignment of horizontal coordinates, each of which
   ∗ Research supported in part by National Science Council
                                                            addresses a problem of achieving one of the above
under grant NSC 98-2218-E-151-004-MY3
                                                            criteria, respectively. Unfortunately, the first three
   † Research supported in part by National Science Council problems have been proven to be NP-hard when the
under grant NSC 97-2221-E-002-094-MY3                       width of the layout is restricted. There has been



                                                         1000
a great deal of work with respect to each step of        is quite different between drawing layered layouts
Sugiyama’s algorithm in the literature.                  of acyclic and cyclic directed graphs. In acyclic
   Drawing layered layouts by four independent           graphs, one would not need to solve problems on
steps could be executed efficiently, but it may not        cyclic removal. If the algorithm does not restrict
always obtain nice layouts because preceding steps       the layer by a fixed width, one also would not need
may restrain the results of subsequent steps. For        to solve the limited layer assignment problem. Note
example, four nodes assigned at two levels after the     that the unlimited-width layer assignment is not an
layer assignment step lead to an edge crossing in        NP-hard problem, because the layers of nodes can
Figure 2(a), so that the edge crossing cannot be         be assigned by a topological logic ordering. The
removed during the subsequent crossing reduction         algorithm in [10] only focuses on minimizing the
step, which only moves each node’s relative posi-        number of edge crossings and making the edges as
tion on each layer, but in fact the edge crossing        straight as possible. Although it also combined
can be removed as drawn in Figure 2(b). Namely,          three steps of Sugiyama’s algorithm, but it only
the crossing reduction step is restricted by the layer   contained one NP-hard problem. Oppositely, our
assignment step. Such a negative effect exists ex-        algorithm combined three NP-hard problems, in-
clusively not only for these two particular steps but    cluding cycle removal, limited-width layer assign-
also for every other preceding/subsequent step pair.     ment, and crossing reduction.

                                                            In addition, our algorithm has the following ad-
                                                         vantages. More customized restrictions on layered
                                                         layouts are allowed to be added in our algorithm.
                                                         For example, some nodes should be placed to the
             (a)                     (b)                 left of some other nodes, the maximal layer number
                                                         should be less than or equal to a certain number,
  Figure 2: Different layouts of the same graph.          etc. Moreover, the weighting ratio of each optimal
                                                         criterion can be adjusted for different applications.
   Even if one could obtain the optimal solution for     According to our experimental results, our genetic
each step, those “optimal solutions” may not be the      algorithm may effectively adjust the ratio between
real optimal solution, because those locally optimal     edge crossings number and total edge length. That
solutions are restricted by their respective preced-     is, our algorithm may make layered layouts more
ing steps. Since we cannot obtain the optimal solu-      appealing to human’s aesthetic viewpoint.
tion satisfying all criteria at the same time, we have
to make a choice in a trade-off among all criteria.
   For the above reasons, the basic idea of our                         2. PRELIMINARIES
method for drawing layered layouts is to combine
the first three steps together to avoid the restric-
tions due to criterion trade-offs. Then we use the        The frameworks of three different algorithms for
genetic algorithm to implement our idea. In the          layered layouts of directed graphs (i.e., Sugiyama’s
literature, there has existed some work on produc-       algorithm, the cyclic leveling algorithm, and our
ing layered layouts of directed graphs using ge-         algorithm) are illustrated in Figure 2(a)–2(c), re-
netic algorithm, e.g., using genetic algorithm to re-    spectively. See Figure 2. Sugiyama’s algorithm
duce edge crossings in bipartite graphs [7] or entire    consists of four steps, as mentioned previously; the
acyclic layered layouts [6], modifying nodes in a        other two algorithms are based on Sugiyama’s algo-
subgraph of the original graph on a layered graph        rithm, in which the cyclic leveling algorithm com-
layout [2], drawing common layouts of directed or        bines the first two steps, while our genetic algo-
undirected graphs [3] [11], and drawing layered lay-     rithm combines the first three steps. Furthermore,
outs of acyclic directed graphs [10].                    a barycenter algorithm is applied to the crossing re-
   Note that the algorithm for drawing layered lay-      duction step of the cyclic leveling and our genetic
outs of acyclic directed graphs in [10] also com-        algorithms, and the priority layout method is ap-
bined three steps of Sugiyama’s algorithm, but it        plied to the x-coordinate assignment step.



                                                         1001
Sugiyama’s Algorithm         Cyclic Leveling                               Genetic Algorithm

      Cycle Removel             Cycle Removel                                   Cycle Removel                                            edge-node crossing
     Layer Assignment         Layer Assignment                                Layer Assignment
                                                                                                                edge crossing
                                                                             Crossing Reduction       (a) An edge crossing. (b) An edge-node crossing
    Crossing Reduction       Crossing Reduction    Barycenter Algorithm


   x-Coordinte Assignment   x-Coordinte Assignment Priority Layout Method
                                                                            x-Coordinte Assignment
                                                                                                              Figure 4: Two kinds of crossings.
 (a) Sugiyama               (b) Cyclic Leveling                                (c) Our
                                                                                                     we reverse as few edges as possible such that the
Figure 3: Comparison among different algorithms.                                                      input graph becomes acyclic. This problem can
                                                                                                     be stated as the maximum acyclic subgraph prob-
                            2.1. Basic Definitions                                                    lem, which is NP-hard. (2) Layer assignment: Each
                                                                                                     node is assigned to a layer so that the total vertical
A directed graph is denoted by G(V, E), where V is                                                   length of all edges is minimized. If an edge spans
the set of nodes and E is the set of edges. An edge                                                  across at least two layers, then dummy nodes should
e is denoted by e = (v1 , v2 ) ∈ E, where v1 , v2 ∈ V ;                                              be introduced to each crossed layer. If the maxi-
edge e is directed from v1 to v2 . A so-called layered                                               mum width is bounded greater or equal to three,
layout is defined by the following conditions: (1)                                                    the problem of finding a layered layout with min-
Let the number of layers in this layout denoted by                                                   imum height is NP-compete. (3) Crossings reduc-
n, where n ∈ N and n ≥ 2. Moreover, the n-layer                                                      tion: The relative positions of nodes on each layer
layout is denoted by G(V, E, n). (2) V is parti-                                                     are reordered to reduce edges crossings. Even if we
tioned into n subsets: V = V1 ∪ V2 ∪ V3 ∪ · · · ∪ Vn ,                                               restrict the problem to bipartite (two-layer) graphs,
where Vi ∩ Vj = ∅, ∀i ̸= j; nodes in Vk are assigned                                                 it is also an NP-hard problem. (4) x-coordinate as-
to layer k, 1 ≤ k ≤ n. (3) A sequence ordering,                                                      signment: The x-coordinates of nodes and dummy
σi , of Vi is given for each i ( σi = v1 v2 v3 · · · v|Vi |                                          nodes are modified, such that all the edges on the
with x(v1 ) < x(v2 ) < · · · < x(v|Vi | )). The n-                                                   original graph structure are as straight as possi-
layer layout is denoted by G(V, E, n, σ), where σ =                                                  ble. This step includes two objective functions: to
(σ1 , σ2 , · · · , σn ) with y(σ1 ) < y(σ2 ) < · · · < y(σn ).                                       make all edges as close to vertical lines as possible;
    An n-layer layout is called “proper” when it fur-                                                to make all edge-paths as straight as possible.
ther satisfies the following condition: E is parti-
tioned into n − 1 subsets: E = E1 ∪ E2 ∪ E3 ∪                                                                  2.3. Cyclic Leveling Algorithm
· · · ∪ En−1 , where Ei ∩ Ej = ∅, ∀i ̸= j, and
Ek ⊂ Vk × Vk+1 , 1 ≤ k ≤ n − 1.                       The cyclic leveling algorithm (CLA) [1] combines
                                                      the first two steps of Sugiyama’s algorithm, i.e., it
    An edge crossing (assuming that the layout is
                                                      focuses on minimizing the number of edges point-
proper) is defined as follows. Consider two edges
                                                      ing upward and total vertical length of all edges.
e1 = (v11 , v12 ), e2 = (v21 , v22 ) Ei, in which v11
                                                      It introduces a number called span that represents
and v21 are the j1 -th and the j2 -th nodes in σi ,
                                                      the number of edges pointing upward and the total
respectively; v12 and v22 are the k1 -th and the k2 -
                                                      vertical length of all edges at the same time.
th nodes in σi+1 , respectively. If either j1 < j2 &
k1 > k2 or j1 > j2 & k1 < k2 , there is an edge          The span number is defined as follows. Consider
crossing between e1 and e2 (see Figure 4(a)).         a directed graph G = (V, E). Given k ∈ N, define a
                                                      layer assignment function ϕ : V → {1, 2, · · · , k}.
    An edge-node crossing is defined as follows. Con-
                                                      Let span(u, v) = ϕ(v) − ϕ(u), if ϕ(u) < ϕ(v);
sider an edge e = (v1 , v2 ), where v1 , v2 ∈ V i; v1
                                                      span(u, v) = ϕ(v) − ϕ(u) + k, otherwise. For each
and v2 are the j-th and the k-th nodes in σi , re-
                                                      edge e = (u, v) ∈ E, denote span(e) = span(u, v)
                                                                          ∑
spectively. W.l.o.g., assuming that j > k, there are
                                                      and span(G) =          e∈E span(e).   In brief, span
(k − j − 1) edge-node crossings (see Figure 4(b)).
                                                      means the sum of vertical length of all edges and
                                                      the penalty of edges pointing upward or horizontal,
           2.2. Sugiyama’s Algorithm
                                                      provided maximum height of this layout is given.
Sugiyama’s algorithm [9] consists of four steps: (1)     The main idea of the CLA is: if a node causes
Cycle removal: If the input directed graph is cyclic, a high increase in span, then the layer position of



                                                                                                     1002
the node would be determined later. In the algo-                  then priority(v) = B − (|k − m/2|), in which B is a
rithm, the distance function is defined to decide                  big given number, and m is the number of nodes in
which nodes should be assigned first and is ap-                    layer k; if down procedures (resp., up procedures),
plied. There are four such functions as follows,                  then priority(v) = connected nodes of node v on
but only one can be chosen to be applied to all                   layer p − 1 (resp., p + 1).
the nodes: (1) Minimum Increase in Span                              Moreover, the x-coordinate position of each node
= minϕ(v)∈{1,··· ,k} span(E(v, V ′ )); (2) Minimum                v is defined as the average x-coordinate position of
Average Increase in Span (MST MIN AVG)                            connected nodes of node v on layer k − 1 (resp.,
= minϕ(v)∈{1,··· ,k} span(E(v, V ′ ))/E(v, V ′ ); (3)             k + 1), if down procedures (resp., up procedures).
Maximum Increase in Span = 1/δM IN (v);
(4) Maximum Average Increase in Span =
                                                                                    2.6. Genetic Algorithm
1/δM IN AV G (v). From the experimental results
in [1], using “MST MIN AVG” as the distance            The Genetic algorithm (GA) [5] is a stochastic
function yields the best result. Therefore, our        global search method that has proved to be success-
algorithm will be compared with the CLA using          ful for many kinds of optimization problems. GA
MST MIN AVG in the experimental section.               is categorized as a global search heuristic. It works
                                                       with a population of candidate solutions and tries
              2.4. Barycenter Algorithm                to optimize the answer by using three basic princi-
The barycenter algorithm is a heuristic for solv- ples, including selection, crossover, and mutation.
ing the edge crossing problem between two lay- For more details on GA, readers are referred to [5].
ers. The main idea is to order nodes on each
layer by its barycentric ordering. Assuming that                                 3. OUR METHOD
node u is located on the layer i (u ∈ Vi ), the
                                                       The major issue for drawing layered layouts of di-
barycentric∑ value of node u is defined as bary(u) =
                                                       rected graphs is that the result of the preceding step
(1/|N (u)|) v∈N (u) π(v), where N (u) is the set
                                                       may restrict that of the subsequent step on the first
consisting of u’s connected nodes on u’s below or
                                                       three steps of Sugiyama’s algorithm. To solve it, we
above layer (Vi−1 or Vi+1 ); π(v) is the order of v
                                                       design a GA that combines the first three steps of
in σi−1 or σi+1 . The process in this algorithm is
                                                       Sugiyama’s algorithm. Figure 5 is the flow chart
reordering the relative positions of all nodes accord-
                                                       of our GA. That is, our method consists of a GA
ing to the ordering: layer 2 to layer n and then layer
                                                       and an x-coordinate assignment step. Note that
n − 1 to layer 1 by barycentric values.
                                                       the barycenter algorithm and the priority method
                                                       are also used in our method, in which the former is
            2.5. Priority Layout Method
                                                       used in our GA to reduce the edge crossing, while
The priority layout method solves the x-coordinate the latter is applied to the x-coordinate assignment
assignment problem. Its idea is similar to the step of our method.
barycenter algorithm. It assigns the x-coordinate
position of each node layer to layer according to the                                               Initialization
priority value of each node.
  At first, these nodes’ x-coordinate positions in
each layer are given by xi = x0 + k, where x0 is
                            k
                                                                                                 Assign dummy nodes

                         i
a given integer, and xk is the k-th element of σi .        Draw the best Chromosome     Terminate?                  Barycenter
Next, nodes’ x-coordinate positions are adjusted
                                                                                      Fine tune                       Selection
according to the order from layer 2 to layer n, layer
n − 1 to layer 1, and layer n/2 to layer n. The im-                                    Mutation                  Remove dummy nodes

provements of the positions of nodes from layer 2 to
                                                                                                      Crossover
layer n are called down procedures, while those from
layer n−1 to layer 1 are called up procedures. Based
on the above, the priority value of a k-th node v on Figure 5: The flow chart of our genetic algorithm.
layer p is defined as: if node v is a dummy node,



                                                                  1003
3.1. Definitions                              4. MAIN COMPONENTS OF OUR GA
For arranging nodes on layers, if the relative hori-   Initialization: For each chromosome, we ran-
                                                                                  √         √
zontal positions of nodes are determined, then the     domly assign nodes to a ⌈ |V |⌉ × ⌈ |V |⌉ grid.
exact x-coordinate positions of nodes are also de-     Selection: To evaluate the fitness value of each
termined according to the priority layout method.      chromosome, we have to compute the number of
Hence, in the following, we only consider the rela-    edge crossings, which however cannot be computed
tive horizontal positions of nodes, and each node is   at this step, because the routing of each edge is
arranged on a grid. We use GA to model the lay-        not determined yet. Hence, some dummy nodes
ered layout problem, so define some basic elements:     should be introduced to determine the routing of
Population: A population (generation) includes         edges. In general, these dummy nodes are placed
many chromosomes, and the number of chromo-            on the best relative position with the optimal edge
somes depends on setting of initial population size.   crossings between two adjacent layers. Neverthe-
Chromosome: One chromosome represents one              less, permuting these nodes on each layer for the
graph layout, where the absolute position of each      fewest edge crossings is an NP-hard problem [4].
(dummy) node on the grid is recorded. Since the        Hence, the barycenter algorithm (which is also used
adjacencies of nodes and the directions of edges       by the CLA) is applied to reducing edge crossings
will not be altered after our GA, we do not need       on each chromosome before selection. Next, the
record the information on chromosomes. On this         selection step is implemented by the truncation se-
grid, one row represents one layer; a column rep-      lection, which duplicates the best (selection rate ×
resents the order of nodes on the same layer, and      population size) chromosomes (1/selection rate)
these nodes on the same layer are always placed        times to fill the entire population. In addition, we
successively. The best-chromosome window reserves      use a best-chromosome window to reserve some of
the best several chromosomes during all antecedent     the best chromosomes in the previous generations
generations; the best-chromosome window size ra-       as shown in Figure 6.
tio is the ratio of the best-chromosome window size
                                                                                       Best-Chromosome Window
to the population size.                                       Best-Chromosome Window
Fitness Function: The ‘fitness’ value in our def-                                                           duplicate

inition is abused to be defined as the penalty for
the bad quality of chromosome. That is, larger ‘fit-              Parent Population      Child Population               Child Population

ness’ value implies worse chromosome. Hence, our
GA aims to find the chromosome with minimal ‘fit-
ness’ value. Some aesthetical criteria to determine           Figure 6: The selection process of our GA.
the quality of chromosomes (layouts) are given as
follows (noticing that these criteria are referred     Crossover: Some main steps of our crossover pro-
                                      ∑7
from [8] and [9]): f itness value = i=1 Ci × Fi        cess are detailed as follows: (1) Two ordered par-
where Ci are constants, 1 ≤ i ≤ 7, ∀i; F1 is the to-   ent chromosomes are called the 1st and 2nd parent
tal edge vertical length; F2 is the number of edges    chromosome. W.l.o.g., we only introduce how to
pointing upward; F3 is the number of edges point-      generate the first child chromosome from the two
ing horizontally; F4 is the number of edge crossing;   parent chromosomes, and the other child is similar.
F5 is the number of edge-node crossing; F6 is the      (2) Remove all dummy nodes from these two par-
degree of layout height over limited height; F7 is     ent chromosomes. (3) Choose a half of the nodes
the degree of layout width over limited width.         from each layer of the 1st parent chromosome and
   In order to experimentally compare our GA           place them on the same relative layers of child chro-
with the CLA in [1], the fitness function of our        mosome in the same horizontal ordering. (4) The
GA is tailored to satisfy the CLA as follows:          information on the relative positions of the remain-
f itness value = span + weight × edge crossing +       ing nodes all depends on the 2nd chromosomes.
C6 × F6 + C7 × F7 where we will adjust the weight      Specifically, we choose a node adjacent to the small-
of edge crossing number in our experiment to rep-      est number of unplaced nodes until all nodes are
resent the major issue which we want to discuss.       placed. If there are many candidate nodes, we ran-



                                                       1004
domly choose one. The layer of the chosen node is         Note that the x-coordinate assignment problem
equal to base layer plus relative layer, where base    (step 4) is solved by the priority layout method
layer is the average of its placed connected nodes’    in our experiment. In fact, this step would not
layers in the child chromosome and relative layer is   affect the span number or edge crossing number. In
the relative layer position of its placed connected    addition, the second step of Sugiyama’s algorithm
nodes’ layers in the 2nd parent chromosome. (5)        (layer assignment) is an NP-hard problem when the
The layers of this new child chromosome are mod-       width of the layered layout is restricted. Hence,
ified such that layers start from layer 1.              we will respectively investigate the cases when the
Mutation: In the mutated chromosome, a node            width of the layered layout is limited or not.
is chosen randomly, and then the position of the
chosen node is altered randomly.                                 5.1. Experimental Environment
Termination: If the difference of average fitness
                                                       All experiments run on a 2.0 GHz dual core lap-
values between successive generations in the latest
                                                       top with 2GB memory under Java 6.0 platform
ten generations is ≤ 1% of the average fitness value
                                                       from Sun Microsystems, Inc. The parameters of
of these ten generations, then our GA algorithm
                                                       our GA are given as follows: Population size:
stops. Then, the best chromosome from the latest
                                                       100; Max Generation: 100; Selection Rate: 0.7;
population is chosen, and its corresponding graph
                                                       Best-Chromosome Window Size Ratio: 0.2; Mutate
layout (including dummy nodes at barycenter po-
                                                       Probability: 0.2; C6 : 500; C7 : 500; f itness value =
sitions) is drawn.
                                                       span + weight × edgecrossing + C6 × F6 + C7 × F7 .
Fine Tune: Before the selection step or after the
termination step, we could tune better chromo-                    5.2. Unlimited Layout Width
somes according to the fitness function. For ex-
ample, we remove all layers which contain only         Because it is necessary to limit the layout width
dummy nodes but no normal nodes, called dummy          and height for L M algorithm, we set both limits
layers. Such a process does not necessarily worsen     for width and height to be 30. It implies that there
the edge crossing number but it would improve          are at most 30 nodes (dummy nodes excluded) on
the span number. In addition, some unnecessary         each layer and at most 30 layers in each layout. If
dummy nodes on each edge can also be removed           we let the maximal node number to be 30 in our
after the termination step, in which the so-called     experiment, then the range for node distribution
unnecessary dummy node is a dummy node that            is equivalently unlimited. In our experiments, we
is removed without causing new edge crossings or       consider a graph with 30 nodes under three differ-
worsening the fitness value.                            ent densities (2%, 5%, 10%), in which the density
                                                       is the ratio of edge number to all possible edges,
         5. EXPERIMENTAL RESULTS                       i.e. density = edge number/(|V |(|V | − 1)/2). Let
                                                       the weight ratio of edge crossing to span be de-
To evaluate the performance of our algorithm, our      noted by α. In our experiments, we consider five
algorithm is experimentally compared with the          different α values 1, 3, 5, 7, 9. The statistics for
CLA (combing the first two steps of Sugiyama’s          the experimental results is given in Table 1.
algorithm) using MST MIN AVG as the distance              Consider an example of a 30-node graph with
function [1], as mentioned in the previous sections.   5% density. The layered layout by the LM B algo-
For convenience, the CLA using MST MIN AVG             rithm, our algorithm under α = 1 and α = 9 are
distance function is called as the L M algorithm       shown in Figure 7, Figure 8(a) and Figure 8(b), re-
(Leveling with MST MIN AVG). The L M algo-             spectively. Obviously, our algorithm performs bet-
rithm (for step 1 + step 2) and barycenter algo-       ter than the LM B.
rithm (for step 3) can replace the first three steps
in Sugiyama’s algorithm. In order to be compared
                                                                5.3. Limited Layout Width
with our GA (for step 1 + step 2 + step 3), we con-
sider the algorithm combining the L M algorithm The input graph used in this subsection is the same
and barycenter algorithm, which is called LM B al- as the previous subsection (i.e., a 30-node graph).
gorithm through the rest of this paper.              The limited width is set to be 5, which is smaller



                                                       1005
Table 1: The result after redrawing random graphs
with 30 nodes and unlimited layout width.
 method       measure density =2%density=5%density=10%
                span      30.00     226.70    798.64
  LM    B     crossing     4.45      57.90    367.00
            running time 61.2ms    151.4ms   376.8ms
    α   =1      span      30.27     253.93    977.56
              crossing     0.65      38.96    301.75
    α   =3      span      31.05     277.65   1338.84
              crossing     0.67      32.00    272.80
our α   =5      span      30.78     305.62   1280.51
GA            crossing     0.67      29.89    218.45
    α   =7      span      32.24     329.82   1359.46
              crossing     0.75      26.18    202.53             (a) α = 1                (b) α = 9
    α   =9      span      31.65     351.36   1444.27
              crossing     0.53      24.89    200.62     (span: 188, crossing: 30)(span: 238, crossing: 14)
         running time     3.73s     17.32s   108.04s
                                                                     Figure 8: Layered layouts by our GA.


                                                         Table 2: The result after redrawing random graphs
                                                         with 30 nodes and limited layout width 5.
                                                            method         measure density =2%density=5%density=10%
                                                                             span      28.82     271.55    808.36
                                                                LM   B     crossing     5.64      59.09    383.82
                                                                         running time 73.0ms    147.6ms   456.2ms
Figure 7: Layered layout by LM B (span:262,                      α   =1      span      32.29     271.45   1019.56
crossing:38).                                                              crossing     0.96      39.36    292.69
                                                                 α   =3      span      31.76     294.09   1153.60
                                                                           crossing     0.80      33.16    232.76
                                                           our α     =5      span      31.82     322.69   1282.24
                                                           GA              crossing     0.82      30.62    202.31
than the square root of the node number (30), be-         α          =7      span      32.20     351.00   1369.73
cause we hope the results under limit and unlimited                        crossing     0.69      27.16    198.20
                                                          α          =9      span      33.55     380.20   1420.31
conditions have obvious differences. The statistics                         crossing     0.89      24.95    189.25
for the experimental results under the same settings                  running time    3.731s      3.71s    18.07s
in the previous subsection is given in Table 2.
   Consider an example of a 30-node graph with 5%
density. The layered layout for this graph by the
                                                     our GA may produce simultaneously span and edge
LM B algorithm, our algorithm under α = 1 and
                                                     crossing numbers both smaller than that by LM B.
α = 9 are shown in Figure 9, Figure 10(a) and Fig-
ure 10(b), respectively. Obviously, our algorithm       Moreover, we discovered that under any condi-
also performs better than the LM B.                  tions the edge crossing number gets smaller and
                                                     the span number gets greater when increasing the
                                                     weight of edge crossing. It implies that we may ef-
                   5.4. Discussion
                                                     fectively adjust the weight between edge crossings
Due to page limitation, only the case of 30-node and spans. That is, we could reduce the edge cross-
graphs is included in this paper. In fact, we con- ing by increasing the span number.
ducted many experiments for various graphs. Be-         Under limited width condition, because the re-
sides those results, those tables and figures show sults of L M are restricted, its span number should
that under any conditions (node number, edge den- be larger than that under unlimited condition.
sity, and limited width or not) the crossing number However, there are some unusual situations in our
by our GA is smaller than that by LM B. How- GA. Although the results of our GA are also re-
ever, the span number by our GA is not neces- stricted under limited width condition, its span
sarily larger than that by LM B. When the layout number is smaller than that under unlimited width
width is limited and the node number is sufficiently condition. Our reason is that the limited width
small (about 20 from our experimental evaluation), condition may reduce the possible dimension. In



                                                         1006
REFERENCES
                                                        [1] C. Bachmaier, F. Brandenburg, W. Brunner,
                                                            and G. Lov´sz. Cyclic leveling of directed
                                                                       a
                                                            graphs. In Proc. of GD 2008, volume 5417
                                                            of LNCS, pages 348–359, 2008.
Figure 9: Layered layout by LM B algorithm              [2] H. do Nascimento and P. Eades. A focus and
(span: 288, crossing: 29) with limited layout               constraint-based genetic algorithm for interac-
width = 5.                                                  tive directed graph drawing. Technical Report
                                                            533, University of Sydney, 2002.
                                                        [3] T. Eloranta and E. Makinen.        TimGA:
                                                            A genetic algorithm for drawing undirected
                                                            graphs. Divulagciones Matematicas, 9(2):55–
                                                            171, 2001.
                                                        [4] M. R. Garey and D. S. Johnson. Crossing
                                                            number is NP-complete. SIAM Journal on
                                                            Algebraic and Discrete Methods, 4(3):312–316,
                                                            1983.
                                                        [5] J. Holland. Adaptation in Natural and Arti-
        (a) α = 1                (b) α = 9                  ficial Systems. University of Michigan Press,
(span: 252, crossing: 29)(span: 295, crossing: 14)          Ann Arbor, 1975.
     Figure 10: Layered layouts by our GA.              [6] P. Kuntz, B. Pinaud, and R. Lehn. Minimizing
                                                            crossings in hierarchical digraphs with a hy-
                                                            bridized genetic algorithm. Journal of Heuris-
this problem, the dimension represents the posi-            tics, 12(1-2):23–36, 2006.
tion as which nodes could be placed. Furthermore,
                                                        [7] E. M¨kinen and M. Sieranta. Genetic algo-
                                                                 a
if the dimension is smaller, then our GA can easier
                                                            rithms for drawing bipartite graphs. Inter-
converge to a better result.
                                                            national Journal of Computer Mathematics,
                                                            53:157–166, 1994.
               6. CONCLUSIONS
                                                        [8] H. Purchase. Metrics for graph drawing aes-
This paper has proposed an approach for producing           thetics. Journal of Visual Languages and
layered layouts of directed graphs, which uses a GA         Computing, 13(5):501–516, 2002.
to simultaneously consider the first three steps of
classical Sugiyama’s algorithm (consisting of four      [9] K. Sugiyama, S. Tagawa, and M. Toda. Meth-
steps) and applies the priority layout method for           ods for visual understanding of hierarchical
the forth step. Our experimental results revealed           system structures. IEEE Transitions on Sys-
that our GA may efficiently adjust the weighting              tems, Man, and Cybernetics, 11(2):109–125,
ratios among all aesthetic criteria.                        1981.
                                                   [10] J. Utech, J. Branke, H. Schmeck, and P. Eades.
            ACKNOWLEDGEMENT                             An evolutionary algorithm for drawing di-
                                                        rected graphs. In Proc. of CISST’98, pages
This study is conducted under the ”Next Gener-
                                                        154–160. CSREA Press, 1998.
ation Telematics System and Innovative Applica-
tions/Services Technologies Project” of the Insti- [11] Q.-G. Zhang, H.-Y. Liu, W. Zhang, and Y.-J.
tute for Information Industry which is subsidized       Guo. Drawing undirected graphs with genetic
by the Ministry of Economy Affairs of the Repub-         algorithms. In Proc. of ICNC 2005, volume
lic of China.                                           3612 of LNCS, pages 28–36, 2005.



                                                      1007
Structured Local Binary Haar Pattern for Graphics Retrieval

                      Song-Zhi Su,                                                    Shu-Yuan Chen*, Shang-An Li
 Cognitive Science Department of Xiamen University,                         Department of Computer Science and Engineering of
  Fujian Key Laboratory of the Brain-like Intelligent                                  Yuan Ze University, Taiwan
    Systems (Xiamen University), Xiamen, China                              *correspondence author, cschen@saturn.yzu.edu.tw
               SUSONGZHI@163.com                                                                Der-Jyh Duh
                          Shao-Zi Li                                         Department of Computer Science and Information
 Cognitive Science Department of Xiamen University,                            Engineering, Ching Yun University, Taiwan
  Fujian Key Laboratory of the Brain-like Intelligent                                      djduh@cyu.edu.tw
    Systems (Xiamen University), Xiamen, China
                 szlig@xmu.edu.cn

Abstract—Feature extraction is an important issue in graphics             histogram indexing structure to addresses two issues in shape
retrieval. Local feature based descriptors are currently the              retrieval problem: perceptual similarity measure on partial
predominate method used in image retrieval and object                     query and overcoming dimensionality curse and adverse
recognition. Inspired by the success of Haar feature and Local            environment. Chalechale et al. [6] proposed a sketch-based
Binary Pattern (LBP), a novel feature named structured local              image retrieval system, in which feature extraction for
binary Haar pattern (SLBHP) is proposed for graphics                      matching purpose is based on angular partitioning of two
retrieval in this paper. SLBHP encodes the polarity instead of            abstract images which are obtained from the model image
the magnitude of the difference between accumulated gray                  and from the query image. The angular-spatial distribution of
values of adjacent rectangles. Experimental results on graphics
                                                                          pixels in the abstract images is scale and rotation invariant
retrieval show that the discriminative power of SLBHP is
better than those of using edge points (EP), Haar feature, and
                                                                          and robust against translation by using the Fourier transform.
LBP even in noisy condition.                                                    Most existing graphics retrieval adopting contour-based
                                                                          [4] [5] rather than pixel-based approaches [6]. Since the
    Keywords-graphics retrieval; structured local binary haar             contour-based method is concerned with a lot of curves and
pattern; Haar; local binary pattern;                                      lines, it is computational intensive. Thus it is the goal of this
                                                                          paper to propose a pixel-based graphics retrieval using novel
                                                                          structured local binary Haar pattern.
                     I.      INTRODUCTION
                                                                                This paper is organized as follows. The original Haar
    With the advent of computing technology, media                        and LBP feature is described in Section 2. Proposed SLBHP
acquisition/storage devices, and multimedia compression                   feature is described in Section 3. Experimental results and
standards, more and more digital data are generated and                   performance comparison are given in Section 4. Finally,
available to user all over the world. Nowadays, it is easy to             conclusions are given in Section 5.
access electronic books, electronic journals, web portals, and
video streams. Hence, it will be convenient for users to                        II.   LOCAL BINARY PATTEN AND HAAR FEATURE
provide an image retrieval system for browsing, searching
and retrieving images from a large database of digital images.            A. Local Binary Pattern
Traditional systems add some metadata such as caption,                         Local feature based approaches have got great success
keywords, descriptions or annotation of images so that                    in object detection and recognition in recent years. The
retrieval can be converted into a text retrieval problem.                 original LBP descriptor was proposed by Ojala et al. [7],
    However, manual annotation is time-consuming,                         and was proved a powerful means for texture analysis. LBP
laborious and expensive. There are a lot of works on content-             encode local primitives including different types of curved
based image retrieval (CBIR) [1] [2] [3], which is also called            edges, spots, flat areas, etc. The advantage of LBP was
query by image content. “Content-based” means that the
                                                                          invariant to monotonic changes in gray scale. So LBP is
retrieval process will utilize and analyze the actual contents
of the image, which might refer to colors, shapes, textures, or           widely used in face recognition [8], pedestrian detection [9],
any other information that can be derived from the images                 and many other computer vision applications.
themselves.                                                                    The basic LBP operator assigns a label to every pixel
    Unfortunately, although there are many content-based                  of an image by thresholding the 3 × 3-neighborhood and
retrieval methods for image databases, few of them are                    considering the results as a binary number. Then the
specifically designed for graphics. Huet et al. [4] exploit both          histogram of labels can be used as descriptor of local
geometric attributes and structural information to construct a            regions. See Figure 1(a) for an illustration of the basic LBP
shape histogram for retrieving line-patterns from large                   operator.
databases. Chi et al. [5] proposed an approach to combine a
local-structure-based shape representation and a new




                                                                   1008
The decimal form of the resulting 8-bit LBP code can be
                                                                                  expressed as follows:
                                                                                                                          7
                                                                                                          LBP ( x, y ) = ∑ wi bi ( x, y )
                                                                                                                         i =0

                                                                                  where wi = 2i , bi ( x, y) = ⎧1, Haari ( x, y) > T . As Figure 2 shown,
                                                                                                               ⎨
                                                                                                              ⎩0, otherwise
                                                                                  each component of LBP is actually a binary 2 rectangle
                                                                                  Haar feature with rectangle size 1 × 1. Even the gradient can
                                                                                  be seen as combination of Haar features. For example,
                                                                                                             I x = Haar0 + Haar4
                                                                                                             I y = Haar2 + Haar6
                                                                                  where I x and I y are gradient along x axis and y axis with
 Figure 1. Illustration of LBP and Haar. (a) The basic LBP operator, (b)          filter [1, −2,1] and [1, −2,1]T , respectively.
                        four types of Haar feature.
                                                                                        III.   STRUCTURED LOCAL BINARY HAAR PATTERN
B. Haar Feature                                                                   A. SLBHP
     A simple rectangular Haar feature can be defined as the
difference of the accumulate sum of pixels of area inside the
rectangle, which can be at any position and scale within the
given image. Oren et al. [10] first used 2 rectangle features
in pedestrian classification. Viola and Jones [11] extend
them to 3 rectangle features and 4 rectangle features in
Viola-Jones object detection framework for face and
pedestrian. The difference values indicate certain
characteristics of a particular area of image. Haar feature
encodes the low-frequency information and each feature
type can indicate the existence of certain characteristics in
the images, such as vertical or horizontal edges, changes in
texture.
     Haar feature can be computed quickly using the
integral image [11]. It is an intermediate representation of
image, all the rectangular two-dimensional image features
can be computed rapidly using this representation. Each
element of the integral image contains the sum of all pixels
located on the up-left region of the original image. Given                              Figure 3. An example of SLBHP. (a) Four Haar features; (b)
the integral image, any rectangular sum of pixel values                           corresponding Haar features with overlapping; (c) an example to compute
                                                                                                              SLBHP values.
aligned with the coordinate axes can be computed in four
array references.
C. A New Sight into LBP, Haar,and Gradient                                            In this paper, based on the similar idea of multi-block
                                                                                  local binary pattern features [12, 13], a descriptor Structured
                                                                                  Local Binary Haar Pattern (SLBHP) has been modified from
                                                                                  LBP with Haar feature. The proposed SLBHP adopts four
                                                                                  types of Haar features, which capture the changes of gray
                                                                                  values along the horizontal direction, the vertical direction
                                                                                  and the diagonals as shown in Figure 3(a). However, only
                                                                                  the polarity of Haar feature is involved in SLBHP, while the
                                                                                  magnitude is discarded. It is noted that the number of
                                                                                  encoding patterns has been reduced from 256 for LBP to 16
                                                                                  for SLBHP. Moreover, SLBHP encoding spatial structure of
                                                                                  two adjacent rectangle regions in four-directions. Thus,
  Figure 2. LBP can be seen as a weighted combination of binary Haar              compared to LBP, the SLBHP has compact encoding
                               feature.                                           patterns and incorporates more semantic structure
                                                                                  information. Let ai , i = 0,1,L,8 denote the corresponding
                                                                                  gray values for a 3×3 window with a0 at the center pixel




                                                                           1009
( x, y) as shown in Figure 3(a). The value of SLBHP code of
a pixel ( x, y) is given by the following equation,
    SLBHP( x, y) = ∑ B ( H p ⊗ N ( x, y)) × 2 p
                          4


                         p =1

                       ⎡ a1       a2 a3 ⎤             ⎡ 1 1 0⎤
where      N ( x, y) = ⎢ a8
                       ⎢
                                        ⎥,
                                  a0 a4 ⎥        H1 = ⎢ 1 0 −1⎥ ,
                                                      ⎢       ⎥
                       ⎢a7
                       ⎣          a6 a5 ⎥
                                        ⎦             ⎢       ⎥
                                                      ⎣0 −1 −1⎦               Figure 4. An example of SLBHP histograms for graphics retrieval.

     ⎡ 0 1 1⎤          ⎡ 1         1 1⎤        ⎡−1 0 1⎤
     ⎢−1 0 1⎥ , H = ⎢ 0
H2 = ⎢                                 ⎥ , H = ⎢−1 0 1⎥.
                                   0 0⎥ 4 ⎢                                                     IV.   EXPERIMENTAL RESULTS
              ⎥ 3 ⎢                                     ⎥
     ⎢−1 −1 0⎥
     ⎣        ⎦        ⎢−1
                       ⎣          −1 −1⎥
                                       ⎦       ⎢−1 0 1⎥
                                               ⎣        ⎦

              ⎧1 if | x |> T
and B ( x ) = ⎨              with T as a threshold 15 in our
              ⎩0 otherwise
experiments). By this binary operation, the feature becomes
more robust to global lighting changes. It is noted that H p
denote a Harr-like basis function and H p ⊗ N (x, y) denotes
the difference between the accumulated gray values of the
black and red rectangle as shown in Figure 3(c). Unlike
traditional Haar feature, here the rectangles are overlapped
with one pixel. Inspired by LBP and the fact that a single
binary Haar feature might have not enough discriminative
power, we combine this binary feature just like LBP. Figure
3(c) shows an example of SLBHP feature. SLBHP feature
extends the merits of both Haar feature and LBP and it                                    (a)                       (b)
encodes the most common structure information of graphics.                 Figure 5. Some query results for graphics database. (a) Query graphics; (b)
Moreover, SLBHP has dimension of 16 smaller than the                        a list of three most similar graphics ordered by similarity values. The one
                                                                                             with red rectangle is the ground true match.
dimension of LBP 256, while has more immunities to noise
since Haar feature uses more pixels once at a time.
B. SLBHP for Graphics Retrieval                                                  479 electronic files of graphics are collected to
    After the SLBHP value is computed, the histogram of                    construct database for retrieval experiments. Test images are
                                                                           comprised of 479 graphics photos taken by a digital camera
SLBHP for a region R is computed by the following
                                                                           and then added by noises to obtain noisy test images The
equation
                                                                           performance of graphics retrieval is measured by the
   H (i) = ∑ I {SLBHP( x, y) = i} ,                                        retrieval accuracy. The retrieval accuracy is computed as the
            ( x, y )∈R
                                                                           ratio of the number of graphics correctly retrieved to the
                ⎧1,A = true,                                               number of total queries. Moreover, not only the retrieval
where I { A} = ⎨                The histogram H contains
                ⎩0,A = false.                                              accuracy with respect to the first rank but also the second
information about the distribution of the local patterns, such             and third ranks is concerned in our experiments. The
as edges, spots and flat areas, over the image region R . In               retrieval accuracies for different approaches are listed in
order to make SLBHP robust to slight transition, a graphics                Tables 1 through 4 with different block sizes from 8×8 to
photo is divided into several small spatial regions (“block”),             32×32. Moreover, the retrieval accuracy for non-
for each of which a SLBHP histogram is computed and then                   overlapping case is also listed in Table 4. By comparing
concatenated to form the representation of graphics as shown               Tables 1 and 4, we found that overlapping results in higher
in Figure 4. For better invariance to illumination, it is useful           retrieval accuracy. It is noted that the proposed method and
to contrast-normalize the local responses in each block                    the approaches using EP [6] and LBP all adopt histogram-
before using them. Experiment results showed that the                      based matching. However, for the Haar feature, the
L2NORM gets better results than L1NORM and L1SQRT.                         computed four Haar values for each block are normalized
Similar to other popular local feature based object detection,             and then concatenated to form the representation, Chi-
the detection windows are tiled with a dense (overlapping)                 square distance is also adopted as similarity measure of the
grid of SLBHP descriptors. The overlap size is half of whole
                                                                           Haar feature.
block.




                                                                    1010
In our experiment, we found that chi-square is a better                           distance. Some retrieval results are shown in Figure 5.
similarity for histogram-based matching than Euclidean


            TABLE I.           RETRIEVAL ACCURACIES OF EDGE POINTS (EP), LBP, HAAR, AND SLBHP WITH HALF-OVERLAPPING BLOCKS.
                                          1-best                                     2-best                                  3-best
                          EP      LBP       Haar      SLBHP       EP          LBP       Haar      SLBHP      EP       LBP       Haar      SLBHP
               32x32      85.2    70.4      83.7      88.3        91.6        79.5      90.6      95.6       93.3     82.5      92.5      96.5
               32x16      83.3    62.3      68.9      88.5        91.4        74.9      76.0      94.6       93.5     78.3      78.5      95.7
               16x32      86.8    66.6      60.8      90.2        92.9        76.0      68.7      95.6       94.2     80.0      72.2      96.7
               16x16      85.0    58.2      62.4      89.4        92.3        66.8      70.1      94.4       94.4     69.3      73.3      95.8
               16x8       81.2    42.0      37.4      86.6        89.8        51.8      43.4      91.9       91.2     55.5      45.9      93.7
               8x16       83.3    45.3      29.0      86.6        90.6        55.1      36.5      92.5       92.9     57.8      40.7      94.8
               8x8        79.3    30.5      29.2      82.7        86.8        39.5      34.9      89.3       89.8     44.5      39.2      91.2

                  TABLE II.          RETRIEVAL ACCURACIES UNDER GAUSSIAN NOISE WITH VARIANCE 50 AND PERTURBATION 1%.
                                     1-best                                     2-best                                   3-best
                   EP       LBP       Haar         SLBHP        EP          LBP      Haar        SLBHP      EP       LBP       Haar       SLBHP
          32x32   63.88     71.19     83.09        82.46       74.53        78.91    90.40       90.81     77.87     84.13     92.48      93.95
          32x16   71.61     65.76     68.48        85.18       79.54        75.16    75.78       93.53     84.76     79.54     78.71      94.57
          16x32   72.44     67.22     60.96        87.06       79.54        76.41    68.27       93.53     83.72     81.21     72.03      94.99
          16x16   78.08     59.92     62.42        88.31       85.39        68.27    69.52       93.74     89.14     72.65     73.70      94.99
          16x8    79.96     43.63     37.58        86.01       87.89        52.61    43.42       92.07     89.98     55.74     45.72      93.95
          8x16    79.12     47.60     29.02        86.85       88.31        54.91    36.33       93.11     91.44     59.08     41.34      94.99
          8x8     81.00     31.52     29.23        83.72       87.68        40.08    34.24       90.40     90.61     44.89     38.62      92.48

                       TABLE III.          RETRIEVAL ACCURACIES UNDER SALT AND PEPPER NOISES WITH PERTURBATION 0.5%.
                                      1-best                                         2-best                                    3-best
                        EP       LBP       Haar       SLBHP        EP          LBP       Haar      SLBHP       EP       LBP       Haar      SLBHP
             32x32     15.24     70.77     83.51      84.76       19.83        79.33     91.02     92.48      25.05     82.88     92.48     94.78
             32x16     20.46     64.93     68.48      86.01       27.97        75.57     75.79     94.15      39.25     79.33     78.50     95.62
             16x32     22.55     67.43     60.96      88.10       27.35        76.20     68.48     94.15      34.66     80.17     72.44     95.62
             16x16     37.37     59.71     61.59      88.52       46.97        67.85     68.89     93.95      61.38     71.19     73.28     95.41
             16x8      55.95     42.80     36.74      86.22       67.22        52.40     43.01     92.28      73.90     55.95     45.30     93.32
             8x16      54.28     47.39     28.60      87.27       67.43        54.90     36.12     93.32      78.08     58.87     40.71     94.57
             8x8       70.35     31.11     29.02      83.51       81.00        40.29     34.86     89.77      84.55     45.09     39.25     92.28

                                     TABLE IV.          RETRIEVAL ACCURACIES WITH NON-OVERLAPPING BLOCKS.
                                          1-best                                       2-best                                 3-best
                          EP        LBP       Haar     SLBHP           EP       LBP        Haar    SLBHP       EP       LBP       Haar     SLBHP
              32x32     82.04     70.98       69.73    87.68      90.40        77.66      78.29    94.57     92.49     81.42      82.88    95.82
              32x16     79.75     64.09       61.80    87.27      88.10        72.65      68.89    93.53     89.98     75.57      71.61    94.78
              16x32     82.25     66.18       53.44    89.14      89.77        74.95      61.17    94.78     90.81     78.50      64.30    96.24
              16x16     81.00     57.20       57.83    88.94      88.94        66.18      65.76    93.11     91.23     69.31      69.10    94.36
              16x8      78.50     41.34       29.23    84.97      87.06        51.57      36.33    91.23     89.35     54.90      40.08    92.48
              8x16      79.96     43.01       23.38    86.01      88.52        51.77      29.44    91.23     91.65     55.11      32.57    93.53
              8x8       75.99     29..22      27.97    83.30      84.97        39.25      33.83    88.31     88.31     42.17      36.12    90.61


                       V.        CONCLUSION                                                                    ACKNOWLEDGMENT
     A novel local feature SLBHP, combining the merits of                                 This work was partially supported by National Science
Haar and LBP, is proposed in this paper. The effectiveness                             Council of Taiwan, under Grants NSC 99-2221-E-155-072,
of SLBHP has been proven by various experimental results.                              National Nature Science Foundation of China under Grants
Moreover, compared to the other approaches using EP, Haar                              60873179, Shenzhen Technology Fundamental Research
and LBP descriptors, SLBHP is superior even in the noisy                               Project under Grants JC200903180630A, and Doctoral
conditions. Further research can be directed to extend the                             Program Foundation of Institutions of Higher Education of
proposed graphics retrieval for slide retrieval or e-learning                          China under Grants 20090121110032.
video retrieval using graphics as query keywords.                                                                     REFERENCES




                                                                                1011
[1]    R. Datta, D. Joshi, J. Lia, and J. Z. Wang, “Image retrieval: ideas,
       influences, and trends of the new age,” ACM Computing Surveys,
       2008, vol. 40, no.2, Atricle. 5, pp. 1–60.
[2]    J. Deng, W. Dong, R. Socher, et al. ImageNet: A large-scale
       hierarchical image database. In: Proceedings of Computer Vision and
       Pattern Recognition, 2009.
[3]    A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images:
       a large dataset for non-parametric object and scene recognition,”
       IEEE Transactions on Pattern Analysis and Machine Intelligence,
       2008, vol. 30, no.11, pp. 1958- 1970.
[4]    B. Huet and E. R. Hancock, “Line pattern retrieval using rational
       histograms,” IEEE Transaction on Pattern Analysis and Machine
       Intelligence, 1999, vol.12, no.12, pp. 1363-1370.
[5]    Y. Chi and M.K.H. Leung, “ALSBIR: A local-structure-based image
       retrieval,” Pattern Recognition, 2007, vol. 40, pp. 244-261.
[6]    A. Chalechale, G. Naghdy and A. Mertins, “Sketch-based image
       matching using angular partitioning,” IEEE Transactions on Systems,
       Man, and Cybernetics –Part A: Systems and Humans, 2005, vol. 35,
       no. 1, pp.28-41.
[7]    T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of
       texture measures with classification based on featured distribution,”
       Pattern Recognition, 1996, vol. 29, no. 1, pp.51-59.
[8]    T. Ahonen, A. Hadid, and M. Pietikinen. “Face description with local
       binary patterns, application to face recognition,” IEEE Transaction on
       Pattern Analysis and Machine Intelligence, 2006, vol.28, no. 12, pp.
       2037-2041.
[9]    X. Wang, T. X. Han, and S. Yan, “An HOG-LBP human detector
       with partial occlusion handling,” In: Proceedings of Internation
       Conference on Computer Vision, 2009.
[10]   M. Oren, C. Papageorion, P. Sinha, et al, “Pedestrian detection using
       wavelet templates,” In: Proceedings of International Conference on
       Computer Vision and Pattern Recognition, 1997.
[11]   P. Viola and M Jones, “Robust real-time face detection,” International
       Journal of Computer Vision, 2004, vol. 57, no. 2, pp. 137-154.
[12]   L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, ‘Face detection
       based on multi-block LBP representation’, Proc. Int. Conf. on
       Biometrics, 2007.
[13]   S. Yan, S. Shan, X. Chen, and W. Gao, ‘Locally assembled binary
       (LAB) feature with feature-centric cascade for fast and accurate face
       detection’, Proc. Int. Conf. Computer Vision and Pattern Recognition,
       2008.




                                                                                1012
IMAGE-BASED INTELLIGENT ATTENDANCE LOGGING SYSTEM
                              Hary Oktavianto1, Gee-Sern Hsu2, Sheng-Luen Chung1
                                             1
                                      Department of Electrical Engineering
                                         2
                                    Department of Mechanical Engineering
                     National Taiwan University of Science and Technology, Taipei, Taiwan
                                         E-mail: hary35@yahoo.com

Abstract— This paper proposes an extension of the
surveillance camera’s function as an intelligent
attendance logging system. The system works like a time
recorder. Based on sitting and standing up events, the
system was designed with learning phase and monitoring
phase. The learning phase learns the environment to
locate the sitting areas. After a defined time, the system
switches to the monitoring phase which monitors the
incoming occupants. When the occupant sits at the same                     Fig. 1. Occupant’s working room (left) and a map consists of occupants’
location with the sitting area found by the learning                                                sitting areas (right).
phase, the monitoring phase will generate a sitting-time
report. A leaving-time report is also generated when an                  working area of the occupant does and when the occupant
occupant stands up from his/her seat. This system                        works.
employs one static camera. The camera is placed 6.2                           The diagram flow of the proposed system is shown in
                                                                         Fig. 2. The system consists of an object segmentation unit, a
meters far, 2.6 meters high, and facing down 21° from
                                                                         tracking unit, learning phase, and monitoring phase. A fixed
horizontal. The camera’s view is perpendicular with the                  static camera is placed inside the occupants’ working room.
working location. The experimental result shows that the                 The images taken by the camera are pre-processed by the
system can achieve a good result.                                        object segmentation unit to extract the foreground objects.
                                                                         The connected foreground object is called as blob. These
Keywords Activity Map; Attendance; Logging system;                       blobs are processed further in the tracking unit. Once the
Learning phase; Monitoring phase; Surveillance Camera;                   system detected the blob as an occupant, the system keeps
                                                                         tracking the occupant in the scene using centroid, ground
                       I.   INTRODUCTION                                 position, color, and size of the occupant as the simple
     Intelligent buildings have increased as a research topic            tracking features. The learning phase has responsibility to
recently [1], [2], [3]. Many buildings are installed with                learn the environment and constructs a map as the output.
surveillance cameras for security reasons. This paper extends            The monitoring phase uses the map to monitor whether the
the function of existing surveillance cameras as an intelligent          occupants are present in their working desk or not. The
attendance logging system. The purpose is to report the                  report on the presence or the absence of the occupants is the
occupant’s attendance. The system works like a time                      final output of the system for further analysis. The system is
recorder or time clock. A time recorder is a mechanical or               implemented by taking the advantages of the existing open
electronics timepiece that is used to assist in tracking the             source library for computer vision, OpenCV [5] and cvBlob
hours an employee of a company worked [4]. Instead of                    [6].
spending more budgets to apply those timepieces, the                          The contributions of this paper are:
surveillance camera can be used to do the same function. The             (1) Learning mechanism that locates seats in an unknown
system is so called intelligent since it learns from a given                 environment.
environment automatically to build a map. A map consists of              (2) Monitoring mechanism that detects in entering and
sitting areas of the occupants. Sitting area is the space                    leaving events of occupants.
information about where are the locations of the occupant’s              (3) Integrating system with real-time performance up to 16
working desk. So, there is no need to select the area of                     fps, ready for context-aware applications.
occupant’s working area manually. Fig. 1 shows an example                     This paper is organized with the following sections. The
scenario. Naturally, the occupant enters into the room and               problem definition and the previous researches as related
sits to start working. Afterward, the occupant stands up from            works are reviewed in Section II. Section III describes the
his/her seat and leaves the room. The sitting and standing up            technical overview of the proposed solution. Section IV
events will be used by the system to decide where the                    explains about the tracking that is used to keep tracking of
                                                                         the occupants during their appearance in a scene based on the




                                                                  1013
information from the previous frame. The learning phase and
the monitoring phase are explained in Section V. Section VI
explains the experiments’ setup, result, and discussion.
Finally, the conclusions are summarized in Section VII.
      II. PROBLEM DEFINITION AND RELATED WORK
    This section describes the problem definition and the
previous works related to the intelligent attendance logging
system.
A. Problem Definition
    The goal of this paper is to design an image-based
intelligent attendance logging system. Given a fixed static
camera as an input device inside an unknown working
environment with a number of fixed seats, each of them
belong to a particular user or occupant. Occupant enters and
leaves not necessarily at the same time. We are to design a
camera-equipped intelligent attendance logging system, such                              Fig. 2. Diagram flow of the system.
that, the system can report in real-time each occupant’s
entering and leaving events to and from his/her particular               provide the vocabulary to categorize past and present
seat.                                                                    activity, predict future behavior, and detect abnormalities.
    The system is designed based on two assumptions. The                     The researches above detect occupants and build a map
first assumption is the environment is unknown, in that, the             consists of locations where those people mostly occupy. This
number of seat and the location of these seats are not known             paper extends the advantages of the surveillance cameras to
before the system monitors. The second assumption is each                monitor the occupant’s presence. A static camera is used by
occupant has his/her own seat, as such, detecting the                    the system as in [2], [8]. Morris and Trivedi applied
presence/absence of a particular seat amounts to answering               omnidirectional camera [9] to their system. The other
the presence/absence of that corresponding occupant.                     researchers [1], [3], [7] used stereo camera to reduce the
    There are two performance criteria to evaluate the system            effect of lighting intensity and occlusion. It is intended that
regarding to the main functions of the system. The main                  the system in this paper works in real time and has the
functions of the system are to find the sitting area and to              capability to learn the environment automatically from
report the monitoring result. The first criterion is the system          observed behavior.
should find the sitting areas given by the ground truth. The
second criterion is the system should be able to monitor the
occupants during their appearance in the scene to generate                                 III. TECHNICAL OVERVIEW
the accurate report.                                                         As shown in Fig. 2 and the detail in Fig. 3, the input
B. Related Work                                                          images acquired from the camera are fed into the object
                                                                         segmentation unit to extract the foreground object.
    During the past decades, intelligent building has been
                                                                         Foreground object is the moving object in a scene. The
developed. Zhou et al [3] developed the video-based human
                                                                         foreground object is obtained by subtracting the current
indoor activity monitoring that is aimed to assist elderly.
                                                                         image with the background image. To model the background
Demirdjian et al [7] presented a method for automatically
estimating activity zone based on observed user behaviors in             image, Gaussian Mixture Model (GMM) is used. GMM
                                                                         represents each background pixel variation with a set of
an office room using 3D person tracking technique. They
                                                                         weighted Gaussian distributions [10], [11], [12], [13]. The
used simple position, motion, and shape features for
                                                                         first frame will be used to initialize the mean. A pixel is
tracking. This activity zone is used at run time to
contextualize user preferences, e.g., allowing “location-                decided as the background if it falls into a deviation around
                                                                         the mean of any of the Gaussians that model it. The update
sticky” settings for messaging, environmental controls,
                                                                         process, which is performed in the current frame, will
and/or media delivery. Girgensohn, Shipman, and Wilcox [8]
                                                                         increase the weight of the Gaussian model that is matched to
thought that retail establishments want to know about traffic
                                                                         the pixel. By taking the difference between the current image
flow in order to better arrange goods and staff placement.
                                                                         and the background image, the foreground object is obtained.
They visualized the results as heat maps to show activity and
object counts and average velocities overlaid on the map of                  After that, the foreground object is converted from RGB
the space. Morris and Trivedi [9] extracted the human                    color image to gray color image [13]. The edges of the
activity. They presented an adaptive framework for live                  objects in the gray color image are extracted by applying
video analysis based on trajectory learning. A surveillance              edge detector. The edge detector uses moving frame
scene is described by a map which is learned in unsupervised             algorithm. Moving frame algorithm has four steps. Step one,
fashion to indicate interesting image regions and the way                the gray color image (I) is shifted to eight directions using
objects move between these places. These descriptors                     the fixed distance in pixel unit (dx and dy), resulting eight



                                                                  1014
images with an offset to right, left, up, down, up right, up
left, down right, and down left, respectively. Those eight
shifted images with offset are called moving frame images
(Fi ).
                Fi ( x , y )  I ( x  dxi , y  dyi )     (1)

Step two, each of moving frames image is updated (F*) by
making subtraction to the image frame (I) to get the
extended edges.
                Fi ( x , y )  I ( x , y )  Fi ( x , y ) (2)
Step three, each of moving frames is converted to binary by
applying a threshold value (TF).                                           Fig. 3. The detail of the object segmentation unit and the tracking unit.

                     FiT ( x , y )  f T ( Fi )           (3)           where Biy and Bjy are the y-coordinate of of blob-i and blob-
                                                                         j, respectively, ci and cj are the centroid of each blob. If
                        1 if ( Fi* ( x , y ))  TF                      those three conditions satisfy (6) then the broken blobs are
                   fT  
                        0       otherwise                               grouped.

Finally, all of moving frame images are added together. As                                  BI                        
                                                                                                   TC   Bdy  TD  B A  TA          
                                                                                  1
the result, the edges of the image (E) are obtained.                            G                                                                (6)
                                                                                  0                         otherwise
                       E( x , y )   FiT ( x , y )        (4)
                                      i                                  TC, TD, and TA are the threshold values for the intersection
                                                                         distance, the nearest vertical distance of blobs, and the angle
Edge detector extracts the object while removes the weak                 of blobs, respectively. In the experiments, TC is 0 pixel, TD is
shadows at the same time since weak shadows do not have                  50 pixels, and TA is 30°.
edges. However, strong shadows may happen and create                          After the broken blobs are grouped into one, the motion
some edges. Strong edges appear between legs can still be                detector will test the blob whether it is an occupant or not.
tolerated since the system does not consider about occupant’s            The blob is an occupant if the size of the blob looks like a
contour.                                                                 human and the blob has movement. A minimum size of
    The result from the edge detection process is refined by             human is an approximation relative to the image size. X-axis
using morphology filters [13]. Dilation filter is applied twice          displacement and optical flow [13] are used to detect the
to join the edges and erosion filter is used once to remove the          movement of the blob. If a blob is detected as an occupant
noises. The last step in the object segmentation unit is                 then the tracking unit gives a unique identification (ID)
connected component labeling. The connected component                    number and a track the occupant. A track is an indicator that
labeling is used to detect the connected region. The                     a blob is an occupant, and it is represented by a bounding
connected region is so called as blob. In the object                     box. Tracking rules are implemented as states to handle each
segmentation unit, the GMM, the gray color conversion, the               event. There are five basic states; entering state, people state,
edge detector, and the morphology filters are implemented                sitting state, standing up state, and leaving state. During the
using OpenCV library while the connected component                       tracking, the occlusion problem may happen. Two more
labeling is implemented using cvBlob library.                            states are added. They are merge state and split state. In the
    The blob that represents the foreground object may be                tracking unit, optical flow implements OpenCV library while
broken due to the occlusion with furniture or having the                 the tracking rules employ cvBlob library.
same color with the background image. Some rules to group                     The learning phase is activated if the map has not
the broken blob are provided. There are three conditions to              constructed yet. The sitting state in the tracking unit triggers
examine the broken blobs. The first is the intersection                  the learning phase to locate the occupant’s sitting area. After
distance of blobs (BI). The second is the nearest vertical               a defined time, the learning phase finished its job and the
                                                                         monitoring phase is activated. In this phase, the sitting state
distance of blobs (Bdy). The third is the angle of blobs (BA)
                                                                         and the standing up state in the tracking unit trigger the
from their centroids. Bdy and BA are calculated using (5)                monitoring phase to generate reports. The reports tell when
while BI is explained in [14].                                           the occupant sit and left.
                   Bdy  min( Bi . y , B j . y )                              The system will be evaluated by testing it with some
                                                           (5)           video clips. There are two scenarios in the video clips. Five
                   B A  ( ci ,c j )                                    occupants are asked to enter the scene. They sit, stand up,
                                                                         leave the scene and sometimes cross each other.




                                                                  1015
IV.   TRACKING
   This section describes about the tracking rules in the
tracking unit (Fig. 3). Tracking rules will keep tracking the
occupants during their appearance in the scene based on the
information (features) from the previous frame. The                                           Fig. 4. Basic tracking states.
tracking rules are represented by states. The basic tracking
states are shown in Fig. 4. There are five states:
    Entering state (ES), an incoming blob that appears in
     the scene for the first time will be marked as entering
     state. This state also receives information from the
     motion detector to detect whether the incoming blob is
     an occupant or a noise. If the incoming blob is
     considered as noise and it remains there for more than
     100 frames then the system will delete it, for instance,
     the size is too small because of shadows. To erase the
     noise from the scene, the system re-initializes the                         Fig. 5. An occupant in the scene and the features.
     Gaussian model to the noise region so that the noise
     will be absorbed as a background image. An incoming
     blob is classified as an occupant if the incoming blob
     has motion at least for 20 frames continuously and the
     height of the blob is more than 60 pixels.
    Person state (PS), if the incoming blob is detected as
     an occupant, a unique identification (ID) number and a
     bounding box are attached to this blob. The blob that is
     detected as an occupant is called as a track. The system
     adds this track in the tracking list.                                      Fig. 6. Centroid feature to check the distance in 2D.
    Sitting state (IS), detects if the occupant is sitting.
     Sitting occupant can be assumed if there is no                    surrounding by a bounding box), size (number of blob
     movement from the occupant for a defined time. In the             pixels or area density), centroid (center gravity of mass),
     experiments, an occupant is sitting when the x-axis               and ground position (foot position of occupant).
     displacement is zero for 20 frames and the velocity                   The first feature is centroid. Centroid is used to associate
     vectors from the optical flow’s result are zero for 100           the object’s location in the 2D image between two
     frames, continuously.                                             consecutive frames by measuring the centroids distance.
    Standing-up state (US), detects when the sitting                  Fig. 6 shows the two objects being associated. One object is
     occupant starts to move to leave his/her desk. In the             already defined as track in the previous frame (t-1) and
     experiments, a standing up occupant is detected when              another object is appearing in the current frame (t) as a blob.
     the sitting occupant produces movements, the height               Each object has centroid (c). These two objects are
     increases above 75%, and the size changes to 80%-                 measured [14] in the following way. If one of centroid is
     140% comparing to the size of the current bounding                inside another object (the boundary of each object is defined
     box.                                                              as a rectangle) the returned distance value is zero. If the
                                                                       centroids are lying outside the boundary of each object then
    Leaving state (LS), deletes the occupant from the list.
                                                                       the returned distance value is the nearest centroid to the
     A leaving occupant is detected when occupant moves
                                                                       opponent boundary. A threshold value (TC) is set. When the
     to the edge of the scene and occupant’s track loses its
                                                                       distance is below TC meaning that those two objects are the
     blob for 5 frames.
                                                                       same object, the track position will be updated to the blob
A. Tracking Features                                                   position. If the distance is not satisfied then it means these
    The system is tried to match every detected occupant in            two objects is not correlated each other. It could be the
the scene from frame to frame. This can be done by                     previous track loses the object in the next frame and a new
matching the features of occupant. Four features (centroid,            object appears at the same time. A track that missed the
ground position, color, and size) are used for tracking                tracking is defined in the leaving state (LS) and a new object
purpose. Fig. 5 shows the illustration of blob (the connected          that appears in the scene is handled in the blob state (BS).
region of occupant object in the current frame), track (a                  The second feature is ground position. It is possible that
connected blob that considers as an individual occupant,               two objects are not the same object but their centroids are




                                                                1016
Fig. 10. Extended tracking states.

Fig. 7. Ground position feature to check the distance in 3D. Blob and track
        in the processing stage (left). View in the real image (right).              categories; n is the total bin number; the histogram HR,G,B of
                                                                                     occupant-i meets the following conditions:
                                                                                                                               n
                                                                                                           H iR ,G ,B   bk                    (7)
                                                                                                                          k 1

                                                                                     The histogram HR,G,B are calculated using the masked image
                                                                                     and then normalized. The masked image, shown in Fig. 8, is
                                                                                     obtained from the occupant’s object and the blob with and-
                                                                                     operation. The method for matching the occupant’s
           Fig. 8. Color feature is calculated on masked image.
                                                                                     histogram is correlation method. In the experiments, 10-bins
                                                                                     for each color are chosen. The histogram matching
                                                                                     procedure uses a threshold value of 0.8 to indicate that the
                                                                                     comparing histogram is sufficient matched.
                                                                                         The fourth feature is size. The size feature is used to
                                                                                     match the object between two consecutive frames based on
                                                                                     the pixel density. The pixel density is the blob itself, shown
                                                                                     at Fig. 9. Allowable changing size at the next frame is set ±
                                                                                     20% from the previous size. Let p(x’,y’) be the pixel
                     Fig. 9. Size feature of occupant.
                                                                                     location of an occupant in binary image. The size feature of
                                                                                     object-i is calculated as follow:
lying inside each other boundary. Fig. 7 shows this problem.
There are two occupants in the scene. One occupant is                                                         si      p( x' , y' )            (8)
                                                                                                                     x' , y'
sitting while the other is walking through behind. In the 2D
image (left), the two objects are overlapped each other.
However, it is clear that the walking occupant should not be                         B. Merge-split Problem
confused with the sitting occupant. To solve this problem,                              A challenging situation may happen. While the occupants
ground position is used to associate the object’s location in                        are walking in the scene, they are crossing each other and
the 3D image between two consecutive frames. Ground                                  making occlusion. Since the system keeps tracking each
position feature will eliminate the error that an object to be                       occupant in the scene, it is necessary to extend the tracking
updated with another object even thought they overlap each                           states from Fig. 4. Two states are added for this purpose;
other. Occupant’s foot location is used as ground position.                          merge state (MS) and split state (SS). Fig. 10 shows the
A fixed uniform ellipse boundary (25 pixels and 20 pixels                            extended tracking states. Merge and split can be detected by
for major axis and minor axis, respectively) around the                              using proximity matrix [14]. Objects are merged when
ground position is set to indicate the maximum allowable                             multiple tracks (in the previous frame) are associated to one
range of the same person to move. In the real scene, this                            blob (in the current frame). Objects are split when multiple
pixel area is equal to 40 centimeters square for the nearest                         blobs (in the current frame) are created from a track (in the
object from the camera until 85 centimeters square for the                           previous frame). In the merge condition, only centroid
furthest object from the camera. This wide range is caused                           feature is used to track the next possible position since the
by the using of uniform ellipse distance for all locations in                        other three features are not useful when the objects merge.
the image.                                                                           After a group of occupants split, their color will be matched
     The third feature is color. Color feature is used to                            to their color just before they merged together.
indicate color information of occupant’s clothing or wearing                             In experiments, when more than two occupants split,
and help to separate the objects in term of occlusion. Three                         sometimes an occupant remains occluded. Later, the
dimension of RGB color histogram is used. Let b be the bin                           occluded occupant splits. When the occluded occupant
that counts the number of pixel that fall into the same                              splits, the system will re-identify each occupant and correct



                                                                              1017
Sitting area number     Event               Time stamp
                                                                                      1              Sitting      09:02:09 Wed 2 June 2010
                                                                                      2              Sitting      09:07:54 Wed 2 June 2010
                                                                                      3              Sitting      09:12:16 Wed 2 June 2010
                                                                                      2              Leaving      10:46:38 Wed 2 June 2010
                                                                                      2              Sitting      10:49:54 Wed 2 June 2010
                                                                                      3              Leaving      12:46:38 Wed 2 June 2010
                                                                                                Fig. 12. A report example.


                                                                           B. Monitoring Phase
                                                                           The monitoring phase is derived from the sitting state and
                                                                           the standing up state in the tracking rules. The monitoring
                                                                           phase generates the reports of the occupant’s attendance. It
        Fig. 11. Merge-split algorithm with occlusion handling.            uses the map that has been constructed by the learning
                                                                           phase. From Fig. 4, the sitting into state (IS) and standing up
their previous ID number just before they have merged. Fig.                state (US) trigger the monitoring phase. When the occupant
11 shows the algorithm to handle the occluded problem.                     sits, the system will try to match the current occupant’s
                                                                           sitting location with the sitting area in the map. If the
          V.     LEARNING AND MONITORING PHASES                            positions are the same then the system will generate a time
    This section introduces about how the learning phase                   stamp of sitting time for the particular sitting area. A time
and the monitoring phase work. These phases are derived                    stamp of leaving time is also generated by the system when
from the tracking unit, which are the sitting state and the                the occupant moves out from the sitting area. Fig. 12 shows
standing up state in the tracking rules. At the beginning, the             the example of the report.
system activates the learning phase. Triggering by the sitting
                                                                                VI.    APPLICATION TO INTELLIGENT ATTENDANCE
event, the learning phase starts to construct the map. When
                                                                                                   LOGGING SYSTEM
the given time interval is passed, the learning phase is
stopped. A map has been constructed. The system switches                       This paper demonstrates the usage of the surveillance
to the monitoring phase to report the occupants’ attendance                camera as an intelligent attendance logging system. It
based on when they sit into and stand from their seat.                     mentioned earlier that the system works like a time recorder.
                                                                           The system assists for tracking the hours of occupant
A. Learning Phase                                                          attendance. Using this system, the occupants no need to
    The learning phase is derived from the sitting state in the            bring special tag or badge. In this section, the environment
tracking rules. The output of the learning phase is a map.                 setup, result, and discussion are described.
The map consists of occupants’ sitting areas. From Fig. 4,
                                                                           A. Environment Setup
the information about when the occupant sits is extracted
from the sitting into state (IS). When an occupant is detected                 A static network camera is used to capture the images
as sitting, the system will start counting. After a certain                from the scene. It is a HLC-83M, a network camera
period of counting, the location where the occupant sits is                produced by Hunt Electronic. The image size taken from the
determined as sitting area. The counting period is used as a               camera is 320 x 240 pixels. The test room is in our
delay. The delay makes sure that the occupant sits for                     laboratory. The camera is placed about 6.24 meters far, 2.63
enough time. In the experiments, the delay is defined for                  meters high, and 21° facing down from the horizontal line.
200 frames. Ideally, the learning phase is considered to be                The occupant desks and the camera view are orthogonal to
finished after all of the sitting areas are found. In this paper,          get the best view. There are 5 desks as the ground truths.
to show that the learning phase does its job, the occupants                    The room has inner lighting from fluorescent lamps and
enter into the scene and sit one by one without making                     the windows are covered so the sunlight cannot come into
occlusion. The scenario for this demonstration is arranged                 the room during the test.
so that after 10 minutes, the map is expected to be                        B. Result and Discussion
completely constructed. Thus, the learning phase is finished                   Visual C++ and OpenCV platform on Intel® Core™2
its job. The system will be switched to the monitoring
                                                                           Quad CPU at 2.33GHz with 4 GB RAMs is used to
phase. In the real situation, the delay and how long the                   implement the system. Both offline and online methods are
learning phase will be finished can be adjusted.                           allowed. In the scene without any detected objects, the
                                                                           system ran at 16 frames per second (fps). When the number



                                                                    1018
of incoming objects is increasing, the lowest speed can be              Table 1. Test results of scene type 1. The number of detected seat by the
achieved is 8 fps.                                                                          system for 10 times experiments.
    The algorithm was tested with 2 types of scenarios. The               Sitting                             Desk number
first scenario is sitting occupants with no occlusion (Fig.                area         #1          #2             #3         #4            #5
13). This scenario demonstrated the working of learning                  Detected       7           10             10         10             8
phase. The second scenario is the same as the first scenario             Missed         3            0             0          0              2
but the occupants are allowed crossing each other to make
an occlusion (Fig. 14). This scenario demonstrated the                 Table 2. Test results of scene type 2. The number shows the success rate of
merge-split handling.                                                            monitoring without occlusion for 10 times experiments.
    Table 1 shows the test result of scenario type 1. There                                                   Desk number
are 5 desks as ground truth (Fig. 1). Five occupants enter               Occupant
                                                                                         #1          #2           #3          #4            #5
into the scene. They sit, stand up, and left the scene one by             Sitting        9           10           10          10             9
one without making any occlusion. The order or the                        Leaving        0            9           10          10             9
occupants enter and leave are arranged. The occupant
started to occupy the desk number 5 (the right most desk),              Table 3. Test results of scene type 2. The number of occupant mistakenly
until the desk number 1 (the left most desk). When they left,                      assigned in merge-split case for 10 times merged.
the occupant started to leave from the desk number 1, until
                                                                        Number of       Sitting     Walking                         Split
the desk number 5. This order is made to make sure that                 Occupant       occupant     occupant
                                                                                                                  Merge
                                                                                                                            Succeeded       Failed
there is no occupant walks through behind the sitting
                                                                             2               0            2        10           9             1
occupant. This scenario was repeated 10 times. The result
                                                                             2               1            1        10           9             1
shows that there is no problem for the desk number 2, 3, and
4. However, there are some errors that the system failed to                  3               0            3        10           8             2

locate the occupants’ sitting areas. In the case of the desk                 3               1            2        10           9             1
number 1, sometimes the occupant’s blob merges with                          3               2            1        10           9             1
his/her neighbor occupant. So, the system cannot detect or
track the occupant that sits into desk number 1. In the case           which is which after they split. The error happened because
of the desk number 5, the occupant’s color was similar to              of the occupant’s color and the sitting occupant. If the
the color of the background image. This caused the                     occupants have a similar color then the system may get
occupant produced small blob. The system cannot track the              confuse to differentiate them. Another time, when the sitting
occupant because his/her blob’ size becomes too small.                 occupant makes a movement, it creates a blob. However, the
    Table 2 shows the test result of scene type 2. The system          system still does not have enough evidence to determine that
monitored the occupants based on the map that has been                 this blob will change the status of sitting occupant become
found. The experiments were done 10 times without                      standing up occupant. Another occupant walked closer and
occlusion. There are some errors that the system failed to             merged with this blob. After they split, the system confused
recognize the sitting occupant. The system failed to detect            since the blob had no previous information. As the result,
the occupant because of the same problems in the previous              the system missed count the previous track being merged.
discussion; the system lost to track the occupant because the          The ID number of occupant is restored incorrectly.
occupant has the similar color to the background image so
that the occupant suddenly has small blob. The system also                                       VII. CONCLUSIONS
failed to recognize the leaving event from desk number 1.                  We have already designed an intelligent attendance
The system detects a leaving occupant when the occupant                logging system by integrating the open source with
split with his/her seat. Since the desk number 1 does not              additional algorithm. The system works in two phases;
have enough space for the system to detect the splitting, the          learning phase and monitoring phase. The system can
system still detected that the desk number 1 is always                 achieve real-time performance up to 16 fps. We also
occupied even the corresponding occupant has left that                 demonstrate that the system can handle the occlusion up to
location.                                                              three occupants considering that the scene seems become
    Table 3 shows the test result of scene type 2. The                 too crowded for more than three occupants. While the
experiments were done 10 times with occlusion. The system              regular time recorder only reports the time stamp of the
should be able to keep tracking the occupants. To test the             beginning and the ending of the occupant’s working hour,
system, three occupants enter to the scene to make the                 this system provides more detail about the timing
scenario as shown in Table 3. Some occupants walk through              information. Some unexpected behavior may cause an error.
behind the sitting occupant or the occupants just walk and             For instance, the occupant has the color similar to the
cross each other. Most of the cases, the system can detect             background, the desk position, or the occupant moves while



                                                                1019
sitting.
     In the future, the events generated by this system can be
used to deliver a message to another system. It is possible to
control the environment automatically such as adjust the
lighting, playing a relaxation music, setting the air
conditioner when an occupant enters or leaves the room.
The summary report of the occupant’s attendance also can
be used for activity analysis. The current system does not
include the recognition capability since it only detect
whether the working desk is occupied or not. However, if
occupant recognition is needed then there are two ways.
After the map of sitting areas are found, user may label each
sitting area manually or a recognition system can be added.

                               REFERENCES
[1]    B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafers,
       “EasyLiving Technologies for Intelligent Environments,”        Lecture
       Notes in Computer Science, Volume 1927/2000, pp. 97-119, 2000.
[2]    S. -L. Chung and W. –Y. Chen, “MyHome: A Residential Server for
       Smart Homes”, Lecture Notes in Computer Science (including
       subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
       Bioinformatics) 4693 LNAI (PART 2), pp. 664-670, 2007.
[3]    Z. Zhou, X. Chen, Y. –C. Chung, Z. He, T. X. Man, and J. M. Keller,
       “Activity analysis, summarization, and visualization for indoor
       human activity monitoring,” IEEE Transactions on Circuits and
       Systems for Video Technology 18 (11), art. no. 4633633, pp. 1489-
       1498, 2008.                                                                      Fig. 13. Scenario type-1. It shows how the system builds a map. The current
[4]    Wikipedia, “Time Clock,” http://en.wikipedia.org/wiki/Time_clock                     images (left) and a map is shown as filled rectangles (right images).
       (June 24, 2010).
[5]    OpenCV. Available: http://sourceforge.net/projects/opencvlibrary/
[6]    cvBlob. Available : http://code.google.com/p/cvblob/
[7]    D. Demirdjian, K. Tollmar, K. Koile, N. Checka, and T. Darrell,
       “Activity maps for location-aware computing,” Proceedings of the
       Sixth IEEE Workshop on Applications of Computer Vision (WACV),
       pp. 70-75, 2002.
[8]    A. Girgensohn, F. Shipman, and L. Wilcox, “Determining Activity
       Patterns in Retail Spaces through Video Analysis,” MM'08 -
       Proceedings of the 2008 ACM International Conference on
       Multimedia, with co-located Symposium and Workshops , pp. 889-
       892, 2008.
[9]    B. Morris and M. Trivedi, “An Adaptive Scene Description for
       Activity Analysis in Surveillance Video,” 2008 19th International
       Conference on Pattern Recognition, ICPR 2008 , art. no. 4761228,
       2008.
[10]   A. Bayona, J.C. SanMiguel, and J.M. Martínez, “Comparative
       evaluation of stationary foreground object detection algorithms based
       on background subtraction techniques,” 6th IEEE International
       Conference on Advanced Video and Signal Based Surveillance, AVSS
       2009 , art. no. 5279450, pp. 25-30, 2009.
[11]   S. Herrero and J. Bescós, “Background subtraction techniques:
       Systematic evaluation and comparative analysis” Lecture Notes in
       Computer Science (including subseries Lecture Notes in Artificial
       Intelligence and Lecture Notes in Bioinformatics) 5807 LNCS, pp. 33-
       42, 2009.
[12]   P. KaewTraKulPong and R. Bowden, “An Improved Adaptive
       Background Mixture Model for Real-time Tracking with Shadow
       Detection,” Proc. 2nd European Workshop on Advanced Video Based
       Surveillance Systems, AVBS01, 2001
[13]   G. Bradski and A. Kaehler, “Learning OpenCV: Computer Vision
       with the OpenCV Library,” Sebastopol, CA: O'Reilly Media, 2008.
[14]   A. Senior, A. Hampapur, Y.-L. Tian, , L. Brown, S. Pankanti, and R.
       Bolle, ” Appearance models for occlusion handling,” Image and                      Fig. 14. Scenario type-2. The map of 3 desks has been completed. The
       Vision Computing 24 (11), pp. 1233-1243, 2006.                                      occupants cross each other and the system can handle this situation.




                                                                                 1020
i-m-Walk : Interactive Multimedia Walking-Aware System



      1
     Meng-Chieh Yu(余孟杰), 2Cheng-Chih Tsai(蔡承志), 1Ying-Chieh Tseng(曾映傑), 1Hao-Tien
  Chiang(姜昊天), 1Shih-Ta Liu(劉士達), 1Wei-Ting Chen(陳威廷), 1Wan-Wei Teo(張菀薇), 2Mike Y.
              Chen(陳彥仰), 1,2Ming-Sui Lee(李明穗), and 1,2Yi-Ping Hung(洪一平)

                                    1
                                    Graduate Institute of Networking and Multimedia,
                                       National Taiwan University, Taipei, Taiwan
                               2
                                 Dept. of Computer Science and Information Engineering,
                                       National Taiwan University, Taipei, Taiwan



                          Abstract                                         calories burned [16]. adidas used a accelerometer to detect
                                                                           the footsteps of the runner, and it will let you know running
    i-m-Walk is a mobile application that uses pressure                    information audibly [31]. Wii fit used balance boards to
sensors in shoes to visualize phases of footsteps on a mobile              detect user's center of gravity and designed several games,
device in order to raise the awareness for the user´s walking              such as yoga, gymnastics, aerobics, and balancing [18]. In
behaviour and to help him improve it. As an example                        addition, walking is an important factor of our health. For
application in slow technology, we used i-m-Walk to help                   example, it is one of the earliest rehabilitation exercises and
beginners learn “walking meditation,” a type of meditation                 an essential exercise for elders [5]. Improper foot pressure
where users aim to be as slow as possible in taking pace, and              distribution can also contribute to various types of foot
to land every footstep with toes first. In our experiment, we              injuries. In recent years, the ambient light and the
asked 30 participants to learn walking meditation over a                   biofeedback were widely used in rehabilitation and healing,
period of 5 days; the experimental group used i-m-Walk                     and the concept of “slow technology” was proposed. Slow
from day 2 to day 4, and the control group did not use it at all.          technology aimed to use slowness in learning, understanding
The results showed that i-m-Walk effectively assisted                      and presence to give people time to think and reflect [30].
beginners in slowing down their pace and decreasing the                    Meditation is one kind of the example in slow technology.
error rate of pace during walking meditation. To conclude,                 Also, “walking meditation” is an important form of
this study may be of importance in providing a mechanism to                meditation. Although many research projects have focused
assist users in better understanding of his pace and                       on meditation, showing benefits such as enhancing the
improving the walking habit. In the future, i-m-Walk could                 synchronization of neuronal excitation [11] and increasing
be used in other application, such as walking rehabilitation.              the concentration of antibodies in blood after vaccination [3],
                                                                           most projects have focused on meditation while sitting. In
Keywords: Smart Shoes, Walking Meditation, Visual Feedback,                order to better understand how users walk in a portable way,
Slow Technology                                                            we have designed i-m-Walk, which uses multiple force
                                                                           sensitive resistor sensors embedded in the soles of shoes to
                                                                           monitor users’ pressure distribution while walking. The
                    1.    INTRODUCTION                                     sensor data are wirelessly transmitted over ZigBee, and then
    Walking is an integral part of our daily lives in terms of             relayed over Bluetooth to be analyzed in real-time on
transportation as well as exercise, and it is a basic exercise             smartphones. Interactive visual feedback can then be
can be done everywhere. In recent years, many research                     provided via the smartphones (see Figure 1).
projects have studied walking-related human-computer                           In this paper, in order to develop a system that can help
interfaces on mobile phones with the rapid growth of                       users in improving the walking habit, we use the training of
smartphones. For example, there is research evaluated the                  walking meditation as an example application to evaluate the
walking user interfaces for mobile devices [9], and proposed               effectiveness of i-m-Walk. Traditional training of walking
minimal attention user interfaces to support ecologists in the             meditation demands one-on-one instruction, and there is no
field [21]. In addition, there are several walking-related                 standardized evaluation after training. It is challenging for
systems developed to help people in walking and running.                   beginners to self-learn walking meditation without feedback
Nike+ used footstep sensors attached to users’ shoes to                    from the trainers.
adjust the playback speed of music while running and to
track running related statistics like time, distance, pace, and



                                                                    1021
2.2 Multimedia-Assisted Walking Application
                                                                              There are some studies using multimedia feedback and
                                                                          walking detection technique to help people in monitoring or
                                                                          training application in daily life. In the application of
                                                                          dancing training, there was an intelligent shoe that can detect
                                                                          the timing of footsteps, and play the music to help beginners
                                                                          in learning of ballroom dancing. If it detected missed
                                                                          footsteps while dancing, it would show warning messages to
                                                                          the user. The device emphasizes the acoustic element of the
                                                                          music to help the dancing couple stay in sync with the music
                                                                          [4]. The other application of dance performance could detect
                                                                          dancers’ pace and applied them in interactive music for
                                                                          dance performance [20]. In the application of musical tempo
                                                                          and rhythm training for children, there was a system which
                                                                          can write out the music on a timeline along the ground, and
     Figure 1. A participant is using i-m-Walk during walking
                                                                          each footstep activates the next note in the song [13].
                            meditation.
                                                                          Besides, visual information was be used to adjust foot
                                                                          trajectory during the swing phase of a step when stepping
    We have designed experiments to test the effect of                    onto a stationary target [23].
training by using i-m-Walk during walking meditation.                         In the application of psychological, there are some
Participants were asked to do a 15-minute practice of                     experiments related to walking perceptive system. In the
walking meditation for five consecutive days. During the                  application in walking assisting of stroke patients, lighted
experiment, participants using i-m-Walk will be shown real-               target was used to load onto left side and right side of
time pace information on the screen. We would like to test                walkway, and stroke patients can follow the lighted target to
whether it could help participants to raise the awareness for             carry on their step. The result pointed out that stroke patient
their walking behaviour and to improve it. We proposed two                might effectively get help by using vision and hearing as
hypotheses: (a) i-m-Walk could help users to walk slower                  guidance [14]. An fMRI study of multimedia-assisted
walking during meditation; (b) i-m-Walk could help users to               walking showed that increased activation during visually
walk correctly in the method of walking meditation.                       guided self-generated ankle movements, and proved that
    This paper is structured as follows: The first section deals          multimedia-assisted walking is profound to people [1]. In the
with the introduction of walking system. The second section               related application of walking in entertainment, Personal
of the article is a review of walking detection and                       Trainer – Walking [17] detects users’ footsteps trough
multimedia-assisted walking applications. This is followed                accelerometer, and encourage users to walk through
by some introduction of walking meditation. The forth                     interesting and interactive games. In the healthcare
section describes the system design.              After which             application, there was a system applied the concept of
experimental design is presented. The results for the various             intelligence shoes on the healthcare field, such as to detect
analyses are presented following each of these descriptive                the walking stability of elderly and thus to prevent falling
sections. Finally, the discussion and conclusion are presented            down [19]. The system monitored walking behaviours and
and suggestions are made for further research.                            used a fall risk estimation model to predict the future risk of
                                                                          a fall. Another application used electromyography
                                                                          biofeedback system for stroke and rehabilitation patients, and
                    2.   RELATED WORKS
                                                                          the results showed that there was recovery of foot-drop in the
                                                                          swing phase after training [8].
2.1 Methods of Walking Detection
    In the past decade, there were many researches on                                    3.   WALKING MEDITATION
intelligent shoes. The first concept of wearable computing
and smart clothing systems included an intelligence cloth,                    The practice of meditation has several different ways and
glasses, and an intelligence shoes. The intelligence shoes                postures, such as meditation in standing, meditation in sitting,
could detect the walking condition [12]. Then, a research                 meditation in walking, or meditation in lying down on back.
used pressure sensor and gyro sensor to detect feet posture,              Compared to sitting meditation, people tend to feel less dull,
such as heel-off, swing, and heel-strike[22], and a research              tense, or easily distract in walking meditation. In this paper,
embedded pressure sensor in the shoes to detect the walking               we focus on the meditation in walking, which is also named
cycle, and a vibrator was equipped to assist when walking                 walking meditation. Walking meditation is a way to align the
[26]. Besides, there were many other methods on walking                   feeling inside and outside of the body, and it would helps
detection, such as use bend sensor [15], accelerometer [2],               people to focus and concentrate on his mind and body.
ultrasonic [29], and computer vision technology [24] to                   Furthermore, it can also deeply investigate our knowledge
analyze footsteps.                                                        and wisdom.




                                                                   1022
4.1 i-m-Walk Architecture
                                                                                The shoe module is based on Atmel's high-performance,
                                                                            low-power 8-bit AVR ATMega328 microcontroller, and
                                                                            transmits sensing values through a 2.4GHz XBee 1mW Chip
                                                                            Antenna module wirelessly. The module size is 3.9 cm x 5.3
                                                                            cm x 0.8 cm with an overall weight of 185g (Figure 4),
 Figure 2. Six phases of each footstep in walking meditation [25].          including an 1800mAh Lithium battery can continuous use
                                                                            for 24 hours. We kept the hardware small and lightweight in
    The methods of walking meditation aim to be as slow as                  order not to affect users while walking.
possible in taking pace, and landing each pace with toes first.                 We use force sensitive resistor sensors to detect the
The participants could focus on the movement of walking,                    pressure distribution of feet while walking. The sensing area
from raising, lifting, pushing, lowering, stepping, to pressing             of sensor is 0.5 inch in diameter. The sensor changes its
(Figure 2). Also, the participants should aware of the                      resistance depending on how much pressure is applied to the
movement of the feet in each stage. It is important to stay                 sensing area. In our system, the intelligent shoes would
aware of the feet sensation. As a result, keep on practicing of             detect the walking speed and the walking method in walking
walking meditation is an effective way to develop                           meditation. According to the recommendations of
concentration and maintain tranquillity in participants’ daily              orthopaedic surgery, we use three force sensitive resistor
lives. Furthermore, it can also help participants become                    sensors fixed underneath the shoe insole, and the three main
calmer and their minds can be still and peaceful. With the                  sustain areas located at structural bunion, Tailor’s bunion,
long-term practice of walking meditation, it benefits people                and heel, seperately. (see Figure 4). The shoe module is put
by increasing patience, enhancing attention, overcoming                     on the outside of the shoes (see Figure 5). With a fully
drowsiness, and leading to healthy body [6]. In order to help               charged battery, the pressure sensing modules can
beginners in learning the walking methods in walking                        continuous use for 24 hours. A power button can switch the
meditation, i-m-Walk system was developed.                                  module off when it is not being used.

                      4.    SYSTEM DESIGN
    i-m-Walk includes a pair of intelligent shoes for detecting
pace, an ZigBee-to-Bluetooth relay, and a smartphone for
walking analysis and visual feedback. There are three force
sensitive resistor sensors fixed underneath each shoe insole
that send pressure data through the relay. We implemented
an analysis and visual feedback application on HTC HD2
smartphone running Window Mobile 6.5 and which has a
4.3-inch LCD screen. The overview of the system is shown
in Figure 3.                                                                   Figure 4. Sensing module: the micro-controller and wireless
  Force                                    Relay                              module (right), and one of the insole with three force sensitive
  Sensor         Right Shoe                                                                       resistor sensors (left).
                                                Xbee (Receiver)

  Force
                 Microcontroller
  Sensor                                           Bluetooth


  Force          Xbee (Transfer)
  Sensor                                  Smart Phone

  Force                                            Bluetooth
  Sensor          Left Shoe
                                               Footstep Detection
  Force
                 Microcontroller
  Sensor
                                               Stability Analysis


  Force          Xbee (Transfer)                Visual feedback
  Sensor


             Figure 3. System structure of i-m-Walk.                          Figure 5. Sensing shoes: attached the sensing module into the
                                                                                                          shoes.




                                                                     1023
4.2 Walking detection                                                        4.3.1     Pace awareness
    There were many methods in walking detection, and the                    The function of pace awareness is to help user aware of his
methods were different according to the applications. In our                 walking phases and whether he use correct footstep during
system, we use three pressure sensors in each shoe, and                      walking meditation. A feet pattern shows on the smartphone,
totally will sense six sensing values at the sample rat 30                   and the color block shows where the position of foot’s
times per second. In order to detect whether the user lands                  center of gravity is and how much is forced on the foot in
each pace with toes first or heel first, we divide each shoe                 real-time. The transparency of the block will decrease while
into two parts, toe part and heel part. The sensing value of                 user land his feet. On the contrary, the transparency of the
toe part is the average of two force sensors, which                          block will increase while user raise his foot. Besides, the
underneath at structural bunion and Tailor’s bunion. The                     color block moves top-down while the participants land
sensing value of heel part is the force sensor underneath at                 with toes first. The color blocks move bottom-up means that
heel. Therefore, the system divides the sensing area into toe                the participants land with heel first. If forward of the foot
part and heel part in each shoe, and totally four parts in each              lands first, then the colour block moves forward to indicate
person. Then, we use threshold method to detect the moment                   the landing position. In addition, if the user land pace with
while the sensing part less than the threshold value, and                    toe first, the system would defined that he is using correct
activate that part. We define the beginning of the each gait                 walking methods in walking meditation, and the colour
cycle while heel part is lifting. The end of the gait cycle is at            block would display as the color of green. On the contrary,
the moment while the another foot’s heel part is rising. The                 while the user lands pace with heel first, the system would
previous cycle stops in one foot and another foot begins a                   recognize that he is using wrong walking methods, and the
new cycle of gait. Figure 6 shows an example. In this case,                  colour block would change the color from green to red.
when the heel in left foot rose in 5 seconds, the sensing value
was less than the threshold, and our system detected left foot               4.3.2     Walking Speed and Warning Message
rise in this moment. In the mean time, it means that user’s                  During walking meditation, people should stabilize his
right foot is stepping down. On the contrary, when the heel                  walking paces at a lower speed. By this way, the user
in right foot rose in 10.7 seconds, the sensing value was less               interface should provide the information of walking speed in
than the threshold, and our system detected right foot rise in               real time, and remind the user while the walking speed is
this moment.                                                                 too fast. Walking speed and wrong pace can be measured
                                                                             after the processing of walking signals. Then, the walking
                                                                             speed is visualized as a speedometer. The indicator point of
                                                                             the speedometer will point to the value of walking speed.
                                                                             For example, if the indicator points to the value “30”, it
                                                                             means that the user walk thirty paces per three minutes.
                                                                             Therefore, the speedometer provides the function to remind
                                                                             the user while he is walking too fast. According to the pilot
                                                                             study, we defined the lower-bound of the walking speed as
                                                                             40 paces per three minutes. While the walking speed
                                                                             exceeds the speed, the indicator will point to the red area,
                                                                             and the screen will show a warning message “too fast” on
                                                                             the top of the screen. The warning message would disappear
                                                                             while the walking speed is less than 40 paces per three
                                                                             minutes.

Figure 6. Signal processing of walking signals. Blue line indicates                                                               Warning
the sensed weight (kg) of heel and green line indicates the sensed                                                                message
weight of toe. Red line means the threshold in detecting the
landing event. Gray block means that which foot is landing.
                                                                                                                                  Footstep
                                                                                                                                 Awareness
4.3 User interface
    Multimedia feedback can be effectively applied in
preventive medicine [7], and it can also assist rehabilitation                                                                   Walking
patients in walking effectively [5, 27]. i-m-Walk is developed                                                                    speed
to assist user in learning the walking methods during walking
meditation. The user interface of i-m-Walk includes three
components: warning message, pace awareness, and walking
speed (see Figure 7). In this section, we describe the user
interface and the design principles of our system.                           Figure 7: User interface of i-m-Walk. The user interface shows
                                                                             three events: warning message, condition of each footstep, and




                                                                      1024
walking speed. Picture (a) shows that the user used incorrect               meditation. 83.3% of the participants carry mobile phone all
walking method in right foot and the colour block changed to red            the time, and 63.3% of the participants have the experience
on the right foot. Picture (b) shows that the walking speed is too          of using smartphone. There were fifteen participants (eleven
fast (46 steps per three minutes).                                          male and four female) in experiment group (with visual
                                                                            feedback), and fifteen participants (eleven male and four
                                                                            female) in control group (without visual feedback). Because
              5.   EXPERIMENT DESIGN                                        the feet size was different in each participant, we prepared
    Two experiments was designed to test the effects of i-m-                two pair os shoes with different sizes, and participants could
Walk during walking meditation. The first experiment was a                  choose comfortable one to wear.
pilot study. In this study, we evaluated the effect of visual
feedback which showed six sensing curves which projected                    5.2.2    Location
on the wall. The second experiment evaluated the effects of
i-m-Walk in improving user´s walking behaviour.                                 Meditating in a quiet and enclosed area would be easier
                                                                            to bring mind inward into ourselves and may reach in calm
5.1 Pilot study                                                             and peace situation. In this experiment, we selected a
    Before the test of i-m-Walk, we designed a preliminary                  corridor in the faculty building as the experimental place for
study to test the effects of visual feedback displayed on the               walking meditation. The corridor is a public place at an
wall. Eight master students volunteered to participate in this              enclosed area and few people would conduct their daily
pilot study. Participants’ average age is 26.3 (SD=0.52). All               activities like standing, walking, and interacting with one
participants have the experience of sitting meditation before,              another there. The surrounding of corridor is quiet and
but all of them do not have the experience of walking                       comfortable for user to reach their mind in calm. The length
meditation. There were four participants (three male and one                of the corridor is thirty meters, and the width is three meters.
female) in experiment group (with visual feedback), and four                The temperature is 21~23 Celsius degree.
participants (four male) in control group (without visual
feedback). Participants would take ten minutes each day and                 5.2.3    Procedure and analysis
for three consecutive days in the experiment. The experiment
took place at a seminar room in the faculty building. Before                    Before the experiment, participants were asked to walk
the experiment, participants were taught the methods and                    alone the corridor in usual walking speed, and we recorded it.
principles of walking meditation. The experiment was a 4 ×                  Then, participants were taught the methods of walking
2 between-participants design. In the experiment group,                     meditation. The guideline of the walking meditation which
participants were asked to watch the curves which showed                    we provided to the participants was as follows: “Walking
feet’s sessing value projected on the wall. In the control                  meditation is a way to align the feeling inside and outside of
group, participants were asked to walk themselves without                   the body. You should focus on the movement of walking,
visual feedback. Participants walked straight in the seminar                from raising, lifting, pushing, lowering, stepping, to pressing.
room. The results showed that there was a significant main                  You have to land every footstep with toes first and then
effect that experimental group had lower walking speed than                 slowly land your heel down. During walking meditation, you
control group (p<0.05) during the three days. The average                   should stabilize your walking paces at a lower speed as
number of wrong pace in experimental group was less than                    possible. You have to relax your body from head to toes.”
control group, too. As the results from this pilot study, we                    The experiment was a 15 × 2 between-participants
concluded two preliminary conclusion that (a) visual                        design. Participants would take fifteen minutes each day and
biofeedback could help users in slowing down the walking                    for five consecutive days in this experiment. Table 1 shows
speed during walking meditation; (b) Multimedia guidance                    the procedure of this experiment. In the experiment group,
could usefully help user in aware of his pace during walking                participants were asked to use i-m-Walk from day 2 to day 4.
meditation, and could decrease the number of wrong pace.                    In the control group, participants were asked to walk
However, we also observed some issues in pilot study, one is                themselves without any feedback during walking meditation.
that the perspective would change over time while walking                                       DAY 1 DAY 2 DAY 3 DAY 4 DAY 5
to different location, and it might influence the effect of
learning. Based on the results and recommands, we designed                   Experimental
                                                                                                ○         ●         ●        ●         ○
a experiment to evaluate the effect of i-m-Walk.                                Group
5.2 User study                                                                  Control
                                                                                                ○         ○         ○        ○         ○
5.2.1   Participants                                                            Group

     Thirty master and PhD students in the Department of                    Table 1: Experimental procedure: ● means that participants have
Computer Science volunteered to participate in this                           to use i-m-Walk during walking meditation, and ○ means that
experiment. Participants’ average age is 25.2 (SD=3.71).                    participant do not have to use i-m-Walk during walking meditation.
Twenty-seven participants have the experience of sitting
meditation, and three participants do not. However, all                        While learning of walking mediation, all participants
participants do not have the experience of walking                          were asked to walk clockwise around the corridor and hold



                                                                     1025
the smartphone. In control group, there was no visual                      minute learning of walking meditation on experimental
feedback on smartphone although they still need to hold the                group and control group from day 1 to day 5.
smartphone. In experimental group, participants were                          In experimental group, the median value of the wrong
informed that they can choice not to look at the visual                    pace decreased from eight wrong pace in day 1 to one wrong
feedback while aware of their pace well. The participants of               pace in day 5, and the value of wrong pace decreased over
experimental group were asked to complete a questionnaire                  day. In control group, the median value of the wrong pace
after experiments from day 2 to day 4. Besides, we asked all               decreased from 7 paces in day 1 to 5 paces in day 5, but the
participants the feelings and impressions after the experiment             wrong pace decreased just in first three days. The results
in day 5. However, all participants could write down any                   showed that i-m-Walk could effectively reduce wrong pace
recommends and feelings after the experiment, and we will                  during walking meditation.
discuss the issues in the discussion section.

5.2.4    Results

    We analyzed the average walking speed and the wrong
pace both in experimental group and control group. In the
results of the average walking speed, figure 8 shows the
average time of each pace on experimental group and control
group from day 1 to day 5. In day 1 and day 5, all
participants learned walking meditation without using i-m-
Walk. Following t-tests revealed significant difference (p <
0.005) in the average walking time per footstep for the
experimental group and control group from day 2 to day 4.
In experimental group, the average walking time per footstep
increased from 4.5 seconds in day 1 to 10.9 seconds in day 5.                Figure 9: The median value of error footsteps for experimental
In control group, the average walking time per footstep                               group and control group from day 1 to day 5.
increased from 3.2 seconds (day 1) to 5.1 seconds (day 5).
The results showed that the participants in experimental                       The experimental group was asked to complete a
group had significant main effect (p < .005) in slowing down               questionnaire after using i-m-Walk system from day 2 to day
the walking speed after the learning of walking meditation.                4. The content of the questionnaire was the same in each day.
On the contrary, the participants in control group had no                  Figure 10 shows the results of questionnaires which were
significant main effect (p > .1) in slowing down the walking               completed after walking meditation. We asked two questions:
speed after the learning of walking meditation. The results                (1) what is the degree of i-m-Walk to help you in aware of
showed that i-m-Walk could help participants in slowing                    your pace? (2) what is the degree of i-m-Walk to help you in
down the walking speed during walking meditation.                          slowing down the footstep? There were five options for
                                                                           answers, including “1: serious interference”, “2: a little
                                                                           interference”, “3: no interference and no help”, “4: a little
                                                                           help”, and “5: very helpful”. The results showed that all
                                                                           participants in experimental group gave positive feedback
                                                                           both in question 1 and question 2, and the score of
                                                                           questionnaire were between “a little helpful” and “very
                                                                           helpful”.




Figure 8: The average time of one footstep for experimental group
 and control group from day 1 to day 5. Error bars show ±1 SE.

    In our experiment, the rule of correct walking method
was that participants should land every footstep with toes                 Figure 10: The questionnaire results filled by experimental group
first during walking meditation. If they landed footstep with                 from day 2 to day 4. The red line shows the baseline of the
heel first, it was a wrong pace defined in our experiment.                              satisfaction. Error bars show ±1 SE.
Figure 9 shows the median values of total wrong pace in 15



                                                                    1026
6.   DISCUSSION                                   the time during walking meditation because we did not know
                                                                        whether the user needed the guidance or not. But we
    The aim of this section is to summarize, analyze and                informed participants that they could decide not to look at
discuss the results of this study and give guidelines for the           the visual feedback while they could aware of their pace well.
future development of applications.                                     By this way, it could minimize the interference when using.
                                                                        The questionnaire showed that the participants think that
6.1 User Interface:                                                     there was no interference while using i-m-Walk, and the
                                                                        system was helpful in use.
The user interface of i-m-Walk provides the information of
pace, including walking speed, wrong pace, and the center               6.3 Beginners vs. Masters
of feet. The results in walking speed significantly showed
that i-m-Walk could help beginners decrease walking speed                In recent years, the concept of “slow technology” was
during walking meditation. There are some participants’                 applied in many mediate systems. The design philosophy of
comments from experimental group:                                       “slow technology” is that we should use slowness in
User E6 in day 3: “I always walked fast before, but when I              learning, understanding and presence to give people time to
saw the dashboard and the warning message “too fast,” it                think and reflect. In our case, walking meditation is one
was helpful to remind me in slowing down the walking                    kind of the conception in slow technology. There are two
speed.                                                                  main parts in walking meditation, including inside condition
                                                                        and outside condition. The inside condition means the
    We list two design principles of the user interface: (a)
                                                                        meditation of mind and the outside condition means the
We used the form of dashboard to represent the walking
                                                                        meditation of walking posture. All participants were
speed. The value of walking speed is easy to watch, and user
                                                                        beginners in our experiment, because we focused on the
might aware of the change of walking speed while he slowed
                                                                        training of the outside condition, walking posture. The
down or speeded up; (b) i-m-Walk provided additional alarm
                                                                        difference between beginners and masters in walking
mechanism, a warning message “too fast”, while walking too
                                                                        meditation is that the beginners do not familiar in walking
fast. The mechanism could remind user when distracted. The
                                                                        meditation and needs to pay more attention on the control of
results in wrong pace showed that i-m-Walk could effectively
                                                                        pace, but the masters familiar it and could focus on the
reduce wrong pace for beginners during walking meditation.
                                                                        meditation of align the inside and outside of body. Walking
One of the participants from experimental group said that:
                                                                        meditation is a way to align the feeling inside and outside of
User E1 in day2: “While I saw the color of block on the                 the body. The beginner should familiar the walking posture
screen changed from green to red, and I knew that I had a               before the spiritual development. In this paper, the goal of
wrong pace. Then, I would concentrate on my pace                        our experiment is to evaluate the learning effects of i-m-
deliberately while the next footstep.                                   Walk system. The experimental results showed that the
                                                                        participants of experimental group could slow down the
6.2 Human Perception                                                    walking speed and decrease the wrong pace after five days
                                                                        training. Six participants in experimental group felt that the
Human beings receive messages by means of the five                      experimental time in day four is short than first day,
modalities, including vision, sound, smell, taste and touch.            although the experimental time is the same. On the contrary,
The most use in the field of human-computer interaction is              there was no such comment from the participants in control
visual modality and audio modality. There was a comment                 group. The results showed that i-m-Walk could help user in
from an experimental participant:                                       training the walking posture of walking meditation.
 User E3 in day 2: “If I can listen to my pace during
walking meditation, I do not need to hold the smartphone”.              6.4 Reaction Time
     In cross-modal research, visual modality is always                 Reaction time is an important issue in human-computer
considered superior than auditory modality in spatial domain.           interaction design. If the reaction time delay too long, users
In our case, we need to show the footstep phases accurately,            could not control it well and could not aware of the
and also need to show walking speed and wrong pace in the               interaction easily. According to the observation, the delay
same time. Therefore, we selected visual feedback as the                time of i-m-Walk is 0.2 second. However, the delay time do
user interface. The advantage was that users could decide               not affect users because the application in this experiment
whether to watch the information or not, but the shortcoming            do not need fast reaction time. The average pace speed is
was that users failed to receive the information while they             10.9 seconds in experiment group in day five. The results of
did not see it. Therefore, there are possible to provide more           questionnaires also showed that participants felt that the
interaction methods, such as tactile perception and acoustic            visual feedback could reflect the walking status immediately.
perception, to remind users.                                            However, the somatosensory of one’s feet is the most
    On the other hand, the mechanisms of multimedia                     intuitive, and i-m-Walk can only provide accessibility for
feedback might attract user’s attention in some case. Too               beginners while they need.
many inappropriate and redundant events might disturb users
while using it. In our system, we provided visual feedback all



                                                                 1027
7.   CONCLUSIONS AND FUTURE WORK                                          [9]    Kane, S.K., Wobbrock, J.O., and Smith, I.E., 2008. Getting off the
                                                                                           treadmill: evaluating walking user interfaces for mobile devices in
    In this paper, we present a mobile application that uses                               public spaces. In Proc. MobileHCI '08, 109-118.
pressure sensors in shoes to visualize phases of footsteps on                       [10]   Kong, K. and Tomizuka, M., 2008. Smooth and Continuous Human
a mobile device in order to raise the awareness for the user´s                             gait Phase Detection based on foot pressure patterns. In Proc.
                                                                                           ICRA ’08, 3678-3683.
walking behaviour and to help him improve it. Our study
                                                                                    [11]   Lutz, A., 2004. Long-term meditators self-induce high-amplitude
showed that i-m-Walk could effectively assisted beginners in                               gamma synchrony during mental practice. PNAS 101, 16369-16373.
slowing down their pace and decreasing the error rate of pace
                                                                                    [12]   Mann, S., 1997. Smart clothing: The wearable computer and
during walking meditation. Therefore, the conception of i-m-                               WearCam. Journal of Personal Technologies 1(1), 21-27
Walk could be used in other applications, such as walking                           [13]   Mann, S., 2006. The andante phone: a musical instrument that you
rehabilitation.                                                                            play by simply walking. In Proc. ACM Multimedia 14th, 181-184.
    Despite the encouraging results of this study as to the                         [14]   Montoya, R., Dupui, P.H., Pagès, B., and Bessou, P., 1994. Step-
positive effect of i-m-Walk, future research is required in a                              length biofeedback device for walk rehabilitation. Journal of Medical
number of directions. In the part of intelligient shoes, we will                           and Biological Engineering and Computing 32(4), 416-420.
analyze user’s walking method, such as pigeon toe gait and                          [15]   Morris, S.J., and Paradiso, J.A., 2002. Shoe-integrated sensor system
out toe gait while walking. In the part of biofeedback                                     for wireless gait analysis and real-time feedback. In Proc. Joint IEEE
                                                                                           EMBS (Engineering in Medicine and Biology Society) and BMES
mechanisms, we will try to design more interaction methods,                                (the Biomedical Engineering Society) 2nd, 2468-2469.
such as tactile perception and acoustic perception. Besides,
                                                                                    [16]   Nike, INC., 2009. Nike+, Retrieved October 26, 2009, from
we will record and analyze user’s learning status while                                    www.nikeplus.com/
walking, and provide appropriate and personalized guidance                          [17]   Nintendo, 2009. Personal Trainer - Walking, Retrieved November 26,
according to his condition. Currently, we are using additional                             2009, from http://www.personaltrainerwalking.com/
sensing devices, such as Breath-Aware Garment and Sensing                           [18]   Nintendo, 2009. Wii Fit Plus, Retrieved January 8, 2010, from
ring, to detect user’s biosignal and activities, and integrate to                          http://www.nintendo.co.jp/wii/rfpj/index.html
i-m-Walk to analyze the breathing status and heartbeat rate                         [19]   Noshadi, H., Ahmadian, S., Dabiri, F., Nahapetian, A., Stathopoulus,
while walking and running.                                                                 T., Batalin, M., Kaiser, W., Sarrafzadeh, M., 2008. Smart Shoe for
                                                                                           Balance, Fall Risk Assessment and Applications in Wireless Health.
                                                                                           In Proc. Microsoft eScience Workshop.
                    8.    ACKNOWLEDGMENT
                                                                                    [20]   Paradiso, J., 2002. FootNotes: Personal Reflections on the
   This work was supported in part by the Technology                                       Development of Instrumented Dance Shoes and their Musical
Development Program for Academia, Ministry of Economic                                     Applications. In Quinz, E., ed., Digital Performance, Anomalie,
Affairs, Taiwan, under grant 98-EC-17-A-19-S2-0133.                                        digital_arts Vol. 2, 34 - 49.
                                                                                    [21]   Pascoe, J., Ryan, N. and Morse, D., 2000. Using while moving: HCI
                          9.   REFERENCES                                                  issues in fieldwork environments. ACM Transactions on Computer-
                                                                                           Human Interaction, 7 (3), 417- 437.
                                                                                    [22]   Pappas, P. I., Keller, T., Mangold, S., Popovic, M.R., Dietz, V., and
[1]   Christensen, M.S., Lundbye-Jensen, J., Petersen, N., Geertsen, S.S.,                 Morari, M., 2004. A reliable, insole-embedded gait phase detection
      Paulson, O.B., and Nielsen, J.B., 2007. Watching Your Foot Move--                    sensor for FES-assisted walking. Journal of IEEE Sensors 4 (2), 268-
      An fMRI Study of Visuomotor Interactions during Foot Movement.                       274.
      Journal of Cereb Cortex 17 (8), 1906-1917.                                    [23]   Reynolds, R.F., Day, B.L., 2005. Visual guidance of the human foot
[2]   Crossan, A., Murray-Smith, R., Brewster, S., Kelly, J., and Musizza,                 during a step. Journal of Physiology 569 (2), 677-684.
      B., 2005. Gait phase effects in mobile interaction, In Proc. CHI '05          [24]   Quek, F., Ehrich, R., Lockhart, T., 2008. As go the feet...: on the
      extended abstracts on Human factors in computing systems, 1312-                      estimation of attentional focus from stance. In Proc. ICMI 10th, 97-
      1315.                                                                                104.
[3]   Davidson, R.J., 2003. Alterations in brain and immune function                [25]   Thera, S., 1998. The first step' to Insight Meditation. Buddhist
      produced by mindfulness meditation. Psychosom Med. 65, 564-570.                      Cultural Centre.
[4]   Drobny, D., Weiss, M., and Borchers, J. 2009. Saltate! -– A Sensor-           [26]   Watanabe, J., Ando, H., and Maeda, T., 2005. Shoe-shaped Interface
      Based System to Support Dance Beginners. In Proc. CHI '09                            for Inducing a Walking Cycle. In Proc. ICAT 15th, 30-34.
      Extended Abstracts on Human Factors in Computing Systems, 3943-
      3948.                                                                         [27]   Woodbridge, J., Nahapetian, A., Noshadi, H., Kaiser, W. and
                                                                                           Sarrafzadeh, M., 2009. Wireless Health and the Smart Phone
[5]   Femery, V.G., Moretto, P.G., Hespel, J-MG, Thévenon, A., and                         Conundrum. HCMDSS/ MDPnP.
      Lensel, G., 2004. A real-time plantar pressure feedback device for
      foot unloading. Journal of Arch Phys Med Rehabi 85(10), 1724-1728.            [28]   Ikemoto, L., Arikan, O., and Forsyth, D., 2006. Knowing when to put
                                                                                           your foot down. In Proc. Interactive 3D graphics and games 06’, 49-
[6]   Hanh, T.N., Nquyen, A.H., 2006. Walking Meditation. Sounds True                      53.
      Ltd.
                                                                                    [29]   Yeh, S.Y., Wu, C.I., Chu, H.H., and Hsu, Y.J., 2007. GETA sandals:
[7]   Hu, M.H., and Woollacott, M.H., 1994. Multisensory training of                       a footstep location tracking system. Personal and Ubiquitous
      standing balance in older adults: I. Postural stability and one-leg                  Computing 11(6): 451-463.
      stance balance. Journal of Gerontology: MEDICAL SCIENCES 49(2),
      M52-M61.                                                                      [30]   Hallnäs, L., Redström, J., 2001. Slow Technology: Designing for
                                                                                           Reflection. Personal and Ubiquitous Computing, Vol. 5(3). pp. 201-
[8]   Intiso D., Santilli V., Grasso M.G., Rossi R., and Caruso I., 1994.                  212.
      Rehabilitation of walking with electromyographic biofeedback in
      foot-drop after stroke. Journal of Stroke 25(6), 1189-1192.                   [31]   adidas, INC., 2010. miCoach, Retrieved March 5, 2010, from
                                                                                           www.micoach.com




                                                                             1028
Object of Interest Detection Using Edge Contrast Analysis


                     Ding-Horng Chen                                                          FangDe Yao
   Department of Computer Science and Information                           Department of Computer Science and Information
                    Engineering                                                              Engineering
            Southern Taiwan University                                               Southern Taiwan University
          Yong Kang City, Tainan County                                            Yong Kang City, Tainan County
              chendh@mail.stut.edu.tw                                              m97g0102@webmail.stut.edu.tw

Abstract— This study presents a novel method to detect the               the separation of variations in illumination from the
focused object-of-interest (OOI) from a defocused low depth-             reflectance of the objects (also known as intrinsic image
of-field (DOF) image. The proposed method divides into three             extraction) and in-focus areas (foreground) or out-of-focus
steps. First, we utilized three different operators, saturation          (background) areas in an image.
contrast, morphological functions and color gradient to                      The DOF is the portion of a scene that appears
compute the object's edges. Second, the hill climbing color              acceptably sharp in the image. Although lens can precisely
segmentation is used to search the color distribution of an
                                                                         focus at one specific distance, the sharpness decreases
image. Finally, we combine the edge detection and color
segmentation to detect the object of interest in an image. The           gradually on each side of the focused distance. A low (small)
proposed method utilizes the edge analysis and color                     DOF can be more effective to emphasize the photo subject.
segmentation, which takes both advantages of two features                The OOI is thus obtained via the photography technique by
space. The experimental results show that our method works               using low DOF to separate the interested object in a photo.
satisfactorily on many challenging image data.                           Fig. 1 shows a typical OOI image with low DOF.
    Keywords-component; Object of Interest (OOI); Depth of
Field (DOF); Object Detection; Edge Detection; Blur Detection.

                      I.    INTRODUCTION
     The market for digital single-lens reflex cameras, or so-
called DSLR, has expanded tremendously for its price
become more acceptable. For a professional photographer,
the DSLR owns the advantages for the excellent image
quality, the interchangeable lenses, and the accurate, large,
and bright optical viewfinder. The DSLR camera has bigger
sensor unit that can create more obvious depth-of-field (DOF)
photos, and that is the most significant features of DSLR.
                                                                                          Figure 1. A typical OOI image
According to market reports [1][2][3], the DSLR market
share will grows very fast in the near future. Table 1 shows                  The OOI detection problem can be viewed as an
the growth rate of the digital camera market.                            extension of the blurred detection problem. In Chung’s
          Table 1.   Market Estimate of the Digital Cameras              method [6], they compute x and y direction derivative and
                                                                         gradient map to measure the blurred level ,by obtaining the
    Year                2006            2011  Growth Rate                edge points which is computed by a weighted average of the
 World Market            81             82.2     108%                    standard deviation of the magnitude profile around the edge
                                                                         point.
   DSLR                  4.8             8.3     173%
                                                                              Renting Liu et al. [7] have proposed a method could
    DSC                 76.8            79.9     104%                    determine blurred type of an image, using the pre-de ned
                                          Unit: Million US$              blur features, the method train a blur classifier to
     The extraction of the local region of interested in an              discriminate different regions. This classifier is based on
image is one of the most important research topics for                   some features such as local power spectrum slope, gradient
computer vision and image processing [4][5]. The detection               histogram span, and maximum saturation. Then they
of object of interest (OOI) in a low DOF images can be                   detected the blurry regions that are measured by local
applied in many fields such as content-based image retrieval.            autocorrelation congruency to recognize the blur types.
To measure the sharpness or blurriness edges in an image is                   The above methods determine the blur level and regions,
also important for many image processing applications. For               but they still cannot extract OOI object from an image. If the
instance, checking the focus of a camera lens, identifying               background is complex or edges are blurred, the described
shadows (which edges are often less sharp than object edges),            methods are unable to find OOI [6][7]. N. Santh and
                                                                         K.Ramar have proposed two approaches, i.e. the edge-based




                                                                  1029
and region-based approach, to segment the low-DOF images                    A. Saturation Edge Power Mean
[8]. They transformed the low-DOF pixels into an                                 Fig. 3 shows the original image that we want to detect
appropriate feature space called higher-order statistics (HOS)              the OOI. The background is out-of-focus and thus is
map. The OOI is then extracted from a low-DOF image by                      smoother then the object we want to detect. The color
region-merging and threholding technique as the final                       saturation and edge sharpness are the major differences
decision.                                                                   between the objects and the background. Color information
     But if the object’s shape is complex or the edges are not              is very important in blur detection. It is observed that blurred
fully connected, it’s still hard to find the object. The OOI                pixels tend to have less vivid colors than un-blurred pixels
may not a compact region with a perfect sharp boundary. It                  because of the smoothing effect of the blurring process.
cannot simply use edge detection to find a complete object in               Focused (or un-blurred) objects are likely to have more vivid
a low-DOF image. In some cases, such as macro                               colors than blurred parts. The maximum saturation value in
photography or close-up photography, the depth-of-field is                  blurred regions is expected to be smaller than in un-blurred
very low. Some parts of subject may out of focus. This also                 regions. By this observation, we use the following equation
causes a partial blur on subject. To acquire a satisfactory                 to compute pixel saturation:
result on OOI detection, not only the blurred part but also the                                           3
sharp part needs to be taken into consideration. How to find a                                 S P=1-               Min R,G,B ,
good OOI subject in the image are challenging in this issue.                                           R G B
                                                                            where Sp means the saturation point for image. Equation (1)
                 II. THE P ROPOSED METHOD                                   transforms the original image into saturation feature space to
     In this paper, we proposed a novel method to extract the               find the higher saturation part in the image.
OOI from a low-DOF image. The proposed algorithm                                 In low-DOF images, the saturation won’t change
contains three steps. First, we find the object boundaries                  dramatically for the background is smoother. On the contrary,
based on computing the sharpness of edges. Second, the hill                 the color saturation will change sharply along the edges.
climbing color segmentation is used to find color distribution              Therefore, we define the edge contrast CA, which is
and its edges. Finally, we integrate the above results to get               computed in a 3x3 window described as follows:
the OOI location.                                                                                       1
                                                                                         CA               ( n A) 2
     The first step is divided into three parts and is illustrated                             n M ,n A n
in Fig. 2. We calculate the feature parameters including the                Here M is the 3x3 window; A is the saturation value on the
maximum saturation, the color gradient and the local range                  window center, n is the value of the neighborhood in this
image. The image is converted into CIE Lab color-space and                  window.
is performed with edge detection. In the part of noise                           Equation (2) calculates the saturation intensity. Here,
reduction, we use a median filter to reduce the fragmentary                 we show the result images to demonstrate the processing
value. Then all the featured image will be multiply together                steps. Fig. 4 is the result of saturation image. Fig. 5 shows
to extract the exact position of OOI.                                       the result after performing the edge contrast computation.




                                                                                                Figure 3. Original image




                                                                                               Figure 4. Saturation image

                 Figure 2. Edge detection flowchart




                                                                     1030
Gx Gy        Gx Gy .
                                                                                      G2 0.5
                                                                                                      cos(2       A ) 2 * Gxy . sin 2       A
                                                                                                                                                2

                                                                                          The value of color gradient CG, is obtained by choosing
                                                                                      the maximum value of comparing G1 and G2, i.e.,
                                                                                                                     CG    Max G1 ,G2
                                                                                         This CG value shows the color intensity followed the
                   Figure 5. Saturation edge image
                                                                                      edge gradient. The CG value will increase if the color of this
                                                                                      edge point changes dramatically.
B. Color Gradient                                                                        Fig. 6 shows the result after color vector computation.
    The gradient is a scalar field for a vector field which
points in the direction of the greatest rate of increase of the
scalar field, and whose magnitude is the greatest rate of
change. It is very useful in a lot of typical edge detection
problem.
    To calculate the gradient of the color intensity, first we
would use the Sobel operator to separate vertical and
horizontal edges.
                 1 0 1                    1     2     1
       Gx         2 0 2          A, Gy                     0     0    0   A,                             Figure 6. Color vector image
                  1 0 1                                 1        2    1
                                                                                      C. Local Range Image
                                 2         2
                     G      Gx        Gy           ,                                      In this study, we adopt the morphological functions
                                                                                      DILATION and EROSION to find the local maximum and
                                      Gy                                              minimum values in the specified neighborhood.
                          arctan
                                      G x                                                First, we convert the original image form RGB color
   Equation (3) and (4) show a traditional way to compute                             space to CIE Lab color space. Because the luminance of an
                                                                                      object is not always flat, we compute the local range value
gradient. Here is the edge angle, and      is 0 for a vertical
                                                                                      for A and B layer without L(luminance) component. We
edge which is darker on the left side. We modify the above                            censored the color diversification on object in order to
equations to be more accurate in our case with the following                          prevent this situation. The dilation, erosion and local range
equations:
                                                                                      computation are defined as the following equations:
                                 2         2                2
                     Gx     Rx        Gx               Bx                                  Dilation

                    Gy       Ry
                                  2
                                      Gy
                                               2
                                                       By
                                                            2                                                 A B       z| B z,   A B       z
                                                                                          Erosion
          G xy      Rx R y G x             Gy               Bx   By                                                    ˆ                ˆ
                                                                                                        A     B     z| B z,   A         B   z
                                  2 * Gxy
                  A 0.5 * arctan                                                           Local Range Image          A   B       A B
                                 (Gx Gy )
                                                                                         Fig. 7 shows the result after the local range operation.
                    Gx Gy            Gx Gy .
     G1     0.5
                       cos(2 * A) 2 * Gxy . sin(2 * A)
where Rx, Gx, Bx are RGB layers through horizontal Sobel
operator; Ry, Gy and By are RGB layers through vertical
Sobel operator. A is the angle of Gxy, and G1 is the color
gradient of image on angle 0.
    The definition of G2 is quite similar as G1, but the term A
is replaced by:
                                               A
                                                       2
   Therefore, G2 is computed by:                                                                        Figure 7. A local range image




                                                                               1031
D. Median Filter                                                                                           ImageColorSegmention




    Median filter is a nonlinear digital filtering technique,
which is often used to remove noise. Such noise reduction is
a typical pre-processing step to improve the results for later
processing. The process of edge detection will cease some
fragmentary values. If the values are low or the fragment
edges are not connected, it could as seen as noise. Therefore,
we adopt the median filter to reduce the fragmentary pixels.
E. Hill Climbing Color Segmentation
     Edge detection can find most edges of OOI object, but
the boundaries are not usually closed completely. The
                                                                                           Figure 9. A color segmentation result
morphological operators cannot link all the disconnected
edges to obtain a complete boundary. Most OOI edges can
                                                                          F. Edge Combination
be detected after the previous procedures, but some edges are
still unconnected. To make the OOI boundary be a regular                      The OOI edges are obtained by two methods. First, we
closure, we adopt color segmentation to connect the isolated              use morphological close operation, which is a dilation
edges.                                                                    followed by an erosion, to connect the isolated points. The
     The color segmentation method is illustrated in Fig. 8.              close operation will make the gaps between unconnected
This method is based on T. Ohashi et al. [9] and R. Achanta               edges become smaller and make the outer edges become
et al. [10] .The hill-climbing algorithm detects local maxima             smoother. Second, we adopt edge detection on color
of clusters in the global three-dimensional color histogram of            segmentation map to find the color distribution, and merge it
an image. Then, the algorithm associates the pixels of an                 with pre-edge detection result.
image with the detected local maxima; as a result, several                    After the above procedures, we can get most of the edge
visually coherent segments are generated.                                 clues, and then we want to integrate these clues to a complete
                                                                          OOI boundary. Let the result of the boundary detection be IE,
                                                                          the result from the color segmentation be IC. The edge is
                                                                          extended by counting the pixels in IC and the neighboring
                                                                          points of IE. To determine whether a pixel at the end of the IE


                                                                                               =
                                                                          to be extended or not, here we reassign a value P at point (i,j)

                                                                                                                       (          ,   )   (, )
                                                                          as an “edge extension” value as follow:
                                                                                        (, )           ,

                                                                          where n=-1, m=1, is sliding in a 3x3 window, IE is the pre-
                                                                          edge detection image value of the neighborhood in this
                                                                          window. Equation (16) will remove the un-necessary pixels
                                                                          and let the OOI mask be closed by extending the boundaries.
                                                                          The value is shown in Fig. 10. The result image that merges
                                                                          the edge extension image with the color segmentation edge is
                                                                          shown in Fig. 11.




      Figure 8. Color segmentation and egde detection flow chart

   The detailed algorithm is described as follows:                                                                (a)
   1. Convert image to CIE Lab color space.
   2. Build CIE Lab color histogram.
   3. Follow color histogram to find local maximum value.
   4. Apply local maximum color to be initial centroid of
       k-means classification.
   5. Re-train the classifier until the cluster centers are
       stable.
   6. Apply K-means clustering and remap the original                                                             (b)
       pixels to each cluster.
                                                                          Figure 10. (a) The result before the edge extension (b) The result after the
   Fig.9 shows the result of color segmentation.                                                        edge extension




                                                                   1032
Plus ColorSeg
                                                                                        Figure 13. Five examples with different aperture values

                                                                                    The DOF is smaller as the aperture value gets lower, and
                                                                                the OOI would be blurred as well. The higher aperture value
                                                                                will increase the edge sharpness; that will cause the difficulty
                                                                                to separate the background and the OOI. From Fig. 14 to Fig.
                                                                                17, we show the OOI detection results. By experiment, the
                                                                                object boundaries become irregular while the aperture value
                                                                                gets higher. In our experiment, the proper aperture value to
                                                                                obtain the best segmentation results is about f2.8 to f5.6.
  Figure 11. The result image that merged the edge extention image and
                         color segmentation image

    We integrate the above edge pieces into a complete OOI
mask. If the boundaries are closed, we will add this region
into the final OOI mask. The edge combination of the final
OOI mask is shown in Fig.12.




                  Figure 12. Edge combination result

              III. THE EXPERIMENTAL RESULTS
    The aperture stop of a photographic lens companion with
shutter speed can adjust the amount of light reaching to the
film or image sensor. In this study, we use a digital camera
Pantax istDL and a prime lens “Helois M44-2 60mm F2.0”                                      Figure 14. The experimental results (sample 1)
to perform the experiment. We choose a prime lens to be our
test lens in order to reduce the instability parameters. To
insure all of the exposures are the same, we have controlled
the shutter speed and aperture parameter manually.
    To test the propose method, we select 5 test photos in a
50 photos album randomly. They are all prepared in a same
condition and camera parameter. Fig. 13 shows the proposed
OOI detection results in different aperture value.




                                                                                            Figure 15. The experimental results (sample 2)




                                                                         1033
the overlapped region between the reference and the detected


                                                                           Accuracy
                                                                      OOI boundaries, i.e.
                                                                                                ( x ,y )   I est ( x , y )        I ref ( x , y )
                                                                                           1
                                                                                                              ( x ,y )
                                                                                                                         I ref ( x , y )
                                                                      where Iest is the OOI mask from the proposed method and Iref
                                                                      is the mask drawn by the user as the ground truth.
                                                                           Fig. 18 (a) shows the user drawn OOI boundaries and (b)
                                                                      shows the detected OOI boundaries.




           Figure 16. The experimental results (sample 3)




                                                                        Figure 18. Comparsion results: (a)User drawn OOI boundary (b) The
                                                                                             proposed method result

                                                                          The detection accuracy decreases while the OOI has
                                                                      complex texture such as shirt, cloth, or artificial structures;
                                                                      and the accuracy is higher while the background is simple.
                                                                      However, if the image is not correctly focused on the target,
                                                                      the proposed method still can find a complete object. The
                                                                      correctness will become lower if there are more than one
                                                                      OOI in an image, as shown in sample 2 in Fig. 18. Table 2
                                                                      shows the result of accuracy computed by Equation (17).
                                                                       Table 2.   The comparison result between the reference images and the
                                                                                                    proposed method


           Figure 17. The experimental results (sample 4)              Sample          1              2                       3               4      5
    The convincing definition of “good OOI” is hard to                Accuracy      98.2%       94.6%                     96.1%            98%      91%
define; it will depend on human cognition. In this paper, we
refer N. Santh and K.Ramar’s experiment [8] to verify the
proposed method. First, five user-defined OOI boundaries
are drawn, then we compare with the boundaries that
detected by the proposed method. Equation (17) computes




                                                               1034
IV. CONCLUSION                                                 [6]    Yun-Chung Chung, Jung-Ming Wang, Robert R. Bailey, Sei-Wang
                                                                                             Chen, “A Non-Parametric Blur Measure Based on Edge Analysis for
    In this paper we propose a method to extract the OOI                                     Image Processing Applications,” IEEE Conference on Cybernetics
objects form a low DOF image based on edge and color                                         and Intelligent Systems Singapore, 1-3 December, 2004.
information. The method needs no user-defined parameters                              [7]    Renting Liu ,Zhaorong Li ,Jiaya Jia, “Image Partial Blur Detection
like shapes and positions of objects, or extra scene                                         and Classi cation,” IEEE Conference on Computer Vision and
                                                                                             Pattern Recognition (CVPR), 2008, pp. 1–8.
information. We integrate the color saturation,
morphological functions and color gradient to detect the                              [8]    N. Santh, K.Ramar, “Image Segmentation Using Morphological
                                                                                             Filters and Region Merging,” Asian Journal of Information
rough OOI. Final we utilize color segmentation to make the                                   Technology vol. 6(3) 2007,pp. 274-279.
OOI boundaries close and compact. Our method takes both                               [9]    D. Kornack and P. Rakic, “Cell Proliferation without Neurogenesis in
advantages of edge detection and color segmentation.                                         Adult Primate Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127-
    The experiments show that our method works                                               2130.
satisfactorily on many different kinds of image data. This                            [10]   T.Ohashi, Z.Aghbari, and A.Makinouchi. “Hill-climbing Algorithm
method can apply in image processing or computer vision                                      for Efficient Color-based Image Segmentation,” IASTED
tasks such as object indexing or content-based image                                         International Conference On Signal Processing, Pattern Recognition,
                                                                                             and Applications (SPPRA 2003), June 2003. P.200.
retrieval as a pre-processing.
                                                                                      [11]   R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk1. “Salient Region
                              REFERENCES                                                     Detection and Segmentation,” International Conference on Computer
                                                                                             Vision Systems (ICVS 2008), May 2008. PP.66-75
[1]   InfoTrend ,”The Consumer Digital SLR Marketplace: Identifying &                 [12]   Martin Ru i, Davide Scaramuzza, and Roland Siegwart. “Automatic
      Profiling Emerging Segments,” Digital Photography Trends.                              Detection of Checkerboards on Blurred and Distorted Images,”
      September,2008                                                                         International Conference on Intelligent Robots and Systems 2008,
      http://www.capv.com/public/Content/Multiclients/DSLR.html                              Sept, 2008. PP.22-26
[2]   Dudubird, “Chinese Photographic Equipment Industry Market                       [13]   Hanghang Tong, Mingjing Li, Hongjiang Zhang, and Chanshui Zang.
      Research Report,” December,2009. http://www.cnmarketdata.com                           “Blur Detection for Digital Images Using Wavelet Transform,”
      /Article_84/2009127175051902-1.html                                                    International Conference on Multimedia and Expo 2004, PP.17-20
[3]                                                         .”                        [14]   Gang Cao, Yao Zhao and Rongrong Ni. “Edge-based Blur Metric for
                                                                          ,”                 Tamper Detection,” Journal of Information Hiding and Multimedia
      September,2007. https://www.fuji-keizai.co.jp/market/06074.html                        Signal Processing, Volume 1, Number 1, January 2009. pp. 20-27
[4]   Khalid Idrissi, Guillaume Lavou e, Julien Ricard , and Atilla Baskurt,          [15]   Rong-bing Gan, Jian-guo Wang. “Minimum Total Variation
      “Object of interest-based visual navigation, retrieval, and semantic                   Autofocus Algorithm for SAR Imaging,” Journal of Electronics &
      content identi cation system” Computer Vision and Image                                Information Technology, Volume 29, Number 1, January 2007. pp.
      Understanding vol. 94 ,2004 , pp. 271-294.                                             12-14
[5]   James Z. Wang, Jia Li, Robert M. Gray, Gio Wiederhold ,                         [16]   Ri-Hua XIANG, Run-Sheng WANG, “A Range Image Segmentation
      “Unsupervised Multiresolution Segmentation for Images with Low                         Algorithm Based on Gaussian Mixture Model,” Journal of Software
      Depth of Field” IEEE TRANSACTIONS ON PATTERN                                           2003, Volume 14, Number 7, pp. 1250-1257
      ANALYSIS AND MACHINE INTELLIGENCE vol.23 no.1, January
      2001, pp. 85-90.




                                                                               1035
Efficient Multi-Layer Background Model on Complex Environment for
                                Foreground Object Detection
     1
         Wen-kai Tsai(蔡文凱),2Chung-chi Lin(林正基), 1Ming-hwa Sheu(許明華), 1Siang-min Siao(蕭翔民),
                                          1
                                            Kai-min Lin(林凱名)
                                   1
                                       Graduate School of Engineering Science and Technology
                                        National Yunlin University of Science & Technology
                                                 2
                                                   Department of Computer Science
                                                        Tung Hai University
                                                 E-mail:g9610804@yuntech.edu.tw


Abstract—This paper proposes an establishment of multi-layer              has the advantages of updating model parameters
background model, which can be used in a complex                          automatically, it is necessary to take a very long period of
environment scene. In general, the surveillance system focuses            time to learn the background model. In addition, it also faces
on detecting the moving object, but in the real scenes there are          strenuous limitations such as memory space and processing
many moving background, such as dynamic leaves, falling rain              speed in embedded system. Next, Codebook background
etc. In order to detect the object in the moving background               model [3] establishes a rational and adaptive capability
environment, we use exponential distribution function to                  which is able to improve the detection accuracy of moving
update background model and combine background                            background and lighting changes. However, the Codebook
subtraction with homogeneous region analysis to find out
                                                                          background model still requires higher computational cost,
foreground object. The system uses the TI TMS320DM6446
                                                                          larger memory space for saving background data.
Davinci development platform, and it can achieve 20 frames
per second for benchmark images of size 160×120. From the
                                                                          Subsequently, Gaussian model [4] is presented by updating
                                                                          the threshold value for each pixel, but its disadvantages
experimental results, our approach has better performance in
                                                                          includes large amount of computing and lots of memory
terms of detection accuracy and similarity measure, when
comparing with other modeling techniques methods.                         space used to record the background model. In order to
                                                                          reduce the usage of memory, [5] and [6] are to calculate the
   Keywords-background modeling; object detection                         weight value for each pixel to establish background model.
                                                                          According to the weight value, the updating mechanism
                                                                          determines whether the pixel is replaced or not. So it uses a
                   I.       INTRODUCTION                                  less amount of memory space to establish moving
    Foreground object detection is a very important                       background,.
technology in the image surveillance system since the system                  The above works all use multi-layer background model
performance highly dependents on whether the foreground                   to store background information, but this is still inadequate
object detection is right or not. Furthermore, it needs to                to deal with moving background issues. They need to take
detect the foreground object accurately and quickly, such                 into account the dependency between adjacent pixels to
that the follow-up works such as tracking, identification can             inspect whether the neighbor region possesses the
be easy to perform correctly and reliably. Conceptually, the              homogeneous characteristics or not. This paper proposes an
technology of foreground object detection is based on                     efficient 4-layer background model and homogeneous region
background substation mostly. This approach seems simple                  analysis to feature the background pixels.
and low computational cost; however, it is difficult to obtain
good results without reliable background model. To manage                     II.   BUILDING MULTI-LAYER BACKGROUND MODELS
these complex background scenarios, the skill of how to                       First, the input image pixel xi,j(t) consists of R, G and B
construct a suitable background model has become the most                 elements as shown in Eq. (1). The pixels of moving
crucial one.                                                              background are inevitably appeared in some region
    Generally speaking, most of the algorithms only regard                repeatedly, so we have to learn these appearance behaviors
non-moving objects as background, but in real environment,                when constructing multi-layer background model. The first
many moving objects may also belong to a part of the                      layer background model (BGM1) is used to store the first
background, in which we named the moving background                       input frame. For the 2nd frame, we record on the difference
such as waving trees. However, it is a difficult task to                  of the 1st and 2nd frames for the second layer background
construct the moving background model. The general                        model (BGM2). Similarly, the difference of the consecutive 3
practice is to use algorithms to conduct the learning and                 grams is saved for the third layer (BGM3), etc. We use the
establish of background model. After building up the model,               first 4 frame and their differences as the initial background
the system starts to carry on the foreground object detection.            model. Besides, Eq. (2) is used to record the numbers of
Therefore, in recent years a number of background models                  occurrence each pixel in the learning frame.
have been proposed. The most popular approach is the
Mixture of Gaussians Model (MoG) [1- 2]. Although MoG                                  xi , j (t ) = ( xiR j (t ), xiGj (t ), xiB j (t ))
                                                                                                         ,           ,          ,           (1)




                                                                   1036
⎧
                                                                                                            ⎪ remove ,             if weight iu, j (t ) < Te         (5)
                     ⎧ MATCH iu, j (t − 1), if xi , j (t ) − BGM iu, j (t ) > th (2)       BGM iu, j (t ) = ⎨
                     ⎪                                                                                      ⎪α × BGM i , j (t ) + (1 − α ) × BGM i , j (t − 1), else
                                                                                                                       u                            u
  MATCH iu, j (t ) = ⎨                                                                                      ⎩
                     ⎪MATCH i , j (t − 1) + 1,
                             u
                     ⎩                                else
where, u=1…4 and th is the threshold value of compare                                      where Te is a threshold for weight; α is a constant and
similarity. From the 5th learning frame, we start to calculate                             α <1.
all the pixel repetition numbers of occurrence in each layer                                   Based on the above mentioned approach, Fig. 2
of background model, and Eq. (3) is to obtain its frequency                                demonstrates a 4-layer background which be constructed
of occurrence.                                                                             after learning 100 frames.
                                MATCH iu, j (t )
                      λu, j =
                       i                                                       (3)
                                            N
where N is the total number of learning frames. The larger
λu j
 i,    indicates that the corresponding pixel in the learning
period has higher occurrence and must preserve with 4
layers. Conversely, the pixel with lower occurrence will be
removed.                                                                                                 (a) BGM1                                      (b) BGM2

                     III.    BACKGROUND UPDATE
    After building up multi-layer background model, we
must update the content of BGMi,j along with the time, to
replace the inadequate background information. So the
background update mechanism is very important for the
following object detection. The proposed background update                                              (c) BGM3                  (d) BGM4
method uses the exponential distribution model to calculate                                           Figure 2. Multi-layer Background Model
the weight value for each pixel, as shown in Eq. (4). It can
obtain the repetition condition of occurrence for each pixel in
background model. The lower weight expresses that the                                                                IV.           OBJECT DETECTION
corresponding pixel has not appeared for a long time. It
                                                                                               After establishing the accurate background model, the
should be replaced by the higher-weight input pixel.
                                                                                           background subtraction can be used to obtain foreground
                                                                                           object. From the practical observation, the moving
                                        − λu, j × t
       weightiu, j (t ) = λu, j × e
                           i
                                           i
                                                      ,t > 0                  (4)          background has the homogeneous characteristic. Therefore,
                                                                                           the object detection method can carry on the subtraction on
                                                                                           both 4-layer background and their homogeneous regions.
where t is the number of non-match frames.
                                                                                               As shown in Fig. 2, the information stored in background
    Fig. 1 shows the distribution of weight values. If the
                                                                                           model is the scene of the moving background. It is with
pixel in background model does not be matched in a period
                                                                                           important features of homogeneous. In Eq. (6) and (7), TI(t)
time, its weight value becomes exponentially decreased. If
the weight value is less than a threshold, the background                                  is the total matching index for input pixel and the
pixel should be replaced based on Eq.(5).                                                  homogeneous region of 4-layer background and Diu+ k , j + p is an
   weight                                                                                  individual matching index the input pixel and one
                                                                                           background data BGM iu+ k , j + p . The homogeneous region is
                                                                                           defined as (2r +1) * (2r +1) for the background data at (i, j)
                                                                                           location.

                                                                                                             4      r      r
                                                                                                 TI (t ) = ∑ ∑           ∑D         u
                                                                                                                                    i+k, j+ p   (t )                      (6)
                                                                                                            u =1 k = − r p = − r

                                                                                                                       ⎧1, if xi , j (t ) − BGM iu+ k , j + p (t ) ≤ th
                                                                                                 Diu+ k , j + p (t ) = ⎨                                                  (7)
                                                                                                                       ⎩0,               else
                                        t
         Figure 1. Exponential distribution of weight                                      where th is a threshold value to determine whether it is
                                                                                           similar. If TI(t) is greater than a threshold (τ), that indicates
                                                                                           the input xi,j(t) is similar to many background information




                                                                                    1037
and it is not a object pixel. Eq. (8) is used to find out                 sequence. Our proposed approach can achieve the highest
foreground object (FO).                                                   similarity value, i.e. our results are close to those of ground
                                                                          truth.
                ⎧ 0,   if TI (t ) ≥ τ
 FOi , j (t ) = ⎨                                    (8)
                ⎩1,      else

When FOi , j (t ) = 1 , the input pixel belongs to foreground
object pixel. On the other hand, If FOi , j (t ) = 0 , the input
pixel belongs to background pixel.


   V.     EXPERIMENTAL RESULTS OF PROTOTYING SYSTEM
    Based on our proposed approach, the object detection is
implemented by TMS320DM6446 Davinci as shown in
Fig.3. The input image resolution is 160*120 per flame.
Averagely, our approach can process 20 frames per second
for performing object detection on the prototyping platform.




  Figure 3. TI TMS320DM6446 Davinci development kit

Next, by using the presented research methods, the
foreground object with binary-value results are also
displayed in Fig.4. The result of ground truth, which is
segmented the objects manually from the original image
frame, is regarded as the perfect result. It can be found that
our result has the better object detection. In order to make a
fair comparison, we adopt [7] calculating similarity and
total error pixels method to assess these results of the
                                                                                           Figure 4. Foreground Object Detection Result
algorithms. Eq.(9) is used to get the total error pixel number
and Eq. (10) is used to evaluate similarity value.                                                                                                 Wu[2]
                                                                                                                 Total Error Pixels                Chien[5]
                                                                                        3000
                total error pixel = fn + fp                 (9)                                                                                    Tsai[6]
                                                                                                                                                   Our proposed
                                                                                        2500
                                                                          Error pixel




                                                                                        2000
                                       tp
                  Similarity =                             (10)                         1500
                                  tp + fn + fp
                                                                                        1000
where fp is the total number of false positives, fn is the total                        500
number of false negatives, and tp indicates the total number                              0
of true positives. Fig. 6 depicts the number of error pixels                                   240   245   250   255     260     265   270   275   280
for a video sequence. We can see the numbers of error                                                                  Frame Number
pixels produced by our proposed are less than other
algorithms. Fig. 7 shows the similarity of the video                                           Figure 5. Error pixels by different methods




                                                                   1038
Wu[2]                 [6]   Wen-Kai Tsai, Ming-Hwa Sheu, Ching-Lung Su, Jun-Jie Lin and
                                                Similarity                                                        Shau-Yin    Tseng,   “Image   Object   Detection    and   Tracking
                                                                                      Chien[5]
                1
                                                                                      Tsai[6]                     Implementation for Outdoor Scenes on an Embedded SoC
               0.9                                                                    Our proposed                Platform,” International Conference on Intelligent Information
               0.8                                                                                                Hiding and Multimedia Signal Processing, pp.386-389, September,
               0.7                                                                                                2009.
               0.6
                                                                                                            [7]   Lucia Maddalean, Alfredo Petrosino, “A Self-Organizing Approach
 Sim ilarity




                                                                                                                  to Background Subtraction for Visual Surveillance Applications,”
               0.5
                                                                                                                  IEEE Trans. on Image Processing, vol. 17, No.7, July, 2008.
               0.4

               0.3

               0.2

               0.1

                0
                      242       245   248       251          254       257     260
                                             Frame Number

                            Figure 6. Similarity by different methods

                                        VI.      Conclusion
    In this paper, we propose an effective and robust multi-
layer background modeling algorithm. The foreground
object detection will encounter the problem of moving
background, because there are outdoor scenes of fluttering
leaves, rain, and indoor scenes of fans etc. Therefore, we
construct the moving background into multi-layer
background model through calculating weight value and
analyzing the characteristics of regional homogeneous. In
this way, our approach can be suitable to a variety of scenes.
Finally, we present the result of foreground detection by
using data-oriented form of similarity and total error pixels,
furthermore through explicit data and graph to show the
benefit of our algorithms.



                                             REFERENCES
[1]                  C. Stauffer, W. Eric L. Grimson, “Learning Patterns of Activity
                     Using Real-Time Tracking,” IEEE Trans. Pattern Analysis and
                     Machine Intelligence, vol.22, No. 8, pp.747-757, 2000.
[2]                  H. H. P. Wu, J. H. Chang, P. K. Weng, and Y. Y. Wu, “Improved
                     Moving Object Segmentation by Multi-Resolution and Variable
                     Thresholding, ” Optical Engineering. vol. 45, No. 11, 117003, 2006.
[3]                  K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis,
                     “Real-Time Foreground-Background Segmentation using Codebook
                     Model, ” Real-Time Imaging, pp.172-185, 2005.
[4]                  Hanzi Wang, and David Suter, “ A Consensus-Based Method for
                     Tracking    Modelling    Background           Scenario   and    Foreground
                     Appearance,” Pattern Recognition, pp.1091-1105, 2006.
[5]                  Wei-Kai Chan, Shao-Yi Chien,”Real-Time Memory-Efficient Video
                     Object Segmentation in Dynamic Background with Multi-
                     Background Registration Technique,” International Workshop on
                     Multimedia Signal Processing, pp.219-222, 2002.




                                                                                                     1039
CLEARER 3D ENVIRONMENT CONSTRUCTION USING IMPROVED DM BASED ON
     GAZE TECHNOLOGY APPLIED TO AUTONOMOUS LAND VEHICLES

                            1                                          2
                                Kuei-Chang Yang (楊桂彰), Rong-Chin Lo (駱榮欽)
   1
     Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering,
                           National Taipei University of Technology, Taipei
   2
     Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering,
                            National Taipei University of Technology, Taipei
                                   E-mail: t7418002@ntut.edu.tw



                      ABSTRACT                                        to obtain meaningful information. There are a lot of
                                                                      manpower and resources devoted to the binocular stereo
      In this paper, we propose a gaze approach that sets             vision [4] research for many countries. As applied to
the binocular cameras in different baseline distances to              robots and ALV, the advantage of binocular stereo
obtain better resolution of three dimensions (3D)                     vision is to obtain the depth of the environment, and this
environment construction. The method being capable                    depth can be used for obstacle avoidance, environment
of obtain more accurate distance of an object and                     learning, and path planning. In such applications, the
clearer environment construction that can be applied to               disparity is used as the vision system based on image
the Autonomous Land Vehicles (ALV) navigation. In                     recognition and image-signal analysis. Besides, two
the study, the ALV is equipped with parallel binocular                cameras need to be set in parallel and to be fixed
cameras to simulate human eye to have the binocular                   accurately, this disparity method still requires a high-
stereo vision. Using the information of binocular stereo              speed computer to store and analyze images. However,
vision to build a disparity map (DM), the 3D                          setting the binocular cameras of ALV with fixed
environment can be reconstructed. Owing to the                        baseline can only obtain the better DM of environment
baseline of the binocular cameras usually being fixed,                images in a specific region.
the DM, shown as an image, only has a better resolution
in a specific distance range, that is, only partial specific          In this paper, we try to propose an approach that sets the
region of the reconstructed 3D environment is clearer.                binocular cameras with different baseline to obtain the
However, it cannot provide a complete navigation                      depths of DM corresponding to different measuring
environment. Therefore, the study proposes the multiple               distances; In the future, this method can obtain the
baselines to obtain the clearer DMs according to the                  environment image from near to far range, such that it
near, middle and far distances of the environment.                    will help the ALV in path planning.
Several experimental results, showing the feasibility of
the proposed approach, are also included.                                              2. STEREO VISION

Keywords binocular stereo vision; disparity map                            In recent years, because the computing speed of
                                                                      the computer is much faster and its hardware
                  1. INTRODUCTION                                     performance also becomes better, therefore a lot of
                                                                      researches relating to the computer vision are proposed
     In recent years, the machine vision is the most                  for image processing. The computer vision system with
important sensing system for intelligent robots. The                  depth sensing ability is called the stereo vision system,
vision image captured from camera has a large number                  and the stereo vision is the core of computer vision
of object information including shape, color, shading,                technologies. However, one camera can only obtain two
shadow, etc. Unlike other used sensors can only obtain                dimensions (2D) information of environment image that
one of measurement information, such as ultrasonic                    is unable to reconstruct the 3D coordinate, To improve
sensors [1], infrared sensors [2], laser sensors [3], etc.            the shortage of one camera, in the study, two cameras
In other words, the visual sensor can achieve a lot of                are used to calculate 3D coordinate. The details are
environmental information, but this information is with               described in the following sub-sections.
each other. Therefore, various image processing
techniques are necessary for separating them one by one




                                                               1040
2.1. Projective Transform                                                                     Nowadays the cost of camera becomes very
       The projective transform model of one camera is                                         cheaper, therefore, in the study, we chose two cameras
to project the real objects or scene to the image plane.                                       fixed in parallel to solve the problem of depth and
As shown in Fig. 1, assume that the coordinate of object                                       height. The usage of parallel cameras can reduce the
P in the real world is (X, Y, Z) relative to the origin (0,                                    complexity of the corresponding problem. In Fig. 3, we
0, 0) at the camera center. After transform, the                                               easily derive the Xl and Xr by using similar triangles, and
coordinate of P' projected by P on the image plane is (x,                                      we have:
y, f) relative to the image origin (0, 0, f), where f is the                                                                  Zx         
                                                                                                                                                  l

                                                                                                                         X 
                                                                                                                                      l

distance from the camera center to image plane. Using                                                                           f
similar triangle geometry theory to find the relationship
between the actual object P and its projected point P' on                                                                                     Zx r
the image plane, the relationship between two points is                                                                          Xr 
                                                                                                                                               f
                                                                                                                                                              
as follows:
                                 X                                      Assuming that the optical axes of two cameras are
                           x f                                                                parallel to each other, where b is the distance between
                                 Z
                                Y                               two camera centers, and b= Xl - Xr. C and G are the
                           y f                                                                projected points of P to left image plane and right image
                                 Z
                                                                                               plane, respectively. The disparity d is defined as d = xl -
       Therefore, even if P'(x, y, f) captured from the
                                                                                               xr . From (3) and (4), we have:
image plane is the known condition, we still cannot
                                                                                                          b  X l  X r  x l  x r   d          
                                                                                                                          Z               Z
calculate the depth Z of P point and determine its
coordinate P(X, Y, Z) according to (1) and (2) unless                                                                     f                f
we know one of X or Y (height) or Z (depth).                                                   Therefore, the image depth Z can be given by:
                                                                                                                              f  b 
                                                                        P (X ,Y ,Z )                                     Z
                                                                                                                               d
                       Y                     y       P' ( x , y , f )
                                                                                                                     Xl                       P(Object)
                                     X                      x
                                                                                                                                      r
                                                                                                                                  X

                                                                                                                          Z                                Z
                                                                            Z
  Camera center                                                                                                                 C
     (0,0,0)
                                         Image plane                                                                      xl                      xr G
                                 f

                                                                                                                 f                            f
              Figure 1. Perspective projection of one camera.
                                                                                                                           Ol             b               Or
          2.2. Image Depth
      From the previous discussion, we have known that                                             Figure 3. Projection transform of two cameras and disparity.
it is impossible to calculate accurately the depth or
height of object or scene from the information of one                                          As shown in Fig. 4. The height image of an object can
camera, even if we have a lot of known conditions in                                           be derived from the height of the object image based on
advance. Therefore, several studies use the overlapping                                        the assumption of a pinhole camera and the image-
view’s information of two [5] or more cameras to                                               forming geometry.
calculate the depth or height of object or scene, shown                                         Y  y  Z 
in Fig. 2.                                                                                                                           f

                           Right image plane
                                                                                                                                                                        Y
                                         y                                                                                    Pinhole
   Left image plane                                         r   r
                                                          (x ,y )
                                                 x
                                                                                                                                              Optical axis
                   y                                                                                 y
                       x
                                             b
          l    l
        (x ,y )                                                                                                      f
                                                                                                                                                      Z
                                                                         P (X ,Y ,Z )


                                                                                                                Figure 4. Image-forming geometry.
   Figure 2. The relationship between depth and disparity for two
                             cameras.




                                                                                        1041
Due to rapid corresponding on two cameras, the                        the middle region is from 5m to 10m, and far region is
method has high efficiency on calculating the depth and                    over 10m.
height of the objects, and is suitable for the application
                                                                           Acquisition of the best baseline b
of ALV navigating. This method can find the disparity d
from two corresponding points (for instance, C and G                             To acquire a best baseline b means that to find the
shown in Fig. 3.) respective to left image and right                       appropriate cameras baseline b on the basis of different
image. Here, the accuracies of two corresponding points                    depths of the region. Table 1 and Table 2 show the
are very important. Regard the value of disparity d as                     relationship between depth Z and the two-camera
image intensity shown by gray values (0 to 255), then,                     distance baseline b. We set the d = 30 as the threshold
whole disparities form an image, called disparity map                      value dth, and region of d less than dth as background.
(DM) or DM image. The DM construction proposed by                          Therefore, when the depth Z is known, the disparity d
Birchfield and Tomasi [6] is employed in this paper.                       can be obtained from Table 1 and Table 2in the
The advantage of this constructing method is faster to                     different baseline, and then find the most appropriate
obtain all depths that also include the depths of                          value of b makes the value of d closest or greater than
discontinuous, covered, and mismatch points. Otherwise,                    the dth. For example: 20cm is the best b for short-range
the disadvantage is to lack the accuracy of the obtained                   region (0 m~ 5m), and 40cm for medium-range region
disparity map. Fig. 5 shows that the disparity map is                      (5m ~ 10m).
generated from left and right images.                                      Calculation of the depth and height
                                                                                 The cameras are calibrated [8] in advance, then,
                                                                           we can obtain the focus value f =874 pixels.
                                                                           Substituting the obtained d for object into (6), we find
                                                                           the distance Z between the camera and the object, and Z
                                                                           is then substituted into (7) to calculate the object height
                                                                           Y [9] that usually can be used to decide whether the
                                                                           object is an obstacle.
                                (a)




                                                                                                 Distance

                                (b)

   Figure 5. The disparity map (a) left image and right image (b)
                          disparity map
                                                                                                                            Disparity
                                                                                   Camera

                3. PROPOSED METHOD                                             Figure 6. The relationship between distance and disparity.

      From (6) [7], we know that the object is far from                    TABLE I.         DISPARITY VALUES d (PIXELS) VS. DEPTH Z=1M~5M
two cameras, the disparity value will become small, and                                      AND BASELINE b =10CM~150CM.
vice versa. In Fig. 6, there is obviously a nonlinear                              Z(m)
relationship between these two terms. The disadvantage                        b(cm)             1           2           3               4      5
of DM is that the farther distance between objects and                            10           87           44         29               22     17
two cameras makes the smaller disparity value, and it                             20           175          87         58               44    35*
begets the difficulty of separation between the object                            30           262          131        87               66     52
                                                                                  40           350          175       117               87     70
and the background becoming difficult. Therefore, how                             50           437          219       146               109    87
to find the suitable baseline b for obtaining the clearer                         60           524          262       175               131   105
DM for each region in different depth region of two                               70           612          306       204               153   122
cameras is required. The processing steps are described                           80           699          350       233               175   140
in the following sub-sections:                                                    90           787          393       262               197   157
                                                                                 100           874          437       291               219   175
Region segmentation                                                              110           961          481       320               240   192
                                                                                 120          1049          524       350               262   210
     We partition the region segmentation into three
                                                                                 130          1136          568       379               284   227
levels by near, middle and far, and obtain the best DM                           140          1224          612       408               306   245
of the depth in the different regions. In the paper, we                          150          1311          656       437               328   262
define the near region is the distance from 0m to 5m,                       *: The best disparity for short-range region.




                                                                    1042
TABLE II.  DISPARITY VALUES d (PIXELS) VS. DEPTH
         Z=6M~10M AND BASELINE b =10CM~150CM.
       Z(m)
  b(cm)              6         7          8        9          10
       10          15          12         11       10           9
       20          29          25         22       19          17
       30          44          37         33       29          26
       40          58          50         44       39         35*
       50          73          62         55       49          44
       60          87          75         66       58          52
       70          102        87         76       68          61
       80          117        100         87       78          70
       90          131        112         98       87          79
      100          146        125        109      97          87
      110          160        137        120      107          96
      120          175        150        131      117         105
      130          189        162        142      126         114
      140          204        175        153      136         122
      150          219        187        164      146         131
 *: The best disparity for medium-range region.



              4. EXPERIMENTAL RESULTS
                                                                           Figure 8. The disparity map (a) left image and right image (b)
      The proposed methods have been implemented                                    disparity map (Z=400CM、800cm,b=20cm).
and tested on the 2.8GHz Pentium IV PC. Fig. 7 shows
two cameras are fixed on a sliding way and can be
pulled apart to change the baseline distance. In Section
III, we know that the best b for the short-range region
0m ~ 5m is 20cm, and 40cm for medium-range region
5m ~ 10m. Therefore, we set two persons standing in
the distance from the two-camera of 4m and 8m, two-
camera distance b = 20cm, shown in Figure 8. Because
the person standing at 4m is in the short-range region,
so it can be seen clearly. However, another person
standing at 8m is in medium-range region, it's difficult
to separate it from background.




                                                                           Figure 9. The disparity map (a) left image and right image (b)
                                                                                       disparity map (Z=800CM,b=20cm).

            Figure 7. Experiment platform of stereo vision.

     To compare Fig. 9 and Fig. 10 with the distance
from a person to the baseline is 8m (medium-range
region) and the baseline is changed from b = 20cm to b
= 40cm, so the results show that as b = 40cm, the person
(object) becomes clearer as shown in Fig. 10.




                                                                    1043
[6] S. Birchfield and C. Tomasi, ”Depth Discontinuities by
                                                                                Pixel-to-Pixel Stereo,” International Journal of Computer
                                                                                Vision, pp. 269-293, Aug 1999.
                                                                            [7] G. Bradski and A. Kaehler, Learning OpenCV: Computer
                                                                                Vision with the OpenCV Library, O'Reilly Press, 2008.
                                                                            [8] http://www.vision.caltech.edu/bouguetj/calib_doc/
                                                                            [9] L. Zhao and C. Thorpe, “Stereo- and Neural Network-
                                                                                Based Pedestrian Detection,” IEEE Trans, Intelligent
                                                                                Transportation System, Vol. 3, No. 3, pp. 148-154, Sep
                                                                                2000.




   Figure 10. The disparity map (a) left image and right image (b)
                disparity map (Z=800cm,b=40cm).



                      5. CONCLUSION

     From the experimental results, we have found that
the suitable baseline of two cameras can help us to
obtain the better disparity. However, if the object is far
from two cameras, its disparity value will become small,
then the disparity value of the object is near to that of
the background, and not easily detected. Using the
proposed method, to change the baseline of two
cameras, the object becomes clearer and easier detected,
and 3D object information is obtained more. The results
can be used to a lot of applications, for example, ALV
navigation. In the future, we plan to solve the DM noise
of horizontal stripe inside, so DM can be shown better.

                        REFERENCES

[1] A. Elfes, “Using occupancy grids for mobile robot
   perception and navigation,” Computer Magazine, pp. 46-
   57, June 1989.

[2] J. Hancock, M. Hebert and C. Thorpe, “Laser intensity-
    based obstacle detection Intelligent Robots and Systems,”
    1998 IEEE/RSJ International Conference on Intelligent
    Robotic Systems, Vol. 3, pp. 1541-1546, 1998.
[3] E. Elkonyaly, F. Areed, Y. Enab, and F. Zada, “Range
    sensory=based navigation in unknown terrains,” in Proc.
    SPIE, Vol. 2591, pp.76-85.
[4] 陳禹旗,使用 3D 視覺資訊偵測道路和障礙物應用於人
    工智慧策略之室外自動車導航,碩士論文,國立台北
    科技大學電腦與通訊研究所,台北,2003。
[5] 張煜青,以雙眼立體電腦視覺配合人工智慧策略做室
    外自動車導航之研究,碩士論文,國立台北科技大學
    自動化科技研究所,台北,2003。




                                                                     1044
A MULTI-LAYER GMM BASED ON COLOR-TEXTURE COMBINATION
            FEATURE FOR MOVING OBJECT DETECTION


          Tai-Hwei Hwang (黃泰惠), Chuang-Hsien Huang (黃鐘賢), Wen-Hao Wang (王文豪)

         Advanced Technology Center, Information and Communications Research Laboratories,
            Industrial Technology Research Institute, Chutung, HsinChu, Taiwan ROC 310
                         E-mail: {hthwei, DavidCHHuang, devin}@itri.org.tw



                        ABSTRACT                                          background scene. The background scene contains the
                                                                          images of static or quasi-periodically dynamic objects,
Foreground detection generally plays an important role in the             for instance, sea tides, a fountain, or an escalator. The
intelligent video surveillance systems. The detection is based            representation of background scene is basically a
on the characteristic similarity of pixels between the input              collection of statistics of pixel-wise features such as
image and the background scene. To improve the
                                                                          color intensities or spatial textures. The color feature
characteristic representation of pixel, a color and texture
combination scheme for background scene modeling is
                                                                          can be the RGB components or other features derived
proposed in this paper. The color-texture feature is applied              from the RGB, such as HSI, or YUV expression. The
into a four-layer structured GMM, which can classify a pixel              texture accounts for information of intensity variation in
into one of states of background, moving foreground, static               a small region centered by the input pixel, which can be
foreground and shadow. The proposed method is evaluated                   computed by the conventional edge or gradient
with three in-door videos and the performance is verified by              extraction algorithm, local binary pattern [1], etc. The
pixel detection accuracy, false positive and false negative rate          statistical background models of pixel color and textures
based on ground truth data. The experimental results                      are respectively efficient when the moving objects are
demonstrate it can eliminate shadow significantly but without
                                                                          with different colors from background objects and are
many apertures in foreground object.
                                                                          full of textures for either background or foreground
                                                                          moving objects. For example, it is hard to detect a
                  1.   INTRODUCTION
                                                                          walking man in green from a green bush using the color
                                                                          feature only. In this case, since the bush is full of
Wide range deployment of video surveillance system is
                                                                          different textures from the green cloth, the man can be
getting more and more importance to security
                                                                          easily detected by background subtraction with texture
maintenance in a modern city as the criminal issue is
                                                                          feature. However, this will not be the case when only
strongly concerned by the public today. However,
                                                                          using the texture difference to detect the man walking in
conventional video surveillance systems need heavy
                                                                          the front of flat white wall because of the lack of texture
human monitoring and attention. The more cameras
                                                                          for both the cloth and the wall. Therefore, some studies
deployed, the more inspection personnel employed. In
                                                                          are conducted to combine the color and the texture
addition, attention of inspection personnel is decreased
                                                                          information together as a pixel representation for
over time, resulting in lower effectiveness at recognizing
                                                                          background scene modeling [2][3][4]. In addition to the
events while monitoring real-time surveillance videos.
                                                                          different modeling abilities of color and texture, texture
To minimize the involved man power, research in the
                                                                          feature is much more robust than color under the
field of intelligent video surveillance is blooming in
                                                                          illumination change and is less sensitive to slight cast
recent years.
                                                                          shadow of moving object.
Among the studies, background subtraction is a
                                                                          Though the combination of color and texture can
fundamental element and is commonly used for moving
                                                                          provide a better modeling ability and robustness for
object detection or human behavior analysis in the
                                                                          background scene under illumination change, it is not
intelligent visual surveillance systems. The basic idea
                                                                          enough to eliminate a slightly dark cast shadow or to
behind the background subtraction is to build a
                                                                          keep an invariant scene under stronger illumination
background scene representation so that moving objects
                                                                          change or automatic white balance of camera. To
in the monitored scene can be detected by a distance
                                                                          improve the robustness of background modeling further,
comparison between the input image and the




                                                                   1045
a simple but efficient way to eliminate shadows is to               waving leaves. In this study, we propose a four-layer
filter pixels casted by shadows according to the                    scene model which classes each pixel into four states, i.e.
chromatic and illuminative changes. In the illuminative             background, static foreground, moving foreground, and
component the value of the shadow pixel is lower than               shadow. We improve Gallego’s work by modeling
that in background model; while in the chromatic                    background with mixture Gaussians of color and texture
component, it shows slightly different from that in the             combined feature and design related mechanisms for
background model. Therefore, shadows can be detected                state transition. In addition, we also bring the concept of
by using thresholding technique to obtain the pixels                shadow learning, based on the work [7], into the
which are satisfied with these physical characteristics.            proposed scene model. The structure and the
Cucchiara et al. [5] transformed video frames from RGB              mechanisms of our background scene model are
space to Hue-Saturation-Intensity (HSI) space to                    described in section 2. Section 3 reveals applicable
highlight these physical characteristics. In the work of            scenarios and experimental results. Section 4 presents
Shan et al. [6], they evaluated the performance of                  the conclusions and our future works
thresholding-based shadow detection approach on
different color spaces such as HSI, YCrCb, c1c2c3,
L*a*b. To sum up, conventional approaches are based                         2. MULTI-LAYER SCENE MODEL
on transforming the RGB features to other color
domains or features, which have better characteristics to           Figure 1 illustrates the flowchart of the multi-layer scene
represent shadows. But no matter what kind of color                 model. In the first stage, the color and texture
spaces or features is adopted, users usually need to set            representation have to be obtained for all pixels in the
one or more threshold values to filter shadows out.                 input image. Four layers which represent the states of
                                                                    background, shadow, static foreground and moving
   Recently, Nicolas et al. [7] proposed an online-                 foreground, are modeled separately. For each pixel i
learning approach named Gaussian Mixture Shadow                     belonging to the current frame, if it is fit to the
Model (GMSM) for shadow detection. The GMSM                         background model, the background model is updated
utilities two Gaussian mixture models (GMM) [8] to                  and the pixel is then labeled as the state of background.
model the static background and casting shadows,                    Otherwise, the pixel is passed to the shadow layer.
respectively. Afterward, Tanaka et al. [9] used the same
idea but modeled the distributions of background and                   In the shadow layer, i is examined whether it is
shadows non-parametrically by Parzon windows. It is                 satisfied to be a shadow candidate by a weak shadow
faster than GMSM but costs more storage space. Both of              classifier, which was designed according to the shadow
them are based on statistical analysis and have better              physical characteristics such as the mentioned chromatic
discriminative power on shadows, especially when the                and illuminative changes. If i is determined as a shadow
color of moving object shows similar attribute to the               candidate, the shadow layer is updated by the pixel’s
pixels covered by shadows.                                          color features. If i shows strong fitness to the dominant
                                                                    Gaussian of the updated shadow model, its state is then
   On the other hand, maintenance of static foreground              labeled as shadow. For the pixel which is not satisfied to
objects is also an important issue for background                   being a shadow candidate or does not fit to the shadow
modeling. The static foreground objects are those                   model, we pass it to the static foreground layer.
objects that, after entering into the surveillance scene,
reach a position and then stop their motion. Examples               Consequentially, if i dose not fit the static foreground
are such as cars waiting for traffic lights, browsing               model if it exists, i is passed to the moving foreground
people in shops, or abandoned luggage in train stations.            layer. When i fits the moving foreground model, it is the
In traditional GMM-based background models [8], the                 circumstance that the state of the moving object is from
static foreground objects are usually absorbed into                 moving to staying at the current position. As a result, we
background after a given time period, which usually                 update the moving foreground model by i’s color
proportional to the learning rate of the background                 features. A counter named CountMF corresponding to the
model. The current state-of-the-art technique to                    moving background model is increased as well. When
distinguish the static foreground objects from static               CountMF reaches another user-defined threshold T2, we
background and moving objects is to maintain a multi-               replace the static foreground model by the moving
layer model representing background, moving                         foreground model and CountSF is set to zero. Otherwise,
foreground and static foreground separately [10,11].                if i does not fit the moving foreground model, we use it
                                                                    to reinitialize the moving foreground model, i.e. set
In the work of Gallego et al. [11], they proposed a three-          CountMF to zero, and then update the background model
layer model which comprises moving foreground, static               by the past moving foreground model. The reason of
foreground and background layer. However, they                      using the moving foreground model to update the
modeled background by using a single Gaussian, which                background is to allow the background model having the
can not cope with the multi-mode background such as                 ability to deal with the multi-mode background problem




                                                             1046
such as waving leaves, ocean waves or traffic lights.                   The background model is first initialized with a set of
Details of feature extraction stage, the background and              training data. For example, the training data could be
the shadow layers are described in the following                     collected from the first L frames of the testing video.
subsections.                                                         After that, each pixel at frame t can be determined
                                                                     whether it matches to the m-th Gaussian Nm by satisfying
2.1. Feature extraction stage                                        the following inequality for all components {xC ,i , xT ,i } ∈ x :

The color-texture feature is a vector including
                                                                                    ( xC , i − µC ,i , m ) 2                     ( xT ,i − µT , i, m ) 2
                                                                             dC                                           dT
                                                                         λ                                         1− λ
                                                                                                B                                           B
components of RGB and local difference pattern (LDP)
as the texture in a local region. The LDP is an edge-like               dC   ∑
                                                                             i =1     k × (σ C ,i , m ) 2
                                                                                             B
                                                                                                               +
                                                                                                                    dT    ∑
                                                                                                                          i =1     k × (σT , i , m ) 2
                                                                                                                                         B
                                                                                                                                                           <1   (3)

feature which is consisted of intensity differences
between predefined pixel pairs. Each component of LDP                where dC and dT denote vector dimension of color and
is computed by                                                       texture, respectively, λ is the color-texture combination
                                                                     weight, k is a threshold factor and we set it to three
                LDPn(C)=I(Pn)-I(C),                     (1)
                                                                     according to the three-sigma rule (a.k.a. 68-95-99.7 rule)
where C and Pn represent the pixel and thereof neighbor              of normal distribution. The weights of Gaussian
pixel n, respectively, and I(C) represents the gray level            distribution are sorted in decreasing order. Therefore if
intensity of pixel C. The gray level intensity can be                the pixel matches to the first nB distributions, where nB is
computed by the average of RGB components. Four                      obtained by Eq. (4), it is then classified as the
types of pattern defining the neighbor pixels are                    background [13].
depicted in Figure 2 and are separately adopted to
compare their performance of moving object detection                                                      b             
experimentally.                                                                                       b 
                                                                                                                   ∑
                                                                                           n B = arg min  π m > 1 − p f 
                                                                                                                         
                                                                                                                                                                (4)
                                                                                                          m=1           

                                                                     where pf is a measure of the maximum proportion of the
                                                                     data that belong to foreground objects without
                                                                     influencing the background model.

                                                                        When a pixel fits the background model, the
                                                                     background model is updated in order to adapt it to
Fig. 2. Four types of pattern defining neighbor pixels               progressive image variations. The update for each pixel
for computation of LDP                                               is as follows:
2.2. Background Layer
                                                                                           π m ← π m + α (om − π m ) − αcL
                                                                                             B     B             B
                                                                                                                                                                (5)
The GMM background subtraction approach presented
by Stauffer and Grimson [8] is a widely used approach                                      µ m ← µ m + om (α / π m )(x − µ m )
                                                                                             B     B             B         B
                                                                                                                                                                (6)
for extracting moving objects. Basically, it uses couples
of Gaussian distribution to model the reasonable                        (σ m ) 2 ← (σ m ) 2 + om (α / π m )((x − µ m )T (x − µ m ) − (σ m ) 2 )
                                                                           B          B                 B          B           B        B
                                                                                                                                                                (7)
variation of the background pixels. Therefore, an
unclassified pixel will be considered as foreground if the           where α=1/L is a learning rate and cL is a constant value
variation is larger than a threshold. We consider non-               (set to 0.01 herein [14]). The ownership om is set to 1 for
correlated feature components and model the                          the matched Gaussian, and set to 0 for the others.
background distribution with a mixture of M Gaussian
distributions for each pixel of input image:                         2.3. Shadow Layer
                        M
              p (x) =   ∑π
                        m=1
                              B            B      B
                              m N m ( x; µ m , Iσ m )
                                                        (2)          The problem of color space selection for shadow
                                                                     detection has been discussed in [6][12]. Their
                                                                     experimental results revealed that performing cast
                                                     B
where x represents the feature vector of a pixel, µ m is             shadow detection in CIE L*u*v, YUV or HSV is more
                      B
                                                                     efficient than in RGB color space. Considering that the
the estimated mean, σ m is the variance, and I represents
                                                                     RGB-to-CIE L*u*v transform is nonlinear and the Hue
the identity matrix to keep the covariance matrix                    domain is circular statistics in HSV space, YUV color
isotropic for computational efficiency. The estimated                space shows more computing efficiency due to its
mixing weights, denoted by π m , are non-negative and
                             B
                                                                     linearity of transforming from RGB space. In addition,
they add up to one.                                                  YUV is also for interfacing with analogy and digital
                                                                     television or photographic equipment. As a result, YUV




                                                              1047
color features were adopted in this study, i.e. the color            frame. The color-texture representation of pixel is the
components x mentioned in the previous subsection. It is             vector concatenation of RGB and LDP. The second
worth reminding that Y stands for the illuminative                   pattern in Figure 2 is adopted for the computation of
component and U and V are the chromatic components.                  LDP in the experiments. The effect of shadow
For the pixel which does not fit the background model,               elimination are shown not only by background masks
it is then passed to the shadow layer. First, it is                  but also with the pixel detection accuracy rate (Acc.),
examined if it is qualified as a shadow candidate by a               false positive rate (FPR), and false negative rate (FNR),
weak shadow classifier according to the following rules              if ground truth data is available. These quantitative
[7]                                                                  measures are defined as follows,
                             xY
                  rmin <           < rmax               (8)                                        # TP
                            µY
                             B
                                                                                   Acc. =                                  (15)
                                                                                            # TP + # FP + # FN
                  | xU − µU |< Λ U
                          B
                                                        (9)
                                                                                                   # FP
                                                                                   FPR =                                   (16)
                  | xV − µ V |< Λ V
                           B
                                                       (10)                                 # TP + # FP + # FN
model, respectively. The parameters, rmin, rmax, ΛU and                                            # FN
                                                                                   FNR =                                   (17)
Λmax, are user-defined thresholding values. Users just                                      # TP + # FP + # FN
need to set them roughly by a friendly graphical user
interface (GUI) because the more precise shadow                      Where TP is short for true positive and #TP means pixel
classification will further be made by the following                 number of TP in a frame. In general, false positives are
shadow GMM.                                                          resulted from moving cast shadows and false negatives
                                                                     are apertures inside the foreground regions. In the
Similar to the background layer, the shadow layer is also            following figures of background mask, the pixels
modeled by a GMM but only the color features of                      depicted with black, red, white and green represent the
shadow candidate will be fed in. For initialization, rmin,           background region, false negatives, moving foreground
rmax, ΛU and Λmax are used to derive the first Gaussian of           and shadows, respectively. The experiments are
each color component and set its weight to one. The                  performed on a personal computer with a Pentium 4 3.0-
corresponding means and variances of the first Gaussian              GHz CPU and 2 GB RAM. The processing frame rate is
are obtained by the following equations:                             about 15 frames/second.

                 µ Y = µ Y (rmax + rmin ) / 2
                   S     B
                                                       (11)          3.2.Effect of color-texture combination
                  σY
                   S
                       =   ( µ Y rmax
                               B
                                        −   µY
                                             S
                                                 )/3   (12)
                                                                     To check the effectiveness of using color-texture
                       µU
                        S
                             =    µU
                                   B
                                                       (13)          combination feature, an experiment of background
                                                                     subtraction using the feature but with single background
                       σ U = ΛU / 3
                         S
                                                       (14)          layered GMM is conducted. The result of video 1 is
                                                                     demonstrated in Figure 3. The combination weights λ’s
where superscripts B and S are related to the background             are set to 1, 0.3, and 0 for experiments in column 2, 3,
or shadow models, and µV and σ V are calculated in the
                         S         S
                                                                     and 4 in Figure.3, respectively. When λ=1, i.e., only
same way as Eq. (13) and Eq. (14). In the circumstance               color feature is effectively used, there are significant
if a feature vector x is not matched to any Gaussian                 false positives caused by shadows and camera brightness
distribution, the Gaussian which has the smallest weight             control in the background masks of column 2. When λ=0,
is replaced with µ = x and σ = [σ0 σ0 σ0]T, where σ0 is an           i.e., only texture feature is effectively used, most of the
initial variance.                                                    false positives disappear but many apertures show up in
                                                                     the foreground region at column 4 because of the lack of
                  3. EXPERIMENTS                                     texture in both the road scene (background) and most of
                                                                     the surface of car (foreground). When λ=0.3, i.e., the
3.1. Experimental setting                                            color-texture feature is used, the number and size of
There are four videos used in the experiment, video 1 is             aperture in the results at column 3 become smaller than
collected from a road side camera of real surveillance               the results at column 4.
system by the police department of Taichung County
(PDTC), and the others are recorded indoors. Video 2 is              3.3.Results of using multi-layer GMM
collected by our colleagues at a porch with glossy wall
that reflects object slightly, video 3 and video 4 are               The experimental results of using the multi-layer GMM
selected      from      the      video    dataset     at             on video 2, 3, and 4 are demonstrated in Figure 4, 5, and
http://cvrr.ucsd.edu/aton/shadow, which are entitled by              6, respectively. The results at column 2, 3, and 4 of each
intelligentroom_raw and Laboratory_raw, resoectively.                figure are obtained by using single layered RGB, multi-
The image size of these videos are 320x240 pixels per                layered RGB, and multi-layered RGB+LDP background




                                                              1048
models, respectively. Detection rates, including                        color and gradient information”, In Proc. of IEEE
detection accuracy, false positive rate, and false negative             Workshop on Motion and Video Computing, 2002.
rate of pixel, of video 2 and 3 are computed based on
ground-truth data and are printed on each frame. In                  [3]K. Yokoi, “Illumination-robust Change Detection
addition, the average of detection rates is tabulated for               Using Texture Based Features”, In Proc. of IAPR
each method in Table 1 and 2. As shown in these figures                 Conference on Machine Vision Applications, 2007.
and tables, the multi-layer GMM of RGB+LDP feature
outperforms the method without combining the LDP                     [4]J. Yao and J. Odobez, “Multi-Layer Background
significantly.                                                           Subtraction Based on Color and Texture”, In Proc.
                                                                         of IEEE CVPR, 2007.
                             Acc.         FPR       FNR
                              (%)         (%)        (%)             [5]Cucchiara, R., Grana, C., Piccardi, M., Prati, A., and
RGB only                     63.89       31.53       4.58               Sirotti, S.: Improving Shadow Suppression in
RGB+shadow layer             70.05       4.91       25.04               Moving Object Detection with HSV Color
RGB+LDP+shadow               80.27       14.43       5.30               Information. in Proceedings of 2001 IEEE Intelligent
layer                                                                   Transportation Systems Conference. pp. 334-339
Table 1. Average detection rates of moving objects in                   (2001).
video 2.
                                                                     [6]Shan, Y., Yang, F., and Wang, R.: Color Space
                             Acc.        FPR       FNR                  Selection for Moving Shadow Elimination. in
                              (%)        (%)        (%)                 Proceedings of 4th International conference on
RGB only                     45.19      53.94       0.87                Image and Graphics. pp. 496-501 (2007).
RGB+shadow layer             76.85      1.21       21.94
RGB+LDP+shadow               82.89      15.87       1.25             [7]Nicolas, M.-B., and Zaccarin, A.: Learning and
layer                                                                   Removing Cast Shadows through a Multidistribution
Table 2. Average detection rates of moving objects in                   Approach. IEEE Transactions on Pattern Analysis
video 3.                                                                and Machine Intelligent. vol. 29, no. 7, pp. 1133-
                                                                        1146 (2007).
                   5.CONCLUSION
                                                                     [8]Stauffer, C., and Grimson, W. E. L.: Adaptive
This study presents a multi-layer scene model for                       Background Mixture Models for Real-time Tracking.
applications of video surveillance. The proposed scene                  in Proceedings of IEEE Computer Society
model uses a RGB+LDP feature to represent each pixel                    Conference on Computer Vision and Pattern
and classifies each pixel into four different states                    Recognition. vol. 2, pp. 246-252 (1999).
comprising background, moving foreground, static
foreground and shadow. As shown in the experimental                  [9]Tanaka, T., Shimada, A., Arita, D., and Taniguchi, R.:
results, both the modeling ability and illumination                     Non-parametric Background and Shadow Modeling
invariance are significantly improved by including the                  for Object Detection. Lecture Notes in Computer
texture information.                                                    Science. no. 4843, pp. 159-168 (2007).

              ACKNOWLEDGEMENT                                        [10]Herrero-Jaraba, E., Orrite-Urunuela, C.,   and Senar,
                                                                        J.: Detected Motion Classification with     a Double-
                                                                        background and a neighborhood-based         difference.
This paper is a partial result of project 9365C51100                    Pattern Recognition Letters. vol. 24, pp.   2079-2092
conducted by ITRI under sponsorship of the Ministry of                  (2003).
Economic Affairs, Taiwan.
                                                                     [11]Gallego, J., Pardas, M., and Landabaso, J.-L.:
                                                                        Segmentation and Tracking of Static and Moving
                    REFERENCES                                          Objects in Video Surveillance Scenarios. in
                                                                        Proceedings of IEEE International Conference on
[1] M. Heikkil¨a and M. Pietik¨ainen, “A texture-based
                                                                        Image Processing. pp. 2716-2719 (2008).
    method for modeling the background and detecting
    moving objects”, In Proc. of IEEE Transactions on
                                                                     [12]Benedek C., and Sziranyi, T.: Study on Color Space
    Pattern Analysis and Machine Intelligence, Vol. 28,
                                                                        Selection for Detecting Cast Shadows in Video
    No. 4, pp. 657–662, April 2006.
                                                                        Surveillance. International Journal of Imaging
                                                                        Systems and Technology. vol. 17, pp. 190-201
[2]O. Javed, K. Shafique and M. Shah, “A hierarchical
                                                                        (2007).
   approach to robust background subtraction using




                                                              1049
[13]Izadi, M., and Parvaneh, S.: Robust Region-based                    [14]Zivkovic Z., and van der Heijden, F.: Recursive
   Background Subtraction and Shadow Removing                              Unsupervised Learning of Finite Mixture Models.
   using Color and Gradient Information. in                                IEEE Transactions on Pattern Analysis and Machine
   Proceedings of International Conference on Pattern                      Intelligent. vol. 26, no. 7, pp. 773-780 (2006).
   Recognition. pp. 1-5 (2008).




              Color-texture
            representation of
                 Pixel i



            Fit Background
                                 No
                Model?

                  Yes

               Update                 Is Shadow
             Background                                 No
                                      Candidate?
               Model
                                            Yes
                                                              Fit Static
              Background          Update Shadow              Foreground              No
                                     Model                     Model?
                                                   No              Yes
                                                           Update Static                  Fit Moving
                                      Fit Shadow                                          Foreground                No
                                                         Foreground Model
                                        Model ?                                             Model?
                                                          and Count SF +1
                                         Yes                                                 Yes

                                        Shadow            If Count SF > T 1           Update Moving
                                                                                     Foreground Model
                                                                                                                Transfer Moving
                                                                                     and Count MF +1
                                                                 Yes                                             Foreground to
                                                                                                                   Background,
                                                                                                               Reinitialize Moving
                                                          Transfer Static
                                                                                      If Count MF >T2          Foreground Model ,
                                                        Foreground Model to
                                                                               No                              and Set Count MF = 0
                                                         Background Model
                                                                                                Yes
                                                                                      Transfer Moving
                                                                                          to Static
                                                                                     Foreground Model     No
                                                                                    and Set Count SF =0
                                                                                     Foreground Model


                                                                                            Static                   Moving
                                                                                          Foreground                Foreground



                                Fig. 1. Flowchart of the proposed multi-layer scene model




                                                                 1050
Fig. 3. Results of background subtraction controlled by combination weight of color and texture




                        Fig. 4. Foreground detection results of video 2.




                                               1051
Fig.5. Foreground detection results of video 3, the intelligentroom_raw.




 Fig. 6. Foreground detection results of video 4, the Laboratory_raw.




                                   1052
Adaptive Traffic Scene Analysis by Using Implicit Shape Model


                                       Kai-Kai Hsu, Po-Chyi Su and Kai-Yi Cheng
                                 Dept. of Computer Science and Information Engineering
                                               National Central University
                                                    Jhongli, Taiwan
                                            Email: pochyisu@csie.ncu.edu.tw


   Abstract—This research presents a framework of analyz-               research is to provide an approach to deal with the vehicle
ing the traffic information in the surveillance videos from              occlusion problem, in which multiple vehicles appear in
the static roadside cameras to assist resolving the vehicle             the video scene and certain parts of them overlap, in
occlusion problem for more accurate traffic flow estimation
and vehicle classification. The proposed scheme consists of              the vehicle detection. The occlusions of vehicles occur
two main parts. The first part is a model training mechanism,            quite often in cameras set up at the streets and cause
in which the traffic and vehicle information will be collected           ambiguity in vehicle detecting and may lead to inaccurate
and their statistics are employed to automatically establish            measurement of traffic parameters, such as the traffic flow
the model of the scene and the implicit shape model of                  volume. We adopt a so-called “Implicit Shape Model”
vehicles. The second part adopts the flexibly trained models
for vehicle recognition when possible occlusions of vehicles            (ISM) to recognize the vehicle and reasonably help in
are detected. Experimental results show the feasibility of the          solving the occlusion problem. The proposed scheme will
proposed scheme.                                                        have two parts, i.e. the self-training mechanism and the
  Keywords-Vehicle; traffic surveillance; occlusion; SIFT;               construction of the implicit shape model for resolving
                                                                        vehicle occlusion. The organization of this paper is as
                                                                        follows. A review of the related works is described in
                    I. I NTRODUCTION
                                                                        Section II. The proposed method is presented in Section
   Developing Intelligent Transportation System (ITS) has               III. Preliminary results are shown in Section IV and the
been a major investigation these years. Through the inte-               conclusive remarks are given in Section V.
gration of advanced computing facilities, electronics, com-
munication and sensor technologies, ITS can provide the                                   II. R ELATED W ORKS
real-time information to help maintain the traffic order or                 There have been active research efforts on the automatic
to ensure the safety of pedestrians and drivers. As there are           vision-based traffic scene analysis in recent years [1]–[6].
more and more surveillance cameras deployed along the                   Levin et al. [1] proposed to collect the training examples
local roads or highways, the visual information provided                by a coarse detector and the training examples are used
by these surveillance videos become an important part                   to build the final pedestrian detector. The classification
of ITS. The traffic information obtained by the vision-                  criterion for the coarse detector has to be defined manually.
based approach can assist the traffic flow control, vehicle               Wu et al. [3] employed an online boosting method to
counting and categorization, etc. In addition, the emergent             enhance the performance of the system. A prior detector
traffic events may be detected right after they happened by              by off-line learning is employed to train the posterior
the advanced visual processing so that the corresponding                detector, which adopts unsupervised learning. Nair et al.
processes can be applied in a more active way.                          [7] also employed a supervised way for the initial training.
   The vehicle detection/classification by using the vision-             Hsieh et al. [2] adopts a different approach to detect
based approach is a challenging issue and various methods               the lanes of the surveillance video in the initial stage
have been proposed in recent years. It should be noted                  automatically. Vehicle features such as size and linearity
that appearances of vehicles in the surveillance videos                 are used to detect and classify vehicles, instead of using
from different cameras are quite diverse because of the                 the large amount of labeled training data. The vehicle size
different locations, heights, angles and views of cameras.              information has to be pre-defined manually. Zhou et al. [4]
In addition, the weather condition and the time of video                proposed an example-based moving vehicle detection. The
recording, e.g. morning or evening, may also affect the                 vehicle are detected according to the luminance changes
vehicle detection process. It is quite difficult to establish            by the background subtraction. The features are extracted
a common model in advance for all the surveillance                      from those examples using PCA and trained as a detector
videos. Nevertheless, if we choose to construct a model                 by SVM. Celik et al. [5] presented an unsupervised and
for each individual surveillance video, a great deal of                 on-line approach. A coarse object detector is used to
human efforts will be required, given that there are so                 extract the moving object by the background subtraction
many roadside cameras. Therefore, the objective of our                  and then the obtained samples are refined by clustering
research is to enable the procedures of model construction              based on the similarity matrix. These extracted features
in an automatic manner so that the customized model of                  are separated into good and bad positives for training a
each scene can be established. The other objective of this              final detector via SVM. Celik et al. [6] then addressed an



                                                                 1053
¢c £¨¨¢`


automatic classification method for identifying pedestrians                                                                           ¤   ¤ !©

and vehicles by SIFT.
                                                                                                                                                                     ¤   ¤!©
                                                                                                                                ¨¢©#¢!(
                                                                                                                                                                    ¤¢W¢¢%



   Regarding the issues of occlusion problem, various solu-                                                                                CR3QPC I

                                                                                                                                            97Q H B67 V A
                                                                                                                                                 2     9



tions have been proposed [8]–[16]. We roughly classify the                          ¨§£©¢

                                                                                    ¨¢¡
                                                                                                          © ##¢`

                                                                                                         ¨©¤%
                                                                                                                                   ¤   ¤ ¥ ¤!©

                                                                                                                                ¨¢¡#¨ 
                                                                                                                                                                  ¥  ©!)¡ 

                                                                                                                                                                      !¤0
                                                                                                                                                                                 ¤)¢




approaches into 3D model-based, feature-based methods
                                                                                                         b !¢¨a                  ©¢¥




and others. The 3D model is a popular solution to solve the
                                                                                          @9876 5 321
                                                                                                 4
                                                                                                                                          E
                                                                                                                     @C H 26 HG
                                                                                                                         3     F
                                                                                          C 5 BA
                                                                                             2
                                                                                                                     U T CR3QPC I
                                                                                                                        S



vehicle occlusion. Pang et al. [8], [9] detect a vanishing
                                                                                     ¤   ¤!©           # ¨£¤                  ¨§!©©$               ¤D     ¨§ !©©$         3Q VV 6 X
                                                                                                                                                                                            2
                                                                        ¤£¢¡ 
                                                                                   ¨¢¨¤¡£¤¥          ¤¤¨                    ¨©¤¤%                       £¨'!¤          T C H B262Y
                                                                                                                                                                                        6 C
                                                                       ¤©¨¤§¦¤¥



point first in the traffic scene. A 3D deformable model
is used to estimate each viewpoint of vehicle occlusion
and transform it into a 2D representation. The occlusion                             Figure 1.          The proposed framework
is detected by obtaining the curvature points on the shape
of vehicle. Occluded vehicles are separated into individual
vehicles. The vanishing point is also adopted by Yoneyama             a single vehicle can be correctly located. Since the traffic
et al. [10]. A hexagon is used to approximate the shape               scenes from different cameras may vary significantly, this
of vehicle for eliminating shadows. A multiple-camera                 training process has to be applied for each individual cam-
system is utilized to detect the occlusion problem. Song              era. If this process is carried out manually, its computation
et al. [11] proposed to employ vehicle shape models,                  will require considerable amounts of human efforts. In
camera calibration and ground plane knowledge to detect,              order to provide a more feasible solution, we plan to
track and classify the occlusion by estimating the related            develop a “self-training” adaptive scheme so that these
likelihood. Lou et al. [12] established a 3D model for                models will be built in a more automatic manner without
tracking vehicles and an improved extended Kalman filter               involving a great deal of human efforts. Considering that
was also presented to track and predict the vehicle motion.           the settings of traffic surveillance cameras are usually fixed
Most methods of 3D model require the precise camera cal-              without rotation and the corresponding traffic scenes tend
ibration and vehicle detection. In feature-based methods,             to be static, i.e. the background of the traffic scene is
the occlusion can be resolved by tracking partial visible             invariant, we will extract a long video segment from the
features of occluded vehicles. Kanhere et al. [13] proposed           target camera for building models. It should be noted that
to track vehicle in low-angle situation and estimate the 3D           typical vehicles should appear in the extracted long video
height of features on the vehicles. The feature points are            segment and their related information can thus be collected
detected and tracked throughout the image sequence. The               as references for future usage.
feature-based methods may be influenced by the similar                    Fig. 1 demonstrates our system framework. The back-
shapes from the background. Zhang et al. [14] presented               ground of the scene will be constructed from the traffic
a multilevel framework, which consists of the intra-frame,            surveillance video by an iteratively updating method so
inter-frame and tracking level. In the intra-frame level, an          that the background subtraction can be applied to extract
occlusion is detected by evaluating convex compactness                the vehicle masks. Although the extracted vehicle masks
ratio of vehicle shape and resolved by removing a “cutting            may contain a single vehicle or colluded ones, it is as-
region.” In the inter-frame level, an occlusion is detected           sumed that the long video used for training should contain
by the statistics of motion vectors of vehicles. During               a large number of single vehicles and that the vehicles of
the tracking level, the detected vehicles are tracked for             the same types should exhibit a similar shape/size. Even
resolving the full occlusion. Tsai et al. [15] detect the             if many occlusions may happen, their shapes are usually
vehicles by using color and edges. The vehicle color usu-             quite different. Therefore, the majority-voting methodol-
ally looks unique and can be used for searching possible              ogy can be employed to determine such static information
vehicle locations. Then the edge maps and coefficients                 in the target traffic video, including the traffic flow direc-
of wavelet transform are used for examining the vehicle               tions and the vehicle shape/size of different types at the
candidates. Wang and Lien [16] proposed an automatic                  scene to construct our first model, i.e. the scene model. The
vehicle detection based on significant subregions of ve-               second model, i.e. the shape model, which will be used for
hicles, which are transformed to PCA weighting vectors                recognizing vehicles, especially the concluded vehicles,
and ICA coefficient vectors. The position information is               is said to be implicitly established since image features
estimated by a likelihood probability evaluation process.             are extracted and grouped without explicitly resorting to
                                                                      the exact shapes of vehicles. The scale-invariant feature
    III. T HE P ROPOSED S ELF -T RAINING S CHEME
                                                                      transform (SIFT) will be used to extract effective features
A. System overview                                                    from the segmented vehicle masks of consecutive frames
   Our system is aimed at resolving the vehicle occlusion             to indicate the pixels covering vehicles more precisely.
problem for more accurate estimation of traffic flow at                    The statistics of vehicle size information obtained by the
the scene captured by a static traffic surveillance camera.            occlusion detection is analyzed and will be utilized to clas-
Our scheme mainly relies on establishing two models, i.e.             sify the vehicle types. By the results of statistics, the types
the traffic scene model and the implicit shape model of                of vehicles can be classified into motorcycles, sedan cars
vehicles, for effective traffic scene analysis. The models             and buses according to the vehicle size information. The
should be trained in advance so that the pixels covering              step of vehicle pattern extraction and classification will



                                                               1054
(a)

                              (a)




                                                                                                        (b)

                                                                       Figure 3.   Convex hulls of (a) non-occlusion vehicle and (b) occluded
                                                                       vehicles.
                              (b)

           Figure 2.   Background image Construction                   where Vs and Vc represent the vehicle area from the
                                                                       background subtraction and the vehicle convex area, re-
                                                                       spectively. When the value of Γ is closer to one, the
collect various types of vehicle masks. The classification              vehicle area is similar to its convex hull area and it
is implemented by the vehicle size information obtained                indicates that the occlusion may not happen. In the training
from the traffic information analysis. When the system                  process, our system tries to extract non-occluded vehicle
runs after a period of time, there will be enough vehicle              patterns so we set up a high threshold to ensure that most
masks to establish the implicit shape model. We will detail            of the extracted vehicle patterns contain single vehicles.
the procedures of our proposed system as follows.
                                                                       D. Traffic Information Analysis
B. Background Model Construction
                                                                          As mentioned before, we require that our system be
  A series of traffic surveillance frames will be utilized
                                                                       executed in an more automatic manner to reduce the
to construct the background image of the traffic scene
                                                                       human efforts for tuning the parameters. Our scheme will
captured by a static roadside camera so that the moving
                                                                       obtain the direction of traffic appearing in the scene and
vehicles will be detected by the background subtraction.
      i                                                                the common vehicle size information by the statistics of
Let Bx,y be the pixel at (x, y) of the background image,
                                                                       the surveillance videos in the training phase. For analyzing
and the background updating function is given by
                                                                       the direction of traffic, the vehicle movements must be
          i+1          b     i       b    i                            attained first. SIFT is employed to identify features on
         Bx,y = (1 − αMx,y )Bx,y + αMx,y Fx,y            (1)
                                                                       vehicles. After the vehicle segmentation, the vehicles
             i
in which Fx,y is the pixel at (x, y) in frame i; α is the              are transformed into feature descriptors of SIFT. The
                        b
small learning rate; Mx,y is the binary mask of the current            features of frames will be compared and the positions
frame. If the pixel at (x, y) belongs to the foreground part,          of movements are recorded. After a period of time, the
   b                                                b
Mx,y = 1 to turn on the updating. Otherwise, Mx,y is set               main direction of traffic in the surveillance scene can
as 0 to avoid updating the background with the moving                  be observed from the resultant movement histogram. In
objects. An example of the scene with its constructed                  addition, the Region of Interest (ROI) can be identified to
background is demonstrated in Fig. 2.                                  facilitate the subsequent processing. The position of ROI is
                                                                       located in the area of the detected traffic flow and the area
C. Occlusion Vehicle Detection
                                                                       near the bottom of the captured traffic scene for vehicles
   It has been observed that the shape of non-occluded                 of larger size, which can offer more information.
vehicle should be close to its convex hull and that the                   After determining ROI, we can collect vehicle patterns
shape of occluded vehicles will show certain concavity,                or masks that appear in the ROI. In the training phase, ve-
as illustrated in Fig. 3. This characteristic can be used to           hicle patterns that are determined to contain single vehicles
roughly extract the non-occluded vehicle. In our imple-                based on the convex hull analysis will be archived. Then
mentation, compactness, Γ, is used to evaluate how close               we can check the size histogram of archived vehicles to set
the vehicle’s shape and its convex hull are. That is,                  up the criterion for roughly classifying them. In our test
                                    Vs                                 videos, the most common vehicles are motorcycles, sedan
                           Γ=          ,                 (2)           cars and buses. When we examine the histogram of the
                                    Vc



                                                                1055
the position of training vectors where the codebook entry
                                                                
                                                                             is found. The position of each feature is dependent on
                                                                             the object center. We match the features from the training
                                                   !        !                images with the codebook entries. When the similarity of
  ¥©¨§¦¥¤ £¢¡             ©¥§¥ ©¥§¨    ¥¥©©¨ ¥©¥
                                                                             features with any entry is above a threshold, the position
                                                                             relative to the object center is recorded along with the
                                                                             codebook entry. After matching the training images with
             Figure 4.   The codebook training procedure.                    the codebook entries, we obtain the spatial probability
                                                                             distribution.
                                                                                2) Recognition Approach: Given a target image, the
sizes of the collected single vehicle patterns, there will be
                                                                             features are extracted by SIFT and matched to the
obvious peaks. To be more specific, we basically make use
                                                                             codewords in the codebook. When the similarity be-
of the peaks to determine the sizes of common motorcycles
                                                                             tween extracted features and the codebook entries is
and sedan cars since they appear more often. We can then
                                                                             higher than a threshold, these matches are then collected.
set up the upper and lower bounds of sedan cars and then
                                                                             According to the spatial probability distribution, these
we can use them as the reference to assign a lower bound
                                                                             matched codebook entries cast votes to the object center.
of the bus size. In the detection phase, if the vehicle mask
                                                                             When the features of target image that are extracted at
is large and shows a convex hull, then the pattern may be
                                                                             (ximg , yimg , simg ), in which (x, y) is the location and
determined as a bus. Otherwise, an occlusion may happen
                                                                             s means scale, are determined to have a match with a
and this has to be solved by using ISM. In other words,
                                                                             codebook entry, the positions (xpos , ypos , spos ) recorded
after the rough classification according to the vehicle sizes,
                                                                             in this codebook entry cast votes for the object center.
we proceed to use the vehicle patterns to establish the
                                                                             The voting is applied by
codebooks of ISM, which will then be used for resolving
the vehicle occlusions.                                                                                              simg
                                                                                            xvote = ximg − xpos (          )           (3)
                                                                                                                      spos
E. Implicit Shape Model                                                                                              simg
                                                                                            yvote = yimg − ypos (          )           (4)
   Leibe et al. [17] proposed to use ISM for learning                                                                spos
the shape representations in detecting the most possible                                                simg
                                                                                            svote =                                    (5)
locations of vehicles in images or frames. The object                                                   spos
categorization is achieved by learning the appearance
                                                                             where (xvote , yvote , svote ) is a vote for the object center.
variability of an object category in a codebook. The
                                                                             After all the matches that the codebook entries have voted,
investigated image will be compared with the codewords
                                                                             we store these votes for a probability density estimation
in the codebook that has a similar shape and then a
                                                                             mechanism, which is used to obtain the most possible
weighted voting procedure will be applied to address the
                                                                             location of the object center.
object detection. The steps of ISM are as follows.
                                                                                Next, we collect the votes in a binned 3D accumulator
   1) Shape Model Establishment: In the visual object
                                                                             array and search the local maxima for speeding up the
recognition, we have to determine the correspondence of
                                                                             computation. The local maxima are detected by comparing
the image features with the structures of the object, even
                                                                             each member of the binned 3D accumulator array to its 26
under different conditions. To employ a flexible repre-
                                                                             neighbors in 3×3 regions. Then, the Mean-Shift approach
sentation for object recognition, a codebook is built for
                                                                             [18] is employed to refine the local maxima for more
representing features that appear on training images quite
                                                                             accurate location. The Mean-Shift approach can locate
often and similar features are clustered. A codeword in
                                                                             the maxima of a density function given the discrete data
the codebook should be a compact representation of local
                                                                             sampled from that function. It will quickly converge to
appearances of objects. Given an unknown image struc-
                                                                             more precise locations of the local maxima after several
ture, we will try to match it with a possible representation
                                                                             iterations.
or codeword in the codebook. Then, many such matches
are collected and we can then infer the existence of that                       The refined local maxima can be regarded as candidates
object. Again, the scale-invariant interest point detector                   of the object center. Thus, the following criterion is used
is employed to detect the feature points on the training                     to estimate the existing probability of the object:
images and the extracted image regions are then translated                                             1                     lc − li
to a representation by a local descriptor. Next, the visually                         score(lc ) =                 wi Ker(           ),   (6)
                                                                                                     V (sc )   i
                                                                                                                              b(sc )
similar features will be grouped to construct a codebook
for representing the local appearances of certain object.                    where Ker() is a kernel function; b(sc ) is the kernel
The k-means algorithm is used to partition the features into                 bandwidth; V (sc ) is the volume of the kernel; wi and
k clusters, in which each feature is assigned to the cluster                 li are the weighting factor and the location of the vote,
center with the nearest distance. The codebook generation                    respectively; lc and sc are the location and scale of the
process is shown in Fig. 4.                                                  local maximum. The kernel function Ker() can be treated
   After building the codebook, the spatial probability                      as a search window for the position of object center. If
distribution is defined for each codebook entry. It records                   the vote location li is inside of the kernel, the Ker()



                                                                      1056
(a)                                 (b)                                           (a)                               (b)


Figure 5. (a) Multi-type vehicle error detection and (b) the result after          Figure 6. (a) Multiple hypotheses detected in one vehicle and (b) the
the refining procedure.                                                             results from the refining procedure.



function returns a value one. Otherwise, it returns zero.                             There exists another problem in the vehicle recognition
For the 3D voting space, we use a spherical kernel and                             by using ISM. As shown in Fig 6, there are three bounding
the radius is the bandwidth, b(sc ), which is adaptive to                          boxes on the same vehicle. It means that the recognition
the local maximum scale sc . As the object scale increases,                        result includes some error detections that ISM has defined
the kernel bandwidth should also increase for an accurate                          for multiple hypotheses on this vehicle. Since the multiple
estimation. Therefore, we sum up all the weighting values                          definition problem comes from the fact that the ISM
that are inside of the kernel and divide them by the volume                        searches the local maxima in the scale-space as shown
V (sc ) to obtain an average weight density, which is called                       in Fig. 7, the scheme may find several local maxima in
the score. After the score is derived, we define a thresh-                          different scale levels but at a similar location. In fact,
old θ for determining whether the object exists. When                              these local maxima are generated by the same vehicle
the score is above θ, the hypothesized object center is                            center. Therefore, the unnecessary hypotheses should be
preserved. Finally, we back-project the votes that support                         eliminated. We deal with the problem by computing the
this hypothesized object center to obtain an approximate                           overlapped area between the two bounding boxes. When
shape of the object.                                                               the overlapped area between two bounding boxes is very
                                                                                   large, we can claim that the bounding box that has a
F. Occlusion Resolving                                                             weaker score is an error detection. For efficient compu-
   After detecting the existence of certain occluded vehi-                         tation, the rate of overlap is computed by finding the
cles in the image, we need to classify them into different                         distance between the two bounding boxes’ central points
types. In our scheme, we construct the codebooks of                                and use the longer diagonal line of the larger bounding
different types of vehicles. Each type of vehicle codebook                         box as the criterion. The longer the distance is, the higher
will be established automatically after we obtain enough                           the areas overlap. In other words, for every two bounding
vehicle patterns collected by the process of vehicle ex-                           boxes, we need to check
traction. However, as shown in Fig. 5, the performance of                                                                 1
recognition is not as good as expected since many errors                                            distance(B1 , B2 )  D,                 (7)
                                                                                                                          3
happen on the bus image. Owing to the fact that the area
                                                                                   where B1 and B2 denote two bounding boxes central
of buses are much larger than sedan cars and that there
                                                                                   points and D is the diagonal line of the larger one. In our
are many similar local appearances in these two types,                                                                                 1
                                                                                   implementation, when the distance is smaller than 3 D, the
errors of this kind occur quite often. We provide a refining
                                                                                   overlapped area of the bounding boxes is above 50% and
procedure as follows.
                                                                                   we will thus remove the bounding box that has a lower
   All the hypotheses are supported by the contributing
                                                                                   score. The error detection from ISM can thus be reduced.
votes that are cast by the matched features. Theoretically,
every extracted feature should only support one hypothesis                                        IV. E XPERIMENTAL R ESULTS
since it is not possible that one feature belongs to two                              We have tested the proposed self-training mechanism
vehicles. Thus, we will modify these hypotheses after                              on two different surveillance videos. The scenes of two
executing multiple recognition procedures. We first store                           surveillance videos are displayed in Fig. 8. Scene 1 shown
all the hypotheses whose scores are over a threshold.                              in Fig. 8(a) is a 15 minutes long video while Scene
Then all the hypotheses are refined by checking each                                2 shown in Fig. 8(b) is a 17 minutes long video. The
contributing vote that appears in two hypotheses at the                            experimental results will be demonstrated in three parts,
same time. The hypothesis with a higher score can retain                           i.e. the traffic information analysis, the vehicle pattern
this vote while the vote from others will be eliminated.                           extraction/classification and the occlusion resolving.
Next, the scores of these hypotheses are recalculated.
When the new score is above the threshold, the hypothesis                          A. Traffic Information Analysis
can be preserved. After this refining procedure, the number                           The directions of traffic flow analysis of two scenes
of error detections can be reduced.                                                are illustrated in Fig. 9. The red points represent forward



                                                                            1057
9                                                                                                         4




                           D                               D                                                               8                                                                                                        3.5




                                                                                       Number of Occurences (per minute)




                                                                                                                                                                                                Number of Occurences (per minute)
                                                                                                                           7
                                                                                                                                                                                                                                     3

                                                                                                                           6
                                                                                                                                                                                                                                    2.5
                                                                                                                           5
                                                                                                                                                                                                                                     2
                                                                                                                           4
                                                                                                                                                                                                                                    1.5
                                                                                                                           3

                                                                                                                                                                                                                                     1
                                                                                                                           2

                                                                                                                           1                                                                                                        0.5


                                                                                                                           0                                                                                                         0
                                                                                                                               0   10   20         30       40       50        60   70   80                                               0   20     40        60        80         100   120   140
                                                                                                                                             Size of Vehicles Unit: 100 pixels                                                                        Size of Vehicles Unit: 100 pixels




                                                                                                                                                      (a)                                                                                                     (b)

                                                                                    Figure 10.                                               The vehicle size statistics for (a) Scene 1 and (b) Scene 2.
Figure 7. If the distance between two bounding boxes’ centers is smaller,
then the overlap area is larger so the distance will be employed to remove                                                                                               Table I
the duplicated detections.                                                                                                                                   V EHICLE PATTERN E XTRACTION

                                                                                                                                                                                    Total     Error                                            Correct rate
                                                                                                                                                   Scene 1                           940       15                                                98.4%
                                                                                                                                                   Scene 2                          1251       31                                                97.5%




                                                                                    B. Vehicle Pattern Extraction and Classification
                                                                                       The various extracted vehicle patterns are demonstrated
                (a)                                  (b)                            and they pass the occlusion detection process to ensure that
                                                                                    it have no occlusion problem. In our experiment, we give
Figure 8. The views of two surveillance videos. (a) Scene 1. (b) Scene              Eq.(2) a threshold 0.9 for extracting the sedan car/bus and
2.
                                                                                    0.8 for motorcycles. We apply the shape analysis on sedan
                                                                                    cars and buses but not on motorcycles since they cannot
                                                                                    be approximated by a convex hull. The performance of
                                                                                    vehicle extraction is summarized in Table I. These vehicle
moving vehicles and blue points are backward moving
                                                                                    patterns will be employed for training. It should be noted
vehicles. We can see that the directions of traffic flows are
                                                                                    that the errors usually come from some unstable envi-
successfully obtained after training the video for a while. It
                                                                                    ronmental conditions, which will affect the construction
should be noted that the more traffic volume is, the lesser
                                                                                    of background image. The vehicle classification result is
time we will need. The vehicle size information statistics
                                                                                    summarized in Table II. Some extracted patterns from
for Scene 1 and Scene 2 are exhibited in Fig 10. There
                                                                                    Scene 1 are illustrated in Figs. 11-13. We can see that
exist two peaks in each scene as the left peak, which has
                                                                                    the vehicle patterns can be effectively extracted and they
a smaller vehicle size, represents a motorcycle, while the
                                                                                    will be helpful in training a more accurate codebook or
right one, which has a larger vehicle size, stands for a
                                                                                    models.
sedan car. In Scene 1, according to Fig. 10, we assign the
lower bound 700 pixels and upper bound 1000 pixels for                              C. Occlusion Resolving
motorcycle size. The upper and lower bounds of sedan
                                                                                       Table III and Figs. 14-16 demonstrate the results of
car size are 1700 pixels and 3300 pixels respectively. In
                                                                                    occlusion resolving. We use the extracted vehicle patterns
Scene 2, the motorcycle size is assigned with 1400 pixels
                                                                                    to train the ISM codebooks for two different scenes. Table
and 2100 pixels while the sedan car size is assigned with
                                                                                    III is the performance of resolving occlusion on sedan
the lower bound 4000 pixels and the upper bound 8500
                                                                                    cars and the occlusion part of Table III denotes the sedan
pixels. We can see that the vehicle size information i.e.
                                                                                    cars actually occlude with other vehicles while the non-
the motorcycle and sedan car, for surveillance video can
                                                                                    occlusion part stands for the sedan cars which are not
be obtained by statistics successfully.
                                                                                    occluded with other vehicles but pass the occlusion detec-
                                                                                    tion. As shown in Figs. 14 and 15, there are several sedan
                                                                                    cars that are partially occluded. We use the trained ISM to
                                                                                    resolve the occlusions. The red points and bounding boxes

                                                                                                                                                                     Table II
                                                                                                                                                        V EHICLE PATTERN C LASSIFICATION


                                                                                                                                                                    Motorcycle                                                                                      Sedan car
                (a)                                  (b)                                                                                     Total                  Error  Correct rate                                                            Total            Error   Correct rate
                                                                                    Scene 1                                                   135                     3       97.8%                                                                765               34       95.6%
Figure 9.   The directions of traffic flows for (a) Scene 1 and (b) Scene             Scene 2                                                   159                     2       98.7%                                                                826               46       94.4%
2.




                                                                             1058
Figure 11.    The extracted motorcycle patterns from Scene 1.




                                                                                   Figure 13.     The extracted bus patterns from Scene 1.




                                                                                 Figure 14.     Occlusion resolving of sedan cars in Scene 1.


    Figure 12.     The extracted sedan car patterns from Scene 1.




represent vehicle’s central coordinate and its position that
are detected by ISM. In Fig. 16, we resolve the problem of
occlusion from the two types of vehicles i.e. bus and sedan
car. By combining ISM and the proposed self-training
mechanism, these occlusion problems can be reasonably
resolved.
                                                                                 Figure 15.     Occlusion resolving of sedan cars in Scene 2.


                             Table III
             S EDAN C AR O CCLUSION R ESOLVING R ATE

                                  Total     Miss       False alarm
                   occlusion      177        35            46
   Scene 1       non-occlusion     88         1             2
                   occlusion       92        16            21
   Scene 2       non-occlusion    130         2            12                 Figure 16.   Resolving the partial occlusion of sedan car and bus.
                                 Recall   Precision
                   occlusion     80.2%     75.5%
   Scene 1       non-occlusion   98.9%     97.8%                                               V. C ONCLUSION
                   occlusion     82.6%     78.2%
   Scene 2       non-occlusion   98.4%     99.2%                               We have proposed a framework of analyzing the traffic
                                                                            information in the surveillance videos captured by the


                                                                     1059
static roadside cameras. The traffic and vehicle infor-                      [13] N. Kanhere, S. Birchfield, and W. Sarasua, “Vehicle seg-
mation will be collected from the videos for training                            mentation and tracking in the presence of occlusions,”
the related model automatically. For the vehicles without                        Transportation Research Record: Journal of the Trans-
                                                                                 portation Research Board, vol. 1944, no. -1, pp. 89–97,
occlusion, we can use the scene model to record and                              2006.
classify. If an occlusion happen, the implicit shape model
will be employed. The experimental results demonstrate                      [14] W. Zhang, Q. Wu, X. Yang, and X. Fang, “Multilevel
this potential solution of solving occlusion problems in                         Framework to Detect and Handle Vehicle Occlusion,” IEEE
the traffic surveillance videos. Future work will be further                      Transactions on Intelligent Transportation Systems, vol. 9,
                                                                                 no. 1, pp. 161–174, 2008.
improving the accuracy and the speed of execution.
                                                                            [15] L. Tsai, J. Hsieh, and K. Fan, “Vehicle detection using
                        R EFERENCES
                                                                                 normalized color and edge map,” IEEE Transactions on
 [1] O. Javed, S. Ali, and M. Shah, “Online detection and clas-                  Image Processing, vol. 16, no. 3, pp. 850–864, 2007.
     sification of moving objects using progressively improv-
     ing detectors,” Computer Vision and Pattern Recognition,               [16] C. Wang and J. Lien, “Automatic Vehicle Detection Using
     vol. 1, p. 696701, 2005.                                                    Local FeaturesA Statistical Approach,” IEEE Transactions
                                                                                 on Intelligent Transportation Systems, vol. 9, no. 1, pp. 83–
 [2] J. Hsieh, S. Yu, Y. Chen, and W. Hu, “Automatic traffic                      96, 2008.
     surveillance system for vehicle tracking and classification,”
     IEEE Transactions on Intelligent Transportation Systems,               [17] B. Leibe, A. Leonardis, and B. Schiele, “Robust object de-
     vol. 7, no. 2, pp. 175–187, 2006.                                           tection with interleaved categorization and segmentation,”
                                                                                 International Journal of Computer Vision, vol. 77, no. 1,
 [3] B. Wu and R. Nevatia, “Improving part based object                          pp. 259–289, 2008.
     detection by unsupervised, online boosting,” in IEEE Con-
     ference on Computer Vision and Pattern Recognition, 2007.              [18] Y. Cheng, “Mean shift, mode seeking, and clustering,”
     CVPR’07, 2007, pp. 1–8.                                                     IEEE Transactions on Pattern Analysis and Machine In-
                                                                                 telligence, vol. 17, no. 8, pp. 790–799, 1995.
 [4] J. Zhou, D. Gao, and D. Zhang, “Moving vehicle detection
     for automatic traffic monitoring,” IEEE transactions on
     vehicular technology, vol. 56, no. 1, pp. 51–59, 2007.

 [5] H. Celik, A. Hanjalic, E. Hendriks, and S. Boughor-
     bel, “Online training of object detectors from unlabeled
     surveillance video,” in IEEE Computer Society Conference
     on Computer Vision and Pattern Recognition Workshops,
     2008. CVPRW’08, 2008, pp. 1–7.

 [6] H. Celik, A. Hanjalic, and E. Hendriks, “Unsupervised
     and simultaneous training of multiple object detectors from
     unlabeled surveillance video,” Computer Vision and Image
     Understanding, vol. 113, no. 10, pp. 1076–1094, 2009.

 [7] V. Nair and J. Clark, “An unsupervised, online learning
     framework for moving object detection,” Computer Vision
     and Pattern Recognition, vol. 2, p. 317324, 2004.

 [8] C. Pang, W. Lam, and N. Yung, “A novel method for
     resolving vehicle occlusion in a monocular traffic-image
     sequence,” IEEE Transactions on Intelligent Transportation
     Systems, vol. 5, pp. 129–141, 2004.

 [9] ——, “A method for vehicle count in the presence of
     multiple-vehicle occlusions in traffic images,” IEEE Trans-
     actions on Intelligent Transportation Systems, vol. 8, no. 3,
     pp. 441–459, 2007.

[10] A. Yoneyama, C. Yeh, and C. Kuo, “Robust vehicle and
     traffic information extraction for highway surveillance,”
     EURASIP Journal on Applied Signal Processing, vol. 2005,
     p. 2321, 2005.

[11] X. Song and R. Nevatia, “A model-based vehicle segmen-
     tation method for tracking,” in Tenth IEEE International
     Conference on Computer Vision, 2005. ICCV 2005, 2005,
     pp. 1124–1131.

[12] J. Lou, T. Tan, W. Hu, H. Yang, and S. Maybank, “3-
     D model-based vehicle tracking,” IEEE Transactions on
     image processing, vol. 14, no. 10, pp. 1561–1569, 2005.



                                                                     1060
An Augmented Reality Based Navigation System for Museum Guidance


             Jun-Ming Pan, Chi-Fa Chen                                   Chia-Yen Chen, Bo-Sen Huang, Jun-Long Huang,
            Dept. of Electrical Engineering,                                    Wen-Bin Hong, Hong-Cyuan Syu
                  I-Shou University                                                     Vision and Graphics Lab.
                                                                              Dept. of Computer Science and Information
                                                                                               Engineering
                                                                            Nat. University of Kaohsiung, Kaohsiung, Taiwan
                                                                                            ayen@nuk.edu.tw


Abstract— The paper describes the design and an                          interest about the exhibition. Therefore, the implemented
implementation of an augmented reality based navigation                  system aims to provide more in depth visual and auditory
system used for guidance through a museum. The aim of this               information, as well interactive 3D viewing of the objects,
work is to improve the level of interactions between a viewer            which otherwise cannot be provide by a pamphlet alone. In
and the system by means of augmented reality. In the                     addition, an interactive guidance system will have more
implemented system, hand motions are captured via computer               impact and create a more interesting experience for the
vision based approaches and analyzed to extract representative           visitors.
actions which are used to interact with the system. In this                  The implemented system does not require the keyboard
manner, tactile peripheral hardware such as keyboard and
                                                                         or the mouse for interaction. Instead, a camera and a
mouse can be eliminated. In addition, the proposed system also
                                                                         pamphlet are used to provide the necessary input. To use the
aims to reduce hardware related costs and avoid health risks
associated with contaminations by contact in public areas.               system, the user is first given a pamphlet, as often given out
                                                                         to visitors to the museum, he/she can then check out the
   Keywords- augmented reality; computer vision; human                   different objects by moving his/her finger across the paper
computer interaction; multimedia interface;                              and point to the pictures of objects on the pamphlet. The
                                                                         location of the fingertip is captured by an overhead camera
                                                                         and the images are analyzed to determine the user's intended
            I.   INTRODUCTION AND BACKGROUND
                                                                         actions. In this manner, the user does not need to come into
    The popularity of computers has induced a wide spread                contact with anything other than the pamphlet that is given to
usage of computers as information providers, in public                   him/her, thus eliminating health risks due to direct contact
facilities such as museums or other tourist attractions.                 with harmful substances or contaminated surfaces.
However, in most locations, the user is required to interact
with the system via tactile means, for example, a mouse, a                   To implement the system, we make use of technologies
keyboard, or a touch screen. With a large number of users                in augmented reality.
coming into contact with the hardware devices, it is hard to                 Augmented reality (AR) has received a lot of attentions
keep the devices free from bacteria and other harmful                    due to its attractive characteristics including real time
contaminates which may cause health concerns to                          immersive interactions and freedom from cumbersome
subsequent users. In addition, constant handling increases the           hardware [7]. There have been many applications designed
risk of damage to the devices, incurring higher maintenance              using AR technologies in areas such as medical applications,
cost to the providing party. Thus, it is our aim to design and           entertainment, military navigation, as well as many other
implement an interactive system using computer vision                    new possibilities.
approaches, such that the above mentioned negative effects                   An AR system usually incorporates technologies from
may be eliminated. Moreover, we also intend to enhance the               different fields. For example, technologies from computer
efficiency of the interface by increasing the amount of                  graphics are required for the projection and embedding of
interaction, which can be achieved by means of a multimedia,             virtual objects; video processing is required to display the
user augmented reality interface.                                        virtual objects in real time; and computer vision technologies
                                                                         are required to analyse and interpret actions from input
     In this work, we realize the proposed idea by                       image frames. As such, an AR system is usually realized by
implementing an interactive system that enables the user to              a cross disciplinary combination of techniques.
interact with a terminal via a pamphlet, which can easily be                 Existing AR systems or applications often use designated
produced in a museum and distributed visitors. The pamphlet              markers, such as the AR encyclopedia or other applications
contains summarized information about the exhibition or                  written by ARToolkit [8]. The markers are often bi-coloured
objects of interest. However, due to the size of the pamphlet,           and without details to facilitate marker recognition. However,
it is not possible to put in a lot of information, besides, too          for the guidance application, we intend to have a system that
much textual information tends to make the visitor lose                  is able to recognize colour and meaningful images of objects




                                                                  1061
or buildings as printed on a brochure or guide book and use
them for user interactions.
    The paper is organized as follows. Section 2 describes
the design of the system; section 3 describes the different
steps in the implementation of the system; section 4
discusses the operational navigation system; and section 5
provides the conclusion and discusses possible future
research directions.
                     II.    S YSTEM DESIGN
    The section describes how the system is designed and
implemented. Issues that arose during the implementation of
the system, as well as the approaches taken to resolve the
issues are also discussed in the following.
    To achieve the goals and ideas set out in the previous
section, the system is designed with the following                                 Figure 2. Concept of the navigation system.
considerations.
                                                                           The system obtains input via a camera, located above and
      Minimum direct contact: The need for a user to                  overlooking the pamphlet. The camera captures images of
       come into direct contact with hardware devices such             the user's hand and the pamphlet. The images are processed
       as a keyboard, or a mouse, or a touch screen, should            and analyzed to extract the motion and the location of the
       be minimized.                                                   fingertip. The extracted information is used to determine the
      User friendliness: The system should be easy and                multimedia data, including text, 2D pictures, 3D models,
       intuitive to use, with simple interface and concise             sound files, and/or movie clips, to be displayed for the
       instructions.                                                   selected location on the pamphlet. Fig. 2 shows the concept
                                                                       of the proposed navigation system.
      Adaptability: The system should be able to handle
       other different but similar operations with minimum
       modifications.                                                              III. SYSTEM IMPLEMENTATION
      Cost effectiveness: We wish to implement the
       system using readily available hardware, to                         The main steps in our system are discussed in the
       demonstrate that the integration of simple hardware             followings.
       can have fascinating performance.                               A. Build the system using ARToolKit
      Simple and robust setup. Our goal is to have the                    We have selected ARToolKit to develop our system,
       system installed at various locations throughout the            since it has many readily available high level functions that
       school, or other public facilities. By having a simple          can be used for our purpose. It can also be easily integrated
       and robust setup, we reduce the chances of a system             with other libraries to provide more advanced functions and
       failure.                                                        implement many creative applications.
    In accordance to the considerations listed above, the              B. Create markers
system is designed to have the input and out interfaces as                 The system will associate 2D markers on the pamphlet
shown in Fig. 1.                                                       with 3D objects stored in the database, as well as actions to
                                                                       manipulate the objects. This is achieved by first scanning the
                                                                       marker patterns, storing them in the system and let the
                                                                       program learn to recognize the patterns. In the program, each
                                                                       marker is associated with a particular 3D model or action,
                                                                       such that when the marker has been selected by the user, the
                                                                       associated data or action will be displayed or executed. Fig. 3
                                                                       shows examples of markers used for the system. Each
                                                                       marker is surrounded by a black border to facilitate
                                                                       recognition. The object markers, as indicated by the blue
          Figure 1. Diagram for the navigation interface.              arrows, are designed to match the objects to be displayed.
                                                                       Bottom right shows a row of markers, as enclosed by the red
                                                                       oval, used to perform actions on the displayed 3D objects.




                                                                1062
Objects                               model. Note that the actions can be applied to any 3D
                                                                         models that can be displayed by the system.




                                           Actions




               Figure 3. Markers used by the system.


C. Create 3D models                                                      Figure 5. The user selects the zoom in function to magnify the displayed
                                                                                                        3D model.
    The 3D models that are associated with the markers are
created using OpenGL or VRML format. These models can
be displayed on top of the live-feed video, such that the user
can interact with the 3D models in real time. The models are
texture mapped to provide realistic appearances. The models
are created in collaboration with the Kaohsiung Museum of
History [9]. Fig. 4 shows examples of the 3D models used in
the navigation system. The models are completely 3D with
texture mapping, and can be viewed from any angle by the
user.




                                                                          Figure 6. The user selects the zoom out function to shrink the displayed
                                                                                                         3D model.




               Figure 4. Markers used by the system.


D. Implement interactive functions
    In addition to displaying the 3D models when the user
selects a marker, the system will also provide a set of actions
that the user can use to manipulate the displayed 3D model
in real time. For example, we have designed “+/-” markers
for the user to magnify or shrink the displayed 3D model.
The user simply places his/her finger on the markers and the
3D model will change size accordingly. There are also
markers for the user to rotate the 3D model, as well as reset
the model to its original size and position. Figs 5 to 7 show
the system with its implemented actions in operation. In the                Figure 7. The user use the rotation marker to rotate the 3D object.
figures, the user simply puts a finger over the marker, and
the selected actions will be performed on the displayed 3D




                                                                  1063
E. Determine selection                                                                               V. CONCLUSION
    An USB camera is used to capture continuous images of                         A multimedia, augmented reality interactive navigation
the scene. The program will automatically scan the field of                   system has been designed and implemented in this work. In
view in real time for recognized markers. Once a marker has                   particular, the system is implemented for application in
found to be selected, that is, it is partially obstructed by the              providing museum guidance.
hand, it is considered to be selected. The program will match                     The implemented system does not require the user to
the selected marker with the associated 3D model or action                    operate hardware devices such as the keyboard, mouse, or
in the database. Figs. 5 to 7 show the user selecting markers                 touch screen. Instead, computer vision approaches are used
by pointing to the markers with a finger. From the figures, it                to obtain input information from the user via an overhead
can be seen that the selected 3D model is shown within the                    camera. As the user points to certain locations on the
video window in real time. Also notice that the models are                    pamphlet with a finger, the selected markers are identified by
placed on top of the corresponding marker’s position in the                   the system, and relevant data are shown or played, including
video window.                                                                 a texture mapped 3D model of the object, textual, audio, or
                                                                              other multimedia information. Actions to manipulate the
                                                                              displayed 3D model can also be selected in a similar manner.
                 IV. NAVIGATION SYSTEM                                        Hence, the user is able to operate the system without
    The proposed navigation system has been designed and                      contacting any hardware device except for the printout of the
implemented according to the descriptions provided in the                     pamphlet.
previous sections. The system does not have high memory                           The implementation of the system is hoped to reduce the
requirements and runs effectively on usual PC or laptops.                     cost of providing and maintaining peripheral hardware
The system also requires no expensive hardware, an USB                        devices at information terminals. At the same time,
camera is sufficient to provide the input required. It is also                eliminating health risks associated with contaminations by
quite easy to set up and customized to various objects and                    contact in public areas.
applications.                                                                     Work to enhance the system is ongoing and it is hoped
    The system can be placed at various points in the                         that the system will be used widely in the future.
museum on separate terminals to enable visitors to access
additions museum information in an interactive manner.
                                                                                                   ACKNOWLEDGMENT
                                                                                  This research is supported by National Science Council
                                                                              (NSC98-2815-C-390-026-E). We would also like to thank
                                                                              Kaohsiung Museum of History for providing cultural
                                                                              artifacts and kind assistance.

                                                                                                         REFERENCES
                                                                              [1]  J.-Z. Jiang,Why can Wii Win ?,Awareness Publishing,2007
                                                                              [2]  D.-Y. Lai, M. Liou, Digital Image Processing Technical Manual,
                                                                                   Kings Information Co., Ltd.,2007
                                                                              [3] R. Jain, R. Kasturi  B. G. Schunck, Machine Vision、McGraw-Hill,
                                                                                   1995
                                                                              [4] R. Klette, K. Schluns  K. Koschan, Computer vision: three-
                                                                                   dimensional data from images, Springer; 1998.
                                                                              [5] R. C. Gonzalez and R. E. Woods, Prentice Hall,Digital Image
                                                                                   Processing, Prentice Hall; 2nd edition, 2002.
                                                                              [6] HitLabNZ, http://www.hitlabnz.org/wiki/Home, 2008
  Figure 8. The interface showning the 3D model and other multimedia
                             information.                                     [7] R. T. Azuma, A Survey of Augmented Reality. In Presence:
                                                                                   Teleoperators and Virtual Environments 6, pp 355—385, (1997)
    Fig. 8 shows the screen shot of the system in operation.                  [8] Augmented Reality Network, http://augmentedreality.ning.com, 2008
In Fig. 8, the left window is the live-feed video, with the                   [9] H.-J. Chien, C.-Y. Chen, C.-F. Chen, Reconstruction of Cultural
selected 3D model shown on top of the corresponding                                Artifact using Structured Lighting with Densified Stereo
                                                                                   Correspondence, ARTSIT, 2009.
marker’s position in the video window. The window on the
                                                                              [10] C.-H. Liu, Hand Posture Recognition, Master thesis, Dept. of
right hand side shows the multimedia information that will                         Computer Science  Eng., Yuan Ze University, Taiwan, 2006.
be shown along with the 3D model to provide more                              [11] C.-Y., Chen, Virtual Mouse:Vision-Based Gesture Recognition,
information about the object. For example, when the 3D                             Master thesis, Dept. of Computer Science  Eng., National Sun Yat-
object is displayed, the window on the right might show                            sen University, Taiwan, 2003
additional textual information about the object, as well as                   [12] J. C., Lai, Research and Development of Interactive Physical Games
audio files to describe the object or to suitable provide                          Based on Computer Vision, Master thesis, Department of Information
background music.                                                                  Communication, Yuan Ze University, Taiwan, 2005




                                                                       1064
[13] H.-C., Yeh, An Investigation of Web Interface Modal on Interaction
     Design - Based on the Project of Burg Ziesar in Germany and the
     Web of National Palace Museum in Taiwan, Master thesis, Dept. of
     Industrical Design Graduate Institute of Innovation  Design,
     National Taipei University of Technology, Taiwan, 2007.
[14] T. Brown and R. C. Thomas, Finger tracking for the digital desk. In
     First Australasian User Interface Conference, vol 22, number 5, pp
     11--16, 2000
[15] P. Wellner, Interacting with papers on the DigitalDesk,
     Communications of the ACM, pp.28-35, 1993




                                                                           1065
Facial Expression Recognition Based on Local Binary Pattern and Support
                               Vector Machine
    1                                2                                   3                                    4
        Ting-Wei Lee (李亭緯), Yu-shann Wu(吳玉善), Heng-Sung Liu(柳恆崧) and Shiao-Peng
                                      Huang(黃少鵬)

                            Chunghwa Telecommunication Laboratories
                               12, Lane 551, Min-Tsu Road Sec.5
                           Yang-Mei, Taoyuan, Taiwan 32601, R.O.C.
                           TEL:886 3 424-5095, FAX:886 3 424-4742
    Email: finas@cht.com.tw, yushanwu@cht.com.tw, lhs306@cht.com.tw, pone@cht.com.tw

      Abstract—For a long time, facial expression                            Besides the PCA and LDA, Gabor filter method [3]
recognition is an important issue to be full of challenge. In          is also used in facial feature extraction. This method has
this paper, we propose a method for facial expression                  both multi-scale and multi-orientation selection in
recognition. Firstly we take the face detection method to              choosing filters which can present some local features of
detect the location of face. Then using the Local Binary               facial expression effectively. However, the Gabor filter
Patterns (LBP) extracts the facial features. When
                                                                       method suffers the same problem as PCA and LDA. It
calculating the LBP features, we use an NxN window to be
a statistical region and remove this window by certain                 would cost too much computation and high dimension of
pixels. Finally, we adopt the Support Vector Machine                   feature space.
(SVM) method to be a classifier and recognize the facial                     In this paper, we use the Local Binary Pattern
expression. In the experimental process, we use the JAFFE
                                                                       (LBP) [4][5] as the facial feature extraction method.
database and recognize seven kinds of expressions. The
average correct rate achieves 93.24%. According to the
                                                                       LBP has low computation cost and efficiently encodes
experimental results, we prove that this proposed method               the texture features of micro-pattern information in the
has the higher accuracy.                                               face image. In the first step, we have to detect the face
                                                                       area to remove the background image. We extract the
Keywords: facial expression, face detection, LBP, SVM                  Haar-like [6] features and use the Adaboost [7] classifier
                                                                       for face detection. The face detection module can be
                                                                       found in the Open Source Computer Vision Library
                    I.     INTRODUCTION                                (OpenCV). After adopting the face area, we calculate
                                                                       this area’s LBP features. Finally, using the Support
      To analyze facial expression can provide much
                                                                       Vector Machine (SVM) classifies the LBP feature and
interesting information and used in several applications.
                                                                       recognizes the facial expression. Experimental results
Take electronic board as example, we can realize
                                                                       demonstrate the effective performance of the proposed
whether the commercials attract the customers or not by
                                                                       method.
the facial expression recognition. In recent years, many
researches had worked on this technique of human-                           The rest of this paper is organized as follows: In
computer interaction.                                                  Section Ⅱ, we introduce our system flow chart and the
      The basic key point of any image processing is to                face detection. In section Ⅲ, we explain the facial LBP
extract the facial features from the original images.                  representation and SVM classifier. In Section Ⅳ ,
Principal Component Analysis (PCA) [1] and Linear                      experimental results are presented. Finally, we give brief
Discriminant Analysis (LDA) [2] are two methods used                   discussion and conclusion in section Ⅴ.
widely. PCA computes a set of eigenvalues and
eigenvectors. By selecting several most significant
                                                                                   II.    THE PROPOSED METHOD
eigenvectors, it produces the projection axes to let the
images projected and minimizes the reconstruction error.                     The flow chart of the proposed facial expression
The goal of LDA is to find a linear transformation by                  recognition method was shown in Fig.1. In the first step,
minimizing the within-class variance and maximizing                    the face detection is performed on the original image to
the between-class variance. In other words, PCA is                     locate the face area. In order to reduce the region of hair
suitable for data analysis and reconstruction. LDA is                  image or the background image, we take a smaller area
suitable for classification. But the dimension of image is             from the face area after the face detection. In the second
usually higher, the calculations require for the process of            step, using the LBP method extracts the facial
feature extraction would be significant.                               expression features. When calculating the histogram of
                                                                       LBP features, we use an NxN window to be a statistical




                                                                1066
Original
                         Image



                                                                     Figure 2. Haar-like features: the first row is for the edge
                    Face Detection                                      features and the second row is for the line features.
                                                                               The face detection module can be found in the
                                                                     Open Source Computer Vision Library (OpenCV) [10].
                                                                     But if we use the original detection region, it may
                                                                     include some areas which are unnecessary, such as hair
                     LBP Feature
                                                                     or background. For avoiding this situation, we cut the
                      Extraction                                     smaller area from the detection region and try to reduce
                                                                     the unnecessary areas but also keep the important
                                                                     features. This area’s width is 126 and the height is 147.
                                                                     Fig. 4 shows the final result of face area.
                         SVM
                     Classification
                                                                      Features

                                                                                  Weak        Pass    Weak         Pass         Pass     Weak        Pass
                                                                                 Classifier          Classifier                         Classifier          A face area
                                                                                     1                   2                                 N


                      Recognition                                                             Deny          Deny                       Deny


                        Result                                                                                     Not a face
                                                                                                                     area

   Figure 1. The flow chart of the proposed method.                    Figure 3. The decision process of cascade Adaboost.

region and move this window by certain pixels. In the
last step, SVM classifier is used for the facial expression
recognition.
A. The Face Detection
          Viola and Jones [9] used the Haar-like feature
for face detection. There are some Haar-like feature
samples shown in Fig. 2. Haar-like features can
highlight the differences between the black region and
the white region. Each portion in facial area has
different property, for example, the eye region is darker
than the nose region. Hence, the Haar-like features may
extract rich information to discriminate different regions.
          The cascade of classifiers trained by Adaboost
technique is an optimal way to reduce the time for
searching face area. In this cascade algorithm, the
boosted classifier combines several weak classifiers to
become a strong classifier. Different Haar-like features
are selected and processed by different cascade weak
classifiers. Fig. 3 shows the decision process of this
algorithm. If the feature set passes through all of the
weak classifiers, it is acknowledged as the face area. On
the other hand, if the feature set is denied by any weak
classifier, it is rejected.
                                                                      Figure 4. The first column is the original images; the
                                                                              second column is the final face areas.




                                                              1067
6          18
                                                                                                                       8
       III.   THE LBP METHOD AND SVM CLASSIFIER

B. Local Binary Patterns
                                                                                                                                        21
      LBP was used in the texture analysis. This
approach is defined as a gray-level invariant
measurement method, derived from the texture in a local
neighborhood. The LBP has been applied to many                          Figure 7. Representation of statistic way in width.
different fields including the face recognition.
     By considering the 3x3-neighborhood, the operator               C. Support Vector Machine
assigns a label to every pixel around the central points in               The SVM is a kind of learning machine whose
an image. By thresholding each pixel with the center                 fundamental is statistics learning theory. It has been
pixel value, the result is regarded as a binary number.              widely applied in pattern recognition.
Then, the histogram of the labels can be used as a                        The basic scheme of SVM is to try to create an
texture descriptor. See Figure 5 for an illustration of the          optimal hyper-plane as the decision plane, which
basic LBP operator.                                                  maximizes the margin between the closest points of two
      Another extension version to the original LBP is               classes. The points on the hyper-plane are called support
called uniform patterns [11]. A Local Binary Pattern is              vectors. In other words, those support vectors are used
called uniform if it contains at most two bitwise                    to decide the hyper-plane.
transitions from 0 to 1 or vice versa. For example,                      Assume we have a set of sample points from two
00011110 and 10000011 are uniform patterns.                          classes
      We utilized the above idea of LBP with uniform
patterns in our facial expression representation. We                   {xi , yi }, i  1,, m xi  R N , yi  {1,1}             (1)
compute the uniform patterns using the (8, 2)
neighborhood, which is shown in Fig.6. The (8, 2) stand              the discrimination hyper-plane is defined as below:
for finding eight neighborhoods in the radius of two.
The black rectangle in the center means the threshold,                                    m
the other circle points around there mean the                                  f ( x )   y i a i k ( x, xi )  b                (2)
neighborhoods. But we can see four neighborhoods are                                      i 1
not located in the center of pixels, these neighborhoods’
values are calculated by interpolation method. After that,           where f (x ) indicates the membership of              x . ai and
a sliding window with size 18x21 is used for uniform
patterns statistic by shifting 6 pixels in width and 8               b are real constants. k ( x, xi )   ( x),  ( xi ) is a
pixels in height. Fig.7 represents the statistic way in              kernel function and  (x) is the nonlinear map from
width.
                                                                     original space to the high dimensional space. The kernel
                                                                     function can be various types. For example, the linear
                                                                                                       
                                                                     function is k ( x, xi )  x  xi , the radial basis function
                                                                     (RBF)                kernel                function           is
                                                                                                 1
                                                                     k ( x, xi )  exp(             x  y ) and the polynomial
                                                                                                            2

     Figure 5. The basic idea of the LBP operator                                           2 2
                                                                     kernel function is k ( x, xi )  ( x  xi  1) n . SVM can be
                                                                     designed for either two-classes classification or multi-
                                                                     classes classification. In this paper, we use the multi-
                                                                     classified SVM and polynomial kernel function [12].


                                                                                    IV.       EXPERIMENTAL RESULTS
                                                                          In this paper, we use the JAFFE facial expression
                                                                     database [13]. The examples of this database are shown
                                                                     in the Table 1. The face database is composed of 213
                                                                     gray scale images of 10 Japanese females. Each person
                                                                     has 7 kinds of expressions, and every expression
     Figure 6. LBP representation using the (8, 2)                   includes 3 or 4 copies. Those 7 expressions are Anger,
                   neighborhood                                      Disgust, Fear, Happiness, Neutral, Sadness and Surprise.




                                                              1068
Table I     THE EXAMPLES OF JAFFE DATABASE                                  Table III THE COMPARISON RESULTS


                   Anger                                                                                                    The
                                                                                        Reference        Reference
                                                                                                                          proposed
                                                                                          [14]             [15]
                                                                                                                           method
                 Disgust
                                                                          Anger            95%             95.2%             90%
                                                                         Disgust           88%             95.2%           88.89%
                   Fear
                                                                           Fear            100%            85.7%           92.3%
                Happiness                                              Happiness           100%            84.9%            100%
                                                                         Neutral           75%             100%             100%
                 Neutral
                                                                         Sadness           90%             90.4%           81.8%
                                                                         Surprise          100%            89.8%            100%
                 Sadness
                                                                         Average          92.57%           91.6%           93.24%
                Surprise
                                                                           According to the Table 3, we can realize the
                                                                      proposed method has the better performance than the
The size of each image is 256x256 pixels. Two images                  other two references obviously. Even though some
of each expression for all of the people are used as                  recognition rates of expressions aren’t as good as the
training samples and the rest are testing samples. Hence              two reference methods, we still have the highest average
the total number of training sample is 140, and the                   recognition rate.
number of testing sample is 73.
                                                                                            V.     CONCLUSIONS
      The Table 2 shows that recognition rate of each
facial expression which were experimented by the                             In this paper, we proposed a facial expression
proposed method. The last row is the average                          recognition method by using the LBP features. For
recognition rate of 7 expressions, which is 93.24%. The               decreasing the computing efforts, we detect the face
recognition time of each face image is 0.105 seconds.                 region before the LBP method. After we extract the
      We also compare our experimental results with                   facial features from the detected area, the SVM
some references. In reference [14], the author used the               classifier will recognize the facial expression finally. By
Gabor features and NN fusion method. In another                       using the JAFFE be the experiment database, we can
reference [15], the author took the face image into three             prove the proposed method has the 93.24% correction
parts and used the 2DPCA method. The training images                  rate and better than the two reference methods.
and test images are the same as the proposed method.                        For the future work, we still have some aspects to
Table 3 shows the comparison result. The average                      be studied hardly. Those experiments which we
recognition rate of reference [14] is 92.57% and the                  discussed above have the same property. This property
reference [15] is 91.6%.                                              is that the training and testing samples are from the same
                                                                      person. In other word, if we want to recognize
                                                                      someone’s expression, we must have his images of
     Table II        THE RECOGNITION RATE OF PROPOSED METHOD          various expressions in database previously. But this
                                                                      property is not suitable for the real application. In the
           Anger                           90%                        future, we want to overcome this problem. Perhaps we
          Disgust                         88.89%                      can utilize the variations between the different
                                                                      expressions to become a model and use this model to
            Fear                          92.3%                       recognize. There are other problems in the facial
        Happiness                          100%                       recognition still have to be dealt with, such as the
                                                                      lighting variation and the pose changing. Those difficult
          Neutral                          100%                       issues exist for a long time. We will try to find out a
          Sadness                         81.8%                       better algorithm to enhance our method.
         Surprise                          100%
                                                                                               REFERENCES
         Average                          93.24%
                                                                      [1] L.I. Smith, “A Tutorial on Principal Components Analysis”,
                                                                           2002.

                                                                      [2] H. Yu and J. Yang, “A Direct LDA Algorithm for High-




                                                               1069
Dimensional Data with Application to Face Recognition”,
     Pattern Recognition, vol. 34, no. 10, pp.2067–2070, 2001.

[3] Deng Hb, Jin Lw and Zhen Lx et al, “A New Facial
     Expression Recognition Method Based on Local Gabor Filter
     Bank and PCA plus LDA”, International Journal of
     Information Technology, vol.11, no. 11, pp.86-96, 2005.

[4] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen,
     “Face Description with Local Binary Patterns: Application to
     Face Recognition”, IEEE Transactions on Pattern Analysis
     and Machine Intelligence, vol. 28, no. 12, pp.2037–2041,
     2006.

[5] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen, “Face
     Recognition with Local Binary Patterns”, Springer-Verlag
     Berlin Heidelberg 2004, pp.469–481, 2004.

[6] Pavlovic V. and Garg A. “Efficient Detection of Objects and
     Attributes using Boosting”, IEEE Conf. Computer Vision and
     Pattern Recognition, 2001.

[7] Jerome Friedman, Trevor Hastie and Robert Tibshirani,
     “Additive Logistic Regression: A Statistical View of Boosting”,
     The Annals of Statistics, vol. 28, no. 2, pp.337–407, 2000.

[8] C. Burges, Tutorial on support vector machines for pattern
     recognition, Data Mining and Knowledge Discovery, vol. 2, no.
     2, pp. 955-974, 1998.

[9] P. Viola and M. Jones, “Rapid object detection using a boosted
     cascade of simple features”, Proceedings of the 2001 IEEE
     Computer Society Conference, vol 1, 2001, pp. I-511-I-518.

[10] Intel,     “Open      source      computer      vision   library;
     http://sourceforge.net/projects/opencvlibrary/”, 2001.

[11] T.   Ojala, M. Pietika¨inen, and T. Ma¨enpa¨a¨,
     “Multiresolution Gray-Scale and Rotation Invariant Texture
     Classification with Local Binary Patterns,” IEEE Trans. on
     Pattern Analysis and Machine Intelligence, vol. 24, no. 7,
     pp. 971-987, July 2002.

[12] Dana Simian, “A model for a complex polynomial SVM kernel”,
     Mathematics And Computers in Science and Engineering, pp.
     164-169, 2008.


[13] M. Lyons, S. Akamatsu, etc. “Coding Facial Expressions with
     Gabor Wavelets”. Proceedings of the Third IEEE International
     Conference on Automatic Face and Gesture Recognition,Nara
     Japan, 200-205, 1998.


[14] WeiFeng Liu and ZengFu Wang “Facial Expression Recognition
     Based on Fusion of Multiple Gabor Features”, International
     Conference on Pattern Recognition, 2006

[15] Bin Hua and Ting Liu , “Facial expression recognition based on
     FB2DPCA and multi-classifier fusion”, International
     Conference on Information Technology and Computer Science,
     2009.




                                                                         1070
MILLION-SCALE IMAGE OBJECT RETRIEVAL
                            1                                  1,2
                                Yin-Hsi Kuo (郭盈希) and                Winston H. Hsu (徐宏民)
                           1
                               Dept. of Computer Science and Information Engineering,
                                         National Taiwan University, Taipei
                                 2
                                   Graduate Institute of Networking and Multimedia,
                                         National Taiwan University, Taipei




                      ABSTRACT

In this paper, we present a real-time system that
addresses three essential issues of large-scale image
object retrieval: 1) image object retrieval—facilitating
pseudo-objects in inverted indexing and novel object-
level pseudo-relevance feedback for retrieval accuracy;
2) time efficiency—boosting the time efficiency and
memory usage of object-level image retrieval by a novel
inverted indexing structure and efficient query
evaluation; 3) recall rate improvement—mining
semantically relevant auxiliary visual features through
visual and textual clusters in an unsupervised and
scalable (i.e., MapReduce) manner. We are able to
search over one-million image collection in respond to a
user query in 121ms, with significantly better accuracy
(+99%) than the traditional bag-of-words model.
                                                                Figure 1: With the proposed auxiliary visual feature
Keywords Image Object Retrieval; Inverted File; Visual
                                                                discovery, more accurate and diverse results of image
Words; Query Expansion
                                                                object retrieval can be obtained. The search quality is
                                                                greatly improved. Regarding efficiency, because the
                1.   INTRODUCTION                               auxiliary visual words are discovered offline on a
                                                                MapReduce platform, the proposed system takes less
                                                                than one second searching over million-scale image
Different from traditional content-based image retrieval        collection to respond to a user query.
(CBIR) techniques, the target images to match might
only cover a small region in the database images. The
needs raise a challenging problem of image object                   noisily quantized descriptors. Meanwhile, the target
retrieval, which aims at finding images that contain a              images generally have different visual appearances
specific query object rather than images that are globally          (lighting condition, occlusion, etc). To tackle these
similar to the query (cf. Figure 1). To improve the                 issues, we propose to mine visual features semantically
accuracy of image object retrieval and ensure retrieval             relevant to the search targets (see the results in Figure 1)
efficiency, in this paper, we consider several issues of            and augment each image with such auxiliary visual
image object retrieval and propose methods to tackle                features. As illustrated in Figure 5, these features are
them accordingly.                                                   discovered from visual and textual graphs (clusters) in an
     State-of-the-art object retrieval systems are mostly           unsupervised manner by distributed computing (i.e.,
based on the bag-of-words (BoW) [6] representation and              MapReduce [1]). Moreover, to facilitate object-level
inverted-file indexing methods. However, unlike textual             indexing and retrieval, we incorporate the idea of
queries with few semantic keywords, image object                    pseudo-objects [4] to the inverted file paradigm and the
queries are composed of hundreds (or few thousands) of              pseudo-relevance feedback mechanism. A novel efficient




                                                             1071
Figure 2: The system diagram. Offline part: We extract visual and textual features from images. Textual and visual
image graphs are constructed by an inverted list-based approach and clustered by an adapted affinity propagation
algorithm by MapReduce (18 Hadoop servers). Based on the graphs, auxiliary visual features are mined by
informative feature selection and propagation. Pseudo-objects are then generated by considering the spatial
consistency of salient local features. A compact inverted structure is used over pseudo-objects for efficiency. Online
part: To speed up image retrieval, we proposed an efficient query evaluation approach for inverted indexing. The
retrieval process is then completed by relevance scoring and object-level pseudo-relevance feedback. It takes around
121ms to produce the final image ranking of image object retrieval over one-million image collections.

query evaluation method is also developed to remove                   Inverted file is a popular way to index large-scale data in
unreliable features and further improve accuracy and                  the information retrieval community [8]. Because of its
efficiency.                                                           superiority of efficiency, many recent image retrieval
     Experiment shows that the automatically discovered               systems adopt the concept to index visual features (i.e.
auxiliary visual features are complementary to                        VWs). The intuitive way is to record each entry with
conventional query expansion methods. Its performance                 image ID, VW frequency in the inverted file.
is significantly superior to the BoW model. Moreover,                 However, to our best knowledge, most systems simply
the proposed object-level indexing framework is                       adopt the conventional method to the visual domain,
remarkably efficiency and takes only 121ms for                        without considering the differences between documents
searching over the one million image collection.                      and images, where the image query is composed of
                                                                      thousands of (noisy) VWs and the object of interest may
                                                                      occupy small portions of the target images.
             2.    SYSTEM OVERVIEW
                                                                      3.1.   Pseudo-Objects
Figure 2 shows a schematic plot of the proposed system,
which consists of offline and online parts. In the offline            Images often contain several objects so we cannot take
part, visual features (VWs) and textual features (tfidf of            the whole image features to represent each object. Each
expanded tags) are extracted from the images. We then                 object has its distinctive VWs. Motivated by the novelty
propagate semantically relevant VWs from the textual                  and promising retrieval accuracy in [4], we adopt the
domain to the visual domain, and remove visually                      concept of pseudo-object—a subset of proximate feature
irrelevant VWs in the visual domain (cf. Section 4). All              points with its own feature vector to represent a local
these operations are performed in an unsupervised                     area. An example shows in Figure 4 that the pseudo-
manner on the MapReduce [1] platform, which is                        objects, efficiently discovered, can almost catch different
famous of it scalability. Operations including image                  objects; however, advanced methods such as efficient
graph construction, clustering, and mining over million-              indexing or query expansion are not considered. We
scale images can be performed efficiently. To further                 further propose a novel object-level inverted indexing.
enhance efficiency, we index the VWs by the proposed
object-level inverted indexing method (cf. Section 3).                3.2.   Index Construction
We incorporate the concept of pseudo-object and adopt
compression methods to reduce memory usage.                           Unlike document words, VWs have a spatial dimension.
     In the online part, an efficient retrieval algorithm is          Neighboring VWs often correspond to the same object in
employed to speed up the query process without loss of                an image, and an image consists of several objects. We
retrieval accuracy. In the end, we apply object-level                 adopt pseudo-objects and store the object information in
pseudo-relevance feedback to refine the search result and             the inverted file to support object-level image retrieval.
improve the recall rate. Unlike its conventional                      Specifically, we construct an inverted list for each VW t
counterpart, the proposed object-level pseudo-relevance               as follows, Image ID i, ft,i, RID1, ... ,RIDf, which
feedback places more importance on local objects                      indicates the ID of the image i where the VW appears,
instead of the whole image.                                           the occurrence frequency (ft,i), and the associated object
                                                                      region ID (RIDf) in each image. The addition of the
                                                                      object ID to the inverted file makes it possible to search
  3.    OBJECT-LEVEL INVERTED INDEXING                                for a specific object even if the object only occupies a
                                                                      small region of an image.




                                                               1072
Figure 3: Illustration of efficient query evaluation (cf.
Section 3). To achieve time efficiency, first, we rank a
visual word by its salience to the query and then retrieve
the designated number of candidate images (e.g., 7
images, A to G). After deciding the candidate images,
we skip the irrelevant images and cut those non-salient
VWs.



3.3.   Index Compression

Index compression is a common way to reduce memory
usage in textual domain. First, we discard the top 5%
frequent VWs as stop words to decrease the mismatch
rate and reduces the size of inverted file. We then adopt
different coding methods to compress data based on their                   Figure 4: Object-level retrieval results by pseudo-
visual characteristics. Image IDs are ordinal numbers                      objects and object-level pseudo-relevance feedback.
sorted in ascending order in the lists, thus we store the                  The letter below each image represents the region
difference between adjacent image IDs instead of the                       (pseudo-object) with the highest relevance to query
image ID itself which is called d-gap [8]. And for region                  object by (2). The region information is essential for
IDs, we adopt a fixed length bit-level coding of three bits                query expansion. Instead of using the whole image as
to encode it (e.g., R2   010). On the other hand, we use                   the seed for retrieving other related images, we can
a variant length bit-level coding to encode frequency                      easily identify those related objects (e.g., R0, R5, R0) and
(e.g., 3    1110). Furthermore, we implement AND and                       mitigate the influence of noisy features. Note that the
SHIFT operations to efficiently decode the frequency                       yellow dots in the background are detected feature
and region IDs at query time. The memory space for                         points.
indexing pseudo-objects can be saved about 54.1%.

3.4.   Object-Level Scoring Method
                                                                           3.5.   Efficient Query Evaluation (EQE)
We use the intersection of TFIDF, which performs the
best for matching, to calculate the score of each region                   Conventional query evaluation in inverted indexing
indexed by VW t. Besides the discovered pseudo-objects,                    needs to keep track of the scores of all images in the
we also define a new object R0 to treat the whole image                    inverted lists. In fact, it is observed that most of the
as another object. We first calculate the score of every                   scored images contain only a few matched VWs. We
pseudo-object (R) to the query object (Q) as follows,                      propose an efficient query evaluation (EQE) algorithm
                                                                           that explores a small part of a large-scale database to
          score ( R , Q ) = ∑ IDFt × min( wt , R , wt ,Q ),   (1)          reduce the online retrieval time. The procedures of EQE
                           t∈Q
                                                                           are described below and illustrated in Figure 3.
where wt,R and wt,Q are the normalized VW frequency in                     1. Query term ranking: The ranking score in (1)
pseudo-object and in the query respectively. And then                           favors the query term with higher frequency and
the pseudo-object with the highest score is regarded as                         IDFt; therefore, we sort the query terms according
the most relevant object with respect to the query, as                          to its salience, which is calculated as wt,Q×IDFt for
suggested in [4]:                                                               VW t. The following phases are then processed
                                                                                sequentially to deal with VWs ordered and
         score(i,Q) = max{score(R,Q) | R ∈ i}.                (2)               weighted by their visual significance to the query.
                                                                           2. Collecting phase: In the retrieval process, user
                                                                                only cares about the images in the top ranks.




                                                                    1073
(a)visual cluster example (b)representative VW selection (c)example results (d)auxiliary VW propagation (e)textual cluster example
Figure 5: Image clustering results and mining auxiliary visual words. (a) and (e) show the sample visual and textual
clusters; the former keeps visually similar images in the same cluster, while the latter favors semantic similarities. The
former facilitates representative VW selection, while the latter facilitates semantic (auxiliary) VW propagation. (b) and
(d) illustrate the selection and propagation operations based on the cluster histogram as detailed in Section 4. And a
simple example shows in (c).


     Therefore, instead of calculating the score of each               R0 in Figure 4), we can further remove irrelevant objects
     image, we score the top images of the inverted lists              such as the toy in R4 of the second image.
     and add them to a set S until we have collected
     sufficient number of candidate images.
                                                                              4. AUXILIARY VISUAL WORD (AVW)
3. Skipping phase: After deciding the candidate
                                                                                          DISCOVERY
     images, we skip the images that do not appear in
     the collecting phase. For every image i in the
     inverted list, score the image i if i∈S , otherwise               Due to the limitation of VWs, it is difficult to retrieve
     skip it. If the number of visited VWs reaches a                   images with different viewpoints, lighting conditions and
     predefined cut ratio, go on to the next phase.                    occlusions, etc. To improve recall rate, query expansion
                                                                       is the most adopted method; however, it is limited by the
4. Cutting phase: Simply remove the remaining VWs,
                                                                       quality of initial retrieval results. Instead, in an offline
     which usually have little influence on the results.
                                                                       stage, we augment each image with auxiliary visual
     And then the process stops here.
                                                                       features and consider representative (dominant) features
     This algorithm works remarkably well, bringing                    in its visual clusters and semantically related features in
about almost the same retrieval quality with much less                 its textual graph respectively. Such auxiliary visual
computational cost. As image queries are generally                     features can significantly improve the recall rate as
composed of thousands or hundreds of (noisy) VWs,                      demonstrated in Figure 1. We can deploy all the
rejecting those non-salient VWs significantly improves                 processes in a parallel way by MapReduce [1]. Besides,
the efficiency and slightly improves the accuracy.                     the by-product of auxiliary visual word discovery is the
                                                                       reduction of the number indexed visual features for each
3.6.   Object-Level       Pseudo-Relevance        Feedback             image for better efficiency in time and memory.
       (OPRF)                                                          Moreover, it is easy to embed the auxiliary visual
                                                                       features in the proposed indexing framework by adding
Conventional approach using whole images for pseudo-                   one new region for those discovered auxiliary visual
relevance feedback (PRF) may not perform well when                     features not existing in the original VW set.
only a part of retrieved images are relevant. In such a
case, many irrelevant objects would be included in PRF,                4.1.   Image Clustering by MapReduce
resulting in too many query terms (or noises) and
degrading the retrieval accuracy. To tackle this issue, a              The image clustering is first based on a graph
novel object-level pseudo-relevance feedback (OPRF)                    construction. The images are represented by 1M VWs
algorithm is proposed. Rather than using the whole                     and 50K text tokens expanded by Google snippets from
images, we select the most important objects from each                 their associated (noisy) tags. However, it is very
of the top-ranked images and use them for PRF. The                     challenging to construct image graphs for million-scale
importance of each object is estimated according to (2).               images. To tackle the scalability problem, we construct
By selecting relevant objects in each image (e.g., R0, R5,




                                                                1074
image graphs using MapReduce model [1], a scalable                    images in the same textual cluster are semantically close
framework that simplifies distributed computations.                   but usually visually different. Therefore, these images
     We take the advantage of the sparseness and use                  provide a comprehensive view of the same object.
cosine measure as the similarity measure. Our algorithm               Propagating the VWs from the textual domain can
extends the method proposed in [2] which uses a two-                  therefore enrich the visual descriptions of the images. As
phase MapReduce model—indexing phase and                              the example shows in Figure 5(c), the bottom image can
calculation phase—to calculate pairwise similarities. It              obtain auxiliary VWs with the different lighting
takes around 42 minutes to construct a graph of 550K                  condition of the Arc de Triomphe. The similarity score
images on 18-node Hadoop servers. To cluster images                   can be weighted to decide the number of VWs to be
on the image graph, we apply affinity propagation (AP)                propagated. Specifically, we derive the VW histogram
proposed in [3]. AP is a graph-based clustering                       from the images of each cluster and then propagate VWs
algorithm. It passes and updates messages among nodes                 based on the cluster histogram weighted by its (semantic)
on graph iteratively and locally—associating with the                 similarity to the canonical image of the textual cluster.
sparse neighbors only. It takes around 20 minutes for
each iteration and AP converges generally around 20                   4.4.   Combining Selection and Propagation
iterations (~400 minutes) for 550K images by
MapReduce model.                                                      The selection and propagation operations described
     The image clustering results are sampled in Figure               above can be performed iteratively. The selection
5(a) and (e). Note that if an image is close to the                   operation removes visually irrelevant VWs and improves
canonical image (center image), it has a higher AP score,             memory usage and efficiency, whereas the propagation
indicating that it is more strongly associated with the               operation obtains semantically relevant VWs to improve
cluster. Moreover, images in the same visual cluster are              the recall rate. Though propagation may include too
often visually similar to each other, whereas some of the             many VWs and thus decrease the precision, we can
images in the same textual cluster differ in view, lighting           perform selection after propagation to mitigate this effect.
condition, angle, etc., and are potential to bring                         A straightforward approach is to iterate the two
complementary VWs for other images in the same                        operations until convergence. However, we find that it is
textual cluster.                                                      enough to perform a selection first, a propagation next,
                                                                      and finally a selection because of the following reasons.
4.2.   Representative Visual Word Selection                           First, only the propagation step updates the auxiliary
                                                                      visual feature and textual cluster images are fixed; each
We first propose to remove irrelevant VWs in each                     image will obtain distinctive VWs at the first
image to mitigate the effect of noise and quantization                propagation step. The subsequent propagation steps will
error to reduce memory usage in the inverted file system              only modify the frequency of the VWs. As the objective
and to speed up search efficiency. We observe that                    is to obtain distinctive VWs, frequency is less important
images in the same visual cluster are visually similar to             here. Second, binary feature vectors perform better or at
each other (cf. Figure 5(a)). As illustrated in Figure 5(c),          least comparable to the real-valued.
the middle image can then have representative VWs
from the visual cluster it belongs to. We accumulate the
number of each VW from the images of a cluster to form                                  5. EXPERIMENTS
a cluster histogram. As shown in Figure 5(b), each image
donates the same weight to the cluster histogram. We                  5.1.   Experimental Setup
can then select the VWs whose occurrence frequency is
above a predefined threshold (e.g., in Figure 5(b) the                We evaluate the proposed methods using a large-scale
VWs in red rectangles are selected).                                  photo retrieval benchmark—Flickr550 [7]. Besides, we
                                                                      randomly add Manhattan photos to Flickr550 to make it
4.3.   Auxiliary Visual Word Propagation                              a 1 million dataset. As suggested by many literatures
                                                                      (e.g., [5]), we use the Hessian-affine detector to extract
Due to variant capture conditions, some VWs that                      feature points in images. The feature points are described
strongly characterize the query object may not appear in              by SIFT and quantized into 1 million VWs for better
the query image. It is also difficult to obtain these VWs             performance. In addition, we use the average precision to
through query expansion method such as PRF because of                 evaluate the retrieval accuracy. Since average precision
the difference in visual appearance between the query                 only shows the performance for a single image query, we
image and the retrieved. Mining semantically relevant                 compute the mean average precision (MAP) to represent
VWs from other information source such as text is                     the system performance over all the queries.
therefore essential to improve the retrieval accuracy.
     As illustrated in Figure 5(e), we propose to augment             5.2.   Experimenal Results
each image with VWs propagated from the textual
cluster result. This is based on the observation that




                                                               1075
Table 1: The summarization of the impacts in                         the features points. This result shows that the selection
performance and query time comparing with the                        and propagation operations are effective in mining useful
baseline methods. It can be found that our proposed                  features and remove the irrelevant one. In addition, the
methods can achieve better retrieval accuracy and                    relative improvement of AVW (+44%) is orthogonal and
respond to a user query in 121ms over one-million                    complement to OPRF (0.352 0.487, +38%).
photo collections. The number in the parentheses
indicates relative gain over baseline. And the symbol
‘%’ stands for relative improvement over BoW model                                    6.    CONCLUSIONS
[6].
              (a) Image object retrieval                             In this paper, we cover four aspects of large-scale
                                                                     retrieval system: 1) image object retrieval over one-
      MAP           Baseline       PRF     OPRF
                                                                     million image collections—responding to user queries in
                                  0.290    0.324                     121ms, 2) the impact of object-level pseudo-relevance
Pseudo-objects [4]    0.251
                                (+15.5%) (+29.1%)                    feedback—boosting retrieval accuracy, 3) time
                  (b) Time efficiency                                efficiency with efficient query evaluation in the inverted
                      Flickr550       One-million                    file paradigm—comparing with the traditional inverted
              Pseudo-objects [4] +EQE   +EQE                         file structure, and 4) image object retrieval based on
                                                                     effective auxiliary visual feature discovery—improving
Query time
                      854             56         121                 the recall rate. That is to say, the efficiency and
  (ms)
                                                                     effectiveness of the proposed methods are validated over
               (c) Recall rate improvement                           large-scale consumer photos.
                BoW model [6]        AVW AVW+OPRF
MAP                  0.245          0.352  0.487                                           REFERENCES
%                      -            43.7%  98.8%
                                                                     [1]   J. Dean and S. Ghemawat, “Mapreduce: Simplified data
                                                                           processing on large clusters,” OSDI, 2004.
We first evaluate the performance of object-level PRF
                                                                     [2]   T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document
(OPRF) in boosting the retrieval accuracy. As shown in                     similarity in large collections with mapreduce,” ACL,
Table 1(a), OPRF outperforms PRF by a great margin                         2008.
(relative improvement 29.1% vs. 15.5%). The result
shows that the pseudo-object paradigm is essential for               [3]   B. J. Frey and D. Dueck, “Clustering by passing
PRF-based query expansion in object-level image                            messages between data points,” Science, 2007.
retrieval since the targets of interest might only occupy a
small portion of the images.                                         [4]   K.-H. Lin, K.-T. Chen, W. H. Hsu, C.-J. Lee, and T.-H.
                                                                           Li, “Boosting object retrieval by estimating pseudo-
     We then evaluate the query time of object-level                       objects,” ICIP, 2009.
inverted indexing augmented with efficient query
evaluation (EQE) to achieve time efficiency. The query               [5]   J. Philbin, O. Chum, M. Isard, J. Sivic, and A.
time is 15.2 times faster (854          56) after combining                Zisserman, “Object retrieval with large vocabularies and
with EQE method as shown in Table 1(b). The reasons                        fast spatial matching,” CVPR, 2007.
attribute to the selection of salient VWs and ignoring
those insignificant inverted lists. It is essential since            [6]   J. Sivic and A. Zisserman, “Video google: a text
                                                                           retrieval approach to object matching in videos,” ICCV,
unlike textual queries with 2 or 3 query terms, an image                   2003.
query might contain thousands (or hundreds) of VWs.
Therefore, we can respond to a user query in 121ms over              [7]   Y.-H. Yang, P.-T. Wu, C.-W. Lee, K.-H. Lin, W. H.
one-million photo collections.                                             Hsu, and H. Chen, “ContextSeer: context search and
     Finally, to improve recall, we evaluate the                           recommendation at query time for shared consumer
performance of auxiliary visual word (AVW) discovery.                      photos,” ACM MM, 2008.
As shown in Table 1(c), the combination of selection,
propagation and further OPRF brings 99% relative                     [8]   J. Zobel and A. Moffat, “Inverted files for text search
improvement over BoW model and reduces one-fifth of                        engines,” ACM Computing Surveys, 2006




                                                              1076
Sport Video Highlight Extraction Based on
            Kernel Support Vector Machines
                                         Po-Yi Sung, Ruei-Yao Haung, and Chih-Hung Kuo,
                                                Department of Electrical Engineering
                                                 National Cheng Kung University
                                                          Tainan, Taiwan
                                        { n2895130 , n2697169 , chkuo }@mail.ncku.edu.tw



Abstract—This paper presents a generalized highlight                        density of cuts, and audio energy, with a derived function to
extraction method based on Kernel support vector machines                   detect highlights. In [5], Duan proposes a technique that
(Kernel SVM) that can be applied to various types of sport                  searches shots with goalposts and excited voices to find
video. The proposed method is utilized to extract highlights                highlights for soccer programs. To locate scenes of the
without any predefining rules of the highlights events. The                 goalposts in football games, the technique of Chang [6]
framework is composed of the training mode and the analysis                 detects white lines in the field, and then verifies touch-down
mode. In the training mode, the Kernel SVM is applied to train              shots via audio features. Wan [7] detects voices in
classification plane for a specific type of sport by shot features          commentaries with high volume, combined with the
of selected video sequences. And then the genetic algorithm
                                                                            frequency of shot change and other visual features to locate
(GA) is adopted to optimize kernel parameters and select
features for improving the classification accuracy. In the
                                                                            goal events. Huang [8] exploited color and motion
analysis mode, we use the classification plane to generate the              information to find logo objects in the replay of sport video.
video highlights of sport video. Accordingly, viewers can access            All these techniques have to depend on predefined rules for a
important segments quickly without watching through the                     single specific type of sport video, and as a result may need
entire sport video.                                                         lots of human efforts to analyze the video sequences and
                                                                            identify the proper objects for highlights in the particular
   Keywords-Highlight extraction; Sport analysis; Kernel                    type of sport.
support vector machines; Genetic algorithm                                      Many other techniques have employed probabilistic
                                                                            models, such as Hidden Markov Models (HMM), to look for
                      I.    INTRODUCTION                                    the correlations of events and the temporal dependency of
                                                                            features [9]-[15]. The selected scene types are represented by
    Due to the rapid growth of multimedia storage                           hidden states, and the state transition probabilities can be
technologies, such as Portable Multimedia Player (PMP),                     evaluated by the HMM. Highlights can be identified
HD DVD and Blu-ray DVD, large amounts of video contents                     accurately by some specific transition rules. However, it is
can be saved in a small piece of storage device. However,                   hard to include all types of highlight events in the same set of
people may not have sufficient time to watch all the recorded               rules, and the model may fail to detect highlights if the video
programs. They may prefer skipping less important parts and                 features are different from the original ones. Cheng [16]
only watch those remarkable segments, especially for sport                  proposed a likelihood model to extract audio and motion
videos. Highlight extraction is a technique making use of                   features, and employed the HMM to detect the transition of
video content analysis to index significant events in video                 the integrated representation for the highlight segments. This
data, and thereby help viewers to access the desired parts of               kind of methods all need to estimate the probabilities of state
the content more efficiently. This technique can also be a                  transitions, which has to be set up through intense human
help to the processes of summarization, retrieval, and                      observations.
abstraction from large amounts of video database.                               Most of the previous researches have adopted rule-based
    In this paper, we focus on the highlight extraction                     methods, whereby the rules are heuristically set to describe
techniques for sport videos. Many works have been proposed                  the dynamics among objects and scenes in the highlight
that can identify objects that appear frequently in sport                   events of a specific sport. The rules set for one kind of sport
highlights. Xiong [1] propose a technique that extracts audio               video usually cannot be applied to the other kinds. In [17],
and video objects that are frequently appearing in the                      we have proposed a more generalized technique based on
highlight scenes, like applauses, the baseball catcher, the                 low-level semantic features. In this approach, we can
soccer goalpost, and so on. Tong [2] characterized three                    generate highlight tempo curves without defining
essential aspects for sport videos: focus ranges of the camera,             complicated transitions among hidden states, and hence we
object types, and video production techniques. Hanjalic et al.              can apply this technique to various kinds of videos.
[3]-[4] measured three factors, that is, motion activity,



                                                                     1077
In this paper, we extend our technique [17] and                                                                           A. Shot Change Detection
incorporate it with the framework of Kernel support vector                                                                        The task in this stage is to detect the transition point from
machines (Kernel SVM). For each type of sport video, a                                                                        one scene to another. Histogram differences of two
small amount of highlight shots are input so that some                                                                        consecutive frames are calculated by (2) to detect the shot
unified features can be extracted. Then apply the Kernel                                                                      changes in video sequences. A shot change is said to be
SVM system to train the classification plane and utilize the                                                                  detected if the histogram difference is greater than a
trained classification plane to analyze other input videos of                                                                 predefined threshold. The pixel values that are employed to
the same sport type, generating the highlight shots.                                                                          calculate the histogram contains luminance only, since the
    The rest of this paper is organized as follows. Section II                                                                human visual system is more sensitive to luminance
presents the overview of the proposed system. Section III                                                                     (brightness) than to colors. The histogram difference is
details the method for highlight shots classification and                                                                     computed by the equation
highlight shots generation. The highlight extraction                                                                                                    255
performance and experimental results are shown in Section                                                                                                H (i)  H
                                                                                                                                                              I       I 1   (i)
                                                                                                                                                                                           (2)
Ⅳ. SectionⅤ is the conclusion.                                                                                                                   DI    i0
                                                                                                                                                                  N
           II.                                            PROPOSED HIGHLIGHT SHOT EXTRACTION
                                                                SYSTEM OVERVIEW                                               where N is the total pixel number in a frame, and HI (i) is the
    Fig. 1 shows four stages of the proposed scheme: (1) shot                                                                 pixel number of level i for the I-th frame. Finally, the video
change detection, (2) visual and audio features computation,                                                                  sequence will be separated into several shots according to
(3) Kernel SVM training and analysis, and (4) highlight                                                                       the shot change detection results.
shots generation. In the first stage, histogram differences are                                                               B. Visual and Audio Features Computation
counted to detect the shot change points. In the second stage,
the feature parameters of each shot are computed and taken                                                                        Each shot may contain lots of frames. To reduce the
as the input eigenvalues into the Kernel SVM training and                                                                     computation complexity, we select a keyframe to represent
analysis system. The shot eigenvalues include shot length (L),                                                                the shot. In this work, we simply define the 10th frame of
color structure (C), shot frame difference (Ds), shot motion                                                                  each shot as the keyframe, since it is usually more stable
(Ms), keyframe difference (Dkey), keyframe motion (Mkey), Y-                                                                  than the previous frames, which may contain mixing frames
histogram difference (Yd), sound energy (Es), sound zero-                                                                     during scene transition. Many of the following features are
crossing rate (Zs) and short-time sound energy (Est). They are                                                                extracted from this keyframe.
collected as a feature set for the i-th shot                                                                                     1) Shot Length
                                                              
        Vi  L, C, Ds , M s , Dkey , M key , Yd , Es , Zs , Est (1)                                                              We designate the frame number in each shot as the shot
                                                                                                                              length (L). Experiments show that the highlight shot lengths
                                                                                                                              are shorter in non-highlight shots, such as the shots with
   In the third stage, the Kernel SVM either trains the                                                                       judges or scenes with special effects. A highlight shot is
parameters or analyzes the input features, according to the                                                                   often longer than a non-highlight shot. For example,
mode of the system. Then, in the last stage, highlight shots                                                                  pitching in baseball games and shooting goal in soccer
are generated based on the output of Kernel SVM. We                                                                           games are usually longer in shot length. Hence, the shot
explain the first two stages in the following, and the other                                                                  length is an important feature for the highlights and is
two stages are explained in Section Ⅲ.                                                                                        included as one of the input eigenvalues.
                                                                                                                                 2) MPEG-7 Color Structure
  Highlight shots




                                                                                                                                  The color structure descriptor (C) is defined in the
    generation




                                                           Highlight                        Highlight
                                                             shot                             shot                            MPEG-7 standard [18,19] to describe the structuring
                                                                                                                              property of video contents. Unlike the simple statistic of
                                                                                                           Training           histograms, it counts the color histograms based on a
                                                                                                             data
                                                                                                                              moving window called the structuring element. The
                                      analysis system




                                                              GA
                                       Training and




                                                                          Kernel SVM     Kernel SVM        Baseball
                                                          parameters                                                          descriptor value of the corresponding bin in the color
                                                         optimization       analysis       training
                                                                                                          Basketball          histogram is increased by one if the specified color is within
                                                         and features        mode           mode
                                                           selection                                        Soccer            the structuring element. Compared to the simple statistic of
                                                                                                                              one histogram, the color structure descriptor can better
            Visual and audio features




                                                                                                                              reflect the grouping properties of a picture. A smaller C
                                                        Visual features                  Audio features                       value means the image is more structured. For example,
                                                                                                                              both of the two monochrome images in Fig. 2 have 85 black
                                                                       Shot       Shot                                        pixels, and hence their histograms are the same. The color
                                                                                                                              structure descriptor C of the image in Fig. 2-(a) is 129,
                                                                       Shot       Shot
                                                                                                                              while the image in Fig 2-(b) is more scattered with the C
                                                                                                                              value 508. Fig. 3 depicts the curve of the C values in the
  detection
    Shot




                                                        Video Data                                                            video of a baseball game. It shows that pictures with a
                                                        Audio Data                                                            scattered structure usually have higher C values.
                                                        Figure 1. The proposed highlight shots extraction system




                                                                                                                       1078
where W and H are the block numbers in the horizontal and
                                                                               vertical directions respectively. MVx,n(i, j) and MVy,n(i, j) are
                                                                               the motion vectors in x and y directions respectively, of the
                                                                               block at i-th row and j-th column in the n-th frame of the
                                                                               shot. The motion vector of a block represents the
                                                                               displacement in the reference frame from the co-located
                                                                               block to the best matched square, and is searched by
                                                                               minimizing the sum of absolute error (SAE) [21].
                                       (a)
                                                                                                          K 1 K 1
                                                                                                 SAE   C(i, j)  R(i, j)                    (5)
                                                                                                          i 0 j 0


                                                                               where C(i, j) is the pixel intensity of a current block at
                                                                               relative position (i, j), and R(i, j) is the pixel intensity of a
                                                                               reference block.
                                                                                 5) Keyframe Difference and Keyframe Motion
                                       (b)                                         We calculate the frame difference and estimate the
Figure 2. The MPEG-7 Color Structure: (a) a highly structured                  motion activity between the keyframe and its next frame.
monochrome image; (b) a scattered monochrome image. Both have the              Suppose the k-th frame is a keyframe. The keyframe
same histogram                                                                 difference Dkey of the shot is defined by
                                                                                                          W 1 H 1
                                                                                                   Dkey   f k (i, j)  f k 1 (i, j)          (6)
                                                                                                           i 0 j 0


                                                                               where fk(i, j) represents the intensity of the pixel at position
                                                                               (i, j) in the k-th frame. Similarly, keyframe motion Mkey
                                                                               represents the average magnitude of motion vectors inside
                                                                               the key frame and is defined as
Figure 3. MPEG-7 Color Structure Descriptor curve in a baseball game.                                       W 1 H 1

                                                                                                             MVx (i, j)2  MVy (i, j)2
                                                                                                    1
                                                                                      M key                                                     (7)
    In this paper, we perform edge detection before                                             W  H / K 2 i0 j 0
calculating the color structure descriptors. The resultant C
value of each keyframe is regarded as an eigenvalue and                        where MVx(i, j) and MVy(i, j) denote the components of the
included in the input data set of Kernel SVM.                                  motion vectors in x- and y- directions respectively.
  3) Shot Frame Difference                                                      6) Y-Histogram Difference
The average shot frame difference (Ds) of each shot is                            The average Y-histogram difference is calculated by
defined by
                                                                                                                       255
                     L1         W 1 H 1                                                                              H (i)  H   n1   (i)
                    (  f (i, j)  f n1 (i, j) )
                 1        1                                                                                   L1            n
         Ds                                                      (3)                           Yd 
                                                                                                        1
                                                                                                             i0
                                                                                                                                                 (8)
                L 1 n1 WH i0 j 0 n                                                                 L  1 n1             W H

where W and H are frame width and height respectively,                         where Hn (i) represents the number of pixels at level i,
and fn(i, j) is the pixel intensity at position (i, j) in the n-th             counted in the n-th frame. In general, the value of Yd is
frame. This feature shows the frame activities in a shot. In                   higher in the highlight shots.
general, highlight shots have higher Ds values than non-                         7) Sound Energy
highlight shots.                                                                  The sound energy Es is defined as
   4) Shot Motion
    To measure the motion activity, we first partition a                                                          M

frame into square blocks of the size K-by-K pixels, and                                                           S (n)  S (n)                 (9)
perform motion estimation to find the motion vector of each                                             Es      n 1

block [20]. The shot motion Ms is defined as the average                                                                     M
magnitude of motion vectors by
                                                                               where S(n) is the signal strength of the n-th audio sample in
                           L1 W 1 H 1                                       a shot, M is the total number of audio samples in the
                            MVx,n (i, j)2  MVy,n (i, j)2
               1
Ms                                                               (4)          duration of the corresponding shot. In the highlight shot, the
       (L 1) W  H / K 2 n1 i0 j 0
                                                                               sound energy is usually higher than those in non-highlight
                                                                               shots.



                                                                        1079
8) Sound Zero-crossing Rate                                                   section, we briefly explain the basic idea about constructing
    We also adopt the zero-crossing rate (Zs) of the audio                      the SVM decision functions.
signals as one of the input features, since it is a simple                            a) Linear SVM
indicator of the audio frequency information. Experiments                            Given a training set (x1, y1), (x2, y2),…, (xi,
                                                                                yi), xn  Rn , yn 1, 1, n  1 i , where i is the total
indicate that the zero-crossing rate becomes higher in
highlight shots. The zero-crossing rate is defined as
                                                                                number of training data, each training data point xn is
                              M                                                 associated with one of two classes characterized by a value
                               signS (i) signS (i  1)
                      1 fs
          Z s (n)                                                (10)          yn = ±1. In the linear SVM theory, the decision function is
                      2M      i 1
                                                                                supposed to be a linear function and defined as

                                                                                                            f  x  wTx  b
where fs is the audio sampling rate, and the sign function is
defined by                                                                                                                                      (13)

                                1 , if S (i)  0,                              where w, x  Rn , b  R , w is the weighting vector of
                               
                 signS (i)   0 , if S (i)  0,                (11)          hyperplane coefficients, x is the data vector in space and b
                                1 , otherwise.                                is the bias. The decision function lies half way between two
                                                                               hyperplanes which referred to as support hyperplanes. SVM
                                                                                is expected to find the linear function f(x) = 0 such that
  9) Short-time Sound Energy                                                    separates the two classes of data. Fig. 4-(a) shows the
    Since the crowd sounds always last for 1 or 2 seconds,                      decision function that separates two classes of data. For
and therefore the sound energy can not represent the crowd                      separable data, there are many possible decision functions.
sounds for video shot with longer shot length. Thus, we                         The basic idea is to determine the margin that separates the
select short-time sound energy (Est) as one of the input                        two hyperplanes and maximize the margin in order to find
eigenvalues. The short-time sound energy is defined as                          the optimal decision function. As shown in Fig. 4-(b), two
                                                                                hyperplanes consist of data points which satisfy
                                                                                 wT  x  b  1 and wT  x  b  1 respectively. For example,
                        S           (n)  S p (n) 
                      24000

                                  p                                             the data point x1 of the positive class (yn = +1) lead to a
           e( p)      n 1
                                                                  (12)          positive value and the data points x2 of the negative class (yn
                          24000                                                 = -1) are negative. The perpendicular distance between the
            Est  max e(1), e(2), e(3),                , e(m)                                          2
                                                                                two hyperplanes is         . In order to find the maximized
                                                                                                         w
where Sp(n) is the signal strength of the n-th audio sample at
p-th second in the video shot, e(p) is the sound energy of                      margin and optimal hyperplanes, we must find the smallest
the p-th second in the video shot, and m is the time of the                     distance w . Therefore, the data points have to satisfy the
video shot.                                                                     condition as one set of inequalities

                                                                                              y j  wT x j  b  1, for j  1, 2, 3, , i
 III.   HIGHLIGHT SHOT CLASSIFICATION METHOD
                                                                                                                                                (14)
A. Kernel SVM Training and Analysis System
    In this work, the Kernel SVM is adopted to analyze the                      The problem for solving the w and b can be reduced to the
input videos and generate the highlight shots. In the training                  following optimization problem
mode, the selected shots for a specific sport type are fed
into the system to train for the classification hyperplanes                                           1 2
                                                                                          Minimize        w
and we apply genetic algorithm (GA) to select features and                                            2                                         (15)
                                                                                          subject to y j  wT x j  b  1, for j  1 i
optimize kernel parameters for support vector machines. In
the analysis mode, the system just loads these pre-stored
parameters and generates highlight shots for the input sport
video. We will explain the process in details in the                            This is a quadratic programming (QP) problem and can be
following.                                                                      solved by the following Lagrange function [24]:
  1) Support Vector Machines

                                                                                                            α y w x  b  1, for j  1i
    SVM is a machine learning technique first suggests by                                         1
                                                                                                             i
Vapnik [22] and has widespread applications in                                       L(w, b,  )  wT w           j   j
                                                                                                                           T
                                                                                                                               j                (16)
classification, pattern recognition and bioinformatics.                                           2         j 1
Typical concepts of SVM are for solving binary
classification problems [23]. The data may be                                   where α j denotes the Lagrange multiplier. The w, b, and
multidimensional and form several disjoint regions in the
space. The feature of SVM is to find the decision functions                     α j at optimum to minimize (16) are obtained. Then,
that optimally separate the data into two classes. In this                      following the Karush Kuhn-Tucker (KKT) conditions to



                                                                         1080
simplify this optimization problem. Since the optimization                                                        where C is the penalty parameter. This optimization
problem have to satisfy the KKT conditions defined by                                                             problem also can be solved by the Lagrange function and
                                                                                                                  transformed to dual problem as follows
                     i
            w      α y x
                    j 1
                                 j       j           j
                                                                                                                      Maximize L( ) 
                                                                                                                                                        i
                                                                                                                                                                j   
                                                                                                                                                                          1
                                                                                                                                                                                y
                                                                                                                                                                                  i
                                                                                                                                                                                          j yk  j k x j x k
                                                                                                                                                                                                       T

                                                                                                                                                       j 1
                                                                                                                                                                          2     j ,k 1
             i
            α y
                                                                                                                                                                                                                     (22)
                                 0                                                                 (17)                               i

            j 0
                     j       j
                                                                                                                      Subject to      j1
                                                                                                                                              j y j  0 ,0   j  C, j  1i
            αj 0                                                             , for j  1i
                  
             j y j w x j  b  1  0 , for j  1i
                                 T
                                                                                                                Similarity, we can solve this dual problem and find the
                                                                                                                  optimal w and b.
Substitute (17) into (16), then the Lagrange function is                                                               c) Non-linear SVM
transformed to dual problem as follows                                                                            The SVM can extended to the case of nonlinear conditions
                                                                                                                  by projecting the original data sets to a higher dimensional
                                                 i                        i

                                                                       y y α α x x
                                                                    1                                             space referred to as the feature space via a mapping
       MaximizeL( )                                        αj                  j k   j k
                                                                                              T
                                                                                              j k                 function φ which also called kernel function. The nonlinear
                                                                    2
                                             j 1                       j ,k 1
                                                                                                    (18)          decision function is obtained by formulating the linear
                         i
       Subject to    α yj1
                                     j       j        0, α j  0, j  1i
                                                                                                                  classification problem in the feature space. In nonlinear
                                                                                                                  SVM, the inner products xT xk in (22) can be replaced by
                                                                                                                                               j

                                                                                                                  the kernel function k(x j , xk )  φ(x j )T φ(xk ) . Therefore, the
Solving for this dual problem and find the Lagrange                                                               dual problem in (22) can be replaced by the following
multiplier α j . Substitute α j into (19) to find the optimal w                                                   equation
and b.
                                                                                                                                                   i                        i

                                     i                                                                               Maximize L( )                        j   
                                                                                                                                                                      1
                                                                                                                                                                           y y   k (x , x )
                                                                                                                                                                                      j k         j   k      j   k

                                                                                                                                                j 1
                                                                                                                                                                      2   j ,k 1
                     w                      α j y j x j , for j  1i                                                                                                                                               (23)
                                                                                                                                   i

                                                                                                                                  
                                     j1
                                                                                                    (19)             Subject to              jyj        0 ,0   j  C, j  1i
                         1  sv  1           
                               N
                     b      
                        N sv 
                                     wT x S 
                                   y
                              s1  s
                                              
                                              
                                                                                                                                 j1


                                                                                                                  According to (19), we also can solve above dual problem
                                                                                                                  and find optimal w and b. The classification is then
where xS are data points which Lagrange multiplier α j 0, ys                                                     obtained by the sign of
is the class of xS and Nsv is the number of xS .
                                                                                                                                                  i                          
     b) Linear Generalized SVM
    In the case where the data is not linearly separable as
                                                                                                                                             sign
                                                                                                                                                 
                                                                                                                                                 
                                                                                                                                                  j 0
                                                                                                                                                       y j j k(x, x j )  b 
                                                                                                                                                                              
                                                                                                                                                                              
                                                                                                                                                                              
                                                                                                                                                                                                                     (24)
shown in Fig. 4-(c), the optimization problem in (15) will
be infeasible. The concepts of linear SVM can also be
extended to the linearly nonseparable case. Rewrite (14) as                                                           d) Types of Kernels
(20) by introducing a non-negative slack variable  .                                                                The most commonly used kernel functions are
                                                                                                                  multivariate Gaussian radial basis function (MGRBF),
                     
                   y j wTx j  b  1   j , for j  1i                                           (20)
                                                                                                                  Gaussian radial basis function (GRBF), polynomial
                                                                                                                  function and sigmoid function.
                                                                                                                  MGRBF:
The above inequality constraints are minimized through a
penalized object function. Then the optimization problem                                                                                                                                      n   x jm xkm2
can be written as                                                                                                                                                                         
                                                                                                                                                                                          
                                                                                                                                                                                                      2 m
                                                                                                                                                                                                         2           (25)
                                                                                                                             k(x j , x k )  φ(x j ) T φ(x k )  e                        m1


                                          i       
                                                                              
                             1
         Maximize L( )        w 2  C  j                                                                     where  m  , x jm , xkm  , x j , x k  n , xjm is m-th
                             2                    
                                          j 1                                                    (21)
                                     
         Subject to y j w T x j  b  1   j , for j  1i                                                      element of xj, xkm is m-th element of xk,                                    m is the adjustable
                                                                                                                  parameter of the Gaussian kernel, x j , xk are input data.




                                                                                                           1081
GRBF:                                                                                                       1        n                 n
                                                                                  where g1 ~ gSs , gC ~ gCc and g1 ~ g f f are parameters of
                                                                                                 n
                                                                                           S                          f

                                                      x j xk
                                                                2                 kernel, penalty factor and features respectively. The ns, nc,
                                                                                 and nf are the bits to represent the above parameters. The
                                                        2 2        (26)
             k(x j , xk )  φ(x j )T φ(xk )  e                                   parameters defined at the start process are bits of
                                                                                  parameters and features, number of generations, crossover
                                                                                  and mutation rate, and limitations of parameters. The next
where   , x j , xk n ,  is the adjustable parameter of                      step is output parameters and features to Kernel SVM for
the Gaussian kernel, x j , xk are input data.                                     training. In the selection step, we keep two chromosomes
Polynomial function:                                                              with maximum objective value (Of) obtained by (29) for
                                                                                  next generation. These chromosomes will not change in the
                                                                                  following crossover and mutation steps. Fig. 10 shows the
             k(x j , xk )  φ(x j )T φ(xk )  (1  xT xk )d
                                                    j               (27)          crossover and mutation operations. As shown in Fig. 10-(a),
                                                                                  two new offspring are obtained by randomly exchanging
where d is positive integer, x j , xk n , d is the adjustable                   genes between two chromosomes using one point crossover.
                                                                                  After crossover operation, as shown in Fig. 10-(b), the
parameter of the polynomial kernel, x j , xk are input data.
                                                                                  binary code genes are changed occasionally from 0 to 1 or
   2) Kernel SVM Input Data Structure                                             vice versa called mutation operation. Finally, a new
    In sport videos, a highlight event usually consists of                        generation is obtained and output parameters and features
several consecutive shots. Fig. 5 shows an example of a                           again. These processes will be terminated until the
home run in a baseball game. It includes three consecutive                        predefined numbers of generations satisfy.
shots: pitching and hitting, ball flying, and the base running.                        In this paper, we adopt precision and recall rates to
Unlike many other highlight extraction algorithms that have                       evaluate the performance of our system. The precision (P)
to predefine the highlight events with specific constituting                      and recall (R) rates are defined as follows
shots, we simply propose to collect the feature sets of
several consecutive shots together as the input eigenvalues                                                          SNc     SNc
of the Kernel SVM.                                                                                              P       ,R                                  (28)
                                                                                                                     SNe     SNt
   3) Kernel SVM Training Mode
    For the training mode, the data are processed in two
steps: a) initialization, b) kernel parameters optimization                       where SNc, SNe, and SNt are the number of correctly
and feature selection.                                                            extracted highlight shots, extracted highlight shots, and
                                                                                  actual highlight shots repectively.
     a) Initialization of the Input Data
                                                                                      In the objective function calculation step, we calculate
    The initialization process of the training mode is shown                      the objective value (Of) to evaluate the kernel parameters
in Fig. 6. The video is partitioned into shots and divided                        and select features generated by GA. The objective value
into two sets: highlight shots and non-highlight shots. The                       calculated by following equation
eigenvalues of consecutive shots are collected as a data set.
All data sets are composed as the input data vector. Then                                                  Of  0.5 P  0.5 R
each eigenvalue is normalized into the range of [0, 100].                                                                                                     (29)
The order of the data set in the input data vector is
randomized.                                                                       These steps will terminate when the predefined number of
     b) Kernel Parameters Optimization and Feature                                generations have achieved. And finally we select the kernel
selection                                                                         parameters and features which have maximum objective
                                                                                  value.
    Since the parameters in kernel functions are adjustable,                        4) Kernel SVM Analysis Mode
and in order to improve the classification accuracy, these
                                                                                      In the analysis mode, the user has to select a sport type.
kernel parameters should be properly set. In this process,
                                                                                  The Kernel SVM system directly loads the pre-trained
we adopt the GA-based feature selection and parameters
                                                                                  classification function corresponding to the sport type. The
optimization method proposed by Huang [25] to select
                                                                                  classification function is defined as (30), where Cx is the
features and optimize kernel parameters for support vector
                                                                                  classes of the video shots. Cx = +1 represents the shots
machines. Fig. 7 shows the flowchart of the feature
                                                                                  belong to highlight shot, and Cx = -1 are non-highlight shot.
selection and parameters optimization method.
                                                                                  This process can be performed very quickly, since these
    As shown in Fig. 7, we apply the GA to generate kernel
                                                                                  kernel parameters and features do not need to be trained
parameters and select features to train the hyperplanes of                        again.
the Kernel SVM. The processes to generate kernel
parameters and select features utilize the GA are shown in
                                                                                                                                  i                      
Fig. 8. The GA start process include generate chromosome
                                                                                             i                                 
                                                                                                                         1 , if  y j  j k(x, x j )  b   0
                                                                                                                                                          
randomly and parameters setup. The chromosome is
represented as binary coding format as shown in Fig. 9,                                     
                                                                                  Cx  sign y j  j k(x, x j )  b   
                                                                                                                     
                                                                                                                                   j 1
                                                                                                                     1 , if  y  k(x, x )  b   0
                                                                                                                                   i
                                                                                                                                                           
                                                                                                                                                                 (30)
                                                                                            j 1
                                                                                                                        
                                                                                                                        
                                                                                                                                  
                                                                                                                                  
                                                                                                                                   j 1
                                                                                                                                         j j       j
                                                                                                                                                           
                                                                                                                                                           
                                                                                                                                                           




                                                                           1082
Figure 4. Linear decision function separatimg two classes: (a) Decision function separete class postive from class negative; (b) The margin that seperates
                                            two hyperplanes; (c) The case of linear non-separable data sets.




                                 (a) pitching and hitting                              (b) ball flying                            (c) base running
                                                               Figure 5. A home run event in a baseball game.

                                                                                                                                   Start

               {
                    data set 0
                    data set 1
                    data set 2
                                                                                                                                                Parameters and
  Highlights                                                                                                            Output parameters and       features
                                     Normalize         Randomize
                       …




               {
                                                                                                                          selected features
                                                                        Training Data Vector

                    Data set n
Non-Highlights     Data Vector
                                                                                                                                 Selection
Training Data

                 Figure 6. The initialization of training data.
                                                                                                                                 Crossover

                                       Training data           Testing data
                                                                                                                                 Mutation
                                    Selected                                                                                           New generation
                                    features      Normalization and
                                                   feature selection
                                                                                                      Figure 8. Genetic algorithm to generate prameters and features
          Genetic Algorithm
                                                     Kernel SVM
                                 Kernel SVM         training mode
                                  parameters
                                                                                                             i           n             j     n                          n
                                                                                                      g1  g S  g Ss
                                                                                                       S                          g1 gC  g Cc         g 1 g k  g f f
                                                                                                                                                          f    f
                                                    Kernel SVM                                                                     C
                                                   analysis mode
                                                           Precision and                                                        Figure 9. Chromosome
                                                             recall rates
                                                  Objective function
                                                     calculation

                                                                                                                        Parents 1 0101 1111     Offspring 1 0101 0010
                                                                                                         Crossover
                                          No                                                                            Parents 2 0001 0010     Offspring 2 0001 1111
                                                     Terminate?

                                                           Yes

                                                 Optimized parameters                                                              01011111        01111111
                                                     and features                                                    Mutation
                                                                                                                                    Before           After


    Figure 7. The flowchart of the feature selection and parameters
                       optimization method                                                               Figure 10. (a) Crossover operation; (b) Mutation operation




                                                                                               1083
IV.     EXPERIMENTAL RESULTS                                     [2]    X. Tong, L. Duan, H. Lu, C. Xu, Q. Tian and J. S. Jin, „A mid-level
                                                                                          visual concept generation framework for sports analysis‟, Proc. IEEE
The experimental setup for different sport types are listed in                            ICME, July 2005, pp. 646–649.
Table I. For the baseball game, we take hits, home runs,                           [3]    A. Hanjalic, „Multimodal approach to measuring excitement in
strike out, steal, and replay as highlight events. For                                    video‟, Proc. IEEE ICME, July 2003, pp. 289–292.
basketball game, the highlight events are dunks, three-point                       [4]    A. Hanjalic, „Generic approach to highlights extraction from a sport
shots, jump shots, bank shots and replays. For soccer game,                               video‟, Proc. IEEE ICIP, Sept. 2003, pp. I - 1–4.
we set highlight events as goals, long shoots, close-range                         [5]    L. Y. Duan, M. Xu, T. S. Chua, Q. Tian, and C. S.Xu, „A mid-level
shoots, free kicks, corner kicks, break through, and replays.                             representation framework for semantic sports video analysis‟, Proc.
                                                                                          ACM Multimedia, Nov. 2003, pp. 33–44.
                                                                                   [6]    Y. L. Chang, W. Zeng, I. Kamel, and R. Alonso, „Integrated image
    In this paper, we adopt three kernel functions include                                and speech analysis for content-based video indexing‟, Proc. IEEE
multivariate Gaussian radial basis function, Gaussian radial                              ICMCS, May 1996, pp. 306–313.
basis function and polynomial function. Then we evaluate                           [7]    K. Wan and C. Xu, „Efficient multimodal features for automatic
the performance for extracting highlight shots of sport video                             soccer highlight generation‟, Proc. IEEE ICPR, Aug. 2004, pp. 973–
among these kernel functions. Table. II shows the                                         976.
experimental results of NYY vs. NYM, Table. III shows the                          [8]    Q. Huang, J. Hu, W. Hu, T. Wang, H. Bai and Y. Zhang, „A reliable
experimental results in the game NBA Celtics vs. Rockets,                                 logo and replay detector for sports video‟, Proc. IEEE ICME, July
and Table. IV shows the experimental results of the soccer                                2007, pp. 1695–1698.
game Arsenal vs. Hotspur. According to the experimental                            [9]    J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati and P. Pala,
                                                                                          „Soccer highlights detection and recognition using HMMs‟, Proc.
results, we find that the SVM with kernel function MGRBF                                  IEEE ICME, Aug. 2002, pp. 825–828.
have the best performance among these types of sport videos.                       [10]   G. Xu, Y. F. Ma, H. J. Zhang and S. Yang, „A HMM based semantic
                                                                                          analysis framework for sports game event detection‟, Proc. IEEE
 TABLE I.           THE EXPERIMENTAL SETUP FOR DIFFERENT SPORT TYPES                      ICIP, Sept. 2003, pp. I - 25–8.
      Sport type              Sequence         Total length   Shot length          [11]   J. Wang, C. Xu, E. Chng and Q. Tian, „Sports highlight detection
                                                                                          from keyword sequences using HMM‟, Proc. IEEE ICME, June 2004,
       Baseball           NYY vs. NYM          146 minutes       1097                     pp. 599–602.
      Basketball         Celtics vs. Rockets    32 minutes        180              [12]   P. Chang, M. Han and Y. Gong, „Extract highlights from baseball
        Soccer           Asenal vs. Hotspur    48 minutes        280                      game video with hidden Markov models‟, Proc. IEEE ICIP, Sept.
                                                                                          2002, pp. 609–612.
      TABLE II.          THE EXPERIMENTAL RESULTS OF BASEBALL GAME                 [13]   N. H. Bach, K. Shinoda and S. Furui, „Robust highlight extraction
                                                                                          using multi-stream hidden Markov models for baseball video‟, Proc.
                 Sequence               NYY vs. NYM                                       IEEE ICIP, Sept. 2005, pp. III - 173–6.
                  Kernel       MGRBF      GRBF    Polynomial                       [14]   Z. Xiong, R. Radhakrishnan, A. Divakaran and T. S. Huang, „Audio
                                                                                          events detection based highlights extraction from baseball, golf and
                 Precision      87%        89%       77%
                                                                                          soccer games in a unified framework‟, Proc. IEEE ICME, July 2003,
                  Recall        99%        81%       91%                                  pp. III - 401–4.
                                                                                   [15]   B. Zhang, W. Chen, W. Dou, Y. J. Zhang and L. Chen, „Content-
  TABLE III.            THE EXPERIMENTAL RESULTS OF BASKETBALL GAME                       based table tennis games highlight detection utilizing audiovisual
                                                                                          clues‟, Proc. IEEE ICIG, Aug. 2007, pp. 833–838.
                 Sequence            Celtics vs. Rockets
                                                                                   [16]   C. C. Cheng and C. T. Hsu, „Fusion of audio and motion information
                  Kernel       MGRBF      GRBF      Polynomial                            on HMM-based highlight extraction for baseball games‟, IEEE Trans.
                 Precision      100%       86%          93%                               Multimedia, pp. 585–599, June 2006.
                  Recall         93%      100%          87%
                                                                                   [17]   L. C. Chang, Y. S. Chen, R. W. Liou, C. H. Kuo, C. H. Yeh and B. D.
                                                                                          Liu, „A real time and low cost hardware architecture for video
       TABLE IV.          THE EXPERIMENTAL RESULTS OF SOCCER GAME                         abstraction system‟, Proc. IEEE ISCAS, May 2007, pp. 773–776.
                 Sequence            Asenal vs. Hotspur                            [18]   ISO/IEC JTC1/SC29/WG11/ N6881: ‟MPEG-7 Requirements
                                                                                          Document V.18‟, January 2005.
                  Kernel       MGRBF     GRBF       Polynomial
                                                                                   [19]   ISO/IEC JTC1/SC29/WG11: „MPEG-7 Overview (version 10)‟,
                 Precision      100%      76%         100%                                October 2004.
                  Recall         88%      96%          73%
                                                                                   [20]   C. H. Kuo, M. Shen and C.-C. Jay Kuo, „Fast motion search with
                                                                                          efficient inter-prediction mode decision for H.264‟, Journal of Visual
                             V.    CONCLUSION                                             Communication and Image Representation, pp. 217–242, 2006.
   Kernel SVM can be trained to classify the shots by                              [21]   Iain E. G. Richardson, H.264 and MPEG-4 Video Compression,
exploiting the information of a unified set of basic features.                            WILEY, 2003.
Experimental results show that the SVM with multivariate                           [22]   V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
Gaussian radial basis kernel can achieve average of 96%                            [23]   V. Kecman, Learning and Soft Computing, MIT Press, Cambridge,
precision rate and 93% recall rate.                                                       2001.
                                                                                   [24]   Vapnyarskii,      I.B.    (2001),    “Lagrange     Multipliers”,    in
                                  REFERENCES                                              Hazewinkel,Michiel, Encyclopedia of Mathematics, Kluwer
                                                                                          Academic Publishers,ISBN 978-1556080104.
[1]    Z. Xiong, R. Radhakrishnan, A. Divakaran, and T.S. Huang,
                                                                                   [25]   C.L. Huang and C.J. Wei, GA-based feature selection and
       „Highlights extraction from sports video based on an audio-visual
                                                                                          Parameters optimization for support vector machine, Expert Syst
       marker detection framework‟, Proc. IEEE ICME, July 2005, pp. 29–
                                                                                          Appl 31 (2006), pp. 231–240.
       32.




                                                                            1084
IMAGE INPAINTING USING STRUCTURE-GUIDED PRIORITY BELIEF
           PROPAGATION AND LABEL TRANSFORMATIONS

            Heng-Feng Hsin (辛恆豐), Jin-Jang Leou (柳金章), Hsuan-Ying Chen (陳軒盈)

                         Department of Computer Science and Information Engineering
                                      National Chung Cheng University
                                  Chiayi, Taiwan 621, Republic of China
                             E-mail: {hhf96m, jjleou, chenhy}@cs.ccu.edu.tw



                     ABSTRACT                                      problem with isophote constraint. They estimate the
                                                                   smoothness value given by the best chromosome of GA,
In this study, an image inpainting approach using                  and project this value in the isophotes direction. Chan
structure-guided priority belief propagation (BP) and              and Shen [3] proposed a new diffusion method, called
label transformations is proposed. The proposed                    curvature-driven diffusions (CDD), as compared to
approach contains five stages, namely, Markov random               other diffusion models. PDE-based approaches are
field (MRF) node determination, structure map                      suitable for thin and elongated missing parts in an image.
generation, label set enlargement by label                         For large and textured missing regions, the processed
transformations, image inpainting by priority-BP                   results of PDE-based approaches are usually
optimization, and overlapped region composition. Based             oversmooth (i.e., blurring).
on experimental results obtained in this study, as                      Exemplar-based approaches try to fill missing
compared with three comparison approaches, the                     regions in an image by simply copying some available
proposed approach provides the better image inpainting             part in the image. Nie et al. [4] improved Criminisi et
results.                                                           al.’s approach [5] by changing the filling order and
                                                                   overcame the problem that gradients of some pixels on
Keywords Image Inpainting; Priority Brief Propagation;             the source region contour are zeros. A major
Label Transformation; Markov Random Field (MRF);                   shortcoming of exemplar-based approaches is the
Structure Map.                                                     greedy way of filling an image, resulting in visual
                                                                   inconsistencies. To cope with this problem, Sun et al. [6]
                 1. INTRODUCTION                                   proposed a new approach. However, in their approach,
                                                                   user intervention is required to specify the curves on
Image inpainting is to remove unwanted objects or                  which the most salient missing structures reside. Jia and
recover damaged parts in an image, which can be                    Tang [7] used image segmentation to abstract image
employed in various applications, such as repairing                structures. Note that natural image segmentation is a
aged images and multimedia editing. Image inpainting               difficult task. To cope with this problem, Komodaskis
approaches can be classified into three categories,                and Tziritas [8] proposed a new exemplar-based
namely, statistical-based, partial differential equation           approach, which treats image inpainting as a discrete
(PDE) based, and exemplar-based approaches.                        global optimization problem.
Statistical-based approaches are usually used for texture
synthesis and suitable for highly-stochastic parts in an                       2. PROPOSED APPROACH
image. However, statistical-based approaches are hard
to rebuild structure parts in an image.                            The proposed approach contains five stages, namely,
     PDE-based approaches try to fill target regions of            Markov random field (MRF) node determination,
an image through a diffusion process, i.e., diffuse                structure map generation, label set enlargement by label
available data from the source region boundary towards             transformations, image inpainting by priority-BP
the interior of the target region by PDE, which is                 optimization, and overlapped region composition.
typically nonlinear. Bertalmio et al. [1] proposed a
PDE-based image inpainting approach, which finds out               2.1. MRF node determination
isophote directions and propagates image Laplacians to
the target region along these directions. Kim et al. [2]           As shown in Fig. 1 [8], an image I0 contains a target
used genetic algorithms (GA) to solve the inpainting               region T and a source region S with S=I0-T. Image




                                                            1085
inapinting is to fill T in a visually plausible way by                                                        Vpq (xp , xq )
simply pasting various patches from S. In this study,
image inpainting is treated as a discrete optimization
                                                                                                              =     ∑Z(x
                                                                                                                  dp, dq∈Ro
                                                                                                                               p   + dp, xq + dq)(I0 (xp + dp) − I0 (xq + dq))2 ,           (4)

problem with a well-defined energy function. Here,                                                        where Ro is the overlapped region between two labels, xp
discrete MRFs are employed.                                                                               and xq.
     To define the nodes of an MRF, the image lattice is
used with the horizontal and vertical spacings of gapx                                                    2.3. Label set enlargement
and gapy (pixels), respectively. For each lattice point, if
its neighborhood of size (2gapx + 1) × (2gapy + 1) overlaps                                               To completely use label informations in the original
the target region, it will be an MRF node p. Each label                                                   image, three types of label transformations are used to
of the label set L of an MRF consists of                                                                  enlarge the label set. The first type of label
(2gapx+1) × (2gapy+1) pixels from the source region S.                                                    transformation contains two different directions: the
Based on the image lattice, each MRF node may have 2,                                                     vertical and horizontal flippings, which can find out
3, or 4 neighboring MRF nodes.                                                                            labels (patches) that do not exist in the original source
     Assigning a label to an MRF node is equivalent to                                                    region, but have symmetric properties in the horizontal
copying the label (patch) to the MRF node. To evaluate                                                    or vertical direction. The second type of label
the goodness of a label (patch) for an MRF node, the                                                      transformation contains three different rotations: left
energy (cost) function of an MRF will be defined,                                                         90° rotation, right 90° rotation, and 180° rotation, which
which includes the cost of the observed region of an                                                      can find out rotated labels (patches) of the above-
MRF node.                                                                                                 mentioned three degrees. The third type of label
     We will assign a label x p ∈ L to each MRF node p
                              ˆ
                                                                                                          transformation is scaling. To keep the original size of
                                                                                                          horizontal and vertical spacings gapx and gapy, the
so that the total energy F (x) of the MRFs is minimized.
                            ˆ                                                                             original image is directly up/down scaled so that new
Here,                                                                                                     labels (patches) can be obtained in the original image
       F ( x) = ∑ V p ( x p ) +
           ˆ            ˆ                       ∑V          pq
                                                                   ˆ ˆ
                                                                 ( x p , xq ),               (1)          with the same horizontal and vertical spacings. Here,
                        p∈v                   ( p , q )∈ε                                                 both the up-sampling (double-resolution by bilinear
where V p ( x p ) (called the label cost hereafter) denotes                                               interpolation) image and the down-sampling (half-
the single node potential for placing label xp over MRF                                                   resolution) image are used to generate extra candidate
node p, i.e., how the label xp agrees with the source                                                     labels (patches).
region around p. Vpq(xp,xq) represents the pairwise
potential measuring how well node p agrees with the                                                       2.4. Image inpainting by priority-BP optimization
overlapped region ε between p and its neighboring
node q when pasting xp at p and pasting xq at q.                                                          Belief propagation (BP) [10] treats an optimization
                                                                                                          problem by iteratively solving a finite set of equations
2.2. Structure map generation                                                                             until the optimal solution is found. Ordinary BP is
                                                                                                          computationally expensive. For an MRF graph, each
In this study, the Canny edge detector [9] is used to                                                     node sends “message” to all its neighboring nodes,
extract the edge map of an image, which preserves the                                                     whereas the node receives messages from all its
important structural properties of the source region in                                                   neighboring nodes. This process is iterated until all the
the image. A binary mask E(p) to used to build the                                                        messages do not change any more.
structure map of the image, which is just the edge map                                                         The set of messages sent from node p to its
with morphological dilation. If E(p) is non-zero, pixel p                                                 neighboring node q is denoted by {m pq ( xq )} . This
                                                                                                                                                                                    xq ∈L
is belonging to the structure part. Then, E(p) is used to                                                 message expresses the opinion of node p about
formulate the structure weighting function Z(p,q):                                                        assigning label xq to node q. The message formulation is
                    ⎧ 1, if E ( p) = 0 and E (q ) = 0,                                                    defined as:
       Z ( p, q ) = ⎨                                                                        (2)
                    ⎩w, otherwise,                                                                                              ⎧                                          ⎫
where w is “the structure weighting coefficient.” The                                                        mpq (xq ) = min⎨Vpq (x p , xq ) +Vp (x p ) + ∑mrp (x p )⎬. (5)
                                                                                                                         x p ∈L
label cost Vp(xp) is defined as (the sum of weighted                                                                            ⎩                        r:r ≠q,( r , p)∈ε ⎭
squared differences, SWSD):                                                                               That is, if node p wants to send message mpq to node q,
 Vp (xp )                                                                                                 node p must traverse its own label set and find the best
                                                                                                          label to support node q when label xq is assigning to
 =                  ∑[Z( p + dp,x +dp)M ( p + dp)(I ( p + dp) − I (x +dp)) ,
                              ]
     dp∈[− gapx , gapx ]× − gapy , gapy
                                          p                             0        0   p
                                                                                         2   (3)
                                                                                                          node q. Each message is based on two factors: (1) the
where M(p) denotes a binary mask, which is non-zero if                                                    compatibility between labels xp and xq, and (2) the
pixel p lies inside the source region S. Thus, for an                                                     likelihood of assigning label xp to node p, which also
MRF node p, if its neighborhood of size (2gapx+1) ×                                                       contains two factors: (1) the label cost Vp(xp), and (2)
(2gapy+1) does not intersect S, Vp(xp)=0. Vpq(xp,xq) for                                                  the opinion of its neighboring node about xp measured
pasting labels xp and xq over p and q, respectively, can                                                  by the third term in Eq. (5).
be similarly defined as:




                                                                                                   1086
Messages are iteratively updated by Eq. (5) until                   MRF edge can be bidirectionally traversed. In the
they converge. Then, a set of beliefs, which represents                  forward pass, all the nodes are visited by the priority
the probability of assigning label xp to p, is computed                  order, an MRF node having the highest priority will
for each MRF node p as:                                                  pass message to its neighboring MRF nodes having the
    bp (x p ) = −Vp (x p ) −      ∑m         rp   (x p ).   (6)
                                                                         lower priorities, and the MRF node having the highest
                               r:(r , p)∈ε                               priority will be marked as “committed,” which will not
The second term in Eq. (6) means that to calculate a                     be visited again in this forward pass. For label pruning,
node’s belief, it is required to gather all messages from                the MRF node having the highest priority can transmit
all its neighboring nodes. When the beliefs of all MRF                   its “cheap” message to all its neighboring MRF nodes
nodes have been calculated, each node p is assigned the                  having not been committed. The priority of each
best label having the maximum belief:                                    neighboring MRF node having received a new message
     x p = arg maxbp ( x p ).
      ˆ                                                 (7)              is updated. The above process is iterated until there are
                 x p∈L
                                                                         no uncommitted MRF nodes. On the other hand, the
      To reduce the computational cost of BP,                            backward pass is performed in the reverse order of the
Komodakis and Tziritas [8] proposed “priority-BP” to                     forward pass. Note that label pruning is not performed
control the message passing order of MRF nodes and                       in the backward pass.
“dynamic label pruning” to reduce the number of
elements in the label set of each MRF node. In [8], the                  2.5. Overlapped region composition
priority of an MRF node p is related to the confidence
of node p about the label should be assigned to it. The                  When the number of iterations reaches K, each MRF
confidence depends on the current set of beliefs                         node p is assigned a label having maximum bp values.
{bp(xp)} that has been calculated by BP. Here, the
       xp∈L
                                                                         All the MRF nodes are composed to produce the final
priority of node p is designed as:                                       image inpainting results, where label composition is
                                    1                                    performed in a decreasing order of MRF node priorities.
    priority ( p ) =                                  ,                  Depending on whether the region contains a global
                     {x p ∈ L : b p ( x p ) ≥ bconf }
                                  rel                       (8)
                                                                         structure or not, two strategies are used to compose each
    bp (xp ) = bp (xp ) − bp ,
     rel                   max                              (9)          overlapped region. If an overlapped region contains a
       rel
                                                                         global structure, graph cuts are used to seam it.
where bp is the relative belief value and b p is the
                                            max
                                                                         Otherwise, each pixel value of the overlapped region is
maximum belief among all labels in the label set of                      computed by weighted sum of two corresponding pixel
node p. Here, the confidence of an MRF node is the                       values, where the weighting coefficient is proportional
number of candidate labels whose relative belief values                  to the priority of an MRF node.
exceed a certain threshold bconf.
     On the other hand, to traverse MRF nodes, the                                 3. EXPERIMENTAL RESULTS
number of candidate labels for an MRF node can be
pruned dynamically. To commit a node p, all labels with                  In this study, 21 test images are used to evaluate the
relative beliefs being less than a threshold bprune for                  performance of the proposed approach. Three
node p will not be considered as its candidate labels.                   comparison inpainting approaches, namely, the PDE-
The remaining labels are called “active labels” for node                 based approach [1], the exemplar-based approach [5],
p. In this study, the label set of an MRF node is sorted                 and the ordinary priority-BP-based approach [8], are
by belief values, at least Lmin active labels are selected               implemented in this study. Some image inpainting
for an MRF node, and a similarity measure is used to                     results by the three comparison approaches and the
check the remaining labels. If the similarity between                    proposed approach are shown in Figs. 2-6.
two remaining labels is greater than a threshold Sdiff, one                   In Fig. 2, the image size is 256 × 170, gapx=9,
of the two remaining labels will be pruned. This process                 gapy=9, bconf=-180000, bprune=-360000, Lmax=30, Lmin=5,
will be iterated until the relative belief value of any                  and w=10. Blurring artifacts appear in Fig. 2(c). In Fig.
remaining label is smaller than bprune or the number of                  2(d), because the isophotes direction is too complex to
active labels reaches a user-specified parameter Lmax.                   guide the inpainting process, the inpainting results are
     To apply priority-BP to image inpainting, the labels                not good. Compared with the ordinary priority-BP-
from the source region of an original image and the                      based approach (Fig. 2(e)), the proposed approach (Fig.
labels by applying three types of label transformations                  2(f)) can keep the global structure in the image by
are obtained so that each MRF node maintains its label                   guiding the message passing process by the structure
set. Then, the number of priority-BP iterations, K, is set,              map. In Fig. 3, the image size is 206 × 308, gapx=5,
the priorities of all MRF nodes are initialized only by                  gapy=5, bconf=-40000, bprune=-80000, Lmax=20, Lmin=3,
their Vp(xp) values, and message passing is performed.                   and w=10. In Fig. 3(c), blurring artifacts appear in the
Each priority-BP iteration consists of the forward and                   upper part of the image. In Fig. 3(d), the stone bridge
backward passes. Message passing and dynamic label                       can not be well reconstructed, because there is no
pruning are performed in the forward pass, and each                      suitable patch in the image. Furthermore, error




                                                                  1087
propagation appears in the lake. In Fig. 3(e), because
the priority of the bridge structure is low, the bridge
structure is broken. In the proposed approach, the
weighting coefficient is used to raise the priority of the
bridge structure, resulting in the better inpainting results.
In Fig. 4, the image size is 208×278, gapx=7, gapy=7,
bconf=     -150000, bprune=-300000, Lmax=30, Lmin=5 and
w=2. For the image, the proposed approach can
reconstruct the tower structure by label transformations,                      (a)                         (b)
whereas the three comparison approaches contain error              Fig. 1. (a) Nodes and edges of an MRF; (b) labels of an
propagations, due to lack of suitable labels. In Fig. 5,           MRF for image inpainting [8].
the image size is 287×216, gapx=10, gapy=10, bconf=
-200000, bprune=-400000, Lmax=50, Lmin=5 and w=15. In
Fig. 5(f), the proposed approach uses both the original
labels and the flipped labels to reconstruct the region to
be inpainted, resulting in the better inpainting image. In
Fig. 6, the image size is 257 × 271, gapx=6, gapy=6,
bconf=-200000, bprune=-400000, Lmax=50, Lmin=10 and
                                                                               (a)                        (b)
w=5. Because the building in the original image has the
symmetric property, label transformations can be
employed in this case. Blurring artifacts appear in Fig.
6(c). In Fig. 6(d), the isophote direction is too complex
so that the structures interfere each other. In Fig. 6(e),
the inpainting results are poor, due to lack of valid
labels. In Fig. 6(f), for the lower part of the image, the
window structure is partially broken due to the building                       (c)                        (d)
is not totally symmetric so that error propagation
appears in some inpainting regions of the image.
However, the inpainting image by the proposed
approach is better than that by the three comparison
methods.

            4. CONCLUDING REMARKS                                               (e)                         (f)
                                                                   Fig. 2. (a) The original image, “Lantern;” (b) the
In this study, an image inpainting approach using                  masked image; (c)-(f) the image inpainting results by
structure-guided priority BP and label transformations is          the PDE-based approach [1], the exemplar-based
proposed. In the proposed approach, to reconstruct the             approach [5], the ordinary priority-BP-based approach
global structures in an image, the structure map of the            [8], and the proposed approach, respectively.
image is generated, which guides the inpainting process
by priority-BP optimization. Furthermore, three types of
label transformations are employed to get more usable
labels (patches) for inpainting. Based on the
experimental results obtained in this study, as compared
with three comparison approaches, the proposed
approach provides the better image inpainting results.

               ACKNOWLEDGEMENT

This work was supported in part by National Science
Council, Taiwan, Republic of China under Grants NSC                                (a)                  (b)
96-2221-E-194-033-MY3 and NSC 98-2221-E-194-                       Fig. 3. (a) The original image, “Bungee jumping;” (b)
034-MY3.                                                           the masked image; (c)-(f) the image inpainting results
                                                                   by the PDE-based approach [1], the exemplar-based
                                                                   approach [5], the ordinary priority-BP-based approach
                                                                   [8], and the proposed approach, respectively (to be
                                                                   continued).




                                                                1088
(c)                    (d)                                    (e)                     (f)
                                                             Fig. 4. (a) The original image, “Tower;” (b) the masked
                                                             image; (c)-(f) the image inpainting results by the PDE-
                                                             based approach [1], the exemplar-based approach [5],
                                                             the ordinary priority-BP-based approach [8], and the
                                                             proposed approach, respectively (continued).




               (e)                   (f)
Fig. 3. (a) The original image, “Bungee jumping;” (b)
the masked image; (c)-(f) the image inpainting results
by the PDE-based approach [1], the exemplar-based                        (a)                         (b)
approach [5], the ordinary priority-BP-based approach
[8], and the proposed approach, respectively
(continued).




                                                                         (c)                        (d)




              (a)                    (b)


                                                                          (e)                        (f)
                                                             Fig. 5. (a) The original image, “Picture frame;” (b) the
                                                             masked image; (c)-(f) the image inpainting results by
                                                             the PDE-based approach [1], the exemplar-based
                                                             approach [5], the ordinary priority-BP-based approach
                                                             [8], and the proposed approach, respectively.



                (c)                   (d)
Fig. 4. (a) The original image, “Tower;” (b) the masked
image; (c)-(f) the image inpainting results by the PDE-
based approach [1], the exemplar-based approach [5],
the ordinary priority-BP-based approach [8], and the
proposed approach, respectively (to be continued).




                                                          1089
Computer Society Conf. on Computer Vision and Pattern
                                                                           Recognition, 2003, 721–728.

                                                                        [6] J. Sun, L. Yuan, J. Jia, and H. Y. Shun, “Image completion
                                                                            with structure propagation,” in Proc. of 2005 ACM
                                                                            SIGGRAPH on Computer Graphics, 2005, pp. 861–868.

                                                                        [7] J. Jia and C. K. Tang, “Image repairing: Robust image
                                                                            synthesis by adaptive and tensor voting,” in Proc. of 2003
                                                                            IEEE Int. Conf. on Computer Vision and Pattern
                                                                            Recognition, 2003, pp. 643–650.
              (a)                           (b)
                                                                        [8] N. Komodakis and G. Tziritas, “Image completion using
                                                                            efficient belief propagation via priority scheduling and
                                                                            dynamic pruning,” IEEE Trans. on Image Processing, Vol.
                                                                            16, pp. 2649–2661, 2007.

                                                                        [9] J. Canny, “A computational approach to edge detection,”
                                                                            IEEE Trans. on Pattern Analysis and Machine Intelligence,
                                                                            Vol. 8, pp. 679–698, 1986.

                                                                        [10] J. Pearl, Probabilistic Reasoning in Intelligent Systems:
                                                                           Networks of Plausible Inference, Morgan Kaufmann, San
              (c)                           (d)                            Francisco, CA, 1988.




              (e)                      (f)
Fig. 6. (a) The original image, “Building;” (b) the
masked image; (c)-(f) the image inpainting results by
the PDE-based approach [1], the exemplar-based
approach [5], the ordinary priority-BP-based approach
[8], and the proposed approach, respectively.

                     REFERENCES

[1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester,
    “Image inapinting,” in Proc. of ACM Int. Conf. on
    Computer Graphics and Interactive Techniques, 2000,
    417–424.

[2] J. B. Kim and H. J. Kim, “Region removal and restoration
    using a genetic algorithm with isophote constraint,”
    Pattern Recognition Letters, Vol. 24, pp. 1303–1316, 2003.

[3] T. Chan and J. Shen, “Non-texture inpaintings by
   curvature-driven diffusions,” Journal of Visual Comm.
   Image Rep., Vol. 12, pp. 436–449, 2001.

[4] D. Nie, L. Ma, and S. Xiao, “Similarity based image
    inpainting method,” in Proc. of 2006 Multi-Media
    Modeling Conf., 2006, 4–6.

[5] A. Criminisi, P. Perez, and K. Toyama, “Object removal
    by exemplar-based inpainting,” in Proc. of IEEE




                                                                 1090
CONTENT-BASED BUILDING IMAGE RETRIEVAL

    Wen-Chao Chen(陳文昭), Chi-Min Huang (黃啟銘), Shu-Kuo Sun (孫樹國), Zen Chen (陳稔)

                        Dept. of Computer Science, National Chiao Tung University
                   E-mail:Chaody.cs94g@nctu.edu.tw, toothbrush.cs97g@ nctu.edu.tw,
                            sksun@csie.nctu.edu.tw, zchen@cs.nctu.edu.tw


Abstract—This paper addresses an image retrieval                query image, the content-based image retrieval system
system which searches the most similar building for a           extracts the most similar images from a database by
captured building image from an image database based            either spatial information, such as color, texture and
on an image feature extraction and matching method.             shape, or frequency domain features, e.g. wavelet-based
The system then can provide relevant information to             methods [3].
users, such as text or video information regarding the                Existing content-based image retrieval algorithms
query building in augmented reality setting. However,           can be categorized into (a) image classification methods,
the main challenge is the inevitable geometric and              and (b) object identification methods. The first approach
photometric transformations encountered when a                  retrieves images which belong to the same category as a
handheld camera operates at a varying viewpoint under           query image. Jing et al. proposed region-based image
various lighting environments. To deal with these               retrieval architecture [6]. An image is segmented into
transformations, the system measures the similarity             regions by the JSEG method and every region is
between the MSER features of the captured image and             described with color moment. Every region is clustered
database images using the Zernike Moment (ZM)                   to form a codebook by Generalized Lloyd algorithm.
information. This paper also presents algorithms based          The similarity of two images is then measured by Earth
on feature selection by multi-view information and the          Mover’s Distance (EMD). Willamowski et al. presented
DBSCAN clustering method to retrieve the most                   generic visual categorization method by using support
relevant image from database efficiently. The                   vector machine as a classifier [7]. Affine invariant
experimental results indicate that the proposed system          descriptor represents an image as a vector quantization.
has excellent performance in terms of the accuracy and               In the second approach Wu and Yang [8] detected
processing time under the above inevitable imaging              and recognized street landmarks from database images
variations.                                                     by combining salient region detection and segmentation
                                                                techniques. Obdrzalek and Matas [9] developed a
Keywords Image recognition and retrieval; Geometric
                                                                building image recognition system based on local affine
and photometric transformations; Zernike moments;
                                                                features that allows retrieval of objects in images taken
Image indexing;
                                                                from distinct viewpoints. Discrete cosine transform
                 1. INTRODUCTION                                (DCT) is then applied to the local representations to
                                                                reduce the memory usage. Zhang and Kosecka [10] also
     In recent years, there have been an increasing
                                                                proposed a system to recognize building by a
number of applications in Location-Based Service
                                                                hierarchical approach. They first index the model views
(LBS). LBS is an service that can be accessed from
                                                                by localized color histograms. After converting to
mobile devices to provide information based on the
                                                                YCbCr color space and indexing with the hue value,
current geographical position, e.g. GPS information.
                                                                SIFT descriptors [4, 5] are then applied to refine
However, GPS position is only available in open spaces
                                                                recognition results.
since the GPS signal is often blocked by high-rise
                                                                     Most of related image retrieval algorithms detect
buildings or overhead bridges. Magnetic compasses are
                                                                local features of a query image and then compare with
also disturbed by nearby magnetic materials. Vision-
                                                                detected features of database images by feature
based localization is therefore an alternative approach to
                                                                descriptors. However, the feature detectors such as
provide both accurate and robust navigation information.
                                                                Harris corner detector and the SIFT detector, which is
     This paper addresses the aspects of a building
                                                                based on the difference of Gaussians (DOG), utilize a
image retrieval system. The building recognition is a
                                                                circular window to search for a possible location of a
content-based image retrieval technique that can be
                                                                feature. The image content in the circular window is not
extended to applications of object recognition and web
                                                                robust to affine deformations. Furthermore, the feature
image search via a cloud service combined with
                                                                points may not be reliable and may not appear
consumer-oriented augmented reality tools. Given a




                                                             1091
simultaneously across the multiple views with wide-                   2. FEATURE DETECTOR AND DESCRIPTOR
baselines.
     Matas et al. [13] presented a maximally stable                 2.1. MSER feature region detector
extremal region (MSER) detector. Mikolajczyk and
                                                                      Recently, numbers of local feature detectors using a
Schmid [3] proposed Harris-Affine and Hessian-Affine
                                                                local elliptical window have been investigated. The
detectors. The performances of the existing region
                                                                MSER detector is evaluated as one of the best region
detectors were evaluated in [14] in which the MSER
                                                                detectors [5]. The advantage of MSER detector is the
detector and the Hessian-Affine detector were ranked as
                                                                ability to resist geometry transformation. The MSER
the two best. Chen and Sun [2] compare various popular
                                                                detector performs also well when images contain
feature descriptors, e.g. SIFT, PCA-SIFT, GLOH,
                                                                homogenous regions with distinctive boundaries [1].
steerable filter, with phase-based Zernike Moment (ZM)
                                                                Because building images contain regions with
descriptor. The ZM descriptor performs significantly
                                                                boundaries, such as windows and color bricks, the
better than other descriptors in geometric and
                                                                MSER detector can extract these regions stably.
photometric transformations, such as blur, illumination,
                                                                      After detecting elliptical regions by MSER method,
noise, scale, JPEG compression. To describe a building
                                                                we have to filter out unstable regions such as oversized
image in geometric and photometric transformations,
                                                                area, large aspect ratio, duplicated regions, and high
this paper utilizes the MSER method as the feature
                                                                area variation, as shown in fig. 2.
detector. The Zernike Moment is then applied to
describe each detected feature region.                              2.2. Zernike Moment feature region descriptor
     In order to index a large number of features
                                                                      Once the feature regions are detected, every region
descriptors, KD-tree [12] is a fundamental method to
                                                                is described as a feature vector for similarity
recursively partition the space into two subspaces to
                                                                measurement. This paper presents a method which
construct a binary tree.
                                                                applies Zernike Moment (ZM) as the feature descriptor
     We also introduce a building image dataset, the
                                                                [2].
NCTU-Bud dataset, containing the high resolution
                                                                      Zernike moments (ZMs) have been used in object
images of 22 buildings located on National Chiao Tung
                                                                recognition regardless of variations in position, size and
University campus with a total of 190 database images.
                                                                orientation. Essentially Zernike moments are the
We capture at least one face of each building from 5
                                                                extension of the geometric moments by replacing the
distinct viewing directions. Query images are captured
under 12 different lighting conditions for performance          conventional transform kernel x m y n with orthogonal
evaluation.                                                     Zernike polynomials.
     Fig. 1 shows the overall system block diagram.                   The Zernike basis function Vnm ( ρ , θ ) is defined
Section 2 briefly describes the background of the feature       over a unit circle with order n and repetition m such
detector and descriptor. Section 3 presents a feature           that (a) n − m is even and (b) m  n , as given by
selection method to remove unstable features and a
clustering method to obtain representative features. In                   Vnm ( ρ ,θ ) = Rnm (ρ ,θ )e jmθ , for ρ ≤ 1                                        (1)
Section 4 the image indexing and retrieval method is
                                                                where {Rnm (ρ )} is a radial polynomial in the form of
described. In Section 5 experimental results of the                                    ( n −|m|) / 2
                                                                                                                         (n − s )!
NCTU-Bud dataset are described. The performance on                       Rnm ( ρ ) =      ∑ (−1)       s
                                                                                                                n+ | m |          n− | m |
                                                                                                                                                  ρ n− 2 s
                                                                                                                                                             (2)
the publicly available ZuBud dataset is evaluated as well.                                 s =0
                                                                                                           s! (          − s )! (          − s )!
                                                                                                                  2                 2
Finally, Section 6 concludes the paper.




                                                                          (a)




                                                                          (b)




                                                                Figure 2. (a) Initial MSER results. (b) Results after
            Figure 1. System block diagram.                     removing unstable MSER feature regions.




                                                             1092
v v
The set of basis functions { Vnm ( ρ , θ ) } is orthogonal, i.e.                                   The similarity of magnitude S mag ( Pq , Pd ) is defined
       π                             π
      ∫ ∫ V (ρ ,θ )V (ρ ,θ )ρdρdθ = n + 1 δ
       2    1
                *
                                                                   δ     ,            as the degree of cosine between two vectors.
                    nm             pq                            np mq
       0    0
                                                                                                             v v         mag q ⋅ mag d
                                                                                                     S mag ( Pq , Pd ) =                                                (7)
                                            1 a=b                                                                        mag q mag d
                            with δ ab = {                                    (3)
                                            0 otherwise
                                                                                      The value ranges between 0 and 1, while a higher value
The two-dimensional ZMs for a continuous image                                        indicates two vectors are more similar. This is
function f ( ρ ,θ ) are represented by                                                equivalent to the Euclidean distance between the two
        n +1                                                                          normalized unit vectors.
             ∑ ∑ f ( ρ ,θ )V *nm (ρ ,θ ) = Znm e nm (4)
                                                iφ
 Z nm =                                                                                    A similarity measure using the weighted ZM phase
           π     ( ρ , θ )∈unit disk
                                                                                      differences is expressed by
                                                                                            v v
     For a digital image function the two-dimensional                                     S phase ( Pq , Pd ) =

ZMs are given as                                                                                             min{ Φ nm − (mα ) mod(2π ) ,2π − Φ nm − (mα ) mod(2π ) }
                                                                                                                           ˆ                           ˆ                (8)
                                                                                          1 − ∑∑ wnm
       n +1                                                                                                                               π
            ∑ ∑ f ( ρ ,θ )V *nm ( ρ ,θ ) = Z nm e nm (5)
                                                 iφ                                            m    n

Z nm =
           π     ( ρ , θ )∈unit disk                                                                                Z nm + Z nm
                                                                                                                      q      d

                                                                                      where wnm =                             + Z nm , and
                                                                                                                                           Φ nm = (φ nm − φ nm ) mod 2π is
                                                                                                                                                     q      d
                                  r                                                                               ∑Z     q
                                                                                                                         nm
                                                                                                                                  d

       Define a region descriptor P based on the sorted                                                           n ,m

ZMs as follows:                                                                       the actual phase difference.
r          iφ          iφ                     iφ                                             The rotation angle α is determined by an iterative
                                                                                                                       ˆ
P = [ Z11 e 11 , Z 31 e 31 ,LL , Z nmax mmax e nmaxmmax ]T                   (6)
                                                                                      computation of α m = (Φ nm − α m −1 ) mod 2π , with the
                                                                                                               ˆ             ˆ
where Z nm is the ZM magnitude, and ϕnm is the ZM                                     initial value α 0 = 0 , using the entire information of
                                                                                                           ˆ
phase.                                                                                Zernike moments sorted by m . The value range of
                                                                                                 v v
     Zernike Moment is then derived after integrating                                  S phase ( Pq , Pd ) is the interval [0, 1] while a higher value
the normalized region with respect to Zernike basis                                       indicates two vectors are more similar.
function. In this paper, the ZMs with m =0 are not
included, and both the maximum order n and maximum                                                        3. EFFICIENT BUILDING IMAGE
repetition m equal to 12, resulting the length of feature                                                   DATABASE CONSTRUCTION
vector to be 42. In this way, two feature vectors                                          In building image retrieval applications, the scale
represent a feature region: mag = [ Z1,1 , Z 3,1 ,..., Z12,12 ]T                      of a database is typically large with considerable visual
                                                                                      descriptors. In order to index and search rapidly,
and phase = [φ1,1 , φ3,1 ,...,φ12,12 ]T .
                                                                                      effective approaches to store appropriate descriptor are
                                                                                      proposed for constructing a large-scale building image
                                                                                      database.
                                                                                          3.1. Feature selection from multiple images
                                                                                            Modern building databases in image retrieval
                                                                                      applications normally contain multiple views for a single
                                                                                      building. For example, the ZuBud dataset collects five
                                                                                      images for each building in the database. We refine
       Figure 3. Normalization of an elliptical region.                               detected MSER feature regions by verifying consistency
                                                                                      between multiple images of a building that are captured
2.3. A similarity measure                                                             from distinct viewpoints. The basic idea of selection is
     v                         v                                                      to keep representative feature regions and remove
Let Pq = (mag q , phaseq ) and Pd = (mag d , phased ) be two                          discrepant features as outliers. Feature region selection
ZM feature vectors, where mag q = [ Z1q,1 , Z 3q,1 ,..., Z12 ,12 ]T ,
                                                           q
                                                                                      reduces storage space of feature descriptors in a
                                                                                      database. Furthermore, this method improves the
phaseq = [φ1q1 ,φ3q,1 ,...,φ12,12 ]T , magd = [ Z1d,1 , Z 3d,1 ,..., Z12,12 ]T ,
            ,
                             q                                         d
                                                                                      efficiency and accuracy of the image retrieval process
and phased = [φ1d,1 , φ3d,1 ,..., φ12,12 ]T .
                                    d                                                 remarkably.




                                                                                   1093
(a)                                           (b)                                        (c)




                   (d)                                           (e)                                        (f)
    Figure 4. (a) - (c) Three different images in a group of a building image before feature selection. (d) - (f) Three
                          different images in a group of a building image after feature selection.

     The occurrence of discrepant feature regions comes                      After removing non-building feature regions, most
from non-building areas, such as trees, bicycles, and                  of remaining feature regions belong to the buildings.
pedestrians, as shown in Fig 4. Feature regions of non-                However, repeated pattern, e.g. windows, doors, is
building area are not stable comparing to regions of                   popular in a building image. In order to reduce the
building area. Therefore, excluding these feature regions              storage space of the repeated feature descriptors in a
out of a database is necessary to ensure uniform results.              database, clustering similar features into a representative
     This paper presented a method to select feature                   feature descriptor is necessary.
regions automatically by measuring similarity between                        In conventional clustering algorithms, e.g. k-means
multiple images of a building. The algorithm description               algorithm and k-medoid algorithm, each cluster is
for feature region selection is given in Fig. 5. Only                  represented by the gravity center or by one of the
similar feature regions across the views are preserved.
                                         v v
                                                                       objects of the cluster located near its center. However,
Two regions are similar if Smag ( Pq , Pd ) 0.7 and                   determining the number of clusters k is not
          v v                                                          straightforward. Moreover, the ability for distinguishing
S phase ( Pq , Pd ) 0.7. Comparison of feature regions before
                                                                       different features is reduced because the isolated feature
and after selection is shown in Fig. 4. Unstable feature
                                                                       regions are forced to merge to a nearby cluster which
regions in Figs. 4(a)-4(c), such as trees and pedestrians
                                                                       may be with dissimilar characteristic of region
are removed by the proposed algorithm. The results of
                                                                       appearance. Consequently, we utilized the Density-
selection are shown in Figs. 4(d)-4(f).
                                                                       Based Spatial Clustering algorithm (DBSCAN)
                                                                       algorithm [15] is used for clustering.
  Input: A group of feature regions in multi-view images.
                                                                             The DBSCAN algorithm relies on a density-based
  Output: Selected feature regions.
                                                                       notion of clusters. Two input parameters ε and MinPts
  For each feature region
    If there’s at least two similar regions in other views             are required to determine the clustering conditions in
        Preserve the feature region;                                   two steps. The first step chooses an arbitrary point from
    Else                                                               the database as a seed. Second, retrieve all reachable
        Delete the feature region;                                     points from the seed. The parameter ε defines the size
                                                                       of neighborhood, and for each point to be included in a
      Figure 5. Feature region selection algorithm.                    cluster there must be at least a minimum number
                                                                       ( MinPts ) of points in an ε -neighborhood of a cluster
3.2. Feature clustering                                                point.




                                                                  1094
(a)                      (b)                          (c)                         (d)                      (e)




            (f)                      (g)                          (h)                          (i)                     (j)
    Figure 6. (a) - (e) Feature regions in the same cluster. (f) - (i) Another cluster of feature regions after DBSCAN.

     The input to the DBSCAN algorithm is 42-                              Then, based on the current minimum distance between
dimensional selected ZM magnitude vectors of all                           the query point and the single database point in the leaf
images belonging to the same group or building. We                         node the KD-tree is revisited to search for the next
calculate the mean of feature vectors as the                               available neighbor within the current minimum distance.
representative one of the cluster, while preserving the                    The tree backtracking is repeated until no further
isolated feature points.                                                   reduced minimum distance to the query point is found.
     The elliptical regions in Figs. 6 (a)-6(e) are feature
vectors in the same cluster, and are replaced by a                           Input: N feature vectors in k dimension
representative feature vector. Figs. 6 (f)-6(j) show the                     Output: A KD-tree, every leaf nodes contains a single
other feature cluster in the same group of multi-view                                 feature vector
images.                                                                      kd_tree_build (Dataset, dim)
                                                                             {
     4. IMAGE INDEXING AND RETRIEVAL                                           If Dataset contains only one point
                                                                                   Mark a leaf node containing the point;
4.1. Descriptor indexing with a KD-tree                                            Return;
                                                                               else
     After feature selection and clustering processes                              1. Sort all points in Dataset according to
described above, all extracted building regions are then                                feature dimension dim;
indexed by a KD-tree according to their ZM magnitude                               2. Determine the median value of feature
vectors. The goal is to build an indexing structure so that                             dimension dim in Dataset, make a new node
the nearest neighbors of a query vector can be searched                                 and save the median value;
rapidly.                                                                           3. Dataset_bigger = The points in Dataset
                                                                                        with dim = median value;
     A KD-tree (k-dimensional tree) is a binary tree that
                                                                                   4. Dataset_smaller = The points in Dataset
recursively partitions feature space into two parts by a                                with dim  median value;
hyperplane perpendicular to a coordinate axis. The                                 5. Set Dataset_bigger as the new node’s right
binary space partition is recursively executed until all                                child and set Dataset_smaller as the new
leaf nodes contain each a single data point. The                                        node’s left child;
algorithm for constructing a KD-tree is given in Fig. 7                            6. call kd_tree_build (Dataset_bigger ,
by initializing dim as 1 and Dataset as the set of N                                    (dim+1) % k);
database points.                                                                   7. call kd_tree_build (Dataset_smaller,
                                                                                        (dim+1) % k);
4.2. Query by region vote counting                                           }
      After establishing a KD-tree for organizing the ZM                        Figure 7. The KD-tree construction algorithm.
magnitude feature vectors in the database, the KD-tree is
descended to find a leaf node into which the query point                     For each extracted region in the query building
falls. After obtaining the first candidate nearest neighbor,            image one vote is cast to a certain database building
we verify with the ZM phase feature vector to confirm                   image which has a region to be claimed as the nearest
the candidate point is qualified or not. In our                         neighbor of the query region. After all extracted regions
experiments, two vectors are qualified as similar when                  of the query image have voted, we count the number of
their distance is as small as possible and their magnitude              votes each database image receives. The database image
                                               v v                      with the maximum votes is returned as the most similar
and phase similarity measures satisfy Smag ( Pq , Pd )  0.85
                              v v                                       building to the query building.
in equation (7) and S phase ( Pq , Pd )  0.85 in equation (8).




                                                                    1095
View 1               View 2                   View 3                View 4                View 5


EC Building



ED Building
Face 1 (first
 side view)

ED Building
   Face 2
(second side
   view)

                        Figure 8. Examples of the database images in the NCTU-Bud dataset.



                                                                Class D                    Class E            Class F
             Class A        Class B             Class C
                                                            Correct exposure            Over exposure      Under exposure
         Correct exposure Over exposure      Under exposure
                                                             with occlusion             with occlusion     with occlusion

Sunny
 day


Cloudy
 day


                           Figure 9. Examples of query images for the NCTU-Bud dataset.

                                                                   weather condition, six images are collected, each with
           5. EXPERIMENTAL RESULTS
                                                                   different exposure settings of and different occlusion
     In our experiments, the proposed algorithm is                 conditions. Totally 12 classes of images constitute the
written in Matlab under the Windows environment and                query dataset, as shown in Fig. 9. Furthermore, five
evaluated on the platform with a 2.83GHz processor and             additional camera poses, such as different rotations,
3GB Ram. We test our proposed indexing and retrieval               focal lengths, and translations, are recorded for further
system on two sets of building images: the NCTU-Bud                testing. A total of 2280 query images is gathered.
dataset created by our own and the publicly available
                                                                   5.2. Experimental results for the NCTU-Bud dataset
ZuBud [11].
                                                                        Table I shows the total number of different region
5.1. The NCTU-Bud Dataset
                                                                   feature vectors collected in the database and the
     To evaluate our proposed approach and to establish            recognition rate for the query images captured in the
a benchmark for future work, we introduce the NCTU-                normal exposure during cloudy days. From this table,
Bud dataset. Our dataset contains 22 high resolution               feature selection using multiple images does not raise
images of the buildings on the NCTU campus. For each               the query accuracy rate. However, we achieve 100%
building in the database we capture at least one facet of          accuracy after applying the feature selection and
the building from five different viewing directions. All           DBSCAN clustering. In this case not only the region
database images are in a resolution of 1600x1200 pixels.           storage space is reduced, but also only the representative
The database contains a total of 190 building images.              feature vectors are stored for query search.
Some representative database images are shown in Fig.              Consequently, the image retrieval accuracy is raised to
8.                                                                 100%.
     For query images, we capture with a different                      The storage size (the number of nodes) is decided
camera of a 2352x1568 resolution in two different                  by the number of region feature vectors found from all
weather conditions: sunny and cloudy. For each type of             images in the database. Approximately 50% space is




                                                            1096
saved by applying feature selection and the DBSCAN                   TABLE II. AVERAGE PROCESSING TIME OF FEATURE DETECTION
clustering method.                                                    AND DESCRIPTOR COMPUTATION IN DIFFERENT RESOLUTIONS.

     The time of feature region detection and
description relies on the resolution and the content of an                Resolution         2352x1568    1600x1200     640x480
image. If the scene of an image is complex, the number
of detected extremal regions by MSER increases and the                    Avg. / std.
                                                                                              13.8 /4.3    5.8 / 1.58   1.8 / 0.7
processing time increases as well. Table II shows the                processing time (sec)
average processing time of feature detection and
descriptor computation of 92 different images in
different resolutions.                                                  TABLE III. QUERY ACCURACY RATE OF THE NCTU-BUD
     With the feature selection and DBSCAN clustering                        DATASET IN DIFFERENT WEATHER CONDITIONS.

method, the average time of indexing the database is                                                  Sunny day Cloudy day
22.4 seconds and the average query time for an image in                         Class A
                                                                                                       93.6 %         100%
                                                                            Correct exposure
a resolution of 2352x1568 pixels is 40 seconds. The
                                                                                 Class B
image query time comprises the time for feature region                                                 92.1 %       92.1 %
                                                                              Over exposure
detection (MSER), descriptor computation (ZM) and                                Class C
searching time for the nearest neighbor in the database.                                               93.1 %       96.3 %
                                                                             Under exposure
                                                                                Class D
     Table III shows the query accuracy rate for the 12                                                93.6 %       96.3 %
                                                                     Correct exposure with occlusion
different classes of images. Each class consists of 190                          Class E
                                                                                                       92.1 %       94.2 %
query images. We can find that the accuracy rate in                   Over exposure with occlusion
cloudy days generally higher than that in sunny days.                            Class F
                                                                                                       92.6 %       96.8 %
This reason may be because strong shadows are casted                 Under exposure with occlusion
by occluding object in sunny days. And the over-
exposured images are harder to recognize comparing to
other exposure conditions.                                              Query                          Database
     Comparing classes of D-F with classes of A-C, we                   image                           image
can find that the proposed methods also perform well
under occlusion. It shows that the proposed system is
able to distinguish region feature regions even when
buildings are partially occluded.
5.3. Experimental results for the ZuBud dataset
      The ZuBud dataset contains images of 201
different buildings taken in Zurich, Switzerland. There
are 5 different images taken for each building. Fig. 10
shows some example images. The dataset consists of
115 query images, which are taken with a different
camera under different weather conditions.
      In the experimental result for the ZuBud dataset,
the query accuracy rate with the feature selection and
DBSCAN clustering is over 95%. The average query
time is 3.1 second with a variation of 1.16 second. From
the experimental results, our system still performs well                 Figure 10. Example of images of the ZuBud dataset.
in this publicly available dataset.
                                                                     TABLE IV. TOTAL NUMBER OF REGION FEATURE VECTORS AND
  TABLE I. TOTAL NUMBER OF REGION FEATURE VECTORS AND
                                                                          QUERY ACCURACY RATE OF THE ZUBUD DATASET
    QUERY ACCURACY RATE OF THE NCTU-BUD DATASET
                                                                                        Without                  With feature
                                                                                                    With feature
                      Without     With feature With feature                              feature                  selection
                                                                                                     selection
                       feature     selection    selection                               selection                 DBSCAN
                      selection      only       DBSCAN
                                                                     # Region feature
# of region feature                                                                          488,527      264,311       256,261
                      113,194       68,036       56,089                  vectors
      vectors
   Memory size                                                         Recognition
                      22 MB        12.9 MB      10.6 MB                                      89.57%       94.8%         95.6%
   of a KD-tree                                                         Accuracy
 Query accuracy
                       94.7%        94.7%         100%
      rate




                                                              1097
6. CONCLUSION                                           Analysis with Applications to Street Landmark
                                                                          Localization”, Proceedings of ACM International
      In this paper, we have presented a novel image
                                                                          Conference on Multimedia, 2008.
retrieval system based on the MSER detector and the
                                                                   [9]    S. Obdrzalek, J. Matas, “Image Retrieval Using
ZM descriptor, which can resist against the geometric
                                                                          Local Compact DCT-Based Representation”,
and photometric transformations. Experimental results
                                                                          Pattern Recognition, 25th DAGM Symposium, vol.
illustrate that the KD-tree indexing and retrieval system
                                                                          2781 of Lecture Notes in Computer Science.
with the magnitude and phase ZM feature vectors
                                                                          Magdeburg, Germany: Springer Verlag, p.490–
achieves a query high accuracy rate. The accuracy rate
                                                                          497, 2003.
for our created NCTU-Bud dataset and the ZuBud
                                                                   [10]   W. Zhang, J. Kosecka, “Hierarchical building
dataset are 100% and 95%, respectively.
                                                                          recognition”, Image and Vision Computing, 2007.
       The success of our system are attributed to
                                                                   [11]   H. Shao, T. Svoboda, L.V. Gool, “ZuBuD—
(a) Selection of MSER feature vectors using multiple
                                                                          Zurich Buildings Database for Image Based
       images of the same building captured from
                                                                          Recognition” Technical Report 260, Computer
       different viewpoints removes the unreliable
                                                                          Vision Laboratory, Swiss Federal Institute of
       regions.
                                                                          Technology,2003
(b) The DBSCAN clustering technique groups similar
                                                                   [12]   J. H. Friedman, J. L. Bentley, R. A. Finkel, “An
       feature vectors into a representative feature
                                                                          Algorithm for Finding Best Matches in
       descriptor to tackle the problem of repeated
                                                                          Logarithmic Expected Time”, ACM Transactions
       feature patterns in the image.
                                                                          on Mathematical Software, vol. 3,no 3, pp 209-
      In the future, we will consider optimizing the
                                                                          266,1977
programs and porting to mobile phone for mobile device
                                                                   [13]   J. Matas, O. Chum, M. Urban, T. Pajdla, “Robust
applications. Furthermore, the query results may be
                                                                          wide-baseline stereo from maximally stable
verified using the multi-view geometry constraints for
                                                                          extremal regions,” Image and Vision Computing,
eliminating the outliers in order to lower the miss
                                                                          vol.22, pp.761–767, 2004.
recognition rate.
                                                                   [14]   K. Mikolajczyk, T. Tuytelaars, C. Schmid, A.
                    REFERENCES                                            Zisserman, and J. Matas, “A comparison of affine
                                                                          region detectors,” Int’l J. Computer Vision, vol.
[1]   J. Wang, G. Wiederhold, O. Firschein, and S. Wei,
                                                                          65, no. 1/2, pp. 43–72, 2005.
      “Content-Based Image Indexing and Searching
                                                                   [15]   M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ” A
      Using Daubechies’ Wavelets,” Int’l J. Digital
                                                                          Density-Based Algorithm for Discovering Clusters
      Libraries, vol. 1, pp. 311-328, 1998.
                                                                          in Large Spatial Databases with Noise,” in Proc.
[2]   Z. Chen, and S.K. Sun, “A Zernike moment phase-
                                                                          Int’l Conf. Knowledge Discovery and Data
      based descriptor for local image representation
                                                                          Mining, 1996.
      and matching”, IEEE Trans. Image Processing,
      vol. 19, No. 1, pp. 205-219. 22 September 2009.
[3]   K. Mikolajczyk, T. Tuytelaars, C. Schmid, A.
      Zisserman, J. Matas, F. Schaffalitzky, T. Kadir,
      and L. V. Gool, “A comparison of affine region
      detectors”. Int’l J. Computer Vision, vol. 65, 43–
      72, 2005.
[4]   D. G. Lowe, “Distinctive image features from
      scale-invariant keypoints,” Int’l J. Computer
      Vision, vol. 60, no.2, pp. 91–110, 2004.
[5]   K. Mikolajczyk, and C. Schmid, “A Performance
      Evaluation of Local Descriptors,” IEEE Trans.
      Pattern Analysis and Machine Intelligence, vol.
      27, no. 10, p1615-1630, 2005.
[6]   F. Jing, M. Li, “An Efficient and Effective
      Region-Based Image Retrieval Framework”, IEEE
      Trans. Image Processing, vol. 13, no. 5, MAY
      2004.
[7]   J. Willamowski, D. Arregui, G. Csurka, C.R.
      Dance, and L. Fan., “Categorizing nine visual
      classes using local appearance descriptors”, ICPR
      Workshop on Learning for Adaptable Visual
      Systems, 2004
[8]   W. Wu, J. Yang, “Object Fingerprints for Content




                                                            1098
Using Modified View-Based AAM to Reconstruct the Frontal Facial Image with
                  Expression from Different Head Orientation
               1
                   Po-Tsang Li(李柏蒼), 1Sheng-Yu Wang(王勝毓), 1,2Chung-Lin Huang(黃仲陵)

                                                  1
                                                      Dept. of Electrical Engineering,
                                     National Tsing Hua University, Hsin-Chu, Taiwan.
                               2
                                   Dept. of Informatics, Fo-Guang University, I-Lan, Taiwan.
                                                E-mail: clhuang@ee.nthu.edu.tw

                                                                          information using 3D face laser scanner. 3DMM can accurately
                                                                          reconstruct 3D human face, however, it needs a lot of
                            Abstract                                      computation that limit its applications for only academic
This paper develops a method to solve the unpredictable head              research.
orientation problem in 2D facial analysis. We extend the
expression subspace to the view-based Active Appearance                           Blanz et al. [12] apply 3DMM for human identity
Model (AAM) so that it can be applied for multi-view face                 recognition, however, the fitting process takes 4.5 minutes per
                                                                          frame on a workstation with 2GHz Pentium 4 processor. For
fitting and pose correction for facial image with any expression.
                                                                          facial expression recognition, the problem becomes more
Our multi-views model-based facial image fitting system can be
                                                                          obvious. Due to the non-sufficient 3D face expression data, one
applied for a 2D face image (with expression variation) at any
                                                                          can only rely on single expression (neutral) 3D face model for
pose. The facial image in any view can be reconstructed in                3D facial identity recognition. However, because more 3D face
another different view. We divide the facial image into                   expression database become available, researchers such as Wang
expression component and identity component to increase the               et al. [16], Amor et al. [17], and Kakadiaris et al. [18] develop a
face identification accuracy. The experimental results                    method to identify the human face in different views and
demonstrate that the proposed algorithm can be applied to                 different expression. However, because the facial expressions are
improve the facial identification process. We test our system for         complicate {surprise, sadness, happiness, disgust, anger, fear},
the video sequence with frame size is 320*240 pixels. It requires         the 3-D model for different facial expressions are non-practical.
30~45 ms to fitting a face and 0.35~0.45 ms for warping.                  Lu et al. [20] only record the variations of the landmark points
Keywords View-Based; AAM; Facial expression;                              and then apply the Thin-Plate-Spline warping method to
                                                                          synthesize other expression facial images for fitting face
                                                                          expression image. Chang et al. [15] also divide the training data
                                                                          into identity space and expression space, and use bilinear
1. Introduction                                                           interpolation to synthesize other expression human faces.
                                                                          Ramanathan et al. [19] propose a method using 3DMM for facial
The facial image analysis consists of face detection, facial              expression recognition。
feature extraction, face identification and facial expression
recognition. Currently, the 2D face recognition technology is                    To capture 3D face information, we may use either 3D
well-developed with high recognition accuracy. However, the               laser scanner or multi-view 2D images. Recently, the 2D+3D
unpredictable head orientation often causes a big problem for 2D          active appearance models (AAM) method has been proposed by
facial analysis. Most of the previous facial identity identification      Xiao et al. [21], Koterba et al.[22], and Sung et al.[23]. Based on
or facial expression recognition methods are limited to the               the known project matrix of certain view, the so-called 2D+3D
frontal face and profile face. They work only for the faces in a          AAM method trains a 2D AAM for single view for later tracking
single view with ± 15 degrees variation.                                  and fitting of the landmark points of 2D images. Then it uses the
                                                                          corresponding points to calculate the 3D position of the
       The 3D model is well-known as the 3D Morphable Model               landmark points. Xiao et al.[21] use only 900 image frames of a
(3DMM) which is proposed by Blanz and Veter [11]. 3DMM                    single camera to develop the 3D AAM model. Because of the
and AAM are similar. They are model-based approach which                  precision error of 2D AAM tracking landmark point, Lucey et al.
consists of shape model and texture model. They both use                  [24] point out that the feature points tracked by 2D+3D AAM is
Principal Component Analysis (PCA) for dimension reduction.               worse than the normalized shape obtained by 2D AAM fitting.
The two major differences between them are (1) the optimization           Their argument is that 2D+3D AAM can not obtain the depth
algorithm used in fitting, and (2) the feature points in shape            information precisely and it causes the recognition errors.
model for 3DMM are 3D feature points, whereas in AAM, they
are 2D locations. In data collection, AAM can be developed                      In this paper, we apply the view-based AAM proposed by
using 2D facial images, whereas 3DMM captures the depth                   Cootes et al.[4] for model-fitting of input face with any




                                                                       1099
expression view in any angl and then it c be warped t any
               wed             le,           can            to          regenerate the face
                                                                                          e.
target viewing angle. The vie  ew-based AAM consists of se
                                             M              everal
2D AAMs wh     hich can be fur rther divided in inter mode and
                                              nto          el           2.1 Statistical A
                                                                          1             Appearance Mo
                                                                                                    odels
intra model. The inter m       model describ  bes the para ameter
transformation between any two 2D AAMs, whereas the intra  e                   A statistical appearance m
                                                                                           l             model consists of two parts: t  the
model describe the relations
               es             ship between th model param
                                             he            meters       sha model desc
                                                                          ape              cribing the shap of the objec and the textu
                                                                                                          pe              ct            ure
and viewing an ngle for a single 2D AAM mo
                               e             odel. The view-
                                                           -based       model, describing the gray-leve information of the object. It
                                                                                           g              el
AAM can be g   generated by an off-line traini process. Be
                               n              ing           esides      use labeled face images to train the AAM. To train the AA
                                                                          es                               n             T              AM
the identity sub
               bspace, this pap extends the expression sub
                              per                          bspace       model, we must h   have an annota ated set of faci images, call
                                                                                                                          ial            led
to inter model so that the vie ew-based AAM can be applie for
                                             M              ed          land
                                                                           dmark points. These landma points are selected as the
                                                                                                          ark                            t
multi-view face fitting and pose correction for an input fa of
                                                           ace          sali
                                                                           ient points on t face and ide
                                                                                           the            entifiable from any human fac  ce.
any expressionn.                                                        Fig
                                                                          gure 2 shows some annotated tr   raining face im
                                                                                                                         mage data set.

       The flow diagram is s
              w             shown in Fig 1 For an input face
                                           1.              t
image, based o the intra mo
              on             odel, we may find the relatioonship
between param meters and viewwing angle, an then remov the
                                            nd            ve
angle effect on the par     rameters. Then we divide the
                                            n             e
angle-independ model para
              dent          ameters into ide
                                           entity parameter and
                                                           rs
expression para
              ameters which can be transforrmed to the targ 2D
                                                          get                               igure 2. Examp of the train
                                                                                           Fi            ples         ning set.
AAM model by using the inte model. Finall based on the intra
              y             er              ly,           e
model, we add the influence of the angle parameters ont the
              d             e                              to                  The numb   ber of landm   mark points is determin        ned
model paramet               esize the facial image in the target
              ters and synthe                                           exp              Although more landmark poi
                                                                           perimentally. A                               ints increase tthe
viewing angle.                                                          acc
                                                                          curacy of the model, how
                                                                                          e              wever, it also increases t
                                                                                                                         o              the
                                                                        commputation of model fitting process. The distribution of
                                                                                                                        e
                                                                        landdmark points d
                                                                                         depends on the ccharacteristics on the faces, su
                                                                                                                         o              uch
                                                                        as t eyebrows, e
                                                                           the            eyes, nose and m
                                                                                                         mouth. In these regions, we ne eed
                       Input image    Facial region detection 
                                                                        to p more landm
                                                                           put           mark points, wh hereas in the ot
                                                                                                                        ther regions (suuch
                                         Pose classification            as e
                                                                           ears, forehead, o other non-visible area, we p no landmark
                                                                                           or                           put             ks.

                                     Modified View -based AAM
                                                        d
                                                                        2.2 Shape Mode
                                                                          2          el

                                               Fitting                  Her we use trian
                                                                          re,          ngular meshes t compose the human face. W
                                                                                                     to          e             We
                                              using i th
                                               AAM                      def
                                                                          fine a shape si as a vector co         ontaining the c     coordinates of Ns
                                                                        land
                                                                           dmark points in a face image Ii.
                                                                                         n
                                                                                      si = ( x1 , y1 , x 2 , y 2 , K , x N , y N ) Τ        (1)
          Target                              Select the                                                              s     s
       orientation Ө                        target model
                                                                        The model is cons
                                                                          e             structed based on the coordinate of the label
                                                                                                                                    led
                                                                        points of training images. We aligned the locations of t
                                                                                          g            e                            the
                                                                          rresponding po
                                                                        cor              oints on diffeerent training faces by usi  ing
                                         Rotate model i→ j              Pro
                                                                          ocrustes analysis as normalizzation. Given a set of alignned
                                                                        sha
                                                                          ape, we then appplied Principa component analysis (PCA) to
                                                                                                       al            a
                                                                        the data. Any shap example can then be approx
                                                                                         pe                          ximated by usin
                                                                                                                                   ng
                                                                                                    _
                                           Reconstructed
                                                 at                                          s = s + Ρs bs                     (2)
                                            j th AAM
                                                                        whe s_ defines th mean shape o all aligned sh
                                                                            ere          he                of         hape is calculatted
                                                                        usin s = Σ i=1Si/N, Ps= (ps1, ps2,…pst) is th matrix of t
                                                                                     N
                                                                            ng                                        he              the
                                                                        firs t eigenvectors and bs is a s of shape par
                                                                           st             s                set       rameters. pst is the
                                                                                                                                      t
                                                                        eigeenvector of shhape covarianc matrix. Figu 3 shows the
                                                                                                           ce         ure             t
                                                      New vie
                                                            ew
                                                       in ang le Ө
                                                                        effe             g                           parameters by ±2
                                                                           ects of varying the first two shape model p
                                                                        stan
                                                                           ndard deviation
                                                                                         ns.



                  igure 1. The flo
                 Fi              owchart of our system.


2. Active Ap
           ppearance M
                     Model                                                      Figure 3. F
                                                                                          First two modes of shape variat
                                                                                                                        tion (±2sd)

In the Modified View-based A
              d              AAM, 2D AAM play a crucial part.
                                             M                          2.3 Texture Mod
                                                                          3           del
This chapter in
              ntroduces the ooverall structur of 2D AAM the
                                              re         M,
flow diagram o training and f
              of             fitting algorithm The major g of
                                             m.          goal           The texture of A
                                                                           e            AAM is defined the gray leve information at
                                                                                                       d            el
AAM, which is first proposed by Cootes et al. [2] is to fin the
                             d                            nd                          T                              _

model paramet ters that reduce the differenc between syn
                             e               ce          npaper         pix x=(x, y) th lie inside the mean shape s . First, we ali
                                                                          xels           hat           e                           ign
                                                                                                              _
image (generated by the AA model) an the target im
                            AM               nd           mage.         the control points and the mean shape s of ev
                                                                                                                    very training fa
                                                                                                                                   ace
Based on the parameters and the AAM model, we may
              e                              M




                                                                     1100
image by using affine warping. Then we sample the gray level             The shape parameters bs have unit distance and texture
information gim of the warping images at the mean shape region.          parameters bg have unit intensity. Because they are difference
Before applying PCA on texture date, to minimize the effect of           nature and difference relevance, they cannot be compared
lighting variation, we normalize gim first by applying a scaling α       directly. To estimate the correct value of Ws, we systemically
                                                                         displace the element of bs from each example’s best match
and offset β as
                                                                         parameter in the training set and sample the sample the
                 g = ( gim − β ⋅ Ι) / α                (3)
                                           _                             corresponding difference. In addition, active appearance model
where I is a vector of ones. Let g defined the mean of the
normalized texture data, scaled and offset so that the sum is zero       have a pose parameter vector to describe the similarity
and the variance is unity. α and β are selected to normalize gim as      transformation of shape. The pose parameter vector t has four
                                                                                                       T
                                                _                        elements as t=(kx, ky, tx, ty) . Where (tx, ty) is the translation and
                     β=(gim ·I)/n and α= gim· g ,      (4)
                                                                         (kx, ky) represent the scaling k and in-plane rotation angle θ,
where the n is the number of pixels in the mean _ shape. We
iteratively use Equations (3) and (4) to estimate g until the            kx,=k(cosθ−1) and ky,=ksinθ.
estimation stabilized. Then, we apply PCA to the normalized
texture data so that the texture example can be expressed as
                    _
                                                                         2.6 Active Appearance Model Search
               g = g + Ρg bg                               (5)
where Pg is the eigenvectors and bg is the texture parameters.           Here, we introduce the kernel of AAM. The ultimate goal of
Figure 4 shows the effects of varying the first two texture model        applying AAM is that given an input facial image, we may find
parameters through ±2 standard deviations.                               the model parameters the may be apply to the AAM model to
                                                                         synthesize an image similar to the input image. Given a new
                                                                         image, we have the initial estimate of appearance parameter c,
                                                                         and the position, orientation and scaling placed in the image. We
                                                                         need to minimize the difference E as
                                                                                E = g image − g mod el .                (9)
     Figure 4. First two modes of texture variation (±2 sd).             where based on the pre-estimated                        c   , we may have
                                                                                      _                                _
                                                                                                                                −1
2.4 Appearance Model                                                         g mod el = g + Pg Q g c and s mod el = s + PsW Q s c . gimage denotes the
                                                                                                                               s

The shape and texture of any example data in the training set can        target image obtained by applying warp function using                      smodel
be summarized by the bs and bg. The appearance model combines                     _
the two parameters vector into a single parameter bc as                  and s , and sampling the pixel intensity of region. It is need an
                                                                         algorithm to adjust parameters to make the input image and that
            ⎛Ws bs ⎞ ⎛Ws PsΤ ( s − s ) ⎞
                                    _
                                                                         image generated by model as closely as possible. There are many
                   ⎟=⎜                 ⎟            (6)
       bc = ⎜
            ⎜      ⎟ ⎜            _ ⎟                                    optimization algorithms propose for parameters searching. In this
            ⎝ bg ⎠ ⎜ PgΤ ( g − g ) ⎟
                     ⎝                 ⎠                                 paper, we apply the so-called AAM-API method [8]. Rewrite (9)
where Ws is a diagonal matrix of weights for each shape                  as
parameter. A further PCA is applied for remove the possible
correlations between the shape and texture variations.                                      E ( p ) = g image − g mod el              (10)
                 bc = Qc c                        (7)
                                                                         where p is the parameters of model.                          p = (c Τ | t Τ | u Τ ) ,
where the Qc is the eigenvectors and c is the appearance                 u = (α β ) Τ . Taylor expansion of (10) gives
parameter.                                                                                                             ∂E
                                                                                          E ( p + ∇p ) = E ( p ) +        ∂p          (11)
      Given a appearance parameter c, we can synthesis a face                                                          ∂p
image by generate gray level g the interior_ of mean shape and
warping the texture from the mean shape s to model shape s,              where the ijth element of matrix ∂E/∂p is ∂Ei/∂pj. Suppose E is
using                                                                    our current matching error. We want to find ∇p such as to
           _                       _
                                                                         minimize E ( p + ∇p ) 2 . By equating Equation 2.11 to zero, we
      s = s + PsWs−1Qs c , g = g + Pg Q g c          (8)                 obtain the RMS solution,
                     T
where Qc=(Qs, Qg) . Figure 5 shows the effects of varying the                                        ∇p = − AE ( p )
first two appearance model parameters through ±2 standard
deviations.                                                                                                  −1
                                                                         where A = ⎛ ∂E                 ∂E ⎞ ∂E Τ
                                                                                                 Τ
                                                                                   ⎜
                                                                                   ⎜ ∂p                    ⎟
                                                                                            ⎝           ∂p ⎟ ∂p
                                                                                                           ⎠

                                                                                If we apply a conventional optimization process, we need
                                                                         to recalculate ∂E/∂p after every match, and it requires heavy
   Figure 5. First two modes of appearance variation (±2 sd).            computation. So, to simplify the optimization process, Cootes et
                                                                         al. assume that A can be approximately constant, and the
                                                                         relationship between E and ∇p is linear.
2.5 Shape parameter weight




                                                                      1101
Therefor we systemat
              re,             tically displace the parameter from
                                             e             r
                                                                           3.1 Training Da
                                                                             1           ata
the optimal v value on the example image and record the     d
corresponding effect of texture dif          fference. App  plying
multi-variance linear regressio on the displa
                              on              acements ∇p an the
                                                            nd             Sin we do not h
                                                                              nce             have a large multiple expressio and multi-vie
                                                                                                                            on             ew
corresponding difference textu E to find A. Therefore, we need
                              ure                          e               faci image datab
                                                                               ial            base of for 2D AAM training process, we ne  eed
not recalculate matrix A, wh
              e              hich can be computed off-lin and
                                                           ne              to o               ning data by using six camer to capture t
                                                                               obtain the train                             ras            the
stored in the m
              memory for ref ference afterward. When we want
                                                           e               muultiple expressio and multi-v
                                                                                               on            view facial ima database. W
                                                                                                                            age           We
to match a imag on-line, the step of procedu is as follow:
               ge                            ure            :              hav obtained mu
                                                                              ve              ultiple expressio and multi-v
                                                                                                              on            view facial ima
                                                                                                                                          age
                                                                           from 13 people (i.e., neutral, surprised, hap
                                                                               m                                             ppiness, sadne
                                                                                                                                          ess,
                                                                           disggust, anger, fea There are t
                                                                                               ar).           totally 510 fac images in t
                                                                                                                            cial           the
Initial estimate parameters p
I              e                                                           trai
                                                                              ining data set. F 6 shows som of the trainin samples of o
                                                                                              Fig            me              ng           our
1. Calculate the model shape smodel. and mode texture gmodel.
               e                              el                           trai
                                                                              ining data.
2. Warping the current image and sample tex   xture gimage.
                              ture E= gimage − gmodel.
3. Evaluate the difference text
4. update the mmodel paramete p→p+k∇p, ∇p=−AE(p), initial
                               ers             ,
   k=1.                                                                    3.2 Intra-Model Rotate
                                                                             2
5. Calculate the new model sh
               e              hape smodel. and m
                                               model texture gmodel.
6. Sample the iimage from new shape gimage.
                              w                                            Coo et al. [4] s
                                                                              otes         suggest that the model parame
                                                                                                          e            eters c are relat
                                                                                                                                       ted
7. Calculate the new error E
               e                                                           to t view angle θ as
                                                                              the
          2
8. if E '  E 2 , then accept the new estim   mate ; otherwis try
                                                             se,                                    c = c0 + cc cos(θ ) + c s sin θ )
                                                                                                                                n(                                                    (12)
   k=0.5, k=0.225.
                                                                           whe c0, cc, and cs are vectors learned from training data. W
                                                                               ere                                            t             We
The iteration o the preceding steps stop wh the E 2 ca not
              of            g             hen           an                 can find the opt
                                                                             n                timal value of parameters ci of the traini
                                                                                                             f                              ing
be reduced, an we may as
               nd           ssume that the iterative algo
                                          e             orithm             exaample and its c corresponding view angle θi. Cootes’ meth     hod
converge.                                                                  doe not θi prec
                                                                              es              cisely, it allow ±10 degree errors. In o
                                                                                                             ws               e             our
                                                                           expperiment, we fifixed the camer so that the viewing angle is
                                                                                                              ra
                                                                           knoown beforehan However, i creates an under-determina
                                                                                             nd.             it              u               ant
                                                                           pro
                                                                             oblem. We use facial images f    from two views to generate oneo
                                                                           AAAM. There are only two inpu that can be used to estima
                                                                                                             uts                             ate
     (a)                                                                   thre unknowns. So we rando
                                                                              ee                             omly increase θi by ±1. It is
                                                                           reassonable becaus the error mad during the im
                                                                                             se               de              mage capturing is
                                                                                                                                            g
                                                                           unaavoidable, such as the human s  subject slightly movement of h
                                                                                                                             y               his
                                                                           bod or head. Usin this method to add more in
                                                                              dy               ng                            nput data, we may
                                                                                                                                           m
                                                                           esti
                                                                              imate c0, cc, an cs by applyin multiple lin
                                                                                             nd               ng             near regression on
                                                                                             quations of cs an (1, cos(θ ), sin(θ )) Τ .
                                                                           the relationship eq               nd
     (b)
                                                                                 Given an fa  acial image, to f    find the best fit
                                                                                                                                   tting parameter cj,
                                                                           we may use Equations (13) and (14) to estim
                                                                                                                    d              mate the viewi ing
                                                                           ang θj as
                                                                             gle
                                                                                     ( x j , y j ) Τ = Rc−1 ( c j − c0 )              (13)
     (c)                                                                                                                 -1
   Figure 6. Exxamples from th training set f the models. (a)
                               he           for                            whe Rc is the le pseudo-inve of Rc−1 (cc | cs ) = Ι 2 .
                                                                             ere          eft         erse
 Right profile F
               Face, 90° and75 (b) Right Ha Face 60° and 45°,
                              5°;          alf          d                                                                     θ j = ta −1 ( y j / x j )
                                                                                                                                     an                                                (14)
(c) Frontal Face 0° and -15°.
                                                                           Fig 7 shows the predicted angle compared with the actual ang
                                                                             g               p                            h              gle
                                                                           for the training set for each mode The results are worse than t
                                                                                              t             el.           a              the
3. Modified View-Based AAM
                     d                                                     resu from Coote et al [4]. It is due to that ou model contai
                                                                              ults            es                          ur             ins
                                                                           muultiple expressio facial image data.
                                                                                             on
Cootes et al. [4 propose View
               4]              w-based AAM, based on sever 2D ral
AAM for 3D M    Model fitting 2 image. The model-based fitting
                               2D                              f                                                                    150

for model para  ameter estimati can be div
                                ion           vided as intra-m model
                                                                                  P red icted A n g le(d eg ree)




                                                                                                                                    100

and inter-mode His method h been succes
               e.              has             ssfully applied to the                                                                50
human face w  without expressi ion. However, they have prob    blems                                                                  0
fitting the face with expressio It is becaus in the human face
               e               on.            se               n                                                   -40        -20          0       20
                                                                                                                                                   2       40        60   80   100
                                                                                                                                     -50
parameter spac the expressi difference between the ch
               ce,             ion             b              hanges
for intra-person is much bigge than the cha
                n               er            anges in inter-peerson.                                                               -100
                                                                                                                                               Actua Angle(degree)
                                                                                                                                                   al
The original liinear transformmation between the view-angl andle
AAM paramete is no longer valid. Here w propose a m
                ers                           we             method                                                                            (a)
                                                                                                                                                 )                                    b)
                                                                                                                                                                                     (b
to project the facial space to identity subsp
                               o               pace and expre  ession
subspace to sol the problem Here we divid the viewing angle
                lve           m.               de
in five ranges: [-90, -75], [-60 -45], [-15, 15 [45, 60], [75 90]
                               0,              5],             5,            gure 7. Predic angle vs ac
                                                                           Fig             cted            ctual angle across training set (
                                                                                                                                           (a)
from leftward to rightward. S  Since the human face is symm
                                               n              metric,      resu of our data. (b) Cootes’ exp
                                                                              ult                          perimental resul at ‘view-bas
                                                                                                                           lts             sed
in the experim ments, we only develop the 2D AAM for three
                               y                                           acti appearance mode’.
                                                                              ive
different angles [-15, 0], [45, 60], and [75, 9
                s:                            90].
                                                                           Giv a new pers image, we apply AAM f
                                                                             ven        son                   fitting to find t
                                                                                                                              the




                                                                        1102
best model parameters and e       estimating the head angle as well.                              Τ
                                                                                                                  _
                                                                                                                                        (20)
Then, we can remove to angle effect by using
                                 e                   g                                 b exp = P ex ( r − b exp )
                                                                                                  xp

      cresidual = c j − c0 − cc cos(θ j ) − cs sin(θ j ) (15)                  The we can comp the rneutral b using
                                                                                 en          pute           by
Therefore the mmodel paramete are separated into two parts: one
                                  ers                                                                _
                                                                                                                                         (21)
part that describ the variatio due to rotatio and the othe part
                bes              on                on,            er                  rneutral = r − b exp − Pexp bexp
that describe th other variat
                he                tions (e.g. the vvariation of ide
                                                                  entity,      And get project in identity subs
                                                                                 d              nto           space
expression, illumination). W can use the paramete to
                                  We                              er
                                                                                                                         _
reconstruction at a new angle φ as                                                      bneutral = Pne ( rneutral − r ne )
                                                                                                                       eutral            (22)
                                                                                                     eutral
       c(φ ) = cresidual + c0 + cc cos(φ ) + cs sin(φ ) (16)
This method ca only do smal angel rotation based on 2D A
              an             ll           n            AAM.
Cootes et al. [7] and Huis   sman [25] hav proved tha the
                                          ve           at
intra-model pose can be applie for the huma face recognit
                             ed           an            tion.

3.3 Identity a Expressio Subspace
             and       on

To make a l     large angel wwarping, we m  must transform the
                                                          m
parameters bet tween the 2D m models. We inte to find a s
                                             end          simple
transformation between th two mode
                             he             els. However, the
                                                           ,
parameters in ((15) consist of identity compo
                                            onent and expreession
component tha the transform
               at                           vial. Cootes use two
                              mation non-triv
different methhods to remove the expressio and project into
                              e              on            t
identity subspa The parameters are simplified to the var
               ace.                                       riation
of identity, whi is linear tran
               ich            nsformation.

      Let r de efined as the r    residual param
                                               meter after (15) We
                                                              ).
               ning data as rneu and rexp whe exp ∈{happ
divide the train                utral          ere           piness,
sadness, fear, anger, disgu     ust, surprised} to compute the
                                                }             e
expression and identity covar
               d                 riance matrix. R
                                                Remove the iddentity
component of rexp by
           eexp = rexp − rneutral                    (17)
                                                                                   Figure 10. The facial space rel of identity and expression.
                                                                                   F                             late          a             .
where eexp be d
              defined the exp  pression compon nent. Fig (15) s
                                                              shows
the training ex
              xample of eexp, and Figure 9 shows the tra      aining
examples rneutra . By applying PCA to rneutral and eexp, we can find
               al                                             n                3.4 Inter-Model Rotate
                                                                                 4           l
                                              pace and Pexp, into a
the projection Pneutral. into an identity subsp
                               n                                                                                                                    i
expression subspace as                                                         Now we may use M
                                                                                i w                      Multiple linear regression met
                                                                                                                                      thod to find b ex ,
                                                                                                                                                 ij   xp
                              _                                                r ijeutral in thej ith A
                                                                                 ne                   AAM model and the relationshi (i.e., R neutral ,
                                                                                                                j            d         ips
                     e exp = e exp + Pexp bexp              (18)               R e ) with b exp and r neutral in t j AAM mo as
                                                                                   exp                                       the   th
                                                                                                                                      odel
                          _                                                              rneutral = enettural + Rneutral rneuttral
                                                                                           j          ij          ij       i
                                                                                                                                         (23)
                eutral = r neutral + P
    and       rne                     neutral bneutral
                                                            (19)

                                                                               and
                                                                                 d      bexp = eexp + Rexp rexp
                                                                                          j     ij     ij    i                        (24)


                                                                               whe enetural and eexp are cons
                                                                                 ere ij       d ij          stant.


                                                                               3.5 Reconstruct a New View
                                                                                 5           t          w

Figure 8. some examples from expression co
             e             m             omponent traini
                                                       ing                     Giv a match of a new person in a view, we can reconstruct a
                                                                                  ven             f
set                                                                            view by follow ste (as shown in Fig. 11).
                                                                                   w              eps            n
                                                                               1. R
                                                                                  Remove the effe of orientati
                                                                                                 fects           ion. (Eq. 15).
                                                                               2. P
                                                                                  Project into ide entity and exprression subspace of the mod del.
                                                                                  (
                                                                                  (Eqs. 20, 21, 222).
                                                                                  Project into the subspaces of th target model. (Eqs. 23, 24).
                                                                               3. P                              he
                                                                               4. P
                                                                                  Project that into residual space and combined two vectors in
                                                                                                  o              e              d             nto
                                                                                  o vector (inve Eqs. 20, 21 22).
                                                                                  one              erse          1,
   Figure 9. T neutral imag for training i
             The          ge             identity subspac
                                                        ce.                    5. A the assigne orientation. (
                                                                                  Add             ed             (Eq. 16)
Costen et al. [26] suggest                ession changes are
                             ted that expre
orthogonal to the changes du to identity in framework. A new
              t             ue            n
image with pa arameter r, the expression pa
                                          arameter bexp,can be
calculated by




                                                                            1103
Figure 11. The flowchart of Rotate Model.


4. Experimental Results

Here, we illustrate the results of our methods. We use six
cameras to capture the expression of one person. There are 13
persons in our experiments with 5 or 6 different expressions. We
select 510 pictures for our training data for Multi-Pose 2D AAMs.      Figure 12. result of warping Right Half to Frontal vs Ground
In the testing phase, we apply the model fitting for all pictures      truth.
and about 90% of the testing pictures have been successful.
                                                                              Then, we illustrate the experimental results of warping
Because we do not have enough training data, we apply
                                                                       right-side view to frontal view and compare with the ground
leave-one-out to train and test our rotational model algorithm.
                                                                       truth as shown in Figure 13. Apparently, the performance is not
Besides warp the input face to the pre-trained pose, we also try
                                                                       as good as the previous one.
warping the face to other pose and compare with the video
captured in that specific pose.

       Although our system allows us to do the model face fitting
and then warp the face to any pose, for some view, the warping
results are not as good as the others. To compare the results of
the rotated model, we do the warping of the input face image in
right half view to the front pose and compare with the ground
truth pre-stored in our database as shown in Figure 12.




                                                                    1104
Figure 13. The experimental results of warping right-side view           Figure15. The experimental results of               warping    the
to frontal view.                                                         right-side-view facial image to the front view.

                                                                               We use the distance similarity measure x1 ⋅ x 2 / x1 x 2 to
We use PC equipped with Intel C2D 6300 CPU and 2045 MB
                                                                         evaluate whether the warped image help increasing the
memory to test our algorithm. For a video sequence (with frame
                                                                         recognition rate, where x1 represent the pre-stored frontal neutral
resolution 320*240), the processing time is less than
                                                                         face image database, x2 represent the testing data of facial image
45ms/frame.
                                                                         of any expression and in any viewing direction.

       The purpose of the warping the non-frontal face to the
                                                                         Table 1 The improvement of identity recognition, with ICO
frontal view is to increase the face identification accuracy.
                                                                         (identity component only) and PC (pose correction) with 15
Before the warping process, we have separated the identity
                                                                         degree.
component and expression component from the model parameter.
                                                                                            ICO        PC      PC+ICO
To analyze the warped facial image, we may use identity
parameter or expression parameter independently to increase the              Frontal        18%        3.7%    21.5%
recognition rate. In the following, we will synthesize the face              intra-model
image by using only the identity component or the expression
component. The experimental results of right-half-view facial            In Table 2, the comparison is done with the expression parameter.
image and right-side-view are shown in Figures 14 and 15.                We find that the identity component increase the similarity
                                                                         between the neutral face in the database. On the other hand, we
      In Figure 14, the lower-right figure illustrates the facial        have the right-half-view faces with expression processed by PC
image by using identity component. The expression can hardly             + ICO(45-60 degree), the average similarity is about 74.3%. It is
be found and it shows a facial image of neutral face. In Figure 15,      lower than the PC+ICO Frontal expression face for 4.6% only.
the warped image using identity component is worse than Figure           However, the improvement of right-view face with expression is
14, however, the warped image by using expression parameter              very limited. The similarity is about 56.4%.
looks fine.
                                                                         5. Conclusions
                                                                         In this paper, we have demonstrated that the expression
                                                                         parameter can be linear transform between each two AAMs of
                                                                         the view-based AAM. Then, it can be used to match an
                                                                         expression variant face at any angle, and to predict the
                                                                         appearance from new viewpoints given a single image of a
                                                                         person. We anticipate this approach will be useful for face
                                                                         recognition and expression recognition system more invariant to
                                                                         viewing angle. In the future, we may establish a wide angle
                                                                         facial detection and recognition system with higher accuracy,
                                                                         less processing time, and more stable.


                                                                         References

                                                                         [1]   T.F. Cootes, D. Cooper, C.J. Taylor and J. Graham, Active
Figure14. The experimental results            of   warping     the             Shape Models - Their Training and Application. Computer
right-half-view facial image to front view.                                    Vision and Image Understanding. Vol. 61, No. 1, pp. 38-59,




                                                                      1105
1995.                                                                  Page(s):511-518 Vol.1
[2]    T.F.Cootes, G.J. Edwards and C.J.Taylor. Active                  [23] J. Sung, D. Kim STAAM: Fitting a 2D+3D to Stereo
       Appearance Models, Proc. European Conf. on Computer                   Images IEEE ICIP on 8-11 Oct. 2006.
       Vision, Vol. 2, pp. 484-498, 1998.                                [24] Lucey, S., Mathews, I., Changbo Hu, Ambadar, Z., de la
[3]    G.J.Edwards, C.J.Taylor, T.F.Cootes, Interpreting Face                Torre, F., Cohn, J., AAM derived face representations for
       Images using Active Appearance Models, Int. Conf. on                  robust action recognition Int. Conf. on Automatic Face and
       Face and Gesture Recognition 1998.                                     Gesture Recognition, 10-12 April 2006 Page(s): 155-160
[4]    T. F. Cootes, G.V.Wheeler, K.N.Walker and C. J. Taylor,           [25] Huisman, P., van Munster, R., Moro-Ellenberger, S.,
       View-Based Active Appearance Models, Image and                       Veldhuis, R., Bazen, A. Making 2D face recognition more
       Vision Computing, Vol.20, 2002, pp.657-664                             robust using AAMs for pose compensation nt. Conf. on
[5]    T.F. Cootes, G.V.Wheeler, K.N. Walker and C.J.Taylor                   Automatic Face and Gesture Recognition, 10-12 April 2006
       Coupled-View Active Appearance Models, British                  [26] N. Costen, T. F. Cootes and C. J. Taylor, Compensating for
       Machine Vision Conference 2000.                                        Ensemble-Specificity Effects when Building Facial
[6]    T.F.Cootes, G.J. Edwards and C.J.Taylor. Active                       Models, Proc. British Machine Vision Conference 2000,
       Appearance Models, IEEE PAMI, Vol.23, No.6,                           Vol. 1, pp.62-71.
       pp.681-685, 2001
[7]    H. Kang, T.F. Cootes and C.J. Taylor, A Comparison of
       Face Verification Algorithms using Appearance Models,
       Proc. BMVC2002, Vol.2,pp.477-4862.
[8]    M. B. Stegmann, B. K. Ersbøll, R. Larsen, FAME -- A
       Flexible Appearance Modelling Environment, IEEE
       Transactions on Medical Imaging, 2003
[9]    I. Matthews and S. Baker. “Active Appearance Models
       revisited.” IJCV, 2004. In Press
[10]   陳曉瑩 即時多角度人臉偵測 國立清華大學電機工程
       研究所碩士論文,2006
[11]   V. Blanz and T. Vetter. A morphable model for the
       synpaper of 3d faces. Proc. Computer Graphics
       SIGGRAPH '99, 1999.
[12]   V. Blanz and T. Vetter. Face recognition based on fitting
       a 3d morphable model.  IEEE Trans. On PAMI, 25(9),
       September 2003.
[13]   C. Christoudias, L. Morency, and T. DarreIl. Light field
       appearance manifolds. European Conf. on Computer
       Vision, (4):482-493, 2004
[14]   R. Gross, I. Matthews, and S. Baker. Eigen light-fields and
       face recognition across pose. Int. Conf on Automatic Face
       and Gesture Recognition, 2002.
[15]   Chang, J., Zheng, Y., Wang, Z. Facial Expression Anaylsis
       and synthesis: a Bilinear Appraoach Int. Conf. on
       Information Acquisition, ICIA’07, 8-11 July 2007.
[16]   Wang, Tueming, Pen, Gang, Wu, Zhaohui 3D Face
       Recognition in the Presence of Expression : A
       Guidance-based Constraint Deformation Approach IEEE
       CVPR 2007.
[17]   Amor, B.B. Ardabilianm, M., Chen, L. New Experiments
       on ICP-Based 3D Face Recognition and Authentication
       ICPR 2006, Volume 3, Page(s) : 1195 – 1199
[18]   I.A. Kakadiaris, G.Passalis, , G.Toderici, , M.N Murtuza,.,
       Y.     Lu,      Karampatziakis,     N,     Theoharis,     T.
       Three-Dimensional Face Recognition in the Presence of
       Facial Expression: An Annotated Deformable Model
       Approach IEEE Trans. on PAMI, Vol. 29, Issue 4, April
       2007 pp. 640-649
[19]   S. Ramanathan, A. Kassim, Y. Vemlatesh, S. W. Wu,
       Human Facial Expression Recognition using a 3D
       Morphable Model IEEE ICIP, Oct 2006,
[20]   Lu X. and Jain A., Deformation Modeling for Robust 3D
       face Matching IEEE Trans. on PAMI, 2007.
[21]   Jling Xiao, Baker, S., Mathews, I., Kanade, T. Real-time
       conbined 2D+3D active appearance models CVPR 2004,
       Page(s):II-535~II-542.
[22]   Koterba, S., Baker, S., Mathews, I., Changbo Hu, Jing Xiao,
       Cohn, J., Janade, T. Multi-view AAM fitting and camera
       calibration IEEE ICCV, Vol. 1, 17-21 Oct. 2005




                                                                      1106
1107
1108
1109
1110
1111
Patch-Based Occupant Classification for Smart Airbag


                         Shih-Shinh Huang                                                 Er-Liang Jian and Chi-Liang Chien
         Dept. of Computer and Communication Engineering                             Chung-Shan Institute of Science and Technology
    National Kaohsiung First University of Science and Technology                           Email: jianerliang@gmail.com
                    Email: poww@nkfust.edu.tw



   Abstract—This paper presents a vision-based approach for
occupant classification. In order to circumvent the intra-class
variance, we consider the empty class as reference and describe
the occupant class by using appearance difference rather than
appearance itself in the tradition approaches. Each class in
this work is modeled by a set of representative parts called
patches and each of which is represented by a Gaussian
distribution. This alleviates the mis-classification resulting from
the severe lighting change which makes the image locally
blooming or invisible. Instead of using maximum likelihood
(ML) for patch selection and estimating the parameters of                   Figure 1.      Challenges: (a) Severe lighting change. The images have
the proposed generative models, we discriminatively learn the               considerably large dynamic range. These observed images have significantly
models through a boosting algorithm by minimizing the loss                  different appearance. (b) Intra-class variance. Persons wearing clothing with
                                                                            different styles or colors.
of the training error.
  Keywords-patch-based model, discriminative learning
                                                                            considerably dynamic range from bright sunlight to dark
                      I. I NTRODUCTION
                                                                            shadow. Extremely, this makes some regions of the image
   Until now, the integration of the airbags into automobiles               blooming or invisible and thus complicates the classifica-
has significantly improved the occupant safety in vehicle                    tion task. The intra-class variance denotes that the same
crashes. However, inappropriate deployment of airbags in                    occupant class may have different appearance. For instance,
some situations may cause severe or even fatal injuries. For                the passengers may wear clothing with different colors; the
example, it deploys on a rear-facing infant seat or in a case               baby seats may have different styles. The difference in scene
of that a passenger sitting too close to the airbag. According              resulting from the configuration change of objects inside
to the report of American National Highway Transportation                   the vehicle is referred to as the structure variance. Figure
and Safety Administration (NHTSA), since 1990, more than                    1 shows some images exhibiting the lighting change and
200 occupants have been killed by the airbag deployed in                    intra-class variance. Similar to the works in the literature, we
low-speed crashes. To prevent occupants from this kind of                   assume that the scene monitored has no structure variance,
injure, NHTSA defined the Federal Motor Vehicle Safety                       and the objective of this paper is to achieve high recognition
Standard (FMVSS) 208 in 2001. One of the fundamental                        rate against severe lighting change and intra-class variance.
issues of FMVSS 208 is to recognize the occupant class
inside the vehicle for controlling the deployment of airbags.               A. Related Work
The five basic classes defined in FMVSS 208 are (i) Empty,                       Owechko etc., [1] who are the pioneer in this area,
(ii) RFIS (Rear Facing Infant Seat), (iii) FFCS (Front Facing               attempted to eliminate the illumination variance by firstly
Child Seat), (iv) Child, and (v) Adult.                                     applying intensity normalization to the training images. The
   Some existing sensors, such as ultrasound, pressure, or                  coefficients of the eigen vectors computed by the princi-
camera, have been used for developing system which aims                     pal component analysis (PCA) are used to represent the
at meeting the classification requirements in FMVSS 208.                     occupant class. The input unknown image is then recog-
In this work, we choose camera as the sensing device, since                 nized as the same class of the nearest neighbor sample.
it can provide rich representation of the occupant in front                 In order to overcome lighting change, Haar wavelet filters
of the dashboard. This makes the proposed approach have                     which describe the intensity difference among neighboring
potentially higher classification accuracy. The success to the               regions have been used for occupant representation. An over-
problem of the occupant classification based on computer                     complete and dense way using Haar filters over thousands
vision is challenging in the presence of severe change in                   of rectangular regions is adopted in [2]. Then, Support
lighting, large intra-class variance, and structure variance.               Vector Machine (SVM) is applied to determine the bound-
Since the vehicle is moving, the observed image may have                    aries among different occupant classes for handling intra-



                                                                     1112
class variance. In [3], [4], the edge map of the passenger                image and that at the reference image. This makes the
appearance is extracted through the background subtraction                proposed approach be invariant to intra-class variance. Then,
algorithm and further described by the use of high-order                  the likelihood ratios for evaluating the existence confidence
Legendre moments. The classification is achieved using the                 for giving image with respect to five trained models are
k-nearest neighbors strategy. The edge map of the occupant                computed and the classification result is the occupant class
is described by higher-order Tchebichef moments in [5]                    with the highest confidence.
and then Adaboost algorithm is applied to select a set of                    The remainder of this paper is organized as follows. In
discriminative moments for classification. To utilize more                 Section II, we introduce the generative models for repre-
information for classification, multiple features including                senting the occupant classes and the way to perform oc-
range [6], motion information, and edge map are fused under               cupant classification. The boosting algorithm for estimating
a two-layer architecture [7], [8]. The classifiers for each layer          the parameters of the models in a discriminative manner
are the Non-linear Discriminant Analysis (NDA) classifiers.                is then described in Section III. Section IV demonstrates
   The features used in the aforementioned works are all                  the effectiveness of the developed approach by providing
global descriptors, such as dense edge map [7], [8], Legendre             experimental results on an abundant database. Finally, we
moments [3], [4] or Tchebichef moments[5]. The main                       conclude the paper in Section V with some discussion.
limitation in this kind of approaches is that the classification
accuracy deteriorates in the two extreme cases (blooming                                II. PATCH -BASED C LASSIFIER
and invisible) resulting from severe lighting change. To                     For every class, we model it by a generative model
circumvent this, we present a patch-based model which                     consisting of several patches described by a Gaussian distri-
is commonly used in recognition literature [9], [10] for                  bution. The observed image is classified by maximizing the
handling occlusion effect to describe the occupant class.                 likelihood probability. Here, the feature representation of a
Furthermore, the above works directly model the appearance                patch is appearance difference with respect to a reference
of the occupant and thus suffer from the significant intra-                image. In order to eliminate illumination factor, we recover
class variance. The general way to solve or alleviate this                the reflectance image of the empty class and consider it
problem is by by introducing some classification algorithms,               as the reference image. Negative normalized correlation
such as SVM or NDA in the literature. According to the                    is then introduced to measure appearance difference for
insight that the silhouettes of different occupant classes are            representing the feature of patches.
distinct, we consider the empty class as the reference and
thus model the appearance difference with respect to the                  A. Feature Representation
empty class.                                                                 The images for training are captured from various lighting
                                                                          conditions. Similar in foreground segmentation literature
B. Approach Overview                                                      [11], a reference image suitable for difference measurement
   The objective of occupant classification is to assign one of            should be illumination invariant and without any moving
five classes C = {CEmpty , CRF IS , CF F CS , CChild , CAdult }            objects inside. As discussed in [12], an image is a product
to the currently observed image. The system mainly consists               of two images: a reflectance image and an illumination
of two phases: training and classification. In training phase,             image. The reflectance image of the scene is constant and
we firstly a reflectance image of empty class by removing                   illumination image changes with the lighting condition in
the illumination effect. The obtained reflectance image is                 the environment. Accordingly, the reflectance image of the
considered as the reference image for further feature repre-              empty class is recovered and thus considered as the reference
sentation. In this work, each occupant class is described by a            image here.
patch-based generative model in order to handle with severe                  Giving a set of empty-class images in the training
lighting change which makes the image locally blooming or                 database, we apply the approach proposed in [13] to estimate
invisible. In tradition, the parameters of generative model is            the empty-class reflectance image Ir based on an assumption
generally estimated using ML strategy which only samples                  that illumination images are lower contrast than reflectance
with the same label are considered and used for training that             image. This leads to that the derivative filter outputs on the
corresponding model. However, the models learned in this                  illumination image will be sparse and the reflectance recov-
way suffers from the problem of having less discriminativity              ery problem can be re-formulated as maximum-likelihood
among difference classes. Instead, we adopt a discriminitive              estimation problem. Figure 2 shows the decomposition of
boosting algorithm to estimate the model parameters by                    three empty-class images into a constant reflectance image
directly minimizing the training error.                                   and its corresponding illumination images.
   In classification phase, the appearances at the trained                    Let p be a patch whose configuration is θ(p) = (t, l, w, h),
patches of a specific occupant model are taken into consider-              where (t, l) is the coordinate of the top-left corner and (w, h)
ation for feature representation. Feature used in this work is            is the patch size. To impose locality property similar to
the difference in appearance between the patch at observed                histogram of oriented gradients (HOGs) [14], we divided



                                                                   1113
B. Classification Model
                                                                                          A generative occupant model Mc = {pc : k = 1, ..., K c}
                                                                                                                                 k
                                                                                       consisting of K c patches is proposed to describe the class
                                                                                       c ∈ C in this work. Each patch pc is modeled by a Gaussian
                                                                                                                        k
                                                                                       distribution Nk = {µc , Σc } associated with the patch
                                                                                                      c
                                                                                                                k   k
                                                                                       configuration θ(pk ), where µc and Σk are the mean and
                                                                                                         c
                                                                                                                      k
                                                                                                                              c

                                                                                       covariance matrix, respectively. By assuming independence
                                                                                       among patches, the log likelihood probability of an observed
                                                                                       image I belonging to the class c is defined as:
                                                                                                                                 Kc
                                                                                                    c                      c
                                                                                       log Pr(I|z = 1) = log Pr(I|M ) =               log Pr(f (I(pc ))|Nk )
                                                                                                                                                   k
                                                                                                                                                         c

                                                                                                                                k=1
                                                                                                                                                 (3)
Figure 2. Examples of reflectance image recovery for three images at the                where z c ∈ {+1, −1} is the membership label for the class
first row. The second row shows the recovered reflectance image and the
three corresponding illumination images are shown at the third row.                    c and f (Io (pc )) is the aforementioned patch representation
                                                                                                     k
                                                                                                                       c
                                                                                       of the image I at the patch pk . Remarkably, the proposed
                                                                                       model which learns the likelihood probability of a given
                                                                                       observation is a generative one.
                                                                                          Instead of solving the problem of occupant classification
                                                                                       directly using maximum likelihood (ML), that is, c∗ =
                                                                                       arg max log Pr(I|z c = 1), we introduce existence confi-
                                                                                       dence to re-formulate it as five one-against-others binary
                                                                                       classification problems. The work in [9] claims that this
                                                                                       benefits both classification and training to be done in a
    Figure 3.   The definition of quadrant images Io (qi ) and Ir (qi ).
                                                                                       discriminative manner and thus improve the classification
                                                                                       accuracy. Consequently, we define the existence confidence
                                                                                       of a specific class c given an observed image I as the log
                                                                                       likelihood ratio test (LRT) which can be expressed as:
a patch p into four quadrants {q1 , q2 , q3 , q4 }. We denote
the quadrant qi at the observed image Io and the recovered                                                           Pr(I|z c = 1)
                                                                                                        H(I, c) = log                           (4)
reflectance image Ir as Io (qi ) and Ir (qi ), respectively. The                                                     Pr(I|z c = −1)
schematic form is shown in Figure 3. Inspired by the                                   Without assuming any prior, we approximate the background
work [15] to deal with the presence of the severe lighting                             hypothesis Pr(I|z c = −1) by a constant Θc . Accordingly,
change, a matching function (MF) γ(.) is applied to measure                            the function form H(.) of the LRT statistics in (4) becomes:
the appearance difference between Io (qi ) and Ir (qi ). The
                                                                                           H(I, c)      = log Pr(I|z c = 1) − Θc
γ(Io (qi ), Ir (qi )) is defined as:
                                                                                                            Kc

                                                     N (x, y)                                           =         log Pr(f (I(pc )|Nk ) − Θc
                                                                                                                               k
                                                                                                                                    c

                                          (x,y)∈qi                                                          k=1                                       (5)
γ(Io (qi ), Ir (qi )) = −
                                                                                                            Kc
                                         Do (x, y)                Dr (x, y)
                                                                                                        =         {log Pr(f (I(pc ))|Nk ) − Θc }
                                                                                                                                k
                                                                                                                                      c
                                                                                                                                             k
                              (x,y)∈qi                 (x,y)∈qi
                                                                                                            k=1
                                                                          (1)
                                                                                                c        Kc
where                                                                                  where Θ =         k=1 Θc . Therefore, the classification result
                                                                                                               k
                                                                                       of giving I and five trained patch-based generative model
  N (x, y)                      ¯                       ¯
               = (Io (x, y) − Io (qi )) × (Ir (x, y) − Ir (qi ))                       {Mc : c ∈ C} is the class c∗ with the highest existence
  Do (x, y)                     ¯                       ¯
               = (Io (x, y) − Io (qi )) × (Io (x, y) − Io (qi ))                       confidence value, that is, c∗ = arg max H(I, c). However,
  Dr (x, y)                     ¯                       ¯
               = (Ir (x, y) − Ir (qi )) × (Ir (x, y) − Ir (qi ))                       we still not mention about how to estimate the model pa-
                                                               (2)                     rameters including Ωc = {(θk , µc , Σc , Θc ) : k = 1, .., K c }.
                                                                                                                     c
       ¯           ¯                                                                                                     k    k  k
Here, Io (qi ) and Ir (qi ) denote the average intensity of the                        In the next section, a boosting algorithm is proposed to train
quadrant images Io (qi ) and Ir (qi ), respectively. Remarkably,                       these parameters in a discriminative way.
this function computes the negative normalized correlation
between Io (qi ) and Ir (qi ). Hence, the range of γ(.) is                                III. D ISCRIMINATIVE L EARNING U SING B OOSTING
between [−1, 1]. Thus, the feature representation of the p                               In the learning literature [9], [10], several compelling
is defined as a 4-D vector.                                                             arguments indicate that the model with the parameters



                                                                                1114
estimated in a discriminative manner is preferable in terms                  J(H)(x). The differentiating of J(.) in (6) w.r.t H(.) can
of classification accuracy. Inspired by this, the parameters is             be expressed as
thus determined directly by minimizing the exponential loss
                                                                                                             N
of the margin over all training samples [16].                                           c          ∂                c
                                                                                J(D, Ω ) =                    exp{−zi H(Ii , c)}
                                                                                                 ∂H(I, c) i=1                           (9)
A. Cost Function Definition
                                                                                            =    −z c exp{−z cH(I, c)}
    Assume that there are a set of labeled images D =
{(Ii , ti )}N . The margin of a sample (Ii , ti ) with respect to
            i=1
                                                       c
a learned model (classifier) H(.) is defined as zi H(Ii , c),                  Since it will not be possible to choose a hm (I, c) =
           c
where zi ∈ {+1, −1} is the membership label of the ith                       J(D, Ωc ), so instead the AnyBoost algorithm search for
                          c
sample for the class c. zi = 1 if ti is equal to c; otherwise,             a function with greatest inner product with J(D, Ωc )).
  c
zi = −1. Then, the cost function J(.) for evaluating the                   The inner product between two functions J(D, Ωc ) and
training error of the training set D to the class c is defined              hm (I, c) is defined by
as:                                                                                                      N
                             N
                                                                              J(D, Ωc ), hm (I, c) =           c       c
                                                                                                               −zi exp{−zi H(Ii , c)}hm (Ii , c)
             J(D, Ωc ) =                c
                                  exp{−zi H(Ii , c)}         (6)
                                                                                                         i=1
                            i=1                                                                                                  (10)
                                                                                             c                         c
Notably, the less training error of a model H(.) determined                Denoting exp{−zi H(Ii , c)} as the weight wi , the task at
by parameters Ωc for the class c; the smaller the cost J(.)                boosting round m is to find weak hypothesis that maximize
                                                                             N      c c
is. In other words, the objective of training the classifier for              i=1 −zi wi hm (Ii , c).
each class c is to find a set of model parameters in Ωc space
so that the cost function is minimized.                                                         IV. E XPERIMENT
   The minimization to the (6) is by boosting algorithm
which is currently a popular way to sequentially approach                    In this section, we present some experimental results on
to the solution with a set of additive models. At each round               a great amount of videos in this section.
m, our defined function H(.) is updated as H(.) + hm (.) so
as to decrease the cost. hm (.) and H(.) are called as weak                A. System Setup and Video Collection
and strong classifier, respectively, in the boosting literature.
Consequently, H(.) has the form:                                              The car used for experiment is Mitsubishi Sarvin and the
                                                                           appearance inside the vehicle is shown in Figure 4(a). We
                                   M
                                                                           mount the camera at the center of roof near the rear-view
                     H(x) =            hm (x)                (7)           mirror (see Figure 4(b)) for providing a near profile view of
                                  m=1                                      the occupant and preventing the camera view from blocking
where M is the number of boosting rounds. By designing                     by the driver. The video sequences used for both training
hm (x) as the log likelihood of a patch minus an offset and                and validation are gathered from the deployed camera in the
setting M = K c , we have H(x) in (7) equivalent to H(I, c)                situation that the platform is moving on road. The camera
in (5). The problem of estimating the model parameters is                  grabs the images at the rate of 30 frames per second.
thus the same as boosting the strong classifier in a sequential                In order to make the database with abundant lighting
manner.                                                                    change, we collect the videos at different weather conditions,
                                                                           such as sunny or cloudy day for a period of more than
    M           =    Kc                                                    two months. In addition, we drive the vehicle to pass
    hm (x, c)   =    log Pr(f (I(pc ))|Nk ) − Θc
                                        c                                  through several different scenes including indoor and out-
                                  k            k
                                                             (8)           door environments, such as basement, facing the sun, streets
                     Kc                                                    with shade of the trees, etc. As for intra-class variance,
    H(x)        =          {log Pr(f (I(pc ))|Nk ) − Θc }
                                         k
                                               c
                                                      k                    several adults and children with different body types and
                     k=1                                                   clothing are included in videos and are asked to exhibit
                                                                           various postures. Some examples are shown in Figure 5. Our
B. Gradient Descent Optimization
                                                                           database contains 34 video sets and each set consists of one
   Boosting for choosing linear combination of weak classi-                video for each occupant class. There are totally 34×5 = 170
fiers to minimize the proposed cost function J(.) is shown                  videos in our database. The time of a video is about 5 to
to be a greedy gradient descent in [17]. An algorithm called               10 minutes long and consists of about 8, 000 to 11, 000
AnyBoost presented by [17] claims that the weak hypothesis                 frames. The total frames in our database is 1, 633, 752 and
resulting in the greatest reduction in cost is at the direction            the detailed statistics can be found in Table I.



                                                                    1115
Table I
                                                                                                           D ATABASE S TATISTICS

                                                                                                  Empty      RFIS      FFCS         Child     Adult
                                                                                        Fold 1   174,572    164,635   167,589      166,103   176,423
                                                                                        Fold 2   153,322    157,699   157,064      156,352   159,993
                                                                                         Total   327,894    322,334   324,653      322,455   336,416



                                                                                    other is for validation, and vice versa (see Table I). Each
Figure 4. Camera configuration: (a) is the appearance inside the Mitsubishi
Sarvin; (b) shows the deployment of the camera.                                     fold is thus include 17 sets. For training, we extract 50
                                                                                    frames from every video in training fold and thus totally
                                                                                    have 50 ∗ 85 = 4250 frames used for training. The selection
                                                                                    of training frames in each video is by equally sampling per
                                                                                    100 frames from the first 5000 frames. Figure 6 shows the
                                                                                    first selected 10 patches for five occupant classes.
                                                                                       The confusion matrices for classification results of fold 1
                                                                                    and fold 2 are shown in Table II and Table III, respectively.
                                                                                    Obviously, our proposed approach is effective in both cases.
                                                                                    This is because the patch-based model based on local
                                                                                    features will be more robust to severe lighting change than
                                                                                    the one using global representation. The usage of appearance
                                                                                    difference for feature representation makes the system be
          Figure 5.   Various poses with different illuminations.                   invaraint to intra-class invariance. The classification results
                                                                                    in four classes including RFIS, FFCS, Child, and Adult are
                                                                                    more than 99.0%. The classification time for our method
B. Classification Results and Analysis                                               is about 16 ms .The efficiency of our approaches results
                                                                                    from the simplicity in computing the log likelihood ratio
   Our work for occupant classification is based on a set of                         which uses 4-D Gaussian distributions.However, there are
discriminative patches. For computation saving, the grabbed                         still many mis-classifications between FFCS class and Adult
images are normalized to resolution 256 × 128. Four types                           class. The reason hard to distinguish between them is that
of rectangles used in both approaches include 32 × 32,                              the FFCS and adult classes have similar appearance.
32 × 16, 16 × 32. The steps for scanning the entire the
image in horizontal and vertical directions are set to 1/2                                                  V. C ONCLUSION
of the width and height of the rectangles, respectively. For                           In this paper, we present a patch-based generative model
example, we shift the 32 × 16 rectangle by 16 and 8 pixels,                         for occupant classification. Each patch is divided into four
respectively. The number of patches selected for modeling                           quadrants and appearance difference measured by the pro-
is K c = 50 for each occupant class. The CUP used is                                posed negative correlation is used for representing patch.
Intel Duo Core with 2.4GHz and 1.0 GB working memory.                               Instead using ML for classification, the idea of existence
The Intel Open Source Computer Vision Library (OpenCV)                              confidence is introduced and thus model parameters can be
and library libsvm 2.89 [18] are adopted to support the                             estimated in a discriminative manner. To achieve this, an
programming under Microsoft Windows XP.                                             boosting algorithm is applied to approach the solution by
   Here, we use a statistical method called 2-fold cross                            directly minimizing the training error. The robustness and ef-
validation to compare the classification performance of the                          fectiveness of our proposed method to severe lighting change
two algorithms. The collected videos in our database are                            and intra-class variance has been intensively validated by
divided into two folds: one is used to learn the models; the                        using abundant database with more than 1, 600, 000 frames.
                                                                                    In the near future, we will introduce some semantic cues,
                                                                                    such as head or seat detection to make the system have
                                                                                    classification accuracy more closer to 100%. In addition,
                                                                                    the assumption that there is no structure variance inside the
                                                                                    vehicle due to user’s preference should be relaxed in the
                                                                                    ongoing work.
                                                                                                           ACKNOWLEDGMENT
                                                                                      This research is sponsored by the Chung-Shan Institute
          Figure 6.   Selected patches for five occupant classes.                    of Science and Technology under the project XB98175P.



                                                                             1116
Table II
                                                   C ONFUSION M ATRICES FOR F OLD 1

                                                       Our Approach (99.50%)
                                          Empty      RFIS      FFCS       Child         Adult    Accuracy
                                Empty    171,261      233        0         86             4       98.10%
                                RFIS        0       164,605      0          1             29      99.98%
                                FFCS        0          0     167,567        0             22      99.98%
                                Child       0          0         6       165,597         500      99.69%
                                Adult       0         116       276         2          176,029    99.77%

                                                               Table III
                                                   C ONFUSION M ATRICES FOR F OLD 2

                                                   Our Approach (Average: 99.59%)
                                          Empty      RFIS      FFCS       Child    Adult         Accuracy
                                Empty    153,301       0         0         17        4            99.98%
                                RFIS        0       157,684      0          0        15           99.99%
                                FFCS        0          0      154,642       0      2,422          98.45%
                                Child       0          0         2       155,598    752           99.51%
                                Adult       0          0         0          0     159,993         100.0%



                       R EFERENCES                                        [10] T. Deselaers, D. Keysers, and H. Ney, “Discriminative Train-
                                                                               ing for Object Recognition Using Image Patches,” IEEE Intl.
[1] J. Krumm and G. Kirk, “Video Occupant Detection for Airbag                 Conf. on Computer Vision and Pattern Recognition, vol. 2,
    Deployment,” IEEE Workshop on Applications of Computer                     pp. 20–25, 2005.
    Vision, pp. 20–35, 1998.
                                                                          [11] S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao, “Region-Level
[2] Y. Zhang, S. J. Kiselewich, and W. A. Bauson, “”A Monocu-                  Motion-Based Foreground Segmentation under a Bayesian
    lar Vision-Based Occupant Classification Approach for Smart                 Network,” IEEE Trans. on Circuits and Systems for Video
    Airbag”,” IEEE Proceedings on Intelligent Vehicle Sympo-                   Technology, vol. 19, no. 4, pp. 522–532, April 2009.
    sium, pp. 632–637, 2005.
                                                                          [12] H. Farid and E. H. Adelson, “Separating Reflections from
                                                                               Images by Use of Independent Components Analysis,” Jour-
[3] M. E. Farmer and A. K. Jain, “Smart Automotive Airbags:                    nal of the Optical Society of America, vol. 16, no. 9, pp.
    Occupant Classification and Tracking,” IEEE Trans. on Ve-                   2136–2145, 1999.
    hicular Technology, vol. 56, no. 1, pp. 60–80, January 2007.
                                                                          [13] Y. Weiss, “Deriving intrinsic images from image sequences,”
[4] ——, “Occupant Classification System for Automotive                          IEEE Intl. Conf. on Computer Vision, vol. 1, pp. 68–75, 2001.
    Airbag Suppression,” IEEE Intl. Conf. on Computer Vision
    and Pattern Recognition, vol. 1, pp. 756–761, 2003.                   [14] N. Dalal and B. Triggs, “Histograms of Oriented Gradients
                                                                               for Human Detection,” IEEE Computer Society Conference on
[5] S.-S. Huang and P.-Y. Hsiao, “Occupant Classification for                   Computer Vision and Pattern Recognition, vol. 1, pp. 886–
    Smart Airbag Using Bayesian Filtering,” International Con-                 893, 2005.
    ference on Green Circuits and Systems, 2010.
                                                                          [15] L. D. Stefano, F. Tombari, and S. Mottoccia, “Robust and
[6] P. R. Devarakota, M. Castillo-Franco, R. Ginhoux, B. Mir-                  Accurate Change Detection Under Sudden Illumination Vari-
    bach, and B. Ottersten, “Smart Automotive Airbags: Occu-                   ations,” Asia Conference on Computer Vision, pp. 103–109,
    pant Classification and Tracking,” IEEE Trans. on Vehicular                 November 2007.
    Technology, vol. 56, no. 4, pp. 1983–1993, July 2007.
                                                                          [16] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing
                                                                               Visual Features for Multiclass and MultiView Object Detec-
[7] Y. Owechko, N. Srinivasa, S. Medasani, and R. Boscolo,                     tion,” IEEE Transactions on Pattern Analysis and Machine
    “High Performance Sensor Fusion Architecture for Vision-                   Intelligence, vol. 29, no. 5, pp. 854–869, May 2007.
    Based Occupant Detection,” IEEE Intl. Conference on Intel-
    ligent Transportation Systems, pp. 1128–1132, 2003.                   [17] L. Mason, J. Baxter, P. Bartlett, and M. Frean, “Boosting
                                                                               Algoirhtms as Gradient Descent,” Neural Information Pro-
[8] ——, “Vision-Based Fusion System for Smart Airbag Appli-                    cessing Systems (NIPS), pp. 512–518, 2000.
    cation,” IEEE Proceedings on Intelligent Vehicle Symposium,
    pp. 245–250, 2002.                                                    [18] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working Set Selection
                                                                               Using Second Order Information for Training Support Vector
[9] A. B. Hillel, T. Hertz, and D. Weinshall, “Object Class                    Machines,” Journal of Machine Learning Research, no. 6, pp.
    Recognition by Boosting a Part-Based Model,” IEEE Intl.                    1889–1918, 2005.
    Conf. on Computer Vision and Pattern Recognition, pp. 702–
    709, 2005.



                                                                   1117
DISPLAY CHARACTERIZATION IN VISUAL CRYPTOGRAPHY FOR
                        COLOR IMAGES

                                             Chao-Hua Wen (溫照華)

                  Color Imaging and Illumination Center, Graduate Institute of Engineering,
                          National Taiwan University of Science and Technology
                                             Taipei, Taiwan
                                   E-mail: chwen@mail.ntust.edu.tw



                     ABSTRACT                                      images can be reconstructed by stacking operation. This
                                                                   property makes VCS especially useful in condition of
Visual cryptography can encrypt the visual information             the system requirement of low computation load.
and then decrypt the information by human visual                        Noar and Shamir proposed the (k, n) threshold
system without complicated computation. There are                  scheme or k out of n threshold scheme which illustrated
various measures on the performance of kinds of visual             a new paradigm in image sharing [2]. In this scheme a
cryptography schemes, but rare studies on exact color              secrete image is divided into n share images. With any k
reproduction for visual cryptography. This paper                   of the n shares, the secret can be perfectly reconstructed,
proposes a new visual cryptography scheme with the                 while even complete knowledge of (k-1) shares reveals
display characterization model which can render                    no information about the secret image. Consequently,
decrypted color image accurately. In the experiments,              Noar and Shamir’s method is restricted to a binary
the processes of encryption and decryption were                    image due to the nature of the basic model.
demonstrated from the source display to the destination                 Verheul and Van Tilborg (1997) proposed the
display. For color secret images, this method only uses            scheme that extended the basic visual cryptography
two encryption share images and the decryption can be              scheme from binary image to color image [3]. In this
performed via a simple operation.                                  scheme each pixel is expanded into m subpixels. Each
                                                                   subpixel may take one of the color from the set of color
Keywords Visual Cryptography; Visual Secret Sharing;               0, 1,…, c-1, where c is the total number of the colors
Color Visual Cryptography; Display Characterization                used to represent the pixel. These subpixels are
                                                                   interrelated to each other such that after all shares are
                1. INTRODUCTION                                    stacked and the color is revealed if corresponding
                                                                   subpixels of all shares are of same color, otherwise the
With the rapid deployment of network technology,                   level of black is revealed. In this scheme the size of the
multimedia information is transmitted over the Internet            decrypted image will increase by a factor of ck-1, when c
conveniently. While transmitting secret images, security           ≥ n for a (k, n) threshold scheme.
shall be taken into consideration because hackers may                   Koga and Yamamoto (1998) proposed the lattice
utilize weak link over communication network to                    based (k, n) VCS scheme for gray level and color image
exposure the hidden information. There are various                 [4]. In that scheme, the pixels are treated as elements of
image secret sharing schemes have been developed to                finite lattice and the stacking up of pixels is defined as
strengthen the security of the secret images. Information          an operation on the finite lattice. In that scheme, (k, n)
hiding and secrete sharing are two major approaches.               VCS for color images is defined with c colors as a
For instance, the watermarking method is widely used               collection of c subsets in nth Cartesian product of the
for information hidden [1] and the Visual Cryptogrphay             finite lattice.
(VC) is adopted for secret sharing [2].                                 Yang (2000) proposed a new VCS for the color
     VC is introduced first by Noar and Shamir (1994),             images [5]. The scheme is implemented based on the
which allows visual information (e.g. plain text,                  basic concept of a black and white VCS and gets much
handwritten notes, graphs and pictures) to be encrypted            better block length than the Verheul-Van Tilborg
by producing random noise images that are used to                  scheme. Here each pixel is expanded into 2c-1
decrypt through the human visual system [2]. Visual                subpixels, where c is the number of colors. Hou (2003)
cryptography scheme (VCS) eliminates complex                       proposed a scheme of secret sharing for both gray-level
computation in decryption process, and the secret                  and color images using halftone technique [6]. The




                                                            1118
color secret image is decomposed into individual                       2. VISUAL CRYPTOGRAPHY SCHEME
channels before the application of halftone technique.
Then the traditional VC is applied to halftone image of           Naor and Shamir proposed a (k, n) threshold visual
each channel to accomplish the creation of shares. The            secret sharing scheme to share a secret image [2]. A
size of decrypted image is increased by a factor of nk-1          secret image is hidden into n share images and can be
for (k, n) threshold VCS and the quality of decrypted             decrypted by superimposing at least k share images but
image is based on halftone technique used.                        any k-1 shares cannot reveal the secret.
     Cimato et al. (2003) proposed c-colors (k, n)
threshold cryptography scheme that provides a                     2.1. Visual Cryptography Scheme for binary images
characterization of contrast optimal scheme with pixel
expansion of 2c – 1 [7]. Yang and Chen (2008)                     The (2, 2) VCS is illustrated to introduce the basic
proposed VCS for color image based on additive color              concept of threshold visual secret sharing schemes. The
mixing [8]. In the scheme, each pixel is expanded by a            encryption process transforms each secret pixel into two
factor of three.                                                  shares, and each share belongs to the corresponding
     In order to reduce the size and the distortion of            share image. In the decryption process the two
decrypted image, Dharwadkar et al. (2010) propose the             corresponding shares are stacked together (using
visual cryptography for color image using color error             OR/AND operation) to recover the secret pixel. Two
diffusion dithering technique [10][16]. This technique            share of a white secret pixel are of the same while those
improves the quality of decrypted image compared to               of a black secret pixel are complementary as shown in
other dithering techniques, such as Flyod-Steinberg               Fig. 1:. Consequently a white secret pixel is recovered
error diffusion [11] which is shown by the experimental           by a share with the stacked result of half white sub-
results obtained using Picture Quality Evaluation                 pixels and a black secret pixel is recovered by all black.
metrics [12]. Meanwhile, Revenkar et al. (2010)                   Using this basic VCS, the contrast ratio of the decrypted
provided the overview of various VCS and performance              image is reduced results from halving intensity of the
analysis on the basis of pixel expansion, number of               white secret pixels.
secret images, image format and type of shares
generated [13].
     Display is one of the most used media devices in
Visual Cryptography. In most applications, the
decryption side uses a different display model from the
encryption side. Even though the same display model
used, luminance and color of the displays are possibly
different because of production variance. Color gamut is
one of characteristics of the color reproduction media
for reproducing color images play a major role in
determining how a given secret image will perform in
VCS. The display color gamut that we have been living
for the past several decades is standardized as “Rec.             Fig. 1: (2, 2) VCS for transforming a binary pixel into
709” in the video industry [14] or “sRGB” in the                                       two shares.
computer industry [15]. These systems share the same
primaries. However, the advanced wide gamut displays
are rapid deployment in specialized professional                  2.2. Digital Halftoning
applications and even in home theater now. That makes
display characterization more serious in terms of                 Halftone technique is one of the most important parts of
accurate information communications between the                   the image reproduction process for devices with a
source and destination.                                           limited number of colors. According to the physical
     The rest of this paper is organized as follows:              characteristics of different media uses the different ways
Section 2 provides overview of black and white VCS,               of representing the color level of images. The general
digital halftoning, error diffusion, halftone-based VCS           printer such as dot matrix printers and laser printers can
for gray-scale images, and color visual cryptography              only control a single pixel to be printed (black pixel) or
scheme. Display characterization is elaborated in                 not be printed (white pixel). The halftone is applied to
section 3. The proposed framework is introduced in                the given image to render the illusion of the continuous
Section 4. Results and discussion is given in Section 5.          tone images on the devices that are capable of
Finally the conclusion is given in Section 6.                     producing only binary image elements. This illusion is
                                                                  achieved because our eyes perform spatial integration.
                                                                  That is, if we view a very small area from sufficiently
                                                                  large viewing distance our eyes averages the fine detail




                                                           1119
within the small area and record only the overall                   increase pixel expansion. Wei Qiao et al. also
intensity of the area.                                              introduced a VCS for color images based on halftone
                                                                    technique [21].
2.3. Digital Halftoning
                                                                           3. DISPLAY CHARACTERIZATION
In the (k, n) threshold VCS for gray-level image [3].
The pixels have g gray levels ranging from 0 to g-1,                Display is one of the most used media devices in Visual
where each pixel is expanded to m subpixels of size m ≥             Cryptography. Flat panel displays have been become a
gk-1. In this scheme the size of decoded image is larger            common peripheral for desktop personal computers and
than the secret image compared to Naor and Shamir                   workstations. In general VC tasks, we create an image
VCS scheme. In order to reduce the size of decrypted                on one display and take the data file to a second
image, the gray-level halftone image is transformed into            imaging system. When viewed on the second display,
an approximate binary image. Then, the basic VCS                    the decrypted image is likely to have different color
described in Section 2.1 can use to create shares. The              reproduction. Here we address primarily users who will
following steps are used to generate less distorted same            be doing accurate imaging on a monitor.
decrypted image.                                                         The traditional CRT techniques have been
    1) Transform the gray-level image into a binary                 summarized by Berns [17] and can be described as
       image using halftone technique.                              application of the gain-offset-gamma (GOG) model to
    2) Each black or white pixel in the halftone image is           characterize the electro-optical transfer functions of the
       represented by m subpixels into different shares             display and a 3x3 linear transform to go from RGB to
       selecting from the shares of black or white pixels.          CIE XYZ tristimulus values. The accuracy of the GOG
    3) Repeat step 2 until every pixel in the halftone              characterization is probably adequate for most desktop
       image is decomposed into shares.                             color applications and color management systems [18].
                                                                         The International Color Consortium (ICC) has
2.4. Error Diffusion                                                published a standard file format for storing ‘‘profile’’
                                                                    information      about      any       imaging      device
In literature there are many mature error diffusion                 (http://www.color.org/). It has been become routine to
techniques are exists, and because of its exceptionally             use such profiles to achieve accurate imaging. The
high image quality, it continues to be a popular choice             widespread support for profiles allows most users to
among digital halftoning algorithms [9]. Nagaraj V.                 achieve characterization and correction without needing
Dharwadkar et al. have used Adaptive Order Dithering                to understand the underlying characteristics of the
(Cluster-dot dithering) [16], Floyd-Steinberg error                 imaging device. ICC monitor profiles use the standard
diffusion technique [11] and color error diffusion                  CRT model presented in this article.
technique and performed the computation of Picture
Quality evaluation for decrypted images [12]. Those                 3.1. Primary transform matrix and inverse
experimental results revealed that the color error
diffusion produces the superior quality of recovered                The primary transform matrix for the colorimetric
image compare to Adaptive Order Dithering and Floyd-                characterization of the display was derived from the
Steinberg error diffusion technique.                                direct colorimetric measurements of the three full-on
                                                                    primaries after black correction. The matrix and its
2.5. VCS for Color Images                                           inverse are given in Equation (1) and Equation (2).

First color VCS was developed by Verheul and Van                                 ⎡X ⎤ ⎡X R      XG    X B ⎤⎡R⎤
                                                                                 ⎢ ⎥ ⎢
Tilborg [3]. Colored secret images can be shared with
                                                                                 ⎢ Y ⎥ = ⎢ YR   YG    YB ⎥ ⎢G ⎥
                                                                                                          ⎥⎢ ⎥            (1)
the concept of arcs to construct a colored visual
                                                                                 ⎢Z ⎥ ⎢ZR
                                                                                 ⎣ ⎦ ⎣          ZG    ZB ⎥⎢B⎥
                                                                                                          ⎦⎣ ⎦
cryptography scheme. In c colorful VCS, one pixel is
                                                                                                            −1
transformed into m subpixels, and each subpixel is                             ⎡R⎤ ⎡ X R        XG   XB⎤         ⎡X ⎤
divided into c color regions. In each subpixel, there is                       ⎢G ⎥ = ⎢ Y       YG   YB ⎥        ⎢Y ⎥     (2)
exactly one color region colored, and all the other color                      ⎢ ⎥ ⎢ R                  ⎥        ⎢ ⎥
regions are black. The color of one pixel depends on the                       ⎢B⎥ ⎢ ZR
                                                                               ⎣ ⎦ ⎣            ZG   ZB ⎥
                                                                                                        ⎦        ⎢Z ⎥
                                                                                                                 ⎣ ⎦
interrelations between the stacked subpixels. For a
colored visual cryptography scheme with c colors, the               3.2. Electro-Optical Transfer Function (EOTF)
pixel expansion m is c × 3. Yang and Laih [19]
improved the pixel expansion to c × 2 of Verheul and                EOTF is used to describe the relationship between the
Van Tilborg [3]. Liu et al. developed a color VCS under             signal used to drive a given display channel and the
the visual cryptography model of Naor and Shamir with               luminance produced by that channel. For displays, this
no pixel expansion [20]. In this scheme the increase in             function is sometimes referred to as gamma and it is
the number of colors of recovered secret image does not             the aspect of the display characterization described by




                                                             1120
GOG portion of the display characterization model.                                       [IR, IG, IB]        [IRHT, IGHT, IBHT]
EOTF, however, does not work in visual cryptography
because VCS deals with fully on/off signal basically.                        (3) Creation of work-in-process shares: The
                                                                        method described in Section 2.1 is used for creating the
         4. THE PROPOSED COLOR VCS                                      work-in-process shares by (2, 2) VCS for each halftone
                                                                        images. For example, the red halftone image IRHT, (2, 2)
The objective of our proposed scheme is to apply the                    VCS encodes the halftone image into two shares, IRSH1
VCS for color image and get better quality decrypted                    and IRSH2 respectively. Green and blue halftone images
image with display characterization procedures. Fig. 9:                 is performed the same process as the red halftone image.
illustrates the framework of the encryption algorithm
and the simulated decryption image. In this encryption                  [IRHT, IGHT, IBHT]         [IRSH1, IRSH2, IGSH1, IGSH2, IBSH1, IBSH2]
algorithm the color image is decomposed into three
channels and each channel is considered as a gray-level                      (4) Creation of encrypted shares: To combine the
image. For each gray-level image dithering and VCS                      work-in-process shares of IRSH1, IGSH1, and IBSH1 into a
schemes are applied independently to accomplish the                     color Share1 image, and to combine IRSH2, IGSH2 and
creation of shares. We used color error diffusion for                   IBSH2 into a Share2 image.
dithering technique. It reduces the color sets that render
the halftone image and chooses the color from sets by                   [IRSH1, IRSH2, IGSH1, IGSH2, IBSH1, IBSH2]           [Share1, Share2]
which the desired color may be rendered and whose
brightness variation is minimal. Fig. 2: shows how to                      (5) Display characterization: To apply display
decompose a magenta pixel (R = 1, G = 0, B = 1) into                    model for color correction of Share1 and Share2 images.
two sharing blocks and how to reconstruct the magenta
block. We superimpose (using AND operation) the                                  [Share1, Share2]              [Share1’, Share2’]
binary shares of each channel to get the decrypted color
image.                                                                       For delivery of accuracy color communication
                                                                        between the original secret image and the decrypted
                                     (R,G,B) = (1,0,1)                  image, two displays were used in this study. One is the
                                                                        laptop monitor of HP Pavilion dm3, and other is the
                    R = 1            G = 0               B = 1          mobile phone display of hTC Diamond. The
                                                                        colorimetric measurements by Konica-Minolta CA-210
                                                                        are shown in Table 1: and plotted in Fig. 3:. Fig. 3:
                                                                        illustrates the difference of chromaticity coordinates of
           Share1                                    Share2
                                                                        HP monitor, hTC display and NTSC color space. The
                                                                        color gamut of hTC display is wider than HP monitor.
                                      Decrypted pixel                   The primary transform matrices of two displays were
                                                                        calculated and embedded into the ICC profiles. The
    Fig. 2: An example of the proposed VCS for a
                                                                        transform matrices of HP monitor and hTC display are
                  magenta pixel.
                                                                        shown in Equation (3) and Equation (4) respectively.

                                                                           Table 1: Measured Luminance and chromaticities
4.1. Encryption
                                                                                             Color                Luminance and Chromaticity
                                                                    Display
In the encryption algorithm, the two shares are                                      R         G         B       Y (cd/m2)        x      y
generated from the color image. Based on Noar and                                    1         0         0         54.95      0.5894   0.3440
Shamir’s basic concept, the color image is decomposed
                                                                                     0         1         0         138.8      0.328    0.5852
into R, G and B channels. From these channels, six of
the work-in-process shares are created. Next to combine                   HP         0         0         1         41.94      0.1466   0.1156
these six work-in-process shares into two encrypted                                  0         0         0         0.25       0.2009   0.1966
color images using following steps:
     (1) Color Decomposition: The color image I is                                   1         1         1         234.7      0.2964   0.3099
decomposed into IR, IG and IB monochrome gray-level                                  1         0         0         55.67      0.6340   0.3336
images for R, G and B color channels respectively.
                                                                                     0         1         0         192.9      0.3321   0.6273

                            [I]   [IR, IG, IB]                           hTC         0         0         1         36.24      0.1423   0.0778

                                                                                     0         0         0         0.04       0.2329   0.2269
    (2) Digital halftoning: To apply the halftone
                                                                                     1         1         1        284.80      0.2925   0.3044
technique for each color channel to obtain IRHT, IGHT,
and IBHT halftone images respectively.




                                                                 1121
0.6                                                                        processed by the Floyd-Steinberg error diffusion
                                                                                              algorithm as shown in Fig. 4:.
                   0.5


                   0.4
              v'

                   0.3


                   0.2                                                                            (a) RGB secret image      (b) Red halftone image

                   0.1


                    0
                         0     0.1   0.2   0.3      0.4    0.5    0.6   0.7
                                             u''
                                                                                                 (c) Green halftone image (d) Blue halftone image
 Fig. 3: Plot of CIE chromaticity coordinate values of
 HP monitor (purple line), hTC display (blue line) and                                        Fig. 4: Color secret image and decomposited halftone
          NTSC color spcace (yellow line).                                                     images by Floyd-Steinberg error diffusion algorithm.

         ⎡X ⎤    ⎡0.4009 0.3294 0.2261⎤ ⎡ R ⎤                                                       Fig. 5: shows the creation of the work-in-process
         ⎢ Y ⎥ = ⎢0.2340 0.5877 0.1783⎥ ⎢G ⎥                                     (3)
                                                                                              shares. Fig. 5: (a) and Fig. 5: (b) are two work-in-
         ⎢ ⎥     ⎢                     ⎥⎢ ⎥                                                   process shares of red channel. Fig. 5: (c) and Fig. 5: (d)
         ⎢Z ⎥
         ⎣ ⎦ HP ⎣⎢0.0453 0.0872 1.1379 ⎥ ⎢ B ⎥
                                       ⎦ ⎣ ⎦ HP                                               are the shares of green channel and Fig. 5: (e) and Fig. 5:
                                                                                              (f) illustrates the work-in-process shares of blue channel.
        ⎡X ⎤     ⎡0.3714 0.3593 0.2301⎤ ⎡ R ⎤
        ⎢Y ⎥   = ⎢0.1954 0.6787 0.1258⎥ ⎢G ⎥                                     (4)
                                                                                              Overall, these six shares reveal no information about the
        ⎢ ⎥      ⎢                     ⎥⎢ ⎥                                                   secret image.
        ⎢Z ⎥
        ⎣ ⎦ hTC ⎣⎢0.0190 0.0439 1.2613 ⎥ ⎢ B ⎥
                                       ⎦ ⎣ ⎦ hTC

      As described in Section 3, two ICC profiles were
first created. Here we assigned the source profile to HP
monitor and the destination profile to hTC display. The
new color transform was created based on the source                                                 (a) IRSH1           (b) IRSH2          (c) IGSH1
profile and the destination profile. The profile connect
color space CIEXYZ was adapted. Therefore, the
convert from HP image to hTC image are
RGBhp XYZ RGBhTC shown in Equation (5).
Consequently, we convert from Share1 and Share2
images to Share1’ and Share2’ using the equation.                                                   (d) IGSH2           (e) IBSH1          (f) IBSH2

                                            −1
                                                                                                 Fig. 5: Six of the work-in-process share images.
⎡R⎤            ⎡0.3714 0.3593 0.2301⎤            ⎡0.4009 0.3294 0.2261⎤ ⎡ R ⎤
⎢ ⎥            ⎢                    ⎥            ⎢                    ⎥⎢ ⎥                         (Actually, these are black and white images)
⎢G ⎥         = ⎢0.1954 0.6787 0.1258⎥            ⎢0.2340 0.5877 0.1783⎥ ⎢G ⎥
⎢B⎦
⎣ ⎥            ⎢0.0190 0.0439 1.2613⎥
               ⎣                    ⎦            ⎢0.0453 0.0872 1.1379⎥ ⎢ B ⎥
                                                 ⎣                    ⎦ ⎣ ⎦ HP
       hTC                                                                       (5)              To reduce the number of the share images for
                                                                                              portability and distribution, the proposed scheme
4.2. Decryption                                                                               divides the work-in-process shares into two groups and
                                                                                              creates two encrypted shares. Fig. 6: shows the
In the decryption algorithm the color image channels                                          combination of six work-in-process shares into two
are reconstructed by stacking the shares of channels.                                         encrypted color images. Both encrypted color images
Our proposed scheme is straightforward to reconstruct                                         show no information about the secret image at all.
the decrypted image Idecrypt by stacking Share1’ and
Share2’ images with AND operation for each color
channel individually.

                             [Share1’, Share2’]           [Idecrypt]
                                                                                                  (a) Share1 color image    (b) Share2 color image
                     5. RESULTS AND DISCUSSION
                                                                                                    Fig. 6: Two encrypted share color images.
The color image is decomposed into R, G and B channel
images, and then next those decomposition images are                                              Here we assume that Alice uses the monitor of HP
                                                                                              Pavilion dm3 to encrypt the secret image and then Bob




                                                                                       1122
uses the display of hTC Diamond One to decrypted the                                         REFERENCES
image by his eyes. The simulation encrypted share
images are shown in Fig. 7: that Alice can preview the               [1] H. Arafat Ali, “Qualitative spatial image data hiding for
encryption results of the images on hTC display. As a                     secure data transmission,” ICGST, GVIP Journal,
consequence, Share1’ and Share2’ were used rather                         Volume 7, Issue 2, August, pp-35-43, 2007.
than Share1 and Share2.                                              [2] M. Naor, A. Shamir, “Visual cryptography,” Advances in
                                                                          Cryptology, Eurocrypt 94, Lecture Notes in Computer
                                                                          Science, Vol. 950, pp. 1-12, 1995.
                                                                     [3] Verheul, Van Tilborg, “Construction and properties of k
                                                                          out of n visual secret sharing scheme Designs,” Codes
                                                                          and Cryptography, Vol. 11, pp. 179-196, 1997.
                                                                     [4] H. Koga, H. Yamamoto, “Proposal of a lattice based
   (a) Share1’ color image    (b) Share2’ color image                     visual secret sharing scheme for color and gray-scale
                                                                          images,” IEICE Transactions on Fundamentals of
   Fig. 7: Color correction of the encrypted images                       Electronics, Communications and Computer Sciences,
resulted from the display characterization of HP monitor                  vol. E81-A, no. 6, pp. 1262-1269, 1998.
                    and hTC display.                                 [5] C.N. Yang, C.S. Laih, “New colored visual secret sharing
                                                                          scheme,” Design, Codes and Cryptography, vol. 20,
    Finally, the decryption results are illustrated in Fig.               pp.325-335, 2000.
8:. Fig. 8: (a) depicts the decrypted image without                  [6] Y.C. Hou, “Visual Cryptography for color images,”
                                                                          Pattern Recognition, vol. 36, pp. 1619-1629, 2003.
display characterization. In contrast to the decrypted               [7] S. Cimato, R.Prisco and A.De Santis, “Optimal colored
image with display characterization in Fig. 8: (b),                       threshold visual cryptography schemes,” Design, Codes
results revealed that there were color different between                  and Cryptography, vol. 35, pp. 311-335, 2003.
(a) and (b). However, note that Alice can share the                  [8] Ching-Nung Yang and Tse-Shih Chen, “Colored Visual
encrypted images to Bob, then he can decrypted the                        Cryptography Scheme based on additive color mixing,”
secret image and see the same contents with same color                    Pattern Recognition, vol. 41, pp. 3114-3129, 2008.
as Alice create.                                                     [9] Keith T. Knox, “Evolution of error diffusion,” Journal of
                                                                          Electronic Imaging, Vol. 8, pp. 422-429, 1999.
                                                                     [10] Shaked, N. Arad, A. Fitzhugh and I. Sobel, “Color
                                                                          diffusion: Error diffusion for color halftones,” H.P.
                                                                          laboratories Israel, HPL-96-128(R.1), 1999.
                                                                     [11] R. Floyd and L. Steinberg, “An adaptive algorithm for
                                                                          spatial gray scale,” Proceedings of the S.I.D. 17,
                (a)                    (b)                                2(Second Quarter), 75-77, 1976.
                                                                     [12] Tomas Kratochvil and Pavel Simicek, “Utalization of
  Fig. 8: Decryption results. (a) shows the decrypted                     MATLAB for Picture Quality Evaluation,” Institute of
     image without display characterization and (b)                       Radio electronics, Brno University of Technology.
     demonstrated the decrpted image with display                    [13] P.S.Revenkar, Anisa Anjum and W.Z. Gandhare,
                   characterization.                                      “Survey of visual cryptography schemes,” International
                                                                          Journal of Security and Its Applications, Vol. 4, No. 2, pp.
                                                                          49-56, April, 2010
                  6. CONCLUSIONS                                     [14] ITU-R Recommendation BT.709-5, Basic parameter
                                                                          values for the HDTV standard for the studio and for
                                                                          international program exchange.
In this paper, we proposed a new VCS for the color                   [15] IEC 61966-2-1, Multimedia systems and equipment –
images with display characterization, which uses the                      Color measurement and management– Part 2-1: Color
error diffusion dithering on primary color channel                        management –Default RGB color space – sRGB.
directly. We also reduced the encrypted images down                  [16] Nagaraj V. Dharwadkar, B. B. Amberker, and Sushil Raj
two share images for easily reconstruction and hidden                     Joshi, “Visual Cryptography for Color Image using Color
information. Here we first applied the display                            Error Diffusion,” ICGST-GVIP Journal, Vol. 10, Issue 1,
characterization into Visual cryptography. Results                        pp.1-8, February, 2010.
revealed that we can accurately deliver color                        [17] R.S. Berns, “Methods for Characterizing CRT Displays,”
                                                                          Displays, Vol. 16, pp. 173-182, 1996.
information and secret image as well. Further works can
                                                                     [18] Mark D. Fairchild and David R. Wyble, “Colorimetric
be done to reduce the size of share image, improve the                    Characterization of the Apple Studio Display (Flat Panel
quality of halftone shares, and use the model of display                  LCD),” Munsell Color Science Laboratory Technical
characterization as an encryption key.                                    Report, July, 1998. (www.cis.rit.edu/mcsl/ research/PDFs
                                                                          /LCD.pdf )
                                                                     [19] C.N. Yang, “New visual secret sharing schemes using
                                                                          probabilistic method,” Pattern Recognition Letter 25,
                                                                          pp.481-494, 2004.




                                                              1123
[20] F. Liu, C.K. Wu and X.J. Lin, “Color Visual                        On Halftone Technique,” International Conference on
     Cryptography Schemes,” IET Information Security, vol.              Measuring Technology and Mechatronics Automation
     2, No. 4, pp 151-165, 2008.                                        978-0-7695-3583-8/09, pp. 393-395, 2009.
[21] Wei Qiao, Hongdong Yin and Huaqing Liang, “A Kind
     Of Visual Cryptography Scheme For Color Images Based



                                   Red channel       SR0
                                                                                          Display model
                                                       SG0                         Share1

                                                           SB0
                                  Green channel                                                 Decrypted image



                                                     SR1
                   Original
                    image          Blue channel        SG1                         Share2

                                                           SB1

                                                                                          Display model
                                 Decomposition 
                                         
                                 Digital Halftone           Visual Cryptography  Scheme

                                   Fig. 9: The proposed Framework of Color VCS




                                                                 1124
Data Hiding with Rate-Distortion Optimization on
                                            H.264/AVC Video
                                                   Yih-Chuan Lin and Jung-Hong Li
             Dept. of Computer Sciences and Information Engineering, National Formosa University, Yunlin, Taiwan.
                                                        E-mail: lyc@nfu.edu.tw
Abstract - This paper proposes a data hiding algorithm for the             video quality caused by the watermark hiding can be
H.264/AVC standard videos. The proposed video data hiding                  controlled at the bound less than 2 dB.
scheme embeds information that is useful to some specific                        The remainder of this paper is organized as follows.
applications into the symbols of context adaptive variable                 Section 2 describes the watermarking principles and related
length coding (CAVLC) domain in H.264/AVC video streams.                   literatures. Section 3 explains our proposed scheme, including
In order to minimize the changes on both the reproduced video              the watermark embedding/extracting schemes and embedding
quality and the output bit-rate, the algorithm selects DCT                 restriction rule. In Section 4, the performance of our proposed
blocks using a coefficient energy difference (CED) rule and                scheme is presented. Finally, some conclusions are given in
then modifies the minor significant symbols, trailing one (T1)             Section 5.
symbols and the least significant bits (LSB) of non-zero
quantized coefficient symbols, to hide data into the selected                                    II. BACKGROUND
blocks. Upon considering the joint optimization on rate and                      In general, most data hiding methods in H.264/AVC are
distortion, the data hiding algorithm considers the data hiding            based on entropy coding symbols or motion vectors (MV).
task as a special quantization process and performs within the             There are two kinds of entropy coding method in H.264/AVC:
rate-distortion optimization loop of H.264/AVC encoder. The                CAVLC and CABAC (Context-adaptive binary arithmetic
experiment results have demonstrated that our scheme has                   coding). Many scholars choose CAVLC to develop because it
good efficiency on hiding capacity, video quality and output               is not complicated and is easy to operate for most situations.
bit-rate.                                                                  We can modify those nonzero coefficients in DCT blocks for
Keywords: H.264/AVC, data hiding, CAVLC, reconstruction                    embedding, but it would affect the bit-rate and video quality
loop, coefficient energy difference.                                       seriously. Although the watermark hiding in the DCT blocks is
                                                                           easy to develop, we should consider avoiding unnecessary
                      I. INTRODUCTION                                      problems.
     Information hiding (or called data hiding interchangeably                   After transform and quantization, a DCT block usually
hereafter) for video is a video process that adds some useful              contains sparse zeros and nonzero coefficients. The nonzero
data to the raw data or compressed formats of the video in a               coefficients in high-frequency after the zig-zag reorder are
manner such that the third parties or others can not discern the           often sequences ±1, which are called trailing one and they are
presence or contents of the hidden message in perception.                  limited only up to three at most in H.264/AVC. When the
     H.264/AVC can provide better compression efficiency                   number of trailing ones becomes more, the coding length is
than other exiting standard at the cost of high computation                shortest. So most researchers are focus on this part to develop
complexity. Owing to the high popularity of this standard                  algorithm in data hiding. Consider changing the coefficients in
format over many video applications, the hiding of useful data             a DCT block. Four symbols for the CAVLC are available:
into this format attracts a great deal of attention for different          coeff_token, trailing_ones_sign_flag, total_zero, and
applications. Recently, many researchers are committed to                  run_before. The coeff_token is composed of nonzero
develop watermark schemes in H.264/AVC [1-4], but in order                 coefficients and T1 in a DCT block. In the same case, if the
to make a balance between video quality and bit-rate; they                 number of trailing one increases, the bit-rate will reduce. On
usually offer only a small capacity to hide data. This paper               the contrary, when the number of coefficient is raised, the
proposes a data hiding (or called watermark interchangeably                bit-rate will increase oppositely.
hereafter) scheme that is based on the CAVLC in H.264/AVC                        In Wu et al. [4], their proposed method is emphasizing on
encoder and decoder sides. In the proposed method, one                     robustness to the compression attacks for H.264/AVC with
watermark bit is embedded by employing the relationship                    more than a 40:1 compression ratio in I frame. The data
between all of the polarity of T1 symbols in a 4x4 luminance               embedded to the predicted 4x4 DCT block is only one bit. In
DCT block. If the DCT block has no any T1, the algorithm                   Tian et al. [5], this proposed method just modified the nonzero
considers modifying the LSB of the last nonzero coefficient                coefficients. Therefore, the bit-rate increase is about 0.1% and
for embedding information. Experiment results have shown                   the PSNR degradation is less then 0.5dB. It is good at keeping
that our proposed method provide more capacity and can                     low bit-rate and high quality. However the capacity is too low.
enhance the rate-distortion efficiency. The degradation of                 In Liao et al. [6], this method embeds message into the trailing
                                                                           ones of 4x4 blocks during the CAVLC. The feature of this

                                                                    1125
method is to allow data hiding directly in the compressed                 is intra-mode, the encoder performs intra-prediction and the
stream in real time and the capacity is more than others [5-6].           mode set contains only I4MB, I16MB and IPCM modes.
In Shahid et al. [7], this proposed method also embeds
watermark into DCT blocks. It modifies the LSB of
coefficients in each inter- and intra-frames and provides a high
capacity of data hiding. In Huang et al. [8], this method is a
new steganography scheme with capacity variability and
synchronization for secure transmission of acoustic data, In
Wang et al. [9], the method has good efficiency, it are always
higher than 45 dB at the hiding capacity of 1.99 bpp by
embedding for all test images

                III. THE PROPOSED SCHEME

A. OVERVIEW OF OUR METHOD
     Figure. 1 depicts the block diagram of our proposed
method in the H.264/AVC encoder side. The watermark
embedding method is inserted into H.264/AVC during the
encoding process. Data is hided in DCT blocks before entropy
coding. In our proposed method, the watermarking is done on
luminance DCT blocks in both intra and inter modes, not
considering the chrominance DCT blocks.
                                                                           Fig. 2. The proposed watermarking method at macro-block
                                                                           level.




Fig. 1. Schematic illustration of our proposed watermarking
/embedding procedure.

     When the encoder executes information hiding method,
the rate-distortion must be considered. Because the marked
result changes are reflected to the reconstruction frame, the
encoding of next frame refers to this marked reconstruction
frame. So we must consider the reconstruction loop [7]. In
other words, the data hiding block should perform inside the
reconstruction loop or inside the reconstruction loop with                Fig. 3. The proposed watermarking integration with RDO
RDO (Rate Distortion Optimization). Otherwise, the bit-rate               procedure.
and video quality would be affected seriously due to the
prediction drift phenomenon between encoder and decoder                        As indicated in Fig. 2, our proposed method is also
sides.                                                                    integrated within the RDO procedure in the encoder side.
     In the H264/AVC encoding, RDO helps current frame to                 When the encoder performs the RDO procedure, it selects the
select the best mode and get the best trade-off between                   best coding mode while watermarking is done at the same time.
distortion of quality and bit-rate. Therefore, our method takes           That mode might be different from that without watermarking.
into account RDO in order to get better coding performance                But the bit-rate and video quality are best among other modes
while embedding the information into the video blocks. As                 in the mode set. Fig. 3 illustrates the detail of “RDCost with
shown in Fig. 2, the embedding procedure at the macro-block               watermarking” block shown in Fig. 2. As described previously,
level is illustrated. When a macro-block enters the encoder               we focused on both intra- and inter-blocks of luminance
side, the encoder firstly determines its encoding mode. If the            component for data hiding. As indicated in Fig. 3, the modes
marco-block is inter-mode, the encoder performs both inter-               IPCM and SKIP are not considered for embedding.
and intra-prediction to select the best mode from the mode set.                As previously described, our method can be done within
The mode set includes PSKIP, P16x16, P16x8, P8x16, P8x8,                  RDO inside reconstruction loop. As shown in Fig. 2, the block
I4MB, I16MB and IPCM modes. When the marco-block mode                     “Get best MB mode” selects the best mode to do the coding
                                                                          task. The performance of data hiding without RDO is not

                                                                   1126
better than that of considering the RDO based on the results                Fig. 5. Example illustration of proposed watermark restriction.
shown in a later section.                                                         There is a 4x4 DCT block with five coefficients and the
     Fig. 4 illustrates the integration of the proposed method              threshold is set 0.25. After zig-zag scanning all of the
with the H.264/AVC decoder. An extracting algorithm is                      coefficients, the sequence is -2, 4, 3, -3, 0, 0, -1. The last
inserted into H.264/AVC decoding phase. The extracting                      trailing one is -1. Before embedding phase, we must calculate
phase can be done in DCT blocks after entropy decoding. In                  the CED firstly and compare the CED value with threshold. As
our method, we embed the watermark on luminance DCT                         shown in Fig. 5, the block satisfies our restriction, in that the
blocks in both intra- and inter-modes. So we need only to do                CED is lower than the threshold.
extract on the luminance part of DCT blocks.
                                                                            C. EMBEDDING ALGORITHM
                                                                                 In this subsection, we will show the pseudo code for the
                                                                            embedding algorithm and explains the detailed. In Table I, we
                                                                            define the symbols and the functions in the pseudo code. These
                                                                            functions often refer to the DCT block or trailing one set to get
                                                                            the information of DCT block.

                                                                                    Table I the symbol and function explanation
Fig. 4. Schematic illustration of the proposed watermark                     Variable or Function         Definition
extracting procedure.
                                                                             DCTB                         A size 4x4 DCT block
                                                                                                          A size 4x4 DCT block by
                                                                             DCTB 
B. THE RESTRICTION OF OUR METHOD                                                                          embedding
      In literatures, most methods usually utilize the quantized                                          The trailing one set in a DCT
                                                                             T 1set                       block
coefficients for embedding; they all have the common feature
that only modifying the value but not changing the sign. The                 W                            Watermarking bit, W = {0,1}
proposed algorithm utilizes the relation of the polarity of each             Threshold                    Threshold value
T1 to embedding. The polarity and the sign of coefficient are                coeEnergy                    coefficient energy difference
related.                                                                     getT 1set ( DCTB)            Get the T1 set from DCTB
      Based on experiments, we observe a phenomenon that
when the number of coefficients is sparse in a DCT block,                    getT 1count (T 1set )     Get the number of trailing
changing the sign of trailing one causes the bit-rate increasing                                       one from T1 set
significantly. In intra-prediction phase, the current block refers           getLevcount (DCTB )       Get the number of nonzero
to the upper and the left blocks to make prediction and encode                                         level from DCTB
the prediction residual. When changing the sign of trailing one              getLastT 1Index (T 1set ) Get the last T1 index from T1
with sparse nonzero coefficients in the current block, the block                                       set
data in spatial domain would change greatly because the                      getLastLev Index (DCTB ) Get the last nonzero level
energy changes by the sign flip is a greater proportion of the                                         index from DCTB
whole block coefficient energy. When the coded block is                       XorT 1Polarity (T 1set )        All of polarity doing the
referenced by other uncoded blocks, this bad effect would be                                                  XOR operation in T1 set.
propagated to other uncoded blocks due to the reconstruction                 ChangeSign (DCTB ,               Changing the sign of T1 on
loop. Thus, we have to draw up a mechanism for preventing                    Index)                           index position in DCTB
this effect. If the number of coefficients is not sparsely and the
coefficient energy of trailing one to be changed the sign
                                                                             ChangeLSB (DCTB ,                Changing the LSB of T1 on
                                                                             Index)                           index position in DCTB
occupies slightly proportion in the current block, we does not
hide any watermark bits to the DCT block.                                    getLSB( DCTB, Index)             Getting the LSB of level on
      In our method, we set a threshold to decide whether the                                                 index position in DCTB
DCT block is suitable to embedding data or not. At first, we                 getEnergy (DCTB )                Getting coefficient energy
calculate the coefficient energy of the current DCT block and                                                 difference in DCTB
the CED after changing the sign of one trailing one. If the
change rate of CED is less than the prespecified threshold, the                   The Embedding algorithm can be divided into two parts,
block is chosen to hide data. Otherwise, the block is kept intact.          as shown in Table II. The first part, for blocks with at least one
One simple example is shown in Fig. 5.                                      trailing one and CED less than the threshold, utilizes all of the
                                                                            polarity values of trailing ones to hide data. If the sign of
                                                                            trailing one is negative, the polarity value is 0. On the contrary,
                                                                            the sign of trailing one is positive, the polarity value is 1. The
                                                                            polarity values of trailing ones are through an XOR operation.
                                                                            The result must be identical to the value of the watermark bit to
                                                                            be hided into the block; otherwise we should change the sign
                                                                            of last trailing one to satisfy the hiding condition. If the result


                                                                     1127
equals to the watermarking bit, the process does not modify                   watermark bit when the number of trailing one is nonzero. If
any thing for the block. The algorithm changes the sign of the                the number of trailing one is zero and the last level existence,
last trailing one because the last trailing one in the high                   we can get the LSB from the last level as watermark bit. If the
frequency zone has lower energy than other trailing ones, not                 number of level and trailing one is zero, we do not do any
causing significant degradation of quality and bit-rate.                      thing.

       Table II The pseudo code for Embedding Algorithm                              Table III The pseudo code for Extracting Algorithm
                      Embedding Algorithm                                                             Extracting Algorithm
                                                                                                  
     Input: DCTB                                                                   Input: DCTB
     Output: DCTB                                                                 Output: W
     Initialization:                                                               Initialization:
            T 1set  getT 1set ( DCTB )                                                   T 1set  getT 1set ( DCTB )
            numT1  getT1count (T 1set )                                                  numT1  getT1count (T 1set )
            numLevel  getLevcount ( DCTB)                                                numLevel  getLevcount ( DCTB)
     Begin Embedding()                                                             Begin Extracting()
      if( numT1  0 )                                                               if( numT1  0 )
           coeEngergy  getEnergy (DCTB )                                               coeEngergy  getEnergy ( DCTB  )
           if( coeEngergy  Threshold )                                                 if( coeEngergy  Threshold )
               W  XorT1Polarity(T1set )
                 
                                                                                            W  XorT1Polarity(T1set )
               if( W !  W )                                                               output W
                    LastT1  getLastT1Index( DCTB)                                      end
                   ChangeSign( DCTB, LastT 1)                                        else if( numT 1  0  numlevel  0 )
                     output DCTB                                                            LastLevel  getLastLevIndex( DCTB  )
                end                                                                         W  getLSB ( DCTB  , LastLevel )
          end                                                                               output W
       else if( numT1  0  numlevel  0 )                                        end
               LastLevel  getLastLevIndex(DCTB )                                  End
               ChangeLSB ( DCTB , LastLevel ,W )
              output DCTB                                                    IV. EXPERIMENTAL RESULTS
      end
     End                                                                      A. THE EXPERIMENT ENVIRONMENT

      The second part, when the number of nonzero                             Table IV the experimental parameters for H.264/AVC codec.
 coefficients is nonzero and the number of trailing one is zero,                         Parameter                  Information
 utilizes the last level to change the LSB for hiding data.                             Profile IDC                 66(baseline)
 Otherwise if the number of levels and trailing ones are zero,                          Intra period                15(I-P-P-P)
 we do not perform the embedding work. The advantage of the                              Slice mode                      0
 method in the first case is that the change of the sign does not                  Frames to be encoded                 300
 affect other symbols in the same block. According to the                       Motion Estimation scheme         Fast Full Search
 CAVLC rule, the trailing_ones_sign_flag indicate the sign of                           Rate Control                  Disable
 trailing one, it is encoded as one bit in the NAL (Network
 Abstraction Layer). If the sign is negative, it will be encoded              Table V the test video format parameters
 bit 1. On the contrary, if the sign of trailing one is positive, it                    Parameter                    Information
 will be encoded one bit 0. We change only the sign of last                            Video format                     QCIF
 trailing one so that the encoded block has the same length as                         YUV format                        4:2:0
 that prior to embedding process.                                                       Frame Size                     176×144
D. EXTRACTING ALGORITHM                                                                 Frame rate                      30 fps
     The extracting phase as shown in Table III is easier than
the embedding phase. The watermarking extracting algorithm                         We utilize the H.264/AVC JM Reference software [9] as
is performed between the entropy decoding phase and the                       the platform to simulate our proposed method. This subsection
inverse quantization phase. We find out all of the trailing ones              presents that the experiment parameters for our method in JM
in current DCT block firstly and calculate the CED value; if                  reference software. We use the version of JM software is 12.2,
the CED is lower than threshold, we collect all of the polarity               where the related environmental parameters are shown in
values for each trailing one to do XOR operation to get the                   Table IV. In the experiment, four videos: “akiyo,” “foreman,”

                                                                       1128
“mobile,” and “news” are used as test data set. Their format
information is shown in Table V. The secret data to be hided               Table VI Comparison the efficiency between the original’s
into the test videos is a random bit stream.                               and proposed method for foreman in QP = 15
                                                                                                           QP = 15
B. The EXPERIMENT RESULTS
                                                                                            PSNR(dB)       Bit-rate(kbit)   Capacity(bit)
     In this subsection, we demonstrate the experiment results               Original         47.32       969.62
and make an explanation about the results. Three methods are                without ER        45.18      1070.09               337752
considered. The original method refers to the method without                                        With ER
data hiding; the “within RDO” method represents the method
                                                                               T=0.5          46.35      1023.22               165019
operated in the RDO loop while the “without RDO” method
                                                                               T=0.1          46.35      1024.11               164923
means that it executes after the RDO stage in the
                                                                               T=0.05         46.36      1025.11               165190
reconstruction loop of encoder. As shown in Figs. 6 and 7, the
“within RDO” method is superior to the “without RDO” in
                                                                           Table VII Comparison the efficiency between the original and
terms of the output video bit-rate and the reconstructed video
                                                                           embedding method for foreman in QP = 27
PSNR.
                                                                                                            QP = 27
                                                                                            PSNR(dB)       Bit-rate(kbit)   Capacity(bit)
                                                                             Original          37.5       196.26
                                                                            without ER        36.62       228.05               80708
                                                                                                    With ER
                                                                               T=0.5          37.33       205.92               22118
                                                                               T=0.1          37.33        205.7               22216
                                                                               T=0.05         37.32       205.59               22273

                                                                           Table VIII Comparison the efficiency between the original’s
                                                                           and proposed method for foreman in QP = 31
                                                                                                            QP = 31
                                                                                            PSNR(dB)       Bit-rate(kbit)   Capacity(bit)
Fig. 6. Comparison of the video quality for video foreman
encoded at varying QP values                                                 Original         34.86       74.92
                                                                            without ER        34.1        140.93               48152
                                                                                                    With ER
                                                                               T=0.5          34.65       127.25               11449
                                                                               T=0.1          34.64       126.81               11289
                                                                               T=0.05         34.63       126.85               11409

                                                                                 From the experiments, we can observe that the
                                                                           degradation of bit-rate and video quality caused by embedding
                                                                           can be controlled effectively by adding embedding restriction.
                                                                           But it also raises another question. When the threshold is small,
                                                                           the performance is improved to a saturation degree. In other
                                                                           words, the effectiveness of the embedding restriction rule has
                                                                           a limitation level for controlling the degradation. For other test
                                                                           videos, we illustrate their results in terms of video quality and
Fig. 7. Comparison of output bit-rate for video foreman
                                                                           bit-rate in Figs. 8-15.
encoded at varying QP values
     In Fig. 7, the bit-rate of the within RDO method is higher
than that of the original. This is not a desired phenomenon for
some applications. We use a threshold value of CED to select
appropriate DCT blocks to embed data. The number of DCT
blocks that can be embedded is decreasing with the restriction
threshold. This mechanism helps us to control the degradation
of marked video quality, bit-rate change, and the capacity of
data hiding.
     In the experiments, we set different threshold values T of
embedding restriction rule as 1, 0.5, 0.1 or 0.05 for the “within
RDO” scheme. The results are shown in Tables VI to VIII.
We can find that the degradation of quality is reduced from
                                                                           Fig. 8. Comparison of the video quality between our method
3dB to 1dB and that the bit-rate after embedding is not
                                                                           and the original for video foreman at varying QP values
increasing significantly by setting the restriction rule.


                                                                    1129
Fig. 9. Comparison of the bit-rate between our method and the
original for video foreman at varying QP values                        Fig. 13. Comparison of the video quality between our method
                                                                       and the original for video mobile at varying QP values




Fig. 10. Comparison of the video quality between our method
and the original for video akiyo at varying QP values
                                                                       Fig. 14. Comparison of the video quality between our method
                                                                       and original for video news at varying QP values




Fig. 11. Comparison of the bit-rate between our method and
the original for video akiyo at varying QP values

                                                                       Fig. 15. Comparison the video quality between our method
                                                                       and the original for video news at varying QP values

                                                                            For smaller threshold values, most of the DCT blocks in
                                                                       the video are excluded to modify the T1 symbols. However, it
                                                                       doesn’t affect the scheme because in that case it modifies the
                                                                       LSB of the last coefficient in the block. Therefore, for smaller
                                                                       threshold values, the number of DCT blocks hided using the
                                                                       T1 symbols is less than that of using the LSB replacement.
                                                                       This means that the bit-rate and video quality will be kept
                                                                       saturation. Only changing the LSB of the last coefficient in the
                                                                       block would not affect the bit-rate and PSNR significantly.
Fig. 12. Comparison of the video quality between our method            The capacity for each test video is shown in Figs. 16-19
and the original for video mobile at varying QP values


                                                                1130
According to Fig. 3, our proposed method does not aim at
                                                                    the SKIP mode blocks for data hiding. When the cost of SKIP
                                                                    mode is lower than others, the mode decision phase selects the
                                                                    SKIP mode to be the block mode, the number of SKIP mode
                                                                    blocks is increasing with the QP value, as the results shown in
                                                                    Fig. 20.




Fig. 16. Comparison of the capacity between our method and
the original for video foreman at varying QP values




                                                                    Fig. 20. Comparison of the number of SKIP mode block for
                                                                    video foreman encoded at varying QP values

                                                                         In Figs. 21 to 23, our proposed method and Shahid’s [7]
                                                                    are compared in terms of bit-rate, PSNR and capacity. There
                                                                    are two variants of our proposed method; the one with
                                                                    threshold value of CED T=0.1 and the other with T=0.5,
                                                                    respectively. When the QP values are higher than 11, Shahid’s
Fig. 17. Comparison of the capacity between our method and          capacity is rapidly declined due to the number of coefficients
the original for video akiyo at varying QP values                   in high QP values is sparse. The efficiency of our method with
                                                                    CED is close to Shahid’s regarding the bit-rate and video
                                                                    quality.




Fig. 18. Comparison of the capacity between our method and
the original for video mobile at varying QP values                  Fig. 21. Comparison video quality of our proposed and Shahid
                                                                    for video foreman encoded at varying QP values




                                                                    Fig. 22. Comparison bit-rate of the number of our proposed
Fig. 19. Comparison of the capacity between our method and
                                                                    and Shahid for video foreman encoded at varying QP values
the original for video news at varying QP values

                                                             1131
[2] S.K. Kapotas, E.E. Varsaki, A.N. Skodras, “Data Hiding in
                                                                               H.264 Encoded Video Sequences”, IEEE 9th Workshop
                                                                               on Multimedia Signal Processing, October 1-3, 2007,
                                                                               Crete, pp. 373-376.
                                                                          [3] B.G. Mobasseri, Y.N. Raikar, “Authentication of H.264
                                                                               Streams by Watermarking CAVLC blocks”, SPIE
                                                                               Conference      on    Security,    Steganography    and
                                                                               Watermarking of Multimedia Contents IX, San Jose, CA,
                                                                               January 28-February 2, 2007.
                                                                          [4] G.Z. Wu, Y.J. Wang, W.H. Hsu, “Robust watermark
                                                                               embedding detection algorithm for H.264 video”,
                                                                               Journal of Electronic Imaging 14(1), 013013, 2005
                                                                          [5] L. Tian, N. Zheng, J. Xue and T. Xu, “A CAVLC-Based
Fig. 23. Comparison capacity of our proposed and Shahid for                   Blind Watermarking Method for H.264/AVC Compressed
video foreman encoded at varying QP values                                    Video”, In: Asia-Pacific Services Computing Conference,
                                                                              2008. APSCC 2008, pp. 1295–1299. IEEE, Los Alamitos
     In Table IX, we compare the capacity performance                         (2008)
between Shahid’s scheme and our proposed algorithm. At the                [6] K. Liao, D. Ye, S. Lian, Z. Guo, J. Wang, “Lightweight
same QP, our method can provide higher capacity than that of                  Information Hiding in H.264/AVC Video Stream”, mines,
Shahid’s, and the capacity of Shahid’s is decreasing seriously                vol. 1, pp.578-582, 2009 International Conference on
with the QP value decreased.                                                  Multimedia Information Networking and Security, 2009
                                                                          [7] Z. Shahid, M. Chaumont, W. Puech, “Considering the
Table IX Comparison capacity of our method and Shaid’s for                    Reconstruction Loop for Data Hiding of Intra and Inter
foreman at varying QP                                                         Frames of H.264/AVC”, published in European Signal
                    Proposed method          Shahid[7]                        Processing Conference (EUSIPCO), 2009.
     QP           T = 0.5       T = 0.1                                   [8] X. Huang, Y. Abe, and I. Echizen, “Capacity Adaptive
                             Capacity (bit)                                   Synchronized Acoustic Steganography Scheme”, Journal
     11          281591         281497        280578                          of Information Hiding and Multimedia Signal Processing,
     15          165019         164923        139629                          Vol. 1, No. 2, pp. 72-90, Apr. 2010
     19           82915         83241          67582                      [9] Z.H. Wang, T.D. Kieu, C.C. Chang, M.C. Li, A Novel
     23           40620         40652          29851                          Information Concealing Method Based on Exploiting
     27           22118         22216          12108                          Modification Direction Journal of Information Hiding
     31           11449         11289          4357                           and Multimedia Signal Processing, Vo1. 1, No. 1, pp. 1-9,
                                                                              Jan. 2010
                    V. CONCLUSIONS                                        [10] K. Sühring, H.264/AVC Reference Software Group
                                                                              [On-line]. Available: http://iphome.hhi.de/suehring/tml/,
     In this paper, we propose a data hiding algorithm that has
                                                                              Joint Model 12.2 (JM12.2), Jan. 2009.
considered the rate distortion performance for H.264/AVC
standard. The algorithm can control the increase of bit-rate and
decrease of PSNR after hiding secret data into the videos at the
cost of reducing the capacity of data to be hided. The
information is hided in the T1 symbols of CAVLC domain in
H.264/AVC encoder. In order to reduce the propagation of
hiding modification to the subsequent blocks, the proposed
algorithm can selection those blocks with minor energy
change to hide data. With the selection scheme, the proposed
algorithm can control the threshold value to adjust adaptively
the capacity for different application requirements.

                    ACKNOWLEDGEMENT
This research is supported in part by National Science Council,
Taiwan under the grant NSC 98-2221-E-150-051

                        REFERENCES
[1] G. Qiu, P. Marziliano, A. Ho, D. He, Q. Sun, “A Hybrid
    Watermarking Scheme for H.264 Video”, Processing of
    the 17th International Conference on Pattern Recognition,
    ICPR, vol.4, pp.865-868, Aug. 2004.


                                                                   1132
Secret-fragment-visible Mosaic — a New Image Art and Its Application to
                                  Information Hiding


                   I-Jen Lai (賴怡臻)                                                      Wen-Hsiang Tsai (蔡文祥)
    Institute of Computer Science and Engineering                                        Dept. of Computer Science
   National Chiao Tung University, Hsinchu, Taiwan                            National Chiao Tung University, Hsinchu, Taiwan
         Email: nekolai.cs97g@g2.nctu.edu.tw                                            Email: whtsai@cs.nctu.edu.tw


Abstract—A new type of art image called secret-fragment-                   Dobashi et al. [3] improved the voronoi diagram to allow a
visible mosaic image is created, which is composed of                      user to add various effects to the mosaic image, such as
rectangular-shaped fragments yielded by division of a secret               simulation of stained glasses. Kim and Pellacini [4]
image. To create this kind of mosaic image, the 3D RGB color               generated jigsaw image mosaic composed of many arbitrary
space is transformed into a 1-dimensional h-colorscale based
                                                                           shapes of tiles selected from a database. Extending the
on which a new image similarity measure is proposed; and the
most similar candidate image from an image database is                     concept of [4], Blasi et al. [5] presented a new mosaic image
selected accordingly as a target image. Then, a greedy                     called puzzle image mosaic. Lin and Tsai [6] embedded
algorithm is adopted to fit every tile image in the secret image           secret data in image mosaics by adjusting regions of
into a properly-selected block in the target image, resulting in           boundaries and altering pixels’ color values. Wang and Tsai
an effect of embedding the secret image fragmentally and                   [7] hid data into image mosaics by utilizing overlapping
visibly in the composed mosaic image. In addition to this type             spaces of component images. Hung and Tsai [8] embedded
of secret image hiding, secret message bits may be embedded                data into stained-glass-like mosaic images by modifying the
as well for the purpose of covert communication. Based on the              tree structure used in the creation process. Hsu and Tsai [9]
fact that tile images in an identical bin of the histogram of the
                                                                           presented a new type of art image, circular-dotted image,
created mosaic image have similar colors, all the tile images in
each histogram bin are reordered pairwisely and their relative             and used the characteristics of its creation processes to hide
positions are switched accordingly, to embed secret message                secret messages in the generated art image. Chang and Tsai
bits without creating noticeable changes in the resulting mosaic           [10] proposed a new type of art image, called tetromino-
image. The embedded message is protected by a secret key, and              based mosaic, which is composed of tetrominoes appearing
may be extracted from the stego-image using the key.                       in a video game. Data hiding is made possible by distinct
Additional security measures are also discussed. Experimental              combinations and color shifting of the tetromino elements.
results show the feasibility of the proposed methods.                         A new type of art image, called secret-fragment-visible
   Keywords: secret-fragment-visible mosaic image, covert                  mosaic image, which contains small fragments of a secret
communication, data hiding.                                                source image is proposed in this study. Observing such a
                                                                           type of mosaic image, people can see all of the fragments of
                      I. INTRODUCTION                                      the secret image, but the fragments are so tiny in size and so
   Mosaics are artworks created from composing small                       random in position that people cannot figure out what the
pieces of materials, such as stone, glass, tile, etc. Nowadays,            source image look like, unless they have some way to
they are used popularly for decorating houses and other                    rearrange the pieces back into their original positions, using
constructions. Creation of mosaic images by computers is a                 a secret key from the image owner. Therefore, the source
new research topic in recent years. Traditional mosaic                     image may be said to be secretly embedded in the resulting
images are obtained by arranging a large number of small                   mosaic image, though the fragment pieces are all visible to
images, called tile images, in a certain manner so that each               an observer of the image. And this is just why we name the
tile image represents a small piece of a source image, named               resulting image as a secret-fragment-visible mosaic image.
target image. Consequently, while we see a mosaic image                       In the remainder of this paper, the proposed mosaic image
from a distance, as a whole it will look like its source                   creation process will be described in Section II, a covert
image — an effect of a human vision property. Many                         communication method via secret-fragment-visible mosaic
methods have been proposed to create different types of                    images will be proposed in Section III, and some
mosaic images [1-8].                                                       experimental results will be presented in Section IV,
   Haeberli [1] proposed a method for mosaic image                         followed by conclusions in Section V.
creation using voronoi diagrams by placing the sites of
                                                                               II. PROPOSED MOSAIC IMAGE CREATION PROCESS
blocks randomly and filling colors into the blocks based on
the content of the original image. Hausner [2] created tile                  The proposed mosaic image creation process is composed
mosaic images by using centroidal voronoi diagrams.                        of two major stages. The first is the construction of a



                                                                    1133
database which can be used later to select similar target                  above defines a 1-D h-colorscale. The resulting image
images for given secret images. The quality of a constructed               created by our method is given in Fig. 1(b), which
secret-fragment-visible mosaic image is related to the                     contrastively has less noise when compared with Fig. 1(a).
similarity between the secret image and the target image; the
selected target image should be as similar to the secret
image as possible. An appropriate similarity measure for
this purpose is proposed in this study and described later.
The other stage is the creation of a desired mosaic image
using the secret image and the target image as input. In this
stage, the secret image is divided into fragment pieces as tile
images, which then are used to create the mosaic image. The
number of tile images is limited by the size of the secret                                 (a)                                  (b)
image and that of the tile images. Note that this is not the                Figure 1. Effects of mosaic image creation using different color
case in traditional mosaic image creation where available                             similarity measures (a) Image created with similarity measure
tile images for use to fit into the target image are unlimited                        of [12]. (b) Image created with proposed similarity measure.
in number. In order to solve this problem of fitting a limited
                                                                               Furthermore, to compute the similarity measure between
number of tile images into a target image, a greedy                        a tile image in the secret image and a target block in an
algorithm is proposed, which is described later as well.                   image in a database for use in tile-image fitting in generating
2.1 Database Construction                                                  a mosaic image, we propose a new feature, called h-feature,
                                                                           for each block image C (either a tile image or a target block),
   The database plays an important role in the secret-                     denoted as hC, which is computed by the following steps:
fragment-visible mosaic image creation process. If a target
image is dissimilar to a secret image, the created image will                 1. compute the average of the color values of all the
be distinct from the target one. In order to generate a good                      pixels in C as (RC, GC, BC);
result, the database so should be as large as possible.                        2. re-quantize (RC, GC, BC) into (rC′, gC′, bC′) using the
   Searching a database for a target image with the highest                       new Nr, Ng, and Nb color levels; and
similarity to the secret image is a problem of content-based                   3. calculate the h-feature hC for C by Eq. (2) above,
image retrieval. A technique to solve this problem is to base                     resulting in the following equation:
the similarity on 1-D color histogram transformation [12] of                          hC(rC′, gC′, bC′) = bC′ + NbrC′ + NbNrgC′.             (3)
the color distribution of the image. The transformation maps
                                                                               With Nr, Ng, and Nb all set equal to 8, the range of the
the three color channel values into a single value.                        computed values of the h-feature fC above may be figured out
Specifically, each color channel is re-quantized first into                to be from 0 to 584. The proposed algorithm for constructing
fewer levels, yielding a new image I′ with a lower resolution              a database of candidate images for use in generating secret-
in color specified by (r′, g′, b′). Let Nr, Ng, and Nb denote the          fragment-visible mosaic images is described in the following.
numbers of levels of the new color values r′, g′, and b′,
respectively. Then, for each pixel P′ in I′ with new colors (r′,           Algorithm 1: construction of candidate image database.
                                                                           Input: a set S of images, a pre-selected tile image size Zt,
g′, b′), the following 1-D function value f is computed:
                                                                                  and a pre-selected candidate image size Zc.
             f(r′, g′, b′) = r′ + Nrg′ + NrNgb′.                   Output: a database DB of candidate images with size Zc and
However, according to our experimental experience using                           their corresponding h-colorscale histograms.
this 1-D color function f, it is found inappropriate for our               Steps:
study here where the human’s visual feeling of image                       Step 1. For each input image I, perform the following steps.
similarity must be emphasized, as shown by Fig. 1(a).                          1.1 Resize and crop I to yield an image D of size Zc.
Therefore, we propose a new function h as follows:                             1.2 Divide D into blocks of size Zt.
                                                                               1.3 For each block C of D, calculate and round off the
             h(r′, g′, b′) = b′ + Nbr′ + NbNrg′         
                                                                                   h-feature value hC described by Eq. (3).
where the numbers of levels, Nr, Ng, and Nb, are all set to be                 1.4 Generate a histogram H of the h-feature values of all
8. Differently from the case in (1), we set in (2) the largest                     the blocks in D.
weight NbNr to the green channel value g′ and the smallest                    1.5 Save H with D into the desired database DB.
weight 1 to the blue channel value b′. The reason is that the              Step 2. If the input images are not exhausted, go to Step 1;
eyes of human beings are the most sensitive to the green                           otherwise, exit.
color, and the least sensitive to the blue one. In addition,
                                                                           2.2 Similarity Measure Computation
with all of Nr, Ng, and Nb set to 8 in (2), an advantage of
speeding up the process of mosaic image creation can be                        Before generating a mosaic image, we have to choose as
obtained according to our experiments. Subsequently, we                    the target image the most similar candidate image from the
will say that the new color feature function h we propose                  database based on the given secret image content. For this,
                                                                           we define a difference measure e between the 1-D histogram



                                                                    1134
HS of the secret image S and that of a candidate image D in              edge of the graph with its label taken to be that of the tile
the database in the following way:                                       image and its weight taken to be the average Euclidean
                         584                                             distance between the pixels’ colors of the selected tile image
                     e   Hs  m  HD  m                       and those of the target block. Accordingly, we can build a
                         m 0
                                                                         tree structure as the graph for this problem, as shown by Fig.
where m stands for a h-feature value. The smaller the value e            2.
is, the more similar the candidate image D is to the secret
image S. After calculating the errors of all the images in the
database, we can select the one with the smallest error as the
desired target image for use in mosaic image generation. The
detail of selecting the most similar candidate image from a
database is given as follows.
Algorithm 2: selection of the most similar candidate image
                as a target image.
Input: a secret image S, a database DB of candidate images,
       and the sizes Zt and Zc mentioned in Algorithm 1.
Output: the target image T in DB which is the most similar
       to S.                                                                 Figure 2. A tree structure of fitting tile images to target blocks.
Steps:
Step 1. Resize S to yield an image S′ of size Zc to become of               In order to find the optimal solution, we may utilize the
        the same size as the candidate images in DB.                     Dijkstra algorithm whose the running time for getting an
Step 2. Divide S′ into blocks of size Zt, and perform the                optimal answer is O(|V|2), where V denotes for the number
        following steps.                                                 of vertices in the tree. Unfortunately, according to Fig. 2 the
                                                                                                                  N 1
    2.1 For each block C of S′, calculate its h-feature value            number of vertices in this problem is   1)!/n!] where
                                                                                                                         n1
         hC by Eq. (3) and round off the result.
    2.2 Generate a 1-D h-colorscale histogram HS′ for S′                 N is the number of target blocks which is larger than 40,000
         from the h-feature values of all the blocks in S′.              for images used in this study, and so the computation time
Step 3. For each candidate image D with 1-D h-colorscale                 for getting an optimal solution for such a large N is
        histogram HD in DB, perform the following steps.                 obviously too high to be practical! This means that we have
    3.1 Compute the difference measure e between HS' and                 to find other feasible solutions to solve this problem.
        HD according to Eq. (4) described above.                             The solution we propose is to use a greedy algorithm. We
    3.2 Record the value e.                                              calculate the average Euclidean distance between the pixels’
Step 4. If the images in DB are not exhausted, go to Step 3;             colors of a tile image T and those of a target block B as the
        otherwise, continue.                                             similarity measure between T and B; and then use the
Step 5. Select the image in DB which has the minimum                     measure as a selection function for the greedy algorithm to
        difference measure e and take it as the desired target           select the most similar target block for tile image fitting.
        image T.                                                         However, as shown by the example of Fig. 4(a) which is the
                                                                         result of using such a greedy algorithm to fit the tile images
2.3 Algorithm for Secret-fragment-visible Mosaic Image                   of the secret image, Fig. 3(a), into the target image, Fig. 3(b),
    Creation                                                             the algorithm is found unsatisfactory, yielding often a result
   Before presenting the algorithm for creating the proposed             with the lower part of the target image being filled with
mosaic images, we discuss some problems which are                        some fragment pieces of inappropriate colors. This
encountered in the creation process and present the solutions            phenomenon comes from the situation that the number of
we propose to solve them.                                                tile images obtained from the secret image, Fig. 3(a), is
                                                                         limited by the secret image’s own size, so that the tile
A. Problem of fitting tile images optimally and proposed                 images available for choice to fit the target blocks in Fig.
    solution                                                             3(b) become less and less near the end of the fitting process.
    The first problem faced in the creation process is how to            As a result, the similarity differences between the later-fitted
find an optimal solution for fitting a tile image of the secret          tile images and the chosen target blocks become bigger and
image into an appropriate target block in a target image                 bigger than the earlier-fitted ones, yielding a poorly-fitted
selected by Algorithm 2. For this, it seems that we can                  bottom part like that shown in Fig. 4(a).
reduce it to a single-source shortest path problem. The                      A solution to this problem found in this study is to use the
shortest path problem is one of finding a path in a graph                previously-proposed h-feature to define the selection
with the smallest sum of between-vertex edge weights. The                function for the greedy algorithm. This feature takes the
state of fitting a tile image may be represented by a vertex             global color distribution of an image into consideration,
of the graph. And the action of selecting the most similar               which helps creation of a mosaic image with its content
tile image for each target block may be represented by an                resembling the target image more effectively, as shown by



                                                                  1135
the example of Fig. 4(b) which is an improvement of Fig.                           of proposed secret-fragment-visible mosaic images is
4(a).                                                                              described in the following.




                (a)                                   (b)                                          (a)                               (b)
 Figure 3. Input images. (a) A secret image. (b) A selected target image.            Figure 5. Input images. (a) A secret image. (b) A selected target
                                                                                               image.




                                                                                                   (a)                               (b)
                                                                                     Figure 6. Resulting images. (a) Image created without the proposed
                 (a)                                   (b)
                                                                                               remedy method, which is four times as large as (b). (b)
 Figure 4. Resulting images using different similarity measures. (a)                           Image created with the proposed remedy method.
           Image created using Euclidean distance to define select
           function of greedy algorithm. (b) Image created using h-
           feature to define select function of greedy algorithm.                  Algorithm 3: mosaic image creation.
                                                                                   Input: a secret image S, a database DB, and a selected size
B. Problem of small-sized candidate image database and                                    Zt of a tile image.
    proposed solution                                                              Output: A secret-fragment-visible mosaic image R.
    A second problem faced in the mosaic image creation                            Steps:
process is how to deal with a database which is not large                          Stage 1  embedding secret image fragments into a
enough. This problem will cause the selection of an                                         selected target image.
insufficiently similar image from the database as the target                       Step 1. Crop S to yield an image S′ which is divisible by
image for a given secret image. As a result, the created                                    size Zt.
mosaic image will look unlike the target one, as shown by                          Step 2. Perform following steps to select a target image T
the example of Fig. 6(a), a mosaic image created with Figs.                                 from DB.
5(a) and 5(b) as the secret and target images, respectively.                           2.1 Select a candidate image as T from DB by
    To solve this problem, during the candidate image                                       Algorithm 2.
selection process, after the difference measure between a                              2.2 If the difference measure e computed in Step 3.1 of
secret image and a candidate image is computed, if the                                      Algorithm 2 is larger than a pre-selected threshold
computed value is large, the selected target image is                                       Th, then enlarge the size of T e Th  times.
                                                                                                                                
regarded inappropriate for the creation process. In this case,                     Step 3. Obtain a block-label sequence L1 of S′ by
we enlarge the size of the selected target image as a remedy.                               calculating and sorting the h-feature values of all
The reason is that if the size of the target image is larger                                tile images in S.
than that of the secret image, the number of target blocks, or                     Step 4. Obtain a block-label sequence L2 of T by
equivalently, the number of possible positions, to fit each                                 calculating and sorting the h-feature values of all
tile image, will become larger, yielding in general a better                                target blocks in T.
fitting result. In this way, the resulting mosaic image will                       Step 5. Fit the tile images of S′ to the target blocks of T
become better visually than before, as shown by the                                         based on the one-to-one mappings from the ordered
example of Fig. 6(b).                                                                       labels of L1 to those of L2, thus completing
C. Algorithm for secret-fragment-visible mosaic image                                       embedding of all the tile images in S′ into all the
   creation                                                                                 target blocks of T according to the greedy criterion.
   According to above discussions, an algorithm for creation                       Stage 2  dealing with unfilled target blocks.



                                                                            1136
Step 6. Perform the following steps to fill each remaining
         unfilled target blocks, B, in T if there is any.
    6.1 Compute the difference e′ between the h-feature hB
         of B and the h-feature hA of each of the tile images,
         A, in S′ by the following equation:
                            e′ = |hB  hA|.                    (5)
    6.2 Pick out the tile image Ao with the smallest
                                                                                           (a)                                 (b)
         difference eo′ and compare eo′ with another pre-
         selected threshold Th′ to conduct either of the
         following two operations:
         A. if eo′  Th′, then fill the tile image Ao into the
             target block B;
         B. if eo′  Th′, then fill the averages of the R, G, and
             B values of all the pixels in B into B.
Stage 3  generating the desired mosaic image.
Step 7. Generate as output an image R obtained by
         composing all the tile images fitted at their
         respective positions in T.

2.4 Experimental Results of Mosaic Image Creation
   Some mosaic images generated by the above algorithm
                                                                                                                (c)
are shown in Figs. 7 and 8. Note that in either figure, the
secret image of (a) may be thought to have been embedded                       Figure 8. Another example of mosaic image creation. (a) Secret image.
                                                                                         (b) Target image. (c) Generated secret-fragment-visible
into the target image of (b) to yield the stego-image of (c).                            mosaic image.
The database used in running the algorithm includes 841
candidate images. The size of this database is regarded as
large enough because the remedy measure of target image                         III. COVERT COMMUNICATION VIA SECRET-FRAGMENT-
enlargement is rarely used in the mosaic image creation                                        VISIBLE MOSAIC IMAGES
process in our experiments.
                                                                               3.1 Idea of Proposed Covert Communication Method
                                                                                  In the proposed mosaic image creation process, tile
                                                                               images with the same h-feature values appear to have
                                                                               similar colors. Each tile image is fitted into a corresponding
                                                                               target block based on the one-to-one mappings established
                                                                               between the two label sequences of the secret image and the
                                                                               selected target image. Note that both sequences have been
                                                                               sorted according to the h-feature values of their image
                (a)                                (b)                         blocks. They are said to have been h-sorted. As a result of
                                                                               such h-sorting, every pair of neighboring labels in either
                                                                               sequence specify two image blocks with similar h-feature
                                                                               values, implying that the average colors of the two blocks
                                                                               are essentially visually similar.
                                                                                  The main idea of secret embedding in the proposed
                                                                               covert communication method is to switch the orders of the
                                                                               target blocks in the h-sorted label sequence of the target
                                                                               image during the mosaic image creation process to embed
                                                                               message bits, thus achieving the goal of hiding a secret
                                                                               message into a secret-fragment-visible mosaic image
                                                                               imperceptibly.
                                                                                  More specifically, after the label switching, if a leading
                                                                               label is smaller than the following one in the target block
                                 (c)
                                                                               label sequence, then a bit “0” is regarded to have been
                                                                               embedded in the two neighboring labels; otherwise, a bit
 Figure 7. An example of mosaic image creation. (a) Secret image. (b)
           Target image. (c) Generated secret-fragment-visible mosaic          “1” is regarded as embedded there. Furthermore, as shown
           image.                                                              by the example of Fig. 9, because the tile images which




                                                                        1137
correspond to the target blocks with switched labels have the                                Step 3. Perform Step 3 to 4 of Algorithm 3 to obtain the h-
similar average colors as mentioned previously, after the                                             sorted label sequences L1 and L2 of S′ and T,
message is embedded, no visually perceptible difference                                               respectively.
will arise in the resulting mosaic image.                                                    Step 4. Group the labels of L1 and L2 by the following
                                                                                                      steps.
                                                                                                  4.1 Group the labels of L1 based on the h-feature
                                                                                                      values of the tile images in S′, with each resulting
     1      2     3      4
                                   21
                                               1     2      3      4
                                                                            21
                                                                                                      group including the labels of a set of tile images
      5     6     7      8                     5     6      7      8
                                                                                                      having the same h-feature values.
                                   59                                       59
                                                                                                  4.2 Group the labels of L2 based on the grouping of L1
     9     10     11     12                    9    10      11     12                                 obtained in Step 4.1, resulting in groups of labels,
                                                                                                      G1, G2, …, Gm, with each group Gi including the
  Target blocks               Tile images   Target blocks               Tile images                   labels of a set of target blocks whose corresponding
                                                                                                      tile images have the same h-feature values.
                   (a)                                           (b)
Figure 9. Label switching and corresponding corresponding target block
                                                                                             Stage 2  embedding the secret message M.
          exchange. (a) The original one. (b) After switching the                            Step 5. Generate the histogram H of the h-feature values of
          corresponding target blocks of tile images.                                                 all the tile images in the resized secret image S′.
                                                                                             Step 6. Transform the message to be embedded, M, into a
3.2 Modified Secret-fragment-visible Mosaic Image                                                     bit string M′.
      Creation Process for Secret Message Embedding                                          Step 7. Perform the following steps to embed the bits of M′
   In the proposed covert communication method, the                                                   into L2.
mappings of the labels of the tile images to those of their                                      7.1 Select the smallest unprocessed h-feature value hi
corresponding target blocks is recorded in a recovery                                                 whose histogram value H(hi) is larger than or equal
sequence LR for use in later data recovery. An illustration is                                        to two.
shown in Fig. 10. Embedding of LR is then accomplished by                                        7.2 Take out the group Gi of labels in L2 corresponding
hiding the labels of LR into the tile images randomly by the                                          to the h-feature value hi.
lossless LSB-modification scheme [11] controlled by a                                            7.3 Take out the first two unprocessed labels l1 and l2
secret key. The detailed algorithm for secret message                                                 in Gi, and switch the order of l1 and l2 in L2 if the
embedding is now given as follows, which is a modified                                                following two conditions are satisfied, assuming
version of Algorithm 3.                                                                               that the first unembedded bit in M is denoted as b:
                                                                                                           A. b = 0 and l1  l2;
                                                                                                           B. b = 1 and l1  l2.
                                                                                                 7.4 Repeat Step 7.3 until Gi includes at most one label,
                                                                                                      which is left untouched.
                                                                                                 7.5 Repeat Steps 7.1 through 7.4 until the bits of M′ are
                                                                                                      exhausted.
                                                                                             Step 8. Form an extra string M′ of 8 bits of “0” as the
                                                                                                      ending signal of the input message M, and embed it
                                                                                                      into L2 by Step 7 above.
                                                                                             Step 9. Fit the tile images of S′ into the target blocks of T
                                                                                                      based on the one-to-one mappings from the labels
 Figure 10. An illustration of generation of a recovery sequence LR.
                                                                                                      of L1 to those of the re-ordered L2 obtained in Steps
                                                                                                      7 and 8 (denoted as L2′ subsequently), and let the
Algorithm 4: embedding a message into a secret-fragment-                                              resulting image be denoted as T′.
              visible mosaic image.                                                          Stage 3  dealing with unfilled target blocks, generating
Input: a secret image S, a secret key K, the size Zt of tile                                          and embedding the recovery sequence, and
       images, a database DB, and a secret message M.                                                 generating the desired mosaic image.
Output: a secret-fragment-visible mosaic image R into                                        Step 10. Perform Step 6 of Algorithm 3 to fill each of the
       which M is embedded.                                                                           remaining unfilled target blocks if there is any.
Steps:                                                                                       Step 11. Sort all the labels in L1 by their corresponding h-
Stage 1  embedding secret image fragments into a                                                     feature    values,    re-order    accordingly the
        selected target image.                                                                        corresponding labels in L2′, take the re-ordering
Step 1. Crop S to yield S′ with a size divisible by Zt.                                               result as a recovery sequence LR, and transform it
Step 2. Select a target image T for S′ with histogram H                                               into a binary string.
        from the database DB by Algorithm 2.                                                 Step 12. Embed the width and height of S′ as well as the




                                                                                      1138
size Zt into the first ten pixels of image T′ in a                              key K.
         raster-scan order by the LSB modification scheme.                      Step 3. Compose the desired secret image S based on the
Step 13. Embed the data of LR by the same scheme into                                    sequence LR by extracting the tile images fitted in R
         unprocessed tile images of T′ randomly selected by                              in order and placing them at correct relative
         the secret key K.                                                               positions.
Step 14. Generate as output an image R obtained by                              Stage 2  regaining the h-sorted label sequences.
         composing all the tile images fitted at their                          Step 4. Get the h-sorted label sequence L1 of the tile
         respective positions in T′.                                                     images of the recovered secret image S, and group
                                                                                         the labels of L1 based on the h-feature values of the
3.3 Secret Extraction Process                                                            tile images, with each resulting group including the
   In the proposed secret message extraction process, we                                 labels of the tile images having the same h-feature
extract the recovery sequence LR first and retrieve                                      values.
accordingly the original secret image S. Also, by calculating                   Step 5. Perform the following steps to get the h-sorted
the h-feature values of the original secret image, we regain                             label sequence L2 of the target image T.
the h-feature values of the tile images and sort them to get                       5.1 Get the re-ordered block sequence QT of the target
the h-sorted label sequence L1.                                                          blocks in T by one-to-one-mapping the labels in
   Next, as illustrated by Fig. 11, the recorded sequence LR,                            sequence L1 to those of LR.
though including only the labels of L2, essentially specifies                      5.2 Get a new h-sorted label sequence L2 from the
one-to-one mappings between the tile images and the target                               labels of the re-ordered block sequence QT.
blocks. Therefore, we may regain the h-sorted label                                5.3 Group the labels of L2 based on the grouping of L1
sequence L2 of the target blocks from the corresponding                                  conducted in Step 4 with each group Gi including
mappings from L1 to LR. Then, with the histogram H of the                                the labels of the target blocks whose corresponding
h-feature values of all the tile images, we may group the                                tile images have the same h-feature values, hi.
labels of sequences L1 and L2, and then examine the orders                      Stage 3  extracting the embedded secret message M.
of the labels of L2 to extract the embedded secret message,                     Step 6. Generate the histogram H of the h-feature values of
in a way reverse to the message embedding process as                                     all the tile images in the recovered secret image S.
described in Algorithm 4.                                                       Step 7. Perform the following steps to extract the bits of
                                                                                         secret message M.
                                                                                   7.1 Select the smallest h-feature value hi whose
                                                                                         histogram value H(hi) is larger than or equal to two.
                                                                                   7.2 Take out the group Gi of labels in L2 corresponding
                                                                                         to hi.
                                                                                   7.3 Take out the first two unprocessed labels l1 and l2
                                                                                         in Gi, extract a hidden message bit b by the
                                                                                         following rule, and append it to the end of a bit
                                                                                         version of the message, denoted as D:
                                                                                               A. if l1  l2, then set b = 0;
                                                                                               B. if l1  l2, then set b = 1.
                                                                                   7.4 Repeat Step 7.3 until Gi includes at most one label,
                                                                                         which is then left untouched.
                                                                                   7.5 Repeat Steps 7.1 through 7.4 if the 8-bit end signal
 Figure 11. An illustration of the regaining of the label sequence L2.
                                                                                         is not extracted (i.e., if the last extracted 8 bits are
                                                                                         not a sequence of eight “0’s”).
Algorithm 5: secret image recovery and secret message
                                                                                Step 8. Transform every 8 bits of D into characters as the
               extraction.
                                                                                         desired secret message M.
Input: a secret-fragment-visible mosaic image R, and a
        secret key K identical to that used in Algorithm 4.                     3.4 Experimental Results
Output: a recovered secret image S, and the secret message                         An example of experimental results using Algorithms 4
        M supposedly embedded in R.                                             and 5 is given in Fig. 12. The average difference measure
Steps:                                                                          value at the block level between Fig. 8(c) and Fig. 12
Stage 1  retrieving the secret image S.                                        (computed as the sum of all the Euclidean distances divided
Step 1. Retrieve the width and height of S′ as well as the                      by the number of blocks) is 0.05, and the PSNR of Fig. 12
         size Zt of tile images from the LSB’s of the first ten                 with respect to Fig. 8(c) is 66.6 which is quite satisfactory,
         pixels of image R.                                                     meaning that the proposed information hiding method
Step 2. Extract the recovery sequence LR from the LSB’s                         (implemented by Algorithms 4 and 5) provides a good effect
         of blocks in R randomly selected using the secret                      on covert communication.



                                                                         1139
colorscale to represent the color distribution of an image
                                                                                   more effectively, based on which a new h-feature is
                                                                                   proposed for measuring image similarity. A greedy
                                                                                   algorithm is proposed accordingly for fitting the tile images
                                                                                   of the secret image into appropriate target blocks more
                                                                                   efficiently. A remedy method has also been proposed to
                                                                                   solve the problem of using a small-sized database, which
                                                                                   enlarges a selected target image in proportion to the
                                                                                   difference measure between the secret and the target images.
                                                                                   For the proposed data hiding method used in covert
                                                                                   communication via secret-fragment-visible mosaic images, it
                                                                                   was observed that the tile images in an identical bin have
                                   (a)                                             similar colors. By switching the relative positions of the
                                                                                   target blocks corresponding to such tile images, we can
                                                                                   embed secret message bits into a secret-fragment-visible
                                                                                   mosaic image imperceptibly.
                                                                                       Future works may be directed to allowing users to select
                                                                                   target images freely to create secret-fragment-visible mosaic
                                                                                   images. This seems achievable by applying a reversible color
                                                                                   shifting technique to fit the color distribution of the secret
                                                                                   image to a selected target image.

                                                                                                                REFERENCES
                                                                                   [1]  P. Haeberli, “Paint by numbers: abstract image representations,” Proc.
                                                                                        SIGGRAPH 99, pp.207-214, Dallas, USA, 1990.
                                                                                   [2] A. Hausner, “Simulating decorative mosaics,” Proceedings of 2001
                                                                                        International Conf. on Computer Graphics  Interactive Techniques
                                                                                        (SIGGRAPH 01), Los Angeles, USA, August 2001, pp. 573-580.
                                                                                   [3] Y. Dobashi, T. Haga, H. Johan and T. Nishita, “A method for creating
                                   (b)                                                  mosaic image using voronoi diagrams,” Proc. 2002 European
                                                                                        Association for Computer Graphics (Eurographics 02), Saarbrucken,
                                                                                        Germany, September 2002, pp. 341-348.
                                                                                   [4] J. Kim and F. Pellacini, “Jigsaw image mosaics,” Proc. 2002
                                                                                        International Conf. on Computer Graphics  Interactive Techniques
                                                                                        (SIGGRAPH 02), San Antonio, USA, July 2002, pp. 657-664.
                                                                                   [5] G. D. Blasi, G. Gallo and M. Petralia, “Puzzle image mosaic,” Proc.
                                                                                        2005 Int’l Association of Science  Technology for Development on
                                                                                        Visualization, Imaging  Image Processing (IASTED/VIIP 2005),
                                                                                        Benidorm, Spain, Sept. 2005.
                                                                                   [6] W. L. Lin and W. H. Tsai, “Data hiding in image mosaics by visible
                                                                                        boundary regions and its copyright protection application against
                                                                                        print-and-scan attacks,” Proc. 2004 Int’l Computer Symp. (ICS 2004),
                                                                                        Taipei, Taiwan, Dec. 15-17, 2004.
                                                                                   [7] C. C. Wang and W. H. Tsai, Creation of Tile-overlapping mosaic
                                                                                        images for information hiding, Proc. 2007 Nat’l Computer Symp.,
                                                                                        Taichung, Taiwan, Dec. 20- 21, 2007, pp. 119-126.
                                                                                   [8] S. C. Hung, D. C. Wu and W. H. Tsai, “Data hiding in stained glass
                                                                                        images,” Proc. 2005 Int’l Symp. on Intelligent Signal Processing 
                                                                                        Communications Systems, June 2005, Hong Kong, pp. 129-132.
                                   (c)                                             [9] C. Y. Hsu and W. H. Tsai, “Creation of a new type of image - circular
 Figure 12. An example of covert communication. (a) A mosaic image                      dotted image - for data hiding by a dot overlapping scheme,” Proc.
            into which messages are hidden. (b) Resulting image and                     2006 Conf. on Computer Vision, Graphics  Image Processing,
            extracted messages using a right key. (c) Resulting image and               Taoyuan, Taiwan, Aug. 13-15, 2006.
            extracted messages using a wrong key.                                  [10] C. P. Chang and W. H. Tsai, “Creation of a new type of art image 
                                                                                        tetromino-based mosaic image  and protection of its copyright by
                                                                                        losslessly-removable visible watermarking,” Proc. 2009 Nat’l
                                                                                        Computer Symp., Taipei, Taiwan, Nov. 27-28, 2009, pp. 577-586.
            IV. CONCLUSIONS AND SUGGESTIONS                                        [11] D. Coltuc and JM. Chassery, “Very fast watermarking by reversible
    A new type of art image  secret-fragment-visible mosaic                            contrast mapping,” IEEE Signal Processing Letters, vol. 14, no. 4, pp.
                                                                                        255-258, April 2007.
image, and a data hiding technique have been proposed for                          [12] J. R. Smith and S. F. Chang, “Tools and techniques for color image
secret image hiding and covert message communication,                                   retrieval,” Proc. Society for Imaging Science  Technology  SPIE
respectively. For the former, we have proposed a new 1-D h-                             (IS  T/SPIE), vol. 2670, Feb. 1995, pp. 2-7.




                                                                            1140
A Practical Design of High-Volume Steganography in Digital Videos


                                    Ming-Tse Lu, Po-Chyi Su and Ying-Chang Wu
                                Dept. of Computer Science and Information Engineering
                                             National Central University
                                                   Jhongli, Taiwan
                                           Email: pochyisu@csie.ncu.edu.tw


   Abstract—In this research, we consider to exploit the              available to most of the people and their transmission is
large volume of audio/video data streams in compressed                increasingly popular. This research aims at developing a
video clips/files for effective steganography. By observing            steganographic scheme for popular digital video files.
that most of the widely distributed video files employ
H.264/AVC and MPEG AAC for video/audio compression,                      H.264/AVC is the state-of-the-art video codec and its
we examine the coding features in these data streams to               decent coding performance lends itself to become the
determine good choices of data modifications for reliable              major coding mechanism in various applications. The most
and acceptable information hiding, in which the perceptual            popular digital video formats/containers for file sharing
quality, compressed bit-stream length, payload of embedding,          nowadays, including FLV (Flash Video), MKV (Matroska
effectiveness of extraction and efficiency of execution are
taken into account. Experimental results demonstrate that             Multimedia Container), AVI (Audio Video Interleave) and
the payload of the selected features for achieving a good             MP4, etc., support H.264/AVC so we choose H.264/AVC
balance among several constraints can be more than 10% of             as the host. In addition, since FLV has become very
the compressed video file size.                                        popular in file sharing these days, we wrap the resulting
   Keywords-Steganography; H.264/AVC; MPEG; AAC; in-                  H.264/AVC video bit-stream into a FLV file for the future
formation hiding;                                                     usage. As FLV files contain both video and audio data
                                                                      streams for playback, we will make use of both video and
                    I. I NTRODUCTION                                  audio data streams to embed as much secret information as
   Digital videos are widely available nowadays thanks to             possible. The chosen audio format is MPEG AAC, which
the fast advances of increasingly cheaper yet powerful                is usually adopted by FLV files.
computer facilities and broadband internet technologies.                 Two embedding scenarios may be considered in this
It is now possible to stream high-quality videos on the               application. First, an user may acquire a compressed video
Internet and such web sites as YouTube, Yahoo! Video,                 file that is not coded by H.264/AVC, e.g. an MPEG2 or
DailyMotion, etc. offer free video viewing, sharing or                MPEG4 related file. In order to embed the information,
downloading services. Watching videos anytime and any-                this video file will be transcoded into an H.264/AVC bit-
where may become people’s daily activity as the portable              stream so that the secret information can be embedded
devices may become more and more popular. As a re-                    during the encoding process. The resultant H.264/AVC
sult, digital videos are ubiquitous and will be the major             bit-stream will then become the video stream of an FLV
circulated multimedia content. Due to the large volume                file. If the input video is already an FLV file compressed
of digital videos, data compression is usually applied to             by H.264/AVC, the information may be embedded more
facilitate their transmission and storage. Since human’s              efficiently since the existing coding parameters in the
perceptual models are not perfect, lossy compression is               original video file can be referenced. It should be noted
usually preferred to increase the coding efficiency of                 that both of the embedding procedures will be carried out
digital videos without affecting human’s perception. In               in the encoding phase so the transcoding is always needed.
other words, there exists certain redundancy in digital               To achieve the high-volume information hiding and to
video files. Nevertheless, in the viewpoint of communica-              retain the fidelity of the audio/visual data, we investigate
tion, this redundancy can serve as an “invisible” channel             combinations of embedding methods to satisfy most of
and, if one can make good use of it, the high-volume                  the requirements or restrictions. The paper is organized
secret communication using digital videos as a camouflage              as follows. Some previous works will be described in
is achievable. The secret communication is also termed                Sec. II and our proposed scheme is delineated in Sec. III.
“steganography”, which means “cover writing”, and can be              Experimental results will be shown in Sec. IV to validate
applied to transmit sensitive information between trusted             the trade-offs that we make among several different re-
parties or when the encryption is not allowed or safe                 quirements. The conclusive remarks are given in Sec. V.
in the normal communication channel. There are a few
requirements in steganography, including the high payload                      II. R EVIEW OF THE R ELATED W ORKS
of hidden information, unobtrusiveness of the distortion,                Unlike digital watermarking, in which the embedded
security and reliability. To achieve the secure, high-volume          information should be able to withstand some common
and reliable covert communication, digital videos can                 processing attacks such as re-compression at a different
serve as a good host, especially when these files are                  bit-rate, random video frame dropping, resizing, etc.,



                                                               1141
the high-volume steganography emphasizes more on the                                       III. T HE P ROPOSED S CHEME
payload, reliability and the difficulty of detection even              A. System Overview
with steganalysis [1], which is a process for revealing the
                                                                         The block diagram of our proposed scheme is shown in
existence of certain hidden information in a suspicious
                                                                      Fig. 1. A video file is parsed first to extract the video and
video. Of course, the quality should always be main-
                                                                      audio data streams. As the transcoding will be applied, the
tained to avoid affecting its applications. Some data hiding
                                                                      video and audio decoder will extract the compressed bit-
schemes in digital videos [2]–[7] have been proposed. As
                                                                      streams into raw data. After obtaining the reconstructed
the video file consists of a large number of frames, the
                                                                      video and audio signals, H.264/AVC and AAC encoders
similar data hiding techniques of still images may also
                                                                      will encode the raw data right after they are extracted and
be applied on videos. The most widely used technique
                                                                      the embedding procedure is triggered. If the input video
to hide the data in digital images or video data is the
                                                                      file is processed by H.264/AVC already, a mode copying
usage of the Least-Significant Bit (LSB) modification [8],
                                                                      procedure that records features of the original video stream
[9], in which the LSB’s of samples, usually coefficients or
                                                                      may be applied to speed up the whole process. The hidden
quantization indices if the compressed data are used, are
                                                                      information can be extracted efficiently in the decoder.
substituted by the secret message bits. JStego, F3, F4 and
F5 described in [8] are popular approaches. In the original                                            ¤£¢¡ 
JStego algorithm, the LSB’s of JPEG residual coefficients                                           ©¨§¦¥             ¢ ¨ ¨

are overwritten with the binary secret message consisting                                           ¨¡¦


of “0” and “1”. JStego skips the embedding operation
                                                                                                                                        CD
                                                                                           32©1¨§©0                                    EFG

when it encounters 0 and ±1 to avoid generating zero                                                           ¢ ©¨§¦¥
                                                                                                                                    P
                                                                                                                                        HI


values, which will cause ambiguity to the hidden infor-
mation extraction. Other values are grouped into pairs, i.e.
                                                                                 ¨$©(                 ©¨§¦¥ !        ©¦§'% %%
                                                                                £¨$'¢¨                $¨§©#¨           $¨§©#¨
(±2, ±3), (±4, ±5)... In F3 algorithm, the LSB of non-                         ©¦¢)$©¦
                                                                                                                                                   ¦§§¨A)4


zero coefficients will be matched with the secret message
after the information embedding, which decreases the                                                   ©¨§¦¥ !        ©¦§'% %%
absolute values of coefficients. If the coefficient becomes                           ¦££%1¨§©0          $¨§©#4          $¨§©#4                   ¨$#¨(

zero after this modification operation, we will embed                                                                                               ¨¢££¨0


this bit once again in the next sample. F4 algorithm is                                                        ©¨§¦¥ ¤£¢¡ 
developed to complement the weakness of F3 algorithm. In                                                         $¨'0
                                                                                 986765
F4 algorithm, a negative coefficient is presented inversely.                    @ ¡©$©

In F5 algorithm, permutating straddling process is adopted                     @    ¢¢                                                      RQ
                                                                                                               B¨¡¦ © §4

to improve the perceptive characteristic and enhance the
security level. In addition, the so-called matrix coding is
                                                                               Figure 1.        The flowchart of the proposed scheme
applied to avoid modifying too many data samples.
   In addition to modifying the coefficients, some re-
searchers employ the characteristics of popular video                 B. Steganography in H.264/AVC
compression standards. Fang i.e. [4] proposed to embed                   H.264/AVC offers a much better compression perfor-
the data into motion vectors’ phase angle in the inter-               mance than the existing video codecs due to its vari-
frames. Wang et al. [10] utilized motion vectors in P and             ous encoding tools. As previous video coding standards,
B-pictures as the data carriers for hiding the copyright              H.264/AVC is based on motion compensated, DCT-like
information. An motion vector is selected based on its                transform coding. Each picture is compressed by partition-
magnitude and its angle guides the modification opera-                 ing it as one or more slices; each slice consists of mac-
tion. Yang G, et al. [11] employed the intra-prediction               roblocks, which are blocks of 16 × 16 luma samples with
mode and matrix coding. They mapped the two secret                    the corresponding chroma samples. Each macroblock may
message bits to every three intra 4×4 blocks by the matrix            also be divided into sub-macroblock partitions for motion
coding. Kim et al. [1] proposed an entropy coding based               prediction. The prediction partitions can have seven differ-
watermarking algorithm to achieve the balance between                 ent sizes 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4. The
the capacity of watermark bits and the fidelity of the video.          large variety of partition shapes and the quarter sample
One bit of information is embedded in the sign bit of the             compensation provide enhanced prediction accuracy. In
trailing ones in context-adaptive variable length coding              intra-coded slices, 4 × 4 or 16 × 16 intra spatial prediction
(CAVLC) of the H.264/AVC stream. The transcoding                      based on neighboring decoded pixels in the same slice
process may thus be avoided but the drift errors resulting            will be applied. The 4 × 4 spatial transform, which is an
from the different reference frame content may appear.                approximate DCT and can be implemented with integer
   In this paper, we will try to make good use of the coding          operations with a few additions/shifts, will be calculated
features in video/audio data streams to maximize the ca-              for the residual data. The point by point multiplication in
pacity of embedded data. Our work can be applied to any               the transform step will be combined with the quantization
container format that can de-multiplex the video stream of            step and implemented by simple shifting operations to
H.264/AVC and audio stream of AAC compression.                        achieve efficiency. CAVLC or CABAC will be used for



                                                               1142
lossless coding. Our video embedding scheme is integrated                          Besides, using MPM to encode a mode should appear in
into the H.264/AVC encoding process as the quantization,                           normal videos so we have to keep this situation in our
intra prediction and motion estimation procedures in the                           “stego” video. Our scheme divide the eight modes into
encoder are modified.                                                               two groups to represent the binary secret information and
   1) Employing Intra Prediction Modes: In H.264/AVC,                              the division or classification is applied according to Fig. 2.
the intra prediction in the luma and chroma of a frame is                          If the DC mode is not MPM, we replace the direction of
quite important for reducing the coding redundancy since                           MPM by the DC mode and then assign “0” and “1” to the
a coding block is usually related to its neighbors. Four                           prediction directions, which are known by the embedder
16×16 or nine 4×4 intra prediction modes can be applied                            and detector. The Rate Distortion Optimization (RDO)
on the luma while four 8 × 8 prediction modes are for the                          will be employed to determine a better prediction mode.
chroma. Fig. 2(a) shows the 4×4 intra prediction. Since                            Although the payload becomes 0.79% of the compressed
the samples above and to the left (labeled as A to M) of the                       bit-stream size, the increment of file size is less than 3%. If
current block have been encoded/reconstructed previously                           the input video is already an H.264/AVC video stream, we
and are available to both the encoder and decoder, nine                            may reference the prediction mode in the original video
prediction modes, including eight directions and one DC                            and determine pairs of modes to represent “0” and “1”
prediction can thus be calculated. It should be noted that,                        directly. If the execution-time constraint is not that restrict,
if the neighboring upper or left block of the current block                        we still suggest applying RDO to find a better mode
is not available, the number of available modes is reduced.                        since it’s not easy to predict a good selection just based
For instance, if the upper block is available while the left                       on the mode in the incoming/original video. Besides, the
block is not, only “horizontal”, “DC”, and “horizontal up”                         computational load is not increased a lot by ROD since
modes can be chosen.                                                               we only have four candidate selections to embed one bit.
                                                                                      2) Employing Inter Prediction: The inter prediction
                                                                                   provides a reference from one or more previously en-
                                                                                   coded video frames for effective encoding. In order to
                                                                                   acquire the precise motion vectors, H.264/AVC adopts the
                                                                                   quarter-pixel precision for motion compensation. The last
                                                                                   two bits of the motion vectors indicate more accurate
                                                                                   locations of motion vectors. We basically make use of
                    (a)                         (b)                                the last bit of motion vectors for effective information
Figure 2. (a) Labeling of prediction samples (4×4) and (b) the directions          hiding without affecting the coding performance severely.
of 4×4 intra prediction modes.                                                     Since the transcoding is applied, the Sum of Absolute
                                                                                   Difference (SAD) of the investigated motion vectors with
   Our proposed scheme only utilizes the nine 4 × 4                                their Least Significant Bit (LSB) being equal to the
intra prediction modes since the content of these blocks                           hidden bit will be available, we can examine them to
is usually complicated so the blocks are suitable for                              find good motion vectors. Again, like the intra prediction,
information hiding. Compared with 16 × 16 luma and                                 the motion vectors of neighboring partitions are often
8 × 8 chroma prediction modes, 4 × 4 intra prediction                              highly correlated. After determining the motion vectors
modes offer finer prediction so the modification of them                             by motion estimation, H.264/AVC predicts the motion
will affect the coding performance less. To embed the                              vector first from the nearby, previously coded partitions.
information, one may think of grouping the nine modes                              After obtaining a predicted motion vector M Vpredicted , the
into pairs so that one bit can be cast in each 4×4 subblock.                       difference between the current motion vector, M Vcurrent ,
However, the resultant bit-stream length will be consid-                           and M Vpredicted will be calculated and encoded. The
erably increased. We take “Container” video compressed                             motion vector difference is termed ”M V D” and formed
with a fixed Quantization Parameter (QP) equal to 30 as an                          in the same way at the decoder. In our scheme, we actually
example. By doing so, the payload can reach 2.07% of the                           modify the data of M V D, instead of motion vectors them-
total bit-stream size, which is inappropriately enlarged by                        selves so the detector can extract the hidden information
6.72%. The reason comes from the fact that the correlation                         efficiently. Furthermore, we will skip the partitions with
in the intra prediction modes of adjacent blocks is not                            M V D equal to 0 and avoid generating a new zero M V D.
taken into account. In H.264/AVC, the mode of the current                          This strategy can limit the file size increment and make the
block is first predicted by the minimum of the prediction                           statistics of motion vectors look normal. If only the motion
modes of its two neighbors, i.e. the upper and left blocks.                        vector whose M V D becomes zero after the information
If the mode matches with the predicted one, only one                               embedding has a reasonably small SAD, we will choose
flag bit called “Most Probable Mode” (MPM) is asserted                              this motion vector and embed this bit once again in the
and sent. Otherwise, this flag bit will be set as “0” and                           following partition.
three extra bits will also be sent to signal which of the                             It should be noted that the effect of M V D embedding
remaining eight modes is used. We only modify the modes                            is more obvious if we use the fixed QP to compress a
when the flag bit is “0” since this block may be more                               video. We illustrate this by compressing “Container” with
different from its neighbors and suitable for embedding.                           QP equal to 30. Fig. 3 shows the proportion of each



                                                                            1143
coding features that occupy in the compressed bit-stream.                                       other hand, for a negative coefficient and its LSB is
Fig. 3(a) shows that the proportion of luma components                                          equal to BitT oEmbed, the modification has to be applied
from inter blocks is 39% if nothing is embedded while                                           and the coefficient is added by 1, i.e. the LSB of this
the proportion of luma components from inter blocks is                                          negative number is equal to the inverted target bit. After
extended to 49% as shown in Fig. 3(b). That is, a large                                         the embedding operations, it is required to check if the
increment will appear in the residuals of inter blocks. If                                      index becomes 0. If yes, the bit will be skipped by the
the bit-stream size increment has to be strictly limited,                                       decoder so it has to be embedded again.
we should avoid embedding the information in motion
vectors. However, the increased residuals may be helpful                                        Algorithm 1 F4 Algorithm
in the information embedding of quantization indices,                                           Input: BitT oEmbed ∈ {0, 1}
which will be discussed later.                                                                   1: for all AC values coe in a block after quantization
                          11%                                     9%
                                                                                                    do
                                     5%
                                                                         4%                      2:   if coe  0 ∧ LSB(coe) = BitT oEmbed then /*
      37%                                 4%
                                                  31%                         4%

                                                                                   4%
                                                                                                      positive number */
                                            4%                                                   3:      coe ← coe − 1
                                                                                                 4:   else if coe  0∧LSB(coe) = BitT oEmbed then
                                 LumaIntra
                                 LumaInter
                                                                        LumaIntra
                                                                        LumaInter                     /* negative */
                                 Chroma                                 Chroma
                                 Motion Vectors
                                 IntraMode4x4
                                                                        Motion Vectors
                                                                        IntraMode4x4
                                                                                                 5:      coe ← coe + 1
                           39%
                                 else
                                                                  49%
                                                                        else
                                                                                                 6:   else /* skip zero value */
                    (a)                                     (b)                                  7:      continue
                                                                                                 8:   end if
Figure 3. Pie chart of “Container” with (a) nothing being embedded                               9:   if coe = 0 then /* Successfully embedded */
and (b) M V D being modified.
                                                                                                10:      /* get nextBitToEmbed */
                                                                                                11:      BitT oEmbed = GetNextBit();
   Here, we test the videos “Garden” and “Container”                                            12:   end if
that are coded with the target bit-rate equal to 2 Mbps                                         13: end for
to explain our strategy. Table I shows the comparison of
modifying motion vectors (M V ) directly without consid-                                           The reason of choosing F4 is as follows. It has been
ering M V D and the M V D embedding. The payload is                                             reported that we can reveal the existence of hidden infor-
calculated in bits per frame, bpf. In our viewpoint, the                                        mation by checking the statistics of samples in JStego and
M V D embedding is still a better choice as the quality of                                      F3 since they may change the histogram of coefficients’
video is less affected, although the payload is decreased                                       frequency after the information embedding. Besides, the
due to the skipping of some MVD’s that are equal to zero                                        original/natural message induced by unchanged carrier
vectors, especially in such static video as “Container”.                                        media may have more steganographic ones than zeros due
                           Table I                                                              to the appearance of ±1 so we had better keep this situa-
T HE PERFORMANCE COMPARISON OF M V AND M V D                             EMBEDDING              tion. F5 is assumed to be a better approach by using matrix
                   (B IT- RATE : 2M BPS )                                                       coding so that less data will need to be modified. However,
                                                                                                as we would like to embed the information during the
               Video                MV                       MV D                               encoding process to achieve the efficiency, we have to
               Name       PSNR       Payload            PSNR Payload                            finish coding the data in a subblock before we proceed
              Garden      29.78        8461             30.56    5148
            Container     40.70       29372             43.41    8024                           to encode the next subblock. In the matrix encoding, we
                                                                                                need to collect 2m − 1 samples to embed m bits by
                                                                                                changing only 1 sample. Since the prediction mechanism
   3) Quantized Coefficients Embedding: After the intra                                          of H.264 performs well so a lot of zero indices may exist
and inter predictions and compensation, the prediction                                          and several subblocks may thus be required for collecting
residuals will be generated and occupy a large portion                                          2m − 1 nonzero samples for the information hiding. As
of the video stream. It is advantageous to utilize these                                        the efficient modification is one of our major objectives,
residuals to achieve the high-volume information hiding.                                        F5 is not suitable. In F4, if we require a higher degree of
In our scheme, both luma and chorma of residuals will be                                        safety, some coefficients may be skipped for embedding
embedded. As mentioned before, popular methodologies                                            given that both the embedder and detector know the rule.
to achieve high-volume steganography in samples without                                         As described earlier, the information hiding by F4 has a
affecting the perceptual quality of videos include JStego,                                      positive side-effect in video coding as the magnitudes of
F3, F4 and F5 algorithms. In our scheme, F4 algorithm                                           the resultant coefficients tend to become smaller. When
is adopted to achieve the effect information hiding in                                          using a fixed QP to encode a video, the video size may
residuals. Algorithm 1 shows the pseudo code of the                                             even be reduced after information embedding and this may
embedding loop of F4. For each non-zero AC coefficient                                           offset the negative effects from the information hiding by
coe, if it is a positive number and its LSB is not equal                                        the intra and inter predictions.
to BitT oEmbed, its absolute value is decreased. On the                                            It should be noted that, when the rate control mechanism



                                                                                         1144
is enabled, QP will be adjusted along with the encoding              if the coding performance is our major concern.
process and the F4 algorithm may help to save some bits in
the current frame so that a smaller QP can be assigned in            C. Information Hiding in Advanced Audio Coding
the following frames. In addition, if the bit-stream length
                                                                        Advanced Audio Coding (AAC) is a standardized com-
is not the major concern, we may try to generate more
                                                                     pression scheme for digital audio and designed as the
nonzero indices. As some indices may be quantized into
                                                                     successor of the MP3 format. AAC makes use of many
zero values because a large QP is used, we may leverage
                                                                     advanced coding techniques available at the time of its
the coefficients that may barely survive by using a smaller
                                                                     development to provide high quality multi-channel audios.
QP. For example, if QP = 28 is adopted in this block, we
                                                                     Therefore, it becomes the kernel algorithm of audio com-
may try a smaller QP = 27 to see whether there will be
                                                                     pression standards. At the beginning of encoding, the filter
zero coefficients/indices that become survivors by using
                                                                     bank is employed to transform the time domain signal
this smaller QP. If yes, this index can be changed to ±1
                                                                     to the frequency domain. Following the time-frequency
so that we may have more nonzero indices to consider.
                                                                     conversion is a series of prediction mechanisms and those
   4) Mode-Copy Procedure: Our embedding methods
                                                                     stages attempt to improve the redundancy reduction of
described above are based on a transcoding process. If
                                                                     previous encoded signals or the joint stereo channel. After
the input video is already an H.264/AVC bit-stream, we
                                                                     the predictions, an iteration loop is applied to quantize
may record the coding modes during the decoding process
                                                                     the spectral coefficients. The scalefactors of subbands are
so that the time-consuming mode decision process can be
                                                                     obtained and multiplied by all of the coefficients in the
made efficient by referencing the modes in the input H.264
                                                                     corresponding scalefactor band. The number of required
video as long as the settings of the video, including the
                                                                     bits and the related information are determined to control
GOP structure, the bit-rate, etc. are the same. We thus
                                                                     the trade-off between the audio distortion and payload.
implement a mode-copy procedure to skip some time-
                                                                     Huffman coding is followed according to the 12 pre-
consuming mode decision steps in the encoding process.
                                                                     defined Huffman tables. Since the data of scalefactors and
In our implementation, the coding information that we
                                                                     spectral coefficients occupy a significant part in the coded
record are the frame type, macroblock type, intra- and
                                                                     audio streams, we will make use of them to embed the
inter-prediction modes and motion vectors in the quarter
                                                                     information.
unit. After decoding a frame, the video encoder assign
                                                                        1) Embedding in the Scale Factors: The scalefactors
those features directly to speed up the whole transcoding
                                                                     have been used for the effective information embedding
process. We compare the typical transcoding and mode-
                                                                     purposes [12]. In our implementation, the scalefactors
copy encoding in Table II, where Frame per second (FPS)
                                                                     equal to zero will be skipped in the embedding scheme
is used as the performance measurement. No information
                                                                     and the scalefactor bands that use pseudo codebooks in the
embedding method is applied in both cases. It can be seen
                                                                     intensity stereo are also skipped. For nonzero scalefactors,
that employing the mode-copy procedure is competent to
                                                                     the secret message bit is embedded in the LSB of each
the typical transcoding.
                                                                     scalefactor. The payload of scalefactors embedding in
                          Table II                                   bytes is shown in Table III, in which two audio clips with
T HE COMPARISON OF THE TYPICAL TRANSCODING AND MODE - COPY           different characteristics are employed. Two target bit rates
               ENCODING (B IT- RATE : 500K BPS )
                                                                     are employed, i.e. High: 264Kbps and Low: 132Kbps. We
                                                                     can see that the payload of the scalefactors embedding
                          Typical                                    could achieve around 1 to 3% of the audio stream size.
             Video     transcoding      Mode-Copy
             Name     PSNR      FPS    PSNR    FPS                                             Table III
            Garden    23.62    14.54   23.53  21.15                        T HE PAYLOAD OF INFORMATION EMBEDDING IN AUDIO
          Container   37.77    11.99   37.51  18.42

                                                                                           Scalefactor   Quantization index
                                                                               Music     High      Low    High       Low
   For the information embedding, we may use both the                         A(5:06)   1.55% 2.86%      7.10%      6.79%
intra- and inter-prediction modes as the references and                       B(3:51)   1.26% 2.52%      11.90%     7.72%
try to modify the modes directly without using RDO. If
the speed is the major concern, this approach is feasible.
In the information hiding in the intra-prediction modes,               2) Embedding in the Quantization Indices: In order to
we can replace the prediction direction with DC (if DC               maximize the payload in the audio stream, the spectral
mode is not MPM) and then group them into known pairs                coefficients after quantization are also employed. Again,
according to the prediction directions. For the information          we apply the F4 algorithm to the spectral coefficients.
hiding in M V D, we may simply change the bits according             Table III also shows the payload using quantization indices
to the incoming motion vectors or use a refined method,               embedding and the average payload is around 6 to 12% of
which calculates the SAD’s of the adjacent locations to              the embedded audio file size. It can be seen that “Music
pick a better motion vector. However, in our opinions, we            B” has a larger payload than “Music A” because Music
may prefer to run RDO in the intra mode modification, as              B has more transient signals than A and more non-zero
described before, and omit the motion vector embedding,              coefficients exist.



                                                              1145
IV. E XPERIMENTAL R ESULTS                               under various bit-rates, the payloads of IMP and MVD
                                                                       tend to be independent from the target bit-rate since they
   Our results are demonstrated in two parts, i.e. the
                                                                       are usually more related to the frame size.
information embedding in video and audio streams. In
the video embedding part, we evaluate the performance
of various videos by first using a fixed QP value and then                                         15


by enabling the rate control mechanism. In both ways, the                                        14

payload, fidelity and increment of bit-stream size are the
major concerns. Six common test videos, namely, “Con-                                            13




                                                                                   Payload (%)
tainer”, “Hall Monitor”, “Foreman”, “Football”, “Garden”,                                        12

“Mobile”, are utilized to verify the proposed video em-
bedding method. The details of test videos are shown                                             11                                                        Container
                                                                                                                                                           Hall Mointor

in Table IV. The proposed scheme is integrated within                                            10
                                                                                                                                                           Foreman
                                                                                                                                                           Football

Intel Integrated Performance Primitives (IPP) version 5.2,                                                                                                 Garden
                                                                                                                                                           Mobile

which is a highly optimized run-time library for supporting                                         9
                                                                                                    500               1000
                                                                                                                             Bit−rate (Kbps)
                                                                                                                                               1500                   2000

fast H.264/AVC coding.
                                                                       Figure 4. The payload of information embedding in videos under various
                            Table IV                                   bit-rates.
                    D ETAILS OF T EST V IDEOS


                        Num of      Frame       No. of MB
               Name     frames    resolution    per frame                                               50
                                                                                                                                                            500Kbps
           Container      300     352×288          396                                                                                                      1Mbps
                                                                                                                                                            2Mbps
                                                                                                        45
             Football     125     352×240          330
            Foreman       300     352×288          396                                                  40
              Garden      115     352×240          330
                                                                                            PSNR (dB)

        Hall Monitor      300     352×288          396                                                  35

              Mobile      140     352×240          330
                                                                                                        30



                                                                                                        25
   First, we set a fixed QP value equal to 30 within all
frames in the test videos to see the effects of different                                          20
                                                                                                 Hall Monitor   Container       Foreman           Garden
                                                                                                                             Video Name
embedding methods. When a fixed QP is employed, the
trade-off between the hidden information payload and the               Figure 5. PSNR of embedded videos and transcoded videos under
increment of bit-stream size is the major consideration. For           various bit-rates
each video, we record the payload in bits per frame and
the increments of bit-stream that is resulted from various                Then, we consider the fidelity of embedded videos
embedding methods are demonstrated in Table V. We can                  under various target bit-rates. We present the PSNR values
find that “Football” video provides the largest payload in              of the embedded video and the transcoded videos. The
the modifications of quantization indices of intra blocks               values of fidelity decreases of four videos are shown in
and intra mode prediction, IMP, since the high motions in              Fig. 5. We can see that the fidelity of transcoded high-
the video leads to more intra blocks. In the quantization              motion video is not as high as that of other static videos
index embedding in inter predicted blocks, the payload of              under the same bit-rate because the high-motion videos
“Garden” video is even higher than “Football” because it               lead to a significant amount of inter block residuals for
not only has high variations in the frame but also high                compensating the high variations of video frames. It can
similarity among frames so more inter blocks exist. In                 be observed that the PSNR of high-motion videos such as
other embedding modes, high motion videos always have                  “Garden” is decreased a lot after the information embed-
higher payloads.                                                       ding. The reason could be that modifying motion vectors
   Using a fixed QP value is only an experiment to observe              of high-motion videos seriously cause inaccuracies in
the trade-off among various embedding modes. We should                 motion compensation. In the lower bit-rate, the difference
enable the rate control to simulate the scenario of the                between transcoded and embedded video is not as much
real applications. Under a given target bit-rate, the issue            as that in the higher bit-rate. Despite of that, our scheme
we will discuss is the trade-off between the payload and               still achieves around 10% payload of embedded video size
fidelity of video. Unlike the fixed QP mode, we combine                  on average as shown in Fig. 4.
all the embedding modes to directly observe the trade-                    As mentioned in Section III-B4, a mode-copy procedure
off between the payload and PSNR under the same target                 was introduced for speeding up the encoding process. It
bit-rate as shown in Fig. 6, in which four videos are                  should be noted that mode-copy procedure is reasonable
tested and their payloads are shown as the solid lines.                only when the rate-control mode is enabled. Our mode-
We can see that “Garden” achieves the best payload                     copy procedure skips the most time-consuming stage in
performance at all the bit-rates. Again, it demonstrates that          video encoder, including the motion estimation, to increase
the high motion videos usually perform better. In fact,                the efficiency. First, we record the time of embedding



                                                                1146
Table V
                                           T HE AVERAGE PAYLOAD AND THE CORRESPONDING INCREMENT PAYLOAD IN BPF


                                Intra4x4,16x16                            Inter                     IMP                   MVD                    All (MVD)
      File Name                 Payload    Size                     Payload      Size          Payload      Size    Payload   Size             Payload   Size
       Container                  409      0.14                       339       -7.30            81         2.96      128    25.89              1191    7.49
    Hall Monitor                  262     -3.38                       341       -8.77            95         4.13       60     6.14               880    -5.16
        Foreman                   349     -1.41                       586       -7.60            246        4.44      482    12.72              1887    3.44
         Football                1768     -4.34                      2411       -8.15            615        3.00      407     5.59              6095    -8.38
          Garden                 1460      0.70                      5039       -8.81            206        0.81      440    19.37              9164    2.03
          Mobile                 1592      2.04                      4325       -9.24            222        1.19      470    30.94              8996    1.99



process, with and without employing the mode-copy pro-                                                                           45

cedure. Table VI shows the ratio of the increased efficiency
in execution time. It can be observed that the efficiency                                                                         40


can be improved by more than 28% when the mode-copy




                                                                                                                     PSNR (dB)
procedure is employed. The lower bit-rate value is used,
                                                                                                                                 35



the more improvement can be obtained. Figure 6 shows                                                                             30
the payload of embedding when the mode-copy procedure                                                                                                         Container
is applied. It should be noted that the M V D embedding                                                                          25
                                                                                                                                                              Foreman
                                                                                                                                                              Garden
is disabled. The dotted lines in Fig. 6 show the payload                                                                                                      Football
                                                                                                                                                              with mode−copy

of the hidden information and the solid lines are from                                                                           20
                                                                                                                                  500   1000           1500                2000

the complete transcoding scheme. “Garden” still has the                                                                                   Bit−rate (Kbps)

best embedding performance. Here we disable the M V D
                                                                                                          Figure 7. The fidelity of video under various bit-rates, with and without
embedding and the payload of embedding can still achieve                                                  the mode-copy procedure
around 10% of encoded video size in average.

                           Table VI
T HE RATIO OF EXECUTION TIME DECREASE AFTER EMPLOYING THE                                                 When a higher bit-rate is set, the mode-copy procedure
                   MODE - COPY PROCEDURE .                                                                even performs better.

           Video Name                500Kbps                 1Mbps          2Mbps                         A. The Evaluation of Audio Embedding
              Container                53%                    39%            30%
           Hall Monitor                48%                    40%            35%                             For the information embedding in audio, we select
               Foreman                 52%                    44%            33%                          some audio clips from the EBU SQAM (Sound Quality
                Football               35%                    32%            30%                          Assessment Material) CD, including “abba”, ”speech”,
                 Garden                55%                    35%            31%                          “baird” and “bach”. All the audio clips from EBU SQAM
                 Mobile                46%                    35%            27%
                                                                                                          CD are encoded in lossless compression format (FLAC)
                                                                                                          and we transcode those clips into FLV as the input to
                                                                                                          our scheme. The audio encoder parameters are set to
                         15
                                                                                                          retain the fidelity of audio as much as possible. We
                         14
                                                                                                          employ the “target quality mode” in Nero AAC encoder
                                                                                                          to preserve the fidelity of original clips from EBU SQAM
                         13
                                                                                                          CD. In addition, we also select two videos, “classic”, and
           Payload (%)




                         12
                                                                                                          “electronic”, from YouTube as one is a classical music and
                         11                                                                               the other is a remixed pop music.
                         10
                                                                                                             We first investigate the payload of embedding with all
                                                                                                          both encode modes enabled. The payload unit we use is
                                                                       Foreman
                                                                       Garden
                                                                       Football
                          9
                                                                       Hall Monitor
                                                                       with mode−copy                     also bits per frame, bpf, in which “frame” is the basic
                          8
                          500       1000
                                           Bit−rate (Kbps)
                                                             1500                       2000              element to collect the sampling points. Table VII shows
                                                                                                          the ratio of short window appearance and the payload of
Figure 6. The embedding payload under various bit-rates with the                                          scalefactors embedding, quantization indices embedding,
mode-copy procedure                                                                                       and the embedding ratio of payload to encoded audio
                                                                                                          size. All the audio clips are encoded at 192 Kbps. The
  The mode-copy procedure skips lots of searching steps                                                   ratio of short window appearance shows the characteristics
so it may degrade the fidelity of video frames. Figure 7                                                   in audio presentation. The transient signal is a short-
shows the PSNR value of four videos with and without the                                                  duration signal that contains the high degree of non-
mode-copy procedure. The dotted line in the figure repre-                                                  periodic components and a higher magnitude of high
sents the PSNR values with the mode-copy procedure. We                                                    frequencies than the harmonic content of that sound. It can
can see that the fidelity of videos is not affected by much.                                               be seen that the notations in “abba”, “speech”, “electronic”



                                                                                                   1147
have more transient signals. Be comparing the ratio of                                                [2] C. Xu, X. Ping, and T. Zhang, “Steganography in com-
short window appearance and payload, we can find that the                                                  pressed video stream,” 2006.
payload obtained by scalefactors embedding is irrelative to
                                                                                                      [3] S. Kapotas, E. Varsaki, and A. Skodras, “Data Hiding in H.
the ratio of short windows because each subband has only                                                  264 Encoded Video Sequences,” in IEEE 9th Workshop on
one scalefactor so it is not related to the bitrate or number                                             Multimedia Signal Processing, 2007. MMSP 2007, 2007,
of short windows. Therefore, the ratio of short windows                                                   pp. 373–376.
appearance is proportional to the payload obtained by the
embedding in the quantization indices.                                                                [4] D. Fang and L. Chang, “Data hiding for digital video
                                                                                                          with phase of motion vector,” in 2006 IEEE International
                                             Table VII
                                                                                                          Symposium on Circuits and Systems, 2006. ISCAS 2006.
                               T HE PAYLOAD OF THE AUDIO EMBEDDING                                        Proceedings, 2006, p. 4.

                                                                                                      [5] Z. Liu, H. LANG, X. Niu, and Y. Yang, “A robust video
               Audio                    Short              Payload [bpf]                                  watermarking in motion vectors,” in International Confer-
               name                  window [%]            SF      QC      Ratio[%]                       ence on Signal Processing, 2004, pp. 2358–2361.
                abba                    14.78              76     341       10.55
              speech                    11.42              77     339       10.61
                                                                                                      [6] M. Wu and B. Liu, “Data hiding in image and video: part
                baird                   0.11               92     271        8.20
                                                                                                          I-fundamental issues and solutions,” IEEE Transactions on
                bach                    2.60               72     284        9.42
              classic                   0.04               95     319        9.19
                                                                                                          Image Processing, vol. 12, no. 6, pp. 685–695, 2003.
             electro.                   5.28               79     496       12.80
                                                                                                      [7] M. Wu, H. Yu, and B. Liu, “Data hiding in image and
                                                                                                          video: part II-designs and applications,” IEEE Transactions
                                                                                                          on Image Processing, vol. 12, no. 6, pp. 696–705, 2003.
   Fig. 8 shows the bit-rate and payload of quantization
indices embedding. We can see that the higher target                                                  [8] A. Westfeld, “F5Xa steganographic algorithm,” in Informa-
bit-rate is set, the more payload of quantization indices                                                 tion Hiding. Springer, pp. 289–302.
embedding can be achieved because the number of co-
efficients in subbands increases. The payload of audio                                                 [9] A. Bhaumik, M. Choi, R. Robles, and M. Balitanas, “Data
                                                                                                          Hiding in Video,” 2009.
embedding achieves around 10% of the audio stream size
in average as the video information embedding.                                                       [10] H. Wang, Y. Li, Z. Lu, and S. Sun, “Compressed domain
                                                                                                          video watermarking in motion vector,” in Knowledge-Based
                             700
                                                                                                          Intelligent Information and Engineering Systems. Springer,
                                                                                                          2005, pp. 580–586.
                             600

                                                                                                     [11] G. Yang, J. Li, Y. He, and Z. Kang, “An information hid-
                             500
                                                                                                          ing algorithm based on intra-prediction modes and matrix
             Payload (bpf)




                                                                                                          coding for H. 264/AVC video stream,” AEU-International
                             400
                                                                                                          Journal of Electronics and Communications, 2010.
                                                                           abba
                             300
                                                                           speech
                                                                                                     [12] S. Kirbiz, A. Lemma, M. Celik, and S. Katzen-
                                                                           baird                          beisser, “Decode-Time Forensic Watermarking of AAC
                                                                           bach
                             200
                                                                           classic                        Bitstreams,” IEEE Transactions on Information Forensics
                                                                           electronic                     and Security, vol. 2, no. 4, pp. 683–696, 2007.
                             100
                               120   140    160      180      200    220   240          260
                                                  Bit−rate (Kbps)

Figure 8.                    The payload of quantization indices embedding in various
bit-rates.


                                           V. C ONCLUSION
   We develop a high-volume steganographic scheme for
FLV files. Both video and audio streams are employed and
several coding features are taken into account. The users
may take their applications into account to select suitable
features. The payload, perceptual quality, file increment
and security should be the major concerns. Experimental
results demonstrate the payload can reach more than 10%
of the total file size when a good tradeoff is achieved.

                                            R EFERENCES
 [1] U. Budhia, D. Kundur, and T. Zourntos, “Digital video
     steganalysis exploiting statistical visibility in the temporal
     domain,” IEEE Transactions on Information Forensics and
     Security, vol. 1, no. 4, pp. 502–516, 2006.



                                                                                              1148
MULTI-SCALE IMAGE CONTRAST ENHANCEMENT: USING ADAPTIVE
           INVERSE HYPERBOLIC TANGENT ALGORITHM
     1,2             1                 3           1,4
         Cheng-Yi Yu, Yen-Chieh Ouyang, Tzu-Wei Yu, Chein-I Chang
          1
           Dept. of Electrical Engineering, National Chung Hsing University, Taichung, ROC
      2
        Dept. of Computer Science and Information Engineering, National Chin Yi University of
                                      Technology, Taichung, ROC
    3
      Dept. of Electronic Engineering, National Chin Yi University of Technology, Taichung, ROC
4
  Remote Sensing Signal and Image Processing Laboratory, Dept. of Computer Science and Electrical
         Engineering, University of Maryland, Baltimore County, Baltimore, MD 21250, USA
                                       E-mail:youjy@ncut.edu.tw


                      ABSTRACT                                       contrast image. A dark image has particular low gray
                                                                     levels in intensity, while a bright image has very high
      This paper presents a fast and effective method for            gray levels in intensity. The gray levels of a back-lighted
image contrast enhancement based on multi-scale                      image are usually distributed at the two ends of dark and
parameter adjustment of Adaptive Inverse Hyperbolic                  bright regions. On the other hand, the gray levels of a
Tangent Algorithm (MSAIHT). Sub-band coefficients                    low-contrast image are generally centralized on the
were developed base on the method of Adaptive Inverse                middle region, while the gray levels of a high-contrast
Hyperbolic Tangent Algorithm. In the proposed method,                image are scattered across the whole spectrum (Fig.
the image contrast is calculated from the local mean and             2).[3,4]
local variance before the further processing for Adaptive
Inverse Hyperbolic Tangent Algorithm (AIHT). We
show that this approach could provide a convenient and
effective way for various types of images. Applications
of the proposed method in real time image were also
discussed. Experimental results show that the proposed
algorithm is capable of enhancing the local contrast of
the original image adaptively while extruding the details
of objects simultaneously.
Keywords — Multi-Scale, Adaptive Inverse Hyperbolic
Tangent, Contrast Enhancement, Image Processing
Topic area — Multi Processing, Image Post-Processing                     Figure 1. Human Visual System mapping curve


                  1. INTRODUCTION
       Light is the electromagnetic radiation that
stimulates our visual response. In real-world situations,
light intensities have a large range. The illumination
range over which the human visual system can operate is
roughly 1 to 1010, or ten orders of magnitude.
       The retina of the human eye contains about 100
million rods and 6.5 million cones. The rods are sensitive,
and provide vision the lower several orders of magnitude
of illumination. The cones are less sensitive, and provide                     Figure 2. Five kinds of contrast types
the visual response at the higher 5 to 6 orders of
magnitude of illumination. Figure 1 is the Human Visual                    Five categories of commonly used gray level
System mapping curve [1,2].                                          transfer functions shown in Fig. 3 are generally used to
       According to image contrast an images is generally            perform contrast enhancement so as to achieve different
categorized into one of five groups: dark image, bright              types of contrast [3, 4]. For example, for dark images
image, back-lighted image, low-contrast image, and high-             with mean  0.5, the function in Fig. 3(a) is used,




                                                              1149
whereas the function in Fig. 3(b) is used for a bright                                                                                                                                                                                                                                                                                                                            interactive applications. It can automatically produce
image with mean  0.5 for the same purpose. For images                                                                                                                                                                                                                                                                                                                            contrast enhanced images with good quality while using a
whose gray levels are centralized in the middle region                                                                                                                                                                                                                                                                                                                            spatially uniform mapping function that is based on a
with mean near at 0.5, the function in Fig. 3(c) is used.                                                                                                                                                                                                                                                                                                                        simple brightness perception model to achieve better
For images whose gray levels are distributed at the two                                                                                                                                                                                                                                                                                                                           efficiency. In addition, the MSAIHT also provides users
end of dark and bright region, the function in Fig. 3(d) is                                                                                                                                                                                                                                                                                                                       with a tool of tuning the on-the-fly image appearance in
used. For the images whose gray levels are uniformly                                                                                                                                                                                                                                                                                                                              terms of brightness and contrast and thus, is suitable for
scattered across the whole spectrum, the function in Fig.                                                                                                                                                                                                                                                                                                                         interactive applications. The AIHT-processed images can
3(e) is used.                                                                                                                                                                                                                                                                                                                                                                     be reproduced within the capabilities of the display
                                                                                                                                                                                                                                                                                                                                                                                  medium to have better detailed and faithful
                                                            bias=0.37,gain=0.35                                                                                                     bias=0.37,gain=3.0                                                                                                                   bias=0.97,gain=1.0
                1                                                                                                                              1                                                                                                                                1

               0.9                                                                                                                            0.9                                                                                                                              0.9

               0.8

               0.7
                                                                                                                                              0.8

                                                                                                                                              0.7
                                                                                                                                                                                                                                                                               0.8

                                                                                                                                                                                                                                                                               0.7
                                                                                                                                                                                                                                                                                                                                                                                  representations of original scenes.
               0.6                                                                                                                            0.6                                                                                                                              0.6
                                                                                                                                                                                                                                                                                                                                                                                         The remainder of this paper is organized as follows:
                                                                                                                                                                                                                                                               Output Lev el
Output Level




                                                                                                                               Output Level




               0.5                                                                                                                            0.5                                                                                                                              0.5

               0.4

               0.3
                                                                                                                                              0.4

                                                                                                                                              0.3
                                                                                                                                                                                                                                                                               0.4

                                                                                                                                                                                                                                                                               0.3
                                                                                                                                                                                                                                                                                                                                                                                  Section 2 reviews previous work done in the literary.
               0.2

               0.1
                                                                                                                                              0.2

                                                                                                                                              0.1
                                                                                                                                                                                                                                                                               0.2

                                                                                                                                                                                                                                                                               0.1
                                                                                                                                                                                                                                                                                                                                                                                  Section 3 develops the MSAIHT contrast enhancement
                0
                     0   0.1   0.2    0.3                    0.4       0.5
                                                                   Input Level
                                                                               0.6     0.7      0.8      0.9         1
                                                                                                                                               0
                                                                                                                                                    0         0.1       0.2   0.3   0.4       0.5
                                                                                                                                                                                          Input Level
                                                                                                                                                                                                      0.6   0.7    0.8                   0.9         1
                                                                                                                                                                                                                                                                                0
                                                                                                                                                                                                                                                                                     0          0.1    0.2       0.3     0.4       0.5
                                                                                                                                                                                                                                                                                                                               Input Level
                                                                                                                                                                                                                                                                                                                                           0.6       0.7   0.8   0.9   1
                                                                                                                                                                                                                                                                                                                                                                                  algorithm along with its parameters, and usage. Section 4
                         (a) dark image                                                                                        (b) bright image                                                                                                                (c) back-lighted                                                                                                   conducts experiments including simulations. Finally,
                                                                                                                                                                                                                                                                                                                                                                                  Section 5 provides future directions of further research.
                                                                                             bias=0.97,gain=1.0                                                                                                                                                                      bias=0.37,gain=1.0
                                                       1                                                                                                                                                                            1

                                                      0.9                                                                                                                                                                          0.9

                                                      0.8                                                                                                                                                                          0.8

                                                      0.7                                                                                                                                                                          0.7

                                                      0.6                                                                                                                                                                          0.6
                                                                                                                                                                                                                                                                                                                                                                                   2. CONTRAST ENHANCEMENT FOR AN IMAGE
                                     O utput Lev el




                                                                                                                                                                                                                  O utput Lev el




                                                      0.5                                                                                                                                                                          0.5

                                                      0.4

                                                      0.3
                                                                                                                                                                                                                                   0.4

                                                                                                                                                                                                                                   0.3
                                                                                                                                                                                                                                                                                                                                                                                         There are two categories of contrast enhancement
                                                      0.2

                                                      0.1
                                                                                                                                                                                                                                   0.2

                                                                                                                                                                                                                                   0.1
                                                                                                                                                                                                                                                                                                                                                                                  techniques: global methods and local methods. Global
                                                       0
                                                            0       0.1     0.2      0.3     0.4       0.5
                                                                                                   Input Level
                                                                                                               0.6       0.7          0.8               0.9         1
                                                                                                                                                                                                                                    0
                                                                                                                                                                                                                                         0     0.1       0.2         0.3                 0.4       0.5
                                                                                                                                                                                                                                                                                               Input Level
                                                                                                                                                                                                                                                                                                           0.6     0.7      0.8      0.9         1                                contrast enhancement techniques remedy problems that
                          (d) low contrast image                                                                                                                                                            (e) high contrast image                                                                                                                                               manifest themselves in a global fashion such as
           Figure 3. Five category of classical gray level transform                                                                                                                                                                                                                                                                                                              excessive/poor lightning conditions in the source
                                  functions                                                                                                                                                                                                                                                                                                                                       environment. On the other hand, local contrast
                                                                                                                                                                                                                                                                                                                                                                                  enhancement tries to enhance the visibility of local
      Contrast enhancement techniques are widely used                                                                                                                                                                                                                                                                                                                             details in the image. Locally enhanced images look more
to increase the visual image quality. The purpose of                                                                                                                                                                                                                                                                                                                              attractive than the originals because of the higher contrast
image enhance: First, people eyes identified more easily                                                                                                                                                                                                                                                                                                                          [5].
to images or make the image clear and detailed. Second,                                                                                                                                                                                                                                                                                                                                  The advantages of using a global method are its
the computer easy analysis of image data and identified,                                                                                                                                                                                                                                                                                                                          high efficiency and low computational load. The
and like humans has a visual perception capabilities.                                                                                                                                                                                                                                                                                                                             drawback of using a global operator is its inability in
However, in our previous projects undertaken[3,4], we                                                                                                                                                                                                                                                                                                                             revealing image details of local luminance variation. On
proposed that an Adaptive Inverse Hyperbolic Tangent                                                                                                                                                                                                                                                                                                                              the contrary, the advantage of a local operator is its
Algorithm, however, this approach suffers from the                                                                                                                                                                                                                                                                                                                                capability of revealing the details of luminance level
following drawbacks: First, it lacks of a mechanism to                                                                                                                                                                                                                                                                                                                            information in an image at the expense of very high
adjust the degree of enhancement, using the AIHT based                                                                                                                                                                                                                                                                                                                            computational cost that may not be unsuitable for video
image contrast enhancement methods can not retain the                                                                                                                                                                                                                                                                                                                             applications without hardware realization [3,4]. Two
detail brightness distribution of the original image                                                                                                                                                                                                                                                                                                                              types of contrast enhancement techniques, linear and
therefore lead to distortion. Second, this algorithm can                                                                                                                                                                                                                                                                                                                          nonlinear are discussed as follows.
only be done on the image of the global contrast                                                                                                                                                                                                                                                                                                                                         Linear contrast enhancement is also referred to as
enhancement and cannot achieve a local contrast                                                                                                                                                                                                                                                                                                                                   contrast stretching. It linearly expands the original digital
enhanced, and unable to meet the Human Visual System                                                                                                                                                                                                                                                                                                                              luminance values of an image to a new distribution.
mapping curve, and to produce non-smooth or distorted                                                                                                                                                                                                                                                                                                                             Expanding the original input values of the image makes it
images phenomenon.                                                                                                                                                                                                                                                                                                                                                                possible to use the entire sensitivity range of the display
      In this paper, the above-mentioned shortcomings of                                                                                                                                                                                                                                                                                                                          device. Linear contrast enhancement also highlights
image contrast enhancement methods in order to propose                                                                                                                                                                                                                                                                                                                            subtle variations within the data.
multi-scales image enhancement method base on                                                                                                                                                                                                                                                                                                                                            Nonlinear contrast enhancement often involves
Adaptive Inverse Hyperbolic Tangent Algorithm. This                                                                                                                                                                                                                                                                                                                               histogram equalization, which requires an algorithm to
method has two main features: (1) a sub-processing                                                                                                                                                                                                                                                                                                                                accomplish the task. One major disadvantage resulting
method to achieve the local contrast enhancement; (2)                                                                                                                                                                                                                                                                                                                             from the nonlinear contrast stretch is that each value in
proposed a method capable of processing various types                                                                                                                                                                                                                                                                                                                             the input image can have several values in the output
of images, enhance and retain the original image details.                                                                                                                                                                                                                                                                                                                         image so that objects in the original scene lose their
Image enhancement of the results will contribute to                                                                                                                                                                                                                                                                                                                               correct relative brightness values.
image analysis.                                                                                                                                                                                                                                                                                                                                                                          Under such a circumstance the contrast
      Multi-scale parameter adjustment of Adaptive                                                                                                                                                                                                                                                                                                                                enhancement is generally performed to expand gray level
Inverse Hyperbolic Tangent Algorithm (MSAIHT) for                                                                                                                                                                                                                                                                                                                                 range to mitigate the problem. One popular technique to
image contrast enhancement that is suitable for                                                                                                                                                                                                                                                                                                                                   accomplish this task is histogram equalization in




                                                                                                                                                                                                                                                                                                                                                                           1150
(Gonzalez and Woods [6]). A disadvantage of the method                 inverse hyperbolic tangent curve can be further
is that it is indiscriminate and produces unrealistic effects          dynamically adjusted. The following section describes
in photographs. It may increase the contrast of                        the method we will use, which is similar to the proposed
background noise, while decreasing the usable signal. In               algorithm.
scientific imaging where spatial correlation is more
important than intensity of signal, the small signal to
noise ratio usually hampers visual detection.

  3. MULTI-SCALE PARAMETER ADJUSTMENT
      OF ADAPTIVE INVERSE HYPERBOLIC
          TANGENT (MSAIHT) ALGORITHM
3.1. Adaptive Inverse Hyperbolic Tangent (AIHT)
      Algorithm
       Figure 4 is a block diagram of the AIHT algorithm.
The input data is converted from its original format to a
floating point representation of RGB values. The
principal characteristic of our proposed enhancement
function is a adaptive adjustment of the Inverse
Hyperbolic Tangent (AIHT) Function determined by
each pixel’s radiance. After reading the image file, the
bias(x) and gain(x) is computed. These parameters
control the shape of the IHT function. Figure 5 shows a                     Figure 4. A flowchart of the AIHT algorithm.
block diagram of AIHT parameters evaluates, including
bias(x) and gain(x) parameters [3,4].
       The Adaptive Inverse Hyperbolic Tangent
algorithm has several desirable properties. For very small
and very large luminance values, its logarithmic function
enhances the contrast in both dark and bright areas of an
image. Because this function is an asymptote, the output
mapping is always bounded between 0 and 1. Another
advantage of this function is that it supports an
approximately inverse hyperbolic tangent mapping for
                                                                            Figure 5. A flowchart of AIHT parameters
intermediate luminance, or luminance distributed
                                                                                               evaluates.
between dark and bright values. Figure 6 shows an
example where the middle section of the curve is                                                                                        gain=1.0
approximately linear.                                                                            1


       The form of the AIHT fits data obtained from
                                                                                                                bias=0.2
                                                                                                0.9             bias=1.0

measuring the electrical response of photo-receptors to
                                                                                                                bias=4.0
                                                                                                0.8             linear

flashes of light in various species [7]. It has also provided                                   0.7

a good fit to other electro-physiological and                                                   0.6

psychophysical measurements of human visual function
                                                                                 Output Level




                                                                                                0.5
[8]-[10].
                                                                                                0.4
       The contrast of an image can be enhanced using
Adaptive inverse hyperbolic function. The enhanced
                                                                                                0.3



pixel xij is defined as following,                                                              0.2


                                                                                                0.1

                  1  xij  x   
                         bias

Enhancexij    log              1  gainx 
                                                                                                 0
                                                         (1)                                          0   0.1    0.2       0.3   0.4       0.5     0.6   0.7   0.8   0.9   1

                  1  xij  x   
                                                                                                                                       Input Level
                         bias
                                                                    Figure 6. AIHT is approximately linear over the middle
where xij is the image gray level of the ith row and jth                   range values where the choice of a semi-saturation
column. The bias(x) is a power of xij to speed up the                     constant determines how input values are mapped to
changing. The gain function is a weighting function                                          display values.
which is used to determine the steepness of the AIHT                   3.2. Bias and Gain Parameters
curve. A steeper slope narrows a smaller range of input                       The bias function is a power function defined over
values to the display range. The gain function is used to              the unit interval which remaps x according to the bias
help shape how fast the mid-range of objects in a soft                 transfer function. The bias function is used to bend the
region goes from 0 to 1. A higher gain value means a                   density function either upwards or downwards over the
higher rate in change. Therefore the steepness of the                  [0,1] interval.




                                                                1151
The bias power function is defined by:                                                                                                                                            fixed (gain = 0.85), the corresponding results are shown
                                                                                                                                                                            0.25                           in Fig. 8(b).
                                1 m n           
                mean ( x ) 
                               
                               
                                         x ij 
                                 m  n i 1 j 1     (2)
                                                                                         0 .25
                                                                                                                                                                                                                    Original Image.            Gain=1.0 processed of image.     Gain=0.99 processed of image.


  bias  x                                 
                0 .5         
                                      0 .5
                                                 
                                                
                                                
       The gain function determines the steepness of the
AIHT curve. A steeper slope maps a smaller range of
input values to the display range. The gain function is
used to help to reshape the object’s midrange from 0 to 1
                                                                                                                                                                                                             Gain=0.97 processed of image.     Gain=0.93 processed of image.    Gain=0.85 processed of image.


of its soft region.
       The gain function is defined by:
                                                                                                                                                                                             0. 5
                                                                                                                    1 m n              
 gain  x   0.1  variance ( x )                                                                                m  n  xij   
                                                                                                           0 .1  
                                                                                                  0 .5
                                                                                                                                        
                                                                                                                          i 1 j 1    
                                                                                                                                                                                             (3)              Gain=0.69 processed of image.     Gain=0.37 processed of image.

           1 m n
where          xij
          m  n i 1 j 1
      Decreasing the gain(x) value increases the contrast
of the re-mapped image. Shifting the distribution toward
lower levels of light (i.e., decreasing bias(x)) decreases
the highlights. By adjusting the bias(x) and gain(x), it is
possible to tailor a re-mapping function with appropriate                                                                                                                                                                                                  (a)
amounts of image contrast enhancement, highlights and                                                                                                                                                           bias=0.4 processed of image.    bias=0.6 processed of image.    bias=0.8 processed of image.


shadow lightness as shown in Fig. 7.
                             bias=0.4  varying the gain                                bias=0.5  varying the gain                                bias=0.65  varying the gain
                    1                                                          1                                                          1
                                                                                                                                                                                      1
                                                                                                                                                                                      0.99
                   0.8                                                        0.8                                                        0.8
                                                                                                                                                                                      0.97
                                                                                                                                                                                      0.93
    Output Level




                                                               Output Level




                                                                                                                          Output Level




                   0.6                                                        0.6                                                        0.6
                                                                                                                                                                                      0.85
                                                                                                                                                                                      0.69
                   0.4                                                        0.4                                                        0.4                                          0.37

                   0.2                                                        0.2                                                        0.2


                    0                                                          0                                                          0
                         0               0.5               1                        0               0.5               1                        0               0.5                1
                                     Input Level                                                Input Level                                                Input Level                                          bias=1.0 processed of image.    bias=1.2 processed of image.    bias=1.4 processed of image.
                             bias=0.8  varying the gain                                bias=1.0  varying the gain                                bias=1.25  varying the gain
                    1                                                          1                                                          1


                   0.8                                                        0.8                                                        0.8
    Output Level




                                                               Output Level




                                                                                                                          Output Level




                   0.6                                                        0.6                                                        0.6


                   0.4                                                        0.4                                                        0.4


                   0.2                                                        0.2                                                        0.2


                    0                                                          0                                                          0
                         0               0.5               1                        0               0.5               1                        0               0.5                1
                                     Input Level                                                Input Level                                                Input Level

                             bias=1.6  varying the gain                                bias=2.1  varying the gain                                bias=2.8  varying the gain
                    1                                                          1                                                          1
                                                                                                                                                                                                                bias=1.6 processed of image.    bias=1.8 processed of image.    bias=2.0 processed of image.
                   0.8                                                        0.8                                                        0.8
    Output Level




                                                               Output Level




                                                                                                                          Output Level




                   0.6                                                        0.6                                                        0.6


                   0.4                                                        0.4                                                        0.4


                   0.2                                                        0.2                                                        0.2


                    0                                                          0                                                          0
                         0               0.5               1                        0               0.5               1                        0               0.5                1
                                     Input Level                                                Input Level                                                Input Level


        Figure 7. Inverse Hyperbolic Tangent Curve
 produced by varying the gain and bias values, varying
     the gain and bias values of mapping curves.                                                                                                                                                                                        (b)
                                                                                                                                                                                                                   Figure 8. (a) The bias parameter fixed (bias = 1)
      The gain function determines the steepness of the                                                                                                                                                     and eight different gain values processed of images. (a)
curve. Steeper slopes map a smaller range of input values                                                                                                                                                  gain parameter fixed (gain = 0.85) and nine different bias
to the display range. The value of bias controls the                                                                                                                                                            values of mapping curves, (b) fixed gain= 0.85,
centering of the inverse hyperbolic tangent. Figure 8                                                                                                                                                                         processed of image.
shows gain and bias the curve for different values of                                                                                                                                                      3.3. Multi-Scale Parameter Adjustment of Adaptive
processed images. There are a total of eight gain values                                                                                                                                                        Inverse     Hyperbolic       Tangent      (MSAIHT)
(1, 0.99, 0.97, 0.93, 0.85, 0.69, 0.37) and bias parameter                                                                                                                                                      Algorithm
fixed (bias = 1), the corresponding results are shown in                                                                                                                                                         Figure 9 shows a block diagram of the MSAIHT
Fig. 8(a). There are a total of nine bias values (0.4, 0.5,                                                                                                                                                algorithm. The input data is converted from its original
0.65, 0.8, 1.0, 1.25, 1.6, 2.1, 2.8) and gain parameter                                                                                                                                                    format to a floating point representation of RGB values.
                                                                                                                                                                                                           The principal characteristic of our proposed enhancement




                                                                                                                                                                                                    1152
function is a Multi-scale adaptive adjustment of the               gain images in the low level and high gain images in the
Inverse Hyperbolic Tangent (MSAIHT) Function                       high level. An additional problem that is potentially
determined by each pixel’s radiance. After reading the             solved by this approach is the compression property of
image file, the bias(x) and gain(x) is computed. These             the display (so-called gamma curve). This transfer
parameters control the shape of the AIHT function.                 function has high suppressed rate for higher luminance
Figure 10 shows a block diagram of MSAIHT parameters               range and has low prolonged rate for the lower luminance
evaluates, including Multi-scale bias(x) and gain(x)               regions.
parameters.
                                                                     4. IMPLEMENTATION AND EXPERIMENTAL
                                                                                           RESULTS
                                                                          A variety of video sequences and still images were
                                                                   tested by using the proposed method. There are four
                                                                   types of extreme case images: dark image, bright image,
                                                                   back-lighted image, and low-contrast image. Images with
                                                                   different types of histogram distributions were taken to
                                                                   test for experiments. These include some daily life
                                                                   images that may arise in contrast to the poor image and
                                                                   demonstrate the enhanced results. Figure 11 shows
                                                                   various types of images with bad contrast enhancement.
                                                                   Figure 11 displays the results of the enhanced image
                                                                   processing by histogram equalization, AIHT and the
                                                                   proposed MSAIHT method. Figure 12 shows Adaptive
                                                                   Inverse Hyperbolic Tangent and Multi-scale Adaptive
                                                                   Inverse Hyperbolic Tangent Comparison on local detail.
                                                                   In the local detail of the enhance MSAIHT better than
                                                                   AIHT.
    Figure 9. A flowchart of the MSAIHT algorithm.                        The comparative analysis has shown the proposed
                                                                   methods can display more detail in the sense of contrast
      There are two important design goals for the Multi-          than the current frequently used methods. The MSAIHT
scale: avoiding noise visibility especially in smooth              technique can keep the sharpness of defects' edges and
regions and preventing intensity saturation to minimum             local detail well. Therefore, AIHT and MSAIHT can
and maximum possible intensity values (e.g. 0 and 255              greatly enhance poor image and they will be helpful for
for 1 byte per channel source format).                             defect recognition.
      The enhanced output image resulting from the                        Finally, Figure 13 shows the MSAIHT system
multi-scale approach for processing input image x, is              interface in manual and automatic mode. The automatic
described by:                                                      mode adjusts the best parameters (multi-scale gain and
                       K
                                                                   bias) based on the automatic calculation of characteristics
  Enhance _ MSAIHT   AIHT (bias(k ), gain(k ))      (4)
                       k 1
                                                                   of images (Piecewise mean and variance). In manual
where K is the number of band we used.                             mode, users can select th
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3
CVGIP 2010 Part 3

CVGIP 2010 Part 3

  • 1.
    3-D Environment ModelConstruction and Adaptive Foreground Detection for Multi-Camera Surveillance System Yi-Yuan Chen1† , Hung-I Pai2† , Yung-Huang Huang∗ , Yung-Cheng Cheng∗ , Yong-Sheng Chen∗ Jian-Ren Chen† , Shang-Chih Hung† , Yueh-Hsun Hsieh† , Shen-Zheng Wang† , San-Lung Zhao† † Industrial Technology Research Institute, Taiwan 310, ROC ∗ Department of Computer and Information Science, National Chiao-Tung University, Taiwan 30010, ROC E-mail:1 [email protected], 2 [email protected] Abstract— Conventional surveillance systems usually use multiple screens to display acquired video streams and may cause trouble to keep track of targets due to the lack of spatial relationship among the screens. This paper presents an effective and efficient surveillance system that can integrate multiple video contents into one single comprehensive view. To visualize the monitored area, the proposed system uses planar patches to approximate the 3-D model of the monitored environment and displays the video contents of cameras by applying dynamic texture mapping on the model. Moreover, a pixel-based shadow detection scheme for surveillance system is proposed. After an offline training phase, our method exploits the threshold which determines whether a pixel is in a shadow part of the Fig. 1. A conventional surveillance system with multiple screens. frame. The thresholds of pixels would be automatically adjusted and updated according to received video streams. The moving objects are extracted accurately with removing cast shadows and then visualized through axis-aligned billboarding. The system direction of cameras, and locations of billboards indicate provides security guards a better situational awareness of the the positions of cameras, but the billboard contents will be monitored site, including the activities of the tracking targets. hard to perceive if the angles between viewing direction and Index Terms— Video surveillance system, planar patch mod- normal directions of billboards are too large. However, in eling, axis-aligned billboarding, cast shadow removal rotating billboard method, when the billboard rotates and faces to the viewpoint of user, neither camera orientations I. I NTRODUCTION nor capturing areas will be preserved. In outdoor surveillance Recently, video surveillance has experienced accelerated system, an aerial or satellite photograph can be used as growth because of continuously decreasing price and better a reference map and some measurement equipments are capability of cameras [1] and has become an important used to build the 3-D environment [3]–[5]. Neumann, et al. research topic in the general field of security. Since the utilized an airborne LiDAR (Light Detection and Ranging) monitored regions are often wide and the field of views sensor system to collect 3-D geometry samples of a specific of cameras are limited, multiple cameras are required to environment [6]. In [3], image registration seams the video on cover the whole area. In the conventional surveillance system, the 3-D model. Furthermore, video projection, such as video security guards in the control center monitor the security flashlight or virtual projector, is another way to display video area through a screen wall (Figure 1). It is difficult for the in the 3-D model [4], [7]. guards to keep track of targets because the spatial relationship However, the multi-camera surveillance system still has between adjacent screens is not intuitively known. Also, it is many open problems to be solved, such as object tracking tiresome to simultaneously gaze between many screens over across cameras and object re-identification. The detection of a long period of time. Therefore, it is beneficial to develop a moving objects in video sequences is the first relevant step in surveillance system that can integrate all the videos acquired the extraction of information in vision-based applications. In by the monitoring cameras into a single comprehensive view. general, the quality of object segmentation is very important. Many researches on integrated video surveillance systems The more accurate positions and shapes of objects are, the are proposed in the literature. Video billboards and video more reliable identification and tracking will be. Cast shadow on fixed planes project camera views including foreground detection is an issue for precise object segmentation or objects onto individual vertical planes in a reference map to tracking. The characteristics of shadow are quite different visualize the monitored area [2]. In fixed billboard method, in outdoor and indoor environment. The main difficulties in billboards face to specified directions to indicate the capturing separating the shadow from an interesting object are due to 988
  • 2.
    the physical propertyof floor, directions of light sources and System configuration On-line monitoring additive noise in indoor environment. Based on brightness Manual operation video and chromaticity, some works are proposed to decide thresh- streams olds of these features to roughly detect the shadow from objects [8]–[10]. However, current local threshold methods 2D 3D patterns model Background modeling couple blob-level processing with pixel-level detection. It causes the performance of these methods to be limited due to the averaging effect of considering a big image region. Registration with Segmentation and corresponding points refinement Two works to remove shadow are proposed to update the threshold with time and detect cast shadow in different Axis-aligned scenes. Carmona et al [11] propose a method to detect Lookup billboarding shadow by using the properties of shadow in Angle-Module tables space. Blob-level knowledge is used to identify shadow, 3-D model refection and ghost. This work also proposes a method to construction update the thresholds to remove shadow in different positions of the scene. However there are many undetermined param- Fig. 2. The flowchart and components of the proposed 3-D surveillance system. eters to update the thresholds and the optimal parameters are hard to find in practice. Martel-Brisson et al [12] propose a method, called GMSM, which initially uses Gaissian of Mixture Model (GMM) to define the most stable Gaussian distributions as the shadow and background distributions. Since a background model is included in this method, more computation is needed for object segmentation if a more com- plex background model is included in the system. Besides, because that each pixel has to be updated no matter how many objects moving, it cost more computation in few objects. In this paper, we develop a 3-D surveillance system based on multiple cameras integration. We use planar patches to build the 3-D environment model firstly and then visualize videos by using dynamic texture mapping on the 3-D model. To obtain the relationship between the camera contents and the 3-D model, homography transformations are estimated for every pair of image regions in the video contents and the corresponding areas in the 3-D model. Before texture Fig. 3. Planar patch modeling for 3-D model construction. Red patches mapping, patches are automatically divided into smaller ones (top-left), green patches (top-right), and blue patches (bottom-left) represent with appropriate sizes according to the environment. Lookup the mapping textures in three cameras. The yellow point is the origin of the 3-D model. The 3-D environment model (bottom-right) is composed of tables for the homography transformations are also built for horizontal and vertical patches from these three cameras. accelerating the coordinate mapping in the video visual- ization processing. Furthermore, a novel method to detect moving shadow is also proposed. It consists of two phases. quired from IP cameras deployed in the scene to the 3-D The first phase is an off-line training phase which determines model by specifying corresponding points between the 3-D the threshold of every pixel by judging whether the pixel is model and the 2-D images. Since the cameras are fixed, this in the shadow part. In the second phase, the statistic data configuration procedure can be done only once beforehand. of every pixel is updated with time, and the threshold is Then in the on-line monitoring stage, based on the 3-D adjusted accordingly. By this way, a fixed parameters setting model, all videos will be integrated and visualized in a single for detecting shadow can be avoided. The moving objects are view in which the foreground objects extracted from images segmented accurately from the background and are displayed are displayed through billboards. via axis-aligned billboarding for better 3-D visual effects. A. Image registration II. S YSTEM CONFIGURATION For a point on a planar object, its coordinates on the plane Figure 2 illustrates the flowchart of constructing the pro- can be mapped to 2-D image through homography citeho- posed surveillance system. First, we construct lookup tables mography, which is a transformation between two planar for the coordinate transformation from the 2-D images ac- coordinate systems. A homography matrix H represents the 989
  • 3.
    relationship between pointson two planes: sct = Hcs , (1) where s is a scalar factor and cs and ct are a pair of corre- sponding points in the source and target patches, respectively. If there are at least four correspondences where no three correspondences in each patch are collinear, we can estimate H through the least-squares approach. We regard cs as points of 3-D environment model and ct as points of 2-D image and then calculate the matrix H to map points from the 3-D model to the images. In the reverse order, we can also map points from the images to the 3-D model. B. Planar patch modeling Precise camera calibration is not an easy job [13]. In the Fig. 4. The comparison of rendering layouts between different numbers and sizes of patches. A large distortion occurs if there are fewer patches for virtual projector methods [4], [7], the texture image will be rendering (left). More patches make the rendering much better (right). miss-aligned to the model if the camera calibration or the 3-D model reconstruction has large error. Alternatively, we develop a method that approximates the 3-D environment where Iij is the intensity of the point obtained from homog- model through multiple yet individual planar patches and ˜ raphy transformation, Iij is the intensity of the point obtained then renders the image content of every patches to generate from texture mapping, i and j are the coordinates of row and a synthesized and integrated view of the monitored scene. In column in the image, respectively, and m × n represents the this way we can easily construct a surveillance system with dimension of the patch in the 2-D image. In order to have 3-D view of the environment. an reference scale to quantify the distortion amount, a peak Mostly we can model the environment with two basic signal-to-noise ratio is calculated by building components, horizontal planes and vertical planes. The horizontal planes for hallways and floors are usually MAX2 I surrounded by doors and walls, which are modeled as the PSNR = 10 log10 , (3) MSE vertical planes. Both two kinds of planes are further divided into several patches according to the geometry of the scenes where MAXI is the maximum pixel value of the image. (Figure 3). If the scene consists of simple structures, a few Typical values for the PSNR are between 30 and 50 dB and large patches can well represent the scene with less rendering an acceptable value is considered to be about 20 dB to 25 dB costs. On the other hand, more and smaller patches are in this work. We set a threshold T to determine the quality required to accurately render a complex environment, at the of texture mapping by expense of more computational costs. In the proposed system, the 3-D rendering platform is PSNR ≥ T . (4) developed on OpenGL and each patch is divided into tri- angles before rendering. Since linear interpolation is used If the PSNR of the patch is lower than T , the procedure to fill triangles with texture in OpenGL and not suitable divides it into smaller patches and repeats the process until for the perspective projection, distortion will appear in the the PSNR values of every patches are greater than the given rendering result. One can use a lot of triangles to reduce this threshold T . kind of distortion, as shown in Figure 4, it will enlarge the computational burden and therefore not feasible for real-time III. O N - LINE MONITORING surveillance systems. To make a compromise between visualization accuracy and The proposed system displays the videos on the 3-D model. rendering cost, we propose a procedure that automatically However, the 3-D foreground objects such as pedestrians are divides each patch into smaller ones and decides suitable projected to image frame and become 2-D objects. They will sizes of patches for accurate rendering (Figure 4). We use the appear flattened on the floor or wall since the system displays following mean-squared error method to estimate the amount them on planar patches. Furthermore, there might be ghosting of distortion when rendering image patches: effects when 3-D objects are in the overlapping areas of m−1 n−1 different camera views. We need to tackle this problem by 1 ˜ MSE = (Iij − Iij )2 , (2) separating and rendering 3-D foreground objects in addition m×n i=0 j=0 to the background environment. 990
  • 4.
    our method suchthat the background doesn’t have to be determined again. In the indoor environment, we assume the color in eq.(7) is similar between shadow and background in a pixel although it is not evidently in sunshine in outdoor. Only the case of indoor environment is considered in this paper. Fig. 5. The tracking results obtained by using different shadow thresholds B. Collecting samples while people stand on different positions of the floor. (a) Tr = 0.8 (b) Samples I(x, y, t) in some frames are collected to decide Tr = 0.3. The threshold value Tθ = 6o is the same for both. the shadow area, where t is the time. In [12] all samples are collected including the classification of background, shadow A. Underlying assumption and foreground by the pixel value changed with time. But if a good background model has already built and some Shadow is a type of foreground noise. It appears in any initial foreground objects were segmented, the background zone of the camera scene. In [8], each pixel belongs to samples are not necessary. Only foreground and shadow a shadow blob is detected by two properties. First, the samples If (x, y, t) were needed to consider. Besides, since color vector of a pixel in shadow blob has similar direction background pixels are dropped from the samples list, this can to that of the background pixel in the same position of save the computer and memory especially in a scene with image. Second, the magnitude of the color vector in the T few objects. Future, If θ (x, y, t) is obtained by dropping the shadow is slightly less than the corresponding color vector of samples which not satisfy inequality eq.(7) from If (x, y, t). background. Similar to [11], RGB or other color space can be Obviously, the samples data composed of more shadows transformed into two dimensional space (called angle-module samples and less foreground samples. This also leads to that space). The color vector of a pixel in position (x, y) of current the threshold r(x, y, t) can be derived more easily than the frame, Ic (x, y), θ(x, y) is the angle between background threshold derived from samples of If (x, y, t). vector Ib (x, y) and Ic (x, y), and the magnitude ratio r(x, y) are defined as C. Deciding module ratio threshold arccos(Ic (x, y) · Ib (x, y)) The initial threshold Tθ (x, y, 0) is set according to the θ(x, y) = (5) experiment. In this case, Tθ (x, y, 0) = cos(6◦ ) is set as |Ic (x, y)||Ib (x, y)| + the initial value. After collecting enough samples, the ini- |Ic (x, y)| r(x, y) = (6) tial module ratio threshold Tr (x, y, 0) can be decided by |Ib (x, y)| this method, Fast step minimum searching (FSMS). FSMS where is a small number to avoid zero denominator. In [11], can fast separate the shadow from foreground distribution the shadow of a pixels have to satisfy which collected samples are described above. The detail of this method is described below. The whole distribution is Tθ < cos θ(x, y) < 1 (7) separated by a window size w. The height of each window Tr < r(x, y) < 1 (8) is the sum of the samples. Besides the background peak, two peaks were found. The threshold Tr is used to search the where Tθ is the angle threshold and Tr is the module peak which is closest to the average background value and ratio threshold. According to the demonstration showed in smaller than the background value, the shadow threshold can Figure 5, the best shadow thresholds are highly depends on be found by searching the minimum value or the value close positions (pixels) in the scene, because of the complexity to zero. of environment, the light sources and objects positions. Therefore, we propose a method to automatically adjust the D. Updating angle threshold thresholds for detecting shadow for each pixel. The threshold When a pixel satisfies both conditions in inequality eq.(7, for a pixel to be classified as shadow or not is determined by 8) at the same time, the pixel is classified as shadow. In other the necessary samples (data) collected with time. Only one words, if the pixel Is (x, y) is actually a shadow pixel, and c is classified as one of candidate of shadow by FSMS, the parameter has to be manually initialized. It is Tθ (0), where 0 means the initial time. Then the method can update the property of the pixel is require to satisfy the below equation thresholds automatically and fast. Our method is faster than at the same time the similar idea, GMSM method [12], when a background model has built up. There are two major advantages of 0 ≤ cos θ(x, y, t) < Tθ (x, y, t) (9) the computation time for our method. First, only necessary samples are collected. Second, compared with method [12], Tθ (x, y, t) can be decided by searching the minimum any background or foreground results can combine with cos(θ) of pixels in Is which is obtained by FSMS. However 991
  • 5.
    Fig. 7. Orientationdetermination of the axis-aligned billboarding. L is the location of the billboard, E is the location projected vertically from the viewpoint to the floor, and v is the vector from L to E. The normal vector (n) of the billboard is rotated according to the location of the viewpoint. Y is the rotation axis and φ is the rotation angle. Fig. 6. A flowchart to illustrate the whole method. The purple part is based on pixel. are always moving on the floor, the billboards can be aligned to be perpendicular to the floor in the 3-D model. The 3-D we propose another method to find out Tθ (x, y, t) more fast. location of the billboard is estimated by mapping the bottom- The number of samples which are classified as shadow or middle point of the foreground bounding box in the 2-D background at time t is ATr (x, y, t) by using FSMS. We {b,s} image through the lookup tables. The ratio between the height define a ratio R(Tr ) = ATr /A{b,s,f } where A{b,s,f } is all {b,s} of the bounding box and the 3-D model determines the height samples in position x, y, where b, s, f represent the back- of the billboard in the 3-D model. The relationship between ground, shadow and foreground respectively. The threshold the direction of a billboard and the viewpoint is defined as Tθ (x, y, t) can be updating to Tθ (x, y, t) by R(Tr ). The shown in Figure 7. number of samples whose cos(θ(x, y)) values are larger than The following equations are used to calculate the rotation the Tθ (x, y, t) is equal to A{b,s} and is required angle of the billboard: R(Tθ (x, y, t)) = R(Tr ) (10) Y = (n × v) , (12) Besides, we add a perturbation δTθ to the Tθ (x, y, t). T Since FSMS only finds out a threshold in If θ (x, y, t), if the φ = cos−1 (v · n) , (13) initial threshold Tθ (x, y, 0) is set larger than true threshold, the best updating threshold is equal to threshold Tθ not where v is the vector from the location of the billboard, L, to smaller than threshold Tθ . Therefore the true angle threshold the location E projected vertically from the viewpoint to the will never be found with time. To solve this problem, a per- floor, n is the normal vector of the billboard, Y is the rotation turbation of the updating threshold is added to the updating axis, and φ is the estimated rotation angle. The normal vector threshold of the billboard is parallel to the vector v and the billboard is always facing toward the viewpoint of the operator. Tθ (x, y, t) = Tθ (x, y, t) − δTθ (11) F. Video content integration Since the new threshold Tθ (x, y, t) has smaller value to cover more samples, it can approach the true threshold If the fields of views of cameras are overlapped, objects in with time. This perturbation can also make the method more these overlapping areas are seen by multiple cameras. In this adaptable to the change of environment. Here is a flowchart case, there might be ghosting effects when we simultaneously Figure 6 to illustrate the whole method. display videos from these cameras. To deal with this problem, we use 3-D locations of moving objects to identify the cor- E. Axis-aligned billboarding respondence of objects in different views. When the operator In visualization, axis-aligned billboarding [14] constructs chooses a viewpoint, the rotation angles of the corresponding billboards in the 3-D model for moving objects, such as billboards are estimated by the method presented above and pedestrians, and the billboard always faces to the viewpoint of the system only render the billboard whose rotation angle is the user. The billboard has three properties: location, height, the smallest among all of the corresponding billboards, as and direction. By assuming that all the foreground objects shown in Figure 8. 992
  • 6.
    C1 C3 C2 C1 Fig. 8. Removal of the ghosting effects. When we render the foreground object from one view, the object may appear in another view and thus cause the ghosting effect (bottom-left). Static background images without Fig. 9. Determination of viewpoint switch. We divide the floor area foreground objects are used to fill the area of the foreground objects (top). depending on the fields of view of the cameras and associate each area to one Ghosting effects are removed and static background images can be update of the viewpoint close to a camera. The viewpoint is switched automatically by background modeling. to the predefined viewpoint of the area containing more foreground objects. G. Automatic change of viewpoint The experimental results shown in Figure 12 demonstrate that the viewpoint can be able to be chosen arbitrarily in The proposed surveillance system provides target tracking the system and operators can track targets with a closer feature by determining and automatic switching the view- view or any viewing direction by moving the virtual camera. points. Before rendering, several viewpoints are specified in Moreover, the moving objects are always facing the virtual advance to be close to the locations of the cameras. During camera by billboarding and the operators can easily perceive the viewpoint switching from one to another, the parameters the spatial information of the foreground objects from any of the viewpoints are gradually changed from the starting viewpoint. point to the destination point for smooth view transition. The switching criterion is defined as the number of blobs V. C ONCLUSIONS found in the specific areas. First, we divide the floor area into In this work we have developed an integrated video surveil- several parts and associate them to each camera, as shown lance system that can provide a single comprehensive view in Figure 9. When people move in the scene, the viewpoint for the monitored areas to facilitate tracking moving targets is switched automatically to the predefined viewpoint of the through its interactive control and immersive visualization. area containing more foreground objects. We also make the We utilize planar patches for 3-D environment model con- billboard transparent by setting the alpha value of textures, so struction. The scenes from cameras are divided into several the foreground objects appear with fitting shapes, as shown patches according to their structures and the numbers and in Figure 10. sizes of patches are automatically determined for compromis- ing between the rendering effects and efficiency. To integrate IV. E XPERIMENT RESULTS video contents, homography transformations are estimated for relationships between image regions of the video contents We developed the proposed surveillance system on a PC and the corresponding areas of the 3D model. Moreover, with Intel Core Quad Q9550 processor, 2GB RAM, and one the proposed method to remove moving cast shadow can nVidia GeForce 9800GT graphic card. Three IP cameras with automatically decide thresholds by on-line learning. In this 352 × 240 pixels resolution are connected to the PC through way, the manual setting can be avoided. Compared with the Internet. The frame rate of the system is about 25 frames per work based on frames, our method increases the accuracy to second. remove shadow. In visualization, the foreground objects are In the monitored area, automated doors and elevators are segmented accurately and displayed on billboards. specified as background objects, albeit their image do change when the doors open or close. These areas will be modeled in R EFERENCES background construction and not be visualized by billboards, [1] R. Sizemore, “Internet protocol/networked video surveillance market: the system use a ground mask to indicate the region of Equipment, technology and semiconductors,” Tech. Rep., 2008. interesting. Only the moving objects located in the indicated [2] Y. Wang, D. Krum, E. Coelho, and D. Bowman, “Contextualized videos: Combining videos with environment models to support situa- areas are considered as moving foreground objects, as shown tional understanding,” IEEE Transactions on Visualization and Com- in Figure 11. puter Graphics, 2007. 993
  • 7.
    Fig. 11. Dynamic background removal by ground mask. There is an automated door in the scene (top-left) and it is visualized by a billboard (top- right). A mask covered the floor (bottom-left) is used to decide whether to visualize the foreground or not. With the mask, we can remove unnecessary billboards (bottom-right). Fig. 10. Automatic switching the viewpoint for tracking targets. People Fig. 12. Immersive monitoring at arbitary viewpoint. We can zoom out the walk in the lobby and the viewpoint of the operator automatically switches viewpoint to monitor the whole surveillance area or zoom in the viewpoint to keep track of the targets. to focus on a particular place. [3] Y. Cheng, K. Lin, Y. Chen, J. Tarng, C. Yuan, and C. Kao, “Accurate transactions on Geosci. and remote sens., 2009. planar image registration for an integrated video surveillance system,” [10] J. Kim and H. Kim, “Efficient regionbased motion segmentation for a Computational Intelligence for Visual Intelligence, 2009. video monitoring system,” Pattern Recognition Letters, 2003. [4] H. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal, [11] E. J. Carmona, J. Mart´nez-Cantos, and J. Mira, “A new video seg- ı S. Hsu, D. Nister, and K. Hanna, “Video flashlights: real time ren- mentation method of moving objects based on blob-level knowledge,” dering of multiple videos for immersive model visualization,” in 13th Pattern Recognition Letters, 2008. Eurographics workshop on Rendering, 2002. [12] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast [5] U. Neumann, S. You, J. Hu, B. Jiang, and J. Lee, “Augmented virtual shadows through a multidistribution approach,” IEEE transactions on environments (ave): dynamic fusion of imagery and 3-d models,” IEEE pattern analysis and machine intelligence, 2007. Virtual Reality, 2003. [13] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and [6] S. You, J. Hu, U. Neumann, and P. Fox, “Urban site modeling from N. Master, “Calibrated, registered images of an extended urban area,” lidar,” Lecture Notes in Computer Science, 2003. International Journal of Computer Vision, 2003. [7] I. Sebe, J. Hu, S. You, and U. Neumann, “3-d video surveillance [14] A. Fernandes, “Billboarding tutorial,” 2005. with augmented virtual environments,” in International Multimedia Conference, 2003. [8] T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for real-time robust background subtraction and shadow detection,” IEEE ICCV. (1999). [9] K. Chung, Y. Lin, and Y. Huang, “Efficient shadow detection of color aerial images based on successive thresholding scheme,” IEEE 994
  • 8.
    Morphing And TexturingBased On The Transformation Between Triangle Mesh And Point Wei-Chih Hsu Wu-Huang Cheng Department of Computer and Communication Institute of Engineering Science and Technology, Engineering, National Kaohsiung First University of National Kaohsiung First University of Science and Science and Technology. Kaohsiung, Taiwan Technology. Kaohsiung, Taiwan [email protected] Abstract—This research proposes a methodology of [1] has proposed a method to represent multi scale surface. transforming triangle mesh object into point-based object and M. Müller et al. The [2] has developed a method for the applications. Considering the cost and program functions, modeling and animation to show that point-based has the experiments of this paper adopt C++ instead of 3D flexible property. computer graphic software to create the point cloud from Morphing can base on geometric, shape, or other features. meshes. The method employs mesh bounded area and planar Mesh-based morphing sometimes involves geometry, mesh dilation to construct the point cloud of triangle mesh. Two structure, and other feature analysis. The [3] has point-based applications are addressed in this research. 3D demonstrated a method to edit free form surface based on model generation can use point-based object morphing to geometric. The method applies complex computing to deal simplify computing structure. Another application for texture mapping is using the relation of 2D image pixel and 3D planar. with topology, curve face property, and triangulation. The [4] The experiment results illustrate some properties of point- not only has divided objects into components, but also used based modeling. Flexibility and scalability are the biggest components in local-level and global-level morphing. The [5] advantages among the properties of point-based modeling. The has adopted two model morphing with mesh comparison and main idea of this research is to detect more sophisticated merging to generate new model. The methods involved methods of 3D object modeling from point-based object. complicate data structure and computing. This research has illustrated simple and less feature analysis to create new Keywords-point-based modeling; triangle mesh; texturing; model by using regular point to morph two or more objects. morphing Texturing is essential in rendering 3D model. In virtual reality, the goal of texture mapping is try to be as similar to I. INTRODUCTION the real object as possible. In special effect, exaggeration texturing is more suitable for demand. The [6] has built a In recent computer graphic related researches, form•Z, mesh atlas for texturing. The texture atlases' coordinates, Maya, 3DS, Max, Blender, Lightwave, Modo, solidThinking considered with triangle mesh structure, were mapped to 3D and other 3D computer graphics software are frequently model. The [7] has used the conformal equivalence of adopted tools. For example, Maya is a very popular software, triangle meshes to find the flat mesh for texture mapping. and it includes many powerful and efficient functions for This method is more comprehensible and easy to implement. producing results. The diverse functions of software can The rest arrangements are described as followings: increase the working efficiency, but the methodology design Transforming triangle mesh into point set for modeling are must follow the specific rules and the cost is usually high. addressed in Section II and III, and that introduce point- Using C++ as the research tool has many advantages, based morphing for model creating. The point-based texture especially in data information collection. Powerful functions mapping is addressed in Section IV, and followed by the can be created by C language instructions, parameters and conclusion of Section V. C++ oriented object. More complete the data of 3D be abstracted, more unlimited analysis can be produced. II. TRANSFORMING TRIANGLE MESH INTO POINT SET The polygon mesh is widely used to represent 3D models In order to implement the advantages of point-based and has some drawbacks in modeling. Unsmooth surface of model, transforming triangle mesh into point is the first step. combined meshes is one of them. Estimating vertices of The point set can be estimated by using three normal bound objects and constructing each vertex set of mesh are the lines of triangle mesh. The normal denoted by n can be factors of modeling inefficiency. Point-based modeling is the calculated by three triangle vertices. The point in the triangle solution to conquer some disadvantages of mesh modeling. area is denoted by B in , A denotes the triangle mesh area, the Point-based modeling is based on point primitives. No structure of each point to another is needed. To simplify the 3D space planar can be presented by p with coordinate point based data can employ marching cube and Delaunay ( x, y , z ) , vi =1, 2,3 denotes three triangle vertices of triangle triangulation to transform point-based model into polygon mesh. Mark Pauly has published many related researches mesh, v denotes the mean of three triangle vertices. The about point-based in international journals as followings: the formula that presents the triangle area is described below. 995
  • 9.
    A = {p ( x, y , z ) | pn T − v i n T = 0 , i ∈ (1,2,3), p ∈ Bin } The experiments use some objects file which is the wave front file format (.obj) from NTU 3D model database ver.1 Bin = { p( x, y, z ) | f (i , j ) ( p) × f (i , j ) (v) > 0} of National Taiwan University. The process of transforming f (i , j ) ( p) = r × a − b + s triangle mesh into point-based is shown in Figure 1. It is clear to see that some areas with uncompleted the whole b j − bi point set shown in red rectangle of figure 1. The planar r= , s = bi - r × ai dilation process is employed to refine fail areas. a j − ai Planar dilation process uses 26-connected planar to refine i, j = 1,2,3 a , b = x, y , z i < j a<b the spots leaved in the area. The first half portion of Figure 2 shows 26 positions of connected planar. If any planar and its 26 neighbor positions are the object planar is the condition. The main purpose to estimate the object planar is to verify the condition is true. The result in second half portion of Figure 2 reveals the efficiency of planar dilation process. III. POINT-BASED MORPHING FOR MODEL CREATING The more flexible in objects combining is one of property of point-based. No matter what the shape or category of the objects, the method of this study can put them into morphing process to create new objects. The morphing process includes 3 steps. Step one is to Figure 1. The process of transforming triangle mesh into point-based equalize the objects. Step two is to calculate each normal point of objects in morphing process. Step three is to estimate each point of target object by using the same normal point of two objects with the formula as described below. n −1 ot = p r1o1 + p r 2 o2 + ⋅ ⋅ ⋅ + (1 − ∑ p ri )o n i =1 n 0 ≤ p r1 , p r 2 ,⋅ ⋅ ⋅, p r ( n −1) ≤ 1 , ∑ p ri = 1 i =1 ot presents each target object point of morphing, and oi is the object for morphing process. p ri donates the object effect weight in morphing process, and i indicates the number of object. The new model appearance generated from morphing is depended on which objects were chosen and the value of each object weight as well. The research experiments use two objects, therefore i = 1 or 2, n = 2 . The results are shown in Figure 3. First row is a simple flat board and a character morphing. The second row shows the object selecting free in point-based modeling, because two totally different objects can be put into morphing and produced the satisfactory results. The models the created by objects morphing with different weights can be seen in figure 4. IV. POINT-BASED TEXTURE MAPPING Texturing mapping is a very plain in this research method. It uses a texture metric to map the 3D model to the 2D image pixel by using the concept of 2D image transformation into 3D. Assuming 3D spaces is divided into α × β blocks, α is the number of row, and β is the number of column. Hence the length, the width, and the height of 3D space is h × h × h ; afterwards the ( X , Y ) and ( x. y, z ) will denote the image coordination and 3D model respectively. Figure 2. Planar dilation process. The texture of each block is assigned by texture cube, and it 996
  • 10.
    is made by2D image as shown in the middle image of first confirmed by the scalability and flexibility of proposed raw in figure 5. The process can be expressed by a formula methodologies. as below. At T = c T REFERENCES h h h [1] MARK PAULY, “Point-Based Multiscale Surface t = [ x mod , y mod , z mod ] , c = [ X,Y ] α β β Representation,” ACM Transactions on Graphics, Vol. 25, No. 2, pp. 177–193, April 2006. ⎡α 0 0⎤ A=⎢ β (h − z ) ⎥ [2] M. Müller1, R. Keiser1, A. Nealen2, M. Pauly3, M. Gross1 ⎢ 0 0 ⎥ and M. Alexa2, “Point Based Animation of Elastic, Plastic ⎣ y ⎦ and Melting Objects,” Eurographics/ACM SIGGRAPH A denotes the texture transforming metric, t denotes the 3D Symposium on Computer Animation, pp. 141-151, 2004. model current position, c denotes the image pixel content in [3] Theodoris Athanasiadis, Ioannis Fudos, Christophoros Nikou, the current position. “Feature-based 3D Morphing based on Geometrically The experiment results are shown in the second row of Constrained Sphere Mapping Optimization,” SAC’10 Sierre, figure 5 and 6. The setting results α = β = 2 are shown in Switzerland, pp. 1258-1265, March 22-26, 2010. second row of figure 5. The setting results α = β = 4 create [4] Yonghong Zhao, Hong-Yang Ong, Tiow-Seng Tan and Yongguan Xiao, “Interactive Control of Component-based the images are shown in the first row of figure 6. The last Morphing,” Eurographics/SIGGRAPH Symposium on row images of figure 6 indicate the proposed texture Computer Animation , pp. 340-385, 2003. mapping method can be applied into any point-based model. [5] Kosuke Kaneko, Yoshihiro Okada and Koichi Niijima, “3D V. CONCLUSION Model Generation by Morphing,” IEEE Computer Graphics, Imaging and Visualisation, 2006. In sum, the research focuses on point-based modeling [6] Boris Springborn, Peter Schröder, Ulrich Pinkall, “Conformal applications by using C++ instead of convenient facilities or Equivalence of Triangle Meshes,” ACM Transactions on other computer graphic software. The methodologies that Graphics, Vol. 27, No. 3, Article 77, August 2008. developed by point-based include the simple data structure properties and less complex computing. Moreover, the [7] NATHAN A. CARR and JOHN C. HART, “Meshed Atlases for Real-Time Procedural Solid Texturing,” ACM methods can be compiled with two applications morphing Transactions on Graphics, Vol. 21,No. 2, pp. 106–131, April and texture mapping. The experiment results have been 2002. Figure 3. The results of point-based modeling using different objects morphing. 997
  • 11.
    Figure 4. Themodels created by objects morphing with different weights. Figure 5. The process of 3D model texturing with 2D image shown in first row and the results shown in second row. 998
  • 12.
    Figure 6. Theresults of point-based texture mapping with α = β = 4 and different objects. 999
  • 13.
    LAYERED LAYOUTS OFDIRECTED GRAPHS USING A GENETIC ALGORITHM Chun-Cheng Lin1,∗, Yi-Ting Lin2 , Hsu-Chun Yen2,† , Chia-Chen Yu3 1 Dept. of Computer Science, Taipei Municipal University of Education, Taipei, Taiwan 100, ROC 2 Dept. of Electrical Engineering, National Taiwan University, Taipei, Taiwan 106, ROC 3 Emerging Smart Technology Institute, Institute for Information Industry, Taipei, Taiwan, ROC ABSTRACT charts, maps, posters, scheduler, UML diagrams, etc. It is important that a graph be drawn “clear”, By layered layouts of graphs (in which nodes are such that users can understand and get information distributed over several layers and all edges are di- from the graph easily. This paper focuses on lay- rected downward as much as possible), users can ered layouts of directed graphs, in which nodes are easily understand the hierarchical relation of di- distributed on several layers and in general edges rected graphs. The well-known method for generat- should point downward as shown in Figure 1(b). ing layered layouts proposed by Sugiyama includes By this layout, users can easily trace each edge from four steps, each of which is associated with an NP- top to bottom and understand the priority or order hard optimization problem. It is observed that the information of these nodes clearly. four optimization problems are not independent, in the sense that each respective aesthetic criterion may contradict each other. That is, it is impossi- ble to obtain an optimal solution to satisfy all aes- thetic criteria at the same time. Hence, the choice for each criterion becomes a very important prob- lem. In this paper, we propose a genetic algorithm to model the first three steps of the Sugiyama’s al- gorithm, in hope of simultaneously considering the Figure 1: The layered layout of a directed graph. first three aesthetic criteria. Our experimental re- sults show that this proposed algorithm could make Specifically, we use the following criteria to es- layered layouts satisfy human’s aesthetic viewpoint. timate the quality of a directed graph layout: to minimize the total length of all edges; to mini- Keywords: Visualization, genetic algorithm, mize the number of edge crossings; to minimize the graph drawing. number of edges pointing upward; to draw edges as straight as possible. Sugiyama [9] proposed a 1. INTRODUCTION classical algorithm for producing layered layouts of directed graphs, consisting of four steps: cycle Drawings of directed graphs have many applica- removal, layer assignment, crossing reduction, and tions in our daily lives, including manuals, flow assignment of horizontal coordinates, each of which ∗ Research supported in part by National Science Council addresses a problem of achieving one of the above under grant NSC 98-2218-E-151-004-MY3 criteria, respectively. Unfortunately, the first three † Research supported in part by National Science Council problems have been proven to be NP-hard when the under grant NSC 97-2221-E-002-094-MY3 width of the layout is restricted. There has been 1000
  • 14.
    a great dealof work with respect to each step of is quite different between drawing layered layouts Sugiyama’s algorithm in the literature. of acyclic and cyclic directed graphs. In acyclic Drawing layered layouts by four independent graphs, one would not need to solve problems on steps could be executed efficiently, but it may not cyclic removal. If the algorithm does not restrict always obtain nice layouts because preceding steps the layer by a fixed width, one also would not need may restrain the results of subsequent steps. For to solve the limited layer assignment problem. Note example, four nodes assigned at two levels after the that the unlimited-width layer assignment is not an layer assignment step lead to an edge crossing in NP-hard problem, because the layers of nodes can Figure 2(a), so that the edge crossing cannot be be assigned by a topological logic ordering. The removed during the subsequent crossing reduction algorithm in [10] only focuses on minimizing the step, which only moves each node’s relative posi- number of edge crossings and making the edges as tion on each layer, but in fact the edge crossing straight as possible. Although it also combined can be removed as drawn in Figure 2(b). Namely, three steps of Sugiyama’s algorithm, but it only the crossing reduction step is restricted by the layer contained one NP-hard problem. Oppositely, our assignment step. Such a negative effect exists ex- algorithm combined three NP-hard problems, in- clusively not only for these two particular steps but cluding cycle removal, limited-width layer assign- also for every other preceding/subsequent step pair. ment, and crossing reduction. In addition, our algorithm has the following ad- vantages. More customized restrictions on layered layouts are allowed to be added in our algorithm. For example, some nodes should be placed to the (a) (b) left of some other nodes, the maximal layer number should be less than or equal to a certain number, Figure 2: Different layouts of the same graph. etc. Moreover, the weighting ratio of each optimal criterion can be adjusted for different applications. Even if one could obtain the optimal solution for According to our experimental results, our genetic each step, those “optimal solutions” may not be the algorithm may effectively adjust the ratio between real optimal solution, because those locally optimal edge crossings number and total edge length. That solutions are restricted by their respective preced- is, our algorithm may make layered layouts more ing steps. Since we cannot obtain the optimal solu- appealing to human’s aesthetic viewpoint. tion satisfying all criteria at the same time, we have to make a choice in a trade-off among all criteria. For the above reasons, the basic idea of our 2. PRELIMINARIES method for drawing layered layouts is to combine the first three steps together to avoid the restric- tions due to criterion trade-offs. Then we use the The frameworks of three different algorithms for genetic algorithm to implement our idea. In the layered layouts of directed graphs (i.e., Sugiyama’s literature, there has existed some work on produc- algorithm, the cyclic leveling algorithm, and our ing layered layouts of directed graphs using ge- algorithm) are illustrated in Figure 2(a)–2(c), re- netic algorithm, e.g., using genetic algorithm to re- spectively. See Figure 2. Sugiyama’s algorithm duce edge crossings in bipartite graphs [7] or entire consists of four steps, as mentioned previously; the acyclic layered layouts [6], modifying nodes in a other two algorithms are based on Sugiyama’s algo- subgraph of the original graph on a layered graph rithm, in which the cyclic leveling algorithm com- layout [2], drawing common layouts of directed or bines the first two steps, while our genetic algo- undirected graphs [3] [11], and drawing layered lay- rithm combines the first three steps. Furthermore, outs of acyclic directed graphs [10]. a barycenter algorithm is applied to the crossing re- Note that the algorithm for drawing layered lay- duction step of the cyclic leveling and our genetic outs of acyclic directed graphs in [10] also com- algorithms, and the priority layout method is ap- bined three steps of Sugiyama’s algorithm, but it plied to the x-coordinate assignment step. 1001
  • 15.
    Sugiyama’s Algorithm Cyclic Leveling Genetic Algorithm Cycle Removel Cycle Removel Cycle Removel edge-node crossing Layer Assignment Layer Assignment Layer Assignment edge crossing Crossing Reduction (a) An edge crossing. (b) An edge-node crossing Crossing Reduction Crossing Reduction Barycenter Algorithm x-Coordinte Assignment x-Coordinte Assignment Priority Layout Method x-Coordinte Assignment Figure 4: Two kinds of crossings. (a) Sugiyama (b) Cyclic Leveling (c) Our we reverse as few edges as possible such that the Figure 3: Comparison among different algorithms. input graph becomes acyclic. This problem can be stated as the maximum acyclic subgraph prob- 2.1. Basic Definitions lem, which is NP-hard. (2) Layer assignment: Each node is assigned to a layer so that the total vertical A directed graph is denoted by G(V, E), where V is length of all edges is minimized. If an edge spans the set of nodes and E is the set of edges. An edge across at least two layers, then dummy nodes should e is denoted by e = (v1 , v2 ) ∈ E, where v1 , v2 ∈ V ; be introduced to each crossed layer. If the maxi- edge e is directed from v1 to v2 . A so-called layered mum width is bounded greater or equal to three, layout is defined by the following conditions: (1) the problem of finding a layered layout with min- Let the number of layers in this layout denoted by imum height is NP-compete. (3) Crossings reduc- n, where n ∈ N and n ≥ 2. Moreover, the n-layer tion: The relative positions of nodes on each layer layout is denoted by G(V, E, n). (2) V is parti- are reordered to reduce edges crossings. Even if we tioned into n subsets: V = V1 ∪ V2 ∪ V3 ∪ · · · ∪ Vn , restrict the problem to bipartite (two-layer) graphs, where Vi ∩ Vj = ∅, ∀i ̸= j; nodes in Vk are assigned it is also an NP-hard problem. (4) x-coordinate as- to layer k, 1 ≤ k ≤ n. (3) A sequence ordering, signment: The x-coordinates of nodes and dummy σi , of Vi is given for each i ( σi = v1 v2 v3 · · · v|Vi | nodes are modified, such that all the edges on the with x(v1 ) < x(v2 ) < · · · < x(v|Vi | )). The n- original graph structure are as straight as possi- layer layout is denoted by G(V, E, n, σ), where σ = ble. This step includes two objective functions: to (σ1 , σ2 , · · · , σn ) with y(σ1 ) < y(σ2 ) < · · · < y(σn ). make all edges as close to vertical lines as possible; An n-layer layout is called “proper” when it fur- to make all edge-paths as straight as possible. ther satisfies the following condition: E is parti- tioned into n − 1 subsets: E = E1 ∪ E2 ∪ E3 ∪ 2.3. Cyclic Leveling Algorithm · · · ∪ En−1 , where Ei ∩ Ej = ∅, ∀i ̸= j, and Ek ⊂ Vk × Vk+1 , 1 ≤ k ≤ n − 1. The cyclic leveling algorithm (CLA) [1] combines the first two steps of Sugiyama’s algorithm, i.e., it An edge crossing (assuming that the layout is focuses on minimizing the number of edges point- proper) is defined as follows. Consider two edges ing upward and total vertical length of all edges. e1 = (v11 , v12 ), e2 = (v21 , v22 ) Ei, in which v11 It introduces a number called span that represents and v21 are the j1 -th and the j2 -th nodes in σi , the number of edges pointing upward and the total respectively; v12 and v22 are the k1 -th and the k2 - vertical length of all edges at the same time. th nodes in σi+1 , respectively. If either j1 < j2 & k1 > k2 or j1 > j2 & k1 < k2 , there is an edge The span number is defined as follows. Consider crossing between e1 and e2 (see Figure 4(a)). a directed graph G = (V, E). Given k ∈ N, define a layer assignment function ϕ : V → {1, 2, · · · , k}. An edge-node crossing is defined as follows. Con- Let span(u, v) = ϕ(v) − ϕ(u), if ϕ(u) < ϕ(v); sider an edge e = (v1 , v2 ), where v1 , v2 ∈ V i; v1 span(u, v) = ϕ(v) − ϕ(u) + k, otherwise. For each and v2 are the j-th and the k-th nodes in σi , re- edge e = (u, v) ∈ E, denote span(e) = span(u, v) ∑ spectively. W.l.o.g., assuming that j > k, there are and span(G) = e∈E span(e). In brief, span (k − j − 1) edge-node crossings (see Figure 4(b)). means the sum of vertical length of all edges and the penalty of edges pointing upward or horizontal, 2.2. Sugiyama’s Algorithm provided maximum height of this layout is given. Sugiyama’s algorithm [9] consists of four steps: (1) The main idea of the CLA is: if a node causes Cycle removal: If the input directed graph is cyclic, a high increase in span, then the layer position of 1002
  • 16.
    the node wouldbe determined later. In the algo- then priority(v) = B − (|k − m/2|), in which B is a rithm, the distance function is defined to decide big given number, and m is the number of nodes in which nodes should be assigned first and is ap- layer k; if down procedures (resp., up procedures), plied. There are four such functions as follows, then priority(v) = connected nodes of node v on but only one can be chosen to be applied to all layer p − 1 (resp., p + 1). the nodes: (1) Minimum Increase in Span Moreover, the x-coordinate position of each node = minϕ(v)∈{1,··· ,k} span(E(v, V ′ )); (2) Minimum v is defined as the average x-coordinate position of Average Increase in Span (MST MIN AVG) connected nodes of node v on layer k − 1 (resp., = minϕ(v)∈{1,··· ,k} span(E(v, V ′ ))/E(v, V ′ ); (3) k + 1), if down procedures (resp., up procedures). Maximum Increase in Span = 1/δM IN (v); (4) Maximum Average Increase in Span = 2.6. Genetic Algorithm 1/δM IN AV G (v). From the experimental results in [1], using “MST MIN AVG” as the distance The Genetic algorithm (GA) [5] is a stochastic function yields the best result. Therefore, our global search method that has proved to be success- algorithm will be compared with the CLA using ful for many kinds of optimization problems. GA MST MIN AVG in the experimental section. is categorized as a global search heuristic. It works with a population of candidate solutions and tries 2.4. Barycenter Algorithm to optimize the answer by using three basic princi- The barycenter algorithm is a heuristic for solv- ples, including selection, crossover, and mutation. ing the edge crossing problem between two lay- For more details on GA, readers are referred to [5]. ers. The main idea is to order nodes on each layer by its barycentric ordering. Assuming that 3. OUR METHOD node u is located on the layer i (u ∈ Vi ), the The major issue for drawing layered layouts of di- barycentric∑ value of node u is defined as bary(u) = rected graphs is that the result of the preceding step (1/|N (u)|) v∈N (u) π(v), where N (u) is the set may restrict that of the subsequent step on the first consisting of u’s connected nodes on u’s below or three steps of Sugiyama’s algorithm. To solve it, we above layer (Vi−1 or Vi+1 ); π(v) is the order of v design a GA that combines the first three steps of in σi−1 or σi+1 . The process in this algorithm is Sugiyama’s algorithm. Figure 5 is the flow chart reordering the relative positions of all nodes accord- of our GA. That is, our method consists of a GA ing to the ordering: layer 2 to layer n and then layer and an x-coordinate assignment step. Note that n − 1 to layer 1 by barycentric values. the barycenter algorithm and the priority method are also used in our method, in which the former is 2.5. Priority Layout Method used in our GA to reduce the edge crossing, while The priority layout method solves the x-coordinate the latter is applied to the x-coordinate assignment assignment problem. Its idea is similar to the step of our method. barycenter algorithm. It assigns the x-coordinate position of each node layer to layer according to the Initialization priority value of each node. At first, these nodes’ x-coordinate positions in each layer are given by xi = x0 + k, where x0 is k Assign dummy nodes i a given integer, and xk is the k-th element of σi . Draw the best Chromosome Terminate? Barycenter Next, nodes’ x-coordinate positions are adjusted Fine tune Selection according to the order from layer 2 to layer n, layer n − 1 to layer 1, and layer n/2 to layer n. The im- Mutation Remove dummy nodes provements of the positions of nodes from layer 2 to Crossover layer n are called down procedures, while those from layer n−1 to layer 1 are called up procedures. Based on the above, the priority value of a k-th node v on Figure 5: The flow chart of our genetic algorithm. layer p is defined as: if node v is a dummy node, 1003
  • 17.
    3.1. Definitions 4. MAIN COMPONENTS OF OUR GA For arranging nodes on layers, if the relative hori- Initialization: For each chromosome, we ran- √ √ zontal positions of nodes are determined, then the domly assign nodes to a ⌈ |V |⌉ × ⌈ |V |⌉ grid. exact x-coordinate positions of nodes are also de- Selection: To evaluate the fitness value of each termined according to the priority layout method. chromosome, we have to compute the number of Hence, in the following, we only consider the rela- edge crossings, which however cannot be computed tive horizontal positions of nodes, and each node is at this step, because the routing of each edge is arranged on a grid. We use GA to model the lay- not determined yet. Hence, some dummy nodes ered layout problem, so define some basic elements: should be introduced to determine the routing of Population: A population (generation) includes edges. In general, these dummy nodes are placed many chromosomes, and the number of chromo- on the best relative position with the optimal edge somes depends on setting of initial population size. crossings between two adjacent layers. Neverthe- Chromosome: One chromosome represents one less, permuting these nodes on each layer for the graph layout, where the absolute position of each fewest edge crossings is an NP-hard problem [4]. (dummy) node on the grid is recorded. Since the Hence, the barycenter algorithm (which is also used adjacencies of nodes and the directions of edges by the CLA) is applied to reducing edge crossings will not be altered after our GA, we do not need on each chromosome before selection. Next, the record the information on chromosomes. On this selection step is implemented by the truncation se- grid, one row represents one layer; a column rep- lection, which duplicates the best (selection rate × resents the order of nodes on the same layer, and population size) chromosomes (1/selection rate) these nodes on the same layer are always placed times to fill the entire population. In addition, we successively. The best-chromosome window reserves use a best-chromosome window to reserve some of the best several chromosomes during all antecedent the best chromosomes in the previous generations generations; the best-chromosome window size ra- as shown in Figure 6. tio is the ratio of the best-chromosome window size Best-Chromosome Window to the population size. Best-Chromosome Window Fitness Function: The ‘fitness’ value in our def- duplicate inition is abused to be defined as the penalty for the bad quality of chromosome. That is, larger ‘fit- Parent Population Child Population Child Population ness’ value implies worse chromosome. Hence, our GA aims to find the chromosome with minimal ‘fit- ness’ value. Some aesthetical criteria to determine Figure 6: The selection process of our GA. the quality of chromosomes (layouts) are given as follows (noticing that these criteria are referred Crossover: Some main steps of our crossover pro- ∑7 from [8] and [9]): f itness value = i=1 Ci × Fi cess are detailed as follows: (1) Two ordered par- where Ci are constants, 1 ≤ i ≤ 7, ∀i; F1 is the to- ent chromosomes are called the 1st and 2nd parent tal edge vertical length; F2 is the number of edges chromosome. W.l.o.g., we only introduce how to pointing upward; F3 is the number of edges point- generate the first child chromosome from the two ing horizontally; F4 is the number of edge crossing; parent chromosomes, and the other child is similar. F5 is the number of edge-node crossing; F6 is the (2) Remove all dummy nodes from these two par- degree of layout height over limited height; F7 is ent chromosomes. (3) Choose a half of the nodes the degree of layout width over limited width. from each layer of the 1st parent chromosome and In order to experimentally compare our GA place them on the same relative layers of child chro- with the CLA in [1], the fitness function of our mosome in the same horizontal ordering. (4) The GA is tailored to satisfy the CLA as follows: information on the relative positions of the remain- f itness value = span + weight × edge crossing + ing nodes all depends on the 2nd chromosomes. C6 × F6 + C7 × F7 where we will adjust the weight Specifically, we choose a node adjacent to the small- of edge crossing number in our experiment to rep- est number of unplaced nodes until all nodes are resent the major issue which we want to discuss. placed. If there are many candidate nodes, we ran- 1004
  • 18.
    domly choose one.The layer of the chosen node is Note that the x-coordinate assignment problem equal to base layer plus relative layer, where base (step 4) is solved by the priority layout method layer is the average of its placed connected nodes’ in our experiment. In fact, this step would not layers in the child chromosome and relative layer is affect the span number or edge crossing number. In the relative layer position of its placed connected addition, the second step of Sugiyama’s algorithm nodes’ layers in the 2nd parent chromosome. (5) (layer assignment) is an NP-hard problem when the The layers of this new child chromosome are mod- width of the layered layout is restricted. Hence, ified such that layers start from layer 1. we will respectively investigate the cases when the Mutation: In the mutated chromosome, a node width of the layered layout is limited or not. is chosen randomly, and then the position of the chosen node is altered randomly. 5.1. Experimental Environment Termination: If the difference of average fitness All experiments run on a 2.0 GHz dual core lap- values between successive generations in the latest top with 2GB memory under Java 6.0 platform ten generations is ≤ 1% of the average fitness value from Sun Microsystems, Inc. The parameters of of these ten generations, then our GA algorithm our GA are given as follows: Population size: stops. Then, the best chromosome from the latest 100; Max Generation: 100; Selection Rate: 0.7; population is chosen, and its corresponding graph Best-Chromosome Window Size Ratio: 0.2; Mutate layout (including dummy nodes at barycenter po- Probability: 0.2; C6 : 500; C7 : 500; f itness value = sitions) is drawn. span + weight × edgecrossing + C6 × F6 + C7 × F7 . Fine Tune: Before the selection step or after the termination step, we could tune better chromo- 5.2. Unlimited Layout Width somes according to the fitness function. For ex- ample, we remove all layers which contain only Because it is necessary to limit the layout width dummy nodes but no normal nodes, called dummy and height for L M algorithm, we set both limits layers. Such a process does not necessarily worsen for width and height to be 30. It implies that there the edge crossing number but it would improve are at most 30 nodes (dummy nodes excluded) on the span number. In addition, some unnecessary each layer and at most 30 layers in each layout. If dummy nodes on each edge can also be removed we let the maximal node number to be 30 in our after the termination step, in which the so-called experiment, then the range for node distribution unnecessary dummy node is a dummy node that is equivalently unlimited. In our experiments, we is removed without causing new edge crossings or consider a graph with 30 nodes under three differ- worsening the fitness value. ent densities (2%, 5%, 10%), in which the density is the ratio of edge number to all possible edges, 5. EXPERIMENTAL RESULTS i.e. density = edge number/(|V |(|V | − 1)/2). Let the weight ratio of edge crossing to span be de- To evaluate the performance of our algorithm, our noted by α. In our experiments, we consider five algorithm is experimentally compared with the different α values 1, 3, 5, 7, 9. The statistics for CLA (combing the first two steps of Sugiyama’s the experimental results is given in Table 1. algorithm) using MST MIN AVG as the distance Consider an example of a 30-node graph with function [1], as mentioned in the previous sections. 5% density. The layered layout by the LM B algo- For convenience, the CLA using MST MIN AVG rithm, our algorithm under α = 1 and α = 9 are distance function is called as the L M algorithm shown in Figure 7, Figure 8(a) and Figure 8(b), re- (Leveling with MST MIN AVG). The L M algo- spectively. Obviously, our algorithm performs bet- rithm (for step 1 + step 2) and barycenter algo- ter than the LM B. rithm (for step 3) can replace the first three steps in Sugiyama’s algorithm. In order to be compared 5.3. Limited Layout Width with our GA (for step 1 + step 2 + step 3), we con- sider the algorithm combining the L M algorithm The input graph used in this subsection is the same and barycenter algorithm, which is called LM B al- as the previous subsection (i.e., a 30-node graph). gorithm through the rest of this paper. The limited width is set to be 5, which is smaller 1005
  • 19.
    Table 1: Theresult after redrawing random graphs with 30 nodes and unlimited layout width. method measure density =2%density=5%density=10% span 30.00 226.70 798.64 LM B crossing 4.45 57.90 367.00 running time 61.2ms 151.4ms 376.8ms α =1 span 30.27 253.93 977.56 crossing 0.65 38.96 301.75 α =3 span 31.05 277.65 1338.84 crossing 0.67 32.00 272.80 our α =5 span 30.78 305.62 1280.51 GA crossing 0.67 29.89 218.45 α =7 span 32.24 329.82 1359.46 crossing 0.75 26.18 202.53 (a) α = 1 (b) α = 9 α =9 span 31.65 351.36 1444.27 crossing 0.53 24.89 200.62 (span: 188, crossing: 30)(span: 238, crossing: 14) running time 3.73s 17.32s 108.04s Figure 8: Layered layouts by our GA. Table 2: The result after redrawing random graphs with 30 nodes and limited layout width 5. method measure density =2%density=5%density=10% span 28.82 271.55 808.36 LM B crossing 5.64 59.09 383.82 running time 73.0ms 147.6ms 456.2ms Figure 7: Layered layout by LM B (span:262, α =1 span 32.29 271.45 1019.56 crossing:38). crossing 0.96 39.36 292.69 α =3 span 31.76 294.09 1153.60 crossing 0.80 33.16 232.76 our α =5 span 31.82 322.69 1282.24 GA crossing 0.82 30.62 202.31 than the square root of the node number (30), be- α =7 span 32.20 351.00 1369.73 cause we hope the results under limit and unlimited crossing 0.69 27.16 198.20 α =9 span 33.55 380.20 1420.31 conditions have obvious differences. The statistics crossing 0.89 24.95 189.25 for the experimental results under the same settings running time 3.731s 3.71s 18.07s in the previous subsection is given in Table 2. Consider an example of a 30-node graph with 5% density. The layered layout for this graph by the our GA may produce simultaneously span and edge LM B algorithm, our algorithm under α = 1 and crossing numbers both smaller than that by LM B. α = 9 are shown in Figure 9, Figure 10(a) and Fig- ure 10(b), respectively. Obviously, our algorithm Moreover, we discovered that under any condi- also performs better than the LM B. tions the edge crossing number gets smaller and the span number gets greater when increasing the weight of edge crossing. It implies that we may ef- 5.4. Discussion fectively adjust the weight between edge crossings Due to page limitation, only the case of 30-node and spans. That is, we could reduce the edge cross- graphs is included in this paper. In fact, we con- ing by increasing the span number. ducted many experiments for various graphs. Be- Under limited width condition, because the re- sides those results, those tables and figures show sults of L M are restricted, its span number should that under any conditions (node number, edge den- be larger than that under unlimited condition. sity, and limited width or not) the crossing number However, there are some unusual situations in our by our GA is smaller than that by LM B. How- GA. Although the results of our GA are also re- ever, the span number by our GA is not neces- stricted under limited width condition, its span sarily larger than that by LM B. When the layout number is smaller than that under unlimited width width is limited and the node number is sufficiently condition. Our reason is that the limited width small (about 20 from our experimental evaluation), condition may reduce the possible dimension. In 1006
  • 20.
    REFERENCES [1] C. Bachmaier, F. Brandenburg, W. Brunner, and G. Lov´sz. Cyclic leveling of directed a graphs. In Proc. of GD 2008, volume 5417 of LNCS, pages 348–359, 2008. Figure 9: Layered layout by LM B algorithm [2] H. do Nascimento and P. Eades. A focus and (span: 288, crossing: 29) with limited layout constraint-based genetic algorithm for interac- width = 5. tive directed graph drawing. Technical Report 533, University of Sydney, 2002. [3] T. Eloranta and E. Makinen. TimGA: A genetic algorithm for drawing undirected graphs. Divulagciones Matematicas, 9(2):55– 171, 2001. [4] M. R. Garey and D. S. Johnson. Crossing number is NP-complete. SIAM Journal on Algebraic and Discrete Methods, 4(3):312–316, 1983. [5] J. Holland. Adaptation in Natural and Arti- (a) α = 1 (b) α = 9 ficial Systems. University of Michigan Press, (span: 252, crossing: 29)(span: 295, crossing: 14) Ann Arbor, 1975. Figure 10: Layered layouts by our GA. [6] P. Kuntz, B. Pinaud, and R. Lehn. Minimizing crossings in hierarchical digraphs with a hy- bridized genetic algorithm. Journal of Heuris- this problem, the dimension represents the posi- tics, 12(1-2):23–36, 2006. tion as which nodes could be placed. Furthermore, [7] E. M¨kinen and M. Sieranta. Genetic algo- a if the dimension is smaller, then our GA can easier rithms for drawing bipartite graphs. Inter- converge to a better result. national Journal of Computer Mathematics, 53:157–166, 1994. 6. CONCLUSIONS [8] H. Purchase. Metrics for graph drawing aes- This paper has proposed an approach for producing thetics. Journal of Visual Languages and layered layouts of directed graphs, which uses a GA Computing, 13(5):501–516, 2002. to simultaneously consider the first three steps of classical Sugiyama’s algorithm (consisting of four [9] K. Sugiyama, S. Tagawa, and M. Toda. Meth- steps) and applies the priority layout method for ods for visual understanding of hierarchical the forth step. Our experimental results revealed system structures. IEEE Transitions on Sys- that our GA may efficiently adjust the weighting tems, Man, and Cybernetics, 11(2):109–125, ratios among all aesthetic criteria. 1981. [10] J. Utech, J. Branke, H. Schmeck, and P. Eades. ACKNOWLEDGEMENT An evolutionary algorithm for drawing di- rected graphs. In Proc. of CISST’98, pages This study is conducted under the ”Next Gener- 154–160. CSREA Press, 1998. ation Telematics System and Innovative Applica- tions/Services Technologies Project” of the Insti- [11] Q.-G. Zhang, H.-Y. Liu, W. Zhang, and Y.-J. tute for Information Industry which is subsidized Guo. Drawing undirected graphs with genetic by the Ministry of Economy Affairs of the Repub- algorithms. In Proc. of ICNC 2005, volume lic of China. 3612 of LNCS, pages 28–36, 2005. 1007
  • 21.
    Structured Local BinaryHaar Pattern for Graphics Retrieval Song-Zhi Su, Shu-Yuan Chen*, Shang-An Li Cognitive Science Department of Xiamen University, Department of Computer Science and Engineering of Fujian Key Laboratory of the Brain-like Intelligent Yuan Ze University, Taiwan Systems (Xiamen University), Xiamen, China *correspondence author, [email protected] [email protected] Der-Jyh Duh Shao-Zi Li Department of Computer Science and Information Cognitive Science Department of Xiamen University, Engineering, Ching Yun University, Taiwan Fujian Key Laboratory of the Brain-like Intelligent [email protected] Systems (Xiamen University), Xiamen, China [email protected] Abstract—Feature extraction is an important issue in graphics histogram indexing structure to addresses two issues in shape retrieval. Local feature based descriptors are currently the retrieval problem: perceptual similarity measure on partial predominate method used in image retrieval and object query and overcoming dimensionality curse and adverse recognition. Inspired by the success of Haar feature and Local environment. Chalechale et al. [6] proposed a sketch-based Binary Pattern (LBP), a novel feature named structured local image retrieval system, in which feature extraction for binary Haar pattern (SLBHP) is proposed for graphics matching purpose is based on angular partitioning of two retrieval in this paper. SLBHP encodes the polarity instead of abstract images which are obtained from the model image the magnitude of the difference between accumulated gray and from the query image. The angular-spatial distribution of values of adjacent rectangles. Experimental results on graphics pixels in the abstract images is scale and rotation invariant retrieval show that the discriminative power of SLBHP is better than those of using edge points (EP), Haar feature, and and robust against translation by using the Fourier transform. LBP even in noisy condition. Most existing graphics retrieval adopting contour-based [4] [5] rather than pixel-based approaches [6]. Since the Keywords-graphics retrieval; structured local binary haar contour-based method is concerned with a lot of curves and pattern; Haar; local binary pattern; lines, it is computational intensive. Thus it is the goal of this paper to propose a pixel-based graphics retrieval using novel structured local binary Haar pattern. I. INTRODUCTION This paper is organized as follows. The original Haar With the advent of computing technology, media and LBP feature is described in Section 2. Proposed SLBHP acquisition/storage devices, and multimedia compression feature is described in Section 3. Experimental results and standards, more and more digital data are generated and performance comparison are given in Section 4. Finally, available to user all over the world. Nowadays, it is easy to conclusions are given in Section 5. access electronic books, electronic journals, web portals, and video streams. Hence, it will be convenient for users to II. LOCAL BINARY PATTEN AND HAAR FEATURE provide an image retrieval system for browsing, searching and retrieving images from a large database of digital images. A. Local Binary Pattern Traditional systems add some metadata such as caption, Local feature based approaches have got great success keywords, descriptions or annotation of images so that in object detection and recognition in recent years. The retrieval can be converted into a text retrieval problem. original LBP descriptor was proposed by Ojala et al. [7], However, manual annotation is time-consuming, and was proved a powerful means for texture analysis. LBP laborious and expensive. There are a lot of works on content- encode local primitives including different types of curved based image retrieval (CBIR) [1] [2] [3], which is also called edges, spots, flat areas, etc. The advantage of LBP was query by image content. “Content-based” means that the invariant to monotonic changes in gray scale. So LBP is retrieval process will utilize and analyze the actual contents of the image, which might refer to colors, shapes, textures, or widely used in face recognition [8], pedestrian detection [9], any other information that can be derived from the images and many other computer vision applications. themselves. The basic LBP operator assigns a label to every pixel Unfortunately, although there are many content-based of an image by thresholding the 3 × 3-neighborhood and retrieval methods for image databases, few of them are considering the results as a binary number. Then the specifically designed for graphics. Huet et al. [4] exploit both histogram of labels can be used as descriptor of local geometric attributes and structural information to construct a regions. See Figure 1(a) for an illustration of the basic LBP shape histogram for retrieving line-patterns from large operator. databases. Chi et al. [5] proposed an approach to combine a local-structure-based shape representation and a new 1008
  • 22.
    The decimal formof the resulting 8-bit LBP code can be expressed as follows: 7 LBP ( x, y ) = ∑ wi bi ( x, y ) i =0 where wi = 2i , bi ( x, y) = ⎧1, Haari ( x, y) > T . As Figure 2 shown, ⎨ ⎩0, otherwise each component of LBP is actually a binary 2 rectangle Haar feature with rectangle size 1 × 1. Even the gradient can be seen as combination of Haar features. For example, I x = Haar0 + Haar4 I y = Haar2 + Haar6 where I x and I y are gradient along x axis and y axis with Figure 1. Illustration of LBP and Haar. (a) The basic LBP operator, (b) filter [1, −2,1] and [1, −2,1]T , respectively. four types of Haar feature. III. STRUCTURED LOCAL BINARY HAAR PATTERN B. Haar Feature A. SLBHP A simple rectangular Haar feature can be defined as the difference of the accumulate sum of pixels of area inside the rectangle, which can be at any position and scale within the given image. Oren et al. [10] first used 2 rectangle features in pedestrian classification. Viola and Jones [11] extend them to 3 rectangle features and 4 rectangle features in Viola-Jones object detection framework for face and pedestrian. The difference values indicate certain characteristics of a particular area of image. Haar feature encodes the low-frequency information and each feature type can indicate the existence of certain characteristics in the images, such as vertical or horizontal edges, changes in texture. Haar feature can be computed quickly using the integral image [11]. It is an intermediate representation of image, all the rectangular two-dimensional image features can be computed rapidly using this representation. Each element of the integral image contains the sum of all pixels located on the up-left region of the original image. Given Figure 3. An example of SLBHP. (a) Four Haar features; (b) the integral image, any rectangular sum of pixel values corresponding Haar features with overlapping; (c) an example to compute SLBHP values. aligned with the coordinate axes can be computed in four array references. C. A New Sight into LBP, Haar,and Gradient In this paper, based on the similar idea of multi-block local binary pattern features [12, 13], a descriptor Structured Local Binary Haar Pattern (SLBHP) has been modified from LBP with Haar feature. The proposed SLBHP adopts four types of Haar features, which capture the changes of gray values along the horizontal direction, the vertical direction and the diagonals as shown in Figure 3(a). However, only the polarity of Haar feature is involved in SLBHP, while the magnitude is discarded. It is noted that the number of encoding patterns has been reduced from 256 for LBP to 16 for SLBHP. Moreover, SLBHP encoding spatial structure of two adjacent rectangle regions in four-directions. Thus, Figure 2. LBP can be seen as a weighted combination of binary Haar compared to LBP, the SLBHP has compact encoding feature. patterns and incorporates more semantic structure information. Let ai , i = 0,1,L,8 denote the corresponding gray values for a 3×3 window with a0 at the center pixel 1009
  • 23.
    ( x, y)as shown in Figure 3(a). The value of SLBHP code of a pixel ( x, y) is given by the following equation, SLBHP( x, y) = ∑ B ( H p ⊗ N ( x, y)) × 2 p 4 p =1 ⎡ a1 a2 a3 ⎤ ⎡ 1 1 0⎤ where N ( x, y) = ⎢ a8 ⎢ ⎥, a0 a4 ⎥ H1 = ⎢ 1 0 −1⎥ , ⎢ ⎥ ⎢a7 ⎣ a6 a5 ⎥ ⎦ ⎢ ⎥ ⎣0 −1 −1⎦ Figure 4. An example of SLBHP histograms for graphics retrieval. ⎡ 0 1 1⎤ ⎡ 1 1 1⎤ ⎡−1 0 1⎤ ⎢−1 0 1⎥ , H = ⎢ 0 H2 = ⎢ ⎥ , H = ⎢−1 0 1⎥. 0 0⎥ 4 ⎢ IV. EXPERIMENTAL RESULTS ⎥ 3 ⎢ ⎥ ⎢−1 −1 0⎥ ⎣ ⎦ ⎢−1 ⎣ −1 −1⎥ ⎦ ⎢−1 0 1⎥ ⎣ ⎦ ⎧1 if | x |> T and B ( x ) = ⎨ with T as a threshold 15 in our ⎩0 otherwise experiments). By this binary operation, the feature becomes more robust to global lighting changes. It is noted that H p denote a Harr-like basis function and H p ⊗ N (x, y) denotes the difference between the accumulated gray values of the black and red rectangle as shown in Figure 3(c). Unlike traditional Haar feature, here the rectangles are overlapped with one pixel. Inspired by LBP and the fact that a single binary Haar feature might have not enough discriminative power, we combine this binary feature just like LBP. Figure 3(c) shows an example of SLBHP feature. SLBHP feature extends the merits of both Haar feature and LBP and it (a) (b) encodes the most common structure information of graphics. Figure 5. Some query results for graphics database. (a) Query graphics; (b) Moreover, SLBHP has dimension of 16 smaller than the a list of three most similar graphics ordered by similarity values. The one with red rectangle is the ground true match. dimension of LBP 256, while has more immunities to noise since Haar feature uses more pixels once at a time. B. SLBHP for Graphics Retrieval 479 electronic files of graphics are collected to After the SLBHP value is computed, the histogram of construct database for retrieval experiments. Test images are comprised of 479 graphics photos taken by a digital camera SLBHP for a region R is computed by the following and then added by noises to obtain noisy test images The equation performance of graphics retrieval is measured by the H (i) = ∑ I {SLBHP( x, y) = i} , retrieval accuracy. The retrieval accuracy is computed as the ( x, y )∈R ratio of the number of graphics correctly retrieved to the ⎧1,A = true, number of total queries. Moreover, not only the retrieval where I { A} = ⎨ The histogram H contains ⎩0,A = false. accuracy with respect to the first rank but also the second information about the distribution of the local patterns, such and third ranks is concerned in our experiments. The as edges, spots and flat areas, over the image region R . In retrieval accuracies for different approaches are listed in order to make SLBHP robust to slight transition, a graphics Tables 1 through 4 with different block sizes from 8×8 to photo is divided into several small spatial regions (“block”), 32×32. Moreover, the retrieval accuracy for non- for each of which a SLBHP histogram is computed and then overlapping case is also listed in Table 4. By comparing concatenated to form the representation of graphics as shown Tables 1 and 4, we found that overlapping results in higher in Figure 4. For better invariance to illumination, it is useful retrieval accuracy. It is noted that the proposed method and to contrast-normalize the local responses in each block the approaches using EP [6] and LBP all adopt histogram- before using them. Experiment results showed that the based matching. However, for the Haar feature, the L2NORM gets better results than L1NORM and L1SQRT. computed four Haar values for each block are normalized Similar to other popular local feature based object detection, and then concatenated to form the representation, Chi- the detection windows are tiled with a dense (overlapping) square distance is also adopted as similarity measure of the grid of SLBHP descriptors. The overlap size is half of whole Haar feature. block. 1010
  • 24.
    In our experiment,we found that chi-square is a better distance. Some retrieval results are shown in Figure 5. similarity for histogram-based matching than Euclidean TABLE I. RETRIEVAL ACCURACIES OF EDGE POINTS (EP), LBP, HAAR, AND SLBHP WITH HALF-OVERLAPPING BLOCKS. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 85.2 70.4 83.7 88.3 91.6 79.5 90.6 95.6 93.3 82.5 92.5 96.5 32x16 83.3 62.3 68.9 88.5 91.4 74.9 76.0 94.6 93.5 78.3 78.5 95.7 16x32 86.8 66.6 60.8 90.2 92.9 76.0 68.7 95.6 94.2 80.0 72.2 96.7 16x16 85.0 58.2 62.4 89.4 92.3 66.8 70.1 94.4 94.4 69.3 73.3 95.8 16x8 81.2 42.0 37.4 86.6 89.8 51.8 43.4 91.9 91.2 55.5 45.9 93.7 8x16 83.3 45.3 29.0 86.6 90.6 55.1 36.5 92.5 92.9 57.8 40.7 94.8 8x8 79.3 30.5 29.2 82.7 86.8 39.5 34.9 89.3 89.8 44.5 39.2 91.2 TABLE II. RETRIEVAL ACCURACIES UNDER GAUSSIAN NOISE WITH VARIANCE 50 AND PERTURBATION 1%. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 63.88 71.19 83.09 82.46 74.53 78.91 90.40 90.81 77.87 84.13 92.48 93.95 32x16 71.61 65.76 68.48 85.18 79.54 75.16 75.78 93.53 84.76 79.54 78.71 94.57 16x32 72.44 67.22 60.96 87.06 79.54 76.41 68.27 93.53 83.72 81.21 72.03 94.99 16x16 78.08 59.92 62.42 88.31 85.39 68.27 69.52 93.74 89.14 72.65 73.70 94.99 16x8 79.96 43.63 37.58 86.01 87.89 52.61 43.42 92.07 89.98 55.74 45.72 93.95 8x16 79.12 47.60 29.02 86.85 88.31 54.91 36.33 93.11 91.44 59.08 41.34 94.99 8x8 81.00 31.52 29.23 83.72 87.68 40.08 34.24 90.40 90.61 44.89 38.62 92.48 TABLE III. RETRIEVAL ACCURACIES UNDER SALT AND PEPPER NOISES WITH PERTURBATION 0.5%. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 15.24 70.77 83.51 84.76 19.83 79.33 91.02 92.48 25.05 82.88 92.48 94.78 32x16 20.46 64.93 68.48 86.01 27.97 75.57 75.79 94.15 39.25 79.33 78.50 95.62 16x32 22.55 67.43 60.96 88.10 27.35 76.20 68.48 94.15 34.66 80.17 72.44 95.62 16x16 37.37 59.71 61.59 88.52 46.97 67.85 68.89 93.95 61.38 71.19 73.28 95.41 16x8 55.95 42.80 36.74 86.22 67.22 52.40 43.01 92.28 73.90 55.95 45.30 93.32 8x16 54.28 47.39 28.60 87.27 67.43 54.90 36.12 93.32 78.08 58.87 40.71 94.57 8x8 70.35 31.11 29.02 83.51 81.00 40.29 34.86 89.77 84.55 45.09 39.25 92.28 TABLE IV. RETRIEVAL ACCURACIES WITH NON-OVERLAPPING BLOCKS. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 82.04 70.98 69.73 87.68 90.40 77.66 78.29 94.57 92.49 81.42 82.88 95.82 32x16 79.75 64.09 61.80 87.27 88.10 72.65 68.89 93.53 89.98 75.57 71.61 94.78 16x32 82.25 66.18 53.44 89.14 89.77 74.95 61.17 94.78 90.81 78.50 64.30 96.24 16x16 81.00 57.20 57.83 88.94 88.94 66.18 65.76 93.11 91.23 69.31 69.10 94.36 16x8 78.50 41.34 29.23 84.97 87.06 51.57 36.33 91.23 89.35 54.90 40.08 92.48 8x16 79.96 43.01 23.38 86.01 88.52 51.77 29.44 91.23 91.65 55.11 32.57 93.53 8x8 75.99 29..22 27.97 83.30 84.97 39.25 33.83 88.31 88.31 42.17 36.12 90.61 V. CONCLUSION ACKNOWLEDGMENT A novel local feature SLBHP, combining the merits of This work was partially supported by National Science Haar and LBP, is proposed in this paper. The effectiveness Council of Taiwan, under Grants NSC 99-2221-E-155-072, of SLBHP has been proven by various experimental results. National Nature Science Foundation of China under Grants Moreover, compared to the other approaches using EP, Haar 60873179, Shenzhen Technology Fundamental Research and LBP descriptors, SLBHP is superior even in the noisy Project under Grants JC200903180630A, and Doctoral conditions. Further research can be directed to extend the Program Foundation of Institutions of Higher Education of proposed graphics retrieval for slide retrieval or e-learning China under Grants 20090121110032. video retrieval using graphics as query keywords. REFERENCES 1011
  • 25.
    [1] R. Datta, D. Joshi, J. Lia, and J. Z. Wang, “Image retrieval: ideas, influences, and trends of the new age,” ACM Computing Surveys, 2008, vol. 40, no.2, Atricle. 5, pp. 1–60. [2] J. Deng, W. Dong, R. Socher, et al. ImageNet: A large-scale hierarchical image database. In: Proceedings of Computer Vision and Pattern Recognition, 2009. [3] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: a large dataset for non-parametric object and scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, vol. 30, no.11, pp. 1958- 1970. [4] B. Huet and E. R. Hancock, “Line pattern retrieval using rational histograms,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 1999, vol.12, no.12, pp. 1363-1370. [5] Y. Chi and M.K.H. Leung, “ALSBIR: A local-structure-based image retrieval,” Pattern Recognition, 2007, vol. 40, pp. 244-261. [6] A. Chalechale, G. Naghdy and A. Mertins, “Sketch-based image matching using angular partitioning,” IEEE Transactions on Systems, Man, and Cybernetics –Part A: Systems and Humans, 2005, vol. 35, no. 1, pp.28-41. [7] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of texture measures with classification based on featured distribution,” Pattern Recognition, 1996, vol. 29, no. 1, pp.51-59. [8] T. Ahonen, A. Hadid, and M. Pietikinen. “Face description with local binary patterns, application to face recognition,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 2006, vol.28, no. 12, pp. 2037-2041. [9] X. Wang, T. X. Han, and S. Yan, “An HOG-LBP human detector with partial occlusion handling,” In: Proceedings of Internation Conference on Computer Vision, 2009. [10] M. Oren, C. Papageorion, P. Sinha, et al, “Pedestrian detection using wavelet templates,” In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 1997. [11] P. Viola and M Jones, “Robust real-time face detection,” International Journal of Computer Vision, 2004, vol. 57, no. 2, pp. 137-154. [12] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, ‘Face detection based on multi-block LBP representation’, Proc. Int. Conf. on Biometrics, 2007. [13] S. Yan, S. Shan, X. Chen, and W. Gao, ‘Locally assembled binary (LAB) feature with feature-centric cascade for fast and accurate face detection’, Proc. Int. Conf. Computer Vision and Pattern Recognition, 2008. 1012
  • 26.
    IMAGE-BASED INTELLIGENT ATTENDANCELOGGING SYSTEM Hary Oktavianto1, Gee-Sern Hsu2, Sheng-Luen Chung1 1 Department of Electrical Engineering 2 Department of Mechanical Engineering National Taiwan University of Science and Technology, Taipei, Taiwan E-mail: [email protected] Abstract— This paper proposes an extension of the surveillance camera’s function as an intelligent attendance logging system. The system works like a time recorder. Based on sitting and standing up events, the system was designed with learning phase and monitoring phase. The learning phase learns the environment to locate the sitting areas. After a defined time, the system switches to the monitoring phase which monitors the incoming occupants. When the occupant sits at the same Fig. 1. Occupant’s working room (left) and a map consists of occupants’ location with the sitting area found by the learning sitting areas (right). phase, the monitoring phase will generate a sitting-time report. A leaving-time report is also generated when an working area of the occupant does and when the occupant occupant stands up from his/her seat. This system works. employs one static camera. The camera is placed 6.2 The diagram flow of the proposed system is shown in Fig. 2. The system consists of an object segmentation unit, a meters far, 2.6 meters high, and facing down 21° from tracking unit, learning phase, and monitoring phase. A fixed horizontal. The camera’s view is perpendicular with the static camera is placed inside the occupants’ working room. working location. The experimental result shows that the The images taken by the camera are pre-processed by the system can achieve a good result. object segmentation unit to extract the foreground objects. The connected foreground object is called as blob. These Keywords Activity Map; Attendance; Logging system; blobs are processed further in the tracking unit. Once the Learning phase; Monitoring phase; Surveillance Camera; system detected the blob as an occupant, the system keeps tracking the occupant in the scene using centroid, ground I. INTRODUCTION position, color, and size of the occupant as the simple Intelligent buildings have increased as a research topic tracking features. The learning phase has responsibility to recently [1], [2], [3]. Many buildings are installed with learn the environment and constructs a map as the output. surveillance cameras for security reasons. This paper extends The monitoring phase uses the map to monitor whether the the function of existing surveillance cameras as an intelligent occupants are present in their working desk or not. The attendance logging system. The purpose is to report the report on the presence or the absence of the occupants is the occupant’s attendance. The system works like a time final output of the system for further analysis. The system is recorder or time clock. A time recorder is a mechanical or implemented by taking the advantages of the existing open electronics timepiece that is used to assist in tracking the source library for computer vision, OpenCV [5] and cvBlob hours an employee of a company worked [4]. Instead of [6]. spending more budgets to apply those timepieces, the The contributions of this paper are: surveillance camera can be used to do the same function. The (1) Learning mechanism that locates seats in an unknown system is so called intelligent since it learns from a given environment. environment automatically to build a map. A map consists of (2) Monitoring mechanism that detects in entering and sitting areas of the occupants. Sitting area is the space leaving events of occupants. information about where are the locations of the occupant’s (3) Integrating system with real-time performance up to 16 working desk. So, there is no need to select the area of fps, ready for context-aware applications. occupant’s working area manually. Fig. 1 shows an example This paper is organized with the following sections. The scenario. Naturally, the occupant enters into the room and problem definition and the previous researches as related sits to start working. Afterward, the occupant stands up from works are reviewed in Section II. Section III describes the his/her seat and leaves the room. The sitting and standing up technical overview of the proposed solution. Section IV events will be used by the system to decide where the explains about the tracking that is used to keep tracking of the occupants during their appearance in a scene based on the 1013
  • 27.
    information from theprevious frame. The learning phase and the monitoring phase are explained in Section V. Section VI explains the experiments’ setup, result, and discussion. Finally, the conclusions are summarized in Section VII. II. PROBLEM DEFINITION AND RELATED WORK This section describes the problem definition and the previous works related to the intelligent attendance logging system. A. Problem Definition The goal of this paper is to design an image-based intelligent attendance logging system. Given a fixed static camera as an input device inside an unknown working environment with a number of fixed seats, each of them belong to a particular user or occupant. Occupant enters and leaves not necessarily at the same time. We are to design a camera-equipped intelligent attendance logging system, such Fig. 2. Diagram flow of the system. that, the system can report in real-time each occupant’s entering and leaving events to and from his/her particular provide the vocabulary to categorize past and present seat. activity, predict future behavior, and detect abnormalities. The system is designed based on two assumptions. The The researches above detect occupants and build a map first assumption is the environment is unknown, in that, the consists of locations where those people mostly occupy. This number of seat and the location of these seats are not known paper extends the advantages of the surveillance cameras to before the system monitors. The second assumption is each monitor the occupant’s presence. A static camera is used by occupant has his/her own seat, as such, detecting the the system as in [2], [8]. Morris and Trivedi applied presence/absence of a particular seat amounts to answering omnidirectional camera [9] to their system. The other the presence/absence of that corresponding occupant. researchers [1], [3], [7] used stereo camera to reduce the There are two performance criteria to evaluate the system effect of lighting intensity and occlusion. It is intended that regarding to the main functions of the system. The main the system in this paper works in real time and has the functions of the system are to find the sitting area and to capability to learn the environment automatically from report the monitoring result. The first criterion is the system observed behavior. should find the sitting areas given by the ground truth. The second criterion is the system should be able to monitor the occupants during their appearance in the scene to generate III. TECHNICAL OVERVIEW the accurate report. As shown in Fig. 2 and the detail in Fig. 3, the input B. Related Work images acquired from the camera are fed into the object segmentation unit to extract the foreground object. During the past decades, intelligent building has been Foreground object is the moving object in a scene. The developed. Zhou et al [3] developed the video-based human foreground object is obtained by subtracting the current indoor activity monitoring that is aimed to assist elderly. image with the background image. To model the background Demirdjian et al [7] presented a method for automatically estimating activity zone based on observed user behaviors in image, Gaussian Mixture Model (GMM) is used. GMM represents each background pixel variation with a set of an office room using 3D person tracking technique. They weighted Gaussian distributions [10], [11], [12], [13]. The used simple position, motion, and shape features for first frame will be used to initialize the mean. A pixel is tracking. This activity zone is used at run time to contextualize user preferences, e.g., allowing “location- decided as the background if it falls into a deviation around the mean of any of the Gaussians that model it. The update sticky” settings for messaging, environmental controls, process, which is performed in the current frame, will and/or media delivery. Girgensohn, Shipman, and Wilcox [8] increase the weight of the Gaussian model that is matched to thought that retail establishments want to know about traffic the pixel. By taking the difference between the current image flow in order to better arrange goods and staff placement. and the background image, the foreground object is obtained. They visualized the results as heat maps to show activity and object counts and average velocities overlaid on the map of After that, the foreground object is converted from RGB the space. Morris and Trivedi [9] extracted the human color image to gray color image [13]. The edges of the activity. They presented an adaptive framework for live objects in the gray color image are extracted by applying video analysis based on trajectory learning. A surveillance edge detector. The edge detector uses moving frame scene is described by a map which is learned in unsupervised algorithm. Moving frame algorithm has four steps. Step one, fashion to indicate interesting image regions and the way the gray color image (I) is shifted to eight directions using objects move between these places. These descriptors the fixed distance in pixel unit (dx and dy), resulting eight 1014
  • 28.
    images with anoffset to right, left, up, down, up right, up left, down right, and down left, respectively. Those eight shifted images with offset are called moving frame images (Fi ). Fi ( x , y )  I ( x  dxi , y  dyi ) (1) Step two, each of moving frames image is updated (F*) by making subtraction to the image frame (I) to get the extended edges. Fi ( x , y )  I ( x , y )  Fi ( x , y ) (2) Step three, each of moving frames is converted to binary by applying a threshold value (TF). Fig. 3. The detail of the object segmentation unit and the tracking unit. FiT ( x , y )  f T ( Fi ) (3) where Biy and Bjy are the y-coordinate of of blob-i and blob- j, respectively, ci and cj are the centroid of each blob. If 1 if ( Fi* ( x , y ))  TF those three conditions satisfy (6) then the broken blobs are fT   0 otherwise grouped. Finally, all of moving frame images are added together. As BI    TC   Bdy  TD  B A  TA   1 the result, the edges of the image (E) are obtained. G (6) 0 otherwise E( x , y )   FiT ( x , y ) (4) i TC, TD, and TA are the threshold values for the intersection distance, the nearest vertical distance of blobs, and the angle Edge detector extracts the object while removes the weak of blobs, respectively. In the experiments, TC is 0 pixel, TD is shadows at the same time since weak shadows do not have 50 pixels, and TA is 30°. edges. However, strong shadows may happen and create After the broken blobs are grouped into one, the motion some edges. Strong edges appear between legs can still be detector will test the blob whether it is an occupant or not. tolerated since the system does not consider about occupant’s The blob is an occupant if the size of the blob looks like a contour. human and the blob has movement. A minimum size of The result from the edge detection process is refined by human is an approximation relative to the image size. X-axis using morphology filters [13]. Dilation filter is applied twice displacement and optical flow [13] are used to detect the to join the edges and erosion filter is used once to remove the movement of the blob. If a blob is detected as an occupant noises. The last step in the object segmentation unit is then the tracking unit gives a unique identification (ID) connected component labeling. The connected component number and a track the occupant. A track is an indicator that labeling is used to detect the connected region. The a blob is an occupant, and it is represented by a bounding connected region is so called as blob. In the object box. Tracking rules are implemented as states to handle each segmentation unit, the GMM, the gray color conversion, the event. There are five basic states; entering state, people state, edge detector, and the morphology filters are implemented sitting state, standing up state, and leaving state. During the using OpenCV library while the connected component tracking, the occlusion problem may happen. Two more labeling is implemented using cvBlob library. states are added. They are merge state and split state. In the The blob that represents the foreground object may be tracking unit, optical flow implements OpenCV library while broken due to the occlusion with furniture or having the the tracking rules employ cvBlob library. same color with the background image. Some rules to group The learning phase is activated if the map has not the broken blob are provided. There are three conditions to constructed yet. The sitting state in the tracking unit triggers examine the broken blobs. The first is the intersection the learning phase to locate the occupant’s sitting area. After distance of blobs (BI). The second is the nearest vertical a defined time, the learning phase finished its job and the monitoring phase is activated. In this phase, the sitting state distance of blobs (Bdy). The third is the angle of blobs (BA) and the standing up state in the tracking unit trigger the from their centroids. Bdy and BA are calculated using (5) monitoring phase to generate reports. The reports tell when while BI is explained in [14]. the occupant sit and left. Bdy  min( Bi . y , B j . y ) The system will be evaluated by testing it with some (5) video clips. There are two scenarios in the video clips. Five B A  ( ci ,c j ) occupants are asked to enter the scene. They sit, stand up, leave the scene and sometimes cross each other. 1015
  • 29.
    IV. TRACKING This section describes about the tracking rules in the tracking unit (Fig. 3). Tracking rules will keep tracking the occupants during their appearance in the scene based on the information (features) from the previous frame. The Fig. 4. Basic tracking states. tracking rules are represented by states. The basic tracking states are shown in Fig. 4. There are five states:  Entering state (ES), an incoming blob that appears in the scene for the first time will be marked as entering state. This state also receives information from the motion detector to detect whether the incoming blob is an occupant or a noise. If the incoming blob is considered as noise and it remains there for more than 100 frames then the system will delete it, for instance, the size is too small because of shadows. To erase the noise from the scene, the system re-initializes the Fig. 5. An occupant in the scene and the features. Gaussian model to the noise region so that the noise will be absorbed as a background image. An incoming blob is classified as an occupant if the incoming blob has motion at least for 20 frames continuously and the height of the blob is more than 60 pixels.  Person state (PS), if the incoming blob is detected as an occupant, a unique identification (ID) number and a bounding box are attached to this blob. The blob that is detected as an occupant is called as a track. The system adds this track in the tracking list. Fig. 6. Centroid feature to check the distance in 2D.  Sitting state (IS), detects if the occupant is sitting. Sitting occupant can be assumed if there is no surrounding by a bounding box), size (number of blob movement from the occupant for a defined time. In the pixels or area density), centroid (center gravity of mass), experiments, an occupant is sitting when the x-axis and ground position (foot position of occupant). displacement is zero for 20 frames and the velocity The first feature is centroid. Centroid is used to associate vectors from the optical flow’s result are zero for 100 the object’s location in the 2D image between two frames, continuously. consecutive frames by measuring the centroids distance.  Standing-up state (US), detects when the sitting Fig. 6 shows the two objects being associated. One object is occupant starts to move to leave his/her desk. In the already defined as track in the previous frame (t-1) and experiments, a standing up occupant is detected when another object is appearing in the current frame (t) as a blob. the sitting occupant produces movements, the height Each object has centroid (c). These two objects are increases above 75%, and the size changes to 80%- measured [14] in the following way. If one of centroid is 140% comparing to the size of the current bounding inside another object (the boundary of each object is defined box. as a rectangle) the returned distance value is zero. If the centroids are lying outside the boundary of each object then  Leaving state (LS), deletes the occupant from the list. the returned distance value is the nearest centroid to the A leaving occupant is detected when occupant moves opponent boundary. A threshold value (TC) is set. When the to the edge of the scene and occupant’s track loses its distance is below TC meaning that those two objects are the blob for 5 frames. same object, the track position will be updated to the blob A. Tracking Features position. If the distance is not satisfied then it means these The system is tried to match every detected occupant in two objects is not correlated each other. It could be the the scene from frame to frame. This can be done by previous track loses the object in the next frame and a new matching the features of occupant. Four features (centroid, object appears at the same time. A track that missed the ground position, color, and size) are used for tracking tracking is defined in the leaving state (LS) and a new object purpose. Fig. 5 shows the illustration of blob (the connected that appears in the scene is handled in the blob state (BS). region of occupant object in the current frame), track (a The second feature is ground position. It is possible that connected blob that considers as an individual occupant, two objects are not the same object but their centroids are 1016
  • 30.
    Fig. 10. Extendedtracking states. Fig. 7. Ground position feature to check the distance in 3D. Blob and track in the processing stage (left). View in the real image (right). categories; n is the total bin number; the histogram HR,G,B of occupant-i meets the following conditions: n H iR ,G ,B   bk (7) k 1 The histogram HR,G,B are calculated using the masked image and then normalized. The masked image, shown in Fig. 8, is obtained from the occupant’s object and the blob with and- operation. The method for matching the occupant’s Fig. 8. Color feature is calculated on masked image. histogram is correlation method. In the experiments, 10-bins for each color are chosen. The histogram matching procedure uses a threshold value of 0.8 to indicate that the comparing histogram is sufficient matched. The fourth feature is size. The size feature is used to match the object between two consecutive frames based on the pixel density. The pixel density is the blob itself, shown at Fig. 9. Allowable changing size at the next frame is set ± 20% from the previous size. Let p(x’,y’) be the pixel Fig. 9. Size feature of occupant. location of an occupant in binary image. The size feature of object-i is calculated as follow: lying inside each other boundary. Fig. 7 shows this problem. There are two occupants in the scene. One occupant is si   p( x' , y' ) (8) x' , y' sitting while the other is walking through behind. In the 2D image (left), the two objects are overlapped each other. However, it is clear that the walking occupant should not be B. Merge-split Problem confused with the sitting occupant. To solve this problem, A challenging situation may happen. While the occupants ground position is used to associate the object’s location in are walking in the scene, they are crossing each other and the 3D image between two consecutive frames. Ground making occlusion. Since the system keeps tracking each position feature will eliminate the error that an object to be occupant in the scene, it is necessary to extend the tracking updated with another object even thought they overlap each states from Fig. 4. Two states are added for this purpose; other. Occupant’s foot location is used as ground position. merge state (MS) and split state (SS). Fig. 10 shows the A fixed uniform ellipse boundary (25 pixels and 20 pixels extended tracking states. Merge and split can be detected by for major axis and minor axis, respectively) around the using proximity matrix [14]. Objects are merged when ground position is set to indicate the maximum allowable multiple tracks (in the previous frame) are associated to one range of the same person to move. In the real scene, this blob (in the current frame). Objects are split when multiple pixel area is equal to 40 centimeters square for the nearest blobs (in the current frame) are created from a track (in the object from the camera until 85 centimeters square for the previous frame). In the merge condition, only centroid furthest object from the camera. This wide range is caused feature is used to track the next possible position since the by the using of uniform ellipse distance for all locations in other three features are not useful when the objects merge. the image. After a group of occupants split, their color will be matched The third feature is color. Color feature is used to to their color just before they merged together. indicate color information of occupant’s clothing or wearing In experiments, when more than two occupants split, and help to separate the objects in term of occlusion. Three sometimes an occupant remains occluded. Later, the dimension of RGB color histogram is used. Let b be the bin occluded occupant splits. When the occluded occupant that counts the number of pixel that fall into the same splits, the system will re-identify each occupant and correct 1017
  • 31.
    Sitting area number Event Time stamp 1 Sitting 09:02:09 Wed 2 June 2010 2 Sitting 09:07:54 Wed 2 June 2010 3 Sitting 09:12:16 Wed 2 June 2010 2 Leaving 10:46:38 Wed 2 June 2010 2 Sitting 10:49:54 Wed 2 June 2010 3 Leaving 12:46:38 Wed 2 June 2010 Fig. 12. A report example. B. Monitoring Phase The monitoring phase is derived from the sitting state and the standing up state in the tracking rules. The monitoring phase generates the reports of the occupant’s attendance. It Fig. 11. Merge-split algorithm with occlusion handling. uses the map that has been constructed by the learning phase. From Fig. 4, the sitting into state (IS) and standing up their previous ID number just before they have merged. Fig. state (US) trigger the monitoring phase. When the occupant 11 shows the algorithm to handle the occluded problem. sits, the system will try to match the current occupant’s sitting location with the sitting area in the map. If the V. LEARNING AND MONITORING PHASES positions are the same then the system will generate a time This section introduces about how the learning phase stamp of sitting time for the particular sitting area. A time and the monitoring phase work. These phases are derived stamp of leaving time is also generated by the system when from the tracking unit, which are the sitting state and the the occupant moves out from the sitting area. Fig. 12 shows standing up state in the tracking rules. At the beginning, the the example of the report. system activates the learning phase. Triggering by the sitting VI. APPLICATION TO INTELLIGENT ATTENDANCE event, the learning phase starts to construct the map. When LOGGING SYSTEM the given time interval is passed, the learning phase is stopped. A map has been constructed. The system switches This paper demonstrates the usage of the surveillance to the monitoring phase to report the occupants’ attendance camera as an intelligent attendance logging system. It based on when they sit into and stand from their seat. mentioned earlier that the system works like a time recorder. The system assists for tracking the hours of occupant A. Learning Phase attendance. Using this system, the occupants no need to The learning phase is derived from the sitting state in the bring special tag or badge. In this section, the environment tracking rules. The output of the learning phase is a map. setup, result, and discussion are described. The map consists of occupants’ sitting areas. From Fig. 4, A. Environment Setup the information about when the occupant sits is extracted from the sitting into state (IS). When an occupant is detected A static network camera is used to capture the images as sitting, the system will start counting. After a certain from the scene. It is a HLC-83M, a network camera period of counting, the location where the occupant sits is produced by Hunt Electronic. The image size taken from the determined as sitting area. The counting period is used as a camera is 320 x 240 pixels. The test room is in our delay. The delay makes sure that the occupant sits for laboratory. The camera is placed about 6.24 meters far, 2.63 enough time. In the experiments, the delay is defined for meters high, and 21° facing down from the horizontal line. 200 frames. Ideally, the learning phase is considered to be The occupant desks and the camera view are orthogonal to finished after all of the sitting areas are found. In this paper, get the best view. There are 5 desks as the ground truths. to show that the learning phase does its job, the occupants The room has inner lighting from fluorescent lamps and enter into the scene and sit one by one without making the windows are covered so the sunlight cannot come into occlusion. The scenario for this demonstration is arranged the room during the test. so that after 10 minutes, the map is expected to be B. Result and Discussion completely constructed. Thus, the learning phase is finished Visual C++ and OpenCV platform on Intel® Core™2 its job. The system will be switched to the monitoring Quad CPU at 2.33GHz with 4 GB RAMs is used to phase. In the real situation, the delay and how long the implement the system. Both offline and online methods are learning phase will be finished can be adjusted. allowed. In the scene without any detected objects, the system ran at 16 frames per second (fps). When the number 1018
  • 32.
    of incoming objectsis increasing, the lowest speed can be Table 1. Test results of scene type 1. The number of detected seat by the achieved is 8 fps. system for 10 times experiments. The algorithm was tested with 2 types of scenarios. The Sitting Desk number first scenario is sitting occupants with no occlusion (Fig. area #1 #2 #3 #4 #5 13). This scenario demonstrated the working of learning Detected 7 10 10 10 8 phase. The second scenario is the same as the first scenario Missed 3 0 0 0 2 but the occupants are allowed crossing each other to make an occlusion (Fig. 14). This scenario demonstrated the Table 2. Test results of scene type 2. The number shows the success rate of merge-split handling. monitoring without occlusion for 10 times experiments. Table 1 shows the test result of scenario type 1. There Desk number are 5 desks as ground truth (Fig. 1). Five occupants enter Occupant #1 #2 #3 #4 #5 into the scene. They sit, stand up, and left the scene one by Sitting 9 10 10 10 9 one without making any occlusion. The order or the Leaving 0 9 10 10 9 occupants enter and leave are arranged. The occupant started to occupy the desk number 5 (the right most desk), Table 3. Test results of scene type 2. The number of occupant mistakenly until the desk number 1 (the left most desk). When they left, assigned in merge-split case for 10 times merged. the occupant started to leave from the desk number 1, until Number of Sitting Walking Split the desk number 5. This order is made to make sure that Occupant occupant occupant Merge Succeeded Failed there is no occupant walks through behind the sitting 2 0 2 10 9 1 occupant. This scenario was repeated 10 times. The result 2 1 1 10 9 1 shows that there is no problem for the desk number 2, 3, and 4. However, there are some errors that the system failed to 3 0 3 10 8 2 locate the occupants’ sitting areas. In the case of the desk 3 1 2 10 9 1 number 1, sometimes the occupant’s blob merges with 3 2 1 10 9 1 his/her neighbor occupant. So, the system cannot detect or track the occupant that sits into desk number 1. In the case which is which after they split. The error happened because of the desk number 5, the occupant’s color was similar to of the occupant’s color and the sitting occupant. If the the color of the background image. This caused the occupants have a similar color then the system may get occupant produced small blob. The system cannot track the confuse to differentiate them. Another time, when the sitting occupant because his/her blob’ size becomes too small. occupant makes a movement, it creates a blob. However, the Table 2 shows the test result of scene type 2. The system system still does not have enough evidence to determine that monitored the occupants based on the map that has been this blob will change the status of sitting occupant become found. The experiments were done 10 times without standing up occupant. Another occupant walked closer and occlusion. There are some errors that the system failed to merged with this blob. After they split, the system confused recognize the sitting occupant. The system failed to detect since the blob had no previous information. As the result, the occupant because of the same problems in the previous the system missed count the previous track being merged. discussion; the system lost to track the occupant because the The ID number of occupant is restored incorrectly. occupant has the similar color to the background image so that the occupant suddenly has small blob. The system also VII. CONCLUSIONS failed to recognize the leaving event from desk number 1. We have already designed an intelligent attendance The system detects a leaving occupant when the occupant logging system by integrating the open source with split with his/her seat. Since the desk number 1 does not additional algorithm. The system works in two phases; have enough space for the system to detect the splitting, the learning phase and monitoring phase. The system can system still detected that the desk number 1 is always achieve real-time performance up to 16 fps. We also occupied even the corresponding occupant has left that demonstrate that the system can handle the occlusion up to location. three occupants considering that the scene seems become Table 3 shows the test result of scene type 2. The too crowded for more than three occupants. While the experiments were done 10 times with occlusion. The system regular time recorder only reports the time stamp of the should be able to keep tracking the occupants. To test the beginning and the ending of the occupant’s working hour, system, three occupants enter to the scene to make the this system provides more detail about the timing scenario as shown in Table 3. Some occupants walk through information. Some unexpected behavior may cause an error. behind the sitting occupant or the occupants just walk and For instance, the occupant has the color similar to the cross each other. Most of the cases, the system can detect background, the desk position, or the occupant moves while 1019
  • 33.
    sitting. In the future, the events generated by this system can be used to deliver a message to another system. It is possible to control the environment automatically such as adjust the lighting, playing a relaxation music, setting the air conditioner when an occupant enters or leaves the room. The summary report of the occupant’s attendance also can be used for activity analysis. The current system does not include the recognition capability since it only detect whether the working desk is occupied or not. However, if occupant recognition is needed then there are two ways. After the map of sitting areas are found, user may label each sitting area manually or a recognition system can be added. REFERENCES [1] B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafers, “EasyLiving Technologies for Intelligent Environments,” Lecture Notes in Computer Science, Volume 1927/2000, pp. 97-119, 2000. [2] S. -L. Chung and W. –Y. Chen, “MyHome: A Residential Server for Smart Homes”, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4693 LNAI (PART 2), pp. 664-670, 2007. [3] Z. Zhou, X. Chen, Y. –C. Chung, Z. He, T. X. Man, and J. M. Keller, “Activity analysis, summarization, and visualization for indoor human activity monitoring,” IEEE Transactions on Circuits and Systems for Video Technology 18 (11), art. no. 4633633, pp. 1489- 1498, 2008. Fig. 13. Scenario type-1. It shows how the system builds a map. The current [4] Wikipedia, “Time Clock,” http://en.wikipedia.org/wiki/Time_clock images (left) and a map is shown as filled rectangles (right images). (June 24, 2010). [5] OpenCV. Available: http://sourceforge.net/projects/opencvlibrary/ [6] cvBlob. Available : http://code.google.com/p/cvblob/ [7] D. Demirdjian, K. Tollmar, K. Koile, N. Checka, and T. Darrell, “Activity maps for location-aware computing,” Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision (WACV), pp. 70-75, 2002. [8] A. Girgensohn, F. Shipman, and L. Wilcox, “Determining Activity Patterns in Retail Spaces through Video Analysis,” MM'08 - Proceedings of the 2008 ACM International Conference on Multimedia, with co-located Symposium and Workshops , pp. 889- 892, 2008. [9] B. Morris and M. Trivedi, “An Adaptive Scene Description for Activity Analysis in Surveillance Video,” 2008 19th International Conference on Pattern Recognition, ICPR 2008 , art. no. 4761228, 2008. [10] A. Bayona, J.C. SanMiguel, and J.M. Martínez, “Comparative evaluation of stationary foreground object detection algorithms based on background subtraction techniques,” 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2009 , art. no. 5279450, pp. 25-30, 2009. [11] S. Herrero and J. Bescós, “Background subtraction techniques: Systematic evaluation and comparative analysis” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5807 LNCS, pp. 33- 42, 2009. [12] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection,” Proc. 2nd European Workshop on Advanced Video Based Surveillance Systems, AVBS01, 2001 [13] G. Bradski and A. Kaehler, “Learning OpenCV: Computer Vision with the OpenCV Library,” Sebastopol, CA: O'Reilly Media, 2008. [14] A. Senior, A. Hampapur, Y.-L. Tian, , L. Brown, S. Pankanti, and R. Bolle, ” Appearance models for occlusion handling,” Image and Fig. 14. Scenario type-2. The map of 3 desks has been completed. The Vision Computing 24 (11), pp. 1233-1243, 2006. occupants cross each other and the system can handle this situation. 1020
  • 34.
    i-m-Walk : InteractiveMultimedia Walking-Aware System 1 Meng-Chieh Yu(余孟杰), 2Cheng-Chih Tsai(蔡承志), 1Ying-Chieh Tseng(曾映傑), 1Hao-Tien Chiang(姜昊天), 1Shih-Ta Liu(劉士達), 1Wei-Ting Chen(陳威廷), 1Wan-Wei Teo(張菀薇), 2Mike Y. Chen(陳彥仰), 1,2Ming-Sui Lee(李明穗), and 1,2Yi-Ping Hung(洪一平) 1 Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan 2 Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan Abstract calories burned [16]. adidas used a accelerometer to detect the footsteps of the runner, and it will let you know running i-m-Walk is a mobile application that uses pressure information audibly [31]. Wii fit used balance boards to sensors in shoes to visualize phases of footsteps on a mobile detect user's center of gravity and designed several games, device in order to raise the awareness for the user´s walking such as yoga, gymnastics, aerobics, and balancing [18]. In behaviour and to help him improve it. As an example addition, walking is an important factor of our health. For application in slow technology, we used i-m-Walk to help example, it is one of the earliest rehabilitation exercises and beginners learn “walking meditation,” a type of meditation an essential exercise for elders [5]. Improper foot pressure where users aim to be as slow as possible in taking pace, and distribution can also contribute to various types of foot to land every footstep with toes first. In our experiment, we injuries. In recent years, the ambient light and the asked 30 participants to learn walking meditation over a biofeedback were widely used in rehabilitation and healing, period of 5 days; the experimental group used i-m-Walk and the concept of “slow technology” was proposed. Slow from day 2 to day 4, and the control group did not use it at all. technology aimed to use slowness in learning, understanding The results showed that i-m-Walk effectively assisted and presence to give people time to think and reflect [30]. beginners in slowing down their pace and decreasing the Meditation is one kind of the example in slow technology. error rate of pace during walking meditation. To conclude, Also, “walking meditation” is an important form of this study may be of importance in providing a mechanism to meditation. Although many research projects have focused assist users in better understanding of his pace and on meditation, showing benefits such as enhancing the improving the walking habit. In the future, i-m-Walk could synchronization of neuronal excitation [11] and increasing be used in other application, such as walking rehabilitation. the concentration of antibodies in blood after vaccination [3], most projects have focused on meditation while sitting. In Keywords: Smart Shoes, Walking Meditation, Visual Feedback, order to better understand how users walk in a portable way, Slow Technology we have designed i-m-Walk, which uses multiple force sensitive resistor sensors embedded in the soles of shoes to monitor users’ pressure distribution while walking. The 1. INTRODUCTION sensor data are wirelessly transmitted over ZigBee, and then Walking is an integral part of our daily lives in terms of relayed over Bluetooth to be analyzed in real-time on transportation as well as exercise, and it is a basic exercise smartphones. Interactive visual feedback can then be can be done everywhere. In recent years, many research provided via the smartphones (see Figure 1). projects have studied walking-related human-computer In this paper, in order to develop a system that can help interfaces on mobile phones with the rapid growth of users in improving the walking habit, we use the training of smartphones. For example, there is research evaluated the walking meditation as an example application to evaluate the walking user interfaces for mobile devices [9], and proposed effectiveness of i-m-Walk. Traditional training of walking minimal attention user interfaces to support ecologists in the meditation demands one-on-one instruction, and there is no field [21]. In addition, there are several walking-related standardized evaluation after training. It is challenging for systems developed to help people in walking and running. beginners to self-learn walking meditation without feedback Nike+ used footstep sensors attached to users’ shoes to from the trainers. adjust the playback speed of music while running and to track running related statistics like time, distance, pace, and 1021
  • 35.
    2.2 Multimedia-Assisted WalkingApplication There are some studies using multimedia feedback and walking detection technique to help people in monitoring or training application in daily life. In the application of dancing training, there was an intelligent shoe that can detect the timing of footsteps, and play the music to help beginners in learning of ballroom dancing. If it detected missed footsteps while dancing, it would show warning messages to the user. The device emphasizes the acoustic element of the music to help the dancing couple stay in sync with the music [4]. The other application of dance performance could detect dancers’ pace and applied them in interactive music for dance performance [20]. In the application of musical tempo and rhythm training for children, there was a system which can write out the music on a timeline along the ground, and Figure 1. A participant is using i-m-Walk during walking each footstep activates the next note in the song [13]. meditation. Besides, visual information was be used to adjust foot trajectory during the swing phase of a step when stepping We have designed experiments to test the effect of onto a stationary target [23]. training by using i-m-Walk during walking meditation. In the application of psychological, there are some Participants were asked to do a 15-minute practice of experiments related to walking perceptive system. In the walking meditation for five consecutive days. During the application in walking assisting of stroke patients, lighted experiment, participants using i-m-Walk will be shown real- target was used to load onto left side and right side of time pace information on the screen. We would like to test walkway, and stroke patients can follow the lighted target to whether it could help participants to raise the awareness for carry on their step. The result pointed out that stroke patient their walking behaviour and to improve it. We proposed two might effectively get help by using vision and hearing as hypotheses: (a) i-m-Walk could help users to walk slower guidance [14]. An fMRI study of multimedia-assisted walking during meditation; (b) i-m-Walk could help users to walking showed that increased activation during visually walk correctly in the method of walking meditation. guided self-generated ankle movements, and proved that This paper is structured as follows: The first section deals multimedia-assisted walking is profound to people [1]. In the with the introduction of walking system. The second section related application of walking in entertainment, Personal of the article is a review of walking detection and Trainer – Walking [17] detects users’ footsteps trough multimedia-assisted walking applications. This is followed accelerometer, and encourage users to walk through by some introduction of walking meditation. The forth interesting and interactive games. In the healthcare section describes the system design. After which application, there was a system applied the concept of experimental design is presented. The results for the various intelligence shoes on the healthcare field, such as to detect analyses are presented following each of these descriptive the walking stability of elderly and thus to prevent falling sections. Finally, the discussion and conclusion are presented down [19]. The system monitored walking behaviours and and suggestions are made for further research. used a fall risk estimation model to predict the future risk of a fall. Another application used electromyography biofeedback system for stroke and rehabilitation patients, and 2. RELATED WORKS the results showed that there was recovery of foot-drop in the swing phase after training [8]. 2.1 Methods of Walking Detection In the past decade, there were many researches on 3. WALKING MEDITATION intelligent shoes. The first concept of wearable computing and smart clothing systems included an intelligence cloth, The practice of meditation has several different ways and glasses, and an intelligence shoes. The intelligence shoes postures, such as meditation in standing, meditation in sitting, could detect the walking condition [12]. Then, a research meditation in walking, or meditation in lying down on back. used pressure sensor and gyro sensor to detect feet posture, Compared to sitting meditation, people tend to feel less dull, such as heel-off, swing, and heel-strike[22], and a research tense, or easily distract in walking meditation. In this paper, embedded pressure sensor in the shoes to detect the walking we focus on the meditation in walking, which is also named cycle, and a vibrator was equipped to assist when walking walking meditation. Walking meditation is a way to align the [26]. Besides, there were many other methods on walking feeling inside and outside of the body, and it would helps detection, such as use bend sensor [15], accelerometer [2], people to focus and concentrate on his mind and body. ultrasonic [29], and computer vision technology [24] to Furthermore, it can also deeply investigate our knowledge analyze footsteps. and wisdom. 1022
  • 36.
    4.1 i-m-Walk Architecture The shoe module is based on Atmel's high-performance, low-power 8-bit AVR ATMega328 microcontroller, and transmits sensing values through a 2.4GHz XBee 1mW Chip Antenna module wirelessly. The module size is 3.9 cm x 5.3 cm x 0.8 cm with an overall weight of 185g (Figure 4), Figure 2. Six phases of each footstep in walking meditation [25]. including an 1800mAh Lithium battery can continuous use for 24 hours. We kept the hardware small and lightweight in The methods of walking meditation aim to be as slow as order not to affect users while walking. possible in taking pace, and landing each pace with toes first. We use force sensitive resistor sensors to detect the The participants could focus on the movement of walking, pressure distribution of feet while walking. The sensing area from raising, lifting, pushing, lowering, stepping, to pressing of sensor is 0.5 inch in diameter. The sensor changes its (Figure 2). Also, the participants should aware of the resistance depending on how much pressure is applied to the movement of the feet in each stage. It is important to stay sensing area. In our system, the intelligent shoes would aware of the feet sensation. As a result, keep on practicing of detect the walking speed and the walking method in walking walking meditation is an effective way to develop meditation. According to the recommendations of concentration and maintain tranquillity in participants’ daily orthopaedic surgery, we use three force sensitive resistor lives. Furthermore, it can also help participants become sensors fixed underneath the shoe insole, and the three main calmer and their minds can be still and peaceful. With the sustain areas located at structural bunion, Tailor’s bunion, long-term practice of walking meditation, it benefits people and heel, seperately. (see Figure 4). The shoe module is put by increasing patience, enhancing attention, overcoming on the outside of the shoes (see Figure 5). With a fully drowsiness, and leading to healthy body [6]. In order to help charged battery, the pressure sensing modules can beginners in learning the walking methods in walking continuous use for 24 hours. A power button can switch the meditation, i-m-Walk system was developed. module off when it is not being used. 4. SYSTEM DESIGN i-m-Walk includes a pair of intelligent shoes for detecting pace, an ZigBee-to-Bluetooth relay, and a smartphone for walking analysis and visual feedback. There are three force sensitive resistor sensors fixed underneath each shoe insole that send pressure data through the relay. We implemented an analysis and visual feedback application on HTC HD2 smartphone running Window Mobile 6.5 and which has a 4.3-inch LCD screen. The overview of the system is shown in Figure 3. Figure 4. Sensing module: the micro-controller and wireless Force Relay module (right), and one of the insole with three force sensitive Sensor Right Shoe resistor sensors (left). Xbee (Receiver) Force Microcontroller Sensor Bluetooth Force Xbee (Transfer) Sensor Smart Phone Force Bluetooth Sensor Left Shoe Footstep Detection Force Microcontroller Sensor Stability Analysis Force Xbee (Transfer) Visual feedback Sensor Figure 3. System structure of i-m-Walk. Figure 5. Sensing shoes: attached the sensing module into the shoes. 1023
  • 37.
    4.2 Walking detection 4.3.1 Pace awareness There were many methods in walking detection, and the The function of pace awareness is to help user aware of his methods were different according to the applications. In our walking phases and whether he use correct footstep during system, we use three pressure sensors in each shoe, and walking meditation. A feet pattern shows on the smartphone, totally will sense six sensing values at the sample rat 30 and the color block shows where the position of foot’s times per second. In order to detect whether the user lands center of gravity is and how much is forced on the foot in each pace with toes first or heel first, we divide each shoe real-time. The transparency of the block will decrease while into two parts, toe part and heel part. The sensing value of user land his feet. On the contrary, the transparency of the toe part is the average of two force sensors, which block will increase while user raise his foot. Besides, the underneath at structural bunion and Tailor’s bunion. The color block moves top-down while the participants land sensing value of heel part is the force sensor underneath at with toes first. The color blocks move bottom-up means that heel. Therefore, the system divides the sensing area into toe the participants land with heel first. If forward of the foot part and heel part in each shoe, and totally four parts in each lands first, then the colour block moves forward to indicate person. Then, we use threshold method to detect the moment the landing position. In addition, if the user land pace with while the sensing part less than the threshold value, and toe first, the system would defined that he is using correct activate that part. We define the beginning of the each gait walking methods in walking meditation, and the colour cycle while heel part is lifting. The end of the gait cycle is at block would display as the color of green. On the contrary, the moment while the another foot’s heel part is rising. The while the user lands pace with heel first, the system would previous cycle stops in one foot and another foot begins a recognize that he is using wrong walking methods, and the new cycle of gait. Figure 6 shows an example. In this case, colour block would change the color from green to red. when the heel in left foot rose in 5 seconds, the sensing value was less than the threshold, and our system detected left foot 4.3.2 Walking Speed and Warning Message rise in this moment. In the mean time, it means that user’s During walking meditation, people should stabilize his right foot is stepping down. On the contrary, when the heel walking paces at a lower speed. By this way, the user in right foot rose in 10.7 seconds, the sensing value was less interface should provide the information of walking speed in than the threshold, and our system detected right foot rise in real time, and remind the user while the walking speed is this moment. too fast. Walking speed and wrong pace can be measured after the processing of walking signals. Then, the walking speed is visualized as a speedometer. The indicator point of the speedometer will point to the value of walking speed. For example, if the indicator points to the value “30”, it means that the user walk thirty paces per three minutes. Therefore, the speedometer provides the function to remind the user while he is walking too fast. According to the pilot study, we defined the lower-bound of the walking speed as 40 paces per three minutes. While the walking speed exceeds the speed, the indicator will point to the red area, and the screen will show a warning message “too fast” on the top of the screen. The warning message would disappear while the walking speed is less than 40 paces per three minutes. Figure 6. Signal processing of walking signals. Blue line indicates Warning the sensed weight (kg) of heel and green line indicates the sensed message weight of toe. Red line means the threshold in detecting the landing event. Gray block means that which foot is landing. Footstep Awareness 4.3 User interface Multimedia feedback can be effectively applied in preventive medicine [7], and it can also assist rehabilitation Walking patients in walking effectively [5, 27]. i-m-Walk is developed speed to assist user in learning the walking methods during walking meditation. The user interface of i-m-Walk includes three components: warning message, pace awareness, and walking speed (see Figure 7). In this section, we describe the user interface and the design principles of our system. Figure 7: User interface of i-m-Walk. The user interface shows three events: warning message, condition of each footstep, and 1024
  • 38.
    walking speed. Picture(a) shows that the user used incorrect meditation. 83.3% of the participants carry mobile phone all walking method in right foot and the colour block changed to red the time, and 63.3% of the participants have the experience on the right foot. Picture (b) shows that the walking speed is too of using smartphone. There were fifteen participants (eleven fast (46 steps per three minutes). male and four female) in experiment group (with visual feedback), and fifteen participants (eleven male and four female) in control group (without visual feedback). Because 5. EXPERIMENT DESIGN the feet size was different in each participant, we prepared Two experiments was designed to test the effects of i-m- two pair os shoes with different sizes, and participants could Walk during walking meditation. The first experiment was a choose comfortable one to wear. pilot study. In this study, we evaluated the effect of visual feedback which showed six sensing curves which projected 5.2.2 Location on the wall. The second experiment evaluated the effects of i-m-Walk in improving user´s walking behaviour. Meditating in a quiet and enclosed area would be easier to bring mind inward into ourselves and may reach in calm 5.1 Pilot study and peace situation. In this experiment, we selected a Before the test of i-m-Walk, we designed a preliminary corridor in the faculty building as the experimental place for study to test the effects of visual feedback displayed on the walking meditation. The corridor is a public place at an wall. Eight master students volunteered to participate in this enclosed area and few people would conduct their daily pilot study. Participants’ average age is 26.3 (SD=0.52). All activities like standing, walking, and interacting with one participants have the experience of sitting meditation before, another there. The surrounding of corridor is quiet and but all of them do not have the experience of walking comfortable for user to reach their mind in calm. The length meditation. There were four participants (three male and one of the corridor is thirty meters, and the width is three meters. female) in experiment group (with visual feedback), and four The temperature is 21~23 Celsius degree. participants (four male) in control group (without visual feedback). Participants would take ten minutes each day and 5.2.3 Procedure and analysis for three consecutive days in the experiment. The experiment took place at a seminar room in the faculty building. Before Before the experiment, participants were asked to walk the experiment, participants were taught the methods and alone the corridor in usual walking speed, and we recorded it. principles of walking meditation. The experiment was a 4 × Then, participants were taught the methods of walking 2 between-participants design. In the experiment group, meditation. The guideline of the walking meditation which participants were asked to watch the curves which showed we provided to the participants was as follows: “Walking feet’s sessing value projected on the wall. In the control meditation is a way to align the feeling inside and outside of group, participants were asked to walk themselves without the body. You should focus on the movement of walking, visual feedback. Participants walked straight in the seminar from raising, lifting, pushing, lowering, stepping, to pressing. room. The results showed that there was a significant main You have to land every footstep with toes first and then effect that experimental group had lower walking speed than slowly land your heel down. During walking meditation, you control group (p<0.05) during the three days. The average should stabilize your walking paces at a lower speed as number of wrong pace in experimental group was less than possible. You have to relax your body from head to toes.” control group, too. As the results from this pilot study, we The experiment was a 15 × 2 between-participants concluded two preliminary conclusion that (a) visual design. Participants would take fifteen minutes each day and biofeedback could help users in slowing down the walking for five consecutive days in this experiment. Table 1 shows speed during walking meditation; (b) Multimedia guidance the procedure of this experiment. In the experiment group, could usefully help user in aware of his pace during walking participants were asked to use i-m-Walk from day 2 to day 4. meditation, and could decrease the number of wrong pace. In the control group, participants were asked to walk However, we also observed some issues in pilot study, one is themselves without any feedback during walking meditation. that the perspective would change over time while walking DAY 1 DAY 2 DAY 3 DAY 4 DAY 5 to different location, and it might influence the effect of learning. Based on the results and recommands, we designed Experimental ○ ● ● ● ○ a experiment to evaluate the effect of i-m-Walk. Group 5.2 User study Control ○ ○ ○ ○ ○ 5.2.1 Participants Group Thirty master and PhD students in the Department of Table 1: Experimental procedure: ● means that participants have Computer Science volunteered to participate in this to use i-m-Walk during walking meditation, and ○ means that experiment. Participants’ average age is 25.2 (SD=3.71). participant do not have to use i-m-Walk during walking meditation. Twenty-seven participants have the experience of sitting meditation, and three participants do not. However, all While learning of walking mediation, all participants participants do not have the experience of walking were asked to walk clockwise around the corridor and hold 1025
  • 39.
    the smartphone. Incontrol group, there was no visual minute learning of walking meditation on experimental feedback on smartphone although they still need to hold the group and control group from day 1 to day 5. smartphone. In experimental group, participants were In experimental group, the median value of the wrong informed that they can choice not to look at the visual pace decreased from eight wrong pace in day 1 to one wrong feedback while aware of their pace well. The participants of pace in day 5, and the value of wrong pace decreased over experimental group were asked to complete a questionnaire day. In control group, the median value of the wrong pace after experiments from day 2 to day 4. Besides, we asked all decreased from 7 paces in day 1 to 5 paces in day 5, but the participants the feelings and impressions after the experiment wrong pace decreased just in first three days. The results in day 5. However, all participants could write down any showed that i-m-Walk could effectively reduce wrong pace recommends and feelings after the experiment, and we will during walking meditation. discuss the issues in the discussion section. 5.2.4 Results We analyzed the average walking speed and the wrong pace both in experimental group and control group. In the results of the average walking speed, figure 8 shows the average time of each pace on experimental group and control group from day 1 to day 5. In day 1 and day 5, all participants learned walking meditation without using i-m- Walk. Following t-tests revealed significant difference (p < 0.005) in the average walking time per footstep for the experimental group and control group from day 2 to day 4. In experimental group, the average walking time per footstep increased from 4.5 seconds in day 1 to 10.9 seconds in day 5. Figure 9: The median value of error footsteps for experimental In control group, the average walking time per footstep group and control group from day 1 to day 5. increased from 3.2 seconds (day 1) to 5.1 seconds (day 5). The results showed that the participants in experimental The experimental group was asked to complete a group had significant main effect (p < .005) in slowing down questionnaire after using i-m-Walk system from day 2 to day the walking speed after the learning of walking meditation. 4. The content of the questionnaire was the same in each day. On the contrary, the participants in control group had no Figure 10 shows the results of questionnaires which were significant main effect (p > .1) in slowing down the walking completed after walking meditation. We asked two questions: speed after the learning of walking meditation. The results (1) what is the degree of i-m-Walk to help you in aware of showed that i-m-Walk could help participants in slowing your pace? (2) what is the degree of i-m-Walk to help you in down the walking speed during walking meditation. slowing down the footstep? There were five options for answers, including “1: serious interference”, “2: a little interference”, “3: no interference and no help”, “4: a little help”, and “5: very helpful”. The results showed that all participants in experimental group gave positive feedback both in question 1 and question 2, and the score of questionnaire were between “a little helpful” and “very helpful”. Figure 8: The average time of one footstep for experimental group and control group from day 1 to day 5. Error bars show ±1 SE. In our experiment, the rule of correct walking method was that participants should land every footstep with toes Figure 10: The questionnaire results filled by experimental group first during walking meditation. If they landed footstep with from day 2 to day 4. The red line shows the baseline of the heel first, it was a wrong pace defined in our experiment. satisfaction. Error bars show ±1 SE. Figure 9 shows the median values of total wrong pace in 15 1026
  • 40.
    6. DISCUSSION the time during walking meditation because we did not know whether the user needed the guidance or not. But we The aim of this section is to summarize, analyze and informed participants that they could decide not to look at discuss the results of this study and give guidelines for the the visual feedback while they could aware of their pace well. future development of applications. By this way, it could minimize the interference when using. The questionnaire showed that the participants think that 6.1 User Interface: there was no interference while using i-m-Walk, and the system was helpful in use. The user interface of i-m-Walk provides the information of pace, including walking speed, wrong pace, and the center 6.3 Beginners vs. Masters of feet. The results in walking speed significantly showed that i-m-Walk could help beginners decrease walking speed In recent years, the concept of “slow technology” was during walking meditation. There are some participants’ applied in many mediate systems. The design philosophy of comments from experimental group: “slow technology” is that we should use slowness in User E6 in day 3: “I always walked fast before, but when I learning, understanding and presence to give people time to saw the dashboard and the warning message “too fast,” it think and reflect. In our case, walking meditation is one was helpful to remind me in slowing down the walking kind of the conception in slow technology. There are two speed. main parts in walking meditation, including inside condition and outside condition. The inside condition means the We list two design principles of the user interface: (a) meditation of mind and the outside condition means the We used the form of dashboard to represent the walking meditation of walking posture. All participants were speed. The value of walking speed is easy to watch, and user beginners in our experiment, because we focused on the might aware of the change of walking speed while he slowed training of the outside condition, walking posture. The down or speeded up; (b) i-m-Walk provided additional alarm difference between beginners and masters in walking mechanism, a warning message “too fast”, while walking too meditation is that the beginners do not familiar in walking fast. The mechanism could remind user when distracted. The meditation and needs to pay more attention on the control of results in wrong pace showed that i-m-Walk could effectively pace, but the masters familiar it and could focus on the reduce wrong pace for beginners during walking meditation. meditation of align the inside and outside of body. Walking One of the participants from experimental group said that: meditation is a way to align the feeling inside and outside of User E1 in day2: “While I saw the color of block on the the body. The beginner should familiar the walking posture screen changed from green to red, and I knew that I had a before the spiritual development. In this paper, the goal of wrong pace. Then, I would concentrate on my pace our experiment is to evaluate the learning effects of i-m- deliberately while the next footstep. Walk system. The experimental results showed that the participants of experimental group could slow down the 6.2 Human Perception walking speed and decrease the wrong pace after five days training. Six participants in experimental group felt that the Human beings receive messages by means of the five experimental time in day four is short than first day, modalities, including vision, sound, smell, taste and touch. although the experimental time is the same. On the contrary, The most use in the field of human-computer interaction is there was no such comment from the participants in control visual modality and audio modality. There was a comment group. The results showed that i-m-Walk could help user in from an experimental participant: training the walking posture of walking meditation. User E3 in day 2: “If I can listen to my pace during walking meditation, I do not need to hold the smartphone”. 6.4 Reaction Time In cross-modal research, visual modality is always Reaction time is an important issue in human-computer considered superior than auditory modality in spatial domain. interaction design. If the reaction time delay too long, users In our case, we need to show the footstep phases accurately, could not control it well and could not aware of the and also need to show walking speed and wrong pace in the interaction easily. According to the observation, the delay same time. Therefore, we selected visual feedback as the time of i-m-Walk is 0.2 second. However, the delay time do user interface. The advantage was that users could decide not affect users because the application in this experiment whether to watch the information or not, but the shortcoming do not need fast reaction time. The average pace speed is was that users failed to receive the information while they 10.9 seconds in experiment group in day five. The results of did not see it. Therefore, there are possible to provide more questionnaires also showed that participants felt that the interaction methods, such as tactile perception and acoustic visual feedback could reflect the walking status immediately. perception, to remind users. However, the somatosensory of one’s feet is the most On the other hand, the mechanisms of multimedia intuitive, and i-m-Walk can only provide accessibility for feedback might attract user’s attention in some case. Too beginners while they need. many inappropriate and redundant events might disturb users while using it. In our system, we provided visual feedback all 1027
  • 41.
    7. CONCLUSIONS AND FUTURE WORK [9] Kane, S.K., Wobbrock, J.O., and Smith, I.E., 2008. Getting off the treadmill: evaluating walking user interfaces for mobile devices in In this paper, we present a mobile application that uses public spaces. In Proc. MobileHCI '08, 109-118. pressure sensors in shoes to visualize phases of footsteps on [10] Kong, K. and Tomizuka, M., 2008. Smooth and Continuous Human a mobile device in order to raise the awareness for the user´s gait Phase Detection based on foot pressure patterns. In Proc. ICRA ’08, 3678-3683. walking behaviour and to help him improve it. Our study [11] Lutz, A., 2004. Long-term meditators self-induce high-amplitude showed that i-m-Walk could effectively assisted beginners in gamma synchrony during mental practice. PNAS 101, 16369-16373. slowing down their pace and decreasing the error rate of pace [12] Mann, S., 1997. Smart clothing: The wearable computer and during walking meditation. Therefore, the conception of i-m- WearCam. Journal of Personal Technologies 1(1), 21-27 Walk could be used in other applications, such as walking [13] Mann, S., 2006. The andante phone: a musical instrument that you rehabilitation. play by simply walking. In Proc. ACM Multimedia 14th, 181-184. Despite the encouraging results of this study as to the [14] Montoya, R., Dupui, P.H., Pagès, B., and Bessou, P., 1994. Step- positive effect of i-m-Walk, future research is required in a length biofeedback device for walk rehabilitation. Journal of Medical number of directions. In the part of intelligient shoes, we will and Biological Engineering and Computing 32(4), 416-420. analyze user’s walking method, such as pigeon toe gait and [15] Morris, S.J., and Paradiso, J.A., 2002. Shoe-integrated sensor system out toe gait while walking. In the part of biofeedback for wireless gait analysis and real-time feedback. In Proc. Joint IEEE EMBS (Engineering in Medicine and Biology Society) and BMES mechanisms, we will try to design more interaction methods, (the Biomedical Engineering Society) 2nd, 2468-2469. such as tactile perception and acoustic perception. Besides, [16] Nike, INC., 2009. Nike+, Retrieved October 26, 2009, from we will record and analyze user’s learning status while www.nikeplus.com/ walking, and provide appropriate and personalized guidance [17] Nintendo, 2009. Personal Trainer - Walking, Retrieved November 26, according to his condition. Currently, we are using additional 2009, from http://www.personaltrainerwalking.com/ sensing devices, such as Breath-Aware Garment and Sensing [18] Nintendo, 2009. Wii Fit Plus, Retrieved January 8, 2010, from ring, to detect user’s biosignal and activities, and integrate to http://www.nintendo.co.jp/wii/rfpj/index.html i-m-Walk to analyze the breathing status and heartbeat rate [19] Noshadi, H., Ahmadian, S., Dabiri, F., Nahapetian, A., Stathopoulus, while walking and running. T., Batalin, M., Kaiser, W., Sarrafzadeh, M., 2008. Smart Shoe for Balance, Fall Risk Assessment and Applications in Wireless Health. In Proc. Microsoft eScience Workshop. 8. ACKNOWLEDGMENT [20] Paradiso, J., 2002. FootNotes: Personal Reflections on the This work was supported in part by the Technology Development of Instrumented Dance Shoes and their Musical Development Program for Academia, Ministry of Economic Applications. In Quinz, E., ed., Digital Performance, Anomalie, Affairs, Taiwan, under grant 98-EC-17-A-19-S2-0133. digital_arts Vol. 2, 34 - 49. [21] Pascoe, J., Ryan, N. and Morse, D., 2000. Using while moving: HCI 9. REFERENCES issues in fieldwork environments. ACM Transactions on Computer- Human Interaction, 7 (3), 417- 437. [22] Pappas, P. I., Keller, T., Mangold, S., Popovic, M.R., Dietz, V., and [1] Christensen, M.S., Lundbye-Jensen, J., Petersen, N., Geertsen, S.S., Morari, M., 2004. A reliable, insole-embedded gait phase detection Paulson, O.B., and Nielsen, J.B., 2007. Watching Your Foot Move-- sensor for FES-assisted walking. Journal of IEEE Sensors 4 (2), 268- An fMRI Study of Visuomotor Interactions during Foot Movement. 274. Journal of Cereb Cortex 17 (8), 1906-1917. [23] Reynolds, R.F., Day, B.L., 2005. Visual guidance of the human foot [2] Crossan, A., Murray-Smith, R., Brewster, S., Kelly, J., and Musizza, during a step. Journal of Physiology 569 (2), 677-684. B., 2005. Gait phase effects in mobile interaction, In Proc. CHI '05 [24] Quek, F., Ehrich, R., Lockhart, T., 2008. As go the feet...: on the extended abstracts on Human factors in computing systems, 1312- estimation of attentional focus from stance. In Proc. ICMI 10th, 97- 1315. 104. [3] Davidson, R.J., 2003. Alterations in brain and immune function [25] Thera, S., 1998. The first step' to Insight Meditation. Buddhist produced by mindfulness meditation. Psychosom Med. 65, 564-570. Cultural Centre. [4] Drobny, D., Weiss, M., and Borchers, J. 2009. Saltate! -– A Sensor- [26] Watanabe, J., Ando, H., and Maeda, T., 2005. Shoe-shaped Interface Based System to Support Dance Beginners. In Proc. CHI '09 for Inducing a Walking Cycle. In Proc. ICAT 15th, 30-34. Extended Abstracts on Human Factors in Computing Systems, 3943- 3948. [27] Woodbridge, J., Nahapetian, A., Noshadi, H., Kaiser, W. and Sarrafzadeh, M., 2009. Wireless Health and the Smart Phone [5] Femery, V.G., Moretto, P.G., Hespel, J-MG, Thévenon, A., and Conundrum. HCMDSS/ MDPnP. Lensel, G., 2004. A real-time plantar pressure feedback device for foot unloading. Journal of Arch Phys Med Rehabi 85(10), 1724-1728. [28] Ikemoto, L., Arikan, O., and Forsyth, D., 2006. Knowing when to put your foot down. In Proc. Interactive 3D graphics and games 06’, 49- [6] Hanh, T.N., Nquyen, A.H., 2006. Walking Meditation. Sounds True 53. Ltd. [29] Yeh, S.Y., Wu, C.I., Chu, H.H., and Hsu, Y.J., 2007. GETA sandals: [7] Hu, M.H., and Woollacott, M.H., 1994. Multisensory training of a footstep location tracking system. Personal and Ubiquitous standing balance in older adults: I. Postural stability and one-leg Computing 11(6): 451-463. stance balance. Journal of Gerontology: MEDICAL SCIENCES 49(2), M52-M61. [30] Hallnäs, L., Redström, J., 2001. Slow Technology: Designing for Reflection. Personal and Ubiquitous Computing, Vol. 5(3). pp. 201- [8] Intiso D., Santilli V., Grasso M.G., Rossi R., and Caruso I., 1994. 212. Rehabilitation of walking with electromyographic biofeedback in foot-drop after stroke. Journal of Stroke 25(6), 1189-1192. [31] adidas, INC., 2010. miCoach, Retrieved March 5, 2010, from www.micoach.com 1028
  • 42.
    Object of InterestDetection Using Edge Contrast Analysis Ding-Horng Chen FangDe Yao Department of Computer Science and Information Department of Computer Science and Information Engineering Engineering Southern Taiwan University Southern Taiwan University Yong Kang City, Tainan County Yong Kang City, Tainan County [email protected] [email protected] Abstract— This study presents a novel method to detect the the separation of variations in illumination from the focused object-of-interest (OOI) from a defocused low depth- reflectance of the objects (also known as intrinsic image of-field (DOF) image. The proposed method divides into three extraction) and in-focus areas (foreground) or out-of-focus steps. First, we utilized three different operators, saturation (background) areas in an image. contrast, morphological functions and color gradient to The DOF is the portion of a scene that appears compute the object's edges. Second, the hill climbing color acceptably sharp in the image. Although lens can precisely segmentation is used to search the color distribution of an focus at one specific distance, the sharpness decreases image. Finally, we combine the edge detection and color segmentation to detect the object of interest in an image. The gradually on each side of the focused distance. A low (small) proposed method utilizes the edge analysis and color DOF can be more effective to emphasize the photo subject. segmentation, which takes both advantages of two features The OOI is thus obtained via the photography technique by space. The experimental results show that our method works using low DOF to separate the interested object in a photo. satisfactorily on many challenging image data. Fig. 1 shows a typical OOI image with low DOF. Keywords-component; Object of Interest (OOI); Depth of Field (DOF); Object Detection; Edge Detection; Blur Detection. I. INTRODUCTION The market for digital single-lens reflex cameras, or so- called DSLR, has expanded tremendously for its price become more acceptable. For a professional photographer, the DSLR owns the advantages for the excellent image quality, the interchangeable lenses, and the accurate, large, and bright optical viewfinder. The DSLR camera has bigger sensor unit that can create more obvious depth-of-field (DOF) photos, and that is the most significant features of DSLR. Figure 1. A typical OOI image According to market reports [1][2][3], the DSLR market share will grows very fast in the near future. Table 1 shows The OOI detection problem can be viewed as an the growth rate of the digital camera market. extension of the blurred detection problem. In Chung’s Table 1. Market Estimate of the Digital Cameras method [6], they compute x and y direction derivative and gradient map to measure the blurred level ,by obtaining the Year 2006 2011 Growth Rate edge points which is computed by a weighted average of the World Market 81 82.2 108% standard deviation of the magnitude profile around the edge point. DSLR 4.8 8.3 173% Renting Liu et al. [7] have proposed a method could DSC 76.8 79.9 104% determine blurred type of an image, using the pre-de ned Unit: Million US$ blur features, the method train a blur classifier to The extraction of the local region of interested in an discriminate different regions. This classifier is based on image is one of the most important research topics for some features such as local power spectrum slope, gradient computer vision and image processing [4][5]. The detection histogram span, and maximum saturation. Then they of object of interest (OOI) in a low DOF images can be detected the blurry regions that are measured by local applied in many fields such as content-based image retrieval. autocorrelation congruency to recognize the blur types. To measure the sharpness or blurriness edges in an image is The above methods determine the blur level and regions, also important for many image processing applications. For but they still cannot extract OOI object from an image. If the instance, checking the focus of a camera lens, identifying background is complex or edges are blurred, the described shadows (which edges are often less sharp than object edges), methods are unable to find OOI [6][7]. N. Santh and K.Ramar have proposed two approaches, i.e. the edge-based 1029
  • 43.
    and region-based approach,to segment the low-DOF images A. Saturation Edge Power Mean [8]. They transformed the low-DOF pixels into an Fig. 3 shows the original image that we want to detect appropriate feature space called higher-order statistics (HOS) the OOI. The background is out-of-focus and thus is map. The OOI is then extracted from a low-DOF image by smoother then the object we want to detect. The color region-merging and threholding technique as the final saturation and edge sharpness are the major differences decision. between the objects and the background. Color information But if the object’s shape is complex or the edges are not is very important in blur detection. It is observed that blurred fully connected, it’s still hard to find the object. The OOI pixels tend to have less vivid colors than un-blurred pixels may not a compact region with a perfect sharp boundary. It because of the smoothing effect of the blurring process. cannot simply use edge detection to find a complete object in Focused (or un-blurred) objects are likely to have more vivid a low-DOF image. In some cases, such as macro colors than blurred parts. The maximum saturation value in photography or close-up photography, the depth-of-field is blurred regions is expected to be smaller than in un-blurred very low. Some parts of subject may out of focus. This also regions. By this observation, we use the following equation causes a partial blur on subject. To acquire a satisfactory to compute pixel saturation: result on OOI detection, not only the blurred part but also the 3 sharp part needs to be taken into consideration. How to find a S P=1- Min R,G,B , good OOI subject in the image are challenging in this issue. R G B where Sp means the saturation point for image. Equation (1) II. THE P ROPOSED METHOD transforms the original image into saturation feature space to In this paper, we proposed a novel method to extract the find the higher saturation part in the image. OOI from a low-DOF image. The proposed algorithm In low-DOF images, the saturation won’t change contains three steps. First, we find the object boundaries dramatically for the background is smoother. On the contrary, based on computing the sharpness of edges. Second, the hill the color saturation will change sharply along the edges. climbing color segmentation is used to find color distribution Therefore, we define the edge contrast CA, which is and its edges. Finally, we integrate the above results to get computed in a 3x3 window described as follows: the OOI location. 1 CA ( n A) 2 The first step is divided into three parts and is illustrated n M ,n A n in Fig. 2. We calculate the feature parameters including the Here M is the 3x3 window; A is the saturation value on the maximum saturation, the color gradient and the local range window center, n is the value of the neighborhood in this image. The image is converted into CIE Lab color-space and window. is performed with edge detection. In the part of noise Equation (2) calculates the saturation intensity. Here, reduction, we use a median filter to reduce the fragmentary we show the result images to demonstrate the processing value. Then all the featured image will be multiply together steps. Fig. 4 is the result of saturation image. Fig. 5 shows to extract the exact position of OOI. the result after performing the edge contrast computation. Figure 3. Original image Figure 4. Saturation image Figure 2. Edge detection flowchart 1030
  • 44.
    Gx Gy Gx Gy . G2 0.5 cos(2 A ) 2 * Gxy . sin 2 A 2 The value of color gradient CG, is obtained by choosing the maximum value of comparing G1 and G2, i.e., CG Max G1 ,G2 This CG value shows the color intensity followed the Figure 5. Saturation edge image edge gradient. The CG value will increase if the color of this edge point changes dramatically. B. Color Gradient Fig. 6 shows the result after color vector computation. The gradient is a scalar field for a vector field which points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change. It is very useful in a lot of typical edge detection problem. To calculate the gradient of the color intensity, first we would use the Sobel operator to separate vertical and horizontal edges. 1 0 1 1 2 1 Gx 2 0 2 A, Gy 0 0 0 A, Figure 6. Color vector image 1 0 1 1 2 1 C. Local Range Image 2 2 G Gx Gy , In this study, we adopt the morphological functions DILATION and EROSION to find the local maximum and Gy minimum values in the specified neighborhood. arctan G x First, we convert the original image form RGB color Equation (3) and (4) show a traditional way to compute space to CIE Lab color space. Because the luminance of an object is not always flat, we compute the local range value gradient. Here is the edge angle, and is 0 for a vertical for A and B layer without L(luminance) component. We edge which is darker on the left side. We modify the above censored the color diversification on object in order to equations to be more accurate in our case with the following prevent this situation. The dilation, erosion and local range equations: computation are defined as the following equations: 2 2 2 Gx Rx Gx Bx Dilation Gy Ry 2 Gy 2 By 2 A B z| B z, A B z Erosion G xy Rx R y G x Gy Bx By ˆ ˆ A B z| B z, A B z 2 * Gxy A 0.5 * arctan Local Range Image A B A B (Gx Gy ) Fig. 7 shows the result after the local range operation. Gx Gy Gx Gy . G1 0.5 cos(2 * A) 2 * Gxy . sin(2 * A) where Rx, Gx, Bx are RGB layers through horizontal Sobel operator; Ry, Gy and By are RGB layers through vertical Sobel operator. A is the angle of Gxy, and G1 is the color gradient of image on angle 0. The definition of G2 is quite similar as G1, but the term A is replaced by: A 2 Therefore, G2 is computed by: Figure 7. A local range image 1031
  • 45.
    D. Median Filter ImageColorSegmention Median filter is a nonlinear digital filtering technique, which is often used to remove noise. Such noise reduction is a typical pre-processing step to improve the results for later processing. The process of edge detection will cease some fragmentary values. If the values are low or the fragment edges are not connected, it could as seen as noise. Therefore, we adopt the median filter to reduce the fragmentary pixels. E. Hill Climbing Color Segmentation Edge detection can find most edges of OOI object, but the boundaries are not usually closed completely. The Figure 9. A color segmentation result morphological operators cannot link all the disconnected edges to obtain a complete boundary. Most OOI edges can F. Edge Combination be detected after the previous procedures, but some edges are still unconnected. To make the OOI boundary be a regular The OOI edges are obtained by two methods. First, we closure, we adopt color segmentation to connect the isolated use morphological close operation, which is a dilation edges. followed by an erosion, to connect the isolated points. The The color segmentation method is illustrated in Fig. 8. close operation will make the gaps between unconnected This method is based on T. Ohashi et al. [9] and R. Achanta edges become smaller and make the outer edges become et al. [10] .The hill-climbing algorithm detects local maxima smoother. Second, we adopt edge detection on color of clusters in the global three-dimensional color histogram of segmentation map to find the color distribution, and merge it an image. Then, the algorithm associates the pixels of an with pre-edge detection result. image with the detected local maxima; as a result, several After the above procedures, we can get most of the edge visually coherent segments are generated. clues, and then we want to integrate these clues to a complete OOI boundary. Let the result of the boundary detection be IE, the result from the color segmentation be IC. The edge is extended by counting the pixels in IC and the neighboring points of IE. To determine whether a pixel at the end of the IE = to be extended or not, here we reassign a value P at point (i,j) ( , ) (, ) as an “edge extension” value as follow: (, ) , where n=-1, m=1, is sliding in a 3x3 window, IE is the pre- edge detection image value of the neighborhood in this window. Equation (16) will remove the un-necessary pixels and let the OOI mask be closed by extending the boundaries. The value is shown in Fig. 10. The result image that merges the edge extension image with the color segmentation edge is shown in Fig. 11. Figure 8. Color segmentation and egde detection flow chart The detailed algorithm is described as follows: (a) 1. Convert image to CIE Lab color space. 2. Build CIE Lab color histogram. 3. Follow color histogram to find local maximum value. 4. Apply local maximum color to be initial centroid of k-means classification. 5. Re-train the classifier until the cluster centers are stable. 6. Apply K-means clustering and remap the original (b) pixels to each cluster. Figure 10. (a) The result before the edge extension (b) The result after the Fig.9 shows the result of color segmentation. edge extension 1032
  • 46.
    Plus ColorSeg Figure 13. Five examples with different aperture values The DOF is smaller as the aperture value gets lower, and the OOI would be blurred as well. The higher aperture value will increase the edge sharpness; that will cause the difficulty to separate the background and the OOI. From Fig. 14 to Fig. 17, we show the OOI detection results. By experiment, the object boundaries become irregular while the aperture value gets higher. In our experiment, the proper aperture value to obtain the best segmentation results is about f2.8 to f5.6. Figure 11. The result image that merged the edge extention image and color segmentation image We integrate the above edge pieces into a complete OOI mask. If the boundaries are closed, we will add this region into the final OOI mask. The edge combination of the final OOI mask is shown in Fig.12. Figure 12. Edge combination result III. THE EXPERIMENTAL RESULTS The aperture stop of a photographic lens companion with shutter speed can adjust the amount of light reaching to the film or image sensor. In this study, we use a digital camera Pantax istDL and a prime lens “Helois M44-2 60mm F2.0” Figure 14. The experimental results (sample 1) to perform the experiment. We choose a prime lens to be our test lens in order to reduce the instability parameters. To insure all of the exposures are the same, we have controlled the shutter speed and aperture parameter manually. To test the propose method, we select 5 test photos in a 50 photos album randomly. They are all prepared in a same condition and camera parameter. Fig. 13 shows the proposed OOI detection results in different aperture value. Figure 15. The experimental results (sample 2) 1033
  • 47.
    the overlapped regionbetween the reference and the detected Accuracy OOI boundaries, i.e. ( x ,y ) I est ( x , y ) I ref ( x , y ) 1 ( x ,y ) I ref ( x , y ) where Iest is the OOI mask from the proposed method and Iref is the mask drawn by the user as the ground truth. Fig. 18 (a) shows the user drawn OOI boundaries and (b) shows the detected OOI boundaries. Figure 16. The experimental results (sample 3) Figure 18. Comparsion results: (a)User drawn OOI boundary (b) The proposed method result The detection accuracy decreases while the OOI has complex texture such as shirt, cloth, or artificial structures; and the accuracy is higher while the background is simple. However, if the image is not correctly focused on the target, the proposed method still can find a complete object. The correctness will become lower if there are more than one OOI in an image, as shown in sample 2 in Fig. 18. Table 2 shows the result of accuracy computed by Equation (17). Table 2. The comparison result between the reference images and the proposed method Figure 17. The experimental results (sample 4) Sample 1 2 3 4 5 The convincing definition of “good OOI” is hard to Accuracy 98.2% 94.6% 96.1% 98% 91% define; it will depend on human cognition. In this paper, we refer N. Santh and K.Ramar’s experiment [8] to verify the proposed method. First, five user-defined OOI boundaries are drawn, then we compare with the boundaries that detected by the proposed method. Equation (17) computes 1034
  • 48.
    IV. CONCLUSION [6] Yun-Chung Chung, Jung-Ming Wang, Robert R. Bailey, Sei-Wang Chen, “A Non-Parametric Blur Measure Based on Edge Analysis for In this paper we propose a method to extract the OOI Image Processing Applications,” IEEE Conference on Cybernetics objects form a low DOF image based on edge and color and Intelligent Systems Singapore, 1-3 December, 2004. information. The method needs no user-defined parameters [7] Renting Liu ,Zhaorong Li ,Jiaya Jia, “Image Partial Blur Detection like shapes and positions of objects, or extra scene and Classi cation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. information. We integrate the color saturation, morphological functions and color gradient to detect the [8] N. Santh, K.Ramar, “Image Segmentation Using Morphological Filters and Region Merging,” Asian Journal of Information rough OOI. Final we utilize color segmentation to make the Technology vol. 6(3) 2007,pp. 274-279. OOI boundaries close and compact. Our method takes both [9] D. Kornack and P. Rakic, “Cell Proliferation without Neurogenesis in advantages of edge detection and color segmentation. Adult Primate Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127- The experiments show that our method works 2130. satisfactorily on many different kinds of image data. This [10] T.Ohashi, Z.Aghbari, and A.Makinouchi. “Hill-climbing Algorithm method can apply in image processing or computer vision for Efficient Color-based Image Segmentation,” IASTED tasks such as object indexing or content-based image International Conference On Signal Processing, Pattern Recognition, and Applications (SPPRA 2003), June 2003. P.200. retrieval as a pre-processing. [11] R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk1. “Salient Region REFERENCES Detection and Segmentation,” International Conference on Computer Vision Systems (ICVS 2008), May 2008. PP.66-75 [1] InfoTrend ,”The Consumer Digital SLR Marketplace: Identifying & [12] Martin Ru i, Davide Scaramuzza, and Roland Siegwart. “Automatic Profiling Emerging Segments,” Digital Photography Trends. Detection of Checkerboards on Blurred and Distorted Images,” September,2008 International Conference on Intelligent Robots and Systems 2008, http://www.capv.com/public/Content/Multiclients/DSLR.html Sept, 2008. PP.22-26 [2] Dudubird, “Chinese Photographic Equipment Industry Market [13] Hanghang Tong, Mingjing Li, Hongjiang Zhang, and Chanshui Zang. Research Report,” December,2009. http://www.cnmarketdata.com “Blur Detection for Digital Images Using Wavelet Transform,” /Article_84/2009127175051902-1.html International Conference on Multimedia and Expo 2004, PP.17-20 [3] .” [14] Gang Cao, Yao Zhao and Rongrong Ni. “Edge-based Blur Metric for ,” Tamper Detection,” Journal of Information Hiding and Multimedia September,2007. https://www.fuji-keizai.co.jp/market/06074.html Signal Processing, Volume 1, Number 1, January 2009. pp. 20-27 [4] Khalid Idrissi, Guillaume Lavou e, Julien Ricard , and Atilla Baskurt, [15] Rong-bing Gan, Jian-guo Wang. “Minimum Total Variation “Object of interest-based visual navigation, retrieval, and semantic Autofocus Algorithm for SAR Imaging,” Journal of Electronics & content identi cation system” Computer Vision and Image Information Technology, Volume 29, Number 1, January 2007. pp. Understanding vol. 94 ,2004 , pp. 271-294. 12-14 [5] James Z. Wang, Jia Li, Robert M. Gray, Gio Wiederhold , [16] Ri-Hua XIANG, Run-Sheng WANG, “A Range Image Segmentation “Unsupervised Multiresolution Segmentation for Images with Low Algorithm Based on Gaussian Mixture Model,” Journal of Software Depth of Field” IEEE TRANSACTIONS ON PATTERN 2003, Volume 14, Number 7, pp. 1250-1257 ANALYSIS AND MACHINE INTELLIGENCE vol.23 no.1, January 2001, pp. 85-90. 1035
  • 49.
    Efficient Multi-Layer BackgroundModel on Complex Environment for Foreground Object Detection 1 Wen-kai Tsai(蔡文凱),2Chung-chi Lin(林正基), 1Ming-hwa Sheu(許明華), 1Siang-min Siao(蕭翔民), 1 Kai-min Lin(林凱名) 1 Graduate School of Engineering Science and Technology National Yunlin University of Science & Technology 2 Department of Computer Science Tung Hai University E-mail:[email protected] Abstract—This paper proposes an establishment of multi-layer has the advantages of updating model parameters background model, which can be used in a complex automatically, it is necessary to take a very long period of environment scene. In general, the surveillance system focuses time to learn the background model. In addition, it also faces on detecting the moving object, but in the real scenes there are strenuous limitations such as memory space and processing many moving background, such as dynamic leaves, falling rain speed in embedded system. Next, Codebook background etc. In order to detect the object in the moving background model [3] establishes a rational and adaptive capability environment, we use exponential distribution function to which is able to improve the detection accuracy of moving update background model and combine background background and lighting changes. However, the Codebook subtraction with homogeneous region analysis to find out background model still requires higher computational cost, foreground object. The system uses the TI TMS320DM6446 larger memory space for saving background data. Davinci development platform, and it can achieve 20 frames per second for benchmark images of size 160×120. From the Subsequently, Gaussian model [4] is presented by updating the threshold value for each pixel, but its disadvantages experimental results, our approach has better performance in includes large amount of computing and lots of memory terms of detection accuracy and similarity measure, when comparing with other modeling techniques methods. space used to record the background model. In order to reduce the usage of memory, [5] and [6] are to calculate the Keywords-background modeling; object detection weight value for each pixel to establish background model. According to the weight value, the updating mechanism determines whether the pixel is replaced or not. So it uses a I. INTRODUCTION less amount of memory space to establish moving Foreground object detection is a very important background,. technology in the image surveillance system since the system The above works all use multi-layer background model performance highly dependents on whether the foreground to store background information, but this is still inadequate object detection is right or not. Furthermore, it needs to to deal with moving background issues. They need to take detect the foreground object accurately and quickly, such into account the dependency between adjacent pixels to that the follow-up works such as tracking, identification can inspect whether the neighbor region possesses the be easy to perform correctly and reliably. Conceptually, the homogeneous characteristics or not. This paper proposes an technology of foreground object detection is based on efficient 4-layer background model and homogeneous region background substation mostly. This approach seems simple analysis to feature the background pixels. and low computational cost; however, it is difficult to obtain good results without reliable background model. To manage II. BUILDING MULTI-LAYER BACKGROUND MODELS these complex background scenarios, the skill of how to First, the input image pixel xi,j(t) consists of R, G and B construct a suitable background model has become the most elements as shown in Eq. (1). The pixels of moving crucial one. background are inevitably appeared in some region Generally speaking, most of the algorithms only regard repeatedly, so we have to learn these appearance behaviors non-moving objects as background, but in real environment, when constructing multi-layer background model. The first many moving objects may also belong to a part of the layer background model (BGM1) is used to store the first background, in which we named the moving background input frame. For the 2nd frame, we record on the difference such as waving trees. However, it is a difficult task to of the 1st and 2nd frames for the second layer background construct the moving background model. The general model (BGM2). Similarly, the difference of the consecutive 3 practice is to use algorithms to conduct the learning and grams is saved for the third layer (BGM3), etc. We use the establish of background model. After building up the model, first 4 frame and their differences as the initial background the system starts to carry on the foreground object detection. model. Besides, Eq. (2) is used to record the numbers of Therefore, in recent years a number of background models occurrence each pixel in the learning frame. have been proposed. The most popular approach is the Mixture of Gaussians Model (MoG) [1- 2]. Although MoG xi , j (t ) = ( xiR j (t ), xiGj (t ), xiB j (t )) , , , (1) 1036
  • 50.
    ⎪ remove , if weight iu, j (t ) < Te (5) ⎧ MATCH iu, j (t − 1), if xi , j (t ) − BGM iu, j (t ) > th (2) BGM iu, j (t ) = ⎨ ⎪ ⎪α × BGM i , j (t ) + (1 − α ) × BGM i , j (t − 1), else u u MATCH iu, j (t ) = ⎨ ⎩ ⎪MATCH i , j (t − 1) + 1, u ⎩ else where, u=1…4 and th is the threshold value of compare where Te is a threshold for weight; α is a constant and similarity. From the 5th learning frame, we start to calculate α <1. all the pixel repetition numbers of occurrence in each layer Based on the above mentioned approach, Fig. 2 of background model, and Eq. (3) is to obtain its frequency demonstrates a 4-layer background which be constructed of occurrence. after learning 100 frames. MATCH iu, j (t ) λu, j = i (3) N where N is the total number of learning frames. The larger λu j i, indicates that the corresponding pixel in the learning period has higher occurrence and must preserve with 4 layers. Conversely, the pixel with lower occurrence will be removed. (a) BGM1 (b) BGM2 III. BACKGROUND UPDATE After building up multi-layer background model, we must update the content of BGMi,j along with the time, to replace the inadequate background information. So the background update mechanism is very important for the following object detection. The proposed background update (c) BGM3 (d) BGM4 method uses the exponential distribution model to calculate Figure 2. Multi-layer Background Model the weight value for each pixel, as shown in Eq. (4). It can obtain the repetition condition of occurrence for each pixel in background model. The lower weight expresses that the IV. OBJECT DETECTION corresponding pixel has not appeared for a long time. It After establishing the accurate background model, the should be replaced by the higher-weight input pixel. background subtraction can be used to obtain foreground object. From the practical observation, the moving − λu, j × t weightiu, j (t ) = λu, j × e i i ,t > 0 (4) background has the homogeneous characteristic. Therefore, the object detection method can carry on the subtraction on both 4-layer background and their homogeneous regions. where t is the number of non-match frames. As shown in Fig. 2, the information stored in background Fig. 1 shows the distribution of weight values. If the model is the scene of the moving background. It is with pixel in background model does not be matched in a period important features of homogeneous. In Eq. (6) and (7), TI(t) time, its weight value becomes exponentially decreased. If the weight value is less than a threshold, the background is the total matching index for input pixel and the pixel should be replaced based on Eq.(5). homogeneous region of 4-layer background and Diu+ k , j + p is an weight individual matching index the input pixel and one background data BGM iu+ k , j + p . The homogeneous region is defined as (2r +1) * (2r +1) for the background data at (i, j) location. 4 r r TI (t ) = ∑ ∑ ∑D u i+k, j+ p (t ) (6) u =1 k = − r p = − r ⎧1, if xi , j (t ) − BGM iu+ k , j + p (t ) ≤ th Diu+ k , j + p (t ) = ⎨ (7) ⎩0, else t Figure 1. Exponential distribution of weight where th is a threshold value to determine whether it is similar. If TI(t) is greater than a threshold (τ), that indicates the input xi,j(t) is similar to many background information 1037
  • 51.
    and it isnot a object pixel. Eq. (8) is used to find out sequence. Our proposed approach can achieve the highest foreground object (FO). similarity value, i.e. our results are close to those of ground truth. ⎧ 0, if TI (t ) ≥ τ FOi , j (t ) = ⎨ (8) ⎩1, else When FOi , j (t ) = 1 , the input pixel belongs to foreground object pixel. On the other hand, If FOi , j (t ) = 0 , the input pixel belongs to background pixel. V. EXPERIMENTAL RESULTS OF PROTOTYING SYSTEM Based on our proposed approach, the object detection is implemented by TMS320DM6446 Davinci as shown in Fig.3. The input image resolution is 160*120 per flame. Averagely, our approach can process 20 frames per second for performing object detection on the prototyping platform. Figure 3. TI TMS320DM6446 Davinci development kit Next, by using the presented research methods, the foreground object with binary-value results are also displayed in Fig.4. The result of ground truth, which is segmented the objects manually from the original image frame, is regarded as the perfect result. It can be found that our result has the better object detection. In order to make a fair comparison, we adopt [7] calculating similarity and total error pixels method to assess these results of the Figure 4. Foreground Object Detection Result algorithms. Eq.(9) is used to get the total error pixel number and Eq. (10) is used to evaluate similarity value. Wu[2] Total Error Pixels Chien[5] 3000 total error pixel = fn + fp (9) Tsai[6] Our proposed 2500 Error pixel 2000 tp Similarity = (10) 1500 tp + fn + fp 1000 where fp is the total number of false positives, fn is the total 500 number of false negatives, and tp indicates the total number 0 of true positives. Fig. 6 depicts the number of error pixels 240 245 250 255 260 265 270 275 280 for a video sequence. We can see the numbers of error Frame Number pixels produced by our proposed are less than other algorithms. Fig. 7 shows the similarity of the video Figure 5. Error pixels by different methods 1038
  • 52.
    Wu[2] [6] Wen-Kai Tsai, Ming-Hwa Sheu, Ching-Lung Su, Jun-Jie Lin and Similarity Shau-Yin Tseng, “Image Object Detection and Tracking Chien[5] 1 Tsai[6] Implementation for Outdoor Scenes on an Embedded SoC 0.9 Our proposed Platform,” International Conference on Intelligent Information 0.8 Hiding and Multimedia Signal Processing, pp.386-389, September, 0.7 2009. 0.6 [7] Lucia Maddalean, Alfredo Petrosino, “A Self-Organizing Approach Sim ilarity to Background Subtraction for Visual Surveillance Applications,” 0.5 IEEE Trans. on Image Processing, vol. 17, No.7, July, 2008. 0.4 0.3 0.2 0.1 0 242 245 248 251 254 257 260 Frame Number Figure 6. Similarity by different methods VI. Conclusion In this paper, we propose an effective and robust multi- layer background modeling algorithm. The foreground object detection will encounter the problem of moving background, because there are outdoor scenes of fluttering leaves, rain, and indoor scenes of fans etc. Therefore, we construct the moving background into multi-layer background model through calculating weight value and analyzing the characteristics of regional homogeneous. In this way, our approach can be suitable to a variety of scenes. Finally, we present the result of foreground detection by using data-oriented form of similarity and total error pixels, furthermore through explicit data and graph to show the benefit of our algorithms. REFERENCES [1] C. Stauffer, W. Eric L. Grimson, “Learning Patterns of Activity Using Real-Time Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol.22, No. 8, pp.747-757, 2000. [2] H. H. P. Wu, J. H. Chang, P. K. Weng, and Y. Y. Wu, “Improved Moving Object Segmentation by Multi-Resolution and Variable Thresholding, ” Optical Engineering. vol. 45, No. 11, 117003, 2006. [3] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis, “Real-Time Foreground-Background Segmentation using Codebook Model, ” Real-Time Imaging, pp.172-185, 2005. [4] Hanzi Wang, and David Suter, “ A Consensus-Based Method for Tracking Modelling Background Scenario and Foreground Appearance,” Pattern Recognition, pp.1091-1105, 2006. [5] Wei-Kai Chan, Shao-Yi Chien,”Real-Time Memory-Efficient Video Object Segmentation in Dynamic Background with Multi- Background Registration Technique,” International Workshop on Multimedia Signal Processing, pp.219-222, 2002. 1039
  • 53.
    CLEARER 3D ENVIRONMENTCONSTRUCTION USING IMPROVED DM BASED ON GAZE TECHNOLOGY APPLIED TO AUTONOMOUS LAND VEHICLES 1 2 Kuei-Chang Yang (楊桂彰), Rong-Chin Lo (駱榮欽) 1 Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering, National Taipei University of Technology, Taipei 2 Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering, National Taipei University of Technology, Taipei E-mail: [email protected] ABSTRACT to obtain meaningful information. There are a lot of manpower and resources devoted to the binocular stereo In this paper, we propose a gaze approach that sets vision [4] research for many countries. As applied to the binocular cameras in different baseline distances to robots and ALV, the advantage of binocular stereo obtain better resolution of three dimensions (3D) vision is to obtain the depth of the environment, and this environment construction. The method being capable depth can be used for obstacle avoidance, environment of obtain more accurate distance of an object and learning, and path planning. In such applications, the clearer environment construction that can be applied to disparity is used as the vision system based on image the Autonomous Land Vehicles (ALV) navigation. In recognition and image-signal analysis. Besides, two the study, the ALV is equipped with parallel binocular cameras need to be set in parallel and to be fixed cameras to simulate human eye to have the binocular accurately, this disparity method still requires a high- stereo vision. Using the information of binocular stereo speed computer to store and analyze images. However, vision to build a disparity map (DM), the 3D setting the binocular cameras of ALV with fixed environment can be reconstructed. Owing to the baseline can only obtain the better DM of environment baseline of the binocular cameras usually being fixed, images in a specific region. the DM, shown as an image, only has a better resolution in a specific distance range, that is, only partial specific In this paper, we try to propose an approach that sets the region of the reconstructed 3D environment is clearer. binocular cameras with different baseline to obtain the However, it cannot provide a complete navigation depths of DM corresponding to different measuring environment. Therefore, the study proposes the multiple distances; In the future, this method can obtain the baselines to obtain the clearer DMs according to the environment image from near to far range, such that it near, middle and far distances of the environment. will help the ALV in path planning. Several experimental results, showing the feasibility of the proposed approach, are also included. 2. STEREO VISION Keywords binocular stereo vision; disparity map In recent years, because the computing speed of the computer is much faster and its hardware 1. INTRODUCTION performance also becomes better, therefore a lot of researches relating to the computer vision are proposed In recent years, the machine vision is the most for image processing. The computer vision system with important sensing system for intelligent robots. The depth sensing ability is called the stereo vision system, vision image captured from camera has a large number and the stereo vision is the core of computer vision of object information including shape, color, shading, technologies. However, one camera can only obtain two shadow, etc. Unlike other used sensors can only obtain dimensions (2D) information of environment image that one of measurement information, such as ultrasonic is unable to reconstruct the 3D coordinate, To improve sensors [1], infrared sensors [2], laser sensors [3], etc. the shortage of one camera, in the study, two cameras In other words, the visual sensor can achieve a lot of are used to calculate 3D coordinate. The details are environmental information, but this information is with described in the following sub-sections. each other. Therefore, various image processing techniques are necessary for separating them one by one 1040
  • 54.
    2.1. Projective Transform Nowadays the cost of camera becomes very The projective transform model of one camera is cheaper, therefore, in the study, we chose two cameras to project the real objects or scene to the image plane. fixed in parallel to solve the problem of depth and As shown in Fig. 1, assume that the coordinate of object height. The usage of parallel cameras can reduce the P in the real world is (X, Y, Z) relative to the origin (0, complexity of the corresponding problem. In Fig. 3, we 0, 0) at the camera center. After transform, the easily derive the Xl and Xr by using similar triangles, and coordinate of P' projected by P on the image plane is (x, we have: y, f) relative to the image origin (0, 0, f), where f is the Zx   l X  l distance from the camera center to image plane. Using f similar triangle geometry theory to find the relationship between the actual object P and its projected point P' on Zx r the image plane, the relationship between two points is Xr  f   as follows: X   Assuming that the optical axes of two cameras are x f parallel to each other, where b is the distance between Z Y  two camera centers, and b= Xl - Xr. C and G are the y f projected points of P to left image plane and right image Z plane, respectively. The disparity d is defined as d = xl - Therefore, even if P'(x, y, f) captured from the xr . From (3) and (4), we have: image plane is the known condition, we still cannot b  X l  X r  x l  x r   d  Z Z calculate the depth Z of P point and determine its coordinate P(X, Y, Z) according to (1) and (2) unless f f we know one of X or Y (height) or Z (depth). Therefore, the image depth Z can be given by: f  b  P (X ,Y ,Z ) Z d Y y P' ( x , y , f ) Xl P(Object) X x r X Z Z Z Camera center C (0,0,0) Image plane xl xr G f f f Figure 1. Perspective projection of one camera. Ol b Or 2.2. Image Depth From the previous discussion, we have known that Figure 3. Projection transform of two cameras and disparity. it is impossible to calculate accurately the depth or height of object or scene from the information of one As shown in Fig. 4. The height image of an object can camera, even if we have a lot of known conditions in be derived from the height of the object image based on advance. Therefore, several studies use the overlapping the assumption of a pinhole camera and the image- view’s information of two [5] or more cameras to forming geometry. calculate the depth or height of object or scene, shown  Y  y  Z  in Fig. 2. f Right image plane Y y Pinhole Left image plane r r (x ,y ) x Optical axis y y x b l l (x ,y ) f Z P (X ,Y ,Z ) Figure 4. Image-forming geometry. Figure 2. The relationship between depth and disparity for two cameras. 1041
  • 55.
    Due to rapidcorresponding on two cameras, the the middle region is from 5m to 10m, and far region is method has high efficiency on calculating the depth and over 10m. height of the objects, and is suitable for the application Acquisition of the best baseline b of ALV navigating. This method can find the disparity d from two corresponding points (for instance, C and G To acquire a best baseline b means that to find the shown in Fig. 3.) respective to left image and right appropriate cameras baseline b on the basis of different image. Here, the accuracies of two corresponding points depths of the region. Table 1 and Table 2 show the are very important. Regard the value of disparity d as relationship between depth Z and the two-camera image intensity shown by gray values (0 to 255), then, distance baseline b. We set the d = 30 as the threshold whole disparities form an image, called disparity map value dth, and region of d less than dth as background. (DM) or DM image. The DM construction proposed by Therefore, when the depth Z is known, the disparity d Birchfield and Tomasi [6] is employed in this paper. can be obtained from Table 1 and Table 2in the The advantage of this constructing method is faster to different baseline, and then find the most appropriate obtain all depths that also include the depths of value of b makes the value of d closest or greater than discontinuous, covered, and mismatch points. Otherwise, the dth. For example: 20cm is the best b for short-range the disadvantage is to lack the accuracy of the obtained region (0 m~ 5m), and 40cm for medium-range region disparity map. Fig. 5 shows that the disparity map is (5m ~ 10m). generated from left and right images. Calculation of the depth and height The cameras are calibrated [8] in advance, then, we can obtain the focus value f =874 pixels. Substituting the obtained d for object into (6), we find the distance Z between the camera and the object, and Z is then substituted into (7) to calculate the object height Y [9] that usually can be used to decide whether the object is an obstacle. (a) Distance (b) Figure 5. The disparity map (a) left image and right image (b) disparity map Disparity Camera 3. PROPOSED METHOD Figure 6. The relationship between distance and disparity. From (6) [7], we know that the object is far from TABLE I. DISPARITY VALUES d (PIXELS) VS. DEPTH Z=1M~5M two cameras, the disparity value will become small, and AND BASELINE b =10CM~150CM. vice versa. In Fig. 6, there is obviously a nonlinear Z(m) relationship between these two terms. The disadvantage b(cm) 1 2 3 4 5 of DM is that the farther distance between objects and 10 87 44 29 22 17 two cameras makes the smaller disparity value, and it 20 175 87 58 44 35* begets the difficulty of separation between the object 30 262 131 87 66 52 40 350 175 117 87 70 and the background becoming difficult. Therefore, how 50 437 219 146 109 87 to find the suitable baseline b for obtaining the clearer 60 524 262 175 131 105 DM for each region in different depth region of two 70 612 306 204 153 122 cameras is required. The processing steps are described 80 699 350 233 175 140 in the following sub-sections: 90 787 393 262 197 157 100 874 437 291 219 175 Region segmentation 110 961 481 320 240 192 120 1049 524 350 262 210 We partition the region segmentation into three 130 1136 568 379 284 227 levels by near, middle and far, and obtain the best DM 140 1224 612 408 306 245 of the depth in the different regions. In the paper, we 150 1311 656 437 328 262 define the near region is the distance from 0m to 5m, *: The best disparity for short-range region. 1042
  • 56.
    TABLE II. DISPARITY VALUES d (PIXELS) VS. DEPTH Z=6M~10M AND BASELINE b =10CM~150CM. Z(m) b(cm) 6 7 8 9 10 10 15 12 11 10 9 20 29 25 22 19 17 30 44 37 33 29 26 40 58 50 44 39 35* 50 73 62 55 49 44 60 87 75 66 58 52 70 102 87 76 68 61 80 117 100 87 78 70 90 131 112 98 87 79 100 146 125 109 97 87 110 160 137 120 107 96 120 175 150 131 117 105 130 189 162 142 126 114 140 204 175 153 136 122 150 219 187 164 146 131 *: The best disparity for medium-range region. 4. EXPERIMENTAL RESULTS Figure 8. The disparity map (a) left image and right image (b) The proposed methods have been implemented disparity map (Z=400CM、800cm,b=20cm). and tested on the 2.8GHz Pentium IV PC. Fig. 7 shows two cameras are fixed on a sliding way and can be pulled apart to change the baseline distance. In Section III, we know that the best b for the short-range region 0m ~ 5m is 20cm, and 40cm for medium-range region 5m ~ 10m. Therefore, we set two persons standing in the distance from the two-camera of 4m and 8m, two- camera distance b = 20cm, shown in Figure 8. Because the person standing at 4m is in the short-range region, so it can be seen clearly. However, another person standing at 8m is in medium-range region, it's difficult to separate it from background. Figure 9. The disparity map (a) left image and right image (b) disparity map (Z=800CM,b=20cm). Figure 7. Experiment platform of stereo vision. To compare Fig. 9 and Fig. 10 with the distance from a person to the baseline is 8m (medium-range region) and the baseline is changed from b = 20cm to b = 40cm, so the results show that as b = 40cm, the person (object) becomes clearer as shown in Fig. 10. 1043
  • 57.
    [6] S. Birchfieldand C. Tomasi, ”Depth Discontinuities by Pixel-to-Pixel Stereo,” International Journal of Computer Vision, pp. 269-293, Aug 1999. [7] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O'Reilly Press, 2008. [8] http://www.vision.caltech.edu/bouguetj/calib_doc/ [9] L. Zhao and C. Thorpe, “Stereo- and Neural Network- Based Pedestrian Detection,” IEEE Trans, Intelligent Transportation System, Vol. 3, No. 3, pp. 148-154, Sep 2000. Figure 10. The disparity map (a) left image and right image (b) disparity map (Z=800cm,b=40cm). 5. CONCLUSION From the experimental results, we have found that the suitable baseline of two cameras can help us to obtain the better disparity. However, if the object is far from two cameras, its disparity value will become small, then the disparity value of the object is near to that of the background, and not easily detected. Using the proposed method, to change the baseline of two cameras, the object becomes clearer and easier detected, and 3D object information is obtained more. The results can be used to a lot of applications, for example, ALV navigation. In the future, we plan to solve the DM noise of horizontal stripe inside, so DM can be shown better. REFERENCES [1] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer Magazine, pp. 46- 57, June 1989. [2] J. Hancock, M. Hebert and C. Thorpe, “Laser intensity- based obstacle detection Intelligent Robots and Systems,” 1998 IEEE/RSJ International Conference on Intelligent Robotic Systems, Vol. 3, pp. 1541-1546, 1998. [3] E. Elkonyaly, F. Areed, Y. Enab, and F. Zada, “Range sensory=based navigation in unknown terrains,” in Proc. SPIE, Vol. 2591, pp.76-85. [4] 陳禹旗,使用 3D 視覺資訊偵測道路和障礙物應用於人 工智慧策略之室外自動車導航,碩士論文,國立台北 科技大學電腦與通訊研究所,台北,2003。 [5] 張煜青,以雙眼立體電腦視覺配合人工智慧策略做室 外自動車導航之研究,碩士論文,國立台北科技大學 自動化科技研究所,台北,2003。 1044
  • 58.
    A MULTI-LAYER GMMBASED ON COLOR-TEXTURE COMBINATION FEATURE FOR MOVING OBJECT DETECTION Tai-Hwei Hwang (黃泰惠), Chuang-Hsien Huang (黃鐘賢), Wen-Hao Wang (王文豪) Advanced Technology Center, Information and Communications Research Laboratories, Industrial Technology Research Institute, Chutung, HsinChu, Taiwan ROC 310 E-mail: {hthwei, DavidCHHuang, devin}@itri.org.tw ABSTRACT background scene. The background scene contains the images of static or quasi-periodically dynamic objects, Foreground detection generally plays an important role in the for instance, sea tides, a fountain, or an escalator. The intelligent video surveillance systems. The detection is based representation of background scene is basically a on the characteristic similarity of pixels between the input collection of statistics of pixel-wise features such as image and the background scene. To improve the color intensities or spatial textures. The color feature characteristic representation of pixel, a color and texture combination scheme for background scene modeling is can be the RGB components or other features derived proposed in this paper. The color-texture feature is applied from the RGB, such as HSI, or YUV expression. The into a four-layer structured GMM, which can classify a pixel texture accounts for information of intensity variation in into one of states of background, moving foreground, static a small region centered by the input pixel, which can be foreground and shadow. The proposed method is evaluated computed by the conventional edge or gradient with three in-door videos and the performance is verified by extraction algorithm, local binary pattern [1], etc. The pixel detection accuracy, false positive and false negative rate statistical background models of pixel color and textures based on ground truth data. The experimental results are respectively efficient when the moving objects are demonstrate it can eliminate shadow significantly but without with different colors from background objects and are many apertures in foreground object. full of textures for either background or foreground moving objects. For example, it is hard to detect a 1. INTRODUCTION walking man in green from a green bush using the color feature only. In this case, since the bush is full of Wide range deployment of video surveillance system is different textures from the green cloth, the man can be getting more and more importance to security easily detected by background subtraction with texture maintenance in a modern city as the criminal issue is feature. However, this will not be the case when only strongly concerned by the public today. However, using the texture difference to detect the man walking in conventional video surveillance systems need heavy the front of flat white wall because of the lack of texture human monitoring and attention. The more cameras for both the cloth and the wall. Therefore, some studies deployed, the more inspection personnel employed. In are conducted to combine the color and the texture addition, attention of inspection personnel is decreased information together as a pixel representation for over time, resulting in lower effectiveness at recognizing background scene modeling [2][3][4]. In addition to the events while monitoring real-time surveillance videos. different modeling abilities of color and texture, texture To minimize the involved man power, research in the feature is much more robust than color under the field of intelligent video surveillance is blooming in illumination change and is less sensitive to slight cast recent years. shadow of moving object. Among the studies, background subtraction is a Though the combination of color and texture can fundamental element and is commonly used for moving provide a better modeling ability and robustness for object detection or human behavior analysis in the background scene under illumination change, it is not intelligent visual surveillance systems. The basic idea enough to eliminate a slightly dark cast shadow or to behind the background subtraction is to build a keep an invariant scene under stronger illumination background scene representation so that moving objects change or automatic white balance of camera. To in the monitored scene can be detected by a distance improve the robustness of background modeling further, comparison between the input image and the 1045
  • 59.
    a simple butefficient way to eliminate shadows is to waving leaves. In this study, we propose a four-layer filter pixels casted by shadows according to the scene model which classes each pixel into four states, i.e. chromatic and illuminative changes. In the illuminative background, static foreground, moving foreground, and component the value of the shadow pixel is lower than shadow. We improve Gallego’s work by modeling that in background model; while in the chromatic background with mixture Gaussians of color and texture component, it shows slightly different from that in the combined feature and design related mechanisms for background model. Therefore, shadows can be detected state transition. In addition, we also bring the concept of by using thresholding technique to obtain the pixels shadow learning, based on the work [7], into the which are satisfied with these physical characteristics. proposed scene model. The structure and the Cucchiara et al. [5] transformed video frames from RGB mechanisms of our background scene model are space to Hue-Saturation-Intensity (HSI) space to described in section 2. Section 3 reveals applicable highlight these physical characteristics. In the work of scenarios and experimental results. Section 4 presents Shan et al. [6], they evaluated the performance of the conclusions and our future works thresholding-based shadow detection approach on different color spaces such as HSI, YCrCb, c1c2c3, L*a*b. To sum up, conventional approaches are based 2. MULTI-LAYER SCENE MODEL on transforming the RGB features to other color domains or features, which have better characteristics to Figure 1 illustrates the flowchart of the multi-layer scene represent shadows. But no matter what kind of color model. In the first stage, the color and texture spaces or features is adopted, users usually need to set representation have to be obtained for all pixels in the one or more threshold values to filter shadows out. input image. Four layers which represent the states of background, shadow, static foreground and moving Recently, Nicolas et al. [7] proposed an online- foreground, are modeled separately. For each pixel i learning approach named Gaussian Mixture Shadow belonging to the current frame, if it is fit to the Model (GMSM) for shadow detection. The GMSM background model, the background model is updated utilities two Gaussian mixture models (GMM) [8] to and the pixel is then labeled as the state of background. model the static background and casting shadows, Otherwise, the pixel is passed to the shadow layer. respectively. Afterward, Tanaka et al. [9] used the same idea but modeled the distributions of background and In the shadow layer, i is examined whether it is shadows non-parametrically by Parzon windows. It is satisfied to be a shadow candidate by a weak shadow faster than GMSM but costs more storage space. Both of classifier, which was designed according to the shadow them are based on statistical analysis and have better physical characteristics such as the mentioned chromatic discriminative power on shadows, especially when the and illuminative changes. If i is determined as a shadow color of moving object shows similar attribute to the candidate, the shadow layer is updated by the pixel’s pixels covered by shadows. color features. If i shows strong fitness to the dominant Gaussian of the updated shadow model, its state is then On the other hand, maintenance of static foreground labeled as shadow. For the pixel which is not satisfied to objects is also an important issue for background being a shadow candidate or does not fit to the shadow modeling. The static foreground objects are those model, we pass it to the static foreground layer. objects that, after entering into the surveillance scene, reach a position and then stop their motion. Examples Consequentially, if i dose not fit the static foreground are such as cars waiting for traffic lights, browsing model if it exists, i is passed to the moving foreground people in shops, or abandoned luggage in train stations. layer. When i fits the moving foreground model, it is the In traditional GMM-based background models [8], the circumstance that the state of the moving object is from static foreground objects are usually absorbed into moving to staying at the current position. As a result, we background after a given time period, which usually update the moving foreground model by i’s color proportional to the learning rate of the background features. A counter named CountMF corresponding to the model. The current state-of-the-art technique to moving background model is increased as well. When distinguish the static foreground objects from static CountMF reaches another user-defined threshold T2, we background and moving objects is to maintain a multi- replace the static foreground model by the moving layer model representing background, moving foreground model and CountSF is set to zero. Otherwise, foreground and static foreground separately [10,11]. if i does not fit the moving foreground model, we use it to reinitialize the moving foreground model, i.e. set In the work of Gallego et al. [11], they proposed a three- CountMF to zero, and then update the background model layer model which comprises moving foreground, static by the past moving foreground model. The reason of foreground and background layer. However, they using the moving foreground model to update the modeled background by using a single Gaussian, which background is to allow the background model having the can not cope with the multi-mode background such as ability to deal with the multi-mode background problem 1046
  • 60.
    such as wavingleaves, ocean waves or traffic lights. The background model is first initialized with a set of Details of feature extraction stage, the background and training data. For example, the training data could be the shadow layers are described in the following collected from the first L frames of the testing video. subsections. After that, each pixel at frame t can be determined whether it matches to the m-th Gaussian Nm by satisfying 2.1. Feature extraction stage the following inequality for all components {xC ,i , xT ,i } ∈ x : The color-texture feature is a vector including ( xC , i − µC ,i , m ) 2 ( xT ,i − µT , i, m ) 2 dC dT λ 1− λ B B components of RGB and local difference pattern (LDP) as the texture in a local region. The LDP is an edge-like dC ∑ i =1 k × (σ C ,i , m ) 2 B + dT ∑ i =1 k × (σT , i , m ) 2 B <1 (3) feature which is consisted of intensity differences between predefined pixel pairs. Each component of LDP where dC and dT denote vector dimension of color and is computed by texture, respectively, λ is the color-texture combination weight, k is a threshold factor and we set it to three LDPn(C)=I(Pn)-I(C), (1) according to the three-sigma rule (a.k.a. 68-95-99.7 rule) where C and Pn represent the pixel and thereof neighbor of normal distribution. The weights of Gaussian pixel n, respectively, and I(C) represents the gray level distribution are sorted in decreasing order. Therefore if intensity of pixel C. The gray level intensity can be the pixel matches to the first nB distributions, where nB is computed by the average of RGB components. Four obtained by Eq. (4), it is then classified as the types of pattern defining the neighbor pixels are background [13]. depicted in Figure 2 and are separately adopted to compare their performance of moving object detection  b  experimentally. b  ∑ n B = arg min  π m > 1 − p f   (4)  m=1  where pf is a measure of the maximum proportion of the data that belong to foreground objects without influencing the background model. When a pixel fits the background model, the background model is updated in order to adapt it to Fig. 2. Four types of pattern defining neighbor pixels progressive image variations. The update for each pixel for computation of LDP is as follows: 2.2. Background Layer π m ← π m + α (om − π m ) − αcL B B B (5) The GMM background subtraction approach presented by Stauffer and Grimson [8] is a widely used approach µ m ← µ m + om (α / π m )(x − µ m ) B B B B (6) for extracting moving objects. Basically, it uses couples of Gaussian distribution to model the reasonable (σ m ) 2 ← (σ m ) 2 + om (α / π m )((x − µ m )T (x − µ m ) − (σ m ) 2 ) B B B B B B (7) variation of the background pixels. Therefore, an unclassified pixel will be considered as foreground if the where α=1/L is a learning rate and cL is a constant value variation is larger than a threshold. We consider non- (set to 0.01 herein [14]). The ownership om is set to 1 for correlated feature components and model the the matched Gaussian, and set to 0 for the others. background distribution with a mixture of M Gaussian distributions for each pixel of input image: 2.3. Shadow Layer M p (x) = ∑π m=1 B B B m N m ( x; µ m , Iσ m ) (2) The problem of color space selection for shadow detection has been discussed in [6][12]. Their experimental results revealed that performing cast B where x represents the feature vector of a pixel, µ m is shadow detection in CIE L*u*v, YUV or HSV is more B efficient than in RGB color space. Considering that the the estimated mean, σ m is the variance, and I represents RGB-to-CIE L*u*v transform is nonlinear and the Hue the identity matrix to keep the covariance matrix domain is circular statistics in HSV space, YUV color isotropic for computational efficiency. The estimated space shows more computing efficiency due to its mixing weights, denoted by π m , are non-negative and B linearity of transforming from RGB space. In addition, they add up to one. YUV is also for interfacing with analogy and digital television or photographic equipment. As a result, YUV 1047
  • 61.
    color features wereadopted in this study, i.e. the color frame. The color-texture representation of pixel is the components x mentioned in the previous subsection. It is vector concatenation of RGB and LDP. The second worth reminding that Y stands for the illuminative pattern in Figure 2 is adopted for the computation of component and U and V are the chromatic components. LDP in the experiments. The effect of shadow For the pixel which does not fit the background model, elimination are shown not only by background masks it is then passed to the shadow layer. First, it is but also with the pixel detection accuracy rate (Acc.), examined if it is qualified as a shadow candidate by a false positive rate (FPR), and false negative rate (FNR), weak shadow classifier according to the following rules if ground truth data is available. These quantitative [7] measures are defined as follows, xY rmin < < rmax (8) # TP µY B Acc. = (15) # TP + # FP + # FN | xU − µU |< Λ U B (9) # FP FPR = (16) | xV − µ V |< Λ V B (10) # TP + # FP + # FN model, respectively. The parameters, rmin, rmax, ΛU and # FN FNR = (17) Λmax, are user-defined thresholding values. Users just # TP + # FP + # FN need to set them roughly by a friendly graphical user interface (GUI) because the more precise shadow Where TP is short for true positive and #TP means pixel classification will further be made by the following number of TP in a frame. In general, false positives are shadow GMM. resulted from moving cast shadows and false negatives are apertures inside the foreground regions. In the Similar to the background layer, the shadow layer is also following figures of background mask, the pixels modeled by a GMM but only the color features of depicted with black, red, white and green represent the shadow candidate will be fed in. For initialization, rmin, background region, false negatives, moving foreground rmax, ΛU and Λmax are used to derive the first Gaussian of and shadows, respectively. The experiments are each color component and set its weight to one. The performed on a personal computer with a Pentium 4 3.0- corresponding means and variances of the first Gaussian GHz CPU and 2 GB RAM. The processing frame rate is are obtained by the following equations: about 15 frames/second. µ Y = µ Y (rmax + rmin ) / 2 S B (11) 3.2.Effect of color-texture combination σY S = ( µ Y rmax B − µY S )/3 (12) To check the effectiveness of using color-texture µU S = µU B (13) combination feature, an experiment of background subtraction using the feature but with single background σ U = ΛU / 3 S (14) layered GMM is conducted. The result of video 1 is demonstrated in Figure 3. The combination weights λ’s where superscripts B and S are related to the background are set to 1, 0.3, and 0 for experiments in column 2, 3, or shadow models, and µV and σ V are calculated in the S S and 4 in Figure.3, respectively. When λ=1, i.e., only same way as Eq. (13) and Eq. (14). In the circumstance color feature is effectively used, there are significant if a feature vector x is not matched to any Gaussian false positives caused by shadows and camera brightness distribution, the Gaussian which has the smallest weight control in the background masks of column 2. When λ=0, is replaced with µ = x and σ = [σ0 σ0 σ0]T, where σ0 is an i.e., only texture feature is effectively used, most of the initial variance. false positives disappear but many apertures show up in the foreground region at column 4 because of the lack of 3. EXPERIMENTS texture in both the road scene (background) and most of the surface of car (foreground). When λ=0.3, i.e., the 3.1. Experimental setting color-texture feature is used, the number and size of There are four videos used in the experiment, video 1 is aperture in the results at column 3 become smaller than collected from a road side camera of real surveillance the results at column 4. system by the police department of Taichung County (PDTC), and the others are recorded indoors. Video 2 is 3.3.Results of using multi-layer GMM collected by our colleagues at a porch with glossy wall that reflects object slightly, video 3 and video 4 are The experimental results of using the multi-layer GMM selected from the video dataset at on video 2, 3, and 4 are demonstrated in Figure 4, 5, and http://cvrr.ucsd.edu/aton/shadow, which are entitled by 6, respectively. The results at column 2, 3, and 4 of each intelligentroom_raw and Laboratory_raw, resoectively. figure are obtained by using single layered RGB, multi- The image size of these videos are 320x240 pixels per layered RGB, and multi-layered RGB+LDP background 1048
  • 62.
    models, respectively. Detectionrates, including color and gradient information”, In Proc. of IEEE detection accuracy, false positive rate, and false negative Workshop on Motion and Video Computing, 2002. rate of pixel, of video 2 and 3 are computed based on ground-truth data and are printed on each frame. In [3]K. Yokoi, “Illumination-robust Change Detection addition, the average of detection rates is tabulated for Using Texture Based Features”, In Proc. of IAPR each method in Table 1 and 2. As shown in these figures Conference on Machine Vision Applications, 2007. and tables, the multi-layer GMM of RGB+LDP feature outperforms the method without combining the LDP [4]J. Yao and J. Odobez, “Multi-Layer Background significantly. Subtraction Based on Color and Texture”, In Proc. of IEEE CVPR, 2007. Acc. FPR FNR (%) (%) (%) [5]Cucchiara, R., Grana, C., Piccardi, M., Prati, A., and RGB only 63.89 31.53 4.58 Sirotti, S.: Improving Shadow Suppression in RGB+shadow layer 70.05 4.91 25.04 Moving Object Detection with HSV Color RGB+LDP+shadow 80.27 14.43 5.30 Information. in Proceedings of 2001 IEEE Intelligent layer Transportation Systems Conference. pp. 334-339 Table 1. Average detection rates of moving objects in (2001). video 2. [6]Shan, Y., Yang, F., and Wang, R.: Color Space Acc. FPR FNR Selection for Moving Shadow Elimination. in (%) (%) (%) Proceedings of 4th International conference on RGB only 45.19 53.94 0.87 Image and Graphics. pp. 496-501 (2007). RGB+shadow layer 76.85 1.21 21.94 RGB+LDP+shadow 82.89 15.87 1.25 [7]Nicolas, M.-B., and Zaccarin, A.: Learning and layer Removing Cast Shadows through a Multidistribution Table 2. Average detection rates of moving objects in Approach. IEEE Transactions on Pattern Analysis video 3. and Machine Intelligent. vol. 29, no. 7, pp. 1133- 1146 (2007). 5.CONCLUSION [8]Stauffer, C., and Grimson, W. E. L.: Adaptive This study presents a multi-layer scene model for Background Mixture Models for Real-time Tracking. applications of video surveillance. The proposed scene in Proceedings of IEEE Computer Society model uses a RGB+LDP feature to represent each pixel Conference on Computer Vision and Pattern and classifies each pixel into four different states Recognition. vol. 2, pp. 246-252 (1999). comprising background, moving foreground, static foreground and shadow. As shown in the experimental [9]Tanaka, T., Shimada, A., Arita, D., and Taniguchi, R.: results, both the modeling ability and illumination Non-parametric Background and Shadow Modeling invariance are significantly improved by including the for Object Detection. Lecture Notes in Computer texture information. Science. no. 4843, pp. 159-168 (2007). ACKNOWLEDGEMENT [10]Herrero-Jaraba, E., Orrite-Urunuela, C., and Senar, J.: Detected Motion Classification with a Double- background and a neighborhood-based difference. This paper is a partial result of project 9365C51100 Pattern Recognition Letters. vol. 24, pp. 2079-2092 conducted by ITRI under sponsorship of the Ministry of (2003). Economic Affairs, Taiwan. [11]Gallego, J., Pardas, M., and Landabaso, J.-L.: Segmentation and Tracking of Static and Moving REFERENCES Objects in Video Surveillance Scenarios. in Proceedings of IEEE International Conference on [1] M. Heikkil¨a and M. Pietik¨ainen, “A texture-based Image Processing. pp. 2716-2719 (2008). method for modeling the background and detecting moving objects”, In Proc. of IEEE Transactions on [12]Benedek C., and Sziranyi, T.: Study on Color Space Pattern Analysis and Machine Intelligence, Vol. 28, Selection for Detecting Cast Shadows in Video No. 4, pp. 657–662, April 2006. Surveillance. International Journal of Imaging Systems and Technology. vol. 17, pp. 190-201 [2]O. Javed, K. Shafique and M. Shah, “A hierarchical (2007). approach to robust background subtraction using 1049
  • 63.
    [13]Izadi, M., andParvaneh, S.: Robust Region-based [14]Zivkovic Z., and van der Heijden, F.: Recursive Background Subtraction and Shadow Removing Unsupervised Learning of Finite Mixture Models. using Color and Gradient Information. in IEEE Transactions on Pattern Analysis and Machine Proceedings of International Conference on Pattern Intelligent. vol. 26, no. 7, pp. 773-780 (2006). Recognition. pp. 1-5 (2008). Color-texture representation of Pixel i Fit Background No Model? Yes Update Is Shadow Background No Candidate? Model Yes Fit Static Background Update Shadow Foreground No Model Model? No Yes Update Static Fit Moving Fit Shadow Foreground No Foreground Model Model ? Model? and Count SF +1 Yes Yes Shadow If Count SF > T 1 Update Moving Foreground Model Transfer Moving and Count MF +1 Yes Foreground to Background, Reinitialize Moving Transfer Static If Count MF >T2 Foreground Model , Foreground Model to No and Set Count MF = 0 Background Model Yes Transfer Moving to Static Foreground Model No and Set Count SF =0 Foreground Model Static Moving Foreground Foreground Fig. 1. Flowchart of the proposed multi-layer scene model 1050
  • 64.
    Fig. 3. Resultsof background subtraction controlled by combination weight of color and texture Fig. 4. Foreground detection results of video 2. 1051
  • 65.
    Fig.5. Foreground detectionresults of video 3, the intelligentroom_raw. Fig. 6. Foreground detection results of video 4, the Laboratory_raw. 1052
  • 66.
    Adaptive Traffic SceneAnalysis by Using Implicit Shape Model Kai-Kai Hsu, Po-Chyi Su and Kai-Yi Cheng Dept. of Computer Science and Information Engineering National Central University Jhongli, Taiwan Email: [email protected] Abstract—This research presents a framework of analyz- research is to provide an approach to deal with the vehicle ing the traffic information in the surveillance videos from occlusion problem, in which multiple vehicles appear in the static roadside cameras to assist resolving the vehicle the video scene and certain parts of them overlap, in occlusion problem for more accurate traffic flow estimation and vehicle classification. The proposed scheme consists of the vehicle detection. The occlusions of vehicles occur two main parts. The first part is a model training mechanism, quite often in cameras set up at the streets and cause in which the traffic and vehicle information will be collected ambiguity in vehicle detecting and may lead to inaccurate and their statistics are employed to automatically establish measurement of traffic parameters, such as the traffic flow the model of the scene and the implicit shape model of volume. We adopt a so-called “Implicit Shape Model” vehicles. The second part adopts the flexibly trained models for vehicle recognition when possible occlusions of vehicles (ISM) to recognize the vehicle and reasonably help in are detected. Experimental results show the feasibility of the solving the occlusion problem. The proposed scheme will proposed scheme. have two parts, i.e. the self-training mechanism and the Keywords-Vehicle; traffic surveillance; occlusion; SIFT; construction of the implicit shape model for resolving vehicle occlusion. The organization of this paper is as follows. A review of the related works is described in I. I NTRODUCTION Section II. The proposed method is presented in Section Developing Intelligent Transportation System (ITS) has III. Preliminary results are shown in Section IV and the been a major investigation these years. Through the inte- conclusive remarks are given in Section V. gration of advanced computing facilities, electronics, com- munication and sensor technologies, ITS can provide the II. R ELATED W ORKS real-time information to help maintain the traffic order or There have been active research efforts on the automatic to ensure the safety of pedestrians and drivers. As there are vision-based traffic scene analysis in recent years [1]–[6]. more and more surveillance cameras deployed along the Levin et al. [1] proposed to collect the training examples local roads or highways, the visual information provided by a coarse detector and the training examples are used by these surveillance videos become an important part to build the final pedestrian detector. The classification of ITS. The traffic information obtained by the vision- criterion for the coarse detector has to be defined manually. based approach can assist the traffic flow control, vehicle Wu et al. [3] employed an online boosting method to counting and categorization, etc. In addition, the emergent enhance the performance of the system. A prior detector traffic events may be detected right after they happened by by off-line learning is employed to train the posterior the advanced visual processing so that the corresponding detector, which adopts unsupervised learning. Nair et al. processes can be applied in a more active way. [7] also employed a supervised way for the initial training. The vehicle detection/classification by using the vision- Hsieh et al. [2] adopts a different approach to detect based approach is a challenging issue and various methods the lanes of the surveillance video in the initial stage have been proposed in recent years. It should be noted automatically. Vehicle features such as size and linearity that appearances of vehicles in the surveillance videos are used to detect and classify vehicles, instead of using from different cameras are quite diverse because of the the large amount of labeled training data. The vehicle size different locations, heights, angles and views of cameras. information has to be pre-defined manually. Zhou et al. [4] In addition, the weather condition and the time of video proposed an example-based moving vehicle detection. The recording, e.g. morning or evening, may also affect the vehicle are detected according to the luminance changes vehicle detection process. It is quite difficult to establish by the background subtraction. The features are extracted a common model in advance for all the surveillance from those examples using PCA and trained as a detector videos. Nevertheless, if we choose to construct a model by SVM. Celik et al. [5] presented an unsupervised and for each individual surveillance video, a great deal of on-line approach. A coarse object detector is used to human efforts will be required, given that there are so extract the moving object by the background subtraction many roadside cameras. Therefore, the objective of our and then the obtained samples are refined by clustering research is to enable the procedures of model construction based on the similarity matrix. These extracted features in an automatic manner so that the customized model of are separated into good and bad positives for training a each scene can be established. The other objective of this final detector via SVM. Celik et al. [6] then addressed an 1053
  • 67.
    ¢c £¨¨¢` automatic classificationmethod for identifying pedestrians ¤ ¤ !© and vehicles by SIFT. ¤ ¤!© ¨¢©#¢!( ¤¢W¢¢% Regarding the issues of occlusion problem, various solu- CR3QPC I 97Q H B67 V A 2 9 tions have been proposed [8]–[16]. We roughly classify the ¨§£©¢ ¨¢¡ © ##¢` ¨©¤% ¤ ¤ ¥ ¤!© ¨¢¡#¨  ¥ ©!)¡  !¤0 ¤)¢ approaches into 3D model-based, feature-based methods b !¢¨a ©¢¥ and others. The 3D model is a popular solution to solve the @9876 5 321 4 E @C H 26 HG 3 F C 5 BA 2 U T CR3QPC I S vehicle occlusion. Pang et al. [8], [9] detect a vanishing ¤ ¤!© # ¨£¤ ¨§!©©$ ¤D ¨§ !©©$ 3Q VV 6 X 2 ¤£¢¡  ¨¢¨¤¡£¤¥ ¤¤¨  ¨©¤¤% £¨'!¤ T C H B262Y 6 C ¤©¨¤§¦¤¥ point first in the traffic scene. A 3D deformable model is used to estimate each viewpoint of vehicle occlusion and transform it into a 2D representation. The occlusion Figure 1. The proposed framework is detected by obtaining the curvature points on the shape of vehicle. Occluded vehicles are separated into individual vehicles. The vanishing point is also adopted by Yoneyama a single vehicle can be correctly located. Since the traffic et al. [10]. A hexagon is used to approximate the shape scenes from different cameras may vary significantly, this of vehicle for eliminating shadows. A multiple-camera training process has to be applied for each individual cam- system is utilized to detect the occlusion problem. Song era. If this process is carried out manually, its computation et al. [11] proposed to employ vehicle shape models, will require considerable amounts of human efforts. In camera calibration and ground plane knowledge to detect, order to provide a more feasible solution, we plan to track and classify the occlusion by estimating the related develop a “self-training” adaptive scheme so that these likelihood. Lou et al. [12] established a 3D model for models will be built in a more automatic manner without tracking vehicles and an improved extended Kalman filter involving a great deal of human efforts. Considering that was also presented to track and predict the vehicle motion. the settings of traffic surveillance cameras are usually fixed Most methods of 3D model require the precise camera cal- without rotation and the corresponding traffic scenes tend ibration and vehicle detection. In feature-based methods, to be static, i.e. the background of the traffic scene is the occlusion can be resolved by tracking partial visible invariant, we will extract a long video segment from the features of occluded vehicles. Kanhere et al. [13] proposed target camera for building models. It should be noted that to track vehicle in low-angle situation and estimate the 3D typical vehicles should appear in the extracted long video height of features on the vehicles. The feature points are segment and their related information can thus be collected detected and tracked throughout the image sequence. The as references for future usage. feature-based methods may be influenced by the similar Fig. 1 demonstrates our system framework. The back- shapes from the background. Zhang et al. [14] presented ground of the scene will be constructed from the traffic a multilevel framework, which consists of the intra-frame, surveillance video by an iteratively updating method so inter-frame and tracking level. In the intra-frame level, an that the background subtraction can be applied to extract occlusion is detected by evaluating convex compactness the vehicle masks. Although the extracted vehicle masks ratio of vehicle shape and resolved by removing a “cutting may contain a single vehicle or colluded ones, it is as- region.” In the inter-frame level, an occlusion is detected sumed that the long video used for training should contain by the statistics of motion vectors of vehicles. During a large number of single vehicles and that the vehicles of the tracking level, the detected vehicles are tracked for the same types should exhibit a similar shape/size. Even resolving the full occlusion. Tsai et al. [15] detect the if many occlusions may happen, their shapes are usually vehicles by using color and edges. The vehicle color usu- quite different. Therefore, the majority-voting methodol- ally looks unique and can be used for searching possible ogy can be employed to determine such static information vehicle locations. Then the edge maps and coefficients in the target traffic video, including the traffic flow direc- of wavelet transform are used for examining the vehicle tions and the vehicle shape/size of different types at the candidates. Wang and Lien [16] proposed an automatic scene to construct our first model, i.e. the scene model. The vehicle detection based on significant subregions of ve- second model, i.e. the shape model, which will be used for hicles, which are transformed to PCA weighting vectors recognizing vehicles, especially the concluded vehicles, and ICA coefficient vectors. The position information is is said to be implicitly established since image features estimated by a likelihood probability evaluation process. are extracted and grouped without explicitly resorting to the exact shapes of vehicles. The scale-invariant feature III. T HE P ROPOSED S ELF -T RAINING S CHEME transform (SIFT) will be used to extract effective features A. System overview from the segmented vehicle masks of consecutive frames Our system is aimed at resolving the vehicle occlusion to indicate the pixels covering vehicles more precisely. problem for more accurate estimation of traffic flow at The statistics of vehicle size information obtained by the the scene captured by a static traffic surveillance camera. occlusion detection is analyzed and will be utilized to clas- Our scheme mainly relies on establishing two models, i.e. sify the vehicle types. By the results of statistics, the types the traffic scene model and the implicit shape model of of vehicles can be classified into motorcycles, sedan cars vehicles, for effective traffic scene analysis. The models and buses according to the vehicle size information. The should be trained in advance so that the pixels covering step of vehicle pattern extraction and classification will 1054
  • 68.
    (a) (a) (b) Figure 3. Convex hulls of (a) non-occlusion vehicle and (b) occluded vehicles. (b) Figure 2. Background image Construction where Vs and Vc represent the vehicle area from the background subtraction and the vehicle convex area, re- spectively. When the value of Γ is closer to one, the collect various types of vehicle masks. The classification vehicle area is similar to its convex hull area and it is implemented by the vehicle size information obtained indicates that the occlusion may not happen. In the training from the traffic information analysis. When the system process, our system tries to extract non-occluded vehicle runs after a period of time, there will be enough vehicle patterns so we set up a high threshold to ensure that most masks to establish the implicit shape model. We will detail of the extracted vehicle patterns contain single vehicles. the procedures of our proposed system as follows. D. Traffic Information Analysis B. Background Model Construction As mentioned before, we require that our system be A series of traffic surveillance frames will be utilized executed in an more automatic manner to reduce the to construct the background image of the traffic scene human efforts for tuning the parameters. Our scheme will captured by a static roadside camera so that the moving obtain the direction of traffic appearing in the scene and vehicles will be detected by the background subtraction. i the common vehicle size information by the statistics of Let Bx,y be the pixel at (x, y) of the background image, the surveillance videos in the training phase. For analyzing and the background updating function is given by the direction of traffic, the vehicle movements must be i+1 b i b i attained first. SIFT is employed to identify features on Bx,y = (1 − αMx,y )Bx,y + αMx,y Fx,y (1) vehicles. After the vehicle segmentation, the vehicles i in which Fx,y is the pixel at (x, y) in frame i; α is the are transformed into feature descriptors of SIFT. The b small learning rate; Mx,y is the binary mask of the current features of frames will be compared and the positions frame. If the pixel at (x, y) belongs to the foreground part, of movements are recorded. After a period of time, the b b Mx,y = 1 to turn on the updating. Otherwise, Mx,y is set main direction of traffic in the surveillance scene can as 0 to avoid updating the background with the moving be observed from the resultant movement histogram. In objects. An example of the scene with its constructed addition, the Region of Interest (ROI) can be identified to background is demonstrated in Fig. 2. facilitate the subsequent processing. The position of ROI is located in the area of the detected traffic flow and the area C. Occlusion Vehicle Detection near the bottom of the captured traffic scene for vehicles It has been observed that the shape of non-occluded of larger size, which can offer more information. vehicle should be close to its convex hull and that the After determining ROI, we can collect vehicle patterns shape of occluded vehicles will show certain concavity, or masks that appear in the ROI. In the training phase, ve- as illustrated in Fig. 3. This characteristic can be used to hicle patterns that are determined to contain single vehicles roughly extract the non-occluded vehicle. In our imple- based on the convex hull analysis will be archived. Then mentation, compactness, Γ, is used to evaluate how close we can check the size histogram of archived vehicles to set the vehicle’s shape and its convex hull are. That is, up the criterion for roughly classifying them. In our test Vs videos, the most common vehicles are motorcycles, sedan Γ= , (2) cars and buses. When we examine the histogram of the Vc 1055
  • 69.
    the position oftraining vectors where the codebook entry is found. The position of each feature is dependent on the object center. We match the features from the training ! ! images with the codebook entries. When the similarity of ¥©¨§¦¥¤ £¢¡  ©¥§¥ ©¥§¨ ¥¥©©¨ ¥©¥ features with any entry is above a threshold, the position relative to the object center is recorded along with the codebook entry. After matching the training images with Figure 4. The codebook training procedure. the codebook entries, we obtain the spatial probability distribution. 2) Recognition Approach: Given a target image, the sizes of the collected single vehicle patterns, there will be features are extracted by SIFT and matched to the obvious peaks. To be more specific, we basically make use codewords in the codebook. When the similarity be- of the peaks to determine the sizes of common motorcycles tween extracted features and the codebook entries is and sedan cars since they appear more often. We can then higher than a threshold, these matches are then collected. set up the upper and lower bounds of sedan cars and then According to the spatial probability distribution, these we can use them as the reference to assign a lower bound matched codebook entries cast votes to the object center. of the bus size. In the detection phase, if the vehicle mask When the features of target image that are extracted at is large and shows a convex hull, then the pattern may be (ximg , yimg , simg ), in which (x, y) is the location and determined as a bus. Otherwise, an occlusion may happen s means scale, are determined to have a match with a and this has to be solved by using ISM. In other words, codebook entry, the positions (xpos , ypos , spos ) recorded after the rough classification according to the vehicle sizes, in this codebook entry cast votes for the object center. we proceed to use the vehicle patterns to establish the The voting is applied by codebooks of ISM, which will then be used for resolving the vehicle occlusions. simg xvote = ximg − xpos ( ) (3) spos E. Implicit Shape Model simg yvote = yimg − ypos ( ) (4) Leibe et al. [17] proposed to use ISM for learning spos the shape representations in detecting the most possible simg svote = (5) locations of vehicles in images or frames. The object spos categorization is achieved by learning the appearance where (xvote , yvote , svote ) is a vote for the object center. variability of an object category in a codebook. The After all the matches that the codebook entries have voted, investigated image will be compared with the codewords we store these votes for a probability density estimation in the codebook that has a similar shape and then a mechanism, which is used to obtain the most possible weighted voting procedure will be applied to address the location of the object center. object detection. The steps of ISM are as follows. Next, we collect the votes in a binned 3D accumulator 1) Shape Model Establishment: In the visual object array and search the local maxima for speeding up the recognition, we have to determine the correspondence of computation. The local maxima are detected by comparing the image features with the structures of the object, even each member of the binned 3D accumulator array to its 26 under different conditions. To employ a flexible repre- neighbors in 3×3 regions. Then, the Mean-Shift approach sentation for object recognition, a codebook is built for [18] is employed to refine the local maxima for more representing features that appear on training images quite accurate location. The Mean-Shift approach can locate often and similar features are clustered. A codeword in the maxima of a density function given the discrete data the codebook should be a compact representation of local sampled from that function. It will quickly converge to appearances of objects. Given an unknown image struc- more precise locations of the local maxima after several ture, we will try to match it with a possible representation iterations. or codeword in the codebook. Then, many such matches are collected and we can then infer the existence of that The refined local maxima can be regarded as candidates object. Again, the scale-invariant interest point detector of the object center. Thus, the following criterion is used is employed to detect the feature points on the training to estimate the existing probability of the object: images and the extracted image regions are then translated 1 lc − li to a representation by a local descriptor. Next, the visually score(lc ) = wi Ker( ), (6) V (sc ) i b(sc ) similar features will be grouped to construct a codebook for representing the local appearances of certain object. where Ker() is a kernel function; b(sc ) is the kernel The k-means algorithm is used to partition the features into bandwidth; V (sc ) is the volume of the kernel; wi and k clusters, in which each feature is assigned to the cluster li are the weighting factor and the location of the vote, center with the nearest distance. The codebook generation respectively; lc and sc are the location and scale of the process is shown in Fig. 4. local maximum. The kernel function Ker() can be treated After building the codebook, the spatial probability as a search window for the position of object center. If distribution is defined for each codebook entry. It records the vote location li is inside of the kernel, the Ker() 1056
  • 70.
    (a) (b) (a) (b) Figure 5. (a) Multi-type vehicle error detection and (b) the result after Figure 6. (a) Multiple hypotheses detected in one vehicle and (b) the the refining procedure. results from the refining procedure. function returns a value one. Otherwise, it returns zero. There exists another problem in the vehicle recognition For the 3D voting space, we use a spherical kernel and by using ISM. As shown in Fig 6, there are three bounding the radius is the bandwidth, b(sc ), which is adaptive to boxes on the same vehicle. It means that the recognition the local maximum scale sc . As the object scale increases, result includes some error detections that ISM has defined the kernel bandwidth should also increase for an accurate for multiple hypotheses on this vehicle. Since the multiple estimation. Therefore, we sum up all the weighting values definition problem comes from the fact that the ISM that are inside of the kernel and divide them by the volume searches the local maxima in the scale-space as shown V (sc ) to obtain an average weight density, which is called in Fig. 7, the scheme may find several local maxima in the score. After the score is derived, we define a thresh- different scale levels but at a similar location. In fact, old θ for determining whether the object exists. When these local maxima are generated by the same vehicle the score is above θ, the hypothesized object center is center. Therefore, the unnecessary hypotheses should be preserved. Finally, we back-project the votes that support eliminated. We deal with the problem by computing the this hypothesized object center to obtain an approximate overlapped area between the two bounding boxes. When shape of the object. the overlapped area between two bounding boxes is very large, we can claim that the bounding box that has a F. Occlusion Resolving weaker score is an error detection. For efficient compu- After detecting the existence of certain occluded vehi- tation, the rate of overlap is computed by finding the cles in the image, we need to classify them into different distance between the two bounding boxes’ central points types. In our scheme, we construct the codebooks of and use the longer diagonal line of the larger bounding different types of vehicles. Each type of vehicle codebook box as the criterion. The longer the distance is, the higher will be established automatically after we obtain enough the areas overlap. In other words, for every two bounding vehicle patterns collected by the process of vehicle ex- boxes, we need to check traction. However, as shown in Fig. 5, the performance of 1 recognition is not as good as expected since many errors distance(B1 , B2 ) D, (7) 3 happen on the bus image. Owing to the fact that the area where B1 and B2 denote two bounding boxes central of buses are much larger than sedan cars and that there points and D is the diagonal line of the larger one. In our are many similar local appearances in these two types, 1 implementation, when the distance is smaller than 3 D, the errors of this kind occur quite often. We provide a refining overlapped area of the bounding boxes is above 50% and procedure as follows. we will thus remove the bounding box that has a lower All the hypotheses are supported by the contributing score. The error detection from ISM can thus be reduced. votes that are cast by the matched features. Theoretically, every extracted feature should only support one hypothesis IV. E XPERIMENTAL R ESULTS since it is not possible that one feature belongs to two We have tested the proposed self-training mechanism vehicles. Thus, we will modify these hypotheses after on two different surveillance videos. The scenes of two executing multiple recognition procedures. We first store surveillance videos are displayed in Fig. 8. Scene 1 shown all the hypotheses whose scores are over a threshold. in Fig. 8(a) is a 15 minutes long video while Scene Then all the hypotheses are refined by checking each 2 shown in Fig. 8(b) is a 17 minutes long video. The contributing vote that appears in two hypotheses at the experimental results will be demonstrated in three parts, same time. The hypothesis with a higher score can retain i.e. the traffic information analysis, the vehicle pattern this vote while the vote from others will be eliminated. extraction/classification and the occlusion resolving. Next, the scores of these hypotheses are recalculated. When the new score is above the threshold, the hypothesis A. Traffic Information Analysis can be preserved. After this refining procedure, the number The directions of traffic flow analysis of two scenes of error detections can be reduced. are illustrated in Fig. 9. The red points represent forward 1057
  • 71.
    9 4 D D 8 3.5 Number of Occurences (per minute) Number of Occurences (per minute) 7 3 6 2.5 5 2 4 1.5 3 1 2 1 0.5 0 0 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140 Size of Vehicles Unit: 100 pixels Size of Vehicles Unit: 100 pixels (a) (b) Figure 10. The vehicle size statistics for (a) Scene 1 and (b) Scene 2. Figure 7. If the distance between two bounding boxes’ centers is smaller, then the overlap area is larger so the distance will be employed to remove Table I the duplicated detections. V EHICLE PATTERN E XTRACTION Total Error Correct rate Scene 1 940 15 98.4% Scene 2 1251 31 97.5% B. Vehicle Pattern Extraction and Classification The various extracted vehicle patterns are demonstrated (a) (b) and they pass the occlusion detection process to ensure that it have no occlusion problem. In our experiment, we give Figure 8. The views of two surveillance videos. (a) Scene 1. (b) Scene Eq.(2) a threshold 0.9 for extracting the sedan car/bus and 2. 0.8 for motorcycles. We apply the shape analysis on sedan cars and buses but not on motorcycles since they cannot be approximated by a convex hull. The performance of vehicle extraction is summarized in Table I. These vehicle moving vehicles and blue points are backward moving patterns will be employed for training. It should be noted vehicles. We can see that the directions of traffic flows are that the errors usually come from some unstable envi- successfully obtained after training the video for a while. It ronmental conditions, which will affect the construction should be noted that the more traffic volume is, the lesser of background image. The vehicle classification result is time we will need. The vehicle size information statistics summarized in Table II. Some extracted patterns from for Scene 1 and Scene 2 are exhibited in Fig 10. There Scene 1 are illustrated in Figs. 11-13. We can see that exist two peaks in each scene as the left peak, which has the vehicle patterns can be effectively extracted and they a smaller vehicle size, represents a motorcycle, while the will be helpful in training a more accurate codebook or right one, which has a larger vehicle size, stands for a models. sedan car. In Scene 1, according to Fig. 10, we assign the lower bound 700 pixels and upper bound 1000 pixels for C. Occlusion Resolving motorcycle size. The upper and lower bounds of sedan Table III and Figs. 14-16 demonstrate the results of car size are 1700 pixels and 3300 pixels respectively. In occlusion resolving. We use the extracted vehicle patterns Scene 2, the motorcycle size is assigned with 1400 pixels to train the ISM codebooks for two different scenes. Table and 2100 pixels while the sedan car size is assigned with III is the performance of resolving occlusion on sedan the lower bound 4000 pixels and the upper bound 8500 cars and the occlusion part of Table III denotes the sedan pixels. We can see that the vehicle size information i.e. cars actually occlude with other vehicles while the non- the motorcycle and sedan car, for surveillance video can occlusion part stands for the sedan cars which are not be obtained by statistics successfully. occluded with other vehicles but pass the occlusion detec- tion. As shown in Figs. 14 and 15, there are several sedan cars that are partially occluded. We use the trained ISM to resolve the occlusions. The red points and bounding boxes Table II V EHICLE PATTERN C LASSIFICATION Motorcycle Sedan car (a) (b) Total Error Correct rate Total Error Correct rate Scene 1 135 3 97.8% 765 34 95.6% Figure 9. The directions of traffic flows for (a) Scene 1 and (b) Scene Scene 2 159 2 98.7% 826 46 94.4% 2. 1058
  • 72.
    Figure 11. The extracted motorcycle patterns from Scene 1. Figure 13. The extracted bus patterns from Scene 1. Figure 14. Occlusion resolving of sedan cars in Scene 1. Figure 12. The extracted sedan car patterns from Scene 1. represent vehicle’s central coordinate and its position that are detected by ISM. In Fig. 16, we resolve the problem of occlusion from the two types of vehicles i.e. bus and sedan car. By combining ISM and the proposed self-training mechanism, these occlusion problems can be reasonably resolved. Figure 15. Occlusion resolving of sedan cars in Scene 2. Table III S EDAN C AR O CCLUSION R ESOLVING R ATE Total Miss False alarm occlusion 177 35 46 Scene 1 non-occlusion 88 1 2 occlusion 92 16 21 Scene 2 non-occlusion 130 2 12 Figure 16. Resolving the partial occlusion of sedan car and bus. Recall Precision occlusion 80.2% 75.5% Scene 1 non-occlusion 98.9% 97.8% V. C ONCLUSION occlusion 82.6% 78.2% Scene 2 non-occlusion 98.4% 99.2% We have proposed a framework of analyzing the traffic information in the surveillance videos captured by the 1059
  • 73.
    static roadside cameras.The traffic and vehicle infor- [13] N. Kanhere, S. Birchfield, and W. Sarasua, “Vehicle seg- mation will be collected from the videos for training mentation and tracking in the presence of occlusions,” the related model automatically. For the vehicles without Transportation Research Record: Journal of the Trans- portation Research Board, vol. 1944, no. -1, pp. 89–97, occlusion, we can use the scene model to record and 2006. classify. If an occlusion happen, the implicit shape model will be employed. The experimental results demonstrate [14] W. Zhang, Q. Wu, X. Yang, and X. Fang, “Multilevel this potential solution of solving occlusion problems in Framework to Detect and Handle Vehicle Occlusion,” IEEE the traffic surveillance videos. Future work will be further Transactions on Intelligent Transportation Systems, vol. 9, no. 1, pp. 161–174, 2008. improving the accuracy and the speed of execution. [15] L. Tsai, J. Hsieh, and K. Fan, “Vehicle detection using R EFERENCES normalized color and edge map,” IEEE Transactions on [1] O. Javed, S. Ali, and M. Shah, “Online detection and clas- Image Processing, vol. 16, no. 3, pp. 850–864, 2007. sification of moving objects using progressively improv- ing detectors,” Computer Vision and Pattern Recognition, [16] C. Wang and J. Lien, “Automatic Vehicle Detection Using vol. 1, p. 696701, 2005. Local FeaturesA Statistical Approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1, pp. 83– [2] J. Hsieh, S. Yu, Y. Chen, and W. Hu, “Automatic traffic 96, 2008. surveillance system for vehicle tracking and classification,” IEEE Transactions on Intelligent Transportation Systems, [17] B. Leibe, A. Leonardis, and B. Schiele, “Robust object de- vol. 7, no. 2, pp. 175–187, 2006. tection with interleaved categorization and segmentation,” International Journal of Computer Vision, vol. 77, no. 1, [3] B. Wu and R. Nevatia, “Improving part based object pp. 259–289, 2008. detection by unsupervised, online boosting,” in IEEE Con- ference on Computer Vision and Pattern Recognition, 2007. [18] Y. Cheng, “Mean shift, mode seeking, and clustering,” CVPR’07, 2007, pp. 1–8. IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 17, no. 8, pp. 790–799, 1995. [4] J. Zhou, D. Gao, and D. Zhang, “Moving vehicle detection for automatic traffic monitoring,” IEEE transactions on vehicular technology, vol. 56, no. 1, pp. 51–59, 2007. [5] H. Celik, A. Hanjalic, E. Hendriks, and S. Boughor- bel, “Online training of object detectors from unlabeled surveillance video,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08, 2008, pp. 1–7. [6] H. Celik, A. Hanjalic, and E. Hendriks, “Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video,” Computer Vision and Image Understanding, vol. 113, no. 10, pp. 1076–1094, 2009. [7] V. Nair and J. Clark, “An unsupervised, online learning framework for moving object detection,” Computer Vision and Pattern Recognition, vol. 2, p. 317324, 2004. [8] C. Pang, W. Lam, and N. Yung, “A novel method for resolving vehicle occlusion in a monocular traffic-image sequence,” IEEE Transactions on Intelligent Transportation Systems, vol. 5, pp. 129–141, 2004. [9] ——, “A method for vehicle count in the presence of multiple-vehicle occlusions in traffic images,” IEEE Trans- actions on Intelligent Transportation Systems, vol. 8, no. 3, pp. 441–459, 2007. [10] A. Yoneyama, C. Yeh, and C. Kuo, “Robust vehicle and traffic information extraction for highway surveillance,” EURASIP Journal on Applied Signal Processing, vol. 2005, p. 2321, 2005. [11] X. Song and R. Nevatia, “A model-based vehicle segmen- tation method for tracking,” in Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005, 2005, pp. 1124–1131. [12] J. Lou, T. Tan, W. Hu, H. Yang, and S. Maybank, “3- D model-based vehicle tracking,” IEEE Transactions on image processing, vol. 14, no. 10, pp. 1561–1569, 2005. 1060
  • 74.
    An Augmented RealityBased Navigation System for Museum Guidance Jun-Ming Pan, Chi-Fa Chen Chia-Yen Chen, Bo-Sen Huang, Jun-Long Huang, Dept. of Electrical Engineering, Wen-Bin Hong, Hong-Cyuan Syu I-Shou University Vision and Graphics Lab. Dept. of Computer Science and Information Engineering Nat. University of Kaohsiung, Kaohsiung, Taiwan [email protected] Abstract— The paper describes the design and an interest about the exhibition. Therefore, the implemented implementation of an augmented reality based navigation system aims to provide more in depth visual and auditory system used for guidance through a museum. The aim of this information, as well interactive 3D viewing of the objects, work is to improve the level of interactions between a viewer which otherwise cannot be provide by a pamphlet alone. In and the system by means of augmented reality. In the addition, an interactive guidance system will have more implemented system, hand motions are captured via computer impact and create a more interesting experience for the vision based approaches and analyzed to extract representative visitors. actions which are used to interact with the system. In this The implemented system does not require the keyboard manner, tactile peripheral hardware such as keyboard and or the mouse for interaction. Instead, a camera and a mouse can be eliminated. In addition, the proposed system also pamphlet are used to provide the necessary input. To use the aims to reduce hardware related costs and avoid health risks associated with contaminations by contact in public areas. system, the user is first given a pamphlet, as often given out to visitors to the museum, he/she can then check out the Keywords- augmented reality; computer vision; human different objects by moving his/her finger across the paper computer interaction; multimedia interface; and point to the pictures of objects on the pamphlet. The location of the fingertip is captured by an overhead camera and the images are analyzed to determine the user's intended I. INTRODUCTION AND BACKGROUND actions. In this manner, the user does not need to come into The popularity of computers has induced a wide spread contact with anything other than the pamphlet that is given to usage of computers as information providers, in public him/her, thus eliminating health risks due to direct contact facilities such as museums or other tourist attractions. with harmful substances or contaminated surfaces. However, in most locations, the user is required to interact with the system via tactile means, for example, a mouse, a To implement the system, we make use of technologies keyboard, or a touch screen. With a large number of users in augmented reality. coming into contact with the hardware devices, it is hard to Augmented reality (AR) has received a lot of attentions keep the devices free from bacteria and other harmful due to its attractive characteristics including real time contaminates which may cause health concerns to immersive interactions and freedom from cumbersome subsequent users. In addition, constant handling increases the hardware [7]. There have been many applications designed risk of damage to the devices, incurring higher maintenance using AR technologies in areas such as medical applications, cost to the providing party. Thus, it is our aim to design and entertainment, military navigation, as well as many other implement an interactive system using computer vision new possibilities. approaches, such that the above mentioned negative effects An AR system usually incorporates technologies from may be eliminated. Moreover, we also intend to enhance the different fields. For example, technologies from computer efficiency of the interface by increasing the amount of graphics are required for the projection and embedding of interaction, which can be achieved by means of a multimedia, virtual objects; video processing is required to display the user augmented reality interface. virtual objects in real time; and computer vision technologies are required to analyse and interpret actions from input In this work, we realize the proposed idea by image frames. As such, an AR system is usually realized by implementing an interactive system that enables the user to a cross disciplinary combination of techniques. interact with a terminal via a pamphlet, which can easily be Existing AR systems or applications often use designated produced in a museum and distributed visitors. The pamphlet markers, such as the AR encyclopedia or other applications contains summarized information about the exhibition or written by ARToolkit [8]. The markers are often bi-coloured objects of interest. However, due to the size of the pamphlet, and without details to facilitate marker recognition. However, it is not possible to put in a lot of information, besides, too for the guidance application, we intend to have a system that much textual information tends to make the visitor lose is able to recognize colour and meaningful images of objects 1061
  • 75.
    or buildings asprinted on a brochure or guide book and use them for user interactions. The paper is organized as follows. Section 2 describes the design of the system; section 3 describes the different steps in the implementation of the system; section 4 discusses the operational navigation system; and section 5 provides the conclusion and discusses possible future research directions. II. S YSTEM DESIGN The section describes how the system is designed and implemented. Issues that arose during the implementation of the system, as well as the approaches taken to resolve the issues are also discussed in the following. To achieve the goals and ideas set out in the previous section, the system is designed with the following Figure 2. Concept of the navigation system. considerations. The system obtains input via a camera, located above and  Minimum direct contact: The need for a user to overlooking the pamphlet. The camera captures images of come into direct contact with hardware devices such the user's hand and the pamphlet. The images are processed as a keyboard, or a mouse, or a touch screen, should and analyzed to extract the motion and the location of the be minimized. fingertip. The extracted information is used to determine the  User friendliness: The system should be easy and multimedia data, including text, 2D pictures, 3D models, intuitive to use, with simple interface and concise sound files, and/or movie clips, to be displayed for the instructions. selected location on the pamphlet. Fig. 2 shows the concept of the proposed navigation system.  Adaptability: The system should be able to handle other different but similar operations with minimum modifications. III. SYSTEM IMPLEMENTATION  Cost effectiveness: We wish to implement the system using readily available hardware, to The main steps in our system are discussed in the demonstrate that the integration of simple hardware followings. can have fascinating performance. A. Build the system using ARToolKit  Simple and robust setup. Our goal is to have the We have selected ARToolKit to develop our system, system installed at various locations throughout the since it has many readily available high level functions that school, or other public facilities. By having a simple can be used for our purpose. It can also be easily integrated and robust setup, we reduce the chances of a system with other libraries to provide more advanced functions and failure. implement many creative applications. In accordance to the considerations listed above, the B. Create markers system is designed to have the input and out interfaces as The system will associate 2D markers on the pamphlet shown in Fig. 1. with 3D objects stored in the database, as well as actions to manipulate the objects. This is achieved by first scanning the marker patterns, storing them in the system and let the program learn to recognize the patterns. In the program, each marker is associated with a particular 3D model or action, such that when the marker has been selected by the user, the associated data or action will be displayed or executed. Fig. 3 shows examples of markers used for the system. Each marker is surrounded by a black border to facilitate recognition. The object markers, as indicated by the blue Figure 1. Diagram for the navigation interface. arrows, are designed to match the objects to be displayed. Bottom right shows a row of markers, as enclosed by the red oval, used to perform actions on the displayed 3D objects. 1062
  • 76.
    Objects model. Note that the actions can be applied to any 3D models that can be displayed by the system. Actions Figure 3. Markers used by the system. C. Create 3D models Figure 5. The user selects the zoom in function to magnify the displayed 3D model. The 3D models that are associated with the markers are created using OpenGL or VRML format. These models can be displayed on top of the live-feed video, such that the user can interact with the 3D models in real time. The models are texture mapped to provide realistic appearances. The models are created in collaboration with the Kaohsiung Museum of History [9]. Fig. 4 shows examples of the 3D models used in the navigation system. The models are completely 3D with texture mapping, and can be viewed from any angle by the user. Figure 6. The user selects the zoom out function to shrink the displayed 3D model. Figure 4. Markers used by the system. D. Implement interactive functions In addition to displaying the 3D models when the user selects a marker, the system will also provide a set of actions that the user can use to manipulate the displayed 3D model in real time. For example, we have designed “+/-” markers for the user to magnify or shrink the displayed 3D model. The user simply places his/her finger on the markers and the 3D model will change size accordingly. There are also markers for the user to rotate the 3D model, as well as reset the model to its original size and position. Figs 5 to 7 show the system with its implemented actions in operation. In the Figure 7. The user use the rotation marker to rotate the 3D object. figures, the user simply puts a finger over the marker, and the selected actions will be performed on the displayed 3D 1063
  • 77.
    E. Determine selection V. CONCLUSION An USB camera is used to capture continuous images of A multimedia, augmented reality interactive navigation the scene. The program will automatically scan the field of system has been designed and implemented in this work. In view in real time for recognized markers. Once a marker has particular, the system is implemented for application in found to be selected, that is, it is partially obstructed by the providing museum guidance. hand, it is considered to be selected. The program will match The implemented system does not require the user to the selected marker with the associated 3D model or action operate hardware devices such as the keyboard, mouse, or in the database. Figs. 5 to 7 show the user selecting markers touch screen. Instead, computer vision approaches are used by pointing to the markers with a finger. From the figures, it to obtain input information from the user via an overhead can be seen that the selected 3D model is shown within the camera. As the user points to certain locations on the video window in real time. Also notice that the models are pamphlet with a finger, the selected markers are identified by placed on top of the corresponding marker’s position in the the system, and relevant data are shown or played, including video window. a texture mapped 3D model of the object, textual, audio, or other multimedia information. Actions to manipulate the displayed 3D model can also be selected in a similar manner. IV. NAVIGATION SYSTEM Hence, the user is able to operate the system without The proposed navigation system has been designed and contacting any hardware device except for the printout of the implemented according to the descriptions provided in the pamphlet. previous sections. The system does not have high memory The implementation of the system is hoped to reduce the requirements and runs effectively on usual PC or laptops. cost of providing and maintaining peripheral hardware The system also requires no expensive hardware, an USB devices at information terminals. At the same time, camera is sufficient to provide the input required. It is also eliminating health risks associated with contaminations by quite easy to set up and customized to various objects and contact in public areas. applications. Work to enhance the system is ongoing and it is hoped The system can be placed at various points in the that the system will be used widely in the future. museum on separate terminals to enable visitors to access additions museum information in an interactive manner. ACKNOWLEDGMENT This research is supported by National Science Council (NSC98-2815-C-390-026-E). We would also like to thank Kaohsiung Museum of History for providing cultural artifacts and kind assistance. REFERENCES [1] J.-Z. Jiang,Why can Wii Win ?,Awareness Publishing,2007 [2] D.-Y. Lai, M. Liou, Digital Image Processing Technical Manual, Kings Information Co., Ltd.,2007 [3] R. Jain, R. Kasturi B. G. Schunck, Machine Vision、McGraw-Hill, 1995 [4] R. Klette, K. Schluns K. Koschan, Computer vision: three- dimensional data from images, Springer; 1998. [5] R. C. Gonzalez and R. E. Woods, Prentice Hall,Digital Image Processing, Prentice Hall; 2nd edition, 2002. [6] HitLabNZ, http://www.hitlabnz.org/wiki/Home, 2008 Figure 8. The interface showning the 3D model and other multimedia information. [7] R. T. Azuma, A Survey of Augmented Reality. In Presence: Teleoperators and Virtual Environments 6, pp 355—385, (1997) Fig. 8 shows the screen shot of the system in operation. [8] Augmented Reality Network, http://augmentedreality.ning.com, 2008 In Fig. 8, the left window is the live-feed video, with the [9] H.-J. Chien, C.-Y. Chen, C.-F. Chen, Reconstruction of Cultural selected 3D model shown on top of the corresponding Artifact using Structured Lighting with Densified Stereo Correspondence, ARTSIT, 2009. marker’s position in the video window. The window on the [10] C.-H. Liu, Hand Posture Recognition, Master thesis, Dept. of right hand side shows the multimedia information that will Computer Science Eng., Yuan Ze University, Taiwan, 2006. be shown along with the 3D model to provide more [11] C.-Y., Chen, Virtual Mouse:Vision-Based Gesture Recognition, information about the object. For example, when the 3D Master thesis, Dept. of Computer Science Eng., National Sun Yat- object is displayed, the window on the right might show sen University, Taiwan, 2003 additional textual information about the object, as well as [12] J. C., Lai, Research and Development of Interactive Physical Games audio files to describe the object or to suitable provide Based on Computer Vision, Master thesis, Department of Information background music. Communication, Yuan Ze University, Taiwan, 2005 1064
  • 78.
    [13] H.-C., Yeh,An Investigation of Web Interface Modal on Interaction Design - Based on the Project of Burg Ziesar in Germany and the Web of National Palace Museum in Taiwan, Master thesis, Dept. of Industrical Design Graduate Institute of Innovation Design, National Taipei University of Technology, Taiwan, 2007. [14] T. Brown and R. C. Thomas, Finger tracking for the digital desk. In First Australasian User Interface Conference, vol 22, number 5, pp 11--16, 2000 [15] P. Wellner, Interacting with papers on the DigitalDesk, Communications of the ACM, pp.28-35, 1993 1065
  • 79.
    Facial Expression RecognitionBased on Local Binary Pattern and Support Vector Machine 1 2 3 4 Ting-Wei Lee (李亭緯), Yu-shann Wu(吳玉善), Heng-Sung Liu(柳恆崧) and Shiao-Peng Huang(黃少鵬) Chunghwa Telecommunication Laboratories 12, Lane 551, Min-Tsu Road Sec.5 Yang-Mei, Taoyuan, Taiwan 32601, R.O.C. TEL:886 3 424-5095, FAX:886 3 424-4742 Email: [email protected], [email protected], [email protected], [email protected] Abstract—For a long time, facial expression Besides the PCA and LDA, Gabor filter method [3] recognition is an important issue to be full of challenge. In is also used in facial feature extraction. This method has this paper, we propose a method for facial expression both multi-scale and multi-orientation selection in recognition. Firstly we take the face detection method to choosing filters which can present some local features of detect the location of face. Then using the Local Binary facial expression effectively. However, the Gabor filter Patterns (LBP) extracts the facial features. When method suffers the same problem as PCA and LDA. It calculating the LBP features, we use an NxN window to be a statistical region and remove this window by certain would cost too much computation and high dimension of pixels. Finally, we adopt the Support Vector Machine feature space. (SVM) method to be a classifier and recognize the facial In this paper, we use the Local Binary Pattern expression. In the experimental process, we use the JAFFE (LBP) [4][5] as the facial feature extraction method. database and recognize seven kinds of expressions. The average correct rate achieves 93.24%. According to the LBP has low computation cost and efficiently encodes experimental results, we prove that this proposed method the texture features of micro-pattern information in the has the higher accuracy. face image. In the first step, we have to detect the face area to remove the background image. We extract the Keywords: facial expression, face detection, LBP, SVM Haar-like [6] features and use the Adaboost [7] classifier for face detection. The face detection module can be found in the Open Source Computer Vision Library I. INTRODUCTION (OpenCV). After adopting the face area, we calculate this area’s LBP features. Finally, using the Support To analyze facial expression can provide much Vector Machine (SVM) classifies the LBP feature and interesting information and used in several applications. recognizes the facial expression. Experimental results Take electronic board as example, we can realize demonstrate the effective performance of the proposed whether the commercials attract the customers or not by method. the facial expression recognition. In recent years, many researches had worked on this technique of human- The rest of this paper is organized as follows: In computer interaction. Section Ⅱ, we introduce our system flow chart and the The basic key point of any image processing is to face detection. In section Ⅲ, we explain the facial LBP extract the facial features from the original images. representation and SVM classifier. In Section Ⅳ , Principal Component Analysis (PCA) [1] and Linear experimental results are presented. Finally, we give brief Discriminant Analysis (LDA) [2] are two methods used discussion and conclusion in section Ⅴ. widely. PCA computes a set of eigenvalues and eigenvectors. By selecting several most significant II. THE PROPOSED METHOD eigenvectors, it produces the projection axes to let the images projected and minimizes the reconstruction error. The flow chart of the proposed facial expression The goal of LDA is to find a linear transformation by recognition method was shown in Fig.1. In the first step, minimizing the within-class variance and maximizing the face detection is performed on the original image to the between-class variance. In other words, PCA is locate the face area. In order to reduce the region of hair suitable for data analysis and reconstruction. LDA is image or the background image, we take a smaller area suitable for classification. But the dimension of image is from the face area after the face detection. In the second usually higher, the calculations require for the process of step, using the LBP method extracts the facial feature extraction would be significant. expression features. When calculating the histogram of LBP features, we use an NxN window to be a statistical 1066
  • 80.
    Original Image Figure 2. Haar-like features: the first row is for the edge Face Detection features and the second row is for the line features. The face detection module can be found in the Open Source Computer Vision Library (OpenCV) [10]. But if we use the original detection region, it may include some areas which are unnecessary, such as hair LBP Feature or background. For avoiding this situation, we cut the Extraction smaller area from the detection region and try to reduce the unnecessary areas but also keep the important features. This area’s width is 126 and the height is 147. Fig. 4 shows the final result of face area. SVM Classification Features Weak Pass Weak Pass Pass Weak Pass Classifier Classifier Classifier A face area 1 2 N Recognition Deny Deny Deny Result Not a face area Figure 1. The flow chart of the proposed method. Figure 3. The decision process of cascade Adaboost. region and move this window by certain pixels. In the last step, SVM classifier is used for the facial expression recognition. A. The Face Detection Viola and Jones [9] used the Haar-like feature for face detection. There are some Haar-like feature samples shown in Fig. 2. Haar-like features can highlight the differences between the black region and the white region. Each portion in facial area has different property, for example, the eye region is darker than the nose region. Hence, the Haar-like features may extract rich information to discriminate different regions. The cascade of classifiers trained by Adaboost technique is an optimal way to reduce the time for searching face area. In this cascade algorithm, the boosted classifier combines several weak classifiers to become a strong classifier. Different Haar-like features are selected and processed by different cascade weak classifiers. Fig. 3 shows the decision process of this algorithm. If the feature set passes through all of the weak classifiers, it is acknowledged as the face area. On the other hand, if the feature set is denied by any weak classifier, it is rejected. Figure 4. The first column is the original images; the second column is the final face areas. 1067
  • 81.
    6 18 8 III. THE LBP METHOD AND SVM CLASSIFIER B. Local Binary Patterns 21 LBP was used in the texture analysis. This approach is defined as a gray-level invariant measurement method, derived from the texture in a local neighborhood. The LBP has been applied to many Figure 7. Representation of statistic way in width. different fields including the face recognition. By considering the 3x3-neighborhood, the operator C. Support Vector Machine assigns a label to every pixel around the central points in The SVM is a kind of learning machine whose an image. By thresholding each pixel with the center fundamental is statistics learning theory. It has been pixel value, the result is regarded as a binary number. widely applied in pattern recognition. Then, the histogram of the labels can be used as a The basic scheme of SVM is to try to create an texture descriptor. See Figure 5 for an illustration of the optimal hyper-plane as the decision plane, which basic LBP operator. maximizes the margin between the closest points of two Another extension version to the original LBP is classes. The points on the hyper-plane are called support called uniform patterns [11]. A Local Binary Pattern is vectors. In other words, those support vectors are used called uniform if it contains at most two bitwise to decide the hyper-plane. transitions from 0 to 1 or vice versa. For example, Assume we have a set of sample points from two 00011110 and 10000011 are uniform patterns. classes We utilized the above idea of LBP with uniform patterns in our facial expression representation. We {xi , yi }, i  1,, m xi  R N , yi  {1,1} (1) compute the uniform patterns using the (8, 2) neighborhood, which is shown in Fig.6. The (8, 2) stand the discrimination hyper-plane is defined as below: for finding eight neighborhoods in the radius of two. The black rectangle in the center means the threshold, m the other circle points around there mean the f ( x )   y i a i k ( x, xi )  b (2) neighborhoods. But we can see four neighborhoods are i 1 not located in the center of pixels, these neighborhoods’ values are calculated by interpolation method. After that, where f (x ) indicates the membership of x . ai and a sliding window with size 18x21 is used for uniform patterns statistic by shifting 6 pixels in width and 8 b are real constants. k ( x, xi )   ( x),  ( xi ) is a pixels in height. Fig.7 represents the statistic way in kernel function and  (x) is the nonlinear map from width. original space to the high dimensional space. The kernel function can be various types. For example, the linear   function is k ( x, xi )  x  xi , the radial basis function (RBF) kernel function is 1 k ( x, xi )  exp(  x  y ) and the polynomial 2 Figure 5. The basic idea of the LBP operator 2 2 kernel function is k ( x, xi )  ( x  xi  1) n . SVM can be designed for either two-classes classification or multi- classes classification. In this paper, we use the multi- classified SVM and polynomial kernel function [12]. IV. EXPERIMENTAL RESULTS In this paper, we use the JAFFE facial expression database [13]. The examples of this database are shown in the Table 1. The face database is composed of 213 gray scale images of 10 Japanese females. Each person has 7 kinds of expressions, and every expression Figure 6. LBP representation using the (8, 2) includes 3 or 4 copies. Those 7 expressions are Anger, neighborhood Disgust, Fear, Happiness, Neutral, Sadness and Surprise. 1068
  • 82.
    Table I THE EXAMPLES OF JAFFE DATABASE Table III THE COMPARISON RESULTS Anger The Reference Reference proposed [14] [15] method Disgust Anger 95% 95.2% 90% Disgust 88% 95.2% 88.89% Fear Fear 100% 85.7% 92.3% Happiness Happiness 100% 84.9% 100% Neutral 75% 100% 100% Neutral Sadness 90% 90.4% 81.8% Surprise 100% 89.8% 100% Sadness Average 92.57% 91.6% 93.24% Surprise According to the Table 3, we can realize the proposed method has the better performance than the The size of each image is 256x256 pixels. Two images other two references obviously. Even though some of each expression for all of the people are used as recognition rates of expressions aren’t as good as the training samples and the rest are testing samples. Hence two reference methods, we still have the highest average the total number of training sample is 140, and the recognition rate. number of testing sample is 73. V. CONCLUSIONS The Table 2 shows that recognition rate of each facial expression which were experimented by the In this paper, we proposed a facial expression proposed method. The last row is the average recognition method by using the LBP features. For recognition rate of 7 expressions, which is 93.24%. The decreasing the computing efforts, we detect the face recognition time of each face image is 0.105 seconds. region before the LBP method. After we extract the We also compare our experimental results with facial features from the detected area, the SVM some references. In reference [14], the author used the classifier will recognize the facial expression finally. By Gabor features and NN fusion method. In another using the JAFFE be the experiment database, we can reference [15], the author took the face image into three prove the proposed method has the 93.24% correction parts and used the 2DPCA method. The training images rate and better than the two reference methods. and test images are the same as the proposed method. For the future work, we still have some aspects to Table 3 shows the comparison result. The average be studied hardly. Those experiments which we recognition rate of reference [14] is 92.57% and the discussed above have the same property. This property reference [15] is 91.6%. is that the training and testing samples are from the same person. In other word, if we want to recognize someone’s expression, we must have his images of Table II THE RECOGNITION RATE OF PROPOSED METHOD various expressions in database previously. But this property is not suitable for the real application. In the Anger 90% future, we want to overcome this problem. Perhaps we Disgust 88.89% can utilize the variations between the different expressions to become a model and use this model to Fear 92.3% recognize. There are other problems in the facial Happiness 100% recognition still have to be dealt with, such as the lighting variation and the pose changing. Those difficult Neutral 100% issues exist for a long time. We will try to find out a Sadness 81.8% better algorithm to enhance our method. Surprise 100% REFERENCES Average 93.24% [1] L.I. Smith, “A Tutorial on Principal Components Analysis”, 2002. [2] H. Yu and J. Yang, “A Direct LDA Algorithm for High- 1069
  • 83.
    Dimensional Data withApplication to Face Recognition”, Pattern Recognition, vol. 34, no. 10, pp.2067–2070, 2001. [3] Deng Hb, Jin Lw and Zhen Lx et al, “A New Facial Expression Recognition Method Based on Local Gabor Filter Bank and PCA plus LDA”, International Journal of Information Technology, vol.11, no. 11, pp.86-96, 2005. [4] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen, “Face Description with Local Binary Patterns: Application to Face Recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp.2037–2041, 2006. [5] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen, “Face Recognition with Local Binary Patterns”, Springer-Verlag Berlin Heidelberg 2004, pp.469–481, 2004. [6] Pavlovic V. and Garg A. “Efficient Detection of Objects and Attributes using Boosting”, IEEE Conf. Computer Vision and Pattern Recognition, 2001. [7] Jerome Friedman, Trevor Hastie and Robert Tibshirani, “Additive Logistic Regression: A Statistical View of Boosting”, The Annals of Statistics, vol. 28, no. 2, pp.337–407, 2000. [8] C. Burges, Tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 955-974, 1998. [9] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, Proceedings of the 2001 IEEE Computer Society Conference, vol 1, 2001, pp. I-511-I-518. [10] Intel, “Open source computer vision library; http://sourceforge.net/projects/opencvlibrary/”, 2001. [11] T. Ojala, M. Pietika¨inen, and T. Ma¨enpa¨a¨, “Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, July 2002. [12] Dana Simian, “A model for a complex polynomial SVM kernel”, Mathematics And Computers in Science and Engineering, pp. 164-169, 2008. [13] M. Lyons, S. Akamatsu, etc. “Coding Facial Expressions with Gabor Wavelets”. Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition,Nara Japan, 200-205, 1998. [14] WeiFeng Liu and ZengFu Wang “Facial Expression Recognition Based on Fusion of Multiple Gabor Features”, International Conference on Pattern Recognition, 2006 [15] Bin Hua and Ting Liu , “Facial expression recognition based on FB2DPCA and multi-classifier fusion”, International Conference on Information Technology and Computer Science, 2009. 1070
  • 84.
    MILLION-SCALE IMAGE OBJECTRETRIEVAL 1 1,2 Yin-Hsi Kuo (郭盈希) and Winston H. Hsu (徐宏民) 1 Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei 2 Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei ABSTRACT In this paper, we present a real-time system that addresses three essential issues of large-scale image object retrieval: 1) image object retrieval—facilitating pseudo-objects in inverted indexing and novel object- level pseudo-relevance feedback for retrieval accuracy; 2) time efficiency—boosting the time efficiency and memory usage of object-level image retrieval by a novel inverted indexing structure and efficient query evaluation; 3) recall rate improvement—mining semantically relevant auxiliary visual features through visual and textual clusters in an unsupervised and scalable (i.e., MapReduce) manner. We are able to search over one-million image collection in respond to a user query in 121ms, with significantly better accuracy (+99%) than the traditional bag-of-words model. Figure 1: With the proposed auxiliary visual feature Keywords Image Object Retrieval; Inverted File; Visual discovery, more accurate and diverse results of image Words; Query Expansion object retrieval can be obtained. The search quality is greatly improved. Regarding efficiency, because the 1. INTRODUCTION auxiliary visual words are discovered offline on a MapReduce platform, the proposed system takes less than one second searching over million-scale image Different from traditional content-based image retrieval collection to respond to a user query. (CBIR) techniques, the target images to match might only cover a small region in the database images. The needs raise a challenging problem of image object noisily quantized descriptors. Meanwhile, the target retrieval, which aims at finding images that contain a images generally have different visual appearances specific query object rather than images that are globally (lighting condition, occlusion, etc). To tackle these similar to the query (cf. Figure 1). To improve the issues, we propose to mine visual features semantically accuracy of image object retrieval and ensure retrieval relevant to the search targets (see the results in Figure 1) efficiency, in this paper, we consider several issues of and augment each image with such auxiliary visual image object retrieval and propose methods to tackle features. As illustrated in Figure 5, these features are them accordingly. discovered from visual and textual graphs (clusters) in an State-of-the-art object retrieval systems are mostly unsupervised manner by distributed computing (i.e., based on the bag-of-words (BoW) [6] representation and MapReduce [1]). Moreover, to facilitate object-level inverted-file indexing methods. However, unlike textual indexing and retrieval, we incorporate the idea of queries with few semantic keywords, image object pseudo-objects [4] to the inverted file paradigm and the queries are composed of hundreds (or few thousands) of pseudo-relevance feedback mechanism. A novel efficient 1071
  • 85.
    Figure 2: Thesystem diagram. Offline part: We extract visual and textual features from images. Textual and visual image graphs are constructed by an inverted list-based approach and clustered by an adapted affinity propagation algorithm by MapReduce (18 Hadoop servers). Based on the graphs, auxiliary visual features are mined by informative feature selection and propagation. Pseudo-objects are then generated by considering the spatial consistency of salient local features. A compact inverted structure is used over pseudo-objects for efficiency. Online part: To speed up image retrieval, we proposed an efficient query evaluation approach for inverted indexing. The retrieval process is then completed by relevance scoring and object-level pseudo-relevance feedback. It takes around 121ms to produce the final image ranking of image object retrieval over one-million image collections. query evaluation method is also developed to remove Inverted file is a popular way to index large-scale data in unreliable features and further improve accuracy and the information retrieval community [8]. Because of its efficiency. superiority of efficiency, many recent image retrieval Experiment shows that the automatically discovered systems adopt the concept to index visual features (i.e. auxiliary visual features are complementary to VWs). The intuitive way is to record each entry with conventional query expansion methods. Its performance image ID, VW frequency in the inverted file. is significantly superior to the BoW model. Moreover, However, to our best knowledge, most systems simply the proposed object-level indexing framework is adopt the conventional method to the visual domain, remarkably efficiency and takes only 121ms for without considering the differences between documents searching over the one million image collection. and images, where the image query is composed of thousands of (noisy) VWs and the object of interest may occupy small portions of the target images. 2. SYSTEM OVERVIEW 3.1. Pseudo-Objects Figure 2 shows a schematic plot of the proposed system, which consists of offline and online parts. In the offline Images often contain several objects so we cannot take part, visual features (VWs) and textual features (tfidf of the whole image features to represent each object. Each expanded tags) are extracted from the images. We then object has its distinctive VWs. Motivated by the novelty propagate semantically relevant VWs from the textual and promising retrieval accuracy in [4], we adopt the domain to the visual domain, and remove visually concept of pseudo-object—a subset of proximate feature irrelevant VWs in the visual domain (cf. Section 4). All points with its own feature vector to represent a local these operations are performed in an unsupervised area. An example shows in Figure 4 that the pseudo- manner on the MapReduce [1] platform, which is objects, efficiently discovered, can almost catch different famous of it scalability. Operations including image objects; however, advanced methods such as efficient graph construction, clustering, and mining over million- indexing or query expansion are not considered. We scale images can be performed efficiently. To further further propose a novel object-level inverted indexing. enhance efficiency, we index the VWs by the proposed object-level inverted indexing method (cf. Section 3). 3.2. Index Construction We incorporate the concept of pseudo-object and adopt compression methods to reduce memory usage. Unlike document words, VWs have a spatial dimension. In the online part, an efficient retrieval algorithm is Neighboring VWs often correspond to the same object in employed to speed up the query process without loss of an image, and an image consists of several objects. We retrieval accuracy. In the end, we apply object-level adopt pseudo-objects and store the object information in pseudo-relevance feedback to refine the search result and the inverted file to support object-level image retrieval. improve the recall rate. Unlike its conventional Specifically, we construct an inverted list for each VW t counterpart, the proposed object-level pseudo-relevance as follows, Image ID i, ft,i, RID1, ... ,RIDf, which feedback places more importance on local objects indicates the ID of the image i where the VW appears, instead of the whole image. the occurrence frequency (ft,i), and the associated object region ID (RIDf) in each image. The addition of the object ID to the inverted file makes it possible to search 3. OBJECT-LEVEL INVERTED INDEXING for a specific object even if the object only occupies a small region of an image. 1072
  • 86.
    Figure 3: Illustrationof efficient query evaluation (cf. Section 3). To achieve time efficiency, first, we rank a visual word by its salience to the query and then retrieve the designated number of candidate images (e.g., 7 images, A to G). After deciding the candidate images, we skip the irrelevant images and cut those non-salient VWs. 3.3. Index Compression Index compression is a common way to reduce memory usage in textual domain. First, we discard the top 5% frequent VWs as stop words to decrease the mismatch rate and reduces the size of inverted file. We then adopt different coding methods to compress data based on their Figure 4: Object-level retrieval results by pseudo- visual characteristics. Image IDs are ordinal numbers objects and object-level pseudo-relevance feedback. sorted in ascending order in the lists, thus we store the The letter below each image represents the region difference between adjacent image IDs instead of the (pseudo-object) with the highest relevance to query image ID itself which is called d-gap [8]. And for region object by (2). The region information is essential for IDs, we adopt a fixed length bit-level coding of three bits query expansion. Instead of using the whole image as to encode it (e.g., R2 010). On the other hand, we use the seed for retrieving other related images, we can a variant length bit-level coding to encode frequency easily identify those related objects (e.g., R0, R5, R0) and (e.g., 3 1110). Furthermore, we implement AND and mitigate the influence of noisy features. Note that the SHIFT operations to efficiently decode the frequency yellow dots in the background are detected feature and region IDs at query time. The memory space for points. indexing pseudo-objects can be saved about 54.1%. 3.4. Object-Level Scoring Method 3.5. Efficient Query Evaluation (EQE) We use the intersection of TFIDF, which performs the best for matching, to calculate the score of each region Conventional query evaluation in inverted indexing indexed by VW t. Besides the discovered pseudo-objects, needs to keep track of the scores of all images in the we also define a new object R0 to treat the whole image inverted lists. In fact, it is observed that most of the as another object. We first calculate the score of every scored images contain only a few matched VWs. We pseudo-object (R) to the query object (Q) as follows, propose an efficient query evaluation (EQE) algorithm that explores a small part of a large-scale database to score ( R , Q ) = ∑ IDFt × min( wt , R , wt ,Q ), (1) reduce the online retrieval time. The procedures of EQE t∈Q are described below and illustrated in Figure 3. where wt,R and wt,Q are the normalized VW frequency in 1. Query term ranking: The ranking score in (1) pseudo-object and in the query respectively. And then favors the query term with higher frequency and the pseudo-object with the highest score is regarded as IDFt; therefore, we sort the query terms according the most relevant object with respect to the query, as to its salience, which is calculated as wt,Q×IDFt for suggested in [4]: VW t. The following phases are then processed sequentially to deal with VWs ordered and score(i,Q) = max{score(R,Q) | R ∈ i}. (2) weighted by their visual significance to the query. 2. Collecting phase: In the retrieval process, user only cares about the images in the top ranks. 1073
  • 87.
    (a)visual cluster example(b)representative VW selection (c)example results (d)auxiliary VW propagation (e)textual cluster example Figure 5: Image clustering results and mining auxiliary visual words. (a) and (e) show the sample visual and textual clusters; the former keeps visually similar images in the same cluster, while the latter favors semantic similarities. The former facilitates representative VW selection, while the latter facilitates semantic (auxiliary) VW propagation. (b) and (d) illustrate the selection and propagation operations based on the cluster histogram as detailed in Section 4. And a simple example shows in (c). Therefore, instead of calculating the score of each R0 in Figure 4), we can further remove irrelevant objects image, we score the top images of the inverted lists such as the toy in R4 of the second image. and add them to a set S until we have collected sufficient number of candidate images. 4. AUXILIARY VISUAL WORD (AVW) 3. Skipping phase: After deciding the candidate DISCOVERY images, we skip the images that do not appear in the collecting phase. For every image i in the inverted list, score the image i if i∈S , otherwise Due to the limitation of VWs, it is difficult to retrieve skip it. If the number of visited VWs reaches a images with different viewpoints, lighting conditions and predefined cut ratio, go on to the next phase. occlusions, etc. To improve recall rate, query expansion is the most adopted method; however, it is limited by the 4. Cutting phase: Simply remove the remaining VWs, quality of initial retrieval results. Instead, in an offline which usually have little influence on the results. stage, we augment each image with auxiliary visual And then the process stops here. features and consider representative (dominant) features This algorithm works remarkably well, bringing in its visual clusters and semantically related features in about almost the same retrieval quality with much less its textual graph respectively. Such auxiliary visual computational cost. As image queries are generally features can significantly improve the recall rate as composed of thousands or hundreds of (noisy) VWs, demonstrated in Figure 1. We can deploy all the rejecting those non-salient VWs significantly improves processes in a parallel way by MapReduce [1]. Besides, the efficiency and slightly improves the accuracy. the by-product of auxiliary visual word discovery is the reduction of the number indexed visual features for each 3.6. Object-Level Pseudo-Relevance Feedback image for better efficiency in time and memory. (OPRF) Moreover, it is easy to embed the auxiliary visual features in the proposed indexing framework by adding Conventional approach using whole images for pseudo- one new region for those discovered auxiliary visual relevance feedback (PRF) may not perform well when features not existing in the original VW set. only a part of retrieved images are relevant. In such a case, many irrelevant objects would be included in PRF, 4.1. Image Clustering by MapReduce resulting in too many query terms (or noises) and degrading the retrieval accuracy. To tackle this issue, a The image clustering is first based on a graph novel object-level pseudo-relevance feedback (OPRF) construction. The images are represented by 1M VWs algorithm is proposed. Rather than using the whole and 50K text tokens expanded by Google snippets from images, we select the most important objects from each their associated (noisy) tags. However, it is very of the top-ranked images and use them for PRF. The challenging to construct image graphs for million-scale importance of each object is estimated according to (2). images. To tackle the scalability problem, we construct By selecting relevant objects in each image (e.g., R0, R5, 1074
  • 88.
    image graphs usingMapReduce model [1], a scalable images in the same textual cluster are semantically close framework that simplifies distributed computations. but usually visually different. Therefore, these images We take the advantage of the sparseness and use provide a comprehensive view of the same object. cosine measure as the similarity measure. Our algorithm Propagating the VWs from the textual domain can extends the method proposed in [2] which uses a two- therefore enrich the visual descriptions of the images. As phase MapReduce model—indexing phase and the example shows in Figure 5(c), the bottom image can calculation phase—to calculate pairwise similarities. It obtain auxiliary VWs with the different lighting takes around 42 minutes to construct a graph of 550K condition of the Arc de Triomphe. The similarity score images on 18-node Hadoop servers. To cluster images can be weighted to decide the number of VWs to be on the image graph, we apply affinity propagation (AP) propagated. Specifically, we derive the VW histogram proposed in [3]. AP is a graph-based clustering from the images of each cluster and then propagate VWs algorithm. It passes and updates messages among nodes based on the cluster histogram weighted by its (semantic) on graph iteratively and locally—associating with the similarity to the canonical image of the textual cluster. sparse neighbors only. It takes around 20 minutes for each iteration and AP converges generally around 20 4.4. Combining Selection and Propagation iterations (~400 minutes) for 550K images by MapReduce model. The selection and propagation operations described The image clustering results are sampled in Figure above can be performed iteratively. The selection 5(a) and (e). Note that if an image is close to the operation removes visually irrelevant VWs and improves canonical image (center image), it has a higher AP score, memory usage and efficiency, whereas the propagation indicating that it is more strongly associated with the operation obtains semantically relevant VWs to improve cluster. Moreover, images in the same visual cluster are the recall rate. Though propagation may include too often visually similar to each other, whereas some of the many VWs and thus decrease the precision, we can images in the same textual cluster differ in view, lighting perform selection after propagation to mitigate this effect. condition, angle, etc., and are potential to bring A straightforward approach is to iterate the two complementary VWs for other images in the same operations until convergence. However, we find that it is textual cluster. enough to perform a selection first, a propagation next, and finally a selection because of the following reasons. 4.2. Representative Visual Word Selection First, only the propagation step updates the auxiliary visual feature and textual cluster images are fixed; each We first propose to remove irrelevant VWs in each image will obtain distinctive VWs at the first image to mitigate the effect of noise and quantization propagation step. The subsequent propagation steps will error to reduce memory usage in the inverted file system only modify the frequency of the VWs. As the objective and to speed up search efficiency. We observe that is to obtain distinctive VWs, frequency is less important images in the same visual cluster are visually similar to here. Second, binary feature vectors perform better or at each other (cf. Figure 5(a)). As illustrated in Figure 5(c), least comparable to the real-valued. the middle image can then have representative VWs from the visual cluster it belongs to. We accumulate the number of each VW from the images of a cluster to form 5. EXPERIMENTS a cluster histogram. As shown in Figure 5(b), each image donates the same weight to the cluster histogram. We 5.1. Experimental Setup can then select the VWs whose occurrence frequency is above a predefined threshold (e.g., in Figure 5(b) the We evaluate the proposed methods using a large-scale VWs in red rectangles are selected). photo retrieval benchmark—Flickr550 [7]. Besides, we randomly add Manhattan photos to Flickr550 to make it 4.3. Auxiliary Visual Word Propagation a 1 million dataset. As suggested by many literatures (e.g., [5]), we use the Hessian-affine detector to extract Due to variant capture conditions, some VWs that feature points in images. The feature points are described strongly characterize the query object may not appear in by SIFT and quantized into 1 million VWs for better the query image. It is also difficult to obtain these VWs performance. In addition, we use the average precision to through query expansion method such as PRF because of evaluate the retrieval accuracy. Since average precision the difference in visual appearance between the query only shows the performance for a single image query, we image and the retrieved. Mining semantically relevant compute the mean average precision (MAP) to represent VWs from other information source such as text is the system performance over all the queries. therefore essential to improve the retrieval accuracy. As illustrated in Figure 5(e), we propose to augment 5.2. Experimenal Results each image with VWs propagated from the textual cluster result. This is based on the observation that 1075
  • 89.
    Table 1: Thesummarization of the impacts in the features points. This result shows that the selection performance and query time comparing with the and propagation operations are effective in mining useful baseline methods. It can be found that our proposed features and remove the irrelevant one. In addition, the methods can achieve better retrieval accuracy and relative improvement of AVW (+44%) is orthogonal and respond to a user query in 121ms over one-million complement to OPRF (0.352 0.487, +38%). photo collections. The number in the parentheses indicates relative gain over baseline. And the symbol ‘%’ stands for relative improvement over BoW model 6. CONCLUSIONS [6]. (a) Image object retrieval In this paper, we cover four aspects of large-scale retrieval system: 1) image object retrieval over one- MAP Baseline PRF OPRF million image collections—responding to user queries in 0.290 0.324 121ms, 2) the impact of object-level pseudo-relevance Pseudo-objects [4] 0.251 (+15.5%) (+29.1%) feedback—boosting retrieval accuracy, 3) time (b) Time efficiency efficiency with efficient query evaluation in the inverted Flickr550 One-million file paradigm—comparing with the traditional inverted Pseudo-objects [4] +EQE +EQE file structure, and 4) image object retrieval based on effective auxiliary visual feature discovery—improving Query time 854 56 121 the recall rate. That is to say, the efficiency and (ms) effectiveness of the proposed methods are validated over (c) Recall rate improvement large-scale consumer photos. BoW model [6] AVW AVW+OPRF MAP 0.245 0.352 0.487 REFERENCES % - 43.7% 98.8% [1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” OSDI, 2004. We first evaluate the performance of object-level PRF [2] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document (OPRF) in boosting the retrieval accuracy. As shown in similarity in large collections with mapreduce,” ACL, Table 1(a), OPRF outperforms PRF by a great margin 2008. (relative improvement 29.1% vs. 15.5%). The result shows that the pseudo-object paradigm is essential for [3] B. J. Frey and D. Dueck, “Clustering by passing PRF-based query expansion in object-level image messages between data points,” Science, 2007. retrieval since the targets of interest might only occupy a small portion of the images. [4] K.-H. Lin, K.-T. Chen, W. H. Hsu, C.-J. Lee, and T.-H. Li, “Boosting object retrieval by estimating pseudo- We then evaluate the query time of object-level objects,” ICIP, 2009. inverted indexing augmented with efficient query evaluation (EQE) to achieve time efficiency. The query [5] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. time is 15.2 times faster (854 56) after combining Zisserman, “Object retrieval with large vocabularies and with EQE method as shown in Table 1(b). The reasons fast spatial matching,” CVPR, 2007. attribute to the selection of salient VWs and ignoring those insignificant inverted lists. It is essential since [6] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” ICCV, unlike textual queries with 2 or 3 query terms, an image 2003. query might contain thousands (or hundreds) of VWs. Therefore, we can respond to a user query in 121ms over [7] Y.-H. Yang, P.-T. Wu, C.-W. Lee, K.-H. Lin, W. H. one-million photo collections. Hsu, and H. Chen, “ContextSeer: context search and Finally, to improve recall, we evaluate the recommendation at query time for shared consumer performance of auxiliary visual word (AVW) discovery. photos,” ACM MM, 2008. As shown in Table 1(c), the combination of selection, propagation and further OPRF brings 99% relative [8] J. Zobel and A. Moffat, “Inverted files for text search improvement over BoW model and reduces one-fifth of engines,” ACM Computing Surveys, 2006 1076
  • 90.
    Sport Video HighlightExtraction Based on Kernel Support Vector Machines Po-Yi Sung, Ruei-Yao Haung, and Chih-Hung Kuo, Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan { n2895130 , n2697169 , chkuo }@mail.ncku.edu.tw Abstract—This paper presents a generalized highlight density of cuts, and audio energy, with a derived function to extraction method based on Kernel support vector machines detect highlights. In [5], Duan proposes a technique that (Kernel SVM) that can be applied to various types of sport searches shots with goalposts and excited voices to find video. The proposed method is utilized to extract highlights highlights for soccer programs. To locate scenes of the without any predefining rules of the highlights events. The goalposts in football games, the technique of Chang [6] framework is composed of the training mode and the analysis detects white lines in the field, and then verifies touch-down mode. In the training mode, the Kernel SVM is applied to train shots via audio features. Wan [7] detects voices in classification plane for a specific type of sport by shot features commentaries with high volume, combined with the of selected video sequences. And then the genetic algorithm frequency of shot change and other visual features to locate (GA) is adopted to optimize kernel parameters and select features for improving the classification accuracy. In the goal events. Huang [8] exploited color and motion analysis mode, we use the classification plane to generate the information to find logo objects in the replay of sport video. video highlights of sport video. Accordingly, viewers can access All these techniques have to depend on predefined rules for a important segments quickly without watching through the single specific type of sport video, and as a result may need entire sport video. lots of human efforts to analyze the video sequences and identify the proper objects for highlights in the particular Keywords-Highlight extraction; Sport analysis; Kernel type of sport. support vector machines; Genetic algorithm Many other techniques have employed probabilistic models, such as Hidden Markov Models (HMM), to look for I. INTRODUCTION the correlations of events and the temporal dependency of features [9]-[15]. The selected scene types are represented by Due to the rapid growth of multimedia storage hidden states, and the state transition probabilities can be technologies, such as Portable Multimedia Player (PMP), evaluated by the HMM. Highlights can be identified HD DVD and Blu-ray DVD, large amounts of video contents accurately by some specific transition rules. However, it is can be saved in a small piece of storage device. However, hard to include all types of highlight events in the same set of people may not have sufficient time to watch all the recorded rules, and the model may fail to detect highlights if the video programs. They may prefer skipping less important parts and features are different from the original ones. Cheng [16] only watch those remarkable segments, especially for sport proposed a likelihood model to extract audio and motion videos. Highlight extraction is a technique making use of features, and employed the HMM to detect the transition of video content analysis to index significant events in video the integrated representation for the highlight segments. This data, and thereby help viewers to access the desired parts of kind of methods all need to estimate the probabilities of state the content more efficiently. This technique can also be a transitions, which has to be set up through intense human help to the processes of summarization, retrieval, and observations. abstraction from large amounts of video database. Most of the previous researches have adopted rule-based In this paper, we focus on the highlight extraction methods, whereby the rules are heuristically set to describe techniques for sport videos. Many works have been proposed the dynamics among objects and scenes in the highlight that can identify objects that appear frequently in sport events of a specific sport. The rules set for one kind of sport highlights. Xiong [1] propose a technique that extracts audio video usually cannot be applied to the other kinds. In [17], and video objects that are frequently appearing in the we have proposed a more generalized technique based on highlight scenes, like applauses, the baseball catcher, the low-level semantic features. In this approach, we can soccer goalpost, and so on. Tong [2] characterized three generate highlight tempo curves without defining essential aspects for sport videos: focus ranges of the camera, complicated transitions among hidden states, and hence we object types, and video production techniques. Hanjalic et al. can apply this technique to various kinds of videos. [3]-[4] measured three factors, that is, motion activity, 1077
  • 91.
    In this paper,we extend our technique [17] and A. Shot Change Detection incorporate it with the framework of Kernel support vector The task in this stage is to detect the transition point from machines (Kernel SVM). For each type of sport video, a one scene to another. Histogram differences of two small amount of highlight shots are input so that some consecutive frames are calculated by (2) to detect the shot unified features can be extracted. Then apply the Kernel changes in video sequences. A shot change is said to be SVM system to train the classification plane and utilize the detected if the histogram difference is greater than a trained classification plane to analyze other input videos of predefined threshold. The pixel values that are employed to the same sport type, generating the highlight shots. calculate the histogram contains luminance only, since the The rest of this paper is organized as follows. Section II human visual system is more sensitive to luminance presents the overview of the proposed system. Section III (brightness) than to colors. The histogram difference is details the method for highlight shots classification and computed by the equation highlight shots generation. The highlight extraction 255 performance and experimental results are shown in Section  H (i)  H I I 1 (i) (2) Ⅳ. SectionⅤ is the conclusion. DI  i0 N II. PROPOSED HIGHLIGHT SHOT EXTRACTION SYSTEM OVERVIEW where N is the total pixel number in a frame, and HI (i) is the Fig. 1 shows four stages of the proposed scheme: (1) shot pixel number of level i for the I-th frame. Finally, the video change detection, (2) visual and audio features computation, sequence will be separated into several shots according to (3) Kernel SVM training and analysis, and (4) highlight the shot change detection results. shots generation. In the first stage, histogram differences are B. Visual and Audio Features Computation counted to detect the shot change points. In the second stage, the feature parameters of each shot are computed and taken Each shot may contain lots of frames. To reduce the as the input eigenvalues into the Kernel SVM training and computation complexity, we select a keyframe to represent analysis system. The shot eigenvalues include shot length (L), the shot. In this work, we simply define the 10th frame of color structure (C), shot frame difference (Ds), shot motion each shot as the keyframe, since it is usually more stable (Ms), keyframe difference (Dkey), keyframe motion (Mkey), Y- than the previous frames, which may contain mixing frames histogram difference (Yd), sound energy (Es), sound zero- during scene transition. Many of the following features are crossing rate (Zs) and short-time sound energy (Est). They are extracted from this keyframe. collected as a feature set for the i-th shot 1) Shot Length  Vi  L, C, Ds , M s , Dkey , M key , Yd , Es , Zs , Est (1)  We designate the frame number in each shot as the shot length (L). Experiments show that the highlight shot lengths are shorter in non-highlight shots, such as the shots with In the third stage, the Kernel SVM either trains the judges or scenes with special effects. A highlight shot is parameters or analyzes the input features, according to the often longer than a non-highlight shot. For example, mode of the system. Then, in the last stage, highlight shots pitching in baseball games and shooting goal in soccer are generated based on the output of Kernel SVM. We games are usually longer in shot length. Hence, the shot explain the first two stages in the following, and the other length is an important feature for the highlights and is two stages are explained in Section Ⅲ. included as one of the input eigenvalues. 2) MPEG-7 Color Structure Highlight shots The color structure descriptor (C) is defined in the generation Highlight Highlight shot shot MPEG-7 standard [18,19] to describe the structuring property of video contents. Unlike the simple statistic of Training histograms, it counts the color histograms based on a data moving window called the structuring element. The analysis system GA Training and Kernel SVM Kernel SVM Baseball parameters descriptor value of the corresponding bin in the color optimization analysis training Basketball histogram is increased by one if the specified color is within and features mode mode selection Soccer the structuring element. Compared to the simple statistic of one histogram, the color structure descriptor can better Visual and audio features reflect the grouping properties of a picture. A smaller C Visual features Audio features value means the image is more structured. For example, both of the two monochrome images in Fig. 2 have 85 black Shot Shot pixels, and hence their histograms are the same. The color structure descriptor C of the image in Fig. 2-(a) is 129, Shot Shot while the image in Fig 2-(b) is more scattered with the C value 508. Fig. 3 depicts the curve of the C values in the detection Shot Video Data video of a baseball game. It shows that pictures with a Audio Data scattered structure usually have higher C values. Figure 1. The proposed highlight shots extraction system 1078
  • 92.
    where W andH are the block numbers in the horizontal and vertical directions respectively. MVx,n(i, j) and MVy,n(i, j) are the motion vectors in x and y directions respectively, of the block at i-th row and j-th column in the n-th frame of the shot. The motion vector of a block represents the displacement in the reference frame from the co-located block to the best matched square, and is searched by minimizing the sum of absolute error (SAE) [21]. (a) K 1 K 1 SAE   C(i, j)  R(i, j) (5) i 0 j 0 where C(i, j) is the pixel intensity of a current block at relative position (i, j), and R(i, j) is the pixel intensity of a reference block. 5) Keyframe Difference and Keyframe Motion (b) We calculate the frame difference and estimate the Figure 2. The MPEG-7 Color Structure: (a) a highly structured motion activity between the keyframe and its next frame. monochrome image; (b) a scattered monochrome image. Both have the Suppose the k-th frame is a keyframe. The keyframe same histogram difference Dkey of the shot is defined by W 1 H 1 Dkey   f k (i, j)  f k 1 (i, j) (6) i 0 j 0 where fk(i, j) represents the intensity of the pixel at position (i, j) in the k-th frame. Similarly, keyframe motion Mkey represents the average magnitude of motion vectors inside the key frame and is defined as Figure 3. MPEG-7 Color Structure Descriptor curve in a baseball game. W 1 H 1  MVx (i, j)2  MVy (i, j)2 1 M key  (7) In this paper, we perform edge detection before W  H / K 2 i0 j 0 calculating the color structure descriptors. The resultant C value of each keyframe is regarded as an eigenvalue and where MVx(i, j) and MVy(i, j) denote the components of the included in the input data set of Kernel SVM. motion vectors in x- and y- directions respectively. 3) Shot Frame Difference 6) Y-Histogram Difference The average shot frame difference (Ds) of each shot is The average Y-histogram difference is calculated by defined by 255 L1 W 1 H 1  H (i)  H n1 (i) (  f (i, j)  f n1 (i, j) ) 1 1 L1 n Ds  (3) Yd  1  i0 (8) L 1 n1 WH i0 j 0 n L  1 n1 W H where W and H are frame width and height respectively, where Hn (i) represents the number of pixels at level i, and fn(i, j) is the pixel intensity at position (i, j) in the n-th counted in the n-th frame. In general, the value of Yd is frame. This feature shows the frame activities in a shot. In higher in the highlight shots. general, highlight shots have higher Ds values than non- 7) Sound Energy highlight shots. The sound energy Es is defined as 4) Shot Motion To measure the motion activity, we first partition a M frame into square blocks of the size K-by-K pixels, and  S (n)  S (n) (9) perform motion estimation to find the motion vector of each Es  n 1 block [20]. The shot motion Ms is defined as the average M magnitude of motion vectors by where S(n) is the signal strength of the n-th audio sample in L1 W 1 H 1 a shot, M is the total number of audio samples in the  MVx,n (i, j)2  MVy,n (i, j)2 1 Ms  (4) duration of the corresponding shot. In the highlight shot, the (L 1) W  H / K 2 n1 i0 j 0 sound energy is usually higher than those in non-highlight shots. 1079
  • 93.
    8) Sound Zero-crossingRate section, we briefly explain the basic idea about constructing We also adopt the zero-crossing rate (Zs) of the audio the SVM decision functions. signals as one of the input features, since it is a simple a) Linear SVM indicator of the audio frequency information. Experiments Given a training set (x1, y1), (x2, y2),…, (xi, yi), xn  Rn , yn 1, 1, n  1 i , where i is the total indicate that the zero-crossing rate becomes higher in highlight shots. The zero-crossing rate is defined as number of training data, each training data point xn is M associated with one of two classes characterized by a value  signS (i) signS (i  1) 1 fs Z s (n)  (10) yn = ±1. In the linear SVM theory, the decision function is 2M i 1 supposed to be a linear function and defined as f  x  wTx  b where fs is the audio sampling rate, and the sign function is defined by (13)  1 , if S (i)  0, where w, x  Rn , b  R , w is the weighting vector of  signS (i)   0 , if S (i)  0, (11) hyperplane coefficients, x is the data vector in space and b  1 , otherwise. is the bias. The decision function lies half way between two  hyperplanes which referred to as support hyperplanes. SVM is expected to find the linear function f(x) = 0 such that 9) Short-time Sound Energy separates the two classes of data. Fig. 4-(a) shows the Since the crowd sounds always last for 1 or 2 seconds, decision function that separates two classes of data. For and therefore the sound energy can not represent the crowd separable data, there are many possible decision functions. sounds for video shot with longer shot length. Thus, we The basic idea is to determine the margin that separates the select short-time sound energy (Est) as one of the input two hyperplanes and maximize the margin in order to find eigenvalues. The short-time sound energy is defined as the optimal decision function. As shown in Fig. 4-(b), two hyperplanes consist of data points which satisfy wT  x  b  1 and wT  x  b  1 respectively. For example,  S (n)  S p (n)  24000 p the data point x1 of the positive class (yn = +1) lead to a e( p)  n 1 (12) positive value and the data points x2 of the negative class (yn 24000 = -1) are negative. The perpendicular distance between the Est  max e(1), e(2), e(3), , e(m) 2 two hyperplanes is . In order to find the maximized w where Sp(n) is the signal strength of the n-th audio sample at p-th second in the video shot, e(p) is the sound energy of margin and optimal hyperplanes, we must find the smallest the p-th second in the video shot, and m is the time of the distance w . Therefore, the data points have to satisfy the video shot. condition as one set of inequalities y j  wT x j  b  1, for j  1, 2, 3, , i III. HIGHLIGHT SHOT CLASSIFICATION METHOD (14) A. Kernel SVM Training and Analysis System In this work, the Kernel SVM is adopted to analyze the The problem for solving the w and b can be reduced to the input videos and generate the highlight shots. In the training following optimization problem mode, the selected shots for a specific sport type are fed into the system to train for the classification hyperplanes 1 2 Minimize w and we apply genetic algorithm (GA) to select features and 2 (15) subject to y j  wT x j  b  1, for j  1 i optimize kernel parameters for support vector machines. In the analysis mode, the system just loads these pre-stored parameters and generates highlight shots for the input sport video. We will explain the process in details in the This is a quadratic programming (QP) problem and can be following. solved by the following Lagrange function [24]: 1) Support Vector Machines α y w x  b  1, for j  1i SVM is a machine learning technique first suggests by 1 i Vapnik [22] and has widespread applications in L(w, b,  )  wT w  j j T j (16) classification, pattern recognition and bioinformatics. 2 j 1 Typical concepts of SVM are for solving binary classification problems [23]. The data may be where α j denotes the Lagrange multiplier. The w, b, and multidimensional and form several disjoint regions in the space. The feature of SVM is to find the decision functions α j at optimum to minimize (16) are obtained. Then, that optimally separate the data into two classes. In this following the Karush Kuhn-Tucker (KKT) conditions to 1080
  • 94.
    simplify this optimizationproblem. Since the optimization where C is the penalty parameter. This optimization problem have to satisfy the KKT conditions defined by problem also can be solved by the Lagrange function and transformed to dual problem as follows i w α y x j 1 j j j Maximize L( )  i  j  1 y i j yk  j k x j x k T j 1 2 j ,k 1 i α y (22) 0 (17) i j 0 j j Subject to  j1  j y j  0 ,0   j  C, j  1i αj 0 , for j  1i    j y j w x j  b  1  0 , for j  1i T   Similarity, we can solve this dual problem and find the optimal w and b. Substitute (17) into (16), then the Lagrange function is c) Non-linear SVM transformed to dual problem as follows The SVM can extended to the case of nonlinear conditions by projecting the original data sets to a higher dimensional i i  y y α α x x 1 space referred to as the feature space via a mapping MaximizeL( )  αj  j k j k T j k function φ which also called kernel function. The nonlinear 2 j 1 j ,k 1 (18) decision function is obtained by formulating the linear i Subject to α yj1 j j  0, α j  0, j  1i classification problem in the feature space. In nonlinear SVM, the inner products xT xk in (22) can be replaced by j the kernel function k(x j , xk )  φ(x j )T φ(xk ) . Therefore, the Solving for this dual problem and find the Lagrange dual problem in (22) can be replaced by the following multiplier α j . Substitute α j into (19) to find the optimal w equation and b. i i i Maximize L( )   j  1  y y   k (x , x ) j k j k j k  j 1 2 j ,k 1 w α j y j x j , for j  1i (23) i  j1 (19) Subject to jyj  0 ,0   j  C, j  1i 1  sv  1  N b  N sv    wT x S  y  s1  s    j1 According to (19), we also can solve above dual problem and find optimal w and b. The classification is then where xS are data points which Lagrange multiplier α j 0, ys obtained by the sign of is the class of xS and Nsv is the number of xS .  i  b) Linear Generalized SVM In the case where the data is not linearly separable as sign    j 0 y j j k(x, x j )  b     (24) shown in Fig. 4-(c), the optimization problem in (15) will be infeasible. The concepts of linear SVM can also be extended to the linearly nonseparable case. Rewrite (14) as d) Types of Kernels (20) by introducing a non-negative slack variable  . The most commonly used kernel functions are multivariate Gaussian radial basis function (MGRBF),  y j wTx j  b  1   j , for j  1i  (20) Gaussian radial basis function (GRBF), polynomial function and sigmoid function. MGRBF: The above inequality constraints are minimized through a penalized object function. Then the optimization problem n x jm xkm2 can be written as   2 m 2 (25) k(x j , x k )  φ(x j ) T φ(x k )  e m1  i   1 Maximize L( )  w 2  C  j  where  m  , x jm , xkm  , x j , x k  n , xjm is m-th 2    j 1  (21)  Subject to y j w T x j  b  1   j , for j  1i  element of xj, xkm is m-th element of xk,  m is the adjustable parameter of the Gaussian kernel, x j , xk are input data. 1081
  • 95.
    GRBF: 1 n n where g1 ~ gSs , gC ~ gCc and g1 ~ g f f are parameters of n S f x j xk 2 kernel, penalty factor and features respectively. The ns, nc,  and nf are the bits to represent the above parameters. The 2 2 (26) k(x j , xk )  φ(x j )T φ(xk )  e parameters defined at the start process are bits of parameters and features, number of generations, crossover and mutation rate, and limitations of parameters. The next where   , x j , xk n ,  is the adjustable parameter of step is output parameters and features to Kernel SVM for the Gaussian kernel, x j , xk are input data. training. In the selection step, we keep two chromosomes Polynomial function: with maximum objective value (Of) obtained by (29) for next generation. These chromosomes will not change in the following crossover and mutation steps. Fig. 10 shows the k(x j , xk )  φ(x j )T φ(xk )  (1  xT xk )d j (27) crossover and mutation operations. As shown in Fig. 10-(a), two new offspring are obtained by randomly exchanging where d is positive integer, x j , xk n , d is the adjustable genes between two chromosomes using one point crossover. After crossover operation, as shown in Fig. 10-(b), the parameter of the polynomial kernel, x j , xk are input data. binary code genes are changed occasionally from 0 to 1 or 2) Kernel SVM Input Data Structure vice versa called mutation operation. Finally, a new In sport videos, a highlight event usually consists of generation is obtained and output parameters and features several consecutive shots. Fig. 5 shows an example of a again. These processes will be terminated until the home run in a baseball game. It includes three consecutive predefined numbers of generations satisfy. shots: pitching and hitting, ball flying, and the base running. In this paper, we adopt precision and recall rates to Unlike many other highlight extraction algorithms that have evaluate the performance of our system. The precision (P) to predefine the highlight events with specific constituting and recall (R) rates are defined as follows shots, we simply propose to collect the feature sets of several consecutive shots together as the input eigenvalues SNc SNc of the Kernel SVM. P ,R (28) SNe SNt 3) Kernel SVM Training Mode For the training mode, the data are processed in two steps: a) initialization, b) kernel parameters optimization where SNc, SNe, and SNt are the number of correctly and feature selection. extracted highlight shots, extracted highlight shots, and actual highlight shots repectively. a) Initialization of the Input Data In the objective function calculation step, we calculate The initialization process of the training mode is shown the objective value (Of) to evaluate the kernel parameters in Fig. 6. The video is partitioned into shots and divided and select features generated by GA. The objective value into two sets: highlight shots and non-highlight shots. The calculated by following equation eigenvalues of consecutive shots are collected as a data set. All data sets are composed as the input data vector. Then Of  0.5 P  0.5 R each eigenvalue is normalized into the range of [0, 100]. (29) The order of the data set in the input data vector is randomized. These steps will terminate when the predefined number of b) Kernel Parameters Optimization and Feature generations have achieved. And finally we select the kernel selection parameters and features which have maximum objective value. Since the parameters in kernel functions are adjustable, 4) Kernel SVM Analysis Mode and in order to improve the classification accuracy, these In the analysis mode, the user has to select a sport type. kernel parameters should be properly set. In this process, The Kernel SVM system directly loads the pre-trained we adopt the GA-based feature selection and parameters classification function corresponding to the sport type. The optimization method proposed by Huang [25] to select classification function is defined as (30), where Cx is the features and optimize kernel parameters for support vector classes of the video shots. Cx = +1 represents the shots machines. Fig. 7 shows the flowchart of the feature belong to highlight shot, and Cx = -1 are non-highlight shot. selection and parameters optimization method. This process can be performed very quickly, since these As shown in Fig. 7, we apply the GA to generate kernel kernel parameters and features do not need to be trained parameters and select features to train the hyperplanes of again. the Kernel SVM. The processes to generate kernel parameters and select features utilize the GA are shown in   i  Fig. 8. The GA start process include generate chromosome  i     1 , if  y j  j k(x, x j )  b   0   randomly and parameters setup. The chromosome is represented as binary coding format as shown in Fig. 9,   Cx  sign y j  j k(x, x j )  b       j 1  1 , if  y  k(x, x )  b   0  i  (30)  j 1      j 1 j j j    1082
  • 96.
    Figure 4. Lineardecision function separatimg two classes: (a) Decision function separete class postive from class negative; (b) The margin that seperates two hyperplanes; (c) The case of linear non-separable data sets. (a) pitching and hitting (b) ball flying (c) base running Figure 5. A home run event in a baseball game. Start { data set 0 data set 1 data set 2 Parameters and Highlights Output parameters and features Normalize Randomize … { selected features Training Data Vector Data set n Non-Highlights Data Vector Selection Training Data Figure 6. The initialization of training data. Crossover Training data Testing data Mutation Selected New generation features Normalization and feature selection Figure 8. Genetic algorithm to generate prameters and features Genetic Algorithm Kernel SVM Kernel SVM training mode parameters i n j n n g1  g S  g Ss S g1 gC  g Cc g 1 g k  g f f f f Kernel SVM C analysis mode Precision and Figure 9. Chromosome recall rates Objective function calculation Parents 1 0101 1111 Offspring 1 0101 0010 Crossover No Parents 2 0001 0010 Offspring 2 0001 1111 Terminate? Yes Optimized parameters 01011111 01111111 and features Mutation Before After Figure 7. The flowchart of the feature selection and parameters optimization method Figure 10. (a) Crossover operation; (b) Mutation operation 1083
  • 97.
    IV. EXPERIMENTAL RESULTS [2] X. Tong, L. Duan, H. Lu, C. Xu, Q. Tian and J. S. Jin, „A mid-level visual concept generation framework for sports analysis‟, Proc. IEEE The experimental setup for different sport types are listed in ICME, July 2005, pp. 646–649. Table I. For the baseball game, we take hits, home runs, [3] A. Hanjalic, „Multimodal approach to measuring excitement in strike out, steal, and replay as highlight events. For video‟, Proc. IEEE ICME, July 2003, pp. 289–292. basketball game, the highlight events are dunks, three-point [4] A. Hanjalic, „Generic approach to highlights extraction from a sport shots, jump shots, bank shots and replays. For soccer game, video‟, Proc. IEEE ICIP, Sept. 2003, pp. I - 1–4. we set highlight events as goals, long shoots, close-range [5] L. Y. Duan, M. Xu, T. S. Chua, Q. Tian, and C. S.Xu, „A mid-level shoots, free kicks, corner kicks, break through, and replays. representation framework for semantic sports video analysis‟, Proc. ACM Multimedia, Nov. 2003, pp. 33–44. [6] Y. L. Chang, W. Zeng, I. Kamel, and R. Alonso, „Integrated image In this paper, we adopt three kernel functions include and speech analysis for content-based video indexing‟, Proc. IEEE multivariate Gaussian radial basis function, Gaussian radial ICMCS, May 1996, pp. 306–313. basis function and polynomial function. Then we evaluate [7] K. Wan and C. Xu, „Efficient multimodal features for automatic the performance for extracting highlight shots of sport video soccer highlight generation‟, Proc. IEEE ICPR, Aug. 2004, pp. 973– among these kernel functions. Table. II shows the 976. experimental results of NYY vs. NYM, Table. III shows the [8] Q. Huang, J. Hu, W. Hu, T. Wang, H. Bai and Y. Zhang, „A reliable experimental results in the game NBA Celtics vs. Rockets, logo and replay detector for sports video‟, Proc. IEEE ICME, July and Table. IV shows the experimental results of the soccer 2007, pp. 1695–1698. game Arsenal vs. Hotspur. According to the experimental [9] J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati and P. Pala, „Soccer highlights detection and recognition using HMMs‟, Proc. results, we find that the SVM with kernel function MGRBF IEEE ICME, Aug. 2002, pp. 825–828. have the best performance among these types of sport videos. [10] G. Xu, Y. F. Ma, H. J. Zhang and S. Yang, „A HMM based semantic analysis framework for sports game event detection‟, Proc. IEEE TABLE I. THE EXPERIMENTAL SETUP FOR DIFFERENT SPORT TYPES ICIP, Sept. 2003, pp. I - 25–8. Sport type Sequence Total length Shot length [11] J. Wang, C. Xu, E. Chng and Q. Tian, „Sports highlight detection from keyword sequences using HMM‟, Proc. IEEE ICME, June 2004, Baseball NYY vs. NYM 146 minutes 1097 pp. 599–602. Basketball Celtics vs. Rockets 32 minutes 180 [12] P. Chang, M. Han and Y. Gong, „Extract highlights from baseball Soccer Asenal vs. Hotspur 48 minutes 280 game video with hidden Markov models‟, Proc. IEEE ICIP, Sept. 2002, pp. 609–612. TABLE II. THE EXPERIMENTAL RESULTS OF BASEBALL GAME [13] N. H. Bach, K. Shinoda and S. Furui, „Robust highlight extraction using multi-stream hidden Markov models for baseball video‟, Proc. Sequence NYY vs. NYM IEEE ICIP, Sept. 2005, pp. III - 173–6. Kernel MGRBF GRBF Polynomial [14] Z. Xiong, R. Radhakrishnan, A. Divakaran and T. S. Huang, „Audio events detection based highlights extraction from baseball, golf and Precision 87% 89% 77% soccer games in a unified framework‟, Proc. IEEE ICME, July 2003, Recall 99% 81% 91% pp. III - 401–4. [15] B. Zhang, W. Chen, W. Dou, Y. J. Zhang and L. Chen, „Content- TABLE III. THE EXPERIMENTAL RESULTS OF BASKETBALL GAME based table tennis games highlight detection utilizing audiovisual clues‟, Proc. IEEE ICIG, Aug. 2007, pp. 833–838. Sequence Celtics vs. Rockets [16] C. C. Cheng and C. T. Hsu, „Fusion of audio and motion information Kernel MGRBF GRBF Polynomial on HMM-based highlight extraction for baseball games‟, IEEE Trans. Precision 100% 86% 93% Multimedia, pp. 585–599, June 2006. Recall 93% 100% 87% [17] L. C. Chang, Y. S. Chen, R. W. Liou, C. H. Kuo, C. H. Yeh and B. D. Liu, „A real time and low cost hardware architecture for video TABLE IV. THE EXPERIMENTAL RESULTS OF SOCCER GAME abstraction system‟, Proc. IEEE ISCAS, May 2007, pp. 773–776. Sequence Asenal vs. Hotspur [18] ISO/IEC JTC1/SC29/WG11/ N6881: ‟MPEG-7 Requirements Document V.18‟, January 2005. Kernel MGRBF GRBF Polynomial [19] ISO/IEC JTC1/SC29/WG11: „MPEG-7 Overview (version 10)‟, Precision 100% 76% 100% October 2004. Recall 88% 96% 73% [20] C. H. Kuo, M. Shen and C.-C. Jay Kuo, „Fast motion search with efficient inter-prediction mode decision for H.264‟, Journal of Visual V. CONCLUSION Communication and Image Representation, pp. 217–242, 2006. Kernel SVM can be trained to classify the shots by [21] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression, exploiting the information of a unified set of basic features. WILEY, 2003. Experimental results show that the SVM with multivariate [22] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. Gaussian radial basis kernel can achieve average of 96% [23] V. Kecman, Learning and Soft Computing, MIT Press, Cambridge, precision rate and 93% recall rate. 2001. [24] Vapnyarskii, I.B. (2001), “Lagrange Multipliers”, in REFERENCES Hazewinkel,Michiel, Encyclopedia of Mathematics, Kluwer Academic Publishers,ISBN 978-1556080104. [1] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T.S. Huang, [25] C.L. Huang and C.J. Wei, GA-based feature selection and „Highlights extraction from sports video based on an audio-visual Parameters optimization for support vector machine, Expert Syst marker detection framework‟, Proc. IEEE ICME, July 2005, pp. 29– Appl 31 (2006), pp. 231–240. 32. 1084
  • 98.
    IMAGE INPAINTING USINGSTRUCTURE-GUIDED PRIORITY BELIEF PROPAGATION AND LABEL TRANSFORMATIONS Heng-Feng Hsin (辛恆豐), Jin-Jang Leou (柳金章), Hsuan-Ying Chen (陳軒盈) Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan 621, Republic of China E-mail: {hhf96m, jjleou, chenhy}@cs.ccu.edu.tw ABSTRACT problem with isophote constraint. They estimate the smoothness value given by the best chromosome of GA, In this study, an image inpainting approach using and project this value in the isophotes direction. Chan structure-guided priority belief propagation (BP) and and Shen [3] proposed a new diffusion method, called label transformations is proposed. The proposed curvature-driven diffusions (CDD), as compared to approach contains five stages, namely, Markov random other diffusion models. PDE-based approaches are field (MRF) node determination, structure map suitable for thin and elongated missing parts in an image. generation, label set enlargement by label For large and textured missing regions, the processed transformations, image inpainting by priority-BP results of PDE-based approaches are usually optimization, and overlapped region composition. Based oversmooth (i.e., blurring). on experimental results obtained in this study, as Exemplar-based approaches try to fill missing compared with three comparison approaches, the regions in an image by simply copying some available proposed approach provides the better image inpainting part in the image. Nie et al. [4] improved Criminisi et results. al.’s approach [5] by changing the filling order and overcame the problem that gradients of some pixels on Keywords Image Inpainting; Priority Brief Propagation; the source region contour are zeros. A major Label Transformation; Markov Random Field (MRF); shortcoming of exemplar-based approaches is the Structure Map. greedy way of filling an image, resulting in visual inconsistencies. To cope with this problem, Sun et al. [6] 1. INTRODUCTION proposed a new approach. However, in their approach, user intervention is required to specify the curves on Image inpainting is to remove unwanted objects or which the most salient missing structures reside. Jia and recover damaged parts in an image, which can be Tang [7] used image segmentation to abstract image employed in various applications, such as repairing structures. Note that natural image segmentation is a aged images and multimedia editing. Image inpainting difficult task. To cope with this problem, Komodaskis approaches can be classified into three categories, and Tziritas [8] proposed a new exemplar-based namely, statistical-based, partial differential equation approach, which treats image inpainting as a discrete (PDE) based, and exemplar-based approaches. global optimization problem. Statistical-based approaches are usually used for texture synthesis and suitable for highly-stochastic parts in an 2. PROPOSED APPROACH image. However, statistical-based approaches are hard to rebuild structure parts in an image. The proposed approach contains five stages, namely, PDE-based approaches try to fill target regions of Markov random field (MRF) node determination, an image through a diffusion process, i.e., diffuse structure map generation, label set enlargement by label available data from the source region boundary towards transformations, image inpainting by priority-BP the interior of the target region by PDE, which is optimization, and overlapped region composition. typically nonlinear. Bertalmio et al. [1] proposed a PDE-based image inpainting approach, which finds out 2.1. MRF node determination isophote directions and propagates image Laplacians to the target region along these directions. Kim et al. [2] As shown in Fig. 1 [8], an image I0 contains a target used genetic algorithms (GA) to solve the inpainting region T and a source region S with S=I0-T. Image 1085
  • 99.
    inapinting is tofill T in a visually plausible way by Vpq (xp , xq ) simply pasting various patches from S. In this study, image inpainting is treated as a discrete optimization = ∑Z(x dp, dq∈Ro p + dp, xq + dq)(I0 (xp + dp) − I0 (xq + dq))2 , (4) problem with a well-defined energy function. Here, where Ro is the overlapped region between two labels, xp discrete MRFs are employed. and xq. To define the nodes of an MRF, the image lattice is used with the horizontal and vertical spacings of gapx 2.3. Label set enlargement and gapy (pixels), respectively. For each lattice point, if its neighborhood of size (2gapx + 1) × (2gapy + 1) overlaps To completely use label informations in the original the target region, it will be an MRF node p. Each label image, three types of label transformations are used to of the label set L of an MRF consists of enlarge the label set. The first type of label (2gapx+1) × (2gapy+1) pixels from the source region S. transformation contains two different directions: the Based on the image lattice, each MRF node may have 2, vertical and horizontal flippings, which can find out 3, or 4 neighboring MRF nodes. labels (patches) that do not exist in the original source Assigning a label to an MRF node is equivalent to region, but have symmetric properties in the horizontal copying the label (patch) to the MRF node. To evaluate or vertical direction. The second type of label the goodness of a label (patch) for an MRF node, the transformation contains three different rotations: left energy (cost) function of an MRF will be defined, 90° rotation, right 90° rotation, and 180° rotation, which which includes the cost of the observed region of an can find out rotated labels (patches) of the above- MRF node. mentioned three degrees. The third type of label We will assign a label x p ∈ L to each MRF node p ˆ transformation is scaling. To keep the original size of horizontal and vertical spacings gapx and gapy, the so that the total energy F (x) of the MRFs is minimized. ˆ original image is directly up/down scaled so that new Here, labels (patches) can be obtained in the original image F ( x) = ∑ V p ( x p ) + ˆ ˆ ∑V pq ˆ ˆ ( x p , xq ), (1) with the same horizontal and vertical spacings. Here, p∈v ( p , q )∈ε both the up-sampling (double-resolution by bilinear where V p ( x p ) (called the label cost hereafter) denotes interpolation) image and the down-sampling (half- the single node potential for placing label xp over MRF resolution) image are used to generate extra candidate node p, i.e., how the label xp agrees with the source labels (patches). region around p. Vpq(xp,xq) represents the pairwise potential measuring how well node p agrees with the 2.4. Image inpainting by priority-BP optimization overlapped region ε between p and its neighboring node q when pasting xp at p and pasting xq at q. Belief propagation (BP) [10] treats an optimization problem by iteratively solving a finite set of equations 2.2. Structure map generation until the optimal solution is found. Ordinary BP is computationally expensive. For an MRF graph, each In this study, the Canny edge detector [9] is used to node sends “message” to all its neighboring nodes, extract the edge map of an image, which preserves the whereas the node receives messages from all its important structural properties of the source region in neighboring nodes. This process is iterated until all the the image. A binary mask E(p) to used to build the messages do not change any more. structure map of the image, which is just the edge map The set of messages sent from node p to its with morphological dilation. If E(p) is non-zero, pixel p neighboring node q is denoted by {m pq ( xq )} . This xq ∈L is belonging to the structure part. Then, E(p) is used to message expresses the opinion of node p about formulate the structure weighting function Z(p,q): assigning label xq to node q. The message formulation is ⎧ 1, if E ( p) = 0 and E (q ) = 0, defined as: Z ( p, q ) = ⎨ (2) ⎩w, otherwise, ⎧ ⎫ where w is “the structure weighting coefficient.” The mpq (xq ) = min⎨Vpq (x p , xq ) +Vp (x p ) + ∑mrp (x p )⎬. (5) x p ∈L label cost Vp(xp) is defined as (the sum of weighted ⎩ r:r ≠q,( r , p)∈ε ⎭ squared differences, SWSD): That is, if node p wants to send message mpq to node q, Vp (xp ) node p must traverse its own label set and find the best label to support node q when label xq is assigning to = ∑[Z( p + dp,x +dp)M ( p + dp)(I ( p + dp) − I (x +dp)) , ] dp∈[− gapx , gapx ]× − gapy , gapy p 0 0 p 2 (3) node q. Each message is based on two factors: (1) the where M(p) denotes a binary mask, which is non-zero if compatibility between labels xp and xq, and (2) the pixel p lies inside the source region S. Thus, for an likelihood of assigning label xp to node p, which also MRF node p, if its neighborhood of size (2gapx+1) × contains two factors: (1) the label cost Vp(xp), and (2) (2gapy+1) does not intersect S, Vp(xp)=0. Vpq(xp,xq) for the opinion of its neighboring node about xp measured pasting labels xp and xq over p and q, respectively, can by the third term in Eq. (5). be similarly defined as: 1086
  • 100.
    Messages are iterativelyupdated by Eq. (5) until MRF edge can be bidirectionally traversed. In the they converge. Then, a set of beliefs, which represents forward pass, all the nodes are visited by the priority the probability of assigning label xp to p, is computed order, an MRF node having the highest priority will for each MRF node p as: pass message to its neighboring MRF nodes having the bp (x p ) = −Vp (x p ) − ∑m rp (x p ). (6) lower priorities, and the MRF node having the highest r:(r , p)∈ε priority will be marked as “committed,” which will not The second term in Eq. (6) means that to calculate a be visited again in this forward pass. For label pruning, node’s belief, it is required to gather all messages from the MRF node having the highest priority can transmit all its neighboring nodes. When the beliefs of all MRF its “cheap” message to all its neighboring MRF nodes nodes have been calculated, each node p is assigned the having not been committed. The priority of each best label having the maximum belief: neighboring MRF node having received a new message x p = arg maxbp ( x p ). ˆ (7) is updated. The above process is iterated until there are x p∈L no uncommitted MRF nodes. On the other hand, the To reduce the computational cost of BP, backward pass is performed in the reverse order of the Komodakis and Tziritas [8] proposed “priority-BP” to forward pass. Note that label pruning is not performed control the message passing order of MRF nodes and in the backward pass. “dynamic label pruning” to reduce the number of elements in the label set of each MRF node. In [8], the 2.5. Overlapped region composition priority of an MRF node p is related to the confidence of node p about the label should be assigned to it. The When the number of iterations reaches K, each MRF confidence depends on the current set of beliefs node p is assigned a label having maximum bp values. {bp(xp)} that has been calculated by BP. Here, the xp∈L All the MRF nodes are composed to produce the final priority of node p is designed as: image inpainting results, where label composition is 1 performed in a decreasing order of MRF node priorities. priority ( p ) = , Depending on whether the region contains a global {x p ∈ L : b p ( x p ) ≥ bconf } rel (8) structure or not, two strategies are used to compose each bp (xp ) = bp (xp ) − bp , rel max (9) overlapped region. If an overlapped region contains a rel global structure, graph cuts are used to seam it. where bp is the relative belief value and b p is the max Otherwise, each pixel value of the overlapped region is maximum belief among all labels in the label set of computed by weighted sum of two corresponding pixel node p. Here, the confidence of an MRF node is the values, where the weighting coefficient is proportional number of candidate labels whose relative belief values to the priority of an MRF node. exceed a certain threshold bconf. On the other hand, to traverse MRF nodes, the 3. EXPERIMENTAL RESULTS number of candidate labels for an MRF node can be pruned dynamically. To commit a node p, all labels with In this study, 21 test images are used to evaluate the relative beliefs being less than a threshold bprune for performance of the proposed approach. Three node p will not be considered as its candidate labels. comparison inpainting approaches, namely, the PDE- The remaining labels are called “active labels” for node based approach [1], the exemplar-based approach [5], p. In this study, the label set of an MRF node is sorted and the ordinary priority-BP-based approach [8], are by belief values, at least Lmin active labels are selected implemented in this study. Some image inpainting for an MRF node, and a similarity measure is used to results by the three comparison approaches and the check the remaining labels. If the similarity between proposed approach are shown in Figs. 2-6. two remaining labels is greater than a threshold Sdiff, one In Fig. 2, the image size is 256 × 170, gapx=9, of the two remaining labels will be pruned. This process gapy=9, bconf=-180000, bprune=-360000, Lmax=30, Lmin=5, will be iterated until the relative belief value of any and w=10. Blurring artifacts appear in Fig. 2(c). In Fig. remaining label is smaller than bprune or the number of 2(d), because the isophotes direction is too complex to active labels reaches a user-specified parameter Lmax. guide the inpainting process, the inpainting results are To apply priority-BP to image inpainting, the labels not good. Compared with the ordinary priority-BP- from the source region of an original image and the based approach (Fig. 2(e)), the proposed approach (Fig. labels by applying three types of label transformations 2(f)) can keep the global structure in the image by are obtained so that each MRF node maintains its label guiding the message passing process by the structure set. Then, the number of priority-BP iterations, K, is set, map. In Fig. 3, the image size is 206 × 308, gapx=5, the priorities of all MRF nodes are initialized only by gapy=5, bconf=-40000, bprune=-80000, Lmax=20, Lmin=3, their Vp(xp) values, and message passing is performed. and w=10. In Fig. 3(c), blurring artifacts appear in the Each priority-BP iteration consists of the forward and upper part of the image. In Fig. 3(d), the stone bridge backward passes. Message passing and dynamic label can not be well reconstructed, because there is no pruning are performed in the forward pass, and each suitable patch in the image. Furthermore, error 1087
  • 101.
    propagation appears inthe lake. In Fig. 3(e), because the priority of the bridge structure is low, the bridge structure is broken. In the proposed approach, the weighting coefficient is used to raise the priority of the bridge structure, resulting in the better inpainting results. In Fig. 4, the image size is 208×278, gapx=7, gapy=7, bconf= -150000, bprune=-300000, Lmax=30, Lmin=5 and w=2. For the image, the proposed approach can reconstruct the tower structure by label transformations, (a) (b) whereas the three comparison approaches contain error Fig. 1. (a) Nodes and edges of an MRF; (b) labels of an propagations, due to lack of suitable labels. In Fig. 5, MRF for image inpainting [8]. the image size is 287×216, gapx=10, gapy=10, bconf= -200000, bprune=-400000, Lmax=50, Lmin=5 and w=15. In Fig. 5(f), the proposed approach uses both the original labels and the flipped labels to reconstruct the region to be inpainted, resulting in the better inpainting image. In Fig. 6, the image size is 257 × 271, gapx=6, gapy=6, bconf=-200000, bprune=-400000, Lmax=50, Lmin=10 and (a) (b) w=5. Because the building in the original image has the symmetric property, label transformations can be employed in this case. Blurring artifacts appear in Fig. 6(c). In Fig. 6(d), the isophote direction is too complex so that the structures interfere each other. In Fig. 6(e), the inpainting results are poor, due to lack of valid labels. In Fig. 6(f), for the lower part of the image, the window structure is partially broken due to the building (c) (d) is not totally symmetric so that error propagation appears in some inpainting regions of the image. However, the inpainting image by the proposed approach is better than that by the three comparison methods. 4. CONCLUDING REMARKS (e) (f) Fig. 2. (a) The original image, “Lantern;” (b) the In this study, an image inpainting approach using masked image; (c)-(f) the image inpainting results by structure-guided priority BP and label transformations is the PDE-based approach [1], the exemplar-based proposed. In the proposed approach, to reconstruct the approach [5], the ordinary priority-BP-based approach global structures in an image, the structure map of the [8], and the proposed approach, respectively. image is generated, which guides the inpainting process by priority-BP optimization. Furthermore, three types of label transformations are employed to get more usable labels (patches) for inpainting. Based on the experimental results obtained in this study, as compared with three comparison approaches, the proposed approach provides the better image inpainting results. ACKNOWLEDGEMENT This work was supported in part by National Science Council, Taiwan, Republic of China under Grants NSC (a) (b) 96-2221-E-194-033-MY3 and NSC 98-2221-E-194- Fig. 3. (a) The original image, “Bungee jumping;” (b) 034-MY3. the masked image; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively (to be continued). 1088
  • 102.
    (c) (d) (e) (f) Fig. 4. (a) The original image, “Tower;” (b) the masked image; (c)-(f) the image inpainting results by the PDE- based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively (continued). (e) (f) Fig. 3. (a) The original image, “Bungee jumping;” (b) the masked image; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based (a) (b) approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively (continued). (c) (d) (a) (b) (e) (f) Fig. 5. (a) The original image, “Picture frame;” (b) the masked image; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively. (c) (d) Fig. 4. (a) The original image, “Tower;” (b) the masked image; (c)-(f) the image inpainting results by the PDE- based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively (to be continued). 1089
  • 103.
    Computer Society Conf.on Computer Vision and Pattern Recognition, 2003, 721–728. [6] J. Sun, L. Yuan, J. Jia, and H. Y. Shun, “Image completion with structure propagation,” in Proc. of 2005 ACM SIGGRAPH on Computer Graphics, 2005, pp. 861–868. [7] J. Jia and C. K. Tang, “Image repairing: Robust image synthesis by adaptive and tensor voting,” in Proc. of 2003 IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2003, pp. 643–650. (a) (b) [8] N. Komodakis and G. Tziritas, “Image completion using efficient belief propagation via priority scheduling and dynamic pruning,” IEEE Trans. on Image Processing, Vol. 16, pp. 2649–2661, 2007. [9] J. Canny, “A computational approach to edge detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 8, pp. 679–698, 1986. [10] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San (c) (d) Francisco, CA, 1988. (e) (f) Fig. 6. (a) The original image, “Building;” (b) the masked image; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively. REFERENCES [1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inapinting,” in Proc. of ACM Int. Conf. on Computer Graphics and Interactive Techniques, 2000, 417–424. [2] J. B. Kim and H. J. Kim, “Region removal and restoration using a genetic algorithm with isophote constraint,” Pattern Recognition Letters, Vol. 24, pp. 1303–1316, 2003. [3] T. Chan and J. Shen, “Non-texture inpaintings by curvature-driven diffusions,” Journal of Visual Comm. Image Rep., Vol. 12, pp. 436–449, 2001. [4] D. Nie, L. Ma, and S. Xiao, “Similarity based image inpainting method,” in Proc. of 2006 Multi-Media Modeling Conf., 2006, 4–6. [5] A. Criminisi, P. Perez, and K. Toyama, “Object removal by exemplar-based inpainting,” in Proc. of IEEE 1090
  • 104.
    CONTENT-BASED BUILDING IMAGERETRIEVAL Wen-Chao Chen(陳文昭), Chi-Min Huang (黃啟銘), Shu-Kuo Sun (孫樹國), Zen Chen (陳稔) Dept. of Computer Science, National Chiao Tung University E-mail:[email protected], toothbrush.cs97g@ nctu.edu.tw, [email protected], [email protected] Abstract—This paper addresses an image retrieval query image, the content-based image retrieval system system which searches the most similar building for a extracts the most similar images from a database by captured building image from an image database based either spatial information, such as color, texture and on an image feature extraction and matching method. shape, or frequency domain features, e.g. wavelet-based The system then can provide relevant information to methods [3]. users, such as text or video information regarding the Existing content-based image retrieval algorithms query building in augmented reality setting. However, can be categorized into (a) image classification methods, the main challenge is the inevitable geometric and and (b) object identification methods. The first approach photometric transformations encountered when a retrieves images which belong to the same category as a handheld camera operates at a varying viewpoint under query image. Jing et al. proposed region-based image various lighting environments. To deal with these retrieval architecture [6]. An image is segmented into transformations, the system measures the similarity regions by the JSEG method and every region is between the MSER features of the captured image and described with color moment. Every region is clustered database images using the Zernike Moment (ZM) to form a codebook by Generalized Lloyd algorithm. information. This paper also presents algorithms based The similarity of two images is then measured by Earth on feature selection by multi-view information and the Mover’s Distance (EMD). Willamowski et al. presented DBSCAN clustering method to retrieve the most generic visual categorization method by using support relevant image from database efficiently. The vector machine as a classifier [7]. Affine invariant experimental results indicate that the proposed system descriptor represents an image as a vector quantization. has excellent performance in terms of the accuracy and In the second approach Wu and Yang [8] detected processing time under the above inevitable imaging and recognized street landmarks from database images variations. by combining salient region detection and segmentation techniques. Obdrzalek and Matas [9] developed a Keywords Image recognition and retrieval; Geometric building image recognition system based on local affine and photometric transformations; Zernike moments; features that allows retrieval of objects in images taken Image indexing; from distinct viewpoints. Discrete cosine transform 1. INTRODUCTION (DCT) is then applied to the local representations to reduce the memory usage. Zhang and Kosecka [10] also In recent years, there have been an increasing proposed a system to recognize building by a number of applications in Location-Based Service hierarchical approach. They first index the model views (LBS). LBS is an service that can be accessed from by localized color histograms. After converting to mobile devices to provide information based on the YCbCr color space and indexing with the hue value, current geographical position, e.g. GPS information. SIFT descriptors [4, 5] are then applied to refine However, GPS position is only available in open spaces recognition results. since the GPS signal is often blocked by high-rise Most of related image retrieval algorithms detect buildings or overhead bridges. Magnetic compasses are local features of a query image and then compare with also disturbed by nearby magnetic materials. Vision- detected features of database images by feature based localization is therefore an alternative approach to descriptors. However, the feature detectors such as provide both accurate and robust navigation information. Harris corner detector and the SIFT detector, which is This paper addresses the aspects of a building based on the difference of Gaussians (DOG), utilize a image retrieval system. The building recognition is a circular window to search for a possible location of a content-based image retrieval technique that can be feature. The image content in the circular window is not extended to applications of object recognition and web robust to affine deformations. Furthermore, the feature image search via a cloud service combined with points may not be reliable and may not appear consumer-oriented augmented reality tools. Given a 1091
  • 105.
    simultaneously across themultiple views with wide- 2. FEATURE DETECTOR AND DESCRIPTOR baselines. Matas et al. [13] presented a maximally stable 2.1. MSER feature region detector extremal region (MSER) detector. Mikolajczyk and Recently, numbers of local feature detectors using a Schmid [3] proposed Harris-Affine and Hessian-Affine local elliptical window have been investigated. The detectors. The performances of the existing region MSER detector is evaluated as one of the best region detectors were evaluated in [14] in which the MSER detectors [5]. The advantage of MSER detector is the detector and the Hessian-Affine detector were ranked as ability to resist geometry transformation. The MSER the two best. Chen and Sun [2] compare various popular detector performs also well when images contain feature descriptors, e.g. SIFT, PCA-SIFT, GLOH, homogenous regions with distinctive boundaries [1]. steerable filter, with phase-based Zernike Moment (ZM) Because building images contain regions with descriptor. The ZM descriptor performs significantly boundaries, such as windows and color bricks, the better than other descriptors in geometric and MSER detector can extract these regions stably. photometric transformations, such as blur, illumination, After detecting elliptical regions by MSER method, noise, scale, JPEG compression. To describe a building we have to filter out unstable regions such as oversized image in geometric and photometric transformations, area, large aspect ratio, duplicated regions, and high this paper utilizes the MSER method as the feature area variation, as shown in fig. 2. detector. The Zernike Moment is then applied to describe each detected feature region. 2.2. Zernike Moment feature region descriptor In order to index a large number of features Once the feature regions are detected, every region descriptors, KD-tree [12] is a fundamental method to is described as a feature vector for similarity recursively partition the space into two subspaces to measurement. This paper presents a method which construct a binary tree. applies Zernike Moment (ZM) as the feature descriptor We also introduce a building image dataset, the [2]. NCTU-Bud dataset, containing the high resolution Zernike moments (ZMs) have been used in object images of 22 buildings located on National Chiao Tung recognition regardless of variations in position, size and University campus with a total of 190 database images. orientation. Essentially Zernike moments are the We capture at least one face of each building from 5 extension of the geometric moments by replacing the distinct viewing directions. Query images are captured under 12 different lighting conditions for performance conventional transform kernel x m y n with orthogonal evaluation. Zernike polynomials. Fig. 1 shows the overall system block diagram. The Zernike basis function Vnm ( ρ , θ ) is defined Section 2 briefly describes the background of the feature over a unit circle with order n and repetition m such detector and descriptor. Section 3 presents a feature that (a) n − m is even and (b) m n , as given by selection method to remove unstable features and a clustering method to obtain representative features. In Vnm ( ρ ,θ ) = Rnm (ρ ,θ )e jmθ , for ρ ≤ 1 (1) Section 4 the image indexing and retrieval method is where {Rnm (ρ )} is a radial polynomial in the form of described. In Section 5 experimental results of the ( n −|m|) / 2 (n − s )! NCTU-Bud dataset are described. The performance on Rnm ( ρ ) = ∑ (−1) s n+ | m | n− | m | ρ n− 2 s (2) the publicly available ZuBud dataset is evaluated as well. s =0 s! ( − s )! ( − s )! 2 2 Finally, Section 6 concludes the paper. (a) (b) Figure 2. (a) Initial MSER results. (b) Results after Figure 1. System block diagram. removing unstable MSER feature regions. 1092
  • 106.
    v v The setof basis functions { Vnm ( ρ , θ ) } is orthogonal, i.e. The similarity of magnitude S mag ( Pq , Pd ) is defined π π ∫ ∫ V (ρ ,θ )V (ρ ,θ )ρdρdθ = n + 1 δ 2 1 * δ , as the degree of cosine between two vectors. nm pq np mq 0 0 v v mag q ⋅ mag d S mag ( Pq , Pd ) = (7) 1 a=b mag q mag d with δ ab = { (3) 0 otherwise The value ranges between 0 and 1, while a higher value The two-dimensional ZMs for a continuous image indicates two vectors are more similar. This is function f ( ρ ,θ ) are represented by equivalent to the Euclidean distance between the two n +1 normalized unit vectors. ∑ ∑ f ( ρ ,θ )V *nm (ρ ,θ ) = Znm e nm (4) iφ Z nm = A similarity measure using the weighted ZM phase π ( ρ , θ )∈unit disk differences is expressed by v v For a digital image function the two-dimensional S phase ( Pq , Pd ) = ZMs are given as min{ Φ nm − (mα ) mod(2π ) ,2π − Φ nm − (mα ) mod(2π ) } ˆ ˆ (8) 1 − ∑∑ wnm n +1 π ∑ ∑ f ( ρ ,θ )V *nm ( ρ ,θ ) = Z nm e nm (5) iφ m n Z nm = π ( ρ , θ )∈unit disk Z nm + Z nm q d where wnm = + Z nm , and Φ nm = (φ nm − φ nm ) mod 2π is q d r ∑Z q nm d Define a region descriptor P based on the sorted n ,m ZMs as follows: the actual phase difference. r iφ iφ iφ The rotation angle α is determined by an iterative ˆ P = [ Z11 e 11 , Z 31 e 31 ,LL , Z nmax mmax e nmaxmmax ]T (6) computation of α m = (Φ nm − α m −1 ) mod 2π , with the ˆ ˆ where Z nm is the ZM magnitude, and ϕnm is the ZM initial value α 0 = 0 , using the entire information of ˆ phase. Zernike moments sorted by m . The value range of v v Zernike Moment is then derived after integrating S phase ( Pq , Pd ) is the interval [0, 1] while a higher value the normalized region with respect to Zernike basis indicates two vectors are more similar. function. In this paper, the ZMs with m =0 are not included, and both the maximum order n and maximum 3. EFFICIENT BUILDING IMAGE repetition m equal to 12, resulting the length of feature DATABASE CONSTRUCTION vector to be 42. In this way, two feature vectors In building image retrieval applications, the scale represent a feature region: mag = [ Z1,1 , Z 3,1 ,..., Z12,12 ]T of a database is typically large with considerable visual descriptors. In order to index and search rapidly, and phase = [φ1,1 , φ3,1 ,...,φ12,12 ]T . effective approaches to store appropriate descriptor are proposed for constructing a large-scale building image database. 3.1. Feature selection from multiple images Modern building databases in image retrieval applications normally contain multiple views for a single building. For example, the ZuBud dataset collects five images for each building in the database. We refine Figure 3. Normalization of an elliptical region. detected MSER feature regions by verifying consistency between multiple images of a building that are captured 2.3. A similarity measure from distinct viewpoints. The basic idea of selection is v v to keep representative feature regions and remove Let Pq = (mag q , phaseq ) and Pd = (mag d , phased ) be two discrepant features as outliers. Feature region selection ZM feature vectors, where mag q = [ Z1q,1 , Z 3q,1 ,..., Z12 ,12 ]T , q reduces storage space of feature descriptors in a database. Furthermore, this method improves the phaseq = [φ1q1 ,φ3q,1 ,...,φ12,12 ]T , magd = [ Z1d,1 , Z 3d,1 ,..., Z12,12 ]T , , q d efficiency and accuracy of the image retrieval process and phased = [φ1d,1 , φ3d,1 ,..., φ12,12 ]T . d remarkably. 1093
  • 107.
    (a) (b) (c) (d) (e) (f) Figure 4. (a) - (c) Three different images in a group of a building image before feature selection. (d) - (f) Three different images in a group of a building image after feature selection. The occurrence of discrepant feature regions comes After removing non-building feature regions, most from non-building areas, such as trees, bicycles, and of remaining feature regions belong to the buildings. pedestrians, as shown in Fig 4. Feature regions of non- However, repeated pattern, e.g. windows, doors, is building area are not stable comparing to regions of popular in a building image. In order to reduce the building area. Therefore, excluding these feature regions storage space of the repeated feature descriptors in a out of a database is necessary to ensure uniform results. database, clustering similar features into a representative This paper presented a method to select feature feature descriptor is necessary. regions automatically by measuring similarity between In conventional clustering algorithms, e.g. k-means multiple images of a building. The algorithm description algorithm and k-medoid algorithm, each cluster is for feature region selection is given in Fig. 5. Only represented by the gravity center or by one of the similar feature regions across the views are preserved. v v objects of the cluster located near its center. However, Two regions are similar if Smag ( Pq , Pd ) 0.7 and determining the number of clusters k is not v v straightforward. Moreover, the ability for distinguishing S phase ( Pq , Pd ) 0.7. Comparison of feature regions before different features is reduced because the isolated feature and after selection is shown in Fig. 4. Unstable feature regions are forced to merge to a nearby cluster which regions in Figs. 4(a)-4(c), such as trees and pedestrians may be with dissimilar characteristic of region are removed by the proposed algorithm. The results of appearance. Consequently, we utilized the Density- selection are shown in Figs. 4(d)-4(f). Based Spatial Clustering algorithm (DBSCAN) algorithm [15] is used for clustering. Input: A group of feature regions in multi-view images. The DBSCAN algorithm relies on a density-based Output: Selected feature regions. notion of clusters. Two input parameters ε and MinPts For each feature region If there’s at least two similar regions in other views are required to determine the clustering conditions in Preserve the feature region; two steps. The first step chooses an arbitrary point from Else the database as a seed. Second, retrieve all reachable Delete the feature region; points from the seed. The parameter ε defines the size of neighborhood, and for each point to be included in a Figure 5. Feature region selection algorithm. cluster there must be at least a minimum number ( MinPts ) of points in an ε -neighborhood of a cluster 3.2. Feature clustering point. 1094
  • 108.
    (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 6. (a) - (e) Feature regions in the same cluster. (f) - (i) Another cluster of feature regions after DBSCAN. The input to the DBSCAN algorithm is 42- Then, based on the current minimum distance between dimensional selected ZM magnitude vectors of all the query point and the single database point in the leaf images belonging to the same group or building. We node the KD-tree is revisited to search for the next calculate the mean of feature vectors as the available neighbor within the current minimum distance. representative one of the cluster, while preserving the The tree backtracking is repeated until no further isolated feature points. reduced minimum distance to the query point is found. The elliptical regions in Figs. 6 (a)-6(e) are feature vectors in the same cluster, and are replaced by a Input: N feature vectors in k dimension representative feature vector. Figs. 6 (f)-6(j) show the Output: A KD-tree, every leaf nodes contains a single other feature cluster in the same group of multi-view feature vector images. kd_tree_build (Dataset, dim) { 4. IMAGE INDEXING AND RETRIEVAL If Dataset contains only one point Mark a leaf node containing the point; 4.1. Descriptor indexing with a KD-tree Return; else After feature selection and clustering processes 1. Sort all points in Dataset according to described above, all extracted building regions are then feature dimension dim; indexed by a KD-tree according to their ZM magnitude 2. Determine the median value of feature vectors. The goal is to build an indexing structure so that dimension dim in Dataset, make a new node the nearest neighbors of a query vector can be searched and save the median value; rapidly. 3. Dataset_bigger = The points in Dataset with dim = median value; A KD-tree (k-dimensional tree) is a binary tree that 4. Dataset_smaller = The points in Dataset recursively partitions feature space into two parts by a with dim median value; hyperplane perpendicular to a coordinate axis. The 5. Set Dataset_bigger as the new node’s right binary space partition is recursively executed until all child and set Dataset_smaller as the new leaf nodes contain each a single data point. The node’s left child; algorithm for constructing a KD-tree is given in Fig. 7 6. call kd_tree_build (Dataset_bigger , by initializing dim as 1 and Dataset as the set of N (dim+1) % k); database points. 7. call kd_tree_build (Dataset_smaller, (dim+1) % k); 4.2. Query by region vote counting } After establishing a KD-tree for organizing the ZM Figure 7. The KD-tree construction algorithm. magnitude feature vectors in the database, the KD-tree is descended to find a leaf node into which the query point For each extracted region in the query building falls. After obtaining the first candidate nearest neighbor, image one vote is cast to a certain database building we verify with the ZM phase feature vector to confirm image which has a region to be claimed as the nearest the candidate point is qualified or not. In our neighbor of the query region. After all extracted regions experiments, two vectors are qualified as similar when of the query image have voted, we count the number of their distance is as small as possible and their magnitude votes each database image receives. The database image v v with the maximum votes is returned as the most similar and phase similarity measures satisfy Smag ( Pq , Pd ) 0.85 v v building to the query building. in equation (7) and S phase ( Pq , Pd ) 0.85 in equation (8). 1095
  • 109.
    View 1 View 2 View 3 View 4 View 5 EC Building ED Building Face 1 (first side view) ED Building Face 2 (second side view) Figure 8. Examples of the database images in the NCTU-Bud dataset. Class D Class E Class F Class A Class B Class C Correct exposure Over exposure Under exposure Correct exposure Over exposure Under exposure with occlusion with occlusion with occlusion Sunny day Cloudy day Figure 9. Examples of query images for the NCTU-Bud dataset. weather condition, six images are collected, each with 5. EXPERIMENTAL RESULTS different exposure settings of and different occlusion In our experiments, the proposed algorithm is conditions. Totally 12 classes of images constitute the written in Matlab under the Windows environment and query dataset, as shown in Fig. 9. Furthermore, five evaluated on the platform with a 2.83GHz processor and additional camera poses, such as different rotations, 3GB Ram. We test our proposed indexing and retrieval focal lengths, and translations, are recorded for further system on two sets of building images: the NCTU-Bud testing. A total of 2280 query images is gathered. dataset created by our own and the publicly available 5.2. Experimental results for the NCTU-Bud dataset ZuBud [11]. Table I shows the total number of different region 5.1. The NCTU-Bud Dataset feature vectors collected in the database and the To evaluate our proposed approach and to establish recognition rate for the query images captured in the a benchmark for future work, we introduce the NCTU- normal exposure during cloudy days. From this table, Bud dataset. Our dataset contains 22 high resolution feature selection using multiple images does not raise images of the buildings on the NCTU campus. For each the query accuracy rate. However, we achieve 100% building in the database we capture at least one facet of accuracy after applying the feature selection and the building from five different viewing directions. All DBSCAN clustering. In this case not only the region database images are in a resolution of 1600x1200 pixels. storage space is reduced, but also only the representative The database contains a total of 190 building images. feature vectors are stored for query search. Some representative database images are shown in Fig. Consequently, the image retrieval accuracy is raised to 8. 100%. For query images, we capture with a different The storage size (the number of nodes) is decided camera of a 2352x1568 resolution in two different by the number of region feature vectors found from all weather conditions: sunny and cloudy. For each type of images in the database. Approximately 50% space is 1096
  • 110.
    saved by applyingfeature selection and the DBSCAN TABLE II. AVERAGE PROCESSING TIME OF FEATURE DETECTION clustering method. AND DESCRIPTOR COMPUTATION IN DIFFERENT RESOLUTIONS. The time of feature region detection and description relies on the resolution and the content of an Resolution 2352x1568 1600x1200 640x480 image. If the scene of an image is complex, the number of detected extremal regions by MSER increases and the Avg. / std. 13.8 /4.3 5.8 / 1.58 1.8 / 0.7 processing time increases as well. Table II shows the processing time (sec) average processing time of feature detection and descriptor computation of 92 different images in different resolutions. TABLE III. QUERY ACCURACY RATE OF THE NCTU-BUD With the feature selection and DBSCAN clustering DATASET IN DIFFERENT WEATHER CONDITIONS. method, the average time of indexing the database is Sunny day Cloudy day 22.4 seconds and the average query time for an image in Class A 93.6 % 100% Correct exposure a resolution of 2352x1568 pixels is 40 seconds. The Class B image query time comprises the time for feature region 92.1 % 92.1 % Over exposure detection (MSER), descriptor computation (ZM) and Class C searching time for the nearest neighbor in the database. 93.1 % 96.3 % Under exposure Class D Table III shows the query accuracy rate for the 12 93.6 % 96.3 % Correct exposure with occlusion different classes of images. Each class consists of 190 Class E 92.1 % 94.2 % query images. We can find that the accuracy rate in Over exposure with occlusion cloudy days generally higher than that in sunny days. Class F 92.6 % 96.8 % This reason may be because strong shadows are casted Under exposure with occlusion by occluding object in sunny days. And the over- exposured images are harder to recognize comparing to other exposure conditions. Query Database Comparing classes of D-F with classes of A-C, we image image can find that the proposed methods also perform well under occlusion. It shows that the proposed system is able to distinguish region feature regions even when buildings are partially occluded. 5.3. Experimental results for the ZuBud dataset The ZuBud dataset contains images of 201 different buildings taken in Zurich, Switzerland. There are 5 different images taken for each building. Fig. 10 shows some example images. The dataset consists of 115 query images, which are taken with a different camera under different weather conditions. In the experimental result for the ZuBud dataset, the query accuracy rate with the feature selection and DBSCAN clustering is over 95%. The average query time is 3.1 second with a variation of 1.16 second. From the experimental results, our system still performs well Figure 10. Example of images of the ZuBud dataset. in this publicly available dataset. TABLE IV. TOTAL NUMBER OF REGION FEATURE VECTORS AND TABLE I. TOTAL NUMBER OF REGION FEATURE VECTORS AND QUERY ACCURACY RATE OF THE ZUBUD DATASET QUERY ACCURACY RATE OF THE NCTU-BUD DATASET Without With feature With feature Without With feature With feature feature selection selection feature selection selection selection DBSCAN selection only DBSCAN # Region feature # of region feature 488,527 264,311 256,261 113,194 68,036 56,089 vectors vectors Memory size Recognition 22 MB 12.9 MB 10.6 MB 89.57% 94.8% 95.6% of a KD-tree Accuracy Query accuracy 94.7% 94.7% 100% rate 1097
  • 111.
    6. CONCLUSION Analysis with Applications to Street Landmark Localization”, Proceedings of ACM International In this paper, we have presented a novel image Conference on Multimedia, 2008. retrieval system based on the MSER detector and the [9] S. Obdrzalek, J. Matas, “Image Retrieval Using ZM descriptor, which can resist against the geometric Local Compact DCT-Based Representation”, and photometric transformations. Experimental results Pattern Recognition, 25th DAGM Symposium, vol. illustrate that the KD-tree indexing and retrieval system 2781 of Lecture Notes in Computer Science. with the magnitude and phase ZM feature vectors Magdeburg, Germany: Springer Verlag, p.490– achieves a query high accuracy rate. The accuracy rate 497, 2003. for our created NCTU-Bud dataset and the ZuBud [10] W. Zhang, J. Kosecka, “Hierarchical building dataset are 100% and 95%, respectively. recognition”, Image and Vision Computing, 2007. The success of our system are attributed to [11] H. Shao, T. Svoboda, L.V. Gool, “ZuBuD— (a) Selection of MSER feature vectors using multiple Zurich Buildings Database for Image Based images of the same building captured from Recognition” Technical Report 260, Computer different viewpoints removes the unreliable Vision Laboratory, Swiss Federal Institute of regions. Technology,2003 (b) The DBSCAN clustering technique groups similar [12] J. H. Friedman, J. L. Bentley, R. A. Finkel, “An feature vectors into a representative feature Algorithm for Finding Best Matches in descriptor to tackle the problem of repeated Logarithmic Expected Time”, ACM Transactions feature patterns in the image. on Mathematical Software, vol. 3,no 3, pp 209- In the future, we will consider optimizing the 266,1977 programs and porting to mobile phone for mobile device [13] J. Matas, O. Chum, M. Urban, T. Pajdla, “Robust applications. Furthermore, the query results may be wide-baseline stereo from maximally stable verified using the multi-view geometry constraints for extremal regions,” Image and Vision Computing, eliminating the outliers in order to lower the miss vol.22, pp.761–767, 2004. recognition rate. [14] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. REFERENCES Zisserman, and J. Matas, “A comparison of affine region detectors,” Int’l J. Computer Vision, vol. [1] J. Wang, G. Wiederhold, O. Firschein, and S. Wei, 65, no. 1/2, pp. 43–72, 2005. “Content-Based Image Indexing and Searching [15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ” A Using Daubechies’ Wavelets,” Int’l J. Digital Density-Based Algorithm for Discovering Clusters Libraries, vol. 1, pp. 311-328, 1998. in Large Spatial Databases with Noise,” in Proc. [2] Z. Chen, and S.K. Sun, “A Zernike moment phase- Int’l Conf. Knowledge Discovery and Data based descriptor for local image representation Mining, 1996. and matching”, IEEE Trans. Image Processing, vol. 19, No. 1, pp. 205-219. 22 September 2009. [3] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool, “A comparison of affine region detectors”. Int’l J. Computer Vision, vol. 65, 43– 72, 2005. [4] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int’l J. Computer Vision, vol. 60, no.2, pp. 91–110, 2004. [5] K. Mikolajczyk, and C. Schmid, “A Performance Evaluation of Local Descriptors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 10, p1615-1630, 2005. [6] F. Jing, M. Li, “An Efficient and Effective Region-Based Image Retrieval Framework”, IEEE Trans. Image Processing, vol. 13, no. 5, MAY 2004. [7] J. Willamowski, D. Arregui, G. Csurka, C.R. Dance, and L. Fan., “Categorizing nine visual classes using local appearance descriptors”, ICPR Workshop on Learning for Adaptable Visual Systems, 2004 [8] W. Wu, J. Yang, “Object Fingerprints for Content 1098
  • 112.
    Using Modified View-BasedAAM to Reconstruct the Frontal Facial Image with Expression from Different Head Orientation 1 Po-Tsang Li(李柏蒼), 1Sheng-Yu Wang(王勝毓), 1,2Chung-Lin Huang(黃仲陵) 1 Dept. of Electrical Engineering, National Tsing Hua University, Hsin-Chu, Taiwan. 2 Dept. of Informatics, Fo-Guang University, I-Lan, Taiwan. E-mail: [email protected] information using 3D face laser scanner. 3DMM can accurately reconstruct 3D human face, however, it needs a lot of Abstract computation that limit its applications for only academic This paper develops a method to solve the unpredictable head research. orientation problem in 2D facial analysis. We extend the expression subspace to the view-based Active Appearance Blanz et al. [12] apply 3DMM for human identity Model (AAM) so that it can be applied for multi-view face recognition, however, the fitting process takes 4.5 minutes per frame on a workstation with 2GHz Pentium 4 processor. For fitting and pose correction for facial image with any expression. facial expression recognition, the problem becomes more Our multi-views model-based facial image fitting system can be obvious. Due to the non-sufficient 3D face expression data, one applied for a 2D face image (with expression variation) at any can only rely on single expression (neutral) 3D face model for pose. The facial image in any view can be reconstructed in 3D facial identity recognition. However, because more 3D face another different view. We divide the facial image into expression database become available, researchers such as Wang expression component and identity component to increase the et al. [16], Amor et al. [17], and Kakadiaris et al. [18] develop a face identification accuracy. The experimental results method to identify the human face in different views and demonstrate that the proposed algorithm can be applied to different expression. However, because the facial expressions are improve the facial identification process. We test our system for complicate {surprise, sadness, happiness, disgust, anger, fear}, the video sequence with frame size is 320*240 pixels. It requires the 3-D model for different facial expressions are non-practical. 30~45 ms to fitting a face and 0.35~0.45 ms for warping. Lu et al. [20] only record the variations of the landmark points Keywords View-Based; AAM; Facial expression; and then apply the Thin-Plate-Spline warping method to synthesize other expression facial images for fitting face expression image. Chang et al. [15] also divide the training data into identity space and expression space, and use bilinear 1. Introduction interpolation to synthesize other expression human faces. Ramanathan et al. [19] propose a method using 3DMM for facial The facial image analysis consists of face detection, facial expression recognition。 feature extraction, face identification and facial expression recognition. Currently, the 2D face recognition technology is To capture 3D face information, we may use either 3D well-developed with high recognition accuracy. However, the laser scanner or multi-view 2D images. Recently, the 2D+3D unpredictable head orientation often causes a big problem for 2D active appearance models (AAM) method has been proposed by facial analysis. Most of the previous facial identity identification Xiao et al. [21], Koterba et al.[22], and Sung et al.[23]. Based on or facial expression recognition methods are limited to the the known project matrix of certain view, the so-called 2D+3D frontal face and profile face. They work only for the faces in a AAM method trains a 2D AAM for single view for later tracking single view with ± 15 degrees variation. and fitting of the landmark points of 2D images. Then it uses the corresponding points to calculate the 3D position of the The 3D model is well-known as the 3D Morphable Model landmark points. Xiao et al.[21] use only 900 image frames of a (3DMM) which is proposed by Blanz and Veter [11]. 3DMM single camera to develop the 3D AAM model. Because of the and AAM are similar. They are model-based approach which precision error of 2D AAM tracking landmark point, Lucey et al. consists of shape model and texture model. They both use [24] point out that the feature points tracked by 2D+3D AAM is Principal Component Analysis (PCA) for dimension reduction. worse than the normalized shape obtained by 2D AAM fitting. The two major differences between them are (1) the optimization Their argument is that 2D+3D AAM can not obtain the depth algorithm used in fitting, and (2) the feature points in shape information precisely and it causes the recognition errors. model for 3DMM are 3D feature points, whereas in AAM, they are 2D locations. In data collection, AAM can be developed In this paper, we apply the view-based AAM proposed by using 2D facial images, whereas 3DMM captures the depth Cootes et al.[4] for model-fitting of input face with any 1099
  • 113.
    expression view inany angl and then it c be warped t any wed le, can to regenerate the face e. target viewing angle. The vie ew-based AAM consists of se M everal 2D AAMs wh hich can be fur rther divided in inter mode and nto el 2.1 Statistical A 1 Appearance Mo odels intra model. The inter m model describ bes the para ameter transformation between any two 2D AAMs, whereas the intra e A statistical appearance m l model consists of two parts: t the model describe the relations es ship between th model param he meters sha model desc ape cribing the shap of the objec and the textu pe ct ure and viewing an ngle for a single 2D AAM mo e odel. The view- -based model, describing the gray-leve information of the object. It g el AAM can be g generated by an off-line traini process. Be n ing esides use labeled face images to train the AAM. To train the AA es n T AM the identity sub bspace, this pap extends the expression sub per bspace model, we must h have an annota ated set of faci images, call ial led to inter model so that the vie ew-based AAM can be applie for M ed land dmark points. These landma points are selected as the ark t multi-view face fitting and pose correction for an input fa of ace sali ient points on t face and ide the entifiable from any human fac ce. any expressionn. Fig gure 2 shows some annotated tr raining face im mage data set. The flow diagram is s w shown in Fig 1 For an input face 1. t image, based o the intra mo on odel, we may find the relatioonship between param meters and viewwing angle, an then remov the nd ve angle effect on the par rameters. Then we divide the n e angle-independ model para dent ameters into ide entity parameter and rs expression para ameters which can be transforrmed to the targ 2D get igure 2. Examp of the train Fi ples ning set. AAM model by using the inte model. Finall based on the intra y er ly, e model, we add the influence of the angle parameters ont the d e to The numb ber of landm mark points is determin ned model paramet esize the facial image in the target ters and synthe exp Although more landmark poi perimentally. A ints increase tthe viewing angle. acc curacy of the model, how e wever, it also increases t o the commputation of model fitting process. The distribution of e landdmark points d depends on the ccharacteristics on the faces, su o uch as t eyebrows, e the eyes, nose and m mouth. In these regions, we ne eed Input image Facial region detection to p more landm put mark points, wh hereas in the ot ther regions (suuch Pose classification as e ears, forehead, o other non-visible area, we p no landmark or put ks. Modified View -based AAM d 2.2 Shape Mode 2 el Fitting Her we use trian re, ngular meshes t compose the human face. W to e We using i th AAM def fine a shape si as a vector co ontaining the c coordinates of Ns land dmark points in a face image Ii. n si = ( x1 , y1 , x 2 , y 2 , K , x N , y N ) Τ (1) Target Select the s s orientation Ө target model The model is cons e structed based on the coordinate of the label led points of training images. We aligned the locations of t g e the rresponding po cor oints on diffeerent training faces by usi ing Rotate model i→ j Pro ocrustes analysis as normalizzation. Given a set of alignned sha ape, we then appplied Principa component analysis (PCA) to al a the data. Any shap example can then be approx pe ximated by usin ng _ Reconstructed at s = s + Ρs bs (2) j th AAM whe s_ defines th mean shape o all aligned sh ere he of hape is calculatted usin s = Σ i=1Si/N, Ps= (ps1, ps2,…pst) is th matrix of t N ng he the firs t eigenvectors and bs is a s of shape par st s set rameters. pst is the t eigeenvector of shhape covarianc matrix. Figu 3 shows the ce ure t New vie ew in ang le Ө effe g parameters by ±2 ects of varying the first two shape model p stan ndard deviation ns. igure 1. The flo Fi owchart of our system. 2. Active Ap ppearance M Model Figure 3. F First two modes of shape variat tion (±2sd) In the Modified View-based A d AAM, 2D AAM play a crucial part. M 2.3 Texture Mod 3 del This chapter in ntroduces the ooverall structur of 2D AAM the re M, flow diagram o training and f of fitting algorithm The major g of m. goal The texture of A e AAM is defined the gray leve information at d el AAM, which is first proposed by Cootes et al. [2] is to fin the d nd T _ model paramet ters that reduce the differenc between syn e ce npaper pix x=(x, y) th lie inside the mean shape s . First, we ali xels hat e ign _ image (generated by the AA model) an the target im AM nd mage. the control points and the mean shape s of ev very training fa ace Based on the parameters and the AAM model, we may e M 1100
  • 114.
    image by usingaffine warping. Then we sample the gray level The shape parameters bs have unit distance and texture information gim of the warping images at the mean shape region. parameters bg have unit intensity. Because they are difference Before applying PCA on texture date, to minimize the effect of nature and difference relevance, they cannot be compared lighting variation, we normalize gim first by applying a scaling α directly. To estimate the correct value of Ws, we systemically displace the element of bs from each example’s best match and offset β as parameter in the training set and sample the sample the g = ( gim − β ⋅ Ι) / α (3) _ corresponding difference. In addition, active appearance model where I is a vector of ones. Let g defined the mean of the normalized texture data, scaled and offset so that the sum is zero have a pose parameter vector to describe the similarity and the variance is unity. α and β are selected to normalize gim as transformation of shape. The pose parameter vector t has four T _ elements as t=(kx, ky, tx, ty) . Where (tx, ty) is the translation and β=(gim ·I)/n and α= gim· g , (4) (kx, ky) represent the scaling k and in-plane rotation angle θ, where the n is the number of pixels in the mean _ shape. We iteratively use Equations (3) and (4) to estimate g until the kx,=k(cosθ−1) and ky,=ksinθ. estimation stabilized. Then, we apply PCA to the normalized texture data so that the texture example can be expressed as _ 2.6 Active Appearance Model Search g = g + Ρg bg (5) where Pg is the eigenvectors and bg is the texture parameters. Here, we introduce the kernel of AAM. The ultimate goal of Figure 4 shows the effects of varying the first two texture model applying AAM is that given an input facial image, we may find parameters through ±2 standard deviations. the model parameters the may be apply to the AAM model to synthesize an image similar to the input image. Given a new image, we have the initial estimate of appearance parameter c, and the position, orientation and scaling placed in the image. We need to minimize the difference E as E = g image − g mod el . (9) Figure 4. First two modes of texture variation (±2 sd). where based on the pre-estimated c , we may have _ _ −1 2.4 Appearance Model g mod el = g + Pg Q g c and s mod el = s + PsW Q s c . gimage denotes the s The shape and texture of any example data in the training set can target image obtained by applying warp function using smodel be summarized by the bs and bg. The appearance model combines _ the two parameters vector into a single parameter bc as and s , and sampling the pixel intensity of region. It is need an algorithm to adjust parameters to make the input image and that ⎛Ws bs ⎞ ⎛Ws PsΤ ( s − s ) ⎞ _ image generated by model as closely as possible. There are many ⎟=⎜ ⎟ (6) bc = ⎜ ⎜ ⎟ ⎜ _ ⎟ optimization algorithms propose for parameters searching. In this ⎝ bg ⎠ ⎜ PgΤ ( g − g ) ⎟ ⎝ ⎠ paper, we apply the so-called AAM-API method [8]. Rewrite (9) where Ws is a diagonal matrix of weights for each shape as parameter. A further PCA is applied for remove the possible correlations between the shape and texture variations. E ( p ) = g image − g mod el (10) bc = Qc c (7) where p is the parameters of model. p = (c Τ | t Τ | u Τ ) , where the Qc is the eigenvectors and c is the appearance u = (α β ) Τ . Taylor expansion of (10) gives parameter. ∂E E ( p + ∇p ) = E ( p ) + ∂p (11) Given a appearance parameter c, we can synthesis a face ∂p image by generate gray level g the interior_ of mean shape and warping the texture from the mean shape s to model shape s, where the ijth element of matrix ∂E/∂p is ∂Ei/∂pj. Suppose E is using our current matching error. We want to find ∇p such as to _ _ minimize E ( p + ∇p ) 2 . By equating Equation 2.11 to zero, we s = s + PsWs−1Qs c , g = g + Pg Q g c (8) obtain the RMS solution, T where Qc=(Qs, Qg) . Figure 5 shows the effects of varying the ∇p = − AE ( p ) first two appearance model parameters through ±2 standard deviations. −1 where A = ⎛ ∂E ∂E ⎞ ∂E Τ Τ ⎜ ⎜ ∂p ⎟ ⎝ ∂p ⎟ ∂p ⎠ If we apply a conventional optimization process, we need to recalculate ∂E/∂p after every match, and it requires heavy Figure 5. First two modes of appearance variation (±2 sd). computation. So, to simplify the optimization process, Cootes et al. assume that A can be approximately constant, and the relationship between E and ∇p is linear. 2.5 Shape parameter weight 1101
  • 115.
    Therefor we systemat re, tically displace the parameter from e r 3.1 Training Da 1 ata the optimal v value on the example image and record the d corresponding effect of texture dif fference. App plying multi-variance linear regressio on the displa on acements ∇p an the nd Sin we do not h nce have a large multiple expressio and multi-vie on ew corresponding difference textu E to find A. Therefore, we need ure e faci image datab ial base of for 2D AAM training process, we ne eed not recalculate matrix A, wh e hich can be computed off-lin and ne to o ning data by using six camer to capture t obtain the train ras the stored in the m memory for ref ference afterward. When we want e muultiple expressio and multi-v on view facial ima database. W age We to match a imag on-line, the step of procedu is as follow: ge ure : hav obtained mu ve ultiple expressio and multi-v on view facial ima age from 13 people (i.e., neutral, surprised, hap m ppiness, sadne ess, disggust, anger, fea There are t ar). totally 510 fac images in t cial the Initial estimate parameters p I e trai ining data set. F 6 shows som of the trainin samples of o Fig me ng our 1. Calculate the model shape smodel. and mode texture gmodel. e el trai ining data. 2. Warping the current image and sample tex xture gimage. ture E= gimage − gmodel. 3. Evaluate the difference text 4. update the mmodel paramete p→p+k∇p, ∇p=−AE(p), initial ers , k=1. 3.2 Intra-Model Rotate 2 5. Calculate the new model sh e hape smodel. and m model texture gmodel. 6. Sample the iimage from new shape gimage. w Coo et al. [4] s otes suggest that the model parame e eters c are relat ted 7. Calculate the new error E e to t view angle θ as the 2 8. if E ' E 2 , then accept the new estim mate ; otherwis try se, c = c0 + cc cos(θ ) + c s sin θ ) n( (12) k=0.5, k=0.225. whe c0, cc, and cs are vectors learned from training data. W ere t We The iteration o the preceding steps stop wh the E 2 ca not of g hen an can find the opt n timal value of parameters ci of the traini f ing be reduced, an we may as nd ssume that the iterative algo e orithm exaample and its c corresponding view angle θi. Cootes’ meth hod converge. doe not θi prec es cisely, it allow ±10 degree errors. In o ws e our expperiment, we fifixed the camer so that the viewing angle is ra knoown beforehan However, i creates an under-determina nd. it u ant pro oblem. We use facial images f from two views to generate oneo AAAM. There are only two inpu that can be used to estima uts ate (a) thre unknowns. So we rando ee omly increase θi by ±1. It is reassonable becaus the error mad during the im se de mage capturing is g unaavoidable, such as the human s subject slightly movement of h y his bod or head. Usin this method to add more in dy ng nput data, we may m esti imate c0, cc, an cs by applyin multiple lin nd ng near regression on quations of cs an (1, cos(θ ), sin(θ )) Τ . the relationship eq nd (b) Given an fa acial image, to f find the best fit tting parameter cj, we may use Equations (13) and (14) to estim d mate the viewi ing ang θj as gle ( x j , y j ) Τ = Rc−1 ( c j − c0 ) (13) (c) -1 Figure 6. Exxamples from th training set f the models. (a) he for whe Rc is the le pseudo-inve of Rc−1 (cc | cs ) = Ι 2 . ere eft erse Right profile F Face, 90° and75 (b) Right Ha Face 60° and 45°, 5°; alf d θ j = ta −1 ( y j / x j ) an (14) (c) Frontal Face 0° and -15°. Fig 7 shows the predicted angle compared with the actual ang g p h gle for the training set for each mode The results are worse than t t el. a the 3. Modified View-Based AAM d resu from Coote et al [4]. It is due to that ou model contai ults es ur ins muultiple expressio facial image data. on Cootes et al. [4 propose View 4] w-based AAM, based on sever 2D ral AAM for 3D M Model fitting 2 image. The model-based fitting 2D f 150 for model para ameter estimati can be div ion vided as intra-m model P red icted A n g le(d eg ree) 100 and inter-mode His method h been succes e. has ssfully applied to the 50 human face w without expressi ion. However, they have prob blems 0 fitting the face with expressio It is becaus in the human face e on. se n -40 -20 0 20 2 40 60 80 100 -50 parameter spac the expressi difference between the ch ce, ion b hanges for intra-person is much bigge than the cha n er anges in inter-peerson. -100 Actua Angle(degree) al The original liinear transformmation between the view-angl andle AAM paramete is no longer valid. Here w propose a m ers we method (a) ) b) (b to project the facial space to identity subsp o pace and expre ession subspace to sol the problem Here we divid the viewing angle lve m. de in five ranges: [-90, -75], [-60 -45], [-15, 15 [45, 60], [75 90] 0, 5], 5, gure 7. Predic angle vs ac Fig cted ctual angle across training set ( (a) from leftward to rightward. S Since the human face is symm n metric, resu of our data. (b) Cootes’ exp ult perimental resul at ‘view-bas lts sed in the experim ments, we only develop the 2D AAM for three y acti appearance mode’. ive different angles [-15, 0], [45, 60], and [75, 9 s: 90]. Giv a new pers image, we apply AAM f ven son fitting to find t the 1102
  • 116.
    best model parametersand e estimating the head angle as well. Τ _ (20) Then, we can remove to angle effect by using e g b exp = P ex ( r − b exp ) xp cresidual = c j − c0 − cc cos(θ j ) − cs sin(θ j ) (15) The we can comp the rneutral b using en pute by Therefore the mmodel paramete are separated into two parts: one ers _ (21) part that describ the variatio due to rotatio and the othe part bes on on, er rneutral = r − b exp − Pexp bexp that describe th other variat he tions (e.g. the vvariation of ide entity, And get project in identity subs d nto space expression, illumination). W can use the paramete to We er _ reconstruction at a new angle φ as bneutral = Pne ( rneutral − r ne ) eutral (22) eutral c(φ ) = cresidual + c0 + cc cos(φ ) + cs sin(φ ) (16) This method ca only do smal angel rotation based on 2D A an ll n AAM. Cootes et al. [7] and Huis sman [25] hav proved tha the ve at intra-model pose can be applie for the huma face recognit ed an tion. 3.3 Identity a Expressio Subspace and on To make a l large angel wwarping, we m must transform the m parameters bet tween the 2D m models. We inte to find a s end simple transformation between th two mode he els. However, the , parameters in ((15) consist of identity compo onent and expreession component tha the transform at vial. Cootes use two mation non-triv different methhods to remove the expressio and project into e on t identity subspa The parameters are simplified to the var ace. riation of identity, whi is linear tran ich nsformation. Let r de efined as the r residual param meter after (15) We ). ning data as rneu and rexp whe exp ∈{happ divide the train utral ere piness, sadness, fear, anger, disgu ust, surprised} to compute the } e expression and identity covar d riance matrix. R Remove the iddentity component of rexp by eexp = rexp − rneutral (17) Figure 10. The facial space rel of identity and expression. F late a . where eexp be d defined the exp pression compon nent. Fig (15) s shows the training ex xample of eexp, and Figure 9 shows the tra aining examples rneutra . By applying PCA to rneutral and eexp, we can find al n 3.4 Inter-Model Rotate 4 l pace and Pexp, into a the projection Pneutral. into an identity subsp n i expression subspace as Now we may use M i w Multiple linear regression met thod to find b ex , ij xp _ r ijeutral in thej ith A ne AAM model and the relationshi (i.e., R neutral , j d ips e exp = e exp + Pexp bexp (18) R e ) with b exp and r neutral in t j AAM mo as exp the th odel _ rneutral = enettural + Rneutral rneuttral j ij ij i (23) eutral = r neutral + P and rne neutral bneutral (19) and d bexp = eexp + Rexp rexp j ij ij i (24) whe enetural and eexp are cons ere ij d ij stant. 3.5 Reconstruct a New View 5 t w Figure 8. some examples from expression co e m omponent traini ing Giv a match of a new person in a view, we can reconstruct a ven f set view by follow ste (as shown in Fig. 11). w eps n 1. R Remove the effe of orientati fects ion. (Eq. 15). 2. P Project into ide entity and exprression subspace of the mod del. ( (Eqs. 20, 21, 222). Project into the subspaces of th target model. (Eqs. 23, 24). 3. P he 4. P Project that into residual space and combined two vectors in o e d nto o vector (inve Eqs. 20, 21 22). one erse 1, Figure 9. T neutral imag for training i The ge identity subspac ce. 5. A the assigne orientation. ( Add ed (Eq. 16) Costen et al. [26] suggest ession changes are ted that expre orthogonal to the changes du to identity in framework. A new t ue n image with pa arameter r, the expression pa arameter bexp,can be calculated by 1103
  • 117.
    Figure 11. Theflowchart of Rotate Model. 4. Experimental Results Here, we illustrate the results of our methods. We use six cameras to capture the expression of one person. There are 13 persons in our experiments with 5 or 6 different expressions. We select 510 pictures for our training data for Multi-Pose 2D AAMs. Figure 12. result of warping Right Half to Frontal vs Ground In the testing phase, we apply the model fitting for all pictures truth. and about 90% of the testing pictures have been successful. Then, we illustrate the experimental results of warping Because we do not have enough training data, we apply right-side view to frontal view and compare with the ground leave-one-out to train and test our rotational model algorithm. truth as shown in Figure 13. Apparently, the performance is not Besides warp the input face to the pre-trained pose, we also try as good as the previous one. warping the face to other pose and compare with the video captured in that specific pose. Although our system allows us to do the model face fitting and then warp the face to any pose, for some view, the warping results are not as good as the others. To compare the results of the rotated model, we do the warping of the input face image in right half view to the front pose and compare with the ground truth pre-stored in our database as shown in Figure 12. 1104
  • 118.
    Figure 13. Theexperimental results of warping right-side view Figure15. The experimental results of warping the to frontal view. right-side-view facial image to the front view. We use the distance similarity measure x1 ⋅ x 2 / x1 x 2 to We use PC equipped with Intel C2D 6300 CPU and 2045 MB evaluate whether the warped image help increasing the memory to test our algorithm. For a video sequence (with frame recognition rate, where x1 represent the pre-stored frontal neutral resolution 320*240), the processing time is less than face image database, x2 represent the testing data of facial image 45ms/frame. of any expression and in any viewing direction. The purpose of the warping the non-frontal face to the Table 1 The improvement of identity recognition, with ICO frontal view is to increase the face identification accuracy. (identity component only) and PC (pose correction) with 15 Before the warping process, we have separated the identity degree. component and expression component from the model parameter. ICO PC PC+ICO To analyze the warped facial image, we may use identity parameter or expression parameter independently to increase the Frontal 18% 3.7% 21.5% recognition rate. In the following, we will synthesize the face intra-model image by using only the identity component or the expression component. The experimental results of right-half-view facial In Table 2, the comparison is done with the expression parameter. image and right-side-view are shown in Figures 14 and 15. We find that the identity component increase the similarity between the neutral face in the database. On the other hand, we In Figure 14, the lower-right figure illustrates the facial have the right-half-view faces with expression processed by PC image by using identity component. The expression can hardly + ICO(45-60 degree), the average similarity is about 74.3%. It is be found and it shows a facial image of neutral face. In Figure 15, lower than the PC+ICO Frontal expression face for 4.6% only. the warped image using identity component is worse than Figure However, the improvement of right-view face with expression is 14, however, the warped image by using expression parameter very limited. The similarity is about 56.4%. looks fine. 5. Conclusions In this paper, we have demonstrated that the expression parameter can be linear transform between each two AAMs of the view-based AAM. Then, it can be used to match an expression variant face at any angle, and to predict the appearance from new viewpoints given a single image of a person. We anticipate this approach will be useful for face recognition and expression recognition system more invariant to viewing angle. In the future, we may establish a wide angle facial detection and recognition system with higher accuracy, less processing time, and more stable. References [1] T.F. Cootes, D. Cooper, C.J. Taylor and J. Graham, Active Figure14. The experimental results of warping the Shape Models - Their Training and Application. Computer right-half-view facial image to front view. Vision and Image Understanding. Vol. 61, No. 1, pp. 38-59, 1105
  • 119.
    1995. Page(s):511-518 Vol.1 [2] T.F.Cootes, G.J. Edwards and C.J.Taylor. Active [23] J. Sung, D. Kim STAAM: Fitting a 2D+3D to Stereo Appearance Models, Proc. European Conf. on Computer Images IEEE ICIP on 8-11 Oct. 2006. Vision, Vol. 2, pp. 484-498, 1998. [24] Lucey, S., Mathews, I., Changbo Hu, Ambadar, Z., de la [3] G.J.Edwards, C.J.Taylor, T.F.Cootes, Interpreting Face Torre, F., Cohn, J., AAM derived face representations for Images using Active Appearance Models, Int. Conf. on robust action recognition Int. Conf. on Automatic Face and Face and Gesture Recognition 1998. Gesture Recognition, 10-12 April 2006 Page(s): 155-160 [4] T. F. Cootes, G.V.Wheeler, K.N.Walker and C. J. Taylor, [25] Huisman, P., van Munster, R., Moro-Ellenberger, S., View-Based Active Appearance Models, Image and Veldhuis, R., Bazen, A. Making 2D face recognition more Vision Computing, Vol.20, 2002, pp.657-664 robust using AAMs for pose compensation nt. Conf. on [5] T.F. Cootes, G.V.Wheeler, K.N. Walker and C.J.Taylor Automatic Face and Gesture Recognition, 10-12 April 2006 Coupled-View Active Appearance Models, British [26] N. Costen, T. F. Cootes and C. J. Taylor, Compensating for Machine Vision Conference 2000. Ensemble-Specificity Effects when Building Facial [6] T.F.Cootes, G.J. Edwards and C.J.Taylor. Active Models, Proc. British Machine Vision Conference 2000, Appearance Models, IEEE PAMI, Vol.23, No.6, Vol. 1, pp.62-71. pp.681-685, 2001 [7] H. Kang, T.F. Cootes and C.J. Taylor, A Comparison of Face Verification Algorithms using Appearance Models, Proc. BMVC2002, Vol.2,pp.477-4862. [8] M. B. Stegmann, B. K. Ersbøll, R. Larsen, FAME -- A Flexible Appearance Modelling Environment, IEEE Transactions on Medical Imaging, 2003 [9] I. Matthews and S. Baker. “Active Appearance Models revisited.” IJCV, 2004. In Press [10] 陳曉瑩 即時多角度人臉偵測 國立清華大學電機工程 研究所碩士論文,2006 [11] V. Blanz and T. Vetter. A morphable model for the synpaper of 3d faces. Proc. Computer Graphics SIGGRAPH '99, 1999. [12] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE Trans. On PAMI, 25(9), September 2003. [13] C. Christoudias, L. Morency, and T. DarreIl. Light field appearance manifolds. European Conf. on Computer Vision, (4):482-493, 2004 [14] R. Gross, I. Matthews, and S. Baker. Eigen light-fields and face recognition across pose. Int. Conf on Automatic Face and Gesture Recognition, 2002. [15] Chang, J., Zheng, Y., Wang, Z. Facial Expression Anaylsis and synthesis: a Bilinear Appraoach Int. Conf. on Information Acquisition, ICIA’07, 8-11 July 2007. [16] Wang, Tueming, Pen, Gang, Wu, Zhaohui 3D Face Recognition in the Presence of Expression : A Guidance-based Constraint Deformation Approach IEEE CVPR 2007. [17] Amor, B.B. Ardabilianm, M., Chen, L. New Experiments on ICP-Based 3D Face Recognition and Authentication ICPR 2006, Volume 3, Page(s) : 1195 – 1199 [18] I.A. Kakadiaris, G.Passalis, , G.Toderici, , M.N Murtuza,., Y. Lu, Karampatziakis, N, Theoharis, T. Three-Dimensional Face Recognition in the Presence of Facial Expression: An Annotated Deformable Model Approach IEEE Trans. on PAMI, Vol. 29, Issue 4, April 2007 pp. 640-649 [19] S. Ramanathan, A. Kassim, Y. Vemlatesh, S. W. Wu, Human Facial Expression Recognition using a 3D Morphable Model IEEE ICIP, Oct 2006, [20] Lu X. and Jain A., Deformation Modeling for Robust 3D face Matching IEEE Trans. on PAMI, 2007. [21] Jling Xiao, Baker, S., Mathews, I., Kanade, T. Real-time conbined 2D+3D active appearance models CVPR 2004, Page(s):II-535~II-542. [22] Koterba, S., Baker, S., Mathews, I., Changbo Hu, Jing Xiao, Cohn, J., Janade, T. Multi-view AAM fitting and camera calibration IEEE ICCV, Vol. 1, 17-21 Oct. 2005 1106
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
    Patch-Based Occupant Classificationfor Smart Airbag Shih-Shinh Huang Er-Liang Jian and Chi-Liang Chien Dept. of Computer and Communication Engineering Chung-Shan Institute of Science and Technology National Kaohsiung First University of Science and Technology Email: [email protected] Email: [email protected] Abstract—This paper presents a vision-based approach for occupant classification. In order to circumvent the intra-class variance, we consider the empty class as reference and describe the occupant class by using appearance difference rather than appearance itself in the tradition approaches. Each class in this work is modeled by a set of representative parts called patches and each of which is represented by a Gaussian distribution. This alleviates the mis-classification resulting from the severe lighting change which makes the image locally blooming or invisible. Instead of using maximum likelihood (ML) for patch selection and estimating the parameters of Figure 1. Challenges: (a) Severe lighting change. The images have the proposed generative models, we discriminatively learn the considerably large dynamic range. These observed images have significantly models through a boosting algorithm by minimizing the loss different appearance. (b) Intra-class variance. Persons wearing clothing with different styles or colors. of the training error. Keywords-patch-based model, discriminative learning considerably dynamic range from bright sunlight to dark I. I NTRODUCTION shadow. Extremely, this makes some regions of the image Until now, the integration of the airbags into automobiles blooming or invisible and thus complicates the classifica- has significantly improved the occupant safety in vehicle tion task. The intra-class variance denotes that the same crashes. However, inappropriate deployment of airbags in occupant class may have different appearance. For instance, some situations may cause severe or even fatal injuries. For the passengers may wear clothing with different colors; the example, it deploys on a rear-facing infant seat or in a case baby seats may have different styles. The difference in scene of that a passenger sitting too close to the airbag. According resulting from the configuration change of objects inside to the report of American National Highway Transportation the vehicle is referred to as the structure variance. Figure and Safety Administration (NHTSA), since 1990, more than 1 shows some images exhibiting the lighting change and 200 occupants have been killed by the airbag deployed in intra-class variance. Similar to the works in the literature, we low-speed crashes. To prevent occupants from this kind of assume that the scene monitored has no structure variance, injure, NHTSA defined the Federal Motor Vehicle Safety and the objective of this paper is to achieve high recognition Standard (FMVSS) 208 in 2001. One of the fundamental rate against severe lighting change and intra-class variance. issues of FMVSS 208 is to recognize the occupant class inside the vehicle for controlling the deployment of airbags. A. Related Work The five basic classes defined in FMVSS 208 are (i) Empty, Owechko etc., [1] who are the pioneer in this area, (ii) RFIS (Rear Facing Infant Seat), (iii) FFCS (Front Facing attempted to eliminate the illumination variance by firstly Child Seat), (iv) Child, and (v) Adult. applying intensity normalization to the training images. The Some existing sensors, such as ultrasound, pressure, or coefficients of the eigen vectors computed by the princi- camera, have been used for developing system which aims pal component analysis (PCA) are used to represent the at meeting the classification requirements in FMVSS 208. occupant class. The input unknown image is then recog- In this work, we choose camera as the sensing device, since nized as the same class of the nearest neighbor sample. it can provide rich representation of the occupant in front In order to overcome lighting change, Haar wavelet filters of the dashboard. This makes the proposed approach have which describe the intensity difference among neighboring potentially higher classification accuracy. The success to the regions have been used for occupant representation. An over- problem of the occupant classification based on computer complete and dense way using Haar filters over thousands vision is challenging in the presence of severe change in of rectangular regions is adopted in [2]. Then, Support lighting, large intra-class variance, and structure variance. Vector Machine (SVM) is applied to determine the bound- Since the vehicle is moving, the observed image may have aries among different occupant classes for handling intra- 1112
  • 126.
    class variance. In[3], [4], the edge map of the passenger image and that at the reference image. This makes the appearance is extracted through the background subtraction proposed approach be invariant to intra-class variance. Then, algorithm and further described by the use of high-order the likelihood ratios for evaluating the existence confidence Legendre moments. The classification is achieved using the for giving image with respect to five trained models are k-nearest neighbors strategy. The edge map of the occupant computed and the classification result is the occupant class is described by higher-order Tchebichef moments in [5] with the highest confidence. and then Adaboost algorithm is applied to select a set of The remainder of this paper is organized as follows. In discriminative moments for classification. To utilize more Section II, we introduce the generative models for repre- information for classification, multiple features including senting the occupant classes and the way to perform oc- range [6], motion information, and edge map are fused under cupant classification. The boosting algorithm for estimating a two-layer architecture [7], [8]. The classifiers for each layer the parameters of the models in a discriminative manner are the Non-linear Discriminant Analysis (NDA) classifiers. is then described in Section III. Section IV demonstrates The features used in the aforementioned works are all the effectiveness of the developed approach by providing global descriptors, such as dense edge map [7], [8], Legendre experimental results on an abundant database. Finally, we moments [3], [4] or Tchebichef moments[5]. The main conclude the paper in Section V with some discussion. limitation in this kind of approaches is that the classification accuracy deteriorates in the two extreme cases (blooming II. PATCH -BASED C LASSIFIER and invisible) resulting from severe lighting change. To For every class, we model it by a generative model circumvent this, we present a patch-based model which consisting of several patches described by a Gaussian distri- is commonly used in recognition literature [9], [10] for bution. The observed image is classified by maximizing the handling occlusion effect to describe the occupant class. likelihood probability. Here, the feature representation of a Furthermore, the above works directly model the appearance patch is appearance difference with respect to a reference of the occupant and thus suffer from the significant intra- image. In order to eliminate illumination factor, we recover class variance. The general way to solve or alleviate this the reflectance image of the empty class and consider it problem is by by introducing some classification algorithms, as the reference image. Negative normalized correlation such as SVM or NDA in the literature. According to the is then introduced to measure appearance difference for insight that the silhouettes of different occupant classes are representing the feature of patches. distinct, we consider the empty class as the reference and thus model the appearance difference with respect to the A. Feature Representation empty class. The images for training are captured from various lighting conditions. Similar in foreground segmentation literature B. Approach Overview [11], a reference image suitable for difference measurement The objective of occupant classification is to assign one of should be illumination invariant and without any moving five classes C = {CEmpty , CRF IS , CF F CS , CChild , CAdult } objects inside. As discussed in [12], an image is a product to the currently observed image. The system mainly consists of two images: a reflectance image and an illumination of two phases: training and classification. In training phase, image. The reflectance image of the scene is constant and we firstly a reflectance image of empty class by removing illumination image changes with the lighting condition in the illumination effect. The obtained reflectance image is the environment. Accordingly, the reflectance image of the considered as the reference image for further feature repre- empty class is recovered and thus considered as the reference sentation. In this work, each occupant class is described by a image here. patch-based generative model in order to handle with severe Giving a set of empty-class images in the training lighting change which makes the image locally blooming or database, we apply the approach proposed in [13] to estimate invisible. In tradition, the parameters of generative model is the empty-class reflectance image Ir based on an assumption generally estimated using ML strategy which only samples that illumination images are lower contrast than reflectance with the same label are considered and used for training that image. This leads to that the derivative filter outputs on the corresponding model. However, the models learned in this illumination image will be sparse and the reflectance recov- way suffers from the problem of having less discriminativity ery problem can be re-formulated as maximum-likelihood among difference classes. Instead, we adopt a discriminitive estimation problem. Figure 2 shows the decomposition of boosting algorithm to estimate the model parameters by three empty-class images into a constant reflectance image directly minimizing the training error. and its corresponding illumination images. In classification phase, the appearances at the trained Let p be a patch whose configuration is θ(p) = (t, l, w, h), patches of a specific occupant model are taken into consider- where (t, l) is the coordinate of the top-left corner and (w, h) ation for feature representation. Feature used in this work is is the patch size. To impose locality property similar to the difference in appearance between the patch at observed histogram of oriented gradients (HOGs) [14], we divided 1113
  • 127.
    B. Classification Model A generative occupant model Mc = {pc : k = 1, ..., K c} k consisting of K c patches is proposed to describe the class c ∈ C in this work. Each patch pc is modeled by a Gaussian k distribution Nk = {µc , Σc } associated with the patch c k k configuration θ(pk ), where µc and Σk are the mean and c k c covariance matrix, respectively. By assuming independence among patches, the log likelihood probability of an observed image I belonging to the class c is defined as: Kc c c log Pr(I|z = 1) = log Pr(I|M ) = log Pr(f (I(pc ))|Nk ) k c k=1 (3) Figure 2. Examples of reflectance image recovery for three images at the where z c ∈ {+1, −1} is the membership label for the class first row. The second row shows the recovered reflectance image and the three corresponding illumination images are shown at the third row. c and f (Io (pc )) is the aforementioned patch representation k c of the image I at the patch pk . Remarkably, the proposed model which learns the likelihood probability of a given observation is a generative one. Instead of solving the problem of occupant classification directly using maximum likelihood (ML), that is, c∗ = arg max log Pr(I|z c = 1), we introduce existence confi- dence to re-formulate it as five one-against-others binary classification problems. The work in [9] claims that this benefits both classification and training to be done in a Figure 3. The definition of quadrant images Io (qi ) and Ir (qi ). discriminative manner and thus improve the classification accuracy. Consequently, we define the existence confidence of a specific class c given an observed image I as the log likelihood ratio test (LRT) which can be expressed as: a patch p into four quadrants {q1 , q2 , q3 , q4 }. We denote the quadrant qi at the observed image Io and the recovered Pr(I|z c = 1) H(I, c) = log (4) reflectance image Ir as Io (qi ) and Ir (qi ), respectively. The Pr(I|z c = −1) schematic form is shown in Figure 3. Inspired by the Without assuming any prior, we approximate the background work [15] to deal with the presence of the severe lighting hypothesis Pr(I|z c = −1) by a constant Θc . Accordingly, change, a matching function (MF) γ(.) is applied to measure the function form H(.) of the LRT statistics in (4) becomes: the appearance difference between Io (qi ) and Ir (qi ). The H(I, c) = log Pr(I|z c = 1) − Θc γ(Io (qi ), Ir (qi )) is defined as: Kc N (x, y) = log Pr(f (I(pc )|Nk ) − Θc k c (x,y)∈qi k=1 (5) γ(Io (qi ), Ir (qi )) = − Kc Do (x, y) Dr (x, y) = {log Pr(f (I(pc ))|Nk ) − Θc } k c k (x,y)∈qi (x,y)∈qi k=1 (1) c Kc where where Θ = k=1 Θc . Therefore, the classification result k of giving I and five trained patch-based generative model N (x, y) ¯ ¯ = (Io (x, y) − Io (qi )) × (Ir (x, y) − Ir (qi )) {Mc : c ∈ C} is the class c∗ with the highest existence Do (x, y) ¯ ¯ = (Io (x, y) − Io (qi )) × (Io (x, y) − Io (qi )) confidence value, that is, c∗ = arg max H(I, c). However, Dr (x, y) ¯ ¯ = (Ir (x, y) − Ir (qi )) × (Ir (x, y) − Ir (qi )) we still not mention about how to estimate the model pa- (2) rameters including Ωc = {(θk , µc , Σc , Θc ) : k = 1, .., K c }. c ¯ ¯ k k k Here, Io (qi ) and Ir (qi ) denote the average intensity of the In the next section, a boosting algorithm is proposed to train quadrant images Io (qi ) and Ir (qi ), respectively. Remarkably, these parameters in a discriminative way. this function computes the negative normalized correlation between Io (qi ) and Ir (qi ). Hence, the range of γ(.) is III. D ISCRIMINATIVE L EARNING U SING B OOSTING between [−1, 1]. Thus, the feature representation of the p In the learning literature [9], [10], several compelling is defined as a 4-D vector. arguments indicate that the model with the parameters 1114
  • 128.
    estimated in adiscriminative manner is preferable in terms J(H)(x). The differentiating of J(.) in (6) w.r.t H(.) can of classification accuracy. Inspired by this, the parameters is be expressed as thus determined directly by minimizing the exponential loss N of the margin over all training samples [16]. c ∂ c J(D, Ω ) = exp{−zi H(Ii , c)} ∂H(I, c) i=1 (9) A. Cost Function Definition = −z c exp{−z cH(I, c)} Assume that there are a set of labeled images D = {(Ii , ti )}N . The margin of a sample (Ii , ti ) with respect to i=1 c a learned model (classifier) H(.) is defined as zi H(Ii , c), Since it will not be possible to choose a hm (I, c) = c where zi ∈ {+1, −1} is the membership label of the ith J(D, Ωc ), so instead the AnyBoost algorithm search for c sample for the class c. zi = 1 if ti is equal to c; otherwise, a function with greatest inner product with J(D, Ωc )). c zi = −1. Then, the cost function J(.) for evaluating the The inner product between two functions J(D, Ωc ) and training error of the training set D to the class c is defined hm (I, c) is defined by as: N N J(D, Ωc ), hm (I, c) = c c −zi exp{−zi H(Ii , c)}hm (Ii , c) J(D, Ωc ) = c exp{−zi H(Ii , c)} (6) i=1 i=1 (10) c c Notably, the less training error of a model H(.) determined Denoting exp{−zi H(Ii , c)} as the weight wi , the task at by parameters Ωc for the class c; the smaller the cost J(.) boosting round m is to find weak hypothesis that maximize N c c is. In other words, the objective of training the classifier for i=1 −zi wi hm (Ii , c). each class c is to find a set of model parameters in Ωc space so that the cost function is minimized. IV. E XPERIMENT The minimization to the (6) is by boosting algorithm which is currently a popular way to sequentially approach In this section, we present some experimental results on to the solution with a set of additive models. At each round a great amount of videos in this section. m, our defined function H(.) is updated as H(.) + hm (.) so as to decrease the cost. hm (.) and H(.) are called as weak A. System Setup and Video Collection and strong classifier, respectively, in the boosting literature. Consequently, H(.) has the form: The car used for experiment is Mitsubishi Sarvin and the appearance inside the vehicle is shown in Figure 4(a). We M mount the camera at the center of roof near the rear-view H(x) = hm (x) (7) mirror (see Figure 4(b)) for providing a near profile view of m=1 the occupant and preventing the camera view from blocking where M is the number of boosting rounds. By designing by the driver. The video sequences used for both training hm (x) as the log likelihood of a patch minus an offset and and validation are gathered from the deployed camera in the setting M = K c , we have H(x) in (7) equivalent to H(I, c) situation that the platform is moving on road. The camera in (5). The problem of estimating the model parameters is grabs the images at the rate of 30 frames per second. thus the same as boosting the strong classifier in a sequential In order to make the database with abundant lighting manner. change, we collect the videos at different weather conditions, such as sunny or cloudy day for a period of more than M = Kc two months. In addition, we drive the vehicle to pass hm (x, c) = log Pr(f (I(pc ))|Nk ) − Θc c through several different scenes including indoor and out- k k (8) door environments, such as basement, facing the sun, streets Kc with shade of the trees, etc. As for intra-class variance, H(x) = {log Pr(f (I(pc ))|Nk ) − Θc } k c k several adults and children with different body types and k=1 clothing are included in videos and are asked to exhibit various postures. Some examples are shown in Figure 5. Our B. Gradient Descent Optimization database contains 34 video sets and each set consists of one Boosting for choosing linear combination of weak classi- video for each occupant class. There are totally 34×5 = 170 fiers to minimize the proposed cost function J(.) is shown videos in our database. The time of a video is about 5 to to be a greedy gradient descent in [17]. An algorithm called 10 minutes long and consists of about 8, 000 to 11, 000 AnyBoost presented by [17] claims that the weak hypothesis frames. The total frames in our database is 1, 633, 752 and resulting in the greatest reduction in cost is at the direction the detailed statistics can be found in Table I. 1115
  • 129.
    Table I D ATABASE S TATISTICS Empty RFIS FFCS Child Adult Fold 1 174,572 164,635 167,589 166,103 176,423 Fold 2 153,322 157,699 157,064 156,352 159,993 Total 327,894 322,334 324,653 322,455 336,416 other is for validation, and vice versa (see Table I). Each Figure 4. Camera configuration: (a) is the appearance inside the Mitsubishi Sarvin; (b) shows the deployment of the camera. fold is thus include 17 sets. For training, we extract 50 frames from every video in training fold and thus totally have 50 ∗ 85 = 4250 frames used for training. The selection of training frames in each video is by equally sampling per 100 frames from the first 5000 frames. Figure 6 shows the first selected 10 patches for five occupant classes. The confusion matrices for classification results of fold 1 and fold 2 are shown in Table II and Table III, respectively. Obviously, our proposed approach is effective in both cases. This is because the patch-based model based on local features will be more robust to severe lighting change than the one using global representation. The usage of appearance difference for feature representation makes the system be Figure 5. Various poses with different illuminations. invaraint to intra-class invariance. The classification results in four classes including RFIS, FFCS, Child, and Adult are more than 99.0%. The classification time for our method B. Classification Results and Analysis is about 16 ms .The efficiency of our approaches results from the simplicity in computing the log likelihood ratio Our work for occupant classification is based on a set of which uses 4-D Gaussian distributions.However, there are discriminative patches. For computation saving, the grabbed still many mis-classifications between FFCS class and Adult images are normalized to resolution 256 × 128. Four types class. The reason hard to distinguish between them is that of rectangles used in both approaches include 32 × 32, the FFCS and adult classes have similar appearance. 32 × 16, 16 × 32. The steps for scanning the entire the image in horizontal and vertical directions are set to 1/2 V. C ONCLUSION of the width and height of the rectangles, respectively. For In this paper, we present a patch-based generative model example, we shift the 32 × 16 rectangle by 16 and 8 pixels, for occupant classification. Each patch is divided into four respectively. The number of patches selected for modeling quadrants and appearance difference measured by the pro- is K c = 50 for each occupant class. The CUP used is posed negative correlation is used for representing patch. Intel Duo Core with 2.4GHz and 1.0 GB working memory. Instead using ML for classification, the idea of existence The Intel Open Source Computer Vision Library (OpenCV) confidence is introduced and thus model parameters can be and library libsvm 2.89 [18] are adopted to support the estimated in a discriminative manner. To achieve this, an programming under Microsoft Windows XP. boosting algorithm is applied to approach the solution by Here, we use a statistical method called 2-fold cross directly minimizing the training error. The robustness and ef- validation to compare the classification performance of the fectiveness of our proposed method to severe lighting change two algorithms. The collected videos in our database are and intra-class variance has been intensively validated by divided into two folds: one is used to learn the models; the using abundant database with more than 1, 600, 000 frames. In the near future, we will introduce some semantic cues, such as head or seat detection to make the system have classification accuracy more closer to 100%. In addition, the assumption that there is no structure variance inside the vehicle due to user’s preference should be relaxed in the ongoing work. ACKNOWLEDGMENT This research is sponsored by the Chung-Shan Institute Figure 6. Selected patches for five occupant classes. of Science and Technology under the project XB98175P. 1116
  • 130.
    Table II C ONFUSION M ATRICES FOR F OLD 1 Our Approach (99.50%) Empty RFIS FFCS Child Adult Accuracy Empty 171,261 233 0 86 4 98.10% RFIS 0 164,605 0 1 29 99.98% FFCS 0 0 167,567 0 22 99.98% Child 0 0 6 165,597 500 99.69% Adult 0 116 276 2 176,029 99.77% Table III C ONFUSION M ATRICES FOR F OLD 2 Our Approach (Average: 99.59%) Empty RFIS FFCS Child Adult Accuracy Empty 153,301 0 0 17 4 99.98% RFIS 0 157,684 0 0 15 99.99% FFCS 0 0 154,642 0 2,422 98.45% Child 0 0 2 155,598 752 99.51% Adult 0 0 0 0 159,993 100.0% R EFERENCES [10] T. Deselaers, D. Keysers, and H. Ney, “Discriminative Train- ing for Object Recognition Using Image Patches,” IEEE Intl. [1] J. Krumm and G. Kirk, “Video Occupant Detection for Airbag Conf. on Computer Vision and Pattern Recognition, vol. 2, Deployment,” IEEE Workshop on Applications of Computer pp. 20–25, 2005. Vision, pp. 20–35, 1998. [11] S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao, “Region-Level [2] Y. Zhang, S. J. Kiselewich, and W. A. Bauson, “”A Monocu- Motion-Based Foreground Segmentation under a Bayesian lar Vision-Based Occupant Classification Approach for Smart Network,” IEEE Trans. on Circuits and Systems for Video Airbag”,” IEEE Proceedings on Intelligent Vehicle Sympo- Technology, vol. 19, no. 4, pp. 522–532, April 2009. sium, pp. 632–637, 2005. [12] H. Farid and E. H. Adelson, “Separating Reflections from Images by Use of Independent Components Analysis,” Jour- [3] M. E. Farmer and A. K. Jain, “Smart Automotive Airbags: nal of the Optical Society of America, vol. 16, no. 9, pp. Occupant Classification and Tracking,” IEEE Trans. on Ve- 2136–2145, 1999. hicular Technology, vol. 56, no. 1, pp. 60–80, January 2007. [13] Y. Weiss, “Deriving intrinsic images from image sequences,” [4] ——, “Occupant Classification System for Automotive IEEE Intl. Conf. on Computer Vision, vol. 1, pp. 68–75, 2001. Airbag Suppression,” IEEE Intl. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 756–761, 2003. [14] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” IEEE Computer Society Conference on [5] S.-S. Huang and P.-Y. Hsiao, “Occupant Classification for Computer Vision and Pattern Recognition, vol. 1, pp. 886– Smart Airbag Using Bayesian Filtering,” International Con- 893, 2005. ference on Green Circuits and Systems, 2010. [15] L. D. Stefano, F. Tombari, and S. Mottoccia, “Robust and [6] P. R. Devarakota, M. Castillo-Franco, R. Ginhoux, B. Mir- Accurate Change Detection Under Sudden Illumination Vari- bach, and B. Ottersten, “Smart Automotive Airbags: Occu- ations,” Asia Conference on Computer Vision, pp. 103–109, pant Classification and Tracking,” IEEE Trans. on Vehicular November 2007. Technology, vol. 56, no. 4, pp. 1983–1993, July 2007. [16] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing Visual Features for Multiclass and MultiView Object Detec- [7] Y. Owechko, N. Srinivasa, S. Medasani, and R. Boscolo, tion,” IEEE Transactions on Pattern Analysis and Machine “High Performance Sensor Fusion Architecture for Vision- Intelligence, vol. 29, no. 5, pp. 854–869, May 2007. Based Occupant Detection,” IEEE Intl. Conference on Intel- ligent Transportation Systems, pp. 1128–1132, 2003. [17] L. Mason, J. Baxter, P. Bartlett, and M. Frean, “Boosting Algoirhtms as Gradient Descent,” Neural Information Pro- [8] ——, “Vision-Based Fusion System for Smart Airbag Appli- cessing Systems (NIPS), pp. 512–518, 2000. cation,” IEEE Proceedings on Intelligent Vehicle Symposium, pp. 245–250, 2002. [18] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working Set Selection Using Second Order Information for Training Support Vector [9] A. B. Hillel, T. Hertz, and D. Weinshall, “Object Class Machines,” Journal of Machine Learning Research, no. 6, pp. Recognition by Boosting a Part-Based Model,” IEEE Intl. 1889–1918, 2005. Conf. on Computer Vision and Pattern Recognition, pp. 702– 709, 2005. 1117
  • 131.
    DISPLAY CHARACTERIZATION INVISUAL CRYPTOGRAPHY FOR COLOR IMAGES Chao-Hua Wen (溫照華) Color Imaging and Illumination Center, Graduate Institute of Engineering, National Taiwan University of Science and Technology Taipei, Taiwan E-mail: [email protected] ABSTRACT images can be reconstructed by stacking operation. This property makes VCS especially useful in condition of Visual cryptography can encrypt the visual information the system requirement of low computation load. and then decrypt the information by human visual Noar and Shamir proposed the (k, n) threshold system without complicated computation. There are scheme or k out of n threshold scheme which illustrated various measures on the performance of kinds of visual a new paradigm in image sharing [2]. In this scheme a cryptography schemes, but rare studies on exact color secrete image is divided into n share images. With any k reproduction for visual cryptography. This paper of the n shares, the secret can be perfectly reconstructed, proposes a new visual cryptography scheme with the while even complete knowledge of (k-1) shares reveals display characterization model which can render no information about the secret image. Consequently, decrypted color image accurately. In the experiments, Noar and Shamir’s method is restricted to a binary the processes of encryption and decryption were image due to the nature of the basic model. demonstrated from the source display to the destination Verheul and Van Tilborg (1997) proposed the display. For color secret images, this method only uses scheme that extended the basic visual cryptography two encryption share images and the decryption can be scheme from binary image to color image [3]. In this performed via a simple operation. scheme each pixel is expanded into m subpixels. Each subpixel may take one of the color from the set of color Keywords Visual Cryptography; Visual Secret Sharing; 0, 1,…, c-1, where c is the total number of the colors Color Visual Cryptography; Display Characterization used to represent the pixel. These subpixels are interrelated to each other such that after all shares are 1. INTRODUCTION stacked and the color is revealed if corresponding subpixels of all shares are of same color, otherwise the With the rapid deployment of network technology, level of black is revealed. In this scheme the size of the multimedia information is transmitted over the Internet decrypted image will increase by a factor of ck-1, when c conveniently. While transmitting secret images, security ≥ n for a (k, n) threshold scheme. shall be taken into consideration because hackers may Koga and Yamamoto (1998) proposed the lattice utilize weak link over communication network to based (k, n) VCS scheme for gray level and color image exposure the hidden information. There are various [4]. In that scheme, the pixels are treated as elements of image secret sharing schemes have been developed to finite lattice and the stacking up of pixels is defined as strengthen the security of the secret images. Information an operation on the finite lattice. In that scheme, (k, n) hiding and secrete sharing are two major approaches. VCS for color images is defined with c colors as a For instance, the watermarking method is widely used collection of c subsets in nth Cartesian product of the for information hidden [1] and the Visual Cryptogrphay finite lattice. (VC) is adopted for secret sharing [2]. Yang (2000) proposed a new VCS for the color VC is introduced first by Noar and Shamir (1994), images [5]. The scheme is implemented based on the which allows visual information (e.g. plain text, basic concept of a black and white VCS and gets much handwritten notes, graphs and pictures) to be encrypted better block length than the Verheul-Van Tilborg by producing random noise images that are used to scheme. Here each pixel is expanded into 2c-1 decrypt through the human visual system [2]. Visual subpixels, where c is the number of colors. Hou (2003) cryptography scheme (VCS) eliminates complex proposed a scheme of secret sharing for both gray-level computation in decryption process, and the secret and color images using halftone technique [6]. The 1118
  • 132.
    color secret imageis decomposed into individual 2. VISUAL CRYPTOGRAPHY SCHEME channels before the application of halftone technique. Then the traditional VC is applied to halftone image of Naor and Shamir proposed a (k, n) threshold visual each channel to accomplish the creation of shares. The secret sharing scheme to share a secret image [2]. A size of decrypted image is increased by a factor of nk-1 secret image is hidden into n share images and can be for (k, n) threshold VCS and the quality of decrypted decrypted by superimposing at least k share images but image is based on halftone technique used. any k-1 shares cannot reveal the secret. Cimato et al. (2003) proposed c-colors (k, n) threshold cryptography scheme that provides a 2.1. Visual Cryptography Scheme for binary images characterization of contrast optimal scheme with pixel expansion of 2c – 1 [7]. Yang and Chen (2008) The (2, 2) VCS is illustrated to introduce the basic proposed VCS for color image based on additive color concept of threshold visual secret sharing schemes. The mixing [8]. In the scheme, each pixel is expanded by a encryption process transforms each secret pixel into two factor of three. shares, and each share belongs to the corresponding In order to reduce the size and the distortion of share image. In the decryption process the two decrypted image, Dharwadkar et al. (2010) propose the corresponding shares are stacked together (using visual cryptography for color image using color error OR/AND operation) to recover the secret pixel. Two diffusion dithering technique [10][16]. This technique share of a white secret pixel are of the same while those improves the quality of decrypted image compared to of a black secret pixel are complementary as shown in other dithering techniques, such as Flyod-Steinberg Fig. 1:. Consequently a white secret pixel is recovered error diffusion [11] which is shown by the experimental by a share with the stacked result of half white sub- results obtained using Picture Quality Evaluation pixels and a black secret pixel is recovered by all black. metrics [12]. Meanwhile, Revenkar et al. (2010) Using this basic VCS, the contrast ratio of the decrypted provided the overview of various VCS and performance image is reduced results from halving intensity of the analysis on the basis of pixel expansion, number of white secret pixels. secret images, image format and type of shares generated [13]. Display is one of the most used media devices in Visual Cryptography. In most applications, the decryption side uses a different display model from the encryption side. Even though the same display model used, luminance and color of the displays are possibly different because of production variance. Color gamut is one of characteristics of the color reproduction media for reproducing color images play a major role in determining how a given secret image will perform in VCS. The display color gamut that we have been living for the past several decades is standardized as “Rec. Fig. 1: (2, 2) VCS for transforming a binary pixel into 709” in the video industry [14] or “sRGB” in the two shares. computer industry [15]. These systems share the same primaries. However, the advanced wide gamut displays are rapid deployment in specialized professional 2.2. Digital Halftoning applications and even in home theater now. That makes display characterization more serious in terms of Halftone technique is one of the most important parts of accurate information communications between the the image reproduction process for devices with a source and destination. limited number of colors. According to the physical The rest of this paper is organized as follows: characteristics of different media uses the different ways Section 2 provides overview of black and white VCS, of representing the color level of images. The general digital halftoning, error diffusion, halftone-based VCS printer such as dot matrix printers and laser printers can for gray-scale images, and color visual cryptography only control a single pixel to be printed (black pixel) or scheme. Display characterization is elaborated in not be printed (white pixel). The halftone is applied to section 3. The proposed framework is introduced in the given image to render the illusion of the continuous Section 4. Results and discussion is given in Section 5. tone images on the devices that are capable of Finally the conclusion is given in Section 6. producing only binary image elements. This illusion is achieved because our eyes perform spatial integration. That is, if we view a very small area from sufficiently large viewing distance our eyes averages the fine detail 1119
  • 133.
    within the smallarea and record only the overall increase pixel expansion. Wei Qiao et al. also intensity of the area. introduced a VCS for color images based on halftone technique [21]. 2.3. Digital Halftoning 3. DISPLAY CHARACTERIZATION In the (k, n) threshold VCS for gray-level image [3]. The pixels have g gray levels ranging from 0 to g-1, Display is one of the most used media devices in Visual where each pixel is expanded to m subpixels of size m ≥ Cryptography. Flat panel displays have been become a gk-1. In this scheme the size of decoded image is larger common peripheral for desktop personal computers and than the secret image compared to Naor and Shamir workstations. In general VC tasks, we create an image VCS scheme. In order to reduce the size of decrypted on one display and take the data file to a second image, the gray-level halftone image is transformed into imaging system. When viewed on the second display, an approximate binary image. Then, the basic VCS the decrypted image is likely to have different color described in Section 2.1 can use to create shares. The reproduction. Here we address primarily users who will following steps are used to generate less distorted same be doing accurate imaging on a monitor. decrypted image. The traditional CRT techniques have been 1) Transform the gray-level image into a binary summarized by Berns [17] and can be described as image using halftone technique. application of the gain-offset-gamma (GOG) model to 2) Each black or white pixel in the halftone image is characterize the electro-optical transfer functions of the represented by m subpixels into different shares display and a 3x3 linear transform to go from RGB to selecting from the shares of black or white pixels. CIE XYZ tristimulus values. The accuracy of the GOG 3) Repeat step 2 until every pixel in the halftone characterization is probably adequate for most desktop image is decomposed into shares. color applications and color management systems [18]. The International Color Consortium (ICC) has 2.4. Error Diffusion published a standard file format for storing ‘‘profile’’ information about any imaging device In literature there are many mature error diffusion (http://www.color.org/). It has been become routine to techniques are exists, and because of its exceptionally use such profiles to achieve accurate imaging. The high image quality, it continues to be a popular choice widespread support for profiles allows most users to among digital halftoning algorithms [9]. Nagaraj V. achieve characterization and correction without needing Dharwadkar et al. have used Adaptive Order Dithering to understand the underlying characteristics of the (Cluster-dot dithering) [16], Floyd-Steinberg error imaging device. ICC monitor profiles use the standard diffusion technique [11] and color error diffusion CRT model presented in this article. technique and performed the computation of Picture Quality evaluation for decrypted images [12]. Those 3.1. Primary transform matrix and inverse experimental results revealed that the color error diffusion produces the superior quality of recovered The primary transform matrix for the colorimetric image compare to Adaptive Order Dithering and Floyd- characterization of the display was derived from the Steinberg error diffusion technique. direct colorimetric measurements of the three full-on primaries after black correction. The matrix and its 2.5. VCS for Color Images inverse are given in Equation (1) and Equation (2). First color VCS was developed by Verheul and Van ⎡X ⎤ ⎡X R XG X B ⎤⎡R⎤ ⎢ ⎥ ⎢ Tilborg [3]. Colored secret images can be shared with ⎢ Y ⎥ = ⎢ YR YG YB ⎥ ⎢G ⎥ ⎥⎢ ⎥ (1) the concept of arcs to construct a colored visual ⎢Z ⎥ ⎢ZR ⎣ ⎦ ⎣ ZG ZB ⎥⎢B⎥ ⎦⎣ ⎦ cryptography scheme. In c colorful VCS, one pixel is −1 transformed into m subpixels, and each subpixel is ⎡R⎤ ⎡ X R XG XB⎤ ⎡X ⎤ divided into c color regions. In each subpixel, there is ⎢G ⎥ = ⎢ Y YG YB ⎥ ⎢Y ⎥ (2) exactly one color region colored, and all the other color ⎢ ⎥ ⎢ R ⎥ ⎢ ⎥ regions are black. The color of one pixel depends on the ⎢B⎥ ⎢ ZR ⎣ ⎦ ⎣ ZG ZB ⎥ ⎦ ⎢Z ⎥ ⎣ ⎦ interrelations between the stacked subpixels. For a colored visual cryptography scheme with c colors, the 3.2. Electro-Optical Transfer Function (EOTF) pixel expansion m is c × 3. Yang and Laih [19] improved the pixel expansion to c × 2 of Verheul and EOTF is used to describe the relationship between the Van Tilborg [3]. Liu et al. developed a color VCS under signal used to drive a given display channel and the the visual cryptography model of Naor and Shamir with luminance produced by that channel. For displays, this no pixel expansion [20]. In this scheme the increase in function is sometimes referred to as gamma and it is the number of colors of recovered secret image does not the aspect of the display characterization described by 1120
  • 134.
    GOG portion ofthe display characterization model. [IR, IG, IB] [IRHT, IGHT, IBHT] EOTF, however, does not work in visual cryptography because VCS deals with fully on/off signal basically. (3) Creation of work-in-process shares: The method described in Section 2.1 is used for creating the 4. THE PROPOSED COLOR VCS work-in-process shares by (2, 2) VCS for each halftone images. For example, the red halftone image IRHT, (2, 2) The objective of our proposed scheme is to apply the VCS encodes the halftone image into two shares, IRSH1 VCS for color image and get better quality decrypted and IRSH2 respectively. Green and blue halftone images image with display characterization procedures. Fig. 9: is performed the same process as the red halftone image. illustrates the framework of the encryption algorithm and the simulated decryption image. In this encryption [IRHT, IGHT, IBHT] [IRSH1, IRSH2, IGSH1, IGSH2, IBSH1, IBSH2] algorithm the color image is decomposed into three channels and each channel is considered as a gray-level (4) Creation of encrypted shares: To combine the image. For each gray-level image dithering and VCS work-in-process shares of IRSH1, IGSH1, and IBSH1 into a schemes are applied independently to accomplish the color Share1 image, and to combine IRSH2, IGSH2 and creation of shares. We used color error diffusion for IBSH2 into a Share2 image. dithering technique. It reduces the color sets that render the halftone image and chooses the color from sets by [IRSH1, IRSH2, IGSH1, IGSH2, IBSH1, IBSH2] [Share1, Share2] which the desired color may be rendered and whose brightness variation is minimal. Fig. 2: shows how to (5) Display characterization: To apply display decompose a magenta pixel (R = 1, G = 0, B = 1) into model for color correction of Share1 and Share2 images. two sharing blocks and how to reconstruct the magenta block. We superimpose (using AND operation) the [Share1, Share2] [Share1’, Share2’] binary shares of each channel to get the decrypted color image. For delivery of accuracy color communication between the original secret image and the decrypted (R,G,B) = (1,0,1) image, two displays were used in this study. One is the laptop monitor of HP Pavilion dm3, and other is the R = 1 G = 0 B = 1 mobile phone display of hTC Diamond. The colorimetric measurements by Konica-Minolta CA-210 are shown in Table 1: and plotted in Fig. 3:. Fig. 3: illustrates the difference of chromaticity coordinates of Share1 Share2 HP monitor, hTC display and NTSC color space. The color gamut of hTC display is wider than HP monitor. Decrypted pixel The primary transform matrices of two displays were calculated and embedded into the ICC profiles. The Fig. 2: An example of the proposed VCS for a transform matrices of HP monitor and hTC display are magenta pixel. shown in Equation (3) and Equation (4) respectively. Table 1: Measured Luminance and chromaticities 4.1. Encryption Color Luminance and Chromaticity Display In the encryption algorithm, the two shares are R G B Y (cd/m2) x y generated from the color image. Based on Noar and 1 0 0 54.95 0.5894 0.3440 Shamir’s basic concept, the color image is decomposed 0 1 0 138.8 0.328 0.5852 into R, G and B channels. From these channels, six of the work-in-process shares are created. Next to combine HP 0 0 1 41.94 0.1466 0.1156 these six work-in-process shares into two encrypted 0 0 0 0.25 0.2009 0.1966 color images using following steps: (1) Color Decomposition: The color image I is 1 1 1 234.7 0.2964 0.3099 decomposed into IR, IG and IB monochrome gray-level 1 0 0 55.67 0.6340 0.3336 images for R, G and B color channels respectively. 0 1 0 192.9 0.3321 0.6273 [I] [IR, IG, IB] hTC 0 0 1 36.24 0.1423 0.0778 0 0 0 0.04 0.2329 0.2269 (2) Digital halftoning: To apply the halftone 1 1 1 284.80 0.2925 0.3044 technique for each color channel to obtain IRHT, IGHT, and IBHT halftone images respectively. 1121
  • 135.
    0.6 processed by the Floyd-Steinberg error diffusion algorithm as shown in Fig. 4:. 0.5 0.4 v' 0.3 0.2 (a) RGB secret image (b) Red halftone image 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 u'' (c) Green halftone image (d) Blue halftone image Fig. 3: Plot of CIE chromaticity coordinate values of HP monitor (purple line), hTC display (blue line) and Fig. 4: Color secret image and decomposited halftone NTSC color spcace (yellow line). images by Floyd-Steinberg error diffusion algorithm. ⎡X ⎤ ⎡0.4009 0.3294 0.2261⎤ ⎡ R ⎤ Fig. 5: shows the creation of the work-in-process ⎢ Y ⎥ = ⎢0.2340 0.5877 0.1783⎥ ⎢G ⎥ (3) shares. Fig. 5: (a) and Fig. 5: (b) are two work-in- ⎢ ⎥ ⎢ ⎥⎢ ⎥ process shares of red channel. Fig. 5: (c) and Fig. 5: (d) ⎢Z ⎥ ⎣ ⎦ HP ⎣⎢0.0453 0.0872 1.1379 ⎥ ⎢ B ⎥ ⎦ ⎣ ⎦ HP are the shares of green channel and Fig. 5: (e) and Fig. 5: (f) illustrates the work-in-process shares of blue channel. ⎡X ⎤ ⎡0.3714 0.3593 0.2301⎤ ⎡ R ⎤ ⎢Y ⎥ = ⎢0.1954 0.6787 0.1258⎥ ⎢G ⎥ (4) Overall, these six shares reveal no information about the ⎢ ⎥ ⎢ ⎥⎢ ⎥ secret image. ⎢Z ⎥ ⎣ ⎦ hTC ⎣⎢0.0190 0.0439 1.2613 ⎥ ⎢ B ⎥ ⎦ ⎣ ⎦ hTC As described in Section 3, two ICC profiles were first created. Here we assigned the source profile to HP monitor and the destination profile to hTC display. The new color transform was created based on the source (a) IRSH1 (b) IRSH2 (c) IGSH1 profile and the destination profile. The profile connect color space CIEXYZ was adapted. Therefore, the convert from HP image to hTC image are RGBhp XYZ RGBhTC shown in Equation (5). Consequently, we convert from Share1 and Share2 images to Share1’ and Share2’ using the equation. (d) IGSH2 (e) IBSH1 (f) IBSH2 −1 Fig. 5: Six of the work-in-process share images. ⎡R⎤ ⎡0.3714 0.3593 0.2301⎤ ⎡0.4009 0.3294 0.2261⎤ ⎡ R ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ (Actually, these are black and white images) ⎢G ⎥ = ⎢0.1954 0.6787 0.1258⎥ ⎢0.2340 0.5877 0.1783⎥ ⎢G ⎥ ⎢B⎦ ⎣ ⎥ ⎢0.0190 0.0439 1.2613⎥ ⎣ ⎦ ⎢0.0453 0.0872 1.1379⎥ ⎢ B ⎥ ⎣ ⎦ ⎣ ⎦ HP hTC (5) To reduce the number of the share images for portability and distribution, the proposed scheme 4.2. Decryption divides the work-in-process shares into two groups and creates two encrypted shares. Fig. 6: shows the In the decryption algorithm the color image channels combination of six work-in-process shares into two are reconstructed by stacking the shares of channels. encrypted color images. Both encrypted color images Our proposed scheme is straightforward to reconstruct show no information about the secret image at all. the decrypted image Idecrypt by stacking Share1’ and Share2’ images with AND operation for each color channel individually. [Share1’, Share2’] [Idecrypt] (a) Share1 color image (b) Share2 color image 5. RESULTS AND DISCUSSION Fig. 6: Two encrypted share color images. The color image is decomposed into R, G and B channel images, and then next those decomposition images are Here we assume that Alice uses the monitor of HP Pavilion dm3 to encrypt the secret image and then Bob 1122
  • 136.
    uses the displayof hTC Diamond One to decrypted the REFERENCES image by his eyes. The simulation encrypted share images are shown in Fig. 7: that Alice can preview the [1] H. Arafat Ali, “Qualitative spatial image data hiding for encryption results of the images on hTC display. As a secure data transmission,” ICGST, GVIP Journal, consequence, Share1’ and Share2’ were used rather Volume 7, Issue 2, August, pp-35-43, 2007. than Share1 and Share2. [2] M. Naor, A. Shamir, “Visual cryptography,” Advances in Cryptology, Eurocrypt 94, Lecture Notes in Computer Science, Vol. 950, pp. 1-12, 1995. [3] Verheul, Van Tilborg, “Construction and properties of k out of n visual secret sharing scheme Designs,” Codes and Cryptography, Vol. 11, pp. 179-196, 1997. [4] H. Koga, H. Yamamoto, “Proposal of a lattice based (a) Share1’ color image (b) Share2’ color image visual secret sharing scheme for color and gray-scale images,” IEICE Transactions on Fundamentals of Fig. 7: Color correction of the encrypted images Electronics, Communications and Computer Sciences, resulted from the display characterization of HP monitor vol. E81-A, no. 6, pp. 1262-1269, 1998. and hTC display. [5] C.N. Yang, C.S. Laih, “New colored visual secret sharing scheme,” Design, Codes and Cryptography, vol. 20, Finally, the decryption results are illustrated in Fig. pp.325-335, 2000. 8:. Fig. 8: (a) depicts the decrypted image without [6] Y.C. Hou, “Visual Cryptography for color images,” Pattern Recognition, vol. 36, pp. 1619-1629, 2003. display characterization. In contrast to the decrypted [7] S. Cimato, R.Prisco and A.De Santis, “Optimal colored image with display characterization in Fig. 8: (b), threshold visual cryptography schemes,” Design, Codes results revealed that there were color different between and Cryptography, vol. 35, pp. 311-335, 2003. (a) and (b). However, note that Alice can share the [8] Ching-Nung Yang and Tse-Shih Chen, “Colored Visual encrypted images to Bob, then he can decrypted the Cryptography Scheme based on additive color mixing,” secret image and see the same contents with same color Pattern Recognition, vol. 41, pp. 3114-3129, 2008. as Alice create. [9] Keith T. Knox, “Evolution of error diffusion,” Journal of Electronic Imaging, Vol. 8, pp. 422-429, 1999. [10] Shaked, N. Arad, A. Fitzhugh and I. Sobel, “Color diffusion: Error diffusion for color halftones,” H.P. laboratories Israel, HPL-96-128(R.1), 1999. [11] R. Floyd and L. Steinberg, “An adaptive algorithm for spatial gray scale,” Proceedings of the S.I.D. 17, (a) (b) 2(Second Quarter), 75-77, 1976. [12] Tomas Kratochvil and Pavel Simicek, “Utalization of Fig. 8: Decryption results. (a) shows the decrypted MATLAB for Picture Quality Evaluation,” Institute of image without display characterization and (b) Radio electronics, Brno University of Technology. demonstrated the decrpted image with display [13] P.S.Revenkar, Anisa Anjum and W.Z. Gandhare, characterization. “Survey of visual cryptography schemes,” International Journal of Security and Its Applications, Vol. 4, No. 2, pp. 49-56, April, 2010 6. CONCLUSIONS [14] ITU-R Recommendation BT.709-5, Basic parameter values for the HDTV standard for the studio and for international program exchange. In this paper, we proposed a new VCS for the color [15] IEC 61966-2-1, Multimedia systems and equipment – images with display characterization, which uses the Color measurement and management– Part 2-1: Color error diffusion dithering on primary color channel management –Default RGB color space – sRGB. directly. We also reduced the encrypted images down [16] Nagaraj V. Dharwadkar, B. B. Amberker, and Sushil Raj two share images for easily reconstruction and hidden Joshi, “Visual Cryptography for Color Image using Color information. Here we first applied the display Error Diffusion,” ICGST-GVIP Journal, Vol. 10, Issue 1, characterization into Visual cryptography. Results pp.1-8, February, 2010. revealed that we can accurately deliver color [17] R.S. Berns, “Methods for Characterizing CRT Displays,” Displays, Vol. 16, pp. 173-182, 1996. information and secret image as well. Further works can [18] Mark D. Fairchild and David R. Wyble, “Colorimetric be done to reduce the size of share image, improve the Characterization of the Apple Studio Display (Flat Panel quality of halftone shares, and use the model of display LCD),” Munsell Color Science Laboratory Technical characterization as an encryption key. Report, July, 1998. (www.cis.rit.edu/mcsl/ research/PDFs /LCD.pdf ) [19] C.N. Yang, “New visual secret sharing schemes using probabilistic method,” Pattern Recognition Letter 25, pp.481-494, 2004. 1123
  • 137.
    [20] F. Liu,C.K. Wu and X.J. Lin, “Color Visual On Halftone Technique,” International Conference on Cryptography Schemes,” IET Information Security, vol. Measuring Technology and Mechatronics Automation 2, No. 4, pp 151-165, 2008. 978-0-7695-3583-8/09, pp. 393-395, 2009. [21] Wei Qiao, Hongdong Yin and Huaqing Liang, “A Kind Of Visual Cryptography Scheme For Color Images Based Red channel SR0 Display model SG0 Share1 SB0 Green channel Decrypted image SR1 Original image Blue channel SG1 Share2 SB1 Display model Decomposition  Digital Halftone Visual Cryptography  Scheme Fig. 9: The proposed Framework of Color VCS 1124
  • 138.
    Data Hiding withRate-Distortion Optimization on H.264/AVC Video Yih-Chuan Lin and Jung-Hong Li Dept. of Computer Sciences and Information Engineering, National Formosa University, Yunlin, Taiwan. E-mail: [email protected] Abstract - This paper proposes a data hiding algorithm for the video quality caused by the watermark hiding can be H.264/AVC standard videos. The proposed video data hiding controlled at the bound less than 2 dB. scheme embeds information that is useful to some specific The remainder of this paper is organized as follows. applications into the symbols of context adaptive variable Section 2 describes the watermarking principles and related length coding (CAVLC) domain in H.264/AVC video streams. literatures. Section 3 explains our proposed scheme, including In order to minimize the changes on both the reproduced video the watermark embedding/extracting schemes and embedding quality and the output bit-rate, the algorithm selects DCT restriction rule. In Section 4, the performance of our proposed blocks using a coefficient energy difference (CED) rule and scheme is presented. Finally, some conclusions are given in then modifies the minor significant symbols, trailing one (T1) Section 5. symbols and the least significant bits (LSB) of non-zero quantized coefficient symbols, to hide data into the selected II. BACKGROUND blocks. Upon considering the joint optimization on rate and In general, most data hiding methods in H.264/AVC are distortion, the data hiding algorithm considers the data hiding based on entropy coding symbols or motion vectors (MV). task as a special quantization process and performs within the There are two kinds of entropy coding method in H.264/AVC: rate-distortion optimization loop of H.264/AVC encoder. The CAVLC and CABAC (Context-adaptive binary arithmetic experiment results have demonstrated that our scheme has coding). Many scholars choose CAVLC to develop because it good efficiency on hiding capacity, video quality and output is not complicated and is easy to operate for most situations. bit-rate. We can modify those nonzero coefficients in DCT blocks for Keywords: H.264/AVC, data hiding, CAVLC, reconstruction embedding, but it would affect the bit-rate and video quality loop, coefficient energy difference. seriously. Although the watermark hiding in the DCT blocks is easy to develop, we should consider avoiding unnecessary I. INTRODUCTION problems. Information hiding (or called data hiding interchangeably After transform and quantization, a DCT block usually hereafter) for video is a video process that adds some useful contains sparse zeros and nonzero coefficients. The nonzero data to the raw data or compressed formats of the video in a coefficients in high-frequency after the zig-zag reorder are manner such that the third parties or others can not discern the often sequences ±1, which are called trailing one and they are presence or contents of the hidden message in perception. limited only up to three at most in H.264/AVC. When the H.264/AVC can provide better compression efficiency number of trailing ones becomes more, the coding length is than other exiting standard at the cost of high computation shortest. So most researchers are focus on this part to develop complexity. Owing to the high popularity of this standard algorithm in data hiding. Consider changing the coefficients in format over many video applications, the hiding of useful data a DCT block. Four symbols for the CAVLC are available: into this format attracts a great deal of attention for different coeff_token, trailing_ones_sign_flag, total_zero, and applications. Recently, many researchers are committed to run_before. The coeff_token is composed of nonzero develop watermark schemes in H.264/AVC [1-4], but in order coefficients and T1 in a DCT block. In the same case, if the to make a balance between video quality and bit-rate; they number of trailing one increases, the bit-rate will reduce. On usually offer only a small capacity to hide data. This paper the contrary, when the number of coefficient is raised, the proposes a data hiding (or called watermark interchangeably bit-rate will increase oppositely. hereafter) scheme that is based on the CAVLC in H.264/AVC In Wu et al. [4], their proposed method is emphasizing on encoder and decoder sides. In the proposed method, one robustness to the compression attacks for H.264/AVC with watermark bit is embedded by employing the relationship more than a 40:1 compression ratio in I frame. The data between all of the polarity of T1 symbols in a 4x4 luminance embedded to the predicted 4x4 DCT block is only one bit. In DCT block. If the DCT block has no any T1, the algorithm Tian et al. [5], this proposed method just modified the nonzero considers modifying the LSB of the last nonzero coefficient coefficients. Therefore, the bit-rate increase is about 0.1% and for embedding information. Experiment results have shown the PSNR degradation is less then 0.5dB. It is good at keeping that our proposed method provide more capacity and can low bit-rate and high quality. However the capacity is too low. enhance the rate-distortion efficiency. The degradation of In Liao et al. [6], this method embeds message into the trailing ones of 4x4 blocks during the CAVLC. The feature of this 1125
  • 139.
    method is toallow data hiding directly in the compressed is intra-mode, the encoder performs intra-prediction and the stream in real time and the capacity is more than others [5-6]. mode set contains only I4MB, I16MB and IPCM modes. In Shahid et al. [7], this proposed method also embeds watermark into DCT blocks. It modifies the LSB of coefficients in each inter- and intra-frames and provides a high capacity of data hiding. In Huang et al. [8], this method is a new steganography scheme with capacity variability and synchronization for secure transmission of acoustic data, In Wang et al. [9], the method has good efficiency, it are always higher than 45 dB at the hiding capacity of 1.99 bpp by embedding for all test images III. THE PROPOSED SCHEME A. OVERVIEW OF OUR METHOD Figure. 1 depicts the block diagram of our proposed method in the H.264/AVC encoder side. The watermark embedding method is inserted into H.264/AVC during the encoding process. Data is hided in DCT blocks before entropy coding. In our proposed method, the watermarking is done on luminance DCT blocks in both intra and inter modes, not considering the chrominance DCT blocks. Fig. 2. The proposed watermarking method at macro-block level. Fig. 1. Schematic illustration of our proposed watermarking /embedding procedure. When the encoder executes information hiding method, the rate-distortion must be considered. Because the marked result changes are reflected to the reconstruction frame, the encoding of next frame refers to this marked reconstruction frame. So we must consider the reconstruction loop [7]. In other words, the data hiding block should perform inside the reconstruction loop or inside the reconstruction loop with Fig. 3. The proposed watermarking integration with RDO RDO (Rate Distortion Optimization). Otherwise, the bit-rate procedure. and video quality would be affected seriously due to the prediction drift phenomenon between encoder and decoder As indicated in Fig. 2, our proposed method is also sides. integrated within the RDO procedure in the encoder side. In the H264/AVC encoding, RDO helps current frame to When the encoder performs the RDO procedure, it selects the select the best mode and get the best trade-off between best coding mode while watermarking is done at the same time. distortion of quality and bit-rate. Therefore, our method takes That mode might be different from that without watermarking. into account RDO in order to get better coding performance But the bit-rate and video quality are best among other modes while embedding the information into the video blocks. As in the mode set. Fig. 3 illustrates the detail of “RDCost with shown in Fig. 2, the embedding procedure at the macro-block watermarking” block shown in Fig. 2. As described previously, level is illustrated. When a macro-block enters the encoder we focused on both intra- and inter-blocks of luminance side, the encoder firstly determines its encoding mode. If the component for data hiding. As indicated in Fig. 3, the modes marco-block is inter-mode, the encoder performs both inter- IPCM and SKIP are not considered for embedding. and intra-prediction to select the best mode from the mode set. As previously described, our method can be done within The mode set includes PSKIP, P16x16, P16x8, P8x16, P8x8, RDO inside reconstruction loop. As shown in Fig. 2, the block I4MB, I16MB and IPCM modes. When the marco-block mode “Get best MB mode” selects the best mode to do the coding task. The performance of data hiding without RDO is not 1126
  • 140.
    better than thatof considering the RDO based on the results Fig. 5. Example illustration of proposed watermark restriction. shown in a later section. There is a 4x4 DCT block with five coefficients and the Fig. 4 illustrates the integration of the proposed method threshold is set 0.25. After zig-zag scanning all of the with the H.264/AVC decoder. An extracting algorithm is coefficients, the sequence is -2, 4, 3, -3, 0, 0, -1. The last inserted into H.264/AVC decoding phase. The extracting trailing one is -1. Before embedding phase, we must calculate phase can be done in DCT blocks after entropy decoding. In the CED firstly and compare the CED value with threshold. As our method, we embed the watermark on luminance DCT shown in Fig. 5, the block satisfies our restriction, in that the blocks in both intra- and inter-modes. So we need only to do CED is lower than the threshold. extract on the luminance part of DCT blocks. C. EMBEDDING ALGORITHM In this subsection, we will show the pseudo code for the embedding algorithm and explains the detailed. In Table I, we define the symbols and the functions in the pseudo code. These functions often refer to the DCT block or trailing one set to get the information of DCT block. Table I the symbol and function explanation Fig. 4. Schematic illustration of the proposed watermark Variable or Function Definition extracting procedure. DCTB A size 4x4 DCT block A size 4x4 DCT block by DCTB B. THE RESTRICTION OF OUR METHOD embedding In literatures, most methods usually utilize the quantized The trailing one set in a DCT T 1set block coefficients for embedding; they all have the common feature that only modifying the value but not changing the sign. The W Watermarking bit, W = {0,1} proposed algorithm utilizes the relation of the polarity of each Threshold Threshold value T1 to embedding. The polarity and the sign of coefficient are coeEnergy coefficient energy difference related. getT 1set ( DCTB) Get the T1 set from DCTB Based on experiments, we observe a phenomenon that when the number of coefficients is sparse in a DCT block, getT 1count (T 1set ) Get the number of trailing changing the sign of trailing one causes the bit-rate increasing one from T1 set significantly. In intra-prediction phase, the current block refers getLevcount (DCTB ) Get the number of nonzero to the upper and the left blocks to make prediction and encode level from DCTB the prediction residual. When changing the sign of trailing one getLastT 1Index (T 1set ) Get the last T1 index from T1 with sparse nonzero coefficients in the current block, the block set data in spatial domain would change greatly because the getLastLev Index (DCTB ) Get the last nonzero level energy changes by the sign flip is a greater proportion of the index from DCTB whole block coefficient energy. When the coded block is XorT 1Polarity (T 1set ) All of polarity doing the referenced by other uncoded blocks, this bad effect would be XOR operation in T1 set. propagated to other uncoded blocks due to the reconstruction ChangeSign (DCTB , Changing the sign of T1 on loop. Thus, we have to draw up a mechanism for preventing Index) index position in DCTB this effect. If the number of coefficients is not sparsely and the coefficient energy of trailing one to be changed the sign ChangeLSB (DCTB , Changing the LSB of T1 on Index) index position in DCTB occupies slightly proportion in the current block, we does not hide any watermark bits to the DCT block. getLSB( DCTB, Index) Getting the LSB of level on In our method, we set a threshold to decide whether the index position in DCTB DCT block is suitable to embedding data or not. At first, we getEnergy (DCTB ) Getting coefficient energy calculate the coefficient energy of the current DCT block and difference in DCTB the CED after changing the sign of one trailing one. If the change rate of CED is less than the prespecified threshold, the The Embedding algorithm can be divided into two parts, block is chosen to hide data. Otherwise, the block is kept intact. as shown in Table II. The first part, for blocks with at least one One simple example is shown in Fig. 5. trailing one and CED less than the threshold, utilizes all of the polarity values of trailing ones to hide data. If the sign of trailing one is negative, the polarity value is 0. On the contrary, the sign of trailing one is positive, the polarity value is 1. The polarity values of trailing ones are through an XOR operation. The result must be identical to the value of the watermark bit to be hided into the block; otherwise we should change the sign of last trailing one to satisfy the hiding condition. If the result 1127
  • 141.
    equals to thewatermarking bit, the process does not modify watermark bit when the number of trailing one is nonzero. If any thing for the block. The algorithm changes the sign of the the number of trailing one is zero and the last level existence, last trailing one because the last trailing one in the high we can get the LSB from the last level as watermark bit. If the frequency zone has lower energy than other trailing ones, not number of level and trailing one is zero, we do not do any causing significant degradation of quality and bit-rate. thing. Table II The pseudo code for Embedding Algorithm Table III The pseudo code for Extracting Algorithm Embedding Algorithm Extracting Algorithm Input: DCTB Input: DCTB Output: DCTB Output: W Initialization: Initialization: T 1set  getT 1set ( DCTB ) T 1set  getT 1set ( DCTB ) numT1  getT1count (T 1set ) numT1  getT1count (T 1set ) numLevel  getLevcount ( DCTB) numLevel  getLevcount ( DCTB) Begin Embedding() Begin Extracting() if( numT1  0 ) if( numT1  0 ) coeEngergy  getEnergy (DCTB ) coeEngergy  getEnergy ( DCTB ) if( coeEngergy  Threshold ) if( coeEngergy  Threshold ) W  XorT1Polarity(T1set ) W  XorT1Polarity(T1set ) if( W !  W ) output W LastT1  getLastT1Index( DCTB) end ChangeSign( DCTB, LastT 1) else if( numT 1  0 numlevel  0 ) output DCTB LastLevel  getLastLevIndex( DCTB ) end W  getLSB ( DCTB , LastLevel ) end output W else if( numT1  0 numlevel  0 ) end LastLevel  getLastLevIndex(DCTB ) End ChangeLSB ( DCTB , LastLevel ,W ) output DCTB IV. EXPERIMENTAL RESULTS end End A. THE EXPERIMENT ENVIRONMENT The second part, when the number of nonzero Table IV the experimental parameters for H.264/AVC codec. coefficients is nonzero and the number of trailing one is zero, Parameter Information utilizes the last level to change the LSB for hiding data. Profile IDC 66(baseline) Otherwise if the number of levels and trailing ones are zero, Intra period 15(I-P-P-P) we do not perform the embedding work. The advantage of the Slice mode 0 method in the first case is that the change of the sign does not Frames to be encoded 300 affect other symbols in the same block. According to the Motion Estimation scheme Fast Full Search CAVLC rule, the trailing_ones_sign_flag indicate the sign of Rate Control Disable trailing one, it is encoded as one bit in the NAL (Network Abstraction Layer). If the sign is negative, it will be encoded Table V the test video format parameters bit 1. On the contrary, if the sign of trailing one is positive, it Parameter Information will be encoded one bit 0. We change only the sign of last Video format QCIF trailing one so that the encoded block has the same length as YUV format 4:2:0 that prior to embedding process. Frame Size 176×144 D. EXTRACTING ALGORITHM Frame rate 30 fps The extracting phase as shown in Table III is easier than the embedding phase. The watermarking extracting algorithm We utilize the H.264/AVC JM Reference software [9] as is performed between the entropy decoding phase and the the platform to simulate our proposed method. This subsection inverse quantization phase. We find out all of the trailing ones presents that the experiment parameters for our method in JM in current DCT block firstly and calculate the CED value; if reference software. We use the version of JM software is 12.2, the CED is lower than threshold, we collect all of the polarity where the related environmental parameters are shown in values for each trailing one to do XOR operation to get the Table IV. In the experiment, four videos: “akiyo,” “foreman,” 1128
  • 142.
    “mobile,” and “news”are used as test data set. Their format information is shown in Table V. The secret data to be hided Table VI Comparison the efficiency between the original’s into the test videos is a random bit stream. and proposed method for foreman in QP = 15 QP = 15 B. The EXPERIMENT RESULTS PSNR(dB) Bit-rate(kbit) Capacity(bit) In this subsection, we demonstrate the experiment results Original 47.32 969.62 and make an explanation about the results. Three methods are without ER 45.18 1070.09 337752 considered. The original method refers to the method without With ER data hiding; the “within RDO” method represents the method T=0.5 46.35 1023.22 165019 operated in the RDO loop while the “without RDO” method T=0.1 46.35 1024.11 164923 means that it executes after the RDO stage in the T=0.05 46.36 1025.11 165190 reconstruction loop of encoder. As shown in Figs. 6 and 7, the “within RDO” method is superior to the “without RDO” in Table VII Comparison the efficiency between the original and terms of the output video bit-rate and the reconstructed video embedding method for foreman in QP = 27 PSNR. QP = 27 PSNR(dB) Bit-rate(kbit) Capacity(bit) Original 37.5 196.26 without ER 36.62 228.05 80708 With ER T=0.5 37.33 205.92 22118 T=0.1 37.33 205.7 22216 T=0.05 37.32 205.59 22273 Table VIII Comparison the efficiency between the original’s and proposed method for foreman in QP = 31 QP = 31 PSNR(dB) Bit-rate(kbit) Capacity(bit) Fig. 6. Comparison of the video quality for video foreman encoded at varying QP values Original 34.86 74.92 without ER 34.1 140.93 48152 With ER T=0.5 34.65 127.25 11449 T=0.1 34.64 126.81 11289 T=0.05 34.63 126.85 11409 From the experiments, we can observe that the degradation of bit-rate and video quality caused by embedding can be controlled effectively by adding embedding restriction. But it also raises another question. When the threshold is small, the performance is improved to a saturation degree. In other words, the effectiveness of the embedding restriction rule has a limitation level for controlling the degradation. For other test videos, we illustrate their results in terms of video quality and Fig. 7. Comparison of output bit-rate for video foreman bit-rate in Figs. 8-15. encoded at varying QP values In Fig. 7, the bit-rate of the within RDO method is higher than that of the original. This is not a desired phenomenon for some applications. We use a threshold value of CED to select appropriate DCT blocks to embed data. The number of DCT blocks that can be embedded is decreasing with the restriction threshold. This mechanism helps us to control the degradation of marked video quality, bit-rate change, and the capacity of data hiding. In the experiments, we set different threshold values T of embedding restriction rule as 1, 0.5, 0.1 or 0.05 for the “within RDO” scheme. The results are shown in Tables VI to VIII. We can find that the degradation of quality is reduced from Fig. 8. Comparison of the video quality between our method 3dB to 1dB and that the bit-rate after embedding is not and the original for video foreman at varying QP values increasing significantly by setting the restriction rule. 1129
  • 143.
    Fig. 9. Comparisonof the bit-rate between our method and the original for video foreman at varying QP values Fig. 13. Comparison of the video quality between our method and the original for video mobile at varying QP values Fig. 10. Comparison of the video quality between our method and the original for video akiyo at varying QP values Fig. 14. Comparison of the video quality between our method and original for video news at varying QP values Fig. 11. Comparison of the bit-rate between our method and the original for video akiyo at varying QP values Fig. 15. Comparison the video quality between our method and the original for video news at varying QP values For smaller threshold values, most of the DCT blocks in the video are excluded to modify the T1 symbols. However, it doesn’t affect the scheme because in that case it modifies the LSB of the last coefficient in the block. Therefore, for smaller threshold values, the number of DCT blocks hided using the T1 symbols is less than that of using the LSB replacement. This means that the bit-rate and video quality will be kept saturation. Only changing the LSB of the last coefficient in the block would not affect the bit-rate and PSNR significantly. Fig. 12. Comparison of the video quality between our method The capacity for each test video is shown in Figs. 16-19 and the original for video mobile at varying QP values 1130
  • 144.
    According to Fig.3, our proposed method does not aim at the SKIP mode blocks for data hiding. When the cost of SKIP mode is lower than others, the mode decision phase selects the SKIP mode to be the block mode, the number of SKIP mode blocks is increasing with the QP value, as the results shown in Fig. 20. Fig. 16. Comparison of the capacity between our method and the original for video foreman at varying QP values Fig. 20. Comparison of the number of SKIP mode block for video foreman encoded at varying QP values In Figs. 21 to 23, our proposed method and Shahid’s [7] are compared in terms of bit-rate, PSNR and capacity. There are two variants of our proposed method; the one with threshold value of CED T=0.1 and the other with T=0.5, respectively. When the QP values are higher than 11, Shahid’s Fig. 17. Comparison of the capacity between our method and capacity is rapidly declined due to the number of coefficients the original for video akiyo at varying QP values in high QP values is sparse. The efficiency of our method with CED is close to Shahid’s regarding the bit-rate and video quality. Fig. 18. Comparison of the capacity between our method and the original for video mobile at varying QP values Fig. 21. Comparison video quality of our proposed and Shahid for video foreman encoded at varying QP values Fig. 22. Comparison bit-rate of the number of our proposed Fig. 19. Comparison of the capacity between our method and and Shahid for video foreman encoded at varying QP values the original for video news at varying QP values 1131
  • 145.
    [2] S.K. Kapotas,E.E. Varsaki, A.N. Skodras, “Data Hiding in H.264 Encoded Video Sequences”, IEEE 9th Workshop on Multimedia Signal Processing, October 1-3, 2007, Crete, pp. 373-376. [3] B.G. Mobasseri, Y.N. Raikar, “Authentication of H.264 Streams by Watermarking CAVLC blocks”, SPIE Conference on Security, Steganography and Watermarking of Multimedia Contents IX, San Jose, CA, January 28-February 2, 2007. [4] G.Z. Wu, Y.J. Wang, W.H. Hsu, “Robust watermark embedding detection algorithm for H.264 video”, Journal of Electronic Imaging 14(1), 013013, 2005 [5] L. Tian, N. Zheng, J. Xue and T. Xu, “A CAVLC-Based Fig. 23. Comparison capacity of our proposed and Shahid for Blind Watermarking Method for H.264/AVC Compressed video foreman encoded at varying QP values Video”, In: Asia-Pacific Services Computing Conference, 2008. APSCC 2008, pp. 1295–1299. IEEE, Los Alamitos In Table IX, we compare the capacity performance (2008) between Shahid’s scheme and our proposed algorithm. At the [6] K. Liao, D. Ye, S. Lian, Z. Guo, J. Wang, “Lightweight same QP, our method can provide higher capacity than that of Information Hiding in H.264/AVC Video Stream”, mines, Shahid’s, and the capacity of Shahid’s is decreasing seriously vol. 1, pp.578-582, 2009 International Conference on with the QP value decreased. Multimedia Information Networking and Security, 2009 [7] Z. Shahid, M. Chaumont, W. Puech, “Considering the Table IX Comparison capacity of our method and Shaid’s for Reconstruction Loop for Data Hiding of Intra and Inter foreman at varying QP Frames of H.264/AVC”, published in European Signal Proposed method Shahid[7] Processing Conference (EUSIPCO), 2009. QP T = 0.5 T = 0.1 [8] X. Huang, Y. Abe, and I. Echizen, “Capacity Adaptive Capacity (bit) Synchronized Acoustic Steganography Scheme”, Journal 11 281591 281497 280578 of Information Hiding and Multimedia Signal Processing, 15 165019 164923 139629 Vol. 1, No. 2, pp. 72-90, Apr. 2010 19 82915 83241 67582 [9] Z.H. Wang, T.D. Kieu, C.C. Chang, M.C. Li, A Novel 23 40620 40652 29851 Information Concealing Method Based on Exploiting 27 22118 22216 12108 Modification Direction Journal of Information Hiding 31 11449 11289 4357 and Multimedia Signal Processing, Vo1. 1, No. 1, pp. 1-9, Jan. 2010 V. CONCLUSIONS [10] K. Sühring, H.264/AVC Reference Software Group [On-line]. Available: http://iphome.hhi.de/suehring/tml/, In this paper, we propose a data hiding algorithm that has Joint Model 12.2 (JM12.2), Jan. 2009. considered the rate distortion performance for H.264/AVC standard. The algorithm can control the increase of bit-rate and decrease of PSNR after hiding secret data into the videos at the cost of reducing the capacity of data to be hided. The information is hided in the T1 symbols of CAVLC domain in H.264/AVC encoder. In order to reduce the propagation of hiding modification to the subsequent blocks, the proposed algorithm can selection those blocks with minor energy change to hide data. With the selection scheme, the proposed algorithm can control the threshold value to adjust adaptively the capacity for different application requirements. ACKNOWLEDGEMENT This research is supported in part by National Science Council, Taiwan under the grant NSC 98-2221-E-150-051 REFERENCES [1] G. Qiu, P. Marziliano, A. Ho, D. He, Q. Sun, “A Hybrid Watermarking Scheme for H.264 Video”, Processing of the 17th International Conference on Pattern Recognition, ICPR, vol.4, pp.865-868, Aug. 2004. 1132
  • 146.
    Secret-fragment-visible Mosaic —a New Image Art and Its Application to Information Hiding I-Jen Lai (賴怡臻) Wen-Hsiang Tsai (蔡文祥) Institute of Computer Science and Engineering Dept. of Computer Science National Chiao Tung University, Hsinchu, Taiwan National Chiao Tung University, Hsinchu, Taiwan Email: [email protected] Email: [email protected] Abstract—A new type of art image called secret-fragment- Dobashi et al. [3] improved the voronoi diagram to allow a visible mosaic image is created, which is composed of user to add various effects to the mosaic image, such as rectangular-shaped fragments yielded by division of a secret simulation of stained glasses. Kim and Pellacini [4] image. To create this kind of mosaic image, the 3D RGB color generated jigsaw image mosaic composed of many arbitrary space is transformed into a 1-dimensional h-colorscale based shapes of tiles selected from a database. Extending the on which a new image similarity measure is proposed; and the most similar candidate image from an image database is concept of [4], Blasi et al. [5] presented a new mosaic image selected accordingly as a target image. Then, a greedy called puzzle image mosaic. Lin and Tsai [6] embedded algorithm is adopted to fit every tile image in the secret image secret data in image mosaics by adjusting regions of into a properly-selected block in the target image, resulting in boundaries and altering pixels’ color values. Wang and Tsai an effect of embedding the secret image fragmentally and [7] hid data into image mosaics by utilizing overlapping visibly in the composed mosaic image. In addition to this type spaces of component images. Hung and Tsai [8] embedded of secret image hiding, secret message bits may be embedded data into stained-glass-like mosaic images by modifying the as well for the purpose of covert communication. Based on the tree structure used in the creation process. Hsu and Tsai [9] fact that tile images in an identical bin of the histogram of the presented a new type of art image, circular-dotted image, created mosaic image have similar colors, all the tile images in each histogram bin are reordered pairwisely and their relative and used the characteristics of its creation processes to hide positions are switched accordingly, to embed secret message secret messages in the generated art image. Chang and Tsai bits without creating noticeable changes in the resulting mosaic [10] proposed a new type of art image, called tetromino- image. The embedded message is protected by a secret key, and based mosaic, which is composed of tetrominoes appearing may be extracted from the stego-image using the key. in a video game. Data hiding is made possible by distinct Additional security measures are also discussed. Experimental combinations and color shifting of the tetromino elements. results show the feasibility of the proposed methods. A new type of art image, called secret-fragment-visible Keywords: secret-fragment-visible mosaic image, covert mosaic image, which contains small fragments of a secret communication, data hiding. source image is proposed in this study. Observing such a type of mosaic image, people can see all of the fragments of I. INTRODUCTION the secret image, but the fragments are so tiny in size and so Mosaics are artworks created from composing small random in position that people cannot figure out what the pieces of materials, such as stone, glass, tile, etc. Nowadays, source image look like, unless they have some way to they are used popularly for decorating houses and other rearrange the pieces back into their original positions, using constructions. Creation of mosaic images by computers is a a secret key from the image owner. Therefore, the source new research topic in recent years. Traditional mosaic image may be said to be secretly embedded in the resulting images are obtained by arranging a large number of small mosaic image, though the fragment pieces are all visible to images, called tile images, in a certain manner so that each an observer of the image. And this is just why we name the tile image represents a small piece of a source image, named resulting image as a secret-fragment-visible mosaic image. target image. Consequently, while we see a mosaic image In the remainder of this paper, the proposed mosaic image from a distance, as a whole it will look like its source creation process will be described in Section II, a covert image — an effect of a human vision property. Many communication method via secret-fragment-visible mosaic methods have been proposed to create different types of images will be proposed in Section III, and some mosaic images [1-8]. experimental results will be presented in Section IV, Haeberli [1] proposed a method for mosaic image followed by conclusions in Section V. creation using voronoi diagrams by placing the sites of II. PROPOSED MOSAIC IMAGE CREATION PROCESS blocks randomly and filling colors into the blocks based on the content of the original image. Hausner [2] created tile The proposed mosaic image creation process is composed mosaic images by using centroidal voronoi diagrams. of two major stages. The first is the construction of a 1133
  • 147.
    database which canbe used later to select similar target above defines a 1-D h-colorscale. The resulting image images for given secret images. The quality of a constructed created by our method is given in Fig. 1(b), which secret-fragment-visible mosaic image is related to the contrastively has less noise when compared with Fig. 1(a). similarity between the secret image and the target image; the selected target image should be as similar to the secret image as possible. An appropriate similarity measure for this purpose is proposed in this study and described later. The other stage is the creation of a desired mosaic image using the secret image and the target image as input. In this stage, the secret image is divided into fragment pieces as tile images, which then are used to create the mosaic image. The number of tile images is limited by the size of the secret (a) (b) image and that of the tile images. Note that this is not the Figure 1. Effects of mosaic image creation using different color case in traditional mosaic image creation where available similarity measures (a) Image created with similarity measure tile images for use to fit into the target image are unlimited of [12]. (b) Image created with proposed similarity measure. in number. In order to solve this problem of fitting a limited Furthermore, to compute the similarity measure between number of tile images into a target image, a greedy a tile image in the secret image and a target block in an algorithm is proposed, which is described later as well. image in a database for use in tile-image fitting in generating 2.1 Database Construction a mosaic image, we propose a new feature, called h-feature, for each block image C (either a tile image or a target block), The database plays an important role in the secret- denoted as hC, which is computed by the following steps: fragment-visible mosaic image creation process. If a target image is dissimilar to a secret image, the created image will 1. compute the average of the color values of all the be distinct from the target one. In order to generate a good pixels in C as (RC, GC, BC); result, the database so should be as large as possible. 2. re-quantize (RC, GC, BC) into (rC′, gC′, bC′) using the Searching a database for a target image with the highest new Nr, Ng, and Nb color levels; and similarity to the secret image is a problem of content-based 3. calculate the h-feature hC for C by Eq. (2) above, image retrieval. A technique to solve this problem is to base resulting in the following equation: the similarity on 1-D color histogram transformation [12] of hC(rC′, gC′, bC′) = bC′ + NbrC′ + NbNrgC′. (3) the color distribution of the image. The transformation maps With Nr, Ng, and Nb all set equal to 8, the range of the the three color channel values into a single value. computed values of the h-feature fC above may be figured out Specifically, each color channel is re-quantized first into to be from 0 to 584. The proposed algorithm for constructing fewer levels, yielding a new image I′ with a lower resolution a database of candidate images for use in generating secret- in color specified by (r′, g′, b′). Let Nr, Ng, and Nb denote the fragment-visible mosaic images is described in the following. numbers of levels of the new color values r′, g′, and b′, respectively. Then, for each pixel P′ in I′ with new colors (r′, Algorithm 1: construction of candidate image database. Input: a set S of images, a pre-selected tile image size Zt, g′, b′), the following 1-D function value f is computed: and a pre-selected candidate image size Zc. f(r′, g′, b′) = r′ + Nrg′ + NrNgb′.  Output: a database DB of candidate images with size Zc and However, according to our experimental experience using their corresponding h-colorscale histograms. this 1-D color function f, it is found inappropriate for our Steps: study here where the human’s visual feeling of image Step 1. For each input image I, perform the following steps. similarity must be emphasized, as shown by Fig. 1(a). 1.1 Resize and crop I to yield an image D of size Zc. Therefore, we propose a new function h as follows: 1.2 Divide D into blocks of size Zt. 1.3 For each block C of D, calculate and round off the  h(r′, g′, b′) = b′ + Nbr′ + NbNrg′  h-feature value hC described by Eq. (3). where the numbers of levels, Nr, Ng, and Nb, are all set to be 1.4 Generate a histogram H of the h-feature values of all 8. Differently from the case in (1), we set in (2) the largest the blocks in D. weight NbNr to the green channel value g′ and the smallest 1.5 Save H with D into the desired database DB. weight 1 to the blue channel value b′. The reason is that the Step 2. If the input images are not exhausted, go to Step 1; eyes of human beings are the most sensitive to the green otherwise, exit. color, and the least sensitive to the blue one. In addition, 2.2 Similarity Measure Computation with all of Nr, Ng, and Nb set to 8 in (2), an advantage of speeding up the process of mosaic image creation can be Before generating a mosaic image, we have to choose as obtained according to our experiments. Subsequently, we the target image the most similar candidate image from the will say that the new color feature function h we propose database based on the given secret image content. For this, we define a difference measure e between the 1-D histogram 1134
  • 148.
    HS of thesecret image S and that of a candidate image D in edge of the graph with its label taken to be that of the tile the database in the following way: image and its weight taken to be the average Euclidean 584 distance between the pixels’ colors of the selected tile image  e   Hs  m  HD  m   and those of the target block. Accordingly, we can build a m 0 tree structure as the graph for this problem, as shown by Fig. where m stands for a h-feature value. The smaller the value e 2. is, the more similar the candidate image D is to the secret image S. After calculating the errors of all the images in the database, we can select the one with the smallest error as the desired target image for use in mosaic image generation. The detail of selecting the most similar candidate image from a database is given as follows. Algorithm 2: selection of the most similar candidate image as a target image. Input: a secret image S, a database DB of candidate images, and the sizes Zt and Zc mentioned in Algorithm 1. Output: the target image T in DB which is the most similar to S. Figure 2. A tree structure of fitting tile images to target blocks. Steps: Step 1. Resize S to yield an image S′ of size Zc to become of In order to find the optimal solution, we may utilize the the same size as the candidate images in DB. Dijkstra algorithm whose the running time for getting an Step 2. Divide S′ into blocks of size Zt, and perform the optimal answer is O(|V|2), where V denotes for the number following steps. of vertices in the tree. Unfortunately, according to Fig. 2 the N 1 2.1 For each block C of S′, calculate its h-feature value number of vertices in this problem is   1)!/n!] where n1 hC by Eq. (3) and round off the result. 2.2 Generate a 1-D h-colorscale histogram HS′ for S′ N is the number of target blocks which is larger than 40,000 from the h-feature values of all the blocks in S′. for images used in this study, and so the computation time Step 3. For each candidate image D with 1-D h-colorscale for getting an optimal solution for such a large N is histogram HD in DB, perform the following steps. obviously too high to be practical! This means that we have 3.1 Compute the difference measure e between HS' and to find other feasible solutions to solve this problem. HD according to Eq. (4) described above. The solution we propose is to use a greedy algorithm. We 3.2 Record the value e. calculate the average Euclidean distance between the pixels’ Step 4. If the images in DB are not exhausted, go to Step 3; colors of a tile image T and those of a target block B as the otherwise, continue. similarity measure between T and B; and then use the Step 5. Select the image in DB which has the minimum measure as a selection function for the greedy algorithm to difference measure e and take it as the desired target select the most similar target block for tile image fitting. image T. However, as shown by the example of Fig. 4(a) which is the result of using such a greedy algorithm to fit the tile images 2.3 Algorithm for Secret-fragment-visible Mosaic Image of the secret image, Fig. 3(a), into the target image, Fig. 3(b), Creation the algorithm is found unsatisfactory, yielding often a result Before presenting the algorithm for creating the proposed with the lower part of the target image being filled with mosaic images, we discuss some problems which are some fragment pieces of inappropriate colors. This encountered in the creation process and present the solutions phenomenon comes from the situation that the number of we propose to solve them. tile images obtained from the secret image, Fig. 3(a), is limited by the secret image’s own size, so that the tile A. Problem of fitting tile images optimally and proposed images available for choice to fit the target blocks in Fig. solution 3(b) become less and less near the end of the fitting process. The first problem faced in the creation process is how to As a result, the similarity differences between the later-fitted find an optimal solution for fitting a tile image of the secret tile images and the chosen target blocks become bigger and image into an appropriate target block in a target image bigger than the earlier-fitted ones, yielding a poorly-fitted selected by Algorithm 2. For this, it seems that we can bottom part like that shown in Fig. 4(a). reduce it to a single-source shortest path problem. The A solution to this problem found in this study is to use the shortest path problem is one of finding a path in a graph previously-proposed h-feature to define the selection with the smallest sum of between-vertex edge weights. The function for the greedy algorithm. This feature takes the state of fitting a tile image may be represented by a vertex global color distribution of an image into consideration, of the graph. And the action of selecting the most similar which helps creation of a mosaic image with its content tile image for each target block may be represented by an resembling the target image more effectively, as shown by 1135
  • 149.
    the example ofFig. 4(b) which is an improvement of Fig. of proposed secret-fragment-visible mosaic images is 4(a). described in the following. (a) (b) (a) (b) Figure 3. Input images. (a) A secret image. (b) A selected target image. Figure 5. Input images. (a) A secret image. (b) A selected target image. (a) (b) Figure 6. Resulting images. (a) Image created without the proposed (a) (b) remedy method, which is four times as large as (b). (b) Figure 4. Resulting images using different similarity measures. (a) Image created with the proposed remedy method. Image created using Euclidean distance to define select function of greedy algorithm. (b) Image created using h- feature to define select function of greedy algorithm. Algorithm 3: mosaic image creation. Input: a secret image S, a database DB, and a selected size B. Problem of small-sized candidate image database and Zt of a tile image. proposed solution Output: A secret-fragment-visible mosaic image R. A second problem faced in the mosaic image creation Steps: process is how to deal with a database which is not large Stage 1  embedding secret image fragments into a enough. This problem will cause the selection of an selected target image. insufficiently similar image from the database as the target Step 1. Crop S to yield an image S′ which is divisible by image for a given secret image. As a result, the created size Zt. mosaic image will look unlike the target one, as shown by Step 2. Perform following steps to select a target image T the example of Fig. 6(a), a mosaic image created with Figs. from DB. 5(a) and 5(b) as the secret and target images, respectively. 2.1 Select a candidate image as T from DB by To solve this problem, during the candidate image Algorithm 2. selection process, after the difference measure between a 2.2 If the difference measure e computed in Step 3.1 of secret image and a candidate image is computed, if the Algorithm 2 is larger than a pre-selected threshold computed value is large, the selected target image is Th, then enlarge the size of T e Th  times.   regarded inappropriate for the creation process. In this case, Step 3. Obtain a block-label sequence L1 of S′ by we enlarge the size of the selected target image as a remedy. calculating and sorting the h-feature values of all The reason is that if the size of the target image is larger tile images in S. than that of the secret image, the number of target blocks, or Step 4. Obtain a block-label sequence L2 of T by equivalently, the number of possible positions, to fit each calculating and sorting the h-feature values of all tile image, will become larger, yielding in general a better target blocks in T. fitting result. In this way, the resulting mosaic image will Step 5. Fit the tile images of S′ to the target blocks of T become better visually than before, as shown by the based on the one-to-one mappings from the ordered example of Fig. 6(b). labels of L1 to those of L2, thus completing C. Algorithm for secret-fragment-visible mosaic image embedding of all the tile images in S′ into all the creation target blocks of T according to the greedy criterion. According to above discussions, an algorithm for creation Stage 2  dealing with unfilled target blocks. 1136
  • 150.
    Step 6. Performthe following steps to fill each remaining unfilled target blocks, B, in T if there is any. 6.1 Compute the difference e′ between the h-feature hB of B and the h-feature hA of each of the tile images, A, in S′ by the following equation: e′ = |hB  hA|. (5) 6.2 Pick out the tile image Ao with the smallest (a) (b) difference eo′ and compare eo′ with another pre- selected threshold Th′ to conduct either of the following two operations: A. if eo′ Th′, then fill the tile image Ao into the target block B; B. if eo′  Th′, then fill the averages of the R, G, and B values of all the pixels in B into B. Stage 3  generating the desired mosaic image. Step 7. Generate as output an image R obtained by composing all the tile images fitted at their respective positions in T. 2.4 Experimental Results of Mosaic Image Creation Some mosaic images generated by the above algorithm (c) are shown in Figs. 7 and 8. Note that in either figure, the secret image of (a) may be thought to have been embedded Figure 8. Another example of mosaic image creation. (a) Secret image. (b) Target image. (c) Generated secret-fragment-visible into the target image of (b) to yield the stego-image of (c). mosaic image. The database used in running the algorithm includes 841 candidate images. The size of this database is regarded as large enough because the remedy measure of target image III. COVERT COMMUNICATION VIA SECRET-FRAGMENT- enlargement is rarely used in the mosaic image creation VISIBLE MOSAIC IMAGES process in our experiments. 3.1 Idea of Proposed Covert Communication Method In the proposed mosaic image creation process, tile images with the same h-feature values appear to have similar colors. Each tile image is fitted into a corresponding target block based on the one-to-one mappings established between the two label sequences of the secret image and the selected target image. Note that both sequences have been sorted according to the h-feature values of their image (a) (b) blocks. They are said to have been h-sorted. As a result of such h-sorting, every pair of neighboring labels in either sequence specify two image blocks with similar h-feature values, implying that the average colors of the two blocks are essentially visually similar. The main idea of secret embedding in the proposed covert communication method is to switch the orders of the target blocks in the h-sorted label sequence of the target image during the mosaic image creation process to embed message bits, thus achieving the goal of hiding a secret message into a secret-fragment-visible mosaic image imperceptibly. More specifically, after the label switching, if a leading label is smaller than the following one in the target block (c) label sequence, then a bit “0” is regarded to have been embedded in the two neighboring labels; otherwise, a bit Figure 7. An example of mosaic image creation. (a) Secret image. (b) Target image. (c) Generated secret-fragment-visible mosaic “1” is regarded as embedded there. Furthermore, as shown image. by the example of Fig. 9, because the tile images which 1137
  • 151.
    correspond to thetarget blocks with switched labels have the Step 3. Perform Step 3 to 4 of Algorithm 3 to obtain the h- similar average colors as mentioned previously, after the sorted label sequences L1 and L2 of S′ and T, message is embedded, no visually perceptible difference respectively. will arise in the resulting mosaic image. Step 4. Group the labels of L1 and L2 by the following steps. 4.1 Group the labels of L1 based on the h-feature values of the tile images in S′, with each resulting 1 2 3 4 21 1 2 3 4 21 group including the labels of a set of tile images 5 6 7 8 5 6 7 8 having the same h-feature values. 59 59 4.2 Group the labels of L2 based on the grouping of L1 9 10 11 12 9 10 11 12 obtained in Step 4.1, resulting in groups of labels, G1, G2, …, Gm, with each group Gi including the Target blocks Tile images Target blocks Tile images labels of a set of target blocks whose corresponding tile images have the same h-feature values. (a) (b) Figure 9. Label switching and corresponding corresponding target block Stage 2  embedding the secret message M. exchange. (a) The original one. (b) After switching the Step 5. Generate the histogram H of the h-feature values of corresponding target blocks of tile images. all the tile images in the resized secret image S′. Step 6. Transform the message to be embedded, M, into a 3.2 Modified Secret-fragment-visible Mosaic Image bit string M′. Creation Process for Secret Message Embedding Step 7. Perform the following steps to embed the bits of M′ In the proposed covert communication method, the into L2. mappings of the labels of the tile images to those of their 7.1 Select the smallest unprocessed h-feature value hi corresponding target blocks is recorded in a recovery whose histogram value H(hi) is larger than or equal sequence LR for use in later data recovery. An illustration is to two. shown in Fig. 10. Embedding of LR is then accomplished by 7.2 Take out the group Gi of labels in L2 corresponding hiding the labels of LR into the tile images randomly by the to the h-feature value hi. lossless LSB-modification scheme [11] controlled by a 7.3 Take out the first two unprocessed labels l1 and l2 secret key. The detailed algorithm for secret message in Gi, and switch the order of l1 and l2 in L2 if the embedding is now given as follows, which is a modified following two conditions are satisfied, assuming version of Algorithm 3. that the first unembedded bit in M is denoted as b: A. b = 0 and l1 l2; B. b = 1 and l1 l2. 7.4 Repeat Step 7.3 until Gi includes at most one label, which is left untouched. 7.5 Repeat Steps 7.1 through 7.4 until the bits of M′ are exhausted. Step 8. Form an extra string M′ of 8 bits of “0” as the ending signal of the input message M, and embed it into L2 by Step 7 above. Step 9. Fit the tile images of S′ into the target blocks of T based on the one-to-one mappings from the labels Figure 10. An illustration of generation of a recovery sequence LR. of L1 to those of the re-ordered L2 obtained in Steps 7 and 8 (denoted as L2′ subsequently), and let the Algorithm 4: embedding a message into a secret-fragment- resulting image be denoted as T′. visible mosaic image. Stage 3  dealing with unfilled target blocks, generating Input: a secret image S, a secret key K, the size Zt of tile and embedding the recovery sequence, and images, a database DB, and a secret message M. generating the desired mosaic image. Output: a secret-fragment-visible mosaic image R into Step 10. Perform Step 6 of Algorithm 3 to fill each of the which M is embedded. remaining unfilled target blocks if there is any. Steps: Step 11. Sort all the labels in L1 by their corresponding h- Stage 1  embedding secret image fragments into a feature values, re-order accordingly the selected target image. corresponding labels in L2′, take the re-ordering Step 1. Crop S to yield S′ with a size divisible by Zt. result as a recovery sequence LR, and transform it Step 2. Select a target image T for S′ with histogram H into a binary string. from the database DB by Algorithm 2. Step 12. Embed the width and height of S′ as well as the 1138
  • 152.
    size Zt intothe first ten pixels of image T′ in a key K. raster-scan order by the LSB modification scheme. Step 3. Compose the desired secret image S based on the Step 13. Embed the data of LR by the same scheme into sequence LR by extracting the tile images fitted in R unprocessed tile images of T′ randomly selected by in order and placing them at correct relative the secret key K. positions. Step 14. Generate as output an image R obtained by Stage 2  regaining the h-sorted label sequences. composing all the tile images fitted at their Step 4. Get the h-sorted label sequence L1 of the tile respective positions in T′. images of the recovered secret image S, and group the labels of L1 based on the h-feature values of the 3.3 Secret Extraction Process tile images, with each resulting group including the In the proposed secret message extraction process, we labels of the tile images having the same h-feature extract the recovery sequence LR first and retrieve values. accordingly the original secret image S. Also, by calculating Step 5. Perform the following steps to get the h-sorted the h-feature values of the original secret image, we regain label sequence L2 of the target image T. the h-feature values of the tile images and sort them to get 5.1 Get the re-ordered block sequence QT of the target the h-sorted label sequence L1. blocks in T by one-to-one-mapping the labels in Next, as illustrated by Fig. 11, the recorded sequence LR, sequence L1 to those of LR. though including only the labels of L2, essentially specifies 5.2 Get a new h-sorted label sequence L2 from the one-to-one mappings between the tile images and the target labels of the re-ordered block sequence QT. blocks. Therefore, we may regain the h-sorted label 5.3 Group the labels of L2 based on the grouping of L1 sequence L2 of the target blocks from the corresponding conducted in Step 4 with each group Gi including mappings from L1 to LR. Then, with the histogram H of the the labels of the target blocks whose corresponding h-feature values of all the tile images, we may group the tile images have the same h-feature values, hi. labels of sequences L1 and L2, and then examine the orders Stage 3  extracting the embedded secret message M. of the labels of L2 to extract the embedded secret message, Step 6. Generate the histogram H of the h-feature values of in a way reverse to the message embedding process as all the tile images in the recovered secret image S. described in Algorithm 4. Step 7. Perform the following steps to extract the bits of secret message M. 7.1 Select the smallest h-feature value hi whose histogram value H(hi) is larger than or equal to two. 7.2 Take out the group Gi of labels in L2 corresponding to hi. 7.3 Take out the first two unprocessed labels l1 and l2 in Gi, extract a hidden message bit b by the following rule, and append it to the end of a bit version of the message, denoted as D: A. if l1  l2, then set b = 0; B. if l1 l2, then set b = 1. 7.4 Repeat Step 7.3 until Gi includes at most one label, which is then left untouched. 7.5 Repeat Steps 7.1 through 7.4 if the 8-bit end signal Figure 11. An illustration of the regaining of the label sequence L2. is not extracted (i.e., if the last extracted 8 bits are not a sequence of eight “0’s”). Algorithm 5: secret image recovery and secret message Step 8. Transform every 8 bits of D into characters as the extraction. desired secret message M. Input: a secret-fragment-visible mosaic image R, and a secret key K identical to that used in Algorithm 4. 3.4 Experimental Results Output: a recovered secret image S, and the secret message An example of experimental results using Algorithms 4 M supposedly embedded in R. and 5 is given in Fig. 12. The average difference measure Steps: value at the block level between Fig. 8(c) and Fig. 12 Stage 1  retrieving the secret image S. (computed as the sum of all the Euclidean distances divided Step 1. Retrieve the width and height of S′ as well as the by the number of blocks) is 0.05, and the PSNR of Fig. 12 size Zt of tile images from the LSB’s of the first ten with respect to Fig. 8(c) is 66.6 which is quite satisfactory, pixels of image R. meaning that the proposed information hiding method Step 2. Extract the recovery sequence LR from the LSB’s (implemented by Algorithms 4 and 5) provides a good effect of blocks in R randomly selected using the secret on covert communication. 1139
  • 153.
    colorscale to representthe color distribution of an image more effectively, based on which a new h-feature is proposed for measuring image similarity. A greedy algorithm is proposed accordingly for fitting the tile images of the secret image into appropriate target blocks more efficiently. A remedy method has also been proposed to solve the problem of using a small-sized database, which enlarges a selected target image in proportion to the difference measure between the secret and the target images. For the proposed data hiding method used in covert communication via secret-fragment-visible mosaic images, it was observed that the tile images in an identical bin have (a) similar colors. By switching the relative positions of the target blocks corresponding to such tile images, we can embed secret message bits into a secret-fragment-visible mosaic image imperceptibly. Future works may be directed to allowing users to select target images freely to create secret-fragment-visible mosaic images. This seems achievable by applying a reversible color shifting technique to fit the color distribution of the secret image to a selected target image. REFERENCES [1] P. Haeberli, “Paint by numbers: abstract image representations,” Proc. SIGGRAPH 99, pp.207-214, Dallas, USA, 1990. [2] A. Hausner, “Simulating decorative mosaics,” Proceedings of 2001 International Conf. on Computer Graphics Interactive Techniques (SIGGRAPH 01), Los Angeles, USA, August 2001, pp. 573-580. [3] Y. Dobashi, T. Haga, H. Johan and T. Nishita, “A method for creating (b) mosaic image using voronoi diagrams,” Proc. 2002 European Association for Computer Graphics (Eurographics 02), Saarbrucken, Germany, September 2002, pp. 341-348. [4] J. Kim and F. Pellacini, “Jigsaw image mosaics,” Proc. 2002 International Conf. on Computer Graphics Interactive Techniques (SIGGRAPH 02), San Antonio, USA, July 2002, pp. 657-664. [5] G. D. Blasi, G. Gallo and M. Petralia, “Puzzle image mosaic,” Proc. 2005 Int’l Association of Science Technology for Development on Visualization, Imaging Image Processing (IASTED/VIIP 2005), Benidorm, Spain, Sept. 2005. [6] W. L. Lin and W. H. Tsai, “Data hiding in image mosaics by visible boundary regions and its copyright protection application against print-and-scan attacks,” Proc. 2004 Int’l Computer Symp. (ICS 2004), Taipei, Taiwan, Dec. 15-17, 2004. [7] C. C. Wang and W. H. Tsai, Creation of Tile-overlapping mosaic images for information hiding, Proc. 2007 Nat’l Computer Symp., Taichung, Taiwan, Dec. 20- 21, 2007, pp. 119-126. [8] S. C. Hung, D. C. Wu and W. H. Tsai, “Data hiding in stained glass images,” Proc. 2005 Int’l Symp. on Intelligent Signal Processing Communications Systems, June 2005, Hong Kong, pp. 129-132. (c) [9] C. Y. Hsu and W. H. Tsai, “Creation of a new type of image - circular Figure 12. An example of covert communication. (a) A mosaic image dotted image - for data hiding by a dot overlapping scheme,” Proc. into which messages are hidden. (b) Resulting image and 2006 Conf. on Computer Vision, Graphics Image Processing, extracted messages using a right key. (c) Resulting image and Taoyuan, Taiwan, Aug. 13-15, 2006. extracted messages using a wrong key. [10] C. P. Chang and W. H. Tsai, “Creation of a new type of art image  tetromino-based mosaic image  and protection of its copyright by losslessly-removable visible watermarking,” Proc. 2009 Nat’l Computer Symp., Taipei, Taiwan, Nov. 27-28, 2009, pp. 577-586. IV. CONCLUSIONS AND SUGGESTIONS [11] D. Coltuc and JM. Chassery, “Very fast watermarking by reversible A new type of art image  secret-fragment-visible mosaic contrast mapping,” IEEE Signal Processing Letters, vol. 14, no. 4, pp. 255-258, April 2007. image, and a data hiding technique have been proposed for [12] J. R. Smith and S. F. Chang, “Tools and techniques for color image secret image hiding and covert message communication, retrieval,” Proc. Society for Imaging Science Technology SPIE respectively. For the former, we have proposed a new 1-D h- (IS T/SPIE), vol. 2670, Feb. 1995, pp. 2-7. 1140
  • 154.
    A Practical Designof High-Volume Steganography in Digital Videos Ming-Tse Lu, Po-Chyi Su and Ying-Chang Wu Dept. of Computer Science and Information Engineering National Central University Jhongli, Taiwan Email: [email protected] Abstract—In this research, we consider to exploit the available to most of the people and their transmission is large volume of audio/video data streams in compressed increasingly popular. This research aims at developing a video clips/files for effective steganography. By observing steganographic scheme for popular digital video files. that most of the widely distributed video files employ H.264/AVC and MPEG AAC for video/audio compression, H.264/AVC is the state-of-the-art video codec and its we examine the coding features in these data streams to decent coding performance lends itself to become the determine good choices of data modifications for reliable major coding mechanism in various applications. The most and acceptable information hiding, in which the perceptual popular digital video formats/containers for file sharing quality, compressed bit-stream length, payload of embedding, nowadays, including FLV (Flash Video), MKV (Matroska effectiveness of extraction and efficiency of execution are taken into account. Experimental results demonstrate that Multimedia Container), AVI (Audio Video Interleave) and the payload of the selected features for achieving a good MP4, etc., support H.264/AVC so we choose H.264/AVC balance among several constraints can be more than 10% of as the host. In addition, since FLV has become very the compressed video file size. popular in file sharing these days, we wrap the resulting Keywords-Steganography; H.264/AVC; MPEG; AAC; in- H.264/AVC video bit-stream into a FLV file for the future formation hiding; usage. As FLV files contain both video and audio data streams for playback, we will make use of both video and I. I NTRODUCTION audio data streams to embed as much secret information as Digital videos are widely available nowadays thanks to possible. The chosen audio format is MPEG AAC, which the fast advances of increasingly cheaper yet powerful is usually adopted by FLV files. computer facilities and broadband internet technologies. Two embedding scenarios may be considered in this It is now possible to stream high-quality videos on the application. First, an user may acquire a compressed video Internet and such web sites as YouTube, Yahoo! Video, file that is not coded by H.264/AVC, e.g. an MPEG2 or DailyMotion, etc. offer free video viewing, sharing or MPEG4 related file. In order to embed the information, downloading services. Watching videos anytime and any- this video file will be transcoded into an H.264/AVC bit- where may become people’s daily activity as the portable stream so that the secret information can be embedded devices may become more and more popular. As a re- during the encoding process. The resultant H.264/AVC sult, digital videos are ubiquitous and will be the major bit-stream will then become the video stream of an FLV circulated multimedia content. Due to the large volume file. If the input video is already an FLV file compressed of digital videos, data compression is usually applied to by H.264/AVC, the information may be embedded more facilitate their transmission and storage. Since human’s efficiently since the existing coding parameters in the perceptual models are not perfect, lossy compression is original video file can be referenced. It should be noted usually preferred to increase the coding efficiency of that both of the embedding procedures will be carried out digital videos without affecting human’s perception. In in the encoding phase so the transcoding is always needed. other words, there exists certain redundancy in digital To achieve the high-volume information hiding and to video files. Nevertheless, in the viewpoint of communica- retain the fidelity of the audio/visual data, we investigate tion, this redundancy can serve as an “invisible” channel combinations of embedding methods to satisfy most of and, if one can make good use of it, the high-volume the requirements or restrictions. The paper is organized secret communication using digital videos as a camouflage as follows. Some previous works will be described in is achievable. The secret communication is also termed Sec. II and our proposed scheme is delineated in Sec. III. “steganography”, which means “cover writing”, and can be Experimental results will be shown in Sec. IV to validate applied to transmit sensitive information between trusted the trade-offs that we make among several different re- parties or when the encryption is not allowed or safe quirements. The conclusive remarks are given in Sec. V. in the normal communication channel. There are a few requirements in steganography, including the high payload II. R EVIEW OF THE R ELATED W ORKS of hidden information, unobtrusiveness of the distortion, Unlike digital watermarking, in which the embedded security and reliability. To achieve the secure, high-volume information should be able to withstand some common and reliable covert communication, digital videos can processing attacks such as re-compression at a different serve as a good host, especially when these files are bit-rate, random video frame dropping, resizing, etc., 1141
  • 155.
    the high-volume steganographyemphasizes more on the III. T HE P ROPOSED S CHEME payload, reliability and the difficulty of detection even A. System Overview with steganalysis [1], which is a process for revealing the The block diagram of our proposed scheme is shown in existence of certain hidden information in a suspicious Fig. 1. A video file is parsed first to extract the video and video. Of course, the quality should always be main- audio data streams. As the transcoding will be applied, the tained to avoid affecting its applications. Some data hiding video and audio decoder will extract the compressed bit- schemes in digital videos [2]–[7] have been proposed. As streams into raw data. After obtaining the reconstructed the video file consists of a large number of frames, the video and audio signals, H.264/AVC and AAC encoders similar data hiding techniques of still images may also will encode the raw data right after they are extracted and be applied on videos. The most widely used technique the embedding procedure is triggered. If the input video to hide the data in digital images or video data is the file is processed by H.264/AVC already, a mode copying usage of the Least-Significant Bit (LSB) modification [8], procedure that records features of the original video stream [9], in which the LSB’s of samples, usually coefficients or may be applied to speed up the whole process. The hidden quantization indices if the compressed data are used, are information can be extracted efficiently in the decoder. substituted by the secret message bits. JStego, F3, F4 and F5 described in [8] are popular approaches. In the original ¤£¢¡  JStego algorithm, the LSB’s of JPEG residual coefficients ©¨§¦¥ ¢ ¨ ¨ are overwritten with the binary secret message consisting ¨¡¦ of “0” and “1”. JStego skips the embedding operation CD 32©1¨§©0 EFG when it encounters 0 and ±1 to avoid generating zero ¢ ©¨§¦¥ P HI values, which will cause ambiguity to the hidden infor- mation extraction. Other values are grouped into pairs, i.e. ¨$©( ©¨§¦¥ ! ©¦§'% %% £¨$'¢¨ $¨§©#¨ $¨§©#¨ (±2, ±3), (±4, ±5)... In F3 algorithm, the LSB of non- ©¦¢)$©¦ ¦§§¨A)4 zero coefficients will be matched with the secret message after the information embedding, which decreases the ©¨§¦¥ ! ©¦§'% %% absolute values of coefficients. If the coefficient becomes ¦££%1¨§©0 $¨§©#4 $¨§©#4 ¨$#¨( zero after this modification operation, we will embed ¨¢££¨0 this bit once again in the next sample. F4 algorithm is ©¨§¦¥ ¤£¢¡  developed to complement the weakness of F3 algorithm. In $¨'0 986765 F4 algorithm, a negative coefficient is presented inversely. @ ¡©$© In F5 algorithm, permutating straddling process is adopted @ ¢¢ RQ B¨¡¦ © §4 to improve the perceptive characteristic and enhance the security level. In addition, the so-called matrix coding is Figure 1. The flowchart of the proposed scheme applied to avoid modifying too many data samples. In addition to modifying the coefficients, some re- searchers employ the characteristics of popular video B. Steganography in H.264/AVC compression standards. Fang i.e. [4] proposed to embed H.264/AVC offers a much better compression perfor- the data into motion vectors’ phase angle in the inter- mance than the existing video codecs due to its vari- frames. Wang et al. [10] utilized motion vectors in P and ous encoding tools. As previous video coding standards, B-pictures as the data carriers for hiding the copyright H.264/AVC is based on motion compensated, DCT-like information. An motion vector is selected based on its transform coding. Each picture is compressed by partition- magnitude and its angle guides the modification opera- ing it as one or more slices; each slice consists of mac- tion. Yang G, et al. [11] employed the intra-prediction roblocks, which are blocks of 16 × 16 luma samples with mode and matrix coding. They mapped the two secret the corresponding chroma samples. Each macroblock may message bits to every three intra 4×4 blocks by the matrix also be divided into sub-macroblock partitions for motion coding. Kim et al. [1] proposed an entropy coding based prediction. The prediction partitions can have seven differ- watermarking algorithm to achieve the balance between ent sizes 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4. The the capacity of watermark bits and the fidelity of the video. large variety of partition shapes and the quarter sample One bit of information is embedded in the sign bit of the compensation provide enhanced prediction accuracy. In trailing ones in context-adaptive variable length coding intra-coded slices, 4 × 4 or 16 × 16 intra spatial prediction (CAVLC) of the H.264/AVC stream. The transcoding based on neighboring decoded pixels in the same slice process may thus be avoided but the drift errors resulting will be applied. The 4 × 4 spatial transform, which is an from the different reference frame content may appear. approximate DCT and can be implemented with integer In this paper, we will try to make good use of the coding operations with a few additions/shifts, will be calculated features in video/audio data streams to maximize the ca- for the residual data. The point by point multiplication in pacity of embedded data. Our work can be applied to any the transform step will be combined with the quantization container format that can de-multiplex the video stream of step and implemented by simple shifting operations to H.264/AVC and audio stream of AAC compression. achieve efficiency. CAVLC or CABAC will be used for 1142
  • 156.
    lossless coding. Ourvideo embedding scheme is integrated Besides, using MPM to encode a mode should appear in into the H.264/AVC encoding process as the quantization, normal videos so we have to keep this situation in our intra prediction and motion estimation procedures in the “stego” video. Our scheme divide the eight modes into encoder are modified. two groups to represent the binary secret information and 1) Employing Intra Prediction Modes: In H.264/AVC, the division or classification is applied according to Fig. 2. the intra prediction in the luma and chroma of a frame is If the DC mode is not MPM, we replace the direction of quite important for reducing the coding redundancy since MPM by the DC mode and then assign “0” and “1” to the a coding block is usually related to its neighbors. Four prediction directions, which are known by the embedder 16×16 or nine 4×4 intra prediction modes can be applied and detector. The Rate Distortion Optimization (RDO) on the luma while four 8 × 8 prediction modes are for the will be employed to determine a better prediction mode. chroma. Fig. 2(a) shows the 4×4 intra prediction. Since Although the payload becomes 0.79% of the compressed the samples above and to the left (labeled as A to M) of the bit-stream size, the increment of file size is less than 3%. If current block have been encoded/reconstructed previously the input video is already an H.264/AVC video stream, we and are available to both the encoder and decoder, nine may reference the prediction mode in the original video prediction modes, including eight directions and one DC and determine pairs of modes to represent “0” and “1” prediction can thus be calculated. It should be noted that, directly. If the execution-time constraint is not that restrict, if the neighboring upper or left block of the current block we still suggest applying RDO to find a better mode is not available, the number of available modes is reduced. since it’s not easy to predict a good selection just based For instance, if the upper block is available while the left on the mode in the incoming/original video. Besides, the block is not, only “horizontal”, “DC”, and “horizontal up” computational load is not increased a lot by ROD since modes can be chosen. we only have four candidate selections to embed one bit. 2) Employing Inter Prediction: The inter prediction provides a reference from one or more previously en- coded video frames for effective encoding. In order to acquire the precise motion vectors, H.264/AVC adopts the quarter-pixel precision for motion compensation. The last two bits of the motion vectors indicate more accurate locations of motion vectors. We basically make use of (a) (b) the last bit of motion vectors for effective information Figure 2. (a) Labeling of prediction samples (4×4) and (b) the directions hiding without affecting the coding performance severely. of 4×4 intra prediction modes. Since the transcoding is applied, the Sum of Absolute Difference (SAD) of the investigated motion vectors with Our proposed scheme only utilizes the nine 4 × 4 their Least Significant Bit (LSB) being equal to the intra prediction modes since the content of these blocks hidden bit will be available, we can examine them to is usually complicated so the blocks are suitable for find good motion vectors. Again, like the intra prediction, information hiding. Compared with 16 × 16 luma and the motion vectors of neighboring partitions are often 8 × 8 chroma prediction modes, 4 × 4 intra prediction highly correlated. After determining the motion vectors modes offer finer prediction so the modification of them by motion estimation, H.264/AVC predicts the motion will affect the coding performance less. To embed the vector first from the nearby, previously coded partitions. information, one may think of grouping the nine modes After obtaining a predicted motion vector M Vpredicted , the into pairs so that one bit can be cast in each 4×4 subblock. difference between the current motion vector, M Vcurrent , However, the resultant bit-stream length will be consid- and M Vpredicted will be calculated and encoded. The erably increased. We take “Container” video compressed motion vector difference is termed ”M V D” and formed with a fixed Quantization Parameter (QP) equal to 30 as an in the same way at the decoder. In our scheme, we actually example. By doing so, the payload can reach 2.07% of the modify the data of M V D, instead of motion vectors them- total bit-stream size, which is inappropriately enlarged by selves so the detector can extract the hidden information 6.72%. The reason comes from the fact that the correlation efficiently. Furthermore, we will skip the partitions with in the intra prediction modes of adjacent blocks is not M V D equal to 0 and avoid generating a new zero M V D. taken into account. In H.264/AVC, the mode of the current This strategy can limit the file size increment and make the block is first predicted by the minimum of the prediction statistics of motion vectors look normal. If only the motion modes of its two neighbors, i.e. the upper and left blocks. vector whose M V D becomes zero after the information If the mode matches with the predicted one, only one embedding has a reasonably small SAD, we will choose flag bit called “Most Probable Mode” (MPM) is asserted this motion vector and embed this bit once again in the and sent. Otherwise, this flag bit will be set as “0” and following partition. three extra bits will also be sent to signal which of the It should be noted that the effect of M V D embedding remaining eight modes is used. We only modify the modes is more obvious if we use the fixed QP to compress a when the flag bit is “0” since this block may be more video. We illustrate this by compressing “Container” with different from its neighbors and suitable for embedding. QP equal to 30. Fig. 3 shows the proportion of each 1143
  • 157.
    coding features thatoccupy in the compressed bit-stream. other hand, for a negative coefficient and its LSB is Fig. 3(a) shows that the proportion of luma components equal to BitT oEmbed, the modification has to be applied from inter blocks is 39% if nothing is embedded while and the coefficient is added by 1, i.e. the LSB of this the proportion of luma components from inter blocks is negative number is equal to the inverted target bit. After extended to 49% as shown in Fig. 3(b). That is, a large the embedding operations, it is required to check if the increment will appear in the residuals of inter blocks. If index becomes 0. If yes, the bit will be skipped by the the bit-stream size increment has to be strictly limited, decoder so it has to be embedded again. we should avoid embedding the information in motion vectors. However, the increased residuals may be helpful Algorithm 1 F4 Algorithm in the information embedding of quantization indices, Input: BitT oEmbed ∈ {0, 1} which will be discussed later. 1: for all AC values coe in a block after quantization 11% 9% do 5% 4% 2: if coe 0 ∧ LSB(coe) = BitT oEmbed then /* 37% 4% 31% 4% 4% positive number */ 4% 3: coe ← coe − 1 4: else if coe 0∧LSB(coe) = BitT oEmbed then LumaIntra LumaInter LumaIntra LumaInter /* negative */ Chroma Chroma Motion Vectors IntraMode4x4 Motion Vectors IntraMode4x4 5: coe ← coe + 1 39% else 49% else 6: else /* skip zero value */ (a) (b) 7: continue 8: end if Figure 3. Pie chart of “Container” with (a) nothing being embedded 9: if coe = 0 then /* Successfully embedded */ and (b) M V D being modified. 10: /* get nextBitToEmbed */ 11: BitT oEmbed = GetNextBit(); Here, we test the videos “Garden” and “Container” 12: end if that are coded with the target bit-rate equal to 2 Mbps 13: end for to explain our strategy. Table I shows the comparison of modifying motion vectors (M V ) directly without consid- The reason of choosing F4 is as follows. It has been ering M V D and the M V D embedding. The payload is reported that we can reveal the existence of hidden infor- calculated in bits per frame, bpf. In our viewpoint, the mation by checking the statistics of samples in JStego and M V D embedding is still a better choice as the quality of F3 since they may change the histogram of coefficients’ video is less affected, although the payload is decreased frequency after the information embedding. Besides, the due to the skipping of some MVD’s that are equal to zero original/natural message induced by unchanged carrier vectors, especially in such static video as “Container”. media may have more steganographic ones than zeros due Table I to the appearance of ±1 so we had better keep this situa- T HE PERFORMANCE COMPARISON OF M V AND M V D EMBEDDING tion. F5 is assumed to be a better approach by using matrix (B IT- RATE : 2M BPS ) coding so that less data will need to be modified. However, as we would like to embed the information during the Video MV MV D encoding process to achieve the efficiency, we have to Name PSNR Payload PSNR Payload finish coding the data in a subblock before we proceed Garden 29.78 8461 30.56 5148 Container 40.70 29372 43.41 8024 to encode the next subblock. In the matrix encoding, we need to collect 2m − 1 samples to embed m bits by changing only 1 sample. Since the prediction mechanism 3) Quantized Coefficients Embedding: After the intra of H.264 performs well so a lot of zero indices may exist and inter predictions and compensation, the prediction and several subblocks may thus be required for collecting residuals will be generated and occupy a large portion 2m − 1 nonzero samples for the information hiding. As of the video stream. It is advantageous to utilize these the efficient modification is one of our major objectives, residuals to achieve the high-volume information hiding. F5 is not suitable. In F4, if we require a higher degree of In our scheme, both luma and chorma of residuals will be safety, some coefficients may be skipped for embedding embedded. As mentioned before, popular methodologies given that both the embedder and detector know the rule. to achieve high-volume steganography in samples without As described earlier, the information hiding by F4 has a affecting the perceptual quality of videos include JStego, positive side-effect in video coding as the magnitudes of F3, F4 and F5 algorithms. In our scheme, F4 algorithm the resultant coefficients tend to become smaller. When is adopted to achieve the effect information hiding in using a fixed QP to encode a video, the video size may residuals. Algorithm 1 shows the pseudo code of the even be reduced after information embedding and this may embedding loop of F4. For each non-zero AC coefficient offset the negative effects from the information hiding by coe, if it is a positive number and its LSB is not equal the intra and inter predictions. to BitT oEmbed, its absolute value is decreased. On the It should be noted that, when the rate control mechanism 1144
  • 158.
    is enabled, QPwill be adjusted along with the encoding if the coding performance is our major concern. process and the F4 algorithm may help to save some bits in the current frame so that a smaller QP can be assigned in C. Information Hiding in Advanced Audio Coding the following frames. In addition, if the bit-stream length Advanced Audio Coding (AAC) is a standardized com- is not the major concern, we may try to generate more pression scheme for digital audio and designed as the nonzero indices. As some indices may be quantized into successor of the MP3 format. AAC makes use of many zero values because a large QP is used, we may leverage advanced coding techniques available at the time of its the coefficients that may barely survive by using a smaller development to provide high quality multi-channel audios. QP. For example, if QP = 28 is adopted in this block, we Therefore, it becomes the kernel algorithm of audio com- may try a smaller QP = 27 to see whether there will be pression standards. At the beginning of encoding, the filter zero coefficients/indices that become survivors by using bank is employed to transform the time domain signal this smaller QP. If yes, this index can be changed to ±1 to the frequency domain. Following the time-frequency so that we may have more nonzero indices to consider. conversion is a series of prediction mechanisms and those 4) Mode-Copy Procedure: Our embedding methods stages attempt to improve the redundancy reduction of described above are based on a transcoding process. If previous encoded signals or the joint stereo channel. After the input video is already an H.264/AVC bit-stream, we the predictions, an iteration loop is applied to quantize may record the coding modes during the decoding process the spectral coefficients. The scalefactors of subbands are so that the time-consuming mode decision process can be obtained and multiplied by all of the coefficients in the made efficient by referencing the modes in the input H.264 corresponding scalefactor band. The number of required video as long as the settings of the video, including the bits and the related information are determined to control GOP structure, the bit-rate, etc. are the same. We thus the trade-off between the audio distortion and payload. implement a mode-copy procedure to skip some time- Huffman coding is followed according to the 12 pre- consuming mode decision steps in the encoding process. defined Huffman tables. Since the data of scalefactors and In our implementation, the coding information that we spectral coefficients occupy a significant part in the coded record are the frame type, macroblock type, intra- and audio streams, we will make use of them to embed the inter-prediction modes and motion vectors in the quarter information. unit. After decoding a frame, the video encoder assign 1) Embedding in the Scale Factors: The scalefactors those features directly to speed up the whole transcoding have been used for the effective information embedding process. We compare the typical transcoding and mode- purposes [12]. In our implementation, the scalefactors copy encoding in Table II, where Frame per second (FPS) equal to zero will be skipped in the embedding scheme is used as the performance measurement. No information and the scalefactor bands that use pseudo codebooks in the embedding method is applied in both cases. It can be seen intensity stereo are also skipped. For nonzero scalefactors, that employing the mode-copy procedure is competent to the secret message bit is embedded in the LSB of each the typical transcoding. scalefactor. The payload of scalefactors embedding in Table II bytes is shown in Table III, in which two audio clips with T HE COMPARISON OF THE TYPICAL TRANSCODING AND MODE - COPY different characteristics are employed. Two target bit rates ENCODING (B IT- RATE : 500K BPS ) are employed, i.e. High: 264Kbps and Low: 132Kbps. We can see that the payload of the scalefactors embedding Typical could achieve around 1 to 3% of the audio stream size. Video transcoding Mode-Copy Name PSNR FPS PSNR FPS Table III Garden 23.62 14.54 23.53 21.15 T HE PAYLOAD OF INFORMATION EMBEDDING IN AUDIO Container 37.77 11.99 37.51 18.42 Scalefactor Quantization index Music High Low High Low For the information embedding, we may use both the A(5:06) 1.55% 2.86% 7.10% 6.79% intra- and inter-prediction modes as the references and B(3:51) 1.26% 2.52% 11.90% 7.72% try to modify the modes directly without using RDO. If the speed is the major concern, this approach is feasible. In the information hiding in the intra-prediction modes, 2) Embedding in the Quantization Indices: In order to we can replace the prediction direction with DC (if DC maximize the payload in the audio stream, the spectral mode is not MPM) and then group them into known pairs coefficients after quantization are also employed. Again, according to the prediction directions. For the information we apply the F4 algorithm to the spectral coefficients. hiding in M V D, we may simply change the bits according Table III also shows the payload using quantization indices to the incoming motion vectors or use a refined method, embedding and the average payload is around 6 to 12% of which calculates the SAD’s of the adjacent locations to the embedded audio file size. It can be seen that “Music pick a better motion vector. However, in our opinions, we B” has a larger payload than “Music A” because Music may prefer to run RDO in the intra mode modification, as B has more transient signals than A and more non-zero described before, and omit the motion vector embedding, coefficients exist. 1145
  • 159.
    IV. E XPERIMENTALR ESULTS under various bit-rates, the payloads of IMP and MVD tend to be independent from the target bit-rate since they Our results are demonstrated in two parts, i.e. the are usually more related to the frame size. information embedding in video and audio streams. In the video embedding part, we evaluate the performance of various videos by first using a fixed QP value and then 15 by enabling the rate control mechanism. In both ways, the 14 payload, fidelity and increment of bit-stream size are the major concerns. Six common test videos, namely, “Con- 13 Payload (%) tainer”, “Hall Monitor”, “Foreman”, “Football”, “Garden”, 12 “Mobile”, are utilized to verify the proposed video em- bedding method. The details of test videos are shown 11 Container Hall Mointor in Table IV. The proposed scheme is integrated within 10 Foreman Football Intel Integrated Performance Primitives (IPP) version 5.2, Garden Mobile which is a highly optimized run-time library for supporting 9 500 1000 Bit−rate (Kbps) 1500 2000 fast H.264/AVC coding. Figure 4. The payload of information embedding in videos under various Table IV bit-rates. D ETAILS OF T EST V IDEOS Num of Frame No. of MB Name frames resolution per frame 50 500Kbps Container 300 352×288 396 1Mbps 2Mbps 45 Football 125 352×240 330 Foreman 300 352×288 396 40 Garden 115 352×240 330 PSNR (dB) Hall Monitor 300 352×288 396 35 Mobile 140 352×240 330 30 25 First, we set a fixed QP value equal to 30 within all frames in the test videos to see the effects of different 20 Hall Monitor Container Foreman Garden Video Name embedding methods. When a fixed QP is employed, the trade-off between the hidden information payload and the Figure 5. PSNR of embedded videos and transcoded videos under increment of bit-stream size is the major consideration. For various bit-rates each video, we record the payload in bits per frame and the increments of bit-stream that is resulted from various Then, we consider the fidelity of embedded videos embedding methods are demonstrated in Table V. We can under various target bit-rates. We present the PSNR values find that “Football” video provides the largest payload in of the embedded video and the transcoded videos. The the modifications of quantization indices of intra blocks values of fidelity decreases of four videos are shown in and intra mode prediction, IMP, since the high motions in Fig. 5. We can see that the fidelity of transcoded high- the video leads to more intra blocks. In the quantization motion video is not as high as that of other static videos index embedding in inter predicted blocks, the payload of under the same bit-rate because the high-motion videos “Garden” video is even higher than “Football” because it lead to a significant amount of inter block residuals for not only has high variations in the frame but also high compensating the high variations of video frames. It can similarity among frames so more inter blocks exist. In be observed that the PSNR of high-motion videos such as other embedding modes, high motion videos always have “Garden” is decreased a lot after the information embed- higher payloads. ding. The reason could be that modifying motion vectors Using a fixed QP value is only an experiment to observe of high-motion videos seriously cause inaccuracies in the trade-off among various embedding modes. We should motion compensation. In the lower bit-rate, the difference enable the rate control to simulate the scenario of the between transcoded and embedded video is not as much real applications. Under a given target bit-rate, the issue as that in the higher bit-rate. Despite of that, our scheme we will discuss is the trade-off between the payload and still achieves around 10% payload of embedded video size fidelity of video. Unlike the fixed QP mode, we combine on average as shown in Fig. 4. all the embedding modes to directly observe the trade- As mentioned in Section III-B4, a mode-copy procedure off between the payload and PSNR under the same target was introduced for speeding up the encoding process. It bit-rate as shown in Fig. 6, in which four videos are should be noted that mode-copy procedure is reasonable tested and their payloads are shown as the solid lines. only when the rate-control mode is enabled. Our mode- We can see that “Garden” achieves the best payload copy procedure skips the most time-consuming stage in performance at all the bit-rates. Again, it demonstrates that video encoder, including the motion estimation, to increase the high motion videos usually perform better. In fact, the efficiency. First, we record the time of embedding 1146
  • 160.
    Table V T HE AVERAGE PAYLOAD AND THE CORRESPONDING INCREMENT PAYLOAD IN BPF Intra4x4,16x16 Inter IMP MVD All (MVD) File Name Payload Size Payload Size Payload Size Payload Size Payload Size Container 409 0.14 339 -7.30 81 2.96 128 25.89 1191 7.49 Hall Monitor 262 -3.38 341 -8.77 95 4.13 60 6.14 880 -5.16 Foreman 349 -1.41 586 -7.60 246 4.44 482 12.72 1887 3.44 Football 1768 -4.34 2411 -8.15 615 3.00 407 5.59 6095 -8.38 Garden 1460 0.70 5039 -8.81 206 0.81 440 19.37 9164 2.03 Mobile 1592 2.04 4325 -9.24 222 1.19 470 30.94 8996 1.99 process, with and without employing the mode-copy pro- 45 cedure. Table VI shows the ratio of the increased efficiency in execution time. It can be observed that the efficiency 40 can be improved by more than 28% when the mode-copy PSNR (dB) procedure is employed. The lower bit-rate value is used, 35 the more improvement can be obtained. Figure 6 shows 30 the payload of embedding when the mode-copy procedure Container is applied. It should be noted that the M V D embedding 25 Foreman Garden is disabled. The dotted lines in Fig. 6 show the payload Football with mode−copy of the hidden information and the solid lines are from 20 500 1000 1500 2000 the complete transcoding scheme. “Garden” still has the Bit−rate (Kbps) best embedding performance. Here we disable the M V D Figure 7. The fidelity of video under various bit-rates, with and without embedding and the payload of embedding can still achieve the mode-copy procedure around 10% of encoded video size in average. Table VI T HE RATIO OF EXECUTION TIME DECREASE AFTER EMPLOYING THE When a higher bit-rate is set, the mode-copy procedure MODE - COPY PROCEDURE . even performs better. Video Name 500Kbps 1Mbps 2Mbps A. The Evaluation of Audio Embedding Container 53% 39% 30% Hall Monitor 48% 40% 35% For the information embedding in audio, we select Foreman 52% 44% 33% some audio clips from the EBU SQAM (Sound Quality Football 35% 32% 30% Assessment Material) CD, including “abba”, ”speech”, Garden 55% 35% 31% “baird” and “bach”. All the audio clips from EBU SQAM Mobile 46% 35% 27% CD are encoded in lossless compression format (FLAC) and we transcode those clips into FLV as the input to our scheme. The audio encoder parameters are set to 15 retain the fidelity of audio as much as possible. We 14 employ the “target quality mode” in Nero AAC encoder to preserve the fidelity of original clips from EBU SQAM 13 CD. In addition, we also select two videos, “classic”, and Payload (%) 12 “electronic”, from YouTube as one is a classical music and 11 the other is a remixed pop music. 10 We first investigate the payload of embedding with all both encode modes enabled. The payload unit we use is Foreman Garden Football 9 Hall Monitor with mode−copy also bits per frame, bpf, in which “frame” is the basic 8 500 1000 Bit−rate (Kbps) 1500 2000 element to collect the sampling points. Table VII shows the ratio of short window appearance and the payload of Figure 6. The embedding payload under various bit-rates with the scalefactors embedding, quantization indices embedding, mode-copy procedure and the embedding ratio of payload to encoded audio size. All the audio clips are encoded at 192 Kbps. The The mode-copy procedure skips lots of searching steps ratio of short window appearance shows the characteristics so it may degrade the fidelity of video frames. Figure 7 in audio presentation. The transient signal is a short- shows the PSNR value of four videos with and without the duration signal that contains the high degree of non- mode-copy procedure. The dotted line in the figure repre- periodic components and a higher magnitude of high sents the PSNR values with the mode-copy procedure. We frequencies than the harmonic content of that sound. It can can see that the fidelity of videos is not affected by much. be seen that the notations in “abba”, “speech”, “electronic” 1147
  • 161.
    have more transientsignals. Be comparing the ratio of [2] C. Xu, X. Ping, and T. Zhang, “Steganography in com- short window appearance and payload, we can find that the pressed video stream,” 2006. payload obtained by scalefactors embedding is irrelative to [3] S. Kapotas, E. Varsaki, and A. Skodras, “Data Hiding in H. the ratio of short windows because each subband has only 264 Encoded Video Sequences,” in IEEE 9th Workshop on one scalefactor so it is not related to the bitrate or number Multimedia Signal Processing, 2007. MMSP 2007, 2007, of short windows. Therefore, the ratio of short windows pp. 373–376. appearance is proportional to the payload obtained by the embedding in the quantization indices. [4] D. Fang and L. Chang, “Data hiding for digital video with phase of motion vector,” in 2006 IEEE International Table VII Symposium on Circuits and Systems, 2006. ISCAS 2006. T HE PAYLOAD OF THE AUDIO EMBEDDING Proceedings, 2006, p. 4. [5] Z. Liu, H. LANG, X. Niu, and Y. Yang, “A robust video Audio Short Payload [bpf] watermarking in motion vectors,” in International Confer- name window [%] SF QC Ratio[%] ence on Signal Processing, 2004, pp. 2358–2361. abba 14.78 76 341 10.55 speech 11.42 77 339 10.61 [6] M. Wu and B. Liu, “Data hiding in image and video: part baird 0.11 92 271 8.20 I-fundamental issues and solutions,” IEEE Transactions on bach 2.60 72 284 9.42 classic 0.04 95 319 9.19 Image Processing, vol. 12, no. 6, pp. 685–695, 2003. electro. 5.28 79 496 12.80 [7] M. Wu, H. Yu, and B. Liu, “Data hiding in image and video: part II-designs and applications,” IEEE Transactions on Image Processing, vol. 12, no. 6, pp. 696–705, 2003. Fig. 8 shows the bit-rate and payload of quantization indices embedding. We can see that the higher target [8] A. Westfeld, “F5Xa steganographic algorithm,” in Informa- bit-rate is set, the more payload of quantization indices tion Hiding. Springer, pp. 289–302. embedding can be achieved because the number of co- efficients in subbands increases. The payload of audio [9] A. Bhaumik, M. Choi, R. Robles, and M. Balitanas, “Data Hiding in Video,” 2009. embedding achieves around 10% of the audio stream size in average as the video information embedding. [10] H. Wang, Y. Li, Z. Lu, and S. Sun, “Compressed domain video watermarking in motion vector,” in Knowledge-Based 700 Intelligent Information and Engineering Systems. Springer, 2005, pp. 580–586. 600 [11] G. Yang, J. Li, Y. He, and Z. Kang, “An information hid- 500 ing algorithm based on intra-prediction modes and matrix Payload (bpf) coding for H. 264/AVC video stream,” AEU-International 400 Journal of Electronics and Communications, 2010. abba 300 speech [12] S. Kirbiz, A. Lemma, M. Celik, and S. Katzen- baird beisser, “Decode-Time Forensic Watermarking of AAC bach 200 classic Bitstreams,” IEEE Transactions on Information Forensics electronic and Security, vol. 2, no. 4, pp. 683–696, 2007. 100 120 140 160 180 200 220 240 260 Bit−rate (Kbps) Figure 8. The payload of quantization indices embedding in various bit-rates. V. C ONCLUSION We develop a high-volume steganographic scheme for FLV files. Both video and audio streams are employed and several coding features are taken into account. The users may take their applications into account to select suitable features. The payload, perceptual quality, file increment and security should be the major concerns. Experimental results demonstrate the payload can reach more than 10% of the total file size when a good tradeoff is achieved. R EFERENCES [1] U. Budhia, D. Kundur, and T. Zourntos, “Digital video steganalysis exploiting statistical visibility in the temporal domain,” IEEE Transactions on Information Forensics and Security, vol. 1, no. 4, pp. 502–516, 2006. 1148
  • 162.
    MULTI-SCALE IMAGE CONTRASTENHANCEMENT: USING ADAPTIVE INVERSE HYPERBOLIC TANGENT ALGORITHM 1,2 1 3 1,4 Cheng-Yi Yu, Yen-Chieh Ouyang, Tzu-Wei Yu, Chein-I Chang 1 Dept. of Electrical Engineering, National Chung Hsing University, Taichung, ROC 2 Dept. of Computer Science and Information Engineering, National Chin Yi University of Technology, Taichung, ROC 3 Dept. of Electronic Engineering, National Chin Yi University of Technology, Taichung, ROC 4 Remote Sensing Signal and Image Processing Laboratory, Dept. of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Baltimore, MD 21250, USA E-mail:[email protected] ABSTRACT contrast image. A dark image has particular low gray levels in intensity, while a bright image has very high This paper presents a fast and effective method for gray levels in intensity. The gray levels of a back-lighted image contrast enhancement based on multi-scale image are usually distributed at the two ends of dark and parameter adjustment of Adaptive Inverse Hyperbolic bright regions. On the other hand, the gray levels of a Tangent Algorithm (MSAIHT). Sub-band coefficients low-contrast image are generally centralized on the were developed base on the method of Adaptive Inverse middle region, while the gray levels of a high-contrast Hyperbolic Tangent Algorithm. In the proposed method, image are scattered across the whole spectrum (Fig. the image contrast is calculated from the local mean and 2).[3,4] local variance before the further processing for Adaptive Inverse Hyperbolic Tangent Algorithm (AIHT). We show that this approach could provide a convenient and effective way for various types of images. Applications of the proposed method in real time image were also discussed. Experimental results show that the proposed algorithm is capable of enhancing the local contrast of the original image adaptively while extruding the details of objects simultaneously. Keywords — Multi-Scale, Adaptive Inverse Hyperbolic Tangent, Contrast Enhancement, Image Processing Topic area — Multi Processing, Image Post-Processing Figure 1. Human Visual System mapping curve 1. INTRODUCTION Light is the electromagnetic radiation that stimulates our visual response. In real-world situations, light intensities have a large range. The illumination range over which the human visual system can operate is roughly 1 to 1010, or ten orders of magnitude. The retina of the human eye contains about 100 million rods and 6.5 million cones. The rods are sensitive, and provide vision the lower several orders of magnitude of illumination. The cones are less sensitive, and provide Figure 2. Five kinds of contrast types the visual response at the higher 5 to 6 orders of magnitude of illumination. Figure 1 is the Human Visual Five categories of commonly used gray level System mapping curve [1,2]. transfer functions shown in Fig. 3 are generally used to According to image contrast an images is generally perform contrast enhancement so as to achieve different categorized into one of five groups: dark image, bright types of contrast [3, 4]. For example, for dark images image, back-lighted image, low-contrast image, and high- with mean 0.5, the function in Fig. 3(a) is used, 1149
  • 163.
    whereas the functionin Fig. 3(b) is used for a bright interactive applications. It can automatically produce image with mean 0.5 for the same purpose. For images contrast enhanced images with good quality while using a whose gray levels are centralized in the middle region spatially uniform mapping function that is based on a with mean near at 0.5, the function in Fig. 3(c) is used. simple brightness perception model to achieve better For images whose gray levels are distributed at the two efficiency. In addition, the MSAIHT also provides users end of dark and bright region, the function in Fig. 3(d) is with a tool of tuning the on-the-fly image appearance in used. For the images whose gray levels are uniformly terms of brightness and contrast and thus, is suitable for scattered across the whole spectrum, the function in Fig. interactive applications. The AIHT-processed images can 3(e) is used. be reproduced within the capabilities of the display medium to have better detailed and faithful bias=0.37,gain=0.35 bias=0.37,gain=3.0 bias=0.97,gain=1.0 1 1 1 0.9 0.9 0.9 0.8 0.7 0.8 0.7 0.8 0.7 representations of original scenes. 0.6 0.6 0.6 The remainder of this paper is organized as follows: Output Lev el Output Level Output Level 0.5 0.5 0.5 0.4 0.3 0.4 0.3 0.4 0.3 Section 2 reviews previous work done in the literary. 0.2 0.1 0.2 0.1 0.2 0.1 Section 3 develops the MSAIHT contrast enhancement 0 0 0.1 0.2 0.3 0.4 0.5 Input Level 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 Input Level 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 Input Level 0.6 0.7 0.8 0.9 1 algorithm along with its parameters, and usage. Section 4 (a) dark image (b) bright image (c) back-lighted conducts experiments including simulations. Finally, Section 5 provides future directions of further research. bias=0.97,gain=1.0 bias=0.37,gain=1.0 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 2. CONTRAST ENHANCEMENT FOR AN IMAGE O utput Lev el O utput Lev el 0.5 0.5 0.4 0.3 0.4 0.3 There are two categories of contrast enhancement 0.2 0.1 0.2 0.1 techniques: global methods and local methods. Global 0 0 0.1 0.2 0.3 0.4 0.5 Input Level 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 Input Level 0.6 0.7 0.8 0.9 1 contrast enhancement techniques remedy problems that (d) low contrast image (e) high contrast image manifest themselves in a global fashion such as Figure 3. Five category of classical gray level transform excessive/poor lightning conditions in the source functions environment. On the other hand, local contrast enhancement tries to enhance the visibility of local Contrast enhancement techniques are widely used details in the image. Locally enhanced images look more to increase the visual image quality. The purpose of attractive than the originals because of the higher contrast image enhance: First, people eyes identified more easily [5]. to images or make the image clear and detailed. Second, The advantages of using a global method are its the computer easy analysis of image data and identified, high efficiency and low computational load. The and like humans has a visual perception capabilities. drawback of using a global operator is its inability in However, in our previous projects undertaken[3,4], we revealing image details of local luminance variation. On proposed that an Adaptive Inverse Hyperbolic Tangent the contrary, the advantage of a local operator is its Algorithm, however, this approach suffers from the capability of revealing the details of luminance level following drawbacks: First, it lacks of a mechanism to information in an image at the expense of very high adjust the degree of enhancement, using the AIHT based computational cost that may not be unsuitable for video image contrast enhancement methods can not retain the applications without hardware realization [3,4]. Two detail brightness distribution of the original image types of contrast enhancement techniques, linear and therefore lead to distortion. Second, this algorithm can nonlinear are discussed as follows. only be done on the image of the global contrast Linear contrast enhancement is also referred to as enhancement and cannot achieve a local contrast contrast stretching. It linearly expands the original digital enhanced, and unable to meet the Human Visual System luminance values of an image to a new distribution. mapping curve, and to produce non-smooth or distorted Expanding the original input values of the image makes it images phenomenon. possible to use the entire sensitivity range of the display In this paper, the above-mentioned shortcomings of device. Linear contrast enhancement also highlights image contrast enhancement methods in order to propose subtle variations within the data. multi-scales image enhancement method base on Nonlinear contrast enhancement often involves Adaptive Inverse Hyperbolic Tangent Algorithm. This histogram equalization, which requires an algorithm to method has two main features: (1) a sub-processing accomplish the task. One major disadvantage resulting method to achieve the local contrast enhancement; (2) from the nonlinear contrast stretch is that each value in proposed a method capable of processing various types the input image can have several values in the output of images, enhance and retain the original image details. image so that objects in the original scene lose their Image enhancement of the results will contribute to correct relative brightness values. image analysis. Under such a circumstance the contrast Multi-scale parameter adjustment of Adaptive enhancement is generally performed to expand gray level Inverse Hyperbolic Tangent Algorithm (MSAIHT) for range to mitigate the problem. One popular technique to image contrast enhancement that is suitable for accomplish this task is histogram equalization in 1150
  • 164.
    (Gonzalez and Woods[6]). A disadvantage of the method inverse hyperbolic tangent curve can be further is that it is indiscriminate and produces unrealistic effects dynamically adjusted. The following section describes in photographs. It may increase the contrast of the method we will use, which is similar to the proposed background noise, while decreasing the usable signal. In algorithm. scientific imaging where spatial correlation is more important than intensity of signal, the small signal to noise ratio usually hampers visual detection. 3. MULTI-SCALE PARAMETER ADJUSTMENT OF ADAPTIVE INVERSE HYPERBOLIC TANGENT (MSAIHT) ALGORITHM 3.1. Adaptive Inverse Hyperbolic Tangent (AIHT) Algorithm Figure 4 is a block diagram of the AIHT algorithm. The input data is converted from its original format to a floating point representation of RGB values. The principal characteristic of our proposed enhancement function is a adaptive adjustment of the Inverse Hyperbolic Tangent (AIHT) Function determined by each pixel’s radiance. After reading the image file, the bias(x) and gain(x) is computed. These parameters control the shape of the IHT function. Figure 5 shows a Figure 4. A flowchart of the AIHT algorithm. block diagram of AIHT parameters evaluates, including bias(x) and gain(x) parameters [3,4]. The Adaptive Inverse Hyperbolic Tangent algorithm has several desirable properties. For very small and very large luminance values, its logarithmic function enhances the contrast in both dark and bright areas of an image. Because this function is an asymptote, the output mapping is always bounded between 0 and 1. Another advantage of this function is that it supports an approximately inverse hyperbolic tangent mapping for Figure 5. A flowchart of AIHT parameters intermediate luminance, or luminance distributed evaluates. between dark and bright values. Figure 6 shows an example where the middle section of the curve is gain=1.0 approximately linear. 1 The form of the AIHT fits data obtained from bias=0.2 0.9 bias=1.0 measuring the electrical response of photo-receptors to bias=4.0 0.8 linear flashes of light in various species [7]. It has also provided 0.7 a good fit to other electro-physiological and 0.6 psychophysical measurements of human visual function Output Level 0.5 [8]-[10]. 0.4 The contrast of an image can be enhanced using Adaptive inverse hyperbolic function. The enhanced 0.3 pixel xij is defined as following, 0.2 0.1   1  xij  x    bias Enhancexij    log   1  gainx  0 (1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1   1  xij  x    Input Level bias     Figure 6. AIHT is approximately linear over the middle where xij is the image gray level of the ith row and jth range values where the choice of a semi-saturation column. The bias(x) is a power of xij to speed up the constant determines how input values are mapped to changing. The gain function is a weighting function display values. which is used to determine the steepness of the AIHT 3.2. Bias and Gain Parameters curve. A steeper slope narrows a smaller range of input The bias function is a power function defined over values to the display range. The gain function is used to the unit interval which remaps x according to the bias help shape how fast the mid-range of objects in a soft transfer function. The bias function is used to bend the region goes from 0 to 1. A higher gain value means a density function either upwards or downwards over the higher rate in change. Therefore the steepness of the [0,1] interval. 1151
  • 165.
    The bias powerfunction is defined by: fixed (gain = 0.85), the corresponding results are shown 0.25 in Fig. 8(b).  1 m n   mean ( x )      x ij  m  n i 1 j 1  (2) 0 .25 Original Image. Gain=1.0 processed of image. Gain=0.99 processed of image. bias  x        0 .5   0 .5      The gain function determines the steepness of the AIHT curve. A steeper slope maps a smaller range of input values to the display range. The gain function is used to help to reshape the object’s midrange from 0 to 1 Gain=0.97 processed of image. Gain=0.93 processed of image. Gain=0.85 processed of image. of its soft region. The gain function is defined by: 0. 5  1 m n  gain  x   0.1  variance ( x )   m  n  xij     0 .1   0 .5   i 1 j 1  (3) Gain=0.69 processed of image. Gain=0.37 processed of image. 1 m n where    xij m  n i 1 j 1 Decreasing the gain(x) value increases the contrast of the re-mapped image. Shifting the distribution toward lower levels of light (i.e., decreasing bias(x)) decreases the highlights. By adjusting the bias(x) and gain(x), it is possible to tailor a re-mapping function with appropriate (a) amounts of image contrast enhancement, highlights and bias=0.4 processed of image. bias=0.6 processed of image. bias=0.8 processed of image. shadow lightness as shown in Fig. 7. bias=0.4 varying the gain bias=0.5 varying the gain bias=0.65 varying the gain 1 1 1 1 0.99 0.8 0.8 0.8 0.97 0.93 Output Level Output Level Output Level 0.6 0.6 0.6 0.85 0.69 0.4 0.4 0.4 0.37 0.2 0.2 0.2 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 Input Level Input Level Input Level bias=1.0 processed of image. bias=1.2 processed of image. bias=1.4 processed of image. bias=0.8 varying the gain bias=1.0 varying the gain bias=1.25 varying the gain 1 1 1 0.8 0.8 0.8 Output Level Output Level Output Level 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 Input Level Input Level Input Level bias=1.6 varying the gain bias=2.1 varying the gain bias=2.8 varying the gain 1 1 1 bias=1.6 processed of image. bias=1.8 processed of image. bias=2.0 processed of image. 0.8 0.8 0.8 Output Level Output Level Output Level 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 Input Level Input Level Input Level Figure 7. Inverse Hyperbolic Tangent Curve produced by varying the gain and bias values, varying the gain and bias values of mapping curves. (b) Figure 8. (a) The bias parameter fixed (bias = 1) The gain function determines the steepness of the and eight different gain values processed of images. (a) curve. Steeper slopes map a smaller range of input values gain parameter fixed (gain = 0.85) and nine different bias to the display range. The value of bias controls the values of mapping curves, (b) fixed gain= 0.85, centering of the inverse hyperbolic tangent. Figure 8 processed of image. shows gain and bias the curve for different values of 3.3. Multi-Scale Parameter Adjustment of Adaptive processed images. There are a total of eight gain values Inverse Hyperbolic Tangent (MSAIHT) (1, 0.99, 0.97, 0.93, 0.85, 0.69, 0.37) and bias parameter Algorithm fixed (bias = 1), the corresponding results are shown in Figure 9 shows a block diagram of the MSAIHT Fig. 8(a). There are a total of nine bias values (0.4, 0.5, algorithm. The input data is converted from its original 0.65, 0.8, 1.0, 1.25, 1.6, 2.1, 2.8) and gain parameter format to a floating point representation of RGB values. The principal characteristic of our proposed enhancement 1152
  • 166.
    function is aMulti-scale adaptive adjustment of the gain images in the low level and high gain images in the Inverse Hyperbolic Tangent (MSAIHT) Function high level. An additional problem that is potentially determined by each pixel’s radiance. After reading the solved by this approach is the compression property of image file, the bias(x) and gain(x) is computed. These the display (so-called gamma curve). This transfer parameters control the shape of the AIHT function. function has high suppressed rate for higher luminance Figure 10 shows a block diagram of MSAIHT parameters range and has low prolonged rate for the lower luminance evaluates, including Multi-scale bias(x) and gain(x) regions. parameters. 4. IMPLEMENTATION AND EXPERIMENTAL RESULTS A variety of video sequences and still images were tested by using the proposed method. There are four types of extreme case images: dark image, bright image, back-lighted image, and low-contrast image. Images with different types of histogram distributions were taken to test for experiments. These include some daily life images that may arise in contrast to the poor image and demonstrate the enhanced results. Figure 11 shows various types of images with bad contrast enhancement. Figure 11 displays the results of the enhanced image processing by histogram equalization, AIHT and the proposed MSAIHT method. Figure 12 shows Adaptive Inverse Hyperbolic Tangent and Multi-scale Adaptive Inverse Hyperbolic Tangent Comparison on local detail. In the local detail of the enhance MSAIHT better than AIHT. Figure 9. A flowchart of the MSAIHT algorithm. The comparative analysis has shown the proposed methods can display more detail in the sense of contrast There are two important design goals for the Multi- than the current frequently used methods. The MSAIHT scale: avoiding noise visibility especially in smooth technique can keep the sharpness of defects' edges and regions and preventing intensity saturation to minimum local detail well. Therefore, AIHT and MSAIHT can and maximum possible intensity values (e.g. 0 and 255 greatly enhance poor image and they will be helpful for for 1 byte per channel source format). defect recognition. The enhanced output image resulting from the Finally, Figure 13 shows the MSAIHT system multi-scale approach for processing input image x, is interface in manual and automatic mode. The automatic described by: mode adjusts the best parameters (multi-scale gain and K bias) based on the automatic calculation of characteristics Enhance _ MSAIHT   AIHT (bias(k ), gain(k )) (4) k 1 of images (Piecewise mean and variance). In manual where K is the number of band we used. mode, users can select th