Video object tracking
Andrea Cavallaro
Queen Mary, University of London
[email protected]
http://www.elec.qmul.ac.uk/staffinfo/andrea
20/07/2006
20/07/2006 13:48:36
13:48:38
20/07/2006
20/07/200613:48:37
13:48:39
20/07/2006 13:48:33
20/07/2006 13:48:34
13:48:38
20/07/2006 13:48:34
20/07/2006 13:48:40
20/07/2006 13:48:35
20/07/2006
20/07/2006 13:48:35
13:48:39
20/07/2006 13:48:41
20/07/2006 13:48:36
20/07/2006 13:48:36
20/07/2006
20/07/2006 13:48:40
20/07/200613:48:37
13:48:42
20/07/2006 13:48:37
20/07/2006
20/07/200613:48:38
13:48:41
20/07/2006 13:48:3813:48:43
20/07/2006
20/07/2006 20/07/2006
20/07/200613:48:39
13:48:39 13:48:42
20/07/200613:48:40
20/07/2006 13:48:44
20/07/2006 11:48:40
20/07/2006 13:48:43
20/07/2006 13:48:41
1
Outline
• Introduction
• Tracking algorithms
• Mean-shift
• Particle filtering
• Graph-matching
• Integration of detection and tracking
• Integration of audio and video
• Event detection
Framework
background
update
input change post-processing
pre-filtering
video detection
3D analysis tracking and
symbols event detection
classification
a priori information
and info from other cameras
2
Framework
input object post-processing
pre-filtering
video detection
3D analysis tracking and
symbols event detection
classification
a priori information
and info from other cameras
Framework
a priori information a priori information a priori information
and info from and info from and info from
other cameras other cameras other cameras
input object post-processing
pre-filtering
video detection
3D analysis tracking and
symbols event detection
classification
a priori information a priori information a priori information
and info from and info from and info from
other cameras other cameras other cameras
3
Why object detection is not enough?
?
?
?
frame n frame n+m
Object tracking
Frame n Frame n+m
4
Object tracking: examples
Outline
• Introduction
• Tracking algorithms
• Mean-shift
• Particle filtering
• Graph-matching
• Integration of detection and tracking
• Integration of audio and video
• Event detection
5
Problem statement
• Objective
• To predict the target state over time Æ Position, Shape
• Problems
• Changes in pose and illumination
• Partial and total occlusions
• Clutter and targets with similar appearance
• Steps
• Target representation Æ Normalised colour histogram
• Likelihood of a candidate Æ Based on Bhattacharyya coefficient
• Tracking algorithms
• Mean shift (MS)
• Particle filter (PF)
Likelihood
• Likelihood
• Color Æ RGB space Æ 3D color histograms
10x10x10 Bhattacharya
histogram (h) distance (d)
2
d ( h , href )
−
σ
p (C | X t ) = e
6
Mean shift: description
• Mean shift
• Deterministic non-parametric approach
• Iterative procedure
• Kernel-based
• Gradient-based approach
• If the distance function is smooth (kernel) Æ effective
Previous frame position
Mean shift: example
7
Particle filter: description
• State x k = f k (x k −1 , u k )
• Observation z k = h k (x k , n k )
• Objective
• to estimate unknown state x k
based on a sequence of observations z k , k = 0,1,K
• find the posterior distribution N
p (x k | z1:k ) ≈ ∑ wki δ(x k − x ik )
i =1
• Solution (Bayesian)
• Prediction step
• Based on state equation
• Update step
• Based on likelihood function
State transition model
• Typically
• Zero-order model x k = x k −1 + u k
• Limitation: random positioning of the particles
• First-order model x k = x k −1 + θ k −1 + u k
• Limitation: high manoeuvring targets
• Adaptive state transition model
• Zero order model with adaptive noise variances x k = x k −1 + C k u k
1 k −1
Ck ∝ ∑ xt − xt −1
n t = k − n −1
average state velocity in the previous n frames
8
Re-sampling
Posterior
• Problem
• weight
degeneration
x
• Solution
• re-sampling Re-sampling
(eliminates
particles
with
small weights)
Re-sampling
Particle filter: example
9
Hybrid tracker
re-sample particles x k = x k −1 + C k u k
Adaptive state transition model
Zero-order model with adaptive
apply state transition noise variances
1 k −1
MS for each particle Ck ∝ ∑ xt − xt −1
n t = k − n −1
re-weighting Average state velocity in the
previous n frames
E[.]
N
The operator Mean Shift acts on the p (x k | z1:k ) ≈ ∑ wki δ(x k − MS (x ik ))
position 2D state space only
i =1
Hybrid tracker
• Advantages
E[x]
- After MS Æ each particle is near
a local maximum of the filtered
Weighting posterior (position 2D sub-space)
- The efficiency of the particles is
Mean Shift increased
- Multi-modality of the posterior is
maintained
- Extra computation is
compensated by less particles
10
Results
• Initialisation
• Ground-truth initialisation of the target
• Parameters
• Histograms: 10x10x10 (RGB)
• MS: 5 times with different kernel sizes (+/- 10%)
• PF, HT: 3D state model (to compare with MS): position; target size
• Transition model σx = σy = 14; σh = 0.013; ks = 5; kp= 10
• PF: 150 samples; HT: 30 samples
• Presentation of results
• Videos
• Sample frames & objective measure
Evaluation
• Subjective evaluation
• Side-by-side visual comparison of tracking results
• Objective evaluation predicted target
• Deviation from the ground-truth
• APE: average position error (pe)
• ASE: average size error pe
ASE = W + H 2 2
W: width error
H: height error
ground truth
11
Results: highway
MS PF
Proposed
Evaluation: highway
MS PF Proposed
APE 0.95 12.8* 0.88
ASE 2.74 22.3* 3.58
MS
PF
Proposed
12
Results: soccer
MS PF
Proposed
Evaluation: soccer
MS PF Proposed
APE 242* 3.9 3.2
ASE 18.2* 10.8 9.8
MS
PF
Proposed
13
Results: table tennis
MS PF
Proposed
Evaluation: table tennis
MS PF Proposed
APE 43.2* 24.1* 2.0
ASE 6.7* 3.3* 2.8
MS
PF
Proposed
14
Results: emilio
MS PF
Proposed
Single vs. multiple target tracking
• Single target tracking
• Hybrid mean shift / particle filter tracker
• faster and more accurate than particle filter
• more reliable than mean shift with fast targets
• Adaptive transition model
• Deal with highly manoeuvring targets
• Cope with camera motion
• How about multiple targets?
• Need to consider target ‘interactions’
• NP problem
• Complexity grows exponentially with n. of targets (PF)
15
Multiple object tracking
• Graph matching using weighted features
• Data association verified throughout several frames
to validate the correctness of the tracks
• Support track recovery in occlusion scenarios
• Features
• centre of mass
• velocity
• bounding box
• colour
appearance
position
. .
X = [x, y, x , y w, h, H]
g ( X ia , X bj ) = α . g1 ( X ia , X bj ) + β . g 2 ( X ia , X bj ) + γ . g 3 ( X ia , X bj ) + δ . g4 ( X ia , X bj ) − ( j − i − 1).τ
velocity size
Graph matching: full graph
v(x11) v(x13)
v(x12)
v(x21) v(x23)
v(x22)
v(x31) v(x33)
v(x32)
v(x41) v(x43)
V1 V2 V3
16
Graph matching: max path cover
v(x11) v(x13)
v(x12)
v(x21) v(x23)
v(x22)
v(x31) v(x33)
v(x32)
v(x41) v(x43)
V1 V2 V3
Detection vs. tracking
• Detection
• Usually frame-based
• Can be improved with temporal features (e.g., pedestrians)
• Trained classifier
• Choice of training set
• Choice of negative examples
• Choice of poses covered in the training set
• Tracking
• Propagates the initialisation information
• Model: template, statistical representation, parts, …
• Should update the model
• Should self-initialise
Æ Integration!
17
Outline
• Introduction
• Tracking algorithms
• Mean-shift
• Particle filtering
• Graph-matching
• Integration of detection and tracking
• Integration of audio and video
• Event detection
Integration of detection and tracking
• Problem
• Detecting objects (e.g., faces) in clutter
• Tracking multiple object (e.g., faces) under occlusions
Æ Integration of Adaboost face detector and Bayesian tracker
18
Face classifier
• Approach
• Cascade of classifiers
• Integral image
• Training
• Set of scales
• Output
Haar features
• Few false
for face detection
negatives
• Many false positives …
Æ Need additional evidence
Æ Fusion of color analysis (chromaticity segmentation) and face classification
Filtering through chromaticity segmentation
face detector
only
face detector
with
chromaticity
segmentation
19
Detection and tracking
• Use particle filtering to track between detections
• Initialization
• detection away from current particles Æ candidate track
• candidate track Æ activated after successive detections (confidence)
• Filtering
• if two tracks overlap Æ keep that with highest confidence score
• number of tracked frames
• frequency of detections
• Termination
• segmentation cue (skin)
• detection cue (classifier)
• size cue (ratio and area)
Detection and tracking
Removal of
overlapping tracks
20
Particle (temporal) filtering for face tracking
• Particle filtering integrated with face detector
• Link candidates from prediction (particles) with candidates from
detection Æ connected detection (CD)
• Particle spread (temporal prediction)
• If no CD Æ zero-order motion model
• If CD Æ particles are partially spread in the detection area
• Object model (color histogram)
• If no CD Æ no update
• If CD Æ partially update (e.g., by 25%)
Integrated detection and tracking
without particles without model update
around detections
with integration
21
Detection and tracking
particle spread
around detections
colour model
update
Face detection and tracking
22
Automatic tracking with a PTZ camera
Outline
• Introduction
• Tracking algorithms
• Mean-shift
• Particle filtering
• Graph-matching
• Integration of detection and tracking
• Integration of audio and video
• Event detection
23
Multi-modal data fusion using particle filter (PF)
X t = {x, y, width, height}
reverberation filtering
onset multi-band
Audio p( D | X ) GCCF p( A | X t )
detection analysis
t
p (C | X t )
color
feature
histogram X PF
change
Video
detection
motion multivariate
p( M | X t )
feature gaussian
• Overall likelihood
p (O / X t ) = p ( M / X t ) p (C / X t ) p ( A / X t )
Audio likelihood
speaker
• Time delay of arrival (TDOA) noise
s1 (t ) = v(t ) + n1 (t )
s2 (t ) = λv(t + τ ) + n2 (t ) s2
attenuation
dSinθ
• Reverberation filtering θ s1
θ
• Onset detection based on precedence effect d
M2 M1
• Multi-band analysis
Filter
s1 (t )
GCC-PHAT ω1
s2 (t ) Onset
detection
GCC-PHAT ω2 ∑
Rˆ s1s2 ( f )
GCC-PHAT ω3
−
(ςˆ A(Rˆ s1s2 )− xt )2
1 2 σ 2A
p(A|X t ) = e
σ A 2π
24
Comparison
Audio only Video only
Audio Visual
Results – speaker detection
50 313 425
25
Results – scene dynamics for teleconferencing
Original video Abstract representation
Outline
• Introduction
• Tracking algorithms
• Mean-shift
• Particle filtering
• Graph-matching
• Integration of detection and tracking
• Integration of audio and video
• Event detection
26
Event detection
Contextual information
• Scene modelling
• Gaussian for each area of interest
• outside zone Æ modelled with multiple Gaussians
50 50
100 100
150 150
outside_zone
200 200
opens
go_down_stairs
250 250 outside_zone
go_up_stairs
300 300 enters_zone
inside_zone
enters_zone
350 outside_zone 350
inside_zone
400 400
450 450
outside_zone
500 500 outside_zone
550 550
100 200 300 400 500 600 700 100 200 300 400 500 600 700
Building entrance - Camera 1 Airport - Camera 4
27
Object information
• Object detection and tracking
• Observations O
• Model parameters
• States ω = {ω1, ω2, ω3, … ωN}
• Probability
• State transition probabilities A = {aij}
• Emission probabilities B = {bjO}
• Initial state ω(0)
Results
28
Results
Summary
• Tracking algorithms
• Mean-shift
• Particle filtering
• Graph-matching
• Integration of detection and tracking
• Integration of audio and video
• Event detection Acknowledgements
Emilio Maggio
Murtaza Taj
Matteo Bregonzio
Huiyu Zhou
Stefan Karlsson
http://www.elec.qmul.ac.uk/staffinfo/andrea
29
EU FP7 project APIDIS (2008 – 2010)
30