multi-view vehicle detection and tracking in

Multi-view Vehicle Detection and Tracking in
Crossroads
Liwei Liu, Junliang Xing, Haizhou Ai
Computer Science and Technology Department,
Tsinghua University,Beijing 100084, China
Email: ahz@mail.tsinghua.edu.cn
Abstract—Multi-view vehicle detection and tracking in cross-
roads is of fundamental importance in traffic surveillance yet
still remains a very challenging task. The view changes of
different vehicles and their occlusions in crossroads are two
main difficulties that often fail many existing methods. To handle
these difficulties, we propose a new method for multi-view vehicle
detection and tracking that innovates mainly on two aspects: the
two-stage view selection and the dual-layer occlusion handling.
For the two-stage view selection, a Multi-Modal Particle Filter
(MMPF) is proposed to track vehicles in explicit view, i.e. frontal
(rear) view or side view. In the second stage, for the vehicles in
inexplicit views, i.e. intermediate views between frontal and side
view, spatial-temporal analysis is employed to further decide their
views so as to maintain the consistence of view transition. For the
dual-layer occlusion handling, a cluster based dedicated vehicle
model for partial occlusion and a backward retracking procedure
for full occlusion are integrated complementarily to deal with
occlusion problems. The two-stage view selection is efficient
for fusing multiple detectors, while the dual-layer occlusion
handling improves tracking performance effectively. Extensive
experiments under different weather conditions, including snowy,
sunny and cloudy, demonstrate the effectiveness and efficiency
of our method.
I. INTRODUCTION
Detection and tracking of vehicles in traffic scenes is of fun-
damental importance for surveillance system and has apparent
commercial value, which provides great potentials for many
high level computer vision applications such as traffic analysis,
intelligent scheduling and abnormal activity detection. The
difficulties behind this problem, however, are also hard, such
as vehicle view and type changes, partial and full vehicle
occlusions, gradual and sudden illumination changes. These
difficulties are inevitable in practical applications and thus
noticeably aggravate this problem.
Vehicle detection and tracking has been researched for many
years, and significant advances have been achieved. Traditional
methods try to detect vehicles based on background subtrac-
tion [1][2][3] and track them using techniques like Kalman
Fitler [3] and Spatial-Temporal Markov Radom Field [2] with
different observations such as contour [1] and appearance
[3]. Since their methods are sensitive to foreground noise,
particular cases such as camera adjustment, raining, snowing
and shadow will cause failure in these methods. Moreover,
they all require vehicles must be identified separately before
occlusion happens, which is a strong constraint and confine
they to the practical applications in crowded scenarios.
Fig. 1. The flow chart of our approach.
In the last decade, the fast development of object detection
techniques has result in many promising methods for detecting
particular object classes, e.g., faces [4][5], pedestrians [6][7],
and vehicles [8]. These object detectors provide good ob-
servation models for detection based tracking algorithm. All
the detection based methods can be categorized into three
classes according to the types of detectors: single view detector
[4][7], integration of multiple view detectors [6] and single
multi-view detector [8]. Obviously, single view detector is
unsuitable for the scenarios which contain multi-view targets,
e.g. crossroads. In consideration of the connection and distinc-
tion among multiple view detectors, tracking algorithm based
on multiple detectors must have a sophisticated integration
strategy. Single multi-view detector (always used in onboard
system) requires high affinity of targets in each view and
uniform aspect ratio of vehicles, so this approach doesn’t work
in our problem. In addition, Data-Driven MCMC [9] has been
used to recover trajectories of targets of interest over time,
but this method requires all the videos in advance and uses
optimization algorithms to solve the problem, which conflict
with the requirements of online and real-time processing
in our problem. As far as we known, there are very few
works on multi-view car detection and tracking in crossroads
based on detection techniques, which can process online and
in real-time. Our approach is motivated from this practical
requirement for applications.
In this work, we focus on videos taken by a single camera
at a height above ground as would be common in surveillance
application. The vehicle videos are acquired in crossroads
where occlusions among vehicles and viewpoint changes are
rather grave. Our approach is much more robust to shadow
978-1-4577-0121-4/11/$26.00 ©2011 IEEE 608

(a) (b) (c)
Fig. 2. Results of view confidence weight (from left to right: side view (red),
intermediate view and frontal view (green); the histogram in the bottom-right
corner of each figure: the quantitative comparison of the weights).
and illumination changes owing to detection based techniques
compared with background subtraction based ones. The main
contributions of our approach lie in: (1) A real-time and
online processing system that can deal with view changes
and occlusions effectively. (2) A two-stage view selection
technique that can efficiently fuse multiple detectors; (3) A
dual-layer occlusion handling technique that can deal with
partial and full occlusions integrally.
The rest of this paper is organized as follows. The details of
this proposed method are elaborated in section II. Experiment
results are demonstrated in Section III. And conclusions and
discussions are made in section IV.
II. THE PROPOSED APPROACH
The flow chart of our multi-view vehicle detection and
tracking system is shown in Fig.1. Multiple view detectors are
not only employed to search for new targets but also coupled
together in MMPF to guide the tracking process and perform
view selection of targets in explicit views. For those targets
in inexplicit views, spatial-temporal analysis is explored to
smooth their view transition and maintain the consistence of
traffic flow. For the sake of handling occlusion, we devise
a cluster based dedicated vehicle model and a backward re-
tracking procedure for partial occlusion and full occlusion
respectively. In the following, after briefly introducing multiple
view detectors, we will focus our illustration on the two-
stage view selection and dual-layer occlusion handling, which
mainly differentiate our approach from previous methods.
A. Multiple View Detectors
For vehicle surveillance videos in crossroads, it is very
difficult to train one detector that covers all views due to
the large variance of the vehicle appearance. So we train
detectors that cover typical views like frontal (rear) view and
side view. The two detectors are offline trained in the boosting
framework with Joint Sparse Granular Features (JSGF)†
which
has been proven to be effective for object detection and robust
to illumination variation. They provide very discriminative and
steady observation models for multi-view vehicle tracking.
B. Two-Stage View Selection
Having frontal and side view detectors is far from enough
for multi-view vehicle tracking due to response conflict of
the two detectors. In other words, if unreliable observation
is chosen to track a response conflict target, the target may
†Specified object detection apparatus, Chinese Patent 200710305499.7,
inventer: Haizhou Ai, Chang Huang, Shihong Lao, Takayoshi Yamashita.
TABLE I
THE FRAMEWORK OF TWO-STAGE VIEW SELECTION
Given: Each object st−1 has its supporting multi-view parti-
cle set sn
t−1,v, πn
t−1,v
N,V
n=1,v=1
, where N is the number of
particles of each view, V is the number of views, t − 1 is the
frame number and πn
t−1,v is the weight of particle sn
t−1,v:
• For the particles of the dominant view dvt−1 ∈ V :
+ Predict, resample and update as traditional particle filter;
+ Obtain the weighted mean state st,dvt−1
of dominant view;
• For each other view {v |v ∈ V ∩ v = dvt−1}:
+ If a new target match with this view:
- Reinitialize the particles with the new target;
- Use the detector to evaluate the particles;
+ Else if
N
n=1 πn
t−1,v < TS or the distance between
the center of st−1,v and the center of st−1,dvt−1
Dis(v , dvt−1) > TDis:
- Reset all the particles according to st,dvt−1 ;
- Update with the resetted particles;
+ Else:
- Predict, resample and update as traditional particle filter;
+ Obtain the weighted mean state st,v of v ;
• If ∀
N
n=1 πn
t,v −
N
n=1 πn
t,v > TW {v |v ∈ V ∩ v = v} :
+ dvt = v;
• Else:
+ Spatial-temporal analysis;
lost when it cannot get enough supporting observation. So
we propose the two-stage view selection to integrate the two
independent detectors for multi-view vehicle tracking. The
two-stage view selection contains multi-modal particle filter
and spatial-temporal analysis which will be introduced below.
1) Multi-Modal Particle Filter: Multi-Modal Particle Filter
(MMPF) is devised to track multi-view targets. As the name
suggests, a target has two possible views (frontal and side
views) as its two modals but at a time it only reveals one view,
and MMPF is employed to integrate the two view detectors to
track it and perform first stage view selection.
Different from traditional particle filter or CONDENSA-
TION [10], MMPF maintains two groups of particles for a
target, one for frontal view and the other for side view, not
only to track the target but also to acquire its view transition. In
the MMPF framework (Table 1), each particle is evaluated by
a confidence reflecting the likelihood of the target belonging
to the corresponding view. To select the dominant view, the
total confidence of its particles is calculated for each view. If
the difference between two views’ total confidences is bigger
than a threshold TW (equation (1)), then the bigger one (as
Fig.2(a) and Fig.2(c)) will be treated as the dominant view,
otherwise (as Fig.2(b)) a second stage view selection will be
adopted. Denoting N as the number of particles, and πn
t,v as
the nth particle’s confidence for view v in frame t:
N
n=1
πn
t,v −
N
n=1
πn
t,v > TW (1)
Since the two groups of particles are not independent, the
traditional procedures of predict, resample and update [10]
978-1-4577-0121-4/11/$26.00 ©2011 IEEE 609

(a) (b)
Fig. 3. (a) tracking result (green box represents frontal view and red box
denotes side view). (b) the predefined confidences of particles (brightness
indicates confidence and the position is the center of a particle).
for particles are unsuitable for our framework. So MMPF
needs redesigned procedures to deal with all the special cases
when the observation of minor view (the view other than the
dominant view) becomes unreliable or drifts. The redesigned
procedures of predict, resample and update can be formalized
as equation (2) (follow the framework in Table 1):
Predict by p(st,dv|st−1,dv) : {s
(i)
t,v, π
(i)
t−1,v} ∼ p(st,v|Ot−1,v)
Resample :



{s
(i)
t,dv, 1/Ndv} ∼ p(st,dv| Ot−1,dv)
{N(snew, δ2
), 1} ∼ p(st,mv| Ot−1,mv)
{T(s
(i)
t,dv), 1} ∼ p(st,mv| Ot−1,mv)
{s
(i)
t,mv, 1/Nmv} ∼ p(st,mv| Ot−1,mv)
Update :π
(n)
t,v ∝ p(ot,v| st,v), {s
(i)
t,v, π
(i)
t,v} ∼ p(st,v| Ot,v)
(2)
where dv is the dominant view and mv is the minor view,
dv ∪ mv = v. The tracking algorithm first predict all the
particles according to a motion model p(st,dv|st−1,dv) of dv.
In the stage of resample, the particles of dv resample according
to their weights. But for the minor view mv, different mea-
surements are adopted depending on the circumstance: when
a new target (snew) match with the minor view, N(snew, δ2
)
is used to generate new particles through Gaussian sampling.
When the observation becomes unreliable (the total confidence
is too small) or particles drift to another target, T(s
(i)
t,dv) resets
particles according the particles of the dominant view (with the
same center and scale). Except for the two situations above,
the particles of minor view resample like the dominant view’s.
Finally, the tracking algorithm updates the states for both
views by the weighted mean of all the resampled particles.
In the MMPF framework, the observation models need
to give a confidence reflecting the likelihood of the target
belonging to the corresponding view. The outputs of each
view detector are of potential to give confidences of particles
and yield to the corresponding view confidence. But they are
inaccurate and different from view to view, they cannot be used
directly without post processing. So, we utilize the number of
layers a particle passed l and the output of last layer confdet
to predefine the confidence of a particle.
xl = exp(a × (confdet − Tl
det)) (3)
pl = cl−lmax
(4)
pl = pl−1 +
xl
1 + xl
× (pl − pl−1) (5)
(a) (b) (c)
Fig. 4. (a) Vehicles in transition view have different view responses. (b) Some
kinds of vehicles yield to similar appearances with other views’ vehicles. (c)
Different positions cause similar appearances with other views’ vehicles.
where xl is the exponent amplification of difference between
confdet and Tl
det, Tl
det is the confidence threshold of the
detector, a is a constant (set to be 5 in experiments). In (4), pl
is the basis confidence of layer l, c is also a constant (set to be
1.1), lmax is the number of total layers of the corresponding
detector. pl in (5) is the redefined confidence.
Frontal view and side view detectors are trained in the
same way with the same number of layers, same detector
rate, so their pass rates of positive samples in each layer
are the same, from which we can see that the number of
layers a particle passed is important for evaluating the particle.
Since the numbers are discrete and the outputs of detectors
are inaccurate, integrating the two metrics to redefine the
confidence is more appropriate than using them respectively.
After our redefinition, the confidence is normalized to [0, 1).
The higher layer a particle passes, the bigger the confidence
is. Figure 3 (b) shows the predefined confidences.
2) Spatial-Temporal Analysis: Although MMPF is effective
in most cases, it is likely to fail when targets reveal inexplicit
views (Fig.4 (a)) which may lead to frequent view switch.
What is more, some targets may confuse MMPF when their
appearances are ambiguous due to their types (Fig.4 (b)) or
distance to camera (Fig.4 (c)). To address these problems,
spatial-temporal analysis is employed to perform a second
stage view selection, which smooth view switch procedure so
that the selected view coincides not only with traffic flow but
also view variation tendency.
During the spatial-temporal analysis, four different types of
energy terms are explored to vote for the correct view. The four
energy terms we concern about are primary particles, velocity
difference, historical views and neighboring targets.
Primary Particles. It is the number of confident particles,
which reflects the likelihood of a target belonging to a view
from another perspective. The energy term can be denoted as:
|P| (P = {p|Confp > Tc}) (6)
Velocity Difference. Since a different view of vehicle
has a different moving direction in crossroads, the velocity
difference can be used as an energy term. Take the side view
for an example, the velocity along x-direction is bigger than
y-direction. We adopt the mean velocity of recent 10 fames
as targets’ velocity, because velocity between two contiguous
frames is inaccurate.
VSide = |Vx| − |Vy| (7)
VF rontal = |Vy| − |Vx| (8)
978-1-4577-0121-4/11/$26.00 ©2011 IEEE 610

TABLE II
COEFFICIENT OF ENERGY TERM
Energy
Term
Primary
Particle
Velocity
Difference
Historical
Views
Neighboring
Targets
Coefficient α = 1/200 β = 1 γ = 1/10 δ = 1/4
Historical Views. As the temporal information, historical
views are utilized to smooth view variation tendency. In our
experiments, we record targets’ views of last n frames (set to
be 10 in experiments), and use the number of side view HSide
and the number of frontal view HF rontal as the temporal
energy terms.
Neighboring Targets. Since traffic flow is consistent at
a certain time, a target’s view is always the same with
neighbors’. So the numbers of the same view’s targets nearby,
NSide and NF rontal , are introduced into spatial-temporal
analysis as spatial information.
The composite energy function can be defined as equation
(9) whose coefficients are shown in Table 2. Maximum like-
lihood estimation is used to select dominant view.
Uv = α × Pv + β × Vv + γ × Hv + δ × Nv (9)
As the second stage of view selection, spatial-temporal
analysis uses spatial and temporal information to help targets
in inexplicit view to obtain their reliable views. The efficient
fusion of MMPF and spatial-temporal analysis is capable of
tracking multi-view vehicles for seizing their primary obser-
vations.
C. Dual-layer Complementary Occlusion Handling
Besides of the difficulties in selecting the corresponding ve-
hicle view, the occlusion between multiple vehicles is another
tough problem. Occlusion can be divided into two different
types: partial occlusion and full occlusion. In partial occlusion,
the detectors tend to drift due to its congenital deficiency at
distinguishing different targets. To solve this problem, a dedi-
cated vehicle model based on clustering is proposed to prevent
response from drifting. As for full occlusion and particular
partial occlusion whose observation becomes unreliable or
lost, a backward smoothing [10] process is adopted to handle
them.
Taking advantage of traffic scene, we propose a dedicated
vehicle model based on clustering to solve partial occlusion
effectively. The model fuses multiple cues, including position,
size and moving trend, to label particles in order to prevent
particles from drifting. When one target is partially occluded
by another target, particle filter of this target may fail down
because of response drifting. In the stage of resample, some
random resampled particles contain the other target and have
high confidences so that the merged result will drift to the other
target gradually and fail down the particle filter ultimately. So
it is necessary to label the particles with high confidence in one
occlusion cluster before merging to prevent responses from
drifting. For this purpose, we adopt K-Means to cluster confi-
dent particles, exploring the features of position, size and the
difference of moving trend. We denote the feature vector of a
particle as (xn, yn, wn, hn, dvi
n,x, dvi
n,y) where xn, yn, wn, hn
indicate its location and size, dvi
n,x, dvi
n,y are the differences
of moving trend in x direction and y direction, and i is the
target id in the occlusion cluster. For an example, supposing
a particle belonging to object 1, dv1
n,x, dv1
n,y represent the
differences between the velocity from the target position in
last frame (t−1) to the position of the particle and the velocity
of the target. The differences can be formalized as equation
(10) and (11). The smaller the differences are, the more likely
the particle belongs to the target.
dvi
n,x = xn − xi
t−1 − vt−1,x (10)
dvi
n,y = yn − yi
t−1 − vt−1,y (11)
In order to accelerate the process of convergence and
increase the accuracy, it is desirable to use the states of
the targets in the occlusion cluster last frame (t − 1) as the
centers of clustering (dvi
n,x = 0, dvi
n,y = 0) in the stage
of initialization. After clustering by K-Means, the obtained
cluster centers are deemed as the states of the targets and used
to reinitialize the corresponding particle sets. If the overlap
ratio of two merged targets is bigger than a threshold Toverlap
which indicates that a target tends to be fully occluded by the
other target, the second layer of occlusion handling will be
performed.
To surmount full occlusion, a backward smoothing [10]
process is adopted to retrack lost targets. When a target cannot
get enough supporting particles, the track of the target is
buffered for future backward smoothing with newly collected
observations. It first finds the match between the new targets
and the buffered targets based on their affinity (the rate of
overlap). And then Hungarian algorithm is employed to obtain
the optimal match between these two sets.
III. EXPERIMENTS
Experiments are carried out on videos collected from traffic
surveillance cameras in crossroads (with camera adjustment,
raining, snowing and shadow) and some real-world video data
collected with a hand-held camera. The system runs at more
than 22 fps on VGA size (640×480) video on an Intel Core
Quad 2.66 GHz CPU with 4G RAM.
A. Experiment Settings
In our experiments, the offline frontal view detector is
trained from 18567 samples with normalized size 24×24
while side view detector is trained from 9814 samples with
normalized size 48×24. The total confidence threshold (TW )
used to check whether a target’s view is explicit or not is set to
be 5. For dual-layer occlusion handling, if the overlap between
two merged targets after particles clustering is bigger than 0.9
(Toverlap), the backward smoothing procedure is adopted.
B. Detection Performance Evaluation
The evaluation set contains 3334 images with manual la-
beled ground truth out of 24800 multi-view vehicles captured
from traffic surveillance videos under different weather. We
evaluate the detection performance with the tracking results
which reflects the precision of tracking algorithm (detectors
978-1-4577-0121-4/11/$26.00 ©2011 IEEE 611

Fig. 5. ROC curve of multi-view vehicle tracking results.
are used to search for new targets in a part of image pyramid
and offer confidences for particles). We compare our work (red
curve) with other two baseline methods: 1) Simple Integration:
detect (same detectors) and track (particle filter) targets in
frontal view and side view respectively. When a target reveals
both views (overlap), the one with big confidence is selected
and corresponding detector is used to track it afterwards; 2)
Frontal + Side View: detect and track targets in frontal view
and side view respectively with no post processing. In Fig.
5, we can see that our method achieves a relatively higher
detection rate than the other two methods at the same false
alarm level while using the same detectors. We attribute this
to the two-stage view selection since it makes the MMPF seize
the primary observations of targets and track them effectively.
C. Tracking Performance Evaluation
We adopt the same metrics for evaluating tracking perfor-
mance as in [6][10]. These metrics are defined as following.
MT: number of Mostly Tracked trajectories; ML: number of
Mostly Lost trajectories; Frmt: number of Fragments trajecto-
ries; FAT: number of False trajectories; IDS: the frequency of
Identity Switches.
The video we use to evaluate tracking performance consists
of 10002 frames in 640×480 resolution which contain frequent
partial occlusions and intensive full occlusions. To evaluate the
performance of the dual-layer occlusion handling, we compare
our algorithm with the one without occlusion handling. From
Table 3 that gives the comparison results, we can see that
the dual-layer occlusion handling achieves an improvement on
almost all the metrics. Especially on the Frmt, we attribute this
significant improvement to our dual-layer occlusion handling
since it provides progressive association for tracking occluded
targets, which overcomes most of the fragments. The improve-
ment of Frmt further increases the MT in our method. Fig.6
gives some typical tracking results.
TABLE III
TRACKING COMPARISION
Algorithm GT MT ML Frmt FAT IDS
Our method 215 187 6 41 3 5
Without occlusion
handling
215 167 6 129 5 7
Fig. 6. Typical tracking results (first row: complex background; second row:
shadow and occlusion; third row: pedestrian disturbance).
IV. CONCLUSION
In this paper, we present a robust multi-view vehicle detec-
tion and tracking algorithm in crossroads. It is a real-time and
online processing system that can deal with view changes and
occlusions effectively. The two-stage view selection is efficient
in fusing multiple detectors while the dual-layer occlusion
handling technique can tackle both partial and full occlusions.
Experiments under different weather conditions (snowy, sunny
and cloudy) demonstrate the effectiveness and efficiency of our
method.
ACKNOWLEDGMENT
This work is supported by Beijing Educational Committee
Program (YB20081000303).
REFERENCES
[1] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking with
occlusion reasoning,” in Eur. Conf. Comput. Vis., 1994.
[2] S. Kamijo, Y. Matsushita, K. Ikeuchi, and M. Sakauchi, “Occlusion
robust tracking utilizing spatio-temporal markov random field model,”
in IEEE Int. Conf. Pattern Recognition, 2000.
[3] B. T. Morris and M. M. Trivedi, “Learning, modeling, and classification
of vehicle track patterns from live video,” IEEE Trans. Intell. Transp.
Syst., vol. 9, pp. 425–437, 2008.
[4] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.
Comput. Vis., vol. 57, pp. 137–154, 2004.
[5] C. Huang, H. Ai, Y. Li, and S. Lao, “High performance rotation invariant
multiview face detection,” IEEE Trans. Pattern Anal. Mach. Intel.,
vol. 29, pp. 671–686, 2007.
[6] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially
occluded humans by bayesian combination of edgelet based part detec-
tors,” Int. J. Comput. Vis., vol. 75, pp. 247–266, 2007.
[7] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in IEEE Int. Conf. Comput. Vis. Pattern Recognition, 2005.
[8] C.-H. Kuo and R. Nevatia, “Robust multi-view car detection using
unsupervised sub-categorization,” in Appl. of Comput. Vis., 2009.
[9] Q. Yu and G. Medioni, “Integrated detection and tracking for multiple
moving objects using data-driven mcmc data association,” in IEEE
Motion and Video Computing, 2008.
[10] J. Xing, H. Ai, L. Liu, and S. Lao, “Multiple player tracking in sports
video: a dual-mode two-way bayesian inference approach with progres-
sive observation modeling,” IEEE Tans. Image Processing, vol. 20, pp.
1652–1667, 2011.
978-1-4577-0121-4/11/$26.00 ©2011 IEEE 612

multi-view vehicle detection and tracking in

More Related Content

What's hot (19)

Viewers also liked (17)

Similar to multi-view vehicle detection and tracking in (20)

More from Aalaa Khattab (6)

Recently uploaded (20)

multi-view vehicle detection and tracking in