Professional Documents
Culture Documents
Abstract— A framework for robust foreground detection that priors on the set of pixel-clusters or layers that any pixel can
works under difficult conditions such as dynamic background belong to. This, in turn, allows to handle camera motion, without
and moderately moving camera is presented in this paper. The any explicit registration, while robustly detecting the foreground
proposed method includes two main components: coarse scene
embedded in dynamic background, as illustrated by our results in
representation as the union of pixel layers, and foreground
detection in video by propagating these layers using a maximum- Section IV and in [2].
likelihood assignment. We first cluster into “layers” those pixels Prior approaches toward background modeling used single (e.g.
that share similar statistics. The entire scene is then modeled as [3]) or multiple (e.g. [4], [5], and more recently in [6]) mixture of
the union of such non-parametric layer-models. An in-coming Gaussians (MoG). We will later show that our proposed approach
pixel is detected as foreground if it does not adhere to these compares favorably against such models as well as with code-
adaptive models of the background. A principled way of com- book type of approaches as in [7]. The problem of dynamic
puting thresholds is used to achieve robust detection performance
background has been recently addressed by Mittal and Paragios
with a pre-specified number of false alarms. Correlation between
pixels in the spatial vicinity is exploited to deal with camera [8], where they propose an adaptive Kernel Density Estimation
motion without precise registration or optical flow. The proposed technique (first used in [9]) for modeling individual pixels,
technique adapts to changes in the scene, and allows to automat- working well for a stationary camera but difficult to generalize to
ically convert persistent foreground objects to background and a moving camera. Recently, Sheikh and Shah, [10], have proposed
re-convert them to foreground when they become interesting. This exploiting spatial correlation among pixels by using the position
simple framework addresses the important problem of robust of a pixel along with the color. They model the entire scene,
foreground and unusual region detection, at about 10 frames
per second on a standard laptop computer. The presentation of using a single 5-D pdf.1 The authors in [11] describe a three
the proposed approach is complemented by results on challenging level algorithm where region-level and frame-level information
real data and comparisons with other standard techniques. is used to make decisions at the pixel-level. In [12], Zhong et
al have used a Kalman filter for modeling image regions as an
Index Terms— Video analysis, scene analysis, foreground detec-
tion, surveillance, background subtraction, layer tracking. autoregressive moving average (ARMA) process.
The concept of layered representation of a scene was first intro-
duced by Adelson in [13], making a very good case for modeling
I. I NTRODUCTION AND PREVIOUS WORK a video scene as a group of layers instead of single pixels. There
has been a large amount of notable work in this direction since,
R OBUST detection of foreground and “interesting” or “un-
usual” events is an important pre-cursor for many image and
video applications, such as tracking, identification and surveil-
e.g., [14]–[18].
In our work, the scene is coarsely modeled as a group of layers
lance. Although there is often no prior information available about in order to robustly detect the foreground under static or dynamic
the foreground object to be detected, in many situations the back- background and in the presence of nominal camera motion. The
ground scene is available in all frames of the video. This allows training phase determines the (initial) number of layers and helps
for detecting the foreground by “subtracting” the background to automatically compute bandwidths and thresholds required for
from the scene, instead of explicitly modeling the foreground. the non-parametric modeling of layers and robust foreground
Important factors that make detection very challenging are: (a) detection. Though coarse layer propagation is a by-product of
Dynamic background with water ripples, swaying trees, etc, (b) the foreground detection step, it is not the main focus of this
Camera motion due to support vibration, wind, etc (which we work, and the reader should refer to the above mentioned works
will henceforth denote as “nominal” motion), and (c) Real-time on layer-tracking for the state-of-the-art in that regard. Thus the
or quasi-real-time detection requirements for most applications. main contributions of our work are: (a) A principled way of
These factors make detection of foreground by simple frame extracting and automatically computing the number of scene-
differencing (as in [1]) very difficult. Hence, in general, scene- layers,2 (b) Different detection thresholds for different “layers”
modeling must be robust enough to accommodate pixel feature using a principled approach for threshold computation that allows
variation, pixel position uncertainty, and minimize or completely for less than a pre-specified number of false alarms (NFA),3
avoid pixel registration. In our proposed detection framework, the
spatio-temporal correlation between pixels is exploited to provide 1 Our proposed approach generates comparable results using a 3-D feature
space (see Figure 6).
§ Electrical and Computer Engineering and IMA, University of Minnesota, 2 The model is sufficiently simple and accurate for the task of foreground
Offline Step properties with the goal of detection and not for accurate image
Training Step: segmentation.
(LILO stack of first M frames)
Decompose Scene into Layers, A. Extracting a layer: Initial guess
S = { Li } i = 1,…,N
Compute bandwidth Hi corresponding to pixels We first generate an initial guess of a layer that we wish to
in Li extract from the scene. We compute a local maximum of the
Compute threshold τi for Li , such that image histogram (hmax at the gray value gmax 5 ), and a radius
“Number of False Alarms” < 1 (ρ) which is the square root of the trace of the global covariance
matrix. These computations are similar to those in [19]. Thus, all
pixels with gray-values between gmax − ρ and gmax + ρ form our
initial guess or “layer-candidate” (LC ) of the layer to be extracted
Online Step ( repeated for all frames > M ) in the image domain.6
propose the use of a Kullback-Leibler (KL) divergence. First (be- a particular minimum size, which avoids in-consistent assignment of pixels
by force.
fore beginning the layer extraction process), assume that the entire 10 Even though we use labels from previous frames as initial guess, in the
image is the candidate layer (i.e., ρ → ∞). After a few refinement end we also check for appearance of completely new layers which may not
steps compute the KL divergence between the foreground and have been previously seen.
IEEE PAMI 4
Scene
Layer B
Layer A
y Vy
(a) Original frames of a video sequence with multiple foreground objects. 1.5
0.5
0
0 50 100 150 200 250 300 350 400 450 500
frame number
Comparison : Precision
1
(b) Detection results, the foreground is colored in red. The first person is detected as 0.9 MoG with rate 0.5
foreground initially but converted to a background layer (indicated by green) after he MoG with rate 0.05
0.8
becomes temporally persistent (un-interesting), fourth figure. The same person is MoG with rate 0.005
converted to foreground again when he becomes interesting, last two figures. 0.7 Proposed Approach
0.6
Precision
Fig. 5 0.5
0.4
Example of a video where detections are based on temporal-persistence.
0.3
Temporally persistent (un-interesting) foreground is converted to background
0.2
(which is tracked nonetheless as a layer, shown in green) and then converted
0.1
back to foreground when it becomes interesting again. Please see complete
0
video at http://www.tc.umn.edu/˜patw0007/videolayers/. 0 50 100 150 200 250 300 350 400 450 500
frame number
(This is a color figure.) Comparison : Rate of Recall
1
0.6
Recall
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300 350 400 450 500
frame number
Fig. 7
Quantitative comparison. The plot on the top indicates the number of pixels
detected as foreground using various learning rates of the MoG model, as
well as the proposed approach, against the ground-truth manually
segmented video. The plot in the middle compares the Precision and the
bottom plot compares the Recall rates with various learning rates of the
MoG model against our approach. (This is a color figure.)
Fig. 6
Foreground detection with average camera motion of about 14.66 pixels in
arbitrary directions. Top row shows the original frames. The second row
shows foreground detection results using a Mixture-of-Gaussians (MoG)
sequence and the corresponding ground-truth segmentation from
model. Third row shows results from the proposed approach. The fourth row,
[10].14 Figure 6, first shows a qualitative illustration of the results
shows the exact image pixels that are detected as foreground with our
as compared to the ground truth and a Mixture of Gaussians
approach, while the last row shows the ground-truth segmentation from
(MoG) model, and the corresponding quantitative comparison is
[10]. (This is a color figure.)
shown in Figure 7. It can be seen that the proposed approach
has almost no false detections in spite of the camera motion.
The sequence contains average nominal motion of about 14.66
pixels. Figure 7-top shows a comparative plot of the total number
background. of pixels detected as foreground. It can be seen that the MoG
Figure 5, shows an example of a temporally persistent object model produces a large number of false alarms even when we
in the scene being converted to a new background layer (which use a mixture of 5 Gaussians. Figure 7 (middle and bottom) also
is still continuously tracked as a layer) and then re-converted to compares the proposed approach with a 5-component MoG model
a foreground object when it becomes interesting again.
In order to provide a quantitative perspective about the quality 14 Note again that this comparison is for completeness mainly, since our
of foreground detection with our approach, we have used a test proposed framework attempts only at detection and not full segmentation.
IEEE PAMI 6
[3] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: [27] D. H. Kyungnam Kim and L. S. Davis, “Background updating for visual
Real-time tracking of the human body,” IEEE Transactions on Pattern surveillance,” in International Symposium on Visual Computing (ISVC),
Analysis and Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997. LNCS 3804, 2005.
[Online]. Available: citeseer.ist.psu.edu/wren97pfinder.html
[4] N. Friedman and S. J. Russell, “Image segmentation in video sequences:
A probabilistic approach.” in Thirteenth Conference on Uncertainity in
Airtificial Intelligence, Providence, Rhode Island, USA, 1997, pp. 175–
181.
[5] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity
using real-time tracking,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 22, no. 8, pp. 747–757, 2000. [Online].
Available: citeseer.ist.psu.edu/article/stauffer00learning.html
[6] V. Morellas, I. Pavlidis, and P. Tsiamyrtzis, “Deter: detection of events
for threat evaluation and recognition,” Mach. Vision Appl., vol. 15, no. 1,
pp. 29–45, 2003.
[7] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis, “Real-time
foreground-background segmentation using codebook model.” Real-Time
Imaging, vol. 11, no. 3, pp. 172–185, 2005.
[8] A. Mittal and N. Paragios, “Motion-based background subtraction using
adaptive kernel density estimation.” in Intl. Conf. on Computer Vision
and Pattern Recognition, Washington, DC, 2004, pp. 302–309.
[9] A. M. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model
for background subtraction,” in Proceedings of the European Conference
on Computer Vision. London, UK: Springer-Verlag, 2000, pp. 751–767.
[10] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes for
object detection,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 27, no. 11, pp. 1778–1792, November 2005.
[11] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Prin-
ciples and practice of background maintenance.” in IEEE Intl. Conf. on
Computer Vision, Kerkyra, Corfu, Greece, 1999, pp. 255–261.
[12] J. Zhong and S. Sclaroff, “Segmenting foreground objects from a
dynamic textured background via a robust kalman filter.” in IEEE Intl.
Conf. on Computer Vision, Nice, France, 2003, pp. 44–50.
[13] E. H. Adelson, “Layered representation for vision and video,” in
Representation of Visual Scenes, In Conjunction with IEEE Intl. Conf.
on Computer Vision, Boston, MA, vol. 00, 1995, p. 3.
[14] H. Tao, H. S. Sawhney, and R. Kumar, “Object tracking with bayesian
estimation of dynamic layer representations,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 75–89,
2002.
[15] A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov, “Bilayer seg-
mentation of live video,” in Intl. Conf. on Computer Vision and Pattern
Recognition, New York, NY, 2006, pp. 53–60.
[16] Y. Weiss, “Smoothness in layers: Motion segmentation using nonpara-
metric mixture estimation,” in Intl. Conf. on Computer Vision and
Pattern Recognition, San Juan, Puerto Rico, 1997, p. 520.
[17] S. Khan and M. Shah, “Object based segmentation of video using color
motion and spatial information,” in Intl. Conf. on Computer Vision
and Pattern Recognition, Hawaii, vol. 2, 2001, pp. 746–751. [Online].
Available: citeseer.ist.psu.edu/khan01object.html
[18] Y. Zhou and H. Tao, “A background layer model for object tracking
through occlusion,” in IEEE Intl. Conf. on Computer Vision, Nice,
France, 2003, p. 1079.
[19] D. Comaniciu and P. Meer, “Robust analysis of feature spaces: color
image segmentation.” in Intl. Conf. on Computer Vision and Pattern
Recognition, San Juan, Puerto Rico, 1997, p. 750.
[20] G. Wyszecki and W. S. Stiles, Color Science: Concepts and Methods,
Quantitative Data and Formulae. Wiley, 1982.
[21] L. Zhao and L. S. Davis, “Iterative figure-ground discrimination.” in
International Conference on Pattern Recognition, Cambridge UK, 2004,
pp. 67–70.
[22] B. W. Silverman, Density Estimation for Statistics and Data Analysis.
London, UK: Chapman and Hall, 1986.
[23] D. W. Scott, Multivariate Density Estimation. Wiley Inter-Science,
1992.
[24] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis, “Improved fast
gauss transform and efficient kernel density estimation.” in IEEE Intl.
Conf. on Computer Vision, Nice, France, 2003, pp. 464–471. [Online].
Available: citeseer.ist.psu.edu/yang03improved.html
[25] F. Cao, Y. Gousseau, P. Muse, F. Sur, and J.-M. Morel, “Accurate
estimates of false alarm number in shape recognition,” Cachan, France
(http://www.cmla.ens-cachan.fr/Cmla/), Tech. Rep., 2004. [Online].
Available: citeseer.ist.psu.edu/muse04accurate.html
[26] T. Veit, F. Cao, and P. Bouthemy, “An a contrario frame-
work for motion detection,” INRIA, Tech. Rep. 5313, 2004
(http://www.irisa.fr/vista/Publis/Auteur/Frederic.Cao.english.html).