Professional Documents
Culture Documents
5, MAY 2013
723
Abstract The automatic detection of objects that are abandoned or removed in a video scene is an interesting area of
computer vision, with key applications in video surveillance.
Forgotten or stolen luggage in train and airport stations and
irregularly parked vehicles are examples that concern significant
issues, such as the fight against terrorism and crime, and public
safety. Both issues involve the basic task of detecting static
regions in the scene. We address this problem by introducing
a model-based framework to segment static foreground objects
against moving foreground objects in single view sequences taken
from stationary cameras. An image sequence model, obtained
by learning in a self-organizing neural network image sequence
variations, seen as trajectories of pixels in time, is adopted within
the model-based framework. Experimental results on real video
sequences and comparisons with existing approaches show the
accuracy of the proposed stopped object detection approach.
Index Terms Artificial neural network, image sequence modeling, stopped foreground detection, video surveillance.
I. I NTRODUCTION
A. Related Work
Manuscript received April 10, 2012; revised January 15, 2013; accepted
January 16, 2013. Date of publication February 8, 2013; date of current version
March 8, 2013.
L. Maddalena is with the National Research Council of Italy, Institute for
High-Performance Computing and Networking, Naples 80131, Italy (e-mail:
lucia.maddalena@cnr.it).
A. Petrosino is with the Department of Applied Science,
University of Naples Parthenope, Naples 80143, Italy (e-mail:
alfredo.petrosino@uniparthenope.it).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2013.2242092
724
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013
MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS
725
726
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013
The value of is chosen depending on the desired responsiveness of the system, and is application-dependent.
In order to classify stopped pixels, for each pixel x of
sequence frame It , we compute a function Ct of pixel feature consecutive occurrences in the foreground model Ft as
follows:
t
max(0, Ct 1 (x) k A ), if xB
t (1)
Ct (x) = min(, Ct 1 (x) + k B ), if x Bt xF
max(1, Ct 1 (x) kC ), if x Bt x Ft
where indicates the logical AND operation, and C0 (x) = 0.
Basically, the count value is increased if x is modeled by
the foreground model Ft , and decreased otherwise. The three
different cases of (1) can be explained as follows.
If, at time t, x is a background pixel [first case of (1)],
then, at previous time t 1 it was in one of the three
different possible states: 1) it was a background pixel; 2) it
was a foreground pixel belonging to a moving object that
continues to move at time t; or 3) it was a foreground
pixel belonging to a stopped object that starts again moving at time t. In each case, the counting of consecutive
occurrences of x in Ft should be re-initialized, by setting Ct (x) = 0. Instead of just zeroing Ct (x), decreasing its value by the decay value k A in (1) allows us
to control how fast the system should recognize that a
stopped pixelas in case 3has moved again. To set the
alarm flag off immediately after the removal of the stopped
pixel, the decay value should be large, eventually equal
to .
If x is modeled by the foreground model Ft [second
case of (1)], then it holds the same features held in the
previous frame, and therefore Ct (x) is incremented. This case
includes situations where in past frames x was a moving pixel
belonging to an object that has not moved since then; therefore,
the growth factor k B in (1) determines how fast the system
should recognize that a moving pixel is going to stop.
Finally, if pixel x is not modeled by any of the background
or moving foreground models [third case of (1)], then it must
be a new moving foreground pixel, and therefore Ct (x) should
be set to 1. Instead, we just decrease it by the decay factor
kC in (1), in order to enhance robustness to false negatives in
the background or in the moving foreground.
D. Stopped Foreground Model: Construction
According to Definition 1, a pixel x in the sequence frame It
is classified as stopped if Ct (x), as defined in (1), reaches the
stationary threshold value . In order to retain the memory of
objects that have stopped and to distinguish them by moving
objects eventually passing in front of them in subsequent
frames, pixels x for which Ct (x) = are moved from the
moving foreground model Ft to a new stopped foreground
model St .
Given a pixel x of current sequence frame It , the stopped
foreground model St is constructed and updated according to:
1) St is initialized as soon as x has been modeled by Ft for
t and Ct (x) = ,
at least consecutive frames: if xF
then insert(St , x);
MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS
727
728
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013
Mt
Mt
(b)
(c)
It
(a)
Fig. 2. (a) Example of sequence frame It (each circle represents a pixel, connected with its neighbors), (b) corresponding neural map Mt with n = 5 layers,
and (c) inter-layer and intra-layer weight decaying during the update procedure for Mt .
(2)
MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS
The best matching weight vector m bt (x), computed according to (3), belonging to layer b of model Mt , is used as the
pixel encoding approximation. Therefore, the best matching
weight vector m bt (x) and its neighboring weight vectors of the
b-th layer of model Mt are updated, according to weighted
running average:
m bt (y) = (1(x, y))m bt1 (y) + (x, y)It (y) y Nx . (4)
Here, Nx = {y : |x y| w2D } is a 2-D spatial neighborhood
of x of size (2 w2D + 1) (2 w2D + 1) including x. Moreover,
(x, y) = G 2D (y-x), where represents the learning
rate that depends from scene variability, while G 2D () =
2 I ) is a 2-D Gaussian low-pass filter [43] with zero
N (; 0, 2D
2 I variance.1 The (x, y) values are weights that
mean and 2D
allow us to smoothly take into account the spatial relationship
between current pixel x and its neighboring pixels y Nx ,
thus preserving topological properties of the input (close inputs
correspond to close outputs). The 2-D intra-layer update of
(4) is schematically shown in Fig. 2. Let us suppose that
the current incoming pixel x of the t-th sequence frame It
is the center black-colored circle in Fig. 2(a) and that the best
matching weight vector m bt (x), computed according to (3),
belongs to layer b = 3 of model Mt , and is indicated by the
black-colored circle in the third layer of Fig. 2(b). Then, the
weighted running average of (4) in a neighborhood Nx of size
3 3 (choosing w2D = 1) involves all black-colored circles
shown in the third layer of Fig. 2(c), using Gaussian weights
represented by the 2-D Gaussian function over these nodes.
The 2-D update of (4) involves only model weight vectors
lying in the same layer b as the best matching weight vector
m bt (x). In order to further enable the reinforcement of m bt (x)
in the model for pixel x, also weight vectors of x belonging
to layers close to layer b are updated. Such a further update
is achieved by weighted running average
m it (x) = (1 (x))m it 1 (x) + (x)It (x)
(5)
729
1 2
, if 0 < t < K
1 t
=
(8)
K
2 ,
if t K
where 1 and 2 are predefined constants such that
2 1 . Indeed, in order to ensure neural network convergence during the training phase, the learning factor [1 in (8)] is chosen as a monotonically
decreasing function of time t, while, during the subsequent adaptation phase, the learning factor [2 in (8)]
is chosen as a constant value that depends on the scene
variability. Large values enable the network to faster learn
changes corresponding to the background, but also leading to
false negatives, that is, inclusion into the background model
of pixels belonging to foreground moving objects. On the
contrary, lower learning rates make the network slower to
adapt to rapid background changes, but making the model
730
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013
TABLE I
A DOPTED PARAMETER VALUES FOR 3D_SOBS AND SFS A LGORITHMS
S PECIFIC FOR E ACH I MAGE S EQUENCE
2
2
2
Dog
Binders
60
0.02
0.05
0.05
30
0.02
0.05
0.05
PVEasy
1500
0.02
0.05
0.05
PVMedium
1500
0.01
0.08
0.08
PVHard
1500
0.005
0.05
0.05
ABEasy
2500
0.02
0.05
0.05
ABMedium
2500
0.02
0.05
0.05
ABHard
2500
0.02
0.05
0.05
more tolerant to errors due to false negatives through selforganization. Indeed, weight vectors of false negative pixels
are readily smoothed out by the learning process itself.
Same mathematical ground is behind the choice of the
learning rate in (5), whose values can be chosen as
1 2
, if 0 < t < K
1 t
(9)
=
K
2 ,
if t K
where 1 and 2 are predefined constants such that 2 1 .
The moving foreground model Ft is initialized as in (2)
using the pixel value It (x), as soon as x is detected as not
belonging to the background model Bt , and it is updated
according to (4) and (5) if x is detected as belonging to
the moving foreground model Ft . The latter condition can
be checked by thresholding as in (6), where this time m bt (x)
should be interpreted as the best matching weight vector of
Ft for current pixel x.
Finally, the stopped foreground model St is initialized as
in (2), as soon as x is detected as belonging to the moving
foreground model Ft for at least consecutive frames. Here,
instead of using the incoming pixel value It (x), we can make
use of the already established moving foreground model Ft , by
just moving weight vectors of Ft into the stopped foreground
model. The stopped foreground model St is updated according
to (4) and (5) if x is detected as belonging to the stopped
foreground model St . As for the case of the moving foreground
model, the latter condition can be checked by the thresholding
of (6), where m bt (x) is interpreted as the best matching weight
vector of St for current pixel x.
IV. E XPERIMENTAL R ESULTS
Several experiments have been conducted to validate our
approach to SOD and to compare its results with those
achieved by other state-of-the-art methods. In the following,
choice of parameter values, qualitative and quantitative results
will be described for several publicly available sequences.
A. Parameter Values
For most of the parameters of 3D_SOBS and SFS algorithms, we could choose values common to all the considered
sequences. Specifically, we fixed n = 5 model layers for
the background, the moving foreground, and the stopped
foreground models; halfwidths w2D = 1 and w1D = 1 of the
neighborhoods for the model updates in (4) and (5), respec2 = 0.75 and 2 = 0.75 of the 2-D and
tively; variances 2D
1D
1-D Gaussian low-pass filters specifying the weights for the
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Results of the moving and stopped object segmentation for the
Dog sequence. (a) Original frame I270 . (b) MOD mask. (c) Representation
of the background model B270 , computed by the 3D_SOBS algorithm.
(d) Representation of the moving foreground model F270 and (e) of the
stopped foreground model S270 , computed by the SFS algorithm. (f) Original
frame with moving (green) and stopped (red) foreground objects.
MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS
(a)
(b)
(c)
(d)
731
(e)
(f)
Fig. 4. Results of moving and stopped object segmentation for the frames 1280 (first row), 1321 (second row), and 1395 (third row) of the Binders
sequence. (a) Original frames. (b) MOD masks computed by the 3D_SOBS algorithm. (c) Original frames with the first (red) and second (blue) stopped
layers. Representations of (d) the first stopped layer model, (e) the second stopped layer model, and (f) the moving foreground model.
732
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013
TABLE II
C OMPARISON OF G ROUND T RUTH (GT) S TOPPED O BJECT E VENT S TART AND E ND T IMES ( IN M IN ) FOR THE i-LIDS
S EQUENCES W ITH T HOSE C OMPUTED BY THE P ROPOSED A PPROACH AND BY O THER A PPROACHES
PV-Easy
Start End
Mean Median
Error Error
AB-Easy
Start End
GT
SFS + 3D_SOBS
4.00
4.00
8.00
8.00
4.67
4.00
8.67
9.00
5.00
4.00
9.67
10.00
2.00
N/A
N/A
N/A
N/A
2.33
N/A
5.00
4.00
N/A
N/A
N/A
5.33
5.00
29
38.00
7.00
6.00
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
11.00
11.00
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
9.33
8.00
N/A
03:13 10.33
11.00
N/A
N/A
N/A
01:39 01:47
N/A
N/A
N/A
N/A
N/A
N/A
N/A
MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS
733
TABLE III
C OMPARISON OF THE N UMBER OF T RUE D ETECTIONS (TD) AND FALSE
D ETECTIONS (FD) FOR THE i-LIDS S EQUENCES A CHIEVED BY THE
P ROPOSED A PPROACH AND BY O THER A PPROACHES
SFS +
Albiol
Evangelio
Pan
Tian
3D_SOBS et al., [28] et al., [20] et al., [17] et al., [4]
TD FD TD FD TD FD TD FD TD FD
PV-easy
PV-medium
PV-hard
AB-easy
N/A N/A
AB-medium
N/A N/A
AB-hard
N/A N/A
Fig. 5.
Average segmentation accuracy on Dog and Binders sequences
varying the number n of model layers.
Fig. 6. Execution times (in msecs/frame) for the proposed MOD and SOD
algorithm on color image sequences with different sizes (S, M, H), varying
the number n of model layers.
734
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013
V. C ONCLUSION
This paper presented a neural-based contribution to stopped
object detection in digital image sequences taken from
stationary cameras. A 3-D neural model for image sequences
that automatically adapts to scene changes in a self-organizing
manner, was targeted for modeling the background and the
foreground, finalized at the detection of stopped objects. Coupled with the proposed model-based framework for stopped
object detection, it enables the segmentation of stopped foreground objects against moving foreground objects, robustly
handling occlusion and restart problems.
Experimental results on real video sequences and comparisons with existing approaches showed that the proposed
3-D neural model-based framework favorably compares to
other tracking- and non tracking-based approaches to stopped
object detection. The proposed approach is shown to be an
inexpensive by-product of background subtraction that provides an initial segmentation of scene objects, useful for any
other subsequent video analysis tasks, such as abandoned and
removed object classification, people counting, and human
activity recognition.
R EFERENCES
[1] Fourth IEEE International Conference on Advanced Video and Signal
Based Surveillance. Piscataway, NJ: IEEE Computer Society, Sep. 2007.
[2] Ninth IEEE International Workshop on Performance Evaluation of
Tracking and Surveillance, J. M. Ferryman, Ed. Piscataway, NJ: IEEE
Computer Society, Jun. 2006.
[3] Tenth IEEE International Workshop on Performance Evaluation of
Tracking and Surveillance, J. M. Ferryman, Ed. Piscataway, NJ: IEEE
Computer Society, Oct. 2007.
[4] Y. Tian, R. Feris, H. Liu, A. Hampapur, and M.-T. Sun, Robust
detection of abandoned and removed objects in complex surveillance videos, IEEE Trans. Syst., Man, Cybern. C, vol. 41, no. 5,
pp. 565576, Sep. 2011.
[5] S.-C. S. Cheung and C. Kamath, Robust techniques for background
subtraction in urban traffic video, Proc. SPIE, vol. 5308, pp. 881892,
2004, doi:10.1117/12.526886.
[6] S. Elhabian, K. El Sayed, and S. Ahmed, Moving object detection
in spatial domain using background removal techniques: State-of-art,
Recent Patents Comput. Sci., vol. 1, no. 1, pp. 3254, Jan. 2008.
[7] M. Piccardi, Background subtraction techniques: A review, in Proc.
IEEE Int. Conf. Syst. Man Cybern., vol. 4. Oct. 2004, pp. 30993104.
[8] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, Image change
detection algorithms: A systematic survey, IEEE Trans. Image Process.,
vol. 14, no. 3, pp. 294307, Mar. 2005.
[9] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallflower: Principles and practice of background maintenance, in Proc. Int. Conf.
Comput. Vis., vol. 1. 1999, pp. 255261.
[10] M. Bhargava, C.-C. Chen, M. Ryoo, and J. Aggarwal, Detection of
object abandonment using temporal logic, Mach. Vis. Appl., vol. 20,
no. 5, pp. 271281, 2009.
[11] S. Boragno, B. Boghossian, J. Black, D. Makris, and S. Velastin, A
DSP-based system for the detection of vehicles parked in prohibited
areas, in Proc. IEEE Conf. Adv. Video Signal Based Surveill., Sep. 2007,
pp. 260265.
[12] S. Guler, J. A. Silverstein, and I. H. Pushee, Stationary objects in
multiple object tracking, in Proc. IEEE Conf. Adv. Video Signal Based
Surveill., Sep. 2007, pp. 248253.
[13] J. Lee, M. Ryoo, M. Riley, and J. Aggarwal, Real-time illegal parking
detection in outdoor environments using 1-D transformation, IEEE
Trans. Circuits Syst. Video Technol., vol. 19, no. 7, pp. 10141024,
Jul. 2009.
[14] S. Lu, J. Zhang, and D. Dagan Feng, Detecting unattended packages
through human activity recognition and object association, Pattern
Recognit., vol. 40, no. 8, pp. 21732184, Aug. 2007.
MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS
735