You are on page 1of 13

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO.

5, MAY 2013

723

Stopped Object Detection by Learning


Foreground Model in Videos
Lucia Maddalena, Member, IEEE, and Alfredo Petrosino, Senior Member, IEEE

Abstract The automatic detection of objects that are abandoned or removed in a video scene is an interesting area of
computer vision, with key applications in video surveillance.
Forgotten or stolen luggage in train and airport stations and
irregularly parked vehicles are examples that concern significant
issues, such as the fight against terrorism and crime, and public
safety. Both issues involve the basic task of detecting static
regions in the scene. We address this problem by introducing
a model-based framework to segment static foreground objects
against moving foreground objects in single view sequences taken
from stationary cameras. An image sequence model, obtained
by learning in a self-organizing neural network image sequence
variations, seen as trajectories of pixels in time, is adopted within
the model-based framework. Experimental results on real video
sequences and comparisons with existing approaches show the
accuracy of the proposed stopped object detection approach.
Index Terms Artificial neural network, image sequence modeling, stopped foreground detection, video surveillance.

I. I NTRODUCTION

research. Depending on the video surveillance application to


be solved, SOD is followed by higher level scene analysis
tasks, such as discrimination between abandoned and removed
objects, as well as between humans and non-humans.
The typical SOD process is articulated into two steps:
1) detection of foreground objects not belonging to the
scene background, usually termed moving object detection
(MOD) [5][8] and 2) detectionamong moving objects
of those objects that are static. Accordingly, key challenges
for SOD include the following.
1) Well-known issues in MOD (see [9]), such as gradual
or abrupt light changes, moving background, and cast
shadows (basic challenges).
2) Further specific challenges in complex real-world scenes
arising from the need to determine when stopped objects
resume motion (restart challenge) or are occluded by
other objects or by people passing in front of them
(occlusion challenge).

A. Related Work

Manuscript received April 10, 2012; revised January 15, 2013; accepted
January 16, 2013. Date of publication February 8, 2013; date of current version
March 8, 2013.
L. Maddalena is with the National Research Council of Italy, Institute for
High-Performance Computing and Networking, Naples 80131, Italy (e-mail:
lucia.maddalena@cnr.it).
A. Petrosino is with the Department of Applied Science,
University of Naples Parthenope, Naples 80143, Italy (e-mail:
alfredo.petrosino@uniparthenope.it).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2013.2242092

Several single view approaches have been proposed to tackle


SOD key challenges, and they can be broadly grouped in
tracking-based and background subtraction-based approaches.
In tracking-based approaches, SOD is obtained on the basis
of the analysis of object trajectories through an applicationdependent event detection phase [10][16]. While these methods allow us to directly extract trajectory-based semantics
(e.g., who left the item or how long the item has been left
before the person moved away), they are often unreliable
in complex surveillance videos, due to occlusions and light
changes, as also pointed out in [4], [17], and [18].
Background subtraction-based approaches rely on background modeling and foreground analysis for SOD. They
have generally proven to achieve superior robustness to key
challenges in complex real-world scenarios [4], [17][24] and
can be used as a preprocessing stage to improve trackingbased video analysis (e.g., [25]). A further sub-classification
can be obtained depending on the adoption of pixel-level
or region-level analysis [17]. Pixel-based methods assume
that the time series observations are independent at each
pixel, while region-based methods take advantage of interpixel relations and the images are segmented into regions or
the low-level classification obtained at the pixel level is refined.
While pixel-wise background model adaptation is very useful
for handling multimodal backgrounds (e.g., waving trees), it
lacks higher level information about the object shape [4].

BANDONED and removed object detection in image


sequences is a task common to many video surveillance
applications, such as the detection of irregularly parked vehicles [1] and the detection of unattended luggage in public
areas [1][3]. Many definitions of the problem have been
given; the most widely accepted definition can be stated as
in [4], where an abandoned object is defined as a stationary
object that has not been in the scene before, and a removed
object is defined as a stationary object that has been in the
scene before, but is not there anymore. Both the problems
have the common basic task of detecting stationary regions
in the scene, that is, changes of the scene that stay in the
same position for relatively long time. As an example, a
moving object that stops for a while in the scene could be
an abandoned object; if the object later starts again, it could
be a removed object. In both cases, the basic task is to detect
the object as stationary in order to later classify it as either
abandoned or removed. This basic task, hereafter referred to as
stopped object detection (SOD), is the objective of the present

2162237X/$31.00 2013 IEEE

724

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

This does not greatly influence the restart challenge, but it


could have an influence over the occlusion challenge. Indeed,
a region-level analysis can play a major role in identifying
false foreground and false stationary objects, leading to higher
robustness to occlusion [4], [17], at the price of an increased
overall complexity.
One of the earliest approaches to SOD based on background
subtraction is reported in [19] and [21]. Two processes are
distinguished: pixel analysis, which determines whether a
pixel is stationary, transient, or background by observing its
intensity profile over time, and region analysis, which deals
with the agglomeration of groups of pixels into moving and
stopped regions. Stationary regions are added as a layer over
the background, through a layer management process used to
help solving occlusion and restart challenges.
In [22], a SOD method based on double-background subtraction is reported that is robust to illumination changes and
noise. The long-term background represents a clean scene; all
objects located in this background are not considered static
objects, but as part of the scene. The short-term background
shows static objects abandoned in the scene; this image is
made up of the last background image, updated with pieces
of the current image. Static foreground objects are detected
by thresholding the difference between the short-term and the
long-term backgrounds, while dynamic foreground objects are
detected by thresholding the difference between the current
image and the long-term background.
Dynamic background changes, such as parked cars, left
packages, or displaced chairs, are explicitly handled in [23],
highlighting the importance of avoiding the mere absorption
into the background model of these changes, in the light of
intelligent visual surveillance. Dynamic background changes
are inserted into short-term background layers, and, in order
to update a long-term background model, the layer modeling
technique is embedded into a codebook-based background
subtraction algorithm that is robust to local and global illumination changes.
Other approaches are based on pixel layer-based foreground
detection [24], where temporal persistent objects are introduced. To detect persistent foreground objects, the authors
use a color persistence criterion, continually monitoring the
color histogram of the foreground object and using correlation
to decide if the color is persistent. After a user-defined time
threshold, the persistent foreground object is converted to a
new background layer. To reconvert persistent objects back to
foreground layers when they become interesting again (restart
challenge), the authors use higher level features (i.e., globally
modeling the region as a whole, based on the use of regionlevel features).
A pixel-wise method that employs dual foregrounds to
extract temporally static image regions is reported in [18]. The
authors construct separate long- and short-term backgrounds,
modeled as pixel-wise multivariate Gaussian models [26],
whose parameters are adapted online using a Bayesian update
mechanism imposed at different learning rates. By comparing
each frame with these models, they estimate two foreground
masks: the long-term foreground mask shows color variations
in the scene, as well as moving cast shadows and illumination

changes; the short-term foreground mask contains the moving


objects, noise, etc. An evidence score at each pixel is inferred
by applying a set of hypotheses on the foreground masks, and
then aggregated in time to provide temporal consistency.
A complete system for the detection of abandoned and
removed objects is presented in [4], where the mixture of
the Gaussian method [26], suitably adapted, is employed to
detect both the moving and the static objects in the scene.
Several improvements allow the system to properly handle
shadows and light changes, and higher level modules enable
the discrimination of abandoned and removed objects, also
under occlusion.
In [17], region-level analysis is exploited both for background maintenance, where region-level information is fed
back to adaptively control the learning rate, and for SOD,
where it helps validating candidate stationary objects. The
resulting method is robust against illumination changes and
to occlusions.
A system for the detection of static objects based on a dual
background model that classifies pixels by means of a finitestate machine is presented in [20]. The state machine provides
the meaning for the interpretation of the results obtained from
background subtraction.
It should be observed that other methods exist that do not
fall in the above classification. Examples include the approach
in [27] that affords the occlusion problem by using occlusion
reasoning in a multiview setting, and the method in [28] for
parked vehicles detection, based on the analysis of spatiotemporal maps of static image corners, that neither relies on
background subtraction nor performs object tracking.
B. Overview of Our Approach
We approach SOD with a background subtraction-based
method that relies on modeling not only the background,
but also the stopped foreground. Two main novelties are
introduced as compared with previous approaches.
The first contribution resides in the proposition of a general framework for SOD, that we named stopped foreground
subtraction (SFS) algorithm. The basic idea consists of maintaining an up-to-date model of the stopped foreground and
discriminating moving objects as those that deviate from
this model, using a mechanism that is robust to occlusion
and restart challenges. The proposed SFS algorithm is quite
general: indeed, it is independent from the model chosen for
the scene background and foreground, and therefore it can be
used in conjunction with whichever model is preferred.
Another main contribution concerns a 3-D neural model for
image sequences that automatically adapts to scene changes
in a self-organizing manner. Neural network-based solutions
to MOD have already been considered due to the fact that
these methods are usually more effective and efficient than
traditional ones [29][36]. The proposed 3-D neural network
behaves as a competitive neural network implementing a
winner-take-all function with an associated mechanism, that
modifies the local synaptic plasticity of the neurons, allowing
learning to be spatially restricted to the local neighborhood of
the most active neurons. Therefore, the neural image sequence

MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS

model well adapts to scene changes and can capture the


most persisting features of the image sequence, making it
suitable to model both the scene background and foreground.
The 3-D neural model differs from the 2-D model proposed
in [33] in terms of layered network structure and of interand intra-layer weights update. The rationale is to produce a
3-D topographic map that is more consistent with the image
sequence, where every layer is spatially consistent with the
pixel locations (i.e., each neuron corresponds to a single pixel)
and their image neighborhoods, while different neurons at
different layers (corresponding to the same pixel location) are
more or less responsive to the change detected at that pixel.
Overall, the reported research extends our previous
research [37], now, including a complete description of the
SFS algorithm, a constructive description of the needed background and foreground models, a detailed description of
the neural image sequence model, as well as extensive and
reproducible experimental results.
This paper is organized as follows. In Section II, we propose
a model-based framework for SOD that is independent from
the chosen model. In Section III, we describe the adopted
self-organizing model for image sequences, and describe how
this model is used for background and foreground modeling.
Section IV presents results of moving and stopped objects
segmentation obtained by adopting the self-organizing model
for the model-based framework, while Section V includes
concluding remarks.
II. M ODEL -BASED F RAMEWORK FOR SOD
In this section, we propose a model-based approach to the
classification of foreground objects into stopped and moving
objects. The basic idea consists of keeping a model of foreground objects and classifying as stopped objects those whose
model holds the same features for several consecutive frames;
remaining foreground objects are consequently classified as
moving objects.
Specifically, in the following sections we will describe in a
constructive way three models related to an image sequence
{It }: Bt for the background, Ft for the moving foreground, and
St for the stopped foreground. Such models allow us to achieve
a segmentation of sequence frame It at time t, classifying each
pixel of It as either background, moving, or stopped, thus
providing a solution to the considered problem.
A. Background Model: Assumptions
In order to highlight the general applicability of the
proposed approach, independently from the specific model
adopted for the background, for the moving foreground, and
for the stopped foreground, in this section, we assume we are
somehow able to do background subtraction in order to detect,
for each sequence frame It , image elements that do not belong
to the scene background. Specifically:
Assumption 1: Given an image sequence {It }, we suppose
to have devised:
1) a background modeling technique, allowing to initialize
and update a model Bt of background appearance at

725

time t, which gives a statistical description of the scene


background for sequence {It };
2) a foreground detection technique, allowing to discriminate whether an incoming image pixel x of current
sequence frame It is modeled by the background model.
The following notation will be adopted:
1) the case of x being a background pixel, modeled by Bt ,
t;
will be denoted as xB
2) the case of x being a foreground pixel, not modeled by
Bt , will be denoted as x Bt ;
3) the initialization of the background model Bt , for t = 0,
is achieved, for every image pixel x of the sequence
frame I0 , through the procedure insert(B0 , x), which
returns the initial background model B0 ;
4) the update of the background model Bt 1 , for t > 0, is
achieved, for every image pixel x of the sequence frame
It , through the procedure update(Bt , Bt 1 , x), which
returns the updated background model Bt .
Both the insert and the update procedures depend upon the
choice of the model representation; therefore, in this section
they are left unspecified.
B. Moving Foreground Model: Initial Construction
For all foreground pixels, we construct a model Ft of the
sequence foreground, in order to classify as stopped those
foreground pixels that hold the same features for several
consecutive frames, i.e., those that are modeled by Ft for
several consecutive frames. The sequence foreground model
Ft is iteratively constructed for each pixel x of the current
sequence frame It as follows.
1) Ft is initialized as soon as x is detected as foreground
t 1 and x Bt , then insert(Ft , x).
pixel: if xB
2) Ft is updated if it continues to hold the same features
t 1 , then update(Ft , Ft 1 , x).
held in past frames: if xF
t 1 and
3) Ft is erased if x is a background pixel: if xF
t , then delete(Ft , Ft 1 , x).
xB
Contrary to background modeling, foreground modeling
allows us to model pixels whose behavior is usually only
temporary. This implies the above described further issue of
erasing the model whenever it is no more representative; this
is achieved through the procedure delete, which returns the
model Ft updated in such a way that x Ft .
C. Moving Foreground Model: Counting of Consecutive
Occurrences
The availability of the constructed foreground model Ft
allows us to discriminate, among foreground pixels, those that
are moving and those that are stationary. Specifically:
Definition 1: Given a foreground model Ft for the image
sequence {It }, an image pixel x of sequence frame It at time
t is said to be a stopped pixel if it has been modeled by Ft
i , i = t , . . . , t.
for at least consecutive frames: xF
Definition 2: The minimum number of consecutive
frames after which a foreground pixel assuming constant
features is classified as stopped is said the stationary threshold.

726

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

The value of is chosen depending on the desired responsiveness of the system, and is application-dependent.
In order to classify stopped pixels, for each pixel x of
sequence frame It , we compute a function Ct of pixel feature consecutive occurrences in the foreground model Ft as
follows:

t
max(0, Ct 1 (x) k A ), if xB
t (1)
Ct (x) = min(, Ct 1 (x) + k B ), if x Bt xF

max(1, Ct 1 (x) kC ), if x Bt x Ft
where indicates the logical AND operation, and C0 (x) = 0.
Basically, the count value is increased if x is modeled by
the foreground model Ft , and decreased otherwise. The three
different cases of (1) can be explained as follows.
If, at time t, x is a background pixel [first case of (1)],
then, at previous time t 1 it was in one of the three
different possible states: 1) it was a background pixel; 2) it
was a foreground pixel belonging to a moving object that
continues to move at time t; or 3) it was a foreground
pixel belonging to a stopped object that starts again moving at time t. In each case, the counting of consecutive
occurrences of x in Ft should be re-initialized, by setting Ct (x) = 0. Instead of just zeroing Ct (x), decreasing its value by the decay value k A in (1) allows us
to control how fast the system should recognize that a
stopped pixelas in case 3has moved again. To set the
alarm flag off immediately after the removal of the stopped
pixel, the decay value should be large, eventually equal
to .
If x is modeled by the foreground model Ft [second
case of (1)], then it holds the same features held in the
previous frame, and therefore Ct (x) is incremented. This case
includes situations where in past frames x was a moving pixel
belonging to an object that has not moved since then; therefore,
the growth factor k B in (1) determines how fast the system
should recognize that a moving pixel is going to stop.
Finally, if pixel x is not modeled by any of the background
or moving foreground models [third case of (1)], then it must
be a new moving foreground pixel, and therefore Ct (x) should
be set to 1. Instead, we just decrease it by the decay factor
kC in (1), in order to enhance robustness to false negatives in
the background or in the moving foreground.
D. Stopped Foreground Model: Construction
According to Definition 1, a pixel x in the sequence frame It
is classified as stopped if Ct (x), as defined in (1), reaches the
stationary threshold value . In order to retain the memory of
objects that have stopped and to distinguish them by moving
objects eventually passing in front of them in subsequent
frames, pixels x for which Ct (x) = are moved from the
moving foreground model Ft to a new stopped foreground
model St .
Given a pixel x of current sequence frame It , the stopped
foreground model St is constructed and updated according to:
1) St is initialized as soon as x has been modeled by Ft for
t and Ct (x) = ,
at least consecutive frames: if xF
then insert(St , x);

Fig. 1. Classification at time t of the state of a pixel x of It as either


background (BG), moving foreground (MOV), or stopped foreground (ST).

2) St is updated if it continues to hold the same features


t 1 , then update(St , St 1 , x);
held in past frames: if xS
3) St is erased if x is a background pixel for at least
t 1 , xB
t , and Ct (x) = 0, then
k A frames: if xS
delete(St , St 1 , x);
where k A and Ct (x) are those of (1).
As previously mentioned, as soon as the stopped foreground
model St is initialized for pixel x of the current sequence frame
It , the corresponding foreground model Ft is re-initialized
[delete(Ft , Ft , x)], together with the corresponding counting
function value Ct , by setting Ct (x) = 0. This refinement
allows us to adopt the erased model Ft for new moving
foreground pixels that in subsequent frames eventually pass
in front of the stopped pixels.
E. Classification of a Pixel
A simplified flow-chart describing how the state of an
incoming pixel x in sequence frame It is classified, at time t,
as either background (BG), moving (MOV), or stopped (ST),
is reported in Fig. 1. Here, the input background model Bt ,
the moving foreground model Ft 1, and the stopped foreground model St 1 , are obtained as described in Sections IIAII-D, respectively. Moreover, the input function Ct 1 is
computed as in (1), where we have assumed for simplicity
k A = , k B = 1, and kC = . While the flowchart concerns only the update of the counting function
Ct and the initialization of the two foreground models Ft
and St , a complete and constructive description is given in
Algorithm 1.
F. Stopped Foreground Model: Layering
When several overlapping stopped objects are present
in the scene, we cannot consider a single stopped
foreground model St if we want to consider all these objects

MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS

Algorithm 1 Stopped Foreground Subtraction (SFS)

727

foreground models St1 , . . . , StL introduced in Section II-F


needs further clarification.
1) When a pixel is detected as background for at least k A
consecutive frames, all the stopped foreground layers
are erased, because it is assumed that there are no more
stopped objects containing this pixel (lines 45).
2) When a pixel is detected as an old stationary pixel
belonging to the stopped layer Stl , then this layer is
updated (line 9); in this case, all other successive stopped
layers Stl+1 , . . . , StL must be erased, because it means
that they have been removed by the scene (line 10).
3) Nothing should be done in any stopped layer in the case
the pixel x is an old or new moving foreground pixel
(lines 1116). Indeed, the moving object it belongs to,
could be passing in front of stopped objects, and we want
to keep the stopped objects model as is; this ensures SFS
algorithm robustness against stopped object occlusion.
4) Finally, when a pixel is detected as a new stopped
pixel, then it must be inserted into one of the stopped
foreground layers (lines 2021). Choice of the suitable
layer can be done on a per-pixel basis (e.g., insert it into
the first empty stopped layer, as it has been done in SFS
algorithm) or on a per-region basis (e.g., insert it into
the empty stopped layer that already contains adjacent
stopped pixels).
H. Considerations on the Proposed Framework

separately and keep memory of all of them. To this end, we


introduce L different stopped foreground layers St1 , . . . , StL ,
each of which contains stopped foreground objects that do
not overlap with stopped foreground objectsof the other
layers, i.e., that do not occupy the same image spatial
region. The layer St1 contains the model for the first
detected stopped foreground pixels; subsequently detected
stopped foreground pixels that do not belong to this
layer, but overlap with it, will be included in the next layer
St2 , and so on.
G. SFS Algorithm
The stopped foreground subtraction algorithm, in the following referred to as SFS algorithm, for an incoming pixel x
in sequence frame It , t [0, T ], adopting the stopped layering
mechanism with L layers described in Section II-F, is detailed
as Algorithm 1.
Most part of the algorithm is a clear consequence of the
procedures for constructing the moving foreground model
Ft and the stopped foreground model St , and of (1) for
computing the function Ct of pixel feature occurrences in
the foreground model, as described in Sections II-BII-D,
respectively. Instead, the layering mechanism for stopped

The modeling of the foreground and the stopped foreground


layering strategy allows us to keep an up-to-date representation
of all stopped pixels and an initial classification of scene
objects, which can be used to support subsequent tracking
and classification phases. Moreover, stopped foreground modeling allows us to distinguish moving and stopped foreground
objects when they appear overlapped (occlusion challenge).
Indeed, in this case, the availability of an up-to-date model of
all stopped objects allows us to discern whether pixels, that
are not modeled as background, belong to one of the stopped
objects or to moving objects. Finally, separate background
and stopped foreground modeling allow us to determine when
stopped objects resume their motion (restart challenge).
The described stopped foreground subtraction algorithm
is completely independent from the model adopted for the
background, for the moving foreground, and for the stopped
foreground layers. Once a specific model has been chosen,
SOD can be accomplished through the described SFS algorithm, provided that suitable procedures for model initialization, update, and deletion are specified.
III. N EURAL M ODEL FOR SOD
In this section, we give a description of a self-organizing
model for image sequences and describe how the model is
adopted for both background and foreground modeling.
A. Image Sequence Modeling
Relying on recent research in this area [33], [38], the
idea is to build the image sequence model by learning in
a self-organizing manner image sequence variations, seen

728

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

Mt

Mt

(b)

(c)

It

(a)

Fig. 2. (a) Example of sequence frame It (each circle represents a pixel, connected with its neighbors), (b) corresponding neural map Mt with n = 5 layers,
and (c) inter-layer and intra-layer weight decaying during the update procedure for Mt .

as trajectories of pixels in time. A neural network mapping


method is proposed to use a whole trajectory incrementally
in time fed as an input to the network. Each neuron computes
a function of the weighted linear combination of incoming
inputs, and therefore can be represented by a weight vector,
obtained collecting the weights related to incoming links.
An incoming pattern is mapped to the neuron whose set of
weight vectors is most similar to the pattern, and weight
vectors in a neighborhood of this node are updated. Unlike
[33] and [38], the obtained self-organizing neural network is
organized as a 3-D grid of neurons, producing a representation
of training samples with lower dimensionality, at the same
time preserving topological neighborhood relations of the
input patterns.
1) Neural Model Representation: Given an image sequence
{It }, for each pixel x in the image domain D, we build a
neural map consisting of n weight vectors m it (x), i = 1, . . . , n,
which will be called a model for pixel x. If every sequence
frame has N rows and P columns, the complete set of models
Mt (x) = (m 1t (x), . . . , m nt (x)) for all pixels x of the t-th
sequence frame It is organized as a 3-D neural map Mt with
N rows, P columns, and n layers. An example of this neural
map is given in Fig. 2, where for each pixel x, identified by
one of the colored circles in the sequence frame represented
in Fig. 2(a), we have a model Mt (x) = (m 1t (x), . . . , m nt (x)),
that is identified by identically colored circles in the model
layers shown in Fig. 2(b).
Put in other terms, the neural model Mt consists of n
images L it , i = 1, . . . , n, of the same size of image It , which
we call layers. Each layer L it contains, for each pixel x, the
i -th weight vector m it (x)


L it = m it (x), x D , i = 1, . . . , n.

2) Neural Model Initialization: All weight vectors related


to a pixel x are initialized with the pixel brightness value at
time 0, that is
m i0 (x) = I0 (x), i = 1, . . . , n.

(2)

The resulting neural map M0 consists of n layers, each of


which is a copy of the first sequence frame I0 . The idea behind
is that the initial guess for the model Mt , for t = 0, is exactly
the first frame of the sequence, which can be considered as a
good initial approximation of the image sequence.
3) Neural Model Update: Subsequent learning of the neural
map allows us to adapt the image sequence model to scene
modifications. The learning process consists of updating the
model by changing the neural weights, according to a visual
attention mechanism of reinforcement. Specifically, temporarily subsequent samples are fed to the network. At time t, the
value It (x) of each incoming pixel x of the t-th sequence
frame It is compared to the current pixel model Mt (x) =
(m 1t (x), . . . , m nt (x)), to determine the weight vector m bt (x) that
best matches it:




(3)
d m bt (x), It (x) = min d m it (x), It (x)
i=1,...,n

where the metric d(, ) is suitably chosen according to the


specific color space being considered. Example metrics could
be the Euclidean distance in RGB color space, or the Euclidean
distance of vectors in the HSV color hexcone, as suggested
in [39]. The latter is the one adopted for the experiments
reported in Section IV. Indeed, the HSV color space allows us
to specify colors in a way that is close to human experience
of colors, relying on the hue, saturation, and value properties
of each color. Moreover, hue stability against illumination
changes is known to be important for both cast shadow
suppression [40] and motion analysis [41], [42].

MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS

The best matching weight vector m bt (x), computed according to (3), belonging to layer b of model Mt , is used as the
pixel encoding approximation. Therefore, the best matching
weight vector m bt (x) and its neighboring weight vectors of the
b-th layer of model Mt are updated, according to weighted
running average:
m bt (y) = (1(x, y))m bt1 (y) + (x, y)It (y) y Nx . (4)
Here, Nx = {y : |x y| w2D } is a 2-D spatial neighborhood
of x of size (2 w2D + 1) (2 w2D + 1) including x. Moreover,
(x, y) = G 2D (y-x), where represents the learning
rate that depends from scene variability, while G 2D () =
2 I ) is a 2-D Gaussian low-pass filter [43] with zero
N (; 0, 2D
2 I variance.1 The (x, y) values are weights that
mean and 2D
allow us to smoothly take into account the spatial relationship
between current pixel x and its neighboring pixels y Nx ,
thus preserving topological properties of the input (close inputs
correspond to close outputs). The 2-D intra-layer update of
(4) is schematically shown in Fig. 2. Let us suppose that
the current incoming pixel x of the t-th sequence frame It
is the center black-colored circle in Fig. 2(a) and that the best
matching weight vector m bt (x), computed according to (3),
belongs to layer b = 3 of model Mt , and is indicated by the
black-colored circle in the third layer of Fig. 2(b). Then, the
weighted running average of (4) in a neighborhood Nx of size
3 3 (choosing w2D = 1) involves all black-colored circles
shown in the third layer of Fig. 2(c), using Gaussian weights
represented by the 2-D Gaussian function over these nodes.
The 2-D update of (4) involves only model weight vectors
lying in the same layer b as the best matching weight vector
m bt (x). In order to further enable the reinforcement of m bt (x)
in the model for pixel x, also weight vectors of x belonging
to layers close to layer b are updated. Such a further update
is achieved by weighted running average
m it (x) = (1 (x))m it 1 (x) + (x)It (x)

(5)

that involves the weight vectors


of x such that |i b|
w1D ; that is, it involves the weight vectors that belong to a
1-D inter-layer neighborhood of m bt (x) having size 2 w1D + 1.
In (5), (x) = G 1D (x), is the learning rate, and G 1D () =
2 ) is a 1-D Gaussian low-pass filter with zero mean
N (; 0, 1D
2
and 1D variance in the 1-D inter-layer neighborhood. The
1-D inter-layer update of (5) is schematically shown in
Fig. 2(c) with w1D = 2; it involves all black-colored circles in
the center circles column, using Gaussian weights represented
by the 1-D Gaussian function over these nodes.
m it (x)

B. Modeling the Background, the Moving Foreground, and the


Stopped Foreground
The previously described image sequence model can be
readily adapted for modeling the background, the moving
foreground, and the stopped foreground for SOD, to be accomplished by the general model-based framework described in
Section II.
1 Contrary to the Gaussian mixture model [26], we do not assume any
distribution of input data. A Gaussian function is only adopted in order to
smooth the contribution of the best matching weight vector to the model
update.

729

Specifically, the background model Bt is initialized as in (2),


while updating is performed according to selective weighted
running average, in order to adapt the model to slight scene
modifications, without introducing the contribution of pixels
that do not belong to the background scene. For each incoming
pixel x of the t-th sequence frame It , (4) and (5) are applied
only if the best matching weight vector m bt (x) of Bt , computed
as in (3), is close enough to the pixel value It (x), that is, only
if
(6)
d(m bt (x), It (x)) = min d(m it (x), It (x)) 
i=1,...,n

where  is a threshold allowing us to distinguish between


foreground and background pixels. Otherwise, if no acceptable
best matching weight vector exists, then x is detected as
a foreground pixel. In the following, we denote this procedure as 3D_SOBS algorithm [3-D self-organizing background subtraction]. In the usual case that a set of
K initial sequence frames is available for training, the
above described initialization and update procedures are
adopted for training the neural network background model,
to be used for detection and adaptation in subsequent
sequence frames. Specifically, after the described initialization of B0 , the background model Bt , t = 1, . . . , K 1,
is updated by selective weighted running average on the
remaining K 1 training sequence frames with the described
update procedure. Therefore, what differentiates the training
and the adaptation phases is the choice of parameters in
(4)(6).
The threshold  in (6) is chosen as

1 , if 0 < t < K
(7)
=
2 , if t K
with 1 and 2 small constants. The mathematical ground
behind the choice of 1 and 2 is as follows. To obtain a
(possibly rough) initial background model that includes several
observed pixel intensity variations, the value for  [1 in (7)]
within the initial training sequence frames should be high. On
the other side, to obtain a more accurate background model in
the detection and adaptation phase, the value for  [2 in (7)]
should be lower within subsequent frames. Therefore, it should
be 2 1 .
Moreover, the learning rate in (4) is chosen as

1 2
, if 0 < t < K
1 t
=
(8)
K
2 ,
if t K
where 1 and 2 are predefined constants such that
2 1 . Indeed, in order to ensure neural network convergence during the training phase, the learning factor [1 in (8)] is chosen as a monotonically
decreasing function of time t, while, during the subsequent adaptation phase, the learning factor [2 in (8)]
is chosen as a constant value that depends on the scene
variability. Large values enable the network to faster learn
changes corresponding to the background, but also leading to
false negatives, that is, inclusion into the background model
of pixels belonging to foreground moving objects. On the
contrary, lower learning rates make the network slower to
adapt to rapid background changes, but making the model

730

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

TABLE I
A DOPTED PARAMETER VALUES FOR 3D_SOBS AND SFS A LGORITHMS
S PECIFIC FOR E ACH I MAGE S EQUENCE

2
2
2

Dog

Binders

60
0.02
0.05
0.05

30
0.02
0.05
0.05

PVEasy
1500
0.02
0.05
0.05

PVMedium
1500
0.01
0.08
0.08

PVHard
1500
0.005
0.05
0.05

ABEasy
2500
0.02
0.05
0.05

ABMedium
2500
0.02
0.05
0.05

ABHard
2500
0.02
0.05
0.05

more tolerant to errors due to false negatives through selforganization. Indeed, weight vectors of false negative pixels
are readily smoothed out by the learning process itself.
Same mathematical ground is behind the choice of the
learning rate in (5), whose values can be chosen as

1 2
, if 0 < t < K
1 t
(9)
=
K
2 ,
if t K
where 1 and 2 are predefined constants such that 2 1 .
The moving foreground model Ft is initialized as in (2)
using the pixel value It (x), as soon as x is detected as not
belonging to the background model Bt , and it is updated
according to (4) and (5) if x is detected as belonging to
the moving foreground model Ft . The latter condition can
be checked by thresholding as in (6), where this time m bt (x)
should be interpreted as the best matching weight vector of
Ft for current pixel x.
Finally, the stopped foreground model St is initialized as
in (2), as soon as x is detected as belonging to the moving
foreground model Ft for at least consecutive frames. Here,
instead of using the incoming pixel value It (x), we can make
use of the already established moving foreground model Ft , by
just moving weight vectors of Ft into the stopped foreground
model. The stopped foreground model St is updated according
to (4) and (5) if x is detected as belonging to the stopped
foreground model St . As for the case of the moving foreground
model, the latter condition can be checked by the thresholding
of (6), where m bt (x) is interpreted as the best matching weight
vector of St for current pixel x.
IV. E XPERIMENTAL R ESULTS
Several experiments have been conducted to validate our
approach to SOD and to compare its results with those
achieved by other state-of-the-art methods. In the following,
choice of parameter values, qualitative and quantitative results
will be described for several publicly available sequences.
A. Parameter Values
For most of the parameters of 3D_SOBS and SFS algorithms, we could choose values common to all the considered
sequences. Specifically, we fixed n = 5 model layers for
the background, the moving foreground, and the stopped
foreground models; halfwidths w2D = 1 and w1D = 1 of the
neighborhoods for the model updates in (4) and (5), respec2 = 0.75 and 2 = 0.75 of the 2-D and
tively; variances 2D
1D
1-D Gaussian low-pass filters specifying the weights for the

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. Results of the moving and stopped object segmentation for the
Dog sequence. (a) Original frame I270 . (b) MOD mask. (c) Representation
of the background model B270 , computed by the 3D_SOBS algorithm.
(d) Representation of the moving foreground model F270 and (e) of the
stopped foreground model S270 , computed by the SFS algorithm. (f) Original
frame with moving (green) and stopped (red) foreground objects.

running averages in (4) and (5), respectively; the parameters


for training the neural background model [including K = 30
training frames, the training segmentation threshold 1 = 0.1
in (7), and the training learning rates 1 = 1 in (8) and 1 = 1
in (9)]; and, finally, the growth/decay factors k A = 1, k B = 1,
and kC = 1 adopted into (1). Such choices have been driven
by the conducted experiments, where, varying the parameters,
we observed almost constant accuracy.
The remaining parameters have been experimentally chosen
for each image sequence as reported in Table I. As clarified in
Section II-C, the value of the stationary threshold used in (1)
is application-dependent; the values here chosen for the Dog
and the Binders sequences (see Section IV-B) are linked to the
duration of the exemplified stopped object events, while those
for the Parked Vehicle and the Abandoned Bag sequences
(see Section IV-C) derive by the corresponding task definition
given by the AVSS 2007 contest [1]. The segmentation
threshold 2 adopted in (7), described in Section III-B, strictly
depends on the color similarity between the foreground
objects and the background; suitable values are in the order
of 102 . The learning rate 2 for the intra-layer update of (8)
has always been chosen equal to the learning rate 2 for the
inter-layer update of (9), with values in the order of 102 .
As clarified in Section III-B, these learning rates depend
on the scene variability; because the considered sequences
do not exhibit rapid background changes, the chosen low
learning rates allow the neural network to slowly adapt to
these changes, at the same time allowing the self-organizing
model to better tolerate errors due to few false negatives.
We would remark that, according to the mathematical ground
related to (7)(9) reported in Section III-B, values for the
last three parameters for the testing phase should be chosen
smaller than their training counterparts.
B. Qualitative Evaluation
The Dog sequence is an outdoor sequence consisting of 532
frames of 320 240 spatial resolution, available on the web
at http://www.openvisor.org. The scene consists of a garden,
where a man sits on a chair, while a dog passes in front of him.
One representative frame and the related results are reported in

MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS

(a)

(b)

(c)

(d)

731

(e)

(f)

Fig. 4. Results of moving and stopped object segmentation for the frames 1280 (first row), 1321 (second row), and 1395 (third row) of the Binders
sequence. (a) Original frames. (b) MOD masks computed by the 3D_SOBS algorithm. (c) Original frames with the first (red) and second (blue) stopped
layers. Representations of (d) the first stopped layer model, (e) the second stopped layer model, and (f) the moving foreground model.

Fig. 3. Here, we can observe that, although quite accurate, the


MOD mask computed by the 3D_SOBS algorithm [Fig. 3(b)]
for the original sequence frame I270 [Fig. 3(a)], does not allow
us to distinguish between the man sit on the chair from the dog
passing in front of him. Representations of the moving and the
stopped foreground models constructed by the SFS algorithm
are reported in Fig. 3(d) and (e), respectively. Such model
representations are computed as the best approximations to
the current sequence frame I270 obtainable by the related
3-D neural model Ft or St , for t = 270 (i.e., choosing,
for each pixel of I270 , the best matching weight vector of
the related model). Small unsteadiness of the sitting man
can be observed by the moving foreground model [Fig. 3(d)]
that contains not only the moving dog, but also the contour
and the hair of the man. Nonetheless, the availability of the
moving and the stopped foreground models, together with
the background model obtained by the 3D_SOBS algorithm
[whose representation is reported in Fig. 3(c)], allowed us
to isolate moving and stopped foreground pixels, thus proving the robustness of the SFS algorithm about the partial
occlusion of the stopped person. The final segmentation is
reported in Fig. 3(f), showing the original frame with stopped
foreground pixels (in red) and moving foreground pixels
(in green).
The Binders sequence is an indoor sequence consisting of
638 frames of 320 240 spatial resolution (publicly available
in the download section of http://cvprlab.uniparthenope.it),
specifically designed for showing the results achieved by the
SFS algorithm. The scene consists of an office, where a
person first leaves a green binder on a desk, then leaves a red
binder, and, after that, leaves a blue book in front of the red
binder. After a while, the person picks up the green binder,
passing it in front of the two stopped items, then picks up
in turn also the blue book and the red binder. Representative
frames together with related results are reported in Fig. 4.
Here, we report the original sequence frames nos. 1280, 1321,
and 1395 [column (a)] and the corresponding MOD masks
computed by the 3D_SOBS algorithm [column (b)]. As for
the Dog sequence, the MOD masks, although quite accurate,

do not allow us to distinguish between moving and stopped


objects. Specifically, in frame 1321 the blue book is stopped
in front of the red binder, which is itself stopped. However,
the MOD mask allows us only to detect the entire image
area including the two stopped objects, but it does not allow
us to distinguish them. Likewise, in frame 1395 the green
binder is passing in front of the other two stopped objects;
the MOD mask includes the image area of the three objects
extraneous to the background, giving no indication whether
they are moving or stopped. In the column (c), we report
the original frames with superimposed the first and second
stopped layers computed by the SFS algorithm, identified by
red and blue pixels, respectively; moving pixels are not shown.
Apart from few pixels belonging to the contours of the objects,
the achieved classification of stopped objects, as well as their
separation into two different layers, is perfectly consistent with
the scene in the three frames, also proving the robustness of
SFS algorithm against the stopped object occlusion and restart
challenges. Representations of the first and the second stopped
layer model, as well as of the foreground model, are reported
in columns (d)(f), respectively.
Besides robustness against stopped object occlusion and
restart challenges, the reported qualitative results highlight that
the SFS algorithm can be readily adopted for the detection of
abandoned objects, as defined in Section I. Also the case of
removed objects can be readily handled (e.g., the removed
green binder of Fig. 4), provided that the removed object does
not appear as stationary since the beginning of the sequence.
Indeed, in this case it is included into the background and,
due to the selective update of the background model, it will
be detected as foreground only when it starts to move.
C. Quantitative Evaluation
To compare results of our approach to SOD with other existing approaches, we further consider the i-LIDS dataset (publicly available at ftp://motinas.elec.qmul.ac.uk/pub/iLids/),
including also annotated ground truths provided for the AVSS
2007 contest [1]. Two scenarios are considered. The Parked

732

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

TABLE II
C OMPARISON OF G ROUND T RUTH (GT) S TOPPED O BJECT E VENT S TART AND E ND T IMES ( IN M IN ) FOR THE i-LIDS
S EQUENCES W ITH T HOSE C OMPUTED BY THE P ROPOSED A PPROACH AND BY O THER A PPROACHES

PV-Easy
Start End

Parked Vehicle Sequences


PV-Medium
PV-Hard
Start End Start End

Mean Median
Error Error

AB-Easy
Start End

Abandoned Bag Sequences


AB-Medium
AB-Hard
Mean Median
Start End Start End Error Error

GT

02:48 03:15 01:28 01:47 02:12 02:33

03:00 03:12 02:42 03:00 02:42 03:06

SFS + 3D_SOBS

02:45 03:19 01:28 01:51 02:12 02:34

4.00

4.00

02:50 03:17 02:35 03:01 02:42 03:07

8.00

8.00

SFS + SOBS [33]

02:45 03:20 01:28 01:51 02:12 02:35

4.67

4.00

02:50 03:17 02:35 03:02 02:42 03:08

8.67

9.00

SFS + MOG [26]

02:44 03:20 01:27 01:50 02:12 02:35

5.00

4.00

02:51 03:18 02:34 03:02 02:43 03:09

9.67

10.00
2.00

Bhargava et al., [10]

N/A

N/A

N/A

N/A

02:59 03:12 02:46 03:00 02:43 03:07

2.33

Boragno et al., [11]

02:48 03:19 01:28 01:55 02:12 02:36

N/A

5.00

4.00

N/A

N/A

N/A

Guler et al., [12]

02:46 03:18 01:28 01:54 02:13 02:36

5.33

5.00

02:23 03:18 02:42 03:06 02:14 03:16

29

38.00

Lee et al., [13]

02:51 03:18 01:33 01:52 02:16 02:34

7.00

6.00

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

Porikli et al., [18]

N/A

N/A

11.00

11.00

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

Venetianer et al., [16] 02:52 03:16 01:43 01:47 02:19 02:34

9.33

8.00

02:54 03:17 02:54 03:01

N/A

03:13 10.33

11.00

N/A

N/A

N/A

01:39 01:47

N/A

N/A

Vehicle sequences represent typical situations critical for MOD


in outdoor sequences, presenting strong shadows cast by
objects on the ground, mild positional instability caused by
small movements of the camera due to the wind, and strong
and long-lasting illumination variations due to clouds covering
and uncovering the sun. They are devoted to detecting vehicles
in no parking areas, where the street under control is more or
less crowded with cars, depending on the hour of the day the
scene refers to. The no parking area adopted for the contest is
defined as the main street borders, and the stationary threshold
is defined as = 1500; this means that an object is considered
irregularly parked if it stops in the no parking area for more
than 60 s (scenes are captured at 25 fp/s). The Abandoned Bag
sequences are devoted to detecting abandoned objects in a train
station and the detection area is restricted to the train platform.
Different crowd densities in the various sequences determine
different levels of occlusion problems. Here, the event task for
the AVSS 2007 contest [1] is defined as detecting bags left
unattended by their owners for more than 60 s, thus requiring
the association of the owner with the bag. For this scenario,
we then set the stationary threshold = 2500 (instead of
= 1500), in order to approximately take into account the
time the owners take to un-attend the bag.
We compared the results on i-LIDS sequences obtained by
our approach with those obtained by several other approaches.
Specifically, we consider results obtained by the SFS algorithm
using three different models for the background and the
foreground (3D_SOBS, SOBS [33], and MOG [26]), and those
obtained by the following.
1) Bhargava et al., [10], who search unattended objects that
are separated from nearby blobs and perform reverse
traversal for searching the candidate owner, continuously monitoring the scene for the departure/return of
the owner.
2) Boragno et al., [11], who employ a DSP-based system
for automatic visual surveillance, where block matching
motion detection is coupled with MOG-based foreground extraction.
3) Guler et al., [12], who extend a tracking system, inspired
by the human visual cognition system, introducing a

N/A

N/A

N/A

N/A

N/A

stationary object model where each region represents


hypotheses stationary objects whose associated probability measures the endurance of the region.
4) Lee et al., [13], who present a detection and tracking
system operating on a 1-D projection of images.
5) Porikli et al., [18], who employ dual foregrounds
to extract temporally static image regions, as already
described in Section I.
6) Venetianer et al., [16], who employ an object-based
video analysis system, featuring detection, tracking and
classification of objects.
Table II reports results on the i-LIDS sequences in terms
of stopped object event start and end times (in min) for
each sequence, as well as mean and median average error
computed over the absolute errors on the two scenarios.
We can observe that stopped object events detected by the
SFS algorithm, whichever is the adopted pixel-based model
(3D_SOBS, SOBS [33], and MOG [26]), generally start as
soon as a single pixel is detected as stopped, and end as soon
as no more stopped pixels are detected; this leads unavoidably
to a small anticipation concerning the event start time and to
a slight delay concerning the event end time, as compared
to other object-based approaches. However, from the results
we can conclude that generally the SFS algorithm favorably
compares to the other SOD approaches, independently from
scene traffic or crowd level. Indeed, differences between the
ground truth and the computed stopped object event times are
generally smaller for SFS coupled with 3D_SOBS than for the
other approaches, and this is still more evident if we consider
the mean and the median errors for the two scenarios. It is
worth pointing out that, in the case of the Abandoned Bag
sequences, lower accuracy can be observed for most of the
compared methods. This is due to the need of higher level
knowledge of the scene content for the exact detection of the
Abandoned Bag event. Although this is beyond our objective,
higher level information concerning the bag ownership could
be exploited in order to more accurately detect the start of the
abandoned bag event (e.g., [12], [16]).
Results reported in Table II also reveal the robustness of
all the compared methods against the occlusion challenge

MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS

733

TABLE III
C OMPARISON OF THE N UMBER OF T RUE D ETECTIONS (TD) AND FALSE
D ETECTIONS (FD) FOR THE i-LIDS S EQUENCES A CHIEVED BY THE
P ROPOSED A PPROACH AND BY O THER A PPROACHES
SFS +
Albiol
Evangelio
Pan
Tian
3D_SOBS et al., [28] et al., [20] et al., [17] et al., [4]
TD FD TD FD TD FD TD FD TD FD
PV-easy

N/A N/A N/A

PV-medium

N/A N/A N/A

PV-hard

N/A N/A N/A

AB-easy

N/A N/A

AB-medium

N/A N/A

AB-hard

N/A N/A

(all of them detect the beginning of the stopped object event,


despite the occlusions) and the restart challenge (all the
compared methods are able to detect the end of the stopped
object event).
More recent researches [4], [17], [20], [28] report accuracy results in terms of event-based detections, instead of
stopped object event start and end times. These results are
compared to those of the proposed approach in Table III,
where we report the actual number of correctly detected events
(TDtrue detections) and the number of erroneously detected
events (FDfalse detections). The results show that all the
compared methods are robust against the occlusion challenge
(despite the occlusions, all the compared methods detect the
expected number of stopped object events), but nothing can be
said concerning the restart challenge. Some of the methods,
including the proposed one, report false detections (mainly in
the AB-hard sequence) that are due to static people. These
false detections could be avoided using a people detector
(as suggested in [4]) or exploiting region-level information (as
in [17]). We can conclude that the proposed approach achieves
results that are consistent with the state-of-the-art results, and
that the few false detections can be easily eliminated by higher
level processing.
To give an idea of the segmentation accuracy achieved by
our neural-based approach to SOD, in Fig. 5, we analyze the
results achieved on the Dog and Binders sequences in terms
of the F1 measure, as adopted in [33]. The reported F1 values
have been obtained comparing manually labeled ground truth
masks for the first stopped layer of 20 frames of Dog sequence,
and for the first and the second stopped layers of 20 frames
of Binders sequence, with the corresponding segmentation
results, varying the number n of model layers. The high F1
values, slightly increasing with n, are achieved because most
of the stopped pixels are indeed detected as stationary and only
few pixels detected as stopped are instead moving. Further
experimental results showing the accuracy of the 3D_SOBS
algorithm for MOD can be found in [44].
The computational complexity of the proposed MOD and
SOD algorithm, both in terms of space and time, is basically proportional to the computational complexity inherent
to the adopted model, because it entails analogous models for
background, moving, and stopped foreground. Therefore, it is
O(n N P) for each sequence image, where n is the number of

Fig. 5.
Average segmentation accuracy on Dog and Binders sequences
varying the number n of model layers.

Fig. 6. Execution times (in msecs/frame) for the proposed MOD and SOD
algorithm on color image sequences with different sizes (S, M, H), varying
the number n of model layers.

model layers, N P is the image size, and we omit other


small multiplicative constants.
To complete our analysis, in Fig. 6, we report execution
times (in msecs/frame) of the proposed SFS algorithm using
the 3D_SOBS model, varying the spatial resolution of the
color image sequences (S = 180 144, M = 360 288,
H = 720 576) and the number of model layers (n = 3, 5,
7, 9).
Timings have been obtained by a prototype implementation in
C programming language on a Pentium 4 with 2.40 GHz and
512-MB RAM, running the Windows XP operating system,
and do not include I/O. Image sequences with different resolution have been obtained by subsampling the PV-medium
sequence. As expected by the analysis of the computational
complexity, the sequence resolution is the predominant factor
influencing execution times. For a fixed resolution, execution
times moderately increase when augmenting the number of
model layers. Having already observed in Fig. 5 that the
segmentation accuracy slightly increases with n, the value
n = 5 chosen for all the reported experimental results represents a good compromise between high accuracy and computational complexity. Moreover, the plot shows that only for high
resolution sequences the frame rate is not sufficient to obtain
real-time processing (about 40 msecs/frame). Nonetheless, we
can observe that MOD (i.e., 3D_SOBS) requires most of the
total execution times, while SOD (i.e., SFS) times represent
only a small percentage of it (around 15%). Therefore, stopped
foreground subtraction can be considered as a useful and
inexpensive by-product of background subtraction.

734

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

V. C ONCLUSION
This paper presented a neural-based contribution to stopped
object detection in digital image sequences taken from
stationary cameras. A 3-D neural model for image sequences
that automatically adapts to scene changes in a self-organizing
manner, was targeted for modeling the background and the
foreground, finalized at the detection of stopped objects. Coupled with the proposed model-based framework for stopped
object detection, it enables the segmentation of stopped foreground objects against moving foreground objects, robustly
handling occlusion and restart problems.
Experimental results on real video sequences and comparisons with existing approaches showed that the proposed
3-D neural model-based framework favorably compares to
other tracking- and non tracking-based approaches to stopped
object detection. The proposed approach is shown to be an
inexpensive by-product of background subtraction that provides an initial segmentation of scene objects, useful for any
other subsequent video analysis tasks, such as abandoned and
removed object classification, people counting, and human
activity recognition.
R EFERENCES
[1] Fourth IEEE International Conference on Advanced Video and Signal
Based Surveillance. Piscataway, NJ: IEEE Computer Society, Sep. 2007.
[2] Ninth IEEE International Workshop on Performance Evaluation of
Tracking and Surveillance, J. M. Ferryman, Ed. Piscataway, NJ: IEEE
Computer Society, Jun. 2006.
[3] Tenth IEEE International Workshop on Performance Evaluation of
Tracking and Surveillance, J. M. Ferryman, Ed. Piscataway, NJ: IEEE
Computer Society, Oct. 2007.
[4] Y. Tian, R. Feris, H. Liu, A. Hampapur, and M.-T. Sun, Robust
detection of abandoned and removed objects in complex surveillance videos, IEEE Trans. Syst., Man, Cybern. C, vol. 41, no. 5,
pp. 565576, Sep. 2011.
[5] S.-C. S. Cheung and C. Kamath, Robust techniques for background
subtraction in urban traffic video, Proc. SPIE, vol. 5308, pp. 881892,
2004, doi:10.1117/12.526886.
[6] S. Elhabian, K. El Sayed, and S. Ahmed, Moving object detection
in spatial domain using background removal techniques: State-of-art,
Recent Patents Comput. Sci., vol. 1, no. 1, pp. 3254, Jan. 2008.
[7] M. Piccardi, Background subtraction techniques: A review, in Proc.
IEEE Int. Conf. Syst. Man Cybern., vol. 4. Oct. 2004, pp. 30993104.
[8] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, Image change
detection algorithms: A systematic survey, IEEE Trans. Image Process.,
vol. 14, no. 3, pp. 294307, Mar. 2005.
[9] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallflower: Principles and practice of background maintenance, in Proc. Int. Conf.
Comput. Vis., vol. 1. 1999, pp. 255261.
[10] M. Bhargava, C.-C. Chen, M. Ryoo, and J. Aggarwal, Detection of
object abandonment using temporal logic, Mach. Vis. Appl., vol. 20,
no. 5, pp. 271281, 2009.
[11] S. Boragno, B. Boghossian, J. Black, D. Makris, and S. Velastin, A
DSP-based system for the detection of vehicles parked in prohibited
areas, in Proc. IEEE Conf. Adv. Video Signal Based Surveill., Sep. 2007,
pp. 260265.
[12] S. Guler, J. A. Silverstein, and I. H. Pushee, Stationary objects in
multiple object tracking, in Proc. IEEE Conf. Adv. Video Signal Based
Surveill., Sep. 2007, pp. 248253.
[13] J. Lee, M. Ryoo, M. Riley, and J. Aggarwal, Real-time illegal parking
detection in outdoor environments using 1-D transformation, IEEE
Trans. Circuits Syst. Video Technol., vol. 19, no. 7, pp. 10141024,
Jul. 2009.
[14] S. Lu, J. Zhang, and D. Dagan Feng, Detecting unattended packages
through human activity recognition and object association, Pattern
Recognit., vol. 40, no. 8, pp. 21732184, Aug. 2007.

[15] A. Singh, S. Sawan, M. Hanmandlu, V. Madasu, and B. Lovell, An


abandoned object detection system based on dual background segmentation, in Proc. 6th IEEE Int. Conf. Adv. Video Signal Based Surveill.,
Sep. 2009, pp. 352357.
[16] P. Venetianer, Z. Zhang, W. Yin, and A. Lipton, Stationary target
detection using the objectvideo surveillance system, in Proc. IEEE
Conf. Adv. Video Signal Based Surveill., Sep. 2007, pp. 242247.
[17] J. Pan, Q. Fan, and S. Pankanti, Robust abandoned object detection using region-level analysis, in Proc. Int. Conf. Image Process.,
Sep. 2011, pp. 35973600.
[18] F. Porikli, Y. Ivanov, and T. Haga, Robust abandoned object detection
using dual foregrounds, EURASIP J. Adv. Signal Process., Jan. 2008,
p. 30.
[19] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin,
D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt, and L. Wixson, A
system for video surveillance and monitoring, Dept. Nat. Acad. Sci.,
Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-RI-TR00-12, 2000.
[20] R. Evangelio, M. Patzold, and T. Sikora, A system for automatic and
interactive detection of static objects, in Proc. IEEE Workshop Pers.
Oriented Ver., Jan. 2011, pp. 2732.
[21] H. Fujiyoshi and T. Kanade, Layered detection for multiple overlapping
objects, in Proc. 16th Int. Conf. Pattern Recognit., vol. 4. Aug. 2002,
pp. 156161.
[22] E. Herrero-Jaraba, C. Orrite-Urunuela, and J. Senar, Detected motion
classification with a double-background and a neighborhood-based difference, Pattern Recognit. Lett., vol. 24, no. 12, pp. 20792092, 2003.
[23] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, Real-time
foreground-background segmentation using codebook model, RealTime Imag., vol. 11, no. 3, pp. 172185, 2005.
[24] K. Patwardhan, G. Sapiro, and V. Morellas, Robust foreground detection in video using pixel layers, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 30, no. 4, pp. 746751, Apr. 2008.
[25] N. Papadakis and A. Bugeau, Tracking with occlusions via graph cuts,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 144157,
Jan. 2011.
[26] C. Stauffer and W. Grimson, Adaptive background mixture models
for real-time tracking, in Proc. Comp. Vis. Pattern Recognit., vol. 2.
Jun. 1999, p. 252.
[27] Q. Zhang and K. Ngan, Segmentation and tracking multiple objects
under occlusion from multiview video, IEEE Trans. Image Process.,
vol. 20, no. 11, pp. 33083313, Nov. 2011.
[28] A. Albiol, L. Sanchis, A. Albiol, and J. Mossi, Detection of parked
vehicles using spatiotemporal maps, IEEE Trans. Intell. Transp. Syst.,
vol. 12, no. 4, pp. 12771291, Dec. 2011.
[29] D. Culibrk, O. Marques, D. Socek, H. Kalva, and B. Furht, Neural network approach to background modeling for video object segmentation,
IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 16141627, Nov. 2007.
[30] L. Duan, D. Xu, and I. Tsang, Domain adaptation from multiple
sources: A domain-dependent regularization approach, IEEE Trans.
Neural Netw., vol. 23, no. 3, pp. 504518, Mar. 2012.
[31] A. Iosifidis, A. Tefas, and I. Pitas, View-invariant action recognition
based on artificial neural networks, IEEE Trans. Neural Netw., vol. 23,
no. 3, pp. 412424, Mar. 2012.
[32] E. Lopez-Rubio, R. M. L. Baena, and E. Domnguez, Foreground
detection in video sequences with probabilistic self-organizing maps.
Int. J. Neural Syst., vol. 21, no. 3, pp. 225246, 2011.
[33] L. Maddalena and A. Petrosino, A self-organizing approach to background subtraction for visual surveillance applications, IEEE Trans.
Image Process., vol. 17, no. 7, pp. 11681177, Jul. 2008.
[34] G. Pajares, A Hopfield neural network for image change detection,
IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 12501264, Sep. 2006.
[35] S. Pal, A. Petrosino, and L. Maddalena, Handbook on Soft Computing
for Video Surveillance. London, U.K.: Chapman & Hall, 2012.
[36] P. Wang, C. Shen, N. Barnes, and H. Zheng, Fast and robust object
detection using asymmetric totally corrective boosting, IEEE Trans.
Neural Netw., vol. 23, no. 1, pp. 3346, Jan. 2012.
[37] L. Maddalena and A. Petrosino, 3D neural model-based stopped object
detection, in Proc. 15th Int. Conf. Image Anal. Process., LNCS 5716.
2009, pp. 585593.
[38] L. Maddalena and A. Petrosino, Object motion detection and tracking
by an artificial intelligence approach, Int. J. Pattern Recognit. Artif.
Intell., vol. 22, no. 5, pp. 915928, Jan. 2008.
[39] R. Fisher. (1999). Change Detection in Color Images [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/PAPERS/iccv99.pdf

MADDALENA AND PETROSINO: STOPPED OBJECT DETECTION BY LEARNING FOREGROUND MODEL IN VIDEOS

[40] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, Detecting moving


objects, ghosts, and shadows in video streams, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 25, no. 10, pp. 13371342, Oct. 2003.
[41] J. Barron and R. Klette, Experience with optical flow in colour video
image sequences, in Proc. Inter-Vehicle Commun., 2001, pp. 195200.
[42] Y. Mileva, A. Bruhn, and J. Weickert, Illumination-robust variational
optical flow with photometric invariants, in Proc. 29th DAGM Conf.
Pattern Recognit., 2007, pp. 152162.
[43] P. Burt, Fast filter transform for image processing, Comput. Graph.
Image Process., vol. 16, no. 1, pp. 2051, 1981.
[44] L. Maddalena and A. Petrosino, Further experimental results with
3D_SOBS algorithm for moving object detection, Dept. Applied
Sci., Univ. Naples Parthenope, Naples, Italy, Tech. Rep. RT-DSAUNIPARTHENOPE-12-01, 2012.
Lucia Maddalena (M08) received the Laurea
degree (cum laude) in mathematics and the Ph.D.
degree in applied mathematics and computer science
from the University of Naples Federico II, Naples,
Italy.
She is currently a Researcher with the Institute
for High-Performance Computing and Networking,
National Research Council, Naples, Italy. She was
involved in research on parallel computing algorithms, methodologies, and techniques, and their
applications to computer graphics. She is currently
involved in research on methods, algorithms, and software for image processing and multimedia systems in high-performance computational environments,
with applications to real-world problems, particularly digital film restoration
and video surveillance. She has taught with the University of Naples Federico
II and the University of Naples Parthenope, Naples. She has co-edited one
book on soft computing for video surveillance.
Dr. Maddalena is a member of the International Association for Pattern
Recognition, an Associate Editor of the International Journal of Biomedical
Data Mining, and a reviewer of several international journals.

735

Alfredo Petrosino (SM02) received the Laurea


degree (cum laude) in computer science from the
University of Salerno, Salerno, Italy, under the
supervision of E. R. Caianiello.
He is currently a Full Professor of computer
science with the University of Naples Parthenope,
Naples, Italy, where he is the Head of the CVPRLab,
a research laboratory in computer vision and pattern
recognition (cvprlab.uniparthenope.it). He was with
the University of Salerno, the International Institute
of Advanced Scientific Studies, the Institute for the
Physics of Matter, and the National Research Council. He has taught with
the University of Salerno, the University of Siena, the University of Naples
Federico II, and the University of Naples Parthenope. He has authored or
co-authored more than 100 papers in journals and conferences, and has coedited six books. His current research interests include computer vision, image
and video analysis, pattern recognition, neural networks, and fuzzy and rough
sets.
Prof. Petrosino is also a member of the International Association for
Pattern Recognition and of the International Neural Networks Society. He
is the General Chair of the International Workshop on Fuzzy Logic and
Applications since 1995 and of the 17th International Conference on Image
Analysis and Processing in 2013. He is an Associate Editor of Pattern
Recognition, a member of the Editorial Board of Pattern Recognition Letters,
the International Journal of Knowledge Engineering and Soft Data Paradigms,
a Guest Editor of the IEEE T RANSACTIONS ON S YSTEMS , M AN , AND
C YBERNETICS : S YSTEMS , Information Sciences, Fuzzy Sets and Systems,
Image and Vision Computing, and Parallel Computing.

You might also like