Professional Documents
Culture Documents
Event Detection
1 Introduction
Computer vision has entered an exciting phase of development and use in re-
cent years and has benefitted from rapid developments in related fields such as
machine learning, improved Internet networks, computational power, and cam-
era technology. Present applications go far beyond the simple security camera
of a decade ago and now include such fields as assembly line monitoring, sports
medicine, robotics, and medical tele-assistance. Indeed, developing a system that
accomplishes these complex tasks requires coordinated techniques of image anal-
ysis, statistical classification, segmentation, inference algorithms, and state space
estimation algorithms.
Irrespective of the problem domain, the computer vision problem is approxi-
mately the same for application: first we must somehow detect what we consider
to be the foreground object (or objects), we must then track these objects in
time (over several video frames), and we must discern something about what
these objects are doing. This is summarized as follows:
Fig. 1. Age and sex distribution for the year 2010 and the predicted for the year 2050
[37].
main window of the application is shown in Figure 2. The figure shows detection,
extraction and characterization of moving foreground objects.
Open Source software has also left its mark on the computer vision field
with a wide array of libraries freely available to developers to try their hand at
the three fundamental problems of computer vision. A large listing of available
computer vision software libraries (both open source and can be found at CMU
[39].
Detecting moving objects in a video scene and separating them from the back-
ground is a challenging problem due to complications from illumination, shad-
ows, dynamic backgrounds ect[30]. Motion detection is concerned with being
able to notice objects that move within a scene, thereby separating foreground
objects from the background[1,11]. There are several techniques that have been
Smart Telecare Video Monitoring for Anomalous Event Detection 5
applied to motion detection and can be grouped into three broad categories:
environmental modeling, motion segmentation, and object classification.
Environmental modeling: requires constructing a set of images of the back-
ground. While trivial for stationary backgrounds, this technique is funda-
mentally limited for most real world problems, especially for outdoor scenes
that contain moving objects and dramatic changes in lighting [6].
Motion Segmentation: can be further divided into three different types: back-
ground subtraction, temporal differencing, and optical flow.
Optical flow: creates complex contours of the entire velocity field. It is the
most robust method, however requires considerable computational resources
and cannot at present be used in real-time applications [27,22].
Active contour based: the idea is to create bounding contours which rep-
resent each object’s outline. The problem is that the contours need to be
updated from frame to frame, and the method suffers from precision [10,34].
Feature based: feature based tracking methods are very successful techniques
for tracking objects from frame to frame. No assumptions need to be made
about the objects other than an intelligent choice of feature vector. A typical
feature vector contains the following quantitative information of each object:
the object size, position, velocity, ratio of major axis orientation, coordinates
of the bounding box, the centroid, and the histogram of color pixels. The
advantage of using a feature based method is that statistical clustering can
be used to classify the objects from frame to frame [5,20].
Model based: these methods are very accurate, but computationally very ex-
pensive. These include modeling the objects that are to be tracked with
detailed models. For example, human figures have been modeled with stick
figures and with complex gait algorithms [2,28].
Fig. 3. Scene with a lot of moving objects like tree leaves, dogs or birds.
Mean background: this method overcomes the deficiencies of the pairwise dif-
ference by performing a cumulative average over n frames. This has the ad-
vantage that objects do not need to be in continuous movement. The method
[18,35], however suffers from artifacts that arise when objects moves at dif-
ferent velocities. Another problem is that it needs to be hand-tuned. For
initial studies in this thesis, this technique was utilized. Thus, a more com-
prehensive discussion shall be provided later.
Codebook model: the codebook method is a powerful technique for some sit-
uations since it stores a model background based upon previously obtained
backgrounds. This is an interesting choice for indoor environments, as in our
Telecare system, where the background remains relatively static [19].
The software application, TELECARE, developed for this thesis has been writ-
ten in C++ and uses the OpenCV library, which is an open-source cross-platform
standard library, originally developed by Intel, for developing a wide range of
real-time computer vision applications. OpenCV implements low level image pro-
cessing as well as high level machine learning algorithms so that developers can
rapidly leverage well-known techniques for building new algorithms and robust
computer vision applications.
For the graphical interface, the QT library is used since it provides excellent
cross-platform performance. The QT development environment is an industry
open standard and is now owned by Nokia. Existing applications not only run
on other desktop platforms, but can be extended for inclusion in web-enabled
applications as well as mobile and embedded operating systems without the need
for rewriting source code.
Since the TELECARE software is an experimental application, the graphical
interface is designed to provide maximum information about feature vector pa-
rameters as well as easy comparison of different segmentation algorithms. Thus,
the system is not meant for end-users at the moment. Instead, the architecture
of the system provides a plugin- framework for including new ideas. A high level
schema of our software system is shown in Figure 4 . It consists of the basic
components described in the previous section, namely a foreground extraction,
object segmentation, and finally tracking of individual objects. Tracking of mul-
tiple objects is performed by first clustering individual objects, or “blobs”, from
feature vectors formed by their dimensions, color space, and velocity vectors.
In order to experiment with different segmentation schemes for discriminating
events, we have implemented histogram based algorithms.
The first phase of extracting information from videos consists of performing basic
image processing functions. These include: loading the video, capturing individ-
ual frames, and applying various smoothing filters. Next, blobs are identified
based upon movement between frames. For static cameras, this amounts to sim-
ply taking note of pixel locations that change value from frame to frame within
a specified threshold. There are several sophisticated background subtraction
methods for eliminating shadows and other artefacts that we shall describe in
the next section, however the basic background subtraction is based upon two
ideas: (a) finding an appropriate mask for subtracting moving objects in the fore-
ground and (b) updating the background. Two methods are typically employed:
a running average or Gaussian mixtures model, where each pixel is classified to
belong to a particular class.
where the matrix Acc as the accumulated pixel matrix, I(x, y) the image, and
α is the weighting parameter.
For a constant value of α, the running average is not equivalent to the result of
summing all of values for each pixel across a large set of images, and then divide
by the total number of images to obtain the mean. To see this, simply consider
adding three numbers (2, 3, and 4) with α set to 0.5. If we were to accumulate
them with “normal mean”, then the sum would be 9 and the average 3. If
we were to accumulate them with “running average”, the first sum would give
0.5·2·0.5·3 = 2.5 and then adding the third term would give 0.5·2.5·0.5·4 = 3.25.
The reason that the second number is larger is that the most recent contributions
are given more weight than those from farther in the past. The parameter α
essentially sets the amount of time necessary for the influence of a previous
frame to fade. In the Figure 5 the results of different executions with values of
α: 0.05, 0.4 and 0.8 can be seen.
Thus, the background is subtracted by calculating the running average of
previous background images and the current frame with the mask applied. An
example using this algorithm is given in Figure 6. The left image is the gener-
ated mask. The second image is the resulting background image obtained from
multiple calls to the running average routine. The third image is the result of
applying the mask to the current image. Finally, the fourth image is the orig-
inal image with rectangular contours defining moving foreground objects, after
applying image segmentation.
X
K
p(xN ) = wj η(xN ; θj ) (2)
j=1
where wk is the weight parameter of the kth Gaussian component, and η(xN ; θj )
is the Normal distribution of kth component, given by
X 1 − 21 (x−µk )T
P−1
(x−µk )
η(x; θk ) = η(x; µk , )= P 1e
D
k (3)
k (2π) | k | 2
2
P
In the above equation, µk is the mean and k = σk2 I is the covariance of the
k th component. The K distributions are ordered based on the fitness value w k
σk
and the first B distributions are used as a model of the background of the scene
where B is estimated as:
Xb
B = arg min( wj > T ) (4)
j=1
The algorithms Running Average (RunAvg) and Gaussian Mixture Model (GMM)
have been tested with our computer vision system. The data consists of 1 video
sequence of resolution 640x480 pixels, 22 seconds of duration and 25 frames per
second. We select the frames 205, 210, 215, 220, 225, 230, 235 and 240. For
each frame, it was necessary to manually segment foreground objects in order
to have a ground truth quantitative comparison. Table 1 shows the number of
foreground pixels labeled as background (false negatives - FN), and the number
of background pixels labeled as foreground (false positives - FP), and the total
N +F P
percentage of wrongly labeled pixels F640·480 . Comparisons are complicated by
the fact that sudden illumination changes produce shadows increasing the object
size, and modern webcams constantly adapt to illumination changes.
Algorithm Errors fr.205 fr.210 fr.215 fr.220 fr.225 fr.230 fr.235 fr.240
RunAvg FN 11285 16636 20365 17657 10951 12353 14226 18882
(α = 0.05) FP 2907 4077 690 100 2158 5054 2006 903
% Total 4.62 6.74 6.85 5.78 4.27 5.67 5.28 6.44
RunAvg FN 32432 37071 42517 41929 35758 32765 39102 43144
(α = 0.4) FP 1536 2686 36 14 264 928 669 68
% Total 11.06 12.94 13.85 13.65 11.73 10.97 12.95 14.07
RunAvg FN 37318 41152 47173 45566 40125 38653 45742 52165
(α = 0.8) FP 758 1359 0 3 97 733 618 0
% total 12.39 13.84 15.36 14.83 13.09 12.82 15.09 16.98
GMM FN 5175 2532 4091 7802 3515 3080 5168 4775
(σ = 1) FP 20275 28982 13942 9941 31695 27896 11158 6191
% Total 8.28 10.26 5.87 5.78 11.46 10.08 5.31 3.57
GMM FN 9246 9954 12168 10338 5551 6283 8347 12389
(σ = 2.5) FP 1072 1637 77 44 1050 3678 2686 1
% Total 3.34 3.77 3.99 3.38 2.15 3.24 3.59 4.03
GMM FN 39430 43029 48709 43144 37922 36189 44012 50461
(σ = 5) FP 0 0 0 0 0 0 0 0
% Total 12.84 14.01 15.86 14.04 12.34 11.78 14.33 16.43
Table 1. Results of different configurations of running average and Gaussian mixture
model.
Fig. 9. % error of the best configuration with the Running average model and the best
configuration with the Gaussian mixture model.
Foreground objects are identified in each frame as rectangular blobs, which inter-
nally are separate images that can be manipulated and analyzed. Only minimum
sized blobs are considered real objects, thereby eliminating possible artifacts that
may arise. For each blob, the background mask is subtracted and a simple image
erosion operation is performed in order to be sure to eliminate any background
color contributions. This erosion process helps in feature classification since com-
mon background pixels don’t contribute to the object color histograms.
In order to classify each blob uniquely, we define the following feature vector
parameters (Figure 10): (a) the size of the blob, (b) the Gaussian fitted values
of RGB components (Figure 11), (c) the coordinates of the blob center, and (d)
the motion vector. Because of random fluctuations in luminosity, smaller blobs
14 Iván Gómez Conde
appear, but we discard these blobs. The size of blobs is simply the total number
of pixels. Histograms are obtained by considering bin sizes of 10 pixels. We also
normalize the feature vectors by the number of pixels.
Fig. 11. Discrimination of the histogram of color space between blobs in taken from
different frames. The x-axis is the norm difference of red, while the y-axis is the norm
difference histogram for green.
For a particular video sequence, the results shown in Figure 12 demonstrate that
two separate blobs are easily classified.
The tracking algorithm used is similar to other systems described in the liter-
ature. Once segmented foreground objects have been separated, and we have
formed a rectangular blob region, we characterize the blob by its feature vector.
Tracking is performed by matching features of the rectangular regions. Thus,
given N rectangular blobs, we match all these rectangles with the previous
frames. The frame to frame information allows us to extract a motion vector
of the blob centroid, which we use to predict the updated position of each blob.
For cases where there is ambiguity, such as track merging due to crossing, we
recalculate the clusters. Thus, the situations where we explicitly recalculate the
entire feature vector are the following: (a) objects can move in and out of the
view-port, (b) objects can be occluded by objects or by each other, and (c)
complex depth information must be obtained.
An example of velocity vector taken over time is given in Figure 13, where the
tracking algorithm must consider the crossing of two objects. The simple predic-
tion algorithm easily handles this situation by first extracting the instantaneous
velocity vector, and then performing a simple update based on the equations of
motion.
For our initial work, we have considered a limited domain of events that we
should detect, namely: arm gestures, and body positions upright or horizontal,
16 Iván Gómez Conde
to detect falls. These two cases are used to address anomalous behavior or simple
help signals for elderly in their home environments.
Our analysis is based upon comparing histogram moment distributions through
the normalized difference of the histograms as well as the normalized difference
of the moments of each histogram.
X
Hist(Hi , Hj ) = |Hi − Hj | (5)
X
M Hist(Hi , Hj ) = |Mi − Mj | (6)
Fig. 14. Simple histogram results for obtaining moments for detecting arm gestures.
Smart Telecare Video Monitoring for Anomalous Event Detection 17
For each of the histograms obtained in Figure 14, and for the histograms of
Figure 15, statistical moments are calculated and then normalized histograms
(normalized both by bins and total number of points) are obtained. Clustering
can then be performed (similar to that of figure 11), by calculating M Hist(Hi , Hj ),
the normed difference.
Fig. 15. Basic histogram technique used for discrimination body position. The inset
image demonstrates the color space normalized to unity.
The histograms are obtained by summing all the pixels in the vertical direc-
tion. This is performed by dividing the image into equal-sized vertical stripes,
and summing over all nonzero pixels within each slice for each color channel.
Video sequence were recorded with typical actions / events that we would like to
detect, such as a person falling to the floor. Figure 16 shows three video frames,
taken from our application where we have detected and tracked a foreground
object, in this case a person as he falls to the floor.
Fig. 17. Comparison of different histogram moments obtained from the video frames
studied in Figure 8 and Figure 9.
Fig 14 (a) Fig 14 (b) Fig 14 (c) Fig 15 (a) Fig 15 (b)
Mean 0.52 0.33 0.45 0.57 0.49
Standard Deviation 0.21 0.16 0.2 0.18 0.24
Skewness 0.18 3.85 3.12 0.05 1.5
Table 2. Results of different moments for frames shown in figure 14 and figure 15.
Smart Telecare Video Monitoring for Anomalous Event Detection 19
discriminating between Figure 8a and 8c where histograms are similar, yet the
body position is quite different. We are presently improving the basic idea by
using the information from both Hx and Hy with a Fischer criteria.
Thus, we have found that although our simple histogram techniques for hu-
man body position works well for some cases of interest and is easy to implement,
it is not sufficiently robust. Because of its simplicity, however, we are presently
improving the technique while at the same time investigating other model based
algorithms described in the literature, such as matching geometric shapes to
body parts, as described in [15]. This has the advantage both for the foreground
segmentation problem as well as determining body position.
5 Conclusions
In this master thesis we have described preliminary work and algorithms on a
software system which shall allow us to automatically track people and discrimi-
nate basic human motion events. This system is actually part of a more complete
tele-monitoring system under development by our group. The complete tele-care
system shall include additional information from sensors, providing a complete
information about a patient in their home. In this thesis, however, we have
restricted the study to video algorithms that shall allow us to identify body
positions (standing, lying, and bending), in order to ultimately translate this
information from a low level signal to a higher semantic level.
The thesis provides encouraging result and opens many possibilities for fu-
ture study. In particular, in the field of foreground/background segmentation,
the quantative comparison we described is an effective methodology which can
be used to optimize parameters in each model. While the feature based track-
ing that used in this thesis is rudimentary, a future study could combine this
information with modern sequential Monte Carlo methods in order to obtain a
more robust tracking. Finally, while the histogram model developed in this thesis
provides detection for a limited set of actions and events, it is a fast real-time
method, similar to motion detection systems presently employed commercially,
that should have utility in real systems.
References
1. A. Albiol, C. Sandoval, V. Naranjo, J.M. Mossi, Robust motion detector for video
surveillance applications. Departamento de comunicaciones, Universidad Politécnica
de Valencia (2009)
Smart Telecare Video Monitoring for Anomalous Event Detection 21
2. Alexandru O. Bălan and Leonid Sigal and Michael J. Black, A Quantitative Evalu-
ation of Video-based 3D Person Tracking, International Workshop on Performance
Evaluation of Tracking and Surveillance, pp. 349-356 (2005)
3. Botsis, G. Hartvigsen, Current status and future perspectives in telecare for elderly
people suffering form chronic diseases. Journal of Telemedicine and Telecare, Vol
14, pp. 195-203 (2008)
4. G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV
Library. O’Reilly Media, (2008)
5. Robert T. Collins, Yanxi Liu, Marius Leordeanu, Online Selection of Discriminative
Tracking Features. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, pp. 1631-1643 (2005)
6. Ahmed Elgammal and Ramani Duraiswami and David Harwood and Larry S. Davis
and R. Duraiswami and D. Harwood, Background and Foreground Modeling Using
Nonparametric Kernel Density for Visual Surveillance. Proceedings of the IEEE,
pp. 1151-1163 (2002)
7. Markus Enzweiler, Dariu M. Gavrila, Monocular pedestrian detection: survey and
experiments, IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), Vol 31(12), pp. 2179–2195 (2009)
8. P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, Object detection with
discriminatively trained part based models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, (2009)
9. Pedro F. Felzenszwalb and Daniel P. Huttenlocher, Efficient Matching of Pictorial
Structures. Proc. IEEE Computer Vision and Pattern Recognition Conf. Vol 2, pp.
66-73 (2000)
10. Daniel Freedman, Active Contours for Tracking Distributions. IEEE Trans. Image
Processing, Vol 13, pp. 518-526 (2004)
11. G.L. Foresti, L. Marcenaro, C.S. Regazzoni, Automatic detection and indexing of
video-event shots for surveillance applications. IEEE Transactions on multimedia,
Vol 4, No 4 (2002)
12. David A. Forsyth and Okan Arikan and Deva Ramanan, Computational Studies
of Human Motion: Part 1, Tracking and Motion Synthesis. Foundations and Trends
in Computer Graphics and Vision, (2006)
13. Tarak Gandhi, Mohan M. Trivedi, Pedestrian protection systems: issues, survey,
and challenges, IEEE Transactions On Intelligent Transportation Systems, Vol 8(3),
pp. 413–430 (2007)
14. I. Gómez-Conde, D. N. Olivieri, X.A. Vila, L. Rodrı́guez-Liñares, Smart Telecare
Video Monitoring for Anomalous Event Detection. Proceedings of ”5th Iberian con-
ference of systems and information technology” (CISTI 2010), Vol 1, pp. 384-389,
(Jun. 2010)
15. F. van der Heijden, R.P.W. Duin, D. de Ridder, D.M.J., Tax, Classification, Pa-
rameter estimation, and State Estimation, An engineering approach using Matlab.
John Wiley and Sons Ltd, West Sussex, England (2004)
16. Chih-Chiang Chen, Jun-Wei Hsieh, Yung-Tai Hsu, and Chuan-Yu Huang, Segmen-
tation of Human Body Parts Using Deformable Triangulation. In: IEEE Interna-
tional Conference on Pattern Recognition, Vol 1, pp. 355–358 (2006)
17. W. Hu, T. Tan, L. Wang, S. Maybank, A Survey on Visual Surveillance of Object
Motion and Behavior. IEEE Trans. Systems, Man, and cybernetics - Part C. Vol
34, No 3, pp. 334-352 (2004)
18. P. KaewTraKulPong and R. Bowden, An improved adaptive background mixture
model for real-time tracking with shadow detection. In Proceedings of the 2nd Eu-
ropean Workshop on Advanced Video-Based Surveillance Systems (2001)
22 Iván Gómez Conde
37. United Nations, Department of Economic and Social Affairs Population Division,
http://www.un.org/esa/population/
38. NationMaster, a massive central demographic data source,
http://www.nationmaster.com/index.php
39. Listing of computer vision software, CMU site:
http://www.cs.cmu.edu/ cil/v-source.html