TFM Ingles

Smart Telecare Video Monitoring for Anomalous
Event Detection
Iván Gómez Conde
Advisors: David Olivieri Cecchi and Xosé Antón Vila Sobrino

Master Thesis (Intelligent and Adaptable Software Systems Master)
University of Vigo
ivangconde@gmail.com
Abstract. Behavior determination and multiple object tracking for video

surveillance are two of the most active fields of computer vision. The rea-
son for this activity is largely due to the fact that there are many applica-
tion areas. This thesis describes work in developing software algorithms
for the tele-assistance for the elderly, which could be used as early warn-
ing monitor for anomalous events. The thesis treats algorithms for both
the multiple object tracking problem as well simple behavior detectors
based on human body positions. There are several original contributions
proposed by this thesis. First, a method for comparing foreground - back-
ground segmention is proposed. Second a feature vector based tracking
algorithm is developed for discriminating multiple objects. Finally, a sim-
ple real-time histogram based algorithm is described for discriminating
movements and body positions.
1 Introduction
Computer vision has entered an exciting phase of development and use in re-
cent years and has benefitted from rapid developments in related fields such as
machine learning, improved Internet networks, computational power, and cam-
era technology. Present applications go far beyond the simple security camera
of a decade ago and now include such fields as assembly line monitoring, sports
medicine, robotics, and medical tele-assistance. Indeed, developing a system that
accomplishes these complex tasks requires coordinated techniques of image anal-
ysis, statistical classification, segmentation, inference algorithms, and state space
estimation algorithms.
Irrespective of the problem domain, the computer vision problem is approxi-
mately the same for application: first we must somehow detect what we consider
to be the foreground object (or objects), we must then track these objects in
time (over several video frames), and we must discern something about what
these objects are doing. This is summarized as follows:
1. Foreground object segmentation

2. Tracking over time
3. Determination of actions and behavior
2 Iván Gómez Conde
Fig. 1. Age and sex distribution for the year 2010 and the predicted for the year 2050
[37].
1.1 Problem Domain of this Thesis: Tele-assistance
The motivation for this thesis is the development of a tele-assistance application,

which represents a useful and very relevant problem domain. To understand
the importance of this problem area, it is interesting to appreciate some recent
statistics. Life expectancy worldwide has risen sharply in recent years. In 2050
the number of people aged 65 and over will exceed the number of youth under
15 years, according to recent demographic studies [37,38]. Combined with socio
logic factors, there is thus a growing number of elderly people (Figure 1) that
live alone or with their partners. These people need medical care and therefore
there are two big problems: not enough people to care for elderly population and
the government can not cope with this enormous social spending.
Thus, Computer Vision (CV) can provide a strong economic savings by elim-
inating the need for 24 hour in-house assistance by medical staff. A potential
tele-assistance application could combine vital signs monitoring while at the
same time use video to detect anomalous behavior, such as falls or excessively
long periods of inactivity. Therefore, present systems are being developed that
collect a wide array of relevant medical information of patients in their homes,
and send this information to a central server over the Internet. Thus, these
systems provide clear economic cost reduction of primary care as well as early
warning monitoring patients before they enter a critical phase, necessitating
costly emergency care.
A general schematic of the application developed for this masters project
is given in Figure 2. It has a number of parameters that can be hand-tuned
for investigating optimal parameters for background subtraction. The system,
based upon the OpenCV computer vision library, captures real time video from
a webcam, performs foreground object segmentation, tracks these objects, and
subsequently determines a limited set of human body action. A screen capture of
Smart Telecare Video Monitoring for Anomalous Event Detection 3
main window of the application is shown in Figure 2. The figure shows detection,
extraction and characterization of moving foreground objects.
Fig. 2. A screenshot of the main window of the tele-assistance application.
1.2 Objectives and Organization

This master thesis describes a computer vision application that can be used for
Telecare. There are three contributions made by this thesis: (1) a comparison
of different foreground/background algorithms, (2) a feature based model for
tracking multiple objects, and (3) an online method for detecting basic human
body positions, that could eventually be used in a behavior detection algorithm.
This thesis is organized as follows. First, we describe the state of the art
of video surveillance and especially how it is related to the domain of remote
Telecare. In section III, we describe the architecture of our software system, as
well as details of motion detection, segmentation of objects, and the methods
we have developed for detecting anomalous events. Finally, the section IV and
V show the performance results and conclusions of this work.
Part of this master thesis was presented as a long article on ”5th Iberian
conference of systems and information technology” (CISTI 2010) [14].
2 State of the art

For nearly 30 years, researchers have been seeking the best ways of carrying out
the concerted tasks of computer vision. As such, the the quantity of techniques
published in the literature is daunting. Due to the maturity of the field, several
seminal reviews are now available. For segmentation and tracking, for example,
the review by Hu [17] provides a useful review and taxonomy of algorithms used
in multi-object detection. For human body behavior determination from a video
sequences, the recent review by Poppe [31] provides an impressive whirlwind
tour of algorithms and techniques.
Open Source software has also left its mark on the computer vision field
with a wide array of libraries freely available to developers to try their hand at
the three fundamental problems of computer vision. A large listing of available
computer vision software libraries (both open source and can be found at CMU
[39].
2.1 Surveillance System Evolution
The technological evolution of video monitoring and surveillance systems has

been ongoing over recent years, and now includes the latest advances in com-
puter vision. In their review, Hu[17] and co-workers have divided video surveil-
lance systems in three generations. Starting from first generation systems that
consisted of analog monitors requiring intensive human intervention, up until the
present with real-time digital embedded processing, a rapid evolution is under-
way. Present systems can incorporate real-time compression and networks are
fully integrated with the possibility of interconnection with the Internet and ac-
cess from mobile devices. Increased computational capacity of computers today
can perform real-time segmentation and tracking of multiple objects.
Next generation video surveillance systems are presently being intensely re-
searched, and the fundamental problems being addressed are to monitor and infer
behavior detected from the video sequences. Van der Heijden[15] and Olson[29],
describe event graphs, which provide decision making by connecting a series of
sequential events. Other fundamental issues in present systems are the need for
multi-object tracking as well as multiple synchronized cameras. The most up-
dated review relevant to this thesis, however, is that of Poppe [31], for human
action recognition, containing 180 references. Poppe clearly demonstrates that
the main thrust of work over the past ten years, shows that successful systems
must heavily rely upon statistical machine learning and modern kernel methods.
Remote Telecare Systems: There are several research groups investigating re-
mote assistance technologies for the elderly in their homes, some incorporating
smart video camera technology, such as that treated in this paper. Many sys-
tems under development also incorporate vital signs monitoring and other sensor
technologies. Examples of these projects are MobiAlaram, TeleADM, Soprano,
PLAMSI, Dreaming, and so on [3,25,33,36]. These systems translate this low
level signal information, whether from video monitoring, vital signs monitoring,
ect, into higher level system that can be used to make decisions or ultimately
provide more timely care.
2.2 Motion Detection
Detecting moving objects in a video scene and separating them from the back-
ground is a challenging problem due to complications from illumination, shad-
ows, dynamic backgrounds ect[30]. Motion detection is concerned with being
able to notice objects that move within a scene, thereby separating foreground
objects from the background[1,11]. There are several techniques that have been
applied to motion detection and can be grouped into three broad categories:
environmental modeling, motion segmentation, and object classification.
Environmental modeling: requires constructing a set of images of the back-
ground. While trivial for stationary backgrounds, this technique is funda-
mentally limited for most real world problems, especially for outdoor scenes
that contain moving objects and dramatic changes in lighting [6].
Motion Segmentation: can be further divided into three different types: back-
ground subtraction, temporal differencing, and optical flow.
Static background subtraction: uses pixel by pixel comparison of the current

image to a static reference image in order to detect moving objects. While
simple and valid for static backgrounds, the method places stringent con-
straints on the background image [24,21].
Temporal differencing: uses pixel-wise differencing of a sequence of frames.

The method is highly dependent upon different time frames [32].
Optical flow: creates complex contours of the entire velocity field. It is the
most robust method, however requires considerable computational resources
and cannot at present be used in real-time applications [27,22].
Object Classification: these techniques are applied to situations where there

are more than one foreground object moving in a scene. The solution to this
class of problem can further be divided into two main categories: shape based
classifiers, which discern objects for their shape, or motion based classifiers,
that classify based upon the type of motion [23].
2.3 Object Tracking

Once objects have been separated from the background, we are interested in
tracking these objects individually. The goal is to track particular objects during
subsequent frames in the video sequence based upon their unique characteristics
and locations. An important research field is to track multiple people in cluttered
environments. This is a challenging problem due to several factors: people have
similar shape or color, complex events such as running and walking may be
within the same scene, or depth perspective of real world cameras. Another
challenging problem that must be treated by the software is occlusions with
objects or other people. We shall briefly consider four prominent strategies for
implementing object tracking: region based, active contour based, feature based
and model based.
Region based: region tracking consists in arbitrarily defining static regions
within the image frame to monitor. The methods tracks objects within each
region, however fails when objects move between regions [34].
Active contour based: the idea is to create bounding contours which rep-
resent each object’s outline. The problem is that the contours need to be
updated from frame to frame, and the method suffers from precision [10,34].
Feature based: feature based tracking methods are very successful techniques
for tracking objects from frame to frame. No assumptions need to be made
about the objects other than an intelligent choice of feature vector. A typical
feature vector contains the following quantitative information of each object:
the object size, position, velocity, ratio of major axis orientation, coordinates
of the bounding box, the centroid, and the histogram of color pixels. The
advantage of using a feature based method is that statistical clustering can
be used to classify the objects from frame to frame [5,20].
Model based: these methods are very accurate, but computationally very ex-
pensive. These include modeling the objects that are to be tracked with
detailed models. For example, human figures have been modeled with stick
figures and with complex gait algorithms [2,28].
2.4 Object Segmentation

Several problems arise when implementing background subtraction techniques.
The most obvious is noise, which has its origin in the fact that the background
of consecutive frames may not be exactly the same due to changes in luminosity,
temporal occlusions, and the number and velocity of different objects in the
scene. Although often associated with interior video scenes, all or a subset of
these problems can also occur within exterior scenes.
Thus, exterior scenes share similar problems as those of interior scenes, yet
other more complex effects may arise. An example is an oscillating background
in the case of a forest, where wind generates tree leave movements (Figure 3).
Models may be complicated by other problems, such as how to determine if
an object is part of the background or not. A typical textbook example that
illustrate the foreground-background problem is the following: imagine a video
scene with parked cars. In the moment that one car enters the parking lot, it is
a foreground object. However, at what point should our algorithm consider this
car to be a background object?
The principal methods used for determining the background in a frame are
the following: differences between frames, which determines the background by
taking the logical difference between two frames, mean background, which is
based upon the mean based upon the mean background of n frames, and, code-
book model, which analyzes blocks of frames in order to estimate the background
model.
Differences between frames: determines the background by taking the logi-
cal difference between two frames separated by a time t. It is easy to imple-
ment and is not computationally intensive, but it is not sufficiently robust,
and plagued with artifacts and noise [24].
Fig. 3. Scene with a lot of moving objects like tree leaves, dogs or birds.
Mean background: this method overcomes the deficiencies of the pairwise dif-
ference by performing a cumulative average over n frames. This has the ad-
vantage that objects do not need to be in continuous movement. The method
[18,35], however suffers from artifacts that arise when objects moves at dif-
ferent velocities. Another problem is that it needs to be hand-tuned. For
initial studies in this thesis, this technique was utilized. Thus, a more com-
prehensive discussion shall be provided later.
Codebook model: the codebook method is a powerful technique for some sit-
uations since it stores a model background based upon previously obtained
backgrounds. This is an interesting choice for indoor environments, as in our
Telecare system, where the background remains relatively static [19].
2.5 Human action recognition
There is a large body of literature in the area of human action recognition.

Recent comprehensive surveys of human action in videos is described by Poppe
[31] and Forsyth [12]. Related work for locating people in static images has been
reviewed in [7,13].
Many techniques have been described in the literature for distinguishing body
part positions for the purpose of action recognition. As an example of early ap-
proaches to this complex problem, a template matching technique was reported
by Felzenzwalb [9], where body parts are represented by a deformable skeleton
model which is mapped to points on the body from frame to frame. More recent
models rely upon sophisticated statistical machine learning techniques, such as
that by Meeds, which uses a fully Bayesian probabilistic model to construct
tree models of foreground objects. Other researchers are using kernel classifiers
[8] for constructing systems based on mixtures of multi-scale deformable part
models. Hsieh et al, [16], have shown that using a triangulation method one can
distinguish human parts over sequences of video scenes.
3 The software system
The software application, TELECARE, developed for this thesis has been writ-
ten in C++ and uses the OpenCV library, which is an open-source cross-platform
standard library, originally developed by Intel, for developing a wide range of
real-time computer vision applications. OpenCV implements low level image pro-
cessing as well as high level machine learning algorithms so that developers can
rapidly leverage well-known techniques for building new algorithms and robust
computer vision applications.
For the graphical interface, the QT library is used since it provides excellent
cross-platform performance. The QT development environment is an industry
open standard and is now owned by Nokia. Existing applications not only run
on other desktop platforms, but can be extended for inclusion in web-enabled
applications as well as mobile and embedded operating systems without the need
for rewriting source code.
Since the TELECARE software is an experimental application, the graphical
interface is designed to provide maximum information about feature vector pa-
rameters as well as easy comparison of different segmentation algorithms. Thus,
the system is not meant for end-users at the moment. Instead, the architecture
of the system provides a plugin- framework for including new ideas. A high level
schema of our software system is shown in Figure 4 . It consists of the basic
components described in the previous section, namely a foreground extraction,
object segmentation, and finally tracking of individual objects. Tracking of mul-
tiple objects is performed by first clustering individual objects, or “blobs”, from
feature vectors formed by their dimensions, color space, and velocity vectors.
In order to experiment with different segmentation schemes for discriminating
events, we have implemented histogram based algorithms.
3.1 Foreground Segmentation
The first phase of extracting information from videos consists of performing basic
image processing functions. These include: loading the video, capturing individ-
ual frames, and applying various smoothing filters. Next, blobs are identified
Fig. 4. Computer Vision system for video surveillance.

based upon movement between frames. For static cameras, this amounts to sim-
ply taking note of pixel locations that change value from frame to frame within
a specified threshold. There are several sophisticated background subtraction
methods for eliminating shadows and other artefacts that we shall describe in
the next section, however the basic background subtraction is based upon two
ideas: (a) finding an appropriate mask for subtracting moving objects in the fore-
ground and (b) updating the background. Two methods are typically employed:
a running average or Gaussian mixtures model, where each pixel is classified to
belong to a particular class.
Running average [4]: A running average technique is by far easiest to com-

prehend. Each point of the background is calculated as by taking the mean of
accumulated points over some pre-specified time interval, ∆t. In order to control
the influence of previous frames, a weighting parameter α is used as a multiplying
constant in the following way:
Acct (x, y) = (1 − α) · Acct−1 (x, y) + α · It (x, y) (1)
where the matrix Acc as the accumulated pixel matrix, I(x, y) the image, and
α is the weighting parameter.
For a constant value of α, the running average is not equivalent to the result of
summing all of values for each pixel across a large set of images, and then divide
by the total number of images to obtain the mean. To see this, simply consider
adding three numbers (2, 3, and 4) with α set to 0.5. If we were to accumulate
them with “normal mean”, then the sum would be 9 and the average 3. If
we were to accumulate them with “running average”, the first sum would give
0.5·2·0.5·3 = 2.5 and then adding the third term would give 0.5·2.5·0.5·4 = 3.25.
The reason that the second number is larger is that the most recent contributions
are given more weight than those from farther in the past. The parameter α
essentially sets the amount of time necessary for the influence of a previous
frame to fade. In the Figure 5 the results of different executions with values of
α: 0.05, 0.4 and 0.8 can be seen.
Thus, the background is subtracted by calculating the running average of
previous background images and the current frame with the mask applied. An
example using this algorithm is given in Figure 6. The left image is the gener-
ated mask. The second image is the resulting background image obtained from
multiple calls to the running average routine. The third image is the result of
applying the mask to the current image. Finally, the fourth image is the orig-
inal image with rectangular contours defining moving foreground objects, after
applying image segmentation.
Gaussian Mixture Model [18]: This model is able to eliminate many of

the artefacts that the running average method is unable to treat. This method
models each background pixel as a mixture of K Gaussian distributions (where
Fig. 5. Different executions of “running average“.
Fig. 6. Foreground subtraction results for a particular frame in a video sequence.
K is typically a small number from 3 to 5). Different Gaussian are assumed to

represent different colors. The weight parameters of the mixture represent the
time proportions that those colors stay in the scene. The background components
are determined by assuming that the background contains B highest probable
colors. The probable background colors are the ones which stay longer and more
static. Static single-color objects tend to form tight clusters in the color space
while those in movement form wider clusters due to different reflecting surfaces
during the movement. The measure of this is called the fitness value. To allow
the model to adapt to changes in illumination and run in real-time, an update
scheme is applied. It is based upon selective updating. Every new pixel value is
checked against an existing model component to check the value of its fitness.
The first matched model component will be updated. If it finds no match, a
new Gaussian component will be added with the mean at that point and a large
covariance matrix and a small value of weighting parameter.
Each pixel in the scene is modelled by a mixture of K Gaussian distributions.
The probability that a certain pixel has a value of xN at time N can be written
as:
X
K
p(xN ) = wj η(xN ; θj ) (2)
j=1
where wk is the weight parameter of the kth Gaussian component, and η(xN ; θj )
is the Normal distribution of kth component, given by
X 1 − 21 (x−µk )T
P−1
(x−µk )
η(x; θk ) = η(x; µk , )= P 1e
D
k (3)
k (2π) | k | 2
2
P
In the above equation, µk is the mean and k = σk2 I is the covariance of the
k th component. The K distributions are ordered based on the fitness value w k
σk
and the first B distributions are used as a model of the background of the scene
where B is estimated as:
Xb
B = arg min( wj > T ) (4)
j=1
The threshold T is the minimum fraction of the background model. In other

words, it is the minimum prior probability that the background is in the scene.
The Figure 7 shows the results of three executions with different values for the
σ of Gaussians: 1, 2.5 and 5
Fig. 7. Different executions of “Gaussian mixture model“.

Quantitative comparison between running average and Gaussian mix-

ture model
The algorithms Running Average (RunAvg) and Gaussian Mixture Model (GMM)
have been tested with our computer vision system. The data consists of 1 video
sequence of resolution 640x480 pixels, 22 seconds of duration and 25 frames per
second. We select the frames 205, 210, 215, 220, 225, 230, 235 and 240. For
each frame, it was necessary to manually segment foreground objects in order
to have a ground truth quantitative comparison. Table 1 shows the number of
foreground pixels labeled as background (false negatives - FN), and the number
of background pixels labeled as foreground (false positives - FP), and the total
N +F P
percentage of wrongly labeled pixels F640·480 . Comparisons are complicated by
the fact that sudden illumination changes produce shadows increasing the object
size, and modern webcams constantly adapt to illumination changes.
Algorithm Errors fr.205 fr.210 fr.215 fr.220 fr.225 fr.230 fr.235 fr.240
RunAvg FN 11285 16636 20365 17657 10951 12353 14226 18882
(α = 0.05) FP 2907 4077 690 100 2158 5054 2006 903
% Total 4.62 6.74 6.85 5.78 4.27 5.67 5.28 6.44
RunAvg FN 32432 37071 42517 41929 35758 32765 39102 43144
(α = 0.4) FP 1536 2686 36 14 264 928 669 68
% Total 11.06 12.94 13.85 13.65 11.73 10.97 12.95 14.07
RunAvg FN 37318 41152 47173 45566 40125 38653 45742 52165
(α = 0.8) FP 758 1359 0 3 97 733 618 0
% total 12.39 13.84 15.36 14.83 13.09 12.82 15.09 16.98
GMM FN 5175 2532 4091 7802 3515 3080 5168 4775
(σ = 1) FP 20275 28982 13942 9941 31695 27896 11158 6191
% Total 8.28 10.26 5.87 5.78 11.46 10.08 5.31 3.57
GMM FN 9246 9954 12168 10338 5551 6283 8347 12389
(σ = 2.5) FP 1072 1637 77 44 1050 3678 2686 1
% Total 3.34 3.77 3.99 3.38 2.15 3.24 3.59 4.03
GMM FN 39430 43029 48709 43144 37922 36189 44012 50461
(σ = 5) FP 0 0 0 0 0 0 0 0
% Total 12.84 14.01 15.86 14.04 12.34 11.78 14.33 16.43
Table 1. Results of different configurations of running average and Gaussian mixture
model.
Previously, we have shown three configurations for modifying the Running

average and the Gaussian mixture model. Figure 8 shows the percentages of false
positives pixels and the percentages of false negatives pixels for the different
executions. The best results (Figure 9) for the Running average are obtained
with small values for alpha (α = 0.05) and for the Gaussian mixture model with
(σ = 2.5).
Fig. 8. % of FN and FP based on manually segmented image.
Fig. 9. % error of the best configuration with the Running average model and the best
configuration with the Gaussian mixture model.
3.2 Finding and Matching blob features
Foreground objects are identified in each frame as rectangular blobs, which inter-
nally are separate images that can be manipulated and analyzed. Only minimum
sized blobs are considered real objects, thereby eliminating possible artifacts that
may arise. For each blob, the background mask is subtracted and a simple image
erosion operation is performed in order to be sure to eliminate any background
color contributions. This erosion process helps in feature classification since com-
mon background pixels don’t contribute to the object color histograms.
In order to classify each blob uniquely, we define the following feature vector
parameters (Figure 10): (a) the size of the blob, (b) the Gaussian fitted values
of RGB components (Figure 11), (c) the coordinates of the blob center, and (d)
the motion vector. Because of random fluctuations in luminosity, smaller blobs
appear, but we discard these blobs. The size of blobs is simply the total number
of pixels. Histograms are obtained by considering bin sizes of 10 pixels. We also
normalize the feature vectors by the number of pixels.
Fig. 10. The feature vector parameters for classifying blobs.
Fig. 11. Discrimination of the histogram of color space between blobs in taken from
different frames. The x-axis is the norm difference of red, while the y-axis is the norm
difference histogram for green.
In order to match blobs from frame to frame, we perform a clustering. Since

this can be expensive to calculate for each frame, we only recalculate the full
clustering algorithm when blobs intersect. Figure 12 shows excellent discrim-
ination by using the norm histogram differences between blobs for each color
space.
In the plot, the x−axis is the difference of the red color channel between blobs,
while the y−axis is the norm of the difference of histograms in the green channel.
For a particular video sequence, the results shown in Figure 12 demonstrate that
two separate blobs are easily classified.
Fig. 12. The Histograms of the BGR color space.
3.3 Tracking individual blobs
The tracking algorithm used is similar to other systems described in the liter-
ature. Once segmented foreground objects have been separated, and we have
formed a rectangular blob region, we characterize the blob by its feature vector.
Tracking is performed by matching features of the rectangular regions. Thus,
given N rectangular blobs, we match all these rectangles with the previous
frames. The frame to frame information allows us to extract a motion vector
of the blob centroid, which we use to predict the updated position of each blob.
For cases where there is ambiguity, such as track merging due to crossing, we
recalculate the clusters. Thus, the situations where we explicitly recalculate the
entire feature vector are the following: (a) objects can move in and out of the
view-port, (b) objects can be occluded by objects or by each other, and (c)
complex depth information must be obtained.
An example of velocity vector taken over time is given in Figure 13, where the
tracking algorithm must consider the crossing of two objects. The simple predic-
tion algorithm easily handles this situation by first extracting the instantaneous
velocity vector, and then performing a simple update based on the equations of
motion.
3.4 Detecting Events and Behavior for Telecare
For our initial work, we have considered a limited domain of events that we
should detect, namely: arm gestures, and body positions upright or horizontal,
Fig. 13. Frames over from a video sequence.
to detect falls. These two cases are used to address anomalous behavior or simple
help signals for elderly in their home environments.
Our analysis is based upon comparing histogram moment distributions through
the normalized difference of the histograms as well as the normalized difference
of the moments of each histogram.
X
Hist(Hi , Hj ) = |Hi − Hj | (5)
X
M Hist(Hi , Hj ) = |Mi − Mj | (6)
In order to test our histogram discriminatory technique, video events were

recorded with very simple arm gestures, as shown in Figure 14. The foreground
object was subtracted from the background by the methods previously described.
The blob image is then eroded in order to obtain principle color space histograms.
Figure 14 shows the detected rectangular blobs after subtracting the background
mask.
Fig. 14. Simple histogram results for obtaining moments for detecting arm gestures.
For each of the histograms obtained in Figure 14, and for the histograms of
Figure 15, statistical moments are calculated and then normalized histograms
(normalized both by bins and total number of points) are obtained. Clustering
can then be performed (similar to that of figure 11), by calculating M Hist(Hi , Hj ),
the normed difference.
Fig. 15. Basic histogram technique used for discrimination body position. The inset
image demonstrates the color space normalized to unity.
The histograms are obtained by summing all the pixels in the vertical direc-
tion. This is performed by dividing the image into equal-sized vertical stripes,
and summing over all nonzero pixels within each slice for each color channel.
3.5 Event discrimination
Video sequence were recorded with typical actions / events that we would like to
detect, such as a person falling to the floor. Figure 16 shows three video frames,
taken from our application where we have detected and tracked a foreground
object, in this case a person as he falls to the floor.
Fig. 16. Video sequence of person falling to the floor.

From each frame in the sequence, we calculated the histogram moments in

the horizontal and vertical direction, Hx and Hy , respectively, in order to char-
acterize body position. The results of comparing Hy (the vertical histogram) for
two body positions, representing the extremes of the video sequence, are shown
in figure 15. The inset of Figure 15 shows the histograms normalized along the
x-axis to the total number of pixels.
In order to automatically discriminate the different body positions, we can
compare the moments of the histograms obtained. Figure 17 and Table 2 show
the results of different moments for frames shown in Figures 14 and 15. Figure
14b (the central histogram) demonstrates a highly peaked first and second mo-
ment, thereby useful for discriminating between the other two cases, although
cases 14a and 14c are not discriminated for any moment. For Figure 15, the
difference in the distributions in the first and second moments is highly pro-
nounced, and thus discrimination of the two cases is easily obtained from the
simple moment analysis.
Fig. 17. Comparison of different histogram moments obtained from the video frames
studied in Figure 8 and Figure 9.
Fig 14 (a) Fig 14 (b) Fig 14 (c) Fig 15 (a) Fig 15 (b)
Mean 0.52 0.33 0.45 0.57 0.49
Standard Deviation 0.21 0.16 0.2 0.18 0.24
Skewness 0.18 3.85 3.12 0.05 1.5
Table 2. Results of different moments for frames shown in figure 14 and figure 15.
4 Experimental Results and Discussion

All the experimental tests and development were performed on a standard PC,
consisting of an Intel Pentium D CPU 2.80GHz, with 2G of RAM and using the
Ubuntu 9.10 Linux operating system. Videos and images were obtained from
webcam with 1.3MPixel resolution. For testing the system, many different video
sequences were recorded using different lighting situations and object character-
istics, of which a subset of the results have been shown in our examples of the
previous section.
The performance of the algorithm is highly dependent upon amount of pro-
cessing that can be perform from frame to frame. In all cases, we execute the
foreground segmentation step between each step with a 0% reduction in the
original frame rate of 30fps. We do not perform blob clustering between each
frame, since it would be costly, but instead re-calculate clusters only when there
are ambiguities (such as track crossings). Our histogram method for determining
human body position is also costly and therefore we are experimenting with the
optimal multiple of frames (such as every 25 frames) so that we can still detect
relevant body positions for real-time processing.
The figure 18 shows on a logarithmic scale the time results of the performance
of the algorithms with a video of 30fps and 12 seconds of duration. The blue line
represents the normal video reproduction, the magenta line is the video playing
with our system without processing, the red color represents the foreground
segmentation and the green line adds the time for processing blob clustering
between each frame. The quantitative time is shown in the Table 3.
Fig. 18. Time of the different video reproductions.
As shown in the previous section, the results of Figure 17 demonstrate that

we can use statistical moment comparisons of histograms in order to discrim-
inate between simple body positions. However, this method is not robust for
Bg-Fg Segmentation Blob Detection Normal Video Video with QT

Frame 1 28.3 ms 168.5 ms 33.2 ms 2.5 ms
Frame 30 847.5 ms 5065.4 ms 997.2 ms 75.8 ms
Frame 360 10198.2 ms 60954.1 ms 12000 ms 912.4 ms
Table 3. Time results of the performance of the algorithms.
discriminating between Figure 8a and 8c where histograms are similar, yet the
body position is quite different. We are presently improving the basic idea by
using the information from both Hx and Hy with a Fischer criteria.
Thus, we have found that although our simple histogram techniques for hu-
man body position works well for some cases of interest and is easy to implement,
it is not sufficiently robust. Because of its simplicity, however, we are presently
improving the technique while at the same time investigating other model based
algorithms described in the literature, such as matching geometric shapes to
body parts, as described in [15]. This has the advantage both for the foreground
segmentation problem as well as determining body position.
5 Conclusions
In this master thesis we have described preliminary work and algorithms on a
software system which shall allow us to automatically track people and discrimi-
nate basic human motion events. This system is actually part of a more complete
tele-monitoring system under development by our group. The complete tele-care
system shall include additional information from sensors, providing a complete
information about a patient in their home. In this thesis, however, we have
restricted the study to video algorithms that shall allow us to identify body
positions (standing, lying, and bending), in order to ultimately translate this
information from a low level signal to a higher semantic level.
The thesis provides encouraging result and opens many possibilities for fu-
ture study. In particular, in the field of foreground/background segmentation,
the quantative comparison we described is an effective methodology which can
be used to optimize parameters in each model. While the feature based track-
ing that used in this thesis is rudimentary, a future study could combine this
information with modern sequential Monte Carlo methods in order to obtain a
more robust tracking. Finally, while the histogram model developed in this thesis
provides detection for a limited set of actions and events, it is a fast real-time
method, similar to motion detection systems presently employed commercially,
that should have utility in real systems.
References
1. A. Albiol, C. Sandoval, V. Naranjo, J.M. Mossi, Robust motion detector for video
surveillance applications. Departamento de comunicaciones, Universidad Politécnica
de Valencia (2009)
2. Alexandru O. Bălan and Leonid Sigal and Michael J. Black, A Quantitative Evalu-
ation of Video-based 3D Person Tracking, International Workshop on Performance
Evaluation of Tracking and Surveillance, pp. 349-356 (2005)
3. Botsis, G. Hartvigsen, Current status and future perspectives in telecare for elderly
people suffering form chronic diseases. Journal of Telemedicine and Telecare, Vol
14, pp. 195-203 (2008)
4. G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV
Library. O’Reilly Media, (2008)
5. Robert T. Collins, Yanxi Liu, Marius Leordeanu, Online Selection of Discriminative
Tracking Features. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, pp. 1631-1643 (2005)
6. Ahmed Elgammal and Ramani Duraiswami and David Harwood and Larry S. Davis
and R. Duraiswami and D. Harwood, Background and Foreground Modeling Using
Nonparametric Kernel Density for Visual Surveillance. Proceedings of the IEEE,
pp. 1151-1163 (2002)
7. Markus Enzweiler, Dariu M. Gavrila, Monocular pedestrian detection: survey and
experiments, IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), Vol 31(12), pp. 2179–2195 (2009)
8. P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, Object detection with
discriminatively trained part based models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, (2009)
9. Pedro F. Felzenszwalb and Daniel P. Huttenlocher, Efficient Matching of Pictorial
Structures. Proc. IEEE Computer Vision and Pattern Recognition Conf. Vol 2, pp.
66-73 (2000)
10. Daniel Freedman, Active Contours for Tracking Distributions. IEEE Trans. Image
Processing, Vol 13, pp. 518-526 (2004)
11. G.L. Foresti, L. Marcenaro, C.S. Regazzoni, Automatic detection and indexing of
video-event shots for surveillance applications. IEEE Transactions on multimedia,
Vol 4, No 4 (2002)
12. David A. Forsyth and Okan Arikan and Deva Ramanan, Computational Studies
of Human Motion: Part 1, Tracking and Motion Synthesis. Foundations and Trends
in Computer Graphics and Vision, (2006)
13. Tarak Gandhi, Mohan M. Trivedi, Pedestrian protection systems: issues, survey,
and challenges, IEEE Transactions On Intelligent Transportation Systems, Vol 8(3),
pp. 413–430 (2007)
14. I. Gómez-Conde, D. N. Olivieri, X.A. Vila, L. Rodrı́guez-Liñares, Smart Telecare
Video Monitoring for Anomalous Event Detection. Proceedings of ”5th Iberian con-
ference of systems and information technology” (CISTI 2010), Vol 1, pp. 384-389,
(Jun. 2010)
15. F. van der Heijden, R.P.W. Duin, D. de Ridder, D.M.J., Tax, Classification, Pa-
rameter estimation, and State Estimation, An engineering approach using Matlab.
John Wiley and Sons Ltd, West Sussex, England (2004)
16. Chih-Chiang Chen, Jun-Wei Hsieh, Yung-Tai Hsu, and Chuan-Yu Huang, Segmen-
tation of Human Body Parts Using Deformable Triangulation. In: IEEE Interna-
tional Conference on Pattern Recognition, Vol 1, pp. 355–358 (2006)
17. W. Hu, T. Tan, L. Wang, S. Maybank, A Survey on Visual Surveillance of Object
Motion and Behavior. IEEE Trans. Systems, Man, and cybernetics - Part C. Vol
34, No 3, pp. 334-352 (2004)
18. P. KaewTraKulPong and R. Bowden, An improved adaptive background mixture
model for real-time tracking with shadow detection. In Proceedings of the 2nd Eu-
ropean Workshop on Advanced Video-Based Surveillance Systems (2001)
19. Thanarat H. Chalidabhongse, David Harwood, Larry Davis, Real-time foreground-

background segmentation using codebook model. Real-Time Imaging, Vol 11, Issue
3, Special Issue on Video Object Processing, pp 172-185 (2005)
20. T. Lee and T. Höllerer, Hybrid Feature Tracking and User Interaction for Mark-
erless Augmented Reality, Proc. IEEE Conf. Virtual Reality (VR ’08), pp. 145-152,
(2008)
21. M. K. Leung and Y. Yang, First Sight: A Human Body Outline Labeling System.
IEEE Trans. Pattern Anal. Mach. Intell. Vol 17, No 4, pp. 359-377, (Apr. 1995)
22. Arno Lepisk, The use of Optic Flow within Background Subtraction. Master’s
Degree Project, Royal Institute of Technology of Stockholm, Sweden, (2005)
23. A. J. Lipton, H. Fujiyoshi, R.S. Patil, Moving target classification and tracking
from real-time video. Proc. IEEE Workshop Applications of Computer Vision, pp
8-14 (1998)
24. W. Long and Y. Yang, Stationary background generation: an alternative to the
difference of two images. Pattern Recogn. Vol 23, No 12, pp. 1351-1359, (Nov. 1990)
25. T. Martin, Fuzzy Ambient Intelligence in Home Telecare. Lecture Notes in Com-
puter Science, Vol 4031, pp. 12-13 (2006)
26. Meeds, D. A. Ross, R. S. Zemel, and S. T. Roweis. Learning stick-figure models
using nonparametric Bayesian priors over trees. In IEEE Conference on Computer
Vision and Pattern Recognition, (2008)
27. Anurag Mittal, Nikos Paragios, Motion-Based Background Subtraction Using
Adaptive Kernel Density Estimation. Computer Vision and Pattern Recognition,
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’04), pp. 302-309 (2004)
28. Thomas B. Moeslund, Adrian Hilton, Volker Kruger, A survey of advances in
vision-based human motion capture and analysis, Computer Vision and Image Un-
derstanding, Vol 104, Issues 2-3, Special Issue on Modeling People: Vision-based
understanding of a person’s shape, appearance, movement and behaviour, pp. 90-
126 (2006)
29. T. Olson, F. Brill, Moving object detection and event recognition algorithms for
smart cameras. Proc DARPA Image Understanding Workshop, pp. 159-175 (1997)
30. J. Pons, J. Prades-Nebot, A. Albiol, J. Molina, Fast motion detection in the com-
pressed domain for video surveillance. IEE Electronics Letters, Vol 38, pp. 409-411
(2002)
31. Ronald Poppe, A Survey on Vision-based Human Action Recognition. Image and
Vision Computing, aceptado y pendiente de publicación (2010)
32. Chris Stauffer and W. Eric and W. Eric L. Grimson, Learning Patterns of Activity
Using Real-Time Tracking. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol 22, pp.747-757 (2000)
33. M.A. Valero, An Implementation Framework for Smart Home Telecare Services.
Future Generation Communication and Networking (FGCN), Vol 2, pp. 60-65 (2007)
34. Zifu Wei, Duyan Bi, Shan Gao, Jianjun Xu, Contour Tracking Based on Online
Feature Selection and Dynamic Neighbor Region Fast Level Set. Fifth International
Conference on Image and Graphics, pp. 238-243 (2009)
35. Zoran Zivkovic, Ferdinand van der Heijden, Efficient adaptive density estimation
per image pixel for the task of background subtraction, Pattern Recognition Letters,
Vol 27, Issue 7, pp. 773-780 (2006)
36. N. Zouba, Multi-sensor Analysis for Everyday Activity Monitoring. 4th Int. Con-
ference: Sciences of Electronic, Technologies of Information and Telecommunications
(SETIT) (2007)
37. United Nations, Department of Economic and Social Affairs Population Division,
http://www.un.org/esa/population/
38. NationMaster, a massive central demographic data source,
http://www.nationmaster.com/index.php
39. Listing of computer vision software, CMU site:
http://www.cs.cmu.edu/ cil/v-source.html

TFM Ingles

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TFM Ingles

Uploaded by

Copyright:

Available Formats

Smart Telecare Video Monitoring for Anomalous

Iván Gómez Conde

Advisors: David Olivieri Cecchi and Xosé Antón Vila Sobrino

Abstract. Behavior determination and multiple object tracking for video

1. Foreground object segmentation

1.1 Problem Domain of this Thesis: Tele-assistance

The motivation for this thesis is the development of a tele-assistance application,

Fig. 2. A screenshot of the main window of the tele-assistance application.

1.2 Objectives and Organization

2 State of the art

2.1 Surveillance System Evolution

The technological evolution of video monitoring and surveillance systems has

2.2 Motion Detection

Static background subtraction: uses pixel by pixel comparison of the current

Temporal differencing: uses pixel-wise differencing of a sequence of frames.

Object Classification: these techniques are applied to situations where there

2.3 Object Tracking

2.4 Object Segmentation

2.5 Human action recognition

There is a large body of literature in the area of human action recognition.

3 The software system

3.1 Foreground Segmentation

Fig. 4. Computer Vision system for video surveillance.

Running average [4]: A running average technique is by far easiest to com-

Acct (x, y) = (1 − α) · Acct−1 (x, y) + α · It (x, y) (1)

Gaussian Mixture Model [18]: This model is able to eliminate many of

Fig. 5. Different executions of “running average“.

Fig. 6. Foreground subtraction results for a particular frame in a video sequence.

K is typically a small number from 3 to 5). Different Gaussian are assumed to

The threshold T is the minimum fraction of the background model. In other

Fig. 7. Different executions of “Gaussian mixture model“.

Quantitative comparison between running average and Gaussian mix-

Previously, we have shown three configurations for modifying the Running

Fig. 8. % of FN and FP based on manually segmented image.

3.2 Finding and Matching blob features

Fig. 10. The feature vector parameters for classifying blobs.

In order to match blobs from frame to frame, we perform a clustering. Since

Fig. 12. The Histograms of the BGR color space.

3.3 Tracking individual blobs

3.4 Detecting Events and Behavior for Telecare

Fig. 13. Frames over from a video sequence.

In order to test our histogram discriminatory technique, video events were

3.5 Event discrimination

Fig. 16. Video sequence of person falling to the floor.

From each frame in the sequence, we calculated the histogram moments in

4 Experimental Results and Discussion

Fig. 18. Time of the different video reproductions.

As shown in the previous section, the results of Figure 17 demonstrate that

Bg-Fg Segmentation Blob Detection Normal Video Video with QT

19. Thanarat H. Chalidabhongse, David Harwood, Larry Davis, Real-time foreground-

You might also like