You are on page 1of 8

Pattern Recognition, Vol. 29, No. 8, pp.

1421 1428, 1996


Elsevier Science Ltd
Copyright (c~, 1996 Pattern Recognition Society
Printed in Great Britain, All right reserved
0031-3203/96 $15.00 + 00

Pergamon

0031-3203 (95) 00163-8

A SYSTEM FOR C O U N T I N G PEOPLE IN VIDEO IMAGES


USING NEURAL NETWORKS TO IDENTIFY
THE B A C K G R O U N D SCENE
A. J. SCHOFIELD,* P. A. MEHTA and T. J. STONHAM
Department of Electrical Engineering and Electronics, Brunel University, Uxbridge UB8 3PH, U.K.

(Received 26 January 1995; in revised form 7 November 1995;received for publication 24 November 1995)
Abstract--A method for counting the number of people in any pre-definedscene is described. The method
has three distinct stages:image pre-processing,background identificationand object search. The method was
designed to provide accurate counts, even when the background scenewas allowed to vary. This tolerance to
changes in the background scene was achieved using RAM-based neural network classifiers to identify
sections of the background scene in each test image. The system was implemented on relatively low cost
hardware and was found to give good results at moderately high frame rates. Copyright 1996 Pattern
Recognition Society. Published by Elsevier Science Ltd.
People counting
Serial search

Neural network

RAM-based classifier

Background identification

1.2. Applications

l . INTRODUCTION

1.1. The task


The system described in this paper was designed to
address the task of automatically counting people in
complex and variable scenes. Images were captured
using a CCD video camera and then analysed to
determine the number of people present. The background scene (that is the set of all images of the scene
containing no people) was free to vary in a number of
ways, including variations in lighting levels and patterns, and the occasional movement of objects, such as
doors, that might appear in the scene. The background
scene was also complex in that it could contain a
number of objects with a variety of reflectance properties which might compound the effects of variations in
lighting.
Figures 1(a) and (b) show two example images of the
same background scene under different lighting conditions. This type of background scene is typical of the
scenes that the system was designed to deal with. These
images were taken from a ceiling mounted camera,
fitted with a wide angle lens, looking down on the
scene. In one image the room lights are offwhile in the
other they are on, resulting in both a general improvement in contrast and strong reflections on the floor.
Figure 1(c) shows the same scene with some people
present, the task of the people counting system being to
count the number of people in such images.

*Address for correspondence: Dr A. J. Schofield, School


of Psychology, University of Birmingham, Edgbaston,
Birmingham, B15 2TT.

The principal application of the system described


here is to count the number of passengers using and
waiting to use the lifts is large buildings so as to
improve the service provided by the lifts by reducing
the waiting and travelling times of the passengers. For
this application, plan views, like those used in this
report, are quite appropriate since the space available
in lift cars and lift lobby areas is limited and the
overhead view maximized the field of view and minimized the number of occlusions. So, Chan, Kuok and
Liu ~1) describe a people-counting system for use with
lift systems that is somewhat similar to that described
here. Their system relies on image subtraction alone at
the background removal stage and hence is less likely
to be able to deal with changes in the background than
the neural network approach described here.
Other potential application areas include pedestrian flow surveys, crowd control, security access control and building usage surveys. It is clear, however,
that some of these applications might require cameras
to be sighted such that they give a panoramic view and
the method has not been tested in such situations.
1.3. Object search versus background identification
The problem of counting the number of poeple in an
image is an instance of the more general problem of
counting objects. In order to count objects their presence must be registered or detected. This does not
imply that the objects need to be recognized as individual items or that their locations be specified, rather the
pattern recognition system is simply required to know
that some object(s) are present that do not normally
occur in the empty background scene and be able to

1421

1422

A.J. SCHOFIELD et al.

a)

b)

c)
Fig. 1. (a) An example background scene with room lights on.
(b) The same scene with room lights off. Notice the strong
reflections on the floor in (b) and the general improvement in
contrast when the lights are on. (c) The same scene with seven
occupants.

estimate the number of individual objects. This problem might be tackled in two ways: object search, and
background identification.
Object search involves searching the image looking
for shapes that correspond to some archetypal (or model)
object shape (for example, see paper by Sullivant2~).
Although objects need not match the archetype exactly
to be counted, such methods perform best when the

objects of interest have a relatively stable shape, that is,


when objects of the same type are similar and individual objects are non-deformable. People are not easily
modelled because they represent a heterogeneous
class. Moreover, people are not rigid objects and hence
consecutive images of the same person, taken from the
same relative camera position, can differ considerably.
Another approach for detecting objects involves
searching for key features that are common to objects
in the target class. For example, movement has been
used as a feature for detecting people. ~3-~)However, the
search for moving objects may not be appropriate
when the people to be counted are standing still, such
as when they are in a queue.
Background identification involves detecting all
parts of the image that conform to the patterns found
in the normal background scene when no people are
present. Those parts of the image that are not part of
the background are either assumed to be objects of
interest (in the current case people) and can be counted
or may be compared with some object description
prior to being added to the count. The identification of
background regions facilitates both the segmentation
of images into background and foreground regions
(figure/ground segmentation), and the removal of the
background image (background removal). For the
purposes of the system described here background
removal and figure/ground segmentation would be
just as useful to the counting process as background
identification alone and hence solutions to these tasks
must be considered.
The two methods for detecting objects (object search
and background identification) are best suited to
different situations. Object search is most effective
when the objects are less variable than the background
image. Background identification works best when the
background is less variable than the objects. In the case
of people counting while the background is free to vary
it is less variable than the people and hence background identification was chosen as the more appropriate method. Various approaches to background
identification are considered in the next section.
1.4. Background identification methods
One method for background removal is to subtract
a typical (previously stored) background image from
each image under test. This method has been applied
with some success to scenes with a simple, uniform and
fixed background; (`*)here an estimate of the number of
people was obtained by totalling the number of pixels
that were not part of the background scene. If, however, the background scene varies considerably over
time (see for example Fig. 1), then it is possible for an
image to contain no objects (people) and yet be very
different from the stored background image. Image
subtraction using a single background image (without
any further processing) cannot deal with the sort of
variations found in background scenes like that of
Fig. 1.

Counting people in video images


Background removal methods that can deal with
variations in the background scene have been proposed. Many such systems attempt to generate an
average background image based on a number of
example images. Such systems have the disadvantage
that relatively uncommon versions of the background
can become 'buried' in the average image. Also, the
average background image will be slightly different
from each of the contributing images and hence there is
scope for detecting spurious objects in individual
examples of the background scene. Kilger and Dietl, t6)
among others, tv-~9) have proposed adaptive methods
for generating a background template from a continuous stream of input images including images containing non-background objects. While these systems are
very effective they still result in a single background
template which may differ from the actual background
images in some unspecified way, this is especially true if
the background scene is subject to rapid variations.
Most of these methods involve a time constant for
updating the background images which must be sufficiently low to allow changes in the background scene
to be incorporated into the average background yet
high enough to ensure that people are not included in
the background even if they remain stationary in the
scene for some time. Further, such systems always
result in a single background image which may underrepresent certain background features which might
then be erroneously counted.
In the system described here RAM (random access
memory)-based neural networks t1'111 were used to
create a generalized representation of the background
scene that incorporated features from many background images. The networks operated on the difference images obtained from a subtraction process and
were designed to counter the limitations of image
subtraction as a method for background removal.
RAM-based networks are somewhat like look-up tables
that can be used to record the occurence of certain
patterns during an initial training phase, but unlike
look-up tables, they are able to generalize such that
a test pattern need not be exactly like any of the
training examples to be recognized by a network once
training is complete. Further, these networks will recognize new combinations of the original training patterns.
Unlike an average image, the representation formed
by RAM-based networks gives equal weight to all of
the contributing images and is thus able to simultaneously represent many features at each location of
the image. When test images are compared with the
generalized representation, areas of background scene
can be identified. The representation preserves all of
the relevant information from the many background
images reducing the risk of detecting spurious objects.
The major drawback of the RAM-based approach
used here is that it requires binary images. This implies
a threshoiding operation which introduces the risk of
information loss early in the counting process. While
methods for inputting grey-scale images into RAM-

1423

based networks have been proposed, I11) they were not


deemed appropriate for use in this instance.
1.5. Overview
The people-counting algorithm can be split into
three distinct stages: pre-processing, background
identification and object counting. The pre-processing
stage was concerned with thresholding and resolution
reduction, the operation of this stage is discussed in
Section 2. The background identification stage is discussed in Section 3 and the object counting stage in
Section 4. Issues relating to the implementation of
a prototype system are discussed in Section 5 and the
performance of the resulting system is discussed in
Section 6. Comments regarding further work are made
in Section 7.
2. IMAGE PRE-PROCESSING

2.1. Image sub-samplin9


It is often assumed that high-resolution images are
required for pattern analysis and recognition systems,
however, it is frequently the case that sufficient information can be obtained from images with surprisingly low
resolution. Initially the people-counting method described below was applied to medium resolution images (362 x 268). Once the algorithm had been tested
and found adequate for the people counting task it was
tested on low-resolution images and was found to
work equally well when the resolution was reduced to
just 96 72 pixels. This reduction in resolution was
achieved by sub-sampling the input images. This subsampling took place before any image processing operations and resulted in a significant reduction in the
overall processing time.
2.2. Threshold
It was noted earlier that the RAM-based neural
networks used in this project required binary images as
input. Thresholding is a process that requires some
care because it removes much information from the
image. A simple fixed theshold at (say) grey level 128
was not considered appropriate since this method
takes on account of variations in ambient lighting
levels. An adaptive global threshold, where the threshold level is a function of the mean grey level of the
image, would have dealt with global changes in light
levels, but would not have been able to deal with
non-uniform changes in lighting.
Non-uniform changes in illumination can be dealt
with using an adaptive local threshold. Here the grey
level of each pixel is compared with the average grey
level of the pixels in its immediate neighbourhood. For
small neighbourhoods local adaptive thresholding is
similar to edge detection. A system using this type of
adaptive local thresholding has already been described/12~ Due to the differential nature of edge detection this system tended to increase the noise level of the

Counting people in video images


RAM-based neural networks were originally conceived of and constructed as hardware devices comparising arrays of hardwired random access memories.
Perhaps as a consequence of this history RAM-based
networks are most easily described in terms of a hardware implementation, as is the case below. However,
the emergence of fast serial computers has allowed
software implementation of these networks. Despite
the hardware description given here the neural network classifers used in this project were implemented
entirely in software.
3.2. Construction and trainin9
Figure 3 shows a schematic diagram of the background removal stage. The pre-processed images were
divided into 4 4 pixel sections. The pixel values from
each section were applied to a single RAM-based
neural network classifer. The structure of each classifer
was identical to that shown inside the upper insert of
Fig. 3. Each unit in each classifier can be thought of as
a small 1 bit deep RAM with four address lines, containing 16 (1-bit) memory locations. Unlike convectional
memory systems the RAMs were not used to store the
image data. Instead the pixels of the input image were
used as addresses for the RAMs. Groups of 4 pixels
were selected, at random, from the image sections
and together formed the address for one unit. This
random mapping of pixels to address lines was a oneto-one mapping and remained constant throughout
the lifetime of the classifier.
The classifiers were trained as follows. Background
images were applied to the address lines of the RAMs.
A logic "1" was written into each RAM at the addressed location. Another image was applied, generating a new set of addresses, and another "1" written into

1425

each RAM. This process was repeated for a number of


training examples. Training images were only shown
once because once a "1" had been written into a location it could not be overwritten by a "0".
3.3. Image testing~background detection
Once training was complete images were tested as
follows. Each test image was applied to the address
lines of the RAMs and the contents of each memory
read out on the data-out lines. Whenever an individual
RAM-unit was presented with an address that it had
seen during training then it output a logic 'T', if the
address had not been seen during training the RAM
output a "0". Most image sections produced a mixture
of ls and 0s on the data-out lines, but when an image
section was similar to the corresponding section in the
background scene there were more Is than 0s. Adding
together all of the outputs for each image section
generated section scores that were high for image
sections that were similar to the background and low
for sections that were part of a non-background object.
The output scores from the classifiers were then inverted such that the sections of the image that were
least like the background produced the highest scores.
The adjusted scores were then used to build up a section scores image in which sections corresponding to
the background were shaded dark and those corresponding to part of a non-background object were
shaded light.
Figure 4 shows an example section scores image.
The neural networks had previously been trained on
a number of images of the same scene under a variety of
lighting conditions, but with no people present. The
test image, which contained seven people, is shown in
Fig. 1(c).

ClassifierDam Ou---"~'x,t
Random
~

16

i
~ i
Sub-scrn~ed
.

~
\
.

x<16

....L...

i i' .............. ~l'


i
J

i J
cx~xsseo~

Fig. 3. Schematic diagram of the background identification stage.

1424

A.J. SCHOFIELD et al.

images. Further, the method tended to remove all but


the edges of objects within the scene and this tended to
discourage the subsequent object counting stage from
locating the true centres of the people being counted,
causing it to occasionally count the gap between two
people.
The current system used an alternative thresholding
method that was less prone to noise and which preserved information from the centres of objects as well as
their edges. A single example background image was
stored. This image was then subtracted from each test
image and the absolute values found (this step is
similar to background removal by subtraction, however, it was not responsible for detecting background
objects, but was intended only as a pre-processor for
the RAM-based networks). The resulting difference
image was a grey-scale image. The mean grey level of
this image was estimated from a random sample of
2000 pixels. The difference image was then binarized
according to a threshold grey level corresponding to
the estimated average plus an offset. 2000 samples
corresponds to ca 1/3 of the total number of pixels in
the difference image and the resulting estimated mean
values were considered good estimates of the true
mean but required less time to calculate. Linking the
threshold level to the mean grey level (as opposed to
using a fixed level) gave the system a certain immunity
to changes in overall illumination. The offset applied
.to the theshold helps to increase the overall noise
immunity of the system by causing it to ignore small
fluctuations in pixel values due to noise on the video
signal. The offset was determined by selecting the
minimum value that would result in a blank theresholded image for input images that were identical to the
stored image except for noise. In the case of the
example sequence discussed here an offset of 10 was
found to be appropriate.
Whenever the image under test was similar to the
background template a completely blank image was
presented to the background identification system,
greatly easing its task. When the lighting conditions
changed the resulting thresholded image contained
apparent objects corresponding to the differences in
the two background images [see, for example,
Fig. 2(a)]. When people were present in the scene they
also appeared as bright blobs in the thesholded image
[see Fig. 2(b)].
As was noted in Section 1, image subtraction cannot
in itself provide a basis for people counting because
it cannot distinguish between changes due to people
and changes due to lighting and normal background
variations. Due to this limitation the thresholded
image was not used as the basis for the count. Rather,
it was further processed using the neural network
technique described in Section 3. The purpose of
the neural networks was to discriminate between
normal background variations (including lighting
changes) and variations due to non-background objects, thus the limitations of the subtraction method
were overcome.

a)

b)
Fig. 2. The effects of resolution reduction and thresholding.
(a) A thresholded image of the background scene with the
lights off when the image used in the subtraction process had
the lights on. (b) The result of thresholding the image of
Fig. 1(c).

3. BACKGROUNDIDENTIFICATION
3.1. Introduction to R A M - b a s e d neural n e t w o r k
classifiers
The background removal method described here
utilized RAM-based neural networks that were conceptually similar in structure to the neural network
classifiers used in the WlSARD pattern recognition
system31'11 ) RAM-based networks have two advantages over other neural network architectures: they can
be trained using examples of a single training class and
they are fully trained after a single pass through the
training set. In the current application the RAM-based
networks were trained on examples of the background
images only (that is no examples of the scene including
people were presented to the network during training),
yet the system was able to discriminate between parts
of the background scene and non-background objects.
This is important since it would not have been practically possible to train a neural network on a representative set of all non-background scenes.

1426

A.J. S C H O F I E L D et al.

for crowded images than it is for images containing


only a few people the minimum distance was reduced
as the count for each image imcreased. For the scene
used throughout this paper the minimum distance was
set to 5.5 and reduced toward 4.5 for crowded images.
The brightness threshold for these image was ca 12%
of the maximum possible section score. For plan views,
the only calibration required was to determine base
values for the minimum distance and brightness
thresholds.
5. IMPLEMENTATION ISSUES
Fig. 4. Example section score image. The image was produced by processing the image of Fig. 1(c) using the preprocessing outlined in Section 2 and the background
identification method outlined in this Section. The neural
network classifiers had previously been trained on multiple
examples of the background scene under a variety of lighting
conditions. The room lights were on in background image of
the thresholding stage.

4. PEOPLE COUNTING BY SERIAL SEARCH


FOR PEAKS IN THE SECTION SCORES IMAGE

Having produced the section score images described


above these images were then inspected to determine
the number of people in each image. Three methods
were considered: finding the total score for each image,
groupino sections into person-sized regions and detecting bright spots or peaks in the section score images
(peak detection). The total score method was found to
be unreliable because the score was not linearly related
to the number of people in the scene and the nature of
the non-linearity varied from scene to scene. The
grouping method has been implemented with some
success,t1 n but was found to be difficult to calibrate for
changes in camera viewpoint and lens type. The peak
detection method described here was no more accurate
than the grouping method, but required less effort
during calibration.
The section score images were first smoothed using
a Gaussian filter. Next the brightest section in the
image was found and, provided that it was above
a detection threshold, its location was noted and the
count incremented to 1. This section was then set to
black and a search made for the next brightest section.
If this section was above the detection theshold it was
set to black. If it was also sufficiently far from the first
section then its location was recorded and the count
incremented. Searching continued until the brightest
remaining section was below threshold. At each step
the count was only incremented if the new bright spot
was sufficiently far from all the previously recorded
points. The distance between sections was calculated
according to the Euclidian distance measure. The
minimum acceptable distance between peaks was set
to equal the average distance between the centre of
people in the image. Since this distance tends to be less

The people counting method outlined in Section 2,


3 and 4 was implemented on a dedicated image processing card fitted with a single Inmos T805 transputer. The input resolution of this device was
768 x 576 when grabbing images from a PAL format
source. The input images were sub-sampled at a rate of
1 pixel (line) in 8 to produce images at the operating
resolution of 96 x 72. All parts of the algorithm were
implemented in software on the T805, which was also
responsible for controlling the frame grabber hardware and a graphical user interface through which an
operator could pass instructions to the system. The
time taken to process each image was found to be
dependent on the number of people present, but was
normally in the range 0.7-1.0s. Once training and
calibration were complete the system was capable of
continuous, unsupervised, operation. The accuracy of
the system is discussed in Section 6.
6. RESULTS

Figure 5 shows the results of the people-counting


process. These results were obtained from the demonstration system processing a previously recorded video
tape. The system had previously been instructed to
acquire a background image for the thresholding stage
and trained to recognize variations in the background
scene. The images used in this report were extracted
from this video footage. The mean estimated number
of people was calculated for each actual number of
people and are plotted in Fig. 5. The standard deviations about these mean values are also shown.
The actual number of people was supplied by the
operator who used a "mouse" connected to the transputer hardware to increment (decrement) the count
whenever a person entered (left) the scene. To aid the
operator in this task the people taking part in the
video were instructed to enter (leave) the scene one at
a time.
The mean estimated values were all' close to the
actual values and the standard deviations indicate that
the estimated count was normally within 1 person of
the actual count. For images containing few people the
system made very few errors. It must also be remembered that there is some scope for operator errors in
the supply of actual counts, although the operator (the
first author) was well practised at this task.

Counting people in video images

1427

o 7
o

"6

~s
r~
E
"-1

r- 4
"1o
(I)

c:2

S
I
1

I
2

I
3

I
4

I
5

I
6

I
7

,~tual number of people


Fig. 5. Mean estimated number of people versus actual number of people.

7. FURTHER WORK AND CONCLUDING REMARKS

7.1. Tracking movements


The peak detection method for counting described
above automatically provided an indication of the
location of each peak in the section scores image and,
therefore, each person in the scene. Although the current system cannot track individuals, the positional
information could be used to aid the tracking of the
movements of individuals around the scene. This could
in turn be used to determine the actions of poeple
moving through the scene. For example, it might also
be used to discriminate between those boarding a lift
and those alighting. People tracking would also increase the utility of the system as a surveillance tool
and data gathering instrument.
7.2. Trainin9 and calibration
The algorithm described here was designed to be
generally applicable to a variety of environments, provided the camera is sited so as to give a plan view. The
number of adjustable parameters has been minimized
so as to reduce the effort required to calibrate the
system for each situation. The main requirements of
the system are that it should be allowed to view the
background scene so that background images can be
acquired and the neural networks trained. Since RAMbased neural networks can be trained on a single pass
through the training set, this initial training period
need not be very long.
The minimum distance and brightness thresholds
(see Section 4) must also be adjusted in relation to the
height of the camera from the floor and the type of lens

in use. In the on-line prototype the user is able to adjust


these parameters, via a keyboard, so as to tune the
system for best results.
8. CONCLUSIONS
In conclusion, RAM-based neural networks can be
used as part of a general-purpose background identification method that can deal with poorly constrained
background scenes. This algorithm can be used to
facilitate the location and counting of objects in the
scene. The method does not require high resolution
imagery and hence processing is fast even using modest
hardware. RAM-based networks are particularly
suited to this task, because they can be trained on
examples of the background scene only. Training is
fast and does not require multiple iterations though
the training set. When applied to the task of people
counting the method was found to produce reasonably
accurate results, even when large variations in the
background scene were allowed.

REFERENCES

I. A. T. P. So, W. L. Chan, H. S. Kuok and S. K. Liu,


A computer vision based supervisory control system,
Elevator Technology 4, Proc. Elevcon 249-258. 1AEE,
Stockport, U.K. (1992).
2. G. D. Sullivan,Visual interpretation of known objects in
constrained scenes, Phil. Trans. R. Soc. London B 337,
361-370 (1992).
3. A. Rouke and M. G. H. Bell, Video imaoe-processin 9
techniques and their application to pedestrain data collection, Research report No. 83. Transport Operations

Research Group, University of Newcastle upon Tyne


(1992).

1428

A.J. SCHOFIELD et al.

4. S. A. Velatsin, J. H. Yin, A. C. Davies, M. A. Vicenciosilva, R. E. Allsop and A. Penn, Analysis of crowd movements and densities in built-up environments using image
processing, Proc. IEE Colloquium Image process. Transport Appl. Digest No: 1993/236 (1993).
5. A. Del Bimbo and P Nesi. A vision system for estimating
people flow. Technical report DSI-RT 15/93. Department
of Systems and Informatics, University of Florence, Italy
(1993).
6. M. Kilger and T. Dietl Interpretation-driven low-level
parameter adaption in scene analysis, Comput. Aided
Syst. Theory, EUROCAST '93. F. Pichler and R. Moreno
Diaz, eds, pp. 380-387. Springer-Verlag, Berlin (1993).
7. S. Brofferio and L. Carnimeo, A background updating algorithm for moving object scenes, Time Varying
Image Process Moving Object Recognition 2, 297-307
(1990).
8. K. Karmann and A. von Brandt, Moving object recognition using an adaptive background memory, Time

9.
10.
11.
12.

Varying Image Process. Moving Object Recognition 2,


289-297 (1990).
N. L. Seed and A. D. Houghton, Background updating
for real-time image processing at TV rates, Proc. SPIE
901, 172 180 (1988).
I. Aleksander, W. Thomas and P. Bowden, Wisard, a
radical step forward in image recognition, Sensor Rev.
4(3), 120-124 (1984).
I. Aleksander and T. J. Stonham, Guide to pattern recognition using random-access memories, IEE J. Comput.
Digital Techn. 2, 29 40 (1979).
A. J. Schofield, T. J. Stonham and P. A. Mehta, Automated people counting using image processing and neural network techniques. Proc. 3rd Int. Conf. Automat.
Robot Comput. Vis. Vol. 2, pp. 903 906, Singapore, 9-11
November (1994). Pub. Nanyang Technological University, Singapore.

About the Author--ANDREW SCHOFIELD received his B.Eng. degree in Electronics from Brunel

University in 1990 and his Ph.D. in Neuroscience from Keele University, Staffordshire, U.K., in 1994. His
research interests are in image processing and neural network applications.

About the Author--PRATAP MEHTA is a Reader in the Department of Electrical Engineering and

Electronics at Brunel University, West London, U.K. His research interests include power electronics and
intelligent buildings.

About the Author--JOHN STONHAM is Professor of Neural Systems Engineering in the Department of

Electrical Engineering and Electronics at Brunel University, West London, U.K. He received his B.Sc. degree
in Electronics (1970) M.Sc. in Hybrid Computing (1972) and Ph.D. in Pattern Recognition (1974) from Kent
University, U.K. His research interests are the theory and applications of neural networks, pattern
recognition, image processing and image database management.

You might also like