You are on page 1of 6

978-1-4244-2503-7/08/$20.

00 2008 IEEE

Proceedings of the IEEE
International Conference on Automation and Logistics
Qingdao, China September 2008
Automated People Counting at a Mass Site
Ya-li Hou and Grantham K.H. Pang
Department of Electrical and Electronic Engineering,
Industrial Automation Research Laboratory,
The University of Hong Kong, Pokfulam Road, Hong Kong
{ylhou, gpang }@eee.hku.hk
Abstract - Reliable estimation of people in public areas is an
important problem in visual surveillance. Although there is a lot
of research on people counting in recent years, most of them
consider a small crowd of people without many serious occlusions.
Some of them have a lot of particular requirements, like people
are moving, the background is smooth or the image resolution is
high. This paper aims to estimate the number of people in a
complicated scenario, which has around one hundred persons in
an outdoors event. Several people counting methods based on
crowd density are considered to find the relationship between the
foreground pixels and the number of people in the large crowd.
The best estimation result is from the method that considers two
types of foreground pixels: those that come from relatively
stationary crowd, and those that come from moving people. In
an evaluation of three developed methods over 51 cases, the best
average error is around 10%. All the proposed methods do not
have any special requirements on the resolution of the input
video.
Index Terms Automated surveillance, people counting, crowd
density, neural network.
I. INTRODUCTION
People counting is a crucial and challenging problem in
visual surveillance. An accurate and real-time estimation of
people in the shopping mall can provide valuable information
for managers to make decisions. An automatic monitoring on
the crowd density in public areas is also important for safety
control and urban plans.
In most situations, the scene is quite complicated, as
shown in Fig. 1(a). The resolution of the video camera may
not be high enough. People may walk, stand or just sit and
only move a little bit once a while. When the crowd density is
high, there are a lot of serious occlusions. Although there is a
lot of research on people counting, there are few results for
such kind of common events.
This paper aims to find an effective people counting
method to estimate the number of people in a complicated
situation like Fig. 1(a). The estimation is mainly based on the
foreground pixels and some morphology operations. Neural
network is used for training. The methods proposed in this
paper have no particular assumptions on the scene and can
work well even for relatively low resolution images. It is also
found that the approach is not sensitive to the change of the
size of the structure element for morphology operation in the
method.
In the following section, some related research about
people counting is first introduced in Section II. Detailed
methods are described in Section III. Section IV and V show
the estimation results and analyze the source of estimated
errors.
(a)
(b)
Fig. 1 There are always foreground pixels even for stationary people. (a) a
typical original scene to be considered, the red rectangle indicates the region
of interest, (b) the foreground pixels extracted by adaptive background
subtraction.
II. RELATED WORK
Up to now, the method of people counting can mainly be
classified into two categories: detection-based method and
map-based method.
The first category tries to detect and track individuals in
video sequence with some prior knowledge of human, and
then the total number of people in the scene can be counted.
Ref. [1-6] provides the most recent development in this field.
Zhao et al. [1, 6] and Rittscher et al. [3] manage to segment
the extracted foreground blobs with a prior knowledge of
464
human shape. These methods assume that there are enough
evidences for each individual in the foreground contour, which
usually require the crowd density to be low. Since standing or
sitting people only show a few scattered foreground pixels,
these methods can only detect moving people. Some
sophisticated methods like MCMC (Markov Chain Monte
Carlo) and EM (Expectation-Maximization) algorithms are
employed in these methods, which are computationally
intensive when the number of people in the crowd is large. Wu
et al. [2] rely on edge information to detect each human body
parts and then combine these parts with prior knowledge of
human shape. This method is applicable for both moving and
stationary people [5]. However, edge extraction is a time-
consuming process, and the edges will become quite messy
when the background is complicated and the textures of
human clothes are not smooth. For videos taken by low-
resolution cameras, the method totally collapses. Brostow and
Cipolla [4] seem able to perform well in even very crowded
situations. The method tracks some feature points in video and
assumes that features from the same person show similar
trajectories. But the method is based on motion inherently, so
it must combine with other methods for counting stationary
people. Besides, the first category is hardly used for real-time
application due to their high computation complexity when the
number of people increases.
Map based methods try to statistically map the number of
people to foreground pixels or some other features by training.
In [7, 8], it seems that texture features give an effective
estimation of the crowd density level. However, they both
require the background to be quite smooth and their methods
were not used to estimate the number of people. Davies et al.
[9] believe that there is a linear relationship between
foreground pixels and the number of people, under the
situations with trivial perspective distortion and few overlaps
among the crowd. Ref. [10] and [11] investigate the strategies
of perspective correction in detail. Furthermore, [11] tries to
deal with occlusions with a confidence value and a weighted
median filter. The crowd with serious occlusion has a low
confidence value for the counted number. Based on the time
consistency of video, the median filter can help improve the
reliability of the estimated results. Kilambi et al. [12]
successfully avoid the occlusion analysis by using accurate
camera calibration, which allows for estimating the number of
people in more dense groups.
After all, Ref. [10-12] show that the methods which relate
the foreground pixels to the number of person are quite simple,
and can provide a relatively reliable estimation of people
under even very poor situations. For groups consisting up to
10 people, the accuracy more than 75% for each frame is
achieved and the shape based method can perform even better
[12]. However, up to now, all the studies consider moving
people only, and it seems that in all the studies the largest
crowd ever reported consists of about 11 people [12].
III. METHODS OF ESTIMATION
Under the situation as shown in Fig. 1, the crowd density
is very high and many serious occlusions exist. In addition,
some people walk, and some are standing, while some others
are sitting. The resolution of the video camera is not high.
The images provided to the people counting algorithms
presented in this paper are only of 320x240 pixels. It is almost
impossible to detect individuals reliably in the video
sequences.
By observing the extracted foreground pixels from
adaptive background subtraction, we find that there are always
some foreground pixels even for stationary people (like sitting
or standing) due to their occasional movement, as shown in
Fig. 1(b). Therefore, we try to find the relationship between
the foreground pixels and the number of people and this is
used to estimate the number of people.
An approach to estimate the number of people in
complicated situations is proposed. A block diagram is shown
in Fig.2. Usually, it is hard to get a good fixed background due
to the change of the illumination and movement of the camera,
especially for outdoor environments. To get a good foreground
segmentation, an adaptive background estimation method is
adopted[13].
Fig. 2 Block diagram of the approach.
In this section, several methods have been developed for
people counting. These methods try to exploit the relationship
between foreground pixels and the number of people. In all
these methods, the data is divided as training set and testing
set.
Method 1) Based on Foreground Pixels
Suppose the number of foreground pixels is denoted by
X , and the number of people is M . Let
1
f expresses the
relationship between M and X , as in (1).

1
( ) M f X = (1)
The training set is used for building a neural network to learn
this relationship
1
f , and the neural network is then used to
estimate the number of people.
Method 2) Based on Closed Foreground Pixels
It can be noticed that after closing operation, most of
areas occupied by people are covered with white pixels, while
other parts are black. An example of the foreground pixel
image and its corresponding closing image are shown in Fig. 3.
Suppose the number of foreground pixels after closing is
Foreground
extraction
Processing of the
foreground pixels
Adaptive
background
learning
Estimating
the number
of people
NN Training for the
relationship between
the foreground and
number of people
Video input
465
denoted by C , the number of people is M , then
M and C satisfy (2).

2
( ) M f C = (2)
(a)
(b)
Fig. 3 (a) the foreground pixels extracted by adaptive
background subtraction. (b) the image after closing operation to (a).
The number of closing image pixels is input into a neural
network and the ground truth of number of people in each
scene is compared with the output of the network. The test
samples are evaluated with the learned neural network.
Method 3) Based on Closed Foreground Pixels and Ratio of
Erosed Pixels over Foreground Pixels
From Figure 4(a), it can be observed that some people
show solid foreground blobs, while others only show some
scattered pixels in the foreground image. The solid blobs
mainly come from moving people, and the scattered pixels
come from relatively stationary crowd. In fact, the images
after an erosion operation for two times show most of solid
blobs plus only a few scattered points (see Fig. 4(b)). Thus,
the ratio, S, which is the erosed pixels over the foreground
pixels may be used to represent a feature of the image. After
analyzing the results of method 2), we find that the estimation
errors could be related to this feature. To get a more accurate
estimation result, the ratio S and the closed foreground pixels
C are both considered. Similarly, a two-layer feedforward
neural network is used to find the relationship
3
f between the
number of people, and these two features as expressed in (3).
BothC and S should be input to the neural network.

3
( , / ) M f C S X = (3)
(a)
(b)
Fig. 4 (a) the foreground pixels extracted by adaptive
background subtraction. (b) the image after 2 times erosion operation to (a).
IV. EXPERIMENTS AND ANALYSIS
Experiments were performed on a 4-hour video taken in a
public event. The frame rate of the video is 10fps and the
image resolution is 320*240. One image scene is used for the
evaluation every 100 seconds, and a total of 153 images were
extracted from the original four hours of video. The ground
truth, which is the actual number of people in the scene, for
each image was obtained manually. In the set of images, the
number of people in the scene ranges from 36 to 222. The
training set consists of 102 images, which were formed by
taking the first two images out of every three consecutive
images. The test set is composed of the remaining 51 images.
Thus, the number of people in both the training set and test set
has a wide range.
In all the experiments, an adaptive background image is
used for foreground segmentation. In each image, the pixels
466
whose intensities are sufficiently different from the
background are denoted as foreground pixels. The threshold is
30 in the experiments below. All the experiments were
performed in Matlab 7.5.0.
A. Experiments and Results
All the methods in Section III were tested on the video.
Both training and test results are shown below.
Method 1) Based on Foreground Pixels
A two-layer neural network with 10 neurons in the hidden
layer was trained to find the relationship between the number
of foreground pixels and the number of people. From the
ground truth shown in Fig. 5, it can be noticed that it is hard to
find an accurate approximation of
1
f . Different initial values
for neural network may get different training functions. Fig. 5
shows one of them. According to the training relationship
shown in Fig. 5, the test results are given in Fig. 6. A lot of
samples have more than 30% estimation errors.
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
x 10
4
0
50
100
150
200
250
Number of Foreground Pixels
N
u
m
b
e
r

o
f

P
e
o
p
le
estimated count
ground truth
Fig. 5 The relationship between number of foreground pixels and
number of people in the training set.
0 10 20 30 40 50 60
20
40
60
80
100
120
140
160
180
200
220
Sample Sequence
N
u
m
b
e
r

o
f

P
e
o
p
le
ground truth
estimated count
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
Sample Sequence
E
r
r
o
r

P
e
r
c
e
n
t
a
g
e
(
%
)
Fig. 6 The test results of method 1). The figure above shows the estimated
count vs. the ground truth, the figure below is the error percentage.
Method 2) Based on Closed Foreground Pixels
Flat disk-shaped structure elements were used for closing
operation on the foreground pixels in the experiment. To find
the sensitivity of the structure size, both structure elements
with radius of 4 and 5 pixels have been attempted in the
closing operation. These structure elements will be called
disk-4 and disk-5 respectively for convenience. A two-
layer neural network with 10 neurons in the hidden layer was
used to approximate the relationship
2
f . From the training
results, it is found that after a closing operation on the
foreground pixels using either disk-4 or disk-5, the
number of people is almost linearly related to the number of
closed foreground pixels. This is demonstrated in Fig. 7 and
Fig. 9.
Fig. 8 and Fig. 10 show the test results using disk-4 and
disk-5 respectively. Since the relationship between the
number of foreground pixels after closing and the number of
people is much easier to approximate accurately, the estimated
errors are relatively small. Most of samples show error
percentage below 20% for both disk-4 and disk-5.
0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
0
50
100
150
200
250
Number of Foreground Pixels after Closing Operation
N
u
m
b
e
r

o
f

P
e
o
p
le
estimated count
ground truth
Fig. 7 The relationship between number of foreground pixels after closing
(disk-4) and number of people in the training set.
0 10 20 30 40 50 60
20
40
60
80
100
120
140
160
180
200
220
Sample Sequence
N
u
m
b
e
r

o
f

P
e
o
p
le
ground truth
estimated count
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
Sample Sequence
E
r
r
o
r

P
e
r
c
e
n
t
a
g
e
(
%
)
Fig. 8 The test results of method 2) (disk-4). The figure above shows the
estimated count vs. the ground truth, the figure below is the error percentage.
467
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 10
4
0
50
100
150
200
250
Number of Foreground Pixels after Closing Operation
N
u
m
b
e
r

o
f

P
e
o
p
le
estimated count
ground truth
Fig. 9 The relationship between number of pixels after closing (disk-5) and
number of people in the training set.
0 10 20 30 40 50 60
20
40
60
80
100
120
140
160
180
200
220
Sample Sequence
N
u
m
b
e
r

o
f

P
e
o
p
le
ground truth
estimated count
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
Sample Sequence
E
r
r
o
r

P
e
r
c
e
n
t
a
g
e
(
%
)
Fig. 10 The test results of method 2) (disk-5). The figure above shows the
estimated count vs. the ground truth, the figure below is the error percentage.
Method 3) Based on Closed Foreground Pixels and Ratio of
Erosed Pixels over Foreground Pixels
In this method, a flat disk-shaped structure element with
radius of 4 pixels was used for closing operation. A square of
2*2 pixels was used for erosion operation. Since the
relationship will be more complicated, a two-layer neural
network with 50 neurons in the hidden layer is adopted.
The relationship
3
f between C (the number of
foreground pixels after closing), S (the ratio of erosed
foreground pixels over foreground pixels) and M (the number
of people) in the training set is shown in Fig. 11. To see the
relationship more clearly, both C and S are reorganized in
ascending order. It can be observed that the number of people
is still mainly linearly related to the number of closed
foreground pixels, but affected a little bit by S. The estimated
count and the error percentage for each sample in the test set
are shown in Fig. 12.
0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
0
50
100
150
200
250
Number of Foreground Pixels after Closing Operation
N
u
m
b
e
r

o
f

P
e
o
p
le
estimated count
ground truth
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
0
50
100
150
200
250
S/X(Foreground Pixels after Erosion/Total Foreground Pixels)
N
u
m
b
e
r

o
f

P
e
o
p
le
estimated count
ground truth
Fig. 11 The relationship between C (the number of foreground pixels after
closing), / S X (the proportion of foreground pixels after open and total
foreground pixels) and M (the number of people) in the training set.
0 10 20 30 40 50 60
20
40
60
80
100
120
140
160
180
200
220
Sample Sequence
N
u
m
b
e
r

o
f

P
e
o
p
le
ground truth
estimated count
0 10 20 30 40 50 60
0
10
20
30
40
50
60
70
Sample Sequence
E
r
r
o
r

P
e
r
c
e
n
t
(
%
)
Fig.12 The test results of method 3). The figure above shows the estimated
count vs. the ground truth, the figure below is the error percentage.

Since the initial values for neural network are random in
Matlab, each method is tested for 10 times. The average
results of all the methods are compared in Table I.
After closing operation, the relationship between the
foreground pixels and the number of people becomes quite
468
simple. The estimation performance has been significantly
improved.
Different sizes of disks show similar performances. That
means, the linear relationship is not very sensitive to the size
of disks. However, when the size of disk is too large, a lot of
areas without people will also be identified as foreground
pixels, which can result in false estimation.
When two types of foreground pixels are also considered
in method 3, the average estimation error percentage decreases
slightly. The percentage of estimated count with error less
than 15% also increases a little bit. In fact, after closing
operation, even scattered foreground pixels form big solid
areas, which reduce the effect of the nature of the crowd.
TABLE I
THE COMPARISON OF RESULTS
Methods Average
error (%)
Accuracy
(estimation
within 10%
error)
Accuracy
(estimation
within 15%
error)
Method 1) 16.68 42.55 59.41
Method 2) (disk-4) 10.55 66.47 79.02
Method 2) (disk-5) 10.33 64.12 78.04
Method 3) 10.05 64.9 82.94
B. Analysis
One cause of errors in the test set is the sudden decrease
of people. Take the 30th sample (see Fig. 13(a)) in the test set
as an example, the scene 100 seconds before is shown in Fig.
13(b). A large number of people sitting in the chairs for a
long time are learned as the background. After they suddenly
move away, the background has not been adapted completely
yet. The wrong foreground pixels have caused big errors in the
estimation. This is shown at the back of sitting area in Fig. 13
(a).
The movement of non-human objects also results in
wrong foreground pixels. At the end of the 4-hour video, the
ground truth of the number of people is relatively low, thus,
the movement of some boxes cause big error percentage.
(a)
(b)
Fig. 13 Sudden decrease of the number of people results in false foreground
pixels. (a)The 30th sample in the test set. (b) the scene 100 seconds before (a).
V. CONCLUSIONS
In this experiment, the distance from the video camera to
the crowd is far, and therefore the size of people at different
locations are more or less the same. It is not necessary to
correct perspective distortion for this video. If perspective
distortions are obvious, some geometric correction strategies
[10, 11] should be performed before estimation. Besides, our
methods are trying to statistically estimate the number of
people, and hence no occlusion analysis is needed.
The scenario that has been analyzed is quite common in
public areas. Yet, little research has been carried out in such
scenes. In this paper, the method with the best estimation is
based on the use of two types of foreground pixels, those that
come from relatively stationary crowd and from moving
people. The average estimation error is around 10%, but it is
believed that this can be improved with more investigations.
In the future, the sensitivity of the approach to the
difference threshold for foreground extraction will also be
tested. Besides, the prior knowledge about human shape may
be considered to remove some non-human objects from the
foreground images.
REFERENCES
[1] T. Zhao, R. Nevatia, and B. Wu, "Segmentation and Tracking of
Multiple Humans in Crowded Environment," accepted by IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2007.
[2] B. Wu and R. Nevatia, "Detection and tracking of multiple,
partially occluded humans by bayesian combination of edgelet
based part detectors," International Journal of Computer Vision,
vol. 75, pp. 247-266, 2007.
[3] J. Rittscher, P. H. Tu, and N. Krahnstoever, "Simultaneous
estimation of segmentation and shape," in Computer Vision and
Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, 2005, pp. 486-493 vol. 2.
[4] G. J. Brostow and R. Cipolla, "Unsupervised Bayesian Detection
of Independent Motion in Crowds," in Computer Vision and
Pattern Recognition, 2006 IEEE Computer Society Conference on,
2006, pp. 594-601.
[5] B. Wu and R. Nevatia, "Tracking of Multiple Humans in
Meetings," in Computer Vision and Pattern Recognition Workshop,
2006 Conference on, 2006, pp. 143-143.
[6] T. Zhao and R. Nevatia, "Tracking multiple humans in complex
situations," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 26, pp. 1208-1221, 2004.
[7] A. N. Marana, L. da, F. Costa, R. A. Lotufo, and S. A. Velastin,
"Estimating crowd density with Minkoski fractal dimension," in
Int. Conf. Acoust, Speech, Signal Processing, 1999, pp. 3521-3524.
[8] H. Rahmalan, M. S. Nixon, and J. N. Carter, "On crowd density
estimation for surveillance," in The Institution of Engineering and
Technology Conference on Crime and Security, 2006.
[9] A. C. Davies, Y. Jia Hong, and S. A. Velastin, "Crowd monitoring
using image processing," Electronics & Communication
Engineering Journal, vol. 7, pp. 37-47, 1995.
[10] R. Ma, L. Li, W. Huang, and Q. Tian, "On pixel count based crowd
density estimation for visual surveillance," in IEEE Conference
Cybernetics and Intelligent Systems. vol. 1, 2004, pp. 170-173.
[11] H. elik, A. Hanjalic, and E. A. Hendriks, "Towards a robust
solution to people counting," in IEEE International Conference on
Image Processing, 2006.
[12] P. Kilambi, O. Masoud, and N. Papanikolopoulos, "Crowd analysis
at mass transit site," in IEEE Intelligent Transportation Systems
Conference, 2006.
[13] C. Stauffer and W. E. L. Grimson, "Learning patterns of activity
using real-time tracking," IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 22, pp. 747-757, 2000.
469

You might also like