You are on page 1of 7

dimroc See Experiments

Counting Crowds and Lines


19 Nov 2017

Updated with video footage of the CUHK Mall Dataset:

0:00 / 6:40

The ML and site for this post can be found at


countingcompany.com.

In Union Square, NYC, there’s the untoppable burger joint


named Shake Shack that’s always crowded. A group of us
would obsessively check the Shake Cam around lunch to
figure out if that trip was worth it.

14 person line, not bad


Rather than do this manually (come on, it’s nearly 2018),
it would be great if this could be done for us. Then, to take
that idea further, imagine being able to measure foot
traffic on a month to month basis or to measure the
impact of a new promotional campaign.
Object detection has received a lot of attention in the
deep learning space, but it’s ill-suited for highly
congested scenes like crowds. In this post, I’ll talk about
how I implemented multi-scale convolutional neural
network (CNN) for crowd and line counting.

Why not object detection


Regional-CNN’s (R-CNN) use a sliding window to find an
object. High density crowds are ill-suited for sliding
windows due to high occlusion:

Failed attempt with off the shelf (no retraining) TensorFlow R-CNN
Further exploration in this approach led me to TensorBox,
but it too had issues with high congestion and large crowd
counts.

Density Maps to the rescue


Rather than a sliding window, density maps (aka heat
maps) estimate the likelihood of a head being at a
location:
Crowd photo from the UCF Dataset
3406 vs 3408? Pretty close!

What’s happening here?

Based on multi-scale convolutional neural network (CNN)


for crowd counting, the ground truth is generated by
taking the head annotations and setting that pixel value
to one, and then gaussian blurring the image. The model
is then trained to output these blurred images, or density
maps. The sum of all the image pixels then results in the
crowd count prediction. Read the paper for more insight.

Let’s look at density maps applied to the shake cam. Don’t


worry about the color switch from blue to white for the
density maps.

The sum of the pixel values is the size of the crowd


As you can see above, we have:

1. The annotated image courtesy of AWS Mechanical


Turk.
2. The calculated ground truth by setting head
locations to one and then gaussian blurring.
3. The model’s prediction after being trained with
ground truths.

How to get the images?


From your neighborhood Shake Shack Cam of course.

How to annotate the data?


The tried and true AWS Mechanical Turk, with a twist: a
mouse click annotates a head as shown below:
I went ahead and modified the bbox-annotator to be a
single click head annotator.

How to count the line?


Lines aren’t merely people in a certain space, they are
people standing next to each other to form a contiguous
collection of people. As of now, I simply feed the density
map into a three layer fully connected (FC) network to
output a single number, the line count.

Gathering data for that also ended up being a task in AWS


Mechanical Turk.

Here are some examples of where lines aren’t immediately


obvious:
Making a product out of data science
This is all good fun working on your development box, but
how do you host it? This will be a topic for another blog
post, but the short story is:

1. Make sure it doesn’t look bad! Thanks to the design


work done by Steve @ thoughtmerchants.com
2. Use Vue JS and d3 to visualize the line count.
3. Create a docker image with your static assets and
Conda dependencies.
4. Deploy to GCP with kubernetes on Google Container
Engine.
5. Periodically run a background job to scrape the
shake cam image and run a prediction.
I did the extra credit step of having a Rails application
interact with the ML service via gRPC, while integration
testing with PyCall. Not necessary, but I’m very happy
with the setup.

Unexpected Challenges
These following challenges have contributed to erroneous
line predictions:

1. Umbrellas. Not a head but still a person.


2. Shadows. Around noon there can be some strong
shadows resembling people.
3. Winter Darkness. It gets much darker much sooner
in November and December. Yet the model was
trained predominantly with images of people in
daylight.
4. Winter Snow. Training data never had snow, and
now we have mistakes like this:

As I discover more of these scenarios, I’ll know what data


to gather for a model retraining.

You might also like