You are on page 1of 5

Convolutional Neural Network based Face detection

Subham Mukherjee Ayan Das


Institute of Engineering & Management, Kolkata Institute of Engineering & Management, Kolkata
E-mail: subhammukherjee61196@gmail.com E-mail: das.ayan.iem@gmail.com
Sumalya Saha Ayan Kumar Bhunia
Institute of Engineering & Management, Kolkata Institute of Engineering & Management, Kolkata
E-mail: saha.papu.sumalya@gmail.com E-mail: ayanbhunia007@gmail.com
Sounak Lahiri Aishik Konwer
Institute of Engineering & Management, Kolkata Institute of Engineering & Management, Kolkata
E-mail: sounaklahiri01@gmail.com E-mail: konweraishik@gmail.com
Arindam Chakraborty
Institute of Engineering & Management, Kolkata
E-mail: arindam.chakraborty@iemcal.com

Abstract—In this paper, we discuss and analyze two different II. R ELATED W ORK
paradigm of techniques for face detection in images containing
human. We discuss the formulation for both the methods,
In classical computer vision, detection or localization was
i.e., using hand-crafted features followed by training a simple a challenging task. In [1] geometric and colour invariance has
classifier and an entirely modern approach of learning features been implemented for detection of both regular and irregular
from data using neural network. We discussed the theoretical objects with the help of C-SIFT descriptors.Here,points of
advantages of the special kind of neural network we used, i.e., interest are detected at the extreme of a difference of Gaussian
Convolutional Neural Network. Lastly we ran both the methods
to FDDB face detection dataset and provided some qualitative
pyramid of the input image to construct the feature descriptors.
results. In [5] and [11] normalized histogram of oriented features(R-
Keywords—Hand-crafted feature; Neural Network; Face de- HOG & C-HOG) along with linear SVM as a filter has been
tection; Convolutional executed to detect human beings. In [6] deep learning has been
utilised for generic visual recognition using a python network
to train the models easily.Here a open-source convolutional
I. I NTRODUCTION model is implemented for extracting feature visualisations like
LLC and GIST. In [8] convolutional networks are used with
a different approach where non-saturated neutrons are used to
From the very moment when digitization of data became decrease training time .In order to reduce over-fitting of data
possible, the need of processing digitized data increased as ,a special regularization method is used by data augmentation
well. The amount of data acquired by various devices kept and dropout methodology. In [4] joint cascade method is
on increasing as time progresses. One of the most important used for face detection where face alignment is combined
category of data is images and videos which became quite with detection in the same cascade network. In [10] used an
popular because of their expressive power. But, being one of AdaBoost based algorithm which selects critical features from
the most complex type of data, processing of images became large sets by combining increasingly complex classifiers in a
a huge field of research. The application which it got most cascade.
of the attention from is ‘security’. For security application, The paper is organized as follows. Section I introduces
researchers started to investigate methods by which human the problem and corresponding issues. Section II discusses
presence in images can be determined with efficiency. ‘Face an important portion of the literature regarding face detection.
detection’ is one of the components of a rigorous security Section III elaborates the two paradigms in details, i.e., section
system which is presented in this paper. In the early days III-A depicts the old method where III-B depicts the modern
of ‘Image Processing’, manual techniques were used to detect method of solving the problem. Section IV shows the used
faces from a given image. Hand-crafted features were extracted dataset and qualitative results and lastly conclusion has been
and classified with a so-called ‘sliding window classifier’ drawn.
which SVM, DTree as classifiers. As the ‘Computer Vision’
field started to emerge, more modern statistical techniques III. M ETHODOLOGIES
with ‘Advanced Machine Learning’ took over the old method- Detection of objects, humans or parts of human body
ologies for face detection. In this paper, we presented a was considered to be a challenging problems in the field of
qualitative analysis of both the old and modern methods for Computer Vision (CV) in its initial stage. But with the help
face detection. of modern Computer vision algorithms and Deep learning

978-1-5386-1703-8/17/$31.00 2017
c IEEE
denoted as X ∈ Rd where, R is the set of real numbers and
the vector is d-dimensional. The choice of feature is upto the
application in hand. Usually, to select proper feature, we need
domain knowledge of the application.
The entire framework of hand-crafted features is as follows:
An image I is divided into small fixed-size (usually 32x32
or 64x64) overlapping patches (Ii |N i=1 ) with fixed stride. A
previously chosen feature extractor function (say f ()) is
applied to extract feature from a single image patch Ii , i.e.,
the extracted feature fi = f (Ii ). The next step is to train a
classifier to take decision on whether a given feature fi is
a face or not. A Dataset with proper labels is to be used
for training purpose. Formally, the dataset has to be a set of
two-tuples, image patch (Ii ) and corresponding binary label
(Li ) indicating whether its a face or not. The labels are not
produced as it seems, rather a binary mask (Ib ) is created out
of an image with white color on the positions where face is
present and black elsewhere. Then the labels Li is obtained as

1, if I(position(Ibi )) contains a face
Li =
0, otherwise
There’s a practical problem with the sliding window ap-
Fig. 1. Sliding-window approach in the context of Face detection/localization proach which we will discuss below:
In natural images with humans and faces, the size of the
models, significant improvements in the performances have humans/faces can vary largely. The size of the object creates
been reported. a problem in possibly all major CV detection/localization
Classical computer vision (CV) algorithms used to follow a problems. A standard way of handling the problem is to use
relatively restricted pipeline. In classical CV, fixed features image pyramids [3] to form pyramids which are basically
used to be extracted from images and fed them into a different zoom-levels of an image. We obtain k pyramids and
relatively simpler (mostly linear) classifier like linear-SVM, treat each of them separately in our described framework. This
Decision-Tree etc. The tasks like localization or detection simple method provides scale-invariance.
was carried out similarly with an additional sliding-window A crucial point of the entire framework is the choice of the
concept, where an image is divided into smaller regions and feature extractor. There are some features which have shown
then regular pipeline is followed on every such region. Usually performance benefits on benchmark datasets and algorithms.
these sliding-windows are square or rectangular in nature Some of them are discussed below:
and a stride is considered between each such windows. An 1) HOG [5]: : The main idea behind the Histogram of
illustration of the sliding window approach is shown in figure Oriented Gradients descriptor is that local object appearance
1. and shape within an image can be described by the distribution
In modern CV, a more flexible pipeline is followed for of intensity gradients or edge directions. The image is divided
face detection. The modern pipeline is not only used for face into small connected regions called cells, and for the pixels
detection but for any localization task. For processing a raw within each cell, a histogram of gradient directions is com-
image, neural networks are used. Neural networks are complex piled. The descriptor is the concatenation of these histograms.
signal processing device which can be tuned to behave as For improved accuracy, the local histograms can be contrast-
per application. Neural networks are designed by a loose normalized by calculating a measure of the intensity across a
inspiration from human brain. Neural networks, if properly larger region of the image, called a block, and then using this
structured, can handle complex high-dimensional signals. In value to normalize all cells within the block. This normaliza-
this paper, we discussed one architecture of neural network tion results in better invariance to changes in illumination and
suitable for face localization tasks. shadowing.
2) LBP [2]: : In the LBP descriptor for texture classifi-
A. Using Hand-crafted features cation, the occurrences of the LBP codes in an image are
Features are mathematical representations of objects to be collected into a histogram. The classification is then performed
considered for the use of Machine learning algorithms (clas- by computing simple histogram similarities. However, consid-
sifiers/regressors). Features are high dimensional vectors (col- ering a similar approach for facial image representation results
umn matrix) which represent an object as a high-dimensional in a loss of spatial information and therefore one should codify
point on a complex high dimensional space. They are usually the texture information while retaining also their locations.
where, fnon−lin is a non-linear function like sigmoid(),
tanh() etc. The number of layers is a hyper-parameter which
can be chosen by guess or by cross-validation. For training
the network with our dataset D, we use the famous gradient-
descent(GD) algorithm. It is the most simple algorithm in
terms of theoretical analysis. The idea behind GD is to reduce
a predefined loss function in the direction which is steepest
w.r.t the model weights/parameters. The loss is defined as

1 
N
loss(D)  l(X(i) , y (i) )
N i=1
where,
Fig. 2. General neural network

l(X(i) , y (i) ) = ||y (i) − y(i) ||22


One way to achieve this goal is to use the LBP texture descrip-
usually, squared-error loss is considered as per-sample loss
tors to build several local descriptions of the face and combine
l(X(i) , y (i) ).
them into a global description. Such local descriptions have
2) Conv-Net for Face detection:: Convolutional neural net-
been gaining interest lately which is understandable given the
work is a special family of neural network designed [9]
limitations of the holistic representations. These local feature
specially to handle image or 2D spatial data. The architecture
based methods are more robust against variations in pose or
(Refer to figure 4) of convolutional network is inspired from
illumination than holistic methods.
the human visual perception system. ConvNet is composed
B. Using Deep Learning mainly of 3 different kind of layers namely the most important
‘Convolutional layers’, an optional and configurable ‘Pooling
In the setup of modern computer vision, machine learning
layers’ and a ‘Fully Connected layer’. The fully connected
is used as an integral part of the framework. Neural net-
layer can potentially be an entire multi-layer perceptron
work, described earlier, is the most vital component of Deep
(MLP).
learning based systems. The modern era of deep learning
But the architecture described above is not the only archi-
revolves entirely around Neural Networks. Neural networks,
tecture to be used. As per application requirement, a different
in a rough sense, are mathematical devices to represent data in
structure is used for face detection or any object localization
hierarchical organization. In deep learning terminology, these
purposes. As described in section III-A, the network composed
hierarchies are called ‘Layers’. The number of such layers
only of convolutional layers and optional pooling layers. The
is called the depth of a network, and so the name ‘Deep’
target of the network should be to produce the output shown in
learning. Literature have proven the fact that more number of
figure 3. As the target output of the network has to be a binary
layers increases representation power but comes at a cost of
mask, the last layer is proposed to be consisting of sigmoid
difficulty in training phase. Researches are going on in this
non-linearity because
field to understand this trade-off.
1) Neural Network: : Let’s assume the data to be classified 1
Sigmoid(x) = ∈ (0, 1)
is X ∈ Rd , where R is the set of real numbers and d is 1 + exp(−x)
dimension. Let’s also assume the dataset comprised of data In theory, if the stated architecture is trained on data like figure
and label( y ∈ C ) is 3, the different layers (mainly higher convolutional layers) will
D  {(X(i) , y (i) )}|N be trained as face detectors.
i=1

where, N is the number of total samples in dataset. The IV. DATASET AND R ESULTS
basic principle of neural network is to transform the input The comparative studies depicted in this paper has
X linearly and apply proper non-linearity to produce desired been evaluated qualitatively. Below is the details of
output. Mathematically, the non-linear transform of input is as the dataset and some results of our experimentation.
follows:

y(i) = fl3 (fl2 (fl1 (X(i) ; Wl1 ); Wl2 ); Wl3 )


A. Dataset
where, fli is a transfer function for layer li which can be The dataset used is “Face Detection Dataset and Benchmark
further decomposed as the most general form as (FDDB)” [7] which is comprised of 5171 faces in a set of
T
2845 images taken from the Faces in the Wild (FTW)
fli (X(i) ; Wli ) = fnon−lin (X(i) Wli ) data set. FDDB images have accurate annotation by human
Fig. 3. Input-Output of the ConvNet

Fig. 4. Convolutional Neural Network

experts which plays a great role in supervised learning.


We used the annotations to produce binary face masks (as
show in figure 3) which have been used in the training
of the specific architecture described in section III-B2.

B. Results
In the testing phase, test data from the FDDB dataset have
been used against the trained model. As the convolutional
network architecture has sigmoid layer as the terminating layer
Fig. 5. Sample data from FDDB dataset
(Refer to section III-B2), it can produce a grayscale image as
output to a specific input. But to produce a deterministic face
boundary, some post-processing has to be done. The gray- C ONCLUSION
scale output of the network is used as a density plot and fit In this paper, we analyzed both old and modern techniques
a gaussian distribution on it. Then the boundary ellipse has for face detection. We described the method of extracting
been drawn upto one standard deviation of the gaussian. The hand-crafted features and compared with a more advanced
process is depicted in figure 6. neural network based methodology. We presented some quali-
tative results to prove the efficiency of the modern method over
the old one. In future works we may present some quantitative
descriptions and a more rigorous analysis of the work.
R EFERENCES
[1] A. E. Abdel-Hakim and A. A. Farag. Csift: A sift
descriptor with color invariant characteristics. In 2006
IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 2, pages
1978–1983, 2006. doi: 10.1109/CVPR.2006.95.
[2] T. Ahonen, A. Hadid, and M. Pietikainen. Face de-
scription with local binary patterns: Application to face
recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 28(12):2037–2041, 2006.
[3] C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden.
Pyramid methods in image processing, 1984.
[4] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint
Cascade Face Detection and Alignment. Springer Inter-
national Publishing, 2014.
[5] N. Dalal and B. Triggs. Histograms of oriented gradients
for human detection. In 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition
(CVPR’05), pages 886–893 vol. 1, 2005.
[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
E. Tzeng, and T. Darrell. Decaf: A deep convolutional
activation feature for generic visual recognition. CoRR,
2013.
[7] V. Jain and E. Learned-Miller. Fddb: A benchmark
for face detection in unconstrained settings. Technical
Report UM-CS-2010-009, University of Massachusetts,
Amherst, 2010.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems,
page 2012.
[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Pro-
ceedings of the IEEE, 86(11):2278–2324, 1998.
[10] P. Viola and M. Jones. Rapid object detection using a
boosted cascade of simple features. In Proceedings of the
2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, volume 1,
pages I–511–I–518 vol.1, 2001.
[11] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Tor-
ralba. Hoggles: Visualizing object detection features. In
The IEEE International Conference on Computer Vision
(ICCV), 2013, 2013.

Fig. 6. The output of the network is the second image when input was the
first one. The detection density is fitted by gaussian and boundary has been
draw upto one s.d. of the distribution

You might also like