You are on page 1of 5

A Face Detection System Using Shunting Inhibitory Convolutional Neural Networks

EH.C Tivive a n d A. Bouzerdoum, Senior Member, ZEEE


School of Engineering and Mathematics Edith Cowan University 100 Joondalup Drive, Joondalup WA 6027 Perth Australia. E-mail: [f.tivive@ecu.edu.au],[a.bouzerdoum@ieee.orgJ

Abstract-In this paper, we present a face detection system based on a class of convolutional neural networks, namely Shunt-

neural networks where the processing cells are shunting inhibitory neurons [I]. Previously, shunting inhibitory neurons

were trained, using a hybrid method based on Rprop, Qnickprop i.e., they can approximate complex decision surfaces much and least squares, to discriminate between face and non-face readily than M L P ~ .The characteristic of the proposed patterns. All three connection schemes achieve 99 % detection convolutional network Smcture is a flexible "do.it.yourself' accuracy at 5 % false alarm rate, based on a test set of 7000 face Only 'pecifies the input size, and non-face patterm. Furthermore, toeplitz-connected network architecture, where the the receptive field size, the number of layers andor feature a 99 was trained on a larger training set and has correct classification rate with only 1 % false alarm rate based maps, the number of output units, and the connection scheme on the same test set. A face detection system i built based on between lavers. In this work, we demonstrate that the proposed s . . the trained convolutional neural networks. The system accepts network to discriminate between face and nonbe an input image of arbitrary size and localizes the face patterns to build a face in fie image. To localize faces of different sizes, the convolutional face patterns. The trained networks are network is anofida face detection filter at different detection system which locates the face patterns in an input Bs scales. The detection scores from different scales are aggregated image of arbitrary size. together l o form the final deasion. The paper is organized as follows. In the next section the convolutional neural network architecture is introduced. I. INTRODUCTION Section 3 describes the face detection and localization system. Over the past two decades, a huge research effort has been Experimental results are presented in Section 4, followed by devoted to the application of artificial neural networks to concluding remarks in Section 5. pattern recognition problems such as face detection and recog11. SICONNET ARCHITECTURE nition, phoneme and speech recognition, to name a few. The advantage of this type of approach is that pattern recognition The proposed architecture is a hierarchical structure that systems can be trained to capture the complex class conditional is based on the same architectural concepts proposed by densities of the patterns. The basic neural network approach LeCun et al. [4], such as local receptive field, weight sharing for face detection is to apply a feature extractor to extract low and sub-sampling. The input layer of the network is a 2D dimensional feature vectors from the input patterns, and then square array of arbitrary size. Each hidden layer consists of a neural network is used to classify the extracted features into several planes of shunting inhibitory neurons. Each plane, also one of the two possible classes: face and non-face. However, known as feature map, has a unique set of N x N incoming the design of a feature extractor requires a lot of attention as weights, where N is an odd integer; all the neurons in a it is often task specific and hand-crafted. Another approach feature map share the same set of weights connecting them is to use a class of hierarchical neural networks, known as to different locations in the input image (receptive field), and convolutional neural networks (CoNNs). to perform feature each feature map has a unique receptive field. This process extraction as well as classification. The key characteristic allows the neurons in a plane to extract elementary visual of CoNNs is that they possess local receptive fields and features from the previous layer. The same receptive field translation invariant connection matrices. As intensities of the size is used to connect from one layer to another throughout pixels are directly fed into the network, the spatial topology the network architecture. Suh-sampling is performed within of the input pattern is well captured by the network structure. each hidden layer by shifting the centres of receptive fields by Furthermore, the architecture of a CO" integrates the feature two positions, horizontally and vertically, as shown in Fig. la. extraction stage with the classification stage, and both are Three possible connection strategies have been developed: fullgenerated by the learning process. connection, toeplitz-connection, and binary-connection. In a Recently, we have proposed a new class of convolutional full-connection scheme, each feature map is fully connected

can

0-7803-8359-1/04/.$20.00 02004 IEEE

2571

to the feature maps in the succeeding layer, and each layer has an arbitrary number of feature maps. For toeplitz- and binaryconnection schemes, the number of feature maps in each layer is constrained to an integer power of 2: the number of feature maps in the Lt" hidden layer is equal t o 2L. In the toeplitzconnection scheme, a feature map may have one-to-one or oneto-many links with feature maps in the preceding layer, and the connection matrix is a toeplitz matrix. In the binary-connection scheme, each feature map branches out to two feature maps in the succeeding layer, forming a binary tree. More details about the connection strategies can be found in [I]. The output layer is a set of linear or sigmoid neurons (perceptrons). As there is a transition from planes of neurons in the last hidden layer to a column of neurons in the output layer, a local averaging is performed on all feature maps in the last hidden layer; that is, small (2 x 2) non-overlapping regions are averaged and the resulting signals are fed into the output layer (Fig. Ib). In cases where the feature maps in the last hidden layer consist of single neurons only, the outputs of these neurons serve as inputs to the output layer. The response of an output neuron is a weighted sum of the input signals added to a bias term, and the result is passed through an appropriate activation function (linear or sigmoidal). Mathematically, the response of an output neuron is given by

Receptive Field

k-& Shifted bytwo I I positions i i

Shifted by two
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(a)

Layer L-l
(b)

Layer L

Fig. 1.

where h is the output activation function, wu's are the connection weights, t u ' s are the inputs to the neuron, and b is the bias term.

(a) The receptive fields of convolutional neurons: adjacent receptive fields have their centres offset by two positions horizontally and venically. (b) Local averaging in the laSt convolutional layer: the signals fmm 2 x 2 cgions are averaged to form input signals to the neurons of the output layer.

A. Shunting neuron
Shunting inhibition is a powerful computational mechanism that plays an important role in sensory information processing. Neurons based on this shunting mechanism have been used to build cellular neural networks which have been applied to various vision and image processing tasks 151. Recently, shunting inhibition has been used in a feedforward neural network architecture for supervised pattern classification and regression [3], [6]. In this architecture, the activation of the hidden neurons is governed by the steady-state response of the feedforward shunting inhibitory neuron model [2], [3], which has since been generalized to the following [6]:

output of the shunting neuron at location ( i , j ) in feature map { L ,k} (kth feature map in the Lth layer) is
SL-1

X L , ~ ~= A ,
Y ~ , d i , =) j

c
"=I

[CL,& Z L - - ~ . ~ ] ( ~ ~ ,)d(i ,~)~ ) * bL j (3a)


[ D L ,*~ L - - ~ . ~ ] + & , ~( ) , ( )~ ~(3b) Z (~k i j )

SL-1

" 4

QL X L , k ( i r j )

ZL,k(i,j) =

0
+ fL

(3c)

g C ~ j i I i bj
ti =

aL,k(i,j)

( i

ai

+f (

cjJ,

+dj)

Y~,k(i,j)

).

(2)

where r j is the activity of the j t h neuron, IC'sare the external inputs, aj is the passive decay rate, wii and cji are the connection weights from the ith neuron to the j t h neuron, bi and di are constant biases, and f and g are activation functions. In this paper, the neuron model (2) is used to build 2D feature maps for the proposed CO" architecture. The

where ZL-I," represents the output of the uth feature map of the ( L- l ) t h layer; SL-l denotes the number of feature maps in the ( L - l ) t h layer; CL,^ and D L , are the set of weights ~ (convolution masks); and the "t" is the 2D convolution operator.

111. THE FACE DETECTION SYSTEM


The proposed face detection system consists of two stages.

In the first stage, SICoNNets are trained to discriminate

25 72

TABLE I
CLASSIFICATION ACCURACY FOR T H E T H R E E CONNECTION STRATEGIES

Test Set A-Single person (500 images) B-Multi people (200 images) C-SpectaclelSunglass (150 images) D-Beardhfoustaches (100 images)

Number of faces
500 504

Correct detections 477 (95.4 %)


463 (91.9 %) 130 (79.3 %) 82 (81.2 %)

False dismissals

23 (4.6 %)
41 (8.1 %) 34 (20.7 %) 19 (18.8 %)

164 101

False detections 279 265 117 88

between face and non-face patterns. Face patterns are obtained from the ECU face detection database' where faces are cropped from images obtained from various sources over the internet. Non-face examples are collected by using a bootstrap strategy where a network is first trained on a small training set, and the trained network is then used to scan several highly textured images which contained no faces. This strategy is an efficient way in coping with the problem of finding a representative set of non-face patterns. Some samples of ECU face and non-face patterns collected by the bootstrap strategy are presented in Fig. 2. The second stage of the system is to apply the trained SlCoNNet to any given arbitrary 2D input image and determine whether there are any human faces in the image: if there are, the face detection system returns the location of each human face in the image. To better localize faces of different sizes, the input image is processed at different scales by sub-sampling it. Most of the face detection systems [7], [SI, [9] use a ratio of 1.2. However, to reduce the computational cost and time of processing a large number of scaled images, our face detector uses a subsampling ratio of 1.5. Furthermore, the trained network is applied as a filter at every 4 pixels in both dimension of the scaled image to detect face candidates, and their locations are mapped hack to the input image. Often, detected candidates that are background have a higher network response than a true face candidate. Since the human face is symmetric, the detected candidate is mirrored and passed back to the network, and the average of both network responses is taken as the final score of the detected candidate. Experimental results show that this reduces the number of false detections and at the same time increases the confidence score of the face candidates. Then a grouping algorithm, similar to that proposed by Delakis and Garcia [IO],is applied to the scaled images in order to group face candidates into clusters according to their proximity. At each scale, the detected face candidates are sorted in
'ECU face dambare was prepared by S. L. Phung for his PhD thesis on human face detection in colour images at Edith Cowan University.

descending order according to their confidence scores. The top face candidate in the list has the highest score, and those face candidates that are found within a region of one fourth of the size of the top face candidate around its centre are grouped in a cluster. This process continues until all face candidates are clustered. In each cluster, the centroid of the representative face is taken as the average of the centres of the candidate faces, and its confidence score is computed as the product of the highest score of the face candidate and the total number of face candidates in the cluster. To locate precisely the face position and its size, two fine searches are performed. Firstly, a search is performed in a region of 4 pixels, around the centre of the representative face. All centres of the detections are averaged to give the final position of the representative face. This way we remove the tendency of having a non-face candidate position. Once its location is computed, the representative face is tested at 6 scales, ranging from 0.7 to 1.3. According IO [IO], and from our experiments, a face pattern will give a high true response at 2 or more consecutive scales, but this is not so for a non-face pattern. Therefore, the number of face detections at six different scales is counted, and if it is greater than a threshold (Tr = Z), the detected face candidate is considered a true face. These fine searches are applied to all scales. When collapsing all scales into one, in some cases, a representative face can appear at several scales. Therefore, to combine them, the network responses in the final scale are sorted in descending order. Then a search is performed in a (10 x 10) region around the centre of the first highest network response. Those network responses that are within that region are considered overlapping detections and are rejected. Finally, all the remaining face candidates and their mirrored versions are further processed by the network, and if the average score of a candidate and its mirrored version is less than a threshold of 0.3, it is rejected.

2573

trained network is used to filter a certain number of scanned windows. Those windows detected as face are included in the training set. In addition, the same number of face pattems is also added to the training set. This procedure avoids the network to be bias towards either class. In the next iteration, the number of scanned windows is double until it reaches a threshold value (Tr = 5000). The whole training process is stopped when either 100 scene images have been scanned or the number of pattems in the training set has reached 20000. In Fig. 2. Examples of face patterns from ECU face database and non-face every training loop, a cross-validation set is used to terminate patterns collected from the bootstrap procedure the leaming of the network.

IV. EXPERIMENTAL RESULTS


A set of 4000 face and non-face patterns of size 32 x 32 have been used to train the convolutional neural network architecture; all images are pre-processed by applying histogram equalization and range normalization. The target values for face and non-face patterns were set to 1 and -1, respectively. The network topology used for the face detector is a four-layer network with two feature maps in the first hidden layer, four feature maps in the second hidden layer, eight feature maps in the last hidden layer and one linear output neuron. The activation function for the output neuron is a linear function. Receptive fields of size 5 x 5 are used throughout the network. All shunting neurons in the feature maps share the same bias parameters, including the passive decay rate constant; this results in a network with 775 different trainable parameters. A hybrid training method based on the principle of Rprop, Quickprop and least squares has been used to train the networks. To determine which connection strategy has the hest performance, all three connection schemes were used in conjunction with the aforementioned network topology. All networks were trained on the same data set, which contains 8000 patterns, and they were tested on a test set containing 7000 face and non-face pattems. The results presented in Table I show that all connection schemes achieve over 97.9 % total classification accuracy: with the best performance of 98.36 % obtained with toeplitz-connection scheme. The receiver operating characteristics (ROC) curves given in Fig. 3 indicate clearly that all the connection schemes can reach a 99 9 of % detection accuracy with only 5 % to 7 % false alarm rate. The ROC curves show again that the toeplitz-connected topology outperforms slightly the fully-connected and binary-connected network topologies. Taking tcxplitz-connected network as the best classifier, the network is retrained on a large training set. The first stage of the training process is based on a training set of 15000 patterns with equivalent number of face and non-face patterns. The mirrored face patterns are also included in the training set; however, all the training patterns are only range normalized and histogram equalization process has been removed to save some computational cost. The training is stopped based on the performance of the cross-validation set. In the second stage, the network continues to train while scanning some scene images to collect non-face patterns. In each training loop the

In order to conduct a comprehensive investigation of the performance of the face detection and localization system, four large sets of images have been tested Test set A is comprised of 500 images, each containing a single face; the images in set B, consisting of 200 images. contain at least two people; Test set C comprises 150 images, in which a person wears spectacles or sunglasses; Test set D has 100 images in which a person has facial features such as a beard and/or moustache. These images are collected from internet, which have sizes ranging from 100 to 1000 pixels. Most of the images were taken with different digital cameras and consist of people from different ethnic backgrounds and different ages. To evaluate the system, the number of faces that have. been correctly detected is counted. False detections and false dismissals are also recorded for statistical analysis; Table I1 summarizes the results and Fig. 5 presents some of the results obtained from the proposed face detection system based on those test sets.
With test set A, our face detection system has achieved a correct detection rate of 95.4 9% with only 4.6 % of false dismissal rate. However, when the image has more individuals the performance of the face detector drops to 91.9 55. This degradation in accuracy is due to the orientation of the faces and the illumination environment where the picture is taken. In some cases, the faces have more than 30 degree of rotation, and at the same time part of the face is so dark that it totally loses its facial features. For example, our face detector fails when the face pattern is quite small, the person is wearing huge dark sunglasses, or the face is very hairy; usually, the network outputs a low positive response for these faces, but they are rejected during post processing. This is demonstrated by the poor results obtained from both test sets C and D, with correct detection rates of 79.3 % and 81.2 %. respectively. There is a trade-off between the number of faces that can be detected and the number of false detections. In OUT experiment, two threshold values have been used (0 and 0.3). By lowering these values, more faces can be detected at the expense of having more false detections. Figure 4 shows the classification performance as a function of the threshold value, using the same test set as in Table I. For instance, a threshold of -0.25 yields a 99 % correct classification rate with only 1 9% false alarm rate. Since most of the false detections are background sub-images, the false detection rate can be reduced further by applying a skin colour filter, or other feature detectors such as an eye or a mouth detector.

2574

v.

C O N C L U S I O N AND F U T U R E WORK

In this paper, a face detection and localization system that uses shunting inhibitory convolutional neural networks as a face detection filter has been proposed and successfully demonstrated. Based on four large test sets with 1269 persons, this system correctly detects and localizes 90.8 % of the faces. The convolutional network that was used as a face detector consists of four layers with a toeplitz-connection scheme and has 775 trainable parameters. When tested on segmented images, the convolutional network achieves an overall accuracy of 98.36 %, and has a 99 % correct classification rate at 5 % false alarm rate. However, by training the network on a larger set, containing 20000 patterns, the false alarm rate was reduced to 1 %. To improve the performance of the entire system, rotated face patterns should be used to train the network. This may result in a rotation invariant network that has the ability to detect face candidates not in the upright position, and hence reduce the number of missed detections at such rotations. Some post-processing techniques, such as a skin colour filter and eye detection, can also be used to reduce the number of false detections.

Samples images taken from Test set D Fig. 5. Same examples of the output of the proposed neural based face detection syslem. Some of the images are taken from [ I I ]

REFERENCES

[I] F. Tivive and A. Bourerdoum, "A new class of convolutional neural network (siconnets) and their application to face detection:' Pmc. ofthe
Joint Conference on Neural NCWOrh, "01. 3, pp. 2157 -2162,2003. [2] A. Bouzerdoum. " A new class of high-order neural networks with nonlinear decision boundaries:' Pmc. of the sixtk lntemtional Conference on Neural 1nformmot;m Processing, vol. 3, pp. 1004-1009, 1999. 11 3 , "Classification and function approximation using ked-forward shunting inhibitory artificial neural networks," Pmc. of lhe InfeT!"OMl Joint Conference on Neuml Networks. vol. 6, pp. 6 1 3 6 1 8 , 2WO. [4] Y.LeCun. B. Boser, 1. S . Denker, D. Hendenon. R. Howard, W. Hubbard. and L. lackel, "Handw"nen digit recognition with a backprapagation neural network," in Advances in Neuml Informtion Pmcessing Sysfems. D. S . Touretzky. Ed. San Mate0 , C A Morgan Kaufman, 1990. vol. 2, pp. 3 9 M . [5] A. Bouzerdoum and R. B. Pinter, "Shunting inhibitory cellular neural networks: derivation and stability analysis:' IEEE Tramactions on Circuits and Sysrems I: Fundmenml Theory and Applicaliom. vol. 40, pp. 215-221, 1993. 11 G. Amlampalam and A. Bouzerdoum. "A generalized feedforward 6 neural network architecture for classification and regression:' Neural ,vemrkS, WI. 16, pp. 561-568,2003. [71 H. Rowley, S. Baluja, and T. Kanade, "Neural network-based face detection:' IEEE Trunsrocrionr on Pattem Analysis and Mnchinc Intdligence, vol. 20. no. 4,pp. 1019-1031, 1997. [8] K. Sung and T. Pogpio, '%rample-based leaning for viewbased human face detection," IEEE Tmnrnctionr on Potlem Recognition Md Machine Inielligence, vol. 20, pp. 31-59. 1998. 11 B. Fasel. "Robust face analysis using convolutional neural networks:' 9 Pmc. ofrke Skleenth I n " / Confirenre on Pottem Recognition. vol. 2, pp. 11-15,2002. [IO] M.Delakis and C. Garcia, "Robust face detection based on convolutional neural networks:' PNC. of the 2nd Hellenic Conference on Artificial Intelligence (SEI"O2). pp. 367-318, 2W2. 1111 G. Mark and K. Jonathan. (2W4) Geltyimages. [Online]. Available:
/nteml;OMl

Fig. 4. T h i s graph shows the classification rates v e n u the threshold value after mining the toeplitm-connected network on a uaining set of 2OWO partems

http://www.genyimages.coml

25i75

You might also like