AIM-1687 Ps

MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ARTIFICIAL INTELLIGENCE LABORATORY
and
CENTER FOR BIOLOGICAL AND COMPUTATIONAL
LEARNING
DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES
A.I. Memo No. 1687
May, 2000
C.B.C.L Paper No. 187
Fa e Dete tion in Still Gray

Images
Bernd Heisele, Tomaso Poggio, Massimiliano Pontil
This publi ation an be retrieved by anonymous ftp to publi ations.ai.mit.edu. The

pathname for this publi ation is: ai-publi ations/1500-1999/AIM-1687.ps.Z
Abstra t
We present a trainable system for dete ting frontal and near-frontal views
of fa es in still gray images using Support Ve tor Ma hines (SVMs). We
rst onsider the problem of dete ting the whole fa e pattern by a single SVM lassier. In this ontext we ompare dierent types of image
features, present and evaluate a new method for redu ing the number
features and dis uss pra ti al issues on erning the parameterization of
SVMs and the sele tion of training data. The se ond part of the paper
des ribes a omponent-based method for fa e dete tion onsisting of a
two-level hierar hy of SVM lassiers. On the rst level, omponent lassiers independently dete t omponents of a fa e, su h as the eyes, the
nose, and the mouth. On the se ond level, a single lassier he ks if
the geometri al onguration of the dete ted omponents in the image
mat hes a geometri al model of a fa e.
Copyright
Massa husetts Institute of Te hnology, 2000
This report des ribes resear h done within the Center for Biologi al and Computational Learning
in the Department of Brain and Cognitive S ien es and in the Arti ial Intelligen e Laboratory at
the Massa husetts Institute of Te hnology. This resear h is sponsored by a grant from O e of
Naval Resear h under Contra t No. N00014-93-1-3085 and O e of Naval Resear h under Contra t
No. N00014-95-1-0600. Additional support is provided by: AT&T, Central Resear h Institute of
Ele tri Power Industry, Eastman Kodak Company, Daimler-Benz AG, Digital Equipment Corporation, Honda R&D Co., Ltd., NEC Fund, Nippon Telegraph & Telephone, and Siemens Corporate
Resear h, In .
1 Introdu tion
Over the past ten years fa e dete tion has been thoroughly studied in omputer vision
resear h for mainly two reasons. First, fa e dete tion has a number of interesting
appli ations: It an be part of a fa e re ognition system, a surveillan e system, or a
video-based omputer/ma hine interfa e. Se ond, fa es form a lass of visually similar
obje ts whi h simplies the generally di ult task of obje t dete tion. In this ontext,
dete ting hairs is often mentioned as an example where the high variation within the
obje t lass leads to a merely unsolvable dete tion problem. Besides the variability
between individual obje ts of the same lass, dete tion algorithms have to ope with
variations in the appearan e of a single obje t due to pose and illumination hanges.
Most in the past resear h work on fa e dete tion fo ussed on dete ting frontal fa es
thus leaving out the problem of pose invarian e. Although there is still some spa e for
improvement on frontal fa e dete tion, the key issue of urrent and future resear h
seems to be pose invarian e.
In the following we give a brief overview on fa e dete tion te hniques. One
ategory of systems relies on dete ting skin parts in olor images [Wu et al. 99,
Saber & Tekalp 96. Common te hniques for skin olor dete tion estimate the distribution of skin olor in the olor spa e using labeled training data [Jebara & Pentland 97,
Jones & Rehg 99. A major problem of skin olor dete tion is its sensitivity to hanges
in the spe tral omposition of the lighting and to hanges in the hara teristi s of
the amera. Therefore, most systems generate hypotheses by the skin olor dete tor and verify them by a front-end pattern lassi ation module. Depending on the
appli ation there are other e ient ways of generating obje t hypotheses. In ase
of a stati video amera and a stati ba kground s enery, ba kground subtra tion
[Ivanov et al. 98, Toyama et al. 99 is ommonly used to dete t obje ts.
Another ategory of algorithms performs fa e dete tion in still gray images. Sin e
there are no olor and motion ue available, fa e dete tion boils down to a pure
pattern re ognition task. One of the rst systems for dete ting fa es in gray images
ombines lustering te hniques with neural networks [Sung 96. It generates fa e and
non-fa e prototypes by lustering the training data onsisting of 1919 histogram
normalized fa e images. The distan es between an input pattern and the prototypes
are lassied by a Multi-Layer Per eptron. In [Osuna 98 frontal fa es are dete ted
by a SVM with polynomial kernel. A system able to deal with rotations in the image
plane was proposed by [Rowley et al. 97. It onsists of two neural networks, one for
estimating the orientation of the fa e, and another for dete ting the derotated fa es.
The re ognition step was improved [Rowley et al. 98 by arbitrating between independently trained networks of identi al stru ture. The above des ribed te hniques
have ommon lassiers whi h were trained on patterns of the whole fa e. A nave
Bayesian approa h was taken in [S hneiderman & Kanade 98. The method deter1
mines the empiri al probabilities of the o urren e of 1616 intensity patterns within
6464 fa e images. Assuming statisti al independen e between the small patterns,
the probability for the whole pattern being a fa e is al ulated as the produ t of the
probabilities for the small patterns. Another probabilisti approa h whi h dete ts
small parts of fa es is proposed in [Leung et al. 95. Lo al feature extra tors are
used to dete t the eyes, orner of the mouth, and tip of the nose. Assuming that
the position of the eyes is properly determined, the geometri al onguration of the
dete ted parts in the image is mat hed with a model onguration by onditional
sear h. A related method using statisti al models is published in [Rikert et al. 99.
Lo al features are extra ted by applying multi-s ale and multi-orientation lters to
the input image. The responses of the lters on the training set are modeled as Gaussian distributions. In ontrast to [Leung et al. 95, the onguration of the lo al lter
responses is not mat hed with a geometri al model. Instead, the global onsisten y
of the pattern is veried by analyzing features at a oarse resolution. Dete ting omponents has also been applied to fa e re ognition. In [Wiskott 95 lo al features are
omputed on the nodes of an elasti grid. Separate templates for eyes, the nose and
the mouth are mat hed in [Beymer 93, Brunelli & Poggio 93.
There are two interesting ideas behind part- or omponent-based dete tion of
obje ts. First, some obje t lasses an be des ribed well by a few hara teristi obje t
parts1 and their geometri al relation. Se ond, the patterns of some obje t parts might
vary less under pose hanges than the pattern belonging to the whole obje t. The
two main problems of a omponent-based approa h are how to hoose the set of
dis riminatory obje t parts and how to model their geometri al onguration. The
above mentioned approa hes either manually dene a set of omponents and model
their geometri al onguration or uniformly partition the image into omponents
and assume statisti al independen e between the omponents. In our system we
started with a manually dened set of fa ial omponents and a simple geometri al
model a quired from the training set. In a further step we developed a te hnique for
automati ally extra ting dis riminatory obje t parts using a database of 3-D head
models.
The outline of the paper is as follows: In Chapter 2 we ompare dierent types of
image features for fa e dete tion. Chapter 3 is about feature redu tion. Chapter 4
ontains some experimental results on the parameterization of an SVM for fa e dete tion. Dierent te hniques for generating training sets are dis ussed in Chapter 5. The
rst part of the paper about fa e dete tion using a single SVM lassier on ludes
in Chapter 6 with experimental results on standard test sets. Chapter 7 des ribes
a omponent-based system and ompares it to a whole fa e dete tor. Chapter 8
on ludes the paper.
1 In this paper we use the expression obje t part both for the 3-D part of an obje t and the 2D
image of a 3-D obje t part.
2 Extra ting image features

Regarding learning, the goal of image feature extra tion is to pro ess the raw pixel
data su h that variations between obje ts of the same lass (within- lass variations)
are redu ed while variations relevant for separating between obje ts of dierent lasses
(between- lass variations) are kept. Sour es of within- lass variations are hanges in
the illumination, hanges in the ba kground, and dierent properties of the amera.
In [Sung 96 three prepro essing steps were applied to the gray images to redu e
within- lass image variations. First, pixels lose to the boundary of the 1919 images
were removed in order to eliminate parts belonging to the ba kground. Then a
best-t intensity plane was subtra ted from the gray values to ompensate for ast
shadows. Histogram equalization was nally applied to remove variations in the
image brightness and ontrast. The resulting pixel values were used as input features
to the lassier. We ompared these gray value features to gray value gradients and
Haar wavelets. The gradients were omputed from the histogram equalized 1919
image using 33 x- and y -Sobel lters. Three orientation tuned masks (see Fig. 1)
in two dierent s ales were onvoluted with the 1919 image to ompute the Haar
wavelets. This lead to a 1,740 dimensional feature ve tor. Examples for the three
types of features are shown in Fig. 2.
wavelets in 2D
-1
-1
-1
1
1
vertical
horizontal
diagonal
Figure 1: Convolution masks for al ulating Haar wavelets.
a)
Original
b)
Histogram
equalized
c)
Gradients
d)
Haar Wavelets
Figure 2: Examples of extra ted features. The original gray image is shown in a), the
histogram equalized image in b), the gray value gradients in ), and Haar wavelets
generated by a single onvolution mask in two dierent s ales in d).
3
Gray, gray gradient and Haar wavelet features were res aled to be in a range between 0 and 1 before they were used for training an SVM with 2nd-degree polynomial
kernel. The training data onsisted of 2,429 fa e and 19,932 non-fa e images. The
lassi ation performan e was determined on a test set of 118 gray images with 479
frontal fa es2 . Ea h image was res aled 14 times by fa tors between 0.1 and 1.2 to
dete t fa es at dierent s ales. A 19x19 window was shifted pixel-by-pixel over ea h
image. Overall, about 57,000,000 windows were pro essed. The Re eiver Operator
Chara teristi (ROC) urves are shown in Fig. 3, they were generated by stepwise
variation of the lassi ation threshold of the SVM. Histogram normalized gray values are the best hoi e. For a xed FP rate the dete tion rate for gray values was
about 10% higher than for Haar wavelets and about 20% higher than for gray gradients. We trained an SVM with linear kernel on the outputs of the gray/gradient
and gray/wavelet lassiers to nd out whether the ombination of two feature sets
improves the performan e. For both ombinations the results were about the same
as for the single gray lassier.
Features
(Training: 2,429 faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
0.9
0.8
Correct
0.7
0.6
0.5
Haar wavelets
Gradient
0.4
Gray
0.3
0.0E+00
5.0E-06
1.0E-05
1.5E-05
2.0E-05
False positives / inspected window
Figure 3: ROC urves for SVMs with 2nd-degree polynomial kernel trained on different types of image features.
2 The test set is a subset of the CMU test set 1 [Rowley et al. 97 whi h onsists of 130 images
and 507 fa es. We ex luded 12 images ontaining line-drawn fa es and non-frontal fa es.
3 Feature Redu tion

The goal of feature redu tion is to improve the dete tion rate and to speed-up the
lassi ation pro ess by removing lass irrelevant features. We investigated two ways
of feature redu tion: a) Generating a new set of features by linearly ombining the
original features and b) sele ting a subset of the original features.
3.1
Linear ombination of features
We evaluated two te hniques whi h generate new feature sets by linearly ombining
the original features:
Prin ipal Component Analysis (PCA) is a standard te hnique for generating a
spa e of orthogonal, un orrelated features.
Iterative Linear Classi ation (ILC) determines the most lass dis riminant, or-
thogonal features by iteratively training a linear lassier on the labeled training

samples. The algorithm onsists of two steps:
a) Determine the dire tion for separating the two lasses by training a linear
lassier on the urrent training samples.
b) Generate a new sample set by proje ting the samples into a subspa e that
is orthogonal to the dire tion al ulated in a) and ontinue with step a).
The new N -dimensional feature spa e is spanned by the N rst dire tions al ulated in step a). In the following experiments we used an SVM as linear
lassier.
Both te hniques were applied to the 283 gray value features des ribed in Chapter 2. We downsized the previously used training and test sets in order to perform a
large number of tests. The new negative training set in luded 4,550 samples randomly
sele ted from the original negative training set. The positive training data remained
un hanged. The new test set in luded all fa e patterns and 23,570 non-fa e patterns
of the CMU test set 1. The non-fa e patterns were sele ted by the lassier des ribed
in Chapter 2 as the 23,570 non-fa e patterns whi h were most similar to fa es. An
SVM with a 2nd-degree polynomial kernel was trained on the redu ed feature sets.
The ROC urves are shown in Fig. 4 and 5 for PCA and ILC respe tively. The
rst 3 ILC features were superior to the rst 3 PCA features. However, in reasing
the number of ILC features up to 10 did not improve the performan e. This is be ause ILC does not generate un orrelated features. Indeed, the 10 ILC features were
highly orrelated with an average orrelation of about 0.7. In reasing the number
5
of PCA features up to 20, on the other hand, steadily improved the lassi ation
performan e until it equaled the performan e of the system trained on the original
283 features. Redu ing the number of features to 20 sped-up the lassi ation by a
fa tor of 142 = 196 for a 2nd-degree polynomial SVM.
Feature Reduction PCA
(Training: 2,429 faces, 4,550 non-faces, Test: 479 faces, 23,570 non-faces)
1
0.9
0.8
0.7
Correct
0.6
0.5
All features
0.4
First 20 (faces and non-faces)

0.3

0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
False positives
Figure 4: ROC urves for SVMs with 2nd-degree polynomial kernel trained on PCA
features. The PCA has been al ulated on the whole training set.
Feature Reduction with ILC

1
0.9
0.8
0.7
Correct
0.6
0.5
All features
0.4
First 10
0.3
First 3
0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
False positives
Figure 5: ROC urves for SVMs with 2nd-degree polynomial kernel trained on feature
sets generated by ILC.
3.2
Sele ting features
We developed a te hnique for sele ting lass relevant features based on the de ision
fun tion f (x) of an SVM:
f (x) =
X y K (x; x ) + b;
i
(1)
where x are the Support Ve tors, the Lagrange multipliers, y the labels of the
Support Ve tors (-1 or 1), K (; ) the kernel fun tion, and b a onstant. A point x
is assigned to lass 1 if f (x) > 0, otherwise to lass -1. The kernel fun tion K (; )
denes the dot produ t in some feature spa e F . If we denote the transformation
from the original feature spa e F to F by (x), Eq. (1) an be rewritten as:
i
f (x) = w (x) + b;
(2)
where w = P y (x ). Note that the de ision fun tion in Eq. (2) is linear on the
transformed features x = (x). For a 2nd-degree polynomial kernel with K (x; y) =
(1 + x y)p, theptransformed
feature spa e Fp with dimension
Np =
is given
p
p

by x = ( 2x ; 2x ; ::; 2x ; x ; x ; ::; x ; 2x x ; 2x x ; ::; 2x x ).
The ontribution of a feature x to the de ision fun tion in Eq. (2) depends on
w . A straightforward way to order the features is by de reasing jw j. Alternatively,
we weighted w by the Support Ve tors to a ount for dierent distributions of the
features in the training data. The features were ordered by de reasing jw P y x j,
where x denotes the n-th omponent of Support Ve tor i in feature spa e F . Both
ways of feature ranking were applied to an SVM with 2nd-degree polynomial kernel
trained on 20 PCA features orresponding
to 230 features in F . In a rst evaluation
P
of the rankings we al ulated
jf (x ) f (x )j for all M Support Ve tors, where
f (x) is the de ision fun tion using the S rst features a ording to the ranking. Note,
that we did not retrain the SVM on the redu ed feature set. The results in Fig. 6
show that ranking by the weighted omponents of w lead to a faster onvergen e of
the error towards 0. The nal evaluation was done on the test set. Fig. 7 shows
the ROC urves for 50, 100, and 150 features for both ways of ranking. The results
onrm that ranking by the weighted omponents of w is superior. The ROC urve
for 100 features on the test set was about the same as for the omplete feature set.
By ombining PCA with the above des ribed feature sele tion we ould redu e
the originally
= 40; 469 features in F to 100 features without loss in lassi ation performan e on the test set.
i
(N +3)N
2
2
1
2
2
i;n
(283+3)283
2
i;n
Partial Sum for Support Vectors

30
25
|f(x) - fs(x)| / M
20
15
weighted w
10
0
0
10
20
30
40
50
60
70
80
90
100 110 120 130 140 150 160 170 180 190 200 210 220 230
Nb. of features
Figure 6: Classifying Support Ve tors with a redu ed number of features. The x-axis
shows the number of features, the y -axis is the mean absolute dieren e between the
output of the SVM using all features and the same SVM using the S rst features only.
The features were ranked a ording to the omponents and the weighted omponents
of the normal ve tor of the separating hyperplane.
Feature Selection
1
0.9
0.8
0.7
0.6
0.5
50 features, w
0.4
100 features, w
150 features, w
0.3
50 features, weighted w
0.2

0.1
all 230 features

0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Figure 7: ROC urves for redu ed feature sets.

8
0.55
0.6
4 Parameterization of SVMs
The hoi e of the lassier and its parameterization play an important role in the overall performan e of a learning-based system. We hose the SVM as lassier sin e it is
well founded in statisti al learning theory [Vapnik 98 and has been su essfully applied to various obje t dete tion tasks in omputer vision [Oren et al. 97, Osuna 98.
An SVM is parameterized by its kernel fun tion and the C value whi h determines
the onstraint violations during the training pro ess. For more detailed information
about SVMs refer to [Vapnik 98.
4.1
Kernel fun tion
Three ommon types of kernel fun tions were evaluated in our experiments:

Linear kernel: K (x; y) = x y

Polynomial kernel: K (x; y) = (1 + x y)n , n was set to 2 and 3.
2
Gaussian kernel: K (x; y) = exp( kx 2yk ), 2 was set to 5 and 10.
All experiments were arried out on the training and test sets des ribed in Chapter 3. The ROC urves are shown in Fig. 8. The 2nd-degree polynomial kernel seems a
good ompromise between omputational omplexity and lassi ation performan e.
The SVM with Gaussian kernel ( 2 = 5) was slightly better but required about 1.5
times more Support Ve tors (738 versus 458) than the polynomial SVM.
4.2
C -parameter
We varied C between 0.1 and 100 for an SVM with 2nd-degree polynomial kernel.
Some results are shown in Fig. 9. The dete tion performan e slightly in reases with
C until C = 1. For C 1 the error rate on the training data was 0 and the de ision
boundary did not hange any more.
Kernel Functions
1
0.9
0.8
0.7
Correct
0.6
0.5
3rd degree polynomial
0.4
2nd-degree polynomial
0.3
linear
0.2
Gaussian (sigma 5)
0.1
Gaussian (sigma 10)
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
False Positives
Figure 8: ROC urves for dierent kernel fun tions.
C Parameter
1
0.9
0.8
0.7
Correct
0.6
0.5
0.4
C=1
0.3
C = 0.5
0.2
C = 0.1
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
False positives
Figure 9: ROC urves for dierent values of C .

10
0.55
0.6
5 Training Data
Besides sele ting the input features and the lassier, hoosing the training data is
the third important step in developing a lassi ation system.
5.1
Positive training data
Extra ting fa e patterns is usually a tedious and time- onsuming work that has to be
done manually. An interesting alternative is to generate arti ial samples for training
the lassier [Niyogi et al. 98. In [Rowley et al. 97, S hneiderman & Kanade 98 the
training set was enlarged by applying various image transformation to the original
fa e images. We went a step further and generated a ompletely syntheti set of
images by rendering 3-D head models [Vetter 98. Using 3-D models for training has
two interesting aspe ts: First, illumination and pose of the head are fully ontrollable
and se ond, images an be generated automati ally in large numbers by rendering the
3-D models. To reate a large variety of syntheti fa e patterns we morphed between
dierent head models and modied the pose and the illumination. Originally we had
7 textured head models a quired by a 3-D s anner. Additional head models were
generated by 3-D morphing between all pairs of the original models. The heads were
rotated between 15 and 15 in azimuth and between 8 and 8 in the image plane.
The fa es were illuminated by ambient light and a single dire tional light pointing
towards the enter of the fa e. The position of the light varied between 30 and
30 in azimuth and between 30 and 60 in elevation. Overall, we generated about
5,000 fa e images. The negative training set was the same as in Chapter 3. Some
examples of real and syntheti fa es from our training sets are shown in Fig. 10. The
ROC urves for SVMs trained on real and syntheti data are shown in Fig. 11. The
signi ant dieren e in performan e indi ates that the image variations aptured in
the syntheti data do not over the variations present in real fa e images. Most likely
be ause our fa e models were too uniform: No people with beard, no dieren es in
fa ial expression, no dieren es in skin olor.
11
Real Faces
Synthetic Faces
Illumination
Face Models
Rotation
Figure 10: Examples of real and syntheti fa e images. The syntheti fa es were
generated by rendering 3-D head models under varying pose and illumination. The
resolution of the syntheti fa es was 5050 pixels after rendering. For training the
fa e dete tor we res aled the images to 1919 pixels.
Real vs. Synthetic Faces

(Training: 2,429 real faces, 4,536 synthetic faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
1
0.9
0.8
0.7
Correct
0.6
0.5
0.4
0.3
trained on real faces
trained on synthetic faces
0.2
0.1
0
0.0E+00
2.0E-05
4.0E-05
6.0E-05
8.0E-05
1.0E-04
1.2E-04
Figure 11: ROC urves for lassiers trained on real and syntheti fa es.
12
5.2
Negative training data
Non-fa e patterns are abundant and an be automati ally extra ted from images that
do not ontain fa es. However, it would require a huge number of randomly sele ted
samples to fully over the variety of non-fa e patterns. Iterative bootstrapping of the
system with false positives (FPs) is a way to keep the training set reasonably small
by spe i ally pi king non-fa e patterns that are useful for learning. Fig. 12 shows
the ROC urves for an SVM trained on the 19,932 randomly sele ted non-fa e patterns and an SVM trained on additional 7,065 non-fa e patterns determined in three
bootstrapping iterations. At 80% dete tion rate the FP rate for the bootstrapped
system was about 2 10 6 per lassied pattern whi h orresponds to 1 FP per image.
Without bootstrapping, the FP rate was about 3 times higher.
Bootstrapping
(Training: 2,429 faces, no boot: 19,932 non-faces, boot: 26,997 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
0.9
0.8
Correct
0.7
0.6
no bootstrapping
bootstrapping
2.0E-05
2.5E-05
0.5
0.4
0.3
0.0E+00
2.5E-06
5.0E-06
7.5E-06
1.0E-05
1.3E-05
1.5E-05
1.8E-05
2.3E-05
2.8E-05
3.0E-05
Figure 12: ROC urves for a lassier whi h was trained on 19,932 randomly sele ted
non-fa e patterns and for a lassier whi h was bootstrapped with 7,065 additional
non-fa e patterns.
13
6 Results and omparison to other fa e dete tion

systems
There are two sets of gray images provided by the CMU [Rowley et al. 98 whi h
are ommonly used for evaluating fa e dete tion systems [Sung 96, Yang et al. 99,
Osuna 98, Rowley et al. 98, S hneiderman & Kanade 98. These test sets provide a
good basis for omparisons between fa e dete tion systems. However, the use of different training data and dierent heuristi s for suppressing false positives ompli ates
omparisons.
To a hieve ompetitive dete tion results we further enlarged the previously used
positive and negative training sets and also implemented heuristi s for suppressing
multiple dete tions at nearby image lo ations. An SVM with 2nd-degree polynomial
kernel was trained on histogram equalized 19 19 images of 10,038 fa es and 36,220
non-fa es. The positive training set onsisted of 5,813 real fa es and 4,225 syntheti
fa es. The syntheti fa es were generated from a subset of the real fa es by rotating
them between 2 and 2 and hanging their aspe t ratio between 0.9 and 1.1. The
negative training set was generated from an initial set of 19,932 randomly sele ted
non-fa e patterns and additional 16,288 non-fa e patterns determined in six bootstrapping iterations. For testing, ea h test image was res aled 14 times by fa tors
between 0.1 and 1.2. A 19 19 window was shifted pixel-by-pixel over ea h image. We
applied two heuristi s to remove multiple dete tions at nearby image lo ations. First,
a dete tion was suppressed if there was at least one dete tion with a higher SVM output value in its neighborhood. The neighborhood in the image plane was dened as a
19 19 box around the enter of the dete tion. The neighborhood in the s ale spa e
was set to [0:5; 2. The se ond heuristi ounted the number of dete tions within the
neighborhood. If there were less than three dete tions, the dete tion was suppressed.
The results of our lassier are shown in Fig. 13 and ompared to other results in
Table 1. Our system outperforms a previous SVM-based fa e dete tor [Osuna 98 due
to a larger training set and improvements in suppressing multiple dete tions. The
results a hieved by the nave Bayes lassier [S hneiderman & Kanade 98 and the
SNoW-based fa e dete tor [Yang et al. 99 are better than our results. However, it is
not lear whi h heuristi s were used in these systems to suppress multiple dete tions
and how these heuristi s ae ted the results.
14
System
[Sung 96
Neural Network
[Osuna 98
SVM
[Rowley et al. 98
Single neural network
[Rowley et al. 98
Multiple neural networks
[S hneiderman & Kanade 983
Nave Bayes
[Yang et al. 994
SNoW, multi-s ale
Our system5
Subset of test set 1

23 images, 155 fa es
Det. Rate
FPs
84.6%
13
Test set 1
130 images, 507 fa es
Det. Rate
FPs
N/A
N/A
74.2%
20
N/A
N/A
N/A
N/A
90.9%
738
84.5%
84.4%
79
91.1%
12
90.5%
33
94.1%
94.8%
78
84.7%
90.4%
11
26
85.6%
89.9%
9
75
Table 1: Comparison between fa e dete tion systems.

Face Detection
Training: 10,038 faces, 36,220 non-faces,
Test set 1: 23 images, 157 faces, 7,628,845 windows, Test set 2: 118 images, 479 faces, 56,774,966 windows
1
0.9
Correct
0.8
0.7
0.6
Subset of test set 1
Test set 1
0.5
0.4
0
10
20
30
40
50
60
70
80
90
100
False positives
Figure 13: ROC urves for bootstrapped lassier with heuristi s for suppressing
multiple dete tions.
3 Five
images of hand-drawn fa es were ex luded from test set 1.
4 Images
5 Twelve
of hand-drawn fa es and artoon fa es were ex luded from test set 1.

images ontaining line-drawn fa es, artoon fa es and non-frontal fa es were ex luded
from test set 1.
15
7
7.1
Component-based fa e dete tion

Motivation
Until now we onsidered systems where the whole fa e pattern was lassied by a
single SVM. Su h a global approa h is highly sensitive to hanges in the pose of an
obje t. Fig. 14 illustrates the problem for the simple ase of linear lassi ation.
The result of training a linear lassier on frontal fa es an be represented as a
single fa e template, s hemati ally drawn in Fig. 14 a). Even for small rotations the
template learly deviates from the rotated fa es as shown in Fig. 14 b) and ). The
omponent-based approa h tries to avoid this problem by independently dete ting
parts of the fa e. In Fig. 15 the eyes, nose, and the mouth are represented as single
templates. For small rotations the hanges in the omponents are small ompared to
the hanges in whole fa e pattern. Slightly shifting the omponents is su ient to
a hieve a reasonable mat h with the rotated fa es.
a)
c)
b)
Figure 14: Mat hing with a single template. The s hemati template of a frontal fa e
is shown in a). Slight rotations of the fa e in the image plane b) and in depth ) lead
to onsiderable dis repan ies between template and fa e.
a)
c)
b)
Figure 15: Mat hing with a set of omponent templates. The s hemati omponent
templates for a frontal fa e are shown in a). Shifting the omponent templates an
ompensate for slight rotations of the fa e in the image plane b) and in depth ).
16
7.2 Component-based lassier

An overview of our two-level omponent-based lassier is shown in Fig. 16. A similar
ar hite ture was used for people dete tion [Mohan 99. On the rst level, omponent
lassiers independently dete t the eyes (9 7 pixels), the nose (9 11 pixels) and
the mouth (13 7 pixels). Ea h omponent lassier was trained on a set of manually
extra ted fa ial omponents and a set of randomly sele ted non-fa e patterns. The
omponents were extra ted from the same set of 2,429 real fa e images as used in
previous experiments.
On the se ond level the geometri al onguration lassier performs the nal fa e
dete tion by ombining the results of the omponent lassiers. Given a 1919 image,
the maximum outputs of the eyes, nose, and mouth lassiers within re tangular
sear h regions6 around the expe ted positions of the omponents are used as inputs
to the geometri al onguration lassier. The sear h regions have been al ulated
from the mean and standard deviation of the omponents' lo ations in the training
images.
Output of
Output of
Output of
Eye Classifier Nose Classifier Mouth Classifier
First Level:
Component
Classifiers
Classifier
Second Level:
Detection of
Configuration of
Components
Classifier
Figure 16: System overview of the omponent-based lassier. On the rst level,
windows of the size of the omponents (solid lined boxes) are shifted over the fa e
image and lassied by the omponent lassiers. On the se ond level, the maximum
outputs of the omponent lassiers within predened sear h regions (dotted lined
boxes) are fed into the geometri al onguration lassier.
6 To
a ount for hanges in the size of the omponents, the outputs were determined over multiple
s ales of the input image. In our tests, we set the range of s ales to [0:75; 1:2.
17
The ROC urves for CMU test set 1 are shown in Fig. 17. The omponent
lassiers were SVMs with 2nd-degree polynomial kernels and the geometri al onguration lassier was a linear SVM7 . Up to about 90% re ognition rate, the four
omponent system performs worse than the whole fa e lassier. Probably due to
lass-relevant parts of the fa e that were not overed by the four omponents. Therefore, we added the whole fa e as a fth omponent similar to the template-based fa e
re ognition system proposed in [Brunelli & Poggio 93. As shown in Fig. 17 the ve
omponent lassier performs similar to the whole fa e lassier. This indi ates that
the whole fa e is the most dominant of the ve omponents.
To he k the robustness of the lassiers against obje t rotations we performed
tests on syntheti fa es generated from 3-D head models. The syntheti test set
onsisted of two groups of 1919 fa e images: 4,574 fa es rotated in the image plane,
and 15,865 fa es rotated in depth. At ea h rotation angle we determined the FP
rate for 90% dete tion rate based on the ROC urves in Fig. 17. The results in
Fig. 18 and 19 show that the best performan e was a hieved by the ve omponent
system. However, it deteriorated mu h faster with in reasing rotation than the four
omponent system. This is not surprising sin e the whole fa e pattern hanges more
under rotation than the patterns of the other omponents.
Whole Face Classifier vs. Component-based Classifier
(Training: 2,429 faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
0.9
0.8
Correct
0.7
0.6
0.5
Whole face
0.4
Components: eyes, nose, mouth, face

Components: eyes, nose, mouth
0.3
0.2
0.1
0.0E+00
5.0E-06
1.0E-05
1.5E-05
2.0E-05
2.5E-05
3.0E-05
3.5E-05
4.0E-05
FP / inspected window
Figure 17: ROC urves for frontal fa es.
7 Alternatively
we tried linear lassiers for the omponents and a polynomial kernel for the
geometri al lassier but the results were learly worse.
18
Whole face vs. Components, Rotation in Image Plane

Training: 2,429 faces, 19,932 non-faces, Test: 4,574 synthetic images
False positives / window for 90% detection rate
1.0E-04
1.0E-05
1.0E-06
1.0E-07
Whole face
Components: eyes, nose, mouth, whole face
1.0E-08
0
10
Rotation [deg]
Figure 18: Classi ation results for syntheti fa es rotated in the image plane.
Whole face vs. Components, Rotation in Depth

Training: 2,429 faces, 19,932 non-faces, Test: 15,865 synthetic images
False positives / window at 90% detection rate
1.0E-04
1.0E-05
1.0E-06
Whole face
1.0E-07
Components: eyes, nose, mouth, whole face
1.0E-08
0
10
12
14
16
18
20
22
24
26
28
30
Rotation [deg]
Figure 19: Classi ation results for syntheti fa es rotated in depth.

19
7.3
Determining the omponents: preliminary results
In our previous experiments we manually sele ted the eyes, the nose and the mouth
as hara teristi omponents of a fa e. Although this hoi e is somehow obvious,
it would be more sensible to hoose the omponents automati ally based on their
dis riminative power and their robustness against pose hanges. Moreover, for obje ts other than fa es, it might be di ult to manually dene a set of meaningful
omponents. In the following we present two methods for learning omponents from
examples.
The rst method arbitrarily denes omponents and lets the geometri al onguration lassier learn to weight the omponents a ording to their relevan y. We
arried out an experiment with 16 non-overlapping omponents of size 5 5 evenly
distributed on the 19 19 fa e pattern (see Fig. 20). As in previous experiments
the omponent lassiers were SVMs with 2nd-degree polynomial kernels and the
geometri al onguration lassier was a linear SVM. The training errors of the omponent lassiers give information about the dis riminative power of ea h omponent
(see Fig. 21). The omponents 5, 8, 9, and 12 are lo ated on the heeks of the fa e.
They ontain only few gray value stru tures whi h is re e ted in the omparatively
high error rates. Surprisingly, the omponents 14 and 15 around the mouth also show
high error rates. This might be due to variations in the fa ial expression and slight
misalignments of the fa es in the training set.
4
16
Figure 20: Partitioning the fa e pattern into 16 non-overlapping omponents.

An alternative to using a large set of arbitrary omponents is to spe i ally generate dis riminative omponents. Following this idea, we developed a se ond method
that automati ally determines re tangular omponents in a set of syntheti fa e images. The algorithm starts with a small re tangular omponent lo ated around a
pre-sele ted point in the fa e (e.g. enter of the left eye)8 . The omponent is extra ted from all syntheti fa e images to build a training set of positive examples.
8 We ould lo ate the same fa ial point in all fa e images sin e we knew the point-by-point
orresponden es between the 3-D head models.
20
Training reuslts for component classifiers

1
0.95
Recognition rate on training data
0.9
0.85
0.8
0.75
0.7
0.65
faces
non-faces
0.6
1
10
11
12
13
14
15
16
Component
Figure 21: Training results for the 16 omponent lassiers.

We also generate a training set of non-fa e patterns that have the same re tangular
shape as the omponent. After training an SVM on the omponent data we estimate
the performan e of the SVM a ording to its leave-one-out error [Vapnik 98:
= R2
w2 ;
(3)
where R is the radius of the smallest sphere in the feature spa e F ontaining
the Support Ve tors, and w2 is the square norm of the oe ients of the SVM (see
Eq. (2)). After determining we enlarge the omponent by expanding the re tangle
by one pixel into one of four dire tions (up, down, left, right). Again, we generate
training data, train an SVM and determine . We keep the expansion if it lead to a
de rease in else it is reje ted and an expansion into one of the remaining dire tions
was tried. This pro ess is ontinued until the expansions into all four dire tions
lead to an in rease of . In a preliminary experiment we applied the algorithm to
three 3 3 regions lo ated at the enter of the eye, tip of the nose and enter of the
mouth. The nal omponents are shown in Fig. 22, they were determined on about
4,500 syntheti fa es (65 85 pixels, rotation in depth between 45 and 45 ). The
9
9 In our experiments we repla ed
R2 in Eq. (3) by the dimensionality N
of spa e
order to ompute
This be ause
N -dimensional ube of length 1, so the smallest sphere ontaining the

N =2. This approximation was mainly for omputational reasons as in
our data points lay within an

data had radius equal to
F .
we need to solve an optimization problem [Osuna 98.
21
eyes (24 8 pixels) and mouth (30 12 pixels) are similar to the manually sele ted
omponents. The omponent lo ated at the tip of the nose (6 4 pixels), however,
is small. This indi ates that the pattern around the tip of the nose strongly varies
under rotation.
Figure 22: Automati ally generated omponents.
8 Con lusion and future work

We presented and ompared two systems for frontal and near-frontal fa e dete tion: a
whole fa e dete tion system and a omponent-based dete tion system. Both systems
are trained from examples and use SVMs as lassiers. The rst system dete ts the
whole fa e pattern with a single SVM. In ontrast, the omponent-based system performs the dete tion by means of a two level hierar hy of lassiers. On the rst level,
the omponent lassiers independently dete t parts of the fa e, su h as eyes, nose,
and mouth. On the se ond level, the geometri al onguration lassier ombines the
results of the omponent lassiers and performs the nal dete tion step. In addition
to the whole fa e and omponent-based fa e dete tion approa hes we presented a
number of experiments on image feature sele tion, feature redu tion and sele tion of
training data. The main points of the paper are as follows:
Gray values are better input features for a fa e dete tor than are Haar wavelets
and gradient values.
By ombining PCA- with SVM-based feature sele tion we sped-up the dete tion
system by two orders of magnitude without loss in lassi ation performan e.
Bootstrapping the lassier with non-fa e patterns in reased the dete tion rate
by more than 5%.
We developed a omponent-based fa e dete tor whi h is more robust against

fa e rotations than a omparable whole fa e dete tor.
22
We proposed a te hnique for learning hara teristi omponents from examples.
We have shown that a omponent-based lassier trained on frontal fa es an deal

with slight rotations in depth. The next logi al step is to over a larger range of pose
hanges by training the omponent lassiers on rotated fa es. Another promising
topi for further resear h is learning a geometri al model of the fa e by adding the
image lo ations of the dete ted omponents to the input features of the geometri al
onguration lassier.
Referen es
[Beymer 93 D. J. Beymer. Fa e re ognition under varying pose. A.I. Memo 1461,
Center for Biologi al and Computational Learning, M.I.T., Cambridge, MA, 1993.
[Brunelli & Poggio 93 R. Brunelli, T. Poggio. Fa e Re ognition: Features versus
Templates. IEEE Transa tions on Pattern Analysis and Ma hine Intelligen e 15
(1993) 1042{1052.
[Ivanov et al. 98 Y. Ivanov, A. Bobi k, J. Liu. Fast lighting independent ba kground
subtra tion. Pro . IEEE Workshop on Visual Surveillan e, 1998, 49{55.
[Jebara & Pentland 97 T. Jebara, A. Pentland. Parametrized stru ture from motion
for 3D adaptive feedba k tra king of fa es. Pro . IEEE Conferen e on Computer
Vision and Pattern Re ognition, San Juan, 1997, 144{150.
[Jones & Rehg 99 M. J. Jones, J. M. Rehg. Statisti al olor models with appli ation to skin dete tion. Pro . IEEE Conferen e on Computer Vision and Pattern
Re ognition, Fort Collins, 1999, 274{280.
[Leung et al. 95 T. K. Leung, M. C. Burl, P. Perona. Finding fa es in luttered
s enes using random labeled graph mat hing. Pro . International Conferen e on
Computer Vision, 1995, 637{644.
[Mohan 99 A. Mohan. Obje t dete tion in images by omponents. A.I. Memo 1664,
Center for Biologi al and Computational Learning, M.I.T., Cambridge, MA, 1999.
[Niyogi et al. 98 P. Niyogi, F. Girosi, T. Poggio. In orporating prior information in
ma hine learning by reating virtual examples. Pro eedings of the IEEE 86 (1998)
2196{2209.
[Oren et al. 97 M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, T. Poggio. Pedestrian dete tion using wavelet templates. IEEE Conferen e on Computer Vision and
Pattern Re ognition, San Juan, 1997, 193{199.
23
[Osuna 98 E. Osuna. Support Ve tor Ma hines: Training and Appli ations. Ph.D.
thesis, MIT, Department of Ele tri al Engineering and Computer S ien e, Cambridge, MA, 1998.
[Rikert et al. 99 T. D. Rikert, M. J. Jones, P. Viola. A luster-based statisti al model
for obje t dete tion. Pro . IEEE Conferen e on Computer Vision and Pattern
Re ognition, 1999, 1046{1053.
[Rowley et al. 97 H. A. Rowley, S. Baluja, T. Kanade. Rotation Invariant Neural
Network-Based Fa e Dete tion. Computer S ien t Te hni al Report CMU-CS-97201, CMU, Pittsburgh, 1997.
[Rowley et al. 98 H. A. Rowley, S. Baluja, T. Kanade. Neural Network-Based Fa e
Dete tion. IEEE Transa tions on Pattern Analysis and Ma hine Intelligen e 20
(1998) 23{38.
[Saber & Tekalp 96 E. Saber, A. Tekalp. Fa e dete tion and fa ial feature extra tion using olor, shape and symmetry based ost fun tions. Pro . International
Conferen e on Pattern Re ognition, Vol. 1, Vienna, 1996, 654{658.
[S hneiderman & Kanade 98 H. S hneiderman, T. Kanade. Probabilisti Modeling
of Lo al Appearan e and Spatial Relationships for Obje t Re ognition. Pro . IEEE
Conferen e on Computer Vision and Pattern Re ognition, Santa Barbara, 1998,
45{51.
[Sung 96 K.-K. Sung. Learning and Example Sele tion for Obje t and Pattern Re ognition. Ph.D. thesis, MIT, Arti ial Intelligen e Laboratory and Center for Biologi al and Computational Learning, Cambridge, MA, 1996.
[Toyama et al. 99 K. Toyama, J. Krumm, B. Brumitt, B. Meyers. Wall ower: prin iples and pra ti e of ba kground maintenan e. Pro . IEEE Conferen e on Computer Vision and Pattern Re ognition, 1999, 255{261.
[Vapnik 98 V. Vapnik. Statisti al learning theory. New York: John Wiley and Sons,
1998.
[Vetter 98 T. Vetter. Synthesis of novel views from a single fa e. International
Journal of Computer Vision 28 (1998) 103{116.
[Wiskott 95 L. Wiskott. Labeled Graphs and Dynami Link Mat hing for Fa e Re ognition and S ene Analysis. Ph.D. thesis, Ruhr-Universitat Bo hum, Bo hum, Germany, 1995.
24
[Wu et al. 99 H. Wu, Q. Chen, M. Ya hida. Fa e dete tion from olor images using
a fuzzy pattern mat hing method.
IEEE Transa tions on Pattern Analysis and
Ma hine Intelligen e 21 (1999) 557{563.

[Yang et al. 99 M.-H. Yang, D. Roth, N. Ahuja. A SNoW-based fa e dete tor. Advan es in Neural Information Pro essing Systems 12, 1999.
25

AIM-1687 Ps

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIM-1687 Ps

Uploaded by

Copyright:

Available Formats

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE LABORATORY

A.I. Memo No. 1687

C.B.C.L Paper No. 187

Fa e Dete tion in Still Gray

This publi ation an be retrieved by anonymous ftp to publi ations.ai.mit.edu. The

Massa husetts Institute of Te hnology, 2000

2 Extra ting image features

Figure 1: Convolution masks for al ulating Haar wavelets.

False positives / inspected window

3 Feature Redu tion

Linear ombination of features

spa e of orthogonal, un orrelated features.

thogonal features by iteratively training a linear lassi er on the labeled training

First 20 (faces and non-faces)

First 3 (faces and non-faces)

Feature Reduction with ILC

Sele ting features

Partial Sum for Support Vectors

100 features, weighted w

all 230 features

Figure 7: ROC urves for redu ed feature sets.

Kernel fun tion

Linear kernel: K (x; y) = x y

3rd degree polynomial

Gaussian (sigma 10)

Figure 8: ROC urves for di erent kernel fun tions.

Figure 9: ROC urves for di erent values of C .

Positive training data

Real vs. Synthetic Faces

trained on real faces

trained on synthetic faces

False positives / inspected window

Negative training data

False positives / inspected window

6 Results and omparison to other fa e dete tion

Subset of test set 1

Table 1: Comparison between fa e dete tion systems.

Subset of test set 1

images of hand-drawn fa es were ex luded from test set 1.

of hand-drawn fa es and artoon fa es were ex luded from test set 1.

from test set 1.

Component-based fa e dete tion

7.2 Component-based lassi er

Components: eyes, nose, mouth, face

Figure 17: ROC urves for frontal fa es.

Whole face vs. Components, Rotation in Image Plane

False positives / window for 90% detection rate

Whole face vs. Components, Rotation in Depth

False positives / window at 90% detection rate

Figure 19: Classi ation results for syntheti fa es rotated in depth.

Determining the omponents: preliminary results

Figure 20: Partitioning the fa e pattern into 16 non-overlapping omponents.

Training reuslts for component classifiers

Recognition rate on training data

Figure 21: Training results for the 16 omponent lassi ers.

9 In our experiments we repla ed

R2 in Eq. (3) by the dimensionality N

N -dimensional ube of length 1, so the smallest sphere ontaining the

our data points lay within an

we need to solve an optimization problem [Osuna 98.

Figure 22: Automati ally generated omponents.

8 Con lusion and future work

thogonal features by iteratively training a linear lassier on the labeled training

Figure 8: ROC urves for dierent kernel fun tions.

Figure 9: ROC urves for dierent values of C .

7.2 Component-based lassier

Figure 21: Training results for the 16 omponent lassiers.

We proposed a te hnique for learning hara teristi omponents from examples.

We have shown that a omponent-based lassier trained on frontal fa es an deal