You are on page 1of 27

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE LABORATORY

and
CENTER FOR BIOLOGICAL AND COMPUTATIONAL
LEARNING
DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES

A.I. Memo No. 1687

May, 2000

C.B.C.L Paper No. 187

Fa e Dete tion in Still Gray


Images
Bernd Heisele, Tomaso Poggio, Massimiliano Pontil

This publi ation an be retrieved by anonymous ftp to publi ations.ai.mit.edu. The


pathname for this publi ation is: ai-publi ations/1500-1999/AIM-1687.ps.Z

Abstra t
We present a trainable system for dete ting frontal and near-frontal views
of fa es in still gray images using Support Ve tor Ma hines (SVMs). We
rst onsider the problem of dete ting the whole fa e pattern by a single SVM lassi er. In this ontext we ompare di erent types of image
features, present and evaluate a new method for redu ing the number
features and dis uss pra ti al issues on erning the parameterization of
SVMs and the sele tion of training data. The se ond part of the paper
des ribes a omponent-based method for fa e dete tion onsisting of a
two-level hierar hy of SVM lassi ers. On the rst level, omponent lassi ers independently dete t omponents of a fa e, su h as the eyes, the
nose, and the mouth. On the se ond level, a single lassi er he ks if
the geometri al on guration of the dete ted omponents in the image
mat hes a geometri al model of a fa e.
Copyright

Massa husetts Institute of Te hnology, 2000

This report des ribes resear h done within the Center for Biologi al and Computational Learning
in the Department of Brain and Cognitive S ien es and in the Arti ial Intelligen e Laboratory at
the Massa husetts Institute of Te hnology. This resear h is sponsored by a grant from O e of
Naval Resear h under Contra t No. N00014-93-1-3085 and O e of Naval Resear h under Contra t
No. N00014-95-1-0600. Additional support is provided by: AT&T, Central Resear h Institute of
Ele tri Power Industry, Eastman Kodak Company, Daimler-Benz AG, Digital Equipment Corporation, Honda R&D Co., Ltd., NEC Fund, Nippon Telegraph & Telephone, and Siemens Corporate
Resear h, In .

1 Introdu tion
Over the past ten years fa e dete tion has been thoroughly studied in omputer vision
resear h for mainly two reasons. First, fa e dete tion has a number of interesting
appli ations: It an be part of a fa e re ognition system, a surveillan e system, or a
video-based omputer/ma hine interfa e. Se ond, fa es form a lass of visually similar
obje ts whi h simpli es the generally di ult task of obje t dete tion. In this ontext,
dete ting hairs is often mentioned as an example where the high variation within the
obje t lass leads to a merely unsolvable dete tion problem. Besides the variability
between individual obje ts of the same lass, dete tion algorithms have to ope with
variations in the appearan e of a single obje t due to pose and illumination hanges.
Most in the past resear h work on fa e dete tion fo ussed on dete ting frontal fa es
thus leaving out the problem of pose invarian e. Although there is still some spa e for
improvement on frontal fa e dete tion, the key issue of urrent and future resear h
seems to be pose invarian e.
In the following we give a brief overview on fa e dete tion te hniques. One
ategory of systems relies on dete ting skin parts in olor images [Wu et al. 99,
Saber & Tekalp 96. Common te hniques for skin olor dete tion estimate the distribution of skin olor in the olor spa e using labeled training data [Jebara & Pentland 97,
Jones & Rehg 99. A major problem of skin olor dete tion is its sensitivity to hanges
in the spe tral omposition of the lighting and to hanges in the hara teristi s of
the amera. Therefore, most systems generate hypotheses by the skin olor dete tor and verify them by a front-end pattern lassi ation module. Depending on the
appli ation there are other e ient ways of generating obje t hypotheses. In ase
of a stati video amera and a stati ba kground s enery, ba kground subtra tion
[Ivanov et al. 98, Toyama et al. 99 is ommonly used to dete t obje ts.
Another ategory of algorithms performs fa e dete tion in still gray images. Sin e
there are no olor and motion ue available, fa e dete tion boils down to a pure
pattern re ognition task. One of the rst systems for dete ting fa es in gray images
ombines lustering te hniques with neural networks [Sung 96. It generates fa e and
non-fa e prototypes by lustering the training data onsisting of 1919 histogram
normalized fa e images. The distan es between an input pattern and the prototypes
are lassi ed by a Multi-Layer Per eptron. In [Osuna 98 frontal fa es are dete ted
by a SVM with polynomial kernel. A system able to deal with rotations in the image
plane was proposed by [Rowley et al. 97. It onsists of two neural networks, one for
estimating the orientation of the fa e, and another for dete ting the derotated fa es.
The re ognition step was improved [Rowley et al. 98 by arbitrating between independently trained networks of identi al stru ture. The above des ribed te hniques
have ommon lassi ers whi h were trained on patterns of the whole fa e. A nave
Bayesian approa h was taken in [S hneiderman & Kanade 98. The method deter1

mines the empiri al probabilities of the o urren e of 1616 intensity patterns within
6464 fa e images. Assuming statisti al independen e between the small patterns,
the probability for the whole pattern being a fa e is al ulated as the produ t of the
probabilities for the small patterns. Another probabilisti approa h whi h dete ts
small parts of fa es is proposed in [Leung et al. 95. Lo al feature extra tors are
used to dete t the eyes, orner of the mouth, and tip of the nose. Assuming that
the position of the eyes is properly determined, the geometri al on guration of the
dete ted parts in the image is mat hed with a model on guration by onditional
sear h. A related method using statisti al models is published in [Rikert et al. 99.
Lo al features are extra ted by applying multi-s ale and multi-orientation lters to
the input image. The responses of the lters on the training set are modeled as Gaussian distributions. In ontrast to [Leung et al. 95, the on guration of the lo al lter
responses is not mat hed with a geometri al model. Instead, the global onsisten y
of the pattern is veri ed by analyzing features at a oarse resolution. Dete ting omponents has also been applied to fa e re ognition. In [Wiskott 95 lo al features are
omputed on the nodes of an elasti grid. Separate templates for eyes, the nose and
the mouth are mat hed in [Beymer 93, Brunelli & Poggio 93.
There are two interesting ideas behind part- or omponent-based dete tion of
obje ts. First, some obje t lasses an be des ribed well by a few hara teristi obje t
parts1 and their geometri al relation. Se ond, the patterns of some obje t parts might
vary less under pose hanges than the pattern belonging to the whole obje t. The
two main problems of a omponent-based approa h are how to hoose the set of
dis riminatory obje t parts and how to model their geometri al on guration. The
above mentioned approa hes either manually de ne a set of omponents and model
their geometri al on guration or uniformly partition the image into omponents
and assume statisti al independen e between the omponents. In our system we
started with a manually de ned set of fa ial omponents and a simple geometri al
model a quired from the training set. In a further step we developed a te hnique for
automati ally extra ting dis riminatory obje t parts using a database of 3-D head
models.
The outline of the paper is as follows: In Chapter 2 we ompare di erent types of
image features for fa e dete tion. Chapter 3 is about feature redu tion. Chapter 4
ontains some experimental results on the parameterization of an SVM for fa e dete tion. Di erent te hniques for generating training sets are dis ussed in Chapter 5. The
rst part of the paper about fa e dete tion using a single SVM lassi er on ludes
in Chapter 6 with experimental results on standard test sets. Chapter 7 des ribes
a omponent-based system and ompares it to a whole fa e dete tor. Chapter 8
on ludes the paper.
1 In this paper we use the expression obje t part both for the 3-D part of an obje t and the 2D
image of a 3-D obje t part.

2 Extra ting image features


Regarding learning, the goal of image feature extra tion is to pro ess the raw pixel
data su h that variations between obje ts of the same lass (within- lass variations)
are redu ed while variations relevant for separating between obje ts of di erent lasses
(between- lass variations) are kept. Sour es of within- lass variations are hanges in
the illumination, hanges in the ba kground, and di erent properties of the amera.
In [Sung 96 three prepro essing steps were applied to the gray images to redu e
within- lass image variations. First, pixels lose to the boundary of the 1919 images
were removed in order to eliminate parts belonging to the ba kground. Then a
best- t intensity plane was subtra ted from the gray values to ompensate for ast
shadows. Histogram equalization was nally applied to remove variations in the
image brightness and ontrast. The resulting pixel values were used as input features
to the lassi er. We ompared these gray value features to gray value gradients and
Haar wavelets. The gradients were omputed from the histogram equalized 1919
image using 33 x- and y -Sobel lters. Three orientation tuned masks (see Fig. 1)
in two di erent s ales were onvoluted with the 1919 image to ompute the Haar
wavelets. This lead to a 1,740 dimensional feature ve tor. Examples for the three
types of features are shown in Fig. 2.
wavelets in 2D

-1
-1

-1

1
1

vertical

horizontal

diagonal

Figure 1: Convolution masks for al ulating Haar wavelets.

a)
Original

b)
Histogram
equalized

c)
Gradients

d)
Haar Wavelets

Figure 2: Examples of extra ted features. The original gray image is shown in a), the
histogram equalized image in b), the gray value gradients in ), and Haar wavelets
generated by a single onvolution mask in two di erent s ales in d).
3

Gray, gray gradient and Haar wavelet features were res aled to be in a range between 0 and 1 before they were used for training an SVM with 2nd-degree polynomial
kernel. The training data onsisted of 2,429 fa e and 19,932 non-fa e images. The
lassi ation performan e was determined on a test set of 118 gray images with 479
frontal fa es2 . Ea h image was res aled 14 times by fa tors between 0.1 and 1.2 to
dete t fa es at di erent s ales. A 19x19 window was shifted pixel-by-pixel over ea h
image. Overall, about 57,000,000 windows were pro essed. The Re eiver Operator
Chara teristi (ROC) urves are shown in Fig. 3, they were generated by stepwise
variation of the lassi ation threshold of the SVM. Histogram normalized gray values are the best hoi e. For a xed FP rate the dete tion rate for gray values was
about 10% higher than for Haar wavelets and about 20% higher than for gray gradients. We trained an SVM with linear kernel on the outputs of the gray/gradient
and gray/wavelet lassi ers to nd out whether the ombination of two feature sets
improves the performan e. For both ombinations the results were about the same
as for the single gray lassi er.
Features
(Training: 2,429 faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)

0.9

0.8

Correct

0.7

0.6

0.5

Haar wavelets

Gradient

0.4

Gray

0.3
0.0E+00

5.0E-06

1.0E-05

1.5E-05

2.0E-05

False positives / inspected window

Figure 3: ROC urves for SVMs with 2nd-degree polynomial kernel trained on different types of image features.

2 The test set is a subset of the CMU test set 1 [Rowley et al. 97 whi h onsists of 130 images
and 507 fa es. We ex luded 12 images ontaining line-drawn fa es and non-frontal fa es.

3 Feature Redu tion


The goal of feature redu tion is to improve the dete tion rate and to speed-up the
lassi ation pro ess by removing lass irrelevant features. We investigated two ways
of feature redu tion: a) Generating a new set of features by linearly ombining the
original features and b) sele ting a subset of the original features.
3.1

Linear ombination of features

We evaluated two te hniques whi h generate new feature sets by linearly ombining
the original features:
 Prin ipal Component Analysis (PCA) is a standard te hnique for generating a

spa e of orthogonal, un orrelated features.

 Iterative Linear Classi ation (ILC) determines the most lass dis riminant, or-

thogonal features by iteratively training a linear lassi er on the labeled training


samples. The algorithm onsists of two steps:
a) Determine the dire tion for separating the two lasses by training a linear
lassi er on the urrent training samples.
b) Generate a new sample set by proje ting the samples into a subspa e that
is orthogonal to the dire tion al ulated in a) and ontinue with step a).

The new N -dimensional feature spa e is spanned by the N rst dire tions al ulated in step a). In the following experiments we used an SVM as linear
lassi er.
Both te hniques were applied to the 283 gray value features des ribed in Chapter 2. We downsized the previously used training and test sets in order to perform a
large number of tests. The new negative training set in luded 4,550 samples randomly
sele ted from the original negative training set. The positive training data remained
un hanged. The new test set in luded all fa e patterns and 23,570 non-fa e patterns
of the CMU test set 1. The non-fa e patterns were sele ted by the lassi er des ribed
in Chapter 2 as the 23,570 non-fa e patterns whi h were most similar to fa es. An
SVM with a 2nd-degree polynomial kernel was trained on the redu ed feature sets.
The ROC urves are shown in Fig. 4 and 5 for PCA and ILC respe tively. The
rst 3 ILC features were superior to the rst 3 PCA features. However, in reasing
the number of ILC features up to 10 did not improve the performan e. This is be ause ILC does not generate un orrelated features. Indeed, the 10 ILC features were
highly orrelated with an average orrelation of about 0.7. In reasing the number
5

of PCA features up to 20, on the other hand, steadily improved the lassi ation
performan e until it equaled the performan e of the system trained on the original
283 features. Redu ing the number of features to 20 sped-up the lassi ation by a
fa tor of 142 = 196 for a 2nd-degree polynomial SVM.
Feature Reduction PCA
(Training: 2,429 faces, 4,550 non-faces, Test: 479 faces, 23,570 non-faces)
1

0.9

0.8

0.7

Correct

0.6

0.5

All features

0.4

First 20 (faces and non-faces)


First 10 (faces and non-faces)

0.3

First 3 (faces and non-faces)


0.2

0.1

0
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

False positives

Figure 4: ROC urves for SVMs with 2nd-degree polynomial kernel trained on PCA
features. The PCA has been al ulated on the whole training set.

Feature Reduction with ILC


(Training: 2,429 faces, 4,450 non-faces, Test: 479 faces, 23,570 non-faces)
1

0.9

0.8

0.7

Correct

0.6

0.5
All features
0.4
First 10
0.3
First 3
0.2

0.1

0
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

False positives

Figure 5: ROC urves for SVMs with 2nd-degree polynomial kernel trained on feature
sets generated by ILC.

3.2

Sele ting features

We developed a te hnique for sele ting lass relevant features based on the de ision
fun tion f (x) of an SVM:
f (x) =

X y K (x; x ) + b;
i

(1)

where x are the Support Ve tors, the Lagrange multipliers, y the labels of the
Support Ve tors (-1 or 1), K (; ) the kernel fun tion, and b a onstant. A point x
is assigned to lass 1 if f (x) > 0, otherwise to lass -1. The kernel fun tion K (; )
de nes the dot produ t in some feature spa e F . If we denote the transformation
from the original feature spa e F to F  by (x), Eq. (1) an be rewritten as:
i

f (x) = w  (x) + b;

(2)

where w = P y (x ). Note that the de ision fun tion in Eq. (2) is linear on the
transformed features x = (x). For a 2nd-degree polynomial kernel with K (x; y) =
(1 + x  y)p, theptransformed
feature spa e Fp with dimension
Np =
is given
p
p

by x = ( 2x ; 2x ; ::; 2x ; x ; x ; ::; x ; 2x x ; 2x x ; ::; 2x x ).
The ontribution of a feature x to the de ision fun tion in Eq. (2) depends on
w . A straightforward way to order the features is by de reasing jw j. Alternatively,
we weighted w by the Support Ve tors to a ount for di erent distributions of the
features in the training data. The features were ordered by de reasing jw P y x j,
where x denotes the n-th omponent of Support Ve tor i in feature spa e F  . Both
ways of feature ranking were applied to an SVM with 2nd-degree polynomial kernel
trained on 20 PCA features orresponding
to 230 features in F . In a rst evaluation
P
of the rankings we al ulated
jf (x ) f (x )j for all M Support Ve tors, where
f (x) is the de ision fun tion using the S rst features a ording to the ranking. Note,
that we did not retrain the SVM on the redu ed feature set. The results in Fig. 6
show that ranking by the weighted omponents of w lead to a faster onvergen e of
the error towards 0. The nal evaluation was done on the test set. Fig. 7 shows
the ROC urves for 50, 100, and 150 features for both ways of ranking. The results
on rm that ranking by the weighted omponents of w is superior. The ROC urve
for 100 features on the test set was about the same as for the omplete feature set.
By ombining PCA with the above des ribed feature sele tion we ould redu e
the originally
= 40; 469 features in F  to 100 features without loss in lassi ation performan e on the test set.
i

(N +3)N
2

2
1

2
2

i;n

(283+3)283
2

i;n

Partial Sum for Support Vectors


30

25

|f(x) - fs(x)| / M

20

15
weighted w
10

0
0

10

20

30

40

50

60

70

80

90

100 110 120 130 140 150 160 170 180 190 200 210 220 230
Nb. of features

Figure 6: Classifying Support Ve tors with a redu ed number of features. The x-axis
shows the number of features, the y -axis is the mean absolute di eren e between the
output of the SVM using all features and the same SVM using the S rst features only.
The features were ranked a ording to the omponents and the weighted omponents
of the normal ve tor of the separating hyperplane.
Feature Selection
(Training: 2,429 faces, 4,550 non-faces, Test: 479 faces, 23,570 non-faces)
1

0.9

0.8

0.7

0.6

0.5
50 features, w
0.4

100 features, w
150 features, w

0.3

50 features, weighted w
0.2

100 features, weighted w


150 features, weighted w

0.1

all 230 features


0
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Figure 7: ROC urves for redu ed feature sets.


8

0.55

0.6

4 Parameterization of SVMs
The hoi e of the lassi er and its parameterization play an important role in the overall performan e of a learning-based system. We hose the SVM as lassi er sin e it is
well founded in statisti al learning theory [Vapnik 98 and has been su essfully applied to various obje t dete tion tasks in omputer vision [Oren et al. 97, Osuna 98.
An SVM is parameterized by its kernel fun tion and the C value whi h determines
the onstraint violations during the training pro ess. For more detailed information
about SVMs refer to [Vapnik 98.
4.1

Kernel fun tion

Three ommon types of kernel fun tions were evaluated in our experiments:





Linear kernel: K (x; y) = x  y


Polynomial kernel: K (x; y) = (1 + x  y)n , n was set to 2 and 3.
2
Gaussian kernel: K (x; y) = exp( kx 2yk ),  2 was set to 5 and 10.

All experiments were arried out on the training and test sets des ribed in Chapter 3. The ROC urves are shown in Fig. 8. The 2nd-degree polynomial kernel seems a
good ompromise between omputational omplexity and lassi ation performan e.
The SVM with Gaussian kernel ( 2 = 5) was slightly better but required about 1.5
times more Support Ve tors (738 versus 458) than the polynomial SVM.
4.2

C -parameter

We varied C between 0.1 and 100 for an SVM with 2nd-degree polynomial kernel.
Some results are shown in Fig. 9. The dete tion performan e slightly in reases with
C until C = 1. For C  1 the error rate on the training data was 0 and the de ision
boundary did not hange any more.

Kernel Functions
(Training: 2,429 faces, 4,450 non-faces, Test: 479 faces, 23,570 non-faces)
1

0.9

0.8

0.7

Correct

0.6

0.5

3rd degree polynomial

0.4

2nd-degree polynomial

0.3

linear

0.2

Gaussian (sigma 5)

0.1

Gaussian (sigma 10)

0
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

False Positives

Figure 8: ROC urves for di erent kernel fun tions.

C Parameter
(Training: 2,429 faces, 4,450 non-faces, Test: 479 faces, 23,570 non-faces)
1

0.9

0.8

0.7

Correct

0.6

0.5

0.4
C=1
0.3
C = 0.5
0.2

C = 0.1

0.1

0
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

False positives

Figure 9: ROC urves for di erent values of C .


10

0.55

0.6

5 Training Data
Besides sele ting the input features and the lassi er, hoosing the training data is
the third important step in developing a lassi ation system.
5.1

Positive training data

Extra ting fa e patterns is usually a tedious and time- onsuming work that has to be
done manually. An interesting alternative is to generate arti ial samples for training
the lassi er [Niyogi et al. 98. In [Rowley et al. 97, S hneiderman & Kanade 98 the
training set was enlarged by applying various image transformation to the original
fa e images. We went a step further and generated a ompletely syntheti set of
images by rendering 3-D head models [Vetter 98. Using 3-D models for training has
two interesting aspe ts: First, illumination and pose of the head are fully ontrollable
and se ond, images an be generated automati ally in large numbers by rendering the
3-D models. To reate a large variety of syntheti fa e patterns we morphed between
di erent head models and modi ed the pose and the illumination. Originally we had
7 textured head models a quired by a 3-D s anner. Additional head models were
generated by 3-D morphing between all pairs of the original models. The heads were
rotated between 15 and 15 in azimuth and between 8 and 8 in the image plane.
The fa es were illuminated by ambient light and a single dire tional light pointing
towards the enter of the fa e. The position of the light varied between 30 and
30 in azimuth and between 30 and 60 in elevation. Overall, we generated about
5,000 fa e images. The negative training set was the same as in Chapter 3. Some
examples of real and syntheti fa es from our training sets are shown in Fig. 10. The
ROC urves for SVMs trained on real and syntheti data are shown in Fig. 11. The
signi ant di eren e in performan e indi ates that the image variations aptured in
the syntheti data do not over the variations present in real fa e images. Most likely
be ause our fa e models were too uniform: No people with beard, no di eren es in
fa ial expression, no di eren es in skin olor.

11

Real Faces

Synthetic Faces
Illumination

Face Models
Rotation

Figure 10: Examples of real and syntheti fa e images. The syntheti fa es were
generated by rendering 3-D head models under varying pose and illumination. The
resolution of the syntheti fa es was 5050 pixels after rendering. For training the
fa e dete tor we res aled the images to 1919 pixels.

Real vs. Synthetic Faces


(Training: 2,429 real faces, 4,536 synthetic faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
1

0.9

0.8

0.7

Correct

0.6

0.5

0.4

0.3

trained on real faces

trained on synthetic faces

0.2

0.1

0
0.0E+00

2.0E-05

4.0E-05

6.0E-05

8.0E-05

1.0E-04

1.2E-04

False positives / inspected window

Figure 11: ROC urves for lassi ers trained on real and syntheti fa es.
12

5.2

Negative training data

Non-fa e patterns are abundant and an be automati ally extra ted from images that
do not ontain fa es. However, it would require a huge number of randomly sele ted
samples to fully over the variety of non-fa e patterns. Iterative bootstrapping of the
system with false positives (FPs) is a way to keep the training set reasonably small
by spe i ally pi king non-fa e patterns that are useful for learning. Fig. 12 shows
the ROC urves for an SVM trained on the 19,932 randomly sele ted non-fa e patterns and an SVM trained on additional 7,065 non-fa e patterns determined in three
bootstrapping iterations. At 80% dete tion rate the FP rate for the bootstrapped
system was about 2  10 6 per lassi ed pattern whi h orresponds to 1 FP per image.
Without bootstrapping, the FP rate was about 3 times higher.
Bootstrapping
(Training: 2,429 faces, no boot: 19,932 non-faces, boot: 26,997 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)

0.9

0.8

Correct

0.7

0.6

no bootstrapping

bootstrapping

2.0E-05

2.5E-05

0.5

0.4

0.3
0.0E+00

2.5E-06

5.0E-06

7.5E-06

1.0E-05

1.3E-05

1.5E-05

1.8E-05

2.3E-05

2.8E-05

3.0E-05

False positives / inspected window

Figure 12: ROC urves for a lassi er whi h was trained on 19,932 randomly sele ted
non-fa e patterns and for a lassi er whi h was bootstrapped with 7,065 additional
non-fa e patterns.

13

6 Results and omparison to other fa e dete tion


systems
There are two sets of gray images provided by the CMU [Rowley et al. 98 whi h
are ommonly used for evaluating fa e dete tion systems [Sung 96, Yang et al. 99,
Osuna 98, Rowley et al. 98, S hneiderman & Kanade 98. These test sets provide a
good basis for omparisons between fa e dete tion systems. However, the use of different training data and di erent heuristi s for suppressing false positives ompli ates
omparisons.
To a hieve ompetitive dete tion results we further enlarged the previously used
positive and negative training sets and also implemented heuristi s for suppressing
multiple dete tions at nearby image lo ations. An SVM with 2nd-degree polynomial
kernel was trained on histogram equalized 19  19 images of 10,038 fa es and 36,220
non-fa es. The positive training set onsisted of 5,813 real fa es and 4,225 syntheti
fa es. The syntheti fa es were generated from a subset of the real fa es by rotating
them between 2 and 2 and hanging their aspe t ratio between 0.9 and 1.1. The
negative training set was generated from an initial set of 19,932 randomly sele ted
non-fa e patterns and additional 16,288 non-fa e patterns determined in six bootstrapping iterations. For testing, ea h test image was res aled 14 times by fa tors
between 0.1 and 1.2. A 19  19 window was shifted pixel-by-pixel over ea h image. We
applied two heuristi s to remove multiple dete tions at nearby image lo ations. First,
a dete tion was suppressed if there was at least one dete tion with a higher SVM output value in its neighborhood. The neighborhood in the image plane was de ned as a
19  19 box around the enter of the dete tion. The neighborhood in the s ale spa e
was set to [0:5; 2. The se ond heuristi ounted the number of dete tions within the
neighborhood. If there were less than three dete tions, the dete tion was suppressed.
The results of our lassi er are shown in Fig. 13 and ompared to other results in
Table 1. Our system outperforms a previous SVM-based fa e dete tor [Osuna 98 due
to a larger training set and improvements in suppressing multiple dete tions. The
results a hieved by the nave Bayes lassi er [S hneiderman & Kanade 98 and the
SNoW-based fa e dete tor [Yang et al. 99 are better than our results. However, it is
not lear whi h heuristi s were used in these systems to suppress multiple dete tions
and how these heuristi s a e ted the results.

14

System
[Sung 96
Neural Network
[Osuna 98
SVM
[Rowley et al. 98
Single neural network
[Rowley et al. 98
Multiple neural networks
[S hneiderman & Kanade 983
Nave Bayes
[Yang et al. 994
SNoW, multi-s ale
Our system5

Subset of test set 1


23 images, 155 fa es
Det. Rate
FPs
84.6%
13

Test set 1
130 images, 507 fa es
Det. Rate
FPs
N/A
N/A

74.2%

20

N/A

N/A

N/A

N/A

90.9%

738

84.5%

84.4%

79

91.1%

12

90.5%

33

94.1%

94.8%

78

84.7%
90.4%

11
26

85.6%
89.9%

9
75

Table 1: Comparison between fa e dete tion systems.


Face Detection
Training: 10,038 faces, 36,220 non-faces,
Test set 1: 23 images, 157 faces, 7,628,845 windows, Test set 2: 118 images, 479 faces, 56,774,966 windows
1

0.9

Correct

0.8

0.7

0.6

Subset of test set 1

Test set 1

0.5

0.4
0

10

20

30

40

50

60

70

80

90

100

False positives

Figure 13: ROC urves for bootstrapped lassi er with heuristi s for suppressing
multiple dete tions.
3 Five

images of hand-drawn fa es were ex luded from test set 1.

4 Images

5 Twelve

of hand-drawn fa es and artoon fa es were ex luded from test set 1.


images ontaining line-drawn fa es, artoon fa es and non-frontal fa es were ex luded

from test set 1.

15

7
7.1

Component-based fa e dete tion


Motivation

Until now we onsidered systems where the whole fa e pattern was lassi ed by a
single SVM. Su h a global approa h is highly sensitive to hanges in the pose of an
obje t. Fig. 14 illustrates the problem for the simple ase of linear lassi ation.
The result of training a linear lassi er on frontal fa es an be represented as a
single fa e template, s hemati ally drawn in Fig. 14 a). Even for small rotations the
template learly deviates from the rotated fa es as shown in Fig. 14 b) and ). The
omponent-based approa h tries to avoid this problem by independently dete ting
parts of the fa e. In Fig. 15 the eyes, nose, and the mouth are represented as single
templates. For small rotations the hanges in the omponents are small ompared to
the hanges in whole fa e pattern. Slightly shifting the omponents is su ient to
a hieve a reasonable mat h with the rotated fa es.

a)

c)

b)

Figure 14: Mat hing with a single template. The s hemati template of a frontal fa e
is shown in a). Slight rotations of the fa e in the image plane b) and in depth ) lead
to onsiderable dis repan ies between template and fa e.

a)

c)

b)

Figure 15: Mat hing with a set of omponent templates. The s hemati omponent
templates for a frontal fa e are shown in a). Shifting the omponent templates an
ompensate for slight rotations of the fa e in the image plane b) and in depth ).

16

7.2 Component-based lassi er


An overview of our two-level omponent-based lassi er is shown in Fig. 16. A similar
ar hite ture was used for people dete tion [Mohan 99. On the rst level, omponent
lassi ers independently dete t the eyes (9  7 pixels), the nose (9  11 pixels) and
the mouth (13  7 pixels). Ea h omponent lassi er was trained on a set of manually
extra ted fa ial omponents and a set of randomly sele ted non-fa e patterns. The
omponents were extra ted from the same set of 2,429 real fa e images as used in
previous experiments.
On the se ond level the geometri al on guration lassi er performs the nal fa e
dete tion by ombining the results of the omponent lassi ers. Given a 1919 image,
the maximum outputs of the eyes, nose, and mouth lassi ers within re tangular
sear h regions6 around the expe ted positions of the omponents are used as inputs
to the geometri al on guration lassi er. The sear h regions have been al ulated
from the mean and standard deviation of the omponents' lo ations in the training
images.
Output of
Output of
Output of
Eye Classifier Nose Classifier Mouth Classifier

First Level:
Component
Classifiers

Classifier

Second Level:
Detection of
Configuration of
Components

Classifier

Figure 16: System overview of the omponent-based lassi er. On the rst level,
windows of the size of the omponents (solid lined boxes) are shifted over the fa e
image and lassi ed by the omponent lassi ers. On the se ond level, the maximum
outputs of the omponent lassi ers within prede ned sear h regions (dotted lined
boxes) are fed into the geometri al on guration lassi er.

6 To

a ount for hanges in the size of the omponents, the outputs were determined over multiple

s ales of the input image. In our tests, we set the range of s ales to [0:75; 1:2.

17

The ROC urves for CMU test set 1 are shown in Fig. 17. The omponent
lassi ers were SVMs with 2nd-degree polynomial kernels and the geometri al on guration lassi er was a linear SVM7 . Up to about 90% re ognition rate, the four
omponent system performs worse than the whole fa e lassi er. Probably due to
lass-relevant parts of the fa e that were not overed by the four omponents. Therefore, we added the whole fa e as a fth omponent similar to the template-based fa e
re ognition system proposed in [Brunelli & Poggio 93. As shown in Fig. 17 the ve
omponent lassi er performs similar to the whole fa e lassi er. This indi ates that
the whole fa e is the most dominant of the ve omponents.
To he k the robustness of the lassi ers against obje t rotations we performed
tests on syntheti fa es generated from 3-D head models. The syntheti test set
onsisted of two groups of 1919 fa e images: 4,574 fa es rotated in the image plane,
and 15,865 fa es rotated in depth. At ea h rotation angle we determined the FP
rate for 90% dete tion rate based on the ROC urves in Fig. 17. The results in
Fig. 18 and 19 show that the best performan e was a hieved by the ve omponent
system. However, it deteriorated mu h faster with in reasing rotation than the four
omponent system. This is not surprising sin e the whole fa e pattern hanges more
under rotation than the patterns of the other omponents.
Whole Face Classifier vs. Component-based Classifier
(Training: 2,429 faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)

0.9

0.8

Correct

0.7

0.6

0.5

Whole face
0.4

Components: eyes, nose, mouth, face


Components: eyes, nose, mouth

0.3

0.2

0.1
0.0E+00

5.0E-06

1.0E-05

1.5E-05

2.0E-05

2.5E-05

3.0E-05

3.5E-05

4.0E-05

FP / inspected window

Figure 17: ROC urves for frontal fa es.

7 Alternatively

we tried linear lassi ers for the omponents and a polynomial kernel for the
geometri al lassi er but the results were learly worse.

18

Whole face vs. Components, Rotation in Image Plane


Training: 2,429 faces, 19,932 non-faces, Test: 4,574 synthetic images

False positives / window for 90% detection rate

1.0E-04

1.0E-05

1.0E-06

1.0E-07

Whole face
Components: eyes, nose, mouth, whole face
Components: eyes, nose, mouth

1.0E-08
0

10

Rotation [deg]

Figure 18: Classi ation results for syntheti fa es rotated in the image plane.

Whole face vs. Components, Rotation in Depth


Training: 2,429 faces, 19,932 non-faces, Test: 15,865 synthetic images

False positives / window at 90% detection rate

1.0E-04

1.0E-05

1.0E-06

Whole face
1.0E-07
Components: eyes, nose, mouth, whole face
Components: eyes, nose, mouth

1.0E-08
0

10

12

14

16

18

20

22

24

26

28

30

Rotation [deg]

Figure 19: Classi ation results for syntheti fa es rotated in depth.


19

7.3

Determining the omponents: preliminary results

In our previous experiments we manually sele ted the eyes, the nose and the mouth
as hara teristi omponents of a fa e. Although this hoi e is somehow obvious,
it would be more sensible to hoose the omponents automati ally based on their
dis riminative power and their robustness against pose hanges. Moreover, for obje ts other than fa es, it might be di ult to manually de ne a set of meaningful
omponents. In the following we present two methods for learning omponents from
examples.
The rst method arbitrarily de nes omponents and lets the geometri al on guration lassi er learn to weight the omponents a ording to their relevan y. We
arried out an experiment with 16 non-overlapping omponents of size 5  5 evenly
distributed on the 19  19 fa e pattern (see Fig. 20). As in previous experiments
the omponent lassi ers were SVMs with 2nd-degree polynomial kernels and the
geometri al on guration lassi er was a linear SVM. The training errors of the omponent lassi ers give information about the dis riminative power of ea h omponent
(see Fig. 21). The omponents 5, 8, 9, and 12 are lo ated on the heeks of the fa e.
They ontain only few gray value stru tures whi h is re e ted in the omparatively
high error rates. Surprisingly, the omponents 14 and 15 around the mouth also show
high error rates. This might be due to variations in the fa ial expression and slight
misalignments of the fa es in the training set.
4

16

Figure 20: Partitioning the fa e pattern into 16 non-overlapping omponents.


An alternative to using a large set of arbitrary omponents is to spe i ally generate dis riminative omponents. Following this idea, we developed a se ond method
that automati ally determines re tangular omponents in a set of syntheti fa e images. The algorithm starts with a small re tangular omponent lo ated around a
pre-sele ted point in the fa e (e.g. enter of the left eye)8 . The omponent is extra ted from all syntheti fa e images to build a training set of positive examples.
8 We ould lo ate the same fa ial point in all fa e images sin e we knew the point-by-point
orresponden es between the 3-D head models.

20

Training reuslts for component classifiers


1

0.95

Recognition rate on training data

0.9

0.85

0.8

0.75

0.7

0.65

faces

non-faces

0.6
1

10

11

12

13

14

15

16

Component

Figure 21: Training results for the 16 omponent lassi ers.


We also generate a training set of non-fa e patterns that have the same re tangular
shape as the omponent. After training an SVM on the omponent data we estimate
the performan e of the SVM a ording to its leave-one-out error [Vapnik 98:
 = R2

w2 ;

(3)

where R is the radius of the smallest sphere in the feature spa e F ontaining
the Support Ve tors, and w2 is the square norm of the oe ients of the SVM (see
Eq. (2)). After determining  we enlarge the omponent by expanding the re tangle
by one pixel into one of four dire tions (up, down, left, right). Again, we generate
training data, train an SVM and determine . We keep the expansion if it lead to a
de rease in  else it is reje ted and an expansion into one of the remaining dire tions
was tried. This pro ess is ontinued until the expansions into all four dire tions
lead to an in rease of . In a preliminary experiment we applied the algorithm to
three 3  3 regions lo ated at the enter of the eye, tip of the nose and enter of the
mouth. The nal omponents are shown in Fig. 22, they were determined on about
4,500 syntheti fa es (65  85 pixels, rotation in depth between 45 and 45 ). The
9

9 In our experiments we repla ed

R2 in Eq. (3) by the dimensionality N

of spa e

order to ompute

This be ause

N -dimensional ube of length 1, so the smallest sphere ontaining the


N =2. This approximation was mainly for omputational reasons as in

our data points lay within an


data had radius equal to

F .

we need to solve an optimization problem [Osuna 98.

21

eyes (24  8 pixels) and mouth (30  12 pixels) are similar to the manually sele ted
omponents. The omponent lo ated at the tip of the nose (6  4 pixels), however,
is small. This indi ates that the pattern around the tip of the nose strongly varies
under rotation.

Figure 22: Automati ally generated omponents.

8 Con lusion and future work


We presented and ompared two systems for frontal and near-frontal fa e dete tion: a
whole fa e dete tion system and a omponent-based dete tion system. Both systems
are trained from examples and use SVMs as lassi ers. The rst system dete ts the
whole fa e pattern with a single SVM. In ontrast, the omponent-based system performs the dete tion by means of a two level hierar hy of lassi ers. On the rst level,
the omponent lassi ers independently dete t parts of the fa e, su h as eyes, nose,
and mouth. On the se ond level, the geometri al on guration lassi er ombines the
results of the omponent lassi ers and performs the nal dete tion step. In addition
to the whole fa e and omponent-based fa e dete tion approa hes we presented a
number of experiments on image feature sele tion, feature redu tion and sele tion of
training data. The main points of the paper are as follows:

Gray values are better input features for a fa e dete tor than are Haar wavelets
and gradient values.

By ombining PCA- with SVM-based feature sele tion we sped-up the dete tion
system by two orders of magnitude without loss in lassi ation performan e.

Bootstrapping the lassi er with non-fa e patterns in reased the dete tion rate
by more than 5%.

We developed a omponent-based fa e dete tor whi h is more robust against


fa e rotations than a omparable whole fa e dete tor.
22

 We proposed a te hnique for learning hara teristi omponents from examples.

We have shown that a omponent-based lassi er trained on frontal fa es an deal


with slight rotations in depth. The next logi al step is to over a larger range of pose
hanges by training the omponent lassi ers on rotated fa es. Another promising
topi for further resear h is learning a geometri al model of the fa e by adding the
image lo ations of the dete ted omponents to the input features of the geometri al
on guration lassi er.

Referen es
[Beymer 93 D. J. Beymer. Fa e re ognition under varying pose. A.I. Memo 1461,
Center for Biologi al and Computational Learning, M.I.T., Cambridge, MA, 1993.
[Brunelli & Poggio 93 R. Brunelli, T. Poggio. Fa e Re ognition: Features versus
Templates. IEEE Transa tions on Pattern Analysis and Ma hine Intelligen e 15
(1993) 1042{1052.
[Ivanov et al. 98 Y. Ivanov, A. Bobi k, J. Liu. Fast lighting independent ba kground
subtra tion. Pro . IEEE Workshop on Visual Surveillan e, 1998, 49{55.
[Jebara & Pentland 97 T. Jebara, A. Pentland. Parametrized stru ture from motion
for 3D adaptive feedba k tra king of fa es. Pro . IEEE Conferen e on Computer
Vision and Pattern Re ognition, San Juan, 1997, 144{150.
[Jones & Rehg 99 M. J. Jones, J. M. Rehg. Statisti al olor models with appli ation to skin dete tion. Pro . IEEE Conferen e on Computer Vision and Pattern
Re ognition, Fort Collins, 1999, 274{280.
[Leung et al. 95 T. K. Leung, M. C. Burl, P. Perona. Finding fa es in luttered
s enes using random labeled graph mat hing. Pro . International Conferen e on
Computer Vision, 1995, 637{644.
[Mohan 99 A. Mohan. Obje t dete tion in images by omponents. A.I. Memo 1664,
Center for Biologi al and Computational Learning, M.I.T., Cambridge, MA, 1999.
[Niyogi et al. 98 P. Niyogi, F. Girosi, T. Poggio. In orporating prior information in
ma hine learning by reating virtual examples. Pro eedings of the IEEE 86 (1998)
2196{2209.
[Oren et al. 97 M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, T. Poggio. Pedestrian dete tion using wavelet templates. IEEE Conferen e on Computer Vision and
Pattern Re ognition, San Juan, 1997, 193{199.
23

[Osuna 98 E. Osuna. Support Ve tor Ma hines: Training and Appli ations. Ph.D.
thesis, MIT, Department of Ele tri al Engineering and Computer S ien e, Cambridge, MA, 1998.
[Rikert et al. 99 T. D. Rikert, M. J. Jones, P. Viola. A luster-based statisti al model
for obje t dete tion. Pro . IEEE Conferen e on Computer Vision and Pattern
Re ognition, 1999, 1046{1053.
[Rowley et al. 97 H. A. Rowley, S. Baluja, T. Kanade. Rotation Invariant Neural
Network-Based Fa e Dete tion. Computer S ien t Te hni al Report CMU-CS-97201, CMU, Pittsburgh, 1997.
[Rowley et al. 98 H. A. Rowley, S. Baluja, T. Kanade. Neural Network-Based Fa e
Dete tion. IEEE Transa tions on Pattern Analysis and Ma hine Intelligen e 20
(1998) 23{38.
[Saber & Tekalp 96 E. Saber, A. Tekalp. Fa e dete tion and fa ial feature extra tion using olor, shape and symmetry based ost fun tions. Pro . International
Conferen e on Pattern Re ognition, Vol. 1, Vienna, 1996, 654{658.
[S hneiderman & Kanade 98 H. S hneiderman, T. Kanade. Probabilisti Modeling
of Lo al Appearan e and Spatial Relationships for Obje t Re ognition. Pro . IEEE
Conferen e on Computer Vision and Pattern Re ognition, Santa Barbara, 1998,
45{51.
[Sung 96 K.-K. Sung. Learning and Example Sele tion for Obje t and Pattern Re ognition. Ph.D. thesis, MIT, Arti ial Intelligen e Laboratory and Center for Biologi al and Computational Learning, Cambridge, MA, 1996.
[Toyama et al. 99 K. Toyama, J. Krumm, B. Brumitt, B. Meyers. Wall ower: prin iples and pra ti e of ba kground maintenan e. Pro . IEEE Conferen e on Computer Vision and Pattern Re ognition, 1999, 255{261.
[Vapnik 98 V. Vapnik. Statisti al learning theory. New York: John Wiley and Sons,
1998.
[Vetter 98 T. Vetter. Synthesis of novel views from a single fa e. International
Journal of Computer Vision 28 (1998) 103{116.
[Wiskott 95 L. Wiskott. Labeled Graphs and Dynami Link Mat hing for Fa e Re ognition and S ene Analysis. Ph.D. thesis, Ruhr-Universitat Bo hum, Bo hum, Germany, 1995.

24

[Wu et al. 99 H. Wu, Q. Chen, M. Ya hida. Fa e dete tion from olor images using
a fuzzy pattern mat hing method.

IEEE Transa tions on Pattern Analysis and

Ma hine Intelligen e 21 (1999) 557{563.


[Yang et al. 99 M.-H. Yang, D. Roth, N. Ahuja. A SNoW-based fa e dete tor. Advan es in Neural Information Pro essing Systems 12, 1999.

25

You might also like