Professional Documents
Culture Documents
and
CENTER FOR BIOLOGICAL AND COMPUTATIONAL
LEARNING
DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES
May, 2000
Abstra
t
We present a trainable system for dete
ting frontal and near-frontal views
of fa
es in still gray images using Support Ve
tor Ma
hines (SVMs). We
rst
onsider the problem of dete
ting the whole fa
e pattern by a single SVM
lassier. In this
ontext we
ompare dierent types of image
features, present and evaluate a new method for redu
ing the number
features and dis
uss pra
ti
al issues
on
erning the parameterization of
SVMs and the sele
tion of training data. The se
ond part of the paper
des
ribes a
omponent-based method for fa
e dete
tion
onsisting of a
two-level hierar
hy of SVM
lassiers. On the rst level,
omponent
lassiers independently dete
t
omponents of a fa
e, su
h as the eyes, the
nose, and the mouth. On the se
ond level, a single
lassier
he
ks if
the geometri
al
onguration of the dete
ted
omponents in the image
mat
hes a geometri
al model of a fa
e.
Copyright
This report des
ribes resear
h done within the Center for Biologi
al and Computational Learning
in the Department of Brain and Cognitive S
ien
es and in the Arti
ial Intelligen
e Laboratory at
the Massa
husetts Institute of Te
hnology. This resear
h is sponsored by a grant from O
e of
Naval Resear
h under Contra
t No. N00014-93-1-3085 and O
e of Naval Resear
h under Contra
t
No. N00014-95-1-0600. Additional support is provided by: AT&T, Central Resear
h Institute of
Ele
tri
Power Industry, Eastman Kodak Company, Daimler-Benz AG, Digital Equipment Corporation, Honda R&D Co., Ltd., NEC Fund, Nippon Telegraph & Telephone, and Siemens Corporate
Resear
h, In
.
1 Introdu
tion
Over the past ten years fa
e dete
tion has been thoroughly studied in
omputer vision
resear
h for mainly two reasons. First, fa
e dete
tion has a number of interesting
appli
ations: It
an be part of a fa
e re
ognition system, a surveillan
e system, or a
video-based
omputer/ma
hine interfa
e. Se
ond, fa
es form a
lass of visually similar
obje
ts whi
h simplies the generally di
ult task of obje
t dete
tion. In this
ontext,
dete
ting
hairs is often mentioned as an example where the high variation within the
obje
t
lass leads to a merely unsolvable dete
tion problem. Besides the variability
between individual obje
ts of the same
lass, dete
tion algorithms have to
ope with
variations in the appearan
e of a single obje
t due to pose and illumination
hanges.
Most in the past resear
h work on fa
e dete
tion fo
ussed on dete
ting frontal fa
es
thus leaving out the problem of pose invarian
e. Although there is still some spa
e for
improvement on frontal fa
e dete
tion, the key issue of
urrent and future resear
h
seems to be pose invarian
e.
In the following we give a brief overview on fa
e dete
tion te
hniques. One
ategory of systems relies on dete
ting skin parts in
olor images [Wu et al. 99,
Saber & Tekalp 96. Common te
hniques for skin
olor dete
tion estimate the distribution of skin
olor in the
olor spa
e using labeled training data [Jebara & Pentland 97,
Jones & Rehg 99. A major problem of skin
olor dete
tion is its sensitivity to
hanges
in the spe
tral
omposition of the lighting and to
hanges in the
hara
teristi
s of
the
amera. Therefore, most systems generate hypotheses by the skin
olor dete
tor and verify them by a front-end pattern
lassi
ation module. Depending on the
appli
ation there are other e
ient ways of generating obje
t hypotheses. In
ase
of a stati
video
amera and a stati
ba
kground s
enery, ba
kground subtra
tion
[Ivanov et al. 98, Toyama et al. 99 is
ommonly used to dete
t obje
ts.
Another
ategory of algorithms performs fa
e dete
tion in still gray images. Sin
e
there are no
olor and motion
ue available, fa
e dete
tion boils down to a pure
pattern re
ognition task. One of the rst systems for dete
ting fa
es in gray images
ombines
lustering te
hniques with neural networks [Sung 96. It generates fa
e and
non-fa
e prototypes by
lustering the training data
onsisting of 1919 histogram
normalized fa
e images. The distan
es between an input pattern and the prototypes
are
lassied by a Multi-Layer Per
eptron. In [Osuna 98 frontal fa
es are dete
ted
by a SVM with polynomial kernel. A system able to deal with rotations in the image
plane was proposed by [Rowley et al. 97. It
onsists of two neural networks, one for
estimating the orientation of the fa
e, and another for dete
ting the derotated fa
es.
The re
ognition step was improved [Rowley et al. 98 by arbitrating between independently trained networks of identi
al stru
ture. The above des
ribed te
hniques
have
ommon
lassiers whi
h were trained on patterns of the whole fa
e. A nave
Bayesian approa
h was taken in [S
hneiderman & Kanade 98. The method deter1
mines the empiri
al probabilities of the o
urren
e of 1616 intensity patterns within
6464 fa
e images. Assuming statisti
al independen
e between the small patterns,
the probability for the whole pattern being a fa
e is
al
ulated as the produ
t of the
probabilities for the small patterns. Another probabilisti
approa
h whi
h dete
ts
small parts of fa
es is proposed in [Leung et al. 95. Lo
al feature extra
tors are
used to dete
t the eyes,
orner of the mouth, and tip of the nose. Assuming that
the position of the eyes is properly determined, the geometri
al
onguration of the
dete
ted parts in the image is mat
hed with a model
onguration by
onditional
sear
h. A related method using statisti
al models is published in [Rikert et al. 99.
Lo
al features are extra
ted by applying multi-s
ale and multi-orientation lters to
the input image. The responses of the lters on the training set are modeled as Gaussian distributions. In
ontrast to [Leung et al. 95, the
onguration of the lo
al lter
responses is not mat
hed with a geometri
al model. Instead, the global
onsisten
y
of the pattern is veried by analyzing features at a
oarse resolution. Dete
ting
omponents has also been applied to fa
e re
ognition. In [Wiskott 95 lo
al features are
omputed on the nodes of an elasti
grid. Separate templates for eyes, the nose and
the mouth are mat
hed in [Beymer 93, Brunelli & Poggio 93.
There are two interesting ideas behind part- or
omponent-based dete
tion of
obje
ts. First, some obje
t
lasses
an be des
ribed well by a few
hara
teristi
obje
t
parts1 and their geometri
al relation. Se
ond, the patterns of some obje
t parts might
vary less under pose
hanges than the pattern belonging to the whole obje
t. The
two main problems of a
omponent-based approa
h are how to
hoose the set of
dis
riminatory obje
t parts and how to model their geometri
al
onguration. The
above mentioned approa
hes either manually dene a set of
omponents and model
their geometri
al
onguration or uniformly partition the image into
omponents
and assume statisti
al independen
e between the
omponents. In our system we
started with a manually dened set of fa
ial
omponents and a simple geometri
al
model a
quired from the training set. In a further step we developed a te
hnique for
automati
ally extra
ting dis
riminatory obje
t parts using a database of 3-D head
models.
The outline of the paper is as follows: In Chapter 2 we
ompare dierent types of
image features for fa
e dete
tion. Chapter 3 is about feature redu
tion. Chapter 4
ontains some experimental results on the parameterization of an SVM for fa
e dete
tion. Dierent te
hniques for generating training sets are dis
ussed in Chapter 5. The
rst part of the paper about fa
e dete
tion using a single SVM
lassier
on
ludes
in Chapter 6 with experimental results on standard test sets. Chapter 7 des
ribes
a
omponent-based system and
ompares it to a whole fa
e dete
tor. Chapter 8
on
ludes the paper.
1 In this paper we use the expression obje
t part both for the 3-D part of an obje
t and the 2D
image of a 3-D obje
t part.
-1
-1
-1
1
1
vertical
horizontal
diagonal
a)
Original
b)
Histogram
equalized
c)
Gradients
d)
Haar Wavelets
Figure 2: Examples of extra
ted features. The original gray image is shown in a), the
histogram equalized image in b), the gray value gradients in
), and Haar wavelets
generated by a single
onvolution mask in two dierent s
ales in d).
3
Gray, gray gradient and Haar wavelet features were res
aled to be in a range between 0 and 1 before they were used for training an SVM with 2nd-degree polynomial
kernel. The training data
onsisted of 2,429 fa
e and 19,932 non-fa
e images. The
lassi
ation performan
e was determined on a test set of 118 gray images with 479
frontal fa
es2 . Ea
h image was res
aled 14 times by fa
tors between 0.1 and 1.2 to
dete
t fa
es at dierent s
ales. A 19x19 window was shifted pixel-by-pixel over ea
h
image. Overall, about 57,000,000 windows were pro
essed. The Re
eiver Operator
Chara
teristi
(ROC)
urves are shown in Fig. 3, they were generated by stepwise
variation of the
lassi
ation threshold of the SVM. Histogram normalized gray values are the best
hoi
e. For a xed FP rate the dete
tion rate for gray values was
about 10% higher than for Haar wavelets and about 20% higher than for gray gradients. We trained an SVM with linear kernel on the outputs of the gray/gradient
and gray/wavelet
lassiers to nd out whether the
ombination of two feature sets
improves the performan
e. For both
ombinations the results were about the same
as for the single gray
lassier.
Features
(Training: 2,429 faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
0.9
0.8
Correct
0.7
0.6
0.5
Haar wavelets
Gradient
0.4
Gray
0.3
0.0E+00
5.0E-06
1.0E-05
1.5E-05
2.0E-05
Figure 3: ROC urves for SVMs with 2nd-degree polynomial kernel trained on different types of image features.
2 The test set is a subset of the CMU test set 1 [Rowley et al. 97 whi
h
onsists of 130 images
and 507 fa
es. We ex
luded 12 images
ontaining line-drawn fa
es and non-frontal fa
es.
We evaluated two te
hniques whi
h generate new feature sets by linearly
ombining
the original features:
Prin
ipal Component Analysis (PCA) is a standard te
hnique for generating a
Iterative Linear Classi ation (ILC) determines the most lass dis riminant, or-
The new N -dimensional feature spa
e is spanned by the N rst dire
tions
al
ulated in step a). In the following experiments we used an SVM as linear
lassier.
Both te
hniques were applied to the 283 gray value features des
ribed in Chapter 2. We downsized the previously used training and test sets in order to perform a
large number of tests. The new negative training set in
luded 4,550 samples randomly
sele
ted from the original negative training set. The positive training data remained
un
hanged. The new test set in
luded all fa
e patterns and 23,570 non-fa
e patterns
of the CMU test set 1. The non-fa
e patterns were sele
ted by the
lassier des
ribed
in Chapter 2 as the 23,570 non-fa
e patterns whi
h were most similar to fa
es. An
SVM with a 2nd-degree polynomial kernel was trained on the redu
ed feature sets.
The ROC
urves are shown in Fig. 4 and 5 for PCA and ILC respe
tively. The
rst 3 ILC features were superior to the rst 3 PCA features. However, in
reasing
the number of ILC features up to 10 did not improve the performan
e. This is be
ause ILC does not generate un
orrelated features. Indeed, the 10 ILC features were
highly
orrelated with an average
orrelation of about 0.7. In
reasing the number
5
of PCA features up to 20, on the other hand, steadily improved the
lassi
ation
performan
e until it equaled the performan
e of the system trained on the original
283 features. Redu
ing the number of features to 20 sped-up the
lassi
ation by a
fa
tor of 142 = 196 for a 2nd-degree polynomial SVM.
Feature Reduction PCA
(Training: 2,429 faces, 4,550 non-faces, Test: 479 faces, 23,570 non-faces)
1
0.9
0.8
0.7
Correct
0.6
0.5
All features
0.4
0.3
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
False positives
Figure 4: ROC
urves for SVMs with 2nd-degree polynomial kernel trained on PCA
features. The PCA has been
al
ulated on the whole training set.
0.9
0.8
0.7
Correct
0.6
0.5
All features
0.4
First 10
0.3
First 3
0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
False positives
Figure 5: ROC
urves for SVMs with 2nd-degree polynomial kernel trained on feature
sets generated by ILC.
3.2
We developed a te
hnique for sele
ting
lass relevant features based on the de
ision
fun
tion f (x) of an SVM:
f (x) =
X y K (x; x ) + b;
i
(1)
where x are the Support Ve
tors, the Lagrange multipliers, y the labels of the
Support Ve
tors (-1 or 1), K (; ) the kernel fun
tion, and b a
onstant. A point x
is assigned to
lass 1 if f (x) > 0, otherwise to
lass -1. The kernel fun
tion K (; )
denes the dot produ
t in some feature spa
e F . If we denote the transformation
from the original feature spa
e F to F by (x), Eq. (1)
an be rewritten as:
i
f (x) = w (x) + b;
(2)
where w = P y (x ). Note that the de
ision fun
tion in Eq. (2) is linear on the
transformed features x = (x). For a 2nd-degree polynomial kernel with K (x; y) =
(1 + x y)p, theptransformed
feature spa
e Fp with dimension
Np =
is given
p
p
by x = ( 2x ; 2x ; ::; 2x ; x ; x ; ::; x ; 2x x ; 2x x ; ::; 2x x ).
The
ontribution of a feature x to the de
ision fun
tion in Eq. (2) depends on
w . A straightforward way to order the features is by de
reasing jw j. Alternatively,
we weighted w by the Support Ve
tors to a
ount for dierent distributions of the
features in the training data. The features were ordered by de
reasing jw P y x j,
where x denotes the n-th
omponent of Support Ve
tor i in feature spa
e F . Both
ways of feature ranking were applied to an SVM with 2nd-degree polynomial kernel
trained on 20 PCA features
orresponding
to 230 features in F . In a rst evaluation
P
of the rankings we
al
ulated
jf (x ) f (x )j for all M Support Ve
tors, where
f (x) is the de
ision fun
tion using the S rst features a
ording to the ranking. Note,
that we did not retrain the SVM on the redu
ed feature set. The results in Fig. 6
show that ranking by the weighted
omponents of w lead to a faster
onvergen
e of
the error towards 0. The nal evaluation was done on the test set. Fig. 7 shows
the ROC
urves for 50, 100, and 150 features for both ways of ranking. The results
onrm that ranking by the weighted
omponents of w is superior. The ROC
urve
for 100 features on the test set was about the same as for the
omplete feature set.
By
ombining PCA with the above des
ribed feature sele
tion we
ould redu
e
the originally
= 40; 469 features in F to 100 features without loss in
lassi
ation performan
e on the test set.
i
(N +3)N
2
2
1
2
2
i;n
(283+3)283
2
i;n
25
|f(x) - fs(x)| / M
20
15
weighted w
10
0
0
10
20
30
40
50
60
70
80
90
100 110 120 130 140 150 160 170 180 190 200 210 220 230
Nb. of features
Figure 6: Classifying Support Ve
tors with a redu
ed number of features. The x-axis
shows the number of features, the y -axis is the mean absolute dieren
e between the
output of the SVM using all features and the same SVM using the S rst features only.
The features were ranked a
ording to the
omponents and the weighted
omponents
of the normal ve
tor of the separating hyperplane.
Feature Selection
(Training: 2,429 faces, 4,550 non-faces, Test: 479 faces, 23,570 non-faces)
1
0.9
0.8
0.7
0.6
0.5
50 features, w
0.4
100 features, w
150 features, w
0.3
50 features, weighted w
0.2
0.1
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
4 Parameterization of SVMs
The
hoi
e of the
lassier and its parameterization play an important role in the overall performan
e of a learning-based system. We
hose the SVM as
lassier sin
e it is
well founded in statisti
al learning theory [Vapnik 98 and has been su
essfully applied to various obje
t dete
tion tasks in
omputer vision [Oren et al. 97, Osuna 98.
An SVM is parameterized by its kernel fun
tion and the C value whi
h determines
the
onstraint violations during the training pro
ess. For more detailed information
about SVMs refer to [Vapnik 98.
4.1
Three ommon types of kernel fun tions were evaluated in our experiments:
All experiments were
arried out on the training and test sets des
ribed in Chapter 3. The ROC
urves are shown in Fig. 8. The 2nd-degree polynomial kernel seems a
good
ompromise between
omputational
omplexity and
lassi
ation performan
e.
The SVM with Gaussian kernel ( 2 = 5) was slightly better but required about 1.5
times more Support Ve
tors (738 versus 458) than the polynomial SVM.
4.2
C -parameter
We varied C between 0.1 and 100 for an SVM with 2nd-degree polynomial kernel.
Some results are shown in Fig. 9. The dete
tion performan
e slightly in
reases with
C until C = 1. For C 1 the error rate on the training data was 0 and the de
ision
boundary did not
hange any more.
Kernel Functions
(Training: 2,429 faces, 4,450 non-faces, Test: 479 faces, 23,570 non-faces)
1
0.9
0.8
0.7
Correct
0.6
0.5
0.4
2nd-degree polynomial
0.3
linear
0.2
Gaussian (sigma 5)
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
False Positives
C Parameter
(Training: 2,429 faces, 4,450 non-faces, Test: 479 faces, 23,570 non-faces)
1
0.9
0.8
0.7
Correct
0.6
0.5
0.4
C=1
0.3
C = 0.5
0.2
C = 0.1
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
False positives
0.55
0.6
5 Training Data
Besides sele
ting the input features and the
lassier,
hoosing the training data is
the third important step in developing a
lassi
ation system.
5.1
Extra
ting fa
e patterns is usually a tedious and time-
onsuming work that has to be
done manually. An interesting alternative is to generate arti
ial samples for training
the
lassier [Niyogi et al. 98. In [Rowley et al. 97, S
hneiderman & Kanade 98 the
training set was enlarged by applying various image transformation to the original
fa
e images. We went a step further and generated a
ompletely syntheti
set of
images by rendering 3-D head models [Vetter 98. Using 3-D models for training has
two interesting aspe
ts: First, illumination and pose of the head are fully
ontrollable
and se
ond, images
an be generated automati
ally in large numbers by rendering the
3-D models. To
reate a large variety of syntheti
fa
e patterns we morphed between
dierent head models and modied the pose and the illumination. Originally we had
7 textured head models a
quired by a 3-D s
anner. Additional head models were
generated by 3-D morphing between all pairs of the original models. The heads were
rotated between 15 and 15 in azimuth and between 8 and 8 in the image plane.
The fa
es were illuminated by ambient light and a single dire
tional light pointing
towards the
enter of the fa
e. The position of the light varied between 30 and
30 in azimuth and between 30 and 60 in elevation. Overall, we generated about
5,000 fa
e images. The negative training set was the same as in Chapter 3. Some
examples of real and syntheti
fa
es from our training sets are shown in Fig. 10. The
ROC
urves for SVMs trained on real and syntheti
data are shown in Fig. 11. The
signi
ant dieren
e in performan
e indi
ates that the image variations
aptured in
the syntheti
data do not
over the variations present in real fa
e images. Most likely
be
ause our fa
e models were too uniform: No people with beard, no dieren
es in
fa
ial expression, no dieren
es in skin
olor.
11
Real Faces
Synthetic Faces
Illumination
Face Models
Rotation
Figure 10: Examples of real and syntheti
fa
e images. The syntheti
fa
es were
generated by rendering 3-D head models under varying pose and illumination. The
resolution of the syntheti
fa
es was 5050 pixels after rendering. For training the
fa
e dete
tor we res
aled the images to 1919 pixels.
0.9
0.8
0.7
Correct
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0E+00
2.0E-05
4.0E-05
6.0E-05
8.0E-05
1.0E-04
1.2E-04
Figure 11: ROC
urves for
lassiers trained on real and syntheti
fa
es.
12
5.2
Non-fa
e patterns are abundant and
an be automati
ally extra
ted from images that
do not
ontain fa
es. However, it would require a huge number of randomly sele
ted
samples to fully
over the variety of non-fa
e patterns. Iterative bootstrapping of the
system with false positives (FPs) is a way to keep the training set reasonably small
by spe
i
ally pi
king non-fa
e patterns that are useful for learning. Fig. 12 shows
the ROC
urves for an SVM trained on the 19,932 randomly sele
ted non-fa
e patterns and an SVM trained on additional 7,065 non-fa
e patterns determined in three
bootstrapping iterations. At 80% dete
tion rate the FP rate for the bootstrapped
system was about 2 10 6 per
lassied pattern whi
h
orresponds to 1 FP per image.
Without bootstrapping, the FP rate was about 3 times higher.
Bootstrapping
(Training: 2,429 faces, no boot: 19,932 non-faces, boot: 26,997 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
0.9
0.8
Correct
0.7
0.6
no bootstrapping
bootstrapping
2.0E-05
2.5E-05
0.5
0.4
0.3
0.0E+00
2.5E-06
5.0E-06
7.5E-06
1.0E-05
1.3E-05
1.5E-05
1.8E-05
2.3E-05
2.8E-05
3.0E-05
Figure 12: ROC
urves for a
lassier whi
h was trained on 19,932 randomly sele
ted
non-fa
e patterns and for a
lassier whi
h was bootstrapped with 7,065 additional
non-fa
e patterns.
13
14
System
[Sung 96
Neural Network
[Osuna 98
SVM
[Rowley et al. 98
Single neural network
[Rowley et al. 98
Multiple neural networks
[S
hneiderman & Kanade 983
Nave Bayes
[Yang et al. 994
SNoW, multi-s
ale
Our system5
Test set 1
130 images, 507 fa
es
Det. Rate
FPs
N/A
N/A
74.2%
20
N/A
N/A
N/A
N/A
90.9%
738
84.5%
84.4%
79
91.1%
12
90.5%
33
94.1%
94.8%
78
84.7%
90.4%
11
26
85.6%
89.9%
9
75
0.9
Correct
0.8
0.7
0.6
Test set 1
0.5
0.4
0
10
20
30
40
50
60
70
80
90
100
False positives
Figure 13: ROC
urves for bootstrapped
lassier with heuristi
s for suppressing
multiple dete
tions.
3 Five
4 Images
5 Twelve
15
7
7.1
Until now we
onsidered systems where the whole fa
e pattern was
lassied by a
single SVM. Su
h a global approa
h is highly sensitive to
hanges in the pose of an
obje
t. Fig. 14 illustrates the problem for the simple
ase of linear
lassi
ation.
The result of training a linear
lassier on frontal fa
es
an be represented as a
single fa
e template, s
hemati
ally drawn in Fig. 14 a). Even for small rotations the
template
learly deviates from the rotated fa
es as shown in Fig. 14 b) and
). The
omponent-based approa
h tries to avoid this problem by independently dete
ting
parts of the fa
e. In Fig. 15 the eyes, nose, and the mouth are represented as single
templates. For small rotations the
hanges in the
omponents are small
ompared to
the
hanges in whole fa
e pattern. Slightly shifting the
omponents is su
ient to
a
hieve a reasonable mat
h with the rotated fa
es.
a)
c)
b)
Figure 14: Mat
hing with a single template. The s
hemati
template of a frontal fa
e
is shown in a). Slight rotations of the fa
e in the image plane b) and in depth
) lead
to
onsiderable dis
repan
ies between template and fa
e.
a)
c)
b)
Figure 15: Mat
hing with a set of
omponent templates. The s
hemati
omponent
templates for a frontal fa
e are shown in a). Shifting the
omponent templates
an
ompensate for slight rotations of the fa
e in the image plane b) and in depth
).
16
First Level:
Component
Classifiers
Classifier
Second Level:
Detection of
Configuration of
Components
Classifier
Figure 16: System overview of the
omponent-based
lassier. On the rst level,
windows of the size of the
omponents (solid lined boxes) are shifted over the fa
e
image and
lassied by the
omponent
lassiers. On the se
ond level, the maximum
outputs of the
omponent
lassiers within predened sear
h regions (dotted lined
boxes) are fed into the geometri
al
onguration
lassier.
6 To
a ount for hanges in the size of the omponents, the outputs were determined over multiple
s ales of the input image. In our tests, we set the range of s ales to [0:75; 1:2.
17
The ROC
urves for CMU test set 1 are shown in Fig. 17. The
omponent
lassiers were SVMs with 2nd-degree polynomial kernels and the geometri
al
onguration
lassier was a linear SVM7 . Up to about 90% re
ognition rate, the four
omponent system performs worse than the whole fa
e
lassier. Probably due to
lass-relevant parts of the fa
e that were not
overed by the four
omponents. Therefore, we added the whole fa
e as a fth
omponent similar to the template-based fa
e
re
ognition system proposed in [Brunelli & Poggio 93. As shown in Fig. 17 the ve
omponent
lassier performs similar to the whole fa
e
lassier. This indi
ates that
the whole fa
e is the most dominant of the ve
omponents.
To
he
k the robustness of the
lassiers against obje
t rotations we performed
tests on syntheti
fa
es generated from 3-D head models. The syntheti
test set
onsisted of two groups of 1919 fa
e images: 4,574 fa
es rotated in the image plane,
and 15,865 fa
es rotated in depth. At ea
h rotation angle we determined the FP
rate for 90% dete
tion rate based on the ROC
urves in Fig. 17. The results in
Fig. 18 and 19 show that the best performan
e was a
hieved by the ve
omponent
system. However, it deteriorated mu
h faster with in
reasing rotation than the four
omponent system. This is not surprising sin
e the whole fa
e pattern
hanges more
under rotation than the patterns of the other
omponents.
Whole Face Classifier vs. Component-based Classifier
(Training: 2,429 faces, 19,932 non-faces, Test: 118 images, 479 faces, 56,774,966 windows)
0.9
0.8
Correct
0.7
0.6
0.5
Whole face
0.4
0.3
0.2
0.1
0.0E+00
5.0E-06
1.0E-05
1.5E-05
2.0E-05
2.5E-05
3.0E-05
3.5E-05
4.0E-05
FP / inspected window
7 Alternatively
we tried linear
lassiers for the
omponents and a polynomial kernel for the
geometri
al
lassier but the results were
learly worse.
18
1.0E-04
1.0E-05
1.0E-06
1.0E-07
Whole face
Components: eyes, nose, mouth, whole face
Components: eyes, nose, mouth
1.0E-08
0
10
Rotation [deg]
Figure 18: Classi ation results for syntheti fa es rotated in the image plane.
1.0E-04
1.0E-05
1.0E-06
Whole face
1.0E-07
Components: eyes, nose, mouth, whole face
Components: eyes, nose, mouth
1.0E-08
0
10
12
14
16
18
20
22
24
26
28
30
Rotation [deg]
7.3
In our previous experiments we manually sele
ted the eyes, the nose and the mouth
as
hara
teristi
omponents of a fa
e. Although this
hoi
e is somehow obvious,
it would be more sensible to
hoose the
omponents automati
ally based on their
dis
riminative power and their robustness against pose
hanges. Moreover, for obje
ts other than fa
es, it might be di
ult to manually dene a set of meaningful
omponents. In the following we present two methods for learning
omponents from
examples.
The rst method arbitrarily denes
omponents and lets the geometri
al
onguration
lassier learn to weight the
omponents a
ording to their relevan
y. We
arried out an experiment with 16 non-overlapping
omponents of size 5 5 evenly
distributed on the 19 19 fa
e pattern (see Fig. 20). As in previous experiments
the
omponent
lassiers were SVMs with 2nd-degree polynomial kernels and the
geometri
al
onguration
lassier was a linear SVM. The training errors of the
omponent
lassiers give information about the dis
riminative power of ea
h
omponent
(see Fig. 21). The
omponents 5, 8, 9, and 12 are lo
ated on the
heeks of the fa
e.
They
ontain only few gray value stru
tures whi
h is re
e
ted in the
omparatively
high error rates. Surprisingly, the
omponents 14 and 15 around the mouth also show
high error rates. This might be due to variations in the fa
ial expression and slight
misalignments of the fa
es in the training set.
4
16
20
0.95
0.9
0.85
0.8
0.75
0.7
0.65
faces
non-faces
0.6
1
10
11
12
13
14
15
16
Component
w2 ;
(3)
where R is the radius of the smallest sphere in the feature spa
e F
ontaining
the Support Ve
tors, and w2 is the square norm of the
oe
ients of the SVM (see
Eq. (2)). After determining we enlarge the
omponent by expanding the re
tangle
by one pixel into one of four dire
tions (up, down, left, right). Again, we generate
training data, train an SVM and determine . We keep the expansion if it lead to a
de
rease in else it is reje
ted and an expansion into one of the remaining dire
tions
was tried. This pro
ess is
ontinued until the expansions into all four dire
tions
lead to an in
rease of . In a preliminary experiment we applied the algorithm to
three 3 3 regions lo
ated at the
enter of the eye, tip of the nose and
enter of the
mouth. The nal
omponents are shown in Fig. 22, they were determined on about
4,500 syntheti
fa
es (65 85 pixels, rotation in depth between 45 and 45 ). The
9
of spa e
order to ompute
This be ause
F .
21
eyes (24 8 pixels) and mouth (30 12 pixels) are similar to the manually sele
ted
omponents. The
omponent lo
ated at the tip of the nose (6 4 pixels), however,
is small. This indi
ates that the pattern around the tip of the nose strongly varies
under rotation.
Gray values are better input features for a fa
e dete
tor than are Haar wavelets
and gradient values.
By
ombining PCA- with SVM-based feature sele
tion we sped-up the dete
tion
system by two orders of magnitude without loss in
lassi
ation performan
e.
Bootstrapping the
lassier with non-fa
e patterns in
reased the dete
tion rate
by more than 5%.
Referen
es
[Beymer 93 D. J. Beymer. Fa
e re
ognition under varying pose. A.I. Memo 1461,
Center for Biologi
al and Computational Learning, M.I.T., Cambridge, MA, 1993.
[Brunelli & Poggio 93 R. Brunelli, T. Poggio. Fa
e Re
ognition: Features versus
Templates. IEEE Transa
tions on Pattern Analysis and Ma
hine Intelligen
e 15
(1993) 1042{1052.
[Ivanov et al. 98 Y. Ivanov, A. Bobi
k, J. Liu. Fast lighting independent ba
kground
subtra
tion. Pro
. IEEE Workshop on Visual Surveillan
e, 1998, 49{55.
[Jebara & Pentland 97 T. Jebara, A. Pentland. Parametrized stru
ture from motion
for 3D adaptive feedba
k tra
king of fa
es. Pro
. IEEE Conferen
e on Computer
Vision and Pattern Re
ognition, San Juan, 1997, 144{150.
[Jones & Rehg 99 M. J. Jones, J. M. Rehg. Statisti
al
olor models with appli
ation to skin dete
tion. Pro
. IEEE Conferen
e on Computer Vision and Pattern
Re
ognition, Fort Collins, 1999, 274{280.
[Leung et al. 95 T. K. Leung, M. C. Burl, P. Perona. Finding fa
es in
luttered
s
enes using random labeled graph mat
hing. Pro
. International Conferen
e on
Computer Vision, 1995, 637{644.
[Mohan 99 A. Mohan. Obje
t dete
tion in images by
omponents. A.I. Memo 1664,
Center for Biologi
al and Computational Learning, M.I.T., Cambridge, MA, 1999.
[Niyogi et al. 98 P. Niyogi, F. Girosi, T. Poggio. In
orporating prior information in
ma
hine learning by
reating virtual examples. Pro
eedings of the IEEE 86 (1998)
2196{2209.
[Oren et al. 97 M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, T. Poggio. Pedestrian dete
tion using wavelet templates. IEEE Conferen
e on Computer Vision and
Pattern Re
ognition, San Juan, 1997, 193{199.
23
[Osuna 98 E. Osuna. Support Ve
tor Ma
hines: Training and Appli
ations. Ph.D.
thesis, MIT, Department of Ele
tri
al Engineering and Computer S
ien
e, Cambridge, MA, 1998.
[Rikert et al. 99 T. D. Rikert, M. J. Jones, P. Viola. A
luster-based statisti
al model
for obje
t dete
tion. Pro
. IEEE Conferen
e on Computer Vision and Pattern
Re
ognition, 1999, 1046{1053.
[Rowley et al. 97 H. A. Rowley, S. Baluja, T. Kanade. Rotation Invariant Neural
Network-Based Fa
e Dete
tion. Computer S
ien
t Te
hni
al Report CMU-CS-97201, CMU, Pittsburgh, 1997.
[Rowley et al. 98 H. A. Rowley, S. Baluja, T. Kanade. Neural Network-Based Fa
e
Dete
tion. IEEE Transa
tions on Pattern Analysis and Ma
hine Intelligen
e 20
(1998) 23{38.
[Saber & Tekalp 96 E. Saber, A. Tekalp. Fa
e dete
tion and fa
ial feature extra
tion using
olor, shape and symmetry based
ost fun
tions. Pro
. International
Conferen
e on Pattern Re
ognition, Vol. 1, Vienna, 1996, 654{658.
[S
hneiderman & Kanade 98 H. S
hneiderman, T. Kanade. Probabilisti
Modeling
of Lo
al Appearan
e and Spatial Relationships for Obje
t Re
ognition. Pro
. IEEE
Conferen
e on Computer Vision and Pattern Re
ognition, Santa Barbara, 1998,
45{51.
[Sung 96 K.-K. Sung. Learning and Example Sele
tion for Obje
t and Pattern Re
ognition. Ph.D. thesis, MIT, Arti
ial Intelligen
e Laboratory and Center for Biologi
al and Computational Learning, Cambridge, MA, 1996.
[Toyama et al. 99 K. Toyama, J. Krumm, B. Brumitt, B. Meyers. Wall
ower: prin
iples and pra
ti
e of ba
kground maintenan
e. Pro
. IEEE Conferen
e on Computer Vision and Pattern Re
ognition, 1999, 255{261.
[Vapnik 98 V. Vapnik. Statisti
al learning theory. New York: John Wiley and Sons,
1998.
[Vetter 98 T. Vetter. Synthesis of novel views from a single fa
e. International
Journal of Computer Vision 28 (1998) 103{116.
[Wiskott 95 L. Wiskott. Labeled Graphs and Dynami
Link Mat
hing for Fa
e Re
ognition and S
ene Analysis. Ph.D. thesis, Ruhr-Universitat Bo
hum, Bo
hum, Germany, 1995.
24
[Wu et al. 99 H. Wu, Q. Chen, M. Ya
hida. Fa
e dete
tion from
olor images using
a fuzzy pattern mat
hing method.
25