You are on page 1of 4

21st International Conference on Pattern Recognition (ICPR 2012)

November 11-15, 2012. Tsukuba, Japan

Facial Expression Classification on Web Images

Matthias Richter? Tobias Gehrig? Hazm Kemal Ekenel?,


?
Karlsruhe Institute of Technology, Institute for Anthropomatics, Karlsruhe, Germany

Istanbul Technical University, Faculty of Computer and Informatics, Istanbul, Turkey
matthias.richter@student.kit.edu, tobias.gehrig@kit.edu, ekenel@kit.edu

Abstract web and include a large number of subjects of different


gender, age and ethnicity.
In this paper, we present a novel database which, is Two frequently used datasets utilized for facial
obtained from the web. It contains 4761 manually la- expression analysis are the Cohn-Kanade [8] and
beled images of seven basic expressions performed by GEMEP-FERA [1] databases. Both contain image se-
a large number of subjects of different gender, age and quences showing one or more facial expressions per-
ethnicity. Furthermore, we develop feature descriptors formed by several subjects with varying intensities.
based on the discrete cosine transform (DCT), local bi- However, they lack in certain aspects. [8] only includes
nary patterns (LBP), and Gabor filters, which share subjects aged from 1830 years, while in [1] there are
a uniform formulation in terms of regions around key just 10 subjects in total. Furthermore, both databases
points. We explore several strategies to find an opti- exhibit laboratory conditions, which may result in mod-
mal selection of these key points. The system achieves els that represent real world use cases poorly. Recently,
86.2%, 85.9% and 84.4% accuracy on the web image Dhall et al. [5] addressed this issue by sampling static
database using the Gabor, LBP, and DCT descriptors, images from 37 different movies using subtitle informa-
respectively. tion and a manual verification process. Unfortunately,
with only 700 images the database is relatively small.
Due to the extraction from a video stream some of the
1 Introduction facial images are very blurred or low resolution. Others
do not seem to show the target expression at all. Filter-
ing these images reduces the number of samples further.
Facial expression is one of the most important chan- The importance of automatic classification of facial
nels in human-to-human communication. At a low expressions has led to several studies. Littlewort et al.
level, they convey internal affective states, which are [9] and Shan et al. [12] showed that AdaBoost se-
often too complex to verbalize. Higher, more conscious lected features for expression recognition can signifi-
levels enable facial expression to emphasize and even cantly boost recognition performance. In [9], facial de-
change the meaning of the spoken word. Computer- scriptors are built by convolving face images with a Ga-
ized access to this channel of communication opens up a bor filter family. Using AdaBoost, the most discrimina-
vast number of applications in health care [2], education tive of all available filter responses are selected and then
[10], human-computer-interaction [3], market research classified using support vector machines (SVM). This
[11] and entertainment [9]. Unsurprisingly, human fa- AdaSVM approach outperforms classifying the com-
cial expression has attracted a great deal of attention plete filter response both in discriminative quality and
from the scientific community. Darwin [4] proposed speed. Because of the relatively high computational
that natural selection leads to the evolution of emotion costs of Gabor filters, [12] proposes feature descriptors
and deduced the existence of universal emotions. Ek- based on local binary patterns (LBP). Instead of select-
man [7] confirmed this hypothesis by identifying six ba- ing sub-features, AdaBoost is used to select significant
sic emotions anger, disgust, fear, joy, sadness and sur- regions for LBP histogram extraction. While requir-
prise that show the same facial expression across dif- ing less computation time and being more compact, the
ferent cultures. In this paper, we present a face database resulting descriptors outperform traditional grid-based
that contains facial expressions that correspond to these histogram extraction in different classification schemes
universal emotions. The images are collected from the such as template matching and SVM classification.
This study is funded by the Concept for the Future of Karl- Motivated by these results, we develop feature de-
sruhe Institute of Technology within the framework of the German scriptors based on the discrete cosine transform (DCT),
Excellence Initiative.

978-4-9906441-1-6 2012 IAPR 3517


LBP, and Gabor filters. We provide a unifying formula-
tion in terms of regions around key points. Several key Table 1. Number of samples per expres-
point selection strategies are compared using our web sion in the database.
image database. We show that LBP and DCT descrip-
Expression Samples Expression Samples
tors benefit from key point selection, while the Gabor
Anger 648 Disgust 368
descriptor does not. Furthermore, selecting too many Fear 288 Joy 2185
regions has a deteriorating effect on recognition perfor- Sadness 327 Surprise 557
mance regardless of the feature extraction method. Neutral 388
with the best image quality and prevented discarding
2 Web Image Database false duplicates. In the last step, an eye detector was
used to semi automatically mark the eye center posi-
tions. User intervention was only prompted when the
We have taken several points into consideration deviation of marked bounding box rotation and rotation
while building the dataset. The corpus should contain computed from the eye centers exceeded a threshold.
a large number of images of all basic expressions and The filtering steps reduced more than 80000 pic-
the neutral face. It should, furthermore, contain a large tures to a database of 4761 labeled faces with associ-
number of male and female subjects from different age ated meta data.The images show all basic as well as the
groups and ethnicities. Since personal web sites, stock neutral expression performed by a large number of sub-
photography agencies and services like Facebook and jects of different gender, age and ethnicity under vary-
flickr offer an enormous amount of partially tagged pic- ing lighting conditions. While the head pose is non-
tures, the dataset can be built using web images. Google standardized, most of the images show full or near full
Images offers a searchable index of these resources, and profile faces. Since a large portion of the images orig-
was used to prepare an initial selection of images. inated from stock photo services, most of the expres-
We used two word combinations as search terms. sions are artificial and non spontaneous. In some cases
One word was selected from a list of up to eight syn- watermarks partially occlude the facial region, but not
onyms of the adjectives describing the target expres- in a way that prevents humans to recognize the expres-
sion, e.g. angry and aggravated. The other was sion. As shown in Table 1, the database is not balanced,
the term face or person or identified the subjects which has to taken into account when training a classi-
age and gender. This produced 440 possible combi- fier using the dataset. The database information, i.e. the
nations such as frightened man, which were used image URLs, marked eye positions and other meta data
to register more than 80000 images and corresponding can be obtained from:
meta data, e.g. URL, containing document and matched http://face.cs.kit.edu/datasets/web-expressions.
text. This initial selection contained a large number of
near duplicate images and suffered from other defects: 3 Facial Expression Classification
While the first few results of a given query were accu-
rate, the number of false positives generally increased,
The web image database was used to evaluate differ-
when more items were taken into account. In addi-
ent feature description and classification methods. We
tion, some images contained watermarks that covered
randomly selected a subset of images, so that there was
the face in a way that prevented recognition of the fa-
an equal amount of samples for each expression and di-
cial expression. Less frequent defects showed images
vided it into three equally sized parts. In six fold experi-
of only partial faces, heavily distorted images and pic-
ments, each part was either used for key point selection,
tures that did not contain human faces at all.
classifier training or evaluation.
To discard these images, a four step filtering pro-
cess was employed. First images were flagged whether 3.1 Facial Descriptors
they show the desired expression or not. False positives
were removed after finishing labeling all images. In the The facial descriptors are formulated in terms of
second step, faces were marked using rotated bounding square pixel regions with side length 2r around key
boxes. Previously undetected false positives were dis- points pi = (xi , yi ), i.e.
carded. Using the bounding boxes and the DCT feature
described in Sec. 3, near duplicate images were detected Ri = {(x, y) : r x xi < r, r y yi < r}.
by computing the L1-distance between the feature vec- Input images were converted to gray scale and regis-
tor for the current face and the feature vectors of all the tered, so that the distance between eye centers became
other faces in the same class. Images to discard had to 48 pixels (px). A (96 + 2r) (96 + 2r)px facial region
be selected manually. This allowed to keep the picture was extracted, so that distance between left eye center

3518
and the top and left image border was (24 + r)px. The
image histogram was equalized to account for changes Table 2. Comparison of best performing
in lighting condition. configurations by feature descriptor.
Formulating the facial feature descriptor in [6] in
Feature Strategy n r Accuracy
terms of Ri yielded the DCT descriptor. The frequency
components in each region Ri were computed using the DCT per-class 96 6 0.844
LBP per-class 96 6 0.859
DCT and ordered by applying a zig-zag scan. The lower Gabor grid 144 4 0.862
and higher frequency components were discarded by
dropping the first k1 and collecting the next k2 coeffi- fi . AdaBoost was used to select the n most discrimi-
cients into the block features fi = (fik1 +1 , . . . , fik1 +k2 ). nate hjt , which were mapped back to pixel locations to
To balance the individual regions impact on the overall obtain a list of key points.
classification result, the block vectors were normalized We trained AdaBoost in two separate ways: In per-
to kfi k2 = 1. The feature vector FDCT was formed class selection, positive samples were chosen from the
by concatenating the block-features fi . Throughout the target expression, while negative samples were ran-
experiments we used k1 = 1 and k2 = 10. domly selected from every other class. This emphasised
Following [12], the second feature descriptor em- on features that discriminate one from all the other ex-
ploys uniform LBP. Input images were processed by the pressions. Expressive selection drew negative samples
LBPu2 operator with R = 1 and P = 8. The feature from the Neutral class and selected positives from
vector FLBP is the concatenation of histograms of la- all remaining classes. Doing so emphasised on fea-
bels in the regions Ri . Note that [12] uses a LBPu2 op- tures that are useful to discriminate expressive from
erator with P = 8 and R = 2, which focuses on larger non-expressive faces. We expected per-class selection
scale structures. However, since in [12] facial images to perform better than expressive selection.
measure 110 150px, this constitutes only a minor de-
viation. The use of variable sized non-square regions in 3.3 Classification
[12] establishes a more substantial difference.
To obtain the Gabor descriptor, input images were We used a 7-way forced choice to classify samples.
processed by a filter bank of 8 orientations and 5 spa- For each expression, a third degree polynomial ker-
tial frequencies. For each of the complex response im- nel SVMs discriminated the one expression from every
ages Gmn , the other category. The class of a sample was determined
i
P regions energy content was computed by choosing the classifier that produced largest confi-
as Emn = (x,y)Ri kGmn (x, y)k and then collected
into block feature vectors fi . Similar to the DCT de- dence. SVM parameters were found by performing a
scriptor, the feature vector FGabor was obtained by con- grid search and choosing the C and with highest ac-
catenating the block features after normalisation. This curacy in a 5-fold cross validation on the training set.
descriptor is fundamentally different from the one used Each class and feature selection method was allowed to
in [9]: Instead of considering the energy content of pixel produce different parameters.
regions, their descriptor is built by selecting individual
filter responses using AdaBoost. Given a pi , the result- 4 Experimental Results
ing feature vector may contain a single filter response
at that point, whereas our descriptor always includes all We evaluated different key point selection methods
responses at that location. using Ri with r = 4 up to r = 12. The number of
regions varied between n = 5 and n = 144 for ex-
3.2 Key Point Selection pressive and per-class selection and was defined by r
when using the grid based approach. Other feature de-
We employed several strategies to select regions Ri . scriptor parameters were not varied. Table 2 lists the
A trivial solution was to place the pi on a regular grid to best performing configuration for each feature extrac-
cover the whole facial area in a way that the regions do tion method. It can be seen that Gabor features show
not overlap. Doing so resulted in the well known block best recognition performance, while the DCT descrip-
based approaches used as baseline in [12] and [9]. A tor performs worst with a 1.8% difference in mean ac-
second approach utilized AdaBoost to select a list of curacy. However, the DCT descriptor has a much lower
discriminative regions. Initially, every pixel of the input dimensionality and is faster to compute than both the
image was considered to be the region center for fea- Gabor and LBP descriptors.
ture extraction. The descriptor F consisted of W H As expected, per-class key point selection is superior
sub-descriptors fi . With hj denoting linear SVM clas- to expressive selection. Consistent with [12], boosted
sifiers, we associated classifiers hji (f ) = hj (fi ) to the LBP features outperform the grid based approach, al-

3519
Table 3. Confusion matrix of the Gabor
feature descriptor.

Surprise
Sadness
Disgust

Neutral
Anger

Fear

Joy
Ang. 0.44 0.16 0.06 0.03 0.05 0.08 0.06
Dis. 0.19 0.41 0.08 0.06 0.06 0.16 0.05 Figure 1. Commonly misclassified im-
Fear 0.05 0.08 0.43 0.03 0.07 0.10 0.19 ages.
Joy 0.02 0.06 0.03 0.72 0.03 0.04 0.02
Neu. 0.12 0.07 0.06 0.05 0.64 0.13 0.07 cial expression recognition. Feature descriptors based
Sad. 0.11 0.18 0.13 0.08 0.09 0.42 0.05
Sur. 0.06 0.04 0.21 0.03 0.07 0.06 0.56 on the DCT, LBP, and Gabor filters were collectively
formulated in terms of regions around key points. Sev-
though not by such a large margin as observed in their eral strategies to find an optimal selection of key points
study. This might be attributed to the differences in the have been explored. The DCT and LBP based descrip-
feature extraction methods. The Gabor descriptor out- tors have been shown to benefit from key point selec-
performs both the DCT and the LBP descriptor, though tion, while the Gabor feature performed best when re-
surprisingly the best result is not achieved using boosted gions were uniformly distributed.
features, but by the grid based approach. Because we
observed that the Gabor descriptor yielded higher accu- References
racy when using smaller regions, we extended the ex-
periments to include regions with r = 1 and r = 0 (i.e. [1] T. Banziger and K. R. Scherer. Introducing the
only the filter responses). Still, grid based key point Geneva Multimodal Emotion Portrayal (GEMEP) Cor-
pus. Blueprint for affective computing: A sourcebook,
selection yielded highest mean accuracy. This directly pages 271294, 2010.
contradicts with the findings in [9], where AdaSVMs [2] J. Cockburn, M. Bartlett, J. Tanaka, J. Movellan,
performed considerably better than a traditional Gabor M. Pierce, and R. Schultz. SmileMaze: A Tutoring
System in Real-Time Facial Expression Perception and
descriptor. This result may be attributed by the differ- Production in Children with Autism Spectrum Disorder.
ent approach in feature selection: In [9] individual fea- In FGR08, pages 678986, 2008.
ture responses were selected, while we considered all [3] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Vot-
sis, S. Kollias, W. Fellenz, and J. G. Taylor. Emotion
responses given a region Ri . recognition in human-computer interaction. Signal Pro-
Table 3 shows the confusion matrix for the Gabor cessing Magazine, 18(1):3280, January 2001.
feature descriptor. Worst recognition performance is [4] C. Darwin. The Expression of the Emotions in Man
observed with disgust, sadness, fear and anger. Dis- and Animals. Harper Perennial, anniversary edition,
1872/2009.
criminating anger from disgust, disgust from sadness [5] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Static fa-
and fear from surprise produces the most mistakes. Joy cial expression analysis in tough conditions: Data, eval-
and neutral on the other hand are recognized with high uation protocol and benchmark. In ICCV Workshops,
pages 2106 2112, Nov. 2011.
confidence. Figure 1 shows commonly misclassified [6] H. K. Ekenel. A Robust Face Recognition Algorithm
images. From left to right images are tagged anger, fear, for Real-World Applications. PhD thesis, Universitat
anger and neutral and were classified as neutral, sad- Karlsruhe (TH), Fakultat fur Informatik, February 2009.
[7] P. Ekman. Basic emotions. In Handbook of cognition
ness, surprise and sadness, respectively. Some of these and emotion, volume 98, chapter 3, pages 4560. John
images, e.g. the second from left, are difficult to classify Wiley & Sons, 1999.
[8] T. Kanade, J. Cohn, and Y. Tian. Comprehensive
even for humans. Others show features shared between Database for Facial Expression Analysis. In FGR00,
expressions, such as wide opened eyes and raised inner pages 4653, 2000.
eye brows. [9] G. Littlewort, M. S. Bartlett, I. R. Fasel, J. Chenu,
T. Kanda, H. Ishiguro, and J. R. Movellan. Towards
social robots: Automatic evaluation of human-robot in-
5 Conclusion teraction by face detection and expression classification.
In NIPS, 2003.
[10] G. C. Littlewort, M. S. Bartlett, L. P. Salamanca, and
We presented a novel database compiled from web J. Reilly. Automated measurement of childrens facial
images containing 4761 labeled faces of male and fe- expressions during problem solving tasks. In FGR11,
pages 3035, 2011.
male subjects of different ethnicities and age groups. [11] R. W. Picard. Measuring affect in the wild. In ACII11,
Variations of expression intensity, head pose, and light- pages 33, Berlin, Heidelberg, 2011. Springer-Verlag.
ing conditions pose a new challenge for facial expres- [12] C. Shan, S. Gong, and P. W. McOwan. Facial expres-
sion recognition based on Local Binary Patterns: A
sion recognition systems. comprehensive study. Image and Vision Computing,
We furthermore developed a modular system for fa- 27(6):803816, 2009.

3520

You might also like