You are on page 1of 4

FACIAL EXPRESSION ANALYSIS

IN MPEG-4 SEQUENCES
Amaryllis Raouzaiou, Kostas Karpouzis and Stefanos Kollias
Image, Video and Multimedia Systems Lab - Dept of Electrical and Computer Engineering
National Technical University of Athens
Heroon Polytechniou 9, 157 73 Zographou, GREECE
Tel.: (301) 7722491 Fax: (301) 7722492
email: araouz@softlab.ece.ntua.gr
Abstract
While the previous MPEG standards focus primarily on video coding and transmission issues, MPEG-4
concentrates on hybrid coding of natural and synthetic data streams. In this framework, possible applications
include teleconferencing and entertainment applications, where an adaptable synthetic agent substitutes the
actual user. Such agents can interact with each other, receiving input from multi-sensor data, and utilize highlevel information, such as detected emotions and expressions. This greatly enhances human-computer
interaction, by replacing single media representations with dynamic renderings, while providing feedback on
the users emotional status and reactions. Educational environments, virtual collaboration environments and
online shopping and entertainment applications are expected to profit from this concept. Facial expression
synthesis and animation, in particular, is given much attention within the MPEG-4 framework, where higherlevel, explicit Facial Animation Parameters (FAPs) have been dedicated to this purpose. In this work, we
employ general purpose FAPs so as to reduce the definition of facial expressions for synthesis purposes, by
estimating the actual expression as a combination of universal ones. In addition, we provide explicit features,
as well as possible values for the FAPs implementation, while forming a relation between FAPs and the
activation parameter proposed in classic psychological studies.
1. INTRODUCTION
The establishment of the MPEG-4 standard facilitates an alternative way of analyzing and modeling facial
expressions and related emotions. Facial Animation Parameters (FAPs) are utilized in the framework of
MPEG-4 for facial animation purposes, so as to enable efficient hybrid coding of synthetic objects with
natural video. This enables animators to focus on local or global actions on the face, by means of scripting
an animation sequence. For example, the animator can instruct the synthetic model of a human face to
open mouth or lower eyebrow (see Figure 1); in essence, this instruction is passed to the MPEG-4
decoder, which in turn deforms the model by translating the vertices that correspond to the area in question
(see Figure 2). While the standard does cater for the abstract definition of expressions and emotion as a
collection of FAPs and their subsequent interpolation into intermediate expression, this does not necessarily
mean that all possible expressions and emotions can be modeled this way [1]. In general, facial expression
analysis has been mainly concentrated on six expressions, termed as universal. This term means that
humans across different cultures can easily recognize expressions such as joy or disgust [2]. One can
combine different universal expressions to provide intermediate ones, such as fake joy or upset (see Figure
3), or a number of emotional states, such as pain.
The reverse problem, that is the identification of the universal expressions that must be combined to result to
a given intermediate expression is not always clear-cut. In the quest of forming a low-dimensional space in
which distance measures can be defined, notions such as the Feeltrace plane (see Figure 4), defined by the
activation and evaluation axes may be used to diversify the process of synthesizing an intermediate
expression. These notions originate from psychological studies [3] and can be exploited so as to move from
features comprehensible by humans to quantitative measurements, such as FAPs. This can be

accomplished by reversing of the description of the six universal emotions with MPEG-4 FAPs and use of a
priori knowledge that is embedded within a fuzzy rule system. Because FAPs do not correspond to specific
models or polygonal topologies, this scheme can be extended to other models or characters, different from
the one that was analyzed.

Figure 1: A face model in its


neutral state

Figure 2: Deformation along the


eyebrow area

Figure 3: The complete definition of


the emotion upset

Figure 4: The Feeltrace plane


2. MPEG-4 AND FACIAL ANIMATION
The definition parameters defined by the MPEG Group allow for a detailed definition of body/face shape, size
and texture, while the animation parameters facilitate the definition of facial expressions and body postures
[4]. These parameters are designed to accommodate all natural possible expressions and postures, as well
as exaggerated expressions and motions, thus covering not only representation purposes, but also
entertainment as well.
As far as decoding is concerned, FAPs manipulate control points on a 3D model of the face so to produce
animation of the head and facial features like the mouth, the eyes or the eyebrows. All the FAP that involve
translation are expressed in terms of Facial Animation Parameter Units (FAPU). These units are defined with
respect to standard facial features, such as eye distance and allow the interpretation of FAPs on any facial
model in a consistent, reasonable way. This parameter set also contains two high level parameters: the
viseme parameter, which allows rendering of the visual aspect of the lower part of the face without the need
to express them in terms of other parameters, and the expression parameter allows the definition of six high
level facial expressions.
All these animation instructions are fed to an MPEG-4 decoder. This part of the system may animate a
specific 3D model transmitted once as part of the communication or use a generic facial model capable of
interpreting FAPs. If the application aims to replicate a given human head, it may choose to modify the shape
and appearance of the face accordingly. In this case, the definition of its geometry and topology is necessary.
This definition is encoded in FDPs, normally transmitted once per session, followed by a stream of
compressed FAPs. The distinct FDP fields are:
FeaturePointsCoord the actual 3D feature points for the calibration of the face model
TextureCoords texture coordinates for the feature points
TextureType hints to the decoder on the type of texture image.

FaceDefTables - describe the behavior of FAPs w.r.t the geometry deformation


FaceSceneGraph - contains the texture image or groups node for the model hierarchy
3. GOING FROM INTERMEDIATE EXPRESSIONS TO UNIVERSAL
Grading of FAPs is strongly related to the activation parameter proposed by Whissel [5]. Since this relation is
expressed in a different way for each particular expression, a fuzzy rule system seems appropriate for
mapping FAPs to the activation axis. As a general rule, one can define six general categories, each one
characterized by a fundamental universal emotion; within each of these categories intermediate expressions
are described by different emotional and optical intensities, as well as minor variation in expression details.
From the synthetic point of view, expressions and emotions that belong in the same category can be
rendered by animating the same FAPs in different intensities. For example, the emotion group fear also
contains worry and terror; these two emotions can be synthesized by reducing or increasing the
intensities of the employed FAPs, respectively. The same rationale can also be applied in the group of
disgust that also contains disdain and repulsion; the fuzziness that is introduced by the varying scale of
the change of FAP intensity also provides assistance in differentiating mildly the output in similar situations.
This ensures that the synthesis will not render robot-like animation, but drastically more realistic results.
During this process, one can utilize fuzzy rules to go from the transmitted FAPs to universal expressions;
these rules stem from observation [6] and are fortified with results from the analysis of actual video
sequences. The initial rules have the form shown in Table 1, where the symbol + denotes the presence of a
particular atomic action in the human face. These actions are converted to groups of FAPs that map to the
facial area in question; the values of these FAPs are initialized with measurements from video sequences
that correspond to the analyzed expression [7]. As a result, the representation of an expression as a
collection of FAPs may be reduced to the estimation of the intensities of two universal expressions, which
can be part of the client decoder.

Relaxed
Cry
Almost crying
Depressed
Sadness

Lowered
more

less

Raised

Inner arch
more

less

Inner part
more

less

+
+
+
+

Table 1: Rules for the mapping of eyebrow FAPs to intermediate expressions


4. REFERENCES
[1] K. Karpouzis, N. Tsapatsoulis, S. Kollias, Moving to continuous facial expression space using the
MPEG-4 facial definition parameter set, SPIE Electronic Imaging 2000, January 2000, San Jose, CA,
USA
[2] F. Parke and K. Waters, Computer Facial Animation, A K Peters, 1996
[3] R. Cowie R. and E. Douglas-Cowie, Automatic statistical analysis of the signal and prosodic signs of
emotion in speech, Proc. of Intern. Conference on Spoken Language Processing, Philadelphia, 1996
[4] W. S. Lee, M. Escher, G. Sannier, N. Magnenat-Thalmann, MPEG-4 Compatible Faces from Orthogonal
Photos, Computer Animation 1999, Geneva, Switzerland, 1999
[5] C. M. Whissel, The dictionary of affect in language, R. Plutchnik and H. Kellerman (Eds) Emotion:
Theory, research and experience: vol 4, The measurement of emotions, Academic Press, New York,
1989
[6] G. Faigin, "The Artist's Complete Guide to Facial Expressions", Watson-Guptill, New York, 1990

[7] N. Tsapatsoulis, K. Karpouzis, G. Stamou, F. Piat and S. Kollias, A Fuzzy System for Emotion
Classification based on the MPEG-4 Facial Definition Parameter Set, EUSIPCO 2000, September
2000, Tampere, Finland

You might also like