You are on page 1of 4

Fiction database for emotion detection in abnormal situations

Chloe Clavel, Ioana Vasilescu*, Laurence Devillers**, Thibaut Ehrette


Thales Research and Technology France, Domaine de Corbeville, 91404 Orsay Cedex, France
*ENST-TSI, 46 rue Barrault, 75634 Paris Cedex 13, France
**LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France
chloe.clavel@thalesgroup.com, vasilesc@tsi.enst.fr, devil@limsi.fr

Abstract
The present research focuses on the acquisition and annotation of vocal resources for emotion detection. We
are interested in detecting emotions occurring in abnormal situations and particularly in detecting fear. The
present study considers a preliminary database of audiovisual sequences extracted from movie fictions. The sequences selected provide various manifestations of target
emotions and are described with a multimodal annotation
tool. We focus on audio cues in the annotation strategy
and we use the video as support for validating the audio labels. The present article deals with the description
of the methodology of data acquisition and annotation.
The validation of annotation is realized via two perceptual paradigms in which the +/-video condition in stimuli
presentation varies. We show the perceptual significance
of the audio cues and the presence of target emotions.

1. Introduction
Emotion detection in spontaneous speech represents an
interesting topic for a wide set of potential applications,
such as human-machine interaction (human-machine dialog system), and survey application systems (homeland
security). Classical security systems are only based on visual cues detection without any recording of audio signal.
Our aim is to show the relevance of acoustic cues in illustrative abnormal situations. We define an abnormal situation as an unplanned event, consequent upon a human
action, present or imminent or natural disaster, which
implies human life threatening and requires prompt action to protect life or to limit its damages. We assume
that in these situations human being experiences fear or
other related negative emotions. Among the abnormal
situations we can mention natural damages such as fires,
earthquakes, flood etc, physical or psychological threatening and aggression against human beings (kidnapping,
hostages etc). Emotion detection in speech for security
applications requires appropriate real life databases.
The work presented here describes the database we
are building with the aim of developing a system able
to detect emotions occurring during abnormal situations.
We are hence interested with vocal manifestations in sit-

uations in which human life could be in danger. The


potential applications are dealing with security of public places, for example, bank or subway surveillance.
The development of an appropriate database of natural
speech and on a robust annotation strategy is not straightforward. Literature on emotions shows that detecting robust vocal cues carrying emotion information about the
speakers state is strongly correlated to the naturalness
of the corpus employed in this purpose. We can notice
an increasing interest in collecting and analyzing natural and context dependent corpora. As real life corpora
for emotion detection, we can mention a conversational
corpus containing everyday interactions between a target speaker and his environment [3], the Belfast database
[1] containing an interview corpus and a talk-show based
corpus, and call center corpora consisting in interactions
between agents and clients [4]. However, despite the naturalness of the data, all the mentioned corpora illustrate
life contexts in which emotions are shaded, mixed and
moderate as the politeness rules and social conventions
required. Natural corpora with extreme emotional manifestation for surveillance applications are not available
because of the private character of the data. However,
those emotions are present in everyday life, even though
rare and unpredictable. Indeed, nowadays broadcast news
provides strong examples of those extreme emotion manifestations but generally via short excerpts from a large
variability of contexts. As a consequence, there is a lack
of studies focusing on strong natural emotions.
Given the difficulty of collecting abundant material
with those target emotions, we are building a corpus
based on fiction sequences extracted from a collection of
selected recent movies in English language. Even though
emotions are acted, they are expressed by skilled actors in
realistic interpersonal interactions in the whole context of
the movie scenario. Otherwise fiction provides an interesting scope of emergence contexts and of type of speakers that would have been very difficult to collect in real
life. Our annotation methodology consists in three main
phases: naming the emotions, annotating and validating
them via appropriate perceptual tests.
In the following sections we describe the methodol-

ogy of extraction (section 2) and annotation (section 3)


of the corpus sequences and the validation of the annotation via perceptual paradigms (section 4) carried out in
two experimental conditions on a subset of the corpus.
Finally, in section 5 we present conclusions and further
work projects.

2. Corpus
2.1. Description
A corpus is in process of acquisition in order to provide
about 500 audiovisual sequences in English, corresponding to manifestations of emotional states in abnormal situations, either in individual, group or crowd situations.
Recent thrillers, action movies, and psychological dramas are good candidates. Our selection criteria rely on
actors play and situation naturalness, on audio quality
(voice predominance on noise and music) and on the type
of abnormal situation, which we prefer realistic and near
from the situations defined by our application field. In addition, other emotions occurring in normal situations are
also considered in order to verify the robustness of acoustic cues for emotion detection in target situations. Both
verbal and non-verbal manifestations (shots, explosions,
cries, etc.) are considered. Verbal manifestations illustrate neutral (reference) state, target emotion manifestations and emotions occurring in normal situations other
than neutral, and are annotated at several levels (section
3).

ifestations. In this purpose, each sequence that provides a particular context is segmented in speaker turns
which represent the basic annotation unit. The annotation strategies correspond to the first two phases of
the adopted methodology, naming and annotating. The
emotion perception by a human being is strongly multimodal: both audio and video information help us to understand speakers emotional states. Although our annotation scheme relies essentially on audio description, our
choice is helped by video, with the use of a tool for multimodal annotation ANVIL [5]. ANVIL allows us to create
and adapt different annotation schemes. A specification
file is thus provided in xml format. The output of ANVIL
tool is an interpretation file in which the selected description is stored. ANVIL provides the possibility to import
data from PRAAT, such as pitch contour and intensity
variation analysis.
3.2. Annotation scheme
We consider as relevant elements for the annotation: the
emotional content (category labels and dimensional description), the context (here the threat), and the acoustic
properties recovered by PRAAT ([7]).

2.2. The extraction and selection method


Chapters of interest are chosen from previously selected
DVDs and stored as MPEG files, before being segmented
in short sequences of 20 to 100 seconds. Audio data are
extracted into .wav files. The variability in terms of sequence duration is the consequence of a topic criterion of
segmentation. Each sequence illustrates a particular topic
and a verbal or non verbal context. Verbal sequences contain dialogs and/or monologues. Different types of situations are illustrated: hostages, individuals and groups
lost in a threatening environment, kidnapping, etc. Sequences are segmented in speaker turns. For the study
we present here, we considered 20 preliminary sequences
extracted from six different movies and containing 152
speaker turns from 28 speakers, 14 male, 12 female, 2
child, and 21 overlaps. From the 152 speaker turns we
selected a subset of 40 speaker turns test for the perceptual test (section 4).

3. Annotation strategies
3.1. A task dependant annotation strategy
We adopt a task dependent annotation strategy which
considers two main factors: the context of emotion emergence and the temporal evolution of the emotion man-

Audio descriptors
Context descriptor
Emotion descriptors
both categorical
and dimensional
Context descriptors

Figure 1: Annotation Scheme in ANVIL

3.2.1. Emotion descriptors


Interpretation files incorporate two types of emotional
content for each speaker turn, a dimensional one and a
categorical one. The dimensional description is based
on the three following abstract dimensions: activation,
evaluation and control, which are suggested as salient for
emotions description[8]. The third dimension is renamed
and adapted as reactivity dimension, more perceptually
intuitive to distinct emotions occurring in abnormal situations. Our final three dimensions are: intensity, which indicates how intense the emotion is (for example terror is
more intense than fear); evaluation, which gives a global
indication of the positive or negative feelings associated
with the emotional states (for example happiness is a pos-

itive emotion and anger a negative one); reactivity, which


enables us to distinguish different types of fear. Indeed
we are not experiencing the same fear if we cope with the
threat or not [6]. The reactivity value indicates whether
the speaker seems to be subjected to the situation (passive) or to react to it (active). For example, one reaction
to fear can be inhibition (very passive) or anger (very active). For the dimensions, evaluation and reactivity, axes
cover discrete values from wholly negative (3, 2 1)
to wholly positive (+1, +2, +3). The intensity axis provides four levels from 0 to 3. Level 0 corresponds to neutral states for the three dimensions. Each dimension is
stored in a track of ANVILs annotation file as shown in
figure 1.
We also employ categorical labels for the emotional
content of each speaker turn. We selected so far two
groups of labels corresponding to emotions in abnormal
situations (global class fear and other negative emotions)
and other emotions (neutral and other emotions in normal
situations).
3.2.2. Context and meta descriptors
We consider here several factors that allow us to better
describe the context of emotion emergence. The factors
concern both the environment and the relation between
speakers. Consequently, a threat track provides the description of the threat intensity and of its incidence (immediate, imminent or potential). The speaker track gives
the gender of the current speaker and its position in the interaction (victim or aggressor). Judgements about audio
qualities (i.e. +/- noise, +/- music) of each speak turns are
stored. They will be employed at testing the robustness
of detection methods.
3.2.3. Transcription and acoustic descriptors
With the aim of studying salient lexical cues we also transcribe the verbal content, with the help of subtitle support provided by DVDs. Breathing, shots and shouts are
transcribed as non verbal events. PRAAT input allows to
extract a set of statistics from the speech signal, such as
intensity and pitch contours which will be correlated with
other descriptors.

4. Perceptual validation
4.1. Protocol and subjects
The perceptual test is conducted to answer several questions. Firstly, we evaluate the presence of emotion corresponding to an abnormal situation. We estimate also if the
basic unit selected, the speaker turn, is salient for carrying
a particular emotion. We select 40 speaker turns illustrating previously described emotion categories. Each class
is illustrated by 10 stimuli, pronounced by 5 male and 5
female speakers. Speaker turns employed as stimuli vary

in length (from 3 to 43 words) and are presented in random order. We consider two experimental conditions, +/video, in order to verify the role of the audio cues in perceiving the target emotions. The +video condition provides both video and audio recordings for each speaker
turn. The video condition only provides the audio file
corresponding to each speaker turn. 22 subjects participated in the perceptual test, 11 for -audio condition and
11 for +video condition. They are previously instructed
on the purpose of the test, i.e. to judge audio and/or audio/video short sequences in terms of emotion they perceive and of emotion evaluation using the three abstract
dimensions. We expect the test will allow us to evaluate
the two description schemes of emotions, i.e. categorical
vs. dimensional. Subjects are asked to name the emotion emerging from each stimulus. Concerning the use of
abstract dimensions, they are described and examples are
given in order to illustrate them. A familiarization phase
consisting of 5 stimuli precedes the test. The 5 training
stimuli are not considered as results and are provided to
help subjects to practice and to build a personal evaluation scale. All the subjects understand English without
any difficulty and in order to avoid misunderstandings,
the transcription of the stimuli is also provided. The test
phase consists in listening/watching and listening the 40
stimuli and in fulfilling the questions described in the instructions. Subjects are allowed to listen or/and watch
each stimulus as many time as they prefer.
4.2. Results
Figures below show the percentage of intensity (figure 2), evaluation (figure 3) and reactivity (figure 4) labels for emotions occurring in abnormal situations and
other emotions, for both test conditions (audio and audiovisual). We present here the main results. We focus
both on emotion evaluation with the abstract dimensions
according to the experimental conditions, i.e. +/-video,
and categorical labeling of the stimuli. The results obtained with the 3-dimensional emotion description show a
differentiation of emotions in abnormal situations (global
class fear and other negative emotions) from other emotions (neutral and other emotions) in normal situations.
Indeed, the speaker turns initially labeled as emotions in
abnormal situations are perceived as more intense emotions (figure 2). Concerning the evaluation axis, emotions
in normal situations are globally perceived as corresponding to a zero level which means they are neither passive,
nor active (figure 3). As expected, stimuli labeled as corresponding to abnormal situations are evaluated as negative. For the reactivity axis, emotions in normal situations
are subjected to the same observation as for evaluation
(figure 4), whereas emotions in abnormal situations are
more frequently considered as active. Moreover, emotions are perceived as more intense with the help of the
video support. However, for the three dimensions, the

two curves are close, which means that audio cues may
be sufficient to detect such emotional states.
70

in
in
in
in

60

abnormal situations with audio support


normal situations with audio support
abnormal situations with video support
normal situations with video support

Intensity label ratings %

50

40

30

20

need additional video information for correct annotation,


emotions and especially extreme negative emotions such
as fear in abnormal situations have complex multimodal
manifestations and video represent a real help for annotation. This finding is also illustrated in table 2, presenting
the number of stimuli which received for reactivity ratings a higher score with the +video condition than with
the video condition. As in table 1, it shows that images
help at perceiving more marked emotions in the case of
stimuli initially categorized as fear.

10

Intensity scale

Figure 2: Percentage for intensity label ratings.

Table 2: Number of stimuli judged as more marked in the


video condition (majority vote) on the reactivity axis and
for the four emotion classes (10 stimuli/emotion class)
N b.st.
neutral emot. norm. neg. emot. fear
reactivity
0/10
3/10
4/10
5/10

70

in
in
in
in

60

abnormal situations with audio support


normal situations with audio support
abnormal situations with video support
normal situations with video support

Evaluation label ratings %

50

40

30

5. Conclusion and future work

20

10

0
3

0
Evaluation scale

Figure 3: Percentage for evaluation label ratings.


70

in
in
in
in

60

abnormal situations with audio support


normal situations with audio support
abnormal situations with video support
normal situations with video support

50

Reactivity label ratings %

Table 1: Annotated vs perceived emotion classes.


%
neutral emot. norm. neg. emot. fear
audio
100
80
80
50
audio video
70
70
70
80

This paper presented a preliminary work in the acquisition and annotation of a fictional database for emotions
detection in abnormal situations and particularly fear. We
show the perceptual significance of the audio cues and the
role of the video, the presence of target emotions in our
data and the interest of the emotion annotation strategy,
i.e. categorical and with abstract dimensions. Ongoing
work focuses on a finer discrimination inside the global
emotion classes and the correlation between the context
descriptors and the vocal manifestations of emotions.

40

6. References

30

[1] Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.,


2003. Emotional speech: Towards a new generation of
databases. In Speech Communication.

20

10

0
3

0
Reactivity scale

Figure 4: Percentage for reactivity label ratings.


Finally, we correlate the labels provided by subjects
with the four emotion classes. Table 1 shows a global
correlation and allows us to differentiate the usefulness of
the video when annotating the emotions. We notice that
audio condition is sufficient to correctly annotate neutral
stimuli as well as stimuli initially labialized as other emotions in normal situations (emot. norm.) and other negative emotions (emot. neg.). In those cases, providing
the video seems to complicate the task. Indeed neutral
stimuli are correctly rated in 100% of cases in audio condition, but when video is added they are perceived in 70%
of cases as neutral. However for global class fear video
seems to be necessary in order to provide the same judgement as for the initial annotation. If neutral state does not

[3] Campbell,N., Mokhtari, P., 2003. Voice quality: the 4h


Prosodic Dimension. In 15th ICPhs Barcelona.
[4] Devillers, L., Vasilescu, I., 2004. Reliability of Lexical et
Prosodic Cues in two Real-life Spoken Dialog Corpora. In
the 4th International Conference on Language Resources
and Evaluation.
[5] Kipp, M., 2001. Anvil-a generic annotation tool for multimodal dialogue. In 7th European Conference on Speech
Communication and Technology (Eurospeech).
[6] Ekman, P., 2003. Emotions Revealed : Recognizing Faces
and Feelings to Improve Communication and Emotional
Life. Hardcover, Times book.
[7] www.praat.org
[8] Osgood, C., May, W. H., Miron, M.S., 1975. Crosscultural Universals of Affective Meaning. University of
Illinois Press, Urbana.

You might also like