Professional Documents
Culture Documents
Abstract
The present research focuses on the acquisition and annotation of vocal resources for emotion detection. We
are interested in detecting emotions occurring in abnormal situations and particularly in detecting fear. The
present study considers a preliminary database of audiovisual sequences extracted from movie fictions. The sequences selected provide various manifestations of target
emotions and are described with a multimodal annotation
tool. We focus on audio cues in the annotation strategy
and we use the video as support for validating the audio labels. The present article deals with the description
of the methodology of data acquisition and annotation.
The validation of annotation is realized via two perceptual paradigms in which the +/-video condition in stimuli
presentation varies. We show the perceptual significance
of the audio cues and the presence of target emotions.
1. Introduction
Emotion detection in spontaneous speech represents an
interesting topic for a wide set of potential applications,
such as human-machine interaction (human-machine dialog system), and survey application systems (homeland
security). Classical security systems are only based on visual cues detection without any recording of audio signal.
Our aim is to show the relevance of acoustic cues in illustrative abnormal situations. We define an abnormal situation as an unplanned event, consequent upon a human
action, present or imminent or natural disaster, which
implies human life threatening and requires prompt action to protect life or to limit its damages. We assume
that in these situations human being experiences fear or
other related negative emotions. Among the abnormal
situations we can mention natural damages such as fires,
earthquakes, flood etc, physical or psychological threatening and aggression against human beings (kidnapping,
hostages etc). Emotion detection in speech for security
applications requires appropriate real life databases.
The work presented here describes the database we
are building with the aim of developing a system able
to detect emotions occurring during abnormal situations.
We are hence interested with vocal manifestations in sit-
2. Corpus
2.1. Description
A corpus is in process of acquisition in order to provide
about 500 audiovisual sequences in English, corresponding to manifestations of emotional states in abnormal situations, either in individual, group or crowd situations.
Recent thrillers, action movies, and psychological dramas are good candidates. Our selection criteria rely on
actors play and situation naturalness, on audio quality
(voice predominance on noise and music) and on the type
of abnormal situation, which we prefer realistic and near
from the situations defined by our application field. In addition, other emotions occurring in normal situations are
also considered in order to verify the robustness of acoustic cues for emotion detection in target situations. Both
verbal and non-verbal manifestations (shots, explosions,
cries, etc.) are considered. Verbal manifestations illustrate neutral (reference) state, target emotion manifestations and emotions occurring in normal situations other
than neutral, and are annotated at several levels (section
3).
ifestations. In this purpose, each sequence that provides a particular context is segmented in speaker turns
which represent the basic annotation unit. The annotation strategies correspond to the first two phases of
the adopted methodology, naming and annotating. The
emotion perception by a human being is strongly multimodal: both audio and video information help us to understand speakers emotional states. Although our annotation scheme relies essentially on audio description, our
choice is helped by video, with the use of a tool for multimodal annotation ANVIL [5]. ANVIL allows us to create
and adapt different annotation schemes. A specification
file is thus provided in xml format. The output of ANVIL
tool is an interpretation file in which the selected description is stored. ANVIL provides the possibility to import
data from PRAAT, such as pitch contour and intensity
variation analysis.
3.2. Annotation scheme
We consider as relevant elements for the annotation: the
emotional content (category labels and dimensional description), the context (here the threat), and the acoustic
properties recovered by PRAAT ([7]).
3. Annotation strategies
3.1. A task dependant annotation strategy
We adopt a task dependent annotation strategy which
considers two main factors: the context of emotion emergence and the temporal evolution of the emotion man-
Audio descriptors
Context descriptor
Emotion descriptors
both categorical
and dimensional
Context descriptors
4. Perceptual validation
4.1. Protocol and subjects
The perceptual test is conducted to answer several questions. Firstly, we evaluate the presence of emotion corresponding to an abnormal situation. We estimate also if the
basic unit selected, the speaker turn, is salient for carrying
a particular emotion. We select 40 speaker turns illustrating previously described emotion categories. Each class
is illustrated by 10 stimuli, pronounced by 5 male and 5
female speakers. Speaker turns employed as stimuli vary
in length (from 3 to 43 words) and are presented in random order. We consider two experimental conditions, +/video, in order to verify the role of the audio cues in perceiving the target emotions. The +video condition provides both video and audio recordings for each speaker
turn. The video condition only provides the audio file
corresponding to each speaker turn. 22 subjects participated in the perceptual test, 11 for -audio condition and
11 for +video condition. They are previously instructed
on the purpose of the test, i.e. to judge audio and/or audio/video short sequences in terms of emotion they perceive and of emotion evaluation using the three abstract
dimensions. We expect the test will allow us to evaluate
the two description schemes of emotions, i.e. categorical
vs. dimensional. Subjects are asked to name the emotion emerging from each stimulus. Concerning the use of
abstract dimensions, they are described and examples are
given in order to illustrate them. A familiarization phase
consisting of 5 stimuli precedes the test. The 5 training
stimuli are not considered as results and are provided to
help subjects to practice and to build a personal evaluation scale. All the subjects understand English without
any difficulty and in order to avoid misunderstandings,
the transcription of the stimuli is also provided. The test
phase consists in listening/watching and listening the 40
stimuli and in fulfilling the questions described in the instructions. Subjects are allowed to listen or/and watch
each stimulus as many time as they prefer.
4.2. Results
Figures below show the percentage of intensity (figure 2), evaluation (figure 3) and reactivity (figure 4) labels for emotions occurring in abnormal situations and
other emotions, for both test conditions (audio and audiovisual). We present here the main results. We focus
both on emotion evaluation with the abstract dimensions
according to the experimental conditions, i.e. +/-video,
and categorical labeling of the stimuli. The results obtained with the 3-dimensional emotion description show a
differentiation of emotions in abnormal situations (global
class fear and other negative emotions) from other emotions (neutral and other emotions) in normal situations.
Indeed, the speaker turns initially labeled as emotions in
abnormal situations are perceived as more intense emotions (figure 2). Concerning the evaluation axis, emotions
in normal situations are globally perceived as corresponding to a zero level which means they are neither passive,
nor active (figure 3). As expected, stimuli labeled as corresponding to abnormal situations are evaluated as negative. For the reactivity axis, emotions in normal situations
are subjected to the same observation as for evaluation
(figure 4), whereas emotions in abnormal situations are
more frequently considered as active. Moreover, emotions are perceived as more intense with the help of the
video support. However, for the three dimensions, the
two curves are close, which means that audio cues may
be sufficient to detect such emotional states.
70
in
in
in
in
60
50
40
30
20
10
Intensity scale
70
in
in
in
in
60
50
40
30
20
10
0
3
0
Evaluation scale
in
in
in
in
60
50
This paper presented a preliminary work in the acquisition and annotation of a fictional database for emotions
detection in abnormal situations and particularly fear. We
show the perceptual significance of the audio cues and the
role of the video, the presence of target emotions in our
data and the interest of the emotion annotation strategy,
i.e. categorical and with abstract dimensions. Ongoing
work focuses on a finer discrimination inside the global
emotion classes and the correlation between the context
descriptors and the vocal manifestations of emotions.
40
6. References
30
20
10
0
3
0
Reactivity scale