You are on page 1of 4

2008 IEEE/RSJ International Conference on Intelligent Robots and Systems

Acropolis Convention Center


Nice, France, Sept, 22-26, 2008

Emotional speech in the context of entertainment robots.


The effect of different emotions on users perceptions.
Olaf Kroll-Peters, Simon Rauterberg, Ugur Surucu, Andreas Unterstein and Mathias Wilhelm*

Abstract We examined if entertainment-robots should use Synthetic speech and real speech should not be mixed.
emotions. In an experiment we presented jokes to participants Though this was a pleasant feel for the participants, the
to find out if different emotions have different effects on their efficiency when working on their specific task decreased.
pleasure. We found out that emotions do have an impact on
users perceptions when using entertainment robots. Another study [4], examines the influence of the consistency
of speech (recorded + synthetic vs. only synthetic). The
I. INTRODUCTION influence on jokes is regarded in [3] and [5]. Like [3] the
Robots today have a growing importance for different authors conclude that consistency of the produced speech
aspects of our life. They perform tasks of a great diversity, is a preference. The tested participants preferred a joke
being employed as a robot for lawn mowing, a toy robot or system with a purely synthetic voice over a system with
recently even as maintenance and monitoring robots. Robots changing acoustic output. A study about the influence of
interact with their users during many of these activities. emotions on computer generated speech is [6]. It describes an
Interaction and communication should be as effortless as experiment with reviews being read with computer generated
possible. This helps increasing the acceptability and the speech that match with or differs from the personality of
diffusion process. Interaction should be human like, because the test person. The result is that a matching personality
every user knows how to interact in this way and therefore has a positive impact on the reception of a review. This
it is simple to make use of it. Human Interaction is often paper presents the results of our experiment to study the
based on speech. Still the question remains: Will we be influence of emotions in computer-generated speech. In an
able to communicate with a computer on the same level experiment, we used the AIBO robot to communicate with
like communicating to other humans in the future? Its a users. The AIBO reads emails and it tells stories or jokes on
goal that remains to be of great importance. Advancing demand. These stories and jokes are provided by the WWW
the production of speech in robotic systems remains to be via a free Web service. Speech production is accomplished
an important research area. As it is not yet possible to by the Open Source Text-To-Speech System (TTS) MARY
produce a convincingly natural sounding language for a non [7]. The general possibilities of TTS systems are described
strict domain of speech there are only a few approaches to and discussed in [8]. The TTS system MARY offers the
study the influence of difference emotions for the production possibility for a configuration of parameters controlling the
of speech. During a students-project for the integration of emotion of the speech production. Therefore, our interest
robots into everyday life, a web-service-based system for the regarding the project was to control these emotion parame-
entertainment of the inhabitants was developed. The robot, ters. A summarization of multiple parameters usable for the
a Sony AIBO [1], was used to deliver a broad variety of emotional enhancement of synthetic speech is given in [9]. In
entertainment services. Many of these services use speech the following section we are describing our experiment and
to communicate with the user. The usage of synthesized procedures in detail. This is followed by a presentation of our
speech and the influence of emotions of synthesized on the collected data. The collected data will be discussed and we
listener has been already analyzed in several studies. In [2], conclude this paper with an outlook on future developments.
the authors come to the result that the influence of emotions
II. EXPERIMENT
is less remarkable with synthetic language than it is with
recorded speech. Nevertheless, [2] points out, that a synthetic We wanted to study the influence of emotions on the
voice with happy emotions leads to positive perception of perception of synthetic speech. Therefore, we let AIBO
the spoken text. The usage of sad emotions makes the text robots communicate with diverse users. No participant had
more uninteresting. Appropriate emotions engender a feeling any real contact with the AIBO robot system before.
of attraction to the text. Emotions that are in opposition to The experminent was executed with each participant indi-
the text support its credibility. Another aspect is examined vidually. Each participant sat on a table on which the AIBO
in [3]. It examines the influence of the consistency of was placed. Additionally, two supervisors were present that
a human computer interface controlled by language. The guided the experiment and answered questions. The partici-
authors conclude that consistency is an important factor. pants were told that the experiment was done in the context
of a student project examining the interaction between hu-
*The authors are listed in alphabetical order. All authors are members of mans and robots.
the Chair of Agent Technologie (AOT), Technische Universitat Berlin, Ger-
many. olaf.kroll-peters@dai-labor.de, floorice@cs.tu-berlin.de, ugurs@cs.tu- Prior to each experiment, the demographic data of each
berlin.de, stein-uni@web.de, mathias.wilhelm@dai-labor.de. participant (age, gender, native speaker) were recorded.

978-1-4244-2058-2/08/$25.00 2008 IEEE. 3822


To ensure that every user had the same communication
with the AIBO, we made the AIBO tell jokes. We used
randomly chosen jokes in German language. The jokes were
enriched with different emotions and narrated to the listener.
Every participant was told four jokes. These four jokes were
generated and recorded with the TTS system MARY [7] and
by a real human voice. By using a configuration interface,
MARY enables the production of synthetic speech sentences
with various emotions. We decided that we would use a
synthetic male voice, with the three emotions sad, neutral
and happy for the experiment. The sad voice was slow and Fig. 1. Experiment Setup
deep with less power. The happy voice sounds higher, had
more power and was faster spoken. At the neutral voice all
controls were in the middle. soon as a synthetic voice was used one could notice a signif-
For us, an adaptation to the looks of the robot was not icant increase of the effort in listening. Nevertheless we may
possible (For this to happen, a dogs voice, using human notice that the synthetic voice with a sad emotion was almost
words, similar to several animated movies would have been on par with the human voice regarding the intelligibility. The
necessary). The real human voice was used as a reference, so sad synthetic voice was intelligible at above 75%. The happy
we would have a contrast to the synthetic voice. Every syn- voice was significantly less intelligible. It was intelligible at
thetic joke was played back 18 times (six times sad, six times a rate of only 17%. Therefore the happy emotion of the voice
neutral and six times happy) during the test. The reference is unusable in the used manner and/or for the given task. The
to the given set, the four jokes narrated by a human voice, problem is that the emotional filter uses a higher word-rate
were told six times each. To calculate the jocularity of every for a happy emotion. This might be intentionally right but
single joke every participant had a different permutation of also naturally leads to a worse reception of the spoken word.
jokes. Fig. 1 shows the experiment setup. The emotional filter is not yet applicable for every corpus of
The 24 participants of our experiment heard four jokes of words. Given a better corpus or a more subtle setting or
an arbitrary permutation of order. E.g. joke A was narrated by even another TTS system the results for a happy voice may
a human voice. Joke B was narrated by a synthetic, sad voice. increase significantly. On the other hand, a lowered word-
Joke C was narrated by a synthetic, neutral voice and joke D rate is part of the parameters for the sad voice. This leads to
was narrated by a synthetic, happy voice. Additionally, we a very high intelligibility of the sad voice. Table I and Fig.
decided to let the Sony AIBO perform some random moves 2 show this results.
(paw is pointing to participant, tail wagging) or a sound (bark We think that the better understandability of the sad voice
for one time) before a joke is narrated. By doing so, we in our experiment was due to the slow narrating, which gave
wanted to ensure a comfortable ambience. the listener more time to adapt to the unknown voice.
After each joke a questionnaire had to be answered by Furthermore, we examined whether the emotion expressed
the participant. The questions referred to the acoustical in the speech actually triggered the same emotion. Did the
intelligibility and the semantically understanding of the jokes participants actually recognise the sad voice as such? To this
as well as to the humorous aspect. We provided a small end, the participants were asked after listening to the jokes
software tool to fill out the questionnaire. Furthermore, the to assign an emotion to to the voice (happy, sad, neutral).
participants were asked to make suggestions according to the All participents found the synthethic voice to be too neutral.
experiment deliberately, so that we could gather additional, However, they were still able to assign the correct emotion.
maybe important data. The joke could be narrated repeatedly, Following the question about the acoustical intelligibility
if it wasnt acoustically intelligible at once. We did so in the participants had to judge their understanding of the joke.
order to test the joke for the reception of its content, at least. As already remarked, a joke could be listened to more than
This option was especially often used in combination with once. So one can expect that the result for the understanding
the happy synthetic voice. In conclusion of the experiment of the joke to be better than the results for acoustical
every participant got a questionnaire about the use of enter- intelligibility. The results in Table II and Fig. 3 show, that the
tainment robots in the household. The participants came from
the universitary environment of the authors and covered the
fields of female 20%, male 80%, native speakers 80% and TABLE I
non native speakers 20%. The age of the participants was in ACOUSTICAL U NDERSTANDING
the range from 25-35 for the most part.
Acoustically intelligible? No Yes
Neutral emotion 0.33 0.67
III. EXPERIMENTAL RESULTS Happy emotion 0.83 0.17
Sad emotion 0.25 0.75
The results of the experiment revealed that the human Human voice 0 1
voice was received as acoustically intelligible at 100%. As

3823
TABLE II
WAS CONTENT UNDERSTANDABLE ?

Joke understandable? No Yes


Neutral emotion 0.29 0.71
Happy emotion 0.42 0.58
Sad emotion 0.17 0.83
Human voice 0.04 0.96

results for understanding have slightly increased compared to


the intelligibility. The result for the happy emotion indicates
that the meaning of a joke is seldom transmitted, if it is hard
to hear. Strange enough this is even the case if it is presented
multiple times. Fig. 3. Was content understandable?
The last part of the analysis is related to the humour-
quality of the jokes. Not every human laughs about the same
joke. If a listener considers a joke to be funny is surely
not solely dependant on the narrator, but also on how it
is presented to him/her. The participant was asked to judge
about the humorous quality of the jokes. The results in Table
III and Fig. 4 reveal that the jokes, presented with a synthetic
happy voice were considered as less funny or even not at
all funny. This is not really astonishing for the jokes were
probably not heard well or understood at all. A noteworthy
fact is that the neutral synthetic voice is not perceived as well
as the sad voice. By comparing the real voice with the sad
voice it is remarkable that the results are almost identical.
As judged by the participants, the synthetic voice with the
sad emotion had a similar reception compared to the real
human and regarding the intelligibility and the understand-
Fig. 4. Was content funny?
ing. If one of the given sets of speech parameters were to be
considered to be implemented in an entertainment robot the
sad emotion would be a good starting point to mimic human
speech. One of the questions considering the comprehension often assigned as happy than sad. This question remains to
of the joke asked about the perceived emotion (see Fig. 5 ). be examined. An improvement of the quality of the inbuilt
The participants struggled to associate emotions to the speaker of the AIBO will probably lead to better results.
speech samples. We registered a strong tendency to neutral. The output was regarded as too soft in most of the cases.
An exception was the real voice, but even here the opinions Even though the speaker was at maximum volume and the
differ. One can assume that the emotions were not perceived audiofiles had been processed for an additional increase of
very well. It is conspicuous, that our sad emotion was more the loudness, participants often leaned with their ear towards
the AIBO. The experiment will have a different outcome with
loudspeakers of a higher standard. There was no introduction
for the joke. That led to the fact, that the participants had no
way to get prepared for the joke. Some participants suggested
that some introducing words of the AIBO before reading the
joke would have been helpful to get used to the sound of the
voice.

TABLE III
WAS CONTENT FUNNY ?

Funny? very fairly average a little not at all


Neutral 0.04 0.13 0.38 0.13 0.33
Happy 0 0.04 0.17 0.33 0.46
Sad 0.13 0.29 0.25 0.13 0.21
Human 0.08 0.29 0.38 0.17 0.08
Fig. 2. Acoustical Understanding

3824
V. CONCLUSION AND FUTURE WORK
Our experiment was set up to answer the question:How
important is emotional speech in the context of entertainment
robots?. Our participants stated that an emotional speech
is preferred for the communication with a robot. A change
in the parameters that control the emotions led to a partly
better comprehension. The answer to our question therefore
is very important. At the same time we revealed in our
tests that there are still a lot of improvements to be made
in order to reproduce emotions with synthetic speech in a
correct and sufficient manner. Because of this we will be put
additional effort into this issue. More experiments will be
set up to increase the perception of the configured emotions
for the user. In next experiments we will evaluate emotional
speech in a environment with children. We want to find out
whether there is a difference between children and adults
when perceiving emotional speech.
VI. ACKNOWLEDGMENTS
The authors gratefully acknowledge the infrastructure,
assistance and any comments of the DAI-Labor at the
Fig. 5. Perceived emotions Technische Universitat Berlin.
VII. REFERENCES
IV. QUESTIONNAIRE R EFERENCES
Every participant filled out a questionnaire in addition [1] Sony AIBO http://support.sony-europe.com/aibo (21.02.2008)
to the experiment described above. This was intended to [2] Nass, C., Foehr, U., Brave, S. and Somoza, M.: The effects of emotion
record the thoughts of the participants in relation to the of voice in synthesized and recorded speech. Proc. AAAI Symposium
Emotional and Intelligent II: The Tangled Knot of Social Cognition,
experiment and their judgment about the use of entertainment 2001.
robots. Overall one could notice that the participants had [3] Gong, L. and Lai, J.: Shall we Mix Synthetic Speech and Human
fun during the experiment. At the beginning the looks of Speech? The Impact on Users Task Performance and Attitude. Proc.
Human Factors in Computing Systems ACM CHI, p.158-165, 2001.
the faces were rather sceptical. This diminished during the [4] Nass, C., Simard, C. and Takhteyev, Y.: Should Recorded and Syn-
process. Everybody smiled at least at some time during the thesized Speech be Mixed? www.stanford.edu/nass/comm369/pdf/
experiment. The question regarding the idea of robots with MixingTTSandRecordedSpeech.pdf (21.02.2008).
[5] Mihalcea, R. and Strapparava, C.: Making Computers Laugh: Inves-
emotionalized speech was answered with good by most tigations in Automatic Humor Recognition. Proc. Human Language
of the participants. Interesting was answered a multiple Technology and Empirical Methods in Natural Language Processing,
times also. Just two of twenty-four participants rated the p.531-538, 2005.
[6] Nass, C. and Lee, K. M.: Does Computer-Generated Speech Manifest
idea as being irrelevant. The reaction of the participants to Personality? An Experimental Test of Similarity-Attraction. Confer-
the movements of the robot prior to the narration of a joke ence on Human Factors in Computing Systems The Hague, p.329-336,
was overall positive. Just three of the participants rated the 2000.
[7] The MARY Text-to-Speech System: http://mary.dfki.de (21.02.2008)
movements as irritating and distracting. There were multiple [8] Mohasi, E. and Mashao, D.: Text-to-Speech Technology in Human
praises for the movements and suggestions for additional Computer Interaction. 5th Conference on Human Computer Interac-
moves on the other hand. tion in Southern Africa (CHISA), p.79-84, 2006.
[9] Schroder, M.: Emotional speech synthesis: A review. In Proceedings
from Eurospeech, vol.1, p.561-564, 2001.

3825

You might also like