Professional Documents
Culture Documents
cues
Alessandra Cervone1 , Catherine Lai1 , Silvia Pareti1 ,2 , Peter Bell1
1
School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK
2
Google Inc., 8002 Zurich, Switzerland
s1466724@sms.ed.ac.uk, clai@inf.ed.ac.uk, spareti@google.com, Peter.Bell@ed.ac.uk
Abstract (3) I said to him when you left: Do you remember? I told
The phenomenon of reported speech whereby we quote the you I said to him: Dont forget Dave if you ever get
words, thoughts and opinions of others, or recount past dia- in trouble give us a call. You never know your luck.
logue is widespread in conversational speech. Detecting such (4) I said to him when you left, (do you remember? I told
quotations automatically has numerous applications: for exam- you) I said to him: Dont forget Dave if you ever get
ple, in enhancing automatic transcription or spoken language in trouble give us a call. You never know your luck.
understanding applications. However, the task is challenging,
not least because lexical cues of quotations are frequently am- If we rely only on the transcription of Example 1, without
biguous or not present in spoken language. The aim of this listening to the speech itself, even a human would find it very
paper is to identify potential prosodic cues of reported speech hard to decide which interpretation is the correct one.
which could be used, along with the lexical ones, to automati- Given this complex interaction between lexical and acous-
cally detect quotations and ascribe them to their rightful source, tic information, the task of automatically extracting quotations
that is reconstructing their attribution relations. In order to do from speech is a very challenging one. However, being able
so we analyze SARC, a small corpus of telephone conversations to detect reported speech would be very useful for many appli-
that we have annotated with attribution relations. The results of cations. Generally, it could improve sentence boundary detec-
the statistical analysis performed on the data show how varia- tion, since quotations could have prosodic marking similar to
tions in pitch, intensity, and timing features can be exploited intonational phrases (which could mislead the system into de-
as cues of quotations. Furthermore, we build a SVM classifier tecting them as separate sentences). Quotation detection could
which integrates lexical and prosodic cues to automatically de- also improve speaker identification tasks, given that the sys-
tect quotations in speech that performs significantly better than tem could attribute the quoted material to different speakers
chance. based on changes in prosodic encoding. Automatically extract-
Index Terms: speech recognition, quotations, prosody, attribu- ing quotations and ascribing them to their source would also be
tion relations, discourse, conversational speech. useful for a number of spoken language processing tasks such
as information extraction, named entity recognition, and coref-
1. Introduction erence resolution, as well as providing a better understanding
of the dialogue structure. Furthermore, being able to rely on the
Given the pervasiveness of the phenomenon of reported speech
prosodic marking to detect the presence of quotations would be
in language, automatically identifying the presence of quota-
a great advantage for the automatic transcription of speech.
tions has the potential to improve many automatic speech pro-
Previous work on reported speech so far has studied this
cessing tasks. However, the automatic detection of quotations
phenomenon mainly from a purely linguistic perspective [1, 2,
in speech compared to written text is difficult since we can-
3, 4, 5, 6, 7] or considered the prosodic level with only a qualita-
not rely purely on lexical cues (he said..., she told me... etc),
tive approach [8, 9]. While the speech processing literature has
since they are often absent in natural dialogue, and there is of
investigated prosodic features for automatic punctuation detec-
course, no punctuation marking quotation boundaries. As an
tion, these studies concentrate primarily on full stops, commas
example of the complexity of this task, consider the following
and question marks, e.g. [10, 11, 12, 13]. In this work we will
transcribed extract from the Speech Attribution Relations Cor-
try to connect these two approaches to study the potential acous-
pus (SARC) of spoken dialogues analysed in this article:
tic correlates of punctuation devices that mark the presence and
(1) I said to him when you left do you remember I told you boundaries of reported speech such as quotation marks in the
I said to him dont forget Dave if you ever get in trouble case of direct quotations.
give us a call you never know your luck This research direction was suggested by the results of our
previous study [14], where we performed a linguistic analysis
The presence of the verbs said and told in the extract
of the quotations in SARC, a small corpus we created to study
suggests that the example may contain reported speech. It is
attribution in speech. SARC was annotated with attribution re-
however much harder to determine the number of quotations
lations (ARs), i.e. relations holding between the source and the
and their exact boundaries. Some of the possible interpretations
text span containing the reported speech [15]. This annotation
of the example could be:
followed a scheme adapted from the one developed by [16] for
(2) I said to him: When you left, (do you remember? I the annotation of PARC, the first large corpus annotated for ARs
told you) I said to him: Dont forget Dave if you ever in written text. The analysis of the ARs in SARC showed that
get in trouble give us a call. You never know your luck. to detect quotations from speech we cannot simply rely only
on lexical information but we need also prosodic information, sidered as correlates of quotations. In particular, these studies
given that without punctuation the number of potential interpre- found that shifts in pitch [9, 17, 18, 19], intensity [9, 19], and
tations of the words becomes exponential (as for example 1). timing features in particular, pause durations [19] could act
We argued that, in order to build a system able to automatically as markers of reported speech. However the literature does not
extract reported speech from spoken language, we need to inte- offer a comprehensive insight on the phenomenon of attribution.
grate the lexical cues with prosodic ones. In particular, a pilot To our knowledge, the first paper suggesting the possibility of
acoustic analysis of the ARs in SARC suggested that the quo- studying the prosodic encoding of reported speech is [9], which
tation span (difficult to detect given its fuzzy boundaries) could presented only a qualitative analysis. Other studies which ana-
be prosodically marked. lyzed prosodic correlates of quotations [17, 18, 19] report sta-
In this paper, we test the hypotheses suggested in our pre- tistically significant results, but focus only on one particular as-
vious study by performing an acoustic analysis of the reported pect of the phenomenon, for example considering only direct
speech found in SARC. That is, we test whether prosodic in- reported speech [18, 19] or variations in pitch [17, 18], with
formation can be used to automatically detect reported speech. some of them relying on rather small corpora [17]. Moreover,
In Section 2, we describe the challenge of detecting ARs in these studies provide only descriptive analysis, and have not ex-
speech and the potential applications of a system able to au- tended their work to building classifiers to test their findings in
tomatically detect reported speech from spoken language. We a practical setting. To the best of our knowledge, there has been
also discuss the findings reported in the previous literature re- no attempt in the literature to integrate lexical and acoustic fea-
garding the prosodic correlates of reported speech. In Section 3, tures to automatically detect quotations from speech.
we provide further details regarding the process of construction
In this study we aim to offer a more comprehensive insight
and annotation of the corpus and we describe the setup of our
on the prosodic dimension of attribution. Using a larger corpus,
experiments. In Section 4, we report the results of the statis-
we examine both direct and indirect reported speech, consider-
tical analysis we performed on SARC that show that prosodic
ing an expanded set of prosodic and timing features based on
correlates of reported speech do exist. We use these features
those investigated in past studies. We also explore the possibil-
along with lexical ones to train classifiers for detecting reported
ity of integrating lexical cues (extracted in our previous findings
speech segments and their boundaries the results suggest that
[14]) with acoustic features to automatically classify reported
prosodic features can be used successfully for this task.
speech.
[4] E. Holt and R. Clift, Reporting talk: Reported speech in interac- [24] K. Evanini and C. Lai, The importance of optimal parameter set-
tion. Cambridge University Press, 2007. ting for pitch extraction. The Journal of the Acoustical Society of
America, vol. 128, no. 4, pp. 22912291, Oct. 2010.
[5] E. Holt, Reported speech, The Pragmatics of Interaction: Hand-
book of Pragmatics Highlights, vol. 4, pp. 190205, 2009. [25] D. Bates, M. Maechler, B. Bolker, and S. Walker, lme4:
Linear mixed-effects models using Eigen and S4, 2014, r
[6] G. Bolden, The quote and beyond: defining boundaries of re- package version 1.1-7. [Online]. Available: http://CRAN.R-
ported speech in conversational russian, Journal of pragmatics, project.org/package=lme4
vol. 36, no. 6, pp. 10711118, 2004.
[26] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support
[7] J. Sams, Quoting the unspoken: An analysis of quotations in vector machines, ACM Transactions on Intelligent Systems and
spoken discourse, Journal of Pragmatics, vol. 42, no. 11, pp. Technology, vol. 2, pp. 27:127:27, 2011. [Online]. Available:
31473160, 2010. http://www.csie.ntu.edu.tw/ cjlin/libsvm
[8] S. Gunthner, Polyphony and the layering of voices in reported [27] T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer, ROCR:
dialogues: An analysis of the use of prosodic devices in everyday visualizing classifier performance in R, Bioinformatics (Oxford,
reported speech, Journal of pragmatics, vol. 31, no. 5, pp. 685 England), vol. 21, no. 20, pp. 39403941, Oct. 2005.
708, 1999.
[28] I. Buchstaller, He goes and Im like: The new quotatives re-
[9] G. Klewitz and E. Couper-Kuhlen, Quote-unquote? the role
visited, in The 30th annual meeting on new ways of analyzing
of prosody in the contextualization of reported speech se-
variation (NWAV 30), 2001, pp. 1114.
quences, Universitat Konstanz, Philosophische Fakultat, Fach-
gruppe Sprachwissenschaft, 1999. [29] T. OKeefe, S. Pareti, J. R. Curran, I. Koprinska, and M. Honnibal,
A sequence labelling approach to quote attribution, in Proceed-
[10] D. Baron, E. Shriberg, and A. Stolcke, Automatic punctuation
ings of the 2012 Joint Conference on Empirical Methods in Nat-
and disfluency detection in multi-party meetings using prosodic
ural Language Processing and Computational Natural Language
and lexical cues, Channels, vol. 20, no. 61, p. 41, 2002.
Learning, ser. EMNLP-CoNLL 12. Stroudsburg, PA, USA: As-
[11] J.-H. Kim and P. C. Woodland, A combined punctuation genera- sociation for Computational Linguistics, 2012, pp. 790799.
tion and speech recognition system and its performance enhance-
ment using prosody, Speech Communication, vol. 41, no. 4, pp.
563577, Nov. 2003.
[12] F. Batista, H. Moniz, I. Trancoso, and N. Mamede, Bilingual
Experiments on Automatic Recovery of Capitalization and Punc-
tuation of Automatic Speech Transcripts, IEEE Transactions on
Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 474
485, Feb. 2012.
[13] J. Kolr and L. Lamel, Development and Evaluation of Automatic
Punctuation for French and English Speech-to-Text. in INTER-
SPEECH, 2012.
[14] A. Cervone, S. Pareti, P. Bell, I. Prodanof, and T. Caselli, De-
tecting attribution relations in speech: a corpus study, in Pro-
ceedings of the First Italian Conference on Computational Lin-
guistics CLiC-it 2014 & and of the Fourth International Workshop
EVALITA 2014: 9-11 December 2014, Pisa, 2014, pp. 103107.
[15] R. Prasad, N. Dinesh, A. Lee, A. Joshi, and B. Webber, Attribu-
tion and its annotation in the penn discourse treebank, Traitement
Automatique des Langues, Special Issue on Computational Ap-
proaches to Document and Discourse, vol. 47, no. 2, pp. 4364,
2007.
[16] S. Pareti, A database of attribution relations. in LREC, 2012, pp.
32133217.
[17] W. Jansen, M. L. Gregory, and J. M. Brenier, Prosodic corre-
lates of directly reported speech: Evidence from conversational
speech, in ISCA Tutorial and Research Workshop (ITRW) on
Prosody in Speech Recognition and Understanding, 2001.
[18] R. Bertrand, R. Espesser et al., Voice diversity in conversation: a
case study, in Proceeings of Speech Prosody 2002, 2002.
[19] M. Oliveira Jr and D. A. Cunha, Prosody as marker of direct re-
ported speech boundary, in Proceedings of Speech Prosody 2004,
2004.
[20] A. Cervone, Attribution relations extraction in speech: A lexical-
prosodic approach, Masters thesis, University of Pavia, Italy,
2014.