Professional Documents
Culture Documents
Wonkyu Park1, Sung H. Han1, Yong S. Park1, Jungchul Park1, and Huichul Yang2
1
Department of Industrial and Management En ineerin , POSTECH,
San 31, Hyoja, Pohan , 790-784, South Korea
{p09plus1,shan,drastle,mozart}@postech.ac.kr
2
Samsung Electronics, Seoul, South Korea
huicul.yang@samsung.com
1 Introduction
During the last two decades or so, many studies have been conducted to improve the
performance of spoken language dialogue systems (SLDSs). However, most studies
focused on recognition performance, while only a few studies investigated human
factors issues such as user models, linguistic behavior, user satisfaction, etc. [1].
Human factors issues play an important role in an SLDS because enhanced usability
can partially cover imperfect recognition accuracy of the system. When it comes to
natural dialogues between the users and an SLDS, it is obvious that value of the
SLDS depends on usability, which is critical to make the system commercially
successful [2].
Usability evaluation in the development process is essential because it provides
current usability levels and reveals potential usability problems. Although a variety of
studies conducted usability evaluation [3, 4, 5, 6, 7], only a few proposed systematic
evaluation frameworks or methodologies for SLDSs [1, 8]. Walker et al. developed a
framework for evaluating SLDSs, PARADISE (Paradigm for Dialogue System
Evaluation) [8]. It provides a quantitative usability index (i.e. user satisfaction)
considering task success and costs. Dybkjær and Bernsen developed an evaluation
template for SLDSs that consisted of 10 entries such as ‘what is being evaluated’,
N. Aykin (Ed.): Usability and Internationalization, Part I, HCII 2007, LNCS 4559, pp. 398 – 404, 2007.
© Springer-Verlag Berlin Heidelberg 2007
A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs) 399
‘system part evaluated’, ‘type of evaluation’, ‘symptoms to look for’, etc. [1].
However, these studies are not easy for practitioners to apply to usability evaluation
because they do not deal with specific data collection methods.
This paper aims to propose an evaluation framework for SLDSs. The framework
identifies usability measures for various aspects of SLDSs. Also, it proposes scenario-
based methods to effectively evaluate usability in terms of both performance and
satisfaction. In addition, a case study is conducted to validate the proposed
framework. An SLDS providing the user information about schedules, contacts,
weather, etc. in a home environment, was developed for the case study.
and interaction. Various aspects of SLDSs can be evaluated by considering the three
components, while previous studies mainly evaluated the system only [1].
Criteria are developed to evaluate each component. The criteria are made by
elaborating the process shown in a modified job process chart. A job process chart
reported by [12] is a specific type of partitioned operational sequence diagrams. With
the modified job process chart, practitioners are able to identify system-user
interaction processes and information transmitted between them. An example of the
chart is shown in Fig. 2, from which evaluation criteria are made for the case study.
For example, an evaluation criterion of ‘recognition performance’ is elaborated from
‘recognize input’. Another example is ‘user behavior’ that comes from ‘construct
utterance’ when the system fails to provide information that the users request.
Start Start
Construct
Utterance
Speak at a
Recognize input Input utterance
microphone
Read feedback
Display feedback Feedback message
message
Test output
relevance
Yes No
Output message
Generate Read output
Error,
response message
Additional info.,
Response
Is information
End adequate
No
Yes
End
Fig. 2. A modified job process chart of an SLDS used in the case study
A variety of usability measures were collected from previous studies [2, 3, 4, 5, 6, 7].
Some measures (e.g. number of barge-ins and number of SLDS help) were SLDS-
specific, while others (e.g. task completion time and number of errors) could be used
for general usability evaluations. The latter measures might be modified to fit SLDSs.
For example, the number of errors is modified into the number of unrecognized
words/utterance and the number of utterance construction errors. Usability measures
appropriate to a corresponding criterion were selected by ease of measurement and
relevance to usability. For example, word recognition rate was selected to evaluate the
A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs) 401
Table 1. Evaluation criteria and measures of each component for the case study
hypothesized by the developer and the user’s actual utterance pattern can be analyzed.
Realistic dialogues can be developed using the step 1 stated above. Table 2 shows
examples of the two types of dialogues when the user conducts the same task.
The effects of previous experience with the pre-determined dialogues on the user’s
utterance pattern are also investigated by comparing two user groups (one group of
users who conducts the realistic dialogues only, and another group of users who
conducts realistic dialogues after experiencing the pre-determined ones).
Pre-determined dialogues
Realistic dialogues
(performed through two transactions)
Finally, the developers should improve a reasoning function that enables the
system to identify user’s intention from what it has recognized. It is important when,
as in this case, the system’s recognition performance is poor. Note that the correct
response rates are higher than the sentence recognition rates in all the three
experiments. The reasoning function can be improved by refining the dialogue model
based on utterance patterns that the users employ in their daily lives.
100
85.9
80 75.1
71.5
Percentage
61.8 55.9
60 Exp. 1
50
Exp. 2
40 Ex . 3
20
0
Sentence recognition rate Correctresponse rate
Fig. 3. Correct sentence recognition rates and correct response rates for three experiments
4 Conclusion
A usability evaluation framework for SLDSs was proposed. It focuses on both what to
evaluate and how to evaluate. Usability measures are systematically defined to
evaluate SLDSs. Evaluation criteria that could affect the usability of SLDSs are
identified from a modified job process chart. In addition, the study also proposes two
types of scenario-based evaluation methods. Each evaluation method can be used for
a different purpose. In a case study, an SLDS was evaluated using the proposed
framework. The case study revealed the usability level, usability problems, and design
implications for better usability. The framework described in the study can be
practically applied to evaluating the usability of SLDSs.
References
1. Dybkjær, L., Bernsen, N.O.: Usability Issues in Spoken Language Dialogue Systems.
Natural Language Processing 6, 243–272 (2000)
2. Kwahk, J.: A methodology for evaluating the usability of audiovisual consumer electronic
products. Unpublished Ph. D. dissertation, Pohang University of Science and Technology,
Pohang, South Korea (1999)
3. Danieli, M., Gerbino, E.: Metrics for evaluating dialogue strategies in a spoken language
system. In: The 1995 AAAI spring symposium on empirical methods in discourse
interpretation and generation, pp. 34–39 (1995)
404 W. Park et al.
4. Dybkjær, L., Bernsen, N.O., Dybkjær, H.: Evaluation of spoken dialogues: user test with a
simulated speech recogniser. CPK - Center for PersonKommunikation, Aalborg University
9a & 9b (1996)
5. Litman, D.J., Pan, S.: Designing and Evaluating an Adaptive Spoken Dialogue System.
User Modeling and User-Adapted Interaction. 12, 111–137 (2002)
6. Polifroni, J., Hirschman, L., Seneff, S., Zue, V.: Experiments in evaluating interactive
spoken language systems. In: The DARPA Speech and Natural Language Workshop, pp.
28–33 (1992)
7. Simpson, A., Fraser, N.A.: Black Box and Glass Box Evaluation of the SUNDIAL
System. In: The EUROSPEECH: European Conference on Speech Processing, Berlin, pp.
1423–1426 (1993)
8. Walker, M.A., Litman, D.J., Kamm, C.A., Abella, A.: PARADISE: A Framework for
Evaluating Spoken Dialogue Agents. In: The 35th annual meeting of the association for
computational linguistics (ACL-97), Madrid, Spain, pp. 271–280 (1997)
9. Han, S.H., Yun, M.H., Kwahk, J., Hong, S.W.: Usability of consumer electronic products.
International Journal of Industrial Ergonomics 28, 143–151 (2001)
10. Dybkjær, L., Bernsen, N.O.: Usability evaluation in spoken language dialogue systems. In:
The Proceedings of the workshop on evaluation for language and dialogue systems,
Toulouse, France (2001)
11. Park, Y.S., Han, S.H., Yang, H., Park, W.: Usability evaluation of conversational interface
using scenario-based approach. In: The 2005 ESK spring conference (2005)
12. Tanish, M.A.: Job process charts and man-computer interaction within naval command
systems. Ergonomics 28, 555–565 (1985)
13. Dybkjær, L., Bernsen, N.O., Dybkjær, H.: Scenario design for spoken language dialogue
systems development. In: the ESCA workshop on spoken dialogue systems, pp. 93–96
(1995)
14. Park, W., Han, S.H., Yang, H., Park, Y.S., Cho, Y.: A methodology of analyzing user
input scenarios for a conversational interface. In: The 2005 ESK spring conference (2005)