You are on page 1of 7

A Framework for Evaluating the Usability of Spoken

Language Dialog Systems (SLDSs)

Wonkyu Park1, Sung H. Han1, Yong S. Park1, Jungchul Park1, and Huichul Yang2
1
Department of Industrial and Management En ineerin , POSTECH,
San 31, Hyoja, Pohan , 790-784, South Korea
{p09plus1,shan,drastle,mozart}@postech.ac.kr
2
Samsung Electronics, Seoul, South Korea
huicul.yang@samsung.com

Abstract. Usability evaluation is now considered an essential procedure in


developing a spoken language dialogue system (SLDS). This paper proposes a
systematic framework for evaluating the usability of SLDSs. The framework
consists of what to evaluate and how to evaluate. What to evaluate includes
components, evaluation criteria, and usability measures to evaluate various
aspects of SLDSs. With respect to how to evaluate, a procedure for developing
scenarios and scenario-based evaluation methods are introduced. In addition, a
case study, in which the usability an SLDS was evaluated, was conducted to
validate the proposed framework. The results of the case study showed
successfully the usability level, usability problems, and design implications for
further development. The framework proposed in the study can be practically
applied to usability evaluation of SLDSs.

1 Introduction
During the last two decades or so, many studies have been conducted to improve the
performance of spoken language dialogue systems (SLDSs). However, most studies
focused on recognition performance, while only a few studies investigated human
factors issues such as user models, linguistic behavior, user satisfaction, etc. [1].
Human factors issues play an important role in an SLDS because enhanced usability
can partially cover imperfect recognition accuracy of the system. When it comes to
natural dialogues between the users and an SLDS, it is obvious that value of the
SLDS depends on usability, which is critical to make the system commercially
successful [2].
Usability evaluation in the development process is essential because it provides
current usability levels and reveals potential usability problems. Although a variety of
studies conducted usability evaluation [3, 4, 5, 6, 7], only a few proposed systematic
evaluation frameworks or methodologies for SLDSs [1, 8]. Walker et al. developed a
framework for evaluating SLDSs, PARADISE (Paradigm for Dialogue System
Evaluation) [8]. It provides a quantitative usability index (i.e. user satisfaction)
considering task success and costs. Dybkjær and Bernsen developed an evaluation
template for SLDSs that consisted of 10 entries such as ‘what is being evaluated’,

N. Aykin (Ed.): Usability and Internationalization, Part I, HCII 2007, LNCS 4559, pp. 398 – 404, 2007.
© Springer-Verlag Berlin Heidelberg 2007
A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs) 399

‘system part evaluated’, ‘type of evaluation’, ‘symptoms to look for’, etc. [1].
However, these studies are not easy for practitioners to apply to usability evaluation
because they do not deal with specific data collection methods.
This paper aims to propose an evaluation framework for SLDSs. The framework
identifies usability measures for various aspects of SLDSs. Also, it proposes scenario-
based methods to effectively evaluate usability in terms of both performance and
satisfaction. In addition, a case study is conducted to validate the proposed
framework. An SLDS providing the user information about schedules, contacts,
weather, etc. in a home environment, was developed for the case study.

2 Usability Evaluation Framework for SLDSs


The usability evaluation framework proposed in this study consists of what to
evaluate and how to evaluate. Fig.1 depicts details of the proposed framework.

What to evaluate? How to evaluate?

C om ponents Evaluation C riteria M easures Scenario G eneration


M easure ƒ 5 step procedures [15]
System
gathering
Task analysis

M easure Evaluation M ethods Developm ent


U ser
m odification
ƒ Pre-determined dialogues
Process ƒ Realistic dialogues only
elaboration M easure ƒ Realistic dialogues after
Interaction
selection pre-determined dialogues

Fig. 1. A framework for evaluating the usability of SLDSs

What to evaluate includes components, evaluation criteria, and usability measures.


The framework has three components, i.e. user, system, and the interaction between
these two. Evaluation criteria are functions and characteristics of each component that
affect the usability of an SLDS. Usability experts identify them using a task analysis
technique. Usability of a system can be quantitatively measured by employing
performance and satisfaction measures [9]. Relevant measures were surveyed from
the existing literatures. From the measures, usability experts selected ones appropriate
to each criterion by considering ease of measurement and relevance to usability.
How to evaluate introduces the procedure to create scenarios for SLDSs evaluation
and has two methods to collect usability measures. The framework proposes scenario-
based evaluation by real users. It enables researchers to find usability problems that
originate from mismatches between what the user needs and what the system provides
[10, 11].

2.1 ’What to Evaluate’


The framework has components, evaluation criteria, and usability measures with
respect to what to evaluate. Components in this study are classified into user, system,
400 W. Park et al.

and interaction. Various aspects of SLDSs can be evaluated by considering the three
components, while previous studies mainly evaluated the system only [1].
Criteria are developed to evaluate each component. The criteria are made by
elaborating the process shown in a modified job process chart. A job process chart
reported by [12] is a specific type of partitioned operational sequence diagrams. With
the modified job process chart, practitioners are able to identify system-user
interaction processes and information transmitted between them. An example of the
chart is shown in Fig. 2, from which evaluation criteria are made for the case study.
For example, an evaluation criterion of ‘recognition performance’ is elaborated from
‘recognize input’. Another example is ‘user behavior’ that comes from ‘construct
utterance’ when the system fails to provide information that the users request.

System Interaction User

Start Start

Construct
Utterance

Speak at a
Recognize input Input utterance
microphone

Read feedback
Display feedback Feedback message
message

Test output
relevance

Yes No

Output message
Generate Read output
Error,
response message
Additional info.,
Response
Is information
End adequate

No
Yes

End

Fig. 2. A modified job process chart of an SLDS used in the case study

A variety of usability measures were collected from previous studies [2, 3, 4, 5, 6, 7].
Some measures (e.g. number of barge-ins and number of SLDS help) were SLDS-
specific, while others (e.g. task completion time and number of errors) could be used
for general usability evaluations. The latter measures might be modified to fit SLDSs.
For example, the number of errors is modified into the number of unrecognized
words/utterance and the number of utterance construction errors. Usability measures
appropriate to a corresponding criterion were selected by ease of measurement and
relevance to usability. For example, word recognition rate was selected to evaluate the
A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs) 401

‘recognition performance’. Table 1 shows components, evaluation criteria and usabi-


lity measures developed for the case study.

Table 1. Evaluation criteria and measures of each component for the case study

Components Criteria Measures


– Sentence recognition rate
Recognition
– Word recognition rate
performance
– Recognition error frequency
System
– Adequacy of reasoning function
Dialogue model – Utterance construction error (frequency and types)
– Correct response rate

System output – User satisfaction on system response


Interaction
– Task completion time
Task
– Frequency of failed tasks

User satisfaction – Overall user satisfaction on the SLDS

– Users’ response pattern to various system errors


User User behavior
– Patterns of utterance construction

Learning – Utterance variation

2.2 ’How to Evaluate’

A scenario-based method can be used to evaluate an SLDS system in a realistic


situation [13]. A variety of situations should be considered in evaluation scenarios.
However, there exist few studies, except for [14], that systematically develop scenarios
reflecting various situations. Park et al. proposed a scenario development procedure
that consists of five steps [14]: 1) identifying functions and information that system can
provide, 2) analyzing sentence structures appropriate to system functions and
information, 3) analyzing proper words for sentence structures, 4) creating scenario
structures by mapping words into sentence structures, and 5) developing detail
scenarios. This study uses this procedure when creating scenarios.
The framework proposes two scenario-based evaluation methods. The first one
uses pre-determined dialogues. It provides utterance that the system can handle. The
system can always come up with an answer, unless it fails to recognize the speech
pronounced by the user. This method is mainly appropriate to measure recognition
performance of an SLDS. Pre-determined dialogues are developed through the entire
five steps explained above.
The second method performs scenarios to evaluate an SLDS’s overall usability in
realistic situations. Given a situation and information to be queried, the user asks the
system using his/her own expressions. In addition to recognition performance
measured by the first method, the discrepancy between the dialogue model
402 W. Park et al.

hypothesized by the developer and the user’s actual utterance pattern can be analyzed.
Realistic dialogues can be developed using the step 1 stated above. Table 2 shows
examples of the two types of dialogues when the user conducts the same task.
The effects of previous experience with the pre-determined dialogues on the user’s
utterance pattern are also investigated by comparing two user groups (one group of
users who conducts the realistic dialogues only, and another group of users who
conducts realistic dialogues after experiencing the pre-determined ones).

Table 2. Examples of two dialogue types

Pre-determined dialogues
Realistic dialogues
(performed through two transactions)

1: Any e-mail from mom this afternoon?


I heard mom sent me an e-mail this afternoon.
So I would like to know contents of the e-mail.
2: Contents of the e-mail?

3 Validation of the Proposed Framework


A total of 84 subjects who speak Korean participated in the case study. The
participants were randomly assigned to one of three different experiments: 60 subjects
for conducting pre-determined dialogues (experiment 1), 12 for realistic dialogues
(experiment 2), and the other 12 for realistic dialogues after the pre-determined
dialogues (experiment 3). A larger number of participants were assigned to
experiment 1, because the SLDS was in the early stage of the development process in
which the developers needed to focus on the recognition performance. A total of 24
scenarios were developed for the experiments. Twelve scenarios were pre-determined
dialogues, while the other were realistic dialogues.
Evaluation criteria and usability measures for the case study were developed
according to the proposed framework (See section 2.1), which are shown in Table 1.
The evaluation results provide design problems, usability levels, and valuable
design implications for the SLDS. This paper describes sentence recognition rates and
correct response rates only. The average values of these measures for the three
experiments are depicted in Fig. 3. Based on the results of the usability evaluation,
design implications for further development were made. Firstly, the recognition
algorithm needs improvement to effectively process users’ utterance. The sentence
recognition rate of 50 % might be too low for a commercial SLDS. When significant
improvement is difficult to achieve, introducing auxiliary input devices such as
keyboard and mouse would be a good support for better usability.
Secondly, help documents or training programs should be provided in the SLDS.
Experiment 3 that included short training before the main experiment showed better
performance than Experiment 2 in both measures. This implies that system help may
make it easier for users to use the SLDS. Information describing what and how to
interact with the system should be provided for better usability.
A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs) 403

Finally, the developers should improve a reasoning function that enables the
system to identify user’s intention from what it has recognized. It is important when,
as in this case, the system’s recognition performance is poor. Note that the correct
response rates are higher than the sentence recognition rates in all the three
experiments. The reasoning function can be improved by refining the dialogue model
based on utterance patterns that the users employ in their daily lives.

100
85.9
80 75.1
71.5
Percentage

61.8 55.9
60 Exp. 1
50
Exp. 2
40 Ex . 3

20

0
Sentence recognition rate Correctresponse rate

Fig. 3. Correct sentence recognition rates and correct response rates for three experiments

4 Conclusion
A usability evaluation framework for SLDSs was proposed. It focuses on both what to
evaluate and how to evaluate. Usability measures are systematically defined to
evaluate SLDSs. Evaluation criteria that could affect the usability of SLDSs are
identified from a modified job process chart. In addition, the study also proposes two
types of scenario-based evaluation methods. Each evaluation method can be used for
a different purpose. In a case study, an SLDS was evaluated using the proposed
framework. The case study revealed the usability level, usability problems, and design
implications for better usability. The framework described in the study can be
practically applied to evaluating the usability of SLDSs.

References
1. Dybkjær, L., Bernsen, N.O.: Usability Issues in Spoken Language Dialogue Systems.
Natural Language Processing 6, 243–272 (2000)
2. Kwahk, J.: A methodology for evaluating the usability of audiovisual consumer electronic
products. Unpublished Ph. D. dissertation, Pohang University of Science and Technology,
Pohang, South Korea (1999)
3. Danieli, M., Gerbino, E.: Metrics for evaluating dialogue strategies in a spoken language
system. In: The 1995 AAAI spring symposium on empirical methods in discourse
interpretation and generation, pp. 34–39 (1995)
404 W. Park et al.

4. Dybkjær, L., Bernsen, N.O., Dybkjær, H.: Evaluation of spoken dialogues: user test with a
simulated speech recogniser. CPK - Center for PersonKommunikation, Aalborg University
9a & 9b (1996)
5. Litman, D.J., Pan, S.: Designing and Evaluating an Adaptive Spoken Dialogue System.
User Modeling and User-Adapted Interaction. 12, 111–137 (2002)
6. Polifroni, J., Hirschman, L., Seneff, S., Zue, V.: Experiments in evaluating interactive
spoken language systems. In: The DARPA Speech and Natural Language Workshop, pp.
28–33 (1992)
7. Simpson, A., Fraser, N.A.: Black Box and Glass Box Evaluation of the SUNDIAL
System. In: The EUROSPEECH: European Conference on Speech Processing, Berlin, pp.
1423–1426 (1993)
8. Walker, M.A., Litman, D.J., Kamm, C.A., Abella, A.: PARADISE: A Framework for
Evaluating Spoken Dialogue Agents. In: The 35th annual meeting of the association for
computational linguistics (ACL-97), Madrid, Spain, pp. 271–280 (1997)
9. Han, S.H., Yun, M.H., Kwahk, J., Hong, S.W.: Usability of consumer electronic products.
International Journal of Industrial Ergonomics 28, 143–151 (2001)
10. Dybkjær, L., Bernsen, N.O.: Usability evaluation in spoken language dialogue systems. In:
The Proceedings of the workshop on evaluation for language and dialogue systems,
Toulouse, France (2001)
11. Park, Y.S., Han, S.H., Yang, H., Park, W.: Usability evaluation of conversational interface
using scenario-based approach. In: The 2005 ESK spring conference (2005)
12. Tanish, M.A.: Job process charts and man-computer interaction within naval command
systems. Ergonomics 28, 555–565 (1985)
13. Dybkjær, L., Bernsen, N.O., Dybkjær, H.: Scenario design for spoken language dialogue
systems development. In: the ESCA workshop on spoken dialogue systems, pp. 93–96
(1995)
14. Park, W., Han, S.H., Yang, H., Park, Y.S., Cho, Y.: A methodology of analyzing user
input scenarios for a conversational interface. In: The 2005 ESK spring conference (2005)

You might also like