You are on page 1of 6

Comparative study of DTMF and ASR ISR system

Omkar Yadav Mahesh Singh Pralhad Surve Prashant Vishe

T.E.Comps T.E.Comps T.E.Comps T.E.Comps

SPIT,Andheri SPIT,Andheri S.P.I.T,Andheri S.P.I.T.Andheri

omkarr27@gmail.com Mahesh_spit@rediffmail.com survepralhad@gmail.com pvishe@gmail.com

ABSTRACT: ASR in technical terms can be defined as a process


by which spoken words are converted to text. In
This is a review paper of the paper ‘A Comparison ASR, that is voice enabled IVRS, caller will have to
between DTMF and ASR IVR Services through Objective choose his required menu option by speaking out a
and Subjective Evaluation’ by word or a phrase as if he is talking to a human. IVR
Cristina Delogu, Andrea Di Carlo, Paolo Rotundi, Danilo is supposed to recognize his speech and provide
Sartori which examines the comparisons between him required information as per his choice. Earlier
various types of ASR services and DTMF in the only digit speech recognisers were used. Now this
context of IVR applications used in the modern call limitation is overcome and we have variety of IVR
centres. The users of this service utilise this interaction designs, and a greater variety of
telephone applications.
application for activating a particular scheme or
getting information about any other things related to
their network service provider on their mobile DTMF is the signal to the phone company that you
devices. There are three different types of ASR generate when you press an ordinary telephone's
prototypes that are being reviewed. Also the DTMF touch keys. With DTMF, each key you press on
prototype in which users enter the commands your phone generates two tones of specific
through their telephone keyboards is compared with frequencies. So that a voice can't imitate the tones,
one tone is generated from a high-frequency group
the ASR prototypes. The important findings of their
of tones and the other from a low frequency group.
study are highlighted in this paper. Also some Prior to the now-common speech recognition
limitations are considered and its effect is software and development toolkits, most telephone-
discussed. based interfaces relied on touch-tone input using
the telephone keypad, or Dual Tone Multiple
KEY WORDS: Frequency (DTMF). Voice processing systems,
Interactive Voice Response (IVR), Automatic which first appeared in the 1980s, used DTMF for
speech recognition (ASR) and Dual tone multi input and recorded human speech as output. The
frequency (DTMF). fixed set of 12 keys (10 digits as well as the # and *
keys) on the keypad lends itself to the construction
of applications that present the caller with a list of
options (e.g., “Press 1 for Sales”), commonly
INTRODUCTION: referred to as menus.

The paper discusses the design of the ASR


services. Interface design is most important with
regards to distinguishing between prompt and services. This is the same as VOICE-1 differing only
recovery systems. in the fact that users use the telephone keyboard to
The system prompt gathers information or enter the digit-commands instead of saying them.
responses from the system users. The current ASR
systems have certain limiting guidelines to be A general overview of advantages and
followed such as, speaking at appointed moment disadvantages of ASR and DTMF are stated as
when the ASR was ready to record or accept the follows:
input (generally "after the beep") and Have to avoid
extraneous vocal sounds, such as throat clearing, ASR
and superfluous speech such as "um-seven". Advantages:
Prompts have to easy and understandable as the 1. It is much easier just to speak out the menu
belief nowadays is that if a user is facing difficulty in since mobile phones are small to press
completing a transaction with an automated system buttons. You need to listen to the menu,
then the system is at fault and not the user. Instead bring it in front to look the screen, press the
of training users which seemed like forceful or button, and then again put the phone in
unpleasant way of treating the users, these days your ear.
powerful error recovery systems have been formed. 2. We can have unlimited and user friendly
menu options. Further we can browse menu
Paper also discusses about the evaluation methods. easily and jump from one menu to another
There are two methods, first is the performance conveniently.
evaluation which is the conventional testing method 3. Caller can choose a menu option quickly
used by the vendors and the second is the more without having to listen to boring long list of
realistic usability evaluation method. There are menus. IVR can become truly interactive
many speech recognition systems available in the with its caller using same human voice
market place. The manufacturers generally carry 4. Using an IVR was always a problem as one
out the performance evaluation on their product. It is he has to remove it from his ear to look at
given about 98-99% accurate but the condition the keypad and press a key. This can be
under which the testing was carried out is not again quite irritating for a voice portal user.
mentioned. These conditions play a very significant Implementation of ASR completely removes
role in commenting about the efficiency of the this problem.
speech recognition system. As a result we have
usability testing. As performance evaluation has
been crucial for the development of speech Disadvantages:
technology systems, in the same way usability
1. Accuracy of recognition can be thought as
evaluation will be crucial for the development of
services incorporating such systems. Usability converting words spoken by a user
evaluation is concerned with the evaluation of the accurately to its corresponding text. As per
design of the application rather than the ability of a information available (by Googling and
system to perform within that design. visiting ASR engine provider websites),
almost all ASR engine has high accuracy
for detecting two words, YES and NO. Apart
The experiment conducted by the authors use three
from them, other words like numbers, date
prototypes of the ASR. They are:
1. VOICE-1 allows users to interact with the of births etc. has lower accuracy in
service in a digit-only scenario, following a recognising. For example, accuracy of
menu prompted by the system recognition of numbers 1, 2, 3 etc vary from
2. VOICE-2 allows users to say command 87 to 91% depending on ASR engine.
words instead of digits, following the same Accuracy of recognising other natural
menu prompted by the system. English language is fare much worse! As
3. VOICE-3 allows users to say more natural per report, recognising departments in
commands through more natural prompt company is maximum 85%!
Though accuracy can be improved by
Furthermore, a DTMF prototype that allows users to training of ASR, but it is not practical at all
enter commands only through the telephone and does not serve a voice portal.
keyboard have been developed, with the purpose of
making a comparison between touch tone and 2. Using ASR in a big country where people
speech recognition technologies for telephony speak different languages or same
language with different accents, speech The experiment conducted by Delogu, Di Carlo,
recognition accuracy is bound to fare worse. Rotundi, & Sartori, 1998 had some crucial findings.
(a) Speech was preferred to DTMF by a majority of
3. People may design IVR very intelligently to users; (b) the error recovery procedure used in
confirm for any doubtful word recognition by VOICE-3 helped the system in solving most of the
problematic situations encountered during the
YES or NO, but it is still irritating for many interaction. (c) speech was judged as being more
people and it slows down the time for satisfying, more entertaining, and easier to use than
fetching information. In a voice portal where DTMF; and (d) user preference for a particular
caller pays by minute of usage, it may not modality was better predicted by user performance
work in the interest of voice portal users. in nonlinear tasks rather than linear ones.
As an input mechanism, DTMF has the advantage
4. ASR does not work successfully in every of being both instantaneous and 100% accurate.
condition. It may not work in noisy This is in contrast to speech recognition, which is
conditions, weak signal, and disturbance on neither. Processing of speech can cause delays for
lines. a caller, and the variability of spoken inputs will
prevent speech from being 100% accurate in the
5. It might irritate many people for making foreseeable future. However, given the limitation of
them to talk to a machine. And then it is not using only 12 keys for input, many users of IVR
easy to talk some phrases or words like a systems often feel frustration and hang up. Given
conversation. the trade-off between (a) highly accurate yet
severely constrained DTMF input and (b) highly
unconstrained yet often misrecognized speech
DTMF: input, it is important to determine which input
modality users of a telephone-based information
Advantages: system would prefer.

1. Ease of use in case of landline and while


people using a hands-free for mobile. Delogu et al. (1998) found no difference in terms of
2. Very reliable as the DTMF signals does not task completion time and the number of turns per
get distorted even while speaking task when comparing DTMF input for an IVR
simultaneously. system with three different types of speech input.
The three types they experimented with were simple
3. Easy to develop and commission. Faster digit recognition, simple command recognition, and
response by IVRS a full-sentence recognition system. The digit system
only recognized single digits one through nine,
whereas the simple command system recognized a
short list of words such as next, skip, yes, or no.
Disadvantages: The full-sentence recognition system recognized
relatively complete sentences, such as “Please skip
1. Finding keys may be difficult as per menu to next,” or “Yes, I do.” The analysis of user
prompts, especially for new mobile phones attitudes toward the different systems indicated that
with touch pad only keys. users preferred the full-sentence recognition system
2. Using a DTMF on a mobile device, one he to the DTMF input system. The DTMF input system,
has to remove it from his ear to look at the however, was preferred to both the simple digit and
keypad and press a key. This can be again the simple command recognition systems.
quite irritating for a voice portal user.
Users preferred connected word (CW) speech input
3. Many people do not understand pulse (in a CW-based system, users say a string of words
dialling, tone dialling, hash buttons, star after a system prompt, without any pause required
button etc. between the words) and DTMF input systems to an
isolated word (IW) speech input system (in an IW
system, users say only a single word after a
prompt). In addition, they reported an interaction
effect between cognitive abilities (spatial and verbal
abilities) of users and their attitudes toward the
COMPARISON AND STUDY: different modalities tested. Users with high spatial
abilities significantly preferred DTMF to CW and IW
input, probably due to the positive effect of high
spatial ability on mental mapping of the DTMF certain types of tasks which may be more easily
options. Of interest, users with high verbal abilities accomplished with one modality rather than
also preferred DTMF to CW and IW input. The another. For example, DTMF interaction may be
positive effect of verbal skills on the DTMF preferable for a simple linear task (e.g., hearing
preference remains unclear. messages in the order they are received) whereas
speech interaction may be preferable for
Based on this study, the following conclusions can complicated and nonlinear tasks (e.g., hearing a
be drawn. First, in terms of user preference, DTMF message from a particular sender in random order).
systems are preferred to IW or simple digit Inattention to this possible interaction between task
recognition- based voice systems. When compared and modality might be responsible for the
to full-sentence or CW recognition systems, inconsistent findings on user preference for full-
previous studies show inconsistent results—DTMF sentence recognition systems over DTMF ones.
systems were rated either better or worse. Second, The lack of general difference in task performance
in terms of task or system performance, no general between DTMF and voice systems can also be
difference has been found between DTMF and explained by this inattention. Had previous studies
voice systems. For high spatial cognitive users, a tested and analyzed separately different types of
voice system may be better than a DTMF system in tasks (e.g., linear vs. nonlinear), they might have
terms of the speed of task performance. Finally, come to a different conclusion.
task or system performance does not necessarily
predict user preference for a particular system. Second, most previous studies involving full-
sentence recognition systems relied on the Wizard
of Oz methodology. Consequently, this does not
allow any strong conclusions on user preference for
and performance with an actual NL system. In our
study, a state-of-the-art NL system was used to
increase the generalizability of this study. Finally,
user evaluation has been measured in somewhat
LIMITATIONS: naïve ways. It measured user evaluation of a
system by simply asking one or two direct questions
The experimental findings of Delogu et al. (1998) such as “How do you evaluate this system?” or
portrayed the foundation of user needs and “How enjoyable was the system?”
requirements. However it had certain drawbacks.
The voice systems used by them did not involve any
true speech recognition technology. They used the
Wizard of Oz methodology to simulate speech CONCLUSION:
recognition and thus compensated for any First, with regard to task performance as measured
technological barriers. In a study with the Wizard of by system effectiveness (task success rate) and
Oz methodology, participants either naively believe system efficiency (average task completion time),
or are asked to pretend they are interacting with a different modalities favour different types of tasks.
machine, even though in reality it is a human who For simple and linear tasks, the DTMF modality was
interacts with the participants. Therefore, the more effective and efficient, whereas for
applicability of the findings from the study just complicated and nonlinear tasks, the speech
mentioned to the design and evaluation of an actual modality was better.
NL system is more limited. In addition, there was no
statistical analysis of the attitudinal survey data It should be noted that a function mapping was
reported in their paper. Instead, they only provided provided in the DTMF condition, which probably
the proportions for three questions: favoured this condition. Nevertheless, speech was
(a) Which of the three prototypes will be better better than DTMF for nonlinear tasks. Third,
accepted in the marketplace, participants evaluated the speech interface
(b) Which is the most enjoyable system, and significantly more positively than the DTMF
(c) Which system will you prefer to use in a real interface. Participants of the experiment evaluated
service? the speech modality as being more satisfying,
entertaining, and natural to use than the DTMF
modality. The speech system was also evaluated as
The previous studies mentioned have the following being easier to use and more entertaining.
shortcomings. First, there has been no systematic Finally, user performance with a modality in simple
examination of the interaction between the natures linear tasks does not predict well user preference
of task and the modality of interaction. There are for that particular modality. Even though linear tasks
could be completed more quickly with DTMF than
with speech, users were generally more satisfied
with the speech modality and more often chose it as
their final preference. Based on the results of high
effectiveness and efficiency of the speech modality
for complicated and nonlinear tasks, user
preference might be better predicted by user
performance in nonlinear tasks.

Another important finding from this study is that


users preferred speech navigation in spite of the
fairly high recognition error rates experienced with
the speech modality. First, part of the reason might
be the “cool factor” of the speech modality. The
speech modality was sophisticated and friendlier.
Another reason for the preference for the speech
modality might be the trade-off that users make
between the advantages of using a speech based
system (e.g., the ability to control a device in a
hands-free mode) and the disadvantage of dealing
with recognition errors. In the domain of messaging,
the need for hands-free control is clear when
attempting to access messages from a cellular
phone. It is also important to consider the issue of
how entertaining and engaging users find their
interaction with a system—in other words the “fun
factor.”

REFERENCES:

Delogu, C., Di Carlo, A., Rotundi, P., & Sartori, D.


(1998). A comparison between DTMF and ASR IVR
services through objective and subjective evaluation

Delogu, C., Di Carlo, A., Rotundi, P., & Sartori, D.


(1998). Usability Evaluation of IVR systems with
DTMF and ASR

http://deepblue.lib.umich.edu/bitstream/2027.42/174
/2/71952.0001.001.pdf
http://www.ivrsworld.com/ivr/ivr-menu-system/

http://www.ivrsworld.com/general/the-big-
question-voice-enabled-menu-or-dtmf-pressed-
menu

http://www.ivrsworld.com/advanced-ivrs/usability-
guidelines-of-ivr-systems

You might also like