A Speech Recognition

EduSpeak : A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications
Horacio Franco, Harry Bratt, Romain Rossier, Venkata Rao Gadde, Elizabeth Shriberg, Victor Abrash, and Kristin Precoda
SRI International, USA
Language Testing 27(3) 401418 The Author(s) 2010 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0265532210364408 http://ltj.sagepub.com
Abstract
SRI Internationals EduSpeak system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to specific production problems. We review our approach to pronunciation scoring, where our aim is to estimate the grade that a human expert would assign to the pronunciation quality of a paragraph or a phrase. Using databases of nonnative speech and corresponding human ratings at the sentence level, we evaluate different machine scores that can be used as predictor variables to estimate pronunciation quality. For more specific feedback on pronunciation, the EduSpeak toolkit supports a phone-level mispronunciation detection functionality that automatically flags specific phone segments that have been mispronounced. Phonelevel information makes it possible to provide the student with feedback about specific pronunciation mistakes.Two approaches to mispronunciation detection were evaluated in a phonetically transcribed database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers. Results show that classification error of the best system, for the phones that can be reliably transcribed, is only slightly higher than the average pairwise disagreement between the human transcribers.
Keywords
automatic pronunciation scoring, computer aided language learning, mispronunciation detection
Using computers to help students learn and practice a new language has long been a cherished dream. Automatic speech recognition (ASR) technology holds forth the promise
Corresponding author: Horacio Franco, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493, USA Email: hef@speech.sri.com
402
Language Testing 27(3)
of allowing spoken language to be used in many ways in language-learning activities, for example by supporting different types of oral practice and enabling feedback on various dimensions of language proficiency including language use and pronunciation quality. The area of automatic pronunciation scoring has been considered one of the most promising for computer-aided language learning (CALL). In most language-learning settings, a human teacher does not have enough time to provide detailed pronunciation feedback to individual students. Additionally, improving ones pronunciation takes significant time and requires frequent feedback from a source other than the language learners own perceptions, making automatic pronunciation scoring a suitable arena for a tireless computer. Nevertheless, current speech recognition technology has accuracy limitations, which may give the impression of a certain degree of randomness on the part of the ASR system. For instance, an utterance repeated twice in a similar way by a nonnative speaker could have some minor imperceptible changes in articulation, leading to two different speech recognition results. Consequently, designers of CALL systems which use ASR must design the applications to minimize the impact of such limitations. In particular, in the area of pronunciation scoring, the smaller the unit to be scored, the higher the uncertainty in the associated score (Kim et al., 1997). Currently, the most reliable estimates of pronunciation quality are overall levels obtained from a paragraph composed of several sentences that can be used to characterize the speakers overall pronunciation proficiency. At this level, it has been shown that automatic scoring performs as well as human scoring (Bernstein et al., 1990). For CALL we would like to score smaller units, to allow the student to focus on specific aspects of production. For instance, overall pronunciation scoring can be obtained at the sentence level (Franco et al., 1997; Neumeyer et al., 1997) with a level of accuracy that, while lower than that of human scoring, can nonetheless provide valuable feedback for language learning (Precoda et al., 2000). However, an overall score is only part of the desired feedback for pronunciation training. More detailed feedback, at the level of individual phones, directs attention to specific phones that are mispronounced (Eskenazi, 1996; Kim et al., 1997; Ronen et al., 1997; Witt & Young, 1997). At this level of detail, typically only a binary decision between correct and mispronounced is provided. Automatic pronunciation assessment at the sentence and phone levels remains an active area of research. A speech recognition system that is oriented toward use in CALL applications must have the capability to recognize accented and often-mispronounced speech produced by language learners. Additionally, the capability to provide different types of measures of pronunciation quality can allow the application developers to provide additional feedback on pronunciation. We describe some aspects of SRIs EduSpeak system, a software development toolkit for CALL application developers. The EduSpeak recognition engine is based on the same technology as Decipher, SRIs state-of-the-art, large-vocabulary, speaker-independent continuous speech recognition system (Digalakis, Monaco & Murveit, 1996). Leveraging Decipher technology has allowed us to develop a high-performance realtime recognition system that is specifically adapted for CALL applications. In particular, acoustic models tailored for nonnative speech recognition, pronunciation scoring algorithms, and a number of system features were developed especially for the CALL domain.
Franco et al.
403
An important component of our toolkit is the automatic evaluation of pronunciation quality. We have developed algorithms to grade the pronunciation quality of nonnative speakers in a text-independent fashion; that is, these algorithms do not rely on statistics of specific words or sentences (Franco et al., 1997; Franco et al., 1999a; Neumeyer et al., 1996). Text-independent scoring allows developers to use arbitrary content without the need to tune the pronunciation scoring models. Our pronunciation grading method is based on the computation of a number of text-independent scores (spectral, durational, and prosodic) that correlate well with human judgments of pronunciation quality. The individual scores are nonlinearly combined to estimate the pronunciation grade that a human expert would have given. The mappings from machine scores to human grades are calibrated by using a large database of nonnative speech with overall human pronunciation quality scores on individual sentences (Bernstein, 1990; Franco et al., 1999b). This approach results in pronunciation-quality grades for individual sentences, or groups of sentences, with a consistency approaching that of human raters. We introduce a new word duration score that is based on a technique previously developed to model duration on large vocabulary conversational speech recognition (LVCSR) (Gadde, 2000) and evaluate its use to estimate pronunciation quality, either alone or in combination with previously developed scores. More detailed pronunciation feedback in the form of information about specific phone mispronunciations is also provided in our toolkit, allowing a CALL system to provide feedback about specific pronunciation mistakes (Franco et al., 1999c; Ronen, Neumeyer & Franco, 1997). We compare two approaches for phone mispronunciation detection. The first is based on using native models as a reference and on obtaining a measure of the phone-level degree of match to the native model, an approach that is an extension of the one used to score overall pronunciation quality. The second, newer, approach is based on using explicit acoustic models of the different nonnative pronunciations. In this approach we attempt to capture both the acceptable nonnative pronunciations and the mispronounced nonnative pronunciations. We expect that explicit modeling of the different qualities of nonnative speech could help to better pinpoint the mispronunciations. We review some of the features of the EduSpeak toolkit and present advances on the use of the new duration feature for overall pronunciation scoring and on phone mispronunciation detection based on explicit acoustic modeling of the mispronunciations.
Nonnative Speech Recognition for CALL

Speech recognition on the EduSpeak system is based on the current state-of-the-art approach that uses hidden Markov models (HMMs) as the underlying statistical model. The speech signal is converted into a sequence of feature vectors that represent the instantaneous spectral characteristics of the speech signal every 10 milliseconds. The HMM approach uses a Markov chain to model the changing statistical characteristics of the different types of sounds that exist in the spoken language. Every state in such a Markov chain corresponds to a particular segment of a particular sound or phone in the language, where a probability distribution associated to each Markov state is used to
404
model the acoustic features of such sound segment as well as its natural variability when produced repeated times, in different contexts, and by different speakers. Typically, each state has a corresponding Gaussian mixture model (GMM) to model statistically the acoustic features associated with that phone segment. HMMs also allow a hierarchical representation of speech whereby a sentence is modeled as a sequence of words; a word, in turn is modeled as a sequence of phones; and a phone is modeled as a sequence of HMM states. Large vocabulary state-of-the-art systems use context dependent phones, that is, a different HMM is defined for each phone in each different phonetic context. For instance, a context-dependent phone for the sound /a/ in the context of the sound /c/ to the left and the sound /s/ to the right would be the so-called triphone /c[a]s/. Each triphone is typically represented by a three-state HMM with a GMM per state. As the acoustic realization of phones changes substantially with the phonetic context, the use of triphones allows us to model more invariant units with less chance of confusion. Speech recognition with models adapted to nonnative speech offers better nonnative performance than with models trained with only native speech, and thus allows the use of more complex speech recognition grammars that in turn enable the development of more challenging activities for students. It is also desirable to carry on this acoustic adaptation in such a way that good recognition performance is maintained for native speakers such as teachers who may also interact with the system. To achieve both of these goals, our acoustic model training approach is based on model adaptation techniques (Digalakis & Neumeyer, 1996; Gauvain & Lee, 1994) that enable us to combine native and nonnative training data, so both types of speakers can be handled with the same model with good recognition performance. Typically, we use several thousand sentences of nonnative data, balanced over gender and language proficiency. The nonnative data should include a large enough number of speakers to represent the variability found in the target population. One hundred speakers is a typical number for this purpose. The original native model has been typically trained with even larger amounts of native data. The adaptation approach allows the combination of native and nonnative data with different weights. These weights are tuned to achieve a balance of performance between native and nonnative speakers. Typical recognition results after model adaptation show relative error reductions for the nonnatives of more than 50%, while native speaker recognition performance is maintained at levels similar to those before adaptation. For instance, in a typical recognition application the word error rate was reduced from 6.9% to 3.1%.
Pronunciation Scoring
The pronunciation scoring approach initially developed by Bernstein et al. (1990) and Digalakis (1992) was based on fixed text prompts. Knowledge of the text can be used to compute robust pronunciation scoring algorithms, but this approach limits generalizability, since new lessons require additional data collection. We refer to this class of algorithms as text dependent, because they rely on statistics related to specific words, phrases, or sentences.
Franco et al.
405
For a system geared toward training or evaluating foreign-language students, the system typically elicits speech through various language instruction activities designed to ensure that the recognizer produces a correct transcription of the recordings most of the time. This transcription and its corresponding phonetic segmentation are used by the system to produce pronunciation scores. To allow the application using the pronunciation scoring software to be flexible and extensible enough for language instructors to be able to modify and design lessons without expert knowledge in speech recognition technology, the original text-dependent scoring algorithms were extended and generalized, and new algorithms were devised so that text-independent pronunciation scoring was possible (Franco et al., 1997; Neumeyer et al., 1996). The pronunciation scoring paradigm uses HMMs (Digalakis, Monaco & Murveit, 1996) to recognize the text read by the learner and to generate phonetic segmentations of the learners speech. With these segmentations, spectral match and prosodic scores can be derived by comparing the learners speech to the speech of native speakers. The generation and calibration of machine-generated pronunciation scores follows three main steps: Generation of a phonetic segmentation, using an HMM-based speech recognizer. The recognizer models can be trained on data from both native and nonnative speakers of the language. Creation of different machine pronunciation scores for the different phonetic segments by comparing features of the learners speech to those of native speakers by using statistical models. Calibration of the scores, which includes combining several automatic measures and mapping them to estimate the judgment of human listeners as well as possible. No statistics for specific sentences are used. Sentence and speaker scores are in most cases the average of phone-level scores, with the exception of word duration, which is a wordlevel score; consequently, the algorithms do not need to be trained with the actual sentences that are going to be used on the language learning applications. Because of this property, we refer to the scoring algorithms as text independent. Our pronunciation scoring paradigm assumes that the phonetic segmentation is accurate. Therefore, the task for which pronunciation scoring is desired must be designed to ensure a high recognition rate. Reading aloud and multiple-choice exercises are examples of tasks well suited to pronunciation scoring. Table 1 summarizes the types of machine scores used and the features of nonnative speech that are used in each as indicators of pronunciation quality. The machine scores used in the EduSpeak system are described in more detail below.
Spectral match scores (Posterior scores)

We use a set of context-independent models together with the HMM phone alignment to compute an average log-posterior probability for each phone. For each frame belonging
406
Table 1. Machine scores and features of nonnative speech indicating pronunciation quality Spectral match Estimates how close the spectral features of the nonnative sounds are to the corresponding spectral features of native sounds. Uses a normalized probability that the nonnative sounds were produced with a native model. The closer the nonnative sound is to the native realizations, the higher the probability. An average of the phone-level log probabilities over the phones on a sentence is used to represent the overall spectral match to native realizations. Estimates how close the durations of the nonnative phones are to the durations of the corresponding native phones. Uses phone-specific statistical models of normalized duration, estimated from segmentations of native speech, to compute probabilities of durations of nonnative phones. An average of log probabilities over the phones of a sentence represents the overall phone duration score for the sentence. Estimates how close the durations of the nonnative words are to the durations of the native words. The word duration is represented by a duration feature, that is, a vector composed of the durations of the individual phones in the word. Statistical models of the vector of phone durations per each word are estimated using native speech. An average of the log probabilities of the words in a nonnative sentence represents the overall word duration score for the sentence. Native speakers typically speak faster than second language learners. Rate of speech (ROS), defined as the average number of phones per unit of time in a sentence, can be used as a predictor of overall language proficiency.
Phone duration
Word duration
Speech rate
to a segment corresponding to the phone qi we compute the frame-based posterior probability P(qi|yt) of the phone qi given the observed spectral vector yt : P(qi | yt ) = p(yt | qi )P(qi )
j=1 M
p(yt | q j )p(q j )
(1)
The sum over j runs over a set of context-independent models for all phone classes. P(qi) represents the prior probability of the phone class qi. p(yt|qi) is the probability density of the current observation using the model corresponding to the qi phone. The posterior score for each phone is the average of the frame-based log-posterior probability over all frames of the phone. In turn, the posterior-based score for a whole sentence is defined as the average of the individual posterior scores over the N phones in the sentence. Since acoustic channel variations and speaker characteristics affect both the numerator and denominator of the posterior probability similarly, this score shows robustness to changes in acoustic match when providing an estimate of pronunciation quality.
Phone duration scores

To calculate phone duration scores, we first obtain the duration in frames of the i-th phone from the Viterbi alignment. To obtain the corresponding phone duration score, the log
Franco et al.
407
probability of the phone duration is computed using a duration distribution of that phone. The duration distributions have been previously trained from alignments generated for the native training data. Again, the corresponding sentence duration score is defined as the average of the phone scores over the sentence. We normalize phone duration by a measure of the ROS, which is the average number of phones per unit of time in a sentence or over all utterances by a single speaker. The phone duration score for a whole sentence is defined as the average of the individual duration scores over all the phones in a sentence.
Word duration scores

We describe a new method to produce duration scores based on word-level models. The duration models are trained from the acoustic native training data. In the word duration modeling each word is represented by a duration feature, that is, a vector composed of the duration of the individual phones in the word. For example, the word that has a corresponding phone sequence dh + ae + t, and is represented by a duration feature (100 80 40), where the three values, 100, 80, and 40 represent the duration in milliseconds of the three phones, dh, ae, and t, respectively. Thus, the duration feature captures the durations of the phones within the context of the given word. As the phones within the word are modeled jointly, the model can also capture correlations across the phone durations within the word. These duration features can be computed from the phone segmentations for all the utterances in the acoustic training data. Given sufficient instances of a word, we can train statistical models to represent the word duration. In our experiments, we found it convenient to use GMMs, although other models are possible. A number of issues were addressed in developing the word duration models. First, we addressed the issue of unseen words. Since the word duration models were trained from the acoustic training data, the models were limited to the words in the training vocabulary. In addition, we trained a model for a word only if there were a minimum number of occurrences of the word in the training data. Therefore it was possible, during recognition, to encounter words for which we did not have a model. To handle these cases, we trained duration models of generic individual triphones and phones along with those for word-specific phones. During recognition, we scored unseen words using a simple backoff scheme, similar to the one used for acoustic modeling for speech recognition, in which the generic triphone models were used to score the unseen word. If a triphone model did not exist, we backed off to the corresponding phone model. Another issue that we considered was the duration effect known as prepausal lengthening, which refers to the lengthening of the syllables of a word, preceding a pause. To incorporate this effect into the models, we modified our training to train separate models for words followed by another word and words followed by a pause. The third issue we considered relates to normalization of rate of speech across different speakers. We computed the rate of speech as the average number of phones per second over an utterance. Pause and noise phones were excluded in computing the rate of speech. The rate of speech was then used to normalize the duration of the phones. We used normalization at the sentence level and compared the performance with the unnormalized performance. The word duration score for a whole sentence is defined as the average of the word duration scores over all the words in a sentence.
408
Speech rate
Native speakers and advanced learners often speak faster than do beginning learners. Thus, ROS alone or in combination with other scores can be used as a predictor of overall language proficiency. The toolkit uses the sentential speech rate, defined as the average number of phones per unit of time in a sentence.
Combination of scores and calibration

In the previous section we have seen how, based on speech segmentations and probabilistic models, we can produce different pronunciation scores for individual sentences that can be used as predictors of the pronunciation quality. Different types of these machine scores can be combined to obtain a better prediction of the overall pronunciation quality. Furthermore, we want to use the pronunciation scores to obtain an estimate of the pronunciation grade that a human rater would have produced. A principled approach is used to obtain the mappings from machine scores to human pronunciation quality ratings. Each sentence is classified as belonging to one of N classes, where the classes are the discrete pronunciation grades assigned by human raters. In the EduSpeak toolkit the combination of machine scores and pronunciation grade estimation is achieved using a decision tree (Breiman et al., 1984). Decision trees are classifiers that represent the classification knowledge in a tree form. A tree can be used to classify a vector of machine scores into one of several possible classes, each class representing a final node (a leaf) of the tree. By using the labeled training data composed of machine scores and the corresponding human grades, we can build the tree and obtain the rule that assigns the predicted human grade to each leaf with the aid of available tree construction algorithms (Buntine & Caruana, 1992).
Experimental results for nonnative Spanish

To compute and evaluate machine scores for automatic pronunciation assessment we need databases containing both native and nonnative speech. The native corpus is used to train recognizer models that are used to provide the reference native models against which the nonnative speech is scored. We also use the native corpus to estimate probability distributions of measures like segment duration or syllabic timing. The nonnative corpus is used to compute the different machine scores and to evaluate their performance. Since machine score performance is measured by correlating machine scores for a given set of sentences with the corresponding human grades, the nonnative corpus needs to have been previously rated by human experts. In addition, to be able to properly interpret correlation results, we evaluate inter- and intra-rater consistency. To this end, the nonnative database is divided into different sets. A different expert rates each set, and one of the sets is rated by all the experts. Intra-rater correlation is evaluated by making each rater grade the same utterance several times, on different days and in different contexts.
Franco et al.
409
The native corpus used in this work consisted of 26,571 newspaper sentences recorded by a total of 142 native Latin American Spanish speakers from selected countries (Mexico, Colombia, Ecuador, Peru, Venezuela), who did not have a strong regional accent. This corpus was used to train the speech recognizer. To compute the duration mapping we used a smaller different set of 2939 sentences. The nonnative corpus contained 14,000 newspaper sentences, read by 206 adult American English speakers with some knowledge of Spanish. While in previous studies (Neumeyer et al., 1996) utterance-level ratings have been collected from expert language teachers, in this study our approach was to find native speakers with no necessary language-related expertise, and select the best correlated through a pilot study. Starting with 11 raters in the pilot, we chose the five best-correlated. The raters were calibrated to each other by scoring a small common set of data. The ratings were then discussed within the group, trying to get all to agree with a majority vote. This process was iterated several times, with different sets of data for the raters to converge on their ratings. The selected raters rated the overall pronunciation of each nonnative sentence on a scale of 1 to 5, with grade 1 corresponding to strongly nonnative and 5 corresponding to almost native, with the categories significantly nonnative, moderately nonnative, and slightly nonnative corresponding to the grades in between. Speaker-level grades were also computed as the average of sentence grades for each speaker. A common pool of 4116 sentences was rated by all raters, with the remaining randomly assigned to each rater regardless of the speaker. A subset of that pool, 820 utterances, was presented twice to each rater to estimate intra-rater reliability. We computed the correlations between a raters grade and the average of the grades given by all other raters. The average correlation across all raters obtained was r = 0.78 at the sentence level, and can be considered as an upper bound of the correlation that can be expected between machine scores and human grades. The average intra-rater correlation computed across repeated judgments of the same utterance by the same rater was r = 0.79. An additional set of native data, on the order of 10% of the nonnative data size, was used to optionally extend the scoring levels with a native category, to which was assigned a grade of 6. The nonnative and native speech databases were divided into two equal-size sets with no speakers in common. We estimated the parameters of the classification tree in one set and evaluated the correlation of the predicted scores and the human grades in the other set. Then we repeated the procedure with the sets swapped and averaged the two correlation coefficients. We computed and evaluated the classification tree once with the additional native speakers and once without native speakers. To examine the effects of the different machine scores, we developed classification trees to estimate human grades by mapping and combining different sets of machine scores. We have already shown that when sufficient speech data is available for a speaker, it is possible to achieve a correlation with human grades that is comparable to the correlation between humans (Franco et al., 1997). The current challenge is to improve the human machine correlation by using only speech features extracted based on a single sentence. The humanmachine correlation was computed between the human grade and the conditional expected value of the classification trees grade given the machine scores, for each sentence. Table 2 shows sentence-level humanmachine correlations with mapped and combined scores, with and without the set of native speakers.
410
Table 2. Humanmachine correlations at sentence level with mapped and combined machine score Machine scores Humanmachine correlation (for native and nonnative Spanish speakers 0.642 0.548 0.606 0.611 0.680 0.709 0.780 0.785 Humanmachine correlation (for nonnative Spanish speakers only) 0.585 0.414 0.449 0.455 0.594 0.611 0.607 0.623
Posterior scores Phone duration scores Word duration scores Phone duration + word duration scores Posterior + phone duration scores Posterior + word duration scores Posterior scores + phone duration scores + speech rate Posterior scores + word duration scores + speech rate
We observe that the word duration score was consistently better than the phone duration score across all comparable conditions. In the set that has some native data, with only the word duration scores, the correlation improved ~10% with respect to when the phone duration score was used. For the set with only nonnative data the gain was ~8%. Combining both duration scores produced a minor gain of near 1% in correlation. The word duration score was still less effective than the posterior score, but provided significant gains when combined with it. When combining the duration scores with the posterior scores there was an increase of more than 10% in correlation for the word duration and 6% for the phone duration in the set with with native data versus a gain of 4.4% for the word duration scores and 1.5% for the phone duration score in the set with nonnatives only. For the dataset with both native and nonnative speakers the further addition of scores based on speech rate to the posterior and word duration scores, increased the correlation by 10.7%. There was only a small advantage of the word duration scores versus the phone duration scores in this combination, while for the dataset with only nonnative speakers there was a similar gain of ~2% when adding rate to both word and phone duration scores combined with posterior scores. The biggest correlation was obtained with the combination of posterior, word duration, and speech rate scores producing a correlation of 0.623. This sentence-level correlation was only 9.7% lower than the pairwise humanhuman sentence-level correlation of 0.690.
Phone-level Mispronunciation Detection

The techniques described in the previous sections allow the calculation of overall pronunciation quality ratings for a sentence. To provide useful automatic feedback on individual phones, we need to reliably detect whether a phone is native-like or nonnative in
Franco et al.
411
quality and, ideally, to evaluate how close it is to a native phone production along different phonetic features. In earlier work, posterior scores have been used to evaluate the pronunciation quality of specific phones (Kim, Franco & Neumeyer, 1997) as well as to detect phone mispronunciations (Witt & Young, 1997; Witt & Young, 1998). An alternative approach (Ronen, Neumeyer & Franco, 1997) uses HMMs with two alternative pronunciations models per phone, one trained on native speech and the other on strongly accented nonnative speech. Mispronunciations are detected from the phone backtrace when the nonnative phone alternative is chosen. The problem of discriminating native-like pronunciations versus mispronunciations for a single phone has shown to be difficult, both from the point of view of the effectiveness of the detection and scoring algorithms, and from the point of view of the human experts who must label the mispronunciations of nonnative speech. Earlier modeling approaches (Kim, Franco & Neumeyer, 1998; Witt & Young, 1997) use basically a native model to produce a measure of goodness for the nonnative speech. While this measure correlates very well with human judgments for longer segments (i.e., paragraphs or sentences), the correlation decreases for shorter segments, such as phones (Kim, Franco & Neumeyer, 1997). One reason is that the human judgments of pronunciation are less consistent on shorter segments. Another possibility is that measures derived based on the native model are not accurate enough to capture consistently the differences in nonnative pronunciations at the phone level. Our goal would be to model the more subtle differences between the nonnative speech realizations that are considered acceptable versus the nonnative speech realizations that are considered mispronounced. To compare these different approaches we evaluated two mispronunciation detection schemes. The first approach is based on phone posterior scores (Kim, Franco and Neumeyer 1997), while the second is based on explicit acoustic modeling of the nonnative productions of correct and mispronounced phones. A log-likelihood ratio (LLR) of mispronounced and correct phone models (Franco et al., 1999c) was used as the measure of pronunciation quality in the second method.
Human detection of mispronunciation

To evaluate this modeling approach we need a nonnative speech database transcribed at the phone level on the pronunciation variants of interest. With this goal in mind we phonetically transcribed a subset of the nonnative Spanish database (Bratt et al., 1998) described in the subsection Experimental results for nonnative Spanish, above. Four native Spanish-speaking expert phoneticians transcribed 2550 sentences, totaling 130,000 phones, of nonnative speech data. Those sentences, randomly divided among the transcribers, were produced by 206 nonnative speakers whose native language was American English. Their levels of proficiency were varied, and an attempt was made to balance the number of speakers by level of proficiency as well as by gender. An additional set of sentences (one newspaper sentence from each of the 206 speakers), the common pool, was transcribed by all four phoneticians to assess the humanhuman consistency.
412
A first step in acquiring the phonetic transcriptions was to define the transcription conventions. Choosing too narrow a phonetic transcription would take far too much time for the amount of transcription data, but too broad a transcription would not give enough necessary detail to pinpoint pronunciation problems. One factor that made this task more tractable was that the native language of all the nonnative speakers was the same (American English). Therefore, we could expect to observe a relatively small set of common pronunciation problems. Also, we were interested only in nonnative phones; phones that the transcribers perceived as natively produced did not need to be described in any detail. Given these issues, our approach was to define two sets of phones plus a set of diacritics. The first set of phones consisted of all the native phones in the targeted dialect of Spanish. The second set of phones consisted of phones of American English, such as some reduced vowels and the labio-dental fricative [v], which we expected to see carry over into nonnative pronunciations of Spanish. The diacritics were allowed to modify appropriate native phones. The transcribers were instructed that using a diacritic on a phone implied that the phone was not perceived as native, and the diacritic explained the way in which it was nonnative. Diacritics included aspiration for the voiceless stops, gliding for the non low vowels, and length (i.e., nonnatively long). A catch-all diacritic, * was included to represent a sound that was perceived as a nonnative rendition of a phone but for which no more specific method of indicating its nonnativeness was available. In this way, we reduced the transcription problem to a simpler one in terms of cognitive effort for the transcriber, and ease of information entry, while still encoding the most important piece of information in all the transcriptions the judgment of the nativeness of any given phone. Furthermore, for this study, the detailed phone-level transcriptions were collapsed into two categories: native-like and nonnative pronunciations.
Phone transcription
The main use of the phone-level human transcriptions is to train automatic systems to detect mispronunciations by nonnatives. We used the 206 common sentences to make this judgment, and used the kappa coefficient statistic (Siegel & Castellan, 1998) to determine how reliably the transcribers agree on the transcription for each of the 28 native phones. On 10 of the phones, all four transcribers showed at least a moderate level of agreement (using K > 0.41 to mean moderate agreement). In Table 3 we show the value of Kappa for the phones with enough data in the common pool. In the third column we also show the percentage of times that a phone is labeled as mispronounced by any of the transcribers working on the common sentences, indicating how commonly a given phone is mispronounced. The most reliable phones to transcribe were the approximants /b/, /d/, and /g/, which were also among the most frequently mispronounced. Phone /b/ was the most consistently transcribed, while flapped /r/, /w/, and // had moderate correlations and also were frequently mispronounced. Phones /m/ and /s/ had high consistency among transcribers, but were not that frequently mispronounced. Vowel /i/ was the only vowel to have good consistency. Some of the phones that were expected to be good predictors of nonnativeness, such as voiceless stops, most vowels, and /l/ and /rr/, did not have good consistency across all the transcribers.
Franco et al.
Table 3. Consistency among the transcribers in labeling the phones Phone a b b d e Kappa 0.29 0.90 0.70 0.74 0.18 0.68 0.48 0.32 0.26 0.85 0.15 0.46 0.27 0.36 0.42 0.29 0.57 0.37 0.23 0.43 0.39 0.35 % nonnative 17 42 73 68 24 73 20 47 28 17 6 33 20 38 42 77 6 34 18 40 18 84
413
i k l m n o p r rr s t u w y z
Note: The consistency is evaluated using Kappa for phones in the common pool. The percentage of times that a phone is labeled as mispronounced is shown in the third column.
In another approach to assess consistency across transcribers we aligned the transcription from each phonetician with the expected transcription known from the prompt text. From these alignments we derived the sequence of correct and mispronounced labels for the phones of each sentence. We then compared the labels between each rater pair by counting the percentage of the time that they disagreed. The average across all the rater pairs was an estimate of the mean disagreement. The resulting value of 19.8% can be considered as a lower bound to the average detection error that an automatic detection system may achieve.
Mispronunciation detection methods

Two approaches were evaluated and compared. Both assume that a phonetic segmentation of the utterance has been obtained in a first step by using the speech recognition engine and the known transcription. In the first approach the log-posterior probabilitybased scores defined by eq. (1) are computed for each phone segment. The class conditional
414
phone distributions used to compute the posterior probabilities are GMMs that have been trained with a large database of native speech. For a mispronunciation to be detected, the phone posterior score must be below a threshold predetermined for each phone class. The second approach uses the phonetically labeled nonnative database to train two different GMMs for each phone class: one model is trained with the correct native-like pronunciations of a phone, while the other model is trained with the mispronounced or nonnative pronunciations of the same phone. A four-way jackknifing procedure was used to train and evaluate this approach on the same phonetically transcribed nonnative database. There were no common speakers across the evaluation and training sets. In the evaluation phase, for each phone segment qi, a length-normalized log-likelihood ratio score LLR(qi) was computed by using the mispronounced, lM, and correct, lc, pronunciation models, respectively, where LLR(qi) is defined in eq. (2). LLR (qi ) = 1 di
ti + di - 1
/
t = ti
[log P (yt | qi , M |) log P (yt | qi , C )]
(2)
For each phone segment we calculated the log ratio of the probabilities of being produced by the mispronounced model over the probability of being produced by the correct model. If a phone was mispronounced the first term would be larger than the second term and the log ratio would be positive, and conversely for a correctly pronounced phone. The normalization by the phone duration di allows definition of unique thresholds for the LLR for each phone class, independent of the lengths of the segments. A mispronunciation is detected when the LLR is above a predetermined threshold, specific to each phone. For both detection approaches, and for each phone class, different types of performance measures were computed for a wide range of thresholds, receiver operating characteristic (ROC) curves were obtained, and optimal thresholds were determined for the points of equal error rate (EER).
Experimental results
We generated phonetic alignments and produced posterior scores using the EduSpeak speech recognizer. Given the alignments, the detection of mispronunciation is reduced to a binary decision problem as the phone class is given by the alignments. Consequently, the mispronunciation detection performance can be studied for each phone class independently. For each threshold we obtained the machine-produced labels correct (C) or mispronounced (M) for each phone utterance. Then, we compared the machine labels with the labels obtained from the human phoneticians. Two types of performance measure were computed for each threshold: error measures and correlation measures. The error measures were the probability of a false positive, estimated as the percentage of cases where a phone utterance is labeled by the machine as incorrect when it was in fact correct, and the probability of a false negative, that is, the probability that the machine labeled a phone utterance as correct when it was in fact incorrect.
Franco et al.
415
To compute the correlation measures we first converted the C and M labels to numeric values 0 and 1, respectively. Then the 01 strings from machine and human judgments for each phone were correlated. In comparing detection performance across phone classes, the measure should not be affected by the priors of the labels. Consequently, we evaluated the mispronunciation detection performance by computing the ROC curve, and finding the points of EER, where the probability of false positive is equal to the probability of false negative. This error measure is independent of the actual priors for the C or M labels, but results in a higher total error rate than the possible minimum when the priors are skewed. In Table 4 we show the EER and the correlation coefficient for each phone class and for both detection methods. The phones whose nativeness or mispronunciation were detected most reliably were the approximants /b/, /d/, and /g/, the voiced stops /b/ and /d/, the fricative /x/, and the semivowel /w/. These phone classes have good agreement with those found to be the most consistent across different transcribers in Table 3. The LLR method performed better than the posterior-based method for almost all phone classes. The lower performance for the nasals /m/ and // may be attributed to the very few training examples for the mispronounced phones. The reduction in error rate was not uniform across phone classes. The advantage of the LLR method over the posterior method is more significant for the phone classes transcribed with the highest consistency. The exception was the flap /r/ sound. Despite the good consistency among transcribers and a significant amount of training data, we did not find comparable good performance for this phone. A possible reason may be that the spectral features used for classification do not capture the main distinctive features that distinguish the native-like and the mispronounced variants of the /r/ sound. On average, the EER had a relative reduction of 33% for the seven most reliably detected phone classes referred to above, when going from posterior-based to LLR-based detection. Acceptable levels of the correlation coefficients were also found for that set of phones. The overall weighted average of the phone mispronunciation detection EER was 35.5% when log-posterior scores were used, while 32.3% EER was obtained when the LLR method was used. If instead of the EER we obtain the minimum total detection error for each phone class, and compute the average error weighted by the number of examples in each class, the resulting minimum average error is 21.3% for the posterior-based method and 19.4% for the LLR-based method. This minimum average error can be compared with the transcribers percentage of pairwise disagreement reported in the subsection Phone transcription, above, as both take into account the actual priors of the evaluation data. The closeness of the humanmachine and the humanhuman average errors suggests that the accuracy of the LLR-based detection method is bounded by the consistency of the human transcriptions. In an actual application of the mispronunciation detection system there may be other criteria than the EER to define an operating point along the ROC curve. For instance, for pedagogical reasons we may want to impose a maximum acceptable level of false positives. In that case the probability of false negatives could increase significantly.
416

Table 4. Equal error rate and humanmachine correlation at the phone level for the two detection methods Phone Posterior EER a b b c d d e f g 35.0 29.8 29.8 44.8 34.1 26.6 37.9 27.9 41.8 29.1 28.2 38.3 29.3 25.0 41.8 33.3 38.2 41.9 38.2 35.3 36.7 37.8 33.8 23.5 16.7 32.5 44.6 Corr. 0.28 0.41 0.39 0.34 0.35 0.48 0.29 0.20 0.18 0.42 0.42 0.26 0.38 0.26 0.15 0.54 0.22 0.17 0.24 0.26 0.19 0.25 0.36 0.53 0.61 0.34 0.13 LLR score EER 35.6 15.5 21.5 39.6 20.3 19.0 37.4 27.9 28.8 20.1 26.2 32.4 28.9 28.2 35.4 45.1 39.4 32.9 34.5 33.7 27.8 31.1 35.3 14.9 15.1 32.3 30.7 Corr. 0.29 0.72 0.57 0.28 0.61 0.63 0.29 0.27 0.40 0.59 0.45 0.37 0.43 0.29 0.23 0.18 0.20 0.35 0.33 0.31 0.35 0.41 0.32 0.74 0.75 0.33 0.48
i k l m n o p r rr s t u w x y z
Conclusion
We have presented some of the features of the EduSpeak software development kit for development of voice-interactive language education software. In so doing we have highlighted such characteristics as the availability of speaker-independent recognition models for nonnative speakers, which show over 50% relative error reduction for nonnative recognition with no degradation for native speakers. We have also reviewed some of the fundamentals of the pronunciation scoring algorithms embedded in the system and introduced a new pronunciation score based on word duration modeling. The paper has reported new results of pronunciation scoring on English-accented Spanish using combinations of spectral and duration scores, revealing that the new word duration score was superior to the previous phone-level duration score in most cases. When used as
Franco et al.
417
the only score, there was a significant gain in correlation compared with the previous phone duration score. Correlations performed on data using only nonnative speakers were always improved by the use of the word duration score when compared to the phone duration score, in combination with spectral and rate of speech scores. When using evaluation data that contains a portion of native speakers, the addition of a rate of speech score produced a big increase in correlation that masked most of the gain from word duration scores, as rate-ofspeech scores seemed to be very effective in discriminating native speakers from nonnative speakers in this dataset. These results suggest that the word duration score is useful in producing independent scores for duration only, as well as helping to better discriminate between different degrees of proficiency for nonnative speakers. Two mispronunciation detection algorithms have also been investigated. One algorithm is based on posterior probability scores computed using models of the native speech, and the other is based on models trained on actual nonnative speech, including both correct and mispronounced phone utterances. An important advantage of the posterior-based method is that the native models can be applied to detect mispronunciation errors for any type of nonnative accent. The LLR-based method, instead, needs to be trained with specific examples of the target nonnative user population. Experimental results show that the LLR-based system has better overall performance than the posterior-based method. The improvement is particularly significant for the phone classes with the highest consistency across transcribers. The results also suggest that the reported performance of the system might have been limited by the accuracy and consistency of the transcriptions. This is suggested by (1) the agreement between the most consistent phone classes for humans and the best recognized phone classes by the machine, (2) the similar level of average error rate between pairs of humans on one hand and between machine and humans on the other hand, and (3) the fact that the level of improvement, when using the LLR method, is more marked on the most consistently labeled phone classes. Results showed the set of phones where mispronunciation can be detected reliably. They mostly coincide with those phones that the phoneticians were able to transcribe more consistently. The overall error rate of the best system was 19.4%, which was similar to an estimate of pairwise human disagreement on the same task. However, the impact of this level of accuracy on student learning remains an open question and warrants further investigation.
References
Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., & Weintraub, M. (1990). Automatic evaluation and training in English pronunciation. Proceedings of ICSLP 90, 11851188. Kobe, Japan. Bernstein, J., De Jong, J. H. A. L., Pisoni, D., & Townshend, B. (2000) Two experiments on automatic scoring of spoken language proficiency. Proceedings of InSTILL2000: Integrating Speech Technology in Learning, 5761. University of Abertay Dundee, UK. Bialystok, E., & Hakuta, K. (1994) In other words, the science and psychology of second-language acquisition. New York: Basic Books. Bratt, H., Neumeyer, L., Shriberg, E., & Franco, H. (1998). Collection and detailed transcription of a speech database for development of language learning technologies. Proceedings of ICSLP 98, 15391542. Sydney, Australia.
418
Breiman, L., Friedman, J., Olson, R., & Stone, C. (1984). Classification and regression trees. The Waxworks & Brooks/Cole Statistics/Probability Series. Belmont, CA: Wadsworth International. Buntine, W., & Caruana, R. (1992). Introduction to IND V. 2.1 and recursive partitioning. Moffet Field, CA: NASA Ames Research Center. Digalakis, V. (1992). Algorithm development in the Autograder project. SRI International Internal Communication. Digalakis, V., & Neumeyer, L. (1996). Speaker adaptation using combined transformation and Bayesian methods. IEEE Transactions on Speech and Audio Processing, 4(4), 294300. Digalakis, V., Monaco, P., & Murveit, H. (1996). Genones: Generalised mixture tying in continuous hidden Markov model-based speech recognizers. IEEE Transactions on Speech and Audio Processing, 4(4), 281289. Franco, H., & Neumeyer, L. (1998). Calibration of machine scores for pronunciation grading. Proceedings of ICSLP, 98, 26312634, Sydney, Australia. Franco, H., Neumeyer, L., Kim, Y., & Ronen, O. (1997). Automatic pronunciation scoring for language instruction. Proceedings of ICASSP 97, 14711474. Munich, Germany. Franco, H., Neumeyer, L., Digalakis, V., & Weintraub, M. (1999a). Automatic scoring of pronunciation quality. Speech Communication, 30, 8393. Franco, H., Neumeyer, L., Digalakis, V., & Ronen, O. (1999b). Combination of machine scores for automatic grading of pronunciation quality. Speech Communication, 30, 121130. Franco, H., Neumeyer, L., Ramos, M., & Bratt, H. (1999c). Automatic detection of phone-level mispronunciations for language learning. Proceedings of Eurospeech 99, 2, 851854. Budapest, Hungary. Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariable Gaussian mixture observatioins of Markov models. IEEE Transactions on Speech and Audio Processing, 2(2) 291298. Kim, Y., Franco, H., & Neumeyer, L. (1997). Automatic pronunciation scoring of specific phone segments for language instruction. Proceedings of EUROSPEECH 97, 649652. Rhodes, Greece. Neumeyer, L., Franco, H., Weintraub, M., & Price, P. (1996). Automatic text-independent pronunciation scoring of foreign language student speech. Proceedings of ICSLP 96, 14571460. Philadelphia, PA. Precoda, K. Halverson, C., Franco, H. (2000) Effect of speech recognition-based pronunciation feedback on second language pronunciation ability. Proceedings of InSTILL2000: Integrating Speech Technology in Learning, 102105. University of Abertay Dundee, UK. Rao Gadde, V. R. (2000). Modeling word duration. Proceedings of the Sixth International Conference on Spoken Language Processing (ICSLP) 2000, 1, 601604. Ronen, O., Neumeyer, L., & Franco, H. (1997). Automatic detection of mispronunciation for language instruction. Proceedings of EUROSPEECH, 645648. Rhodes, Greece. Siegel, S., & Castellan, N. John Jr. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw-Hill. Witt, S., & Young, S. (1997). Language learning based on non-native speech recognition. Proceedings of EUROSPEECH 97, 633636. Rhodes, Greece. Witt, S., & Young, S. (1998). Performance measures for phone-level pronunciation teaching in CALL. Proceedings of the Workshop on Speech Technology in Language Learning, 99102. Marholmen, Sweden.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A Speech Recognition

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Speech Recognition

Uploaded by

Copyright:

Available Formats

EduSpeak : A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications

Language Testing 27(3)

Nonnative Speech Recognition for CALL

Language Testing 27(3)

Spectral match scores (Posterior scores)

Language Testing 27(3)

Phone duration scores

Word duration scores

Language Testing 27(3)

Combination of scores and calibration

Experimental results for nonnative Spanish

Language Testing 27(3)

Phone-level Mispronunciation Detection

Human detection of mispronunciation

Language Testing 27(3)

Mispronunciation detection methods

Language Testing 27(3)

[log P (yt | qi , M |) log P (yt | qi , C )]

Language Testing 27(3)

Language Testing 27(3)

You might also like