You are on page 1of 6

Pronunciation Modeling in Spelling Correction

for Writers of English as a Foreign Language

Adriane Boyd
Department of Linguistics
The Ohio State University
1712 Neil Avenue
Columbus, Ohio 43210, USA
adriane@ling.osu.edu

Abstract of English as a foreign language (JWEFL). Okada


(2004) identifies two main sources of errors for
We propose a method for modeling pronunci- JWEFL: differences between English and Japanese
ation variation in the context of spell checking
phonology and differences between the English al-
for non-native writers of English. Spell check-
ers, typically developed for native speakers,
phabet and the Japanese romazi writing system,
fail to address many of the types of spelling which uses a subset of English letters. Phonolog-
errors peculiar to non-native speakers, espe- ical differences result in number of distinctions in
cially those errors influenced by differences in English that are not present in Japanese and romazi
phonology. Our model of pronunciation varia- causes difficulties for JWEFL because the Latin let-
tion is used to extend a pronouncing dictionary ters correspond to very different sounds in Japanese.
for use in the spelling correction algorithm We propose a method for creating a model of
developed by Toutanova and Moore (2002),
pronunciation variation from a phonetically untran-
which includes models for both orthography
and pronunciation. The pronunciation vari- scribed corpus of read speech recorded by non-
ation modeling is shown to improve perfor- native speakers. The pronunciation variation model
mance for misspellings produced by Japanese is used to generate multiple pronunciations for each
writers of English. canonical pronunciation in a pronouncing dictionary
and these variations are used in the spelling correc-
1 Introduction tion approach developed by Toutanova and Moore
(2002), which uses statistical models of spelling er-
Spell checkers identify misspellings, select appro-
rors that consider both orthography and pronuncia-
priate words as suggested corrections, and rank the
tion. Several conventions are used throughout this
suggested corrections so that the likely intended
paper: a word is a sequence of characters from the
word is high in the list. Since traditional spell
given alphabet found in the word list. A word list
checkers have been developed with competent na-
is a list of words. A misspelling, marked with *, is
tive speakers as the target users, they do not appro-
a sequence of characters not found in the word list.
priately address many types of errors made by non-
A candidate correction is a word from the word list
native writers and they often fail to suggest the ap-
proposed as a potential correction.
propriate corrections. Non-native writers of English
struggle with many of the same idiosyncrasies of En- 2 Background
glish spelling that cause difficulty for native speak-
ers, but differences between English phonology and Research in spell checking (see Kukich, 1992, for
the phonology of their native language lead to types a survey of spell checking research) has focused
of spelling errors not anticipated by traditional spell on three main problems: non-word error detec-
checkers (Okada, 2004; Mitton and Okada, 2007). tion, isolated-word error correction, and context-
Okada (2004) and Mitton and Okada (2007) in- dependent word correction. We focus on the first
vestigate spelling errors made by Japanese writers two tasks. A non-word is a sequence of letters that

31
Proceedings of the NAACL HLT Student Research Workshop and Doctoral Consortium, pages 31–36,
c
Boulder, Colorado, June 2009. 2009 Association for Computational Linguistics
is not a possible word in the language in any con- ten the word passes through a noisy channel result-
text, e.g., English *thier. Once a sequence of let- ing in the observed non-word r. In order to de-
ters has been determined to be a non-word, isolated- termine how likely a candidate correction is, the
word error correction is the process of determining spelling correction model determines the probabil-
the appropriate word to substitute for the non-word. ity that the word w was the intended word given the
Given a sequence of letters, there are thus two misspelling r: P (w|r). To find the best correction,
main subtasks: 1) determine whether this is a non- the word w is found for which P (w|r) is maximized:
word, 2) if so, select and rank candidate words as argmaxw P (w|r). Applying Bayes’ Rule and dis-
potential corrections to present to the writer. The carding the normalizing constant P (r) gives the cor-
first subtask can be accomplished by searching for rection model:
the sequence of letters in a word list. The second
subtask can be stated as follows (Brill and Moore, argmaxw P (w|r) = argmaxw P (w)P (r|w)
2000): Given an alphabet Σ, a word list D of strings P (w), how probable the word w is overall, and
∈ Σ∗ , and a string r ∈ / D and ∈ Σ∗ , find w ∈ D P (r|w), how probable it is for a writer intending to
such that w is the most likely correction. Minimum write w to output r, can be estimated from corpora
edit distance is used to select the most likely candi- containing misspellings. In the following experi-
date corrections. The general idea is that a minimum ments, P (w) is assumed be equal for all words to fo-
number of edit operations such as insertion and sub- cus this work on estimating the error model P (r|w)
stitution are needed to convert the misspelling into a for JWEFL.1
word. Words requiring the smallest numbers of edit Brill and Moore (2000) allow all edit operations
operations are selected as the candidate corrections. α → β where Σ is the alphabet and α, β ∈ Σ∗ , with
a constraint on the length of α and β. In order to
2.1 Edit Operations and Edit Weights
consider all ways that a word w may generate r with
In recent spelling correction approaches, edit op- the possibility that any, possibly empty, substring α
erations have been extended beyond single charac- of w becomes any, possibly empty, substring β of
ter edits and the methods for calculating edit opera- r, it is necessary to consider all ways that w and r
tion weights have become more sophisticated. The may be partitioned into substrings. This error model
spelling error model proposed by Brill and Moore over letters, called PL , is approximated by Brill and
(2000) allows generic string edit operations up to a Moore (2000) as shown in Figure 1 by considering
certain length. Each edit operation also has an asso- only the pair of partitions of w and r with the max-
ciated probability that improves the ranking of can- imum product of the probabilities of individual sub-
didate corrections by modeling how likely particu- stitutions. P art(w) is all possible partitions of w,
lar edits are. Brill and Moore (2000) estimate the |R| is number of segments in a particular partition,
probability of each edit from a corpus of spelling er- and Ri is the ith segment of the partition.
rors. Toutanova and Moore (2002) extend Brill and The parameters for PL (r|w) are estimated from
Moore (2000) to consider edits over both letter se- a corpus of pairs of misspellings and target words.
quences and sequences of phones in the pronuncia- The method, which is described in detail in Brill and
tions of the word and misspelling. They show that Moore (2000), involves aligning the letters in pairs
including pronunciation information improves per- of words and misspellings, expanding each align-
formance as compared to Brill and Moore (2000). ment with up to N neighboring alignments, and cal-
culating the probability of each α → β alignment.
2.2 Noisy Channel Spelling Correction
Since we will be using a training corpus that con-
The spelling correction models from Brill and sists solely of pairs of misspellings and words (see
Moore (2000) and Toutanova and Moore (2002) use section 3), we would have lower probabilities for
the noisy channel model approach to determine the 1
Of course, P (w) is not equal for all words, but it is not
types and weights of edit operations. The idea be- possible to estimate it from the available training corpus, the
hind this approach is that a writer starts out with the Atsuo-Henry Corpus (Okada, 2004), because it contains only
intended word w in mind, but as it is being writ- pairs of words and misspellings for around 1,000 target words.

32
|R|
Y
PL (r|w) ≈ maxR∈P art(r),T ∈P art(w) P (Ri → Ti )
i=1
X 1
PP HL (r|w) ≈ max PP H (pronw |pronr )P (pronr |r)
pronw
|pronw | pronr
Figure 1: Approximations of PL from Brill and Moore (2000) and PP HL from Toutanova and Moore (2002)

α → α than would be found in a corpus with mis- context in the training data is used to predict the pro-
spellings observed in context with correct words. To nunciation of a word. We extended the prediction
compensate, we approximate P (α → α) by assign- step to consider the most probable phone for the top
ing it a minimum probability m: M most specific contexts.
( We implemented the LTP algorithm and trained
m + (1 − m) count(α→β)
count(α) if α = β and evaluated it using pronunciations from CMU-
P (α → β) = count(α→β)
(1 − m) count(α) if α 6= β DICT. A training corpus was created by pairing the
words from the size 70 CMUDICT-filtered SCOWL
2.2.1 Extending to Pronunciation word list (see section 3) with their pronunciations.
Toutanova and Moore (2002) describe an extension This list of approximately 62,000 words was split
to Brill and Moore (2000) where the same noisy into a training set with 80% of entries and a test set
channel error model is used to model phone se- with the remaining 20%. We found that the best per-
quences instead of letter sequences. Instead of the formance is seen when M = 3, giving 95.5% phone
word w and the non-word r, the error model con- accuracy and 74.9% word accuracy.
siders the pronunciation of the non-word r, pronr ,
and the pronunciation of the word w, pronw . The 2.4 Calculating Final Scores
error model over phone sequences, called PP H , is For a misspelling r and a candidate correction w,
just like PL shown in Figure 1 except that r and w the letter model PL gives the probability that w was
are replaced with their pronunciations. The model is written as r due to the noisy channel taking into ac-
trained like PL using alignments between phones. count only the orthography. PP H does the same for
Since a spelling correction model needs to rank the pronunciations of r and w, giving the probability
candidate words rather than candidate pronuncia- that pronw was output was pronr . The pronuncia-
tions, Toutanova and Moore (2002) derive an er- tion model PP HL relates the pronunciations mod-
ror model that determines the probability that a eled by PP H to the orthography in order to give the
word w was spelled as the non-word r based on probability that r was written as w based on pronun-
their pronunciations. Their approximation of this ciation. PL and PP HL are then combined as follows
model, called PP HL , is also shown in Figure 1. to calculate a score for each candidate correction.
PP H (pronw |pronr ) is the phone error model de-
scribed above and P (pronr |r) is provided by the SCM B (r|w) = logPL (r|w) + λlogPP HL (r|w)
letter-to-phone model described below. 3 Resources and Data Preparation
2.3 Letter-To-Phone Model Our spelling correction approach, which includes
A letter-to-phone (LTP) model is needed to predict error models for both orthography and pronuncia-
the pronunciation of misspellings for PP HL , since tion (see section 2.2) and which considers pronun-
they are not found in a pronouncing dictionary. Like ciation variation for JWEFL requires a number of
Toutanova and Moore (2002), we use the n-gram resources: 1) spoken corpora of American English
LTP model from Fisher (1999) to predict these pro- (TIMIT, TIMIT 1991) and Japanese English (ERJ,
nunciations. The n-gram LTP model predicts the see below) are used to model pronunciation vari-
pronunciation of each letter in a word considering ation, 2) a pronunciation dictionary (CMUDICT,
up to four letters of context to the left and right. The CMUDICT 1998) provides American English pro-
most specific context found for each letter and its nunciations for the target words, 3) a corpus of

33
spelling errors made by JWEFL (Atsuo-Henry Cor- speech. First, the ERJ is recognized using a mono-
pus, see below) is used to train spelling error mod- phone recognizer trained on TIMIT. Next, the most
els and test the spell checker’s performance, and 4) frequent variations between the canonical and rec-
Spell Checker Oriented Word Lists (SCOWL, see ognized pronunciations are used to adapt the recog-
below) are adapted for our use. nizer. The adapted recognizer is then used to rec-
The English Read by Japanese Corpus (Mine- ognize the ERJ in forced alignment with the canon-
matsu et al., 2002) consists of 70,000 prompts con- ical pronunciations. Finally, the variations from the
taining phonemic and prosodic cues recorded by 200 previous step are used to create models of pronun-
native Japanese speakers with varying English com- ciation variation for each phone, which are used to
petence. See Minematsu et al. (2002) for details on generate multiple pronunciations for each word.
the construction of the corpus.
The Atsuo-Henry Corpus (Okada, 2004) in- 4.1 Initial Recognizer
cludes a corpus of spelling errors made by JWEFL A monophone speech recognizer was trained on all
that consists of a collection of spelling errors from TIMIT data using the Hidden Markov Model Toolkit
multiple corpora.2 For use with our spell checker, (HTK).4 This recognizer is used to generate a phone
the corpus has been cleaned up and modified to fit string for each utterance in the ERJ. Each recog-
our task, resulting in 4,769 unique misspellings of nized phone string is then aligned with the canon-
1,046 target words. The data is divided into training ical pronunciation provided to the speakers. Correct
(80%), development (10%), and test (10%) sets. alignments and substitutions are considered with no
For our word lists, we use adapted versions of the context and insertions are conditioned on the previ-
Spell Checker Oriented Word Lists.3 The size 50 ous phone. Due to restrictions in HTK, deletions are
word lists are used in order to create a general pur- currently ignored.
pose word list that covers all the target words from The frequency of phone alignments for all utter-
the Atsuo-Henry Corpus. Since the target pronun- ances in the ERJ are calculated. Because of the low
ciation of each item is needed for the pronunciation phone accuracy of monophone recognizers, espe-
model, the word list was filtered to remove words cially on non-native speech, alignments are observed
whose pronunciation is not in CMUDICT. After fil- between nearly all pairs of phones. In order to focus
tering, the word list contains 54,001 words. on the most frequent alignments common to multi-
ple speakers and utterances, any alignment observed
4 Method less than 20% as often as the most frequent align-
ment for that canonical phone is discarded, which re-
This section presents our method for modeling pro-
sults in an average of three variants of each phone.5
nunciation variation from a phonetically untran-
scribed corpus of read speech. The pronunciation- 4.2 Adapting the Recognizer
based spelling correction approach developed in
Now that we have probability distributions over ob-
Toutanova and Moore (2002) requires a list of pos-
served phones, the HMMs trained on TIMIT are
sible pronunciations in order to compare the pro-
modified as follows to allow the observed varia-
nunciation of the misspelling to the pronunciation
tion. To allow, for instance, variation between p
of correct words. To account for target pronuncia-
and th, the states for th from the original recog-
tions specific to Japanese speakers, we observe the
nizer are inserted into the model for p as a separate
pronunciation variation in the ERJ and generate ad-
path. The resulting phone model is shown in Fig-
ditional pronunciations for each word in the word
ure 2. The transition probabilities into the first states
list. Since the ERJ is not transcribed, we begin
4
by adapting a recognizer trained on native English HTK is available at http://htk.eng.cam.ac.uk.
5
There are 119 variants of 39 phones. The cutoff of 20%
2
Some of the spelling errors come from an elicitation task, was chosen to allow a few variations for most phones. A small
so the distribution of target words is not representative of typi- number of phones have no variants (e.g., iy, w) while a few
cal JWEFL productions, e.g., the corpus contains 102 different have over nine variants (e.g., ah, l). It is not surprising that
misspellings of albatross. phones that are well-known to be difficult for Japanese speakers
3
SCOWL is available at http://wordlist.sourceforge.net. (cf. Minematsu et al., 2002) are the ones with the most variation.

34
with and without pronunciation variation.
.6 p-1 p-2 p-3
We implemented the letter and pronunciation
.4 spelling correction models as described in sec-
th-1 th-2 th-3 tion 2.2. The letter error model PL and the phone
error model PP H are trained on the training set.
Figure 2: Adapted phone model for p accounting for vari-
ation between p and th The development set is used to tune the parameters
introduced in previous sections.7 In order to rank
of the phones come from the probability distribution the words as candidate corrections for a misspelling
observed in the initial recognition step. The transi- r, PL (r|w) and PP HL (r|w) are calculated for each
tion probabilities between the three states for each word in the word list using the algorithm described
variant phone remain unchanged. All HMMs are in Brill and Moore (2000). Finally, PL and PP HL
adapted in this manner using the probability distri- are combined using SCM B to rank each word.
butions from the initial recognition step.
The adapted HMMs are used to recognize the ERJ 5.1 Baseline
Corpus for a second time, this time in forced align- The open source spell checker GNU Aspell8 is used
ment with the canonical pronunciations. The state to determine the baseline performance of a tradi-
transitions indicate which variant of each phone was tional spell checker using the same word list. An
recognized and the correspondences between the Aspell dictionary was created with the word list de-
canonical phones and recognized phones are used scribed in section 3. Aspell’s performance is shown
to generate a new probability distribution over ob- in Table 1. The 1-Best performance is the percent-
served phones for each canonical phone. These are age of test items for which the target word was the
used to find the most probable pronunciation varia- first candidate correction, 2-Best is the percentage
tions for a native-speaker pronouncing dictionary. for which the target was in the top two, etc.
4.3 Generating Pronunciations 5.2 Evaluation of Pronunciation Variation
The observed phone variation is used to generate The effect of introducing pronunciation variation us-
multiple pronunciations for each pronunciation in ing the method described in section 4 can be eval-
the word list. The OpenFst Library6 is used to find uated by examining the performance on the test set
the most probable pronunciations in each case. First, for PP HL with and without the additional variations.
FSTs are created for each phone using the proba- The results in Table 1 show that the addition of pro-
bility distributions from the previous section. Next, nunciation variations does indeed improve the per-
an FST is created for the entire word by concate- formance of PP HL across the board. The 1-Best,
nating the FSTs for the pronunciation from CMU- 3-Best, and 4-Best cases for PP HL with variation
DICT. The pronunciations corresponding to the best show significant improvement (p<0.05) over PP HL
n paths through the FST and the original canon- without variation.
ical pronunciation become possible pronunciations
in the extended pronouncing dictionary. The size 50 5.3 Evaluation of the Combined Model
word list contains 54,001 words and when expanded We evaluated the effect of including pronunciation
to contain the top five variations of each pronuncia- variation in the combined model by comparing the
tion, there are 255,827 unique pronunciations. performance of the combined model with and with-
out pronunciation variation, see results in Table 1.
5 Results Despite the improvements seen in PP HL with pro-
In order to evaluate the effect of pronunciation vari- nunciation variation, there are no significant differ-
ation in Toutanova and Moore (2002)’s spelling cor- ences between the results for the combined model
rection approach, we compare the performance of with and without variation. The combined model
7
the pronunciation model and the combined model The values are: N = 3 for the letter model, N = 4 for the
phone model, m = 80%, and λ = 0.15 in SCM B .
6 8
OpenFst is available at http://www.openfst.org/. GNU Aspell is available at http://aspell.net.

35
Model 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best
Aspell 44.1 54.0 64.1 68.3 70.0 72.5
Letter (L) 64.7 74.6 79.6 83.2 84.0 85.3
Pronunciation (PHL) without Pron. Var. 47.9 60.7 67.9 70.8 75.0 77.3
Pronunciation (PHL) with Pron. Var. 50.6 62.2 70.4 73.1 76.7 78.2
Combined (CMB) without Pron. Var. 64.9 75.2 78.6 81.1 82.6 83.2
Combined (CMB) with Pron. Var. 65.5 75.0 78.4 80.7 82.6 84.0
Table 1: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set for All Models

Rank Aspell L PHL CMB speech. This model allows a native pronouncing
1 enemy enemy any enemy dictionary to be extended to include non-native pro-
2 envy envy Emmy envy nunciation variations. We incorporated a pronounc-
3 energy money Ne any
4 eye emery gunny deny
ing dictionary extended for Japanese writers of En-
5 teeny deny ebony money glish into the spelling correction model developed
6 Ne any anything emery by Toutanova and Moore (2002), which combines
7 deny nay senna nay orthography-based and pronunciation-based mod-
8 any ivy journey ivy els. Although the extended pronunciation dictio-
Table 2: Misspelling *eney, Intended Word any nary does not lead to improvement in the combined
model, it does leads to significant improvement in
with variation is also not significantly different from the pronunciation-based model.
the letter model PL except for the drop in the 4-Best
case. Acknowledgments
To illustrate the performance of each model, the
I would like to thank Eric Fosler-Lussier, the Ohio
ranked lists in Table 2 give an example of the can-
State computational linguistics discussion group,
didate corrections for the misspelling of any as
and anonymous reviewers for their helpful feedback.
*eney. Aspell preserves the initial letter of the mis-
spelling and vowels in many of its candidates. PL ’s References
top candidates also overlap a great deal in orthogra-
Brill, Eric and Robert C. Moore (2000). An Improved
phy, but there is more initial letter and vowel varia- Error Model for Noisy Channel Spelling Correction.
tion. As we would predict, PP HL ranks any as the In Proceedings of ACL 2000.
top correction, but some of the lower-ranked candi- CMUDICT (1998). CMU Pronouncing Dictionary
dates for PP HL differ greatly in length. version 0.6. http://www.speech.cs.cmu.edu/cgi-bin/
cmudict.
5.4 Summary of Results Fisher, Willam (1999). A statistical text-to-phone func-
tion using ngrams and rules. In Proceedings of ICASSP
The noisy channel spelling correction approach de- 1999.
veloped by Brill and Moore (2000) and Toutanova Kukich, Karen (1992). Technique for automatically cor-
recting words in text. ACM Computing Surveys 24(4).
and Moore (2002) appears well-suited for writers Minematsu, N., Y. Tomiyama, K. Yoshimoto,
of English as a foreign language. The letter and K. Shimizu, S. Nakagawa, M. Dantsuji, and S. Makino
combined models outperform the traditional spell (2002). English Speech Database Read by Japanese
Learners for CALL System Development. In
checker Aspell by a wide margin. Although in- Proceedings of LREC 2002.
cluding pronunciation variation does not improve Mitton, Roger and Takeshi Okada (2007). The adapta-
the combined model, it leads to significant improve- tion of an English spellchecker for Japanese writers.
In Symposium on Second Language Writing.
ments in the pronunciation-based model PP HL .
Okada, Takeshi (2004). A Corpus Analysis of Spelling
Errors Made by Japanese EFL Writers. Yamagata En-
6 Conclusion glish Studies 9.
TIMIT (1991). TIMIT Acoustic-Phonetic Continuous
We have presented a method for modeling pronun- Speech Corpus. NIST Speech Disc CD1-1.1.
ciation variation from a phonetically untranscribed Toutanova, Kristina and Robert Moore (2002). Pronunci-
corpus of read non-native speech by adapting a ation Modeling for Improved Spelling Correction. In
monophone recognizer initially trained on native Proceedings of ACL 2002.

36

You might also like