Measuring Spelling Predictability JCC Thesis

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/317381495
Measuring Orthographic Predictability: Calculating Entropy for Phonemes and

Graphemes of Standard American English
Thesis · August 2017

DOI: 10.13140/RG.2.2.32359.04000
CITATIONS READS
0 162
1 author:
Jacob C. Cockcroft
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jacob C. Cockcroft on 07 June 2017.
The user has requested enhancement of the downloaded file.

Measuring Orthographic Predictability: Calculating Entropy for Phonemes and Graphemes of
Standard American English
Running Head: MEASURING ORTHOGRAPHIC PREDICTABILITY
Measuring Orthographic Predictability: Calculating Entropy for

Phonemes and Graphemes of Standard American English
A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science
By
Jacob Cockcroft
Bachelor of Arts, Hendrix College, 2000
2017
The University of Arkansas for Medical Sciences
MEASURING ORTHOGRAPHIC PREDICTABILITY iii
MEASURING ORTHOGRAPHIC PREDICTABILITY iv
Acknowledgements
The author would like first and foremost to acknowledge the long-distance efforts of Dr.
Robert J. Drost in the creation of the analysis software, along with his many hours spent
resolving the numerous technical and theoretical issues involved in this analysis. Without
his contributions, this research and thesis would not have been possible. Also, many
thanks to Dr. Greg Robinson for his guidance, support and encouragement throughout the
evolution of this study. Many thanks as well to the rest of my committee members—Dr.
Chenell Loudermill, Dr. Dana Moser, and Ms. Stacey Mahurin—whose invaluable
feedback have helped shape this project into (hopefully) a streamlined, accessible product
for teachers, educators, therapists, and researchers. Gratitude is likewise owed to all the
professors, instructors, and clinical supervisors of the UAMS/UALR graduate program
for Speech-Language Pathology & Audiology that I have had the pleasure to meet and
learn from during my recent academic journey: Dr. Donna Kelly, Mrs. Connie Bracy, Dr.
Ashlen Thomason, Dr. Betholyn Gentry, Dr. Tom Guyette, and Mrs. Shanna Williamson,
just to name a few. Thanks to my indispensible proofreaders and parents, Jean Cazort and
David Cockcroft. Thanks to Anna Salzer for realizing how much I would love Speech-
Language Pathology before I did, and Eli Wakefield for the blessing of watching child
language development in action. Finally, thanks to my eighteen cohorts for making these
past few years a wonderful and enriching experience.

MEASURING ORTHOGRAPHIC PREDICTABILITY v
Table of Contents
Acknowledgements ............................................................................................................ iv
List of Figures ................................................................................................................... vii
List of Tables ................................................................................................................... viii
Introduction ......................................................................................................................... 1
Statement of the Problem ................................................................................................ 1
Purpose of This Study ..................................................................................................... 3
Rationale.......................................................................................................................... 4
Literature Review................................................................................................................ 7
Historical Perspectives of English Orthography ............................................................. 7
Current Research on Orthographic Depth ..................................................................... 11
Conclusion..................................................................................................................... 14
Methodology ..................................................................................................................... 16
Corpus ........................................................................................................................... 16
Defining English Graphemes ........................................................................................ 17
Alterations to the Corpus............................................................................................... 19
Summary of the Computing Algorithm ........................................................................ 21
Reliability ...................................................................................................................... 22
MEASURING ORTHOGRAPHIC PREDICTABILITY vi
Calculating Entropy....................................................................................................... 23
Results ............................................................................................................................... 25
Graphemes ..................................................................................................................... 25
Phonemes ...................................................................................................................... 32
Discussion ......................................................................................................................... 35
Is English Predictably Spelled? ..................................................................................... 35
Measuring Orthographic Depth ..................................................................................... 37
Word-Level Entropy ..................................................................................................... 38
Entropy-Based Spelling Instruction .............................................................................. 39
Limitations .................................................................................................................... 42
Future Research ............................................................................................................. 44
Conclusion ........................................................................................................................ 47
References ......................................................................................................................... 49
Appendix A: Alphabetical List of All English Graphemes .............................................. 54
Appendix B: List of Phonemes Used by the English Lexicon Project (ELP) .................. 62
Appendix C: Phoneme-to-Grapheme Correspondence (PGC) Probabilities .................... 64
Appendix D: Algorithmic Procedures for Corpus Analysis ............................................. 71

MEASURING ORTHOGRAPHIC PREDICTABILITY vii
List of Figures
Figure 1. Accuracy data from “Foundation Literacy Acquisition in
European Orthographies” (Seymour et al., 2003). English
speakers performed consistently worse than peers speaking
twelve other languages……………………………………….………….14
Figure 2. Vowel cluster analysis. Dendrogram and scatter plot of
vocalic graphemes with complete linkage, grouped into five
categories of decreasing entropy……………………………...…………31
Figure 3. Consonant cluster analysis. Dendrogram and scatter plot
of consonantal graphemes with complete linkage, grouped
into five categories of decreasing entropy...…………………………….32

MEASURING ORTHOGRAPHIC PREDICTABILITY viii
List of Tables
Table 1. Graphemes ranked by entropy. Classed by consonant or
vowel, with low-frequency and zero-entropy graphemes
removed………………………………………………………………….28
Table 2. Graphemes grouped by predictability. The results of
a hierarchical cluster analysis using complete linkage,
where complete predictability indicates zero-entropy…………………...30
Table 3. Phonemes ranked by entropy. SAE phonemes
discovered in the corpus, including token frequency,
probability expressed as a percentage, and unweighted
entropy values……………………………………………………………34
Table 4. Multiple phonemes corresponding to single letters.
Instances where a singleton grapheme corresponded to
more than one phoneme, which were removed from
final analysis……………………………………………………………..35
MEASURING SPELLING PREDICTABILITY
Introduction
“…the English alphabet is pure insanity…, it can hardly spell any word in the
language with any degree of certainty.” --Mark Twain
Statement of the Problem
Exactly how predictable is written English? The above quip by Mark Twain echoes
the exasperation many literate English speakers have felt at one time or another, when
considering how difficult and unruly the English writing system appears to be. Twain’s
trademark humor concerning what he called our “drunken old alphabet” reflects a larger
attitude of his time that English spelling needed to be reformed (Twain, 2016, p. 111).
English orthography features complex rules with many exceptions that appear to defy
logical explanation. Imagine what would happen if one day the auto-correct feature
disappeared from our cell phones and word processors. How confidently could we spell
our own language?
Rather confidently, according to the seminal work in the 1960s of Hanna, Hanna,
Rudorf, and Hodges (referred to as “HHRH” hereafter). According to their research, half
of all English words are predictably spelled from their sound, and another 34% would
have just a single error if spelled on the basis of sound (Moats, 2005). Their linguistic
analysis of over 17,000 of the highest frequency words in English resulted in the
conclusion that English orthography is predictable over 80% of the time (Hanna, 1966).
Despite this assertion, in recent years a growing area of cross-linguistic research once
again throws English predictability into question. Orthographic depth is a term used to
MEASURING ORTHOGRAPHIC PREDICTABILITY 2
indicate the relative degree to which a language’s orthography diverges from its spoken
form (Katz & Frost, 1992). The traditional view of orthographic depth states that
languages exist on a continuum, from “transparent” orthographies at one end to “opaque”
orthographies on the other (Seymour, Aro, Erskine, & collaboration with COST Action
A8 network, 2003). Transparent orthographies, such as those of Italian, Spanish, Finnish,
Hungarian, and Greek, feature relatively phonetic spelling systems, in which word
spellings predictably represent their pronunciation. Opaque orthographies, such as those
of English and French, align in complex and less predictable ways to the oral languages
they symbolize.
English is generally considered to have the most opaque orthography of all European
languages. In multiple cross-linguistic experiments, learners of English have been shown
to struggle with reading and spelling in ways that most other speakers of European
languages do not. David Share dubbed English an “outlier orthography,” and questioned
why it still governs the vast majority of current research on the behavioral,
psycholinguistic, and cognitive processes linked to skilled reading and reading
acquisition (Share, 2008).
It appears that two contrasting perspectives exist in the literature regarding the
predictability of English orthography. If English is orderly and predictable, as concluded
by the HHRH study, then why are English speakers performing consistently worse when
compared to other languages which utilize the same Latin alphabet? One way to answer
this question is to perform a new linguistic analysis of English orthography using modern
computing technology to determine if the sound-symbol relationships are as predictable
as the HHRH study discovered.

Purpose of This Study
The current study sought to measure the degree of predictability of the sounds and
spelling patterns of Standard American English (SAE). Using a custom built software
program, a corpus of over 131 million words was deconstructed into its constituent
sounds and spelling patterns, and then the frequencies of these sound-symbol
correspondences were tallied. From these frequencies, the probabilities of all sound-
symbol correspondences in the corpus were determined, which could be used to calculate
entropy values for all sounds and spelling patterns of SAE.
Entropy, as defined by information theory, measures the degree of uncertainty in a
frequency distribution. It describes how likely or predictable the outcome of a random
sampling is for that distribution, given the amount of information that is available. This
method has been used in several past studies to measure sound-symbol predictability for
various languages, including English (Borgwaldt, Hellwig, & De Groot, 2004;
Borgwaldt, Hellwig, & De Groot, 2005; Borgwaldt, Hellwig, De Groot, & Licht, 2006;
Protopapas & Vlahou, 2009).
The basic units of spoken language that will be measured by this study are called
phonemes. Phonemes are the building blocks of spoken syllables, e.g. vowels and
consonants. Phonemes are written using the International Phonetic Alphabet (IPA)
standard format, with each unique sound represented by a specific symbol between two
slashes. The spoken word bat has three phonemes, written /b/, /æ/, and /t/.
A grapheme is then defined as a written symbol that corresponds to a single
phoneme. Graphemes are written between quotes. The written word “bat” has three
graphemes: “b,” “a,” and “t.” A grapheme might be a single written letter, but graphemes
in English may contain up to four letters, e.g. “ough” as in though. In this study, a
grapheme containing multiple letters is referred to as a compound grapheme.
In this study, the term orthographic correspondence (OC) is used to refer to the
relationship between a specific grapheme and phoneme pair, without reference to a
particular “direction” of correspondence. Thus the word bat contains three OCs: “b” =
/b/, “a” = /æ/, and “t” = /t/. When analyzing OCs, researchers may choose to study
correspondences in either the grapheme-to-phoneme direction (GPCs) or the phoneme-to-
grapheme direction (PGCs). The direction of correspondence determines whether the
cognitive process of decoding or encoding is being emphasized. GPCs relate to decoding
words, the process of translating a string of graphemes into a string of phonemes (i.e.
“sounding out” a written word). PGCs relate to encoding words, the process of translating
a string of phonemes into a string of graphemes (i.e. spelling).
Rationale
The impact of OC predictability on both reading and writing cannot be overstated.
Phonemic awareness, or the understanding that words are composed of phonemes that
must be blended and segmented, was established by the National Reading Panel as one of
the most critical skills that must be learned in order for a student to achieve successful
literacy outcomes (National Reading Panel (US), National Institute of Child Health, &
Human Development (US), 2000). Research has shown that in all phases of development,
language learners employ decoding skills (Sharp, Sinatra, & Reynolds, 2008).
Sight word reading, according to Linnea Ehri, is the process of building the lexicon
through memorizing words so that they may be read “on sight,” without the need to
decode the individual phonemic units of a word, which is a slower process that requires
an extra cognitive load (Ehri, 2005). OCs form the basis for children learning sight
words, providing a “powerful mnemonic system” (p. 172). It is the OCs that provide the
“glue” to cement words in memory. The more unpredictable these correspondences are,
the more difficult decoding and learning sight words should be. For example, evidence
has shown that the OC predictability affects naming latencies, i.e., words with
inconsistent sound-spelling features are read aloud slower than words with consistent
sound-spelling features (Delattre, Bonin, & Barry, 2006).
OCs are equally important to the process of encoding, or spelling words. Decoding
and encoding are separate but highly interrelated processes, because students are
continually learning to read and spell words simultaneously. Students are provided visual
access to a word as a reference while they learn to spell it, so decoding and encoding
skills are learned together and mutually reinforce one another. As Ehri emphasizes,
learning to spell words supports the ability to later recognize them in print, just as
multiple exposures to reading a word improves the probability one will be able to
successfully spell it (Ehri, 2005).
A systematic measure of predictability for English phonemes and graphemes has the
potential to impact our current perspectives on how difficult it is to acquire and master
both spelling and literacy in the United States. Currently, the plethora of approaches on
how to best teach literacy and spelling often leave educators and teachers overwhelmed
and confused for how to best approach the subject (Johnston, 2000; Schlagal, 2002;
Fresch, 2003). An analysis of orthographic entropy could inform best practices of
spelling and literacy instruction for both typical students and outliers. Considering the
entropy of OCs, for example, might provide a convenient method for organizing
wordlists in terms of difficulty.
A quantification method of orthographic predictability could provide a useful metric
for future cross-linguistic studies comparing orthographic depth and its role in literacy
acquisition. Measuring entropy values (weighted by frequency) for all phonemes and
graphemes in the corpus allows both a total phonemic entropy and total orthographic
entropy value to be computed. The first number theoretically indicates how unpredictable
spoken SAE is to spell; the second indicates how unpredictable written English is to
pronounce in SAE. These numbers could be used in quantitative comparisons with other
languages where the entropy has likewise been measured, if it can be shown the corpus
being analyzed is a sufficient representation of the language as a whole.

Literature Review
Historical Perspectives of English Orthography
English orthography is the product of centuries of diverse cultural influences without
any singular, overarching structural model. Linguist David Crystal’s book “Spell it Out:
The History of English Spelling,” illuminates the myriad sources that have contributed
over time to the evolution of the modern English writing system (Crystal, 2012). Anglo-
Saxon, Welsh, Norman-French, Old Norse, Latin and Greek have all contributed their
own arbitrary writing conventions and stylistic forms to the recipe of English. For
example, there is no logical reason why an “–e” must be added after a “v” at the ends of
words like have, give, love, etc., other than the fact that Anglo-Norman scribes in the
medieval ages preferred this form.
Additionally, word spellings tend to be more stable than word pronunciations, which
change at a faster rate. Many English words retain spellings from an earlier time, when
the words were pronounced differently. Furthermore, English spellings are used for
purposes other than to phonetically represent a word, such as indicating etymological
relationships, e.g. keeping the letter “g” in sign, even though it is not pronounced,
signifying a relationship to the words signature, signet, signify, etc..
The notion that English spelling was chaotic and in need of reform percolated among
American, British, and Irish literary circles at least as far back at the 18th century.
Benjamin Franklin, Mark Twain, Noah Webster, and Andrew Carnegie are among the
notable American figures who attempted to address problematic English spelling

conventions to some extent. Franklin, for example, proposed his own streamlined
alphabet, featuring a reduction of extraneous letters and the addition of new letters to
better capture the sounds of English (Webster & Franklin, 1789). Interestingly, the one
lasting outcome from Dr. Franklin’s attempt as spelling reform was the eventual adoption
of his invented symbol for /ŋ/ by the IPA. The fact that he experienced far more success
in helping to establish a new country that broke violently away from British rule than he
ever was at reforming English orthography gives some indication as to just how difficult
such an endeavor must be.
Since English orthography was considered to have a high degree of irregularity,
traditional spelling instruction emphasized the rote memorization of wordlists, which
were arranged alphabetically and sometimes by number of syllables as well, but neither
arrangement necessarily organized words in terms of how difficult they were to learn
(Schlagal, 2002).
In the 1930s, word frequency began to be recognized as a significant indicator of
learning difficulty. It was reasoned that more frequent words would be seen more often
and so were easier to memorize. Resources such as Thorndike’s & Lorge’s “The
Teacher’s Book of 30,000 Words” began to offer educators lists of words that were
arranged by frequency, so that children could be taught the easier, high-frequency words
first (Thorndike & Lorge, 1944). This practice continues today, with wordlists such as
Edward Fry’s “instant words” being ordered by frequency (E. B. Fry & Kress, 2012; E.
Fry, 1980).
In the 1950s, linguistic research began to focus on the relationships between
graphemes and phonemes. The HHRH study published in 1966 was an extension of more
than a decade’s previous research on PGC frequency. It was the largest project of its kind
funded by the U.S. Department of Education, with the final publication exceeding 17,000
pages.
This was the first time such a study utilized computers to analyze large corpora of
words. The HHRH study drew upon 17,310 words from Thorndike & Lorge’s frequency
lists, and analyzed every PGC in a variety of contexts of position and stress. For
position, they listed the probabilities of each OC occurring in the initial, medial, and final
position of syllables. For stress, they listed probabilities for primary, secondary, or
unstressed syllables.
The authors concluded that English PGCs are predictable and consistent “80 percent
of the time” when both position and stress were also considered (Hodges & Rudorf,
1965). This claim is based on the probability that any given phoneme will align to its
“main” grapheme about 80% of the time. The HHRH researchers used an “80-percent
criterion” rule as a benchmark for whether a language could be considered predictably
represented by its orthography. This empirically derived conclusion caused a shift from
the perspective that English spelling was a broken system in need of reform to a position
that acknowledged the overall predictability of English orthography. The principle that
English spelling is mostly predictable has guided the methodology of reading and
spelling instruction since that time (Schlagal, 2002; Fresch, 2003).

Many researchers since have drawn upon and refined the data from the HHRH study.
Subsequent studies updated and reworked the Hanna et al. data (E. Fry, 2004), measuring
GPC probabilities for both “American” (Berndt, Reggia, & Mitchum, 1987) and “British”
(Gontijo, Gontijo, & Shillcock, 2003) English. However, there are several reasons why it
may be time to re-think the conclusions of the HHRH study.
First, computer technology of the era was in it’s infancy. Given that a cell phone in
2012 had more computational power than all of NASA during the 1969 Apollo moon
landing (Kaku, 2012), the study might have returned different outcomes had it been
undertaken more recently. The researchers admit throughout the study that linguistic
accuracy was sometimes compromised due to technological limitations (Hanna, 1966).
Another issue is that the corpus size of ~17,000 words is relatively small by today’s
standards, where the internet and digital technology have allowed for millions and even
billions of words to be analyzed rapidly with relatively small degrees of computational
error. Many studies have discussed how large a corpus size needs to be for the reliable
generalization of results, and somewhere between 16-30 million words seems optimal
(Brysbaert & New, 2009).
Also contestable is the composition of the grapheme list used by the HHRH study.
There is no authoritative list of English graphemes, so the researchers had to create their
own. This is not a straightforward process, however, and requires a certain degree of
arbitrary decision-making, something the researchers acknowledged. One example is how
to handle silent letters. Since silent letters have no phoneme with which to correspond,
they are attached to an adjacent letter which does correspond to a phoneme, i.e. the
formation of compound graphemes. The word asthma, for example, can be considered
either as
 “a”=/æ/ + “s” = /z/ + “thm” = /m/ + “a” = /ə/, or

 “a”=/æ/ + “sth” = /z/ + “m” = /m/ + “a” = /ə/
There is no obvious reason to prefer one over the other. The authors of HHRH study
were forced to make these choices, and in doing so, they effectively established research
conventions regarding how to best classify English graphemes. In several cases, these
decisions were dictated by technological constraints of the era. Thus Berndt et al., twenty
years later, noted that even while they used the HHRH classification system of
graphemes, it is short of ideal, stating that “other divisions of printed words than those
employed here would have resulted in a different set of circumstances” (Berndt et al.,
1987, p 5). They further suggest that “the entire treatment of silent letters by
[HHRH]…might be handled differently if the goal is to describe segments that readers
actually use.”
Current Research on Orthographic Depth
In addition to these issues, current cross-linguistic research is discovering that English
speakers may in fact be handicapped by the depth of their orthography. Since the concept
of orthographic depth was introduced over 25 years ago (Katz & Frost, 1992), a large
body of research has accumulated that centers on analyzing and understanding its
ramifications. It has been shown to impact reading development, developmental and
acquired reading disorders, and theoretical accounts of reading (Schmalz, Marinus,
Coltheart, & Castles, 2015, p. 1614).

Extensive research has linked the depth of a language’s orthography to rates of
literacy acquisition: the deeper the orthography, the longer it takes to acquire basic
literacy (Seymour et al., 2003). Whereas the average learner of English requires three
years to attain basic reading proficiency (Chall, 1967), languages with a more transparent
orthography require less time. One striking example is that children learning the
extremely transparent orthography of the Finnish language, which features essentially a
phonetic spelling system, on average reach this same level of basic reading proficiency in
six months.
In 2003, a large experiment that tested decoding skills across fourteen orthographies
revealed that the English-speaking participants performed unexpectedly low when
compared to non-English speaking peers (Seymour et al., 2003). Figure 1 reports
percentages from two tasks of that study, which tested the accuracy of word-reading for
both real (content and function words) and non-real (mono and bisyllabic) words. As the
graph shows, on reading tasks where the majority of participants scored between 90-
100% for accuracy, the English-speaking children (the final bar in each category) failed
to reach 50%.
100
90
80
Percentage Correct 70
60
50
40
30
20
10
0
Content Words Function Words Monosyllabic Bisyllabic Nonwords
Nonwords
Finnish Greek Italian Spanish Portuguese

French Austrian German Norwegian Icelandic
Swedish Dutch Danish English
Figure 1. Accuracy data from “Foundation Literacy Acquisition in European Orthographies” (Seymour et
al., 2003). English speakers performed consistently worse than peers speaking 13 other languages.
The English-speaking children in this study were drawn from schools in Scotland,
and caution is advised in assuming all English-speaking children are a homogenous group
that can be fairly represented by this small sample of the global population. Nonetheless,
similar findings have been repeated in smaller studies comparing English to various
orthographies; for example, two separate studies of Welsh (transparent) and English
(opaque), both with rigorous cross-linguistic controls, reported similar outcomes: the
Welsh groups of 6-7 year olds were able to read twice as many words as their English
matched peers (Ellis & Hooper, 2001; Hanley, Masterson, Spencer, & Evans, 2004).
Welsh and English are well-suited for linguistic comparisons, because participants can
often be drawn from the same schools or areas, where both languages are taught
simultaneously to the same age groups, and this design helps minimize the effects of
confounding variables that may occur in cross-linguistic studies.

While not all results are as dramatic, English-speaking children do appear to score
consistently lower than their age-matched peers who have learned more transparent
orthographies. Richlan (2014) provides a succinct list of such studies that involve
learners with both typically developing (Aro & Wimmer, 2003; Bergmann & Wimmer,
2008; Cossu, Gugliotta, & Marshall, 1995; Frith, Wimmer, & Landerl, 1998; Georgiou,
Torppa, Manolitsis, Lyytinen, & Parrila, 2012; Seymour et al., 2003; Wimmer &
Goswami, 1994; Zoccolotti, De Luca, Di Filippo, Judica, & Martelli, 2009) as well as
dyslexic reading acquisition (Barca, Burani, Di Filippo, & Zoccolotti, 2006; Davies,
Cuetos, & Glez-Seijas, 2007; Landerl, Wimmer, & Frith, 1997; Landerl & Wimmer,
2000; Landerl et al., 2013; Richlan, 2014; Wimmer, 1993; Wimmer & Schurz, 2010;
Zoccolotti et al., 2005).
Conclusion
English orthography has never been viewed as being a consistently phonetic system.
Centuries of diverse cultural influences have brought their own unique aesthetics and
stylistic preferences to the English writing system. As the spoken language continued to
evolve away from its conservative spelling system, there have been sporadic but
ultimately ineffective attempts at wholesale spelling reform. Instead, educators and
students have adapted to a complicated orthography by placing less emphasis on
decoding and more emphasis on memorization.
Spelling instruction traditionally involved memorizing word lists. Ranking words by
frequency, and to a lesser extent word length, continues to be a widely used means of
teaching English spelling effectively (Schlagal, 2002). In the 1960s, the HHRH study
analyzed approximately 17,000 high-frequency words and concluded that English OCs
are “predictable 80% of the time.” This perspective has informed spelling practices in the
U.S. since that time.
Several concerns can be raised with these findings, however. Half a century has now
passed since the HHRH study was published, and modern computing technology may
provide a clearer picture as to whether or not English orthography can accurately be
considered a predictable system. Furthermore, cross-linguistic studies on orthographic
depth have once again called into question the predictability of English orthography.
Using the mathematical concept of entropy to measure the predictability of graphemes
and phonemes may provide an answer to the question of just how predictable English
orthography really is.

Methodology
Corpus
This study utilized The English Lexicon Project (ELP) database (Balota et al., 2007).
This is a free, online resource that can be found at elexicon.wustl.edu., based on the
Hyperspace Analogue to Language (HAL) corpus (Burgess & Livesay, 1998). A type
count is the number of different words in a text; the ELP’s type count is 40,481. A token
count is the number of times it appears in the corpus. Approximately 131 million word
tokens comprise the entire corpus, gathered from Usenet Internet news groups in 1995,
though more recent estimates place the number at close to 400 million words.1
Regardless, it has been shown that reliable frequency norms are achieved for both high
and low-frequency words from corpora of 16 million words, with diminishing returns
beginning at around 30 million (Brysbaert & New, 2009), so even the lower size estimate
should be more than adequate for generalization of results.
Appendix B provides a list of the phonemes utilized by the ELP. The ELP provided a
pronunciation for each word entry, based on the Standard American English dialect. It is
important to note that the resulting OCs analyzed in this study are ultimately dependent
upon the accuracy of this input. Some transcription errors and inconsistencies in the
ELP’s pronunciation listings were observed and documented below. Finer linguistic
distinctions not recognized by the ELP are beyond the scope of this study.
1
Estimate obtained from the ELP website “Database News & Update: 10/20/14” at
http://elexicon.wustl.edu/
Defining English Graphemes
The number and shape of English graphemes vary, depending on how silent letters
are handled. In a purely phonetic alphabet, each letter would represent a single grapheme
and correspond to a single phoneme, but English orthography contains many silent
letters. These are annexed by neighboring letters to form compound graphemes. The
process of assigning silent letters to form compound graphemes, while often arbitrary, is
necessary if a definitive list of English graphemes is to be established.
By contrast, the number and type of phonemes in SAE are well defined, so it seems
reasonable to first identify the phonemes of a given word, and then parse the written word
into an identical number of graphemes. By definition, there should be a one-to-one
alignment between grapheme and phoneme, i.e. graphemes should not outnumber
phonemes in a word, though there are some cases where this occurs (see Table 4, below).
These graphemes must align with their phonemes in the same serial order, except for
words containing a final, silent “—e.”
This ubiquitous feature of English orthography required additional consideration.
Those learning to read English are frequently taught that the final –e “lengthens” the
preceding vowel, as in mat vs. mate. Acknowledging this rule, the authors of the HHRH
study decided to treat –e as being part of a compound grapheme that includes the
preceding vowel. This was written as V_e, where V is any vowel or vowel combination.
These special graphemes will be referred to in this study as split graphemes. Thus the
word mane is composed of the graphemes “m” + “a_e” + “n”.

In this study, an “e” is considered as part of a split grapheme when it was directly
preceded by either a graphemic consonant(s), “gu-” or “qu-”. The associated vowel had
to be the first vowel encountered prior to the directly preceding consonant(s), “gu-”,
“qu-”, and any number of consecutive preceding vowels until a consonant, another split
grapheme, or the beginning of the word was encountered. This rule flagged the “e” as a
potential candidate for being part a split grapheme, but the parsing algorithm only
declared it to be if both of the following criteria were false:
1. The preceding vowel corresponded to a phoneme that was considered

a traditionally “short” vowel: /ɑ/, /ʊ/, /ə/, /ɛ/, /ɪ/, /ɔ/, /ʌ/, /ɚ/, or /ɝ/
2. The consonant preceding the “e” consisted of one of the following
OCs: “le” = /l/, “ge” = /ʒ/ or /ʤ/, “dg” = /ʤ/, “th” = /ð/, “ce” = /s/,
“sle” = /l/
The first rule followed the premise that if the vowel is not “lengthened” by the “e”,
then the “e” does not alter the vowel, and therefore must alter the adjacent consonant.
The second rule followed the premise that –e often indicates an alternative pronunciation
of the preceding consonant. Kessler & Treiman (2001), for example, provide five
alternative situations where –e could more accurately be judged to “belong” to the
consonant and not the vowel. Furthermore, they argue that –e can perform more than one
of these functions simultaneously.
This interpretation of –e diverges from the conventions of the HHRH study, which
considered final, silent –e as part of a split grapheme in all cases. The authors of the
HHRH study, however, observed that their decision regarding –e was a pragmatic,
simplistic decision that did not necessarily conform to orthographic reality (Hanna,
1966). For the current study, a middle ground was chosen between the two extremes.
While it can be argued that –e can serve more than one function simultaneously, in this
study –e can only be assigned one role: as attaching to either the consonant(s) or the
vowel(s), but not both.
Alterations to the Corpus
Exclusions a priori. As many words as possible were analyzed from the ELP
database of 40,481 words, but certain entries in the ELP were initially deleted:
 70 entries with no phonetic transcription provided by the ELP,

leaving 40,411 entries.
 497 of the remaining 40,411 entries with a frequency of 0, leaving
39,914 entries.
 5 entries with invalid (e.g., non-English or in error) phonemes:
wiretap, Pachelbel, fille, firebox, and ecru. This leaves 39,909
entries.
 4 entries with “unusual” pronunciations: gunwale, halfpenny,
forecastle and ok. The first two words are sailing jargon, the third
is a “Briticism,” and the fourth a logogram.
Removing the above entries left a dataset of unique words containing 39,905 entries
to be automatically parsed by the program. 2,324 words types containing an initial
capital letter, i.e. proper nouns, were included in the dataset. The remaining alternations
were performed during the post-processing stage, after the OCs had been tallied and
entropy values calculated.
Combining Allophones. Certain linguistic sounds called allophones are different
sounds that speakers of a language can use interchangeably without affecting the meaning
of words. For example, the phoneme /ɾ/, called a “flap,” which is found in the middle of
words like butter and ladder, is in allophonic variation with /t/ and /d/ in SAE. Indeed,
the average speaker is likely unaware that this linguistic distinction even exists. However,
the entropy values for the graphemes “t” and “d” were significantly affected when this
distinction was considered, ranking them as the second and third most ambiguous
consonantal graphemes. Since it appears English speakers do not typically struggle over
this distinction, the decision was made to combine /ɾ/ with either /t/ or /d/ as appropriate.
Another minor change was the combination of /ks/ and its voiced counterpart /gz/ into
a “single phoneme” when they correspond to the grapheme “x.” Sometimes this voicing
distinction impacts meaning, as in the words “box” and “bogs.” However, the grapheme
“x” is never employed in such cases, and this voicing distinction appears to be purely
allophonic whenever “x” is used. Again, over-inflated entropy values appeared when this
distinction was made, so the decision was made to ignore this allophonic distinction.
Apostrophes. For ease of parsing, apostrophes were treated as other letters, resulting
in a handful of graphemes which included apostrophes. These included seven graphemes
featuring an initial apostrophe (‘d, ‘ll, ‘m, ‘re, ‘s, ‘t, ‘ve), two graphemes with final
apostrophes (n’ and o’), one low-frequency grapheme with a medial apostrophe (a’a), and
a simple grapheme composed of the apostrophe by itself. All but the latter were then
combined with the identical graphemes that did not include an apostrophe, e.g. totals for
“ ‘ve” and “ve” were combined. The singular apostrophe could be parsed as a simple
grapheme corresponding to schwa (as in could’ve). This was retained as a unique
grapheme, occurring in 54 words in 29,904 instances in the corpus.

Summary of the Computing Algorithm
A custom program was constructed using MATLAB software in order to count the
phonemes and graphemes of the corpus. First, a mapping file was created, containing a
list of all phonemes and all possible graphemes that could potentially map to each
phoneme. When analyzing a word, the algorithm proceeded in a serial, left-to-right
fashion. It first parsed a word into separate phonemes. The term parsing is used in this
study to refer to the graphic representation of a word, where individual
phonemes/graphemes are separated by a dash. The graphemic parsing of the word
English is “E-ng-l-i-sh,” and the phonemic parsing is /i-ŋ-l-ɪ-ʃ/. The program then
considered which graphemes in the word could legitimately correspond to each phoneme,
according to the mapping file. If a legal mapping could not be found for all OCs in a
word, an error was returned.
Some words could be parsed only one way, while some had multiple possible
parsings. For the latter, the algorithm then had to choose the most optimal parsing. It
accomplished this in a step-wise fashion, comparing the first two possibilities to see
which it “preferred.” Based on a series of predetermined preferences, it chose one parsing
and discarded the other, and then compared the next possible candidate to the incumbent
parsing. It proceeded in this manner until all possible parsings had been considered and
only one remained.
The result is that final phonemic and orthographic parsings were ultimately derived
for each word. Appendix D provides the specific guidelines and decision-making
processes the algorithm utilized. The type and token counts for all OCs could then be
automatically counted by the program. From this frequency data, probabilities for each
OC and entropy values for all phonemes and graphemes were then calculated.
Reliability
The algorithm underwent a number of trial runs so that unanticipated errors could be
identified and corrected. This process was repeated until the algorithm returned a
manageable number of words to be manually parsed (85 word types, described in
Appendix D). The algorithm was then run a final time, allowing type and token counts
for all graphemes and phonemes to be tallied by the program. Appendix D contains a
sample list of 200 randomly selected words with their corresponding parsings.
Additionally, a random sample of 1000 parsed words were checked by an
independent rater for reliability. The rater, a graduate student familiar with IPA symbols
and phonetic transcription, underwent an hour-long training session on the acceptable
grapheme shapes delineated for this study. The rater was then instructed to highlight any
words that might contain an illegal parsing. The rater returned a list of 22 highlighted
words. It was then determined that two of the words contained an unusual but acceptable
pronunciation in the ELP, and one contained a phonemic combination which, for ease of
parsing, was parsed correctly during a later step by the algorithm. These three
questionable parsings, along with the remaining 19, were judged to all be legitimate
parsings as outlined by the grapheme definition process of this study. With 0 errors in a
sample size of 1000, therefore, a 98% upper confidence bound of .39% parsing errors in
the whole population of 39,905 is estimated.

Calculating Entropy
Once the algorithm returned the frequency data for all phonemes and graphemes,
probabilities and entropy values could then be calculated. Throughout this study, X refers
to a random variable from the set of graphemes that we will define based
on our parsing of the ELP, where m is the number of such graphemes. Similarly, we
denote Y as a random variable from the set of phonemes that we will
define based on our parsing of the ELP, where n is the number of such phonemes.
For the random variable X (with an analogous definition for Y), the entropy of X is
defined as
where is the probability that X takes the value . This entropy formula will
essentially quantify how difficult it would be to predict which grapheme might be
randomly chosen out of the entire corpus.2
An additional calculation will yield conditional entropy, which quantifies the amount
of information needed to predict the outcome of one random variable Y, given that
another random variable X is known (where “Y given X” is notated . The
conditional entropy of Y given X = is defined by
Finally, the conditional entropy of X given Y is defined as:
2
Throughout this article, it is also assumed that = 0 when = 0.
The first line indicates this will be a weighted average of the conditional entropy of Y
given over all , where the weighting will be the probability that X takes the
value . In other words, this will be the average (with frequency weighting) over all
graphemes of the difficulty in predicting how each grapheme should be pronounced. The
result will be a metric which quantifies how difficult it is to predict the pronunciation of
graphemes in the corpus. Analogous definitions will apply for H(X|Y=yj) and H(X|Y),
which will calculate the entropy of phonemes.
An entropy value of zero indicates that a grapheme is completely predictable, only
ever corresponding to one unique phoneme. The grapheme “dge,” for example, was
found to have zero entropy, because it only corresponded to the phoneme /ʤ/ as in edge,
judge, ridge, lodge, etc. Entropy values larger than zero indicate relatively increasing
unpredictability. For example, the graphemes “se” and “ce” were found to have entropy
values of .957 and .001, respectively. Therefore, “se” is more unpredictable than “ce.”
Results
Graphemes
Classification. The complete list of all 322 graphemes discovered in the corpus are
listed in Appendix A. The total type count for all graphemes in the corpus was 264,618,
and the total token count was 1,603,490,234. Each grapheme’s probability is also listed,
which is the frequency divided by the total number of all graphemes in the entire corpus.
Both unweighted and frequency-weighted entropies are provided.
Unless otherwise specified, all calculations are based on token frequencies.
Graphemes which corresponded to phonetic consonants and vowels were analyzed
separately. A grapheme was classed as either vowel or consonant based on whether the
phoneme it corresponded to was consonantal or vocalic: for example, the grapheme “et”
in ballet was classed as a vowel, because it corresponded to the vocalic phoneme /e/.
Some ambiguous OCs could reasonably be called either consonant or vowel, and so it
was necessary to establish a few further parameters to classify these:
a. Graphemes corresponding to syllabic consonants were classed as

consonants, e.g. “le” = /l/ as in castle.
b. The graphemes “r,” “rr,” and “rrh” were considered consonants, while
graphemes which contained r but also a vowel (“er,” “ur,” “ear,” etc.)
were considered as vowels.
c. The grapheme “w” was classed as a consonant, because of the
overwhelming probability (.999) of it being pronounced as the semi-vowel
/w/ as opposed to the vowels /ʌ/ or /u/. The grapheme “y,” on the other
hand, represented the semi-vowel /j/ less than a fourth of the time (.266),
with vocalic variants (/ɪ/, /i/, /ə/, etc.) being on the whole more frequent,
and thus “y” was classed as a vowel.
Length. The average grapheme length found in the corpus was 2.4 letters. Of the 322
graphemes in the corpus, 1 had six letters (“ailles”), 2 had five letters (“cques” and
“tzsch”), 27 had four letters, 96 had 3 letters, 169 had 2 letters, and 27 had one letter (this
includes the 26 alphabetic letters plus the apostrophe). The three graphemes longer than
four letters were found only in French or German loanwords, but were still considered for
analysis. Entropy values for each length (from six letters to one letter) were respectively
0, 0, 0.104, 0.123, 0.280, and 0.620.
Entropy. A useful first step in interpreting the data was to remove low-frequency
graphemes with a type count of ten or less. This helped clean the data of idiosyncratic
OCs that were found in only a few related words, such as “bt” = /t/ in debt and “olo” =
/ɚ/ in colonel. There were 146 low-frequency graphemes. 124 of these also had zero
entropy.
202 of the 322 graphemes had zero entropy, meaning they corresponded consistently
to the same phoneme in all instances in the corpus, and can be considered completely
predictable. 124 of these were also considered low-frequency, having a type count of 10
or less. Removing these two groups from consideration resulted in a “refined list” of 99
graphemes that had some degree of entropy, comprised of 45 consonants and 54 vowels.
These are listed in Table 1 below, ranked by entropy:

Table 1. Graphemes ranked by entropy. Classed by consonant or vowel, with low-frequency and zero-
entropy graphemes removed.
No. C Frequency Entropy No. C Frequency Entropy No. V Frequency Entropy

1 s 105,633,904 1.027 38 es 2,147,378 0.004 28 or 5,535,041 0.964
2 in 350,753 1.000 39 ssi 572,191 0.003 29 err 122,344 0.962
3 all 617,422 0.968 40 ce 5,659,217 0.001 30 ue 855,526 0.912
4 se 1,696,607 0.957 41 an 616,250 0.001 31 oa 737,843 0.897
5 ch 7,122,450 0.933 42 al 2,765,896 0.001 32 ol 79,933 0.894
6 ed 4,059,494 0.909 43 sh 4,198,020 0.001 33 eur 21,762 0.845
7 f 37,433,336 0.874 44 v 15,036,744 <0.001 34 orr 211,821 0.736
8 g 16,164,105 0.732 45 r 67,773,671 <0.001 35 ew 1,606,405 0.727
9 th 51,995,883 0.711 No. V Frequency Entropy 36 is 32,953 0.670
10 wh 6,321,562 0.695 1 o 101,159,136 2.682 37 eigh 126,170 0.644
11 nd 10,643 0.596 2 u 27,685,856 2.307 38 hi 35,987 0.631
12 gh 317,754 0.591 3 a 121,387,814 2.294 39 ou_e 216,372 0.505
13 c 40,017,797 0.580 4 ou 18,438,447 2.177 40 o_e 4,582,654 0.481
14 si 790,039 0.540 5 ae 41,136 2.015 41 urr 350,854 0.472
15 zz 60,756 0.458 6 e 103,894,768 1.845 42 i_e 9,587,698 0.356
16 x 3,887,725 0.407 7 ea 7,545,614 1.705 43 our 129,262 0.315
17 ci 804,733 0.388 8 ie 1,265,508 1.697 44 ay 4,432,624 0.306
18 ss 4,825,294 0.387 9 y 32,632,117 1.680 45 ir 1,138,712 0.261
19 n 102,493,345 0.322 10 eu 135,096 1.632 46 aw 511,565 0.248
20 z 1,215,898 0.301 11 ough 1,008,396 1.601 47 on 8,818,144 0.230
21 ti 7,056,797 0.301 12 au 1,652,786 1.343 48 ee 5,913,902 0.168
22 ph 1,066,337 0.220 13 ui 309,495 1.315 49 eo 807,520 0.093
23 d 60,894,576 0.187 14 oo 3,895,895 1.289 50 ar 2,078,630 0.046
24 sc 371,066 0.160 15 ei 1,215,778 1.278 51 oi 976,408 0.040
25 gg 220,517 0.132 16 ah 143,359 1.232 52 oy 510,271 0.038
26 t 113,483,774 0.112 17 i 113,079,072 1.209 53 e_e 1,804,972 0.014
27 ge 2,372,177 0.089 18 oe 122,058 1.206 54 a_e 6,796,965 <0.001
28 j 3,701,888 0.071 19 ow 6,051,793 1.105
29 ne 3,060,629 0.059 20 ure 1,020,272 1.098
30 mn 79,355 0.055 21 ia 314,693 1.083
31 m 44,200,165 0.054 22 ai 4,269,932 1.050
32 ll 10,953,011 0.047 23 er 23,048,328 1.042
33 l 46,569,549 0.042 24 eau 65,434 1.032
34 le 5,375,598 0.036 25 ur 1,620,707 1.031
35 re 9,600,092 0.035 26 ey 2,517,956 1.020
36 gn 430,057 0.007 27 u_e 2,464,579 0.982
37 w 20,746,360 0.006
Note: C = consonantal grapheme, V = vocalic grapheme
Consonants generally had less entropy than vowels, with only the most entropic
consonant, “s,” having an entropy value over 1.0. An informal observation of the ranking
shows that the group with the highest general entropy were singleton vowels. Split
graphemes tended to have relatively lower entropy, clustering towards the bottom end of
the list, indicating their pronunciations to be more stable than their counterparts without a
final, silent –e. It is worth noting that this could be a consequence of how graphemes
were defined for this study. For example, if a word contained both a “short vowel” and a
final, silent –e, the –e was combined with the preceding consonant, resulting in a non-
split grapheme. A word like love, even though it might appear to the novice reader to
contain a split grapheme, is not a split grapheme according to the criteria defined in the
methodology and therefore, the entropy for the split grapheme “o_e” is unaffected by this
pronunciation.
Cluster analysis. Using the “refined” list of graphemes from Table 1, where low-
frequency were removed, a hierarchical cluster analysis using complete linkage was run
separately for both consonants and vowels. This is a useful method for grouping the
graphemes together in terms of similar entropy values and characterizing them in terms
of predictability. Table 2 lists the graphemes divided into groups of predictability for both
consonants and vowels. “Complete” predictability refers to graphemes with zero-entropy.
These are listed alphabetically. The remaining categories are arranged in order of
increasing entropy. This provides a useful reference for educators who wish to know
which spelling patterns are the most predictable.

Table 2. Graphemes grouped by predictability. The results of a hierarchical cluster analysis using complete
linkage, where complete predictability indicates zero-entropy.
Predictability Vocalic Graphemes Consonantal Graphemes

ai_e, arr, augh, ea_e, ear, ain, b, bb, bt, cc, mb, me, mm, ng, nm,
ee_e, et, eye, iar, ie_e, ck, cq, dd, de, dg, nn, om, ome, p, pp,
Complete iew, igh, io, ior, oo_e, dge, dj, el, ell, en, ps, q, qu, que, rh,
y_e, ', ff, gi, gu, gue, h, rr, sce, sci, shi, st,
ho, il, ile, k, kn, sw, tch, te, the, tt,
ld, lf, lk, lle, lm, tte, ul, ull, ve, wr, xh
a_e, e_e, oy, oi, ar, r, v, sh, al, an, mn, ne, j, ge, t,
Very High eo, ee, on, aw, ir, ce, ssi, es, w, gn,
ay, our, i_e, re, le, l, ll, gg, m,
High urr, o_e, ou_e, hi, eigh, sc, d, ph,
is, ew, orr,
eur, ol, oa, ue, err, ti, z, n, ss, ci,
err, or, u_e, ey, ur, x, zz,
Moderate eau, er, ai, ia, ure,
ow, oe, i, ah, ei,
oo, ui, au,
Low ough, eu, y, ie, ea, si, c, gh, nd, wh,
e, ae, th, g,
Very Low ou, a, u, o, f, ed, ch, se, all,
all, in, s,
Note: Graphemes featuring complete predictability are listed alphabetically. Otherwise,
graphemes are listed in order of increasing entropy (i.e. decreasing predictability).
Figure 2. Vowel Cluster Analysis. Dendrogram and scatter plot of vocalic graphemes with complete
linkage, grouped into five categories of decreasing entropy.
Figure 3. Consonant Cluster Analysis. Dendrogram and scatter plot of consonantal graphemes with
complete linkage, grouped into five categories of decreasing entropy.
Total orthographic entropy. To derive the total orthographic entropy, the
frequency-weighted probabilities of all graphemes (i.e., the entropy value of each
grapheme multiplied by the probability of that grapheme appearing in the corpus) were
summed together, resulting in a total orthographic entropy of 0.889. This number
indicates the relative predictability of English orthography (as represented by the corpus)
in terms of pronunciation. If total entropy is likewise calculated for the orthographies of
other languages, these values can then be compared to show relative predictability
between languages (see “Measuring Orthographic Depth” in the next section for an
entropy comparison between English and modern Greek).
Phonemes
PGC probability. Appendix C lists the probabilities for all phoneme-to-grapheme
correspondences (PGCs) found in the corpus. The probability of each phoneme is listed
immediately below that phoneme; a phoneme’s probability is simply the frequency
(token count) of that phoneme divided by all occurrences of all phonemes in the corpus.
Next to each phoneme is then listed all graphemes found to correspond to that phoneme
in the corpus. Next to each grapheme is the probability for that particular PGC, which is
calculated by dividing the frequency (token count) of that PGC by the total number of
times that phoneme occurs in the corpus. For each phoneme, the graphemes are listed in
order of decreasing probability. Therefore, the first grapheme listed is the “main
correspondence,” the spelling pattern most often associated with a phoneme. Finally,
underneath all the probabilities for each phoneme is a number in bold, which is the
entropy value for that phoneme, derived using the probabilities listed.
Table 3. Phonemes ranked by entropy.
P Ex. Frequency Prob. Entropy P Ex. Frequency Prob. Entropy

/ɪ/ sit 102,681,934 3.11% 2.743 /e/ day 37,085,249 2.31% 1.116
/ɝ/ sir 10,358,558 0.65% 2.553 /ɑ/ father 30,058,526 1.87% 1.097
/ʊ/ put 8,362,921 0.52% 2.198 /ɹ/ ram 82,400,979 5.13% 0.982
/l/ castle 10,275,525 0.64% 2.101 /i/ see 49,948,119 6.39% 0.947
/ʤ/ edge 10,234,263 0.64% 2.078 /l/ log 57,618,967 3.59% 0.800
/ʃ/ shun 13,555,847 0.84% 2.050 /f/ fee 29,989,182 1.87% 0.713
/ə/ above 86,526,629 5.39% 2.013 /z/ buzz 45,868,357 2.86% 0.703
/ɛ/ set 44,704,282 2.78% 1.968 /m/ mat 48,299,874 3.01% 0.674
/aɪ/ life 32,817,345 2.04% 1.931 /ð/ the 42,218,181 2.63% 0.664
/n/ button 13,065,713 0.81% 1.594 /ŋ/ song 16,916,321 1.05% 0.649
/k/ cat 53,242,102 3.32% 1.593 /m/ chasm 294,694 0.02% 0.577
/ɔ/ law 26,030,429 1.62% 1.578 /θ/ thin 9,724,927 0.61% 0.523
/ʌ/ hug 39,552,208 2.46% 1.577 /w/ wall 23,003,005 1.43% 0.479
/v/ vote 31,961,862 1.99% 1.555 /n/ night 104,184,757 6.49% 0.471
/o/ low 23,153,335 1.44% 1.524 /g/ go 13,773,233 0.86% 0.410
/u/ moot 30,774,839 1.92% 1.500 /h/ hat 17,695,054 1.10% 0.367
/ɚ/ waiter 24,508,665 1.53% 1.496 /p/ pat 37,725,681 2.35% 0.357
/ɾ/ middle 21,787,539 1.36% 1.398 /b/ bit 29,987,163 1.87% 0.044
/ʒ/ mirage 1,074,831 0.07% 1.310 /t/ to 106,307,958 6.62% 0.021
/ʧ/ church 7,939,684 0.49% 1.309 /hw/ whale 5,148,834 0.32% 0.017
/s/ sun 80,258,782 5.00% 1.303 /æ/ bat 53,613,522 3.34% 0.011
/oɪ/ boy 1,547,154 0.10% 1.229 /d/ do 58,449,083 3.64% 0.004
/aʊ/ cow 8,518,489 0.53% 1.176 /j/ yoke 8,670,071 0.54% 0.002
Note: The two English diphthongs / and / as in day and low are merged with their allophonic
counterparts /e/ and /o/, as these linguistic distinctions do not affect word meanings.
Entropy. Table 3 lists the phonemes of SAE, ranked by entropy, including the
frequency (token count) data, probability, and entropy of each phoneme in the corpus.
Unlike graphemes, there appeared to be no obvious relationship between entropy and
whether phonemes were consonantal or vocalic.

Multiple phonemes. Not included in Table 3 above were instances in the corpus
where multiple phonemes corresponded to single graphemes. These are listed in Table 4.
They are all relatively low-frequency, with type counts less than 1000:
Table 4. Multiple phonemes corresponding to single letters. Instances where a singleton grapheme
corresponded to more than one phoneme, which were removed from final analysis.
Phonemes Type Count Token Count Entropy

jʊ 116 301,207 1.750
jə 519 1,727,893 1.533
jɚ 43 248,583 1.527
wɑ 2 1,280 0.972
nj 7 741 0.921
ju 743 5,513,303 0.857
ts 13 25,963 0.771
kʃ 21 83,673 0.445
gʒ 2 577 0.228
ks 714 2,986,740 0.026
əw 1 353 0
jɑ 1 96 0
jɛ 1 115 0
kə 1 1,054 0
wʌ 13 2,516,492 0
gz 117 560,170 0
Total Phonemic Entropy. The frequency-weighted probabilities of all phonemes(i.e.,
each phoneme’s entropy multiplied by it’s probability of occurrence in the corpus) were
summed together to calculate the total phonemic entropy value of 1.017. This number
indicates the relative predictability involved in spelling spoken English. Comparing this
number to the total orthographic entropy appears to indicate that encoding, overall,
involves a higher degree of uncertainty than decoding. Another way to say this is that
reading print is a more predictable process than spelling, which seems intuitive.
Discussion
Is English Predictably Spelled?
The HHRH study is the most comprehensive source of evidence for the contemporary
perspective that English “orthography is alphabetic at least four-fifths of the time, or an
average of approximately 80 percent” (Hanna et al. 1966, p. 33). Using their 52-phoneme
classification system, the HHRH researchers found that phonemes corresponded to their
main grapheme 73.13% of the time, which would have been considered near but still
short of “predictable.” When the researchers then considered the additional linguistic
information of syllable position and stress, the percentage increased to 84.15%,
surpassing the 80-percent criterion.
This current study did not classify the data in terms of syllable position and stress, but
probabilities for all “main” PGCs were determined in the process of calculating
probabilities for all OCs. This data can be found in Appendix C. For each phoneme, the
first grapheme listed is the “main spelling” with the highest probability. So, for example,
the phoneme /t/’s main spelling is “t,” and /t/ was found in the corpus to be written as “t”
93% of the time (a probability of 0.933). Taking an average for all of these main spellings
resulted in the overall probability of 0.7326 that any given phoneme would correspond to
its main grapheme. In other words, English phonemes in the corpus are written as their
main spelling 73.26% of the time.
This number falls within a tenth of a percentage to the number derived by the
HHRH authors. Such close agreement reinforces the findings of both the HHRH study
and the current study, at least as far as the highest probability PGCs are concerned. It also
indicates that increasing the corpus size from ~17,000 highest frequency words to ~131
million words (with no frequency limitations) had no significant impact on the
predictability of the main phoneme-to-grapheme correspondences.
The above calculation was then performed for GPC probabilities, analyzing the
probability that each grapheme corresponds to its “main pronunciation.” Summing the
frequency-weighted probabilities of all main pronunciations resulted in a percentage of
74.49%.
According to these calculations, the corpus features main correspondences in both
phoneme-grapheme and grapheme-phoneme directions a little under 75% of the time. In
the terminology of the HHRH study, then, English appears predictable roughly three-
fourths of the time. This is 5% less below the 80-percent criterion used as the benchmark
for predictability, but factoring in additional linguistic parameters like stress and position
would most likely lead to an increase in predictability, as evidenced by the HHRH study.
We argue that considering only the “main correspondence,” however, does not
provide a full picture to the complexities of English orthography. Entropy is a more
accurate measure of a phoneme or grapheme’s predictability, because it accounts for the
probability of every, not just the main, correspondence. Consider the phonemes X and Y,
where both have their main spelling correspondence 80% of the time, but phoneme X has
only one alternate pronunciation for the remaining 20% of the time, while Y has 4
alternate pronunciations 5% of the time each. Y is more uncertain, a distinction not
reflected in the observation that both are 80% predictable.

Measuring Orthographic Depth
Entropy has been proposed as a common metric by which differing orthographies
might be compared, though there is by no means universal agreement as to how
effectively entropy captures the notion of orthographic depth (Schmalz et al., 2016).
Studies concerning cross-linguistic comparisons of orthographic depth have used a
variety of methods to compare the depth of orthographies, such as expert agreement
(Seymour 2003), naming accuracy and latency (Ellis et al., 2004), and word-initial
phoneme-grapheme entropy (Borgwaldt et al., 2004; 2005; 2006).
Since phonemes and graphemes are the universal building blocks of languages and
alphabetic orthographies, a common mathematical measurement of their correspondence
would be a boon to cross-linguistic scientific inquiry. For example, a study in 2009
calculated phoneme-grapheme entropy values for modern Greek (Protopapas & Vlahou,
2009). Despite having separate phonologies and alphabets, English and Greek could be
compared in terms of entropy. Greek was reported to have a total orthographic entropy
(grapheme-phoneme) of .167 and a total phonemic (phoneme-grapheme) entropy of .645,
while this current study reported comparative values for Standard American English of
.889 and 1.017. Standard American English, therefore, is more unpredictable for both
decoding and encoding than Greek. The same comparisons could be carried out for
consonants, vowels, initial letters, initial phonemes, and even morphological endings,
word classes like nouns and verbs, etc.
The problem is that the process of defining graphemes ultimately affects their entropy
values. The choice as to whether a final, silent –e attaches to the preceding vowel or
consonant, for example, determines which orthographic structures (i.e. letter
combinations) are to be analyzed, and this decision will impact how predictable the
overall orthography is judged to be. The first step towards having a valid cross-linguistic
entropy measure, therefore, is for general agreement to be reached within each language
community as to how their own graphemes are to be defined. English speakers have a
more difficult challenge in this respect than many other European languages with more
transparent orthographies. This current study offers one definitive list of graphemes, but
this list was constructed through a series of decisions regarding the shape of graphemes
which is open to either verification or reinterpretation by others. Until consensus among
the research community is reached that answers the question, “What are the graphemes of
English,” entropy may not offer a truly absolute scale for cross-linguistic comparisons.
Word-Level Entropy
One potential application for the data involves formulating word entropies. Entropy
values for graphemes or phonemes of a word could be summed to calculate the entropy
value of that word, with the caveat that much more research is needed to determine how
precisely this metric would quantify a word’s difficulty in decoding or encoding. Such a
scheme ignores additional linguistic parameters, such as the impact of stress and meaning
or the effects of rime stability, for example, but it may still provide a convenient tool for
comparing relative predictability between words when, for example, choosing
appropriate words for spelling instruction.
A word’s orthographic entropy could simply be considered the sum of the entropy
values of its graphemes. For example, the orthographic entropy for the word sock,
written as “sock,” would equal “s” + “o” + “ck” = 1.027 + 2.682 + 0 = 3.709. The longer
the word, generally the greater the entropy (though this depends on the entropy of each
grapheme), which can account for how word complexity increases with length.
A word’s phonemic entropy, conversely, would be the sum of entropy values of its
phonemes. The word sock, spoken as /sɑk/, would have a phonemic entropy of /s/ + /ɑ/ +
/k/ = 1.303 + 1.097 + 1.593 = 3.993. Both orthographic and phonemic entropy values
describe the degree of predictability between the spoken and written forms of the word
sock. Due to the intricate relationship between encoding and decoding, it remains unclear
which calculation provides a better descriptor. Indeed, both numbers may be required to
accurately represent the word’s predictability. It is also possible to sum orthographic and
phonemic entropy values of a word into a single “total entropy” value, yet doing so may
bury necessary linguistic detail.
Entropy-Based Spelling Instruction
Traditional method. Using entropy as a measurement for how difficult words are to
learn can lead to the creation of grade-leveled wordlists and basal spellers which are
stratified in terms of predictability. Entropy offers a potentially valuable resource for
educators who are unsure of how to choose appropriate words to be taught for spelling
instruction. Entropy-based word selection might help to remedy the fact that basal
spellers do not generally cover the depth of orthography that English has (Foorman &
Petscher, 2010).
Convenience is often a large factor in how spelling instruction is implemented in the
United States, and the reality of the classroom is that often a single common list is used
for all children in a classroom, even if the instructor believes this is not the most effective
method for every child. A survey (Fresch, 2003) which compared teacher beliefs versus
practices in elementary schools in a nationwide sample, found that while only 45% of
interviewed teachers agreed that one common spelling list for a whole classroom was the
most effective method for teaching spelling, 72% put this into practice. Of those that
agreed, 88% put this into practice. This was the highest level of practice and theoretical
agreement among all 355 participants. Notably, many survey respondents expressed
concerns regarding how to successfully teach the unpredictability found in English
orthography, such as “words that do not ‘follow the rules’” and the “many exceptions in
the English language” (Fresch 2003, p. 834).
Developmental method. The developmental approach to spelling instruction
emphasizes the individual learning needs of a student as indicated by their current stage
of development, which describes the common process of how children typically acquire
orthographic knowledge (Ehri, 2005; Henderson & Beers, 1980). Ehri describes literacy
acquisition in four stages, which must occur in sequence, though they do not necessarily
correspond with age. An older child with a language impairment, for example, might not
have advanced at the same rate as his peers and would not benefit from studying the same
list of words as his classmates.
Entropy-based word selection might have the greatest benefit for learners in the
second stage of literacy acquisition, termed “partial alphabetic.” At this stage, according
to Ehri, a child is working to learn sight words by cementing OCs in memory. The child
has a rudimentary but incomplete knowledge of OCs. They will often spell words
incorrectly but phonetically, based on the rules they have learned to that point. Exposing
the child to words which begin predictably but successively increase in uncertainty could
provide an appropriate scaffolding technique, particularly for readers who are struggling
to transition to the next stage of development.
In the third stage of development, the “full alphabetic stage,” English learners have
acquired the “major” OCs and the ability to segment words phonologically, based on the
graphemes they read. Here again, knowledge of the entropy of specific graphemes could
aid instruction. The student can be exposed to more challenging, entropic spellings as
their knowledge grows. Students experiencing difficulty could receive words with
decreased entropy compared to the rest of the class; outliers doing particularly well could
receive words with increased entropy.
Concerns with entropy-based instruction. Currently, orthographic entropy does not
play a role in word selection for spelling instruction, which relies on word frequency and
length to determine levels of difficulty. In fact, current attitudes regarding the
reading/spelling difficulty of words may even run contrary to entropy analyses. It would
seem awkward, for example, to teach graphemes simply on the basis of entropy, as some
of the most entropic graphemes are in fact single letters, e.g., a, o, u, s, and f. Common
practice dictates that a student’s literacy journey begins with memorizing the alphabet,
where letters are ranked with no regard for predictability, e.g. “r” and “s,” the most and
least predictable consonants in Table 1, are alphabetic neighbors and learned together.
Along the same lines, spelling instruction customarily introduces “simpler” (i.e.
shorter) spelling patterns first. When considering vowel phonemes, spelling patterns for
“short” vowels are usually taught before “long” vowels because they are more frequent
and feature shorter spelling patterns. Yet Table 3 indicates the majority of these “short”
(technically lax) vowels feature higher entropy than their “long” (technically tense)
counterparts (e.g., /ɪ/ vs. /i/, /ʌ/ vs. /u/, etc.). According to their entropy values, long/tense
vowels are more predictably spelled. Therefore, students should theoretically be able to
learn these correspondences easier than the more uncertain short vowels.
Limitations
As mentioned briefly in the methodology, this analysis is ultimately dependent on the
accuracy of the transcription data of the ELP. Observed errors are documented in
Appendix D. The functionality of this analysis also relies on the ability of the results to
be generalized from the corpus to SAE as a whole. The HAL database, on which the ELP
is based, is both sufficiently sized and suitably descriptive of oral language, having been
described as “conversational and noisy, much like spoken language” (Burgess & Livesay,
1998), but it should be noted this is solely an adult-based lexicon. If the results of this
quantitative analysis of orthographic correspondences are to be applied to school-age
children in the process of literacy acquisition, an analysis drawn from children’s literature
may be more suitable. The current results, on the other hand, may more easily generalize
to a population such as adult stroke patients with acquired alexia.
Furthermore, It should be emphasized that entropy is a purely mathematical model
relying on probabilities and not, for example, influenced by psycholinguistic models of
reading. Therefore, these numbers are only meant as a conservative estimate, an
approximation of how relatively difficult written words might be to either decode
(pronounce) or encode (spell). Focusing solely on the grapheme-phoneme level ignores

how larger linguistic units influence the complex cognitive processes underlying reading
and writing.
For example, measuring the entropy of individual graphemes assumes that the
predictability of one grapheme is not influenced by its neighbors. In fact, experimental
studies have shown evidence to the contrary. The rime, a larger linguistic unit consisting
of a vowel and any consonants following it in a syllable, has been shown to be a more
stable unit in English than the grapheme, at least for monosyllabic words (Treiman,
Mullennix, Bijeljac-Babic, & Richmond-Welty, 1995).
For example, the grapheme “o” may be pronounced numerous ways, but only one
pronunciation is ever found when combined with the grapheme “ck” to produce the
orthographic rime “-ock,” found in words like sock, lock, dock, etc. This rime
corresponds highly consistently to the phonemic rime /ɑk/. The “ck” grapheme, in effect,
signals the reader that the preceding “o” is to correspond to a specific grapheme, /ɑ/,
thereby removing uncertainty at the rime level.
Even so, Kessler & Treiman (Kessler & Treiman, 2001), who have written
extensively on the subject of the rime-analysis of English, concluded that despite a higher
degree of consistency at the rime level, English rimes “are not processed as individual
units. Rather, the basic processing seems to occur at a phonemic-graphemic level that
takes into account the context in which each phoneme-grapheme is found” (Protopapas &
Vlahou, 2009, p. 992, referencing; Treiman et al., 1995). In other words, rime-level
processing may be consulted to resolve ambiguity when it manifests in sufficient quantity
at the grapheme-phoneme level.

While the the entropy model cannot perfectly imitate the complex process of reading,
a number of decisions in processing the data were made in order to more closely
approximate “a reader’s experience.” These steps were described in the methodology
section. It should be noted that such decisions do not have a specific evidentiary basis. It
remains to be empirically shown, therefore, whether English learners indeed experience
decoding/encoding in a manner consistent with this analysis. Specifically, the results of
this analysis would be strengthened by data showing that students actually process the
graphemes defined by this study. Until such data is collected, defining the shape of
English graphemes could remain a philosophical endeavor. For example, it remains
unclear whether the final –e in words like love and edge is processed as part of a split
grapheme or as part of a compound grapheme combined with the preceding consonant.
Moving forward, research which answers these questions may help solidify a general
consensus regarding the shape of English graphemes.
Future Research
Any research would be useful that aims to discover how closely orthographic
predictability actually reflects the psycholinguistic experience of reading. It remains
unclear to what extent orthographic/phonemic entropy actually impacts reading
acquisition, and so future experimental research could determine what correlation might
exist between the mathematical predictability of a word’s composition and how difficult
it is to read or spell. The ELP provides experimental behavioral data which could
possibly be used in such future studies. Naming and lexical decision data can be accessed
for the 40,481 words listed in the restricted ELP database. If indeed a word’s entropy can
be considered the sum of the entropies of its individual graphemes (as discussed above), a
conceivable application of this data would be to calculate entropy values for the words of
the ELP database and then analyze the results for any correlations between entropy
values and the naming and lexical decision data provided. This might provide some
insight into the relationship between a word’s entropy and how long it takes to decode. If
there is indeed a link between orthographic predictability and ease of acquisition,
students may learn new vocabulary quicker and easier if exposed to words of gradually
increasing entropy.
Just as the HHRH study analyzed the additional linguistic parameters of stress and
syllable position, a future analysis might include these factors when considering the
predictability of orthographic correspondences. The ELP does provide information
regarding both stress and syllable position, and therefore it would be feasible to construct
a more elaborate software program that would account for these parameters using the
same corpus. The results of such an undertaking would most likely provide a more
complete, precise picture of the predictability of orthographic correspondences.
It would also be interesting to see how the structure of current reading programs
addresses orthographic predictability. Do well established curricula such as Orton-
Gillingham or the Association Method, which have unique methods for teaching
grapheme-phonemes correspondences to students, consider orthographic entropy at all?
Would a reading program designed around entropy show improved outcomes compared
to a program that disregards orthographic predictability?
Using entropy as a measurement of orthographic depth has already been discussed;
future research along this avenue would involve database analyses similar to this current
study for other languages. The entropy value from various studies for different languages
could then be compared to one another, and the relative complexity of various languages
could be established against one another.
Theoretically, the type of analysis conducted in this study could be extended not just
to other languages, but to various dialects within English as well. Now that we have a
measure of how predictability SAE corresponds to its orthography, one can ask: is it
more or less entropic than other dialects of English? For example, to what degree does
SAE diverge from its orthography compared to African-American English (AAE)? Does
this degree of distance between a spoken dialect and the orthography, as expressed by
entropy, correlate at all with literacy outcomes, i.e. does having an accent further
removed from standard written English result in more difficulty in learning to read?
Unfortunately, the ELP provides only pronunciation for Standard American English. This
means that before any such cross-dialectal comparisons can be made, researchers would
need access to a sufficiently sized corpus with a pronunciation guide representing other
English dialects, such as AAE.
A final suggestion for future research concerns those suprasegmental linguistic
features of stress and syllabic position not addressed in this study. Certainly if these
parameters could be accounted for in the entropy calculation, a more finely-tuned
measure of predictability would result. As stated, entropy should be viewed as a
conservative measurement of the reading experience, and one limitation of this study is
that the interplay between graphemes is not adequately captured in this current analysis.
Examining stress and position would provide a picture of orthographic predictability that
is most likely closer to the reading experience than simply analyzing individual
graphemes. Graphemes do not exist isolated in space, after all, and prior studies such as
HHRH have shown how predictability is increased when stress and position are
considered.
Conclusion
After formulating a definitive list of English graphemes, the current study calculated
probabilities of correspondence and entropy values for all phonemes and graphemes in
Standard American English as well as a total system entropy for both orthographic and
phonemic predictability. Phonemes and graphemes were then ranked in terms of their
predictability, a scheme that provides a convenient resource for educators and researchers
who wish to identify the most and least predictable sounds and spelling patterns of the
language. The study confirmed the prior findings of the HHRH study, which asserted that
“English is 80% predictable,” but suggested this is a simplistic measurement which may
not fully account for the complexities of English orthography.
Entropy values for phonemes and graphemes may provide a foundation for further
research into the predictability of English orthography. Like word frequency, spelling
predictability may be a useful measurement for how difficult words are to learn, offering
a systematic guideline for constructing wordlists of increasing difficulty for spelling
instruction. Children in various stages of literacy development could also benefit from
vocabulary that is tailored in terms of predictability to their developmental needs. Future
studies can address how the entropy of graphemes and phonemes impacts the cognitive
processes which underlie reading and writing.

The complex nature of these psycholinguistic processes may never be fully distillable
to a single measurement. Considering the deep connection between encoding and
decoding, it is ultimately unclear as to whether orthographic entropy or phonemic
entropy best describes the predictability of words. Also, before entropy can be used for
comparisons across languages and/or dialects, a consensus must first be reached on the
composition and structure of graphemes, particularly for English. Once these technical
issues have been addressed, entropy might provide a useful tool for future cross-linguistic
research. Its mathematical foundation highlights the fact that, ultimately, languages and
orthographies have more commonalities than differences.

References
Aro, M., & Wimmer, H. (2003). Learning to read: English in comparison to six more
regular orthographies. Applied Psycholinguistics, 24(04), 621-635.
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., . . .
Treiman, R. (2007). The english lexicon project. Behavior Research Methods, 39(3),
445-459.
Barca, L., Burani, C., Di Filippo, G., & Zoccolotti, P. (2006). Italian developmental
dyslexic and proficient readers: Where are the differences? Brain and Language,
98(3), 347-351.
Bergmann, J., & Wimmer, H. (2008). A dual-route perspective on poor reading in a

regular orthography: Evidence from phonological and orthographic lexical decisions.
Cognitive Neuropsychology, 25(5), 653-676.
Berndt, R. S., Reggia, J. A., & Mitchum, C. C. (1987). Empirically derived probabilities
for grapheme-to-phoneme correspondences in english. Behavior Research Methods,
Instruments, & Computers, 19(1), 1-9.
Borgwaldt, S. R., Hellwig, F. M., De Groot, A., & Licht, R. (2006). Word-initial sound-
spelling patterns: Cross-linguistic analyses and empirical validations of phoneme-
letter feedback consistency effects. UofA Working Papers in Linguistics, 1
Borgwaldt, S. R., Hellwig, F. M., & De Groot, A. (2004). Word-initial entropy in five
languages: Letter to sound, and sound to letter. Written Language & Literacy, 7(2),
165-184.
Borgwaldt, S. R., Hellwig, F. M., & De Groot, A. M. (2005). Onset entropy matters–
Letter-to-phoneme mappings in seven languages. Reading and Writing, 18(3), 211-
229.
Brysbaert, M., & New, B. (2009). Moving beyond kučera and francis: A critical
evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for american english. Behavior Research
Methods, 41(4), 977-990.
Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in
a basic word recognition task: Moving on from kučera and francis. Behavior
Research Methods, Instruments, & Computers, 30(2), 272-277.
Chall, J. (1967). Reading: The great debate. New York,
Cossu, G., Gugliotta, M., & Marshall, J. C. (1995). Acquisition of reading and written
spelling in a transparent orthography: Two non parallel processes? Reading and
Writing, 7(1), 9-22.
Crystal, D. (2012). Spell it out: The singular story of english spelling Profile Books.
Davies, R., Cuetos, F., & Glez-Seijas, R. M. (2007). Reading development and dyslexia
in a transparent orthography: A survey of spanish children. Annals of Dyslexia,
57(2), 179-198.
Delattre, M., Bonin, P., & Barry, C. (2006). Written spelling to dictation: Sound-to-
spelling regularity affects both writing latencies and durations. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 32(6), 1330.
Ehri, L. C. (2005). Learning to read words: Theory, findings, and issues. Scientific
Studies of Reading, 9(2), 167-188.
Ellis, N. C., & Hooper, A. M. (2001). Why learning to read is easier in welsh than in
english: Orthographic transparency effects evinced with frequency-matched tests.
Applied Psycholinguistics, 22(04), 571-599.
Ellis, N. C., Natsume, M., Stavropoulou, K., Hoxhallari, L., DAAL, V. H., Polyzoe, N., .
. . Petalas, M. (2004). The effects of orthographic depth on learning to read
alphabetic, syllabic, and logographic scripts. Reading Research Quarterly, 39(4),
438-468.
Foorman, B. R., & Petscher, Y. (2010). Development of spelling and differential relations
to text reading in grades 3-12. Assessment for Effective Intervention, 36(1), 7-20.
Fresch, M. J. (2003). A national survey of spelling instruction: Investigating teachers'

beliefs and practice. Journal of Literacy Research, 35(3), 819-848.
Frith, U., Wimmer, H., & Landerl, K. (1998). Differences in phonological recoding in
german-and english-speaking children. Scientific Studies of Reading, 2(1), 31-54.
Fry, E. (1980). The new instant word list. The Reading Teacher, 34(3), 284-289.
Fry, E. (2004). Phonics: A large phoneme-grapheme frequency count revised. Journal of

Literacy Research, 36(1), 85-98.
Fry, E. B., & Kress, J. E. (2012). The reading teacher's book of lists John Wiley & Sons.
Georgiou, G. K., Torppa, M., Manolitsis, G., Lyytinen, H., & Parrila, R. (2012).
Longitudinal predictors of reading and spelling across languages varying in
orthographic consistency. Reading and Writing, 25(2), 321-346.
Gontijo, P. F., Gontijo, I., & Shillcock, R. (2003). Grapheme—phoneme probabilities in

british english. Behavior Research Methods, Instruments, & Computers, 35(1), 136-
157.
Hanley, R., Masterson, J., Spencer, L., & Evans, D. (2004). How long do the advantages
of learning to read a transparent orthography last? an investigation of the reading
skills and reading impairment of welsh children at 10 years of age. The Quarterly
Journal of Experimental Psychology: Section A, 57(8), 1393-1410.
Hanna, P. R. (1966). Phoneme-grapheme correspondences as cues to spelling

improvement.
Henderson, E. H., & Beers, J. W. (1980). Developmental and cognitive aspects of

learning to spell: A reflection of word knowledge. ERIC.
Hodges, R. E., & Rudorf, E. H. (1965). Searching linguistics for cues for the teaching of
spelling. Elementary English, 42(5), 527-533.
Johnston, F. R. (2000). Exploring classroom teachers' spelling practices and beliefs.

Literacy Research and Instruction, 40(2), 143-155.
Kaku, M. (2012). Physics of the future: How science will shape human destiny and our
daily lives by the year 2100 Anchor.
Katz, L., & Frost, R. (1992). Chapter 4 the reading process is different for different
orthographies: The orthographic depth hypothesis. Advances in Psychology, 94, 67-
84. doi:http://dx.doi.org/10.1016/S0166-4115(08)62789-2
Kessler, B., & Treiman, R. (2001). Relationships between sounds and letters in english
monosyllables. Journal of Memory and Language, 44(4), 592-617.
Landerl, K., & Wimmer, H. (2000). Deficits in phoneme segmentation are not the core
problem of dyslexia: Evidence from german and english children. Applied
Psycholinguistics, 21(02), 243-262.
Landerl, K., Wimmer, H., & Frith, U. (1997). The impact of orthographic consistency on
dyslexia: A german-english comparison. Cognition, 63(3), 315-334.
Landerl, K., Ramus, F., Moll, K., Lyytinen, H., Leppänen, P. H. T., Lohvansuu, K., . . .
Schulte-Körne, G. (2013). Predictors of developmental dyslexia in european
orthographies with varying complexity. Journal of Child Psychology and Psychiatry,

54(6), 686-694. doi:10.1111/jcpp.12029
Moats, L. C. (2005). How spelling supports reading. American Educator, 6(12–22), 42-
43.
National Reading Panel (US), National Institute of Child Health, & Human Development
(US). (2000). Report of the national reading panel: Teaching children to read: An
evidence-based assessment of the scientific research literature on reading and its
implications for reading instruction: Reports of the subgroups National Institute of
Child Health and Human Development, National Institutes of Health.
Protopapas, A., & Vlahou, E. L. (2009). A comparative quantitative analysis of greek

orthographic transparency. Behavior Research Methods, 41(4), 991-1008.
Richlan, F. (2014). Functional neuroanatomy of developmental dyslexia: The role of

orthographic depth. Frontiers in Human Neuroscience, 8, 347.
Schlagal, B. (2002). Classroom spelling instruction: History, research, and practice.

Literacy Research and Instruction, 42(1), 44-57.
Schmalz, X., Marinus, E., Coltheart, M., & Castles, A. (2015). Getting to the bottom of
orthographic depth. Psychonomic Bulletin & Review, 22(6), 1614-1629.
doi:10.3758/s13423-015-0835-2
Seymour, P. H. K., Aro, M., Erskine, J. M., & collaboration with COST Action A8
network. (2003). Foundation literacy acquisition in european orthographies. British
Journal of Psychology, 94(2), 143-174. doi:10.1348/000712603321661859
Share, D. L. (2008). On the anglocentricities of current reading research and practice:

The perils of overreliance on an 'outlier' orthography. Psychological Bulletin, 134(4),
584-615. doi:10.1037/0033-2909.134.4.584
Sharp, A. C., Sinatra, G. M., & Reynolds, R. E. (2008). The development of children's
orthographic knowledge: A microgenetic perspective. Reading Research Quarterly,
43(3), 206-226.
Thorndike, E. L., & Lorge, I. (1944). The teacher's wordbook of 30,000 words. new york:
Columbia university, teachers college.
Treiman, R., Mullennix, J., Bijeljac-Babic, R., & Richmond-Welty, E. D. (1995). The
special role of rimes in the description, use, and acquisition of english orthography.
Journal of Experimental Psychology: General, 124(2), 107.
Twain, M. (2016). What is man? and other essays Xist Publishing.

Webster, N., & Franklin, B. (1789). Dissertations on the english language: With notes,
historical and critical, to which is added, by way of appendix, an essay on a
reformed mode of spelling, with dr. franklin's arguments on that subject. by noah
webster, jun. esquire.[two lines in latin from tacitus]. for the author, by Isaiah
Thomas and Company.
Wimmer, H. (1993). Characteristics of developmental dyslexia in a regular writing

system. Applied Psycholinguistics, 14(01), 1-33.
Wimmer, H., & Goswami, U. (1994). The influence of orthographic consistency on

reading development: Word recognition in english and german children. Cognition,
51(1), 91-103.
Wimmer, H., & Schurz, M. (2010). Dyslexia in regular orthographies: Manifestation and
causation. Dyslexia, 16(4), 283-299.
Zoccolotti, P., De Luca, M., Di Filippo, G., Judica, A., & Martelli, M. (2009). Reading
development in an orthographically regular language: Effects of length, frequency,
lexicality and global processing ability. Reading and Writing, 22(9), 1053-1079.
Zoccolotti, P., De Luca, M., Di Pace, E., Gasperini, F., Judica, A., & Spinelli, D. (2005).
Word length effect in early reading and in developmental dyslexia. Brain and
Language, 93(3), 369-373.
Appendix A: Alphabetical List of All English Graphemes
No. Grapheme Type Count Token Count Probability Entropy

1 a 17,930 121,387,814 7.56% 2.294
2 a_e 1,261 6,796,965 0.42% 0.000
3 aa 5 18,404 <0.01% 1.517
4 ae 52 41,136 <0.01% 2.015
5 ael 2 102,268 0.01% 0.000
6 ah 37 143,359 0.01% 1.232
7 ai 789 4,269,932 0.27% 1.050
8 ai_e 28 73,717 <0.01% 0.000
9 aigh 9 48,534 <0.01% 0.000
10 ailles 1 768 <0.01% 0.000
11 ain 24 230,301 0.01% 0.000
12 ais 2 1,459 <0.01% 0.281
13 al 710 2,765,896 0.17% 0.001
14 all 205 617,422 0.04% 0.968
15 am 5 982 <0.01% 0.000
16 an 197 616,250 0.04% 0.001
17 ao 8 11,988 <0.01% 0.975
18 aoh 1 477 <0.01% 0.000
19 ar 537 2,078,630 0.13% 0.046
20 arr 36 101,737 0.01% 0.000
21 as 3 5,973 <0.01% 0.408
22 at 2 172 <0.01% 0.000
23 au 412 1,652,786 0.10% 1.343
24 au_e 4 538 <0.01% 0.000
25 augh 24 76,831 <0.01% 0.000
26 aur 2 2,417 <0.01% 0.000
27 aw 163 511,565 0.03% 0.248
28 awe 5 11,483 <0.01% 0.000
29 awr 1 431 <0.01% 0.000
30 ay 412 4,432,624 0.28% 0.306
31 ay_e 1 14,115 <0.01% 0.000
32 aye 7 14,581 <0.01% 0.342
33 b 5,480 29,845,744 1.86% 0.000
34 bb 168 139,751 0.01% 0.000
35 bh 1 288 <0.01% 0.000
36 bp 1 515 <0.01% 0.000
Appendix A Continued
37 bt 18 87,665 0.01% 0.000
38 c 9,418 40,017,797 2.49% 0.580
39 cc 150 434,614 0.03% 0.000
40 cch 5 839 <0.01% 0.000
41 ce 844 5,659,217 0.35% 0.001
42 ces 2 2,361 <0.01% 0.000
43 ch 1,255 7,122,450 0.44% 0.933
44 che 9 36,800 <0.01% 0.883
45 chsi 1 28 <0.01% 0.000
46 cht 6 1,338 <0.01% 0.000
47 ci 203 804,733 0.05% 0.388
48 ck 923 2,919,639 0.18% 0.000
49 cq 20 25,274 <0.01% 0.000
50 cqu 4 1,900 <0.01% 0.000
51 cques 1 4,567 <0.01% 0.000
52 cs 1 3,578 <0.01% 0.000
53 ct 8 8,422 <0.01% 0.000
54 cu 3 6,548 <0.01% 0.000
55 cz 4 4,878 <0.01% 0.450
56 d 10,566 60,894,576 3.79% 0.187
57 dd 199 968,904 0.06% 0.000
58 de 18 20,029 <0.01% 0.000
59 dg 53 96,498 0.01% 0.000
60 dge 76 268,631 0.02% 0.000
61 dh 1 1,485 <0.01% 0.000
62 di 7 21,233 <0.01% 0.000
63 dj 35 33,852 <0.01% 0.000
64 dn 3 14,826 <0.01% 0.000
65 dth 2 153 <0.01% 0.000
66 e 15,410 103,894,768 6.47% 1.845
67 e_e 137 1,804,972 0.11% 0.014
68 ea 1,309 7,545,614 0.47% 1.705
69 ea_e 60 978,406 0.06% 0.000
70 ear 70 773,428 0.05% 0.000
71 eau 22 65,434 <0.01% 1.032
72 eaux 1 518 <0.01% 0.000
73 ed 1,563 4,059,494 0.25% 0.909
74 ee 790 5,913,902 0.37% 0.168
75 ee_e 20 33,338 <0.01% 0.000
76 eh 2 13,918 <0.01% 0.096
77 ei 97 1,215,778 0.08% 1.278
78 ei_e 4 7,924 <0.01% 0.053
79 eigh 56 126,170 0.01% 0.644
80 eii 1 127 <0.01% 0.000
81 el 133 585,208 0.04% 0.000
82 ell 20 42,750 <0.01% 0.000
83 em 1 291 <0.01% 0.000
84 en 356 1,297,055 0.08% 0.000
85 eo 19 807,520 0.05% 0.093
86 eou 3 8,587 <0.01% 0.000
87 er 5,258 23,048,328 1.44% 1.042
88 ere 3 684,551 0.04% 0.003
89 err 57 122,344 0.01% 0.962
90 erwr 6 1,656 <0.01% 0.000
91 es 413 2,147,378 0.13% 0.004
92 et 27 9,834 <0.01% 0.000
93 eu 86 135,096 0.01% 1.632
94 eur 18 21,762 <0.01% 0.845
95 ew 165 1,606,405 0.10% 0.727
96 ewe 1 418 <0.01% 0.000
97 ey 177 2,517,956 0.16% 1.020
98 eye 32 107,802 0.01% 0.000
99 ez 1 771 <0.01% 0.000
100 f 3,623 37,433,336 2.33% 0.874
101 fe 1 63 <0.01% 0.000
102 ff 391 2,045,640 0.13% 0.000
103 ffe 1 379 <0.01% 0.000
104 ft 8 120,561 0.01% 0.000
105 g 3,704 16,164,105 1.01% 0.732
106 ge 407 2,372,177 0.15% 0.089
107 gg 225 220,517 0.01% 0.132
108 gh 63 317,754 0.02% 0.591
109 gi 42 185,015 0.01% 0.000
110 gm 3 619 <0.01% 0.000
111 gn 96 430,057 0.03% 0.007
112 gu 105 572,154 0.04% 0.000
113 gue 19 22,728 <0.01% 0.000
114 h 1,960 16,493,014 1.03% 0.000
115 ha 7 12,312 <0.01% 0.150
116 he 4 2,428 <0.01% 1.000
117 hei 4 2,686 <0.01% 0.000
118 her 4 8,713 <0.01% 0.000
119 hi 14 35,987 <0.01% 0.631
120 ho 16 61,294 <0.01% 0.000
121 hoe 1 124 <0.01% 0.000
122 hoi 1 80 <0.01% 0.000
123 hou 5 125,350 0.01% 0.000
124 hu 2 1,236 <0.01% 0.000
125 i 20,260 113,079,072 7.04% 1.209
126 i_e 1,520 9,587,698 0.60% 0.356
127 ia 57 314,693 0.02% 1.083
128 iar 12 36,091 <0.01% 0.000
129 ie 400 1,265,508 0.08% 1.697
130 ie_e 26 288,504 0.02% 0.000
131 ier 2 186 <0.01% 0.000
132 ieu 6 8,940 <0.01% 0.310
133 iew 25 263,882 0.02% 0.000
134 ig 2 5,467 <0.01% 0.000
135 igh 241 1,871,680 0.12% 0.000
136 ign 1 2,072 <0.01% 0.000
137 il 75 147,455 0.01% 0.000
138 ile 25 40,227 <0.01% 0.000
139 ill 10 2,201 <0.01% 0.567
140 in 98 350,753 0.02% 1.000
141 ing 9 21,079 <0.01% 0.000
142 io 53 311,539 0.02% 0.000
143 ior 11 79,637 <0.01% 0.000
144 iou 2 618 <0.01% 0.000
145 ioux 1 1,316 <0.01% 0.000
146 ir 250 1,138,712 0.07% 0.261
147 irr 7 5,253 <0.01% 0.000
148 is 11 32,953 <0.01% 0.670
149 it 2 1,175 <0.01% 0.000
150 iu 3 8,351 <0.01% 0.000
151 j 574 3,701,888 0.23% 0.071
152 ju 5 17,056 <0.01% 1.065
153 k 1,740 10,866,474 0.68% 0.000
154 ke 1 11,934 <0.01% 0.000
155 kg 1 230 <0.01% 0.000
156 kh 5 8,364 <0.01% 0.546
157 kk 2 56 <0.01% 0.000
158 kn 73 1,235,975 0.08% 0.000
159 l 10,943 46,569,549 2.90% 0.042
160 ld 12 2,606,500 0.16% 0.000
161 le 1,276 5,375,598 0.33% 0.036
162 lf 11 85,946 0.01% 0.000
163 lh 4 5,271 <0.01% 0.000
164 lk 61 367,405 0.02% 0.000
165 ll 1,278 10,953,011 0.68% 0.047
166 lle 14 16,323 <0.01% 0.000
167 lm 24 32,248 <0.01% 0.000
168 ln 2 8,735 <0.01% 0.000
169 lv 3 2,154 <0.01% 0.000
170 lve 2 244 <0.01% 0.000
171 m 7,912 44,200,165 2.75% 0.054
172 mb 67 99,871 0.01% 0.000
173 me 34 2,456,437 0.15% 0.000
174 mm 366 1,396,853 0.09% 0.000
175 mme 2 2,908 <0.01% 0.000
176 mmes 1 483 <0.01% 0.000
177 mn 17 79,355 <0.01% 0.055
178 mp 1 222 <0.01% 0.000
179 n 15,884 102,493,345 6.38% 0.322
180 nd 13 10,643 <0.01% 0.596
181 ne 100 3,060,629 0.19% 0.059
182 ng 3,343 14,242,619 0.89% 0.000
183 ngue 6 17,467 <0.01% 0.000
184 nh 1 654 <0.01% 0.000
185 nm 11 297,371 0.02% 0.000
186 nn 328 1,281,013 0.08% 0.000
187 nne 6 16,230 <0.01% 0.000
188 nt 2 170 <0.01% 0.000
189 o 10,917 101,159,136 6.30% 2.682
190 o_e 669 4,582,654 0.29% 0.481
191 oa 373 737,843 0.05% 0.897
192 oa_e 2 533 <0.01% 0.000
193 oar 2 706 <0.01% 0.000
194 oe 56 122,058 0.01% 1.206
195 oh 8 338,392 0.02% 0.938
196 oi 246 976,408 0.06% 0.040
197 oi_e 9 23,095 <0.01% 0.000
198 ois 4 17,436 <0.01% 0.310
199 ol 35 79,933 <0.01% 0.894
200 olo 3 3,628 <0.01% 0.000
201 om 15 4,872 <0.01% 0.000
202 ome 28 19,566 <0.01% 0.000
203 on 2,097 8,818,144 0.55% 0.230
204 onn 2 19,009 <0.01% 0.000
205 oo 858 3,895,895 0.24% 1.289
206 oo_e 28 94,585 0.01% 0.000
207 ooh 1 1,428 <0.01% 0.000
208 oor 2 1,998 <0.01% 0.000
209 or 1,052 5,535,041 0.34% 0.964
210 orr 38 211,821 0.01% 0.736
211 os 2 446 <0.01% 0.000
212 ot 4 2,867 <0.01% 0.000
213 ou 1,400 18,438,447 1.15% 2.177
214 ou_e 111 216,372 0.01% 0.505
215 ough 50 1,008,396 0.06% 1.601
216 oui 2 134 <0.01% 0.860
217 oup 2 2,482 <0.01% 0.000
218 our 55 129,262 0.01% 0.315
219 ous 1 771 <0.01% 0.000
220 ow 656 6,051,793 0.38% 1.105
221 owe 3 8,188 <0.01% 0.000
222 oy 131 510,271 0.03% 0.038
223 oy_e 1 2,933 <0.01% 0.000
224 p 7,765 35,274,646 2.20% 0.000
225 pb 3 1,380 <0.01% 0.000
226 pe 3 43,189 <0.01% 0.000
227 ph 530 1,066,337 0.07% 0.220
228 pn 2 1,163 <0.01% 0.000
229 pp 496 2,402,664 0.15% 0.000
230 ppe 2 315 <0.01% 0.000
231 pph 1 2,832 <0.01% 0.000
232 ps 40 56,592 <0.01% 0.000
233 pt 8 8,521 <0.01% 0.000
234 q 443 1,966,725 0.12% 0.000
235 qu 47 75,978 <0.01% 0.000
236 que 22 43,577 <0.01% 0.000
237 r 13,293 67,773,671 4.22% 0.000
238 re 489 9,600,092 0.60% 0.035
239 rh 28 20,260 <0.01% 0.000
240 ro 9 22,769 <0.01% 0.000
241 rps 1 9,467 <0.01% 0.000
242 rr 334 879,006 0.05% 0.000
243 rre 2 10,479 <0.01% 0.000
244 rrh 1 149 <0.01% 0.000
245 rror 3 37,225 <0.01% 0.000
246 rt 3 6,210 <0.01% 0.000
247 s 18,696 105,633,904 6.58% 1.027
248 sc 129 371,066 0.02% 0.160
249 sce 48 12,087 <0.01% 0.000
250 sch 8 6,518 <0.01% 0.000
251 sci 15 27,403 <0.01% 0.000
252 se 200 1,696,607 0.11% 0.957
253 sh 1,220 4,198,020 0.26% 0.001
254 shi 19 42,067 <0.01% 0.000
255 si 190 790,039 0.05% 0.540
256 sl 6 57,284 <0.01% 0.000
257 ss 1,405 4,825,294 0.30% 0.387
258 sse 9 3,871 <0.01% 0.000
259 ssi 121 572,191 0.04% 0.003
260 st 84 140,492 0.01% 0.000
261 sth 2 1,858 <0.01% 0.000
262 sw 15 204,666 0.01% 0.000
263 t 19,481 113,483,774 7.21% 0.112
264 tch 196 490,222 0.03% 0.000
265 te 205 892,866 0.06% 0.000
266 tes 1 717 <0.01% 0.000
267 th 1,204 51,995,883 3.24% 0.711
268 the 31 11,857 <0.01% 0.000
269 thes 8 15,037 <0.01% 0.000
270 ti 1,781 7,056,797 0.44% 0.301
271 ts 1 551 <0.01% 0.000
272 tsch 1 37 <0.01% 0.000
273 tsh 10 2,949 <0.01% 0.000
274 tt 618 2,613,709 0.16% 0.000
275 tte 34 33,275 <0.01% 0.000
276 tth 1 17,697 <0.01% 0.000
277 tw 7 457,084 0.03% 0.000
278 tzsch 1 1,825 <0.01% 0.000
279 u 6,587 27,685,856 1.72% 2.307
280 u_e 311 2,464,579 0.15% 0.982
281 ua 2 17,469 <0.01% 0.000
282 ual 2 15 <0.01% 0.000
283 ue 96 855,526 0.05% 0.912
284 ugh 1 5,867 <0.01% 0.000
285 uh 2 14,565 <0.01% 0.139
286 ui 86 309,495 0.02% 1.315
287 ui_e 4 8,687 <0.01% 0.000
288 ul 138 430,388 0.03% 0.000
289 ule 2 3,321 <0.01% 0.000
290 ull 42 81,135 0.01% 0.000
291 um 3 2,190 <0.01% 0.000
292 uo 9 3,304 <0.01% 0.479
293 uoy 3 313 <0.01% 0.000
294 ur 577 1,620,707 0.10% 1.031
295 ure 193 1,020,272 0.06% 1.098
296 urr 83 350,854 0.02% 0.472
297 ut 3 3,171 <0.01% 0.000
298 uy 5 194,926 0.01% 0.000
299 v 2,880 15,036,744 0.94% 0.000
300 ve 540 5,877,003 0.37% 0.000
301 vv 3 1,259 <0.01% 0.000
302 w 1,601 20,746,360 1.29% 0.006
303 we 5 17,254 <0.01% 0.000
304 wh 215 6,321,562 0.39% 0.695
305 wi 1 1,020 <0.01% 0.000
306 wr 102 2,176,730 0.14% 0.000
307 x 869 3,887,725 0.24% 0.407
308 xe 3 7,822 <0.01% 0.000
309 xh 23 20,691 <0.01% 0.000
310 xi 6 7,758 <0.01% 0.000
311 y 4,493 32,632,117 2.03% 1.680
312 y_e 57 304,957 0.02% 0.000
313 ye 9 16,561 <0.01% 0.000
314 yl 6 19,654 <0.01% 0.000
315 yll 1 36 <0.01% 0.000
316 yr 4 7,350 <0.01% 0.000
317 yrrh 1 105 <0.01% 0.000
318 z 818 1,215,898 0.08% 0.301
319 ze 4 5,758 <0.01% 0.000
320 zi 1 59 <0.01% 0.000
321 zz 74 60,756 <0.01% 0.458
322 ' 54 29,904 <0.01% 0.000
Totals: 264,618 1,603,490,234
Appendix B: List of Phonemes Used by the English Lexicon Project (ELP)
No. SAMPA IPA Example No. SAMPA IPA Example

1 e e, eɪ day 25 k k cat
2 a æ bat 26 dZ ʤ judge
3 A ɑ spa 27 ks / gz ks / gz box / eggs
4 i i see 28 kw kw quit
5 E ɛ set 29 l l log
6 @` ɚ waiter 30 l= l castle
7 aI aɪ life 31 m m mat
8 I ɪ sit 32 m= m chasm
9 o o, oʊ low 33 n n night
10 O ɔ law 34 n= n button
11 u u moot 35 N ŋ song
12 U ʊ put 36 p p pat
13 OI oɪ boy 37 r ɹ ram
14 aU aʊ cow 38 s s sun
15 @ ə above 39 S ʃ shun
16 ju ju mute 40 t t to
17 3` ɝ sir 41 T θ thin
18 V ʌ hug 42 D ð the
19 b b but 43 v v vote
20 tS ʧ church 44 w w / hw witch / which
21 d d do 45 j j yoke
22 f f fee 46 z z buzz
23 g g go 47 Z ʒ mirage
24 h h hat 48 4 ɾ middle
This appendix lists the phonemes distinguished by the ELP and therefore used in
this study, In addition to the 44 phonemes generally accepted for SAE, there are four
additional phonemic distinctions included for analysis: the cluster /ju/ in contrast to /u/
(as heard in moot vs. mute), and the syllabic variants of /m/, /n/, and /l/ (when these
consonants function as the nucleus of a syllable instead of a vowel). Phonemes are listed
in the table in both IPA and SAMPA formats, with a provided example. Like the
International Phonetic Alphabet (IPA), the Speech Assessment Methods Phonetic
Alphabet (SAMPA) is a symbol system used for phonetic transcription, with a specific
character representing each phoneme. It is designed to be readable by most computer
programs and is composed of common characters found on standard English language
keyboards. Note that while some phonemes have different symbols between the IPA and
SAMPA formats, many share the same symbol in both typographies. For example, in the
word teething, the initial consonant is represented by /t/ in both SAMPA and IPA
formats, the vowel featured in both syllables is represented by /i/ in both formats, the
middle consonant is represented by /T/ in SAMPA and /ð/ in IPA, and the final consonant
is represented by /N/ in SAMPA and /ŋ/ in IPA. The analysis was conducted with
phonemes in SAMPA format, and then in post-processing these symbols were converted
to IPA format, which may be more familiar to readers. Appendix D, however, discusses
technical aspects of the algorithm and so retains use of SAMPA symbols for the sake of
consistency.
Appendix C: Phoneme-to-Grapheme Correspondence (PGC) Probabilities
P G Prob. P G Prob. P G Prob.

/t/= t 0.933489 /n/= n 0.937224 /ɪ/= i 0.81118
0.06620 't 0.022247 0.06488 ne 0.029175 0.06394 e 0.145567
ed 0.012392 nn 0.012296 y 0.01587
d 0.010646 kn 0.011863 ea 0.013045
te 0.008399 gn 0.004125 a 0.007226
tt 0.006639 on 0.003138 u 0.001901
tw 0.0043 in 0.001674 ui 0.001599
th 0.000774 nne 0.000156 ee 0.001306
bt 0.000683 dn 0.000142 o 0.001202
tte 0.000313 nd 0.000091 ie 0.00044
pt 0.00008 ln 0.000084 ia 0.000301
ct 0.000019 pn 0.000011 hi 0.000295
cht 0.000012 nh 0.000006 ai 0.000023
tes 0.000007 n' 0.000006 ae 0.000015
0.52337 mn 0.000005 ei 0.00001
mp 0.000002 wi 0.00001
/ə/= e 0.466744 nt 0.000002 ez 0.000008
0.05388 a 0.255068 an 0 ha 0.000003
o 0.12205 0.47126 e_e 0.000001
i 0.112831 0.94675
u 0.027743 /ɹ/= r 0.822485
ou 0.009749 0.05131 re 0.111542 /s/= s 0.773004
y 0.001499 wr 0.026416 0.04998 ce 0.070508
ah 0.00106 er 0.020745 c 0.062638
ei 0.000624 rr 0.010667 ss 0.056087
au 0.000619 're 0.00453 's 0.01386
ai 0.000482 ure 0.001716 se 0.013137
' 0.000346 ur 0.000922 sc 0.004516
ui 0.0003 or 0.000268 sw 0.00255
uh 0.000165 rh 0.000246 st 0.00175
ia 0.000164 ar 0.000127 ps 0.000705
ha 0.000139 rre 0.000127 ci 0.00069
eou 0.000099 rps 0.000115 z 0.000281
Appendix C Continued
hi 0.000066 rt 0.000075 sce 0.000151
aa 0.000051 ir 0.000015 sse 0.000048
oi 0.000037 rrh 0.000002 cs 0.000045
eau 0.000037 0.98177 ces 0.000029
er 0.00003 1.30285
ea 0.000027 /l/= l 0.804507
o' 0.000023 0.03588 ll 0.183583 /æ/= a 0.999126
ae 0.000016 'll 0.005524 0.03339 au 0.000773
he 0.000014 all 0.004237 e 0.000043
eo 0.000008 sl 0.000994 ah 0.00002
on 0.000006 ol 0.000431 ai 0.000015
c 0.000003 le 0.00035 a'a 0.000014
eu 0.000002 lle 0.000283 i 0.000007
oo 0 lh 0.000091 a_e 0.000002
2.01254 0.79956 0.01081
/d/= d 0.89278 /i/= y 0.335638 /m/= m 0.897029

0.03640 ed 0.046915 0.03110 e 0.250323 0.03008 me 0.050858
ld 0.04448 ee 0.115595 mm 0.02892
dd 0.010909 i 0.089788 'm 0.012567
'd 0.004548 ea 0.082904 nm 0.006157
de 0.000343 e_e 0.036094 mb 0.002068
dh 0.000025 ea_e 0.019588 mn 0.001633
0.66373 eo 0.01599 lm 0.000668
ie 0.014512 mme 0.00006
i_e 0.01292 nd 0.000018
ey 0.012813 gm 0.000013
/k/= c 0.654498 ie_e 0.005776 mmes 0.00001
0.03315 k 0.204096 ei 0.005202 0.67368
ck 0.054837 ee_e 0.000667
q 0.036939 a 0.000627 /ð/= th 0.999719
ch 0.025768 is 0.000566 0.02629 the 0.000281
cc 0.008163 oe 0.000329 0.00372
lk 0.006901 ae 0.000246
x 0.005362 ay 0.000198 /e/= a 0.551189
qu 0.001427 ei_e 0.000158 0.02309 a_e 0.183277
que 0.000818 he 0.000024 ay 0.113667
cq 0.000475 it 0.000024 ai 0.07935
ke 0.000224 ill 0.000006 ey 0.048349
kh 0.000137 j 0.000006 ea 0.012055
cu 0.000123 ai 0.000003 e 0.003686
cques 0.000086 hoe 0.000002 eigh 0.002844
che 0.000053 ois 0.000002 ai_e 0.001988
gh 0.000036 2.74292 aigh 0.001309
cqu 0.000036 ei 0.000615
cch 0.000016 /ɛ/= e 0.80297 ay_e 0.000381
kg 0.000004 0.02784 a 0.103993 eh 0.000371
kk 0.000001 ea 0.034405 ae 0.000288
1.5928 ai 0.02727 et 0.000265
ei 0.019017 au 0.000136
/z/= s 0.893611 ie 0.004512 ee 0.000135
0.02856 es 0.0468 ay 0.004462 e_e 0.000055
z 0.02544 ey 0.001868 es 0.000019
's 0.017539 u 0.000366 er 0.000016
se 0.014002 ae 0.000321 eii 0.000003
zz 0.001196 aye 0.000305 ais 0.000002
ss 0.000729 aa 0.000242 ei_e 0.000001
thes 0.000328 eo 0.000177 1.96794
x 0.000154 hei 0.00006
ze 0.000126 ee 0.000023 /u/= o 0.535481
sth 0.000041 oe 0.000005 0.01916 ou 0.187264
is 0.000013 eh 0.000004 u 0.093988
ts 0.000012 1.11633 oo 0.052143
cz 0.00001 ew 0.042228
0.70278 /p/= p 0.93503 u_e 0.033706
0.02349 pp 0.063688 ue 0.018703
/ʌ/= o 0.515433 pe 0.001145 o_e 0.015449
0.02463 u 0.356184 ph 0.000115 ough 0.011045
a 0.09341 bp 0.000014 ui 0.003879
ou 0.020194 ppe 0.000008 oo_e 0.003073
au 0.012732 0.3567 eu 0.000807
oo 0.002047 ou_e 0.000773
1.55491 /v/= v 0.470449 oe 0.000673
0.01990 f 0.344445 ui_e 0.000282
/aɪ/= i 0.473221 ve 0.16932 ieu 0.000274
0.02044 i_e 0.272489 've 0.014556 oup 0.000081
y 0.165759 ph 0.00092 ooh 0.000046
igh 0.057033 w 0.000196 ioux 0.000043
y_e 0.009293 lv 0.000067 w 0.000026
ie 0.007419 vv 0.000039 ous 0.000025
uy 0.00594 lve 0.000008 uo 0.000011
eye 0.003285 1.57742 oui 0.000001
ai 0.001922 2.19773
ei 0.000852 /f/= f 0.881126
ia 0.000703 0.01868 ff 0.068213 /b/= b 0.995284
eigh 0.000631 ph 0.034432 0.01867 bb 0.00466
ye 0.000505 gh 0.009224 pb 0.000046
ay 0.000241 ft 0.00402 bh 0.00001
a 0.000189 lf 0.002866 0.04371
ig 0.000167 pph 0.000094
is 0.000125 ffe 0.000013 /o/= o 0.634053
oy 0.000063 v 0.00001 0.01442 o_e 0.177392
ey 0.000043 fe 0.000002 ow 0.138887
ais 0.000042 0.7127 oa 0.021885
aye 0.000028 ough 0.014182
ae 0.000025 /ɚ/= er 0.700952 oh 0.005187
ailles 0.000023 0.01526 or 0.148072 oe 0.003659
1.93148 ar 0.084384 ou 0.002998
ure 0.030605 eau 0.000638
/ɑ/= o 0.627161 ur 0.014244 owe 0.000354
0.01872 a 0.355241 orr 0.006852 ew 0.000218
oh 0.007262 arr 0.004151 ot 0.000124
ow 0.003507 err 0.001924 u 0.000117
ea 0.002166 ir 0.001921 oo 0.000111
ho 0.002039 rror 0.001519 au 0.000066
ah 0.001387 urr 0.001446 au_e 0.000023
e 0.000779 re 0.001409 oa_e 0.000023
i 0.000236 ro 0.000929 eaux 0.000022
au 0.000095 eur 0.000646 aoh 0.000021
aa 0.00008 our 0.0003 os 0.000019
ou 0.000022 yr 0.0003 eo 0.000009
as 0.000016 aur 0.000099 o' 0.000007
at 0.000006 oor 0.000082 ou_e 0.000004
oi 0.000002 erwr 0.000068 1.5778
aw 0.000001 're 0.000045
1.09733 oar 0.000029 /h/= h 0.932069
awr 0.000018 0.01102 wh 0.066675
/ɔ/= o 0.669859 ere 0.000006 j 0.001238
0.01621 a 0.211565 1.4957 x 0.000018
au 0.039668 0.36731
ou 0.028863 /ɾ/= t 0.57741
aw 0.018842 0.01357 d 0.318461 /ʃ/= ti 0.494173
ough 0.012995 tt 0.087569 0.00844 sh 0.309672
oa 0.008879 dd 0.015204 ci 0.055121
oo 0.004693 bt 0.000692 ssi 0.0422
augh 0.002952 ld 0.000308 s 0.027479
ea 0.000488 ct 0.000293 ss 0.02142
awe 0.000441 th 0.000059 ch 0.019937
ah 0.000343 cht 0.000004 c 0.010283
as 0.000211 1.39791 si 0.007219
ao 0.000188 t 0.004024
u 0.000007 /g/= g 0.938245 shi 0.003103
al 0.000005 0.00858 gu 0.041541 che 0.002202
eo 0 gg 0.015718 sci 0.002021
1.52398 gh 0.002846 sc 0.000639
gue 0.00165 sch 0.000481
/w/= w 0.901524 0.41042 ce 0.000024
0.01432 u 0.096813 chsi 0.000002
we 0.00075 /ɝ/= er 0.40125 2.05046
ju 0.000468 0.00645 or 0.181697
o 0.000371 ur 0.115057 /l/= le 0.521185
hu 0.000054 ir 0.105264 0.00640 al 0.269161
wh 0.000016 ear 0.074666 el 0.056952
r 0.000003 ere 0.066072 ul 0.041885
0.47923 urr 0.030449 all 0.03633
our 0.011768 l 0.020892
err 0.007258 il 0.01435
/ŋ/= ng 0.841945 orr 0.004238 ael 0.009953
0.01053 n 0.155757 her 0.000841 ull 0.007896
ing 0.001246 eur 0.000572 ol 0.005363
ngue 0.001033 irr 0.000507 ell 0.00416
nd 0.000019 olo 0.00035 ile 0.003915
0.64938 yrrh 0.00001 ll 0.003757
2.55326 yl 0.001913
/n/= on 0.649846 'll 0.001774
0.00814 n 0.169372 /θ/= th 0.998036 ule 0.000323
en 0.099272 0.00606 tth 0.00182 ill 0.000186
an 0.047163 t 0.000129 yll 0.000004
ain 0.017626 dth 0.000016 ual 0.000001
in 0.0135 0.02131 2.10091
ne 0.001607
onn 0.001455 /ʊ/= ou 0.567379 /j/= y 0.99984
ign 0.000159 0.00521 oo 0.249375 0.00540 j 0.000085
1.59378 u 0.167699 i 0.000071
o 0.013537 ll 0.000004
/ʤ/= j 0.358992 eu 0.001442 0.00243
0.00637 g 0.315193 uo 0.000354
ge 0.229182 or 0.000202 /ʧ/= ch 0.689814
d 0.036773 oui 0.000011 0.00494 t 0.202527
dge 0.026248 1.49991 tch 0.061743
gi 0.018078 ti 0.043437
dg 0.009429 /hw/= wh 0.998552 cz 0.000557
dj 0.003308 0.00321 ju 0.001153 che 0.00052
di 0.002075 w 0.000296 c 0.000505
gg 0.000393 0.01681 tsh 0.000371
ch 0.000328 ci 0.000274
2.07789 /jɛ/= e 1 tzsch 0.00023
0.00000 0 th 0.000019
/aʊ/= ou 0.639641 tsch 0.000005
0.00531 ow 0.32056 /jə/= u 0.631539 1.30858
ou_e 0.022595 0.00108 io 0.1803
hou 0.014715 ia 0.142676 /ks/= x 0.997381
au 0.001437 ie 0.029076 0.00186 xe 0.002619
ao 0.000833 ua 0.01011 0.02624
ough 0.000219 iu 0.004833
1.17636 a 0.001109 /jɑ/= o 1
iou 0.000358 0.00000 0
/ju/= u 0.571625 1.53274
0.00343 u_e 0.258882 /oɪ/= oi 0.628629
ew 0.05474 /gz/= x 0.963063 0.00096 oy 0.328469
ue 0.050779 0.00035 xh 0.036937 oi_e 0.014927
iew 0.047863 0.22807 aw 0.013624
eau 0.008614 ois 0.010712
ou 0.003263 /jɚ/= ure 0.517928 oy_e 0.001896
eu 0.002428 0.00016 ior 0.320364 eu 0.001395
ugh 0.001064 iar 0.145187 uoy 0.000202
ut 0.000575 ur 0.015327 o 0.000094
ieu 0.00009 ier 0.000748 hoi 0.000052
ewe 0.000076 or 0.000447 1.22852
1.74966 1.52704
/jʊ/= u 0.725133
/wʌ/= o 1 /wɑ/= ois 0.597656 0.00019 eu 0.273925
0.00157 0 0.00000 oi 0.402344 uh 0.000943
0.97231 0.85744
/ʒ/= si 0.643986
0.00067 s 0.293708 /gʒ/= x 1 /kʃ/= x 0.907282
ge 0.02481 0.00000 0 0.00005 xi 0.092718
g 0.014577 0.44548
ti 0.012087 /ts/= z 0.773716
z 0.005907 0.00002 zz 0.226284 /kə/= kh 1
j 0.0046 0.77148 0.00000 0
sh 0.000143
ssi 0.000127 /nj/= n 0.663968 /əw/= ju 1
zi 0.000055 0.00000 gn 0.336032 0.00000 0
1.30993 0.92097
/m/= m 0.905322
0.00018 ome 0.066394
om 0.016532
um 0.007431
am 0.003332
em 0.000987
0.57738
Note: P=phoneme, G=grapheme. Below each phoneme, the probability for that
phoneme is listed. The probabilities for each PGC are listed to the right of each
grapheme. At the bottom of each list, the phoneme’s entropy is given in bold. The
graphemes are listed in order of decreasing probability, with the first grapheme being
the “main spelling” or most predictable correspondence for that phoneme.
Appendix D: Algorithmic Procedures for Corpus Analysis
The following reports the decision-making processes for the algorithm in choosing how
to parse words.
ELP Characteristics. For each entry, the ELP provides a number of various
characteristics that may be used in analysis. The following specific characteristics were
utilized by this study: (a) Frequency Norms (Freq_HAL), (b) Pronunciation (Pron.)., and
(c) Morpheme Parse – Letters (MorphSp).
Freq_HAL are the frequency norms provided by the HAL database. For each entry,
the number of times that word is found throughout the corpus is indicated. This data was
necessary to calculate the probability for each word. This information was applied
separately to each grapheme-phoneme correspondence that comprises the word.
Pron. is the listed pronunciation for each entry, based on the Standard American
English (SAE) dialect. This information is recorded in the SAMPA format, and so for
consistency, phonemes will be represented by their SAMPA symbols in this appendix.
Once the final data was analyzed, the SAMPA symbols were replaced with their more
widely recognized IPA counterparts.
MorphSp provides a morphological breakdown of each entry, giving the root(s) and
any affixes for a given word. This feature was utilized in creating an algorithm which
can apply various procedures to determine how best to parse ambiguous mappings.
Parsing the ELP algorithmically. After the problematic cases documented above
were addressed, an algorithm was constructed which accurately parsed as many words in
the ELP as possible. A parsing was considered “accurate” if it conformed to the desired
grapheme-phoneme relationships as outlined. It was imperative, for example, that
instances of FS-e be correctly assigned to the appropriate preceding consonant or vowel,
depending on its apparent function.
Details of how the algorithm made parsing decisions will be discussed below.
Broadly speaking, there were three steps undertaken by the algorithm in an attempt to
achieve the desired parsing of a word. First, the program consulted a mapping file that
defined all legitimate phoneme-grapheme correspondences. Second, it discovered all
acceptable parsings for a given word type. Third, it chose the most accurate parsing
through a process of elimination, guided by a series of rules or “preferences” that selected
certain graphemes over others in various contexts.
Constructing the mapping file. The mapping file listed all phonemes used by the
ELP and all potential graphemes that might correspond to each phoneme. This is the
information used by the algorithm to parse the wordlist. Only the phoneme-grapheme
matches defined in the mapping file are considered acceptable correspondences by the
algorithm.
The primary goal of the allowed correspondences was not to minimize the number of
possible parsings returned for each word, but rather to be conservative in eliminating
possibilities, so as to increase the chance that one of the potential parsings is the desired
one. In other words, the net was cast wide. This was guided by the belief that it is easier
algorithmically to choose an acceptable parsing from a list of potential parsings than it is
to carefully construct only the correct parsing.

With this strategy in mind, the initial mapping file included all graphemes (according
to our definition) that could conceivably correspond to each phoneme. In addition to
graphemic options culled from prior studies (Berndt et al., 1987; Gontijo et al., 2003;
Hanna, 1966), a number of online sources were consulted, including Wikipedia. This
inclusive approach had the potential to return a plethora of idiosyncratic graphemes
covering many extremely low-frequency, exotic spellings. Furthermore, additional
idiosyncratic graphemes were iteratively added, until the algorithm was able to yield a
suggested parsing for every word in the dataset. This procedure was done to decrease the
risk of missing significant correspondences through a failure to provide those possibilities
to the algorithm.
Expressing all possible parsings. For each entry in the ELP, the algorithm took as
input the grapheme string comprising the written word and the phoneme string
comprising its phonetic transcription, with all stress and syllable marks removed. The
program moved in a serial fashion from left to right. It identified each phoneme
encountered by matching it to the identical phoneme listed in the mapping file, and it
chose the longest phoneme string consistent with position at hand. In this manner, the
algorithm split the string into individual phonemes.
The algorithm next examined every possible acceptable alignment between phoneme
and grapheme strings. No limitations were placed on what character/s could represent
any phoneme. The only constraints were that 1) complex (multi-letter) graphemes must
contain only consecutive letters (except for split graphemes containing FS–e, handled
separately), 2) all letters of the word must be used, 3) and the number of graphemes must
equal the number of phonemes.

The complexity of split graphemes required additional parameters. Any “e”
encountered that is to be considered part of a split grapheme (attached to the preceding
vowel) must be directly preceded by either a graphemic consonant(s), gu- or qu-. The
associated vowel had to be the first vowel encountered prior to the directly preceding
consonant(s), gu-, qu-, and any number of consecutive preceding vowels until a
consonant, another split grapheme, or the beginning of the word was encountered. This
rule flagged the “e” as a potential candidate for being part a split grapheme, but only
declared it to be if both of the following criteria were false:
3. The preceding vowel, which could potentially be part of the

split grapheme, had as its corresponding phoneme one of the
following “short” vowels: /a/, /A/, /@/, /E/, /I/, /U/, /V/, /3`/, or
/@`/.
4. The consonantal correspondence directly preceding the “e”
consisted of one of the following grapheme-phoneme pairs:
a) le = /l=/
b) ge = /Z/ or /dZ/
c) dg = /dZ/
d) th = /D/
e) ce = /s/
f) sle = /l/
Although it can be argued that FS–e can serve many functions simultaneously, for
simplicity, in this study a FS–e cannot be assigned to more than one grapheme. It will
either attach to the vowel(s) or to the consonant(s), as per the guidelines above.
Once the algorithm determined the set of minimally constrained parsings (as
described above) of a given word, it consulted the mapping file to determine if each
phoneme-grapheme correspondence comprising the word were acceptable. If so, the

parsing was deemed legitimate, and otherwise an error was returned. In this fashion, the
algorithm returned as output all legitimate parsings with no limit placed on the number of
parsings per word. As expected, some words returned only one parsing, while many
others returned multiple parsings.
Determining the preferred parsing. Once the above algorithm could generate at
least one legitimate parsing for every word in the list, the next step was to sift through all
generated parsings of an entry and choose the one considered to be the most accurate. To
this end, the MorphSp attribute of the ELP was utilized, which breaks words down into
roots and affixes.
For any word with multiple acceptable parsings, this algorithm first broke each
entry down into its morphological components. The words were divided into three
classes and addressed separately: root, non-compound, and compound words. Root
words were words with a single root and no prefixes or suffixes. Non-compound words
had one root with one or more prefixes and/or suffixes. Compound words contained
more than one root, and may contain prefixes and/or suffixes.
Root words. Root words are first addressed. In the event of multiple acceptable
parsings, the algorithm initially assumed that the first parsing was the correct one. The
algorithm then compared this incumbent parsing to each alternate parsing in turn. With
each comparison the program determined whether to replace the original parsing with the
new one under consideration, or to reject the new parsing and retain the incumbent. It
proceeded through all alternate parsings in this manner.
The algorithm compared the incumbent and the alternate parsing under consideration
by performing up to four passes over the parsings. In each pass, the parsings were
compared in a serial, left-to-right order, ignoring all pairs of identical graphemes between
the two candidates. When the algorithm encountered two graphemes that were not
identical, it initiated a subroutine that attempted to resolve the discrepancy. The
particular subroutine that was employed depended on which pass the algorithm was
executing, as described below. If at any time a subroutine was able to determine, based
on a particular grapheme discrepancy, that one of the parsings was preferred, then the
algorithm selected that parsing and terminated.
The first pass expressed a preference for longer graphemes over shorter graphemes,
when the phoneme in question could be represented by both. For example, consider the
word hackney, which may be parsed as either h-a-ck-n-ey or h-a-c-kn-ey. According to
the mapping file, both options are legitimate; the grapheme ck often represents /k/ as in
back, sack, and hack, and the grapheme kn often represents /n/, as in knee, knight, know,
etc. Since words like hack feature the ck =/k/ grapheme, words with -ney feature the n =
/n/ correspondence, e. g. journey, and the kn grapheme is found predominantly in the
word-initial position only, the first of the two parsings is judged to be the correct option.
The simplest way to choose this option was through the preference of ck over c.
Whenever c is the correct choice, it was never the case that the contesting grapheme is a
ck. This decision was not based upon any theoretical framework, but rather determined
simply from observation of patterns in the ELP. Examples of preferences declared by the
first pass included:
 /n=/ = "en" not "n" (absence)

 /k/ = "ck" not "c" (acknowledge)
 /k/ = "qu" not "q" (briquette)
 /k/ = "cqu" not "cq" (acquire)

 /k/ = "ch" not "c" (chemise)
 /T/ = "th" not "t" (hypothesize)
 /g/ = "gu" not "g" (daguerreotype)
 /gz/ = "xh" not "x" (exhilarate)
 /dZ/ = "ge" not "g" (George)
 /hw/ = "wh" not "w" (whistle)
 /S/ = "ti" not "t" (initiative)
 /S/ = "ci" not "c" (sociable)
 /S/ = "sh" not "s" (shingle)
 /S/ = "sci" not "sc" (omniscience)
If the first pass failed to “break the tie,” a second pass was initiated. Execution of this
code handled discrepancies involving doubled letters. The subroutine preferred doubled
letters over single letters to represent a given phoneme, when both options were
presented. Therefore, such preferences were declared as:
 /k/ = "cc" not "c" (accolade)

 /l/ = "ll" not "l" (fallible)
 /i/ = "ee" not "e" (feeble)
 /e/ = "ee" not "e" (toupee)
 /n/ = "nn" not "n" (colonnade)
 /u/ = "oo" not "o" (Coolidge)
If this code did not break the tie, a third pass was initiated. This pass handled
decisions concerning certain graphemes that preferred not to be combined with
extraneous letters. Preferences declared by the third pass included:
 /tS/ = "ch" not "chi" (achieve)

 /p/ = "pp" not "ppe" (appease)
 /p/ = "p" not "pe" (Chesapeake)
 /d/ = "d" not "de" (credence)
 /r/ = "rr" not "rre" (daguerreotype)
 /s/ = "ss" not "sse" (essence)

 /s/ = "s" not "se" (absence)
 /s/ = "sc" not "sce" (adolescence)
 /f/ = "ff" not "ffe" (indifference)
 /l/ = "l" not "le" (jubilee)
 /n/ = "n" not "ne" (needle)
 /n/ = "n" not "in" (plaintive)
 /z/ = "s" not "se" (presence)
 /a/ = "a" not "al" (allocate)
If the third pass failed to break the tie, a fourth and final pass was be initiated. This
pass concerned certain vowels that were preferred in certain contexts, and included:
 /OI/ = "oy" not "o" (flamboyance)

 /e/ = "ey" not "e" (abeyance)
 /E/ = "ae" not "a" (aerate)
 /O/ = "au" not "a" (aureole)
 /i/ = "ea" not "a" (treatise)
 /E/ = "ai" not "a" (clairvoyance)
 /U/ = "ou" not "o" (entourage)
 /I/ = "ie" not "i" (pierce)
 /U/ = "uo" not "u" (fluoresce)
 /u/ = "ou" not "o" (pirouette)
 /O/ = "ou" not "o" (fourpence)
Non-compound words. Non-compound words were handled in a similar fashion as
root words, but with additional steps to handle affixes. The MorphSp column indicated
the root word and affixes separately for any entry. The algorithm first checked to see if
the root word existed as its own separate entry in the ELP. If the associated root did not
exist as its own separate entry, the entry under consideration was parsed using the four-
pass subroutine described above.

However, frequently the associated root word existed as its own separate entry in
the ELP. If the phonemic string of the non-compound word contained the phonemic
string of the root word, this indicated the affixes did not alter the pronunciation of the
word root. The algorithm thus mirrored the parsing accepted for the root word, rejecting
inconsistent candidates. If no candidate was consistent, the algorithm did not immediately
reject any candidate, but instead skipped this step and proceeded to the next step.
If multiple acceptable parsings remained after examining the root, additional steps
were next applied. First, certain prefixes containing e, such as re-, might parse as part of
a split grapheme, which would inevitably be in error. Therefore, explicit rules were
outlined, where, in the presence of these prefixes, split graphemes could not occur in
combination with the vowel of the prefix.
Next, if multiple candidates remained, suffixes were next examined to determine
if a preference between incumbent and challenger could be made based on the suffix.
Consider the words lie and lady. Pluralizing the words result in lies and ladies. Though
both plural forms end with -ies, they must be parsed differently. The word lies should be
parsed as l-ie-s, not li-es, since the latter would indicate the word root is li-. Ladies, on
the other hand, should be parsed as l-a-d-i-es, not l-a-d-ie-s, because the penultimate “e”
is not found in the singular form of lady. The “e” must be part of the ending, and
therefore -es is the correct parsing of the suffix (with the final -y converting to an -i). So
for a word ending in –y + -s (as indicated by the MorphSp entry) and two candidate
parsings, one ending in –i-es and the other ending in ie-s, the algorithm immediately
selected the parsing ending in i-es as the next incumbent. The rule was expressed as:
 Suffixes –y + –s = –ies: prefer parsings ending in –es to parsings ending in –s

Similarly, other such preferences governed by suffixes include:
 Root ending in -e + suffix -s, resulting in word ending in –es: prefer parsings
ending with –s to parsings ending with –es
 Suffix –ives: prefer parsings ending with –s to parsings ending with –es
 Suffix –itives: prefer parsings ending with –s to parsings ending with –es
 Suffix –tures: prefer parsings ending with –s to parsings ending with –es
 Suffix –eer: prefer parsings ending with –r to parsings ending with –er
 Suffix –eer + additional suffix: prefer parsings ending with –r-? to parsings
endings with –er-?
 Suffix –ered: prefer parsings ending with –ed to parsings ending with –d
 Root ending in -e + suffix –ed: prefer parsings ending with –d to parsings
ending with –ed
 Root not ending in -e + suffix –ed: prefer parsings ending with –ed to parsings
ending with –d
 Suffixes –y + ed = –ied: prefer parsings ending in –ed to parsings ending in –d
Again, these rules were not be decided beforehand, but rather resulted from
observations of patterns during iterative attempts to continually decrease instances where
the algorithm was unable to select a single, desired parsing. If execution of all “suffix
rules” did not result in a determination of the correct parsing, the algorithm continued on
to the four-pass subroutine described above for root words.
Compound words. Lastly, compound words containing multiple roots and/or affixes
were addressed. These were handled similarly to non-compound words, only with initial,
additional steps. First, if there were no affixes, the algorithm checked if the compound
word, in both graphemic and phonemic strings, was identical to the combination of the
associated roots, and if so, chose the parsing that combines those that were selected for
the root words in previous steps of the algorithm. For example, when the program
encountered a word like wholesale, it searched for the associated roots whole and sale,
combined the roots, and then checked if the result was identical to wholesale, which it
was. Similarly, it confirmed that the pronunciation of /holsel/ (wholesale) was the
concatenation of /hol/ + /sel/. Finally, the algorithm posited the concatenation of the
selected parsings for whole and sale, yielding wh-o_e-l-s-a_e-l. As this is indeed one of
the candidate parsings, the algorithm selected it.
For more complicated compound words, particularly ones containing affixes, the
algorithm checked if the combination of associated roots, in both graphemic and
phonemic strings, was a subset of the compound word being analyzed. If so, and if there
was at least one possible parsing of the compound word that contains this simple
combination of the previously selected parsings of the associated roots, then all other
parsings were eliminated. For example, if the algorithm encountered the word
wholesalers, it checked if this word contained whole + sale and its pronunciation
contained /holsel/. When this proved true, only parsings containing wh-o_e-l-s-a_e-l
were retained. In this case, the only such parsing is wh-o_e-l-s-a_e-l-r-z, and so this
parsing was selected.
The algorithm then tested, in a similar fashion, if the compound word being
analyzed began with the first associated root words, and if the pronunciation began with
the combination of the first associated root and the first phoneme of the second associated
root. If this was the case, then all parsings that did not begin with the previously selected
parsing of the first associated root word were eliminated. If there were still multiple
candidate parsings, then the algorithm proceeded as it did with non-compound words,
first analyzing the suffixes to see if it could determine a correct parsing by suffix alone,
before finally executing the four-pass subroutine.

Below is a walkthrough of the algorithm’s decision process in parsing the sample
word whiskey:
Word: whiskey Freq_Hal: 1364 Pron.: hw”I.ski (SAMPA)

MorphSp:{whiskey}
I. Phoneme String
Remove accents/syllable notations: hwIski
From left to right, find largest phonemes in mapping file at current location of string:
A. Both /h/ and /hw/ are phonemes in mapping file – take /hw/ as first phoneme
B. Current location is third character of phoneme string: /I/. /I/ is a phoneme in
mapping file, and no phoneme in mapping file starts with /Is/. So, /I/ is second
phoneme
C. Similarly, /s/ is next phoneme, followed by /k/ and /i/.
Phoneme string: /hw-I-s-k-i/
II. Minimally constrained grapheme parsings
A. Five phonemes in the phoneme string must correspond to a count of five

graphemes.
B. Additionally, “i_e” is a potential split grapheme for this word, according to the
rule that a consonant (or gu/qu) precedes the “e”, and that the longest string of
vowels immediately preceding the consonants before the “–e” is the single vowel
“i.” Rules II.A and II.B. imply the following list of 18 minimally constrained
parsings:
1. w-h-i_e-s-ky,
2. w-h-i_e-sk-y,
3. w-h-i-s-key,
4. w-h-i-sk-ey,
5. w-h-i-ske-y,
6. w-h-is-k-ey,
7. w-h-is-ke-y,
8. w-h-isk-e-y,
9. w-hi-s-k-ey,
10. w-hi-s-ke-y,
11. w-hi-sk-s-k-y,
12. w-his-k-e-y,
13. wh-i_e-s-k-y,
14. wh-i-s-k-ey,
15. wh-i-s-ke-y,
16. wh-i-sk-e-y,
17. wh-is-k-e-y,
18. whi-s-k-e-y
III. Reduction to legitimate parsings based on list of allowed grapheme-phoneme

correspondences
Each minimally constrained grapheme parsing is tested in turn:
A. w-h-i_e-s-ky (#1): this implies that “w” maps to /hw/, “h” maps to /I/. Since the
mapping file does not allow /I/ to map to “h”, this option is eliminated. The same
rule also eliminates candidates #2-8 when encountered, which all begin with w-h-.
B. wh-i_e-s-k-y (#13): this implies “wh” maps to /hw/, “i_e” maps to /I/, “s” maps
to /s/, “k” maps to /k/, and “y” maps to /y/. The mapping file contains all of these
correspondences, and so this option is retained.
C. w-hi-sk-e-y (#11), and wh-i-sk-e-y (#16), are eliminated because the mappings
sk=/s/ and e=/k/ are not allowed by the mapping file.
C. #12 is eliminated because his=/I/ is not a legitimate mapping.
D. #17 is eliminated because s=/k/ and k=/e/ are not legitimate mappings.
E. #18 is eliminated because only the final mapping, y=/i/, is an allowed mapping.
The remaining candidates are: w-hi-s-k-ey (#9), w-hi-s-ke-y (#10), wh-i_e-s-k-y (#13),
wh-i-s-k-ey (#14), and wh-i-s-ke-y (#15).
IV. Based on MorphSp, whiskey is a root word
A. At least one candidate has a split grapheme (/I/ as i_e), and at least one candidate
interprets that –e as not part of a split grapheme. Since /I/ is not a long vowel, it
is determined that there is no split grapheme in this word, and all choices
containing the split grapheme i_e are eliminated. This eliminates one option, wh-
i_e-s-k-y (#13), leaving w-hi-s-k-ey (#9), w-hi-s-ke-y (#10), wh-i-s-k-ey (#14),
and wh-i-s-ke-y (#15).
B. We now perform pairwise comparisons of the candidates.
a. The initial incumbent is w-hi-s-k-ey (#9) and the initial challenger is w-hi-
s-ke-y (#10).
i. First pass of grapheme comparisons from left-to-right:
1. The first grapheme discrepancy encountered is /k/ as “k” or

“ke”. For phoneme /k/, we have the rules “ck” preferred to
“c”, “qu” preferred to “q”, “cqu” preferred to “cq”, “ch”
preferred to “c”, and “cch” preferred to “cc”. None of
these apply.
2. The second grapheme discrepancy encountered is /i/ as
“ey” or “y”. No preferences are assigned for phoneme /i/ in
the first pass, so no decision is made.
ii. Second pass of comparisons from left-to-right:
“ke”. For phoneme /k/, we have the rule “cc” preferred to
“c”. This does not apply.
2. The second grapheme discrepancy encountered is /i/ as
“ey” or “y”. No preferences are assigned for phoneme /i/ in
the second pass, so no decision is made.
iii. Third pass of comparisons from left-to-right:
“ke”. For phoneme /k/, we have the rules “c” preferred to
“sc”, “c” preferred to “cu”, and “k” preferred to “ke”. This
rule applies, and so w-hi-s-k-ey (#9) is kept and w-hi-s-ke-
y (#10) is eliminated.
b. The incumbent is w-hi-s-k-ey (#9) and the challenger is wh-i-s-k-ey (#14).
i. First pass of grapheme comparisons from left-to-right:
1. The first grapheme discrepancy encountered is /hw/ as “w”
or “wh”. For phoneme /hw/, we have the rule “wh”
preferred to “w”. This rule applies, and so we keep wh-i-s-
k-ey (#14) and eliminate w-hi-s-k-ey (#9).
c. The incumbent is wh-i-s-k-ey (#14) and the challenger is wh-i-s-ke-y
(#15). The algorithm proceeds as in step a. above. In particular, in the
third pass, phoneme /k/ has the rule “k” preferred to “ke”, resulting in wh-
i-s-k-ey (#14) being retained and wh-i-s-ke-y (#15) being eliminated
The only remaining candidate, wh-i-s-k-ey (#14) , is selected as the preferred parsing for
this word.
Manual Parsings. A small group of words defied typical patterns seen in English
orthography and presented a unique challenge for grapheme-phoneme alignment. Rather
than attempt to alter the algorithm to accommodate these words, which might in turn
cause unforeseen errors in other cases, these entries were separated and parsed manually:
 52 entries involved words such as wakes, phonemically transcribed as /w-

e-ks/, where the phonemes /k/ and /s/ are separated orthographically by a
FS–e. The algorithm initially handles all cases of consecutive /k/ and /s/
as one phoneme, /ks/, because this “complex phoneme” often corresponds
to the single letter x. To avoid complexities involved in altering the
algorithm to accommodate the interloping FS–e, these words were parsed
manually, and wakes is ultimately parsed as w-a_e-ks.
 25 cases of “loan words” were manually parsed, where it was necessary
algorithmically to consider modified groupings of phonemes in order to
enable reasonable grapheme parsings: Tijuana, signor, signora, signore,
señor, señora, señorita, schizoid, schizophrenia, schizophrenic, pizza,
pizzeria, pizzicato, Mozart’s, Nazi, Naziism, Nazis, Nazism, Mozart,
mezzo, Khmer, cognac, bouillon, bourgeois, and bourgeoisie. For
example, señor, originally parsed as /s-e-n-j-O-r/, was modified to /s-e-nj-
O-r/, since a valid grapheme-phoneme parsing cannot be found if there are
more phonemes than letters.
 8 word presenting special idiosyncratic difficulties were simply parsed
manually, to avoid potential errors caused by altering the algorithm:
bouillabaisse, does, environment, Hebrew, molten, maiden, maidenhead,
and tortilla.
After all the above procedures, parsing the ELP wordlist will be complete. All
40,481 words will either have been (a) removed, (b) assigned one parsing
algorithmically, or (c) have been manually parsed. Every instance of a phoneme in the
wordlist now successfully corresponds to a grapheme, and every grapheme corresponds
to a phoneme. Below is a random sampling of 200 words parsed by the algorithm:

Word Phoneme (SAMPA) Grapheme
clavicle k-l-a-v-@-k-l= c-l-a-v-i-c-le

scaremonger s-k-E-r-m-V-N-g-@` s-c-a-re-m-o-n-g-er
salutation s-a-l-j@-t-e-S-n= s-a-l-u-t-a-ti-on
trademarks t-r-e-d-m-A-r-ks t-r-a_e-d-m-a-r-ks
disseminating d-I-s-E-m-@-n-e-4-I-N d-i-ss-e-m-i-n-a-t-i-ng
nigh n-aI n-igh
foreman's f-O-r-m-@-n-z f-o-re-m-a-n-'s
handsomely h-a-n-s-m=-l-i h-a-nd-s-ome-l-y
chuckle tS-V-k-l= ch-u-ck-le
senile s-i-n-aI-l s-e-n-i_e-l
explores I-ks-p-l-O-r-z e-x-p-l-o-re-s
dogmatize d-O-g-m-@-t-aI-z d-o-g-m-a-t-i_e-z
purveyor p-@`-v-e-@` p-ur-v-ey-or
imperceptibly I-m-p-@`-s-E-p-t-@-b-l-i i-m-p-er-c-e-p-t-i-b-l-y
Indies I-n-4-i-z i-n-d-ie-s
ministers m-I-n-I-s-t-@`-z m-i-n-i-s-t-er-s
goggles g-A-g-l=-z g-o-gg-le-s
confidentially k-A-n-f-@-d-E-n-S-l=-i c-o-n-f-i-d-e-n-ti-all-y
unison ju-n-I-s-n= u-n-i-s-on
desecration d-E-s-@-k-r-e-S-n= d-e-s-e-c-r-a-ti-on
malign m-@-l-aI-n m-a-l-i-gn
sacredness s-e-k-r-@-d-n-@-s s-a-c-r-e-d-n-e-ss
thrones T-r-o-n-z th-r-o_e-n-s
coward k-aU-@`-d c-ow-ar-d
watershed w-O-4-@`-S-E-d w-a-t-er-sh-e-d
iceman aI-s-m-a-n i-ce-m-a-n
trapped t-r-a-p-t t-r-a-pp-ed
underfoot V-n-4-@`-f-U-t u-n-d-er-f-oo-t
response r-I-s-p-A-n-s r-e-s-p-o-n-se
solitaire s-A-l-@-t-E-r s-o-l-i-t-ai-re
overturn o-v-@`-t-3`-n o-v-er-t-ur-n
nickels n-I-k-l=-z n-i-ck-el-s
traveling t-r-a-v-l=-I-N t-r-a-ve-l-i-ng
amorous a-m-@`r-@-s a-m-or-ou-s
contractor's k-@-n-t-r-a-k-t-@`-z c-o-n-t-r-a-c-t-or-'s
apologetically @-p-A-l-@-dZ-E-4-I-k-l-i a-p-o-l-o-g-e-t-i-c-all-y
identifiable aI-d-E-n-4-@-f-aI-@-b-l= i-d-e-n-t-i-f-i-a-b-le
railhead r-e-l-h-E-d r-ai-l-h-ea-d
chloroform k-l-O-r-@-f-O-r-m ch-l-o-r-o-f-o-r-m
discs d-I-s-ks d-i-s-cs
nonfood n-A-n-f-u-d n-o-n-f-oo-d
Wilson w-I-l-s-n= w-i-l-s-on

commissioner's k-@-m-I-S-n=-@`-z c-o-mm-i-ssi-on-er-'s
staffing s-t-a-f-I-N s-t-a-ff-i-ng
Massachusetts m-a-s-@-tS-u-s-@-t-s m-a-ss-a-ch-u-s-e-tt-s
stealthily s-t-E-l-T-@-l-i s-t-ea-l-th-i-l-y
sneaky s-n-i-k-i s-n-ea-k-y
vendor v-E-n-4-@` v-e-n-d-or
agitator a-dZ-@-t-e-4-@` a-g-i-t-a-t-or
attacking @-t-a-k-I-N a-tt-a-ck-i-ng
barely b-E-r-l-i b-a-re-l-y
tracings t-r-e-s-I-N-z t-r-a-c-i-ng-s
decompression d-i-k-@-m-p-r-E-S-n= d-e-c-o-m-p-r-e-ssi-on
corked k-O-r-k-t c-o-r-k-ed
ripping r-I-p-I-N r-i-pp-i-ng
heights h-aI-t-s h-eigh-t-s
admirals a-d-m-@`r-@-l-z a-d-m-ir-a-l-s
hungrier h-V-N-g-r-i-@` h-u-n-g-r-i-er
inmate I-n-m-e-t i-n-m-a_e-t
chemist k-E-m-I-s-t ch-e-m-i-s-t
darkened d-A-r-k-@-n-d d-a-r-k-e-n-ed
magnate m-a-g-n-@-t m-a-g-n-a-te
twopence t-V-p-@-n-s tw-o-p-e-n-ce
sternum s-t-3`-n-@-m s-t-er-n-u-m
coerce k-o-3`-s c-o-er-ce
flier f-l-aI-@` f-l-i-er
umbrage V-m-b-r-I-dZ u-m-b-r-a-ge
excavations E-ks-k-@-v-e-S-n=-z e-x-c-a-v-a-ti-on-s
homicide h-A-m-I-s-aI-d h-o-m-i-c-i_e-d
ineptitude I-n-E-p-t-I-t-u-d i-n-e-p-t-i-t-u_e-d
terminology t-3`-m-@-n-A-l-@-dZ-i t-er-m-i-n-o-l-o-g-y
Prussian p-r-V-S-n= p-r-u-ssi-an
oxidised A-ks-@-d-aI-z-d o-x-i-d-i_e-s-d
killable k-I-l-e-b-l= k-i-ll-a-b-le
exceptions I-ks-E-p-S-n=-z e-xc-e-p-ti-on-s
bang b-a-N b-a-ng
plea p-l-i p-l-ea
footnote f-U-t-n-o-t f-oo-t-n-o_e-t
embezzling I-m-b-E-z-l=-I-N e-m-b-e-zz-l-i-ng
personnel p-3`-s-n=-E-l p-er-s-onn-e-l
fester f-E-s-t-@` f-e-s-t-er
thermostat T-3`-m-@-s-t-a-t th-er-m-o-s-t-a-t
running r-V-n-I-N r-u-nn-i-ng
goodwill g-U-d-w-I-l g-oo-d-w-i-ll
raping r-e-p-I-N r-a-p-i-ng

cupped k-V-p-t c-u-pp-ed
listless l-I-s-t-l-@-s l-i-s-t-l-e-ss
hero's h-I-r-o-z h-e-r-o-'s
hemlock h-E-m-l-A-k h-e-m-l-o-ck
cooled k-u-l-d c-oo-l-ed
flouted f-l-aU-4-@-d f-l-ou-t-e-d
adopting @-d-A-p-t-I-N a-d-o-p-t-i-ng
flywheel f-l-aI-hw-i-l f-l-y-wh-ee-l
townsfolk t-aU-n-z-f-o-k t-ow-n-s-f-o-lk
Stephanie s-t-E-f-@-n-i s-t-e-ph-a-n-ie
wrathful r-a-T-f-l= wr-a-th-f-ul
beauty's b-ju-4-i-z b-eau-t-y-'s
hedgerow h-E-dZ-r-o h-e-dge-r-ow
fingerboard f-I-N-g-@`-b-O-r-d f-i-n-g-er-b-oa-r-d
absence a-b-s-n=-s a-b-s-en-ce
troopship t-r-u-p-S-I-p t-r-oo-p-sh-i-p
useable ju-z-e-b-l= u_e-s-a-b-le
aboveground @-b-V-v-g-r-aU-n-d a-b-o-ve-g-r-ou-n-d
throne T-r-o-n th-r-o_e-n
facades f-@-s-A-d-z f-a-c-a-de-s
suck s-V-k s-u-ck
mannerly m-a-n-@`-l-i m-a-nn-er-l-y
jimmied dZ-I-m-i-d j-i-mm-i-ed
decomposition d-i-k-A-m-p-@-z-I-S-n= d-e-c-o-m-p-o-s-i-ti-on
Dictaphone d-I-k-t-@-f-o-n d-i-c-t-a-ph-o_e-n
rosaries r-o-z-@`r-i-z r-o-s-ar-ie-s
rearrange r-i-@`r-e-n-dZ r-e-arr-a-n-ge
forsythia f-O-r-s-I-T-i-@ f-o-r-s-y-th-i-a
weakly w-i-k-l-i w-ea-k-l-y
volley v-A-l-i v-o-ll-ey
seamstress s-i-m-s-t-r-@-s s-ea-m-s-t-r-e-ss
readjustment r-i-@-dZ-V-s-t-m-@-n-t r-e-a-dj-u-s-t-m-e-n-t
astonishingly @-s-t-A-n-I-S-I-N-l-i a-s-t-o-n-i-sh-i-ng-l-y
opponent's @-p-o-n-@-n-t-s o-pp-o-n-e-n-t-'s
finals f-aI-n-l=-z f-i-n-al-s
sterilized s-t-E-r-@-l-aI-z-d s-t-e-r-i-l-i_e-z-d
ineffectiveness I-n-I-f-E-k-t-I-v-n-@-s i-n-e-ff-e-c-t-i-ve-n-e-ss
reiterating r-i-I-4-@`r-e-4-I-N r-e-i-t-er-a-t-i-ng
meticulous m-@-t-I-k-j@-l-@-s m-e-t-i-c-u-l-ou-s
Australian O-s-t-r-e-l-j@-n au-s-t-r-a-l-ia-n
phosphide f-A-s-f-aI-d ph-o-s-ph-i_e-d
stenographer s-t-E-n-A-g-r-@-f-@` s-t-e-n-o-g-r-a-ph-er
lounged l-aU-n-dZ-d l-ou-n-ge-d

give g-I-v g-i-ve
descending d-I-s-E-n-4-I-N d-e-sc-e-n-d-i-ng
newsreader n-u-z-r-i-4-@` n-ew-s-r-ea-d-er
wobble w-A-b-l= w-o-bb-le
transpiring t-r-a-n-s-p-aI-r-I-N t-r-a-n-s-p-i-r-i-ng
impair I-m-p-E-r i-m-p-ai-r
junctures dZ-V-N-k-tS-@`-z j-u-n-c-t-ure-s
conditioners k-@-n-d-I-S-n=-@`-z c-o-n-d-i-ti-on-er-s
staircase s-t-E-r-k-e-s s-t-ai-r-c-a_e-s
tumultuous t-u-m-V-l-tS-u-@-s t-u-m-u-l-t-u-ou-s
Curie k-jU-r-i c-u-r-ie
conversational k-A-n-v-@`-s-e-S-n=-@-l c-o-n-v-er-s-a-ti-on-a-l
rehearsal r-I-h-3`-s-l= r-e-h-ear-s-al
lesion l-i-Z-n= l-e-si-on
introductions I-n-t-r-@-d-V-kS-n=-z i-n-t-r-o-d-u-cti-on-s
associates @-s-o-s-i-e-t-s a-ss-o-c-i-a_e-t-s
fused f-ju-z-d f-u_e-s-d
impute I-m-p-ju-t i-m-p-u_e-t
counterespionage k-aU-n-4-@`-E-s-p-i-@-n-A-Z c-ou-n-t-er-e-s-p-i-o-n-a-ge
Andy's a-n-d-i-z a-n-d-y-'s
ruthless r-u-T-l-@-s r-u-th-l-e-ss
cavities k-a-v-@-4-i-z c-a-v-i-t-i-es
lunchtime l-V-n-tS-t-aI-m l-u-n-ch-t-i_e-m
pronunciation p-r-@-n-V-n-s-i-e-S-n= p-r-o-n-u-n-c-i-a-ti-on
ketchup k-E-tS-V-p k-e-tch-u-p
quivering kw-I-v-@`-I-N qu-i-v-er-i-ng
exhaustingly I-gz-O-s-t-I-N-l-i e-xh-au-s-t-i-ng-l-y
toothpick t-u-T-p-I-k t-oo-th-p-i-ck
brushes b-r-V-S-@-z b-r-u-sh-e-s
additives a-4-@-4-I-v-z a-dd-i-t-i-ve-s
honey h-V-n-i h-o-n-ey
fright f-r-aI-t f-r-igh-t
spoon s-p-u-n s-p-oo-n
ministrations m-I-n-@-s-t-r-e-S-n=-z m-i-n-i-s-t-r-a-ti-on-s
rehearsals r-I-h-3`-s-l=-z r-e-h-ear-s-al-s
picnicked p-I-k-n-I-k-t p-i-c-n-i-ck-ed
officiated @-f-I-S-i-e-4-@-d o-ff-i-c-i-a-t-e-d
indiscreet I-n-d-I-s-k-r-i-t i-n-d-i-s-c-r-ee-t
gambler g-a-m-b-l-@` g-a-m-b-l-er
hydrophilic h-aI-d-r-@-f-I-l-I-k h-y-d-r-o-ph-i-l-i-c
glides g-l-aI-d-z g-l-i_e-d-s
wakes w-e-ks w-a_e-ks
acquit @-kw-I-t a-cqu-i-t

millions m-I-l-j@-n-z m-i-ll-io-n-s
needles n-i-4-l=-z n-ee-d-le-s
lethal l-i-T-@-l l-e-th-a-l
macaw m-@-k-O m-a-c-aw
fain f-e-n f-ai-n
tuning t-u-n-I-N t-u-n-i-ng
painfully p-e-n-f-l=-i p-ai-n-f-ull-y
conference k-A-n-f-@`-@-n-s c-o-n-f-er-e-n-ce
trick t-r-I-k t-r-i-ck
obtuse @-b-t-u-s o-b-t-u_e-s
shrank S-r-a-N-k sh-r-a-n-k
chilly tS-I-l-i ch-i-ll-y
dissipate d-I-s-@-p-e-t d-i-ss-i-p-a_e-t
psi s-aI ps-i
hairbrush h-E-r-b-r-V-S h-ai-r-b-r-u-sh
dash d-a-S d-a-sh
dilating d-aI-l-e-4-I-N d-i-l-a-t-i-ng
mutable m-ju-4-@-b-l= m-u-t-a-b-le
header h-E-4-@` h-ea-d-er
praising p-r-e-z-I-N p-r-ai-s-i-ng
inventor I-n-v-E-n-4-@` i-n-v-e-n-t-or
fading f-e-4-I-N f-a-d-i-ng
abdicate a-b-d-@-k-e-t a-b-d-i-c-a_e-t
draper d-r-e-p-@` d-r-a-p-er
overdoing o-v-@`-d-u-I-N o-v-er-d-o-i-ng
indecipherable I-n-d-I-s-aI-f-@`-@-b-l= i-n-d-e-c-i-ph-er-a-b-le
urethane jU-r-@-T-e-n u-r-e-th-a_e-n
clubfoot k-l-V-b-f-U-t c-l-u-b-f-oo-t
View publication stats

Measuring Spelling Predictability JCC Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measuring Spelling Predictability JCC Thesis

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Measuring Orthographic Predictability: Calculating Entropy for Phonemes and

Thesis · August 2017

The user has requested enhancement of the downloaded file.

Measuring Orthographic Predictability: Calculating Entropy for

professors, instructors, and clinical supervisors of the UAMS/UALR graduate program

past few years a wonderful and enriching experience.

List of Figures ................................................................................................................... vii

List of Tables ................................................................................................................... viii

Statement of the Problem ................................................................................................ 1

Purpose of This Study ..................................................................................................... 3

Historical Perspectives of English Orthography ............................................................. 7

Current Research on Orthographic Depth ..................................................................... 11

Defining English Graphemes ........................................................................................ 17

Alterations to the Corpus............................................................................................... 19

Summary of the Computing Algorithm ........................................................................ 21

Is English Predictably Spelled? ..................................................................................... 35

Measuring Orthographic Depth ..................................................................................... 37

Word-Level Entropy ..................................................................................................... 38

Entropy-Based Spelling Instruction .............................................................................. 39

Future Research ............................................................................................................. 44

Appendix A: Alphabetical List of All English Graphemes .............................................. 54

Appendix C: Phoneme-to-Grapheme Correspondence (PGC) Probabilities .................... 64

Appendix D: Algorithmic Procedures for Corpus Analysis ............................................. 71

Figure 1. Accuracy data from “Foundation Literacy Acquisition in

European Orthographies” (Seymour et al., 2003). English

speakers performed consistently worse than peers speaking

twelve other languages……………………………………….………….14

Figure 2. Vowel cluster analysis. Dendrogram and scatter plot of

vocalic graphemes with complete linkage, grouped into five

categories of decreasing entropy……………………………...…………31

Figure 3. Consonant cluster analysis. Dendrogram and scatter plot

of consonantal graphemes with complete linkage, grouped

into five categories of decreasing entropy...…………………………….32

Table 1. Graphemes ranked by entropy. Classed by consonant or

vowel, with low-frequency and zero-entropy graphemes

Table 2. Graphemes grouped by predictability. The results of

a hierarchical cluster analysis using complete linkage,

where complete predictability indicates zero-entropy…………………...30

Table 3. Phonemes ranked by entropy. SAE phonemes

discovered in the corpus, including token frequency,

probability expressed as a percentage, and unweighted

Table 4. Multiple phonemes corresponding to single letters.

Instances where a singleton grapheme corresponded to

more than one phoneme, which were removed from

Statement of the Problem

our own language?

languages exist on a continuum, from “transparent” orthographies at one end to “opaque”

A8 network, 2003). Transparent orthographies, such as those of Italian, Spanish, Finnish,

spellings predictably represent their pronunciation. Opaque orthographies, such as those

languages. In multiple cross-linguistic experiments, learners of English have been shown

psycholinguistic, and cognitive processes linked to skilled reading and reading

acquisition (Share, 2008).

predictability of English orthography. If English is orderly and predictable, as concluded

computing technology to determine if the sound-symbol relationships are as predictable

as the HHRH study discovered.

Purpose of This Study

entropy values for all sounds and spelling patterns of SAE.

Entropy, as defined by information theory, measures the degree of uncertainty in a

frequency distribution. It describes how likely or predictable the outcome of a random

various languages, including English (Borgwaldt, Hellwig, & De Groot, 2004;

Protopapas & Vlahou, 2009).

A grapheme is then defined as a written symbol that corresponds to a single