Professional Documents
Culture Documents
net/publication/317381495
CITATIONS READS
0 162
1 author:
Jacob C. Cockcroft
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jacob C. Cockcroft on 07 June 2017.
A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science
By
Jacob Cockcroft
Bachelor of Arts, Hendrix College, 2000
2017
The University of Arkansas for Medical Sciences
MEASURING ORTHOGRAPHIC PREDICTABILITY iii
MEASURING ORTHOGRAPHIC PREDICTABILITY iv
Acknowledgements
The author would like first and foremost to acknowledge the long-distance efforts of Dr.
Robert J. Drost in the creation of the analysis software, along with his many hours spent
resolving the numerous technical and theoretical issues involved in this analysis. Without
his contributions, this research and thesis would not have been possible. Also, many
thanks to Dr. Greg Robinson for his guidance, support and encouragement throughout the
evolution of this study. Many thanks as well to the rest of my committee members—Dr.
Chenell Loudermill, Dr. Dana Moser, and Ms. Stacey Mahurin—whose invaluable
feedback have helped shape this project into (hopefully) a streamlined, accessible product
for teachers, educators, therapists, and researchers. Gratitude is likewise owed to all the
for Speech-Language Pathology & Audiology that I have had the pleasure to meet and
learn from during my recent academic journey: Dr. Donna Kelly, Mrs. Connie Bracy, Dr.
Ashlen Thomason, Dr. Betholyn Gentry, Dr. Tom Guyette, and Mrs. Shanna Williamson,
just to name a few. Thanks to my indispensible proofreaders and parents, Jean Cazort and
David Cockcroft. Thanks to Anna Salzer for realizing how much I would love Speech-
Language Pathology before I did, and Eli Wakefield for the blessing of watching child
language development in action. Finally, thanks to my eighteen cohorts for making these
Table of Contents
Acknowledgements ............................................................................................................ iv
Introduction ......................................................................................................................... 1
Rationale.......................................................................................................................... 4
Literature Review................................................................................................................ 7
Conclusion..................................................................................................................... 14
Methodology ..................................................................................................................... 16
Corpus ........................................................................................................................... 16
Reliability ...................................................................................................................... 22
MEASURING ORTHOGRAPHIC PREDICTABILITY vi
Calculating Entropy....................................................................................................... 23
Results ............................................................................................................................... 25
Graphemes ..................................................................................................................... 25
Phonemes ...................................................................................................................... 32
Discussion ......................................................................................................................... 35
Limitations .................................................................................................................... 42
Conclusion ........................................................................................................................ 47
References ......................................................................................................................... 49
Appendix B: List of Phonemes Used by the English Lexicon Project (ELP) .................. 62
List of Figures
List of Tables
removed………………………………………………………………….28
entropy values……………………………………………………………34
final analysis……………………………………………………………..35
MEASURING SPELLING PREDICTABILITY
Introduction
“…the English alphabet is pure insanity…, it can hardly spell any word in the
language with any degree of certainty.” --Mark Twain
Exactly how predictable is written English? The above quip by Mark Twain echoes
the exasperation many literate English speakers have felt at one time or another, when
considering how difficult and unruly the English writing system appears to be. Twain’s
trademark humor concerning what he called our “drunken old alphabet” reflects a larger
attitude of his time that English spelling needed to be reformed (Twain, 2016, p. 111).
English orthography features complex rules with many exceptions that appear to defy
logical explanation. Imagine what would happen if one day the auto-correct feature
disappeared from our cell phones and word processors. How confidently could we spell
Rather confidently, according to the seminal work in the 1960s of Hanna, Hanna,
Rudorf, and Hodges (referred to as “HHRH” hereafter). According to their research, half
of all English words are predictably spelled from their sound, and another 34% would
have just a single error if spelled on the basis of sound (Moats, 2005). Their linguistic
analysis of over 17,000 of the highest frequency words in English resulted in the
conclusion that English orthography is predictable over 80% of the time (Hanna, 1966).
Despite this assertion, in recent years a growing area of cross-linguistic research once
again throws English predictability into question. Orthographic depth is a term used to
MEASURING ORTHOGRAPHIC PREDICTABILITY 2
indicate the relative degree to which a language’s orthography diverges from its spoken
form (Katz & Frost, 1992). The traditional view of orthographic depth states that
orthographies on the other (Seymour, Aro, Erskine, & collaboration with COST Action
Hungarian, and Greek, feature relatively phonetic spelling systems, in which word
of English and French, align in complex and less predictable ways to the oral languages
they symbolize.
English is generally considered to have the most opaque orthography of all European
to struggle with reading and spelling in ways that most other speakers of European
languages do not. David Share dubbed English an “outlier orthography,” and questioned
why it still governs the vast majority of current research on the behavioral,
It appears that two contrasting perspectives exist in the literature regarding the
by the HHRH study, then why are English speakers performing consistently worse when
compared to other languages which utilize the same Latin alphabet? One way to answer
this question is to perform a new linguistic analysis of English orthography using modern
The current study sought to measure the degree of predictability of the sounds and
spelling patterns of Standard American English (SAE). Using a custom built software
program, a corpus of over 131 million words was deconstructed into its constituent
sounds and spelling patterns, and then the frequencies of these sound-symbol
correspondences were tallied. From these frequencies, the probabilities of all sound-
symbol correspondences in the corpus were determined, which could be used to calculate
sampling is for that distribution, given the amount of information that is available. This
method has been used in several past studies to measure sound-symbol predictability for
Borgwaldt, Hellwig, & De Groot, 2005; Borgwaldt, Hellwig, De Groot, & Licht, 2006;
The basic units of spoken language that will be measured by this study are called
phonemes. Phonemes are the building blocks of spoken syllables, e.g. vowels and
consonants. Phonemes are written using the International Phonetic Alphabet (IPA)
standard format, with each unique sound represented by a specific symbol between two
slashes. The spoken word bat has three phonemes, written /b/, /æ/, and /t/.
phoneme. Graphemes are written between quotes. The written word “bat” has three
MEASURING ORTHOGRAPHIC PREDICTABILITY 4
graphemes: “b,” “a,” and “t.” A grapheme might be a single written letter, but graphemes
in English may contain up to four letters, e.g. “ough” as in though. In this study, a
In this study, the term orthographic correspondence (OC) is used to refer to the
particular “direction” of correspondence. Thus the word bat contains three OCs: “b” =
/b/, “a” = /æ/, and “t” = /t/. When analyzing OCs, researchers may choose to study
words, the process of translating a string of graphemes into a string of phonemes (i.e.
“sounding out” a written word). PGCs relate to encoding words, the process of translating
Rationale
Phonemic awareness, or the understanding that words are composed of phonemes that
must be blended and segmented, was established by the National Reading Panel as one of
the most critical skills that must be learned in order for a student to achieve successful
literacy outcomes (National Reading Panel (US), National Institute of Child Health, &
Human Development (US), 2000). Research has shown that in all phases of development,
language learners employ decoding skills (Sharp, Sinatra, & Reynolds, 2008).
Sight word reading, according to Linnea Ehri, is the process of building the lexicon
through memorizing words so that they may be read “on sight,” without the need to
MEASURING ORTHOGRAPHIC PREDICTABILITY 5
decode the individual phonemic units of a word, which is a slower process that requires
an extra cognitive load (Ehri, 2005). OCs form the basis for children learning sight
words, providing a “powerful mnemonic system” (p. 172). It is the OCs that provide the
“glue” to cement words in memory. The more unpredictable these correspondences are,
the more difficult decoding and learning sight words should be. For example, evidence
has shown that the OC predictability affects naming latencies, i.e., words with
inconsistent sound-spelling features are read aloud slower than words with consistent
OCs are equally important to the process of encoding, or spelling words. Decoding
and encoding are separate but highly interrelated processes, because students are
continually learning to read and spell words simultaneously. Students are provided visual
access to a word as a reference while they learn to spell it, so decoding and encoding
skills are learned together and mutually reinforce one another. As Ehri emphasizes,
learning to spell words supports the ability to later recognize them in print, just as
multiple exposures to reading a word improves the probability one will be able to
A systematic measure of predictability for English phonemes and graphemes has the
potential to impact our current perspectives on how difficult it is to acquire and master
both spelling and literacy in the United States. Currently, the plethora of approaches on
how to best teach literacy and spelling often leave educators and teachers overwhelmed
and confused for how to best approach the subject (Johnston, 2000; Schlagal, 2002;
spelling and literacy instruction for both typical students and outliers. Considering the
MEASURING ORTHOGRAPHIC PREDICTABILITY 6
entropy of OCs, for example, might provide a convenient method for organizing
for future cross-linguistic studies comparing orthographic depth and its role in literacy
acquisition. Measuring entropy values (weighted by frequency) for all phonemes and
graphemes in the corpus allows both a total phonemic entropy and total orthographic
entropy value to be computed. The first number theoretically indicates how unpredictable
spoken SAE is to spell; the second indicates how unpredictable written English is to
pronounce in SAE. These numbers could be used in quantitative comparisons with other
languages where the entropy has likewise been measured, if it can be shown the corpus
Literature Review
any singular, overarching structural model. Linguist David Crystal’s book “Spell it Out:
The History of English Spelling,” illuminates the myriad sources that have contributed
over time to the evolution of the modern English writing system (Crystal, 2012). Anglo-
Saxon, Welsh, Norman-French, Old Norse, Latin and Greek have all contributed their
own arbitrary writing conventions and stylistic forms to the recipe of English. For
example, there is no logical reason why an “–e” must be added after a “v” at the ends of
words like have, give, love, etc., other than the fact that Anglo-Norman scribes in the
Additionally, word spellings tend to be more stable than word pronunciations, which
change at a faster rate. Many English words retain spellings from an earlier time, when
the words were pronounced differently. Furthermore, English spellings are used for
relationships, e.g. keeping the letter “g” in sign, even though it is not pronounced,
The notion that English spelling was chaotic and in need of reform percolated among
American, British, and Irish literary circles at least as far back at the 18th century.
Benjamin Franklin, Mark Twain, Noah Webster, and Andrew Carnegie are among the
conventions to some extent. Franklin, for example, proposed his own streamlined
alphabet, featuring a reduction of extraneous letters and the addition of new letters to
better capture the sounds of English (Webster & Franklin, 1789). Interestingly, the one
lasting outcome from Dr. Franklin’s attempt as spelling reform was the eventual adoption
of his invented symbol for /ŋ/ by the IPA. The fact that he experienced far more success
in helping to establish a new country that broke violently away from British rule than he
ever was at reforming English orthography gives some indication as to just how difficult
were arranged alphabetically and sometimes by number of syllables as well, but neither
arrangement necessarily organized words in terms of how difficult they were to learn
(Schlagal, 2002).
learning difficulty. It was reasoned that more frequent words would be seen more often
and so were easier to memorize. Resources such as Thorndike’s & Lorge’s “The
Teacher’s Book of 30,000 Words” began to offer educators lists of words that were
arranged by frequency, so that children could be taught the easier, high-frequency words
first (Thorndike & Lorge, 1944). This practice continues today, with wordlists such as
Edward Fry’s “instant words” being ordered by frequency (E. B. Fry & Kress, 2012; E.
Fry, 1980).
MEASURING ORTHOGRAPHIC PREDICTABILITY 9
graphemes and phonemes. The HHRH study published in 1966 was an extension of more
than a decade’s previous research on PGC frequency. It was the largest project of its kind
funded by the U.S. Department of Education, with the final publication exceeding 17,000
pages.
This was the first time such a study utilized computers to analyze large corpora of
words. The HHRH study drew upon 17,310 words from Thorndike & Lorge’s frequency
lists, and analyzed every PGC in a variety of contexts of position and stress. For
position, they listed the probabilities of each OC occurring in the initial, medial, and final
position of syllables. For stress, they listed probabilities for primary, secondary, or
unstressed syllables.
The authors concluded that English PGCs are predictable and consistent “80 percent
of the time” when both position and stress were also considered (Hodges & Rudorf,
1965). This claim is based on the probability that any given phoneme will align to its
“main” grapheme about 80% of the time. The HHRH researchers used an “80-percent
represented by its orthography. This empirically derived conclusion caused a shift from
the perspective that English spelling was a broken system in need of reform to a position
that acknowledged the overall predictability of English orthography. The principle that
English spelling is mostly predictable has guided the methodology of reading and
Many researchers since have drawn upon and refined the data from the HHRH study.
Subsequent studies updated and reworked the Hanna et al. data (E. Fry, 2004), measuring
GPC probabilities for both “American” (Berndt, Reggia, & Mitchum, 1987) and “British”
(Gontijo, Gontijo, & Shillcock, 2003) English. However, there are several reasons why it
First, computer technology of the era was in it’s infancy. Given that a cell phone in
2012 had more computational power than all of NASA during the 1969 Apollo moon
landing (Kaku, 2012), the study might have returned different outcomes had it been
undertaken more recently. The researchers admit throughout the study that linguistic
Another issue is that the corpus size of ~17,000 words is relatively small by today’s
standards, where the internet and digital technology have allowed for millions and even
error. Many studies have discussed how large a corpus size needs to be for the reliable
generalization of results, and somewhere between 16-30 million words seems optimal
Also contestable is the composition of the grapheme list used by the HHRH study.
There is no authoritative list of English graphemes, so the researchers had to create their
own. This is not a straightforward process, however, and requires a certain degree of
to handle silent letters. Since silent letters have no phoneme with which to correspond,
they are attached to an adjacent letter which does correspond to a phoneme, i.e. the
MEASURING ORTHOGRAPHIC PREDICTABILITY 11
formation of compound graphemes. The word asthma, for example, can be considered
either as
There is no obvious reason to prefer one over the other. The authors of HHRH study
were forced to make these choices, and in doing so, they effectively established research
conventions regarding how to best classify English graphemes. In several cases, these
decisions were dictated by technological constraints of the era. Thus Berndt et al., twenty
years later, noted that even while they used the HHRH classification system of
graphemes, it is short of ideal, stating that “other divisions of printed words than those
employed here would have resulted in a different set of circumstances” (Berndt et al.,
1987, p 5). They further suggest that “the entire treatment of silent letters by
actually use.”
speakers may in fact be handicapped by the depth of their orthography. Since the concept
of orthographic depth was introduced over 25 years ago (Katz & Frost, 1992), a large
body of research has accumulated that centers on analyzing and understanding its
literacy acquisition: the deeper the orthography, the longer it takes to acquire basic
literacy (Seymour et al., 2003). Whereas the average learner of English requires three
years to attain basic reading proficiency (Chall, 1967), languages with a more transparent
orthography require less time. One striking example is that children learning the
phonetic spelling system, on average reach this same level of basic reading proficiency in
six months.
In 2003, a large experiment that tested decoding skills across fourteen orthographies
percentages from two tasks of that study, which tested the accuracy of word-reading for
both real (content and function words) and non-real (mono and bisyllabic) words. As the
graph shows, on reading tasks where the majority of participants scored between 90-
100% for accuracy, the English-speaking children (the final bar in each category) failed
to reach 50%.
MEASURING ORTHOGRAPHIC PREDICTABILITY 13
100
90
80
Percentage Correct 70
60
50
40
30
20
10
0
Content Words Function Words Monosyllabic Bisyllabic Nonwords
Nonwords
Figure 1. Accuracy data from “Foundation Literacy Acquisition in European Orthographies” (Seymour et
al., 2003). English speakers performed consistently worse than peers speaking 13 other languages.
The English-speaking children in this study were drawn from schools in Scotland,
and caution is advised in assuming all English-speaking children are a homogenous group
that can be fairly represented by this small sample of the global population. Nonetheless,
similar findings have been repeated in smaller studies comparing English to various
orthographies; for example, two separate studies of Welsh (transparent) and English
(opaque), both with rigorous cross-linguistic controls, reported similar outcomes: the
Welsh groups of 6-7 year olds were able to read twice as many words as their English
matched peers (Ellis & Hooper, 2001; Hanley, Masterson, Spencer, & Evans, 2004).
Welsh and English are well-suited for linguistic comparisons, because participants can
often be drawn from the same schools or areas, where both languages are taught
simultaneously to the same age groups, and this design helps minimize the effects of
While not all results are as dramatic, English-speaking children do appear to score
consistently lower than their age-matched peers who have learned more transparent
orthographies. Richlan (2014) provides a succinct list of such studies that involve
learners with both typically developing (Aro & Wimmer, 2003; Bergmann & Wimmer,
2008; Cossu, Gugliotta, & Marshall, 1995; Frith, Wimmer, & Landerl, 1998; Georgiou,
Torppa, Manolitsis, Lyytinen, & Parrila, 2012; Seymour et al., 2003; Wimmer &
Goswami, 1994; Zoccolotti, De Luca, Di Filippo, Judica, & Martelli, 2009) as well as
dyslexic reading acquisition (Barca, Burani, Di Filippo, & Zoccolotti, 2006; Davies,
Cuetos, & Glez-Seijas, 2007; Landerl, Wimmer, & Frith, 1997; Landerl & Wimmer,
2000; Landerl et al., 2013; Richlan, 2014; Wimmer, 1993; Wimmer & Schurz, 2010;
Conclusion
English orthography has never been viewed as being a consistently phonetic system.
Centuries of diverse cultural influences have brought their own unique aesthetics and
stylistic preferences to the English writing system. As the spoken language continued to
evolve away from its conservative spelling system, there have been sporadic but
frequency, and to a lesser extent word length, continues to be a widely used means of
teaching English spelling effectively (Schlagal, 2002). In the 1960s, the HHRH study
MEASURING ORTHOGRAPHIC PREDICTABILITY 15
analyzed approximately 17,000 high-frequency words and concluded that English OCs
are “predictable 80% of the time.” This perspective has informed spelling practices in the
Several concerns can be raised with these findings, however. Half a century has now
passed since the HHRH study was published, and modern computing technology may
depth have once again called into question the predictability of English orthography.
and phonemes may provide an answer to the question of just how predictable English
Methodology
Corpus
This study utilized The English Lexicon Project (ELP) database (Balota et al., 2007).
This is a free, online resource that can be found at elexicon.wustl.edu., based on the
Hyperspace Analogue to Language (HAL) corpus (Burgess & Livesay, 1998). A type
count is the number of different words in a text; the ELP’s type count is 40,481. A token
count is the number of times it appears in the corpus. Approximately 131 million word
tokens comprise the entire corpus, gathered from Usenet Internet news groups in 1995,
though more recent estimates place the number at close to 400 million words.1
Regardless, it has been shown that reliable frequency norms are achieved for both high
and low-frequency words from corpora of 16 million words, with diminishing returns
beginning at around 30 million (Brysbaert & New, 2009), so even the lower size estimate
Appendix B provides a list of the phonemes utilized by the ELP. The ELP provided a
pronunciation for each word entry, based on the Standard American English dialect. It is
important to note that the resulting OCs analyzed in this study are ultimately dependent
upon the accuracy of this input. Some transcription errors and inconsistencies in the
ELP’s pronunciation listings were observed and documented below. Finer linguistic
distinctions not recognized by the ELP are beyond the scope of this study.
1
Estimate obtained from the ELP website “Database News & Update: 10/20/14” at
http://elexicon.wustl.edu/
MEASURING ORTHOGRAPHIC PREDICTABILITY 17
The number and shape of English graphemes vary, depending on how silent letters
are handled. In a purely phonetic alphabet, each letter would represent a single grapheme
and correspond to a single phoneme, but English orthography contains many silent
letters. These are annexed by neighboring letters to form compound graphemes. The
process of assigning silent letters to form compound graphemes, while often arbitrary, is
By contrast, the number and type of phonemes in SAE are well defined, so it seems
reasonable to first identify the phonemes of a given word, and then parse the written word
alignment between grapheme and phoneme, i.e. graphemes should not outnumber
phonemes in a word, though there are some cases where this occurs (see Table 4, below).
These graphemes must align with their phonemes in the same serial order, except for
Those learning to read English are frequently taught that the final –e “lengthens” the
preceding vowel, as in mat vs. mate. Acknowledging this rule, the authors of the HHRH
study decided to treat –e as being part of a compound grapheme that includes the
preceding vowel. This was written as V_e, where V is any vowel or vowel combination.
These special graphemes will be referred to in this study as split graphemes. Thus the
In this study, an “e” is considered as part of a split grapheme when it was directly
preceded by either a graphemic consonant(s), “gu-” or “qu-”. The associated vowel had
to be the first vowel encountered prior to the directly preceding consonant(s), “gu-”,
“qu-”, and any number of consecutive preceding vowels until a consonant, another split
grapheme, or the beginning of the word was encountered. This rule flagged the “e” as a
potential candidate for being part a split grapheme, but the parsing algorithm only
The first rule followed the premise that if the vowel is not “lengthened” by the “e”,
then the “e” does not alter the vowel, and therefore must alter the adjacent consonant.
The second rule followed the premise that –e often indicates an alternative pronunciation
of the preceding consonant. Kessler & Treiman (2001), for example, provide five
consonant and not the vowel. Furthermore, they argue that –e can perform more than one
This interpretation of –e diverges from the conventions of the HHRH study, which
considered final, silent –e as part of a split grapheme in all cases. The authors of the
HHRH study, however, observed that their decision regarding –e was a pragmatic,
simplistic decision that did not necessarily conform to orthographic reality (Hanna,
1966). For the current study, a middle ground was chosen between the two extremes.
MEASURING ORTHOGRAPHIC PREDICTABILITY 19
While it can be argued that –e can serve more than one function simultaneously, in this
study –e can only be assigned one role: as attaching to either the consonant(s) or the
Exclusions a priori. As many words as possible were analyzed from the ELP
database of 40,481 words, but certain entries in the ELP were initially deleted:
Removing the above entries left a dataset of unique words containing 39,905 entries
capital letter, i.e. proper nouns, were included in the dataset. The remaining alternations
were performed during the post-processing stage, after the OCs had been tallied and
sounds that speakers of a language can use interchangeably without affecting the meaning
of words. For example, the phoneme /ɾ/, called a “flap,” which is found in the middle of
words like butter and ladder, is in allophonic variation with /t/ and /d/ in SAE. Indeed,
MEASURING ORTHOGRAPHIC PREDICTABILITY 20
the average speaker is likely unaware that this linguistic distinction even exists. However,
the entropy values for the graphemes “t” and “d” were significantly affected when this
distinction was considered, ranking them as the second and third most ambiguous
consonantal graphemes. Since it appears English speakers do not typically struggle over
this distinction, the decision was made to combine /ɾ/ with either /t/ or /d/ as appropriate.
Another minor change was the combination of /ks/ and its voiced counterpart /gz/ into
a “single phoneme” when they correspond to the grapheme “x.” Sometimes this voicing
distinction impacts meaning, as in the words “box” and “bogs.” However, the grapheme
“x” is never employed in such cases, and this voicing distinction appears to be purely
allophonic whenever “x” is used. Again, over-inflated entropy values appeared when this
distinction was made, so the decision was made to ignore this allophonic distinction.
Apostrophes. For ease of parsing, apostrophes were treated as other letters, resulting
featuring an initial apostrophe (‘d, ‘ll, ‘m, ‘re, ‘s, ‘t, ‘ve), two graphemes with final
apostrophes (n’ and o’), one low-frequency grapheme with a medial apostrophe (a’a), and
a simple grapheme composed of the apostrophe by itself. All but the latter were then
combined with the identical graphemes that did not include an apostrophe, e.g. totals for
“ ‘ve” and “ve” were combined. The singular apostrophe could be parsed as a simple
A custom program was constructed using MATLAB software in order to count the
phonemes and graphemes of the corpus. First, a mapping file was created, containing a
list of all phonemes and all possible graphemes that could potentially map to each
fashion. It first parsed a word into separate phonemes. The term parsing is used in this
English is “E-ng-l-i-sh,” and the phonemic parsing is /i-ŋ-l-ɪ-ʃ/. The program then
considered which graphemes in the word could legitimately correspond to each phoneme,
according to the mapping file. If a legal mapping could not be found for all OCs in a
Some words could be parsed only one way, while some had multiple possible
parsings. For the latter, the algorithm then had to choose the most optimal parsing. It
accomplished this in a step-wise fashion, comparing the first two possibilities to see
and discarded the other, and then compared the next possible candidate to the incumbent
parsing. It proceeded in this manner until all possible parsings had been considered and
The result is that final phonemic and orthographic parsings were ultimately derived
for each word. Appendix D provides the specific guidelines and decision-making
processes the algorithm utilized. The type and token counts for all OCs could then be
MEASURING ORTHOGRAPHIC PREDICTABILITY 22
automatically counted by the program. From this frequency data, probabilities for each
OC and entropy values for all phonemes and graphemes were then calculated.
Reliability
The algorithm underwent a number of trial runs so that unanticipated errors could be
identified and corrected. This process was repeated until the algorithm returned a
Appendix D). The algorithm was then run a final time, allowing type and token counts
for all graphemes and phonemes to be tallied by the program. Appendix D contains a
sample list of 200 randomly selected words with their corresponding parsings.
independent rater for reliability. The rater, a graduate student familiar with IPA symbols
grapheme shapes delineated for this study. The rater was then instructed to highlight any
words that might contain an illegal parsing. The rater returned a list of 22 highlighted
words. It was then determined that two of the words contained an unusual but acceptable
pronunciation in the ELP, and one contained a phonemic combination which, for ease of
parsing, was parsed correctly during a later step by the algorithm. These three
questionable parsings, along with the remaining 19, were judged to all be legitimate
parsings as outlined by the grapheme definition process of this study. With 0 errors in a
sample size of 1000, therefore, a 98% upper confidence bound of .39% parsing errors in
Calculating Entropy
Once the algorithm returned the frequency data for all phonemes and graphemes,
probabilities and entropy values could then be calculated. Throughout this study, X refers
to a random variable from the set of graphemes that we will define based
on our parsing of the ELP, where m is the number of such graphemes. Similarly, we
define based on our parsing of the ELP, where n is the number of such phonemes.
For the random variable X (with an analogous definition for Y), the entropy of X is
defined as
where is the probability that X takes the value . This entropy formula will
An additional calculation will yield conditional entropy, which quantifies the amount
of information needed to predict the outcome of one random variable Y, given that
2
Throughout this article, it is also assumed that = 0 when = 0.
MEASURING ORTHOGRAPHIC PREDICTABILITY 24
The first line indicates this will be a weighted average of the conditional entropy of Y
given over all , where the weighting will be the probability that X takes the
value . In other words, this will be the average (with frequency weighting) over all
graphemes of the difficulty in predicting how each grapheme should be pronounced. The
result will be a metric which quantifies how difficult it is to predict the pronunciation of
graphemes in the corpus. Analogous definitions will apply for H(X|Y=yj) and H(X|Y),
ever corresponding to one unique phoneme. The grapheme “dge,” for example, was
found to have zero entropy, because it only corresponded to the phoneme /ʤ/ as in edge,
judge, ridge, lodge, etc. Entropy values larger than zero indicate relatively increasing
unpredictability. For example, the graphemes “se” and “ce” were found to have entropy
values of .957 and .001, respectively. Therefore, “se” is more unpredictable than “ce.”
MEASURING ORTHOGRAPHIC PREDICTABILITY 25
Results
Graphemes
Classification. The complete list of all 322 graphemes discovered in the corpus are
listed in Appendix A. The total type count for all graphemes in the corpus was 264,618,
and the total token count was 1,603,490,234. Each grapheme’s probability is also listed,
which is the frequency divided by the total number of all graphemes in the entire corpus.
separately. A grapheme was classed as either vowel or consonant based on whether the
phoneme it corresponded to was consonantal or vocalic: for example, the grapheme “et”
in ballet was classed as a vowel, because it corresponded to the vocalic phoneme /e/.
Some ambiguous OCs could reasonably be called either consonant or vowel, and so it
Length. The average grapheme length found in the corpus was 2.4 letters. Of the 322
graphemes in the corpus, 1 had six letters (“ailles”), 2 had five letters (“cques” and
“tzsch”), 27 had four letters, 96 had 3 letters, 169 had 2 letters, and 27 had one letter (this
includes the 26 alphabetic letters plus the apostrophe). The three graphemes longer than
four letters were found only in French or German loanwords, but were still considered for
analysis. Entropy values for each length (from six letters to one letter) were respectively
Entropy. A useful first step in interpreting the data was to remove low-frequency
graphemes with a type count of ten or less. This helped clean the data of idiosyncratic
OCs that were found in only a few related words, such as “bt” = /t/ in debt and “olo” =
/ɚ/ in colonel. There were 146 low-frequency graphemes. 124 of these also had zero
entropy.
202 of the 322 graphemes had zero entropy, meaning they corresponded consistently
to the same phoneme in all instances in the corpus, and can be considered completely
predictable. 124 of these were also considered low-frequency, having a type count of 10
or less. Removing these two groups from consideration resulted in a “refined list” of 99
graphemes that had some degree of entropy, comprised of 45 consonants and 54 vowels.
Table 1. Graphemes ranked by entropy. Classed by consonant or vowel, with low-frequency and zero-
entropy graphemes removed.
Consonants generally had less entropy than vowels, with only the most entropic
consonant, “s,” having an entropy value over 1.0. An informal observation of the ranking
shows that the group with the highest general entropy were singleton vowels. Split
graphemes tended to have relatively lower entropy, clustering towards the bottom end of
the list, indicating their pronunciations to be more stable than their counterparts without a
final, silent –e. It is worth noting that this could be a consequence of how graphemes
were defined for this study. For example, if a word contained both a “short vowel” and a
final, silent –e, the –e was combined with the preceding consonant, resulting in a non-
split grapheme. A word like love, even though it might appear to the novice reader to
contain a split grapheme, is not a split grapheme according to the criteria defined in the
methodology and therefore, the entropy for the split grapheme “o_e” is unaffected by this
pronunciation.
Cluster analysis. Using the “refined” list of graphemes from Table 1, where low-
frequency were removed, a hierarchical cluster analysis using complete linkage was run
separately for both consonants and vowels. This is a useful method for grouping the
graphemes together in terms of similar entropy values and characterizing them in terms
of predictability. Table 2 lists the graphemes divided into groups of predictability for both
These are listed alphabetically. The remaining categories are arranged in order of
increasing entropy. This provides a useful reference for educators who wish to know
Table 2. Graphemes grouped by predictability. The results of a hierarchical cluster analysis using complete
linkage, where complete predictability indicates zero-entropy.
Figure 2. Vowel Cluster Analysis. Dendrogram and scatter plot of vocalic graphemes with complete
linkage, grouped into five categories of decreasing entropy.
MEASURING ORTHOGRAPHIC PREDICTABILITY 31
Figure 3. Consonant Cluster Analysis. Dendrogram and scatter plot of consonantal graphemes with
complete linkage, grouped into five categories of decreasing entropy.
MEASURING ORTHOGRAPHIC PREDICTABILITY 32
grapheme multiplied by the probability of that grapheme appearing in the corpus) were
indicates the relative predictability of English orthography (as represented by the corpus)
other languages, these values can then be compared to show relative predictability
between languages (see “Measuring Orthographic Depth” in the next section for an
Phonemes
correspondences (PGCs) found in the corpus. The probability of each phoneme is listed
(token count) of that phoneme divided by all occurrences of all phonemes in the corpus.
Next to each phoneme is then listed all graphemes found to correspond to that phoneme
in the corpus. Next to each grapheme is the probability for that particular PGC, which is
calculated by dividing the frequency (token count) of that PGC by the total number of
times that phoneme occurs in the corpus. For each phoneme, the graphemes are listed in
order of decreasing probability. Therefore, the first grapheme listed is the “main
correspondence,” the spelling pattern most often associated with a phoneme. Finally,
underneath all the probabilities for each phoneme is a number in bold, which is the
entropy value for that phoneme, derived using the probabilities listed.
MEASURING ORTHOGRAPHIC PREDICTABILITY 33
Entropy. Table 3 lists the phonemes of SAE, ranked by entropy, including the
frequency (token count) data, probability, and entropy of each phoneme in the corpus.
Multiple phonemes. Not included in Table 3 above were instances in the corpus
where multiple phonemes corresponded to single graphemes. These are listed in Table 4.
They are all relatively low-frequency, with type counts less than 1000:
Table 4. Multiple phonemes corresponding to single letters. Instances where a singleton grapheme
corresponded to more than one phoneme, which were removed from final analysis.
each phoneme’s entropy multiplied by it’s probability of occurrence in the corpus) were
summed together to calculate the total phonemic entropy value of 1.017. This number
indicates the relative predictability involved in spelling spoken English. Comparing this
number to the total orthographic entropy appears to indicate that encoding, overall,
involves a higher degree of uncertainty than decoding. Another way to say this is that
reading print is a more predictable process than spelling, which seems intuitive.
MEASURING ORTHOGRAPHIC PREDICTABILITY 35
Discussion
The HHRH study is the most comprehensive source of evidence for the contemporary
average of approximately 80 percent” (Hanna et al. 1966, p. 33). Using their 52-phoneme
classification system, the HHRH researchers found that phonemes corresponded to their
main grapheme 73.13% of the time, which would have been considered near but still
short of “predictable.” When the researchers then considered the additional linguistic
This current study did not classify the data in terms of syllable position and stress, but
probabilities for all “main” PGCs were determined in the process of calculating
probabilities for all OCs. This data can be found in Appendix C. For each phoneme, the
first grapheme listed is the “main spelling” with the highest probability. So, for example,
the phoneme /t/’s main spelling is “t,” and /t/ was found in the corpus to be written as “t”
93% of the time (a probability of 0.933). Taking an average for all of these main spellings
resulted in the overall probability of 0.7326 that any given phoneme would correspond to
its main grapheme. In other words, English phonemes in the corpus are written as their
This number falls within a tenth of a percentage to the number derived by the
HHRH authors. Such close agreement reinforces the findings of both the HHRH study
MEASURING ORTHOGRAPHIC PREDICTABILITY 36
and the current study, at least as far as the highest probability PGCs are concerned. It also
indicates that increasing the corpus size from ~17,000 highest frequency words to ~131
The above calculation was then performed for GPC probabilities, analyzing the
probability that each grapheme corresponds to its “main pronunciation.” Summing the
74.49%.
the terminology of the HHRH study, then, English appears predictable roughly three-
fourths of the time. This is 5% less below the 80-percent criterion used as the benchmark
for predictability, but factoring in additional linguistic parameters like stress and position
would most likely lead to an increase in predictability, as evidenced by the HHRH study.
We argue that considering only the “main correspondence,” however, does not
probability of every, not just the main, correspondence. Consider the phonemes X and Y,
where both have their main spelling correspondence 80% of the time, but phoneme X has
only one alternate pronunciation for the remaining 20% of the time, while Y has 4
effectively entropy captures the notion of orthographic depth (Schmalz et al., 2016).
(Seymour 2003), naming accuracy and latency (Ellis et al., 2004), and word-initial
Since phonemes and graphemes are the universal building blocks of languages and
calculated phoneme-grapheme entropy values for modern Greek (Protopapas & Vlahou,
2009). Despite having separate phonologies and alphabets, English and Greek could be
compared in terms of entropy. Greek was reported to have a total orthographic entropy
while this current study reported comparative values for Standard American English of
.889 and 1.017. Standard American English, therefore, is more unpredictable for both
decoding and encoding than Greek. The same comparisons could be carried out for
consonants, vowels, initial letters, initial phonemes, and even morphological endings,
The problem is that the process of defining graphemes ultimately affects their entropy
values. The choice as to whether a final, silent –e attaches to the preceding vowel or
MEASURING ORTHOGRAPHIC PREDICTABILITY 38
combinations) are to be analyzed, and this decision will impact how predictable the
overall orthography is judged to be. The first step towards having a valid cross-linguistic
entropy measure, therefore, is for general agreement to be reached within each language
community as to how their own graphemes are to be defined. English speakers have a
more difficult challenge in this respect than many other European languages with more
transparent orthographies. This current study offers one definitive list of graphemes, but
this list was constructed through a series of decisions regarding the shape of graphemes
the research community is reached that answers the question, “What are the graphemes of
English,” entropy may not offer a truly absolute scale for cross-linguistic comparisons.
Word-Level Entropy
One potential application for the data involves formulating word entropies. Entropy
values for graphemes or phonemes of a word could be summed to calculate the entropy
value of that word, with the caveat that much more research is needed to determine how
precisely this metric would quantify a word’s difficulty in decoding or encoding. Such a
scheme ignores additional linguistic parameters, such as the impact of stress and meaning
or the effects of rime stability, for example, but it may still provide a convenient tool for
A word’s orthographic entropy could simply be considered the sum of the entropy
values of its graphemes. For example, the orthographic entropy for the word sock,
MEASURING ORTHOGRAPHIC PREDICTABILITY 39
written as “sock,” would equal “s” + “o” + “ck” = 1.027 + 2.682 + 0 = 3.709. The longer
the word, generally the greater the entropy (though this depends on the entropy of each
grapheme), which can account for how word complexity increases with length.
A word’s phonemic entropy, conversely, would be the sum of entropy values of its
phonemes. The word sock, spoken as /sɑk/, would have a phonemic entropy of /s/ + /ɑ/ +
/k/ = 1.303 + 1.097 + 1.593 = 3.993. Both orthographic and phonemic entropy values
describe the degree of predictability between the spoken and written forms of the word
sock. Due to the intricate relationship between encoding and decoding, it remains unclear
which calculation provides a better descriptor. Indeed, both numbers may be required to
accurately represent the word’s predictability. It is also possible to sum orthographic and
phonemic entropy values of a word into a single “total entropy” value, yet doing so may
Traditional method. Using entropy as a measurement for how difficult words are to
learn can lead to the creation of grade-leveled wordlists and basal spellers which are
educators who are unsure of how to choose appropriate words to be taught for spelling
instruction. Entropy-based word selection might help to remedy the fact that basal
spellers do not generally cover the depth of orthography that English has (Foorman &
Petscher, 2010).
United States, and the reality of the classroom is that often a single common list is used
MEASURING ORTHOGRAPHIC PREDICTABILITY 40
for all children in a classroom, even if the instructor believes this is not the most effective
method for every child. A survey (Fresch, 2003) which compared teacher beliefs versus
practices in elementary schools in a nationwide sample, found that while only 45% of
interviewed teachers agreed that one common spelling list for a whole classroom was the
most effective method for teaching spelling, 72% put this into practice. Of those that
agreed, 88% put this into practice. This was the highest level of practice and theoretical
agreement among all 355 participants. Notably, many survey respondents expressed
orthography, such as “words that do not ‘follow the rules’” and the “many exceptions in
emphasizes the individual learning needs of a student as indicated by their current stage
of development, which describes the common process of how children typically acquire
orthographic knowledge (Ehri, 2005; Henderson & Beers, 1980). Ehri describes literacy
acquisition in four stages, which must occur in sequence, though they do not necessarily
correspond with age. An older child with a language impairment, for example, might not
have advanced at the same rate as his peers and would not benefit from studying the same
Entropy-based word selection might have the greatest benefit for learners in the
second stage of literacy acquisition, termed “partial alphabetic.” At this stage, according
to Ehri, a child is working to learn sight words by cementing OCs in memory. The child
has a rudimentary but incomplete knowledge of OCs. They will often spell words
incorrectly but phonetically, based on the rules they have learned to that point. Exposing
MEASURING ORTHOGRAPHIC PREDICTABILITY 41
the child to words which begin predictably but successively increase in uncertainty could
provide an appropriate scaffolding technique, particularly for readers who are struggling
In the third stage of development, the “full alphabetic stage,” English learners have
acquired the “major” OCs and the ability to segment words phonologically, based on the
graphemes they read. Here again, knowledge of the entropy of specific graphemes could
aid instruction. The student can be exposed to more challenging, entropic spellings as
their knowledge grows. Students experiencing difficulty could receive words with
decreased entropy compared to the rest of the class; outliers doing particularly well could
play a role in word selection for spelling instruction, which relies on word frequency and
reading/spelling difficulty of words may even run contrary to entropy analyses. It would
seem awkward, for example, to teach graphemes simply on the basis of entropy, as some
of the most entropic graphemes are in fact single letters, e.g., a, o, u, s, and f. Common
practice dictates that a student’s literacy journey begins with memorizing the alphabet,
where letters are ranked with no regard for predictability, e.g. “r” and “s,” the most and
least predictable consonants in Table 1, are alphabetic neighbors and learned together.
Along the same lines, spelling instruction customarily introduces “simpler” (i.e.
shorter) spelling patterns first. When considering vowel phonemes, spelling patterns for
“short” vowels are usually taught before “long” vowels because they are more frequent
MEASURING ORTHOGRAPHIC PREDICTABILITY 42
and feature shorter spelling patterns. Yet Table 3 indicates the majority of these “short”
(technically lax) vowels feature higher entropy than their “long” (technically tense)
counterparts (e.g., /ɪ/ vs. /i/, /ʌ/ vs. /u/, etc.). According to their entropy values, long/tense
vowels are more predictably spelled. Therefore, students should theoretically be able to
learn these correspondences easier than the more uncertain short vowels.
Limitations
accuracy of the transcription data of the ELP. Observed errors are documented in
Appendix D. The functionality of this analysis also relies on the ability of the results to
be generalized from the corpus to SAE as a whole. The HAL database, on which the ELP
is based, is both sufficiently sized and suitably descriptive of oral language, having been
described as “conversational and noisy, much like spoken language” (Burgess & Livesay,
1998), but it should be noted this is solely an adult-based lexicon. If the results of this
children in the process of literacy acquisition, an analysis drawn from children’s literature
may be more suitable. The current results, on the other hand, may more easily generalize
how larger linguistic units influence the complex cognitive processes underlying reading
and writing.
For example, measuring the entropy of individual graphemes assumes that the
studies have shown evidence to the contrary. The rime, a larger linguistic unit consisting
of a vowel and any consonants following it in a syllable, has been shown to be a more
stable unit in English than the grapheme, at least for monosyllabic words (Treiman,
For example, the grapheme “o” may be pronounced numerous ways, but only one
pronunciation is ever found when combined with the grapheme “ck” to produce the
orthographic rime “-ock,” found in words like sock, lock, dock, etc. This rime
corresponds highly consistently to the phonemic rime /ɑk/. The “ck” grapheme, in effect,
signals the reader that the preceding “o” is to correspond to a specific grapheme, /ɑ/,
Even so, Kessler & Treiman (Kessler & Treiman, 2001), who have written
extensively on the subject of the rime-analysis of English, concluded that despite a higher
degree of consistency at the rime level, English rimes “are not processed as individual
units. Rather, the basic processing seems to occur at a phonemic-graphemic level that
takes into account the context in which each phoneme-grapheme is found” (Protopapas &
Vlahou, 2009, p. 992, referencing; Treiman et al., 1995). In other words, rime-level
While the the entropy model cannot perfectly imitate the complex process of reading,
a number of decisions in processing the data were made in order to more closely
section. It should be noted that such decisions do not have a specific evidentiary basis. It
this analysis would be strengthened by data showing that students actually process the
graphemes defined by this study. Until such data is collected, defining the shape of
unclear whether the final –e in words like love and edge is processed as part of a split
Moving forward, research which answers these questions may help solidify a general
Future Research
Any research would be useful that aims to discover how closely orthographic
acquisition, and so future experimental research could determine what correlation might
exist between the mathematical predictability of a word’s composition and how difficult
it is to read or spell. The ELP provides experimental behavioral data which could
possibly be used in such future studies. Naming and lexical decision data can be accessed
for the 40,481 words listed in the restricted ELP database. If indeed a word’s entropy can
be considered the sum of the entropies of its individual graphemes (as discussed above), a
MEASURING ORTHOGRAPHIC PREDICTABILITY 45
conceivable application of this data would be to calculate entropy values for the words of
the ELP database and then analyze the results for any correlations between entropy
values and the naming and lexical decision data provided. This might provide some
insight into the relationship between a word’s entropy and how long it takes to decode. If
students may learn new vocabulary quicker and easier if exposed to words of gradually
increasing entropy.
Just as the HHRH study analyzed the additional linguistic parameters of stress and
syllable position, a future analysis might include these factors when considering the
regarding both stress and syllable position, and therefore it would be feasible to construct
a more elaborate software program that would account for these parameters using the
same corpus. The results of such an undertaking would most likely provide a more
It would also be interesting to see how the structure of current reading programs
Gillingham or the Association Method, which have unique methods for teaching
Would a reading program designed around entropy show improved outcomes compared
future research along this avenue would involve database analyses similar to this current
MEASURING ORTHOGRAPHIC PREDICTABILITY 46
study for other languages. The entropy value from various studies for different languages
could then be compared to one another, and the relative complexity of various languages
Theoretically, the type of analysis conducted in this study could be extended not just
to other languages, but to various dialects within English as well. Now that we have a
measure of how predictability SAE corresponds to its orthography, one can ask: is it
more or less entropic than other dialects of English? For example, to what degree does
SAE diverge from its orthography compared to African-American English (AAE)? Does
this degree of distance between a spoken dialect and the orthography, as expressed by
entropy, correlate at all with literacy outcomes, i.e. does having an accent further
removed from standard written English result in more difficulty in learning to read?
Unfortunately, the ELP provides only pronunciation for Standard American English. This
means that before any such cross-dialectal comparisons can be made, researchers would
need access to a sufficiently sized corpus with a pronunciation guide representing other
features of stress and syllabic position not addressed in this study. Certainly if these
conservative measurement of the reading experience, and one limitation of this study is
that the interplay between graphemes is not adequately captured in this current analysis.
Examining stress and position would provide a picture of orthographic predictability that
is most likely closer to the reading experience than simply analyzing individual
MEASURING ORTHOGRAPHIC PREDICTABILITY 47
graphemes. Graphemes do not exist isolated in space, after all, and prior studies such as
HHRH have shown how predictability is increased when stress and position are
considered.
Conclusion
After formulating a definitive list of English graphemes, the current study calculated
probabilities of correspondence and entropy values for all phonemes and graphemes in
Standard American English as well as a total system entropy for both orthographic and
phonemic predictability. Phonemes and graphemes were then ranked in terms of their
predictability, a scheme that provides a convenient resource for educators and researchers
who wish to identify the most and least predictable sounds and spelling patterns of the
language. The study confirmed the prior findings of the HHRH study, which asserted that
“English is 80% predictable,” but suggested this is a simplistic measurement which may
Entropy values for phonemes and graphemes may provide a foundation for further
research into the predictability of English orthography. Like word frequency, spelling
predictability may be a useful measurement for how difficult words are to learn, offering
instruction. Children in various stages of literacy development could also benefit from
studies can address how the entropy of graphemes and phonemes impacts the cognitive
The complex nature of these psycholinguistic processes may never be fully distillable
entropy best describes the predictability of words. Also, before entropy can be used for
comparisons across languages and/or dialects, a consensus must first be reached on the
composition and structure of graphemes, particularly for English. Once these technical
issues have been addressed, entropy might provide a useful tool for future cross-linguistic
research. Its mathematical foundation highlights the fact that, ultimately, languages and
References
Aro, M., & Wimmer, H. (2003). Learning to read: English in comparison to six more
regular orthographies. Applied Psycholinguistics, 24(04), 621-635.
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., . . .
Treiman, R. (2007). The english lexicon project. Behavior Research Methods, 39(3),
445-459.
Barca, L., Burani, C., Di Filippo, G., & Zoccolotti, P. (2006). Italian developmental
dyslexic and proficient readers: Where are the differences? Brain and Language,
98(3), 347-351.
Berndt, R. S., Reggia, J. A., & Mitchum, C. C. (1987). Empirically derived probabilities
for grapheme-to-phoneme correspondences in english. Behavior Research Methods,
Instruments, & Computers, 19(1), 1-9.
Borgwaldt, S. R., Hellwig, F. M., De Groot, A., & Licht, R. (2006). Word-initial sound-
spelling patterns: Cross-linguistic analyses and empirical validations of phoneme-
letter feedback consistency effects. UofA Working Papers in Linguistics, 1
Borgwaldt, S. R., Hellwig, F. M., & De Groot, A. (2004). Word-initial entropy in five
languages: Letter to sound, and sound to letter. Written Language & Literacy, 7(2),
165-184.
Borgwaldt, S. R., Hellwig, F. M., & De Groot, A. M. (2005). Onset entropy matters–
Letter-to-phoneme mappings in seven languages. Reading and Writing, 18(3), 211-
229.
Brysbaert, M., & New, B. (2009). Moving beyond kučera and francis: A critical
evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for american english. Behavior Research
Methods, 41(4), 977-990.
MEASURING ORTHOGRAPHIC PREDICTABILITY 50
Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in
a basic word recognition task: Moving on from kučera and francis. Behavior
Research Methods, Instruments, & Computers, 30(2), 272-277.
Cossu, G., Gugliotta, M., & Marshall, J. C. (1995). Acquisition of reading and written
spelling in a transparent orthography: Two non parallel processes? Reading and
Writing, 7(1), 9-22.
Crystal, D. (2012). Spell it out: The singular story of english spelling Profile Books.
Davies, R., Cuetos, F., & Glez-Seijas, R. M. (2007). Reading development and dyslexia
in a transparent orthography: A survey of spanish children. Annals of Dyslexia,
57(2), 179-198.
Delattre, M., Bonin, P., & Barry, C. (2006). Written spelling to dictation: Sound-to-
spelling regularity affects both writing latencies and durations. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 32(6), 1330.
Ehri, L. C. (2005). Learning to read words: Theory, findings, and issues. Scientific
Studies of Reading, 9(2), 167-188.
Ellis, N. C., & Hooper, A. M. (2001). Why learning to read is easier in welsh than in
english: Orthographic transparency effects evinced with frequency-matched tests.
Applied Psycholinguistics, 22(04), 571-599.
Ellis, N. C., Natsume, M., Stavropoulou, K., Hoxhallari, L., DAAL, V. H., Polyzoe, N., .
. . Petalas, M. (2004). The effects of orthographic depth on learning to read
alphabetic, syllabic, and logographic scripts. Reading Research Quarterly, 39(4),
438-468.
Foorman, B. R., & Petscher, Y. (2010). Development of spelling and differential relations
to text reading in grades 3-12. Assessment for Effective Intervention, 36(1), 7-20.
Frith, U., Wimmer, H., & Landerl, K. (1998). Differences in phonological recoding in
german-and english-speaking children. Scientific Studies of Reading, 2(1), 31-54.
Fry, E. (1980). The new instant word list. The Reading Teacher, 34(3), 284-289.
Fry, E. B., & Kress, J. E. (2012). The reading teacher's book of lists John Wiley & Sons.
Georgiou, G. K., Torppa, M., Manolitsis, G., Lyytinen, H., & Parrila, R. (2012).
Longitudinal predictors of reading and spelling across languages varying in
orthographic consistency. Reading and Writing, 25(2), 321-346.
Hanley, R., Masterson, J., Spencer, L., & Evans, D. (2004). How long do the advantages
of learning to read a transparent orthography last? an investigation of the reading
skills and reading impairment of welsh children at 10 years of age. The Quarterly
Journal of Experimental Psychology: Section A, 57(8), 1393-1410.
Hodges, R. E., & Rudorf, E. H. (1965). Searching linguistics for cues for the teaching of
spelling. Elementary English, 42(5), 527-533.
Kaku, M. (2012). Physics of the future: How science will shape human destiny and our
daily lives by the year 2100 Anchor.
Katz, L., & Frost, R. (1992). Chapter 4 the reading process is different for different
orthographies: The orthographic depth hypothesis. Advances in Psychology, 94, 67-
84. doi:http://dx.doi.org/10.1016/S0166-4115(08)62789-2
Kessler, B., & Treiman, R. (2001). Relationships between sounds and letters in english
monosyllables. Journal of Memory and Language, 44(4), 592-617.
Landerl, K., & Wimmer, H. (2000). Deficits in phoneme segmentation are not the core
problem of dyslexia: Evidence from german and english children. Applied
Psycholinguistics, 21(02), 243-262.
Landerl, K., Wimmer, H., & Frith, U. (1997). The impact of orthographic consistency on
dyslexia: A german-english comparison. Cognition, 63(3), 315-334.
Landerl, K., Ramus, F., Moll, K., Lyytinen, H., Leppänen, P. H. T., Lohvansuu, K., . . .
Schulte-Körne, G. (2013). Predictors of developmental dyslexia in european
MEASURING ORTHOGRAPHIC PREDICTABILITY 52
Moats, L. C. (2005). How spelling supports reading. American Educator, 6(12–22), 42-
43.
National Reading Panel (US), National Institute of Child Health, & Human Development
(US). (2000). Report of the national reading panel: Teaching children to read: An
evidence-based assessment of the scientific research literature on reading and its
implications for reading instruction: Reports of the subgroups National Institute of
Child Health and Human Development, National Institutes of Health.
Schmalz, X., Marinus, E., Coltheart, M., & Castles, A. (2015). Getting to the bottom of
orthographic depth. Psychonomic Bulletin & Review, 22(6), 1614-1629.
doi:10.3758/s13423-015-0835-2
Seymour, P. H. K., Aro, M., Erskine, J. M., & collaboration with COST Action A8
network. (2003). Foundation literacy acquisition in european orthographies. British
Journal of Psychology, 94(2), 143-174. doi:10.1348/000712603321661859
Sharp, A. C., Sinatra, G. M., & Reynolds, R. E. (2008). The development of children's
orthographic knowledge: A microgenetic perspective. Reading Research Quarterly,
43(3), 206-226.
Thorndike, E. L., & Lorge, I. (1944). The teacher's wordbook of 30,000 words. new york:
Columbia university, teachers college.
Treiman, R., Mullennix, J., Bijeljac-Babic, R., & Richmond-Welty, E. D. (1995). The
special role of rimes in the description, use, and acquisition of english orthography.
Journal of Experimental Psychology: General, 124(2), 107.
Webster, N., & Franklin, B. (1789). Dissertations on the english language: With notes,
historical and critical, to which is added, by way of appendix, an essay on a
reformed mode of spelling, with dr. franklin's arguments on that subject. by noah
webster, jun. esquire.[two lines in latin from tacitus]. for the author, by Isaiah
Thomas and Company.
Wimmer, H., & Schurz, M. (2010). Dyslexia in regular orthographies: Manifestation and
causation. Dyslexia, 16(4), 283-299.
Zoccolotti, P., De Luca, M., Di Filippo, G., Judica, A., & Martelli, M. (2009). Reading
development in an orthographically regular language: Effects of length, frequency,
lexicality and global processing ability. Reading and Writing, 22(9), 1053-1079.
Zoccolotti, P., De Luca, M., Di Pace, E., Gasperini, F., Judica, A., & Spinelli, D. (2005).
Word length effect in early reading and in developmental dyslexia. Brain and
Language, 93(3), 369-373.
MEASURING ORTHOGRAPHIC PREDICTABILITY 54
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
37 bt 18 87,665 0.01% 0.000
38 c 9,418 40,017,797 2.49% 0.580
39 cc 150 434,614 0.03% 0.000
40 cch 5 839 <0.01% 0.000
41 ce 844 5,659,217 0.35% 0.001
42 ces 2 2,361 <0.01% 0.000
43 ch 1,255 7,122,450 0.44% 0.933
44 che 9 36,800 <0.01% 0.883
45 chsi 1 28 <0.01% 0.000
46 cht 6 1,338 <0.01% 0.000
47 ci 203 804,733 0.05% 0.388
48 ck 923 2,919,639 0.18% 0.000
49 cq 20 25,274 <0.01% 0.000
50 cqu 4 1,900 <0.01% 0.000
51 cques 1 4,567 <0.01% 0.000
52 cs 1 3,578 <0.01% 0.000
53 ct 8 8,422 <0.01% 0.000
54 cu 3 6,548 <0.01% 0.000
55 cz 4 4,878 <0.01% 0.450
56 d 10,566 60,894,576 3.79% 0.187
57 dd 199 968,904 0.06% 0.000
58 de 18 20,029 <0.01% 0.000
59 dg 53 96,498 0.01% 0.000
60 dge 76 268,631 0.02% 0.000
61 dh 1 1,485 <0.01% 0.000
62 di 7 21,233 <0.01% 0.000
63 dj 35 33,852 <0.01% 0.000
64 dn 3 14,826 <0.01% 0.000
65 dth 2 153 <0.01% 0.000
66 e 15,410 103,894,768 6.47% 1.845
67 e_e 137 1,804,972 0.11% 0.014
68 ea 1,309 7,545,614 0.47% 1.705
69 ea_e 60 978,406 0.06% 0.000
70 ear 70 773,428 0.05% 0.000
71 eau 22 65,434 <0.01% 1.032
72 eaux 1 518 <0.01% 0.000
73 ed 1,563 4,059,494 0.25% 0.909
74 ee 790 5,913,902 0.37% 0.168
75 ee_e 20 33,338 <0.01% 0.000
76 eh 2 13,918 <0.01% 0.096
77 ei 97 1,215,778 0.08% 1.278
MEASURING ORTHOGRAPHIC PREDICTABILITY 56
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
78 ei_e 4 7,924 <0.01% 0.053
79 eigh 56 126,170 0.01% 0.644
80 eii 1 127 <0.01% 0.000
81 el 133 585,208 0.04% 0.000
82 ell 20 42,750 <0.01% 0.000
83 em 1 291 <0.01% 0.000
84 en 356 1,297,055 0.08% 0.000
85 eo 19 807,520 0.05% 0.093
86 eou 3 8,587 <0.01% 0.000
87 er 5,258 23,048,328 1.44% 1.042
88 ere 3 684,551 0.04% 0.003
89 err 57 122,344 0.01% 0.962
90 erwr 6 1,656 <0.01% 0.000
91 es 413 2,147,378 0.13% 0.004
92 et 27 9,834 <0.01% 0.000
93 eu 86 135,096 0.01% 1.632
94 eur 18 21,762 <0.01% 0.845
95 ew 165 1,606,405 0.10% 0.727
96 ewe 1 418 <0.01% 0.000
97 ey 177 2,517,956 0.16% 1.020
98 eye 32 107,802 0.01% 0.000
99 ez 1 771 <0.01% 0.000
100 f 3,623 37,433,336 2.33% 0.874
101 fe 1 63 <0.01% 0.000
102 ff 391 2,045,640 0.13% 0.000
103 ffe 1 379 <0.01% 0.000
104 ft 8 120,561 0.01% 0.000
105 g 3,704 16,164,105 1.01% 0.732
106 ge 407 2,372,177 0.15% 0.089
107 gg 225 220,517 0.01% 0.132
108 gh 63 317,754 0.02% 0.591
109 gi 42 185,015 0.01% 0.000
110 gm 3 619 <0.01% 0.000
111 gn 96 430,057 0.03% 0.007
112 gu 105 572,154 0.04% 0.000
113 gue 19 22,728 <0.01% 0.000
114 h 1,960 16,493,014 1.03% 0.000
115 ha 7 12,312 <0.01% 0.150
116 he 4 2,428 <0.01% 1.000
117 hei 4 2,686 <0.01% 0.000
118 her 4 8,713 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 57
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
119 hi 14 35,987 <0.01% 0.631
120 ho 16 61,294 <0.01% 0.000
121 hoe 1 124 <0.01% 0.000
122 hoi 1 80 <0.01% 0.000
123 hou 5 125,350 0.01% 0.000
124 hu 2 1,236 <0.01% 0.000
125 i 20,260 113,079,072 7.04% 1.209
126 i_e 1,520 9,587,698 0.60% 0.356
127 ia 57 314,693 0.02% 1.083
128 iar 12 36,091 <0.01% 0.000
129 ie 400 1,265,508 0.08% 1.697
130 ie_e 26 288,504 0.02% 0.000
131 ier 2 186 <0.01% 0.000
132 ieu 6 8,940 <0.01% 0.310
133 iew 25 263,882 0.02% 0.000
134 ig 2 5,467 <0.01% 0.000
135 igh 241 1,871,680 0.12% 0.000
136 ign 1 2,072 <0.01% 0.000
137 il 75 147,455 0.01% 0.000
138 ile 25 40,227 <0.01% 0.000
139 ill 10 2,201 <0.01% 0.567
140 in 98 350,753 0.02% 1.000
141 ing 9 21,079 <0.01% 0.000
142 io 53 311,539 0.02% 0.000
143 ior 11 79,637 <0.01% 0.000
144 iou 2 618 <0.01% 0.000
145 ioux 1 1,316 <0.01% 0.000
146 ir 250 1,138,712 0.07% 0.261
147 irr 7 5,253 <0.01% 0.000
148 is 11 32,953 <0.01% 0.670
149 it 2 1,175 <0.01% 0.000
150 iu 3 8,351 <0.01% 0.000
151 j 574 3,701,888 0.23% 0.071
152 ju 5 17,056 <0.01% 1.065
153 k 1,740 10,866,474 0.68% 0.000
154 ke 1 11,934 <0.01% 0.000
155 kg 1 230 <0.01% 0.000
156 kh 5 8,364 <0.01% 0.546
157 kk 2 56 <0.01% 0.000
158 kn 73 1,235,975 0.08% 0.000
159 l 10,943 46,569,549 2.90% 0.042
MEASURING ORTHOGRAPHIC PREDICTABILITY 58
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
160 ld 12 2,606,500 0.16% 0.000
161 le 1,276 5,375,598 0.33% 0.036
162 lf 11 85,946 0.01% 0.000
163 lh 4 5,271 <0.01% 0.000
164 lk 61 367,405 0.02% 0.000
165 ll 1,278 10,953,011 0.68% 0.047
166 lle 14 16,323 <0.01% 0.000
167 lm 24 32,248 <0.01% 0.000
168 ln 2 8,735 <0.01% 0.000
169 lv 3 2,154 <0.01% 0.000
170 lve 2 244 <0.01% 0.000
171 m 7,912 44,200,165 2.75% 0.054
172 mb 67 99,871 0.01% 0.000
173 me 34 2,456,437 0.15% 0.000
174 mm 366 1,396,853 0.09% 0.000
175 mme 2 2,908 <0.01% 0.000
176 mmes 1 483 <0.01% 0.000
177 mn 17 79,355 <0.01% 0.055
178 mp 1 222 <0.01% 0.000
179 n 15,884 102,493,345 6.38% 0.322
180 nd 13 10,643 <0.01% 0.596
181 ne 100 3,060,629 0.19% 0.059
182 ng 3,343 14,242,619 0.89% 0.000
183 ngue 6 17,467 <0.01% 0.000
184 nh 1 654 <0.01% 0.000
185 nm 11 297,371 0.02% 0.000
186 nn 328 1,281,013 0.08% 0.000
187 nne 6 16,230 <0.01% 0.000
188 nt 2 170 <0.01% 0.000
189 o 10,917 101,159,136 6.30% 2.682
190 o_e 669 4,582,654 0.29% 0.481
191 oa 373 737,843 0.05% 0.897
192 oa_e 2 533 <0.01% 0.000
193 oar 2 706 <0.01% 0.000
194 oe 56 122,058 0.01% 1.206
195 oh 8 338,392 0.02% 0.938
196 oi 246 976,408 0.06% 0.040
197 oi_e 9 23,095 <0.01% 0.000
198 ois 4 17,436 <0.01% 0.310
199 ol 35 79,933 <0.01% 0.894
200 olo 3 3,628 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 59
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
201 om 15 4,872 <0.01% 0.000
202 ome 28 19,566 <0.01% 0.000
203 on 2,097 8,818,144 0.55% 0.230
204 onn 2 19,009 <0.01% 0.000
205 oo 858 3,895,895 0.24% 1.289
206 oo_e 28 94,585 0.01% 0.000
207 ooh 1 1,428 <0.01% 0.000
208 oor 2 1,998 <0.01% 0.000
209 or 1,052 5,535,041 0.34% 0.964
210 orr 38 211,821 0.01% 0.736
211 os 2 446 <0.01% 0.000
212 ot 4 2,867 <0.01% 0.000
213 ou 1,400 18,438,447 1.15% 2.177
214 ou_e 111 216,372 0.01% 0.505
215 ough 50 1,008,396 0.06% 1.601
216 oui 2 134 <0.01% 0.860
217 oup 2 2,482 <0.01% 0.000
218 our 55 129,262 0.01% 0.315
219 ous 1 771 <0.01% 0.000
220 ow 656 6,051,793 0.38% 1.105
221 owe 3 8,188 <0.01% 0.000
222 oy 131 510,271 0.03% 0.038
223 oy_e 1 2,933 <0.01% 0.000
224 p 7,765 35,274,646 2.20% 0.000
225 pb 3 1,380 <0.01% 0.000
226 pe 3 43,189 <0.01% 0.000
227 ph 530 1,066,337 0.07% 0.220
228 pn 2 1,163 <0.01% 0.000
229 pp 496 2,402,664 0.15% 0.000
230 ppe 2 315 <0.01% 0.000
231 pph 1 2,832 <0.01% 0.000
232 ps 40 56,592 <0.01% 0.000
233 pt 8 8,521 <0.01% 0.000
234 q 443 1,966,725 0.12% 0.000
235 qu 47 75,978 <0.01% 0.000
236 que 22 43,577 <0.01% 0.000
237 r 13,293 67,773,671 4.22% 0.000
238 re 489 9,600,092 0.60% 0.035
239 rh 28 20,260 <0.01% 0.000
240 ro 9 22,769 <0.01% 0.000
241 rps 1 9,467 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 60
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
242 rr 334 879,006 0.05% 0.000
243 rre 2 10,479 <0.01% 0.000
244 rrh 1 149 <0.01% 0.000
245 rror 3 37,225 <0.01% 0.000
246 rt 3 6,210 <0.01% 0.000
247 s 18,696 105,633,904 6.58% 1.027
248 sc 129 371,066 0.02% 0.160
249 sce 48 12,087 <0.01% 0.000
250 sch 8 6,518 <0.01% 0.000
251 sci 15 27,403 <0.01% 0.000
252 se 200 1,696,607 0.11% 0.957
253 sh 1,220 4,198,020 0.26% 0.001
254 shi 19 42,067 <0.01% 0.000
255 si 190 790,039 0.05% 0.540
256 sl 6 57,284 <0.01% 0.000
257 ss 1,405 4,825,294 0.30% 0.387
258 sse 9 3,871 <0.01% 0.000
259 ssi 121 572,191 0.04% 0.003
260 st 84 140,492 0.01% 0.000
261 sth 2 1,858 <0.01% 0.000
262 sw 15 204,666 0.01% 0.000
263 t 19,481 113,483,774 7.21% 0.112
264 tch 196 490,222 0.03% 0.000
265 te 205 892,866 0.06% 0.000
266 tes 1 717 <0.01% 0.000
267 th 1,204 51,995,883 3.24% 0.711
268 the 31 11,857 <0.01% 0.000
269 thes 8 15,037 <0.01% 0.000
270 ti 1,781 7,056,797 0.44% 0.301
271 ts 1 551 <0.01% 0.000
272 tsch 1 37 <0.01% 0.000
273 tsh 10 2,949 <0.01% 0.000
274 tt 618 2,613,709 0.16% 0.000
275 tte 34 33,275 <0.01% 0.000
276 tth 1 17,697 <0.01% 0.000
277 tw 7 457,084 0.03% 0.000
278 tzsch 1 1,825 <0.01% 0.000
279 u 6,587 27,685,856 1.72% 2.307
280 u_e 311 2,464,579 0.15% 0.982
281 ua 2 17,469 <0.01% 0.000
282 ual 2 15 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 61
Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
283 ue 96 855,526 0.05% 0.912
284 ugh 1 5,867 <0.01% 0.000
285 uh 2 14,565 <0.01% 0.139
286 ui 86 309,495 0.02% 1.315
287 ui_e 4 8,687 <0.01% 0.000
288 ul 138 430,388 0.03% 0.000
289 ule 2 3,321 <0.01% 0.000
290 ull 42 81,135 0.01% 0.000
291 um 3 2,190 <0.01% 0.000
292 uo 9 3,304 <0.01% 0.479
293 uoy 3 313 <0.01% 0.000
294 ur 577 1,620,707 0.10% 1.031
295 ure 193 1,020,272 0.06% 1.098
296 urr 83 350,854 0.02% 0.472
297 ut 3 3,171 <0.01% 0.000
298 uy 5 194,926 0.01% 0.000
299 v 2,880 15,036,744 0.94% 0.000
300 ve 540 5,877,003 0.37% 0.000
301 vv 3 1,259 <0.01% 0.000
302 w 1,601 20,746,360 1.29% 0.006
303 we 5 17,254 <0.01% 0.000
304 wh 215 6,321,562 0.39% 0.695
305 wi 1 1,020 <0.01% 0.000
306 wr 102 2,176,730 0.14% 0.000
307 x 869 3,887,725 0.24% 0.407
308 xe 3 7,822 <0.01% 0.000
309 xh 23 20,691 <0.01% 0.000
310 xi 6 7,758 <0.01% 0.000
311 y 4,493 32,632,117 2.03% 1.680
312 y_e 57 304,957 0.02% 0.000
313 ye 9 16,561 <0.01% 0.000
314 yl 6 19,654 <0.01% 0.000
315 yll 1 36 <0.01% 0.000
316 yr 4 7,350 <0.01% 0.000
317 yrrh 1 105 <0.01% 0.000
318 z 818 1,215,898 0.08% 0.301
319 ze 4 5,758 <0.01% 0.000
320 zi 1 59 <0.01% 0.000
321 zz 74 60,756 <0.01% 0.458
322 ' 54 29,904 <0.01% 0.000
Totals: 264,618 1,603,490,234
MEASURING ORTHOGRAPHIC PREDICTABILITY 62
This appendix lists the phonemes distinguished by the ELP and therefore used in
this study, In addition to the 44 phonemes generally accepted for SAE, there are four
additional phonemic distinctions included for analysis: the cluster /ju/ in contrast to /u/
(as heard in moot vs. mute), and the syllabic variants of /m/, /n/, and /l/ (when these
consonants function as the nucleus of a syllable instead of a vowel). Phonemes are listed
MEASURING ORTHOGRAPHIC PREDICTABILITY 63
in the table in both IPA and SAMPA formats, with a provided example. Like the
Alphabet (SAMPA) is a symbol system used for phonetic transcription, with a specific
keyboards. Note that while some phonemes have different symbols between the IPA and
SAMPA formats, many share the same symbol in both typographies. For example, in the
word teething, the initial consonant is represented by /t/ in both SAMPA and IPA
formats, the vowel featured in both syllables is represented by /i/ in both formats, the
middle consonant is represented by /T/ in SAMPA and /ð/ in IPA, and the final consonant
is represented by /N/ in SAMPA and /ŋ/ in IPA. The analysis was conducted with
phonemes in SAMPA format, and then in post-processing these symbols were converted
to IPA format, which may be more familiar to readers. Appendix D, however, discusses
technical aspects of the algorithm and so retains use of SAMPA symbols for the sake of
consistency.
MEASURING ORTHOGRAPHIC PREDICTABILITY 64
Appendix C Continued
P G Prob. P G Prob. P G Prob.
hi 0.000066 rt 0.000075 sce 0.000151
aa 0.000051 ir 0.000015 sse 0.000048
oi 0.000037 rrh 0.000002 cs 0.000045
eau 0.000037 0.98177 ces 0.000029
er 0.00003 1.30285
ea 0.000027 /l/= l 0.804507
o' 0.000023 0.03588 ll 0.183583 /æ/= a 0.999126
ae 0.000016 'll 0.005524 0.03339 au 0.000773
he 0.000014 all 0.004237 e 0.000043
eo 0.000008 sl 0.000994 ah 0.00002
on 0.000006 ol 0.000431 ai 0.000015
c 0.000003 le 0.00035 a'a 0.000014
eu 0.000002 lle 0.000283 i 0.000007
oo 0 lh 0.000091 a_e 0.000002
2.01254 0.79956 0.01081
Appendix C Continued
P G Prob. P G Prob. P G Prob.
kh 0.000137 j 0.000006 ea 0.012055
cu 0.000123 ai 0.000003 e 0.003686
cques 0.000086 hoe 0.000002 eigh 0.002844
che 0.000053 ois 0.000002 ai_e 0.001988
gh 0.000036 2.74292 aigh 0.001309
cqu 0.000036 ei 0.000615
cch 0.000016 /ɛ/= e 0.80297 ay_e 0.000381
kg 0.000004 0.02784 a 0.103993 eh 0.000371
kk 0.000001 ea 0.034405 ae 0.000288
1.5928 ai 0.02727 et 0.000265
ei 0.019017 au 0.000136
/z/= s 0.893611 ie 0.004512 ee 0.000135
0.02856 es 0.0468 ay 0.004462 e_e 0.000055
z 0.02544 ey 0.001868 es 0.000019
's 0.017539 u 0.000366 er 0.000016
se 0.014002 ae 0.000321 eii 0.000003
zz 0.001196 aye 0.000305 ais 0.000002
ss 0.000729 aa 0.000242 ei_e 0.000001
thes 0.000328 eo 0.000177 1.96794
x 0.000154 hei 0.00006
ze 0.000126 ee 0.000023 /u/= o 0.535481
sth 0.000041 oe 0.000005 0.01916 ou 0.187264
is 0.000013 eh 0.000004 u 0.093988
ts 0.000012 1.11633 oo 0.052143
cz 0.00001 ew 0.042228
0.70278 /p/= p 0.93503 u_e 0.033706
0.02349 pp 0.063688 ue 0.018703
/ʌ/= o 0.515433 pe 0.001145 o_e 0.015449
0.02463 u 0.356184 ph 0.000115 ough 0.011045
a 0.09341 bp 0.000014 ui 0.003879
ou 0.020194 ppe 0.000008 oo_e 0.003073
au 0.012732 0.3567 eu 0.000807
oo 0.002047 ou_e 0.000773
1.55491 /v/= v 0.470449 oe 0.000673
0.01990 f 0.344445 ui_e 0.000282
/aɪ/= i 0.473221 ve 0.16932 ieu 0.000274
0.02044 i_e 0.272489 've 0.014556 oup 0.000081
y 0.165759 ph 0.00092 ooh 0.000046
igh 0.057033 w 0.000196 ioux 0.000043
MEASURING ORTHOGRAPHIC PREDICTABILITY 67
Appendix C Continued
P G Prob. P G Prob. P G Prob.
y_e 0.009293 lv 0.000067 w 0.000026
ie 0.007419 vv 0.000039 ous 0.000025
uy 0.00594 lve 0.000008 uo 0.000011
eye 0.003285 1.57742 oui 0.000001
ai 0.001922 2.19773
ei 0.000852 /f/= f 0.881126
ia 0.000703 0.01868 ff 0.068213 /b/= b 0.995284
eigh 0.000631 ph 0.034432 0.01867 bb 0.00466
ye 0.000505 gh 0.009224 pb 0.000046
ay 0.000241 ft 0.00402 bh 0.00001
a 0.000189 lf 0.002866 0.04371
ig 0.000167 pph 0.000094
is 0.000125 ffe 0.000013 /o/= o 0.634053
oy 0.000063 v 0.00001 0.01442 o_e 0.177392
ey 0.000043 fe 0.000002 ow 0.138887
ais 0.000042 0.7127 oa 0.021885
aye 0.000028 ough 0.014182
ae 0.000025 /ɚ/= er 0.700952 oh 0.005187
ailles 0.000023 0.01526 or 0.148072 oe 0.003659
1.93148 ar 0.084384 ou 0.002998
ure 0.030605 eau 0.000638
/ɑ/= o 0.627161 ur 0.014244 owe 0.000354
0.01872 a 0.355241 orr 0.006852 ew 0.000218
oh 0.007262 arr 0.004151 ot 0.000124
ow 0.003507 err 0.001924 u 0.000117
ea 0.002166 ir 0.001921 oo 0.000111
ho 0.002039 rror 0.001519 au 0.000066
ah 0.001387 urr 0.001446 au_e 0.000023
e 0.000779 re 0.001409 oa_e 0.000023
i 0.000236 ro 0.000929 eaux 0.000022
au 0.000095 eur 0.000646 aoh 0.000021
aa 0.00008 our 0.0003 os 0.000019
ou 0.000022 yr 0.0003 eo 0.000009
as 0.000016 aur 0.000099 o' 0.000007
at 0.000006 oor 0.000082 ou_e 0.000004
oi 0.000002 erwr 0.000068 1.5778
aw 0.000001 're 0.000045
1.09733 oar 0.000029 /h/= h 0.932069
awr 0.000018 0.01102 wh 0.066675
MEASURING ORTHOGRAPHIC PREDICTABILITY 68
Appendix C Continued
P G Prob. P G Prob. P G Prob.
/ɔ/= o 0.669859 ere 0.000006 j 0.001238
0.01621 a 0.211565 1.4957 x 0.000018
au 0.039668 0.36731
ou 0.028863 /ɾ/= t 0.57741
aw 0.018842 0.01357 d 0.318461 /ʃ/= ti 0.494173
ough 0.012995 tt 0.087569 0.00844 sh 0.309672
oa 0.008879 dd 0.015204 ci 0.055121
oo 0.004693 bt 0.000692 ssi 0.0422
augh 0.002952 ld 0.000308 s 0.027479
ea 0.000488 ct 0.000293 ss 0.02142
awe 0.000441 th 0.000059 ch 0.019937
ah 0.000343 cht 0.000004 c 0.010283
as 0.000211 1.39791 si 0.007219
ao 0.000188 t 0.004024
u 0.000007 /g/= g 0.938245 shi 0.003103
al 0.000005 0.00858 gu 0.041541 che 0.002202
eo 0 gg 0.015718 sci 0.002021
1.52398 gh 0.002846 sc 0.000639
gue 0.00165 sch 0.000481
/w/= w 0.901524 0.41042 ce 0.000024
0.01432 u 0.096813 chsi 0.000002
we 0.00075 /ɝ/= er 0.40125 2.05046
ju 0.000468 0.00645 or 0.181697
o 0.000371 ur 0.115057 /l/= le 0.521185
hu 0.000054 ir 0.105264 0.00640 al 0.269161
wh 0.000016 ear 0.074666 el 0.056952
r 0.000003 ere 0.066072 ul 0.041885
0.47923 urr 0.030449 all 0.03633
our 0.011768 l 0.020892
err 0.007258 il 0.01435
/ŋ/= ng 0.841945 orr 0.004238 ael 0.009953
0.01053 n 0.155757 her 0.000841 ull 0.007896
ing 0.001246 eur 0.000572 ol 0.005363
ngue 0.001033 irr 0.000507 ell 0.00416
nd 0.000019 olo 0.00035 ile 0.003915
0.64938 yrrh 0.00001 ll 0.003757
2.55326 yl 0.001913
/n/= on 0.649846 'll 0.001774
0.00814 n 0.169372 /θ/= th 0.998036 ule 0.000323
MEASURING ORTHOGRAPHIC PREDICTABILITY 69
Appendix C Continued
P G Prob. P G Prob. P G Prob.
en 0.099272 0.00606 tth 0.00182 ill 0.000186
an 0.047163 t 0.000129 yll 0.000004
ain 0.017626 dth 0.000016 ual 0.000001
in 0.0135 0.02131 2.10091
ne 0.001607
onn 0.001455 /ʊ/= ou 0.567379 /j/= y 0.99984
ign 0.000159 0.00521 oo 0.249375 0.00540 j 0.000085
1.59378 u 0.167699 i 0.000071
o 0.013537 ll 0.000004
/ʤ/= j 0.358992 eu 0.001442 0.00243
0.00637 g 0.315193 uo 0.000354
ge 0.229182 or 0.000202 /ʧ/= ch 0.689814
d 0.036773 oui 0.000011 0.00494 t 0.202527
dge 0.026248 1.49991 tch 0.061743
gi 0.018078 ti 0.043437
dg 0.009429 /hw/= wh 0.998552 cz 0.000557
dj 0.003308 0.00321 ju 0.001153 che 0.00052
di 0.002075 w 0.000296 c 0.000505
gg 0.000393 0.01681 tsh 0.000371
ch 0.000328 ci 0.000274
2.07789 /jɛ/= e 1 tzsch 0.00023
0.00000 0 th 0.000019
/aʊ/= ou 0.639641 tsch 0.000005
0.00531 ow 0.32056 /jə/= u 0.631539 1.30858
ou_e 0.022595 0.00108 io 0.1803
hou 0.014715 ia 0.142676 /ks/= x 0.997381
au 0.001437 ie 0.029076 0.00186 xe 0.002619
ao 0.000833 ua 0.01011 0.02624
ough 0.000219 iu 0.004833
1.17636 a 0.001109 /jɑ/= o 1
iou 0.000358 0.00000 0
/ju/= u 0.571625 1.53274
0.00343 u_e 0.258882 /oɪ/= oi 0.628629
ew 0.05474 /gz/= x 0.963063 0.00096 oy 0.328469
ue 0.050779 0.00035 xh 0.036937 oi_e 0.014927
iew 0.047863 0.22807 aw 0.013624
eau 0.008614 ois 0.010712
ou 0.003263 /jɚ/= ure 0.517928 oy_e 0.001896
eu 0.002428 0.00016 ior 0.320364 eu 0.001395
MEASURING ORTHOGRAPHIC PREDICTABILITY 70
Appendix C Continued
P G Prob. P G Prob. P G Prob.
ugh 0.001064 iar 0.145187 uoy 0.000202
ut 0.000575 ur 0.015327 o 0.000094
ieu 0.00009 ier 0.000748 hoi 0.000052
ewe 0.000076 or 0.000447 1.22852
1.74966 1.52704
/jʊ/= u 0.725133
/wʌ/= o 1 /wɑ/= ois 0.597656 0.00019 eu 0.273925
0.00157 0 0.00000 oi 0.402344 uh 0.000943
0.97231 0.85744
/ʒ/= si 0.643986
0.00067 s 0.293708 /gʒ/= x 1 /kʃ/= x 0.907282
ge 0.02481 0.00000 0 0.00005 xi 0.092718
g 0.014577 0.44548
ti 0.012087 /ts/= z 0.773716
z 0.005907 0.00002 zz 0.226284 /kə/= kh 1
j 0.0046 0.77148 0.00000 0
sh 0.000143
ssi 0.000127 /nj/= n 0.663968 /əw/= ju 1
zi 0.000055 0.00000 gn 0.336032 0.00000 0
1.30993 0.92097
/m/= m 0.905322
0.00018 ome 0.066394
om 0.016532
um 0.007431
am 0.003332
em 0.000987
0.57738
Note: P=phoneme, G=grapheme. Below each phoneme, the probability for that
phoneme is listed. The probabilities for each PGC are listed to the right of each
grapheme. At the bottom of each list, the phoneme’s entropy is given in bold. The
graphemes are listed in order of decreasing probability, with the first grapheme being
the “main spelling” or most predictable correspondence for that phoneme.
MEASURING ORTHOGRAPHIC PREDICTABILITY 71
The following reports the decision-making processes for the algorithm in choosing how
to parse words.
ELP Characteristics. For each entry, the ELP provides a number of various
characteristics that may be used in analysis. The following specific characteristics were
utilized by this study: (a) Frequency Norms (Freq_HAL), (b) Pronunciation (Pron.)., and
Freq_HAL are the frequency norms provided by the HAL database. For each entry,
the number of times that word is found throughout the corpus is indicated. This data was
necessary to calculate the probability for each word. This information was applied
Pron. is the listed pronunciation for each entry, based on the Standard American
English (SAE) dialect. This information is recorded in the SAMPA format, and so for
Once the final data was analyzed, the SAMPA symbols were replaced with their more
MorphSp provides a morphological breakdown of each entry, giving the root(s) and
any affixes for a given word. This feature was utilized in creating an algorithm which
can apply various procedures to determine how best to parse ambiguous mappings.
Parsing the ELP algorithmically. After the problematic cases documented above
were addressed, an algorithm was constructed which accurately parsed as many words in
MEASURING ORTHOGRAPHIC PREDICTABILITY 72
the ELP as possible. A parsing was considered “accurate” if it conformed to the desired
Details of how the algorithm made parsing decisions will be discussed below.
Broadly speaking, there were three steps undertaken by the algorithm in an attempt to
achieve the desired parsing of a word. First, the program consulted a mapping file that
acceptable parsings for a given word type. Third, it chose the most accurate parsing
Constructing the mapping file. The mapping file listed all phonemes used by the
ELP and all potential graphemes that might correspond to each phoneme. This is the
information used by the algorithm to parse the wordlist. Only the phoneme-grapheme
matches defined in the mapping file are considered acceptable correspondences by the
algorithm.
The primary goal of the allowed correspondences was not to minimize the number of
possible parsings returned for each word, but rather to be conservative in eliminating
possibilities, so as to increase the chance that one of the potential parsings is the desired
one. In other words, the net was cast wide. This was guided by the belief that it is easier
With this strategy in mind, the initial mapping file included all graphemes (according
graphemic options culled from prior studies (Berndt et al., 1987; Gontijo et al., 2003;
Hanna, 1966), a number of online sources were consulted, including Wikipedia. This
idiosyncratic graphemes were iteratively added, until the algorithm was able to yield a
suggested parsing for every word in the dataset. This procedure was done to decrease the
to the algorithm.
Expressing all possible parsings. For each entry in the ELP, the algorithm took as
input the grapheme string comprising the written word and the phoneme string
comprising its phonetic transcription, with all stress and syllable marks removed. The
program moved in a serial fashion from left to right. It identified each phoneme
encountered by matching it to the identical phoneme listed in the mapping file, and it
chose the longest phoneme string consistent with position at hand. In this manner, the
The algorithm next examined every possible acceptable alignment between phoneme
and grapheme strings. No limitations were placed on what character/s could represent
any phoneme. The only constraints were that 1) complex (multi-letter) graphemes must
contain only consecutive letters (except for split graphemes containing FS–e, handled
separately), 2) all letters of the word must be used, 3) and the number of graphemes must
vowel) must be directly preceded by either a graphemic consonant(s), gu- or qu-. The
associated vowel had to be the first vowel encountered prior to the directly preceding
consonant(s), gu-, qu-, and any number of consecutive preceding vowels until a
consonant, another split grapheme, or the beginning of the word was encountered. This
rule flagged the “e” as a potential candidate for being part a split grapheme, but only
Although it can be argued that FS–e can serve many functions simultaneously, for
simplicity, in this study a FS–e cannot be assigned to more than one grapheme. It will
either attach to the vowel(s) or to the consonant(s), as per the guidelines above.
Once the algorithm determined the set of minimally constrained parsings (as
described above) of a given word, it consulted the mapping file to determine if each
parsing was deemed legitimate, and otherwise an error was returned. In this fashion, the
algorithm returned as output all legitimate parsings with no limit placed on the number of
parsings per word. As expected, some words returned only one parsing, while many
Determining the preferred parsing. Once the above algorithm could generate at
least one legitimate parsing for every word in the list, the next step was to sift through all
generated parsings of an entry and choose the one considered to be the most accurate. To
this end, the MorphSp attribute of the ELP was utilized, which breaks words down into
For any word with multiple acceptable parsings, this algorithm first broke each
entry down into its morphological components. The words were divided into three
classes and addressed separately: root, non-compound, and compound words. Root
words were words with a single root and no prefixes or suffixes. Non-compound words
had one root with one or more prefixes and/or suffixes. Compound words contained
more than one root, and may contain prefixes and/or suffixes.
Root words. Root words are first addressed. In the event of multiple acceptable
parsings, the algorithm initially assumed that the first parsing was the correct one. The
algorithm then compared this incumbent parsing to each alternate parsing in turn. With
each comparison the program determined whether to replace the original parsing with the
new one under consideration, or to reject the new parsing and retain the incumbent. It
The algorithm compared the incumbent and the alternate parsing under consideration
by performing up to four passes over the parsings. In each pass, the parsings were
MEASURING ORTHOGRAPHIC PREDICTABILITY 76
compared in a serial, left-to-right order, ignoring all pairs of identical graphemes between
the two candidates. When the algorithm encountered two graphemes that were not
particular subroutine that was employed depended on which pass the algorithm was
executing, as described below. If at any time a subroutine was able to determine, based
on a particular grapheme discrepancy, that one of the parsings was preferred, then the
The first pass expressed a preference for longer graphemes over shorter graphemes,
when the phoneme in question could be represented by both. For example, consider the
the mapping file, both options are legitimate; the grapheme ck often represents /k/ as in
back, sack, and hack, and the grapheme kn often represents /n/, as in knee, knight, know,
etc. Since words like hack feature the ck =/k/ grapheme, words with -ney feature the n =
word-initial position only, the first of the two parsings is judged to be the correct option.
The simplest way to choose this option was through the preference of ck over c.
Whenever c is the correct choice, it was never the case that the contesting grapheme is a
ck. This decision was not based upon any theoretical framework, but rather determined
simply from observation of patterns in the ELP. Examples of preferences declared by the
If the first pass failed to “break the tie,” a second pass was initiated. Execution of this
code handled discrepancies involving doubled letters. The subroutine preferred doubled
letters over single letters to represent a given phoneme, when both options were
If this code did not break the tie, a third pass was initiated. This pass handled
If the third pass failed to break the tie, a fourth and final pass was be initiated. This
pass concerned certain vowels that were preferred in certain contexts, and included:
root words, but with additional steps to handle affixes. The MorphSp column indicated
the root word and affixes separately for any entry. The algorithm first checked to see if
the root word existed as its own separate entry in the ELP. If the associated root did not
exist as its own separate entry, the entry under consideration was parsed using the four-
However, frequently the associated root word existed as its own separate entry in
the ELP. If the phonemic string of the non-compound word contained the phonemic
string of the root word, this indicated the affixes did not alter the pronunciation of the
word root. The algorithm thus mirrored the parsing accepted for the root word, rejecting
inconsistent candidates. If no candidate was consistent, the algorithm did not immediately
reject any candidate, but instead skipped this step and proceeded to the next step.
If multiple acceptable parsings remained after examining the root, additional steps
were next applied. First, certain prefixes containing e, such as re-, might parse as part of
a split grapheme, which would inevitably be in error. Therefore, explicit rules were
outlined, where, in the presence of these prefixes, split graphemes could not occur in
if a preference between incumbent and challenger could be made based on the suffix.
Consider the words lie and lady. Pluralizing the words result in lies and ladies. Though
both plural forms end with -ies, they must be parsed differently. The word lies should be
parsed as l-ie-s, not li-es, since the latter would indicate the word root is li-. Ladies, on
the other hand, should be parsed as l-a-d-i-es, not l-a-d-ie-s, because the penultimate “e”
is not found in the singular form of lady. The “e” must be part of the ending, and
therefore -es is the correct parsing of the suffix (with the final -y converting to an -i). So
for a word ending in –y + -s (as indicated by the MorphSp entry) and two candidate
parsings, one ending in –i-es and the other ending in ie-s, the algorithm immediately
selected the parsing ending in i-es as the next incumbent. The rule was expressed as:
Root ending in -e + suffix -s, resulting in word ending in –es: prefer parsings
ending with –s to parsings ending with –es
Suffix –ives: prefer parsings ending with –s to parsings ending with –es
Suffix –itives: prefer parsings ending with –s to parsings ending with –es
Suffix –tures: prefer parsings ending with –s to parsings ending with –es
Suffix –eer: prefer parsings ending with –r to parsings ending with –er
Suffix –eer + additional suffix: prefer parsings ending with –r-? to parsings
endings with –er-?
Suffix –ered: prefer parsings ending with –ed to parsings ending with –d
Root ending in -e + suffix –ed: prefer parsings ending with –d to parsings
ending with –ed
Root not ending in -e + suffix –ed: prefer parsings ending with –ed to parsings
ending with –d
Suffixes –y + ed = –ied: prefer parsings ending in –ed to parsings ending in –d
Again, these rules were not be decided beforehand, but rather resulted from
the algorithm was unable to select a single, desired parsing. If execution of all “suffix
rules” did not result in a determination of the correct parsing, the algorithm continued on
Compound words. Lastly, compound words containing multiple roots and/or affixes
were addressed. These were handled similarly to non-compound words, only with initial,
additional steps. First, if there were no affixes, the algorithm checked if the compound
word, in both graphemic and phonemic strings, was identical to the combination of the
associated roots, and if so, chose the parsing that combines those that were selected for
the root words in previous steps of the algorithm. For example, when the program
encountered a word like wholesale, it searched for the associated roots whole and sale,
MEASURING ORTHOGRAPHIC PREDICTABILITY 81
combined the roots, and then checked if the result was identical to wholesale, which it
was. Similarly, it confirmed that the pronunciation of /holsel/ (wholesale) was the
concatenation of /hol/ + /sel/. Finally, the algorithm posited the concatenation of the
selected parsings for whole and sale, yielding wh-o_e-l-s-a_e-l. As this is indeed one of
For more complicated compound words, particularly ones containing affixes, the
phonemic strings, was a subset of the compound word being analyzed. If so, and if there
was at least one possible parsing of the compound word that contains this simple
combination of the previously selected parsings of the associated roots, then all other
parsings were eliminated. For example, if the algorithm encountered the word
wholesalers, it checked if this word contained whole + sale and its pronunciation
contained /holsel/. When this proved true, only parsings containing wh-o_e-l-s-a_e-l
were retained. In this case, the only such parsing is wh-o_e-l-s-a_e-l-r-z, and so this
The algorithm then tested, in a similar fashion, if the compound word being
analyzed began with the first associated root words, and if the pronunciation began with
the combination of the first associated root and the first phoneme of the second associated
root. If this was the case, then all parsings that did not begin with the previously selected
parsing of the first associated root word were eliminated. If there were still multiple
candidate parsings, then the algorithm proceeded as it did with non-compound words,
first analyzing the suffixes to see if it could determine a correct parsing by suffix alone,
word whiskey:
I. Phoneme String
From left to right, find largest phonemes in mapping file at current location of string:
A. Both /h/ and /hw/ are phonemes in mapping file – take /hw/ as first phoneme
B. Current location is third character of phoneme string: /I/. /I/ is a phoneme in
mapping file, and no phoneme in mapping file starts with /Is/. So, /I/ is second
phoneme
C. Similarly, /s/ is next phoneme, followed by /k/ and /i/.
Phoneme string: /hw-I-s-k-i/
1. w-h-i_e-s-ky,
2. w-h-i_e-sk-y,
3. w-h-i-s-key,
4. w-h-i-sk-ey,
5. w-h-i-ske-y,
6. w-h-is-k-ey,
7. w-h-is-ke-y,
8. w-h-isk-e-y,
9. w-hi-s-k-ey,
10. w-hi-s-ke-y,
11. w-hi-sk-s-k-y,
MEASURING ORTHOGRAPHIC PREDICTABILITY 83
12. w-his-k-e-y,
13. wh-i_e-s-k-y,
14. wh-i-s-k-ey,
15. wh-i-s-ke-y,
16. wh-i-sk-e-y,
17. wh-is-k-e-y,
18. whi-s-k-e-y
A. w-h-i_e-s-ky (#1): this implies that “w” maps to /hw/, “h” maps to /I/. Since the
mapping file does not allow /I/ to map to “h”, this option is eliminated. The same
rule also eliminates candidates #2-8 when encountered, which all begin with w-h-.
B. wh-i_e-s-k-y (#13): this implies “wh” maps to /hw/, “i_e” maps to /I/, “s” maps
to /s/, “k” maps to /k/, and “y” maps to /y/. The mapping file contains all of these
correspondences, and so this option is retained.
C. w-hi-sk-e-y (#11), and wh-i-sk-e-y (#16), are eliminated because the mappings
sk=/s/ and e=/k/ are not allowed by the mapping file.
C. #12 is eliminated because his=/I/ is not a legitimate mapping.
D. #17 is eliminated because s=/k/ and k=/e/ are not legitimate mappings.
E. #18 is eliminated because only the final mapping, y=/i/, is an allowed mapping.
The remaining candidates are: w-hi-s-k-ey (#9), w-hi-s-ke-y (#10), wh-i_e-s-k-y (#13),
wh-i-s-k-ey (#14), and wh-i-s-ke-y (#15).
A. At least one candidate has a split grapheme (/I/ as i_e), and at least one candidate
interprets that –e as not part of a split grapheme. Since /I/ is not a long vowel, it
is determined that there is no split grapheme in this word, and all choices
containing the split grapheme i_e are eliminated. This eliminates one option, wh-
i_e-s-k-y (#13), leaving w-hi-s-k-ey (#9), w-hi-s-ke-y (#10), wh-i-s-k-ey (#14),
and wh-i-s-ke-y (#15).
B. We now perform pairwise comparisons of the candidates.
a. The initial incumbent is w-hi-s-k-ey (#9) and the initial challenger is w-hi-
s-ke-y (#10).
i. First pass of grapheme comparisons from left-to-right:
MEASURING ORTHOGRAPHIC PREDICTABILITY 84
The only remaining candidate, wh-i-s-k-ey (#14) , is selected as the preferred parsing for
this word.
MEASURING ORTHOGRAPHIC PREDICTABILITY 85
Manual Parsings. A small group of words defied typical patterns seen in English
than attempt to alter the algorithm to accommodate these words, which might in turn
cause unforeseen errors in other cases, these entries were separated and parsed manually:
After all the above procedures, parsing the ELP wordlist will be complete. All
40,481 words will either have been (a) removed, (b) assigned one parsing
algorithmically, or (c) have been manually parsed. Every instance of a phoneme in the