You are on page 1of 101

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/317381495

Measuring Orthographic Predictability: Calculating Entropy for Phonemes and


Graphemes of Standard American English

Thesis · August 2017


DOI: 10.13140/RG.2.2.32359.04000

CITATIONS READS

0 162

1 author:

Jacob C. Cockcroft

1 PUBLICATION   0 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Jacob C. Cockcroft on 07 June 2017.

The user has requested enhancement of the downloaded file.


Measuring Orthographic Predictability: Calculating Entropy for Phonemes and Graphemes of
Standard American English
Running Head: MEASURING ORTHOGRAPHIC PREDICTABILITY

Measuring Orthographic Predictability: Calculating Entropy for


Phonemes and Graphemes of Standard American English

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science

By

Jacob Cockcroft
Bachelor of Arts, Hendrix College, 2000

2017
The University of Arkansas for Medical Sciences
MEASURING ORTHOGRAPHIC PREDICTABILITY iii
MEASURING ORTHOGRAPHIC PREDICTABILITY iv

Acknowledgements

The author would like first and foremost to acknowledge the long-distance efforts of Dr.

Robert J. Drost in the creation of the analysis software, along with his many hours spent

resolving the numerous technical and theoretical issues involved in this analysis. Without

his contributions, this research and thesis would not have been possible. Also, many

thanks to Dr. Greg Robinson for his guidance, support and encouragement throughout the

evolution of this study. Many thanks as well to the rest of my committee members—Dr.

Chenell Loudermill, Dr. Dana Moser, and Ms. Stacey Mahurin—whose invaluable

feedback have helped shape this project into (hopefully) a streamlined, accessible product

for teachers, educators, therapists, and researchers. Gratitude is likewise owed to all the

professors, instructors, and clinical supervisors of the UAMS/UALR graduate program

for Speech-Language Pathology & Audiology that I have had the pleasure to meet and

learn from during my recent academic journey: Dr. Donna Kelly, Mrs. Connie Bracy, Dr.

Ashlen Thomason, Dr. Betholyn Gentry, Dr. Tom Guyette, and Mrs. Shanna Williamson,

just to name a few. Thanks to my indispensible proofreaders and parents, Jean Cazort and

David Cockcroft. Thanks to Anna Salzer for realizing how much I would love Speech-

Language Pathology before I did, and Eli Wakefield for the blessing of watching child

language development in action. Finally, thanks to my eighteen cohorts for making these

past few years a wonderful and enriching experience.


MEASURING ORTHOGRAPHIC PREDICTABILITY v

Table of Contents

Acknowledgements ............................................................................................................ iv

List of Figures ................................................................................................................... vii

List of Tables ................................................................................................................... viii

Introduction ......................................................................................................................... 1

Statement of the Problem ................................................................................................ 1

Purpose of This Study ..................................................................................................... 3

Rationale.......................................................................................................................... 4

Literature Review................................................................................................................ 7

Historical Perspectives of English Orthography ............................................................. 7

Current Research on Orthographic Depth ..................................................................... 11

Conclusion..................................................................................................................... 14

Methodology ..................................................................................................................... 16

Corpus ........................................................................................................................... 16

Defining English Graphemes ........................................................................................ 17

Alterations to the Corpus............................................................................................... 19

Summary of the Computing Algorithm ........................................................................ 21

Reliability ...................................................................................................................... 22
MEASURING ORTHOGRAPHIC PREDICTABILITY vi

Calculating Entropy....................................................................................................... 23

Results ............................................................................................................................... 25

Graphemes ..................................................................................................................... 25

Phonemes ...................................................................................................................... 32

Discussion ......................................................................................................................... 35

Is English Predictably Spelled? ..................................................................................... 35

Measuring Orthographic Depth ..................................................................................... 37

Word-Level Entropy ..................................................................................................... 38

Entropy-Based Spelling Instruction .............................................................................. 39

Limitations .................................................................................................................... 42

Future Research ............................................................................................................. 44

Conclusion ........................................................................................................................ 47

References ......................................................................................................................... 49

Appendix A: Alphabetical List of All English Graphemes .............................................. 54

Appendix B: List of Phonemes Used by the English Lexicon Project (ELP) .................. 62

Appendix C: Phoneme-to-Grapheme Correspondence (PGC) Probabilities .................... 64

Appendix D: Algorithmic Procedures for Corpus Analysis ............................................. 71


MEASURING ORTHOGRAPHIC PREDICTABILITY vii

List of Figures

Figure 1. Accuracy data from “Foundation Literacy Acquisition in

European Orthographies” (Seymour et al., 2003). English

speakers performed consistently worse than peers speaking

twelve other languages……………………………………….………….14

Figure 2. Vowel cluster analysis. Dendrogram and scatter plot of

vocalic graphemes with complete linkage, grouped into five

categories of decreasing entropy……………………………...…………31

Figure 3. Consonant cluster analysis. Dendrogram and scatter plot

of consonantal graphemes with complete linkage, grouped

into five categories of decreasing entropy...…………………………….32


MEASURING ORTHOGRAPHIC PREDICTABILITY viii

List of Tables

Table 1. Graphemes ranked by entropy. Classed by consonant or

vowel, with low-frequency and zero-entropy graphemes

removed………………………………………………………………….28

Table 2. Graphemes grouped by predictability. The results of

a hierarchical cluster analysis using complete linkage,

where complete predictability indicates zero-entropy…………………...30

Table 3. Phonemes ranked by entropy. SAE phonemes

discovered in the corpus, including token frequency,

probability expressed as a percentage, and unweighted

entropy values……………………………………………………………34

Table 4. Multiple phonemes corresponding to single letters.

Instances where a singleton grapheme corresponded to

more than one phoneme, which were removed from

final analysis……………………………………………………………..35
MEASURING SPELLING PREDICTABILITY

Introduction

“…the English alphabet is pure insanity…, it can hardly spell any word in the
language with any degree of certainty.” --Mark Twain

Statement of the Problem

Exactly how predictable is written English? The above quip by Mark Twain echoes

the exasperation many literate English speakers have felt at one time or another, when

considering how difficult and unruly the English writing system appears to be. Twain’s

trademark humor concerning what he called our “drunken old alphabet” reflects a larger

attitude of his time that English spelling needed to be reformed (Twain, 2016, p. 111).

English orthography features complex rules with many exceptions that appear to defy

logical explanation. Imagine what would happen if one day the auto-correct feature

disappeared from our cell phones and word processors. How confidently could we spell

our own language?

Rather confidently, according to the seminal work in the 1960s of Hanna, Hanna,

Rudorf, and Hodges (referred to as “HHRH” hereafter). According to their research, half

of all English words are predictably spelled from their sound, and another 34% would

have just a single error if spelled on the basis of sound (Moats, 2005). Their linguistic

analysis of over 17,000 of the highest frequency words in English resulted in the

conclusion that English orthography is predictable over 80% of the time (Hanna, 1966).

Despite this assertion, in recent years a growing area of cross-linguistic research once

again throws English predictability into question. Orthographic depth is a term used to
MEASURING ORTHOGRAPHIC PREDICTABILITY 2

indicate the relative degree to which a language’s orthography diverges from its spoken

form (Katz & Frost, 1992). The traditional view of orthographic depth states that

languages exist on a continuum, from “transparent” orthographies at one end to “opaque”

orthographies on the other (Seymour, Aro, Erskine, & collaboration with COST Action

A8 network, 2003). Transparent orthographies, such as those of Italian, Spanish, Finnish,

Hungarian, and Greek, feature relatively phonetic spelling systems, in which word

spellings predictably represent their pronunciation. Opaque orthographies, such as those

of English and French, align in complex and less predictable ways to the oral languages

they symbolize.

English is generally considered to have the most opaque orthography of all European

languages. In multiple cross-linguistic experiments, learners of English have been shown

to struggle with reading and spelling in ways that most other speakers of European

languages do not. David Share dubbed English an “outlier orthography,” and questioned

why it still governs the vast majority of current research on the behavioral,

psycholinguistic, and cognitive processes linked to skilled reading and reading

acquisition (Share, 2008).

It appears that two contrasting perspectives exist in the literature regarding the

predictability of English orthography. If English is orderly and predictable, as concluded

by the HHRH study, then why are English speakers performing consistently worse when

compared to other languages which utilize the same Latin alphabet? One way to answer

this question is to perform a new linguistic analysis of English orthography using modern

computing technology to determine if the sound-symbol relationships are as predictable

as the HHRH study discovered.


MEASURING ORTHOGRAPHIC PREDICTABILITY 3

Purpose of This Study

The current study sought to measure the degree of predictability of the sounds and

spelling patterns of Standard American English (SAE). Using a custom built software

program, a corpus of over 131 million words was deconstructed into its constituent

sounds and spelling patterns, and then the frequencies of these sound-symbol

correspondences were tallied. From these frequencies, the probabilities of all sound-

symbol correspondences in the corpus were determined, which could be used to calculate

entropy values for all sounds and spelling patterns of SAE.

Entropy, as defined by information theory, measures the degree of uncertainty in a

frequency distribution. It describes how likely or predictable the outcome of a random

sampling is for that distribution, given the amount of information that is available. This

method has been used in several past studies to measure sound-symbol predictability for

various languages, including English (Borgwaldt, Hellwig, & De Groot, 2004;

Borgwaldt, Hellwig, & De Groot, 2005; Borgwaldt, Hellwig, De Groot, & Licht, 2006;

Protopapas & Vlahou, 2009).

The basic units of spoken language that will be measured by this study are called

phonemes. Phonemes are the building blocks of spoken syllables, e.g. vowels and

consonants. Phonemes are written using the International Phonetic Alphabet (IPA)

standard format, with each unique sound represented by a specific symbol between two

slashes. The spoken word bat has three phonemes, written /b/, /æ/, and /t/.

A grapheme is then defined as a written symbol that corresponds to a single

phoneme. Graphemes are written between quotes. The written word “bat” has three
MEASURING ORTHOGRAPHIC PREDICTABILITY 4

graphemes: “b,” “a,” and “t.” A grapheme might be a single written letter, but graphemes

in English may contain up to four letters, e.g. “ough” as in though. In this study, a

grapheme containing multiple letters is referred to as a compound grapheme.

In this study, the term orthographic correspondence (OC) is used to refer to the

relationship between a specific grapheme and phoneme pair, without reference to a

particular “direction” of correspondence. Thus the word bat contains three OCs: “b” =

/b/, “a” = /æ/, and “t” = /t/. When analyzing OCs, researchers may choose to study

correspondences in either the grapheme-to-phoneme direction (GPCs) or the phoneme-to-

grapheme direction (PGCs). The direction of correspondence determines whether the

cognitive process of decoding or encoding is being emphasized. GPCs relate to decoding

words, the process of translating a string of graphemes into a string of phonemes (i.e.

“sounding out” a written word). PGCs relate to encoding words, the process of translating

a string of phonemes into a string of graphemes (i.e. spelling).

Rationale

The impact of OC predictability on both reading and writing cannot be overstated.

Phonemic awareness, or the understanding that words are composed of phonemes that

must be blended and segmented, was established by the National Reading Panel as one of

the most critical skills that must be learned in order for a student to achieve successful

literacy outcomes (National Reading Panel (US), National Institute of Child Health, &

Human Development (US), 2000). Research has shown that in all phases of development,

language learners employ decoding skills (Sharp, Sinatra, & Reynolds, 2008).

Sight word reading, according to Linnea Ehri, is the process of building the lexicon

through memorizing words so that they may be read “on sight,” without the need to
MEASURING ORTHOGRAPHIC PREDICTABILITY 5

decode the individual phonemic units of a word, which is a slower process that requires

an extra cognitive load (Ehri, 2005). OCs form the basis for children learning sight

words, providing a “powerful mnemonic system” (p. 172). It is the OCs that provide the

“glue” to cement words in memory. The more unpredictable these correspondences are,

the more difficult decoding and learning sight words should be. For example, evidence

has shown that the OC predictability affects naming latencies, i.e., words with

inconsistent sound-spelling features are read aloud slower than words with consistent

sound-spelling features (Delattre, Bonin, & Barry, 2006).

OCs are equally important to the process of encoding, or spelling words. Decoding

and encoding are separate but highly interrelated processes, because students are

continually learning to read and spell words simultaneously. Students are provided visual

access to a word as a reference while they learn to spell it, so decoding and encoding

skills are learned together and mutually reinforce one another. As Ehri emphasizes,

learning to spell words supports the ability to later recognize them in print, just as

multiple exposures to reading a word improves the probability one will be able to

successfully spell it (Ehri, 2005).

A systematic measure of predictability for English phonemes and graphemes has the

potential to impact our current perspectives on how difficult it is to acquire and master

both spelling and literacy in the United States. Currently, the plethora of approaches on

how to best teach literacy and spelling often leave educators and teachers overwhelmed

and confused for how to best approach the subject (Johnston, 2000; Schlagal, 2002;

Fresch, 2003). An analysis of orthographic entropy could inform best practices of

spelling and literacy instruction for both typical students and outliers. Considering the
MEASURING ORTHOGRAPHIC PREDICTABILITY 6

entropy of OCs, for example, might provide a convenient method for organizing

wordlists in terms of difficulty.

A quantification method of orthographic predictability could provide a useful metric

for future cross-linguistic studies comparing orthographic depth and its role in literacy

acquisition. Measuring entropy values (weighted by frequency) for all phonemes and

graphemes in the corpus allows both a total phonemic entropy and total orthographic

entropy value to be computed. The first number theoretically indicates how unpredictable

spoken SAE is to spell; the second indicates how unpredictable written English is to

pronounce in SAE. These numbers could be used in quantitative comparisons with other

languages where the entropy has likewise been measured, if it can be shown the corpus

being analyzed is a sufficient representation of the language as a whole.


MEASURING ORTHOGRAPHIC PREDICTABILITY 7

Literature Review

Historical Perspectives of English Orthography

English orthography is the product of centuries of diverse cultural influences without

any singular, overarching structural model. Linguist David Crystal’s book “Spell it Out:

The History of English Spelling,” illuminates the myriad sources that have contributed

over time to the evolution of the modern English writing system (Crystal, 2012). Anglo-

Saxon, Welsh, Norman-French, Old Norse, Latin and Greek have all contributed their

own arbitrary writing conventions and stylistic forms to the recipe of English. For

example, there is no logical reason why an “–e” must be added after a “v” at the ends of

words like have, give, love, etc., other than the fact that Anglo-Norman scribes in the

medieval ages preferred this form.

Additionally, word spellings tend to be more stable than word pronunciations, which

change at a faster rate. Many English words retain spellings from an earlier time, when

the words were pronounced differently. Furthermore, English spellings are used for

purposes other than to phonetically represent a word, such as indicating etymological

relationships, e.g. keeping the letter “g” in sign, even though it is not pronounced,

signifying a relationship to the words signature, signet, signify, etc..

The notion that English spelling was chaotic and in need of reform percolated among

American, British, and Irish literary circles at least as far back at the 18th century.

Benjamin Franklin, Mark Twain, Noah Webster, and Andrew Carnegie are among the

notable American figures who attempted to address problematic English spelling


MEASURING ORTHOGRAPHIC PREDICTABILITY 8

conventions to some extent. Franklin, for example, proposed his own streamlined

alphabet, featuring a reduction of extraneous letters and the addition of new letters to

better capture the sounds of English (Webster & Franklin, 1789). Interestingly, the one

lasting outcome from Dr. Franklin’s attempt as spelling reform was the eventual adoption

of his invented symbol for /ŋ/ by the IPA. The fact that he experienced far more success

in helping to establish a new country that broke violently away from British rule than he

ever was at reforming English orthography gives some indication as to just how difficult

such an endeavor must be.

Since English orthography was considered to have a high degree of irregularity,

traditional spelling instruction emphasized the rote memorization of wordlists, which

were arranged alphabetically and sometimes by number of syllables as well, but neither

arrangement necessarily organized words in terms of how difficult they were to learn

(Schlagal, 2002).

In the 1930s, word frequency began to be recognized as a significant indicator of

learning difficulty. It was reasoned that more frequent words would be seen more often

and so were easier to memorize. Resources such as Thorndike’s & Lorge’s “The

Teacher’s Book of 30,000 Words” began to offer educators lists of words that were

arranged by frequency, so that children could be taught the easier, high-frequency words

first (Thorndike & Lorge, 1944). This practice continues today, with wordlists such as

Edward Fry’s “instant words” being ordered by frequency (E. B. Fry & Kress, 2012; E.

Fry, 1980).
MEASURING ORTHOGRAPHIC PREDICTABILITY 9

In the 1950s, linguistic research began to focus on the relationships between

graphemes and phonemes. The HHRH study published in 1966 was an extension of more

than a decade’s previous research on PGC frequency. It was the largest project of its kind

funded by the U.S. Department of Education, with the final publication exceeding 17,000

pages.

This was the first time such a study utilized computers to analyze large corpora of

words. The HHRH study drew upon 17,310 words from Thorndike & Lorge’s frequency

lists, and analyzed every PGC in a variety of contexts of position and stress. For

position, they listed the probabilities of each OC occurring in the initial, medial, and final

position of syllables. For stress, they listed probabilities for primary, secondary, or

unstressed syllables.

The authors concluded that English PGCs are predictable and consistent “80 percent

of the time” when both position and stress were also considered (Hodges & Rudorf,

1965). This claim is based on the probability that any given phoneme will align to its

“main” grapheme about 80% of the time. The HHRH researchers used an “80-percent

criterion” rule as a benchmark for whether a language could be considered predictably

represented by its orthography. This empirically derived conclusion caused a shift from

the perspective that English spelling was a broken system in need of reform to a position

that acknowledged the overall predictability of English orthography. The principle that

English spelling is mostly predictable has guided the methodology of reading and

spelling instruction since that time (Schlagal, 2002; Fresch, 2003).


MEASURING ORTHOGRAPHIC PREDICTABILITY 10

Many researchers since have drawn upon and refined the data from the HHRH study.

Subsequent studies updated and reworked the Hanna et al. data (E. Fry, 2004), measuring

GPC probabilities for both “American” (Berndt, Reggia, & Mitchum, 1987) and “British”

(Gontijo, Gontijo, & Shillcock, 2003) English. However, there are several reasons why it

may be time to re-think the conclusions of the HHRH study.

First, computer technology of the era was in it’s infancy. Given that a cell phone in

2012 had more computational power than all of NASA during the 1969 Apollo moon

landing (Kaku, 2012), the study might have returned different outcomes had it been

undertaken more recently. The researchers admit throughout the study that linguistic

accuracy was sometimes compromised due to technological limitations (Hanna, 1966).

Another issue is that the corpus size of ~17,000 words is relatively small by today’s

standards, where the internet and digital technology have allowed for millions and even

billions of words to be analyzed rapidly with relatively small degrees of computational

error. Many studies have discussed how large a corpus size needs to be for the reliable

generalization of results, and somewhere between 16-30 million words seems optimal

(Brysbaert & New, 2009).

Also contestable is the composition of the grapheme list used by the HHRH study.

There is no authoritative list of English graphemes, so the researchers had to create their

own. This is not a straightforward process, however, and requires a certain degree of

arbitrary decision-making, something the researchers acknowledged. One example is how

to handle silent letters. Since silent letters have no phoneme with which to correspond,

they are attached to an adjacent letter which does correspond to a phoneme, i.e. the
MEASURING ORTHOGRAPHIC PREDICTABILITY 11

formation of compound graphemes. The word asthma, for example, can be considered

either as

 “a”=/æ/ + “s” = /z/ + “thm” = /m/ + “a” = /ə/, or


 “a”=/æ/ + “sth” = /z/ + “m” = /m/ + “a” = /ə/

There is no obvious reason to prefer one over the other. The authors of HHRH study

were forced to make these choices, and in doing so, they effectively established research

conventions regarding how to best classify English graphemes. In several cases, these

decisions were dictated by technological constraints of the era. Thus Berndt et al., twenty

years later, noted that even while they used the HHRH classification system of

graphemes, it is short of ideal, stating that “other divisions of printed words than those

employed here would have resulted in a different set of circumstances” (Berndt et al.,

1987, p 5). They further suggest that “the entire treatment of silent letters by

[HHRH]…might be handled differently if the goal is to describe segments that readers

actually use.”

Current Research on Orthographic Depth

In addition to these issues, current cross-linguistic research is discovering that English

speakers may in fact be handicapped by the depth of their orthography. Since the concept

of orthographic depth was introduced over 25 years ago (Katz & Frost, 1992), a large

body of research has accumulated that centers on analyzing and understanding its

ramifications. It has been shown to impact reading development, developmental and

acquired reading disorders, and theoretical accounts of reading (Schmalz, Marinus,

Coltheart, & Castles, 2015, p. 1614).


MEASURING ORTHOGRAPHIC PREDICTABILITY 12

Extensive research has linked the depth of a language’s orthography to rates of

literacy acquisition: the deeper the orthography, the longer it takes to acquire basic

literacy (Seymour et al., 2003). Whereas the average learner of English requires three

years to attain basic reading proficiency (Chall, 1967), languages with a more transparent

orthography require less time. One striking example is that children learning the

extremely transparent orthography of the Finnish language, which features essentially a

phonetic spelling system, on average reach this same level of basic reading proficiency in

six months.

In 2003, a large experiment that tested decoding skills across fourteen orthographies

revealed that the English-speaking participants performed unexpectedly low when

compared to non-English speaking peers (Seymour et al., 2003). Figure 1 reports

percentages from two tasks of that study, which tested the accuracy of word-reading for

both real (content and function words) and non-real (mono and bisyllabic) words. As the

graph shows, on reading tasks where the majority of participants scored between 90-

100% for accuracy, the English-speaking children (the final bar in each category) failed

to reach 50%.
MEASURING ORTHOGRAPHIC PREDICTABILITY 13

100
90
80
Percentage Correct 70
60
50
40
30
20
10
0
Content Words Function Words Monosyllabic Bisyllabic Nonwords
Nonwords

Finnish Greek Italian Spanish Portuguese


French Austrian German Norwegian Icelandic
Swedish Dutch Danish English

Figure 1. Accuracy data from “Foundation Literacy Acquisition in European Orthographies” (Seymour et
al., 2003). English speakers performed consistently worse than peers speaking 13 other languages.

The English-speaking children in this study were drawn from schools in Scotland,

and caution is advised in assuming all English-speaking children are a homogenous group

that can be fairly represented by this small sample of the global population. Nonetheless,

similar findings have been repeated in smaller studies comparing English to various

orthographies; for example, two separate studies of Welsh (transparent) and English

(opaque), both with rigorous cross-linguistic controls, reported similar outcomes: the

Welsh groups of 6-7 year olds were able to read twice as many words as their English

matched peers (Ellis & Hooper, 2001; Hanley, Masterson, Spencer, & Evans, 2004).

Welsh and English are well-suited for linguistic comparisons, because participants can

often be drawn from the same schools or areas, where both languages are taught

simultaneously to the same age groups, and this design helps minimize the effects of

confounding variables that may occur in cross-linguistic studies.


MEASURING ORTHOGRAPHIC PREDICTABILITY 14

While not all results are as dramatic, English-speaking children do appear to score

consistently lower than their age-matched peers who have learned more transparent

orthographies. Richlan (2014) provides a succinct list of such studies that involve

learners with both typically developing (Aro & Wimmer, 2003; Bergmann & Wimmer,

2008; Cossu, Gugliotta, & Marshall, 1995; Frith, Wimmer, & Landerl, 1998; Georgiou,

Torppa, Manolitsis, Lyytinen, & Parrila, 2012; Seymour et al., 2003; Wimmer &

Goswami, 1994; Zoccolotti, De Luca, Di Filippo, Judica, & Martelli, 2009) as well as

dyslexic reading acquisition (Barca, Burani, Di Filippo, & Zoccolotti, 2006; Davies,

Cuetos, & Glez-Seijas, 2007; Landerl, Wimmer, & Frith, 1997; Landerl & Wimmer,

2000; Landerl et al., 2013; Richlan, 2014; Wimmer, 1993; Wimmer & Schurz, 2010;

Zoccolotti et al., 2005).

Conclusion

English orthography has never been viewed as being a consistently phonetic system.

Centuries of diverse cultural influences have brought their own unique aesthetics and

stylistic preferences to the English writing system. As the spoken language continued to

evolve away from its conservative spelling system, there have been sporadic but

ultimately ineffective attempts at wholesale spelling reform. Instead, educators and

students have adapted to a complicated orthography by placing less emphasis on

decoding and more emphasis on memorization.

Spelling instruction traditionally involved memorizing word lists. Ranking words by

frequency, and to a lesser extent word length, continues to be a widely used means of

teaching English spelling effectively (Schlagal, 2002). In the 1960s, the HHRH study
MEASURING ORTHOGRAPHIC PREDICTABILITY 15

analyzed approximately 17,000 high-frequency words and concluded that English OCs

are “predictable 80% of the time.” This perspective has informed spelling practices in the

U.S. since that time.

Several concerns can be raised with these findings, however. Half a century has now

passed since the HHRH study was published, and modern computing technology may

provide a clearer picture as to whether or not English orthography can accurately be

considered a predictable system. Furthermore, cross-linguistic studies on orthographic

depth have once again called into question the predictability of English orthography.

Using the mathematical concept of entropy to measure the predictability of graphemes

and phonemes may provide an answer to the question of just how predictable English

orthography really is.


MEASURING ORTHOGRAPHIC PREDICTABILITY 16

Methodology

Corpus

This study utilized The English Lexicon Project (ELP) database (Balota et al., 2007).

This is a free, online resource that can be found at elexicon.wustl.edu., based on the

Hyperspace Analogue to Language (HAL) corpus (Burgess & Livesay, 1998). A type

count is the number of different words in a text; the ELP’s type count is 40,481. A token

count is the number of times it appears in the corpus. Approximately 131 million word

tokens comprise the entire corpus, gathered from Usenet Internet news groups in 1995,

though more recent estimates place the number at close to 400 million words.1

Regardless, it has been shown that reliable frequency norms are achieved for both high

and low-frequency words from corpora of 16 million words, with diminishing returns

beginning at around 30 million (Brysbaert & New, 2009), so even the lower size estimate

should be more than adequate for generalization of results.

Appendix B provides a list of the phonemes utilized by the ELP. The ELP provided a

pronunciation for each word entry, based on the Standard American English dialect. It is

important to note that the resulting OCs analyzed in this study are ultimately dependent

upon the accuracy of this input. Some transcription errors and inconsistencies in the

ELP’s pronunciation listings were observed and documented below. Finer linguistic

distinctions not recognized by the ELP are beyond the scope of this study.

1
Estimate obtained from the ELP website “Database News & Update: 10/20/14” at
http://elexicon.wustl.edu/
MEASURING ORTHOGRAPHIC PREDICTABILITY 17

Defining English Graphemes

The number and shape of English graphemes vary, depending on how silent letters

are handled. In a purely phonetic alphabet, each letter would represent a single grapheme

and correspond to a single phoneme, but English orthography contains many silent

letters. These are annexed by neighboring letters to form compound graphemes. The

process of assigning silent letters to form compound graphemes, while often arbitrary, is

necessary if a definitive list of English graphemes is to be established.

By contrast, the number and type of phonemes in SAE are well defined, so it seems

reasonable to first identify the phonemes of a given word, and then parse the written word

into an identical number of graphemes. By definition, there should be a one-to-one

alignment between grapheme and phoneme, i.e. graphemes should not outnumber

phonemes in a word, though there are some cases where this occurs (see Table 4, below).

These graphemes must align with their phonemes in the same serial order, except for

words containing a final, silent “—e.”

This ubiquitous feature of English orthography required additional consideration.

Those learning to read English are frequently taught that the final –e “lengthens” the

preceding vowel, as in mat vs. mate. Acknowledging this rule, the authors of the HHRH

study decided to treat –e as being part of a compound grapheme that includes the

preceding vowel. This was written as V_e, where V is any vowel or vowel combination.

These special graphemes will be referred to in this study as split graphemes. Thus the

word mane is composed of the graphemes “m” + “a_e” + “n”.


MEASURING ORTHOGRAPHIC PREDICTABILITY 18

In this study, an “e” is considered as part of a split grapheme when it was directly

preceded by either a graphemic consonant(s), “gu-” or “qu-”. The associated vowel had

to be the first vowel encountered prior to the directly preceding consonant(s), “gu-”,

“qu-”, and any number of consecutive preceding vowels until a consonant, another split

grapheme, or the beginning of the word was encountered. This rule flagged the “e” as a

potential candidate for being part a split grapheme, but the parsing algorithm only

declared it to be if both of the following criteria were false:

1. The preceding vowel corresponded to a phoneme that was considered


a traditionally “short” vowel: /ɑ/, /ʊ/, /ə/, /ɛ/, /ɪ/, /ɔ/, /ʌ/, /ɚ/, or /ɝ/
2. The consonant preceding the “e” consisted of one of the following
OCs: “le” = /l/, “ge” = /ʒ/ or /ʤ/, “dg” = /ʤ/, “th” = /ð/, “ce” = /s/,
“sle” = /l/

The first rule followed the premise that if the vowel is not “lengthened” by the “e”,

then the “e” does not alter the vowel, and therefore must alter the adjacent consonant.

The second rule followed the premise that –e often indicates an alternative pronunciation

of the preceding consonant. Kessler & Treiman (2001), for example, provide five

alternative situations where –e could more accurately be judged to “belong” to the

consonant and not the vowel. Furthermore, they argue that –e can perform more than one

of these functions simultaneously.

This interpretation of –e diverges from the conventions of the HHRH study, which

considered final, silent –e as part of a split grapheme in all cases. The authors of the

HHRH study, however, observed that their decision regarding –e was a pragmatic,

simplistic decision that did not necessarily conform to orthographic reality (Hanna,

1966). For the current study, a middle ground was chosen between the two extremes.
MEASURING ORTHOGRAPHIC PREDICTABILITY 19

While it can be argued that –e can serve more than one function simultaneously, in this

study –e can only be assigned one role: as attaching to either the consonant(s) or the

vowel(s), but not both.

Alterations to the Corpus

Exclusions a priori. As many words as possible were analyzed from the ELP

database of 40,481 words, but certain entries in the ELP were initially deleted:

 70 entries with no phonetic transcription provided by the ELP,


leaving 40,411 entries.
 497 of the remaining 40,411 entries with a frequency of 0, leaving
39,914 entries.
 5 entries with invalid (e.g., non-English or in error) phonemes:
wiretap, Pachelbel, fille, firebox, and ecru. This leaves 39,909
entries.
 4 entries with “unusual” pronunciations: gunwale, halfpenny,
forecastle and ok. The first two words are sailing jargon, the third
is a “Briticism,” and the fourth a logogram.

Removing the above entries left a dataset of unique words containing 39,905 entries

to be automatically parsed by the program. 2,324 words types containing an initial

capital letter, i.e. proper nouns, were included in the dataset. The remaining alternations

were performed during the post-processing stage, after the OCs had been tallied and

entropy values calculated.

Combining Allophones. Certain linguistic sounds called allophones are different

sounds that speakers of a language can use interchangeably without affecting the meaning

of words. For example, the phoneme /ɾ/, called a “flap,” which is found in the middle of

words like butter and ladder, is in allophonic variation with /t/ and /d/ in SAE. Indeed,
MEASURING ORTHOGRAPHIC PREDICTABILITY 20

the average speaker is likely unaware that this linguistic distinction even exists. However,

the entropy values for the graphemes “t” and “d” were significantly affected when this

distinction was considered, ranking them as the second and third most ambiguous

consonantal graphemes. Since it appears English speakers do not typically struggle over

this distinction, the decision was made to combine /ɾ/ with either /t/ or /d/ as appropriate.

Another minor change was the combination of /ks/ and its voiced counterpart /gz/ into

a “single phoneme” when they correspond to the grapheme “x.” Sometimes this voicing

distinction impacts meaning, as in the words “box” and “bogs.” However, the grapheme

“x” is never employed in such cases, and this voicing distinction appears to be purely

allophonic whenever “x” is used. Again, over-inflated entropy values appeared when this

distinction was made, so the decision was made to ignore this allophonic distinction.

Apostrophes. For ease of parsing, apostrophes were treated as other letters, resulting

in a handful of graphemes which included apostrophes. These included seven graphemes

featuring an initial apostrophe (‘d, ‘ll, ‘m, ‘re, ‘s, ‘t, ‘ve), two graphemes with final

apostrophes (n’ and o’), one low-frequency grapheme with a medial apostrophe (a’a), and

a simple grapheme composed of the apostrophe by itself. All but the latter were then

combined with the identical graphemes that did not include an apostrophe, e.g. totals for

“ ‘ve” and “ve” were combined. The singular apostrophe could be parsed as a simple

grapheme corresponding to schwa (as in could’ve). This was retained as a unique

grapheme, occurring in 54 words in 29,904 instances in the corpus.


MEASURING ORTHOGRAPHIC PREDICTABILITY 21

Summary of the Computing Algorithm

A custom program was constructed using MATLAB software in order to count the

phonemes and graphemes of the corpus. First, a mapping file was created, containing a

list of all phonemes and all possible graphemes that could potentially map to each

phoneme. When analyzing a word, the algorithm proceeded in a serial, left-to-right

fashion. It first parsed a word into separate phonemes. The term parsing is used in this

study to refer to the graphic representation of a word, where individual

phonemes/graphemes are separated by a dash. The graphemic parsing of the word

English is “E-ng-l-i-sh,” and the phonemic parsing is /i-ŋ-l-ɪ-ʃ/. The program then

considered which graphemes in the word could legitimately correspond to each phoneme,

according to the mapping file. If a legal mapping could not be found for all OCs in a

word, an error was returned.

Some words could be parsed only one way, while some had multiple possible

parsings. For the latter, the algorithm then had to choose the most optimal parsing. It

accomplished this in a step-wise fashion, comparing the first two possibilities to see

which it “preferred.” Based on a series of predetermined preferences, it chose one parsing

and discarded the other, and then compared the next possible candidate to the incumbent

parsing. It proceeded in this manner until all possible parsings had been considered and

only one remained.

The result is that final phonemic and orthographic parsings were ultimately derived

for each word. Appendix D provides the specific guidelines and decision-making

processes the algorithm utilized. The type and token counts for all OCs could then be
MEASURING ORTHOGRAPHIC PREDICTABILITY 22

automatically counted by the program. From this frequency data, probabilities for each

OC and entropy values for all phonemes and graphemes were then calculated.

Reliability

The algorithm underwent a number of trial runs so that unanticipated errors could be

identified and corrected. This process was repeated until the algorithm returned a

manageable number of words to be manually parsed (85 word types, described in

Appendix D). The algorithm was then run a final time, allowing type and token counts

for all graphemes and phonemes to be tallied by the program. Appendix D contains a

sample list of 200 randomly selected words with their corresponding parsings.

Additionally, a random sample of 1000 parsed words were checked by an

independent rater for reliability. The rater, a graduate student familiar with IPA symbols

and phonetic transcription, underwent an hour-long training session on the acceptable

grapheme shapes delineated for this study. The rater was then instructed to highlight any

words that might contain an illegal parsing. The rater returned a list of 22 highlighted

words. It was then determined that two of the words contained an unusual but acceptable

pronunciation in the ELP, and one contained a phonemic combination which, for ease of

parsing, was parsed correctly during a later step by the algorithm. These three

questionable parsings, along with the remaining 19, were judged to all be legitimate

parsings as outlined by the grapheme definition process of this study. With 0 errors in a

sample size of 1000, therefore, a 98% upper confidence bound of .39% parsing errors in

the whole population of 39,905 is estimated.


MEASURING ORTHOGRAPHIC PREDICTABILITY 23

Calculating Entropy

Once the algorithm returned the frequency data for all phonemes and graphemes,

probabilities and entropy values could then be calculated. Throughout this study, X refers

to a random variable from the set of graphemes that we will define based

on our parsing of the ELP, where m is the number of such graphemes. Similarly, we

denote Y as a random variable from the set of phonemes that we will

define based on our parsing of the ELP, where n is the number of such phonemes.

For the random variable X (with an analogous definition for Y), the entropy of X is

defined as

where is the probability that X takes the value . This entropy formula will

essentially quantify how difficult it would be to predict which grapheme might be

randomly chosen out of the entire corpus.2

An additional calculation will yield conditional entropy, which quantifies the amount

of information needed to predict the outcome of one random variable Y, given that

another random variable X is known (where “Y given X” is notated . The

conditional entropy of Y given X = is defined by

Finally, the conditional entropy of X given Y is defined as:

2
Throughout this article, it is also assumed that = 0 when = 0.
MEASURING ORTHOGRAPHIC PREDICTABILITY 24

The first line indicates this will be a weighted average of the conditional entropy of Y

given over all , where the weighting will be the probability that X takes the

value . In other words, this will be the average (with frequency weighting) over all

graphemes of the difficulty in predicting how each grapheme should be pronounced. The

result will be a metric which quantifies how difficult it is to predict the pronunciation of

graphemes in the corpus. Analogous definitions will apply for H(X|Y=yj) and H(X|Y),

which will calculate the entropy of phonemes.

An entropy value of zero indicates that a grapheme is completely predictable, only

ever corresponding to one unique phoneme. The grapheme “dge,” for example, was

found to have zero entropy, because it only corresponded to the phoneme /ʤ/ as in edge,

judge, ridge, lodge, etc. Entropy values larger than zero indicate relatively increasing

unpredictability. For example, the graphemes “se” and “ce” were found to have entropy

values of .957 and .001, respectively. Therefore, “se” is more unpredictable than “ce.”
MEASURING ORTHOGRAPHIC PREDICTABILITY 25

Results

Graphemes

Classification. The complete list of all 322 graphemes discovered in the corpus are

listed in Appendix A. The total type count for all graphemes in the corpus was 264,618,

and the total token count was 1,603,490,234. Each grapheme’s probability is also listed,

which is the frequency divided by the total number of all graphemes in the entire corpus.

Both unweighted and frequency-weighted entropies are provided.

Unless otherwise specified, all calculations are based on token frequencies.

Graphemes which corresponded to phonetic consonants and vowels were analyzed

separately. A grapheme was classed as either vowel or consonant based on whether the

phoneme it corresponded to was consonantal or vocalic: for example, the grapheme “et”

in ballet was classed as a vowel, because it corresponded to the vocalic phoneme /e/.

Some ambiguous OCs could reasonably be called either consonant or vowel, and so it

was necessary to establish a few further parameters to classify these:

a. Graphemes corresponding to syllabic consonants were classed as


consonants, e.g. “le” = /l/ as in castle.
b. The graphemes “r,” “rr,” and “rrh” were considered consonants, while
graphemes which contained r but also a vowel (“er,” “ur,” “ear,” etc.)
were considered as vowels.
c. The grapheme “w” was classed as a consonant, because of the
overwhelming probability (.999) of it being pronounced as the semi-vowel
/w/ as opposed to the vowels /ʌ/ or /u/. The grapheme “y,” on the other
hand, represented the semi-vowel /j/ less than a fourth of the time (.266),
with vocalic variants (/ɪ/, /i/, /ə/, etc.) being on the whole more frequent,
and thus “y” was classed as a vowel.
MEASURING ORTHOGRAPHIC PREDICTABILITY 26

Length. The average grapheme length found in the corpus was 2.4 letters. Of the 322

graphemes in the corpus, 1 had six letters (“ailles”), 2 had five letters (“cques” and

“tzsch”), 27 had four letters, 96 had 3 letters, 169 had 2 letters, and 27 had one letter (this

includes the 26 alphabetic letters plus the apostrophe). The three graphemes longer than

four letters were found only in French or German loanwords, but were still considered for

analysis. Entropy values for each length (from six letters to one letter) were respectively

0, 0, 0.104, 0.123, 0.280, and 0.620.

Entropy. A useful first step in interpreting the data was to remove low-frequency

graphemes with a type count of ten or less. This helped clean the data of idiosyncratic

OCs that were found in only a few related words, such as “bt” = /t/ in debt and “olo” =

/ɚ/ in colonel. There were 146 low-frequency graphemes. 124 of these also had zero

entropy.

202 of the 322 graphemes had zero entropy, meaning they corresponded consistently

to the same phoneme in all instances in the corpus, and can be considered completely

predictable. 124 of these were also considered low-frequency, having a type count of 10

or less. Removing these two groups from consideration resulted in a “refined list” of 99

graphemes that had some degree of entropy, comprised of 45 consonants and 54 vowels.

These are listed in Table 1 below, ranked by entropy:


MEASURING ORTHOGRAPHIC PREDICTABILITY 27

Table 1. Graphemes ranked by entropy. Classed by consonant or vowel, with low-frequency and zero-
entropy graphemes removed.

No. C Frequency Entropy No. C Frequency Entropy No. V Frequency Entropy


1 s 105,633,904 1.027 38 es 2,147,378 0.004 28 or 5,535,041 0.964
2 in 350,753 1.000 39 ssi 572,191 0.003 29 err 122,344 0.962
3 all 617,422 0.968 40 ce 5,659,217 0.001 30 ue 855,526 0.912
4 se 1,696,607 0.957 41 an 616,250 0.001 31 oa 737,843 0.897
5 ch 7,122,450 0.933 42 al 2,765,896 0.001 32 ol 79,933 0.894
6 ed 4,059,494 0.909 43 sh 4,198,020 0.001 33 eur 21,762 0.845
7 f 37,433,336 0.874 44 v 15,036,744 <0.001 34 orr 211,821 0.736
8 g 16,164,105 0.732 45 r 67,773,671 <0.001 35 ew 1,606,405 0.727
9 th 51,995,883 0.711 No. V Frequency Entropy 36 is 32,953 0.670
10 wh 6,321,562 0.695 1 o 101,159,136 2.682 37 eigh 126,170 0.644
11 nd 10,643 0.596 2 u 27,685,856 2.307 38 hi 35,987 0.631
12 gh 317,754 0.591 3 a 121,387,814 2.294 39 ou_e 216,372 0.505
13 c 40,017,797 0.580 4 ou 18,438,447 2.177 40 o_e 4,582,654 0.481
14 si 790,039 0.540 5 ae 41,136 2.015 41 urr 350,854 0.472
15 zz 60,756 0.458 6 e 103,894,768 1.845 42 i_e 9,587,698 0.356
16 x 3,887,725 0.407 7 ea 7,545,614 1.705 43 our 129,262 0.315
17 ci 804,733 0.388 8 ie 1,265,508 1.697 44 ay 4,432,624 0.306
18 ss 4,825,294 0.387 9 y 32,632,117 1.680 45 ir 1,138,712 0.261
19 n 102,493,345 0.322 10 eu 135,096 1.632 46 aw 511,565 0.248
20 z 1,215,898 0.301 11 ough 1,008,396 1.601 47 on 8,818,144 0.230
21 ti 7,056,797 0.301 12 au 1,652,786 1.343 48 ee 5,913,902 0.168
22 ph 1,066,337 0.220 13 ui 309,495 1.315 49 eo 807,520 0.093
23 d 60,894,576 0.187 14 oo 3,895,895 1.289 50 ar 2,078,630 0.046
24 sc 371,066 0.160 15 ei 1,215,778 1.278 51 oi 976,408 0.040
25 gg 220,517 0.132 16 ah 143,359 1.232 52 oy 510,271 0.038
26 t 113,483,774 0.112 17 i 113,079,072 1.209 53 e_e 1,804,972 0.014
27 ge 2,372,177 0.089 18 oe 122,058 1.206 54 a_e 6,796,965 <0.001
28 j 3,701,888 0.071 19 ow 6,051,793 1.105
29 ne 3,060,629 0.059 20 ure 1,020,272 1.098
30 mn 79,355 0.055 21 ia 314,693 1.083
31 m 44,200,165 0.054 22 ai 4,269,932 1.050
32 ll 10,953,011 0.047 23 er 23,048,328 1.042
33 l 46,569,549 0.042 24 eau 65,434 1.032
34 le 5,375,598 0.036 25 ur 1,620,707 1.031
35 re 9,600,092 0.035 26 ey 2,517,956 1.020
36 gn 430,057 0.007 27 u_e 2,464,579 0.982
37 w 20,746,360 0.006
Note: C = consonantal grapheme, V = vocalic grapheme
MEASURING ORTHOGRAPHIC PREDICTABILITY 28

Consonants generally had less entropy than vowels, with only the most entropic

consonant, “s,” having an entropy value over 1.0. An informal observation of the ranking

shows that the group with the highest general entropy were singleton vowels. Split

graphemes tended to have relatively lower entropy, clustering towards the bottom end of

the list, indicating their pronunciations to be more stable than their counterparts without a

final, silent –e. It is worth noting that this could be a consequence of how graphemes

were defined for this study. For example, if a word contained both a “short vowel” and a

final, silent –e, the –e was combined with the preceding consonant, resulting in a non-

split grapheme. A word like love, even though it might appear to the novice reader to

contain a split grapheme, is not a split grapheme according to the criteria defined in the

methodology and therefore, the entropy for the split grapheme “o_e” is unaffected by this

pronunciation.

Cluster analysis. Using the “refined” list of graphemes from Table 1, where low-

frequency were removed, a hierarchical cluster analysis using complete linkage was run

separately for both consonants and vowels. This is a useful method for grouping the

graphemes together in terms of similar entropy values and characterizing them in terms

of predictability. Table 2 lists the graphemes divided into groups of predictability for both

consonants and vowels. “Complete” predictability refers to graphemes with zero-entropy.

These are listed alphabetically. The remaining categories are arranged in order of

increasing entropy. This provides a useful reference for educators who wish to know

which spelling patterns are the most predictable.


MEASURING ORTHOGRAPHIC PREDICTABILITY 29

Table 2. Graphemes grouped by predictability. The results of a hierarchical cluster analysis using complete
linkage, where complete predictability indicates zero-entropy.

Predictability Vocalic Graphemes Consonantal Graphemes


ai_e, arr, augh, ea_e, ear, ain, b, bb, bt, cc, mb, me, mm, ng, nm,
ee_e, et, eye, iar, ie_e, ck, cq, dd, de, dg, nn, om, ome, p, pp,
Complete iew, igh, io, ior, oo_e, dge, dj, el, ell, en, ps, q, qu, que, rh,
y_e, ', ff, gi, gu, gue, h, rr, sce, sci, shi, st,
ho, il, ile, k, kn, sw, tch, te, the, tt,
ld, lf, lk, lle, lm, tte, ul, ull, ve, wr, xh
a_e, e_e, oy, oi, ar, r, v, sh, al, an, mn, ne, j, ge, t,
Very High eo, ee, on, aw, ir, ce, ssi, es, w, gn,
ay, our, i_e, re, le, l, ll, gg, m,
High urr, o_e, ou_e, hi, eigh, sc, d, ph,
is, ew, orr,
eur, ol, oa, ue, err, ti, z, n, ss, ci,
err, or, u_e, ey, ur, x, zz,
Moderate eau, er, ai, ia, ure,
ow, oe, i, ah, ei,
oo, ui, au,
Low ough, eu, y, ie, ea, si, c, gh, nd, wh,
e, ae, th, g,
Very Low ou, a, u, o, f, ed, ch, se, all,
all, in, s,
Note: Graphemes featuring complete predictability are listed alphabetically. Otherwise,
graphemes are listed in order of increasing entropy (i.e. decreasing predictability).
MEASURING ORTHOGRAPHIC PREDICTABILITY 30

Figure 2. Vowel Cluster Analysis. Dendrogram and scatter plot of vocalic graphemes with complete
linkage, grouped into five categories of decreasing entropy.
MEASURING ORTHOGRAPHIC PREDICTABILITY 31

Figure 3. Consonant Cluster Analysis. Dendrogram and scatter plot of consonantal graphemes with
complete linkage, grouped into five categories of decreasing entropy.
MEASURING ORTHOGRAPHIC PREDICTABILITY 32

Total orthographic entropy. To derive the total orthographic entropy, the

frequency-weighted probabilities of all graphemes (i.e., the entropy value of each

grapheme multiplied by the probability of that grapheme appearing in the corpus) were

summed together, resulting in a total orthographic entropy of 0.889. This number

indicates the relative predictability of English orthography (as represented by the corpus)

in terms of pronunciation. If total entropy is likewise calculated for the orthographies of

other languages, these values can then be compared to show relative predictability

between languages (see “Measuring Orthographic Depth” in the next section for an

entropy comparison between English and modern Greek).

Phonemes

PGC probability. Appendix C lists the probabilities for all phoneme-to-grapheme

correspondences (PGCs) found in the corpus. The probability of each phoneme is listed

immediately below that phoneme; a phoneme’s probability is simply the frequency

(token count) of that phoneme divided by all occurrences of all phonemes in the corpus.

Next to each phoneme is then listed all graphemes found to correspond to that phoneme

in the corpus. Next to each grapheme is the probability for that particular PGC, which is

calculated by dividing the frequency (token count) of that PGC by the total number of

times that phoneme occurs in the corpus. For each phoneme, the graphemes are listed in

order of decreasing probability. Therefore, the first grapheme listed is the “main

correspondence,” the spelling pattern most often associated with a phoneme. Finally,

underneath all the probabilities for each phoneme is a number in bold, which is the

entropy value for that phoneme, derived using the probabilities listed.
MEASURING ORTHOGRAPHIC PREDICTABILITY 33

Table 3. Phonemes ranked by entropy.

P Ex. Frequency Prob. Entropy P Ex. Frequency Prob. Entropy


/ɪ/ sit 102,681,934 3.11% 2.743 /e/ day 37,085,249 2.31% 1.116
/ɝ/ sir 10,358,558 0.65% 2.553 /ɑ/ father 30,058,526 1.87% 1.097
/ʊ/ put 8,362,921 0.52% 2.198 /ɹ/ ram 82,400,979 5.13% 0.982
/l/ castle 10,275,525 0.64% 2.101 /i/ see 49,948,119 6.39% 0.947
/ʤ/ edge 10,234,263 0.64% 2.078 /l/ log 57,618,967 3.59% 0.800
/ʃ/ shun 13,555,847 0.84% 2.050 /f/ fee 29,989,182 1.87% 0.713
/ə/ above 86,526,629 5.39% 2.013 /z/ buzz 45,868,357 2.86% 0.703
/ɛ/ set 44,704,282 2.78% 1.968 /m/ mat 48,299,874 3.01% 0.674
/aɪ/ life 32,817,345 2.04% 1.931 /ð/ the 42,218,181 2.63% 0.664
/n/ button 13,065,713 0.81% 1.594 /ŋ/ song 16,916,321 1.05% 0.649
/k/ cat 53,242,102 3.32% 1.593 /m/ chasm 294,694 0.02% 0.577
/ɔ/ law 26,030,429 1.62% 1.578 /θ/ thin 9,724,927 0.61% 0.523
/ʌ/ hug 39,552,208 2.46% 1.577 /w/ wall 23,003,005 1.43% 0.479
/v/ vote 31,961,862 1.99% 1.555 /n/ night 104,184,757 6.49% 0.471
/o/ low 23,153,335 1.44% 1.524 /g/ go 13,773,233 0.86% 0.410
/u/ moot 30,774,839 1.92% 1.500 /h/ hat 17,695,054 1.10% 0.367
/ɚ/ waiter 24,508,665 1.53% 1.496 /p/ pat 37,725,681 2.35% 0.357
/ɾ/ middle 21,787,539 1.36% 1.398 /b/ bit 29,987,163 1.87% 0.044
/ʒ/ mirage 1,074,831 0.07% 1.310 /t/ to 106,307,958 6.62% 0.021
/ʧ/ church 7,939,684 0.49% 1.309 /hw/ whale 5,148,834 0.32% 0.017
/s/ sun 80,258,782 5.00% 1.303 /æ/ bat 53,613,522 3.34% 0.011
/oɪ/ boy 1,547,154 0.10% 1.229 /d/ do 58,449,083 3.64% 0.004
/aʊ/ cow 8,518,489 0.53% 1.176 /j/ yoke 8,670,071 0.54% 0.002
Note: The two English diphthongs / and / as in day and low are merged with their allophonic
counterparts /e/ and /o/, as these linguistic distinctions do not affect word meanings.

Entropy. Table 3 lists the phonemes of SAE, ranked by entropy, including the

frequency (token count) data, probability, and entropy of each phoneme in the corpus.

Unlike graphemes, there appeared to be no obvious relationship between entropy and

whether phonemes were consonantal or vocalic.


MEASURING ORTHOGRAPHIC PREDICTABILITY 34

Multiple phonemes. Not included in Table 3 above were instances in the corpus

where multiple phonemes corresponded to single graphemes. These are listed in Table 4.

They are all relatively low-frequency, with type counts less than 1000:

Table 4. Multiple phonemes corresponding to single letters. Instances where a singleton grapheme
corresponded to more than one phoneme, which were removed from final analysis.

Phonemes Type Count Token Count Entropy


jʊ 116 301,207 1.750
jə 519 1,727,893 1.533
jɚ 43 248,583 1.527
wɑ 2 1,280 0.972
nj 7 741 0.921
ju 743 5,513,303 0.857
ts 13 25,963 0.771
kʃ 21 83,673 0.445
gʒ 2 577 0.228
ks 714 2,986,740 0.026
əw 1 353 0
jɑ 1 96 0
jɛ 1 115 0
kə 1 1,054 0
wʌ 13 2,516,492 0
gz 117 560,170 0

Total Phonemic Entropy. The frequency-weighted probabilities of all phonemes(i.e.,

each phoneme’s entropy multiplied by it’s probability of occurrence in the corpus) were

summed together to calculate the total phonemic entropy value of 1.017. This number

indicates the relative predictability involved in spelling spoken English. Comparing this

number to the total orthographic entropy appears to indicate that encoding, overall,

involves a higher degree of uncertainty than decoding. Another way to say this is that

reading print is a more predictable process than spelling, which seems intuitive.
MEASURING ORTHOGRAPHIC PREDICTABILITY 35

Discussion

Is English Predictably Spelled?

The HHRH study is the most comprehensive source of evidence for the contemporary

perspective that English “orthography is alphabetic at least four-fifths of the time, or an

average of approximately 80 percent” (Hanna et al. 1966, p. 33). Using their 52-phoneme

classification system, the HHRH researchers found that phonemes corresponded to their

main grapheme 73.13% of the time, which would have been considered near but still

short of “predictable.” When the researchers then considered the additional linguistic

information of syllable position and stress, the percentage increased to 84.15%,

surpassing the 80-percent criterion.

This current study did not classify the data in terms of syllable position and stress, but

probabilities for all “main” PGCs were determined in the process of calculating

probabilities for all OCs. This data can be found in Appendix C. For each phoneme, the

first grapheme listed is the “main spelling” with the highest probability. So, for example,

the phoneme /t/’s main spelling is “t,” and /t/ was found in the corpus to be written as “t”

93% of the time (a probability of 0.933). Taking an average for all of these main spellings

resulted in the overall probability of 0.7326 that any given phoneme would correspond to

its main grapheme. In other words, English phonemes in the corpus are written as their

main spelling 73.26% of the time.

This number falls within a tenth of a percentage to the number derived by the

HHRH authors. Such close agreement reinforces the findings of both the HHRH study
MEASURING ORTHOGRAPHIC PREDICTABILITY 36

and the current study, at least as far as the highest probability PGCs are concerned. It also

indicates that increasing the corpus size from ~17,000 highest frequency words to ~131

million words (with no frequency limitations) had no significant impact on the

predictability of the main phoneme-to-grapheme correspondences.

The above calculation was then performed for GPC probabilities, analyzing the

probability that each grapheme corresponds to its “main pronunciation.” Summing the

frequency-weighted probabilities of all main pronunciations resulted in a percentage of

74.49%.

According to these calculations, the corpus features main correspondences in both

phoneme-grapheme and grapheme-phoneme directions a little under 75% of the time. In

the terminology of the HHRH study, then, English appears predictable roughly three-

fourths of the time. This is 5% less below the 80-percent criterion used as the benchmark

for predictability, but factoring in additional linguistic parameters like stress and position

would most likely lead to an increase in predictability, as evidenced by the HHRH study.

We argue that considering only the “main correspondence,” however, does not

provide a full picture to the complexities of English orthography. Entropy is a more

accurate measure of a phoneme or grapheme’s predictability, because it accounts for the

probability of every, not just the main, correspondence. Consider the phonemes X and Y,

where both have their main spelling correspondence 80% of the time, but phoneme X has

only one alternate pronunciation for the remaining 20% of the time, while Y has 4

alternate pronunciations 5% of the time each. Y is more uncertain, a distinction not

reflected in the observation that both are 80% predictable.


MEASURING ORTHOGRAPHIC PREDICTABILITY 37

Measuring Orthographic Depth

Entropy has been proposed as a common metric by which differing orthographies

might be compared, though there is by no means universal agreement as to how

effectively entropy captures the notion of orthographic depth (Schmalz et al., 2016).

Studies concerning cross-linguistic comparisons of orthographic depth have used a

variety of methods to compare the depth of orthographies, such as expert agreement

(Seymour 2003), naming accuracy and latency (Ellis et al., 2004), and word-initial

phoneme-grapheme entropy (Borgwaldt et al., 2004; 2005; 2006).

Since phonemes and graphemes are the universal building blocks of languages and

alphabetic orthographies, a common mathematical measurement of their correspondence

would be a boon to cross-linguistic scientific inquiry. For example, a study in 2009

calculated phoneme-grapheme entropy values for modern Greek (Protopapas & Vlahou,

2009). Despite having separate phonologies and alphabets, English and Greek could be

compared in terms of entropy. Greek was reported to have a total orthographic entropy

(grapheme-phoneme) of .167 and a total phonemic (phoneme-grapheme) entropy of .645,

while this current study reported comparative values for Standard American English of

.889 and 1.017. Standard American English, therefore, is more unpredictable for both

decoding and encoding than Greek. The same comparisons could be carried out for

consonants, vowels, initial letters, initial phonemes, and even morphological endings,

word classes like nouns and verbs, etc.

The problem is that the process of defining graphemes ultimately affects their entropy

values. The choice as to whether a final, silent –e attaches to the preceding vowel or
MEASURING ORTHOGRAPHIC PREDICTABILITY 38

consonant, for example, determines which orthographic structures (i.e. letter

combinations) are to be analyzed, and this decision will impact how predictable the

overall orthography is judged to be. The first step towards having a valid cross-linguistic

entropy measure, therefore, is for general agreement to be reached within each language

community as to how their own graphemes are to be defined. English speakers have a

more difficult challenge in this respect than many other European languages with more

transparent orthographies. This current study offers one definitive list of graphemes, but

this list was constructed through a series of decisions regarding the shape of graphemes

which is open to either verification or reinterpretation by others. Until consensus among

the research community is reached that answers the question, “What are the graphemes of

English,” entropy may not offer a truly absolute scale for cross-linguistic comparisons.

Word-Level Entropy

One potential application for the data involves formulating word entropies. Entropy

values for graphemes or phonemes of a word could be summed to calculate the entropy

value of that word, with the caveat that much more research is needed to determine how

precisely this metric would quantify a word’s difficulty in decoding or encoding. Such a

scheme ignores additional linguistic parameters, such as the impact of stress and meaning

or the effects of rime stability, for example, but it may still provide a convenient tool for

comparing relative predictability between words when, for example, choosing

appropriate words for spelling instruction.

A word’s orthographic entropy could simply be considered the sum of the entropy

values of its graphemes. For example, the orthographic entropy for the word sock,
MEASURING ORTHOGRAPHIC PREDICTABILITY 39

written as “sock,” would equal “s” + “o” + “ck” = 1.027 + 2.682 + 0 = 3.709. The longer

the word, generally the greater the entropy (though this depends on the entropy of each

grapheme), which can account for how word complexity increases with length.

A word’s phonemic entropy, conversely, would be the sum of entropy values of its

phonemes. The word sock, spoken as /sɑk/, would have a phonemic entropy of /s/ + /ɑ/ +

/k/ = 1.303 + 1.097 + 1.593 = 3.993. Both orthographic and phonemic entropy values

describe the degree of predictability between the spoken and written forms of the word

sock. Due to the intricate relationship between encoding and decoding, it remains unclear

which calculation provides a better descriptor. Indeed, both numbers may be required to

accurately represent the word’s predictability. It is also possible to sum orthographic and

phonemic entropy values of a word into a single “total entropy” value, yet doing so may

bury necessary linguistic detail.

Entropy-Based Spelling Instruction

Traditional method. Using entropy as a measurement for how difficult words are to

learn can lead to the creation of grade-leveled wordlists and basal spellers which are

stratified in terms of predictability. Entropy offers a potentially valuable resource for

educators who are unsure of how to choose appropriate words to be taught for spelling

instruction. Entropy-based word selection might help to remedy the fact that basal

spellers do not generally cover the depth of orthography that English has (Foorman &

Petscher, 2010).

Convenience is often a large factor in how spelling instruction is implemented in the

United States, and the reality of the classroom is that often a single common list is used
MEASURING ORTHOGRAPHIC PREDICTABILITY 40

for all children in a classroom, even if the instructor believes this is not the most effective

method for every child. A survey (Fresch, 2003) which compared teacher beliefs versus

practices in elementary schools in a nationwide sample, found that while only 45% of

interviewed teachers agreed that one common spelling list for a whole classroom was the

most effective method for teaching spelling, 72% put this into practice. Of those that

agreed, 88% put this into practice. This was the highest level of practice and theoretical

agreement among all 355 participants. Notably, many survey respondents expressed

concerns regarding how to successfully teach the unpredictability found in English

orthography, such as “words that do not ‘follow the rules’” and the “many exceptions in

the English language” (Fresch 2003, p. 834).

Developmental method. The developmental approach to spelling instruction

emphasizes the individual learning needs of a student as indicated by their current stage

of development, which describes the common process of how children typically acquire

orthographic knowledge (Ehri, 2005; Henderson & Beers, 1980). Ehri describes literacy

acquisition in four stages, which must occur in sequence, though they do not necessarily

correspond with age. An older child with a language impairment, for example, might not

have advanced at the same rate as his peers and would not benefit from studying the same

list of words as his classmates.

Entropy-based word selection might have the greatest benefit for learners in the

second stage of literacy acquisition, termed “partial alphabetic.” At this stage, according

to Ehri, a child is working to learn sight words by cementing OCs in memory. The child

has a rudimentary but incomplete knowledge of OCs. They will often spell words

incorrectly but phonetically, based on the rules they have learned to that point. Exposing
MEASURING ORTHOGRAPHIC PREDICTABILITY 41

the child to words which begin predictably but successively increase in uncertainty could

provide an appropriate scaffolding technique, particularly for readers who are struggling

to transition to the next stage of development.

In the third stage of development, the “full alphabetic stage,” English learners have

acquired the “major” OCs and the ability to segment words phonologically, based on the

graphemes they read. Here again, knowledge of the entropy of specific graphemes could

aid instruction. The student can be exposed to more challenging, entropic spellings as

their knowledge grows. Students experiencing difficulty could receive words with

decreased entropy compared to the rest of the class; outliers doing particularly well could

receive words with increased entropy.

Concerns with entropy-based instruction. Currently, orthographic entropy does not

play a role in word selection for spelling instruction, which relies on word frequency and

length to determine levels of difficulty. In fact, current attitudes regarding the

reading/spelling difficulty of words may even run contrary to entropy analyses. It would

seem awkward, for example, to teach graphemes simply on the basis of entropy, as some

of the most entropic graphemes are in fact single letters, e.g., a, o, u, s, and f. Common

practice dictates that a student’s literacy journey begins with memorizing the alphabet,

where letters are ranked with no regard for predictability, e.g. “r” and “s,” the most and

least predictable consonants in Table 1, are alphabetic neighbors and learned together.

Along the same lines, spelling instruction customarily introduces “simpler” (i.e.

shorter) spelling patterns first. When considering vowel phonemes, spelling patterns for

“short” vowels are usually taught before “long” vowels because they are more frequent
MEASURING ORTHOGRAPHIC PREDICTABILITY 42

and feature shorter spelling patterns. Yet Table 3 indicates the majority of these “short”

(technically lax) vowels feature higher entropy than their “long” (technically tense)

counterparts (e.g., /ɪ/ vs. /i/, /ʌ/ vs. /u/, etc.). According to their entropy values, long/tense

vowels are more predictably spelled. Therefore, students should theoretically be able to

learn these correspondences easier than the more uncertain short vowels.

Limitations

As mentioned briefly in the methodology, this analysis is ultimately dependent on the

accuracy of the transcription data of the ELP. Observed errors are documented in

Appendix D. The functionality of this analysis also relies on the ability of the results to

be generalized from the corpus to SAE as a whole. The HAL database, on which the ELP

is based, is both sufficiently sized and suitably descriptive of oral language, having been

described as “conversational and noisy, much like spoken language” (Burgess & Livesay,

1998), but it should be noted this is solely an adult-based lexicon. If the results of this

quantitative analysis of orthographic correspondences are to be applied to school-age

children in the process of literacy acquisition, an analysis drawn from children’s literature

may be more suitable. The current results, on the other hand, may more easily generalize

to a population such as adult stroke patients with acquired alexia.

Furthermore, It should be emphasized that entropy is a purely mathematical model

relying on probabilities and not, for example, influenced by psycholinguistic models of

reading. Therefore, these numbers are only meant as a conservative estimate, an

approximation of how relatively difficult written words might be to either decode

(pronounce) or encode (spell). Focusing solely on the grapheme-phoneme level ignores


MEASURING ORTHOGRAPHIC PREDICTABILITY 43

how larger linguistic units influence the complex cognitive processes underlying reading

and writing.

For example, measuring the entropy of individual graphemes assumes that the

predictability of one grapheme is not influenced by its neighbors. In fact, experimental

studies have shown evidence to the contrary. The rime, a larger linguistic unit consisting

of a vowel and any consonants following it in a syllable, has been shown to be a more

stable unit in English than the grapheme, at least for monosyllabic words (Treiman,

Mullennix, Bijeljac-Babic, & Richmond-Welty, 1995).

For example, the grapheme “o” may be pronounced numerous ways, but only one

pronunciation is ever found when combined with the grapheme “ck” to produce the

orthographic rime “-ock,” found in words like sock, lock, dock, etc. This rime

corresponds highly consistently to the phonemic rime /ɑk/. The “ck” grapheme, in effect,

signals the reader that the preceding “o” is to correspond to a specific grapheme, /ɑ/,

thereby removing uncertainty at the rime level.

Even so, Kessler & Treiman (Kessler & Treiman, 2001), who have written

extensively on the subject of the rime-analysis of English, concluded that despite a higher

degree of consistency at the rime level, English rimes “are not processed as individual

units. Rather, the basic processing seems to occur at a phonemic-graphemic level that

takes into account the context in which each phoneme-grapheme is found” (Protopapas &

Vlahou, 2009, p. 992, referencing; Treiman et al., 1995). In other words, rime-level

processing may be consulted to resolve ambiguity when it manifests in sufficient quantity

at the grapheme-phoneme level.


MEASURING ORTHOGRAPHIC PREDICTABILITY 44

While the the entropy model cannot perfectly imitate the complex process of reading,

a number of decisions in processing the data were made in order to more closely

approximate “a reader’s experience.” These steps were described in the methodology

section. It should be noted that such decisions do not have a specific evidentiary basis. It

remains to be empirically shown, therefore, whether English learners indeed experience

decoding/encoding in a manner consistent with this analysis. Specifically, the results of

this analysis would be strengthened by data showing that students actually process the

graphemes defined by this study. Until such data is collected, defining the shape of

English graphemes could remain a philosophical endeavor. For example, it remains

unclear whether the final –e in words like love and edge is processed as part of a split

grapheme or as part of a compound grapheme combined with the preceding consonant.

Moving forward, research which answers these questions may help solidify a general

consensus regarding the shape of English graphemes.

Future Research

Any research would be useful that aims to discover how closely orthographic

predictability actually reflects the psycholinguistic experience of reading. It remains

unclear to what extent orthographic/phonemic entropy actually impacts reading

acquisition, and so future experimental research could determine what correlation might

exist between the mathematical predictability of a word’s composition and how difficult

it is to read or spell. The ELP provides experimental behavioral data which could

possibly be used in such future studies. Naming and lexical decision data can be accessed

for the 40,481 words listed in the restricted ELP database. If indeed a word’s entropy can

be considered the sum of the entropies of its individual graphemes (as discussed above), a
MEASURING ORTHOGRAPHIC PREDICTABILITY 45

conceivable application of this data would be to calculate entropy values for the words of

the ELP database and then analyze the results for any correlations between entropy

values and the naming and lexical decision data provided. This might provide some

insight into the relationship between a word’s entropy and how long it takes to decode. If

there is indeed a link between orthographic predictability and ease of acquisition,

students may learn new vocabulary quicker and easier if exposed to words of gradually

increasing entropy.

Just as the HHRH study analyzed the additional linguistic parameters of stress and

syllable position, a future analysis might include these factors when considering the

predictability of orthographic correspondences. The ELP does provide information

regarding both stress and syllable position, and therefore it would be feasible to construct

a more elaborate software program that would account for these parameters using the

same corpus. The results of such an undertaking would most likely provide a more

complete, precise picture of the predictability of orthographic correspondences.

It would also be interesting to see how the structure of current reading programs

addresses orthographic predictability. Do well established curricula such as Orton-

Gillingham or the Association Method, which have unique methods for teaching

grapheme-phonemes correspondences to students, consider orthographic entropy at all?

Would a reading program designed around entropy show improved outcomes compared

to a program that disregards orthographic predictability?

Using entropy as a measurement of orthographic depth has already been discussed;

future research along this avenue would involve database analyses similar to this current
MEASURING ORTHOGRAPHIC PREDICTABILITY 46

study for other languages. The entropy value from various studies for different languages

could then be compared to one another, and the relative complexity of various languages

could be established against one another.

Theoretically, the type of analysis conducted in this study could be extended not just

to other languages, but to various dialects within English as well. Now that we have a

measure of how predictability SAE corresponds to its orthography, one can ask: is it

more or less entropic than other dialects of English? For example, to what degree does

SAE diverge from its orthography compared to African-American English (AAE)? Does

this degree of distance between a spoken dialect and the orthography, as expressed by

entropy, correlate at all with literacy outcomes, i.e. does having an accent further

removed from standard written English result in more difficulty in learning to read?

Unfortunately, the ELP provides only pronunciation for Standard American English. This

means that before any such cross-dialectal comparisons can be made, researchers would

need access to a sufficiently sized corpus with a pronunciation guide representing other

English dialects, such as AAE.

A final suggestion for future research concerns those suprasegmental linguistic

features of stress and syllabic position not addressed in this study. Certainly if these

parameters could be accounted for in the entropy calculation, a more finely-tuned

measure of predictability would result. As stated, entropy should be viewed as a

conservative measurement of the reading experience, and one limitation of this study is

that the interplay between graphemes is not adequately captured in this current analysis.

Examining stress and position would provide a picture of orthographic predictability that

is most likely closer to the reading experience than simply analyzing individual
MEASURING ORTHOGRAPHIC PREDICTABILITY 47

graphemes. Graphemes do not exist isolated in space, after all, and prior studies such as

HHRH have shown how predictability is increased when stress and position are

considered.

Conclusion

After formulating a definitive list of English graphemes, the current study calculated

probabilities of correspondence and entropy values for all phonemes and graphemes in

Standard American English as well as a total system entropy for both orthographic and

phonemic predictability. Phonemes and graphemes were then ranked in terms of their

predictability, a scheme that provides a convenient resource for educators and researchers

who wish to identify the most and least predictable sounds and spelling patterns of the

language. The study confirmed the prior findings of the HHRH study, which asserted that

“English is 80% predictable,” but suggested this is a simplistic measurement which may

not fully account for the complexities of English orthography.

Entropy values for phonemes and graphemes may provide a foundation for further

research into the predictability of English orthography. Like word frequency, spelling

predictability may be a useful measurement for how difficult words are to learn, offering

a systematic guideline for constructing wordlists of increasing difficulty for spelling

instruction. Children in various stages of literacy development could also benefit from

vocabulary that is tailored in terms of predictability to their developmental needs. Future

studies can address how the entropy of graphemes and phonemes impacts the cognitive

processes which underlie reading and writing.


MEASURING ORTHOGRAPHIC PREDICTABILITY 48

The complex nature of these psycholinguistic processes may never be fully distillable

to a single measurement. Considering the deep connection between encoding and

decoding, it is ultimately unclear as to whether orthographic entropy or phonemic

entropy best describes the predictability of words. Also, before entropy can be used for

comparisons across languages and/or dialects, a consensus must first be reached on the

composition and structure of graphemes, particularly for English. Once these technical

issues have been addressed, entropy might provide a useful tool for future cross-linguistic

research. Its mathematical foundation highlights the fact that, ultimately, languages and

orthographies have more commonalities than differences.


MEASURING ORTHOGRAPHIC PREDICTABILITY 49

References

Aro, M., & Wimmer, H. (2003). Learning to read: English in comparison to six more
regular orthographies. Applied Psycholinguistics, 24(04), 621-635.

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., . . .
Treiman, R. (2007). The english lexicon project. Behavior Research Methods, 39(3),
445-459.

Barca, L., Burani, C., Di Filippo, G., & Zoccolotti, P. (2006). Italian developmental
dyslexic and proficient readers: Where are the differences? Brain and Language,
98(3), 347-351.

Bergmann, J., & Wimmer, H. (2008). A dual-route perspective on poor reading in a


regular orthography: Evidence from phonological and orthographic lexical decisions.
Cognitive Neuropsychology, 25(5), 653-676.

Berndt, R. S., Reggia, J. A., & Mitchum, C. C. (1987). Empirically derived probabilities
for grapheme-to-phoneme correspondences in english. Behavior Research Methods,
Instruments, & Computers, 19(1), 1-9.

Borgwaldt, S. R., Hellwig, F. M., De Groot, A., & Licht, R. (2006). Word-initial sound-
spelling patterns: Cross-linguistic analyses and empirical validations of phoneme-
letter feedback consistency effects. UofA Working Papers in Linguistics, 1

Borgwaldt, S. R., Hellwig, F. M., & De Groot, A. (2004). Word-initial entropy in five
languages: Letter to sound, and sound to letter. Written Language & Literacy, 7(2),
165-184.

Borgwaldt, S. R., Hellwig, F. M., & De Groot, A. M. (2005). Onset entropy matters–
Letter-to-phoneme mappings in seven languages. Reading and Writing, 18(3), 211-
229.

Brysbaert, M., & New, B. (2009). Moving beyond kučera and francis: A critical
evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for american english. Behavior Research
Methods, 41(4), 977-990.
MEASURING ORTHOGRAPHIC PREDICTABILITY 50

Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in
a basic word recognition task: Moving on from kučera and francis. Behavior
Research Methods, Instruments, & Computers, 30(2), 272-277.

Chall, J. (1967). Reading: The great debate. New York,

Cossu, G., Gugliotta, M., & Marshall, J. C. (1995). Acquisition of reading and written
spelling in a transparent orthography: Two non parallel processes? Reading and
Writing, 7(1), 9-22.

Crystal, D. (2012). Spell it out: The singular story of english spelling Profile Books.

Davies, R., Cuetos, F., & Glez-Seijas, R. M. (2007). Reading development and dyslexia
in a transparent orthography: A survey of spanish children. Annals of Dyslexia,
57(2), 179-198.

Delattre, M., Bonin, P., & Barry, C. (2006). Written spelling to dictation: Sound-to-
spelling regularity affects both writing latencies and durations. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 32(6), 1330.

Ehri, L. C. (2005). Learning to read words: Theory, findings, and issues. Scientific
Studies of Reading, 9(2), 167-188.

Ellis, N. C., & Hooper, A. M. (2001). Why learning to read is easier in welsh than in
english: Orthographic transparency effects evinced with frequency-matched tests.
Applied Psycholinguistics, 22(04), 571-599.

Ellis, N. C., Natsume, M., Stavropoulou, K., Hoxhallari, L., DAAL, V. H., Polyzoe, N., .
. . Petalas, M. (2004). The effects of orthographic depth on learning to read
alphabetic, syllabic, and logographic scripts. Reading Research Quarterly, 39(4),
438-468.

Foorman, B. R., & Petscher, Y. (2010). Development of spelling and differential relations
to text reading in grades 3-12. Assessment for Effective Intervention, 36(1), 7-20.

Fresch, M. J. (2003). A national survey of spelling instruction: Investigating teachers'


beliefs and practice. Journal of Literacy Research, 35(3), 819-848.

Frith, U., Wimmer, H., & Landerl, K. (1998). Differences in phonological recoding in
german-and english-speaking children. Scientific Studies of Reading, 2(1), 31-54.

Fry, E. (1980). The new instant word list. The Reading Teacher, 34(3), 284-289.

Fry, E. (2004). Phonics: A large phoneme-grapheme frequency count revised. Journal of


Literacy Research, 36(1), 85-98.
MEASURING ORTHOGRAPHIC PREDICTABILITY 51

Fry, E. B., & Kress, J. E. (2012). The reading teacher's book of lists John Wiley & Sons.

Georgiou, G. K., Torppa, M., Manolitsis, G., Lyytinen, H., & Parrila, R. (2012).
Longitudinal predictors of reading and spelling across languages varying in
orthographic consistency. Reading and Writing, 25(2), 321-346.

Gontijo, P. F., Gontijo, I., & Shillcock, R. (2003). Grapheme—phoneme probabilities in


british english. Behavior Research Methods, Instruments, & Computers, 35(1), 136-
157.

Hanley, R., Masterson, J., Spencer, L., & Evans, D. (2004). How long do the advantages
of learning to read a transparent orthography last? an investigation of the reading
skills and reading impairment of welsh children at 10 years of age. The Quarterly
Journal of Experimental Psychology: Section A, 57(8), 1393-1410.

Hanna, P. R. (1966). Phoneme-grapheme correspondences as cues to spelling


improvement.

Henderson, E. H., & Beers, J. W. (1980). Developmental and cognitive aspects of


learning to spell: A reflection of word knowledge. ERIC.

Hodges, R. E., & Rudorf, E. H. (1965). Searching linguistics for cues for the teaching of
spelling. Elementary English, 42(5), 527-533.

Johnston, F. R. (2000). Exploring classroom teachers' spelling practices and beliefs.


Literacy Research and Instruction, 40(2), 143-155.

Kaku, M. (2012). Physics of the future: How science will shape human destiny and our
daily lives by the year 2100 Anchor.

Katz, L., & Frost, R. (1992). Chapter 4 the reading process is different for different
orthographies: The orthographic depth hypothesis. Advances in Psychology, 94, 67-
84. doi:http://dx.doi.org/10.1016/S0166-4115(08)62789-2

Kessler, B., & Treiman, R. (2001). Relationships between sounds and letters in english
monosyllables. Journal of Memory and Language, 44(4), 592-617.

Landerl, K., & Wimmer, H. (2000). Deficits in phoneme segmentation are not the core
problem of dyslexia: Evidence from german and english children. Applied
Psycholinguistics, 21(02), 243-262.

Landerl, K., Wimmer, H., & Frith, U. (1997). The impact of orthographic consistency on
dyslexia: A german-english comparison. Cognition, 63(3), 315-334.

Landerl, K., Ramus, F., Moll, K., Lyytinen, H., Leppänen, P. H. T., Lohvansuu, K., . . .
Schulte-Körne, G. (2013). Predictors of developmental dyslexia in european
MEASURING ORTHOGRAPHIC PREDICTABILITY 52

orthographies with varying complexity. Journal of Child Psychology and Psychiatry,


54(6), 686-694. doi:10.1111/jcpp.12029

Moats, L. C. (2005). How spelling supports reading. American Educator, 6(12–22), 42-
43.

National Reading Panel (US), National Institute of Child Health, & Human Development
(US). (2000). Report of the national reading panel: Teaching children to read: An
evidence-based assessment of the scientific research literature on reading and its
implications for reading instruction: Reports of the subgroups National Institute of
Child Health and Human Development, National Institutes of Health.

Protopapas, A., & Vlahou, E. L. (2009). A comparative quantitative analysis of greek


orthographic transparency. Behavior Research Methods, 41(4), 991-1008.

Richlan, F. (2014). Functional neuroanatomy of developmental dyslexia: The role of


orthographic depth. Frontiers in Human Neuroscience, 8, 347.

Schlagal, B. (2002). Classroom spelling instruction: History, research, and practice.


Literacy Research and Instruction, 42(1), 44-57.

Schmalz, X., Marinus, E., Coltheart, M., & Castles, A. (2015). Getting to the bottom of
orthographic depth. Psychonomic Bulletin & Review, 22(6), 1614-1629.
doi:10.3758/s13423-015-0835-2

Seymour, P. H. K., Aro, M., Erskine, J. M., & collaboration with COST Action A8
network. (2003). Foundation literacy acquisition in european orthographies. British
Journal of Psychology, 94(2), 143-174. doi:10.1348/000712603321661859

Share, D. L. (2008). On the anglocentricities of current reading research and practice:


The perils of overreliance on an 'outlier' orthography. Psychological Bulletin, 134(4),
584-615. doi:10.1037/0033-2909.134.4.584

Sharp, A. C., Sinatra, G. M., & Reynolds, R. E. (2008). The development of children's
orthographic knowledge: A microgenetic perspective. Reading Research Quarterly,
43(3), 206-226.

Thorndike, E. L., & Lorge, I. (1944). The teacher's wordbook of 30,000 words. new york:
Columbia university, teachers college.

Treiman, R., Mullennix, J., Bijeljac-Babic, R., & Richmond-Welty, E. D. (1995). The
special role of rimes in the description, use, and acquisition of english orthography.
Journal of Experimental Psychology: General, 124(2), 107.

Twain, M. (2016). What is man? and other essays Xist Publishing.


MEASURING ORTHOGRAPHIC PREDICTABILITY 53

Webster, N., & Franklin, B. (1789). Dissertations on the english language: With notes,
historical and critical, to which is added, by way of appendix, an essay on a
reformed mode of spelling, with dr. franklin's arguments on that subject. by noah
webster, jun. esquire.[two lines in latin from tacitus]. for the author, by Isaiah
Thomas and Company.

Wimmer, H. (1993). Characteristics of developmental dyslexia in a regular writing


system. Applied Psycholinguistics, 14(01), 1-33.

Wimmer, H., & Goswami, U. (1994). The influence of orthographic consistency on


reading development: Word recognition in english and german children. Cognition,
51(1), 91-103.

Wimmer, H., & Schurz, M. (2010). Dyslexia in regular orthographies: Manifestation and
causation. Dyslexia, 16(4), 283-299.

Zoccolotti, P., De Luca, M., Di Filippo, G., Judica, A., & Martelli, M. (2009). Reading
development in an orthographically regular language: Effects of length, frequency,
lexicality and global processing ability. Reading and Writing, 22(9), 1053-1079.

Zoccolotti, P., De Luca, M., Di Pace, E., Gasperini, F., Judica, A., & Spinelli, D. (2005).
Word length effect in early reading and in developmental dyslexia. Brain and
Language, 93(3), 369-373.
MEASURING ORTHOGRAPHIC PREDICTABILITY 54

Appendix A: Alphabetical List of All English Graphemes

No. Grapheme Type Count Token Count Probability Entropy


1 a 17,930 121,387,814 7.56% 2.294
2 a_e 1,261 6,796,965 0.42% 0.000
3 aa 5 18,404 <0.01% 1.517
4 ae 52 41,136 <0.01% 2.015
5 ael 2 102,268 0.01% 0.000
6 ah 37 143,359 0.01% 1.232
7 ai 789 4,269,932 0.27% 1.050
8 ai_e 28 73,717 <0.01% 0.000
9 aigh 9 48,534 <0.01% 0.000
10 ailles 1 768 <0.01% 0.000
11 ain 24 230,301 0.01% 0.000
12 ais 2 1,459 <0.01% 0.281
13 al 710 2,765,896 0.17% 0.001
14 all 205 617,422 0.04% 0.968
15 am 5 982 <0.01% 0.000
16 an 197 616,250 0.04% 0.001
17 ao 8 11,988 <0.01% 0.975
18 aoh 1 477 <0.01% 0.000
19 ar 537 2,078,630 0.13% 0.046
20 arr 36 101,737 0.01% 0.000
21 as 3 5,973 <0.01% 0.408
22 at 2 172 <0.01% 0.000
23 au 412 1,652,786 0.10% 1.343
24 au_e 4 538 <0.01% 0.000
25 augh 24 76,831 <0.01% 0.000
26 aur 2 2,417 <0.01% 0.000
27 aw 163 511,565 0.03% 0.248
28 awe 5 11,483 <0.01% 0.000
29 awr 1 431 <0.01% 0.000
30 ay 412 4,432,624 0.28% 0.306
31 ay_e 1 14,115 <0.01% 0.000
32 aye 7 14,581 <0.01% 0.342
33 b 5,480 29,845,744 1.86% 0.000
34 bb 168 139,751 0.01% 0.000
35 bh 1 288 <0.01% 0.000
36 bp 1 515 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 55

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
37 bt 18 87,665 0.01% 0.000
38 c 9,418 40,017,797 2.49% 0.580
39 cc 150 434,614 0.03% 0.000
40 cch 5 839 <0.01% 0.000
41 ce 844 5,659,217 0.35% 0.001
42 ces 2 2,361 <0.01% 0.000
43 ch 1,255 7,122,450 0.44% 0.933
44 che 9 36,800 <0.01% 0.883
45 chsi 1 28 <0.01% 0.000
46 cht 6 1,338 <0.01% 0.000
47 ci 203 804,733 0.05% 0.388
48 ck 923 2,919,639 0.18% 0.000
49 cq 20 25,274 <0.01% 0.000
50 cqu 4 1,900 <0.01% 0.000
51 cques 1 4,567 <0.01% 0.000
52 cs 1 3,578 <0.01% 0.000
53 ct 8 8,422 <0.01% 0.000
54 cu 3 6,548 <0.01% 0.000
55 cz 4 4,878 <0.01% 0.450
56 d 10,566 60,894,576 3.79% 0.187
57 dd 199 968,904 0.06% 0.000
58 de 18 20,029 <0.01% 0.000
59 dg 53 96,498 0.01% 0.000
60 dge 76 268,631 0.02% 0.000
61 dh 1 1,485 <0.01% 0.000
62 di 7 21,233 <0.01% 0.000
63 dj 35 33,852 <0.01% 0.000
64 dn 3 14,826 <0.01% 0.000
65 dth 2 153 <0.01% 0.000
66 e 15,410 103,894,768 6.47% 1.845
67 e_e 137 1,804,972 0.11% 0.014
68 ea 1,309 7,545,614 0.47% 1.705
69 ea_e 60 978,406 0.06% 0.000
70 ear 70 773,428 0.05% 0.000
71 eau 22 65,434 <0.01% 1.032
72 eaux 1 518 <0.01% 0.000
73 ed 1,563 4,059,494 0.25% 0.909
74 ee 790 5,913,902 0.37% 0.168
75 ee_e 20 33,338 <0.01% 0.000
76 eh 2 13,918 <0.01% 0.096
77 ei 97 1,215,778 0.08% 1.278
MEASURING ORTHOGRAPHIC PREDICTABILITY 56

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
78 ei_e 4 7,924 <0.01% 0.053
79 eigh 56 126,170 0.01% 0.644
80 eii 1 127 <0.01% 0.000
81 el 133 585,208 0.04% 0.000
82 ell 20 42,750 <0.01% 0.000
83 em 1 291 <0.01% 0.000
84 en 356 1,297,055 0.08% 0.000
85 eo 19 807,520 0.05% 0.093
86 eou 3 8,587 <0.01% 0.000
87 er 5,258 23,048,328 1.44% 1.042
88 ere 3 684,551 0.04% 0.003
89 err 57 122,344 0.01% 0.962
90 erwr 6 1,656 <0.01% 0.000
91 es 413 2,147,378 0.13% 0.004
92 et 27 9,834 <0.01% 0.000
93 eu 86 135,096 0.01% 1.632
94 eur 18 21,762 <0.01% 0.845
95 ew 165 1,606,405 0.10% 0.727
96 ewe 1 418 <0.01% 0.000
97 ey 177 2,517,956 0.16% 1.020
98 eye 32 107,802 0.01% 0.000
99 ez 1 771 <0.01% 0.000
100 f 3,623 37,433,336 2.33% 0.874
101 fe 1 63 <0.01% 0.000
102 ff 391 2,045,640 0.13% 0.000
103 ffe 1 379 <0.01% 0.000
104 ft 8 120,561 0.01% 0.000
105 g 3,704 16,164,105 1.01% 0.732
106 ge 407 2,372,177 0.15% 0.089
107 gg 225 220,517 0.01% 0.132
108 gh 63 317,754 0.02% 0.591
109 gi 42 185,015 0.01% 0.000
110 gm 3 619 <0.01% 0.000
111 gn 96 430,057 0.03% 0.007
112 gu 105 572,154 0.04% 0.000
113 gue 19 22,728 <0.01% 0.000
114 h 1,960 16,493,014 1.03% 0.000
115 ha 7 12,312 <0.01% 0.150
116 he 4 2,428 <0.01% 1.000
117 hei 4 2,686 <0.01% 0.000
118 her 4 8,713 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 57

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
119 hi 14 35,987 <0.01% 0.631
120 ho 16 61,294 <0.01% 0.000
121 hoe 1 124 <0.01% 0.000
122 hoi 1 80 <0.01% 0.000
123 hou 5 125,350 0.01% 0.000
124 hu 2 1,236 <0.01% 0.000
125 i 20,260 113,079,072 7.04% 1.209
126 i_e 1,520 9,587,698 0.60% 0.356
127 ia 57 314,693 0.02% 1.083
128 iar 12 36,091 <0.01% 0.000
129 ie 400 1,265,508 0.08% 1.697
130 ie_e 26 288,504 0.02% 0.000
131 ier 2 186 <0.01% 0.000
132 ieu 6 8,940 <0.01% 0.310
133 iew 25 263,882 0.02% 0.000
134 ig 2 5,467 <0.01% 0.000
135 igh 241 1,871,680 0.12% 0.000
136 ign 1 2,072 <0.01% 0.000
137 il 75 147,455 0.01% 0.000
138 ile 25 40,227 <0.01% 0.000
139 ill 10 2,201 <0.01% 0.567
140 in 98 350,753 0.02% 1.000
141 ing 9 21,079 <0.01% 0.000
142 io 53 311,539 0.02% 0.000
143 ior 11 79,637 <0.01% 0.000
144 iou 2 618 <0.01% 0.000
145 ioux 1 1,316 <0.01% 0.000
146 ir 250 1,138,712 0.07% 0.261
147 irr 7 5,253 <0.01% 0.000
148 is 11 32,953 <0.01% 0.670
149 it 2 1,175 <0.01% 0.000
150 iu 3 8,351 <0.01% 0.000
151 j 574 3,701,888 0.23% 0.071
152 ju 5 17,056 <0.01% 1.065
153 k 1,740 10,866,474 0.68% 0.000
154 ke 1 11,934 <0.01% 0.000
155 kg 1 230 <0.01% 0.000
156 kh 5 8,364 <0.01% 0.546
157 kk 2 56 <0.01% 0.000
158 kn 73 1,235,975 0.08% 0.000
159 l 10,943 46,569,549 2.90% 0.042
MEASURING ORTHOGRAPHIC PREDICTABILITY 58

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
160 ld 12 2,606,500 0.16% 0.000
161 le 1,276 5,375,598 0.33% 0.036
162 lf 11 85,946 0.01% 0.000
163 lh 4 5,271 <0.01% 0.000
164 lk 61 367,405 0.02% 0.000
165 ll 1,278 10,953,011 0.68% 0.047
166 lle 14 16,323 <0.01% 0.000
167 lm 24 32,248 <0.01% 0.000
168 ln 2 8,735 <0.01% 0.000
169 lv 3 2,154 <0.01% 0.000
170 lve 2 244 <0.01% 0.000
171 m 7,912 44,200,165 2.75% 0.054
172 mb 67 99,871 0.01% 0.000
173 me 34 2,456,437 0.15% 0.000
174 mm 366 1,396,853 0.09% 0.000
175 mme 2 2,908 <0.01% 0.000
176 mmes 1 483 <0.01% 0.000
177 mn 17 79,355 <0.01% 0.055
178 mp 1 222 <0.01% 0.000
179 n 15,884 102,493,345 6.38% 0.322
180 nd 13 10,643 <0.01% 0.596
181 ne 100 3,060,629 0.19% 0.059
182 ng 3,343 14,242,619 0.89% 0.000
183 ngue 6 17,467 <0.01% 0.000
184 nh 1 654 <0.01% 0.000
185 nm 11 297,371 0.02% 0.000
186 nn 328 1,281,013 0.08% 0.000
187 nne 6 16,230 <0.01% 0.000
188 nt 2 170 <0.01% 0.000
189 o 10,917 101,159,136 6.30% 2.682
190 o_e 669 4,582,654 0.29% 0.481
191 oa 373 737,843 0.05% 0.897
192 oa_e 2 533 <0.01% 0.000
193 oar 2 706 <0.01% 0.000
194 oe 56 122,058 0.01% 1.206
195 oh 8 338,392 0.02% 0.938
196 oi 246 976,408 0.06% 0.040
197 oi_e 9 23,095 <0.01% 0.000
198 ois 4 17,436 <0.01% 0.310
199 ol 35 79,933 <0.01% 0.894
200 olo 3 3,628 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 59

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
201 om 15 4,872 <0.01% 0.000
202 ome 28 19,566 <0.01% 0.000
203 on 2,097 8,818,144 0.55% 0.230
204 onn 2 19,009 <0.01% 0.000
205 oo 858 3,895,895 0.24% 1.289
206 oo_e 28 94,585 0.01% 0.000
207 ooh 1 1,428 <0.01% 0.000
208 oor 2 1,998 <0.01% 0.000
209 or 1,052 5,535,041 0.34% 0.964
210 orr 38 211,821 0.01% 0.736
211 os 2 446 <0.01% 0.000
212 ot 4 2,867 <0.01% 0.000
213 ou 1,400 18,438,447 1.15% 2.177
214 ou_e 111 216,372 0.01% 0.505
215 ough 50 1,008,396 0.06% 1.601
216 oui 2 134 <0.01% 0.860
217 oup 2 2,482 <0.01% 0.000
218 our 55 129,262 0.01% 0.315
219 ous 1 771 <0.01% 0.000
220 ow 656 6,051,793 0.38% 1.105
221 owe 3 8,188 <0.01% 0.000
222 oy 131 510,271 0.03% 0.038
223 oy_e 1 2,933 <0.01% 0.000
224 p 7,765 35,274,646 2.20% 0.000
225 pb 3 1,380 <0.01% 0.000
226 pe 3 43,189 <0.01% 0.000
227 ph 530 1,066,337 0.07% 0.220
228 pn 2 1,163 <0.01% 0.000
229 pp 496 2,402,664 0.15% 0.000
230 ppe 2 315 <0.01% 0.000
231 pph 1 2,832 <0.01% 0.000
232 ps 40 56,592 <0.01% 0.000
233 pt 8 8,521 <0.01% 0.000
234 q 443 1,966,725 0.12% 0.000
235 qu 47 75,978 <0.01% 0.000
236 que 22 43,577 <0.01% 0.000
237 r 13,293 67,773,671 4.22% 0.000
238 re 489 9,600,092 0.60% 0.035
239 rh 28 20,260 <0.01% 0.000
240 ro 9 22,769 <0.01% 0.000
241 rps 1 9,467 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 60

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
242 rr 334 879,006 0.05% 0.000
243 rre 2 10,479 <0.01% 0.000
244 rrh 1 149 <0.01% 0.000
245 rror 3 37,225 <0.01% 0.000
246 rt 3 6,210 <0.01% 0.000
247 s 18,696 105,633,904 6.58% 1.027
248 sc 129 371,066 0.02% 0.160
249 sce 48 12,087 <0.01% 0.000
250 sch 8 6,518 <0.01% 0.000
251 sci 15 27,403 <0.01% 0.000
252 se 200 1,696,607 0.11% 0.957
253 sh 1,220 4,198,020 0.26% 0.001
254 shi 19 42,067 <0.01% 0.000
255 si 190 790,039 0.05% 0.540
256 sl 6 57,284 <0.01% 0.000
257 ss 1,405 4,825,294 0.30% 0.387
258 sse 9 3,871 <0.01% 0.000
259 ssi 121 572,191 0.04% 0.003
260 st 84 140,492 0.01% 0.000
261 sth 2 1,858 <0.01% 0.000
262 sw 15 204,666 0.01% 0.000
263 t 19,481 113,483,774 7.21% 0.112
264 tch 196 490,222 0.03% 0.000
265 te 205 892,866 0.06% 0.000
266 tes 1 717 <0.01% 0.000
267 th 1,204 51,995,883 3.24% 0.711
268 the 31 11,857 <0.01% 0.000
269 thes 8 15,037 <0.01% 0.000
270 ti 1,781 7,056,797 0.44% 0.301
271 ts 1 551 <0.01% 0.000
272 tsch 1 37 <0.01% 0.000
273 tsh 10 2,949 <0.01% 0.000
274 tt 618 2,613,709 0.16% 0.000
275 tte 34 33,275 <0.01% 0.000
276 tth 1 17,697 <0.01% 0.000
277 tw 7 457,084 0.03% 0.000
278 tzsch 1 1,825 <0.01% 0.000
279 u 6,587 27,685,856 1.72% 2.307
280 u_e 311 2,464,579 0.15% 0.982
281 ua 2 17,469 <0.01% 0.000
282 ual 2 15 <0.01% 0.000
MEASURING ORTHOGRAPHIC PREDICTABILITY 61

Appendix A Continued
No. Grapheme Type Count Token Count Probability Entropy
283 ue 96 855,526 0.05% 0.912
284 ugh 1 5,867 <0.01% 0.000
285 uh 2 14,565 <0.01% 0.139
286 ui 86 309,495 0.02% 1.315
287 ui_e 4 8,687 <0.01% 0.000
288 ul 138 430,388 0.03% 0.000
289 ule 2 3,321 <0.01% 0.000
290 ull 42 81,135 0.01% 0.000
291 um 3 2,190 <0.01% 0.000
292 uo 9 3,304 <0.01% 0.479
293 uoy 3 313 <0.01% 0.000
294 ur 577 1,620,707 0.10% 1.031
295 ure 193 1,020,272 0.06% 1.098
296 urr 83 350,854 0.02% 0.472
297 ut 3 3,171 <0.01% 0.000
298 uy 5 194,926 0.01% 0.000
299 v 2,880 15,036,744 0.94% 0.000
300 ve 540 5,877,003 0.37% 0.000
301 vv 3 1,259 <0.01% 0.000
302 w 1,601 20,746,360 1.29% 0.006
303 we 5 17,254 <0.01% 0.000
304 wh 215 6,321,562 0.39% 0.695
305 wi 1 1,020 <0.01% 0.000
306 wr 102 2,176,730 0.14% 0.000
307 x 869 3,887,725 0.24% 0.407
308 xe 3 7,822 <0.01% 0.000
309 xh 23 20,691 <0.01% 0.000
310 xi 6 7,758 <0.01% 0.000
311 y 4,493 32,632,117 2.03% 1.680
312 y_e 57 304,957 0.02% 0.000
313 ye 9 16,561 <0.01% 0.000
314 yl 6 19,654 <0.01% 0.000
315 yll 1 36 <0.01% 0.000
316 yr 4 7,350 <0.01% 0.000
317 yrrh 1 105 <0.01% 0.000
318 z 818 1,215,898 0.08% 0.301
319 ze 4 5,758 <0.01% 0.000
320 zi 1 59 <0.01% 0.000
321 zz 74 60,756 <0.01% 0.458
322 ' 54 29,904 <0.01% 0.000
Totals: 264,618 1,603,490,234
MEASURING ORTHOGRAPHIC PREDICTABILITY 62

Appendix B: List of Phonemes Used by the English Lexicon Project (ELP)

No. SAMPA IPA Example No. SAMPA IPA Example


1 e e, eɪ day 25 k k cat
2 a æ bat 26 dZ ʤ judge
3 A ɑ spa 27 ks / gz ks / gz box / eggs
4 i i see 28 kw kw quit
5 E ɛ set 29 l l log
6 @` ɚ waiter 30 l= l castle
7 aI aɪ life 31 m m mat
8 I ɪ sit 32 m= m chasm
9 o o, oʊ low 33 n n night
10 O ɔ law 34 n= n button
11 u u moot 35 N ŋ song
12 U ʊ put 36 p p pat
13 OI oɪ boy 37 r ɹ ram
14 aU aʊ cow 38 s s sun
15 @ ə above 39 S ʃ shun
16 ju ju mute 40 t t to
17 3` ɝ sir 41 T θ thin
18 V ʌ hug 42 D ð the
19 b b but 43 v v vote
20 tS ʧ church 44 w w / hw witch / which
21 d d do 45 j j yoke
22 f f fee 46 z z buzz
23 g g go 47 Z ʒ mirage
24 h h hat 48 4 ɾ middle

This appendix lists the phonemes distinguished by the ELP and therefore used in

this study, In addition to the 44 phonemes generally accepted for SAE, there are four

additional phonemic distinctions included for analysis: the cluster /ju/ in contrast to /u/

(as heard in moot vs. mute), and the syllabic variants of /m/, /n/, and /l/ (when these

consonants function as the nucleus of a syllable instead of a vowel). Phonemes are listed
MEASURING ORTHOGRAPHIC PREDICTABILITY 63

in the table in both IPA and SAMPA formats, with a provided example. Like the

International Phonetic Alphabet (IPA), the Speech Assessment Methods Phonetic

Alphabet (SAMPA) is a symbol system used for phonetic transcription, with a specific

character representing each phoneme. It is designed to be readable by most computer

programs and is composed of common characters found on standard English language

keyboards. Note that while some phonemes have different symbols between the IPA and

SAMPA formats, many share the same symbol in both typographies. For example, in the

word teething, the initial consonant is represented by /t/ in both SAMPA and IPA

formats, the vowel featured in both syllables is represented by /i/ in both formats, the

middle consonant is represented by /T/ in SAMPA and /ð/ in IPA, and the final consonant

is represented by /N/ in SAMPA and /ŋ/ in IPA. The analysis was conducted with

phonemes in SAMPA format, and then in post-processing these symbols were converted

to IPA format, which may be more familiar to readers. Appendix D, however, discusses

technical aspects of the algorithm and so retains use of SAMPA symbols for the sake of

consistency.
MEASURING ORTHOGRAPHIC PREDICTABILITY 64

Appendix C: Phoneme-to-Grapheme Correspondence (PGC) Probabilities

P G Prob. P G Prob. P G Prob.


/t/= t 0.933489 /n/= n 0.937224 /ɪ/= i 0.81118
0.06620 't 0.022247 0.06488 ne 0.029175 0.06394 e 0.145567
ed 0.012392 nn 0.012296 y 0.01587
d 0.010646 kn 0.011863 ea 0.013045
te 0.008399 gn 0.004125 a 0.007226
tt 0.006639 on 0.003138 u 0.001901
tw 0.0043 in 0.001674 ui 0.001599
th 0.000774 nne 0.000156 ee 0.001306
bt 0.000683 dn 0.000142 o 0.001202
tte 0.000313 nd 0.000091 ie 0.00044
pt 0.00008 ln 0.000084 ia 0.000301
ct 0.000019 pn 0.000011 hi 0.000295
cht 0.000012 nh 0.000006 ai 0.000023
tes 0.000007 n' 0.000006 ae 0.000015
0.52337 mn 0.000005 ei 0.00001
mp 0.000002 wi 0.00001
/ə/= e 0.466744 nt 0.000002 ez 0.000008
0.05388 a 0.255068 an 0 ha 0.000003
o 0.12205 0.47126 e_e 0.000001
i 0.112831 0.94675
u 0.027743 /ɹ/= r 0.822485
ou 0.009749 0.05131 re 0.111542 /s/= s 0.773004
y 0.001499 wr 0.026416 0.04998 ce 0.070508
ah 0.00106 er 0.020745 c 0.062638
ei 0.000624 rr 0.010667 ss 0.056087
au 0.000619 're 0.00453 's 0.01386
ai 0.000482 ure 0.001716 se 0.013137
' 0.000346 ur 0.000922 sc 0.004516
ui 0.0003 or 0.000268 sw 0.00255
uh 0.000165 rh 0.000246 st 0.00175
ia 0.000164 ar 0.000127 ps 0.000705
ha 0.000139 rre 0.000127 ci 0.00069
eou 0.000099 rps 0.000115 z 0.000281
MEASURING ORTHOGRAPHIC PREDICTABILITY 65

Appendix C Continued
P G Prob. P G Prob. P G Prob.
hi 0.000066 rt 0.000075 sce 0.000151
aa 0.000051 ir 0.000015 sse 0.000048
oi 0.000037 rrh 0.000002 cs 0.000045
eau 0.000037 0.98177 ces 0.000029
er 0.00003 1.30285
ea 0.000027 /l/= l 0.804507
o' 0.000023 0.03588 ll 0.183583 /æ/= a 0.999126
ae 0.000016 'll 0.005524 0.03339 au 0.000773
he 0.000014 all 0.004237 e 0.000043
eo 0.000008 sl 0.000994 ah 0.00002
on 0.000006 ol 0.000431 ai 0.000015
c 0.000003 le 0.00035 a'a 0.000014
eu 0.000002 lle 0.000283 i 0.000007
oo 0 lh 0.000091 a_e 0.000002
2.01254 0.79956 0.01081

/d/= d 0.89278 /i/= y 0.335638 /m/= m 0.897029


0.03640 ed 0.046915 0.03110 e 0.250323 0.03008 me 0.050858
ld 0.04448 ee 0.115595 mm 0.02892
dd 0.010909 i 0.089788 'm 0.012567
'd 0.004548 ea 0.082904 nm 0.006157
de 0.000343 e_e 0.036094 mb 0.002068
dh 0.000025 ea_e 0.019588 mn 0.001633
0.66373 eo 0.01599 lm 0.000668
ie 0.014512 mme 0.00006
i_e 0.01292 nd 0.000018
ey 0.012813 gm 0.000013
/k/= c 0.654498 ie_e 0.005776 mmes 0.00001
0.03315 k 0.204096 ei 0.005202 0.67368
ck 0.054837 ee_e 0.000667
q 0.036939 a 0.000627 /ð/= th 0.999719
ch 0.025768 is 0.000566 0.02629 the 0.000281
cc 0.008163 oe 0.000329 0.00372
lk 0.006901 ae 0.000246
x 0.005362 ay 0.000198 /e/= a 0.551189
qu 0.001427 ei_e 0.000158 0.02309 a_e 0.183277
que 0.000818 he 0.000024 ay 0.113667
cq 0.000475 it 0.000024 ai 0.07935
ke 0.000224 ill 0.000006 ey 0.048349
MEASURING ORTHOGRAPHIC PREDICTABILITY 66

Appendix C Continued
P G Prob. P G Prob. P G Prob.
kh 0.000137 j 0.000006 ea 0.012055
cu 0.000123 ai 0.000003 e 0.003686
cques 0.000086 hoe 0.000002 eigh 0.002844
che 0.000053 ois 0.000002 ai_e 0.001988
gh 0.000036 2.74292 aigh 0.001309
cqu 0.000036 ei 0.000615
cch 0.000016 /ɛ/= e 0.80297 ay_e 0.000381
kg 0.000004 0.02784 a 0.103993 eh 0.000371
kk 0.000001 ea 0.034405 ae 0.000288
1.5928 ai 0.02727 et 0.000265
ei 0.019017 au 0.000136
/z/= s 0.893611 ie 0.004512 ee 0.000135
0.02856 es 0.0468 ay 0.004462 e_e 0.000055
z 0.02544 ey 0.001868 es 0.000019
's 0.017539 u 0.000366 er 0.000016
se 0.014002 ae 0.000321 eii 0.000003
zz 0.001196 aye 0.000305 ais 0.000002
ss 0.000729 aa 0.000242 ei_e 0.000001
thes 0.000328 eo 0.000177 1.96794
x 0.000154 hei 0.00006
ze 0.000126 ee 0.000023 /u/= o 0.535481
sth 0.000041 oe 0.000005 0.01916 ou 0.187264
is 0.000013 eh 0.000004 u 0.093988
ts 0.000012 1.11633 oo 0.052143
cz 0.00001 ew 0.042228
0.70278 /p/= p 0.93503 u_e 0.033706
0.02349 pp 0.063688 ue 0.018703
/ʌ/= o 0.515433 pe 0.001145 o_e 0.015449
0.02463 u 0.356184 ph 0.000115 ough 0.011045
a 0.09341 bp 0.000014 ui 0.003879
ou 0.020194 ppe 0.000008 oo_e 0.003073
au 0.012732 0.3567 eu 0.000807
oo 0.002047 ou_e 0.000773
1.55491 /v/= v 0.470449 oe 0.000673
0.01990 f 0.344445 ui_e 0.000282
/aɪ/= i 0.473221 ve 0.16932 ieu 0.000274
0.02044 i_e 0.272489 've 0.014556 oup 0.000081
y 0.165759 ph 0.00092 ooh 0.000046
igh 0.057033 w 0.000196 ioux 0.000043
MEASURING ORTHOGRAPHIC PREDICTABILITY 67

Appendix C Continued
P G Prob. P G Prob. P G Prob.
y_e 0.009293 lv 0.000067 w 0.000026
ie 0.007419 vv 0.000039 ous 0.000025
uy 0.00594 lve 0.000008 uo 0.000011
eye 0.003285 1.57742 oui 0.000001
ai 0.001922 2.19773
ei 0.000852 /f/= f 0.881126
ia 0.000703 0.01868 ff 0.068213 /b/= b 0.995284
eigh 0.000631 ph 0.034432 0.01867 bb 0.00466
ye 0.000505 gh 0.009224 pb 0.000046
ay 0.000241 ft 0.00402 bh 0.00001
a 0.000189 lf 0.002866 0.04371
ig 0.000167 pph 0.000094
is 0.000125 ffe 0.000013 /o/= o 0.634053
oy 0.000063 v 0.00001 0.01442 o_e 0.177392
ey 0.000043 fe 0.000002 ow 0.138887
ais 0.000042 0.7127 oa 0.021885
aye 0.000028 ough 0.014182
ae 0.000025 /ɚ/= er 0.700952 oh 0.005187
ailles 0.000023 0.01526 or 0.148072 oe 0.003659
1.93148 ar 0.084384 ou 0.002998
ure 0.030605 eau 0.000638
/ɑ/= o 0.627161 ur 0.014244 owe 0.000354
0.01872 a 0.355241 orr 0.006852 ew 0.000218
oh 0.007262 arr 0.004151 ot 0.000124
ow 0.003507 err 0.001924 u 0.000117
ea 0.002166 ir 0.001921 oo 0.000111
ho 0.002039 rror 0.001519 au 0.000066
ah 0.001387 urr 0.001446 au_e 0.000023
e 0.000779 re 0.001409 oa_e 0.000023
i 0.000236 ro 0.000929 eaux 0.000022
au 0.000095 eur 0.000646 aoh 0.000021
aa 0.00008 our 0.0003 os 0.000019
ou 0.000022 yr 0.0003 eo 0.000009
as 0.000016 aur 0.000099 o' 0.000007
at 0.000006 oor 0.000082 ou_e 0.000004
oi 0.000002 erwr 0.000068 1.5778
aw 0.000001 're 0.000045
1.09733 oar 0.000029 /h/= h 0.932069
awr 0.000018 0.01102 wh 0.066675
MEASURING ORTHOGRAPHIC PREDICTABILITY 68

Appendix C Continued
P G Prob. P G Prob. P G Prob.
/ɔ/= o 0.669859 ere 0.000006 j 0.001238
0.01621 a 0.211565 1.4957 x 0.000018
au 0.039668 0.36731
ou 0.028863 /ɾ/= t 0.57741
aw 0.018842 0.01357 d 0.318461 /ʃ/= ti 0.494173
ough 0.012995 tt 0.087569 0.00844 sh 0.309672
oa 0.008879 dd 0.015204 ci 0.055121
oo 0.004693 bt 0.000692 ssi 0.0422
augh 0.002952 ld 0.000308 s 0.027479
ea 0.000488 ct 0.000293 ss 0.02142
awe 0.000441 th 0.000059 ch 0.019937
ah 0.000343 cht 0.000004 c 0.010283
as 0.000211 1.39791 si 0.007219
ao 0.000188 t 0.004024
u 0.000007 /g/= g 0.938245 shi 0.003103
al 0.000005 0.00858 gu 0.041541 che 0.002202
eo 0 gg 0.015718 sci 0.002021
1.52398 gh 0.002846 sc 0.000639
gue 0.00165 sch 0.000481
/w/= w 0.901524 0.41042 ce 0.000024
0.01432 u 0.096813 chsi 0.000002
we 0.00075 /ɝ/= er 0.40125 2.05046
ju 0.000468 0.00645 or 0.181697
o 0.000371 ur 0.115057 /l/= le 0.521185
hu 0.000054 ir 0.105264 0.00640 al 0.269161
wh 0.000016 ear 0.074666 el 0.056952
r 0.000003 ere 0.066072 ul 0.041885
0.47923 urr 0.030449 all 0.03633
our 0.011768 l 0.020892
err 0.007258 il 0.01435
/ŋ/= ng 0.841945 orr 0.004238 ael 0.009953
0.01053 n 0.155757 her 0.000841 ull 0.007896
ing 0.001246 eur 0.000572 ol 0.005363
ngue 0.001033 irr 0.000507 ell 0.00416
nd 0.000019 olo 0.00035 ile 0.003915
0.64938 yrrh 0.00001 ll 0.003757
2.55326 yl 0.001913
/n/= on 0.649846 'll 0.001774
0.00814 n 0.169372 /θ/= th 0.998036 ule 0.000323
MEASURING ORTHOGRAPHIC PREDICTABILITY 69

Appendix C Continued
P G Prob. P G Prob. P G Prob.
en 0.099272 0.00606 tth 0.00182 ill 0.000186
an 0.047163 t 0.000129 yll 0.000004
ain 0.017626 dth 0.000016 ual 0.000001
in 0.0135 0.02131 2.10091
ne 0.001607
onn 0.001455 /ʊ/= ou 0.567379 /j/= y 0.99984
ign 0.000159 0.00521 oo 0.249375 0.00540 j 0.000085
1.59378 u 0.167699 i 0.000071
o 0.013537 ll 0.000004
/ʤ/= j 0.358992 eu 0.001442 0.00243
0.00637 g 0.315193 uo 0.000354
ge 0.229182 or 0.000202 /ʧ/= ch 0.689814
d 0.036773 oui 0.000011 0.00494 t 0.202527
dge 0.026248 1.49991 tch 0.061743
gi 0.018078 ti 0.043437
dg 0.009429 /hw/= wh 0.998552 cz 0.000557
dj 0.003308 0.00321 ju 0.001153 che 0.00052
di 0.002075 w 0.000296 c 0.000505
gg 0.000393 0.01681 tsh 0.000371
ch 0.000328 ci 0.000274
2.07789 /jɛ/= e 1 tzsch 0.00023
0.00000 0 th 0.000019
/aʊ/= ou 0.639641 tsch 0.000005
0.00531 ow 0.32056 /jə/= u 0.631539 1.30858
ou_e 0.022595 0.00108 io 0.1803
hou 0.014715 ia 0.142676 /ks/= x 0.997381
au 0.001437 ie 0.029076 0.00186 xe 0.002619
ao 0.000833 ua 0.01011 0.02624
ough 0.000219 iu 0.004833
1.17636 a 0.001109 /jɑ/= o 1
iou 0.000358 0.00000 0
/ju/= u 0.571625 1.53274
0.00343 u_e 0.258882 /oɪ/= oi 0.628629
ew 0.05474 /gz/= x 0.963063 0.00096 oy 0.328469
ue 0.050779 0.00035 xh 0.036937 oi_e 0.014927
iew 0.047863 0.22807 aw 0.013624
eau 0.008614 ois 0.010712
ou 0.003263 /jɚ/= ure 0.517928 oy_e 0.001896
eu 0.002428 0.00016 ior 0.320364 eu 0.001395
MEASURING ORTHOGRAPHIC PREDICTABILITY 70

Appendix C Continued
P G Prob. P G Prob. P G Prob.
ugh 0.001064 iar 0.145187 uoy 0.000202
ut 0.000575 ur 0.015327 o 0.000094
ieu 0.00009 ier 0.000748 hoi 0.000052
ewe 0.000076 or 0.000447 1.22852
1.74966 1.52704
/jʊ/= u 0.725133
/wʌ/= o 1 /wɑ/= ois 0.597656 0.00019 eu 0.273925
0.00157 0 0.00000 oi 0.402344 uh 0.000943
0.97231 0.85744
/ʒ/= si 0.643986
0.00067 s 0.293708 /gʒ/= x 1 /kʃ/= x 0.907282
ge 0.02481 0.00000 0 0.00005 xi 0.092718
g 0.014577 0.44548
ti 0.012087 /ts/= z 0.773716
z 0.005907 0.00002 zz 0.226284 /kə/= kh 1
j 0.0046 0.77148 0.00000 0
sh 0.000143
ssi 0.000127 /nj/= n 0.663968 /əw/= ju 1
zi 0.000055 0.00000 gn 0.336032 0.00000 0
1.30993 0.92097

/m/= m 0.905322
0.00018 ome 0.066394
om 0.016532
um 0.007431
am 0.003332
em 0.000987
0.57738
Note: P=phoneme, G=grapheme. Below each phoneme, the probability for that
phoneme is listed. The probabilities for each PGC are listed to the right of each
grapheme. At the bottom of each list, the phoneme’s entropy is given in bold. The
graphemes are listed in order of decreasing probability, with the first grapheme being
the “main spelling” or most predictable correspondence for that phoneme.
MEASURING ORTHOGRAPHIC PREDICTABILITY 71

Appendix D: Algorithmic Procedures for Corpus Analysis

The following reports the decision-making processes for the algorithm in choosing how

to parse words.

ELP Characteristics. For each entry, the ELP provides a number of various

characteristics that may be used in analysis. The following specific characteristics were

utilized by this study: (a) Frequency Norms (Freq_HAL), (b) Pronunciation (Pron.)., and

(c) Morpheme Parse – Letters (MorphSp).

Freq_HAL are the frequency norms provided by the HAL database. For each entry,

the number of times that word is found throughout the corpus is indicated. This data was

necessary to calculate the probability for each word. This information was applied

separately to each grapheme-phoneme correspondence that comprises the word.

Pron. is the listed pronunciation for each entry, based on the Standard American

English (SAE) dialect. This information is recorded in the SAMPA format, and so for

consistency, phonemes will be represented by their SAMPA symbols in this appendix.

Once the final data was analyzed, the SAMPA symbols were replaced with their more

widely recognized IPA counterparts.

MorphSp provides a morphological breakdown of each entry, giving the root(s) and

any affixes for a given word. This feature was utilized in creating an algorithm which

can apply various procedures to determine how best to parse ambiguous mappings.

Parsing the ELP algorithmically. After the problematic cases documented above

were addressed, an algorithm was constructed which accurately parsed as many words in
MEASURING ORTHOGRAPHIC PREDICTABILITY 72

the ELP as possible. A parsing was considered “accurate” if it conformed to the desired

grapheme-phoneme relationships as outlined. It was imperative, for example, that

instances of FS-e be correctly assigned to the appropriate preceding consonant or vowel,

depending on its apparent function.

Details of how the algorithm made parsing decisions will be discussed below.

Broadly speaking, there were three steps undertaken by the algorithm in an attempt to

achieve the desired parsing of a word. First, the program consulted a mapping file that

defined all legitimate phoneme-grapheme correspondences. Second, it discovered all

acceptable parsings for a given word type. Third, it chose the most accurate parsing

through a process of elimination, guided by a series of rules or “preferences” that selected

certain graphemes over others in various contexts.

Constructing the mapping file. The mapping file listed all phonemes used by the

ELP and all potential graphemes that might correspond to each phoneme. This is the

information used by the algorithm to parse the wordlist. Only the phoneme-grapheme

matches defined in the mapping file are considered acceptable correspondences by the

algorithm.

The primary goal of the allowed correspondences was not to minimize the number of

possible parsings returned for each word, but rather to be conservative in eliminating

possibilities, so as to increase the chance that one of the potential parsings is the desired

one. In other words, the net was cast wide. This was guided by the belief that it is easier

algorithmically to choose an acceptable parsing from a list of potential parsings than it is

to carefully construct only the correct parsing.


MEASURING ORTHOGRAPHIC PREDICTABILITY 73

With this strategy in mind, the initial mapping file included all graphemes (according

to our definition) that could conceivably correspond to each phoneme. In addition to

graphemic options culled from prior studies (Berndt et al., 1987; Gontijo et al., 2003;

Hanna, 1966), a number of online sources were consulted, including Wikipedia. This

inclusive approach had the potential to return a plethora of idiosyncratic graphemes

covering many extremely low-frequency, exotic spellings. Furthermore, additional

idiosyncratic graphemes were iteratively added, until the algorithm was able to yield a

suggested parsing for every word in the dataset. This procedure was done to decrease the

risk of missing significant correspondences through a failure to provide those possibilities

to the algorithm.

Expressing all possible parsings. For each entry in the ELP, the algorithm took as

input the grapheme string comprising the written word and the phoneme string

comprising its phonetic transcription, with all stress and syllable marks removed. The

program moved in a serial fashion from left to right. It identified each phoneme

encountered by matching it to the identical phoneme listed in the mapping file, and it

chose the longest phoneme string consistent with position at hand. In this manner, the

algorithm split the string into individual phonemes.

The algorithm next examined every possible acceptable alignment between phoneme

and grapheme strings. No limitations were placed on what character/s could represent

any phoneme. The only constraints were that 1) complex (multi-letter) graphemes must

contain only consecutive letters (except for split graphemes containing FS–e, handled

separately), 2) all letters of the word must be used, 3) and the number of graphemes must

equal the number of phonemes.


MEASURING ORTHOGRAPHIC PREDICTABILITY 74

The complexity of split graphemes required additional parameters. Any “e”

encountered that is to be considered part of a split grapheme (attached to the preceding

vowel) must be directly preceded by either a graphemic consonant(s), gu- or qu-. The

associated vowel had to be the first vowel encountered prior to the directly preceding

consonant(s), gu-, qu-, and any number of consecutive preceding vowels until a

consonant, another split grapheme, or the beginning of the word was encountered. This

rule flagged the “e” as a potential candidate for being part a split grapheme, but only

declared it to be if both of the following criteria were false:

3. The preceding vowel, which could potentially be part of the


split grapheme, had as its corresponding phoneme one of the
following “short” vowels: /a/, /A/, /@/, /E/, /I/, /U/, /V/, /3`/, or
/@`/.
4. The consonantal correspondence directly preceding the “e”
consisted of one of the following grapheme-phoneme pairs:
a) le = /l=/
b) ge = /Z/ or /dZ/
c) dg = /dZ/
d) th = /D/
e) ce = /s/
f) sle = /l/

Although it can be argued that FS–e can serve many functions simultaneously, for

simplicity, in this study a FS–e cannot be assigned to more than one grapheme. It will

either attach to the vowel(s) or to the consonant(s), as per the guidelines above.

Once the algorithm determined the set of minimally constrained parsings (as

described above) of a given word, it consulted the mapping file to determine if each

phoneme-grapheme correspondence comprising the word were acceptable. If so, the


MEASURING ORTHOGRAPHIC PREDICTABILITY 75

parsing was deemed legitimate, and otherwise an error was returned. In this fashion, the

algorithm returned as output all legitimate parsings with no limit placed on the number of

parsings per word. As expected, some words returned only one parsing, while many

others returned multiple parsings.

Determining the preferred parsing. Once the above algorithm could generate at

least one legitimate parsing for every word in the list, the next step was to sift through all

generated parsings of an entry and choose the one considered to be the most accurate. To

this end, the MorphSp attribute of the ELP was utilized, which breaks words down into

roots and affixes.

For any word with multiple acceptable parsings, this algorithm first broke each

entry down into its morphological components. The words were divided into three

classes and addressed separately: root, non-compound, and compound words. Root

words were words with a single root and no prefixes or suffixes. Non-compound words

had one root with one or more prefixes and/or suffixes. Compound words contained

more than one root, and may contain prefixes and/or suffixes.

Root words. Root words are first addressed. In the event of multiple acceptable

parsings, the algorithm initially assumed that the first parsing was the correct one. The

algorithm then compared this incumbent parsing to each alternate parsing in turn. With

each comparison the program determined whether to replace the original parsing with the

new one under consideration, or to reject the new parsing and retain the incumbent. It

proceeded through all alternate parsings in this manner.

The algorithm compared the incumbent and the alternate parsing under consideration

by performing up to four passes over the parsings. In each pass, the parsings were
MEASURING ORTHOGRAPHIC PREDICTABILITY 76

compared in a serial, left-to-right order, ignoring all pairs of identical graphemes between

the two candidates. When the algorithm encountered two graphemes that were not

identical, it initiated a subroutine that attempted to resolve the discrepancy. The

particular subroutine that was employed depended on which pass the algorithm was

executing, as described below. If at any time a subroutine was able to determine, based

on a particular grapheme discrepancy, that one of the parsings was preferred, then the

algorithm selected that parsing and terminated.

The first pass expressed a preference for longer graphemes over shorter graphemes,

when the phoneme in question could be represented by both. For example, consider the

word hackney, which may be parsed as either h-a-ck-n-ey or h-a-c-kn-ey. According to

the mapping file, both options are legitimate; the grapheme ck often represents /k/ as in

back, sack, and hack, and the grapheme kn often represents /n/, as in knee, knight, know,

etc. Since words like hack feature the ck =/k/ grapheme, words with -ney feature the n =

/n/ correspondence, e. g. journey, and the kn grapheme is found predominantly in the

word-initial position only, the first of the two parsings is judged to be the correct option.

The simplest way to choose this option was through the preference of ck over c.

Whenever c is the correct choice, it was never the case that the contesting grapheme is a

ck. This decision was not based upon any theoretical framework, but rather determined

simply from observation of patterns in the ELP. Examples of preferences declared by the

first pass included:

 /n=/ = "en" not "n" (absence)


 /k/ = "ck" not "c" (acknowledge)
 /k/ = "qu" not "q" (briquette)
MEASURING ORTHOGRAPHIC PREDICTABILITY 77

 /k/ = "cqu" not "cq" (acquire)


 /k/ = "ch" not "c" (chemise)
 /T/ = "th" not "t" (hypothesize)
 /g/ = "gu" not "g" (daguerreotype)
 /gz/ = "xh" not "x" (exhilarate)
 /dZ/ = "ge" not "g" (George)
 /hw/ = "wh" not "w" (whistle)
 /S/ = "ti" not "t" (initiative)
 /S/ = "ci" not "c" (sociable)
 /S/ = "sh" not "s" (shingle)
 /S/ = "sci" not "sc" (omniscience)

If the first pass failed to “break the tie,” a second pass was initiated. Execution of this

code handled discrepancies involving doubled letters. The subroutine preferred doubled

letters over single letters to represent a given phoneme, when both options were

presented. Therefore, such preferences were declared as:

 /k/ = "cc" not "c" (accolade)


 /l/ = "ll" not "l" (fallible)
 /i/ = "ee" not "e" (feeble)
 /e/ = "ee" not "e" (toupee)
 /n/ = "nn" not "n" (colonnade)
 /u/ = "oo" not "o" (Coolidge)

If this code did not break the tie, a third pass was initiated. This pass handled

decisions concerning certain graphemes that preferred not to be combined with

extraneous letters. Preferences declared by the third pass included:

 /tS/ = "ch" not "chi" (achieve)


 /p/ = "pp" not "ppe" (appease)
 /p/ = "p" not "pe" (Chesapeake)
 /d/ = "d" not "de" (credence)
 /r/ = "rr" not "rre" (daguerreotype)
MEASURING ORTHOGRAPHIC PREDICTABILITY 78

 /s/ = "ss" not "sse" (essence)


 /s/ = "s" not "se" (absence)
 /s/ = "sc" not "sce" (adolescence)
 /f/ = "ff" not "ffe" (indifference)
 /l/ = "l" not "le" (jubilee)
 /n/ = "n" not "ne" (needle)
 /n/ = "n" not "in" (plaintive)
 /z/ = "s" not "se" (presence)
 /a/ = "a" not "al" (allocate)

If the third pass failed to break the tie, a fourth and final pass was be initiated. This

pass concerned certain vowels that were preferred in certain contexts, and included:

 /OI/ = "oy" not "o" (flamboyance)


 /e/ = "ey" not "e" (abeyance)
 /E/ = "ae" not "a" (aerate)
 /O/ = "au" not "a" (aureole)
 /i/ = "ea" not "a" (treatise)
 /E/ = "ai" not "a" (clairvoyance)
 /U/ = "ou" not "o" (entourage)
 /I/ = "ie" not "i" (pierce)
 /U/ = "uo" not "u" (fluoresce)
 /u/ = "ou" not "o" (pirouette)
 /O/ = "ou" not "o" (fourpence)

Non-compound words. Non-compound words were handled in a similar fashion as

root words, but with additional steps to handle affixes. The MorphSp column indicated

the root word and affixes separately for any entry. The algorithm first checked to see if

the root word existed as its own separate entry in the ELP. If the associated root did not

exist as its own separate entry, the entry under consideration was parsed using the four-

pass subroutine described above.


MEASURING ORTHOGRAPHIC PREDICTABILITY 79

However, frequently the associated root word existed as its own separate entry in

the ELP. If the phonemic string of the non-compound word contained the phonemic

string of the root word, this indicated the affixes did not alter the pronunciation of the

word root. The algorithm thus mirrored the parsing accepted for the root word, rejecting

inconsistent candidates. If no candidate was consistent, the algorithm did not immediately

reject any candidate, but instead skipped this step and proceeded to the next step.

If multiple acceptable parsings remained after examining the root, additional steps

were next applied. First, certain prefixes containing e, such as re-, might parse as part of

a split grapheme, which would inevitably be in error. Therefore, explicit rules were

outlined, where, in the presence of these prefixes, split graphemes could not occur in

combination with the vowel of the prefix.

Next, if multiple candidates remained, suffixes were next examined to determine

if a preference between incumbent and challenger could be made based on the suffix.

Consider the words lie and lady. Pluralizing the words result in lies and ladies. Though

both plural forms end with -ies, they must be parsed differently. The word lies should be

parsed as l-ie-s, not li-es, since the latter would indicate the word root is li-. Ladies, on

the other hand, should be parsed as l-a-d-i-es, not l-a-d-ie-s, because the penultimate “e”

is not found in the singular form of lady. The “e” must be part of the ending, and

therefore -es is the correct parsing of the suffix (with the final -y converting to an -i). So

for a word ending in –y + -s (as indicated by the MorphSp entry) and two candidate

parsings, one ending in –i-es and the other ending in ie-s, the algorithm immediately

selected the parsing ending in i-es as the next incumbent. The rule was expressed as:

 Suffixes –y + –s = –ies: prefer parsings ending in –es to parsings ending in –s


MEASURING ORTHOGRAPHIC PREDICTABILITY 80

Similarly, other such preferences governed by suffixes include:

 Root ending in -e + suffix -s, resulting in word ending in –es: prefer parsings
ending with –s to parsings ending with –es
 Suffix –ives: prefer parsings ending with –s to parsings ending with –es
 Suffix –itives: prefer parsings ending with –s to parsings ending with –es
 Suffix –tures: prefer parsings ending with –s to parsings ending with –es
 Suffix –eer: prefer parsings ending with –r to parsings ending with –er
 Suffix –eer + additional suffix: prefer parsings ending with –r-? to parsings
endings with –er-?
 Suffix –ered: prefer parsings ending with –ed to parsings ending with –d
 Root ending in -e + suffix –ed: prefer parsings ending with –d to parsings
ending with –ed
 Root not ending in -e + suffix –ed: prefer parsings ending with –ed to parsings
ending with –d
 Suffixes –y + ed = –ied: prefer parsings ending in –ed to parsings ending in –d

Again, these rules were not be decided beforehand, but rather resulted from

observations of patterns during iterative attempts to continually decrease instances where

the algorithm was unable to select a single, desired parsing. If execution of all “suffix

rules” did not result in a determination of the correct parsing, the algorithm continued on

to the four-pass subroutine described above for root words.

Compound words. Lastly, compound words containing multiple roots and/or affixes

were addressed. These were handled similarly to non-compound words, only with initial,

additional steps. First, if there were no affixes, the algorithm checked if the compound

word, in both graphemic and phonemic strings, was identical to the combination of the

associated roots, and if so, chose the parsing that combines those that were selected for

the root words in previous steps of the algorithm. For example, when the program

encountered a word like wholesale, it searched for the associated roots whole and sale,
MEASURING ORTHOGRAPHIC PREDICTABILITY 81

combined the roots, and then checked if the result was identical to wholesale, which it

was. Similarly, it confirmed that the pronunciation of /holsel/ (wholesale) was the

concatenation of /hol/ + /sel/. Finally, the algorithm posited the concatenation of the

selected parsings for whole and sale, yielding wh-o_e-l-s-a_e-l. As this is indeed one of

the candidate parsings, the algorithm selected it.

For more complicated compound words, particularly ones containing affixes, the

algorithm checked if the combination of associated roots, in both graphemic and

phonemic strings, was a subset of the compound word being analyzed. If so, and if there

was at least one possible parsing of the compound word that contains this simple

combination of the previously selected parsings of the associated roots, then all other

parsings were eliminated. For example, if the algorithm encountered the word

wholesalers, it checked if this word contained whole + sale and its pronunciation

contained /holsel/. When this proved true, only parsings containing wh-o_e-l-s-a_e-l

were retained. In this case, the only such parsing is wh-o_e-l-s-a_e-l-r-z, and so this

parsing was selected.

The algorithm then tested, in a similar fashion, if the compound word being

analyzed began with the first associated root words, and if the pronunciation began with

the combination of the first associated root and the first phoneme of the second associated

root. If this was the case, then all parsings that did not begin with the previously selected

parsing of the first associated root word were eliminated. If there were still multiple

candidate parsings, then the algorithm proceeded as it did with non-compound words,

first analyzing the suffixes to see if it could determine a correct parsing by suffix alone,

before finally executing the four-pass subroutine.


MEASURING ORTHOGRAPHIC PREDICTABILITY 82

Below is a walkthrough of the algorithm’s decision process in parsing the sample

word whiskey:

Word: whiskey Freq_Hal: 1364 Pron.: hw”I.ski (SAMPA)


MorphSp:{whiskey}

I. Phoneme String

Remove accents/syllable notations: hwIski

From left to right, find largest phonemes in mapping file at current location of string:

A. Both /h/ and /hw/ are phonemes in mapping file – take /hw/ as first phoneme
B. Current location is third character of phoneme string: /I/. /I/ is a phoneme in
mapping file, and no phoneme in mapping file starts with /Is/. So, /I/ is second
phoneme
C. Similarly, /s/ is next phoneme, followed by /k/ and /i/.
Phoneme string: /hw-I-s-k-i/

II. Minimally constrained grapheme parsings

A. Five phonemes in the phoneme string must correspond to a count of five


graphemes.
B. Additionally, “i_e” is a potential split grapheme for this word, according to the
rule that a consonant (or gu/qu) precedes the “e”, and that the longest string of
vowels immediately preceding the consonants before the “–e” is the single vowel
“i.” Rules II.A and II.B. imply the following list of 18 minimally constrained
parsings:

1. w-h-i_e-s-ky,
2. w-h-i_e-sk-y,
3. w-h-i-s-key,
4. w-h-i-sk-ey,
5. w-h-i-ske-y,
6. w-h-is-k-ey,
7. w-h-is-ke-y,
8. w-h-isk-e-y,
9. w-hi-s-k-ey,
10. w-hi-s-ke-y,
11. w-hi-sk-s-k-y,
MEASURING ORTHOGRAPHIC PREDICTABILITY 83

12. w-his-k-e-y,
13. wh-i_e-s-k-y,
14. wh-i-s-k-ey,
15. wh-i-s-ke-y,
16. wh-i-sk-e-y,
17. wh-is-k-e-y,
18. whi-s-k-e-y

III. Reduction to legitimate parsings based on list of allowed grapheme-phoneme


correspondences

Each minimally constrained grapheme parsing is tested in turn:

A. w-h-i_e-s-ky (#1): this implies that “w” maps to /hw/, “h” maps to /I/. Since the
mapping file does not allow /I/ to map to “h”, this option is eliminated. The same
rule also eliminates candidates #2-8 when encountered, which all begin with w-h-.
B. wh-i_e-s-k-y (#13): this implies “wh” maps to /hw/, “i_e” maps to /I/, “s” maps
to /s/, “k” maps to /k/, and “y” maps to /y/. The mapping file contains all of these
correspondences, and so this option is retained.
C. w-hi-sk-e-y (#11), and wh-i-sk-e-y (#16), are eliminated because the mappings
sk=/s/ and e=/k/ are not allowed by the mapping file.
C. #12 is eliminated because his=/I/ is not a legitimate mapping.
D. #17 is eliminated because s=/k/ and k=/e/ are not legitimate mappings.
E. #18 is eliminated because only the final mapping, y=/i/, is an allowed mapping.
The remaining candidates are: w-hi-s-k-ey (#9), w-hi-s-ke-y (#10), wh-i_e-s-k-y (#13),
wh-i-s-k-ey (#14), and wh-i-s-ke-y (#15).

IV. Based on MorphSp, whiskey is a root word

A. At least one candidate has a split grapheme (/I/ as i_e), and at least one candidate
interprets that –e as not part of a split grapheme. Since /I/ is not a long vowel, it
is determined that there is no split grapheme in this word, and all choices
containing the split grapheme i_e are eliminated. This eliminates one option, wh-
i_e-s-k-y (#13), leaving w-hi-s-k-ey (#9), w-hi-s-ke-y (#10), wh-i-s-k-ey (#14),
and wh-i-s-ke-y (#15).
B. We now perform pairwise comparisons of the candidates.
a. The initial incumbent is w-hi-s-k-ey (#9) and the initial challenger is w-hi-
s-ke-y (#10).
i. First pass of grapheme comparisons from left-to-right:
MEASURING ORTHOGRAPHIC PREDICTABILITY 84

1. The first grapheme discrepancy encountered is /k/ as “k” or


“ke”. For phoneme /k/, we have the rules “ck” preferred to
“c”, “qu” preferred to “q”, “cqu” preferred to “cq”, “ch”
preferred to “c”, and “cch” preferred to “cc”. None of
these apply.
2. The second grapheme discrepancy encountered is /i/ as
“ey” or “y”. No preferences are assigned for phoneme /i/ in
the first pass, so no decision is made.
ii. Second pass of comparisons from left-to-right:
1. The first grapheme discrepancy encountered is /k/ as “k” or
“ke”. For phoneme /k/, we have the rule “cc” preferred to
“c”. This does not apply.
2. The second grapheme discrepancy encountered is /i/ as
“ey” or “y”. No preferences are assigned for phoneme /i/ in
the second pass, so no decision is made.
iii. Third pass of comparisons from left-to-right:
1. The first grapheme discrepancy encountered is /k/ as “k” or
“ke”. For phoneme /k/, we have the rules “c” preferred to
“sc”, “c” preferred to “cu”, and “k” preferred to “ke”. This
rule applies, and so w-hi-s-k-ey (#9) is kept and w-hi-s-ke-
y (#10) is eliminated.
b. The incumbent is w-hi-s-k-ey (#9) and the challenger is wh-i-s-k-ey (#14).
i. First pass of grapheme comparisons from left-to-right:
1. The first grapheme discrepancy encountered is /hw/ as “w”
or “wh”. For phoneme /hw/, we have the rule “wh”
preferred to “w”. This rule applies, and so we keep wh-i-s-
k-ey (#14) and eliminate w-hi-s-k-ey (#9).
c. The incumbent is wh-i-s-k-ey (#14) and the challenger is wh-i-s-ke-y
(#15). The algorithm proceeds as in step a. above. In particular, in the
third pass, phoneme /k/ has the rule “k” preferred to “ke”, resulting in wh-
i-s-k-ey (#14) being retained and wh-i-s-ke-y (#15) being eliminated

The only remaining candidate, wh-i-s-k-ey (#14) , is selected as the preferred parsing for
this word.
MEASURING ORTHOGRAPHIC PREDICTABILITY 85

Manual Parsings. A small group of words defied typical patterns seen in English

orthography and presented a unique challenge for grapheme-phoneme alignment. Rather

than attempt to alter the algorithm to accommodate these words, which might in turn

cause unforeseen errors in other cases, these entries were separated and parsed manually:

 52 entries involved words such as wakes, phonemically transcribed as /w-


e-ks/, where the phonemes /k/ and /s/ are separated orthographically by a
FS–e. The algorithm initially handles all cases of consecutive /k/ and /s/
as one phoneme, /ks/, because this “complex phoneme” often corresponds
to the single letter x. To avoid complexities involved in altering the
algorithm to accommodate the interloping FS–e, these words were parsed
manually, and wakes is ultimately parsed as w-a_e-ks.
 25 cases of “loan words” were manually parsed, where it was necessary
algorithmically to consider modified groupings of phonemes in order to
enable reasonable grapheme parsings: Tijuana, signor, signora, signore,
señor, señora, señorita, schizoid, schizophrenia, schizophrenic, pizza,
pizzeria, pizzicato, Mozart’s, Nazi, Naziism, Nazis, Nazism, Mozart,
mezzo, Khmer, cognac, bouillon, bourgeois, and bourgeoisie. For
example, señor, originally parsed as /s-e-n-j-O-r/, was modified to /s-e-nj-
O-r/, since a valid grapheme-phoneme parsing cannot be found if there are
more phonemes than letters.
 8 word presenting special idiosyncratic difficulties were simply parsed
manually, to avoid potential errors caused by altering the algorithm:
bouillabaisse, does, environment, Hebrew, molten, maiden, maidenhead,
and tortilla.

After all the above procedures, parsing the ELP wordlist will be complete. All

40,481 words will either have been (a) removed, (b) assigned one parsing

algorithmically, or (c) have been manually parsed. Every instance of a phoneme in the

wordlist now successfully corresponds to a grapheme, and every grapheme corresponds

to a phoneme. Below is a random sampling of 200 words parsed by the algorithm:


MEASURING ORTHOGRAPHIC PREDICTABILITY 86

Word Phoneme (SAMPA) Grapheme

clavicle k-l-a-v-@-k-l= c-l-a-v-i-c-le


scaremonger s-k-E-r-m-V-N-g-@` s-c-a-re-m-o-n-g-er
salutation s-a-l-j@-t-e-S-n= s-a-l-u-t-a-ti-on
trademarks t-r-e-d-m-A-r-ks t-r-a_e-d-m-a-r-ks
disseminating d-I-s-E-m-@-n-e-4-I-N d-i-ss-e-m-i-n-a-t-i-ng
nigh n-aI n-igh
foreman's f-O-r-m-@-n-z f-o-re-m-a-n-'s
handsomely h-a-n-s-m=-l-i h-a-nd-s-ome-l-y
chuckle tS-V-k-l= ch-u-ck-le
senile s-i-n-aI-l s-e-n-i_e-l
explores I-ks-p-l-O-r-z e-x-p-l-o-re-s
dogmatize d-O-g-m-@-t-aI-z d-o-g-m-a-t-i_e-z
purveyor p-@`-v-e-@` p-ur-v-ey-or
imperceptibly I-m-p-@`-s-E-p-t-@-b-l-i i-m-p-er-c-e-p-t-i-b-l-y
Indies I-n-4-i-z i-n-d-ie-s
ministers m-I-n-I-s-t-@`-z m-i-n-i-s-t-er-s
goggles g-A-g-l=-z g-o-gg-le-s
confidentially k-A-n-f-@-d-E-n-S-l=-i c-o-n-f-i-d-e-n-ti-all-y
unison ju-n-I-s-n= u-n-i-s-on
desecration d-E-s-@-k-r-e-S-n= d-e-s-e-c-r-a-ti-on
malign m-@-l-aI-n m-a-l-i-gn
sacredness s-e-k-r-@-d-n-@-s s-a-c-r-e-d-n-e-ss
thrones T-r-o-n-z th-r-o_e-n-s
coward k-aU-@`-d c-ow-ar-d
watershed w-O-4-@`-S-E-d w-a-t-er-sh-e-d
iceman aI-s-m-a-n i-ce-m-a-n
trapped t-r-a-p-t t-r-a-pp-ed
underfoot V-n-4-@`-f-U-t u-n-d-er-f-oo-t
response r-I-s-p-A-n-s r-e-s-p-o-n-se
solitaire s-A-l-@-t-E-r s-o-l-i-t-ai-re
overturn o-v-@`-t-3`-n o-v-er-t-ur-n
nickels n-I-k-l=-z n-i-ck-el-s
traveling t-r-a-v-l=-I-N t-r-a-ve-l-i-ng
amorous a-m-@`r-@-s a-m-or-ou-s
contractor's k-@-n-t-r-a-k-t-@`-z c-o-n-t-r-a-c-t-or-'s
apologetically @-p-A-l-@-dZ-E-4-I-k-l-i a-p-o-l-o-g-e-t-i-c-all-y
identifiable aI-d-E-n-4-@-f-aI-@-b-l= i-d-e-n-t-i-f-i-a-b-le
railhead r-e-l-h-E-d r-ai-l-h-ea-d
chloroform k-l-O-r-@-f-O-r-m ch-l-o-r-o-f-o-r-m
discs d-I-s-ks d-i-s-cs
nonfood n-A-n-f-u-d n-o-n-f-oo-d
MEASURING ORTHOGRAPHIC PREDICTABILITY 87

Wilson w-I-l-s-n= w-i-l-s-on


commissioner's k-@-m-I-S-n=-@`-z c-o-mm-i-ssi-on-er-'s
staffing s-t-a-f-I-N s-t-a-ff-i-ng
Massachusetts m-a-s-@-tS-u-s-@-t-s m-a-ss-a-ch-u-s-e-tt-s
stealthily s-t-E-l-T-@-l-i s-t-ea-l-th-i-l-y
sneaky s-n-i-k-i s-n-ea-k-y
vendor v-E-n-4-@` v-e-n-d-or
agitator a-dZ-@-t-e-4-@` a-g-i-t-a-t-or
attacking @-t-a-k-I-N a-tt-a-ck-i-ng
barely b-E-r-l-i b-a-re-l-y
tracings t-r-e-s-I-N-z t-r-a-c-i-ng-s
decompression d-i-k-@-m-p-r-E-S-n= d-e-c-o-m-p-r-e-ssi-on
corked k-O-r-k-t c-o-r-k-ed
ripping r-I-p-I-N r-i-pp-i-ng
heights h-aI-t-s h-eigh-t-s
admirals a-d-m-@`r-@-l-z a-d-m-ir-a-l-s
hungrier h-V-N-g-r-i-@` h-u-n-g-r-i-er
inmate I-n-m-e-t i-n-m-a_e-t
chemist k-E-m-I-s-t ch-e-m-i-s-t
darkened d-A-r-k-@-n-d d-a-r-k-e-n-ed
magnate m-a-g-n-@-t m-a-g-n-a-te
twopence t-V-p-@-n-s tw-o-p-e-n-ce
sternum s-t-3`-n-@-m s-t-er-n-u-m
coerce k-o-3`-s c-o-er-ce
flier f-l-aI-@` f-l-i-er
umbrage V-m-b-r-I-dZ u-m-b-r-a-ge
excavations E-ks-k-@-v-e-S-n=-z e-x-c-a-v-a-ti-on-s
homicide h-A-m-I-s-aI-d h-o-m-i-c-i_e-d
ineptitude I-n-E-p-t-I-t-u-d i-n-e-p-t-i-t-u_e-d
terminology t-3`-m-@-n-A-l-@-dZ-i t-er-m-i-n-o-l-o-g-y
Prussian p-r-V-S-n= p-r-u-ssi-an
oxidised A-ks-@-d-aI-z-d o-x-i-d-i_e-s-d
killable k-I-l-e-b-l= k-i-ll-a-b-le
exceptions I-ks-E-p-S-n=-z e-xc-e-p-ti-on-s
bang b-a-N b-a-ng
plea p-l-i p-l-ea
footnote f-U-t-n-o-t f-oo-t-n-o_e-t
embezzling I-m-b-E-z-l=-I-N e-m-b-e-zz-l-i-ng
personnel p-3`-s-n=-E-l p-er-s-onn-e-l
fester f-E-s-t-@` f-e-s-t-er
thermostat T-3`-m-@-s-t-a-t th-er-m-o-s-t-a-t
running r-V-n-I-N r-u-nn-i-ng
goodwill g-U-d-w-I-l g-oo-d-w-i-ll
MEASURING ORTHOGRAPHIC PREDICTABILITY 88

raping r-e-p-I-N r-a-p-i-ng


cupped k-V-p-t c-u-pp-ed
listless l-I-s-t-l-@-s l-i-s-t-l-e-ss
hero's h-I-r-o-z h-e-r-o-'s
hemlock h-E-m-l-A-k h-e-m-l-o-ck
cooled k-u-l-d c-oo-l-ed
flouted f-l-aU-4-@-d f-l-ou-t-e-d
adopting @-d-A-p-t-I-N a-d-o-p-t-i-ng
flywheel f-l-aI-hw-i-l f-l-y-wh-ee-l
townsfolk t-aU-n-z-f-o-k t-ow-n-s-f-o-lk
Stephanie s-t-E-f-@-n-i s-t-e-ph-a-n-ie
wrathful r-a-T-f-l= wr-a-th-f-ul
beauty's b-ju-4-i-z b-eau-t-y-'s
hedgerow h-E-dZ-r-o h-e-dge-r-ow
fingerboard f-I-N-g-@`-b-O-r-d f-i-n-g-er-b-oa-r-d
absence a-b-s-n=-s a-b-s-en-ce
troopship t-r-u-p-S-I-p t-r-oo-p-sh-i-p
useable ju-z-e-b-l= u_e-s-a-b-le
aboveground @-b-V-v-g-r-aU-n-d a-b-o-ve-g-r-ou-n-d
throne T-r-o-n th-r-o_e-n
facades f-@-s-A-d-z f-a-c-a-de-s
suck s-V-k s-u-ck
mannerly m-a-n-@`-l-i m-a-nn-er-l-y
jimmied dZ-I-m-i-d j-i-mm-i-ed
decomposition d-i-k-A-m-p-@-z-I-S-n= d-e-c-o-m-p-o-s-i-ti-on
Dictaphone d-I-k-t-@-f-o-n d-i-c-t-a-ph-o_e-n
rosaries r-o-z-@`r-i-z r-o-s-ar-ie-s
rearrange r-i-@`r-e-n-dZ r-e-arr-a-n-ge
forsythia f-O-r-s-I-T-i-@ f-o-r-s-y-th-i-a
weakly w-i-k-l-i w-ea-k-l-y
volley v-A-l-i v-o-ll-ey
seamstress s-i-m-s-t-r-@-s s-ea-m-s-t-r-e-ss
readjustment r-i-@-dZ-V-s-t-m-@-n-t r-e-a-dj-u-s-t-m-e-n-t
astonishingly @-s-t-A-n-I-S-I-N-l-i a-s-t-o-n-i-sh-i-ng-l-y
opponent's @-p-o-n-@-n-t-s o-pp-o-n-e-n-t-'s
finals f-aI-n-l=-z f-i-n-al-s
sterilized s-t-E-r-@-l-aI-z-d s-t-e-r-i-l-i_e-z-d
ineffectiveness I-n-I-f-E-k-t-I-v-n-@-s i-n-e-ff-e-c-t-i-ve-n-e-ss
reiterating r-i-I-4-@`r-e-4-I-N r-e-i-t-er-a-t-i-ng
meticulous m-@-t-I-k-j@-l-@-s m-e-t-i-c-u-l-ou-s
Australian O-s-t-r-e-l-j@-n au-s-t-r-a-l-ia-n
phosphide f-A-s-f-aI-d ph-o-s-ph-i_e-d
stenographer s-t-E-n-A-g-r-@-f-@` s-t-e-n-o-g-r-a-ph-er
MEASURING ORTHOGRAPHIC PREDICTABILITY 89

lounged l-aU-n-dZ-d l-ou-n-ge-d


give g-I-v g-i-ve
descending d-I-s-E-n-4-I-N d-e-sc-e-n-d-i-ng
newsreader n-u-z-r-i-4-@` n-ew-s-r-ea-d-er
wobble w-A-b-l= w-o-bb-le
transpiring t-r-a-n-s-p-aI-r-I-N t-r-a-n-s-p-i-r-i-ng
impair I-m-p-E-r i-m-p-ai-r
junctures dZ-V-N-k-tS-@`-z j-u-n-c-t-ure-s
conditioners k-@-n-d-I-S-n=-@`-z c-o-n-d-i-ti-on-er-s
staircase s-t-E-r-k-e-s s-t-ai-r-c-a_e-s
tumultuous t-u-m-V-l-tS-u-@-s t-u-m-u-l-t-u-ou-s
Curie k-jU-r-i c-u-r-ie
conversational k-A-n-v-@`-s-e-S-n=-@-l c-o-n-v-er-s-a-ti-on-a-l
rehearsal r-I-h-3`-s-l= r-e-h-ear-s-al
lesion l-i-Z-n= l-e-si-on
introductions I-n-t-r-@-d-V-kS-n=-z i-n-t-r-o-d-u-cti-on-s
associates @-s-o-s-i-e-t-s a-ss-o-c-i-a_e-t-s
fused f-ju-z-d f-u_e-s-d
impute I-m-p-ju-t i-m-p-u_e-t
counterespionage k-aU-n-4-@`-E-s-p-i-@-n-A-Z c-ou-n-t-er-e-s-p-i-o-n-a-ge
Andy's a-n-d-i-z a-n-d-y-'s
ruthless r-u-T-l-@-s r-u-th-l-e-ss
cavities k-a-v-@-4-i-z c-a-v-i-t-i-es
lunchtime l-V-n-tS-t-aI-m l-u-n-ch-t-i_e-m
pronunciation p-r-@-n-V-n-s-i-e-S-n= p-r-o-n-u-n-c-i-a-ti-on
ketchup k-E-tS-V-p k-e-tch-u-p
quivering kw-I-v-@`-I-N qu-i-v-er-i-ng
exhaustingly I-gz-O-s-t-I-N-l-i e-xh-au-s-t-i-ng-l-y
toothpick t-u-T-p-I-k t-oo-th-p-i-ck
brushes b-r-V-S-@-z b-r-u-sh-e-s
additives a-4-@-4-I-v-z a-dd-i-t-i-ve-s
honey h-V-n-i h-o-n-ey
fright f-r-aI-t f-r-igh-t
spoon s-p-u-n s-p-oo-n
ministrations m-I-n-@-s-t-r-e-S-n=-z m-i-n-i-s-t-r-a-ti-on-s
rehearsals r-I-h-3`-s-l=-z r-e-h-ear-s-al-s
picnicked p-I-k-n-I-k-t p-i-c-n-i-ck-ed
officiated @-f-I-S-i-e-4-@-d o-ff-i-c-i-a-t-e-d
indiscreet I-n-d-I-s-k-r-i-t i-n-d-i-s-c-r-ee-t
gambler g-a-m-b-l-@` g-a-m-b-l-er
hydrophilic h-aI-d-r-@-f-I-l-I-k h-y-d-r-o-ph-i-l-i-c
glides g-l-aI-d-z g-l-i_e-d-s
wakes w-e-ks w-a_e-ks
MEASURING ORTHOGRAPHIC PREDICTABILITY 90

acquit @-kw-I-t a-cqu-i-t


millions m-I-l-j@-n-z m-i-ll-io-n-s
needles n-i-4-l=-z n-ee-d-le-s
lethal l-i-T-@-l l-e-th-a-l
macaw m-@-k-O m-a-c-aw
fain f-e-n f-ai-n
tuning t-u-n-I-N t-u-n-i-ng
painfully p-e-n-f-l=-i p-ai-n-f-ull-y
conference k-A-n-f-@`-@-n-s c-o-n-f-er-e-n-ce
trick t-r-I-k t-r-i-ck
obtuse @-b-t-u-s o-b-t-u_e-s
shrank S-r-a-N-k sh-r-a-n-k
chilly tS-I-l-i ch-i-ll-y
dissipate d-I-s-@-p-e-t d-i-ss-i-p-a_e-t
psi s-aI ps-i
hairbrush h-E-r-b-r-V-S h-ai-r-b-r-u-sh
dash d-a-S d-a-sh
dilating d-aI-l-e-4-I-N d-i-l-a-t-i-ng
mutable m-ju-4-@-b-l= m-u-t-a-b-le
header h-E-4-@` h-ea-d-er
praising p-r-e-z-I-N p-r-ai-s-i-ng
inventor I-n-v-E-n-4-@` i-n-v-e-n-t-or
fading f-e-4-I-N f-a-d-i-ng
abdicate a-b-d-@-k-e-t a-b-d-i-c-a_e-t
draper d-r-e-p-@` d-r-a-p-er
overdoing o-v-@`-d-u-I-N o-v-er-d-o-i-ng
indecipherable I-n-d-I-s-aI-f-@`-@-b-l= i-n-d-e-c-i-ph-er-a-b-le
urethane jU-r-@-T-e-n u-r-e-th-a_e-n
clubfoot k-l-V-b-f-U-t c-l-u-b-f-oo-t
View publication stats

You might also like