You are on page 1of 29

Lee / Defining Core Vocabulary JEngL 29.

3 (September 2001)

Defining Core Vocabulary and Tracking Its Distribution across Spoken and Written Genres
Evidence of a Gradience of Variation from the British National Corpus
DAVID Y. W. LEE

Lancaster University

The notion of a core vocabulary of English is one that would seem to be intuitively plausible and uncontroversial. For many linguists and language teachers, the idea of the existence of a common, central core of lexical items is taken almost for granted, even if an explicit inventory of such words cannot be agreed on so readily. Indeed, there is no universally accepted, ready-made list of the core words of the English language to be found, partly due to the very concept being amenable to different definitions and interpretations (see below). For one thing, whichever definition or approach is taken, would it be useful to specify one set of core words for one component of the language (spoken English) and another set for the other (written English)? To do so would seem to militate against the very notion of coreness as something that cuts across all genres and divisions in the language as a whole. Even though it is well known that the lexis of spoken language is rather different from that of written language, for example, it might be argued that these differences are, by definition, noncore since core lexis is precisely that which is somehow central to the language as a whole and thus not specific to any lect or register. Vocabulary is an area that is quite heavily researched and discussed, especially by applied linguists and language teachers. Anyone interested in the field should consult the established books on the topic, such as Carter (1987), Carter and McCarthy (1988), and the more recent Schmitt and McCarthy (1997) and Schmitt (2000). As far as I know, however, there have been few (if any) empirical, multiplegenre-based studies done to establish whether, in fact, part of the difference between spoken and written language in some way lies in a differential use of core vocabulary items and how this differential use is distributed among spoken and written genres. Biber (1988) looked at the features mean word length and typeJournal of English Linguistics, Vol. 29 / No. 3, September 2001 250-278 2001 Sage Publications

250

Lee / Defining Core Vocabulary

251

token ratio in his study of the Lancaster-Oslo/Bergen (LOB) and London-Lund (LLC) corpora, but these two features are very different from a consideration of core vocabulary: mean word length gives only an extremely rough guide to the difficulty of the words used, while type-token ratio measures how varied the vocabulary is among the words in a text, saying nothing about the level of difficulty or rareness. Comparing a list of core words against a corpus to calculate to what extent the words in the corpus are composed of core words, as opposed to noncore ones, however, is a different enterprise, and the results tell us, for each text or genre, what percentage of the words used belong to the common core of the language. This study is intended to fill that gap in the literature. A study of the frequency distribution of core vocabulary will probably shed some empirical light on our intuitive feelings about some of the lexical differences among text genres, both written and spoken. Therefore, as part of broader research being done on grammatical and other differences between speech and writing, I conducted just such an investigation into the possible differential use of core vocabulary across spoken and written genres of English. This task was seen to be highly amenable to a computational, corpus-based approach, and the Sampler Corpus of the British National Corpus (BNC), a balanced subcorpus of two million words (half spoken, half written), was used for this study.1 In addition, another subset of the BNC, a four-million word corpus compiled to be more representative of the huge range of genres available in the full BNC, with texts manually assigned to fine genre categories, was also examined for the spread of core vocabulary items. In this article, I describe how an operationalization of core vocabulary was carried out and report on the interesting results obtained from these two subcorpora of the BNC.

Different Conceptions of Core Vocabulary


So far, I have not defined clearly what I mean by core vocabulary, and it is likely that different readers of this article will have had differing conceptions of this as they read. When the notion of coreness is examined more carefully, it becomes clear that very many different approaches can be taken, and while some of them overlap, others make conflicting demands. Worse still, it could be argued that there are several core vocabularies rather than a completely unitary and discrete core vocabulary (Carter 1987, 33). I shall return to this last point later on, but first, some of the possible conceptions of core vocabulary will be explored, followed in the next section by a description of the approach that I settled on for the purposes of this investigation. The many different possible understandings of core vocabulary reflect the essential vagueness of the concept or, perhaps more precisely, its multifaceted nature. Designers of a core vocabulary list may have different interests and purposes and

252

JEngL 29.3 (September 2001)

may therefore come up with very different lists. The following are some possible working definitions of core vocabulary, each of which may be operationalized independently: (i) the most frequent words in the language as a whole Until recently, this has usually meant most frequent in the written language by default, since the written word has always been easier to count and study (Wests famous 1953 word list was based on a corpus of two to five million words of written English). Core vocabulary here amounts to a raw statistical ranking of the most common words, with a cutting-off point (e.g., the first 850, or 1,000, or 2,000, or 3,500). In some uses of core vocabulary (e.g., a pedagogical word list), only content words are included, so prepositions, pronouns, and so on are assumed to be already known and do not add to the count. (In reality, however, most of the so-called function or grammatical words are not only among the most frequent but are also often the most difficult to teach and master.) With the advent of corpus-based studies of language, many frequency-based word lists are now available and have informed new approaches to vocabulary and language teaching, such as Willis (1990). The interesting new facts derived from corpora include the following: that the 700 most frequent words of English account for 70 percent of all English text, the most frequent 1,500 words account for around 76 percent of text, and the most frequent 2,500 words for 80 percent. Or, to put it another way, between 50 and 100 English words will account for about half of the total number of tokens in any text or corpus of English: they are the most frequently used (and reused) words in English (cf. Kennedy 1998, 96-97). Also, in a general, balanced corpus (as opposed to a specialized or restricted-genre corpus), practically all of these 50 words will be function words. Even taking the 100 most frequent words, almost 80 percent of them can be expected to be function words again. However, these figures are for something called general English, a notional, general-purpose, non-genre-specific aggregate of a whole range of spoken and written genres. We know, however, that the more specialized or homogeneous the corpus, the more content words there will be among the top most frequent words. Kennedy (1998, 102), for example, found that 36 percent of the top 50 words of an economics corpus consisted of content words, while the figure was only 6 percent for academic English as a whole and 2 percent for the English language as a whole. The notion generally most frequent in the language thus employs several hidden theoretical assumptions. In this article, I will show how a particular set of core vocabulary items is distributed throughout several spoken and written domains and genres, rather than give average figures for large or vague divisions of language. Another problem with operationalizing the English language as a whole is the problem with regional variety: which English language do we mean? Perhaps we

Lee / Defining Core Vocabulary

253

could find something common to all the old variety Englishes (American, Canadian, British, Australian, New Zealand, etc.) and call this our core vocabulary. Or should we include the New Englishes as well (those widely used as second languages in postcolonial countries such as India, Singapore, and Nigeria)? Peyawary (1999) provides an interesting first attempt at compiling just such a core nonregional vocabulary: he looked at the most frequent words common to American, British, and Indian English. This approach only looked at frequency, however, and the other approaches to core vocabulary listed below also need to be considered. (ii) the most frequent in terms of a particular medium Should there be different lists for spoken and written language? In the latest (1995) edition of the Longman Dictionary of Contemporary English (LDOCE), the differences between the two media in the use of particular words are emphasized to a degree that has never before been seen in dictionaries: there are bar graphs for some word entries showing relative frequencies of usage in speech and writing, and some word entries have the labels S1, S2, S3 or W1, W2, W3 to indicate that the particular word is one of the thousand most frequently used words in spoken (S1) or written (W1) English, or one of the first two thousand most frequent words (S2/W2) in spoken or written English, and so on. In addition, phrases that are typically used in speech rather than writing have the label spoken put next to them in the dictionary entries. With this in mind, whether there is a core of words common to both spoken and written language is certainly one way of looking at coreness. Since speaking, writing, listening, and reading are quite different processes, the two channels of language use may be expected to be quite different, even in the use of the most core or essential words. Psycholinguists, writing on the structure of the mental lexicon, observe the following: The neuropsychological data strongly suggest that there are four different lexicons, one each for speaking, writing, and spoken and visual word recognition . . . [alternatively] we have one lexicon, but four different methods of gaining access to it. (Harley 1995, 286) Chafe (1986, 32) sees the differences in vocabulary mainly in terms of the noncore words, writing that to a certain extent, spoken language and written language have their own vocabularies. Most words are neutral between the two, but some that are natural to speaking are out of place in writing, whereas for others the reverse is true. The reason is not that speakers must choose words more quickly than writers, but that each kind of language has its own divergent history. . . . The separate

254

JEngL 29.3 (September 2001)

history of English writing has created a partially different vocabulary, largely restricted to that medium. Actually, one should question whether such broad and monolithic categories as spoken and written are really useful. This study will show that the frequencies of core words are very much clinal (rather than dichotomous) in nature, so typical spoken and typical written would be more appropriate labels. (iii) the most frequent words for a particular demographic grouping In this usage, age, socioeconomic status, educational level, region, and so on are the relevant criteria in the constitution of a core vocabulary, whether individually or jointly; thus, we could talk about the core vocabulary of upper-class, educated, middle-aged people, for example. This approach, if taken to its logical limits, makes coreness quite meaningless because narrowly defined vocabularies can hardly be considered core to the language. It is for this reason that slang and markedly colloquial items are generally not considered part of any core vocabulary. In certain demographic parts of a spoken corpus, for example, we might expect the more colloquial terms bobby/cop/bill/pig rather than police/police officer. But to include such terms would seem to be stepping away from a common core vocabulary and moving toward the concept of several different genre-specific vocabularies. However, all conceptions of core vocabulary must necessarily fix some limits for socioeconomic and educational parameters, whether consciously or unconsciously. Wests (1953) list, for instance, was criticized for including some rather literary or unexpected words, such as mannerism, vessel, ornament, mere, stock, footman, a consequence of the list being based on written texts (cf. Carter 1987, 165). (iv) words that are most general, or unmarked, or central to the language Core words may be thought of as those in terms of which other less core words can be described (i.e., words that are hypernyms or superordinate in some way and so can substitute for other words): this is exemplified in the Basic English (Ogden 1968) word list, where we find, for example, bird but not robin and flower but not rose. In this sense, core vocabulary can be seen as related to the idea of semantic primes or primitives in componential semantics. This overlaps somewhat with sense (viii) below (words most useful for dictionary definitions). (v) words that are cognitively basic or most salient

Lee / Defining Core Vocabulary

255

This is based on the idea of prototypical words for various colors (Berlin and Kay 1969), animals, and objects. In this sense, core vocabulary would be linked to the notion of prototype categories as developed within cognitive linguistics (e.g., Rosch 1973; Rosch and Lloyd 1978). In this view, core words are prototypical ones, those that would fall toward the lower end of a hierarchy of words from the most general (e.g., animate) to the most specific (e.g., cocker spaniel). Such words would be cognitively more salient or prototypical to users of a language and are generally more easily pictured in the mind (e.g., table would be more core in this sense than furniture). Note, however, that this conflicts somewhat with the sense (iv) above (core vocabulary as being the most general words) since prototypical words tend to be hyponyms or specific words like cat and dog rather than more general words such as animal or mammal (and, in a reversal of the situation in sense (iv) above, robin and rose would come in as core words, rather than bird or flower). This also conflicts with the lexicographic sense of core vocabulary (see viii below), since while color, reptile, and furniture are useful for defining other words, they tend to be less picturable or psychologically salient or accessible than, say, yellow, snake, or table. (vi) words that, in their most general sense, have the most widespread usage across a wide range of genres In statistical terms, these would be those that have a high index of dispersion among text categories (see, e.g., Carroll, Davies, and Richman 1971 for a discussion of this). Coreness in this sense is decided on statistical grounds and relates to those words that are frequently used across a spread of genres. This conception of core vocabulary is a measure of the usefulness or value of words, their disposability (French: disponible) across a wide range of situations, and can be thought of as showing the words we use all the time rather than the words we use only some of the time, for specific settings, purposes, etc. (vii) words useful for dictionary definitions In practice, this combines elements of most of the above and represents a kind of practical or applied amalgam of several of the above criteria. For example, the editors of the Oxford Advanced Learners Dictionary (Hornby 1995, 1417) declare that the words in their defining vocabulary list were carefully chosen according to their frequency in the language and their value to students as a core vocabulary of English (italics added). Presumably, value to students is established separately and independently, and it is not acknowledged that this may conflict with other

256

JEngL 29.3 (September 2001)

goals or interests. One important additional element in this lexicographic approach is that words which may not fall under the above categories but are nonetheless practical or useful for lexicographic or word definition purposes also may be included. This fact is not always clearly acknowledged (e.g., the inclusion in the earlier LDOCE 1978 word list of the words adverb and adjective was motivated by lexicographic necessity rather than by any notion of coreness). Several dictionaries now have such a controlled defining vocabulary. The latest edition of the LDOCE (1995), for instance, has a word list chosen on the following two criteria: (1) frequency in the Longman Corpus Network (a collection of English computerized corpora) and (2) used correctly by learners in the Longman Learners Corpus (a fivemillion word collection of learners written English from over seventy countries, used by Longman to determine frequently misused words and other errors). This second criterion is difficult to apply since it is often the case that the simplest or most frequent and essential words of a language (e.g., the definite article the in English) are often among those most often used incorrectly by learners. Obviously, the cannot be left out of any defining vocabulary list! The above are by no means the only possible ways of thinking about core vocabulary. Other possibilities include the following: first words to be acquired in the first language, easily recalled words in memorization and association tests by psycholinguists, culturally and cross-culturally salient words (nuclear words), words with highest transferability to a second language (i.e., language-neutral, or nonlanguage-specific words), and so on (cf. Carter 1987, 45).

Data, Method, and Tools


Before proceeding any further, a short note should be made here on the slippery notion of word. There are many occasions where this ordinary term serves us well. Linguists, however, have various technical expressions that can be used in place of word when a more precise meaning is required (e.g., lexeme, lemma, morpheme, word form, orthographic word, lexical item, multiword, idiom) and other terms to describe the problems of not using these more precise terms: homophony, homonymy, polysemy, and so on. Some difficulties with computationally implementing the concept of a word (or any other of the above terms) and with using a computer to count words can be illustrated by the following:
Upper case and lower case words: are they always the same word (cp. Polish and polish)? Hyphenated words: hyphens inserted purely for line breaks should be ignored, but hyphenated compound words can be treated separately or as a whole depending on ones purpose.

Lee / Defining Core Vocabulary

257

Abbreviations: these complicate the task of segmentation from a computational point of view (e.g., Mr. vs. Mr, M.B.E. vs. MBE; for the latter, is it one word or does it stand for five?). Apostrophes, quote marks, contractions, and negatives: these also complicate segmentation (e.g., Johns, Thomas, 1950s, Ill, Ive, shes, dont). Numbers: for example, one versus 1 (same word?), 1964 (one word?), nineteen sixty-four (one, two, or three words?) Homographs (e.g., make verb or noun), polysemic words, and homonyms: for counting vocabulary items, these should be treated separately, but it is often difficult in computer terms. Lemmas and word families: should we count word forms or lemmas or word families? Lemmas group inflectional variants of a word together (thus, work, works, worked, and working share the same lemma WORK). A word family groups together all inflectional variants (lemmas) plus the common derivatives (cf. Nation and Waring 1997). Sinclair (1991, 7-8, 28) has argued that lemmas are not often useful and should be used only if the different word forms share a certain amount and type of similarity in their environments of use. However, his remarks were made in the context of collocational analysis. For the purposes of measuring vocabulary, people in the field seem to agree that the word family is the most meaningful unit to work with and pedagogically most useful (Schmitt 2000, 2). My own feeling is that as long as we realize we are not dealing with the collocational frameworks or multiword constructions that the individual word forms within a lemma or word family enter into, vocabulary counts can be given in terms of words, lemmas, word families, or any other grouping, as long as this is made clear. Fixed phrases/multiwords and phrasal verbs: for example, of course, as well as, in addition, take up, and so on. Counting each as equivalent to one word has implications for word frequency lists.

When dealing with vocabulary lists, most of the time we are dealing with units best described as lexemes since we want to keep polysemous word forms apart as far as possible while conflating inflectional forms. However, in the absence of semantic tags and with only the help of part-of-speech tags, computers usually end up counting only a close approximation to lexemes since not all different senses of word forms can be disambiguated by part of speech alone (e.g., can as verb or noun is easy to disambiguate, but the different senses of the noun bank require more sophisticated contextual disambiguation rules). Also, not all taggers treat multiwords (e.g., in spite of) as one unit or lexeme. The specific implementation of vocabulary/word counting in this study is given in a later section. As outlined earlier, core vocabulary can be defined from a variety of perspectives, and the whole notion seems quite broad. Many (or perhaps all) of the approaches intersect one another and describe phenomena which overlap to some degree. Because of time and other practical considerations, it was decided that for this investigation, I would use a ready-made, easily available, and linguistically moti-

258

JEngL 29.3 (September 2001)

vated list, one that has been carefully selected for a specific purpose. My choice was 2 to use the defining vocabulary word list included at the back of the LDOCE (1987): this is called the Longman Defining Vocabulary by the editors (and hereafter referred to as the LDV). In terms of the above definitions, therefore, I chose to operationalize approach (vii) above, which is itself a mixed approach. A manual inspection of the words in the list showed that, apart from a few odd choices, the LDV list could be considered to contain core vocabulary in several senses. I thus designed a project to measure the percentage of these words in each domain of the BNC Sampler Corpus3 (hereafter BNC:S), which is a subcorpus of two million words intended to represent an equally balanced selection from the spoken and written components of the BNC. However, I later discovered some problems with this sampler corpus: many important written and spoken genres were missing, and moreover, the texts were only divided into broad domains (subject matter), not genres, hence the results were not as illuminating as they could have been. I therefore repeated the core vocabulary count using a larger, four-million subset of the BNC. This consisted of texts individually categorized according to a genre scheme designed to capture all the major spoken and written genres of English (see Lee forthcoming for a fuller discussion of the scheme and of the shortcomings of the sampler 4 BNC). This article concentrates, however, on the analysis of the sampler BNC, due to space and time constraints.

Expanding the Longman Defining Vocabulary List


At first, it would seem that all that is needed is to simply scan the words from the Longman list into a machine-readable format to obtain a ready-made list of core words. However, given that the Longman list consisted of only the base forms of words, followed by a list of prefixes and suffixes, which are said in principle to be available for combination with any of the base forms, a whole set of problems presented itself. A lack of access to the full list of words actually used in the definitions of words in the dictionary (as opposed to those theoretically available for use) meant that simply scanning in the Longman list would not be enough. To facilitate the generation of a more fully specified word list, I used a piece of commercial soft5 ware called WordFind, which could generate lists of English words based on a sequence of characters preceded or followed by wild cards. The output was then edited by hand to include only those words that followed the principles of the LDV. The following is a discussion of some of the issues faced in expanding the LDV to create a fully explicit list of core vocabulary items. First, any count of core vocabulary should count inflections of the base word as well (i.e., not just cry but also cries, cried, and crying). At the time of investigation, a lemmatized version of the BNC:S was not available, so searching could not be done by lemma. All inflectional forms, regular and irregular, thus had to be fully

Lee / Defining Core Vocabulary

259

specified and incorporated into the word list. Second, the affixes listed by Longman in the LDV are said to be potentially combinable with any of the base words, but even a cursory look will reveal that not all words so derived can still be considered part of an expanded core vocabulary (and were unlikely to have been used in definitions by Longman). For example, move (core word) + ment (LDV-allowed suffix) gives movement, which could probably be considered core; but obtain (core word) + ment (suffix) gives obtainment (certainly not a core word); pronounce(v) + ment = pronouncement (probably not core); and so on. Third, quite a few words on the list were polysemous, and it was not always immediately clear which meaning was intended as the core sense since Longmans stated policy was that only one core sense of each key word was used in the dictionary. Although the present study did not aim to count word meanings, only word forms, I needed to know the basic sense of each base word before I could manually add its inflections and derivative forms. For example, afford is listed in the LDV without any part of speech tag or further information about its core sense, so I had to assume that only the nonformal meaning of afford (i.e., have enough money to buy) was intended here, giving me affordable and afforded as possible words to manually add to the list, but not affording and affords, because these come from a different sense of the word and are not core. I thus manually checked all the words carefully to ensure that any affixes or inflectional endings added did not change the part of speech (if this was specified in the LDV) or the core sense of the base words. Thus, if a word was specified as core only as a noun (e.g., box), then any inflected forms had also to be nouns (thus excluding boxing and boxed). A particular theoretical and methodological approach also had to be taken with respect to LDV prefixes and the question of compound words. LDV-allowed prefixes were ignored for this study, as there was no way to automatically generate a list of core words with the allowed prefixes without at the same time introducing (and having to arbitrate on) a potentially infinite list of non-core-like words (e.g., dis+position, dis+association, re+double, mis+spend, un+dead, etc.). This means that no attempt was made to guess the many possible compound words that the LDOCE may have used in its definitions. Their policy states that definitions occasionally include compound words formed from words in the Defining Vocabulary, but this is only done if the meaning is completely clear. For example, the word businessman (formed from business and man) is used in some definitions. Where I have found similar constructions in my WordFind-generated list, I have included them, but no attempt was made to be exhaustive. The Longman policy, however, does not deal with all possible cases, as not all such transparently clear compounds are also simultaneously core in the sense of being commonly used. So, for example, in the case of a word generated by WordFind, praiseworthy, the meaning seems to be clear, and both elements (praise and worthy) are part of the Defining Vocabulary. However, intuitively and corpus-empirically, this word is not at all

260

JEngL 29.3 (September 2001)

common and, furthermore, is probably marked for formality. For this reason, and to err on the safe side, I made conservative subjective judgments in leaving in only the most uncontroversial, core-like compounds (e.g., rainstorm and salesroom). Another issue faced was that of multiple affixation. Should words formed using more than one affix (derivational or inflectional) be considered part of the core vocabulary list? For example, from the LDV word joy, we can derive joyful (which is presumably still a core word) but also joyfulness (formed by adding a further suffix -ness). As there was no stated LDOCE policy on words having more than one affix, a conservative decision was made to allow only words that have one derivational affix added to the base word; inflectional affixes I have allowed to be applied twice where appropriate. A further issue had to do with regional and variant spellings. A quick check in the BNC revealed (none too surprisingly) that both British and American spellings of many words were to be found in this British English corpus (e.g., honor, humor, behavior, labor, neighbor, counselor). Thus, both sets of spelling needed to be listed explicitly for counting purposes. After careful consideration, I also decided to add all known comparative and superlative forms that are derived regularly and in a completely predictable way (e.g., sad, sadder, saddest). Only the base form sad may in fact appear in vocabulary and frequency lists, but my operational procedure was for core words, not the most frequent words in the language, and since the rules for comparative and superlative formation are completely predictable (to the extent that even children can overgeneralize/overuse the rule, as in best *bestest), I decided that these forms belonged in the list.6 In the process of expanding the LDV, I also had the chance to scrutinize the choice or selection of words in the list. For example, the following (all listed as main entries in the LDV) do not really seem to be particularly core: adverb, adjective (as pointed out earlier, these, along with noun and verb, are certainly useful for dictionary definitions, but are they core?); admittance, furnish, nobleman, nylon, provisions, spite (perhaps this was included only to allow the phrase in spite of in definitions, excluding the perhaps less core meaning associated with spite used as a noun?); infectious (why not contagious as well?); and spacecraft (was this chosen because it is more general or generic than spaceship and hence more useful for dictionary definitions?). A check using the two-million word BNC Sampler failed to find, for example, the words adverb and adjective. We thus see a clash here between Longmans need to include words that are useful for defining purposes and their stated aim to include only the most frequent and commonly understood words.7 Political correctness also seemed to play a role in the LDV: one of the included words was chairperson. This term may well be more gender neutral, but one wonders if it is really more common or core than chairman. If such a political correctness policy was being followed in the selection of words for the LDV, it should

Lee / Defining Core Vocabulary

261

perhaps have been stated explicitly. Ironically, however, prince is part of the list, but not princess, and if one looks up prince in the LDOCE, one of the definitions for princess is wife of a prince! (This recalls questions in compositional semantics of whether it makes sense to define entities as [ male] rather than [ female].)

Using Template Tagger (TT) to Count Core Vocabulary Usage


Once the issues raised in the above section were sorted out (and an arbitrary decision made in some cases), the next step was to choose a suitable way to count core vocabulary. To efficiently measure the use of core vocabulary in a number of individual texts, or groups of texts (e.g., genres), it is necessary to employ a method that could work more or less automatically, due to the sheer volume of texts involved, as well as the unwieldy size of the vocabulary list: each of the 2,000+ words in the vocabulary list had to be checked, along with part of speech specifications, against each of the 391 files of the BNC:S and the 658 files of the BNC:SW. The tool I found most suitable for the job was Template Tagger (TT), a powerful and flexible tool for processing/postprocessing tagged corpora developed at Lancaster.8 It has been applied to tasks ranging from automatic postediting of tagged corpora to disambiguation, lemmatization, and even parsing.9 TT essentially allows the user to specify a pattern or template of words/phrases, parts of speech, semantic tags, discontinuous dependencies, and so on to some degree of abstraction (using regular expressions or wild cards) and then have an action performed on all or part of the matched stretch of text (e.g., to replace one part of speech with another in automatic tag corrections). Different levels of patterns can be combined (e.g., a word can be specified to be matched only when its part of speech is a verb). The action stage can be left out, in which case the number of matches or hits for each template rule is merely noted and reported when the program terminates, in the form of a count file that lists the results. For my purposes, this rule-counting feature10 of TT was used to provide a convenient way of computing the number of occurrences of core words (my expanded vocabulary list) for each text in the corpus. Initially, each vocabulary item to be counted was treated as a separate rule (technically, it is the number of times a particular rule is fired that is counted; a rule can correspond to a word, a list of words, a phrase, a syntactic sequence, etc.). However, for reasons of conceptual clarity and economy, words or word + tag pairs on the list were later grouped together to form very rough families or types (some, but not all, forming virtual lemma groups). This step was taken only as a rough-and-ready way to group related words as closely together as possible to facilitate counting and evaluation of certain results: it has no further theoretical implications, and the groups are not word families in the sense of Bauer and Nation (1993) or Nation and Waring (1997).

262

JEngL 29.3 (September 2001)

To illustrate how my counts were done, here is a sample taken from the vocabulary rule file used by TT for this study: /* wound/wounds <RULE> <NAME> rule1 <CELL> <PATTERN> <LEVEL> WRD <POSITION> 1 <VALUE> wound/wounds </PATTERN> </CELL> </RULE> /* wounded/wounding [V*] <RULE> <NAME> rule2 <CELL> <PATTERN> <LEVEL> WRD <POSITION> 1 <VALUE> wounded/wounding <LEVEL> POS <POSITION> 1 <VALUE> V* </CELL> </RULE>
N.B.: WRD stands for word(s) and POS for part(s) of speech.

The above are the fully specified rules that the program uses and may seem slightly complex and tedious to type. A translator program exists that allows the user to input rules in a greatly simplified format, automatically expanding them to the full form shown above. As can be seen, lexeme groups (i.e., groupings of words sharing the same stem or, in some cases, root) may be split between different rules and thus counted separately. This is unavoidable, as the necessity of specifying part of speech for some words and not others meant that not all of related words could be put under the same TT rule. The forms wound and wounds had no part-ofspeech restriction in the LDV, so occurrences of these forms as verbs or nouns are all to be counted, whereas the inflected forms wounded and wounding (which were not explicitly specified in the original Longman list but were inserted following regular inflectional rules) are, in accordance with principles set out in earlier sec-

Lee / Defining Core Vocabulary

263

tions, to be counted only if they are used as verbs (thus the adjectival and noun uses of wounded would be excluded). Once TT has finished generating count files for the individual texts in the corpus, the counts can then be conflated and the usage of core vocabulary for each larger text category (whether domain or genre) can be worked out. In the next section, I present the results and discuss some of the additional counts and steps that it was necessary to undertake before an actual core vocabulary index for each text category could be computed accurately.

Some Issues to Do with Word Counts


After a count for each text file was obtained, it was necessary to do several other counts of various other features before an accurate picture of core vocabulary usage could be obtained. These counts were done separately from the core vocabulary count and then added to or subtracted from the appropriate totals (see Table 1 later in the text). These steps are discussed in the following sections.

The Unclassified Words (FU Tags)


The BNC was tagged using the CLAWS automatic tagger, and one of the CLAWS tags allows us to discard from our count those words that were non12 analyzable or unrecognized : such items are given the tag FU. This is especially a problem in the spoken part of the corpus, as many words were half-words or words that were begun and then dropped halfway as the speaker reformulated his or her words. Hesitation fillers like er and erm also come under this category (see Stenstrm 1990 for a discussion of their discourse functions). The following are some concordanced examples (the items in question are within double-angled brackets): er (10,126 cases): What wi So when do they come back? Haley and Tuesday. <<Er>>, Haley, yeah. Haley in the end Got actually, flying to A th (442 cases): Mm? He likes the drills. Oh cutie! Oh! Ah! Oh! <<Th>>, the trouble with Felix is when youre doing something he i (394 cases) I wouldnt go down it. You and I went. Did we? Yeah, <<i>>, it wasnt a working mine, it was an old mine and they We A problematic case, however, is that of the word aint. Because the ai part of the word can stand for various forms of the verb to be (e.g., aint am not/is not/
11

264

JEngL 29.3 (September 2001)

are not), CLAWS treats ai as not analyzable. It is not part of the LDV list, even though it is merely a variant of the copula verb (once again, the written bias of the vocabulary list shows itself). However, a count showed that there were only 252 occurrences of aint in the spoken BNC:S and 8 occurrences in the written half, so not including aint in my count of core vocabulary did not affect my figures to any significant degree.

The Interjections (UH Tags)


In the spoken texts, there are a lot of words that some linguists might want to argue are not really lexical items because they serve more as discourse markers rather than having any real lexical content. These are the so-called interjections, tagged UH in the corpus. Perhaps controversially, I made a theoretical decision to include all these as core words because I feel that the frequency, nature, and importance of these words qualify them for inclusion as part of the core vocabulary of English. Traditional grammar has very little to say about them, in part due to the bias toward written language, and core vocabulary lists largely ignore them too. The Longman list, for example, appears to have a literate bias in the sense that while it includes lexical items such as owing to and admittance, only the canonical, standard spellings of words like yes and no are included, with the very common variants yeah, yup, nah, na, and so on left out. These variants are all tagged as UH in the BNC, so it made sense to count all the UHs rather than having to specify all the spelling variants.13 The following list shows the top ten words tagged as UH in the BNC:S: [1] yeah [2] oh [3] no [4] yes [5] mm [6] ah [7] mhm [8] ooh [9] aye [10] eh 9,087 6,088 4,807 4,260 3,697 911 637 486 366 275 27.69 percent 18.55 percent 14.65 percent 12.98 percent 11.27 percent 2.78 percent 1.94 percent 1.48 percent 1.12 percent 0.84 percent

As can be seen, the first ten types of UH words account for 93.3 percent of all the cases, so other researchers can judge for themselves whether to admit UH words as a class into any core vocabulary list. The alternative, of course, would be to count just yes and no separately since these are LDV words. UH words as a class can then be either included in or subtracted from the total number of words in a text.

Lee / Defining Core Vocabulary

265

The Multiwords (Ditto Tags)


A very important point to make about my count of core vocabulary is that I made the theoretical decision not to count core items where they appear as part of a multiword phrase or expression (hereafter called multiwords). Such expressions are tagged in the BNC with ditto tags; for example, all of a sudden is tagged as follows: all_RR41 of_RR42 a_RR43 sudden_RR44 In the above example, all the individual words (sudden, a, of, and all) in the expression are separate entries in my core vocabulary list. However, because this is a phrase that is learned as a whole and functions as a whole, I ensured that the separate counts of the words all, of, a, and sudden did not each increment by one whenever such a multiword was encountered in the texts. The TT rules for all affected items in my vocabulary list were given the added restriction that the part of speech tag may not end with a double numeral, thus ensuring that no multiwords were counted (except where they are specifically being searched for, as in according to and upside down). A further explanation is needed in relation to Table 1. Having excluded counts of words used in multiword expressions as core usages, I had to be consistent and count such multiword expressions as single units. The total number of all words in the corpus texts (see the column Raw Word Count in Table 1) was computed using an algorithm that counts each part of a multiword expression separately (thus, all of a sudden would be counted as four words). Therefore, to get a total word count that would count multiword expressions as single words, I had to first subtract the counts of the individual parts of the multiwords (which I call multiparts in the table) and then add the counts of multiwords as single units (see the column labeled + multiwords). The net total now provides a more accurate basis for calculations.

Results and Discussion: Frequency of Core Vocabulary in English


The spoken component of the sampler (following the composition of the full BNC) is divided into a demographic component (representatively sampled according to socioeconomic classes) and a context-governed (c-g) component (sampled according to various contextual situations), in the ratio of 6:4. This was done to get as balanced a corpus as possible.14 For the demographic component, the conventional market survey type of classification system was adopted to sample the British population. It is worth noting that these class codes as applied to BNC files only refer to the social class of the respondent (i.e., the person asked to do the re-

266

JEngL 29.3 (September 2001)

cording): it is respondents who were demographically sampled, not their interlocutors (since there was no way to predict who the respondents were going to speak with and thus sample them as well). Respondents can and do speak to people from other socioeconomic classes, and most BNC spoken files are composites of several conversations concatenated together. So a file that comes under the AB category, for instance, cannot be assumed to consist of only AB participants or to represent any notional AB-class type of language.

Results for the Sampler BNC


With the above facts in find, consider Table 1, which shows the percentage of core vocabulary usage in each broad category of the sampler BNC, arranged in descending order. The average difference between spoken and written text categories was thus 15.62 percent. There is a difference of about 2 percent between the lowest score for a spoken genre (85.6 percent, context-governed business) and the highest score for a written genre (83.7 percent, fiction). On the other hand, the largest difference is between the demographic texts as a whole (92.03 percent) and people writing about pure science (65.62 percent), a difference of 27.14 percent. It is interesting that among the written domains, the imaginative texts (fiction, poetry, and drama) are the most demotic, in that they use mostly core words: 80 to 83 percent of all words in imaginative texts are core in nature. This suggests that the creativity and imaginativeness typically associated with such writing are not so much to do with the type or level of vocabulary but with the way core words are used and combined (among themselves or with the other 17-20 percent of words), that is, aspects of grammar and style. The science texts (pure and applied) quite predictably come at the bottom of the table, showing that they typically contain the lowest proportions of core words (therefore, conversely, the greatest proportions of more difficult or unusual words or technical vocabulary). Among the spoken texts, there is a separation of the context-governed (i.e., more formal) material from the demographically sampled material (i.e., spontaneous conversations), although the difference is along a very gentle cline, since there is only a 7 percent difference overall between the highest and the lowest spoken domain scores. This shows that all spoken language, no matter where it occurs and under what circumstances, is relatively homogeneous with respect to lexical richness: 85 to 92 percent of the words will, on average, be core vocabulary. Explanations for linguistic variation within spoken language, therefore, are to be sought elsewhere: for example, at the levels of grammar or style. Looking at the table closely, it can be seen that the differences among the adjacent domains are really small: they differ from each other by only 1 to 3 percent, with the notable exception of the (relatively) sharp drop in core usage from 70 to 65

TABLE 1 Results of Core Vocabulary Count on Files in the BNC Sample Corpus Domains (in descending frequency) Total + Number Number of UH Words Core Words 5,136 7,887 4,723 3,411 1,877 821 2,309 2,567 1,899 1,319 2,130 34,079 801 121 178 5 20 51 8 34 69 7 1 1,295 103,095 138,156 81,369 59,142 44,948 19,451 41,032 89,578 103,005 103,685 137,992 921,453 148,740 25,280 19,241 20,419 38,937 32,002 66,858 96,184 197,074 83,014 21,632 749,381 Net Total Number of Words 111,145 149,058 87,940 64,325 49,020 21,243 45,068 99,161 115,913 119,657 161,207 1,023,737 177,705 31,070 23,786 26,290 51,655 43,631 92,073 134,119 277,181 117,692 32,968 1,008,170 Percentage Core Vocabulary 92.76 92.69 92.53 91.94 91.69 91.56 91.04 90.34 88.86 86.65 85.60 90.52 83.70 81.36 80.89 77.67 75.38 73.35 72.61 71.72 71.10 70.53 65.62 74.90

Domain Spoken Demographic B Demographic C2 Demographic C1 Demographic D Demographic E Demographic unclassified Demographic A Context-governed educational Context-governed leisure Context-governed public Context-governed business Average across spoken domains Written Fiction Poetry Drama Community and social science Arts Belief and thought Commerce and finance Leisure World affairs Applied science Pure science Average across written domains

Core Count 97,959 130,269 76,646 55,731 43,071 18,630 38,723 87,011 101,106 102,366 135,862 147,939 25,159 19,063 20,414 38,917 31,951 66,850 96,150 197,005 83,007 21,631

Raw Word Count 114,489 152,481 89,990 65,918 49,963 21,912 46,201 103,577 120,909 124,592 166,720 179,437 31,336 24,004 26,633 52,142 43,880 93,082 135,026 279,565 119,155 33,270

FU Words 2,500 2,241 1,450 1,154 645 520 791 3,450 3,921 3,625 3,654 23,951 142 63 8 2 15 1 3 10 6 25 2 277

Multiparts 1,617 2,249 1,157 839 565 283 654 1,825 2,006 2,359 3,345 2,967 389 405 628 862 458 1,806 1,679 4,352 2,675 570

+ Multiwords 773 1,067 557 400 267 134 312 859 931 1,049 1,486 1,377 186 195 287 390 210 800 782 1,974 1,237 270

267

NOTE: Average for the demographic conversation texts (ignoring misleading class divisions) is 92.03 percent.

268

JEngL 29.3 (September 2001)

percent between applied science and pure science. The overall smoothness of the transitions between genres suggests that core vocabulary usage is a real cline of variation, that the linguistic resource of lexical richness is adapted to subject domains in a graduated and gradual way. This cline appears to run parallel to that of the typical spoken versus typical written cline, with spontaneous conversations at one end and highly information-heavy, specialist expository texts at the other. Along the core vocabulary cline, the only coherent division that can be made is that between spoken texts as a whole and informative (i.e., nonimaginative or nonnarrative) texts as whole: there is a minimum difference of 8 percent separating them, so these two large groupings could be said to be quite distinctive (even if not entirely distinct). Imaginative texts interestingly come in between to bridge this gap and thus represent, in terms of vocabulary, a kind of mix between what we would typically speak about and what we would typically write about. These results, therefore, are in line with our intuitions but, importantly, empirically verify them.

Results for the BNC Spoken and Written Corpus


The results presented above were for the sampler BNC. As mentioned earlier, the sampler texts were not classified according to genre, and many important genres (e.g., academic written prose, spoken medical consultations) were left out. As part of some other research I was carrying out, a four-million word better version of the sampler had been compiled, and so the core vocabulary frequency routine was repeated with this corpus (hereafter called the BNC Spoken and Written Corpus or BNC:SW). Figure 1 shows the spoken and written genres of this corpus ranked by frequency of use of core vocabulary (the full forms of the abbreviated genre labels are in Table 2). It will be observed that, again, there is a very gentle gradation between the text categories, with the categories being genres this time instead of domains or other broad categories. The slope in the graph is very gentle, with no really sharp breaks between genres. Also note that, with few exceptions, the spoken genres are in the top half while the written genres occupy the bottom half. The exceptions are again easily accounted for: in the top half are the written genres where the representation of speech features largely (namely, the prose fiction written for teens and children); in the bottom half, the exceptions are the spoken genres in which texts are prepared (or sometimes written) in advance (namely, news broadcasts and some lectures). However, not everything is accounted for by the spoken-written distinction. It is clear that the reason for some of the written genres appearing at the bottom end of the scale is that they typically contain lots of technical or specialized vocabulary or proper names: business, sports, technology, natural sciences, and medicine.

Lee / Defining Core Vocabulary

269

consultation demonstration conversation oral history interview sermons classroom meetings phone conv tv&radio discussion legal cross-exam lecture: soc.science lecture: technology tutorial lecture: humanities W_fict_prose_teen debate sports 'live' W_fict_prose_child legal presentation parliament lecture:polit,law,ed tv documentaries speeches: planned W_essays: school W_fict_prose_adult W_letters: personal W_fict_drama W_hansard W_religion news broadcast W_biography W_brdsht_society_new W_non-acad_soc.scien W_non-acad_nat.scien W_news_scripts W_brdsht_personal_ed W_fict_poetry lecture: nat.science W_non-acad_polit,law W_instructional W_essays: university W_email lecture: medicine W_acad_polit,law,edu W_commerce&finance W_brdsht_editorial W_letters: profess. W_brdsht_letters W_acad_soc.science W_non-acad_medicine W_tabloids W_acad_humanities W_popular lore W_advert W_admin W_non-acad_humanitie W_brdsht_home_news W_brdsht_reviews W_official_docs W_acad_technology W_brdsht_business W_brdsht_sports W_non-acad_technolog W_acad_nat.science W_acad_medicine

77 77 77 76 76 76 75 75 75 74 74 74 74 74 74 73 73 73 73 73 73 73 72 72 71 71 71 71 71 70 70

80 79 79 78 78 78

84 83 83 83 83 82 82 82 82 81 81

85 85 85 85 85

89 88 88 88 88

91 90 90 90 89

Genre
65% 70% 75% 80% 85% 90% 95%

% Core Vocabulary
Figure 1: Cline of Core Vocabulary Usage across Genres in BNC Spoken and Written Corpus.

Armed with these detailed tables, we no longer need to generalize about spoken English and written English but can talk in terms of genres of texts. The clines

270

JEngL 29.3 (September 2001)

TABLE 2 Percentage of Core Vocabulary Usage (descending order) in BNC Spoken and Written Corpus Genres Core Vocabulary (as percentage of word total, in descending order) 90.645 90.322 90.088 89.881 89.487 88.628 88.066 87.777 87.628 87.518 85.444 85.439 85.037 84.941 84.939 83.788 83.284 83.253 82.555 82.529 82.425 82.144 81.641 81.546 81.033 80.814 79.538 79.472 79.003 78.345 78.286 78.283 77.215 77.139 Core Vocabulary (as percentage of word total, in descending order) 76.825 76.315 76.297 75.725 75.346 74.995 74.776 74.207 74.17 73.888 73.778 73.778 73.523 73.402 73.334 73.209 73.039 72.847 72.822 72.819 72.133 71.614 71.247 71.221 70.931 70.9 70.601 70.001 69.656 68.079 67.76 65.548

Genre

Genre

consultation demonstration conversation oral history interviews job interviews sermons classroom (school) meetings phone conversations TV and radio discussion legal cross-exam lecture: social sciences lecture: technology tutorials (university) lecture: humanities W_fiction_prose_teen debate sports live W_fiction_prose_child legal presentation parliament lecture: politics, law, education TV documentaries speeches: planned W_essays: school W_fiction_prose_adult W_letters: personal W_fiction_drama W_hansard W_religion news broadcast W_biography W_broadsheet_society_news W_nonacademic_social science

W_nonacademic_natural sciences W_news_scripts W_broadsheet_personal_editorials W_fiction_poetry lecture: natural sciences W_nonacademic_politics, law, education W_instructional prose W_essays: university W_email lecture: medicine W_academic_politics, law, education W_commerce and finance W_broadsheet_editorials W_letters: professional W_broadsheet_letters to the editor W_academic_social science W_nonacademic_medicine W_tabloids W_academic_humanities W_popular lore W_advertisements W_administrative prose W_nonacademic_humanities W_broadsheet_home_news W_broadsheet_reviews W_official_documents W_academic_technology W_broadsheet_business W_broadsheet_sports W_nonacademic_technology W_academic_natural sciences W_academic_medicine

shown in this article show us that frequency of use of core vocabulary items is one way of conceptualizing the continuum of difference between spoken and written language, in terms of genres or other text categories.

Lee / Defining Core Vocabulary

271

Core Types: Drawing from the Core Resource


Another way of looking at core vocabulary is to see what percentage of the different vocabulary types in the list is used in the different text categories. Types here refers to the rough word groupings in the Template Tagger rule files used in the analysis. Table 3 shows the percentage of core vocabulary types used in 20,000word samples15 of each spoken and written domain category in the sampler BNC. It will be observed that, apart from pure science and context-governed: leisure domains, there is again a rough division between speech and writing (of about 7 percent on average), with a gradation in the way the different text categories draw from the resource of core vocabulary. Written domains, on average, almost universally use more different core types, indicating a more varied core vocabulary. This seems to suggest that the well-known phenomenon of a higher type-token ratio for written language actually starts with core vocabulary and does not just involve noncore, difficult, or unusual words.

Type-Token Ratios versus Core Vocabulary Index


The chief goal of this study was to see if and how speech and writing differ with respect to the coreness of their word choices. As mentioned earlier, this is a rather different linguistic feature from the more familiar and more easily computed ratio of type against token. A type-token ratio for spoken versus written texts can only measure how varied the words are in the respective texts while saying nothing about what kinds of words they are (e.g., common or learned, abstruse or vulgar). Theoretically (though improbably) two texts may share no words in common at all, even if they have the same type-token ratio.16 With a core vocabulary index, however, words are checked against a prepared list, compiled according to predetermined criteria, so that any two texts or genres can be measured against a common yardstick: namely, that of coreness of the lexis. However, it is interesting to look at the two different ratios together and speculate about how they might possibly be related. The results I obtained are given in Table 4. From the second column of the table, it can be seen that written texts exhibit a much more varied use of words (more than twice as varied). Looking at the third column, however, we can further say that not only is the vocabulary of written texts more varied, but, among the words used, more are drawn from outside the common stock of words that we can consider as core. We have also seen in the previous section that even within the domain of core vocabulary, written texts draw more types from the stock of core words available. Spoken texts (and the prototypical casual conversation texts in particular) can thus be said to vary very little in terms

272

JEngL 29.3 (September 2001)

TABLE 3 Percentage of Core Vocabulary Types Used in 20,000-Word Samples of Spoken and Written Genres Number of Core Types Used in 20,000-Word Samples from Each Genre Percentage Core Types Used/Total Core Types (N = 2,866)

Genre Written domains Poetry Leisure Community and social science Arts World affairs Fiction Commerce and finance Drama Belief and thought Applied science Pure science Spoken domains Context governed: leisure Context governed: public Context governed: business Demographic B Demographic A Demographic C1 Context governed: educational Demographic C2 Demographic E Demographic D Demographic unclassified Average across written domains Average across spoken domains

1,528 1,436 1,394 1,386 1,385 1,293 1,214 1,162 1,154 1,149 941 1,241 1,070 1,053 1,000 999 971 964 937 934 922 867

53.31 50.1 48.64 48.36 48.33 45.12 42.36 40.54 40.27 40.09 32.83 43.3 37.33 36.74 34.89 34.86 33.88 33.64 32.69 32.59 32.17 30.25 41.56 34.76

NOTE: Average for the demographic conversation texts (ignoring the class divisions) is 33 percent.

of vocabulary, tending to use a fairly limited range of vocabulary, most of which come from a restricted selection from the common core of the language.

Problems and Shortcomings to the Approach Taken


Some brief points concerning certain shortcomings of the present study will be mentioned here. First, no automatic semantic disambiguation was possible, so the problems of polysemy and homography could not be eliminated. In other words, this study was concerned mainly with frequencies of word forms, not word senses. Second, the complex issues to do with collocations and fixed expressions have been ignored. One problem with vocabulary lists and counts of total number of words in a text is that words enter into collocational relationships with one another, so that some words regularly prompt or even demand the use of some other words or

Lee / Defining Core Vocabulary TABLE 4 Type-Token Ratio and Core Vocabulary Ratio Category Spoken Written Vocabulary Variability
a

273

Vocabulary Coreness (percentage) 90.52 74.90

Types 19, 013 = = 0.02 Tokens 889, 030 Types 43, 993 = = 0.05 Tokens 909,816

a. The ratio is for the equal-sized spoken and the written components of the sampler BNC, so no adjustment for length was considered necessary for these rough figures.

prime them and their meanings. Pawley and Syder (1983, 215), for example, talk about [lexicalized] sentence stems that make up by far the largest part of the English speakers lexicon. These may be considered prefabricated chunks of language, which are by definition more than one word. (Nattinger and DeCarrico [1989], Eeg-Olofsson and Altenberg [1996], Aijmer [1996], Moon [1998], Hudson [1998], and Biber and Conrad [1999] are just some of the recent works dealing with collocations, idioms, and multiword expressions.) Should any core vocabulary therefore contain within itself at least the most common multiword expressions? And how should we treat idioms and idiomatic expressions? Should a list of core vocabulary consist of just isolated words, or should it include collocating pairs and phrases and prefabs that are core as well? If so, do we still divide our vocabulary counts by the total number of running words in a text? Or, with all things considered, are vocabulary lists of isolated words still a valid and practical compromise? Renouf (1992) shows that among the 150 most frequent words in English (all of which may be considered core and are in the LDV), there are significant prefabricated chunks. Many of such expressions could be counted as wholes rather than as individual words, and this would obviously affect counts and statistical results. More work, therefore, remains to be done in this area, and these are issues that should be discussed.

Conclusion
I am conscious that in this study, I have used and operationalized just one word list and one approach to core vocabulary (the dictionary defining vocabulary approach), and all the attendant problems and caveats pointed out need to be borne in mind when interpreting the results. However, I hope that the suggestive findings have vindicated my decision and have shown that the approach taken here can serve to provide a rough-and-ready core vocabulary index that can be used in stylistic

274

JEngL 29.3 (September 2001)

analysis and has a variety of applications in language teaching or even natural language processing. In terms of the spoken versus written language divide that is commonly referred to by linguists (more as a shorthand than as a sincere belief, one hopes), I believe this study has presented more evidence for this division being more of a continuum than a dichotomy, providing yet another cline or dimension of variation along which to compare texts. Such a measurement can possibly find linguistic applications in the automatic identification or analysis of genres or, if aided by a partof-speech tagger, even in computer style-checkers (which at present use rather crude heuristics17).

Notes
1. The British National Corpus (BNC) is a machine-readable collection of contemporary British English, both written and spoken, totaling c.100 million words, sampled from a comprehensive range of genres and speech situations and representing speakers from all social classes. A full description of the aims and goals of the BNC project and the design and format of the corpus can be found in the BNC Users Reference Guide (Burnard 1995), an electronic version of which is at http://info.ox.ac.uk/bnc. 2. The difference from the 1978 list was that the new list was updated with reference to more recent frequency information (LDOCE 1987, F8-9). 3. This subset of texts was sampled from the whole BNC and is intended to represent a 2 percent taster of the full corpus. The BNC Sampler Corpus is equally balanced for written and spoken language, unlike the BNC as a whole, where the proportion is 9:1, respectively. Also, this sampler corpus has been manually postedited for tagging errors. 4. Some telephone conversation files were taken from the ICE-GB corpus to fill a gap in the BNC. Also, some details of the genre scheme described in my forthcoming article do not apply to the BNC Spoken and Written Corpus (BNC:SW) used in the present study. However, these differences are minor. Readers interested in the details of the composition of the BNC:SW may write to the author. 5. This old but useful software was included with version 3 of Microsoft Word (for DOS). 6. Other manual additions: after some consideration, I decided to add the following words to the word list, as they are very common words and have clear meanings: anybody, anytime, and anyway (LDV already had anyhow, anyone, anything, anywhere, so it was puzzling that these were left out). The days of the week (Monday-Sunday) were also added. The words kilobyte/kilobytes were also added as I initially felt this to be common enough word for the 1990s. However, I acknowl-

Lee / Defining Core Vocabulary

275

edge that this was a methodological mistake, as one cannot be adding words to the list at will. These will be removed from the word list in future implementations. I also accepted appointment as a possible formation through appoint + ment. It is doubtful whether this should in fact be left there. 7. In addition, one could question how many nonlinguists (whether native speakers or not) really know what an adverb is. 8. This is a C program originally developed for the ESCRC project Lancaster Database of Linguistic Corpora and later used in the BNC Tagging Enhancement Project. A fuller description of Template Tagger (TT) and a report on its application to postediting the BNC can be found in Fligelstone et al. (1996). I used TT for this study because, surprisingly, there were no other available tools or programs that could take a list of words + tags and search this against a corpus, generating a report at the end. The utility fgrep can do this if the input files are first formatted specially, but even this method was not flexible enough for my purposes. Paul Nations VocabProfile program is fine if the input is changed to word_tag format, but does not deal with multiwords in the way I do in the present study. Nations program is at http://www.vuw.ac.nz/lals/staff/paul_nation/index.html. 9. TT currently accepts tagged corpus files in CLAWS and JAWS vertical (database) formats as input, but in theory, it is extensible to other formats through the writing of suitable input-output routines. See Garside, Leech, and Sampson (1987) for more information on CLAWS and Fligelstone (1996) on JAWS. 10. TT is also in the process of being adapted to be used directly with SGML marked-up text. I gratefully acknowledge the generosity of Mike Pacey in implementing the counting feature specifically for the purpose of this study and for tweaking the TT program code in many other ways to meet some of my research needs. 11. Constituent-likelihood automatic word-tagging system. 12. That is, words that were not in the systems lexicon and whose word class could not be guessed by algorithms using punctuation and contextual cues. 13. In the case of yes and no, I have had to avoid double counting since they would be included under my separate count of UH words, so I restricted occurrences of yes/yeah, no/nah, and so on to usages where they are not functioning as interjections (which is very rare: e.g., Is that a yes or a no? where yes and no are tagged as nouns). 14. As Crowdy (1995, 225) explains, Many types of spoken text are produced only rarely in comparison with the total output of all speech producers: for example, broadcast interviews, lectures, legal proceedings, and other texts produced in situations wherebroadly speakingthere are few producers and many receivers. A corpus constituted solely on the demographic model would thus omit important spoken text types. Consequently, the demographic component of the corpus was

276

JEngL 29.3 (September 2001)

complemented with a separate text typology intended to cover the full range of linguistic variation found in spoken language; this is termed the context-governed part of the corpus. 15. Samples were taken from different files within each genre to make up 20,000-word composite texts. This is because when counting types, the number of tokens/words needs to be constant. The type-token ratio measures a phenomenon that does not vary in a mathematically linear fashion and thus has to be computed for texts truncated to equal lengths. 16. For a discussion of the different measures of vocabulary, see Schmitt and McCarthy (1997, 314) and Nation (1995). A recent mathematical reformulation of the type-token ratio is presented in McKee, Malvern, and Richards (2000). 17. For example, (1) Flesch Reading Ease/Flesch-Kincaid Grade Level: readability based on the average number of syllables per word and the average number of words per sentence. (2) Coleman-Liau Grade Level/Bormuth Grade Level: both use word length in characters and sentence length in words to determine a grade level.

References
Aijmer, Karin. 1996. Conversational Routines in English: Convention and Creativity. London: Addison Wesley Longman. Bauer, Laurie, and I.S.P. Nation. 1993. Word Families. International Journal of Lexicography 6 (3): 1-27. Berlin, Brent, and Paul Kay. 1969. Basic Color Terms: Their Universality and Evolution. Berkeley: University of California Press. Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge, UK: Cambridge University Press. Biber, Douglas, and Susan Conrad. 1999. Lexical Bundles in Conversation and Academic Prose. In Out of Corpora: Studies in Honour of Stig Johansson, edited by Hilde Hasselgrd and Signe Oksefjell, 181-90. Amsterdam: Rodopi. Burnard, Lou, ed. 1995. The British National Corpus Users Guide (SGML version, dated 25 April 1995, first release with version 1.0 of BNC). Oxford, UK: Oxford University Computing Services. Carroll, John, Peter Davies, and Barry Richman, eds. 1971. Word Frequency Book. New York: American Heritage. Carter, Ronald. 1987. Vocabulary. London: Allen & Unwin. Carter, Ronald, and Michael McCarthy. 1988. Vocabulary and Language Teaching. London: Longman. Chafe, Wallace. 1986. Writing in the Perspective of Speaking. In Studying Writing, edited by Charles Cooper and Sidney Greenbaum, 12-39. London: Sage.

Lee / Defining Core Vocabulary

277

Crowdy, Steve. 1995. The BNC Spoken Corpus. In Spoken English on Computer: Transcription Mark-up and Application, edited by Geoffrey Leech, Greg Myers, and Jenny Thomas, 224-34. Harlow, UK: Longman. Eeg-Olofsson, Mats, and Bengt Altenberg. 1996. Recurrent Word Combinations in the London-Lund Corpus: Coverage and Use for Word-Class Tagging. In Synchronic Corpus Linguistics (Papers from the 16th International Conference on English Language Research on Computerized Corpora (ICAME 16)), edited by Carol Percy, Charles Meyer, and Ian Lancashire, 109-20. Amsterdam: Rodopi. Fligelstone, Steve. 1996. JAWS: Using Lemmatisation Rules and Contextual Disambiguation Rules to Enhance CLAWS Output. Unpublished project report, Department of Linguistics, Lancaster University. Garside, Roger, Geoffrey Leech, and Geoffrey Sampson, eds. 1987. The Computational Analysis of English: A Corpus-Based Approach. London: Longman. Harley, Trevor. 1995. The Psychology of Language: From Data to Theory. Hove: Psychology Press. Hornby, A. S., ed. 1995. Oxford Advanced Learners Dictionary. 5th ed. Oxford, UK: Oxford University Press. Hudson, Jean. 1998. Perspectives on Fixedness: Applied and Theoretical. Lund, Sweden: Lund University Press. Kennedy, Graeme D. 1998. An Introduction to Corpus Linguistics. London: Longman. Lee, David Y. W. Forthcoming. Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle. Language Learning & Technology. Longman Dictionary of Contemporary English (LDOCE). 1987. 2d ed. Harlow, UK: Longman. . 1995. 3d ed. Harlow, UK: Longman. McKee, Gerald, David Malvern, and Brian Richards. 2000. Measuring Vocabulary Diversity Using Dedicated Software. Literary and Linguistic Computing 15 (3): 323-38. Moon, Rosamund. 1998. Fixed Expressions and Idioms in English: A CorpusBased Approach. Oxford, UK: Clarendon. Nation, I.S.P. 1995. The Word on Words: An Interview with Paul Nation. The Language Teacher 19 (2): 5-7. Nation, Paul, and Rob Waring. 1997. Vocabulary Size, Text Coverage and Word Lists. In Vocabulary: Description, Acquisition and Pedagogy, edited by Norbert Schmitt and Mike McCarthy, 6-19. Cambridge, UK: Cambridge University Press. Nattinger, J. R., and J. S. DeCarrico. 1989. Lexical Phrases and Language Teaching. Oxford, UK: Oxford University Press. Ogden, Charles Kay. 1968. Basic English: International Second Language. New York: Harcourt, Brace and World.

278

JEngL 29.3 (September 2001)

Pawley, Andrew, and Frances Hodgetts Syder. 1983. Two Puzzles for Linguistic Theory: Nativelike Selection and Nativelike Fluency. In Language and Communication, edited by J. C. Richards and R. W. Schmidt, 191-226. London: Longman. Peyawary, Ahmad S. 1999. The Core Vocabulary of International English: A Corpus Approach. Bergen: HIT-senterets publikasjonsserie Nr. 2/99. Renouf, Antoinette. 1992. What Do You Think of That: A Pilot Study of the Phraseology of the Core Words of English. In New Directions in English Language Corpora, edited by Gerhard Leitner, 301-17. Berlin: Mouton de Gruyter. Rosch, Eleanor. 1973. On the Internal Structure of Perceptual and Semantic Categories. In Cognitive Development and the Acquisition of Language, edited by T. E. Moore, 111-44. New York: Academic Press. Rosch, Eleanor, and Barbara Lloyd, eds. 1978. Cognition and Categorization. Hillsdale, NJ: Lawrence Erlbaum. Schmitt, Norbert. 2000. Vocabulary in Language Teaching. Cambridge, UK: Cambridge University Press. Schmitt, Norbert, and Michael McCarthy, eds. 1997. Vocabulary: Description, Acquisition and Pedagogy. Cambridge, UK: Cambridge University Press. Sinclair, John McHardy. 1991. Corpus, Concordance, Collocation. Oxford, UK: Oxford University Press. Stenstrm, Anna-Brita. 1990. Lexical Items Peculiar to Spoken Discourse. In The London-Lund Corpus of Spoken English: Description and Research, edited by Jan Svartvik, 137-75. Lund, Sweden: Lund University Press. West, Michael. 1953. A General Service List of English Words with Semantic Frequencies and a Supplementary Word-List for the Writing of Popular Science and Technology. London: Longman, Green. Willis, Dave. 1990. The Lexical Syllabus. London: Collins ELT.

You might also like