You are on page 1of 4

ResearchNotes

EXTRACT FROM RESEARCH NOTES ISSUE 13 / AUGUST 2003 / PAGES 57

13/02

What constitutes a basic spoken vocabulary?


MICHAEL MCCARTHY AND RONALD CARTER, UNIVERSITY OF NOTTINGHAM

Introduction

frequencies do not decline at a regular rate, but usually have a


point where there is a sudden change to low frequency. This

In the last 20 years or so, corpus linguists have been able to offer

applies to both spoken and written corpora. The point where high

computerised frequency counts based on written and, more

frequency suddenly drops to low can be seen as a boundary

recently, spoken corpora. In this article we look at frequency in the

between the core and the rest, though that point might be expected

5-million word CANCODE spoken corpus (see McCarthy 1998).

to vary a little from corpus to corpus. Figure 1 shows how

CANCODE stands for Cambridge and Nottingham Corpus of

frequency drops off in a 5-million word spoken sample of the

Discourse in English. The corpus was established at the Department

BNC. The horizontal axis shows frequency bands (i.e. 15 indicates

of English Studies, University of Nottingham, UK, and is funded by

a band of words occurring 15 times in the corpus, 400 = a band

Cambridge University Press, with whom the sole copyright resides.

of words occurring 400 times, etc.). The vertical axis shows

We also look at the spoken element of the British National Corpus

how many words in the corpus actually occur at those bands

(BNC) (see Rundell 1995a and b; Leech et al 2001). The spoken

(e.g. around 2500 words occur 100 times).

BNC amounts to 10 million words, and the corpus is in the public

Round about 2000 words down in the frequency ratings

domain. Frequency statistics from the BNC are available in Leech et

(indicated by an arrow), the graph begins to rise very steeply,

al (2001). Using such resources it is possible to obtain at least some

with a marked increase in the number of words that occur less

answers to the question: what vocabulary is used most frequently in

than 100 times, such that almost 9000 words are occurring

day-to-day spoken interaction?

15 times. Even at an occurrence level of 50, there are more than


4000 words. We can conclude that words occurring 100 times or

How big is a basic vocabulary?

more in the spoken corpus belong to some sort of heavy-duty core

There is no easy answer to this question, except to say that, in

vocabulary, which amounts to about 2000 words. It is reasonable

frequency counts, there is usually a point where frequency drops

to suppose, therefore, that a round-figure pedagogical target of the

off rather sharply, from extremely high frequency, hard-working

first 2000 words will safely cover the everyday spoken core with

words to words that occur very infrequently. In other words,

some margin for error.

Figure 1: Frequency distribution: 5 million words BNC spoken

10,000
9,000

8877

8,000

Number of words

7,000
6398
6,000
5,000
4028

4,000
3,000
2491
2,000
1,000
680

0
600

550

864

795

736
500

450

400

1017

926
350

300

1142
250

1281

200

1499

150

1887

100

50

25

15

Occurrences

RESEARCH NOTES : I S S U E 1 3 / AU G U S T 2 0 0 3 / PAG E S 5 7

In the case of written data, the same phenomenon occurs

necessary to refine the raw data. The computer does not know

(i.e. a similar shape of graph), but the number of words in the

what a vocabulary item is. Nonetheless, the top 2000 word list is

core is greater. We see a similar abrupt change from the core,

an invaluable starting point, for a good many reasons, not least

high-frequency words to a huge number of low frequency items,

because clear basic meaning categories emerge from it. Those

but that change occurs at over 3000 words, not 2000. This is not

basic categories are what the rest of this article is about. If, on the

surprising, since lexical density and variation is greater in written

basis of general professional consensus, we exclude as a category

than in spoken texts.

anything up to 200 grammar/functional word-forms, the remainder


of the 2000 word list falls into roughly nine types of item. These
are not presented in any prioritised order, and all may be

Some observations on the spoken core

considered equally important.

Table 1 lists the words that occur in excess of 1,000 times per
million words in the BNC and in CANCODE, and thus perform

Modal items

heavy duty.

Modal items carry meanings referring to degrees of certainty or

The BNC and CANCODE are remarkably consistent on the top


100 words, suggesting a good level of reliability for the figures.

necessity. The 2000 list includes the modal verbs (can, could, will,

However, questions arise as to the place of many of these items

should, etc.), but the list also contains other high frequency items

in a vocabulary list. The first 100 include articles, pronouns,

carrying related meanings. These include the verbs look, seem and

auxiliary verbs, demonstratives, basic conjunctions, etc. The types

sound, the adjectives possible and certain and the adverbs maybe,

of meaning they convey are traditionally considered to be

definitely, probably and apparently. The spoken list offers

grammatical rather than lexical. Another problem raised by the top

compelling evidence of the ubiquity of modal items in everyday

100 list is that of fixed phrases, or chunks extending over more

communication, beyond the well-trodden core modal verbs.

than one word. Word #31 (know) and word #78 (mean) are so
frequent mainly because of their collocation with you and I,

Delexical verbs

in the formulaic phrases you know, and I mean.

This category embraces high-frequency verbs such as do, make,

All in all, the top 100 BNC spoken list shows that arriving at the
basic vocabulary is not just a matter of instructing the computer to

take and get. They are called delexical because of their low lexical

list the most frequent forms, and considerable analytical work is

content and the fact that their meanings are normally derived from

Table 1: 100 most frequent items, total spoken segment (10 million words), BNC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Word

Frequency per
1m words

the
I
you
and
it
a
s
to
of
that
-nt
in
we
is
do
they
er
was
yeah
have
what
he
that
to
but
for
erm
be
on
this
know
well
so
oh

39605
29448
25957
25210
24508
18637
17677
14912
14550
14252
12212
11609
10448
10164
9594
9333
8542
8097
7890
7488
7313
7277
7246
6950
6366
6239
6029
5790
5659
5627
5550
5310
5067
5052

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

Word

Frequency per
1m words

got
ve
not
are
if
with
no
re
she
at
there
think
yes
just
all
can
then
get
did
or
would
mm
them
'll
one
there
up
go
now
your
had
were
about

5025
4735
4693
4663
4544
4446
4388
4255
4136
4115
4067
3977
3840
3820
3644
3588
3474
3464
3368
3357
3278
3163
3126
3066
3034
2894
2891
2885
2864
2859
2835
2749
2730

68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100

Word

Frequency per
1m words

two
said
one
m
see
me
very
out
my
when
mean
right
which
from
going
say
been
people
because
some
could
will
how
on
an
time
who
want
like
come
really
three
by

2710
2685
2532
2512
2507
2444
2373
2316
2278
2255
2250
2209
2208
2178
2174
2116
2082
2063
2039
1986
1949
1890
1888
1849
1846
1819
1780
1776
1762
1737
1727
1721
1663

RESEARCH NOTES : I S S U E 1 3 / AU G U S T 2 0 0 3 / PAG E S 5 7

the words they co-occur with (e.g. make a mistake, make dinner).

can substitute for lower frequency items such as voyage, flight,

However, those collocating words may often be of relatively low

drive, etc. In terms of everyday categories, there is a degree of

frequency (e.g. get a degree, get involved, make an appointment),

unevenness. In the names of the four seasons in CANCODE,

or may be combinations with high-frequency particles generating

summer is three times more frequent than winter, and four times

semantically opaque phrasal verbs (e.g. get round to doing

more frequent than spring, with autumn trailing behind at ten times

something, take over from someone).

less frequent than summer and outside of the top 2000 list.
Pedagogical decisions may override such awkward but fascinating
statistics. However, some closed sets are large (e.g. all the possible

Interactive markers

body parts, or the names of all countries in the world), and in such

There are a number of items which represent speakers attitudes

cases frequency lists are helpful for establishing priorities.

and stance. These are central to communicative well-being and


to maintaining social relations. They are not a luxury, and it is
hard to conceive of anything but the most sterile survival-level

Basic adjectives

communication occurring without them. The words include just,

In this class there appear a number of adjectives for everyday

whatever, thing(s), actually, basically, hopefully, really, pretty, quite,

positive and negative evaluations. These include lovely, nice,

literally. The interactive words may variously soften or make

different, good, bad, horrible, terrible. Basic adjectives (and basic

indirect potentially face-threatening utterances, purposely make

adverbs) often occur as response tokens (speaker A says See you

things vague or fuzzy in the conversation, or intensify and

at five, speaker B says fine/great/good/lovely). Great, good, fine,

emphasise ones stance.

wonderful, excellent, lovely, etc. occur very frequently in this


function. These items make the difference between a respondent
who repeatedly responds with an impoverished range of

Discourse markers

vocalisations or the constant use of yes and/or no and one who

Discourse markers organise and monitor the talk. A range of

sounds engaged, interested and interesting.

such items occur in the top 2000 most frequent forms and
combinations, including I mean, right, well, so, good, you know,
anyway. Their functions include marking openings and closings,

Basic adverbs

returns to diverted or interrupted talk, signalling topic boundaries

Many time adverbs are of extremely high frequency, such as today,

and so on. They are, like the interactive words, an important

yesterday, tomorrow, eventually, finally, as are adverbs of

feature of the interpersonal stratum of discourse. The absence of

frequency and habituality, such as usually, normally, generally,

discourse markers in the talk of an individual leaves him/her

and of manner and degree, such as quickly (but not slowly,

potentially disempowered and at risk of becoming a second-class

which comes in at word #2685), suddenly, fast, totally, especially.

participant in the conversation.

This class of word is fairly straightforward, but some prepositional


phrase adverbials are also extremely frequent, such as in the end,
and at the moment, which occur 205 and 626 times, respectively,

Deictic words

in CANCODE. Once again, the single word-form list often hides

Deictic words relate the speaker to the world in relative terms of

the frequency of phrasal combinations (see McCarthy and Carter,

time and space. The most obvious examples are words such as this

in press).

and that, where this box for the speaker may be that box for a
remotely placed listener, or the speakers here might be here or
there for the listener, depending on where each person is relative

Basic verbs

to each other. The 2000 list contains words with deictic meanings

Beyond the delexical verbs, there are verbs denoting everyday

such as now, then, ago, away, front, side and the extremely

activity, such as sit, give, say, leave, stop, help, feel, put, listen,

frequent back (as the opposite of front, but mostly meaning

explain, love, eat. It is worth noting the distribution of particular

returned from another place).

tense/aspect forms. Of the 14,682 occurrences of the forms of SAY


(i.e. say, says, saying, said) in CANCODE, 5416 of these (36.8%)
are the past tense said, owing to the high frequency of speech

Basic nouns

reports. Such differences may be important in elementary level

In the 2000 list we find a wide range of nouns of very general,

pedagogy, where vocabulary growth often outstrips grammatical

non-concrete and concrete meanings, such as person, problem,

knowledge, and a past form might need to be introduced even

life, noise, situation, sort, trouble, family, kids, room, car, school,

though familiarity with the past tense in general may be low.

door, water, house, TV, ticket, along with the names of days,
months, colours, body-parts, kinship terms, other general time and
place nouns such as the names of the four seasons, the points of

Conclusion

the compass, and nouns denoting basic activities and events such

With spoken data, there is a core vocabulary based around the

as trip and breakfast. These nouns, because of their general

15002000 most frequent words, a vocabulary that does very hard

meanings, have wide communicative coverage. Trip, for example,

work in day-to-day communication. Written data has a larger core.

RESEARCH NOTES : I S S U E 1 3 / AU G U S T 2 0 0 3 / PAG E S 5 7

However, raw lists of items need careful evaluation and further

References and Further Reading

observations of the corpus itself before an elementary-level

Leech, G, Rayson, P and Wilson, A (2001): Word Frequencies in Written


and Spoken English, London: Longman.

vocabulary syllabus can be established. Not least of the problems


is that of widely differing frequencies within sets of items that
seem, intuitively, to form useful families for language learning and
testing purposes. Equally, the list needs to take account of
collocations and phrasal items, as in the case of delexical verbs,
discourse markers and basic adverbs. But the list can also be very
useful in suggesting priorities for the grading of closed sets

McCarthy, M (1998): Spoken Language and Applied Linguistics,


Cambridge: Cambridge University Press.
McCarthy, M and Carter, R (in press): This that and the other: Multi-word
clusters in spoken English as visible patterns of interaction, Teanga.
Special issue on corpus and language variation.

consisting of large numbers of items (e.g. the human body parts).

Rundell, M (1995a): The BNC: A spoken corpus, Modern English Teacher,


4/2, 1315.

Corpus statistics take us a considerable way from what intuition

(1995b): The word on the street, English Today, 11/3, 2935.

and conventional practice alone can provide, but the one should
not exist without the other.

This is an extract from Research Notes, published quarterly by University of Cambridge ESOL Examinations.
Previous issues are available on-line from www.CambridgeESOL.org
UCLES 2003 this article may not be reproduced without the written permission of the copyright holder.

RESEARCH NOTES : I S S U E 1 3 / AU G U S T 2 0 0 3 / PAG E S 5 7

You might also like