Register in BNC

A Study of Register Variation in the
British National Corpus
............................................................................................................................................................
Kaoru Takahashi
Toyota National College of Technology, Japan
.......................................................................................................................................
Abstract
Correspondence:
Kaoru Takahashi,
2-1 Eisei-Cho,
Toyota City,
Aichi-Pref. 471-8525,
Japan.
Email:
takahasi@toyota-ct.ac.jp
This article is concerned with the study of register variation, the process of focusing
on the similarities and dissimilarities between register categories in terms of
various linguistic phenomena. The British National Corpus World Edition, which
is a 100 million word collection of British English, will be used to study the
characterization of register variation by identifying their linguistic characteristics.
By means of multivariate analysis, the variation of the occurrence of selected
linguistic features among registers will be classified. A multivariate analysis holds
out the promise of being able to systematize the register categories in the corpus
while also revealing the characteristic linguistic features of the groups classified.
In this article, by focusing on a sociolinguistic variable which is fairly
systematically associated with social class in the British National Corpus,
the dimensions revealed by the multivariate analysis were interpreted
linguistically. That is, the linguistic dimension concerned with formal style
versus casual style proved the validity of the social variable in the British
National Corpus and enabled its characterization in the light of linguistic
features. Furthermore, several words which pertain to interjection, filler, modal
auxiliary verb, and negation, i.e. hmm, ay, may, d, not, nae, and so on turned out
to be crucial markers to characterize the register in which texts are used.
..................................................................................................................................................................................
1 Introduction
Register can be regarded as a general term for any
language variety defined in terms of a particular
constellation of situational characteristics. Conrad
and Biber (2001, p. 3) claim:
Register distinctions are defined in nonlinguistic terms, including the speakers
purpose in communication, the topic, the
relationship between speakers and hearer, and
the production circumstances and there are
usually important linguistic differences across
registers that correspond to the differences in
situational characteristics.
The research using the British National Corpus

World Edition (the BNC hereinafter) is trying to
shift this restricted notion of register to a definition
of register that is more central to linguistics. The
study of register variation is becoming more and
more sophisticated largely as a result of the use of
corpora in concert with computational techniques
(Lee, 2000; Trudgill, 2002). By addressing English
texts quantitatively, it is possible to, classify and
systematize language samples, e.g. by counting
variants and comparing the incidence of variants
in different registers. Since the criteria used to
classify and systematize the language samples are
largely associated with linguistic dimensions, the
focal point shifts to the issue of how linguistic
Literary and Linguistic Computing Vol. 21, No. 1, 2006. The Author 2005. Published by Oxford University Press on
behalf of ALLC and ACH. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
doi:10.1093/llc/fqi028
Advance Access Published on 18 May 2005
111
K. Takahashi
dimensions can be identified. In this respect, the

claim of Milroy and Milroy (1997, p.53) can be
noted:
In the history of a major language, such as
English and French, the process of
maintenance has also been prominent
sometimes carried out by overt legislation,
and sometimes in a less formal way by
imposing the codified linguistic norm of elite
social groups on society as a whole through
education and literacy.
In other words, education is one dimension of
language variation which is influenced by social
class. Of the social variables that are commonly used
in sociolinguistics, socioeconomic class and social
networks are debatable, yet attractive, ideas; defining social class is not a simple task as social class
may be a fluid concept. It varies according to the
perspective one takes. It may be occupationally or
social network oriented, for example. However, the
use of the socioeconomic variable in BNC to classify
texts in concert with other features, generates new
perspectives both on the classification and the effect
of socioeconomic variables in language use. It is
possible to use such techniques to explore the
register distinction, as in the course of analysis
linguistic dimensions appear which can be used to
show register distinctions along linguistic dimensions.
For now, let us focus on the relationship between
linguistic dimensions and register and in so doing
propose a hypothesis. Labovs (1972) claim regarding style is interesting in that style can be arranged
on a single dimension from least to most formal,
according to context. As such it shows a correlation
with linguistic variations similar to that of social
class (see Mesthrie et al., 2000, p. 95). Mesthrie
criticizes Labovs account of style as onedimensional in nature. They propose that styles
can be arranged on a continuum, depending on the
amount of attention people pay to the act of
using language. Indeed Labov failed to build on
an earlier account by Joos (1959), which had
outlined five styles, varying on a scale of formality
from least to most formal, i.e. (1) intimate; (2)
casual; (3) consultative; (4) formal; and (5) frozen.
112
Literary and Linguistic Computing, Vol. 21, No. 1, 2006
Joos claims that:

(1) Intimate style involves a great deal of shared
knowledge and background in a private
conversation between equals.
(2) Casual style, which is typical of informal
speech between peers, includes ellipsis (or
omission of certain grammatical elements)
and slang between peers.
(3) Consultative style is the norm for informal
conversations between strangers. Slang and
ellipsis might not be used to the extent that
they are used in casual speech with a friend.
(4) Formal style is determined more by the setting
than by the person interacting. Markers of
formal style include whom, may I, and so on.
(5) Frozen style is a hyper-formal style designed
to discourage friendly relations between
participants.
Concrete features, for example ellipsis and slang,
may identify the position of a speaker on the
continuum. Words such as whom and phrases such
as may I are also noteworthy in that they can be
markers of a formal style. It is the authors
hypothesis that the formal/informal dimension is a
key in determining register.
In the present study, the focus is on the notions
of textual dimensions and discussion of register
distinctions using a sophisticated statistical methodology. Once linguistic dimensions are identified
and interpreted by a statistical method, they can be
used to classify and systematize register distinctions.
That is, each text can be given a precise quantitative
characterization with respect to each dimension in
terms of the frequencies of the co-occurring features
that constitute the dimension. This characterization
enables the classification and systematization of
register distinctions with respect to each dimension
and, at the same time, specifies the characteristics of
linguistic features among classified groups. What is
more, analyses in the light of vocabulary, as well as
part-of-speech, are required to identify linguistic
features.
To sum up, the authors hypothesis is that styles
can be arranged on a continuum constituted by the
formal/informal dimension, along which register
variation relating to social class is characterized in
Register Variation in British National Corpus
terms of parts of speech, words, and phrases.

Socioeconomic variables in the BNC serve as crucial
variables in this process. In order to explore this
hypothesis, a multivariate analysis is used.
2 Methodology
2.1 Analysis in Terms of Register
In order to understand the hypothesis outlined in
Section 1, let us explore social variables in the BNC.
The BNC World Edition is a 100 million word
collection of samples of written (82.82%) and
spoken (17.78%) British English from the late
20th century. Spoken texts are organized in two
parts, a context-governed part containing orthographic transcriptions of recordings made at specific
types of meeting and event, and a demographic part
containing orthographic transcriptions of spontaneous natural conversations made by members of
the public.1 The present study addresses the
demographic part only. This contains 4,211,216
words.
Given the focus of this study is on social class, we
will focus here on the social class variable encoded
in the corpus metadata. Table 1 shows the values of
this variable.2 This variable is used to classify the
texts of the demographic section of the BNC using a
multivariate analysis, the exact form of which will be
discussed shortly.
2.2 Multivariate Analysis and Hayashis

Quantification Type III
As noted earlier, a multivariate analysis technique,
Extended Hayashis Quantification Type III (EHT3
hereinafter) is being used in this article. However,
before introducing this technique, it is necessary to
Table 1 Classification of social class
Label
Description
Number of
Sentences
AB
Top or middle management

administrative or professional
Junior management, supervisory
or professional
Skilled manual
Semi-skilled or unskilled
Class unknown
197,803
C1
C2
DE
UU
169,384
144,876
93,159
5,339
explain what multivariate analysis is and how it can

be used on data such as that in Table 1.
Multivariate analysis is a statistical procedure
concerned with the analysis of multiple measurements of each individual or object in one or more
samples. The technique is used in various areas such
as psychology, sociology, and biology. In linguistics,
due to the availability of electronic texts, multivariate analysis promises to allow a computer to do
more than just count the frequencies of words or
find the functions of specific features in the texts.
Multivariate analysis holds out the promise of a
multi-feature/multi-dimensional approach to language in which an abundance of linguistic features
contribute to the many dimensions used to
categorize or discriminate the data. In this analysis,
continuums of linguistic variation rather than
discrete, unrelated, classification are dealt with.
In this way, the multivariate analysis provides an
overall account of linguistic variation among texts
and offers a framework for the discussion of the
similarities and dissimilarities between particular
texts or registers. At the same time, this approach
leads to the classification and systematization of
register categories. The feasibility of this approach
was confirmed by Bibers (1988, p. 55) work that
argues that dimensions are bundles of linguistic
features and that features are made into dimensions.
In turn, there are multiple related dimensions which
combine and recombine in many configurations to
generate discrete text types.
This study employs EHT3. This type of method is
widely known as correspondence analysis. Similar
techniques were developed independently in several
countries, where they were known as optimal
scaling, optimal scoring, or homogeneity analysis.
There are other kinds of the multivariate analyses,
of
which
principal
component
analysis
(PCA hereinafter) and factor analysis (FA hereinafter) are commonly used. Yet in order to understand why EHT3 was chosen, it is necessary to
introduce the technique in the context of other
approaches to multivariate analysis.
PCA was first described by Pearson (1901). PCA
can be regarded as one of the simplest of the
multivariate techniques. The other multivariate
analysis, FA is related to PCA because one way to
113
K. Takahashi
do FA is to begin with PCA and then to use other

approaches. That is, according to Rencheres claim
(2002, p. 49), both analyses seek a simpler structure
in a set of variables. However, PC is defined as linear
combinations of the original variables. We explain a
large part of the total variance of the variables. In
contrast, in FA, the original variables are expressed
as linear combinations of the factors. We seek to
account for the covariances or correlations among
the variables.
In any case, many statisticians are skeptical about
the value of FA in that it is not as objective as most
statistical methods. Chatfield and Collins (1980,
p. 89) list several problems with FA and conclude
that FA should not be used in most situations. Seber
(1984) notes as a result of simulation studies that
even if a postulated FA model is correct, the chance
of recovering it using any available FA method is
not high.
Yet, FA is commonly employed in psychology,
for example in the determination of intelligence by
test. It is used in such contexts as it is safe to say that
the high intellectual faculties cause the high score of
the intelligence test. It is unlikely that the high score
of the intelligence test causes the high intellectual
faculties. Yet such deterministic causality is not
easily argued for in the study of text typology
focusing on the occurrence of linguistic features.
In the authors pilot study, judging from a
frequency table of tags of the BNC among three
informative domains, it was clear that FA had a
weaker ability to discriminate or cluster the texts of

the same domains than PCA or EHT3.3 Both EHT3
and PCA ease the process of the interpretation of
the results of a multivariate analysis. As both
analyses give register categories on every dimension
simultaneously, the weights resulting from the
analysis are easier to interpret. This system is
called biplot. Eventually, it is safe to say that both
analyses have advantages in terms both of discriminate power and of employing biplot. However, the
reason why EHT3 is employed is mainly because
this was developed at the Institute of Statistical
Mathematics in Japan, where the author investigated how to use it and independently processed the
programming so that it works effectively and
comfortably.
2.3 An initial EHT3 Analysis

Table 2 is a table of frequency, showing the ratio of
the occurrence of each tag among social classes.4 In
the BNC there are 61 parts of speech tags exclusive
of ambiguous tags. Excluding UU, the social
variable has four possible values, amounting to a
matrix of 4 rows and 61 columns. However, this
analysis adapted the most frequently occurring 24
tags as features in order to observe the behaviour of
the social classes. By doing so, we can identify their
characteristics in terms of 24 tags, resulting in the
overall tendency of the social classes.5 The matrix is
analyzed by EHT3. The purpose is to find the
Table 2 Frequency of counts of 24 tags in each social class

Social Class
Social Class
Tags
AB
C1
C2
DE
Tags
AB
C1
C2
DE
PUN
PNP
NN1
AV0
ITJ
AT0
PRP
DT0
VVI
VVB
AJ0
CJC
1.2036
1.0361
0.5235
0.4841
0.3257
0.333
0.3006
0.2282
0.2383
0.2317
0.2246
0.2156
1.0857
0.9848
0.4845
0.4386
0.3377
0.3108
0.2725
0.2258
0.2141
0.2157
0.2045
0.1966
1.0946
1.154
0.5511
0.505
0.3497
0.3476
0.3246
0.2622
0.2451
0.2445
0.2203
0.2367
1.0502
1.0572
0.5014
0.4331
0.3391
0.2996
0.2836
0.2285
0.2181
0.2258
0.197
0.204
VBZ
XX0
VM0
NN2
NP0
CRD
VVD
UNC
CJS
TO0
VVG
AVP
0.2068
0.1616
0.1516
0.1336
0.1268
0.0981
0.1002
0.1135
0.1054
0.1023
0.0956
0.083
0.2037
0.1532
0.1455
0.12
0.1103
0.1087
0.1053
0.1009
0.094
0.0933
0.0851
0.0841
0.2184
0.1861
0.167
0.1346
0.1294
0.1318
0.1354
0.1079
0.114
0.1068
0.0992
0.1018
0.2011
0.1788
0.15
0.117
0.1296
0.1569
0.1207
0.0903
0.1012
0.0889
0.086
0.0927
114
o

n
P P
Pn
PM
1=N ni1 M

x
y
1=N

x

y
1=N
ij
i
j
iE
i
Ej
j
j1
i1
j1
r qr
n
o2

P
PM
P
P
2
2
1=N M
1=N ni1 iE xi2 1=N ni1 iE xi
j1 Ej yj 1=N
j1 Ej yj
similarities and dissimilarities in the frequency
matrix between social variables and between tags
as well. The procedure changes the positions of rows
and columns of the original data matrix so that the
proportionally large frequency figures can converge
around the diagonal. Consequently, the social
variables placed close together and the tags placed
close together are considered to be qualitatively
similar. Those located distant from one another are
qualitatively different. Statistically, this analysis is
based on the idea that it gives the quantities to social
variables and tags in the original data matrix in the
way that the given quantities yield the highest
correlation coefficient between the two. As a result
of this calculation, some dimensions appear
which serve as the criteria to characterize social
classes.6 The dimensions are later interpreted
sociolinguistically. Because both social variables
and tags have weights on every dimension simultaneously, the weights can help us to interpret each
criterion to characterize the social classes by
considering the variations and relationships among
social classes and parts of speech tags.
Now suppose that numerical value fij is given to
the jth subcategories of the ith item as follows.
2f
f
... f
... f 3
11
12
1j
6 f21
6
6 .
6 .
6 .
F6
6f
6 i1
6
6 ..
4 .
f22
..
.
. . . f2j
..
.
...
1m
fi2
..
.
...
fij
..
.
...
fn1
fn2
...
...
f2m 7
7
.. 7
7
. 7
7
fim 7
7
7
.. 7
. 5
fnm
In case of Table 2, fij is the frequency score of the ith

parts of speech tag in the jth social class. In order to
help understand more easily, let me explain by
exchanging the frequency scores with 1 or 0
responses for the sake of convenience. That is, the
calculation of EHT3 can be shown by considering
the given original version of Hayashis
Quantification Type III, in which responses, i.e.,

1 or 0 are dealt with. The matrix of the given
responses takes the form
2

...
... 3
11
12
6 21
6
6 .
6
6 ..
D ij 6
6
6 i1
6
6 ..
4 .
n1
22
..
.
i2
..
.
n2
1j
...
...
...
2j
..
.
ij
..
.
1m
2m 7
7
.. 7
7
. 7
7
. . . im 7
7
7
.. 7
. 5
. . . nm
...
Then value xi is given to the ith item and yj is

given to the jth category so as to maximize
the correlation coefficient shown in Equation 1
above as r.
So the following relations hold,
P
Ej M
ij: the score of jth column, 1 or 0,
Pj1
n
iE i1 ij: the score of ith column, 1 or 0,
P
P
P P
N ni1 Ej ni1 iE ni1 M
j1 ij .
To maximize r with respect to ij, the conditions
@r
0
@ij

i 1; 2; . . . ;n; j 1; 2; . . . ;m
are necessary.
By calculating the above characteristic equation,
the maximum number of k eigenvalues can be
obtained. The term k refers to the smaller one of n
and m. Among the eigenvalues, the largest k, which
is equal to value 1, is excluded and the other
eigenvalues help obtain the corresponding eigenvectors for each. The eigenvectors are quantities
(x1, x2, . . . , xn), which are assigned in n items, and
quantities (y1, y2, . . . , ym), which are assigned in m
categories. We can employ almost the same
procedure in the case of the numerical value fij,
instead of the above dichotomic response pattern ij.
However because it is not necessary for the purpose
of the present study to enter into a detailed
discussion of this calculation, these values are not
115
K. Takahashi
Table 3 Table of eigenvector of 24 tags

Tag
Dimension 1
Dimension 2
Tag
Dimension 1
Dimension 2
PUN
PNP
NN1
AV0
ITJ
AT0
PRP
DT0
VVI
VVB
AJ0
CJC
0.49483
0.258207
0.005233
0.12424
0.087467
0.11245
0.014521
0.095129
0.06211
0.02669
0.15862
0.025429
0.57667
0.110518
0.036192
0.257758
0.28158
0.16239
0.206288
0.179705
0.080389
0.01659
0.00837
0.224032
VBZ
XX0
VM0
NN2
NP0
CRD
VVD
UNC
CJS
TO0
VVG
AVP
0.00705
0.225889
0.051563
0.10345
0.080286
0.602305
0.30724
0.20874
0.031885
0.08473
0.04706
0.176196
0.06508
0.02552
0.06311
0.081673
0.14587
0.46212
0.192747
0.087
0.108259
0.161182
0.105171
0.078303
presented here. Interested readers should see

Hayashis paper (1952) for details.
3 Results
By solving the above equation, we obtain eigenvalues whose maximum number is the smaller
of n or m. Eigenvalues are a special set of scalars
associated with a matrix equation. Eigenvectors
corresponding to (k 1) items are obtained, except
for the maximum eigenvalue, namely, 1. That is,
each eigenvalue is paired with a corresponding
eigenvector. Eventually, these eigenvectors are
scores (x1, x2, . . . , xm) given to ith tag and scores
(y1, y2, . . . , yn) given to ith social class respectively.
In Table 2, k is 4, so we obtain three eigenvalues.
Also, the scores of tags (x1, x2, . . . , x4) and social
classes (y1, y2, . . . , y58) are obtained according to the
corresponding eigenvalues. The eigenvalues are
called axis 1, axis 2, etc. in descending order. As a
result of the calculation of Table 2, we obtain
eigenvectors as in Table 3, along with proportion
accounted for as in Table 4. Each eigenvalue
denotes the degree to which the tags or social
groups have salient characteristics in the axis. An
axis refers to a dimension. Proportion accounted
for, is the amount of information contained in the
original data matrix explained by the axis in
question. Cumulative proportion indicates the
percentage of the information in the original data
explained up to the axis in question. In practice, not
all sets of social class scores and tag scores are used
116
Table 4 Proportion accounted for, and cumulative

proportion accounted for for each axis
Axis
Proportion Accounted for
Cumulative Proportion
1
2
3
0.7470
0.1925
0.0605
0.7470
0.9395
1.0000
for analysing given data. In the present case, only

two axes are used. Up to the second axis,
cumulative proportion accounted for is 93.95%,
hence leaving unaccounted about 6.05% of the
information contained in the frequency table.
Although axis 3 may be significant, only two axes
are included in the following analysis, mainly
because of the practical consideration that the
figures presented later can deal with a maximum
of the two axes, if they are to be grasped easily.
The two sets of social class scores and tag
scores calculated for producing three correlation
coefficients are then normalized, with the means
equal to 0, and the variances equal to 1.0.
Normalized scores for both dimensions calculated
in this way are given in Table 5.
EHT3 makes the interpretation easier because the
relationships between the features and the categories
along each axis of coordinates will serve as a key to
interpreting factors. In other words, there is a clear
correspondence between the scores showing the
distribution of social classes and those showing the
tag distribution. The reason why the social class,
which has an outstanding score in an axis, is located
Table 5 Normalized category scores

Class/Tag
Dimension 1
Dimension 2
Tag
Dimension 1
Dimension 2
sdecla1
sdecla2
sdecla3
sdecla4
PUN
PNP
NN1
AV0
ITJ
AT0
PRP
DT0
VVI
VVB
0.05370
0.01960
0.02360
0.04910
0.04810
0.02570
0.00075
0.01870
0.01540
0.02030
0.00274
0.02000
0.01330
0.00571
0.00199
0.00976
0.03100
0.02240
0.02850
0.00559
0.00262
0.01960
0.02520
0.01490
0.01970
0.01920
0.00873
0.00180
AJ0
CJC
VBZ
XX0
VM0
NN2
NP0
CRD
VVD
UNC
CJS
TO0
VVG
AVP
0.03530
0.00564
0.00159
0.05610
0.01350
0.02980
0.02330
0.17500
0.09260
0.06660
0.01010
0.02770
0.01590
0.06000
0.00095
0.02520
0.00743
0.00322
0.00837
0.01190
0.02150
0.06830
0.02950
0.01410
0.01750
0.02680
0.01810
0.01350
on a certain position along the axis, is explained by

referring to the distribution of tags adjacent to it
along the same axis. In other words, because
quantities are given to both social classes and
linguistic features along axes, the given quantities
often indicate the extent to which the social
classes or tags are similar and then also serve as
their classification.7 It is not easy to grasp the
relationships between numerical values in a table;
therefore, the values given in the table are plotted
along dimensions 1 and 2 in Fig. 1. In this way, the
visualization makes it easier for us to identify the
relationship of sdecla variables and linguistic
features.
In this graph, along Dimension 1 (x-axis), the
social variables are plotted in the order of the level
of social class. Along Dimension 2 (y-axis), however, the behaviour of the variables is different
from that of Dimension 1. That is, in Dimension 2,
there are no salient characteristics in both AB and
C1, namely, the most upper class and the second
upper class, whereas DE (skilled manual workers)
and C2 (semi-skilled or unskilled workers) are
located at opposite poles.
Taking into account the dispersion of both social
variables and linguistic features along two axes, the
discussion moves to the issue of interpretation of
two axes and the characterization of the four social
classes. An axis can sometimes be given a straightforward interpretation explaining the relationship
of the social class and the features through the
distribution of the features and social classes

along an identical axis. At the same time, it often
becomes possible to explain why the social classes
are distributed or arranged in a certain way along
the axis.
The next section contains a discussion of the
interpretation of the heavily weighted axes and
identifies their characteristics.
3.1 Distribution of Tags and Social

Classes
3.1.1 Dimension 1
Table 6 indicates the extreme features with a
positive value in Dimension 1. The features are
arranged according to their values, in descending
order, together with the DE social class, which
has the highest score among all the social classes.
Thus, placing the values of the features and the
social class simultaneously will eventually
simplify the interpretations of the quantities given
to them.
Taking a closer look at this table, note that CRD,
the cardinal number, is the most striking tag. It can
be regarded as one of the typical characteristics
of the positive side in Dimension 1, followed by
VVD (the past tense form of lexical verbs) and
AVP (adverb particle). Focusing on the structure of
sentences, we can safely say that the past tense
of verbs or the phrasal verbs is the most striking
structure in the positive side of Dimension 1.
117
K. Takahashi
0.10
Dimention 2
C2
TO0
CJC
AV0
PRP DT0
AT0
CJS
UNC NN2VVG
VVI
VM0
PNP
AJ0
NN1
0.00
VVB
VBZ
- 0.05
0.00
0.05
-0.10
ITJ
AB
VVD
AVP
XX0
0.10
0.15
0.20
NP0
C1
DE
CRD
-0.10
Dimension 1
Fig. 1 The distribution of four social classes and tags
Table 6 Tags and the social class characteristic of the positive side in dimension 1
Tag and Social Class
Score
Description of tags and social class
CRD
VVD
0.1750
0.0926
AVP
XX0
sdecla4
PNP
sdecla3
0.0600
0.0561
0.0491
0.0257
0.0236
Cardinal number (e.g. one, 3, fifty-five, 3609)

The past tense form of lexical verbs
(e.g. forgot, sent, lived, returned)
Adverb particle (e.g. up, off, out)
The negative particle not or nt
DE: semi-skilled or unskilled
Personal pronoun (e.g. I, you, them, ours)
C2: skilled manual
It is also noted that the tags concerning the

negative particle, XX0 and the personal pronoun,
PPN are located around both DE and C2, meaning
that they are typical markers of phrasal characteristics in the lower social classes.
Secondly, as for the negative range of Dimension
1, the extreme features are lined up in Table 7. In
this case, the highest social class, AB, has the second
highest score among all the social classes, followed
by C2, the second highest social class. Tags between
these social variables help to examine the tendency
in this region. The tags such as AJ0 (adjective), NN2
118
(plural common noun), TO0 (infinitive marker to)

and AT0 (article) are the most frequently used in
the higher social classes.8
3.1.2 Dimension 2
In Dimension 2, there is no coherent relationship
among the four social classes as observed in
Dimension 1. However, in Dimension 2,
a conspicuous contrast is observed between C2
(skilled manual) and DE (semi-skilled or unskilled),
namely the second lowest class and the lowest class,
Table 7 Tags and the social class characteristic of the negative side in Dimension 1
Tag and Social
Class
Score
UNC
0.0666
sdecla1
0.0537
PUN
0.0481
AJ0
NN2
0.0353
0.0298
TO0
AT0
sdecla2
0.0277
0.0203
0.0196
Unclassified items which are not appropriately

considered as items of the English lexicon.
AB: top or middle managementadministrative
or professional
Punctuation: general separating mark
i.e. . , ! , : ; - or?
Adjective (general or positive) (e.g. good, old, beautiful)
Plural common noun (e.g. pencils, geese,
times, revelations)
Infinitive marker to
Article (e.g. the, a, an, no)
C2: skilled manual
Table 8 Tags and the class characteristic of the negative side in Dimension 2
Tag and Social
Class
Score
CRD
PUN
ITJ
sdecla4
NP0
sdecla2
0.0683
0.0285
0.0252
0.0224
0.0215
0.00976
Cardinal number (e.g. one, 3, fifty-five, 3609)

Punctuation: general separating mark - i.e. . , ! , : ; - or ?
Interjection or other isolate (e.g. oh, yes, mhm, wow)
DE: semi-skilled or unskilled
Proper noun (e.g. London, Michael, Mars, IBM)
C2: skilled manual
respectively, whereas the highest and the second

highest social class do not show any such contrast.
In the negative region to which the lowest social
class pertains, CRD (cardinal number), ITJ (interjection or other isolate) and NP0 (proper noun) are
regarded as the typical characteristics (see Table 8).
In the opposite region, the positive region to which
C2 pertains, tends to have the characteristics of the
past tense form of lexical verbs (VVD), infinitive
marker to (TO0) and coordinating conjunction
(CJC) (see Table 9). In this way, it is obvious that
Dimension 2 is related to the features that
discriminate between two lower social classes.
When it comes to the sequence of social classes
from higher to lower, however, no continuum can
be observed along Dimension 2. Taking into
account the low value of the proportion accounted
for of Dimension 2, i.e. 22.43%, we will henceforth
concentrate on Dimension 1. That is, the register
distinction will be largely addressed in the light of
continuum of social class.
4 Interpretation
When the multivariate analysis reveals distinguishing tags in a dimension, the dimension enables the
characterization of the structure of the sentence
relevant to the tags. The distinguishing tag refers to
the tag which is located relatively far from the origin
of the coordinates. The tag describes the disposition
of the dimension. Then, the dimension can be
interpreted linguistically. If a social class is far from
the origin of the ordinates, it means that the social
class is largely associated with the side to which the
social class pertains in the dimension. The farther
from the origin of the coordinates the social
variable is located, the more strongly does the
variable describe the disposition of the dimension.
Also, the closer to tags the social variable is located,
the more likely it is that there is a strong relationship between them. In other words, the tags are
commonly used in the social class. There are other
things to note. The positive and the negative
119
K. Takahashi
Table 9 Tags and the class characteristic of the positive side in Dimension 2
Tag and Social
Class
Score
sdecla3
VVD
0.0310
0.0295
TO0
CJC
PRP
0.0268
0.0252
0.0197
AV0
0.0196
DT0
0.0192
VVG
0.0181
C2: skilled manual

The past tense form of lexical verbs
(e.g. forgot, sent, lived, returned)
Infinitive marker to
Coordinating conjunction (e.g. and, or, but)
Preposition (except for of) (e.g. about, at, in, on,
on behalf of, with)
General adverb: an adverb not subclassified as
AVP or AVQ (e.g. often, well, longer (adv.), furthest.
General determiner-pronoun: i.e. a determiner-pronoun
which is not a DTQ or an AT0.
The -ing form of lexical verbs (e.g. forgetting,
sending, living, returning)
sides are expected to assume contrasting characteristics. These notions help interpret dimensions
linguistically. Lastly, we should not overlook that
dimensions can be regarded as continuums of
linguistic variation.
Based on the concept and procedure mentioned
above, the most powerful dimension, Dimension 1
can be interpreted. We must draw attention to
particular tags, located on the far areas of both sides
in Dimension 1, namely, VVD (the past tense form
of lexical verbs), AVP (adverb particle) and XX0
(the negative particle) on the positive side; AJ0
(adjective) and NN2 (plural common noun) on the
negative side. However, it seems hard to deal with
tags alone as linguistic features, suggesting that
words relevant to the particular tags should
be examined besides the tags in interpreting the
dimension. Therefore, in this article XX0 is
focused on as the feature expected to give us
a brief understanding of this dimension. Another
reason is that XX0 is most relevant to the concept of
formality mentioned in Section 1. In this respect,
furthermore, ITJ (interjection or other isolate) and
VM0 (modal auxiliary verb) are added in the
following analyses, although they do not have high
scores along Dimension 1. This is based on a
particular reason. That is, we have to focus on tags
that do not reveal any characteristics along a
dimension. Even if a tag is close to the origin of a
coordinate, the distinguishing words relevant to the
120
tag on both sides along the dimension may dilute

the characteristic of the tag. Eventually, the words to
be analyzed are XX0, ITJ, and VM0.
4.1 Distribution of Words Tagged as ITJ

The same methodology as the previous analysis is
employed to identify the characteristics of words
associated with ITJ. The discrepancy in
the frequency count of the words among the four
social classes yields similarity and dissimilarity along
dimensions by multivariate analysis. Table 10 shows
the frequency of counts of words per sentence
tagged as ITJ. As a result of the analysis, normalized
category scores and some of the feature scores
are given as was calculated in the previous analysis.
Let me omit the numerical results of multivariate
analysis in this article. Eventually, the values
given are plotted along Dimensions 1 and 2 in
Fig. 2.
It should be noted in Fig. 2 that social classes are
linearly located rightward from the top to the lowest
in line. The distribution gives a good account of
Dimension 1 as a scale to measure social levels. In
other words, each word identified in this figure
could be characterized according to the extent to
which the word indicates social class. In this respect,
hmm is the highest socially among interjections.
In contrast, ay, cor, ta, and aye tend to be used
commonly in the lowest class. Let me now interpret
this dimension in view of the claim by Joos
Table 10 Table of frequency of counts of words tagged as ITJ (per sentence)

Class: sdecla
Class: sdecla
Tags
AB
C1
C2
DE
Tags
AB
C1
C2
DE
Yeah
Oh
No
Mm
yes
Ah
Ooh
aye
ha
Eh
Mhm
Hello
dear
aha
bye
Yep
hey
Cor
0.1141
0.0940
0.0783
0.0433
0.0629
0.0133
0.0066
0.0021
0.0046
0.0042
0.0043
0.0050
0.0033
0.0033
0.0029
0.0015
0.0015
0.0006
0.1594
0.1115
0.0840
0.0667
0.0386
0.0168
0.0112
0.0064
0.0072
0.0051
0.0047
0.0046
0.0036
0.0026
0.0033
0.0028
0.0016
0.0011
0.1831
0.1129
0.0880
0.0640
0.0382
0.0143
0.0114
0.0071
0.0044
0.0059
0.0036
0.0026
0.0035
0.0018
0.0024
0.0016
0.0018
0.0025
0.1622
0.1119
0.0900
0.0552
0.0292
0.0159
0.0125
0.0117
0.0054
0.0050
0.0042
0.0033
0.0036
0.0041
0.0022
0.0016
0.0012
0.0012
Urgh
ee
Hm
Tt
Hi
ta
Huh
Ya
hmm
blah
oi
Ow
ho
Wow
Gosh
Ay
Aargh
blimey
0.0015
0.0008
0.0014
0.0011
0.0011
0.0002
0.0008
0.0006
0.0016
0.0004
0.0010
0.0008
0.0004
0.0015
0.0008
0.0014
0.0011
0.0011
0.0012
0.0007
0.0012
0.0008
0.0004
0.0007
0.0005
0.0007
0.0003
0.0005
0.0005
0.0005
0.0005
0.0012
0.0007
0.0012
0.0008
0.0004
0.0006
0.0007
0.0007
0.0005
0.0010
0.0010
0.0003
0.0005
0.0003
0.0004
0.0005
0.0004
0.0009
0.0006
0.0007
0.0007
0.0005
0.0010
0.0009
0.0015
0.0001
0.0006
0.0003
0.0006
0.0009
0.0008
0.0003
0.0012
0.0003
0.0004
0.0002
0.0009
0.0015
0.0001
0.0006
0.0003
0.8
0.6
AB
C1
C2
ay
DE
Dimension 2
0.4
hm hi
cor
blimey
0.2
ta
oi yes
Yeah
-1.5
-1
-0.5 Urgh
hmm
aargh
huh
Mhm
Ooh
0.5
-0.2
-0.4
ee
aye
blah
-0.6
Dimension 1
Fig. 2 The distribution of words tagged as ITJ and social classes
mentioned in Section 1, that is, intimate, casual,

consultative, formal, frozen. As for ay, cor, ta, and
aye, this may alone be enough to explain why the
positive side of this dimension can be regarded as a
scale concerning casualness. If we could safely
regard the negative side as more formal, hmm,
aargh, hm, were more frequently used in formal
language.9 Thus, it is intriguing to give the degree
of formality or casualness to words tagged as ITJ
(interjection or isolate) along the dimension concerning the degree.
4.2 Distribution of Words Tagged

as VM0
The modal verbs are analyzed in the same manner.
Common words tagged as VM0 are as follows in the
descending order: would, will, can, could, may,
121
K. Takahashi
Table 11 Table of frequency of counts of words tagged as VM0 (per sentence)

Class: sdecla
Class: sdecla
Tags
AB
C1
C2
DE
Tags
AB
C1
C2
DE
would
will
can
could
may
should
must
ll
might
0.0161
0.0090
0.0274
0.0124
0.0013
0.0071
0.0048
0.0293
0.0053
0.0141
0.0082
0.0241
0.0135
0.0008
0.0066
0.0044
0.0320
0.0054
0.0162
0.0087
0.0253
0.0132
0.0007
0.0083
0.0055
0.0359
0.0067
0.0161
0.0071
0.0209
0.0133
0.0009
0.0063
0.0044
0.0332
0.0063
d
ca
shall
wo
used
lets
ought
need
dear
0.0081
0.0130
0.0025
0.0053
0.0038
0.0032
0.0006
0.0001
0.0001
0.0085
0.0119
0.0024
0.0062
0.0041
0.0021
0.0006
0.0001
0.0002
0.0107
0.0142
0.0031
0.0083
0.0048
0.0022
0.0012
0.0001
0.0001
0.0112
0.0142
0.0022
0.0070
0.0049
0.0013
0.0006
0.0001
0.0002
0.1
Dimension 2
can
-0.2
-0.15
- 0.1
will
0.05should
must
'll
could
- 0.05
0
0.05
would
- 0.05
might
0.1
'd0.15
- 0.1
AB
C1
C2
- 0.15
may
DE
- 0.2
-0.25
Dimension 1
Fig. 3 Distribution of words tagged as VM0 and social classes
should, must, ll, might, and d. The frequency counts

of the modal auxiliary verbs in the whole social
classes are listed in Table 11. The table of the
frequencies of modal auxiliary verbs vs. four social
classes is analysed, revealing their behaviour in
Fig. 3. Social classes are also located in line in the
order of social level along Dimension 1. So this
dimension can possibly be regarded as a criterion
concerning social classes. More interestingly, may is
very far from the origin of the y coordinate,
suggesting that may is a salient marker of the
negative side.10 The same is true of d in the positive
side. It is also noted that can and will are more often
122
used in the top class whereas might is more often

used in the lowest class.
4.3 Distribution of Words Tagged

as XX0
Words tagged as XX0 are even fewer than those
tagged as ITJ. They are not, nt, n, nt, and nae in the
order from higher to lower social class. Their
frequency counts in all social classes are listed in
Table 12. The account of social class scale is also supported by the distribution of social classes, which
are lined up by the sequence of social classes, from
higher to lower (see Fig. 4). The graph indicates that
0.25
nt
0.2
nae
Dimension 2
0.15
AB C1 C2 DE
0.1
0.05
n
not
0
0n't
-0.1
0.1
0.2
0.3
0.4
0.5
-0.05
Dimension 1
Fig. 4 Distribution of words tagged as XX0 and social classes
Table 12 Table of frequency of counts of words tagged as
XX0 (per sentence)
Class: sdecla
Tags
AB
C1
C2
DE
n
nae
not
nt
nt
0.00610
0.00002
0.05243
0
0.17353
0.00723
0.00007
0.05133
0.00001
0.19108
0.01275
0.00009
0.05873
0
0.23119
0.01263
0.00019
0.05729
0.00002
0.21335
nae is definitely used by lower social classes whereas

not tends to be used by higher social classes.
5 Conclusion
First of all, there seems no doubt that the most
powerful dimension, Dimension 1, can be regarded
as a criterion to measure the extent to which the form
of the text indicates social class. Therefore, this
study proved the validity of register variation, that is,
the social variables in the BNC. Let me now discuss
features which can characterize this dimension in
terms of the claim made by Joos. First of all, the
disposition of the dimension can be identified
linguistically to some extent, in view of the highest
values, both positive and negative.
Joos claims that casual style, which is typical
of informal speech between peers, includes ellipsis
(or omission of certain grammatical elements) and

slang between peers. Ellipsis occurs more often in
conversation than in written text because conversation tends to be less explicit. In view of less
explicitness in conversation, ellipsis is expected to
appear as a feature in the lower social classes. The
present study only shows contraction such as nt, n,
nt, as salient features of the lower social classes;
however, it is not far from truth that ellipsis is a
crucial discriminator in Dimension 1. As for slang
mentioned earlier, assuming that isolated words, a
crucial marker in Dimension 1, are associated with
slang, perhaps it is right to say that slang can be a
crucial marker in Dimension 1. Thus, this analysis
demonstrates that the positive side of Dimension 1
underpins Jooss claim, to some extent. Another
explanation for the disposition of Dimension 1 may
be that usage of language in terms of semantics
serves to prescribe the social class more clearly than
usage in terms of structure does. The discriminators
are hmm, aargh, hm, oi versus ay, cor, ta, and aye in
ITJ; not versus nae in XX0; may versus d in VM0.
Among them, modals are noteworthy. Saeed (1997,
p. 126) claims that the modal verbs mark the
speakers attitude of obligation, responsibility, and
permission. It is conceivable that the speakers
attitude will be largely associated with the social
class to which the speaker pertains. Above all, the
way people talk will vary according to their social
123
K. Takahashi
class. The higher the social class is, more strongly

the way people talk will be, restricted by notions
such as obligation, responsibility, and permission,
suggesting that the relevant modals such as may and
can describe the characteristics of higher classes.
Thus, Dimension 1, measuring the extent to which
the texts are formal or casual, is also associated with
the above speakers attitude, which tends to
prescribe the social class.
Turning now to the issue of being formal versus
informal along this dimension, this problem was left
untouched. Styles are characterized as varieties of
language that can be ranged on a continuum
from very formal to very informal, i.e. intimate,
casual, consultative, formal, and frozen, according
to the claim by Joos. In this respect, as is mentioned
in Section 3.1.1, it can only be said that Dimension
1 has two opposite sides, that is, formal or informal.
It is hardly possible to show the split between
intimate and casual or between formal and frozen in
terms of words relating to these styles. Further
examination is required to clarify the relationship
between these styles and words.
Let us leave the consideration of Dimension 1
and now turn to the issue of Dimension 2. The
interpretation of Dimension 2 in Section 3.1.2 is
rather problematic in that distribution of words
along Dimension 2 in each analysis of ITJ, VM0,
and XX0 does not yield an identical tendency to
serve as interpretation of this dimension. Although
Dimension 1 shows a coherent characteristic to rank
the four social classes in the order of social levels, in
Dimension 2 the four social classes do not appear in
the same sequence. This article largely focuses on
the register continuum. Yet, apart from the register
continuum, how the tags correlate with each social
class should be scrutinized in more detail with
further investigation. The linguistic interpretation
of Dimension 2 still remains to be proved.
Finally, we should not overlook the controversial
nature of the BNC as an accurate source of
linguistic/class distinction. This subject deserves
more than a passing note. The class demarcations
used in the BNC are highly contentious in social
science today. Taking a brief look at Note 2, which
describes details of the occupation in each social
class, hgv driver should not be the second lowest
124
class, C2, for example. The same is true of tv

engineer, which also pertains to C2. Rather they
should be upper class. It is also a fact that linguistic/
class distinctions are old and socially fraught,
and subject to great inaccuracy. Furthermore,
it seems that there exists a contradiction in
that the context of all the speech claims to be
spontaneous natural conversations made by members of the public, although my research deals
with the issue of formal style versus casual style.
Thus, it is not altogether clear whether the issue
of register can even be entertained objectively.
In these respects, however, my claim is as follows:
admitting that the BNC may have obscure linguistic/class distinction in a sense, multivariate analysis
helps clarify the overall tendency of four social
classes and substantiate the claim concerning the
register continuum. It is noteworthy in my study
that the subtle differences in spontaneous natural
conversation among four social classes can also be
identified using the same methodology. For more
precise research in terms of register, for the
moment, we can focus on another classification
employed in the BNC, that is, the genre codes
defined by Lee (2000). Lee aimed at a more detailed
classification of both written and spoken texts
respectively. This classification will be employed in
my further research.
Acknowledgements
This study was funded by a Grant-in-aid for
Scientific Reasearch (C/1) from the Japan Society
for the Promotion of Science and the Ministry of
Education, Science, Sports and Culture. I wish to
thank Prof. Tony McEnery of Lancaster University.
I owe the completion of my paper to his useful
comments very much. I also thank anonymous
reviews for their constructive comments.
References
Biber, D. (1988). Variation Across Speech and Writing.
Cambridge: Cambridge University Press.
Chatfield, C. and Collins, A. J. (1980). Introduction to
Multivariate Analysis. London: Chapman and Hall.
Hayashi, C. (1952). On the prediction of phenomena

from qualitative data and the quantification of
qualitative data from the mathematico-statistical point
of view. Annals of the Institute of Statistical
Mathematics, 55: 6997.
Joos, M. (1959). The isolation of styles. Georgetown
University Monograph Series on Languages and
Linguistics, 12: 10713.
Labov, W. (1972). Sociolinguistic Patterns. Philadelphia:
University of Pennsylvania Press.
Lee, Young Wey David. (2000). Modelling Variation in
Spoken and Written Language: The Multi-Dimensional
Approach Revised, Ph.D. thesis, Lancaster University.
Mesthrie, R., Swann, J., Deumert, A., and Leap W. L.
(2000). Introducing Sociolinguistics. Edinburgh:
Edinburgh University Press.
Milroy, J. and Milroy, L. (1997). Varieties and variation.
In Florian Coulmas (ed.). The Handbook of
Sociolinguistics. Oxford: Blackwell, pp. 4764.
Pearson, K. (1901). A Users Guide to Principal
Components. New York: Wiley.
Renchere, Alvin C. (2002). Methods of Multivariate
Analysis. New York: John Wiley & Sons.
Saeed, I. J. (1997). Semantics. Oxford: Backwell.
Seber, G. A. F. (1984). Multivariate Observations. New
York: Wiley.
Trudgill, Peter. (2002). Sociolinguistic Variation and
Change. Edinburgh: Edinburgh University Press.
Notes
1 The sampling frame was defined in terms of the
language production of the population of British
English speakers in the United Kingdom, according to
Reference Guide for the British National Corpus
(World Edition) (Burnard, 2000) on CD-ROM.
2 Details of the occupation in each social class are as
follows. AB: moderator, teacher, doctor, BBC
employee, chairman of the board, lecturer, barrister,
judge, sales executive, director general, group captain,
member of parliament, council chairman, deputy
prison governor; C1: administrative assistant, stable
hand, teacher, student, unemployed, catering manager,
retired, secretary, housewife, lecturer; C2: student, team
leader, retired, housewife, hgv driver, miner
chargehand, carpenter, tv engineer, taxi driver, driving
instructor, landscape gardener, home care assistant,
engineer, nurse, apprentice engineer, care assistant,
take-away worker, chargehand, crossing warden,
electrician, telecommunication engineer, aircraft

engineer; DE: student, sales executive, councillor,
trainee, driver, unemployed, forecourt attendant, out of
work (pt), engineer, factory operative, production
worker, disabled unemployed, stores person, plasterer,
nurse (pt), aircraft engineer, painter.
3 The domains are: natural and pure science (wridom2),
world affairs (wridom5), leisure (wridom9). Excerpts
were taken at random from fifty-four texts in each
domain. As a result of three multivariate analyses, the
extent to which the texts of the same domain are
clustered can be visually identified. Taking into account
the dispersion of plotted texts in the two powerful
dimension charts, in FA the texts of the three domains
are mingled loosely and it is hard to draw borderlines
among three domains. On the other hand, in both PCA
and EHT3, we can easily draw border lines visually.
Therefore, it is safe to say that FA has less power to
discriminate or cluster the texts of the same domains
than PCA and EHT3.
4 The grammatical description of each tag can be referred
to in Tables 6 through 9. The ratio is per sentence.
Traditionally, the normalization of the frequency count
has tended to be calculated as the ratio to one word or
to certain units of words. That is, the raw frequency
counts are divided by the entire number of words
contained in the target text. However, I supposed that
the normalization should be calculated in terms of the
sentence length in a text, not of the text length in a text.
This assertion was based on the grounds that if focus is
put upon the linguistic features concerned with
sentences, normalization should be made considering
the sentence as a basic unit. My assumption is as
follows: admitting the claim that exposition tends to
have longer sentences than fiction, expository text and
fictional text that have the same amount of words seem
to contain a different number of sentences. In other
words, in the same number of words of both an
expository text and fictional text, fictional texts have
more sentences than expository texts because fiction
tends to have shorter sentences than exposition. So, the
expository text is liable to have more features related to
sentence structures superficially. It is not reasonable to
compare the frequency count of features in terms of
text length. Therefore, in order to deal with any feature
simultaneously, normalization is made in terms of the
sentence length in a text.
5 The preliminary analysis adapted as features: all 61 tags,
the top 40 tags, the top 30 tags, and the top 24 tags. At
the conclusion of the research, it was found that the top
24 tags are enough to reveal reliable dimensions that
enable the classification of social classes.
125
K. Takahashi
6 Concerning the process of calculation, I used the readymade program which was developed by the Institute of
Mathematics and Statistics in Japan.
7 In this respect, Biber (1988) employs a different
methodology for an interpretation of factors. The way
in which my interpretation is different from his is that
Biber takes only the feature distribution into account,
whereas I take the distribution of both categories and
features. This means that Biber at first interprets the
factor in terms of the distribution of features and later
he adopts the interpreted factors to the textual relation.
The difference mainly lies in the fact that factor
analysis, which Biber employed, is not biplot
mentioned in Section 2.1. Therefore the interpretation
must be made separately. This procedure is often
employed in discussing the relationship between
variables and items in factor analysis.
8 Punctuation such as UNC (unclassified items) and
PUN (punctuation) are discarded in the discussion
although they have high values in Table 7.
9 A part of a sentence including hmm in the top social
class is as follows. KB8 7233: We can hear you, hmm,
erm Do you want a Polo? KB8 7583: yeah Hmm Aye
you are not, peoplell KBM 1530: we have nothing to do
with it hmm What? It was an accident KBM 1773: they?
126
They can if I can, hmm, er whens the bed going to get

there? KBS 1081: ? Hmm. Keeping a few cards up your
sleeve. KCD 85: Its a bit cold. Hmm. I was just
reading that. KCD 198: hes got one like that.
Hmm. Red. Hmm. Although the present study
suggests that hmm is closely associated with higher
social classes, they actually collocate with yeah and Aye
in KB8 7583, which are closely typically with the lower
classes. In this way, such a co-occurrence is possible;
however, statistically, the difference between them is
apparent.
10 A part of sentence including may in the top social
class is as follows. J3M 333: we have a show of hands
on that? May well Yes. the erm J3M 382: one slip
through his fingers me were, we may well then be
liable to be sued by the person JJ6 17: right,
so your as an individual your income may be
very low, right, in bad years, right JJ6 34:
subsequent calculations considerably easier,
and you may think in actual fact that linear demand
curves are JJ6 35: linear demand curves are quite
restrictive. We may not expect consumer behaviour,
right, to be the JJ6 37: , average and marginal
functions As you may be aware that economists are
obsessed with the . . .

Register in BNC

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Register in BNC

Uploaded by

Copyright:

Available Formats

A Study of Register Variation in the

British National Corpus

The research using the British National Corpus

dimensions can be identified. In this respect, the

Literary and Linguistic Computing, Vol. 21, No. 1, 2006

Joos claims that:

Register Variation in British National Corpus

terms of parts of speech, words, and phrases.

2.2 Multivariate Analysis and Hayashis

Top or middle management

explain what multivariate analysis is and how it can

do FA is to begin with PCA and then to use other

weaker ability to discriminate or cluster the texts of

2.3 An initial EHT3 Analysis

Table 2 Frequency of counts of 24 tags in each social class

Literary and Linguistic Computing, Vol. 21, No. 1, 2006

Register Variation in British National Corpus

In case of Table 2, fij is the frequency score of the ith

Quantification Type III, in which responses, i.e.,

Then value xi is given to the ith item and yj is

Table 3 Table of eigenvector of 24 tags

presented here. Interested readers should see

Literary and Linguistic Computing, Vol. 21, No. 1, 2006

Table 4 Proportion accounted for, and cumulative

Proportion Accounted for

for analysing given data. In the present case, only

Register Variation in British National Corpus

Table 5 Normalized category scores

on a certain position along the axis, is explained by

distribution of the features and social classes

3.1 Distribution of Tags and Social

Description of tags and social class

Cardinal number (e.g. one, 3, fifty-five, 3609)

It is also noted that the tags concerning the

Literary and Linguistic Computing, Vol. 21, No. 1, 2006

(plural common noun), TO0 (infinitive marker to)

Register Variation in British National Corpus

Description of tags and social class

Unclassified items which are not appropriately

Description of tags and social class

Cardinal number (e.g. one, 3, fifty-five, 3609)

respectively, whereas the highest and the second

Description of tags and social class

C2: skilled manual

Literary and Linguistic Computing, Vol. 21, No. 1, 2006

tag on both sides along the dimension may dilute

4.1 Distribution of Words Tagged as ITJ

Register Variation in British National Corpus

Table 10 Table of frequency of counts of words tagged as ITJ (per sentence)

Fig. 2 The distribution of words tagged as ITJ and social classes

mentioned in Section 1, that is, intimate, casual,

(interjection or isolate) along the dimension concerning the degree.

4.2 Distribution of Words Tagged

Table 11 Table of frequency of counts of words tagged as VM0 (per sentence)

Fig. 3 Distribution of words tagged as VM0 and social classes

should, must, ll, might, and d. The frequency counts

Literary and Linguistic Computing, Vol. 21, No. 1, 2006

used in the top class whereas might is more often

4.3 Distribution of Words Tagged

Register Variation in British National Corpus

nae is definitely used by lower social classes whereas

(or omission of certain grammatical elements) and

class. The higher the social class is, more strongly

Literary and Linguistic Computing, Vol. 21, No. 1, 2006