Professional Documents
Culture Documents
............................................................................................................................................................
Kaoru Takahashi
Toyota National College of Technology, Japan
.......................................................................................................................................
Abstract
Correspondence:
Kaoru Takahashi,
2-1 Eisei-Cho,
Toyota City,
Aichi-Pref. 471-8525,
Japan.
Email:
takahasi@toyota-ct.ac.jp
This article is concerned with the study of register variation, the process of focusing
on the similarities and dissimilarities between register categories in terms of
various linguistic phenomena. The British National Corpus World Edition, which
is a 100 million word collection of British English, will be used to study the
characterization of register variation by identifying their linguistic characteristics.
By means of multivariate analysis, the variation of the occurrence of selected
linguistic features among registers will be classified. A multivariate analysis holds
out the promise of being able to systematize the register categories in the corpus
while also revealing the characteristic linguistic features of the groups classified.
In this article, by focusing on a sociolinguistic variable which is fairly
systematically associated with social class in the British National Corpus,
the dimensions revealed by the multivariate analysis were interpreted
linguistically. That is, the linguistic dimension concerned with formal style
versus casual style proved the validity of the social variable in the British
National Corpus and enabled its characterization in the light of linguistic
features. Furthermore, several words which pertain to interjection, filler, modal
auxiliary verb, and negation, i.e. hmm, ay, may, d, not, nae, and so on turned out
to be crucial markers to characterize the register in which texts are used.
..................................................................................................................................................................................
1 Introduction
Register can be regarded as a general term for any
language variety defined in terms of a particular
constellation of situational characteristics. Conrad
and Biber (2001, p. 3) claim:
Register distinctions are defined in nonlinguistic terms, including the speakers
purpose in communication, the topic, the
relationship between speakers and hearer, and
the production circumstances and there are
usually important linguistic differences across
registers that correspond to the differences in
situational characteristics.
Literary and Linguistic Computing Vol. 21, No. 1, 2006. The Author 2005. Published by Oxford University Press on
behalf of ALLC and ACH. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
doi:10.1093/llc/fqi028
Advance Access Published on 18 May 2005
111
K. Takahashi
2 Methodology
2.1 Analysis in Terms of Register
In order to understand the hypothesis outlined in
Section 1, let us explore social variables in the BNC.
The BNC World Edition is a 100 million word
collection of samples of written (82.82%) and
spoken (17.78%) British English from the late
20th century. Spoken texts are organized in two
parts, a context-governed part containing orthographic transcriptions of recordings made at specific
types of meeting and event, and a demographic part
containing orthographic transcriptions of spontaneous natural conversations made by members of
the public.1 The present study addresses the
demographic part only. This contains 4,211,216
words.
Given the focus of this study is on social class, we
will focus here on the social class variable encoded
in the corpus metadata. Table 1 shows the values of
this variable.2 This variable is used to classify the
texts of the demographic section of the BNC using a
multivariate analysis, the exact form of which will be
discussed shortly.
Description
Number of
Sentences
AB
197,803
C1
C2
DE
UU
169,384
144,876
93,159
5,339
113
K. Takahashi
Social Class
Tags
AB
C1
C2
DE
Tags
AB
C1
C2
DE
PUN
PNP
NN1
AV0
ITJ
AT0
PRP
DT0
VVI
VVB
AJ0
CJC
1.2036
1.0361
0.5235
0.4841
0.3257
0.333
0.3006
0.2282
0.2383
0.2317
0.2246
0.2156
1.0857
0.9848
0.4845
0.4386
0.3377
0.3108
0.2725
0.2258
0.2141
0.2157
0.2045
0.1966
1.0946
1.154
0.5511
0.505
0.3497
0.3476
0.3246
0.2622
0.2451
0.2445
0.2203
0.2367
1.0502
1.0572
0.5014
0.4331
0.3391
0.2996
0.2836
0.2285
0.2181
0.2258
0.197
0.204
VBZ
XX0
VM0
NN2
NP0
CRD
VVD
UNC
CJS
TO0
VVG
AVP
0.2068
0.1616
0.1516
0.1336
0.1268
0.0981
0.1002
0.1135
0.1054
0.1023
0.0956
0.083
0.2037
0.1532
0.1455
0.12
0.1103
0.1087
0.1053
0.1009
0.094
0.0933
0.0851
0.0841
0.2184
0.1861
0.167
0.1346
0.1294
0.1318
0.1354
0.1079
0.114
0.1068
0.0992
0.1018
0.2011
0.1788
0.15
0.117
0.1296
0.1569
0.1207
0.0903
0.1012
0.0889
0.086
0.0927
114
o
n
P P
Pn
PM
1=N ni1 M
x
y
1=N
x
y
1=N
ij
i
j
iE
i
Ej
j
j1
i1
j1
r qr
n
o2
P
PM
P
P
2
2
1=N M
1=N ni1 iE xi2 1=N ni1 iE xi
j1 Ej yj 1=N
j1 Ej yj
similarities and dissimilarities in the frequency
matrix between social variables and between tags
as well. The procedure changes the positions of rows
and columns of the original data matrix so that the
proportionally large frequency figures can converge
around the diagonal. Consequently, the social
variables placed close together and the tags placed
close together are considered to be qualitatively
similar. Those located distant from one another are
qualitatively different. Statistically, this analysis is
based on the idea that it gives the quantities to social
variables and tags in the original data matrix in the
way that the given quantities yield the highest
correlation coefficient between the two. As a result
of this calculation, some dimensions appear
which serve as the criteria to characterize social
classes.6 The dimensions are later interpreted
sociolinguistically. Because both social variables
and tags have weights on every dimension simultaneously, the weights can help us to interpret each
criterion to characterize the social classes by
considering the variations and relationships among
social classes and parts of speech tags.
Now suppose that numerical value fij is given to
the jth subcategories of the ith item as follows.
2f
f
... f
... f 3
11
12
1j
6 f21
6
6 .
6 .
6 .
F6
6f
6 i1
6
6 ..
4 .
f22
..
.
. . . f2j
..
.
...
1m
fi2
..
.
...
fij
..
.
...
fn1
fn2
...
...
f2m 7
7
.. 7
7
. 7
7
fim 7
7
7
.. 7
. 5
fnm
12
6 21
6
6 .
6
6 ..
D ij 6
6
6 i1
6
6 ..
4 .
n1
22
..
.
i2
..
.
n2
1j
...
...
...
2j
..
.
ij
..
.
1m
2m 7
7
.. 7
7
. 7
7
. . . im 7
7
7
.. 7
. 5
. . . nm
...
i 1; 2; . . . ;n; j 1; 2; . . . ;m
are necessary.
By calculating the above characteristic equation,
the maximum number of k eigenvalues can be
obtained. The term k refers to the smaller one of n
and m. Among the eigenvalues, the largest k, which
is equal to value 1, is excluded and the other
eigenvalues help obtain the corresponding eigenvectors for each. The eigenvectors are quantities
(x1, x2, . . . , xn), which are assigned in n items, and
quantities (y1, y2, . . . , ym), which are assigned in m
categories. We can employ almost the same
procedure in the case of the numerical value fij,
instead of the above dichotomic response pattern ij.
However because it is not necessary for the purpose
of the present study to enter into a detailed
discussion of this calculation, these values are not
Literary and Linguistic Computing, Vol. 21, No. 1, 2006
115
K. Takahashi
Dimension 1
Dimension 2
Tag
Dimension 1
Dimension 2
PUN
PNP
NN1
AV0
ITJ
AT0
PRP
DT0
VVI
VVB
AJ0
CJC
0.49483
0.258207
0.005233
0.12424
0.087467
0.11245
0.014521
0.095129
0.06211
0.02669
0.15862
0.025429
0.57667
0.110518
0.036192
0.257758
0.28158
0.16239
0.206288
0.179705
0.080389
0.01659
0.00837
0.224032
VBZ
XX0
VM0
NN2
NP0
CRD
VVD
UNC
CJS
TO0
VVG
AVP
0.00705
0.225889
0.051563
0.10345
0.080286
0.602305
0.30724
0.20874
0.031885
0.08473
0.04706
0.176196
0.06508
0.02552
0.06311
0.081673
0.14587
0.46212
0.192747
0.087
0.108259
0.161182
0.105171
0.078303
3 Results
By solving the above equation, we obtain eigenvalues whose maximum number is the smaller
of n or m. Eigenvalues are a special set of scalars
associated with a matrix equation. Eigenvectors
corresponding to (k 1) items are obtained, except
for the maximum eigenvalue, namely, 1. That is,
each eigenvalue is paired with a corresponding
eigenvector. Eventually, these eigenvectors are
scores (x1, x2, . . . , xm) given to ith tag and scores
(y1, y2, . . . , yn) given to ith social class respectively.
In Table 2, k is 4, so we obtain three eigenvalues.
Also, the scores of tags (x1, x2, . . . , x4) and social
classes (y1, y2, . . . , y58) are obtained according to the
corresponding eigenvalues. The eigenvalues are
called axis 1, axis 2, etc. in descending order. As a
result of the calculation of Table 2, we obtain
eigenvectors as in Table 3, along with proportion
accounted for as in Table 4. Each eigenvalue
denotes the degree to which the tags or social
groups have salient characteristics in the axis. An
axis refers to a dimension. Proportion accounted
for, is the amount of information contained in the
original data matrix explained by the axis in
question. Cumulative proportion indicates the
percentage of the information in the original data
explained up to the axis in question. In practice, not
all sets of social class scores and tag scores are used
116
Cumulative Proportion
1
2
3
0.7470
0.1925
0.0605
0.7470
0.9395
1.0000
Dimension 1
Dimension 2
Tag
Dimension 1
Dimension 2
sdecla1
sdecla2
sdecla3
sdecla4
PUN
PNP
NN1
AV0
ITJ
AT0
PRP
DT0
VVI
VVB
0.05370
0.01960
0.02360
0.04910
0.04810
0.02570
0.00075
0.01870
0.01540
0.02030
0.00274
0.02000
0.01330
0.00571
0.00199
0.00976
0.03100
0.02240
0.02850
0.00559
0.00262
0.01960
0.02520
0.01490
0.01970
0.01920
0.00873
0.00180
AJ0
CJC
VBZ
XX0
VM0
NN2
NP0
CRD
VVD
UNC
CJS
TO0
VVG
AVP
0.03530
0.00564
0.00159
0.05610
0.01350
0.02980
0.02330
0.17500
0.09260
0.06660
0.01010
0.02770
0.01590
0.06000
0.00095
0.02520
0.00743
0.00322
0.00837
0.01190
0.02150
0.06830
0.02950
0.01410
0.01750
0.02680
0.01810
0.01350
117
K. Takahashi
0.10
Dimention 2
C2
TO0
CJC
AV0
PRP DT0
AT0
CJS
UNC NN2VVG
VVI
VM0
PNP
AJ0
NN1
0.00
VVB
VBZ
- 0.05
0.00
0.05
-0.10
ITJ
AB
VVD
AVP
XX0
0.10
0.15
0.20
NP0
C1
DE
CRD
-0.10
Dimension 1
Fig. 1 The distribution of four social classes and tags
Table 6 Tags and the social class characteristic of the positive side in dimension 1
Tag and Social Class
Score
CRD
VVD
0.1750
0.0926
AVP
XX0
sdecla4
PNP
sdecla3
0.0600
0.0561
0.0491
0.0257
0.0236
3.1.2 Dimension 2
In Dimension 2, there is no coherent relationship
among the four social classes as observed in
Dimension 1. However, in Dimension 2,
a conspicuous contrast is observed between C2
(skilled manual) and DE (semi-skilled or unskilled),
namely the second lowest class and the lowest class,
Table 7 Tags and the social class characteristic of the negative side in Dimension 1
Tag and Social
Class
Score
UNC
0.0666
sdecla1
0.0537
PUN
0.0481
AJ0
NN2
0.0353
0.0298
TO0
AT0
sdecla2
0.0277
0.0203
0.0196
Table 8 Tags and the class characteristic of the negative side in Dimension 2
Tag and Social
Class
Score
CRD
PUN
ITJ
sdecla4
NP0
sdecla2
0.0683
0.0285
0.0252
0.0224
0.0215
0.00976
4 Interpretation
When the multivariate analysis reveals distinguishing tags in a dimension, the dimension enables the
characterization of the structure of the sentence
relevant to the tags. The distinguishing tag refers to
the tag which is located relatively far from the origin
of the coordinates. The tag describes the disposition
of the dimension. Then, the dimension can be
interpreted linguistically. If a social class is far from
the origin of the ordinates, it means that the social
class is largely associated with the side to which the
social class pertains in the dimension. The farther
from the origin of the coordinates the social
variable is located, the more strongly does the
variable describe the disposition of the dimension.
Also, the closer to tags the social variable is located,
the more likely it is that there is a strong relationship between them. In other words, the tags are
commonly used in the social class. There are other
things to note. The positive and the negative
Literary and Linguistic Computing, Vol. 21, No. 1, 2006
119
K. Takahashi
Table 9 Tags and the class characteristic of the positive side in Dimension 2
Tag and Social
Class
Score
sdecla3
VVD
0.0310
0.0295
TO0
CJC
PRP
0.0268
0.0252
0.0197
AV0
0.0196
DT0
0.0192
VVG
0.0181
sides are expected to assume contrasting characteristics. These notions help interpret dimensions
linguistically. Lastly, we should not overlook that
dimensions can be regarded as continuums of
linguistic variation.
Based on the concept and procedure mentioned
above, the most powerful dimension, Dimension 1
can be interpreted. We must draw attention to
particular tags, located on the far areas of both sides
in Dimension 1, namely, VVD (the past tense form
of lexical verbs), AVP (adverb particle) and XX0
(the negative particle) on the positive side; AJ0
(adjective) and NN2 (plural common noun) on the
negative side. However, it seems hard to deal with
tags alone as linguistic features, suggesting that
words relevant to the particular tags should
be examined besides the tags in interpreting the
dimension. Therefore, in this article XX0 is
focused on as the feature expected to give us
a brief understanding of this dimension. Another
reason is that XX0 is most relevant to the concept of
formality mentioned in Section 1. In this respect,
furthermore, ITJ (interjection or other isolate) and
VM0 (modal auxiliary verb) are added in the
following analyses, although they do not have high
scores along Dimension 1. This is based on a
particular reason. That is, we have to focus on tags
that do not reveal any characteristics along a
dimension. Even if a tag is close to the origin of a
coordinate, the distinguishing words relevant to the
120
Class: sdecla
Tags
AB
C1
C2
DE
Tags
AB
C1
C2
DE
Yeah
Oh
No
Mm
yes
Ah
Ooh
aye
ha
Eh
Mhm
Hello
dear
aha
bye
Yep
hey
Cor
0.1141
0.0940
0.0783
0.0433
0.0629
0.0133
0.0066
0.0021
0.0046
0.0042
0.0043
0.0050
0.0033
0.0033
0.0029
0.0015
0.0015
0.0006
0.1594
0.1115
0.0840
0.0667
0.0386
0.0168
0.0112
0.0064
0.0072
0.0051
0.0047
0.0046
0.0036
0.0026
0.0033
0.0028
0.0016
0.0011
0.1831
0.1129
0.0880
0.0640
0.0382
0.0143
0.0114
0.0071
0.0044
0.0059
0.0036
0.0026
0.0035
0.0018
0.0024
0.0016
0.0018
0.0025
0.1622
0.1119
0.0900
0.0552
0.0292
0.0159
0.0125
0.0117
0.0054
0.0050
0.0042
0.0033
0.0036
0.0041
0.0022
0.0016
0.0012
0.0012
Urgh
ee
Hm
Tt
Hi
ta
Huh
Ya
hmm
blah
oi
Ow
ho
Wow
Gosh
Ay
Aargh
blimey
0.0015
0.0008
0.0014
0.0011
0.0011
0.0002
0.0008
0.0006
0.0016
0.0004
0.0010
0.0008
0.0004
0.0015
0.0008
0.0014
0.0011
0.0011
0.0012
0.0007
0.0012
0.0008
0.0004
0.0007
0.0005
0.0007
0.0003
0.0005
0.0005
0.0005
0.0005
0.0012
0.0007
0.0012
0.0008
0.0004
0.0006
0.0007
0.0007
0.0005
0.0010
0.0010
0.0003
0.0005
0.0003
0.0004
0.0005
0.0004
0.0009
0.0006
0.0007
0.0007
0.0005
0.0010
0.0009
0.0015
0.0001
0.0006
0.0003
0.0006
0.0009
0.0008
0.0003
0.0012
0.0003
0.0004
0.0002
0.0009
0.0015
0.0001
0.0006
0.0003
0.8
0.6
AB
C1
C2
ay
DE
Dimension 2
0.4
hm hi
cor
blimey
0.2
ta
oi yes
Yeah
-1.5
-1
-0.5 Urgh
hmm
aargh
huh
Mhm
Ooh
0.5
-0.2
-0.4
ee
aye
blah
-0.6
Dimension 1
121
K. Takahashi
Class: sdecla
Tags
AB
C1
C2
DE
Tags
AB
C1
C2
DE
would
will
can
could
may
should
must
ll
might
0.0161
0.0090
0.0274
0.0124
0.0013
0.0071
0.0048
0.0293
0.0053
0.0141
0.0082
0.0241
0.0135
0.0008
0.0066
0.0044
0.0320
0.0054
0.0162
0.0087
0.0253
0.0132
0.0007
0.0083
0.0055
0.0359
0.0067
0.0161
0.0071
0.0209
0.0133
0.0009
0.0063
0.0044
0.0332
0.0063
d
ca
shall
wo
used
lets
ought
need
dear
0.0081
0.0130
0.0025
0.0053
0.0038
0.0032
0.0006
0.0001
0.0001
0.0085
0.0119
0.0024
0.0062
0.0041
0.0021
0.0006
0.0001
0.0002
0.0107
0.0142
0.0031
0.0083
0.0048
0.0022
0.0012
0.0001
0.0001
0.0112
0.0142
0.0022
0.0070
0.0049
0.0013
0.0006
0.0001
0.0002
0.1
Dimension 2
can
-0.2
-0.15
- 0.1
will
0.05should
must
'll
could
- 0.05
0
0.05
would
- 0.05
might
0.1
'd0.15
- 0.1
AB
C1
C2
- 0.15
may
DE
- 0.2
-0.25
Dimension 1
0.25
nt
0.2
nae
Dimension 2
0.15
AB C1 C2 DE
0.1
0.05
n
not
0
0n't
-0.1
0.1
0.2
0.3
0.4
0.5
-0.05
Dimension 1
Fig. 4 Distribution of words tagged as XX0 and social classes
Table 12 Table of frequency of counts of words tagged as
XX0 (per sentence)
Class: sdecla
Tags
AB
C1
C2
DE
n
nae
not
nt
nt
0.00610
0.00002
0.05243
0
0.17353
0.00723
0.00007
0.05133
0.00001
0.19108
0.01275
0.00009
0.05873
0
0.23119
0.01263
0.00019
0.05729
0.00002
0.21335
5 Conclusion
First of all, there seems no doubt that the most
powerful dimension, Dimension 1, can be regarded
as a criterion to measure the extent to which the form
of the text indicates social class. Therefore, this
study proved the validity of register variation, that is,
the social variables in the BNC. Let me now discuss
features which can characterize this dimension in
terms of the claim made by Joos. First of all, the
disposition of the dimension can be identified
linguistically to some extent, in view of the highest
values, both positive and negative.
Joos claims that casual style, which is typical
of informal speech between peers, includes ellipsis
123
K. Takahashi
Acknowledgements
This study was funded by a Grant-in-aid for
Scientific Reasearch (C/1) from the Japan Society
for the Promotion of Science and the Ministry of
Education, Science, Sports and Culture. I wish to
thank Prof. Tony McEnery of Lancaster University.
I owe the completion of my paper to his useful
comments very much. I also thank anonymous
reviews for their constructive comments.
References
Biber, D. (1988). Variation Across Speech and Writing.
Cambridge: Cambridge University Press.
Chatfield, C. and Collins, A. J. (1980). Introduction to
Multivariate Analysis. London: Chapman and Hall.
Notes
1 The sampling frame was defined in terms of the
language production of the population of British
English speakers in the United Kingdom, according to
Reference Guide for the British National Corpus
(World Edition) (Burnard, 2000) on CD-ROM.
2 Details of the occupation in each social class are as
follows. AB: moderator, teacher, doctor, BBC
employee, chairman of the board, lecturer, barrister,
judge, sales executive, director general, group captain,
member of parliament, council chairman, deputy
prison governor; C1: administrative assistant, stable
hand, teacher, student, unemployed, catering manager,
retired, secretary, housewife, lecturer; C2: student, team
leader, retired, housewife, hgv driver, miner
chargehand, carpenter, tv engineer, taxi driver, driving
instructor, landscape gardener, home care assistant,
engineer, nurse, apprentice engineer, care assistant,
take-away worker, chargehand, crossing warden,
125
K. Takahashi
6 Concerning the process of calculation, I used the readymade program which was developed by the Institute of
Mathematics and Statistics in Japan.
7 In this respect, Biber (1988) employs a different
methodology for an interpretation of factors. The way
in which my interpretation is different from his is that
Biber takes only the feature distribution into account,
whereas I take the distribution of both categories and
features. This means that Biber at first interprets the
factor in terms of the distribution of features and later
he adopts the interpreted factors to the textual relation.
The difference mainly lies in the fact that factor
analysis, which Biber employed, is not biplot
mentioned in Section 2.1. Therefore the interpretation
must be made separately. This procedure is often
employed in discussing the relationship between
variables and items in factor analysis.
8 Punctuation such as UNC (unclassified items) and
PUN (punctuation) are discarded in the discussion
although they have high values in Table 7.
9 A part of a sentence including hmm in the top social
class is as follows. KB8 7233: We can hear you, hmm,
erm Do you want a Polo? KB8 7583: yeah Hmm Aye
you are not, peoplell KBM 1530: we have nothing to do
with it hmm What? It was an accident KBM 1773: they?
126