Corpus Linguistics

An introduction to computational teaching
Prof. Rogério Pereira Azeredo

Semana de Letras – Faculdade Pitágoras – Vitória
Outubro 2008
What is a Corpus?
The word "corpus", derived from the Latin word meaning "body", may be used to refer
to any text in written or spoken form. However, in modern Linguistics this term is used
to refer to large collections of texts which represent a sample of a particular variety or
use of language(s) that are presented in machine readable form. Other definitions,
broader or stricter, exist
Computer-readable corpora can consist of raw text only, i.e. plain text with no
additional information. Many corpora have been provided with some kind of linguistic
information, called mark-up or annotation.
Types of corpora
There are many different kinds of corpora. They can contain written or spoken
(transcribed) language, modern or old texts, texts from one language or several
languages. The texts can be whole books, newspapers, journals, speeches etc, or
consist of extracts of varying length. The kind of texts included and the combination of
different texts vary between different corpora and corpus types.
Prof. Rogério Pereira Azeredo 2

What is Corpus Linguistics?
Corpus Linguistics is now seen as the study of linguistic phenomena through
large collections of machine-readable texts: corpora. These are used within a
number of research areas going from the Descriptive Study of the Syntax of a
Language to Prosody or Language Learning, to mention but a few.
The use of real examples of texts in the study of language is not a new issue in
the history of linguistics. However, Corpus Linguistics has developed considerably
in the last decades due to the great possibilities offered by the processing of natural
language with computers. The availability of computers and machine-readable text
has made it possible to get data quickly and easily and also to have this data
presented in a format suitable for analysis.
Corpus linguistics is, however, not the same as mainly obtaining language data
through the use of computers. Corpus linguistics is the study and analysis of data
obtained from a corpus. The main task of the corpus linguist is not to find the data
but to analyze it. Computers are useful, and sometimes indispensable, tools used
in this process.

SOME HISTORICAL BACKGROUND
A landmark in modern corpus linguistics was the publication by Henry Kucera and
Nelson Francis of Computational Analysis of Present-Day American English in 1967,
a work based on the analysis of the Brown Corpus, a carefully compiled selection of
current American English, totaling about a million words drawn from a wide variety of
sources. Kucera and Francis subjected it to a variety of computational analyses,
from which they compiled a rich and variegated opus, combining elements of
linguistics, language teaching, psychology, statistics, and sociology. A further key
publication was Randolph Quirk's 'Towards a description of English Usage' (1960)in
which he introduced The Survey of English Usage.
Shortly thereafter, Boston publisher Houghton-Mifflin approached Kucera to supply a

million word, three-line citation base for its new American Heritage Dictionary,
the first dictionary to be compiled using corpus linguistics. The AHD made the
innovative step of combining prescriptive elements (how language should be used)
with descriptive information (how it actually is used).

Other publishers followed suit. The British publisher Collins' COBUILD monolingual
learner's dictionary, designed for users learning English as a foreign language, was
compiled using the Bank of English. The Survey of English Usage Corpus was
used in the development of one of the most important Corpus-based Grammars,
the Comprehensive Grammar of English (Quirk et al 1985).
The Brown Corpus has also spawned a number of similarly structured corpora: the
LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New
Zealand English), Australian Corpus of English (Australian English), the Frown Corpus
(early 1990s American English), and the FLOB Corpus (1990s British English). Other
corpora represent many languages, varieties and modes, and include the International
Corpus of English, and the British National Corpus, a 100 million word collection of a
range of spoken and written texts, created in the 1990s by a consortium of publishers,
universities (Oxford and Lancaster) and the British Library. For contemporary
American English, work has stalled on the American National Corpus, but the 360
million word Corpus of Contemporary American English (COCA) (1990-present) is now
available.
The first computerized corpus of transcribed spoken language was constructed in

1971 by the Montreal French Project , containing one million words, which inspired
Shana Poplack's much larger corpus of spoken French in the Ottawa-Hull area .

Knowing your corpus
Something about corpus compilation
To combine texts into a corpus is called to compile a corpus. There are various ways
of doing this, depending on what kind of corpus you want to create and on what
resources (time, money, knowledge) you have at your disposal.
Even if you are not compiling your own corpus, it is important to know something
about corpus compilation when you use a corpus. Using a corpus is using a selection
of texts to represent the language. How the corpus has been compiled is of utmost
importance for the results you get when using it. What texts are included, how these
are marked up, the proportions of different text types, the size of the various
texts, how the texts have been selected, etc. are all important issues.

Illustration: the language as a newspaper
Let us imagine that you have a newspaper - a collection of texts of different kinds
(editorials, reportage on different topics, reviews, cartoons, letters to the editor,
sports commentaries, lists of shares, etc) written by different people. You then cut
the paper into small pieces with one word on each. You put all the pieces/words
into a bowl and pick a sample of ten at random. Obviously there would be several
words that you know exist in the newspaper that are not found in your sample.
If you were to pick another ten pieces of paper you would not expect the two sets
of ten words to be exactly the same. If you picked two sets of 100 words each,
you would probably find that some words, especially frequent words like function
words, can be found in both samples, if not in exactly the same numbers. You
would also find that many words are found in only one of the samples. If you took
two very large samples you would find that the frequent words would occur to a
similar extent. Words that occur only once in the newspaper would be found in
only one of the samples (at most). Words that occur infrequently would not
necessarily be evenly distributed across the two samples.

Now imagine that you divide the newspaper into sections (or classify its content
into categories/text types) before cutting it up, and then put the cuttings in different
bowls. By picking your paper slips from the different bowls you can influence the
composition of your sample. You can choose to take slips from only one bowl or
from several, in equal or different proportions. If there is a difference in the
language in the bowls, there will be a difference in the language on the slips and
that will affect your sample correspondingly. You can easily see that if you were
to take 100 slips of paper from the 'sports' bowl and 100 slips from the
'editorial' bowl, you would probably find a larger number of the word football in
the sample taken from the 'sports' bowl than from the 'editorial'.

A practical example
The Dictionary Research Centre is an umbrella for lexicographical activities and

interests within the Department of English. The Department has been involved in
dictionary projects of different kinds for nearly twenty-five years. Perhaps best-
known are the COBUILD project (1980-2000, in association with HarperCollins),
and the Johnson Dictionary project (1988 onwards, partly in association with
Cambridge University Press). There have also been several smaller research
initiatives, for example within the Centre for Corpus Linguistics. The Dictionary
Research Centre was created in autumn 2001, when the Dictionary Research
Centre in the School of English, University of Exeter, was transferred to the
University of Birmingham. Exeter itself had a long tradition in relation to
dictionaries and lexicography within its Dictionary Research Centre, created by
Professor Reinhard Hartmann (now an honorary professor at Birmingham).
University of Birmingham : http://www.english.bham.ac.uk/drc/

LINKS
The British National Corpus : http://www.natcorp.ox.ac.uk/
Virtual Language Centre ( Hong Kong) : http://vlc.polyu.edu.hk
The Corpus of Contemporary American English (385+ million words, 1990-2008):

http://www.americancorpus.org
Web Concordancer VLC: http://www.edict.com.hk/concordance/WWWConcappE.htm
The Collins WordbanksOnline English Corpus

http://www.collins.co.uk/Corpus/CorpusSearch.aspx
Business Letter Corpus Online KWIC Concordancer
http://ysomeya.hp.infoseek.co.jp/
ONLINE CORPORA http://corpus.byu.edu/
WEBCORP : http://www.webcorp.org.uk/cgi-bin/webcorp2.nm
SOME QUERIES
have a bath x take a bath
have a nap x take a nap
make a mistake x commit a mistake
salt and peeper x pepper and salt
dead or alive x alive or dead
lost and found x found and lost
on and off x off and on
fish and chips x chips and fish
sick and tired x tired and sick
black and white x black on white
cats and dogs x dogs and cats
bacon and eggs x eggs and bacon
out of the ( blue/green/white/black/red/grey)
(green/red/black/white/yellow) with anger
(green/red/black/white/yellow) with envy
make/prepare/fix/cook dinner

REFERENCES
 http://www.corpus-linguistics.de/
 http://www.english.bham.ac.uk/drc/
 http://www.essex.ac.uk/linguistics/clmt/w3c/corpu
 http://en.wikipedia.org/wiki/Corpus_linguistics

Corpus Linguistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Linguistics

Uploaded by

Copyright:

Available Formats

An introduction to computational teaching

Prof. Rogério Pereira Azeredo

Prof. Rogério Pereira Azeredo 2

Prof. Rogério Pereira Azeredo 3

Shortly thereafter, Boston publisher Houghton-Mifflin approached Kucera to supply a

Prof. Rogério Pereira Azeredo 4

The first computerized corpus of transcribed spoken language was constructed in

Prof. Rogério Pereira Azeredo 5

Prof. Rogério Pereira Azeredo 6

Prof. Rogério Pereira Azeredo 7

Prof. Rogério Pereira Azeredo 8

The Dictionary Research Centre is an umbrella for lexicographical activities and

University of Birmingham : http://www.english.bham.ac.uk/drc/

Prof. Rogério Pereira Azeredo 9

Virtual Language Centre ( Hong Kong) : http://vlc.polyu.edu.hk

The Corpus of Contemporary American English (385+ million words, 1990-2008):

Web Concordancer VLC: http://www.edict.com.hk/concordance/WWWConcappE.htm

The Collins WordbanksOnline English Corpus

ONLINE CORPORA http://corpus.byu.edu/

Prof. Rogério Pereira Azeredo 11

Prof. Rogério Pereira Azeredo 12

You might also like