You are on page 1of 27

Using Corpus

for Research
5/27/15

What is a Corpus?
A corpus is a collection of pieces of
language text in electronic form, selected
according to external criteria to represent,
as far as possible, a language or language
variety as a source of data for linguistic
research. (Sinclair 2005: 16).

The Characteristics of the Corpus


Approach

It is empirical, analyzing the actual patterns of


language use in natural texts.

It utilizes a large and principled collection of natural


texts as the basis for analysis.

It makes extensive use of computers for analysis.

It depends on both quantitative and qualitative


analytical techniques.

What is Corpus Linguistics?

It is the study of language based on examples of real life


language use (McEnery and Wilson1996:1).

It is an area which focuses upon a set of procedures, or


methods, for studying language (McEnery andHardie 2012: 1).

Research in corpus linguistics has led to the elaboration of


better quality learner input and provided researchers and
teachers with a wider, finer perspective into language in use
(Campoy-Cubillo, Bells-Fortuo and Gea-Valor 2010: 3).

Case #1
What is the difference between the use ofwill and
shall in the simple future tense?

Who say I/We shall,


(Data from the BNC (British National Corpus)

Case #2
If you want to state that something is very good, you
say

(1) Its fabulous.

(2) Its great.

Who uses the adjective fabulous?

Who uses the adjective great?

Case #3 Language and Culture

Case Study: How has the word marriage been used in


Britain?

Data: The British Corpora in the Bank of English

The British Corpora in the Bank of


English

today = Today tabloids

sunnow = The Sun & News of the World tabloids

brbooks = British books (scientific and popular)

times = Times newspapers

brmags = British magazines

guard = Guardian newspapers

indy = Independent newspapers

econ = Economist newspapers

brephem = British ephemeral (ads, booklets, etc.)

bbc = BBC radio

brspok = British spoken language

newsci = New Scientist magazines

The frequency information

What can we infer from the frequency information?

The concordance lines

What can we infer from the concordance lines


regarding the image of the word marriage?

The collocation information

What can we infer from the collocations of the


word
marriage?

Refer to the Collocation Information

Divide the words into the following groups:

Possessives:

Words indicating a sequence or period of time:

Words to do with other relationships:

Words to do with a marriage ending:

Words indicating happiness and success:

The English language analysis


The verb end is an ergative verb.

What can we infer?

Further Analysis on Frequency


Data
Referring to the previous Case Study:

Words indicating women (her, she, daughter) are


more significantly associated with marriage than
those indicating men (even though men and women
must get married in equal numbers).

Is it because the frequency of the word woman is


higher than that of the word man?

Which one has the higher frequency, the word


husband or wife?

Further Analysis on Frequency


Data
Frequency Data

woman 114,022 man 280,290

wife 79,562

husband 52,181

What is your interpretation?

Case #4 The Indonesian Corpus


Apa perbedaan antara penggunaan kata Pria dan kata
Wanita?

Case #4 The Indonesian Corpus

Apa perbedaan antara penggunaan kata Wanita dan


kata Perempuan?

Pembuatan Korpus
dan Analisis Korpus Luring

A corpus is a collection of pieces of language text in


electronic form, selected according to external
criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic
research. (Sinclair 2005: 16).

Korpus luring biasanya menggunakan format .txt,


dengan coding UTF-8

Cara Membuat Korpus Luring

Jika data dalam format MS-Word, buka file, kemudian


ikuti urutan berikut ini: (1) Klik File, (2) Pilih Save as,
(3) Pada bagian Save as type, pilih Plain Text, (4) Klik
Save, (5) Klik OK.

Jika data dalam format PDF, gunakan software


AntFileConverter.

Jika format PDF tetapi berbentuk image, maka perlu


di-convert dulu dengan software lain (misal nitro PDF
atau omnipage)

Example of a small corpus: a word list analysis on the descriptive


paragraphs written by male and female students at a university

Using AntConc

AntConc is a freeware, multiplatform tool for carrying


out corpus linguistics research and data-driven
learning.

It was created by Laurence Anthony, Waseda


University, Japan.

Website:
http://www.antlab.sci.waseda.ac.jp/antconc_index.ht
ml

Using AntConc
AntConc contains seven tools

Concordance Tool: This tool shows search results in a 'KWIC' (KeyWord In Context) format.
This allows you to see how words and phrases are commonly used in a corpus of texts.

Concordance Plot Tool: This tool shows search results plotted as a 'barcode format. This
allows you to see the position where search results appear in target texts.

File View Tool: This tool shows the text of individual files. This allows you to investigate in
more detail the results generated in other tools of AntConc.

Clusters (N-Grams): The N-Grams Tool scans the entire corpus for 'N' (e.g. 1 word, 2
words,) length clusters. This allows you to find common expressions.

Collocates: This tool shows the collocates of a search term.

Word List: This tool counts all the words in the corpus and presents them in an ordered
list. This allows you to quickly find which words are the most frequent.

Keyword List: This tool shows the which words are unusually frequent (or infrequent) in
the corpus in comparison with the words in a reference corpus.

Word Types and Word Tokens

Using AWP

AntWordProfiler contains two tools for carrying out corpus linguistics


research on vocabulary profiling.

Vocabulary Profile Tool: This tool shows allows you to generate


vocabulary statistic and frequency information about a corpus of texts
loaded into the program.

File Viewer and Editor Tool This tool allows you to view an individual
user file and highlight the different levels of vocabulary in the file using
a colour coding. It also shows the overall coverage of different
vocabulary levels.

Thank you