You are on page 1of 2

MODELOS GRAMATICALES DEL I NGLS

Tfno 958 241000 - Ext. 20243 Fax 958 243678. jsantana@ugr.es www.ugr.es/local/jsantana

Juan Santana Lario

MODELOS GRAMATICALES. Corpus Linguistics


2. Types of corpora
According to purpose: o General-purpose corpora: designed as a resource for a general representation of the language and to serve and the basis for a wide range of varied linguistic studies: Brown, LOB (Lancaster-Oslo/Bergen corpus), BNC (British National Corpus). o Domain-specific (or sub-language) corpora: represent a specific variety (whether regional, temporal, language domain, etc.) and/or are intended for specific purposes (language teaching, dictionary making, translation studies, etc.): Guangzhou Petroleum English Corpus, JDEST Computer Corpus of Text in English for Science and Technology According to text selection procedure: o Sample corpus: it consists of sections of texts (samples) o f approximately same length representing a variety of text categories (balancing, representativeness). Eg: Brown, LOB (Lancaster-Oslo/Bergen corpus), SEU (Survey of English Usage corpus),). Brown and LOB: 15 text categories, 500 samples, 2000 words per sample o Full-text corpora: consists of full texts. Eg: English Poetry Full-Text Database Open / Close character: o Closed/static corpus: once the corpus is completed no more texts are added. Eg: all the corpora above. o Open/dynamic corpus. monitor corpus or textbank: new materials are continually added, older materials are discarded: balance between different types is maintained. Eg: Bank of English (University of Birmingham) (originally compiled to produce the CoBuild Dictionary). o Collections: not exactly corpora (lack of explicit design/purpose) but large sets of texts. Eg: Oxford Text Archive, LDC (Linguistic Data Consortium), Project Gutenberg. According to Medium: o Written corpora: only written texts. Eg: Brown, LOB. o Spoken corpora: Eg: LLC (London-Lund Corpus): spoken section of SEU: million words of British English speech with detailed transcription by means of a prosodic notation showing features such as stress and intonation; SEC (IBM/Lancaster Spoken English Corpus ): 50.000 words, various versions: orthographically transcribed, prosodically transcribed, grammatically tagged, sound-recorded; Canadian Hansard: official record of the proceedings of the Canadian House of Commons, over 60 million words, French and English version; MARSEC (Machine Readable Spoken English Corpus): each string in the orthographic transcription is linked to the corresponding section in the audio recording; COLT (Bergen Corpus of London Teeange Language): collected in 1993, it consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London; half a million words orthographically transcribed and word-class tagged; it is a constituent of the BNC. Mixed corpora: both written and spoken material. Eg: Birminghan Bank of English, BNC (British National Corpus), ICE (International Corpus of English) According to number of languages/dialects represented: o Monolingual corpora: texts in one language (or language variety) only. Eg.: all of the above except for the Canadian Hansard o Multilingual or parallel: more than one language/dialect. Parallelism comes in various degrees: from the strictly parallel (original and one or more translated versions of the same texts: Canadian Hansard, English-Norwegian Parallel Corpus; very useful for lexicography, language teaching and translation studies) to the loosely parallel (comparable corpora) ie a collection of "similar" texts in different languages or in different varieties of a language.: ICE (International Corpus of English): texts compiled in 15 countries where English is the first or an official second language on the basis of exactly the same compilation principles; taken together the Brown (American English), LOB (British English), and Kolhapur (Indian English) could considered as comparable corpora

MODELOS GRAMATICALES DEL I NGLS |

TYPES OF CORPORA 1

MODELOS GRAMATICALES DEL I NGLS

Tfno 958 241000 - Ext. 20243 Fax 958 243678. jsantana@ugr.es www.ugr.es/local/jsantana

Juan Santana Lario

According to temporal variety: o Synchronic: 1 variety, normally contemporary (at compilation time). o Diachronic: Helsinki Corpus According to type of speaker: native vs learner corpora According to annotation: o Plain: e.g. Project Gutenberg texts, produced by scanning; no information about text (usually, not even edition): not really a corpus but a collection of texts. o Annotated: marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc.: Brown annotated with identifying information, e.g. edition date, author, genre, register, etc. : BNC, ICE-BG annotated for part of speech, syntactic structure, discourse information, etc. : LOBTAG, BNC, ICE-GB

For a comprehensive list of corpora and links to them, visit: http://www.uow.edu.au/~dlee/CBLLinks.htm http://www.ugr.es/~pedrou/

MODELOS GRAMATICALES DEL I NGLS |

TYPES OF CORPORA 2

You might also like