Professional Documents
Culture Documents
Aplicaie
1. PRELIMINARII
1
2. CONSULTAREA UNUI CORPUS CU AJUTORUL UNEI INTERFEE
Adres: http://corpus.byu.edu/
CORPUS.BYU.EDU
http://corpus.byu.edu/x.asp
This site contains links to several free online corpora that we have created:
Corpus of Contemporary American English 400 million words 1990 - present
Corpus of Contemporary 400 million American English 20 million words each year, 1990-present. Equally divided into
American English (COCA) 1990-present spoken, fiction, popular magazine, newspaper, and academic.
Will be continually updated.
Corpus of Historical 300 million American English (Available August 2010) 13-18 million words each decade,
American English (COHA) c1810 - present divided among fiction, popular magazines, newspaper, and
other non-fiction.
BYU-BNC: The British 100 million British English 90 million words written (fiction, newspaper, academic, etc); 10
National Corpus ~1980s-1993 million spoken. [Website for the original BNC]
TIME Magazine 100 million American English More than 275,000 articles from TIME Magazine. Wide range
1923-present of topics: news, sports, business, culture, health, entertainment,
etc.
Other languages
Corpus del Espaol 100 million Spanish 20 million words 1900s, 20m 1800s, 40m 1500s-1700s, 20m
1200s-1900s 1200s-1400s
Corpus do Portugus 45 million Portuguese 20 million words 1900s, including spoken, fiction, newspaper,
1300s-1900s and academic. Equally divided Brazil/Portugal. 10m 1800s,
15m 1300s-1700s
Posibiliti de cutare
Pentru exemplificare a operaiilor de cutare n corpus: Corpus of Contemporary American English (http://www.americancorpus.org/)
2
Informaii privind posibilitile de cutare oferite de interfaa acestor corpus-uri: vezi n continuare (n englez)
Architecture, interface, and searches
Using our unique corpus architecture, users can:
Search by word, phrase, substring, part of speech (e.g. nouns or verbs), lemma (e.g. all forms of go: goes, went, etc), synonyms,
customized wordlists, or any combination of these
See the individual frequency of all matching forms (as well as in each section of the corpus), or the overall frequency in each genre
and time period
Find the collocates (nearby words) of a given word or phrase, which provides insight into the meaning of the word
Compare the collocates of two words, to see differences in meaning or usage (e.g. collocates of rob vs. steal, or warm vs hot, or men
vs. women, or Democrats vs. Republicans)
Compare the collocates across time periods (provides insight into changes in meaning, such as new uses with green)
Compare the collocates across genres to show differences in 'word sense', e.g. chair = 'committee leader' (academic) vs. 'piece of
furniture' (fiction)
Order results by Mutual Information score (shows 'relevance', in addition to raw frequency)
With integrated thesauruses, find the frequency and distribution of synonyms of a given word (to see which synonyms are most
frequent, in which genres they are used most, which are increasing or decreasing in use, etc)
Create personalized lists of words and phrases (e.g. for a particular semantic field) and then re-use them as part of subsequent queries
Complete, context-sensitive help files and "guided tours" for each corpus
Save and re-use queries, as well as annotate and share your queries with others
Cutare n corpus
As way of introduction, remember that when you search the corpus, you're searching over 400 million words from all types of
contemporary American English (1990-2008) -- transcripts of conversations (on shows like Good Morning America and Oprah), fiction
(e.g. novels and movie scripts), 100 different popular magazines (e.g. Good Housekeeping or Sports Illustrated), newspapers, and
academic journals. What you see is what is really going on in American English today, and this is the only site and corpus that will allow
you to carry out searches like these.
At the most basic level, you can see whether a word, phrase, grammatical construction, or word meaning has been increasing or
decreasing over the past 15-20 years, and what genre of American English uses this the most. For example, take a look at words like
laptop, funky, or bling, parts of words like the suffix -dom or the word root -clean-, phrases like perfect storm or global warming, or
grammatical constructions like end up V-ing (e.g. end up paying), or get + V-ed (e.g. get stopped). If you're looking at a chart (rather
than the table display), feel free to click on [SEE ALL SECTIONS] to compare frequency by sub-section (e.g. African-American or sports
magazines, or movie scripts and novels). So this is the most basic level -- the frequency over time and in different genres of English.
(More information on looking at changes in American English, or on comparing between different genres)
A somewhat more advanced search is seeing how a word or phrase is being used. In order to do this, you'll probably want to look at the
collocates (i.e. "nearby" words). The basic idea is that you can "tell a lot about a word by the other words that it 'hangs out' with". For
example:
Before you look for the collocates of each of the words deep, run, smile, and fairly -- what would you guess are the best collocates -- in
other words, surrounding words that really help to "define" these words? Are there any that are surprises in what you see in the corpus?
Compare the collocates of the two words democrats and republicans. According to these texts (from newspapers, magazines, TV talk
shows, etc), which of the two political parties is "electable, open-minded, and fun", and which is "extremist, mean-spirited, and greedy".
Any possible media bias here?
Compare the verbs near Clinton and Bush. Can you find out anything about recent events by looking at this list?
Compare the adjectives used to describe women and men. According to the corpus data, who is grumpy, impotent, masked, burly, and
armed, and who is glamorous, petite, and voluptuous? Does this reflect biases in contemporary American culture?
Finally, you can also see how the usage of a word is changing over time. For example, compare the types of crises talked about in the
past 2-3 years (list to the left), compared to the 1990s (list to the right). What are we worrying about now that they weren't (so much)
back then, and vice versa?
The corpus is also very useful for language learners, because it allows them to quickly "tap into" hundreds of millions of words to see
what's going on in the language, which can help to compensate for their lack of "native speaker intuition".
Using the web interface, you can search by words (mysterious), phrases (nooks and crannies or faint + noun), lemmas (all forms of
words, like sing or tall), wildcards (un*ly or r?n*), and more complex searches such as un-X-ed adjectives or verb + any word + a form
of ground. Notice that from the "frequency results" window you can click on the word or phrase to see it in context in this lower
3
window.
As the preceding searches indicate, the first option in the search form allows to to either see a list of all matching strings, or a chart
display that shows the frequency in the five "macro" registers (spoken, fiction, popular magazines, newspapers, and academic journals).
Look for the frequency of funky, whom, incredibly + adjective, or forms of need + to + VERB. Via the chart display, you can also see the
frequency of the word or phrase in subregisters as well, such as movie scripts, children's fiction, women's magazines, or medical
journals. With the list display, you can also see the frequency of each matching string in each of the major sections of the corpus (look for
deep + noun, with and without the totals for each section).
You can also search for collocates (words nearby a given word), which often provides insight into the meaning of a given word. For
example, you can search for the most common nouns near thick, adjectives near smile (sorted by relevance), nouns after look into, or
words starting with clos* near eyes.
You can also include information about genre or a specific time period directly as part of the query. This allows you to see how words
and phrases vary across speech and many different types of written texts. We can easily find which words and phrases occur much more
frequently in one register than another, such as good + [noun] in fiction, or verbs in the slot [we * that] in academic writing. You can also
apply this to collocates, such as nouns with the verb break in NEWS or adjectives with woman in FICTION. Finally, you can compare
one section to another, such as nouns near chair in (ACAD vs FICTION), nouns with passionate (FICTION vs NEWSPAPER), verbs in
sports magazines compared to other magazines, or adjectives in medical journals compared to other journals.
Finally, you can easily carry out semantically-oriented searches. For example, you can compare nouns that appear with small and
little, with men and women, nouns with utter and sheer, adjectives with Democrats and Republicans (notice any bias here?), or verbs with
Clinton and Bush (or emphases there?). You can also find the frequency and distribution of synonyms of a given word, such as beautiful
or the verb clean, see with synonyms are more frequent in competing registers (such as synonyms of strong in FICTION and
ACADEMIC), and use synonyms as part of a more complex query (such as synonyms of clean with nouns). Finally, you can create
"customized lists" for any category that interests you, and then re-use these in subsequent queries (such as colors + clothes, or words
related to beautiful + forms of woman).
Hopefully this short five minute overview of the corpus has been helpful. Now feel free to look at more examples of the types of possible
searches, including word/phrase, collocates (surrounding words), synonyms, word comparisons, and customized/user-defined lists. Find
also more info on how to search by section (genre or year), and how to refine your searches with certain search options.
One "slot" : Make sure there is no space, or it will be interpreted as two consecutive words
4
lemma, word, etc. Most
useful for "multiple slot"
queries; see below)
You can limit to a particular part of speech by adding a period (full stop) and then the part of speech tag in brackets. This is always
optional. Make sure there is no space before or after the period (full stop), or it will be interpreted as two consecutive words
word.[pos] Exact word and part of speech strike.[v*] strike (only as a verb)
You can add "lemma" to any other type of search, such as synonym or customized list, to see all forms of the matching words. Just
use an extra set of brackets.
[[user:list]] Customized list and lemma [[davies:clothes]] tie, tying, socks, socked, shirt, blouses
(no part of speech specified, hence tying)
You can also choose lemma and part of speech by combining the preceding symbols
[[=word]].[pos] Synonym and lemma and part [[=clean]].[v*] mop, scrubs, polishing
of speech
[[user:list]]. Customized list and lemma [[davies:clothes]].[n*] tie, ties, sock, socks (i.e. just nouns)
[pos] and part of speech
Multiple "slots": Create sequences of words, using any of the preceding query types. Note that in each case, there is a space between
the word "slots" in the query. These are just a few examples, from an unlimited number of combinations.
5
[[beat]].[v*] * [nn*] beat the Yankees
beaten to death
6
NN1 singular common noun (e.g. book, girl)
NN2 plural common noun (e.g. books, girls)
NNA following noun of title (e.g. M.A.)
NNB preceding noun of title (e.g. Mr., Prof.)
NNL1 singular locative noun (e.g. Island, Street)
NNL2 plural locative noun (e.g. Islands, Streets)
NNO numeral noun, neutral for number (e.g. dozen, hundred)
NNO2 numeral noun, plural (e.g. hundreds, thousands)
NNT1 temporal noun, singular (e.g. day, week, year)
NNT2 temporal noun, plural (e.g. days, weeks, years)
NNU unit of measurement, neutral for number (e.g. in, cc)
NNU1 singular unit of measurement (e.g. inch, centimetre)
NNU2 plural unit of measurement (e.g. ins., feet)
NP proper noun, neutral for number (e.g. IBM, Andes)
NP1 singular proper noun (e.g. London, Jane, Frederick)
NP2 plural proper noun (e.g. Browns, Reagans, Koreas)
NPD1 singular weekday noun (e.g. Sunday)
NPD2 plural weekday noun (e.g. Sundays)
NPM1 singular month noun (e.g. October)
NPM2 plural month noun (e.g. Octobers)
PN indefinite pronoun, neutral for number (none)
PN1 indefinite pronoun, singular (e.g. anyone, everything, nobody, one)
PNQO objective wh-pronoun (whom)
PNQS subjective wh-pronoun (who)
PNQV wh-ever pronoun (whoever)
PNX1 reflexive indefinite pronoun (oneself)
PPGE nominal possessive personal pronoun (e.g. mine, yours)
PPH1 3rd person sing. neuter personal pronoun (it)
PPHO1 3rd person sing. objective personal pronoun (him, her)
PPHO2 3rd person plural objective personal pronoun (them)
PPHS1 3rd person sing. subjective personal pronoun (he, she)
PPHS2 3rd person plural subjective personal pronoun (they)
PPIO1 1st person sing. objective personal pronoun (me)
PPIO2 1st person plural objective personal pronoun (us)
PPIS1 1st person sing. subjective personal pronoun (I)
PPIS2 1st person plural subjective personal pronoun (we)
PPX1 singular reflexive personal pronoun (e.g. yourself, itself)
PPX2 plural reflexive personal pronoun (e.g. yourselves, themselves)
PPY 2nd person personal pronoun (you)
RA adverb, after nominal head (e.g. else, galore)
REX adverb introducing appositional constructions (namely, e.g.)
RG degree adverb (very, so, too)
RGQ wh- degree adverb (how)
RGQV wh-ever degree adverb (however)
RGR comparative degree adverb (more, less)
RGT superlative degree adverb (most, least)
RL locative adverb (e.g. alongside, forward)
RP prep. adverb, particle (e.g about, in)
RPK prep. adv., catenative (about in be about to)
RR general adverb
RRQ wh- general adverb (where, when, why, how)
RRQV wh-ever general adverb (wherever, whenever)
RRR comparative general adverb (e.g. better, longer)
RRT superlative general adverb (e.g. best, longest)
RT quasi-nominal adverb of time (e.g. now, tomorrow)
TO infinitive marker (to)
UH interjection (e.g. oh, yes, um)
VB0 be, base form (finite i.e. imperative, subjunctive)
VBDR were
VBDZ was
7
VBG being
VBI be, infinitive (To be or not... It will be ..)
VBM am
VBN been
VBR are
VBZ is
VD0 do, base form (finite)
VDD did
VDG doing
VDI do, infinitive (I may do... To do...)
VDN done
VDZ does
VH0 have, base form (finite)
VHD had (past tense)
VHG having
VHI have, infinitive
VHN had (past participle)
VHZ has
VM modal auxiliary (can, will, would, etc.)
VMK modal catenative (ought, used)
VV0 base form of lexical verb (e.g. give, work)
VVD past tense of lexical verb (e.g. gave, worked)
VVG -ing participle of lexical verb (e.g. giving, working)
VVGK -ing participle catenative (going in be going to)
VVI infinitive (e.g. to give... It will work...)
VVN past participle of lexical verb (e.g. given, worked)
VVNK past participle catenative (e.g. bound in be bound to)
VVZ -s form of lexical verb (e.g. gives, works)
XX not, n't
ZZ1 singular letter of the alphabet (e.g. A,b)
ZZ2 plural letter of the alphabet (e.g. A's, b's)
8
Real Academia Espaola - Corpus Diacrnico del Espaol (CORDE). http://corpus.rae.es/cordenet.html
El banco de datos de la Real Academia Espaola est constituido por dos grandes corpus textuales: el Corpus de Referencia del Espaol
Actual (CREA, escrito y oral) (http://corpus.rae.es/creanet.html) y el Corpus Diacrnico del Espaol (CORDE)
(http://corpus.rae.es/cordenet.html). Ambos conjuntos son complementarios, de modo que el CREA contiene textos desde 1975 a 2004,
mientras que el CORDE incluye textos de todos los perodos anteriores. El carcter integrado de los dos corpus se refleja en la previsin
de que los textos pertenecientes a perodos que, por el paso del tiempo, vayan quedando fuera del mbito del CREA, pasarn a formar
parte del CORDE.
En la actualidad, el CREA est formado por algo ms de 154 millones de formas correspondientes a la parte escrita y algo ms de ocho
millones en la parte oral. El CORDE alcanza cerca de 250 millones de formas. A travs de la aplicacin de concordancias, los
investigadores tienen a su disposicin alrededor de 400 millones de formas de todos los perodos del espaol, tanto de Espaa como de
Amrica, lo que constituye, sin duda, el recurso ms importante del que se haya podido disponer jams para el estudio de esta lengua.
Codificacin. A todos los materiales procesados tanto en el CREA como en el CORDE se les ha aadido una serie de marcas textuales,
establecidas segn el estndar internacional SGML (Standard General Markup Language), de acuerdo con las recomendaciones de la TEI
(Text Encoding Initiative), que facilitan la recuperacin de la informacin y el intercambio de textos con otros corpus, y garantizan la
independencia de sistemas operativos y programas.
Estadsticas
Corpus Palabras
CORDE 236709914
CREA 154212661
10