Lingvistica Corpusului

6.4. Lingvistica corpusului (4).
Aplicaie
APLICAIE: consultarea unui corpus cu ajutorul unei interfee
1. PRELIMINARII
Consultarea unui corpus se poate face n diverse moduri:
printr-o interfa (o fereastr grafic de comunicare cu corpus-ul)

Avantaje:
- uurina consultrii; este eventual necesar consultarea unui ghid de indicaii pentru formularea
consultrilor (sintaxa cutrii);
- vizualizarea rezultatelor obinute n urma consultrii ntr-o form clar, expresiv, uor de
interpretat
Limitri (dezavantaje):
- o interfa este asociat unui anumit corpus i permite consultarea doar a corpus-ului respectiv;
- posibilitile de consultare a corpus-ului snt relativ predefinite i deci restrnse
prin limbaje de procesare a datelor (de exemplu, awk: http://en.wikipedia.org/wiki/AWK;

http://www.pement.org/awk/awk1line.txt) i prin limbaje de programare (Java, C++, Perl etc.)
Avantaje:
- posibilitatea de a combina multe tipuri de operaii de procesare a datelor i ca atare o flexibilitate
accentuat, chiar un potenial dechis de consultare a corpus-urilor;
- aceste instrumente snt cu totul independente de datele ce se proceseaz, deci permit consultarea
oricrui corpus ntr-un format potrivit, pot fi deci reutilizate;
- posibilitatea de a nregistra rezultatele i de a le refolosi pentru noi operaii de procesare
Limitri (dezavantaje):
- necesitatea de a cunoate limbajul de procesare sau de programare, i anume, comenzile de
procesare a datelor care pot fi executate n limbajul respectiv; aceast iniiere este n general mai
dificil i mai ndelungat;
- rezultatele obinute n urma consultrii nu snt ntr-o form foarte expresiv
(O cale de mijloc ntre cele dou opiuni precedente)

prin alte instrumente predefinite de manipulare a datelor
(de exemplu, WordSmith: http://www.lexically.net/wordsmith/)
Se combin avantajele interfeelor cu cele ale limbajelor de procesare a datelor sau de programare
Relativ uurin n folosire + reutilizare pentru procesarea de noi corpus-uri
1
2. CONSULTAREA UNUI CORPUS CU AJUTORUL UNEI INTERFEE
Exemple de corpus-uri cu interfee de consultare
Adres: http://corpus.byu.edu/
CORPUS.BYU.EDU
http://corpus.byu.edu/x.asp
This site contains links to several free online corpora that we have created:
Corpus of Contemporary American English 400 million words 1990 - present
BYU-BNC: British National Corpus 100 million words 1980s - 1993
TIME Corpus of American English 100 million words 1920s - 2000s
Corpus del Espaol 100 million words 1200s - 1900s
Corpus do Portugus 45 million words 1300s - 1900s
English Size Language / time Content
Corpus of Contemporary 400 million American English 20 million words each year, 1990-present. Equally divided into
American English (COCA) 1990-present spoken, fiction, popular magazine, newspaper, and academic.
Will be continually updated.
Corpus of Historical 300 million American English (Available August 2010) 13-18 million words each decade,
American English (COHA) c1810 - present divided among fiction, popular magazines, newspaper, and
other non-fiction.
BYU-BNC: The British 100 million British English 90 million words written (fiction, newspaper, academic, etc); 10
National Corpus ~1980s-1993 million spoken. [Website for the original BNC]
TIME Magazine 100 million American English More than 275,000 articles from TIME Magazine. Wide range
1923-present of topics: news, sports, business, culture, health, entertainment,
etc.
Other languages
Corpus del Espaol 100 million Spanish 20 million words 1900s, 20m 1800s, 40m 1500s-1700s, 20m
1200s-1900s 1200s-1400s
Corpus do Portugus 45 million Portuguese 20 million words 1900s, including spoken, fiction, newspaper,
1300s-1900s and academic. Equally divided Brazil/Portugal. 10m 1800s,
15m 1300s-1700s
Posibiliti de cutare
Pentru exemplificare a operaiilor de cutare n corpus: Corpus of Contemporary American English (http://www.americancorpus.org/)
2
Informaii privind posibilitile de cutare oferite de interfaa acestor corpus-uri: vezi n continuare (n englez)
Architecture, interface, and searches
Using our unique corpus architecture, users can:
Search by word, phrase, substring, part of speech (e.g. nouns or verbs), lemma (e.g. all forms of go: goes, went, etc), synonyms,
customized wordlists, or any combination of these
See the individual frequency of all matching forms (as well as in each section of the corpus), or the overall frequency in each genre
and time period
Find the collocates (nearby words) of a given word or phrase, which provides insight into the meaning of the word
Compare the collocates of two words, to see differences in meaning or usage (e.g. collocates of rob vs. steal, or warm vs hot, or men
vs. women, or Democrats vs. Republicans)
Compare the collocates across time periods (provides insight into changes in meaning, such as new uses with green)
Compare the collocates across genres to show differences in 'word sense', e.g. chair = 'committee leader' (academic) vs. 'piece of
furniture' (fiction)
Order results by Mutual Information score (shows 'relevance', in addition to raw frequency)
With integrated thesauruses, find the frequency and distribution of synonyms of a given word (to see which synonyms are most
frequent, in which genres they are used most, which are increasing or decreasing in use, etc)
Create personalized lists of words and phrases (e.g. for a particular semantic field) and then re-use them as part of subsequent queries
Complete, context-sensitive help files and "guided tours" for each corpus
Save and re-use queries, as well as annotate and share your queries with others
Cutare n corpus
As way of introduction, remember that when you search the corpus, you're searching over 400 million words from all types of
contemporary American English (1990-2008) -- transcripts of conversations (on shows like Good Morning America and Oprah), fiction
(e.g. novels and movie scripts), 100 different popular magazines (e.g. Good Housekeeping or Sports Illustrated), newspapers, and
academic journals. What you see is what is really going on in American English today, and this is the only site and corpus that will allow
you to carry out searches like these.
At the most basic level, you can see whether a word, phrase, grammatical construction, or word meaning has been increasing or
decreasing over the past 15-20 years, and what genre of American English uses this the most. For example, take a look at words like
laptop, funky, or bling, parts of words like the suffix -dom or the word root -clean-, phrases like perfect storm or global warming, or
grammatical constructions like end up V-ing (e.g. end up paying), or get + V-ed (e.g. get stopped). If you're looking at a chart (rather
than the table display), feel free to click on [SEE ALL SECTIONS] to compare frequency by sub-section (e.g. African-American or sports
magazines, or movie scripts and novels). So this is the most basic level -- the frequency over time and in different genres of English.
(More information on looking at changes in American English, or on comparing between different genres)
A somewhat more advanced search is seeing how a word or phrase is being used. In order to do this, you'll probably want to look at the
collocates (i.e. "nearby" words). The basic idea is that you can "tell a lot about a word by the other words that it 'hangs out' with". For
example:
Before you look for the collocates of each of the words deep, run, smile, and fairly -- what would you guess are the best collocates -- in
other words, surrounding words that really help to "define" these words? Are there any that are surprises in what you see in the corpus?
Compare the collocates of the two words democrats and republicans. According to these texts (from newspapers, magazines, TV talk
shows, etc), which of the two political parties is "electable, open-minded, and fun", and which is "extremist, mean-spirited, and greedy".
Any possible media bias here?
Compare the verbs near Clinton and Bush. Can you find out anything about recent events by looking at this list?
Compare the adjectives used to describe women and men. According to the corpus data, who is grumpy, impotent, masked, burly, and
armed, and who is glamorous, petite, and voluptuous? Does this reflect biases in contemporary American culture?
Finally, you can also see how the usage of a word is changing over time. For example, compare the types of crises talked about in the
past 2-3 years (list to the left), compared to the 1990s (list to the right). What are we worrying about now that they weren't (so much)
back then, and vice versa?
The corpus is also very useful for language learners, because it allows them to quickly "tap into" hundreds of millions of words to see
what's going on in the language, which can help to compensate for their lack of "native speaker intuition".
BRIEF TOUR (GENERAL)

The following links provide a good overview of the features of the corpus. Each link inputs values into the search interface and runs the
query against the 400+ million word corpus (i.e. these are not "canned" results). You might want to note which options have been
selected in the form, and then modify the values to create your own queries.
Using the web interface, you can search by words (mysterious), phrases (nooks and crannies or faint + noun), lemmas (all forms of
words, like sing or tall), wildcards (un*ly or r?n*), and more complex searches such as un-X-ed adjectives or verb + any word + a form
of ground. Notice that from the "frequency results" window you can click on the word or phrase to see it in context in this lower
3
window.
As the preceding searches indicate, the first option in the search form allows to to either see a list of all matching strings, or a chart
display that shows the frequency in the five "macro" registers (spoken, fiction, popular magazines, newspapers, and academic journals).
Look for the frequency of funky, whom, incredibly + adjective, or forms of need + to + VERB. Via the chart display, you can also see the
frequency of the word or phrase in subregisters as well, such as movie scripts, children's fiction, women's magazines, or medical
journals. With the list display, you can also see the frequency of each matching string in each of the major sections of the corpus (look for
deep + noun, with and without the totals for each section).
You can also search for collocates (words nearby a given word), which often provides insight into the meaning of a given word. For
example, you can search for the most common nouns near thick, adjectives near smile (sorted by relevance), nouns after look into, or
words starting with clos* near eyes.
You can also include information about genre or a specific time period directly as part of the query. This allows you to see how words
and phrases vary across speech and many different types of written texts. We can easily find which words and phrases occur much more
frequently in one register than another, such as good + [noun] in fiction, or verbs in the slot [we * that] in academic writing. You can also
apply this to collocates, such as nouns with the verb break in NEWS or adjectives with woman in FICTION. Finally, you can compare
one section to another, such as nouns near chair in (ACAD vs FICTION), nouns with passionate (FICTION vs NEWSPAPER), verbs in
sports magazines compared to other magazines, or adjectives in medical journals compared to other journals.
Finally, you can easily carry out semantically-oriented searches. For example, you can compare nouns that appear with small and
little, with men and women, nouns with utter and sheer, adjectives with Democrats and Republicans (notice any bias here?), or verbs with
Clinton and Bush (or emphases there?). You can also find the frequency and distribution of synonyms of a given word, such as beautiful
or the verb clean, see with synonyms are more frequent in competing registers (such as synonyms of strong in FICTION and
ACADEMIC), and use synonyms as part of a more complex query (such as synonyms of clean with nouns). Finally, you can create
"customized lists" for any category that interests you, and then re-use these in subsequent queries (such as colors + clothes, or words
related to beautiful + forms of woman).
Hopefully this short five minute overview of the corpus has been helpful. Now feel free to look at more examples of the types of possible
searches, including word/phrase, collocates (surrounding words), synonyms, word comparisons, and customized/user-defined lists. Find
also more info on how to search by section (genre or year), and how to refine your searches with certain search options.
Formule de cutare cu ajutorul interfeei, nelese de interfa (sintaxa cutrii)

(Instruciuni video: http://www.youtube.com/watch?v=c8Uz2CyfOXc&feature=related; http://www.youtube.com/watch?v=dss_hlqmMGM&feature=related)
HELP: SEARCH SYNTAX
Syntax Meaning Examples (Click to run) Sample matches
One "slot" : Make sure there is no space, or it will be interpreted as two consecutive words
word One exact word mysterious mysterious
[pos] Part of speech (exact) [vvg] walking, talking

[pos*] Part of speech (wildcard) [v*] swim, swims, swam, swimming
[More information]
[lemma] Lemmas (all forms of a word) [sing] sing, singing, sang

[tall] tall, taller, tallest
[=word] Synonyms [=strong] formidible, muscular, fervent

[More information]
[New: synonym chains]
[user:list] Customized lists [davies:clothes] tie, shirt, blouse

[More information]
word|word Any of these words stunning|gorgeous|charming stunning, charming, gorgeous
*xx Wildcard: * = any # letters un*ly unlikely, unusually

x?xx Wildcard: ? = one letter s?ng sing, sang, song
x?xx* s?ng* song, singer, songbirds
-word NOT (followed by PoS, -[nn*] the, in, is
4
lemma, word, etc. Most
useful for "multiple slot"
queries; see below)
Combinations of preceding (samples)
You can limit to a particular part of speech by adding a period (full stop) and then the part of speech tag in brackets. This is always
optional. Make sure there is no space before or after the period (full stop), or it will be interpreted as two consecutive words
word.[pos] Exact word and part of speech strike.[v*] strike (only as a verb)
word*.[pos] Substring and part of speech dis*.[vvd] discovered, disappeared, discussed
[lemma].[pos] Lemma and part of speech [strike].[v*] strike, struck, striking
[word].[pos] Synonym and part of speech [=beat].[v*] hit, strike, defeat

(but not nouns, like rhythm or drumming)
You can add "lemma" to any other type of search, such as synonym or customized list, to see all forms of the matching words. Just
use an extra set of brackets.
[[=word]] Synonym and lemma [[=publish]] announced, circulating, publishes, issue

(no part of speech specified, so some noun uses)
[[user:list]] Customized list and lemma [[davies:clothes]] tie, tying, socks, socked, shirt, blouses
(no part of speech specified, hence tying)
You can also choose lemma and part of speech by combining the preceding symbols
[[=word]].[pos] Synonym and lemma and part [[=clean]].[v*] mop, scrubs, polishing
of speech
[[user:list]]. Customized list and lemma [[davies:clothes]].[n*] tie, ties, sock, socks (i.e. just nouns)
[pos] and part of speech
Multiple "slots": Create sequences of words, using any of the preceding query types. Note that in each case, there is a space between
the word "slots" in the query. These are just a few examples, from an unlimited number of combinations.
nooks and crannies nooks and crannies
fast|quick|rapid [nn*] fast food

rapid transit
pretty -[nn*] pretty smart

pretty as
(but not pretty girl, pretty picture, etc)
[get] her to [v*] get her to stay

got her to sleep
.|,|; nevertheless [p*] [v*] . Nevertheless it is

(Notice that punctuation can be used like any "word"; ; nevertheless he said
just make sure that it is separated from words by a space)
[break] the [nn*] break the law

broke the story
5
[[beat]].[v*] * [nn*] beat the Yankees
beaten to death
[=gorgeous] [nn*] beautiful woman

attractive wife
[put] on [ap*] [davies:clothes].[n*] put on her hat

putting on my pants
http://ucrel.lancs.ac.uk/claws7tags.html
Part of speech (POS) tags
APPGE possessive pronoun, pre-nominal (e.g. my, your, our)

AT article (e.g. the, no)
AT1 singular article (e.g. a, an, every)
BCL before-clause marker (e.g. in order (that),in order (to))
CC coordinating conjunction (e.g. and, or)
CCB adversative coordinating conjunction ( but)
CS subordinating conjunction (e.g. if, because, unless, so, for)
CSA as (as conjunction)
CSN than (as conjunction)
CST that (as conjunction)
CSW whether (as conjunction)
after-determiner or post-determiner capable of pronominal function (e.g. such, former,
DA
same)
DA1 singular after-determiner (e.g. little, much)
DA2 plural after-determiner (e.g. few, several, many)
DAR comparative after-determiner (e.g. more, less, fewer)
DAT superlative after-determiner (e.g. most, least, fewest)
DB before determiner or pre-determiner capable of pronominal function (all, half)
DB2 plural before-determiner ( both)
DD determiner (capable of pronominal function) (e.g any, some)
DD1 singular determiner (e.g. this, that, another)
DD2 plural determiner ( these,those)
DDQ wh-determiner (which, what)
DDQGE wh-determiner, genitive (whose)
DDQV wh-ever determiner, (whichever, whatever)
EX existential there
FO formula
FU unclassified word
FW foreign word
GE germanic genitive marker - (' or's)
IF for (as preposition)
II general preposition
IO of (as preposition)
IW with, without (as prepositions)
JJ general adjective
JJR general comparative adjective (e.g. older, better, stronger)
JJT general superlative adjective (e.g. oldest, best, strongest)
JK catenative adjective (able in be able to, willing in be willing to)
MC cardinal number,neutral for number (two, three..)
MC1 singular cardinal number (one)
MC2 plural cardinal number (e.g. sixes, sevens)
MCGE genitive cardinal number, neutral for number (two's, 100's)
MCMC hyphenated number (40-50, 1770-1827)
MD ordinal number (e.g. first, second, next, last)
MF fraction,neutral for number (e.g. quarters, two-thirds)
ND1 singular noun of direction (e.g. north, southeast)
NN common noun, neutral for number (e.g. sheep, cod, headquarters)
6
NN1 singular common noun (e.g. book, girl)
NN2 plural common noun (e.g. books, girls)
NNA following noun of title (e.g. M.A.)
NNB preceding noun of title (e.g. Mr., Prof.)
NNL1 singular locative noun (e.g. Island, Street)
NNL2 plural locative noun (e.g. Islands, Streets)
NNO numeral noun, neutral for number (e.g. dozen, hundred)
NNO2 numeral noun, plural (e.g. hundreds, thousands)
NNT1 temporal noun, singular (e.g. day, week, year)
NNT2 temporal noun, plural (e.g. days, weeks, years)
NNU unit of measurement, neutral for number (e.g. in, cc)
NNU1 singular unit of measurement (e.g. inch, centimetre)
NNU2 plural unit of measurement (e.g. ins., feet)
NP proper noun, neutral for number (e.g. IBM, Andes)
NP1 singular proper noun (e.g. London, Jane, Frederick)
NP2 plural proper noun (e.g. Browns, Reagans, Koreas)
NPD1 singular weekday noun (e.g. Sunday)
NPD2 plural weekday noun (e.g. Sundays)
NPM1 singular month noun (e.g. October)
NPM2 plural month noun (e.g. Octobers)
PN indefinite pronoun, neutral for number (none)
PN1 indefinite pronoun, singular (e.g. anyone, everything, nobody, one)
PNQO objective wh-pronoun (whom)
PNQS subjective wh-pronoun (who)
PNQV wh-ever pronoun (whoever)
PNX1 reflexive indefinite pronoun (oneself)
PPGE nominal possessive personal pronoun (e.g. mine, yours)
PPH1 3rd person sing. neuter personal pronoun (it)
PPHO1 3rd person sing. objective personal pronoun (him, her)
PPHO2 3rd person plural objective personal pronoun (them)
PPHS1 3rd person sing. subjective personal pronoun (he, she)
PPHS2 3rd person plural subjective personal pronoun (they)
PPIO1 1st person sing. objective personal pronoun (me)
PPIO2 1st person plural objective personal pronoun (us)
PPIS1 1st person sing. subjective personal pronoun (I)
PPIS2 1st person plural subjective personal pronoun (we)
PPX1 singular reflexive personal pronoun (e.g. yourself, itself)
PPX2 plural reflexive personal pronoun (e.g. yourselves, themselves)
PPY 2nd person personal pronoun (you)
RA adverb, after nominal head (e.g. else, galore)
REX adverb introducing appositional constructions (namely, e.g.)
RG degree adverb (very, so, too)
RGQ wh- degree adverb (how)
RGQV wh-ever degree adverb (however)
RGR comparative degree adverb (more, less)
RGT superlative degree adverb (most, least)
RL locative adverb (e.g. alongside, forward)
RP prep. adverb, particle (e.g about, in)
RPK prep. adv., catenative (about in be about to)
RR general adverb
RRQ wh- general adverb (where, when, why, how)
RRQV wh-ever general adverb (wherever, whenever)
RRR comparative general adverb (e.g. better, longer)
RRT superlative general adverb (e.g. best, longest)
RT quasi-nominal adverb of time (e.g. now, tomorrow)
TO infinitive marker (to)
UH interjection (e.g. oh, yes, um)
VB0 be, base form (finite i.e. imperative, subjunctive)
VBDR were
VBDZ was
7
VBG being
VBI be, infinitive (To be or not... It will be ..)
VBM am
VBN been
VBR are
VBZ is
VD0 do, base form (finite)
VDD did
VDG doing
VDI do, infinitive (I may do... To do...)
VDN done
VDZ does
VH0 have, base form (finite)
VHD had (past tense)
VHG having
VHI have, infinitive
VHN had (past participle)
VHZ has
VM modal auxiliary (can, will, would, etc.)
VMK modal catenative (ought, used)
VV0 base form of lexical verb (e.g. give, work)
VVD past tense of lexical verb (e.g. gave, worked)
VVG -ing participle of lexical verb (e.g. giving, working)
VVGK -ing participle catenative (going in be going to)
VVI infinitive (e.g. to give... It will work...)
VVN past participle of lexical verb (e.g. given, worked)
VVNK past participle catenative (e.g. bound in be bound to)
VVZ -s form of lexical verb (e.g. gives, works)
XX not, n't
ZZ1 singular letter of the alphabet (e.g. A,b)
ZZ2 plural letter of the alphabet (e.g. A's, b's)
NOTE: "DITTO TAGS"

Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22 This signifies that the tag occurs as
part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the
expression in terms of is treated as a single preposition, receiving the tags:
in_II31 terms_II32 of_II33
The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence.
Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range
of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is
taken into account, and also that, depending on the context, ditto tags may or may not be required for a particular word sequence:
at_RR21 length_RR22
a_DD21/RR21 lot_DD22/RR22
in_CS21/II that_CS22/DD1
8
Real Academia Espaola - Corpus Diacrnico del Espaol (CORDE). http://corpus.rae.es/cordenet.html
El banco de datos de la Real Academia Espaola est constituido por dos grandes corpus textuales: el Corpus de Referencia del Espaol
Actual (CREA, escrito y oral) (http://corpus.rae.es/creanet.html) y el Corpus Diacrnico del Espaol (CORDE)
(http://corpus.rae.es/cordenet.html). Ambos conjuntos son complementarios, de modo que el CREA contiene textos desde 1975 a 2004,
mientras que el CORDE incluye textos de todos los perodos anteriores. El carcter integrado de los dos corpus se refleja en la previsin
de que los textos pertenecientes a perodos que, por el paso del tiempo, vayan quedando fuera del mbito del CREA, pasarn a formar
parte del CORDE.
En la actualidad, el CREA est formado por algo ms de 154 millones de formas correspondientes a la parte escrita y algo ms de ocho
millones en la parte oral. El CORDE alcanza cerca de 250 millones de formas. A travs de la aplicacin de concordancias, los
investigadores tienen a su disposicin alrededor de 400 millones de formas de todos los perodos del espaol, tanto de Espaa como de
Amrica, lo que constituye, sin duda, el recurso ms importante del que se haya podido disponer jams para el estudio de esta lengua.
Codificacin. A todos los materiales procesados tanto en el CREA como en el CORDE se les ha aadido una serie de marcas textuales,
establecidas segn el estndar internacional SGML (Standard General Markup Language), de acuerdo con las recomendaciones de la TEI
(Text Encoding Initiative), que facilitan la recuperacin de la informacin y el intercambio de textos con otros corpus, y garantizan la
independencia de sistemas operativos y programas.
Estadsticas
Corpus Palabras
CORDE 236709914
CREA 154212661
Origen Palabras CORDE Palabras CREA Total palabras
Espaa 196106277 85563661 281669938
Hispanoamrica 37562461 68569561 106132022
Otros 3041176 79439 3120615
Medio Palabras CORDE Palabras CREA Total palabras
Libros 234056904 80677065 314733969
Prensa 2653010 70702838 73355848
Miscelnea (solo CREA) 2832758 2832758
Zonas lingsticas de Amrica Palabras CORDE Palabras CREA Total palabras
Andina 15076129 16536515 31612644
Caribea 3345900 9543456 12889356
Central 714341 2883979 3598320
Chilena 3983488 6494966 10478454
Mexicana 7822486 16897942 24720428
Rioplatense 6621954 16212703 22834657
Nuevo tesoro lexicogrfico de la lengua espaola (NTLLE)

El Nuevo tesoro lexicogrfico de la lengua espaola (NTLLE) rene las obras lexicogrficas espaolas ms representativas, una amplia
seleccin de las obras que durante los ltimos quinientos aos han recogido, definido y consolidado el patrimonio lxico de nuestro
idioma. Es un diccionario de diccionarios, un diccionario total que contiene todo el lxico de la lengua espaola desde el siglo XV hasta
el XX, tal y como es recogido, sistematizado, definido e inventariado por los ms importantes repertorios lexicogrficos, sean
9
monolinges o bilinges, dedicados a la lengua espaola. De este modo, el NTLLE no solo ofrece al interesado la extraordinaria
posibilidad de tener juntos y reunidos cerca de 70 diccionarios que ninguna biblioteca en el mundo est en condiciones de custodiar de
forma conjunta, sino que permite algo hasta ahora impensable: buscar de una sola vez, a travs de una sola operacin de consulta, una o
varias palabras de forma simultnea en la totalidad de los diccionarios que lo integran.
Se trata de un proyecto de creacin de una biblioteca digital de diccionarios sin parangn en el mundo.
Destinatarios: aquellas personas interesadas en conocer mejor nuestra lengua, en descubrir la evolucin de las palabras que la integran y
en profundizar en el conocimiento de su lxico
Lista de diccionarios: http://www.rae.es/rae/gestores/gespub000020.nsf/(voAnexos)/arch212F9B52EA5926BCC125716A003313E8/$FILE/ListaDiccionarios.htm
Consulta: http://buscon.rae.es/ntlle/SrvltGUIMenuNtlle?cmd=Lema&sec=1.0.0.0.0.
Trsor de la Langue Franaise informatis (TLFi): http://atilf.atilf.fr/

100.000 cuvinte, 270.000 definiii, 430.000 exemple
Tesoro della Lingua Italiana delle origini (TLIO) http://tlio.ovi.cnr.it/TLIO/ricindex.html

-realizat plecnd de la Opera del Vocabolario Italiano (OVI)
Tesoro della Lingua Italiana delle origini (TLIO): Il primo dizionario storico dellitaliano antico
Il Tesoro della Lingua Italiana delle Origini (TLIO) un vocabolario storico dellitaliano antico, basato su tutta la documentazione
disponibile a partire dal primo documento che si pu dire italiano, cio lIndovinello veronese dellinizio del secolo IX, fino alla fine del
Trecento (termine simbolico il 1375, anno della morte di Boccaccio).
-Il vocabolario oggi intorno a un quaranta per cento della sua elaborazione. Alla fine del 2009, infatti, si raggiunto un totale di circa
20.500 voci di circa 50.000 prevedibili per lopera completa.
-Il TLIO redatto di prima mano, cio studiando direttamente i testi e non prendendo parole, definizioni ed esempi dai vocabolari
esistenti, se non in casi limitati che vengono sempre segnalati. Per fare ci, si serve della grande banca dati dellitaliano antico che lOVI
ha elaborato a questo scopo, ed messa a disposizione del pubblico per la consultazione libera in questo stesso sito.
-TLIO tratta congiuntamente tutte le variet linguistiche dell'italiano antico, inclusi quelli che oggi si chiamano dialetti. Ci corrisponde al
fatto che la lingua nazionale italiana in quanto tale una creazione del primo Cinquecento. Sebbene gi nel Trecento il fiorentino e il
toscano in genere tendano a prevalere nel quadro linguistico italiano, un vocabolario che trascurasse le altre variet linguistiche
(lombarde, venete, campane, siciliane e via dicendo) darebbe un'idea falsata della storia della lingua.
Corpus testuale del Tesoro della lingua italiana delle origini

Il corpus testuale del Tesoro della Lingua Italiana delle Origini la maggiore base di dati oggi disponibile riguardante la lingua italiana
anteriore al 1375. Contiene attualmente 1978 testi per 21.818.088 parole (occorrenze).
-Il corpus lemmatizzato quindi reso interrogabile in rete mediante GattoWeb.
-Una versione non lemmatizzata dello stesso corpus interrogabile in rete tramite ItalNet (questa versione quella da pi lungo tempo
nota agli studiosi, essendo in rete dal 1998; la prima pubblicazione della versione GattoWeb di ottobre 2005).
-La versione ItalNet della Banca Dati dell'Italiano Antico dell'OVI in attesa di aggiornamento
GATTOWEB:
http://www.ovi.cnr.it/index.php?page=banchedati
Corpus OVI dell'Italiano antico: il corpus sul quale si redige il Tesoro della Lingua Italiana delle Origini.
Si pu consultare anche per lemmi.
Corpus TLIO aggiuntivo: contiene testi destinati ad entrare nel Corpus OVI dell'Italiano antico, provvisoriamente non lemmatizzati.
Archivio Datini: il corpus delle lettere edite dell'Archivio di Francesco di Marco Datini, elaborato dall'OVI per conto dell'Archivio di
Stato di Prato, con una lemmatizzazione tematica.
Corpus ARTESIA (Archivio Testuale del Siciliano Antico) a cura di Mario Pagano: un corpus realizzato in GATTO presso l'Universit di
Catania, per conto della quale viene reso interrogabile in rete dall'OVI.
Corpus Folchetto, di Paolo Squillacioti: un esperimento di corpus lemmatizzato della poesia dei trovatori, condotto sulle poesie di
Folchetto di Marsiglia.
Dicionarul Tezaur al Limbii Romne n format electronic (eDLTR)

The Digital Form of the Thesaurus Dictionary of the Romanian Language (eDTLR)
http://profs.info.uaic.ro/~dcristea/papers/Cristea%20et%20al-SPeD07.pdf
10

Lingvistica Corpusului

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lingvistica Corpusului

Uploaded by

Copyright:

Available Formats

6.4. Lingvistica corpusului (4).

APLICAIE: consultarea unui corpus cu ajutorul unei interfee

Consultarea unui corpus se poate face n diverse moduri:

printr-o interfa (o fereastr grafic de comunicare cu corpus-ul)

prin limbaje de procesare a datelor (de exemplu, awk: http://en.wikipedia.org/wiki/AWK;

(O cale de mijloc ntre cele dou opiuni precedente)

Exemple de corpus-uri cu interfee de consultare

BYU-BNC: British National Corpus 100 million words 1980s - 1993

TIME Corpus of American English 100 million words 1920s - 2000s

Corpus del Espaol 100 million words 1200s - 1900s

Corpus do Portugus 45 million words 1300s - 1900s

English Size Language / time Content

BRIEF TOUR (GENERAL)

Formule de cutare cu ajutorul interfeei, nelese de interfa (sintaxa cutrii)

word One exact word mysterious mysterious

[pos] Part of speech (exact) [vvg] walking, talking

[lemma] Lemmas (all forms of a word) [sing] sing, singing, sang

[=word] Synonyms [=strong] formidible, muscular, fervent

[user:list] Customized lists [davies:clothes] tie, shirt, blouse

word|word Any of these words stunning|gorgeous|charming stunning, charming, gorgeous

*xx Wildcard: * = any # letters un*ly unlikely, unusually

-word NOT (followed by PoS, -[nn*] the, in, is

Combinations of preceding (samples)

word*.[pos] Substring and part of speech dis*.[vvd] discovered, disappeared, discussed

[lemma].[pos] Lemma and part of speech [strike].[v*] strike, struck, striking

[word].[pos] Synonym and part of speech [=beat].[v*] hit, strike, defeat

[[=word]] Synonym and lemma [[=publish]] announced, circulating, publishes, issue

nooks and crannies nooks and crannies

fast|quick|rapid [nn*] fast food

pretty -[nn*] pretty smart

[get] her to [v*] get her to stay

.|,|; nevertheless [p*] [v*] . Nevertheless it is

[break] the [nn*] break the law

[=gorgeous] [nn*] beautiful woman

[put] on [ap*] [davies:clothes].[n*] put on her hat

APPGE possessive pronoun, pre-nominal (e.g. my, your, our)

NOTE: "DITTO TAGS"

Origen Palabras CORDE Palabras CREA Total palabras

Espaa 196106277 85563661 281669938

Hispanoamrica 37562461 68569561 106132022

Otros 3041176 79439 3120615

Medio Palabras CORDE Palabras CREA Total palabras

Libros 234056904 80677065 314733969

Prensa 2653010 70702838 73355848

Miscelnea (solo CREA) 2832758 2832758

Zonas lingsticas de Amrica Palabras CORDE Palabras CREA Total palabras

Andina 15076129 16536515 31612644

Caribea 3345900 9543456 12889356

Central 714341 2883979 3598320

Chilena 3983488 6494966 10478454

Mexicana 7822486 16897942 24720428

Rioplatense 6621954 16212703 22834657

Nuevo tesoro lexicogrfico de la lengua espaola (NTLLE)

Trsor de la Langue Franaise informatis (TLFi): http://atilf.atilf.fr/

Tesoro della Lingua Italiana delle origini (TLIO) http://tlio.ovi.cnr.it/TLIO/ricindex.html

Corpus testuale del Tesoro della lingua italiana delle origini

Dicionarul Tezaur al Limbii Romne n format electronic (eDLTR)

You might also like

xx Wildcard: = any # letters un*ly unlikely, unusually

word.[pos] Substring and part of speech dis.[vvd] discovered, disappeared, discussed

.|,|; nevertheless [p] [v] . Nevertheless it is

[put] on [ap] [davies:clothes].[n] put on her hat