Professional Documents
Culture Documents
Application #2:
Lexicography Application #2:
Collocations Collocations
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
Calculating
collocations
Corpora for lexicography Calculating
collocations
Corpus Linguistics Practical work I Can extract authentic & typical examples, with Practical work
1 / 35 2 / 35
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
3 / 35 4 / 35
Collocations Collocations are expressions of two or more words that are Collocations
Defining a collocation Defining a collocation
Krishnamurthy in some sense conventionalized as a group Krishnamurthy
“People disagree on collocations”
Calculating Calculating
I Intuition does not seem to be a completely reliable way collocations I strong tea (cf. ??powerful tea) collocations
5 / 35 6 / 35
Corpus Linguistics Corpus Linguistics
Prototypical collocations Application #2:
Compositionality tests Application #2:
Collocations Collocations
Collocations Collocations
Defining a collocation Defining a collocation
Calculating Calculating
collocations with corpus data collocations
I Non-compositional: meaning of kick the bucket not
Practical work Practical work
composed of meaning of parts (At least) two tests we can use with corpora:
I More subtly: red/white hair sort of composable, but the I Is the collocation translated word-by-word into another
color of red/white here is different than usual language?
I Non-substitutable: orange hair just as accurate as red I e.g., Collocation make a decision is not translated
hair, but we don’t say it literally into French
I Do the two words co-occur more frequently together
I Non-modifiable: often we cannot modify a collocation,
than we would otherwise expect?
even though we normally could modify one of those I e.g., of the is frequent, but both words are frequent, so
words: ??kick the red bucket we might expect this
7 / 35 8 / 35
Collocations Collocations
Defining a collocation Defining a collocation
Collocations come in different guises: Krishnamurthy
Semantic prosody = “a form of meaning which is Krishnamurthy
Calculating Calculating
I Light verbs: verbs convey very little meaning but must collocations established through the proximity of a consistent series of collocations
be the right one: Practical work collocates” (Louw 2000) Practical work
I make a decision vs. *take a decision, take a walk vs. I These are typically negative: e.g., peddle, ripe for, get
*make a walk
I Phrasal verbs: main verb and particle combination,
often translated as a single word: I The idea is that you can tell the semantic prosody of a
I to tell off, to call up word by the types of words it frequently co-occurs with
I Proper nouns: slightly different than others, but each This type of co-occurrence often leads to general semantic
refers to a single idea (e.g., Brooks Brothers) preferences
I Terminological expressions: technical terms that form a I e.g., utterly, totally, etc. typically have a feature of
unit (e.g., hydraulic oil filter) ‘absence or change of state’
9 / 35 10 / 35
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
Firth 1957: “You shall know a word by the company it keeps” Calculating Calculating
I Collocational meaning is a syntagmatic type of
collocations We often look for words which are adjacent to make up a collocations
An example is that ass is associated with a set of adjectives We can also speak of upward/downward collocations:
(think of goose if you prefer) I downward: involves a more frequent node word A with
I silly, obstinate, stupid, awful a less frequent collocate B
I We can see a lexical set associated with this word
I upward: weaker relationship, tending to be more of a
grammatical property
Lexical sets and collocations can vary across genres,
subcorpora, etc.
11 / 35 12 / 35
Corpus Linguistics Corpus Linguistics
Corpus linguistics Application #2:
Calculating collocations Application #2:
Krishnamurthy 2000 Collocations Collocations
Collocations (The slides from here on out are based on Manning & Schütze Collocations
Defining a collocation Defining a collocation
Krishnamurthy (M&S) 1999) Krishnamurthy
Calculating Calculating
collocations collocations
The simplest thing to do to find collocations is to use
Practical work Practical work
Where collocations fit into corpus linguistics: frequency counts: two words appearing together a lot are a
1. Pattern recognition: recognize lexical and grammatical collocation
units
The problem is that we get lots of uninteresting pairs of
2. Frequency list generation: rank words function words (M&S 1999, table 5.1)
3. Concordancing: observe word behavior
C(w1 , w2 ) w1 w2
4. Collocations: take concordancing a step further ...
80871 of the
58841 in the
26430 to the
21842 on the
13 / 35 14 / 35
Collocations Collocations
Defining a collocation Defining a collocation
To remove frequent pairings which are uninteresting, we can Krishnamurthy Some results after tag filtering (M&S 1999, table 5.3) Krishnamurthy
15 / 35 16 / 35
Collocations Collocations
Defining a collocation We want to compare the likelihood of 2 words next to other Defining a collocation
Krishnamurthy Krishnamurthy
19 / 35 20 / 35
Collocations Collocations
Defining a collocation
Krishnamurthy
Our probabilities (p (w1 w2 ), p (w1 ), p (w2 )) are all basically Defining a collocation
Krishnamurthy
One way to see if two words are strongly connected is to Calculating calculated in the same way: Calculating
collocations collocations
compare C (x )
Practical work (8) p (x ) = N Practical work
I the probability of the two words appearing together if
they are independent (p (w1 )p (w2 )) I N is the number of words in the corpus
I the actual probability of the two words appearing I The number of bigrams ≈ the number of unigrams
together (p (w1 w2 )) p (w w )
(9) I(w1 , w2 ) = log p (w1 )1p (w2 2 )
The pointwise mutual information is a measure to do this:
C (w1 w2 )
p ( w1 w2 ) = log N
(7) I(w1 , w2 ) = log p ( w1 ) p ( w2 )
C (w1 ) C (w2 )
N N
= log[N CC(w(1w)1Cw(2w)2 ) ]
21 / 35 22 / 35
Collocations Collocations
Defining a collocation The formula we have also has the following equivalencies: Defining a collocation
I C (Ayatollah ) = 42 Practical work Mutual information tells us how much more information we Practical work
23 / 35 24 / 35
Corpus Linguistics Corpus Linguistics
Motivating Contingency Tables Application #2:
Contingency Tables Application #2:
Collocations Collocations
Collocations Collocations
Defining a collocation Defining a collocation
What we can instead get at is: which bigrams are likely, out Krishnamurthy
We can count up these different possibilities and put them
Krishnamurthy
Calculating Calculating
of a range of possibilities? collocations into a contingency table (or 2x2 table) collocations
25 / 35 26 / 35
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
Because each cell indicates a bigram, divide each of the If we assumed that sherlock and holmes are
Calculating Calculating
cells by the total number of bigrams (7105) to get collocations independent—i.e., the probability of one is unaffected by the collocations
probabilities: Practical work probability of the other—we would get the following table: Practical work
27 / 35 28 / 35
Collocations Collocations
Defining a collocation The chi-square (χ2 ) test measures how far the observed Defining a collocation
Krishnamurthy Krishnamurthy
values are from the expected values:
Multiplying by 7105 (the total number of bigrams) gives us Calculating Calculating
collocations collocations
the expected number of times we should see each bigram: P (fo −fe )2
Practical work (12) χ2 = fe Practical work
= 974.05
I The values in this chart are the expected frequencies
(fe ) If you look this up in a table, you’ll see that it’s unlikely to be
chance
NB: The χ2 test does not work well for rare events, i.e., fe < 6
29 / 35 30 / 35
Corpus Linguistics Corpus Linguistics
Working with collocations Application #2:
Calculating collocations: web interface Application #2:
Collocations Collocations
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
Calculating
For today’s practice, let’s work with an online concordancer Calculating
collocations of the BNC, http://corpus.byu.edu/bnc/ collocations
31 / 35 32 / 35
Collocations Collocations
Defining a collocation Defining a collocation
Krishnamurthy Krishnamurthy
Calculating Calculating
collocations You’ll note that I simply wrote a function (pmi) which collocations
I wrote a Perl script which does the following:
Practical work calculate pointwise mutual information Practical work
1. Reads in a corpus file (could be changed to read over a
I This could be replaced by any function that someone
directory of files, if need be)
(else) writes to calculate a score
2. Stores unigram and bigram counts as it reads the file in I Or, you could output the data without a score and input
3. Loops over all bigrams that into a statistical package
4. For each bigram, calculates the pointwise mutual I e.g., we could change the output into a
information score comma-separated value (CSV) file, readable by excel
and other software
33 / 35 34 / 35
Corpus Linguistics
Testing our intutions Application #2:
Collocations
Collocations
Defining a collocation
Krishnamurthy
Calculating
collocations
Practical work
Let’s play around with collocations:
I Trying different corpora on jones
I Hypothesizing better/worse collocations
I Trying to implement a POS filter in our Perl code
I . . . or any other ideas we get . . .
35 / 35