Lecture 5: Collocations
(Chapter 5 of Manning and Schutze)
WenHsiang Lu ()
Department of Computer Science and Information Engineering,
National Cheng Kung University
2004/10/13
(Slides from Dr. Mary P. Harper,
http://min.ecn.purdue.edu/~ee669/)
Fall 2001 EE669: Natural Language Processing 2
What is a Word?
This is harder to pin down than one might imagine. In
general we talk about word forms, which (in written
English) is a particular configuration of letters wherever it
occurs. Each individual occurrence of a word form is
called a token.
As an example, let's say: 'water'. There are related word
forms: water, waters, watered, watering, watery. This set
of word forms is called a lemma.
Some argue that the best way to find how water is used is
by studying many cases of actual use, and abandoning
prejudices about generalities such as syntactic categories.
Fall 2001 EE669: Natural Language Processing 3
Definition Of Collocation
(wrt Corpus Literature)
A collocation is defined as a sequence of two or
more consecutive words, that has characteristics of
a syntactic and semantic unit, and whose exact and
unambiguous meaning or connotation cannot be
derived directly from the meaning or connotation
of its components. [Chouekra, 1988]
Fall 2001 EE669: Natural Language Processing 4
Word Collocations
It is believed that children learn word forms like water as
distinct items. Only in school do they begin to recognize
nouns and verbs after learning formal rules of language.
Many units of meaning extend over several words, for
example kick the bucket, whose meaning cannot be
derived from its parts.
[Many of the] sentences we utter [can be thought of as]
[having a substantial component of] [frequently used]
phrases. Again, this can be thought of in terms of
collocations, where these groups of words reflect
frequently encountered contexts in the language
community.
Fall 2001 EE669: Natural Language Processing 5
Word Collocations
Many of the ways that say, water can be used are subtly
linked to very specific situations, and to other words. An
example:
His mouth watered.
His eyes watered.
But this paradigm doesn't extend to watering.
The roast was mouthwatering.
*The [smokey] nightclub was eyewatering.
Special relationships between words which tend to be used
together (or not) are called collocational constraints.
Fall 2001 EE669: Natural Language Processing 6
Word Collocations
Collocation
Firth: word is characterized by the company it keeps;
collocations of a given word are statements of the
habitual or customary places of that word.
noncompositionality of meaning
cannot be derived directly from its parts (heavy rain)
nonsubstitutability in context
for parts (red light)
nonmodifiability (& nontransformability)
kick the yellow bucket; take exceptions to
Fall 2001 EE669: Natural Language Processing 7
Association and Cooccurrence:
Terms
Considerable overlap between the concepts of collocations and
(in technical domains) terms, technical term and terminological
phrase. (Terms in IR refers to both words and phrases.)
Terms appear together or in the same (or similar) context:
(doctors, nurses)
(hardware,software)
(gas, fuel)
(hammer, nail)
(communism, free speech)
Collocations sometimes reflect attitudes (e.g., towards different
types of substances: strong cigarettes, tea, coffee versus
powerful drug (e.g., heroin)).
Fall 2001 EE669: Natural Language Processing 8
Linguistic Subclasses of
Collocations
Light verbs: verbs with little semantic content like make,
take, do
Terminological Expressions: concepts and objects in
technical domains (e.g., hard drive)
Idioms: fixed phrases
kick the bucket, birdsofafeather, run for office
Proper names: difficult to recognize even with lists
Tuesday (persons name), May, Winston Churchill, IBM, Inc.
Numerical expressions
containing ordinary words
Monday Oct 04 1999, two thousand seven hundred fifty
Verb particle constructions or Phrasal Verbs
Separable parts:
look up, take off, tell off
Fall 2001 EE669: Natural Language Processing 9
Motivation
Tasks where words and the company they keep is
important:
word sense disambiguation (MT, IR, IE)
lexical entries: subdivision and definitions (lexicography)
language modeling (generalization, smoothing)
word/phrase/term translation (MT, Multilingual IR)
NL generation (natural phrases) (Generation, MT)
parsing (lexicallybased selectional preferences)
Fall 2001 EE669: Natural Language Processing 10
Collocations
Collocations are not necessarily adjacent
Collocations cannot be directly translated into
other languages.
It may be better to use the term collocation in the
narrower sense of grammatically bound elements
that occur in a particular order, and use
association or cooccurrence for words that
appear together in context.
Fall 2001 EE669: Natural Language Processing 11
Other Word Relations
Synonymy: different form/word, same meaning:
notebook / laptop
Antonymy: opposite meaning:
new/old, black/white, start/stop
Homonymy: same form/word, different meaning:
true (random, unrelated): can (aux. verb / can of Coke)
related: polysemy; notebook, shift, grade, ...
Other:
Hyperonymy/Hyponymy: general vs. specific: vehicle/car
Meronymy/Holonymy: part vs. whole: leg/body
Fall 2001 EE669: Natural Language Processing 12
Collocation Exploration
Computers can be used to study collocation in large text
corpora. Typically this means selecting a word form type
which one wants to study, which will serve as the node.
Each occurrence of the type is a token. We then select a
span within which we want to study cooccurrence of other
words. Most of the significant relationships are within a
span of plus or minus four.
Whenever a token of our node word occurs, we tally each
of the tokens of other words which occur within its span.
Where there are many occurrences of the node word, a
statistical profile of that word's collocates starts to emerge.
This works a lot better for content words than it does with
function words.
Fall 2001 EE669: Natural Language Processing 13
Overview of the Collocation
Detection Techniques Surveyed
Selection of Collocations by Frequency
Selection of Collocation based on Mean and
Variance of the distance between focal word and
collocating word.
Hypothesis Testing
Pointwise Mutual I nformation
Fall 2001 EE669: Natural Language Processing 14
Using Frequency to Hunt for
Collocations
The most frequent ngrams are not in general always
collocations; many involve function words or are common
names.
Simple heuristic methods help to improve the collocation
yield of the ngrams.
Use knowledge of stop words; words/forms that cannot alone
make up a collocation
a, the, and, or, but, not,
Use part of speech patterns to filter the ngrams (Justeson and
Katz, 1995)
Adj Noun (cold feet)
Noun Noun (oil prices)
Noun Pronoun Noun (out of sight)
Fall 2001 EE669: Natural Language Processing 15
Mean and Variance (Smadja et al.,
1993)
Frequencybased search works well for fixed phrases.
However, many collocations consist of two words in more
flexible (although regular) relationships. For example,
Knock and door may not occur at a fixed distance from each other
One method of detecting these flexible relationships uses the
mean and variance of the offset (signed distance) between the
two words in the corpus.
If the offsets are randomly distributed (i.e., no collocation),
then the variance will be high (and means close to zero as
would be the case for a uniform distribution).
Fall 2001 EE669: Natural Language Processing 16
Mean, Sample Variance, and
Standard Deviation
n
X
X X X Mean X
n
i
i
n
=
= =
1
) ,..., , ( 2 1
) ,..., , ( ) ,..., , ( 2 1 2 1 n n X X X Var X X X SD s = =
1
) (
) ,..., , (
1
2
2
2 1
= =
=
n
X X
X X X Var s
n
i
i
n
Fall 2001 EE669: Natural Language Processing 17
Example: Knock and Door
1. She knocked on his door.
2. They knocked at the door.
3. 100 women knocked on the big red door.
4. A man knocked on the metal front door.
Average offset between knock and door:
(3 + 3 + 5 + 5)/ 4 = 4
Variance:
((34)
2
+ (34)
2
+ (54)
2
+ (54)
2
)/(41) = 4/3
Fall 2001 EE669: Natural Language Processing 18
Hypothesis Testing: Overview
We want to determine whether the cooccurrence is
random or whether it occurs more often than chance. This
is a classical problem of Statistics called Hypothesis
Testing.
We formulate a null hypothesis H
0
(the association occurs
by chance). Assuming this, calculate the probability that a
collocation would occur if H
0
were true. If the probability
is very low (e.g., p < 0.05) (thus confirming interesting
things are happening!), then reject H
0
; otherwise retain it
as possible.
In this case, we assume that two words are not collocations
if they occur independently.
Fall 2001 EE669: Natural Language Processing 19
Hypothesis Testing: The t test
The t test looks at the mean and variance of a
sample of measurements, where the null
hypothesis is that the sample is drawn from a
distribution with mean .
The test looks at the difference between the
observed and expected means, scaled by the
variance of the data, and tells us how likely one is
to get a sample of that mean and variance
assuming that the sample is drawn from a normal
distribution with mean .
Fall 2001 EE669: Natural Language Processing 20
The Students t test
To determine the probability of getting a certain
sample, we compute the t statistic, where is the
sample mean and s
2
is the sample variance, and look
up its significance wrt the normal distribution.
N
X
t
s
2
=
X
N
X
z
o
=
Ns 30
o is unknown
Normal distribution
Fall 2001 EE669: Natural Language Processing 21
The t test
Significance of difference
Compare with normal distribution (mean )
Using realworld data, compute t
Find in tables (see Manning and Schutze, p. 609):
d.f. = degrees of freedom (parameters which are not determined by
other parameters; sample size)
percentile level p = 0.05 (or lower)
The bigger the t statistic:
the better chance that it is an interesting combination (i.e. we can
reject the null hypothesis)
t: significance level from the t table
Fall 2001 EE669: Natural Language Processing 22
The t test on Collocations
Null hypothesis: independence
mean : p(w
1
) p(w
2
)
Data estimates:
x = MLE of joint probability from data
o
2
is p(1p), i.e. almost p for small p; N is the data size
Example: compute t value for new companies
C(new)=15,828; C(companies)= 4,675; N=14,307,668
H
0
: p(new companies)= 15,828/14,307,668 * 4,675/ 14,307,668 =
3.615 * 10
7
p(new companies)= 8/14,307,668=5.591 * 10
7
o
2
=p*(1p)= pp
2
~ 5.591 * 10
7
T=(5.591 * 10
7
 3.615 * 10
7
)/(5.591 * 10
7
/14,307,668)
.5
=.999932
For o=0.05, need a t value of 1.645, so the null hypothesis is not
rejected.
Fall 2001 EE669: Natural Language Processing 23
Hypothesis Testing of Differences
(Church & Hanks, 1989)
We may also want to find words whose co
occurrence patterns best distinguish between two
words (e.g., strong versus powerful). This
application can be useful for Lexicography.
The t test is extended to the comparison of the
means under the assumption that they are
normally distributed.
The null hypothesis is that the average difference
is 0.
Fall 2001 EE669: Natural Language Processing 24
The ttest for Comparing Two
Populations
This t test compares the means of two normal
populations. The variances of the two populations
are added since the variance of the difference of
two RVs is the sum of their variances.
2
2
2
1
2
1
2 1
n
s
n
s
X X
t
+
=
Fall 2001 EE669: Natural Language Processing 25
Collocation Testing
T values are calculated assuming a Bernoulli
distribution: w is the collocate of interest, v
1
and v
2
are the words to compare, and assume that o
2
~ p.
N
w v P w v P
w v P w v P
t
) ( ) (
) ( ) (
2 1
2 1
+
=
Fall 2001 EE669: Natural Language Processing 26
Pearsons ChiSquare Test
Use of the t test has been criticized by Church and Mercer
(1993) because it assumes that probabilities are
approximately normally distributed (not true, generally).
The ChiSquare test does not make this assumption.
The essence of the test is to compare observed frequencies
with frequencies expected in the case of independence. If
the difference between observed and expected frequencies
is large, then we can reject the null hypothesis of
independence.
_
2
test (general formula): E
i,j
(O
ij
E
ij
)
2
/ E
ij
where O
ij
and E
ij
are the observed versus expected counts of events
i, j
Fall 2001 EE669: Natural Language Processing 27
Pearsons Chisquare Test
Example of a twooutcome event:
w
1
\ w
2
= true
= true
= species
9 1,770
= species
75 219,243
Fall 2001 EE669: Natural Language Processing 28
Pearsons ChiSquare Test
The expected frequencies are computed from the
marginal probabilities:
E
11
= (O
11
+ O
12
)/N (O
11
+ O
21
)/N N
where N is the number of bigrams
_
2
= 221097 (219243 9  75 1770)
2
/(1779 84 221013 219318)
= 103.39 > 7.88 (at .005 thus we can reject the independence assumption)
) )( )( )( (
) (
) (
22 21 22 12 21 11 12 11
2
21 12 22 11
,
2
2
O O O O O O O O
O O O O N
E
E O
ij
j i
ij ij
+ + + +
=
=
_
P(w
1
w
2
) = P(w
1
)p(w
2
) = E
11
/N
=> E
11
= P(w
1
)P(w
2
)N
w
2
w
2
w
1
O
11
O
12
w
1
O
21
O
22
Fall 2001 EE669: Natural Language Processing 29
Pearsons ChiSquare: Applications
One of the early uses of the ChiSquare test in
Statistical NLP was the identification of
translation pairs in aligned corpora (Church &
Gale, 1991).
A more recent application is to use ChiSquare as
a metric for corpus similarity (Kilgariff and Rose,
1998)
Note that the ChiSquare test should not be used
for small counts.
Fall 2001 EE669: Natural Language Processing 30
Likelihood Ratios Within a Single
Corpus (Dunning, 1993)
Likelihood ratios are more appropriate for sparse data than
the ChiSquare test. In addition, they are easier to interpret
than the ChiSquare statistic.
In applying the likelihood ratio test to collocation
discovery, use the following two alternative explanations
for the occurrence frequency of a bigram w
1
w
2
:
H1: The occurrence of w
2
is independent of the
previous occurrence of w
1
: P(w
2
 w
1
) = P(w
2
 w
1
) = p
H2: The occurrence of w
2
is dependent of the previous
occurrence of w
1
: p
1
= P(w
2
 w
1
) = P(w
2
 w
1
) = p
2
Fall 2001 EE669: Natural Language Processing 31
Likelihood Ratios Within a Single
Corpus
Use the MLE for probabilities for p, p
1
, and p
2
and assume the binomial distribution:
Under H
1
: P(w
2
 w
1
) = c
2
/N, P(w
2
 w
1
) = c
2
/N
Under H
2
: P(w
2
 w
1
) = c
12
/ c
1
= p
1
,
P(w
2
 w
1
) = (c
2
c
12
)/(Nc
1
) = p
2
Under H
1
: b(c
12
; c
1
, p) gives c
12
out of c
1
bigrams
are w
1
w
2
and b(c
2
c
12
; Nc
1
, p) gives c
2
 c
12
out of
Nc
1
bigrams are w
1
w
2
Under H
2
: b(c
12
; c
1
, p
1
) gives c
12
out of c
1
bigrams
are w
1
w
2
and b(c
2
c
12
; Nc
1
, p
2
) gives c
2
 c
12
out of
Nc
1
bigrams are w
1
w
2
k n k
p p
k
n
p n k b


.

\

= ) 1 ( ) , ; (
Fall 2001 EE669: Natural Language Processing 32
Likelihood Ratios Within a Single
Corpus
The likelihood of H
1
L(H
1
) = b(c
12
; c
1
, p)b(c
2
c
12
; Nc
1
, p) (likelihood of independence)
The likelihood of H
2
L(H
2
) = b(c
12
; c
1
, p
1
)b(c
2
 c
12
; Nc
1
, p
2
) (likelihood of dependence)
The log of likelihood ratio
log = log [L(H
1
)/ L(H
2
)] = log b(..) + log b(..) log b(..) log b(..)
The quantity 2 log is asymptotically _
2
distributed, so we
can test for significance.
Fall 2001 EE669: Natural Language Processing 33
Likelihood Ratios II: Between two or
more corpora (Damerau, 1993)
Ratios of relative frequencies between two or
more different corpora can be used to discover
collocations that are characteristic of a corpus
when compared to other corpora.
This approach is most useful for the discovery of
subjectspecific collocations.
For example, suppose foo bar occurs 2 out of
1,232,444 times in one corpus and 55 out of
5,348,212, then the frequency ratio is (2/
1,232,444)/(55/ 5,348,212) = .1578
Fall 2001 EE669: Natural Language Processing 34
Pointwise Mutual Information
An InformationTheoretic measure for discovering
collocations is pointwise mutual information (Church et
al., 1989, 1991).
This is NOT MI as defined in Information Theory
(MI: random variables; not values of random variables)
I(a,b) = log
2
[p(a,b) / p(a)p(b)] = log
2
[p(ab) / p(a)]
Pointwise Mutual Information is roughly a measure of how
much one word tells us about the other.
Fall 2001 EE669: Natural Language Processing 35
Pointwise Mutual Information
Example: I(true, species) = log
2
(4.1e
5
/ 3.8e
4
8.0e
3
) = 3.74
measured in bits but it is difficult to give it an interpretation
used for ranking (NOT null hypothesis tests)
Pointwise mutual information works particularly
badly in sparse environments (favors low frequency
events).
May not be a good measure of what an interesting
correspondence between two events is (Church and
Gale, 1995).