Misq2007 Enhancing Information Retrieval Through Statistical Natural Language Processing A Study of Collocation Indexing

Arazy & Woo/Enhancing Information Retrieval
RESEARCH ARTICLE
ENHANCING INFORMATION RETRIEVAL THROUGH

STATISTICAL NATURAL LANGUAGE PROCESSING:
A STUDY OF COLLOCATION INDEXING1
By: Ofer Arazy In this paper, we provide preliminary evidence for the useful-
The University of Alberta ness of statistical natural language processing (NLP) tech-
3-23 Business Building niques, and specifically of collocation indexing, for IR in
Edmonton, AB T6G 2R6 general settings. We investigate the effect of three key param-
CANADA eters on collocation indexing performance: directionality,
ofer.arazy@ualberta.ca distance, and weighting. We build on previous work in IR to
(1) advance our knowledge of key design elements for
Carson Woo collocation indexing, (2) demonstrate gains in retrieval
Sauder School of Business precision from the use of statistical NLP for general-settings
University of British Columbia IR, and, finally, (3) provide practitioners with a useful cost-
2053 Main Mall benefit analysis of the methods under investigation.
Vancouver, BC V6T 1Z2
CANADA Keywords: Document management, information retrieval
carson.woo@ubc.ca (IR), word ambiguity, natural language processing (NLP),
collocations, distance, directionality, weighting, general
settings
Abstract
Although the management of information assets—specifically, Introduction

of text documents that make up 80 percent of these assets—
an provide organizations with a competitive advantage, the Text repositories make up 80 percent of organizations’ infor-
ability of information retrieval (IR) systems to deliver relevant mation assets (Chen 2001), with the knowledge encoded in
information to users is severely hampered by the difficulty of electronic text now far surpassing that available in data alone
disambiguating natural language. The word ambiguity prob- (Spangler et al. 2003). Consequently, the use of information
lem is addressed with moderate success in restricted settings, technology to manage documents remains one of the most
but continues to be the main challenge for general settings, important challenges facing IS managers (Sprague 1995, p.
characterized by large, heterogeneous document collections. 29). Information retrieval (IR) systems enable users to access
relevant textual information in documents. Despite their rapid
adoption, current IR systems are struggling to cope with the
1
diversity and sheer volume of information (Baeza-Yates and
Veda Storey was the accepting senior editor for this paper. Praveen Pathak
served as a reviewer. The associate editor and two additional reviewers
Ribiero-Neto 1999). The performance of IR systems is often
chose to remain anonymous. less than satisfactory, especially when applied to information
MIS Quarterly Vol. 31 No. 3, pp. 525-546/September 2007 525

searches in general settings, such as networked organizations In spite of evidence suggesting the use of collocations is
or the World Wide Web, which are often characterized by essential for representing text documents (Church and Hanks
large, unrestricted collections, and multiple domains of 1990; Lewis and Spärck-Jones 1996), its usage and word co-
interest. This highlights the need for scalable IR techniques occurrence patterns to improve IR effectiveness continues to
(Jain et al. 1998). be controversial (Baeza-Yates and Ribeiro-Neto 1999).
The effectiveness of IR systems is determined by the extent This paper makes two contributions to theory and practice of
to which the representations of documents actually capture IR. First, we identify key parameters that affect collocation
their meaning. The main challenge inhibiting the perfor- indexing performance, namely collocation directionality and
mance of current retrieval systems is word ambiguity (Baeza- the distance between the words making up collocations.
Yates and Ribiero-Neto 1999): the gap between the way in Second, we demonstrate empirically that the use of statistical
which people think about information, and the way in which collocation indexing can substantially enhance IR from large
it is represented in IR systems. IR systems require that both and heterogeneous collections.
documents and user requests be formulated in words, which
are inherently ambiguous (Deerwester et al. 1990). The paper continues as follows. The next section provides a
context for this study, briefly review the problem of content
representation in information retrieval; the third addresses
The objective of this research is to gain an understanding of
scientific rigor by positioning the work within the context of
the potential and limitations of automatically generating
Information Science and Computational Linguistics; the
meaningful representations of textual documents. Our goal
fourth section lays out our research questions; the fifth section
is to identify IR techniques that are both suitable for general
presents the proposed research method, detailing the
settings and deliver relevant information. Accordingly, we
evaluation procedure; the sixth section presents the results
focus on statistical IR techniques which are scalable to large,
from our experiments; the seventh section discusses the
open environments.
findings of this study, and the eight section concludes the
paper with some future research directions.
Statistical natural language processing (statistical NLP)
methods “have led the way in providing successful disam-
biguation in large scale systems using naturally occurring
text” (Manning and Schutze 2003, p. 19). In order to Information Retrieval, Representations,
represent the meaning of documents, statistical NLP uses and Constructs
information contained in the relationships between words,
including how words tend to group together. Our study Information retrieval systems represent documents and
focuses on one specific statistical NLP technique: collocation queries through profiles (also referred to as indexes). A
indexing. A collocation is a word combination which carries profile is a short-form representation of a document, easier to
a distinct meaning (i.e., a concept). Collocation indexing manipulate than the entire document, and serves as a
refers to the process of extracting collocations from natural surrogate at the retrieval stage. Next, matching is performed
text and representing document meaning through these by measuring the similarity between the document and query
collocations (instead of through words). Collocations are profiles, returning to the user a ranked list of documents that
extracted by grouping together words that appear close are predicted to be of relevance to the query.
together in the text (the simplest example of collocation is a
phrase, such as “artificial intelligence”). Firth (1957, p. 11) IR effectiveness is measured through precision, the percent-
coined the phrase, “You shall know a word by the company age of retrieved documents which are relevant to the user, and
it keeps,” suggesting that words could be disambiguated by recall, the percentage of the total relevant documents in the
analyzing their patterns of co-occurrence. Statistical colloca- collection which are included in the set of returned docu-
tion indexing provides a good solution to the word ambiguity ments. Recall and precision are generally complementary,
problem (Morita et al. 2004), and this approach is robust and negatively correlated. There are situations in which recall
(even in the presence of error and new data) and generalizes will be most important to a user; for example, in retrieving
well (Manning and Schutze 2003). Collocations extracted every possible piece of information regarding a specific
from large text corpora have been shown to convey both medicine (even at the cost of retrieving nonrelevant informa-
syntactic and semantic information (Maarek et al. 1991), thus tion). However, precision is considered more important to
collocation-based IR systems can more accurately address most search tasks, since users are primarily interested in
users’ requests. exploring only the documents at the top of the results list
526 MIS Quarterly Vol. 31 No. 3/September 2007

(Lewis and Spärck-Jones 1996). Since users’ intentions are Language models are also domain-dependent and, to date,
unknown during system design, the general performance of a have not been applied to general settings.
system could be evaluated using measures that combine preci-
sion and recall. Two such measures are popular: the In the vector space model, documents and queries are repre-
precision–recall curve (plotting a precision value for each sented as weighted vectors of indexing units (i.e., a list of
recall level), and the harmonic mean, or F measure, terms and their associated weights). Geometrically, the
indexing units define the semantic space, and documents and
(2 × Precision × Recall) queries are represented by points in that space (see Figure 1).
F=
(Precision + Recall) The similarity of query and document indexes is commonly
calculated through the cosine of the angle between the two
Another important aspect of IR systems is their efficiency. IR
index vectors (i.e., the dot product of the two normalized
efficiency is estimated by (1) an algorithm’s computational
vectors) (Baeza-Yates and Ribiero-Neto 1999; Salton and
complexity and (2) the storage space of document profiles.
McGill 1983).2
Efficient systems are capable of processing large amounts of
documents, and are thus suitable for a wide variety of
business settings. In testing the performance of collocation
indexing, we will consider both effectiveness (precision and Constructs in IR
recall), and efficiency (complexity and storage space).
The major challenge for IR research on constructs is in pro-
posing indexing units that are meaningful, yet can be
efficiently extracted (automatically) from text (Lewis and
IR Representations Spärck-Jones 1996). We discuss two different types of
indexing units used in IR models: tokens and concepts.
Research in IR systems investigates four different types of
artifacts: instantiations (IR software systems), methods
(algorithms for developing the IR system), IR models Tokens as Indexing Units
(representations of documents and queries), and constructs
(the indexing units used to represent documents and queries). Traditionally, single-word terms, referred to as tokens, were
used to represent a document’s meaning. For example, this
paper may be indexed by the following set of tokens (infor-
Representations in IR are defined in terms of constructs (i.e.,
mation, systems, research, retrieval, text). Token indexing is
the indexing units that define the semantic space) and models,
usually accomplished through (1) statistical analysis of word
which describe how documents and queries are mapped onto
frequency for the entire corpus, and removal of high-
the semantic space and how their similarity is measured in
frequency words which in themselves do not carry meaning
that space. There are several reasons why we focus in this (e.g., the, of, and at); (2) stemming, by removing word pre-
study on IR representations and, more specifically, on con- fixes and suffixes; (3) removal of infrequent tokens and the
structs. First, IR effectiveness is primarily determined by the generation of a token list; and (4) indexing of each document
extent to which representations actually capture the meaning (and later, of query), using tokens from the list (Baeza-Yates
of documents. Second, most IR research tends to focus on and Ribiero-Neto 1999). Token indexing is useful for
retrieval models, algorithms, and the development of systems; excluding words that do not carry distinctive meaning;
while the remaining type of artifact—constructs—has been however, tokens do not solve the problem of word ambiguity.
overlooked to a large extent. For example, a person interested in computer viruses sub-
mitting the query “virus” may be presented with documents
An investigation of IR indexing units could only be performed describing biological viruses, resulting in low precision.
in the context of a specific retrieval model. We use the vector Techniques for automatically generating token indices rely on
space model (Salton et al. 1975) in our study, as this is the de the early works of Luhn (1958), and are still used in current
facto standard in the design of commercial retrieval systems. commercial IR systems. Despite the popularity of token
Popular alternative models include the probabilistic model indexing, it is widely believed that this approach “is inade-
(Robertson and Spärck-Jones 1976) and language models quate for text content representation in IR” (Salton and
(Ponte and Croft 1998). The fundamental difference between Buckley 1991, p. 21), and therefore higher-level indexing
these alternative models and the vector space model is in the units are essential for large-scale IR (Mittendorf et al. 2000).
way in which indexing terms are weighted. Because the 2
We tested the cosine function by comparing it to a Euclidean distance
probabilistic model requires initial training of the system, it is function and found that the performance of the two functions was
less appropriate for large, domain-independent collections. indistinguishable.
MIS Quarterly Vol. 31 No. 3/September 2007 527

Indexing
Unit B
Doc. #1
W1,B Doc. #2
Query
W1,A Indexing Unit A
Figure 1. IR Models (Illustrated through a simple semantic space—defined by two indexing units: A
and B—where query and documents are mapped onto that space)
Concepts as Indexing Units The alternative approach, statistical NLP, is founded on the
philosophies of Firth (1957), who argued that the meaning of
There is a substantial body of evidence from the fields of a word is defined by the patterns of its use and circumstances
linguistics, semantics, education, and psychology that people (i.e., the words surrounding it). Statistical NLP can be used
process and represent knowledge using concepts. Conceptual to extract meaningful patterns of words and these collocations
(versus token-based) document representations have the can then be utilized to represent the meaning of text
potential to alleviate word ambiguity; yet, automatically documents. This enables the automatic processing of very
generating such representations is extremely difficult, as large and heterogeneous text collections. We will, therefore,
language is complex and patterns of word usage do not follow adopt the statistical NLP approach for our study.
strict rules (Manning and Schutze 2003). Two classes of
natural language processing (NLP) approaches have been
used to automatically extract concepts from text by grouping
words into sets of higher-level abstractions (i.e., latent Collocations in Information Retrieval
concepts): linguistic and statistical.
From linguistic and semantic points of view, the idea that
Linguistic NLP relies on formal models of language to often a combination of words is a better representation of a
analyze the text using syntactic and/or semantic methods. concept than the individual words is well-accepted (Church
Although linguistic analysis can be performed automatically, and Hanks 1990). A collocation4 is defined as “an expression
this is often not effective in general settings, because syntactic consisting of two or more words that corresponds to some
analysis is extremely ambiguous and computers struggle to conventional way of saying things” (Manning and Schutze
determine the syntactic structure of arbitrary text (Lewis and 2003, p. 151). Collocations include compounds (hard-disk),
Spärck-Jones 1996). Alternatively, linguistic analysis can be phrasal verbs [adjacent (get-up) and nonadjacent (knock….
complemented by manual rule creation and hand-tuning, but door)] and other stock phrases (bacon and eggs). Colloca-
hand-coding syntactic constraints and preference rules are tions can be used as indexing units, in addition to single-word
time consuming to build, do not scale-up well, and are not tokens, since the frequency of a collocation is correlated with
easily portable across domains (Chen 2001).3 Furthermore, the importance of the collocation in the text (Luhn 1958;
linguistic NLP performs poorly when evaluated on naturally Maarek et al. 1991).
occurring text (Manning and Schutze 2003).
The discrepancy between the potential of collocations and
their actual use may be due to the challenge of automatically
3
One popular approach for addressing the word ambiguity problem is word- extracting meaningful collections using statistical methods.
sense disambiguation, where preconstructed lexical resources, such as
WordNet (Miller 1995), are used to disambiguate terms appearing in
4
documents and queries. Despite the use of these external resources, in Also referred to as term associations, term co-occurrence, term combina-
general this method has not proved effective in IR (Stokoe et al. 2003). tions, second-order features, higher-order features, or simply phrase.

The untapped potential of collocation indexing makes it an modeling approach to IR provides a well-studied theoretical
area where statistical NLP—given its scalability to large framework that has been successful in other fields. However,
collections—can make important contributions to IR (Lewis language models (similarly to the vector space and proba-
and Spärck-Jones 1996; Manning and Schutze 2003; bilistic IR models) assume independence of indexing terms,
Mittendorf et al. 2000; Strzalkowski et al. 1996). and attempts to integrate co-occurrence information into the
language models have not shown consistent improvements
(e.g., Alvarez et al. 2004; Jiang et al. 2004; Miller et al.
Collocations and Information 1999).
Retrieval Performance
We believe that the major factor inhibiting the performance of
One reason that collocation analysis has been under-utilized collocation-based IR is the insufficient knowledge we have
in IR may be that, traditionally, collocations were associated regarding the key parameters that affect collocation perfor-
with linguistic analysis, complex models, and artificial intelli- mance. In the following sections, we identify such parameters
gence implementations, and thus were regarded as most and discuss their effect on IR performance.
suitable for very small and restricted problems (Salton and
Buckley 1991).5 To address these limitations, statistical tech-
niques have been investigated. The early works on statistical Collocation Extraction Parameters
collocation indexing (e.g., Fagan 1987; Lesk 1969; van and IR Performance
Rijsbergen 1977) were restricted to small and domain-specific
collections. These early studies yielded inconsistent, minor, Hevner et al. (2004) noted that designing useful artifacts is
and insignificant performance improvements (Fagan 1989). complex due to the need for creative advances in areas in
Later works, investigating collocation extraction for much which existing theory is often insufficient. Extant theory on
larger and diverse collections, usually in the context of the collocations provides little practical guidance on the design of
Text Retrieval Conference (TREC; e.g., Carmel et al. 2001), collocation-based retrieval systems. Collocation theory, aside
have also yielded weak and inconclusive results (Khoo et al. from discussions on grammatical features (which are irrele-
2001; Mittendorf et al. 2000), and the effectiveness of vant for statistical NLP), does not prescribe a procedure for
collocation-based IR continues to be controversial (Baeza- distinguishing meaningful collocations from meaningless
Yates and Ribiero-Neto 1999). Lewis and Spärck-Jones word combinations.
(1996), in their survey of NLP in IR, argue that “auto-
matically combining single indexing terms into multi-word A survey of collocation literature revealed three salient
indexing phrases or more complex structures has yielded only characteristics of collocations: directionality (i.e., whether
small and inconsistent improvements” (p. 94). In TREC, the ordering of collocation terms should be preserved),
phrase (i.e., adjacent collocation) indexing has shown to distance (i.e., the proximity of the two words comprising the
provide only minor (2 to 4 percent) improvements in precision collocation), and weighting (i.e., the algorithm for assigning
(Spärck-Jones 1999). Other experiments (e.g., Mitra et al. weights to collocations in document and query profiles), as
1997) show that collocations alone perform no better than discussed below.
tokens, and that enhancing token indexes with collocations
results in minor (1 to 5 percent) precision gains. In some
specific cases, collocations were able to yield small, still Collocation Directionality
insignificant, improvements (e.g., 10 percent gains in Buckley
et al. 1996). In summary, to date, no method for employing Collocation theory does not provide a definite answer on the
collocations in general settings has yielded substantial and issue of directionality, as in some cases (e.g., artificial
significant precision improvements over token-based repre- intelligence) the ordering of collocation terms is essential for
sentation on an accepted benchmark. preserving the meaning, while in other cases (e.g., Africa and
elephants) directionality does not impact meaning (Church
Recently, the focus of collocation indexing research has and Hanks 1990). Directionality is one of the key features
shifted toward advanced retrieval models. The language that distinguishes linguistic from statistical collocations:
while linguistic collocations preserve tokens order, most
5
Furthermore, syntactic collocation indexing, in most cases, has failed to statistical approaches normalize collocations to an nondirec-
yield improvements beyond traditional token indexing or statistically xtracted tional form. Most works in IR on collocation indexing have
collocations (e.g., Croft et al. 1991, Mitra et al. 1997; Strzalkowski et al.
treated collocations as nondirectional (e.g., Fagan 1989; Mitra
1996). Semantic collocation extraction methods have also failed to prove
useful in IR (Khoo et al. 2001). et al. 1997). However, Fagan (1989) reported on problems

with nondirectional collocations, and there have been few suggest that across-sentence collocation could potentially
exceptions where directional collocations were used in IR enhance the performance of the vector space model for large
(e.g., Mittendorf et al. 2000). The effect of directionality on text collections.
IR performance has not been tested in the context of the
vector space model, although evidence from experiments with Another important aspect, which has not been investigated in
language models (Srikanth and Srihari 2002) suggests that IR literature, is the importance of distance within a given
directional collocations are more precise (by 5 to 10 percent) structural element: that is, whether collocation terms sepa-
than nondirectional collocations. rated by few words are more meaningful than collocations at
larger distances (still within the same sentence). Studies to
date have tended to regard all collocations within the
The Distance Between Collocation Terms predefined text window (usually a five-word window, after
Martin et al.) equally, and failed to take into account the fact
The intensity of links between words—commonly opera- that more proximate collocations are likely to carry more
tionalized through distance—reflects the semantic proximity semantics.
of the words. Capturing that semantic proximity is important
for IR effectiveness, and it is generally recognized that collo- To summarize our discussion of collocation distance, the
cation distance is of critical importance (Fagan 1987, 1989). exact relationship between physical and semantic proximity
requires further investigation. First, should collocation
IR research on collocation extraction assumes that co- extraction be restricted to within-sentence co-occurrence, or
occurrence of words within tight structural elements (i.e., a should it include across-sentence collocations? Second,
sentence) conveys more meaning than within less tight struc- should more weight be assigned to closer collocations (within
tural elements (i.e., paragraphs or sections). Thus, research the same structural element)?
on collocation extraction has been dominated by within-
sentence analysis. Empirical analysis justifies restricting
collocation extraction to combinations of words appearing in Collocation Weighting
the same sentence. Martin et al. (1983) found that 98 percent
of syntactic combinations associate words that are within the A document (or query) profile includes a list of indexing
same sentence and are separated by five words or fewer. units, each associated with a weight. Indexing units’ weights
Fagan (1987) found that restricted extraction of collocations determine the position of the document in the semantic space.
to a five-token distance window is almost as effective as With tokens as indexing units, there is a substantial body of
collocations extracted within a sentence with no such restric- research on weighting schemes (Salton and McGill 1983). It
tion, supporting Martin’s findings. Others that followed (e.g., is now commonly accepted that an effective scheme should
Carmel et al. 2001; Maarek et al. 1991) employed a five-word assign a weight based on (1) a local, document-specific factor,
window for extracting collocations. such as token frequency in the document (often normalized,
to counter the variations in document length), and (2) a
Collocations binding terms across sentence boundaries also global, corpus-level factor, such as the number of documents
may be of importance (e.g., doctor-nurse or airport-plane and the frequency of the token in the entire collection. The
commonly co-occur across sentences). Fagan (1987) pro- token weight in the document profile is correlated positively
vided initial evidence showing that, for a small domain- with the local factor, and negatively with the token global
specific corpus, across-sentence collocations are more effec- factor. Different weighting schemes could be used with the
tive than within-sentence collocations. Two recent studies in vector space model, and the de facto standard is term
related fields support Fagan’s earlier findings. Mittendorf et frequency – inverse document frequency (tf-idf), defined
al. (2000; routing task; the probabilistic IR model) report that formally as:
collocations combining terms across sentences were just as
effective as those within a sentence, while document-level Let N be the total number of documents in the
collocations proved less effective than sentence and para- collection and dfi be the number of documents in
graph-level collocations. Srikanth and Srihari (2002; the set- which the index term ki appears. Let tfi, j be the raw
based IR model) investigated a text window for combining frequency of term ki in the document dj. Then, the
terms, and attained optimal performance with a 50 to 80 word normalized frequency fi, j of term ki in the document
window, well beyond the limits of a sentence. Although these tfi, j
di is given by ntfi, j = , and is referred to as
studies were carried out in a context different from ours (e.g., Maxl tfi, j
small collection, different IR models, different tasks), they do the term frequency (TF) factor. The maximum, tfi, j,

is computed over all terms which are mentioned in Research Questions

the text of the document dj. The inverse document
frequency (IDF) factor for ki is given by IDFi = log N . This study is intended to advance our understanding of collo-
dfi
cation indexing in IR in general settings. We address two key
N
Thus, wi, j = TF × IDF = ntfi, j × log df . questions that IS design should consider (Hevner et al., 2004):
i
Tf-idf has been reported to increase IR precision up to 70 per- (1) Does the artifact work?
cent, when compared to simple tf (i.e., weight is assigned
based on the frequency of a term in the document; Salton and (2) What are the characteristics of the environment in
Buckley 1988). which it works?
With collocations as indexing units, however, there is no We first conduct an examination on the effects of key
well-accepted scheme and the development of weighting parameters (i.e., collocation directionality, distance, and
schemes for collocations is at early phases of research. The weighting), and then investigate the extent to which
main challenge for collocation weighting is that the vector– collocation-indexing enhances retrieval performance beyond
space model assumes terms are independent, and thus does the traditional token-based indexing.
not lend itself to modeling inter-term dependencies, which are
essential for handling collocation (Strzalkowski et al. 1996).
Two alternatives for alleviating this problem are possible: Research Method
(1) use the simple tf weighting scheme for collocations (e.g.,
Khoo et al. 2001), based on the fact that collocation frequency In order to study the effect of collocations on IR performance,
is correlated with the importance of the collocation (Luhn we developed a prototype IR system with collocation indexes
1958; Maarek et al. 1991), or (2) adapt a more effective for documents and queries. We conduct a series of laboratory
scheme, such as tf-idf, to suit collocations. There have been experiments to address the research questions presented
various attempts to adapt the tf-idf weighting scheme to earlier.
collocations (e.g., Alvarez et al. 2004; Buckley et al. 1996;
Church and Hanks 1990; Maarek et al. 1991; Mitra et al. To study IR constructs (i.e., tokens and collocations) within
1997; Strzalkowski et al. 1996), but these schemes do not a retrieval system, we fixed the models, methods, and
address the term-independence problem. Furthermore, perfor- instantiations, discussed earlier, in the following ways:
mance gains from these tf-idf adaptations were not sub-
stantial, and the weighted collocations indexes were usually • Model: we employed the standard vector space model,
only 5 percent more effective than weighted token indexes with two alternative weighting schemes: simple tf and tf-
(Jiang et al. 2004; Mitra et al. 1997). idf. For collocations, we also tested the mutual informa-
tion adaptation to tf-idf.
Among the various suggestions, the scheme proposed by
Church and Hanks (1990) stands out for its theoretical • Methods: we used the cosine function for matching
grounding. Church and Hanks’ proposal is based on indexes, and statistical NLP methods for extracting
Shannon’s information theory, where the mutual information collocations (where collocations were restricted to two-
(MI) of two words x and y with occurrence probabilities word combinations, a common and practical approach
P(x), P(y), and co-occurrence probability P(x, y) is
that is scalable to large and heterogeneous collections7).
P(x, y)
I(x, y) = log .6 Mutual information compares the • Instantiations: we restrict the prototype system to only
P(x) × P(y)
the core features described above (vector space model,
“information” captured in the collocations to that captured by collocation indexing, and cosine function for matching).
the single terms making-up the collocation. When applied to This allowed us to focus on the core questions of this
IR, word probabilities are estimated by their occurrence study, and control for the effect of exogenous factors.
frequencies. To date, this information theoretic approach to
collocation weighting has not been tested on a large corpus.
7
Mitra et al. (1997) tested the performance of two-word collocations against
an index that included both two-word and three-word collocations. They
6
Alternative bases for the log functions are possible. In our implementation found that incorporating three-word collocations adds less than 1 percent to
(see details in the “Research Methods” section), we used a base of 2. precision.

Data appeared within the same sentence, while in another scheme

we grouped collocations across sentences. Likewise, in one
In order to simulate a real-life general setting, we studied the scheme we preserved the ordering of collocation terms while
research questions on a large and heterogeneous collection: in another scheme we disregarded term ordering.
the Ttext Retrieval Conference (TREC) database. The TREC
collection is becoming a standard for comparing IR models Efficiency is a challenge in collocation processing, since the
(Baeza-Yates and Ribiero-Neto 1999). TREC includes (1) a possible two-word combinations in an average document are
document collection, (2) a predefined set of queries, and in the billions.10 To address this issue, we restricted collo-
(3) predetermined manual assessments for the relevance of cation extraction in each document to combinations of only
each document to each of the queries. For each approach the 20 most frequent tokens.11 We further restricted the num-
tested, the retrieval prototype system processes the documents ber of collocations by combining the collocations extracted
and queries to produce a list of predicted relevant documents from each document into one cumulative list, and pruning
for each query, which is then compared to manual relevance collocations of low frequency from the list. Similar to Mitra
assessments. Effectiveness is measured using precision, et al. (1997), we ranked the cumulative list based on the
recall, recall–precision graph, and the F measure. We used number of documents in which each collocation appeared, and
disks 4 and 5 of the TREC collection that include 528,030 removed those with lowest frequency, thus restricting the list
documents from a variety of sources8 and cover a variety of to one million unique collocations. In order to have a level
topics, 100 queries, and manually constructed binary playing field, all of the collocation lists generated under
relevance judgments (i.e., either relevant or nonrelevant) for different combinations of distance, directionality, and
all documents on each of the queries. weighting were pruned to one million collocations. Table 1
provides examples of extracted collocations.
Implementation We then indexed documents and queries with the remaining

set of collocations. The outcome of this process was a
To test our hypotheses, we generated several alternative collocation profile for all documents and queries in the test
representations—for each document and query—that differ collection. We generated several alternative sets of profiles,
only in terms of directionality and distance between colloca- based on directionality, distance, and weighting settings.
tion terms.
We first processed the TREC collection documents to gener- Measures

ate a token list, using stop-word removal (with SMART’s
system common words list [Buckley et al. 1996]), stemming Both precision and recall were measured for the 10, 20, 30,
(with Porter’s [1980] algorithm), and removal of tokens that and 100 top ranked documents (i.e., Precision[10],
appear in less than six documents,9 arriving at 72,354 unique Precision[20], …; Recall[10], Recall[20], …). Similarly we
tokens. We then generated documents and query profiles with employed the precision–recall graph and the F measure for the
these tokens. We used the resulting token index (1) to com- 10, 20, 30, and 100 top ranked documents.
pare the performance of collocations to the traditional token-
based approach, and (2) as the starting point for the colloca-
tion extraction process. Experimental Design
To extract collocations, we created “tokenized documents” by We conducted three experiments: Experiment #1 to test the
replacing words with tokens, while preserving sentence effect of collocation indexing parameters (directionality,
boundaries. We then extracted two-token collocations, based distance, and weighting), Experiment #2 to determine the best
on the co-occurrence of these tokens in the text. We applied collocation indexing scheme, and Experiment #3 to test
several alternative procedures that differed in terms of collo- whether collocations can enhance IR performance.
cation distance, directionality, and weighting. For example,
in one scheme we grouped two-token combinations that 10
With the total number of unique tokens in our data set at 72,354, the total
number of unique collocations could potentially reach 5 billion, highlighting
8
Documents from the Los Angeles Times (1989-1990), Financial Times the need for an efficient collocation extraction algorithm.
(1991-1994), Federal Register (1994), and the Foreign Broadcast Information
Service (1996). 11
The choice of the top 20 tokens is based on scalability considerations, and
is justified by the fact that the token list is later trimmed to only the most
9
A similar threshold was employed in other studies (e.g., Deerwester et al. frequent collocations, and frequent collocations regularly are made up of
1990). frequent tokens.

Table 1. Examples of the Frequent Collocations Extracted from the TREC Database (Disks 4 and 5)
Token 1 Words it Represents Token 2 Words it Represents
Fund funding, funded, … Corp corporate, corporation, …
Grade grading, grades, … Math math, …
Gain gains, gaining, … Lab lab, …
In Experiment #1, we employed a 2 × 2 × 3 design combined scheme against the “pure” schemes of tokens and
(directionality: direct/indirect; distance: within/across sen- collocations.
tence; weighting: tf ÷ tf-idf ÷ MI) to test the effect of each
parameter, as well as the interaction effects. Since query In IR research, the differences between competing designs are
length (i.e., the number of terms in the query) has been often small and not statistically significant, thus in most IR
reported to impact IR performance considerably, we con- works no analysis of statistical significance is provided. For
trolled for this effect. We processed each document to instance, IR literature on collocations reports only minor
generate a collocation index by extracting two-word combi- effects for distance, directionality, or weighting (usually
nations (using the standard procedure of restricting word-pairs below 5 percent differences), and rarely test for statistical
to a five-token window,12 and “sliding” this window over the significance. Despite the statistical insignificance of results
entire document). We calculated the frequency of a colloca- from these works, they are considered important contributions
tion in a document by counting all instances of that colloca- to the cumulative knowledge in the field, and are reported in
tion. Across-sentence collocation extraction was performed academic publications. Nonetheless, we will test and report
similarly, using a window of five sentences (and excluding the statistical significance for our experiments.14
collocations within the same sentence). Weighting was tested
with simple tf13 and two variations of tf-idf: the classic
formula (similarly to tokens) and the mutual information
adaptation for tf-idf, as discussed earlier. Results and Analysis
In Experiment #2, we built on results of the first experiment to In order to appreciate the results of the algorithms we studied,
try and design an optimal collocations indexing scheme. We it is useful to provide performance measures for random
performed two types of analysis, focusing on collocation retrieval. On average, a query in our test set had 558 relevant
distance. First, we studied in detail the impact of distance with- documents, out of the total set of 528,030 documents. Thus,
in structural elements (e.g., whether collocations separated by the precision (i.e., percentage of relevant documents) for any
one term yield higher performance than collocations separated randomly selected set is 558 ÷ 528,030 = 0.0011.
by two terms), and tried designing a scheme that assigns higher
weight to more proximate collocations within the same
structural element (e.g., a sentence). Second, we worked to Experiment #1: Collocation Directionality,
combine within-sentence and across-sentence schemes. Distance, and Weighting
Experiment #3 tested the optimal collocations scheme from The 2 × 2 × 2 (i.e., directionality × distance × weighting) fac-
the previous experiment against the traditional token-based torial ANOVA analysis for Experiment #1 (for each of the per-
scheme, while controlling for query length. Since colloca- formance measures) revealed no statistically significant effects.
tions are used to complement tokens in document indexes— We, therefore, report below only on the performance levels for
rather than to replace tokens (e.g., Carmel et al. 2001; Mitra each directionality-weighting-distance combinations.15
et al. 1997; Strzalkowski et al. 1996)—we proceeded to
combine token and collocations schemes, and tested the 14
We will model our analysis after Khoo et al. (2001), one of the rare IR
papers that reports the statistical significance of results, and employ one-
sided significance tests.
12
We also experimented with not restricting collocation distance within a
sentence. Results were almost identical to those of a five-term window. 15
Similar to (Khoo et al. 2001), our comparison included only queries in
which the profile contains at least one collocation. Out of the 100 queries
13
We also tested a normalized tf scheme (i.e., term frequency normalized by provided with the TREC database (numbered 351 through 450), we employed
the highest term frequency in the document) to account for document length, the 84 queries that included at least one indexing unit (excluding queries 361,
but results were indistinguishable from the simple tf, and therefore are not 362, 364, 365, 367, 369, 392, 393, 397, 405, 417, 423, 425, 427, 439, and
reported. 444).

Table 2. The Effects of Directionality, Distance, and Weighting on

Collocation Indexing
Precision100
Precision10
Precision20
Precision30
Recall100
Recall10
Recall20
Recall30
F100
F10
F20
F30
Direct ÷ Tf 0.273 0.226 0.190 0.104 0.054 0.085 0.099 0.149 0.091 0.124 0.130 0.123
Within-Sentence tf-idf 0.288 0.239 0.196 0.110 0.058 0.090 0.103 0.153 0.096 0.130 0.135 0.128
Direct ÷ Tf 0.268 0.212 0.185 0.107 0.037 0.056 0.070 0.131 0.065 0.088 0.101 0.118
Across-Sentence tf-idf 0.276 0.224 0.202 0.113 0.039 0.060 0.077 0.137 0.068 0.094 0.111 0.123
Nondirect ÷ Tf 0.240 0.199 0.173 0.098 0.049 0.074 0.039 0.142 0.081 0.108 0.117 0.116
Within-Sentence tf-idf 0.260 0.218 0.181 0.100 0.053 0.0978 0.092 0.147 0.088 0.116 0.122 0.119
Nondirect ÷ Tf 0.242 0.205 0.181 0.106 0.048 0.071 0.087 0.149 0.080 0.106 0.117 0.124
Across-Sentence tf-idf 0.260 0.224 0.193 0.112 0.052 0.076 0.091 0.153 0.086 0.113 0.124 0.129
The results show a moderate effect for the weighting scheme, directionality does provide important insights. For within-
where tf-idf is superior to tf, for both precision (0.002 – sentence, directional collocations yield higher precision
0.020, or 2.3 to 9.3 percent) and recall (0.002 – 0.007, or 2.6 (0.006 – 0.033, or 6.7 to 13.4 percent) and recall (0.005 –
to 10.4 percent) measures (and thus for the F measure, 0.003 0.011, or 9.0 to 19.6 percent) over nondirectional. For across-
– 0.010, or 2.6 to 9.8 percent). The effect size of tf-idf over sentence, precision is still higher with directional collocations
tf is small. This result is in line with previous studies, which (although the effect size is smaller than for within-sentence;
demonstrated that tf-idf is not as effective for collocations as only up to 0.026, or 10.8 percent), while recall is substantially
it is for tokens. The results for the Mutual Information higher for nondirectional collocations (0.011 – 0.018, or 10.7
adaptation to tf-idf were indistinguishable from the results for to 24.7 percent). These results are consistent across both
standard tf-idf, and thus are excluded from Table 2. Thus, weighting schemes. The graphs in Figure 3 illustrate the
despite the theoretical grounding for the mutual information interaction effect of distance and directionality on P[10] and
scheme, it does not seem to perform well in practice. Figure R[10], when using tf-idf weighting (interaction effects are
2 below demonstrates the effect of weighting on F[10]. statistically insignificant).
No clear effect is visible for collocation distance (for either

precision or recall), and within-sentence and across-sentence Experiment #2: An In-Depth Exploration
schemes perform similarly. While current collocation-extrac- of Collocation Distance
tion practices commonly restrict collocation extraction to
within-sentence word combinations, our findings demonstrate In the second experiment, we explored the effect of distance
that across-sentence collocations are equally important, thus in more detail in an effort to improve performance. We ini-
suggesting that both types of collocations should be incor- tially explored the effect of distance within structural ele-
porated into an effective collocation indexing scheme. ments. Previous studies treated all collocations appearing
within the same structural element (whether sentence or para-
Collocation directionality has a positive impact on precision graph) similarly, and we wanted to test whether collocations
(directional is up to 0.033, or 14.8 percent better than non- binding proximate terms are more important than distant
directional), thus supporting findings in related areas. collocations (all within the same element, for example, a
Directionality seems to have a mixed effect on recall and the sentence).
F measure. These findings are consistent across tf and tf-idf
weighting schemes. Distance Effects for Within-Sentence Collocations
Although distance and directionality alone do not demonstrate For within-sentence, we generated several alternative collo-
large effects, the interaction effect between distance and cation indexes, one for each distance, and then compared their

F[10]: tf vs. tf-idf

tf tf-idf
Nondirect ÷ Across-Sentence
Nondirect ÷ Within-Sentence
Direct ÷ Across-Sentence
Direct ÷ Within-Sentence
F[10]
0.000 0.020 0.040 0.060 0.080 0.100 0.120
Figure 2. F[10]: tf Versus tf-idf for Different Combinations of Collocation Directionality and Distance
P[10]: Interaction Effect Distance x Directionality R[10]: Interaction Effect Distance x Directionality
0.295 0.07
0.290
0.06
0.285
0.280 0.05
0.275 Directional
P[10]
R[10]
0.04
0.270 Nondirectional
0.265 0.03 Directional
0.260 Nondirectional
0.02
0.255
0.01
0.250
0.245 0
Within Sentence Across Sentence Within Sentence Across Sentence
Figure 3. Distance and Directionality Interaction Effect on Precision[10] and Recall[10] Using tf-idf
Weighting
performance. For example, in one scheme we extracted only more effective than any one of the distinct schemes.16 Never-
collocations of adjacent terms, then collocations at one token theless, our findings show that shorter-distance collocations
difference, and so on, up to a five-token distance. We then yield better performance, and thus suggest that weighting all
compared the performance of these alternative schemes (see collocations equally might not be optimal. The relative
Figure 4). weighting of collocations at different distances could be
obtained by assigning the highest weight to the collocations
The results clearly show that collocation combining adjacent that yielded the highest precision in our experiment (colloca-
terms (i.e., phrases) are the most precise (similar effect was tions of adjacent terms), and low weight to collocations that
observed for recall, and for the F measure), and that the resulted in lower precision (collocation combining terms at
relative effectiveness decreases as distance increases. These two-, three-, four-, and five-term distances). We explored
findings were consistent across both weighting schemes. A several alternative combinations of within-sentence colloca-
combination of these collocations into an integrative within-
16
sentence scheme, using the industry-standard procedure of The integrative scheme yielded significant gains over the adjacent collo-
cations scheme (i.e., zero token difference): Precision[10] and Precision[30]
treating all collections equally (i.e., the within-sentence gains of 21 percent (p < 0.05), and Precision[15] and Precision[20] gains of
scheme we explored in Experiment #1) was substantially 34 percent (p < 0.01).

Within-Sentence - Collocations at Different Distances

0.250
0.200
Precision[10]
Precision
0.150
Precision[20]
0.100 Precision[30]
0.050
0.000
0 term 1 term 2 term 3 term 4 term 5 term
distance distance distance distance distance distance
Distance between collocation terms
Figure 4. The Effect of Distance Within a Sentence on Precision (Using tf Weighting)
tions (i.e., polynomial, logarithmic, and exponential), but indexing, and our analysis suggests that, in both schemes, all
failed to obtain gains beyond the standard equal-weight collocation within a structural element should be weighted
scheme. This suggests that the problem is not simple, and equally. Thus we proceeded to combine the two schemes we
thus warrants further research. studied in Experiment #1: within-sentence (restricting collo-
cations to a five-term window; all collocations weighted
equally) and across sentence (restricting collocations to a five-
Distance Effects for Across-Sentence sentence window; all collocations weighted equally). Since,
Collocations in general, directional collocations proved superior to non-
directional, we performed this analysis using directional
We compared the performance of collocations extracted at collocations. We explored various linear combinations of the
adjacent sentences, one sentence difference, and so on, up to two directional schemes,17 and obtain maximum effectiveness
a five sentence difference (similarly to our analysis of within- with query-document distance, D = 0.7 × Dwithin-sentence + 0.3
sentence collocations). Figure 5 illustrates the effects on × Dacross-sentence for tf weighting, and D = 0.6 × Dwithin-sentence
precision (similar effect were observed for recall and for the + 0.4 × Dacross-sentence for tf-idf. Table 3 describes the perfor-
F measure, and these results were consistent across weighting mance for the combined scheme, and the gains it yields over
schemes). each of the individual schemes.18
The results show that distance does not play a large role for The combined scheme is superior to any of the individual
across-sentence boundaries, and, contrary to our expectations, schemes on most performance measures, and the gains are
the number of sentences separating collocation terms appears consistent for both tf and tf-idf weighting schemes (effects
to have very little effect on retrieval performance. These size, in percentages, are presented in Table 3). The combina-
findings suggest that a collocation scheme that combines all tions yield some precision gains (up to 0.027). Recall of the
across-sentence collocations should assign even weights to
collocations at all distances.
17
We employ the common approach for combining schemes (Mitra et al.
1997; Strzalkowski et al. 1996). We initially preserved separate indexes for
Combining Within and Across-Sentence both schemes, performed query-document matching separately for each, and
Collocation Indexing Schemes then combined the matching results into one query-document similarity value.
18
We employed the 88 queries that, for these schemes, included at least one
Findings from Experiment #1 suggest that both within and indexing unit (excluding queries 361, 362, 365, 369, 393, 397, 405, 417, 423,
across-sentence collocations are essential for collocation 425, 427, and 439).

Across-Sentence - Collocations at Different Distances

0.300
0.250
0.200
Precision
Precision[10]
0.150 Precision[20]
Precision[30]
0.100
0.050
0.000
Within sent. 0 sent. diff. 1 sent. diff. 2 sent. diff. 3 sent. diff. 4 sent. diff. 5 sent. diff.
Sentence Difference
Figure 5. The Effect of Distance for Across-Sentence Collocations (Using TF Weighting)
Table 3. Combining Within- and Across-Sentence Collocation Indexing Schemes (Using Directional
Collocations)
Precision100
Precision10
Precision20
Precision30
Recall100
Recall10
Recall20
Recall30
Weight
F100
F10
F20
F30
Scheme
within- and across-
0.277 0.231 0.190 0.107 0.053 0.081 0.095 0.149 0.089 0.120 0.127 0.125
sentence
tf
versus within-sentence 6.6% 6.8% 4.0% 7.6% 2.8% –0.1% 0.8% 5.2% 3.4% 1.7% 2.2% 6.6%
versus across-sentence 8.4% 14.0% 7.5% 5.5% 50.3%* 52.7%** 42.6%* 19.1% 43.5% 42.7%* 30.9% 11.2%
within- and across-
0.303 0.247 0.208 0./112 0.057 0.085 0.099 0.155 0.096 0.127 0.134 0.130
sentence
tf-
idf versus within-sentence 10.3% 8.2% 10.7% 6.6% 2.9% –0.2% 0.9% 5.8% 4.1% 1.9% 4.1% 6.3%
versus across-sentence 15.1% 15.1% 7.9% 4.2% 53.4%* 50.5%** 35.3%* 18.4% 47.3%* 41.4%* 26.4% 10.2%
Asterisks indicate statistical significance of the results (using a one-sided T test): *p < 0.1, **p < 0.05.
combined scheme is slightly better than that of within- Experiment #2) to the traditional token-based indexing. We
sentence collocations (up to 0.002 higher), and substantially initially compared the two distinct indexing schemes, tokens
better (0.025) than that of across-sentence collocations and collocations, and then integrated the two to obtain an
(recalls 10, 20, and 30 are statistically significant). F measure effective combined scheme.
is somewhat better than that of within-sentence (up to 0.002
gains), and substantially better (0.024 – 0.033) than that of
across-sentence (statistical significance: F[20] for tf; F[10] Comparing Collocations Against Tokens
and F[20] for tf-idf). The results for Precision[10] and
Recall[10] (using tf-idf weighting) are illustrated in Figure 6. Table 4 compares the best collocations scheme against the
traditional token scheme, for both tf and tf-idf weighting.
Experiment #3: Token and Collocations Overall, the collocations scheme alone is superior in precision
to tokens, while the effect of recall is dependent on the
In the third experiment, we compared collocation-based weighting scheme. In line with the findings from previous
indexing (using the best collocation scheme identified in studies, tf-idf works well for tokens (0.031 – 0.045, or 18 to

Combining Within & Across-Sentence Schemes Combining Within & Across-Sentence Schemes
Combined
Within- Combined
scheme
0.310
sentence scheme
0.060
0.300
0.050 Across-
Within-
Precision[10]
0.290 sentence
Recall[10]
sentence Across- 0.040
0.280
sentence 0.030
0.270
0.260 0.020
0.250 0.010
0.240 0.000
Figure 6. Precision[10] and Recall[10]: Combining Within- and Across-Sentence Collocations
Table 4. Comparing Collocations (Using the Combined Within- and Across-Sentence Scheme) to Tokens
for tf and tf-idf Weighting
Precision100
Precision10
Precision20
Precision30
Recall100
Recall10
Recall20
Recall30
Weight
F100
F10
F20
F30
Scheme
Tokens 0.226 0.185 0.159 0.098 0.049 0.074 0.089 0.149 0.080 0.106 0.114 0.118
tf
Collocations 0.277 0.231 0.190 0.107 0.053 0.081 0.095 0.149 0.089 0.120 0.127 0.125
tf- Tokens 0.267 0.230 0.202 0.129 0.066 0.094 0.121 0.208 0.106 0.133 0.151 0.160
idf Collocations 0.303 0.247 0.208 0.112 0.057 0.085 0.099 0.155 0.096 0.127 0.134 0.130
32 percent, precision gains; and 0.017 – 0.059, or 27 to 40 usually interested in exploring only the top-ranked
percent, recall gains), and to a lesser extent for collocations documents.
(0.005 – 0.026, or 4 to 9 percent, precision gains; and 0.004
– 0.006, or 4 to 7 percent, recall gains). When using simple In order to gain additional insights into the factors driving
tf weighting, collocations outperform tokens: precision (up to performance, we analyzed all queries in detail. We found that
0.051, or 25 percent, and statistically significant19), recall (up the number of terms in a query has a substantial effect on
to 0.007, or 10 percent), and F (up to 0.014, or 14 percent). performance. For collocations, query length has a large posi-
When using tf-idf, the precision advantage of collocations tive effect on precision (long queries are roughly 30 percent
over tokens is diminished (up to 0.036, or 14 percent), and more precise), and a small negative effect on recall (roughly
recall levels are lower (0.009-0.053, or 9 to 26 percent) than 7 percent). For tokens, on the other hand, we observe a
those of tokens. Figure 7 illustrates the effect of weighting on reverse effect: long queries yield substantially higher recall
F measure. (roughly 30 percent), and no clear gains for precision. This
interesting effect has not been reported in previous studies; it
The relative performance of collocations depends on the has no simple intuitive explanation, and thus remains an issue
length of the result list we employ. The performance of for future research.
collocations is better at the top rankings, and this effect (using
tf-idf weighting) is illustrated in Figure 8. Performance at the Using Collocations to Enhance
top ranking list is of key importance, since searchers are Token-Based Indexing
19
Precision[10] and Precision[20] for collocations are (respectively) 0.051 (or Since in practice collocations are used to complement tokens,
23 percent) and 0.046 (or 25 percent) better than tokens, and the results are
statistically significant (p < 0.1). rather than to replace token indexing, we proceeded to com-

Collocations vs. tokens: tf and tf-idf

0.200
Tokens: tf-idf
0.150
F measure Collocations: tf-idf
0.100
Tokens: tf
0.050
Collocations: tf
0.000
F10 F20 F30 F100
Figure 7. Comparing Tokens to Collocations and the Effect of Weighting Scheme on F Measure
Collocations vs. Tokens (tf-idf weighting) Tokens vs. Collocations (tf-idf weighting)
0.350
0.600
0.300 Tokens
Tokens
0.500
0.250
Precision
Collocations Precision
0.400 Collocations
0.200
0.300
0.150
0.200
0.100
0.100
0.050
0.000
0.000
6
0
0
0.
0.
0.
0.
0.
1.
P10 P20 P30 P100
Recall
Figure 8. Tokens Versus Collocations: Precision for Top Ranked Documents and Recall Precision
Curve Using tf-idf
bine tokens and collocations schemes,20 exploring various significant; recall: 0.012 – 0.025). With tf-idf, the combi-
linear combinations, reaching an optimum with a 60:40 nation still yields considerable gains over both tokens
collocations/tokens ratio (for both tf and tf-idf weighting). (precision: 0.023 – 0.084, and statistically significant; recall:
The effect for F[10] is illustrated in Figure 9. 0.004 – 0.012) and collocations (precision: 0.040 – 0.048;
recall: 0.013 – 0.064). Figure 10 illustrates the gains for the
Table 5 presents the results for the optimal collocations/ combined scheme (using tf-idf weighting).
tokens combination, when compared to the distinct colloca-
tions and tokens schemes. Figure 10 and Table 5 demonstrate that collocations can
enhance the precision of the traditional tokens-based indexing
The combined collocations/tokens scheme is superior to the substantially. Using the industry-standard tf-idf weighting,
distinct schemes, for both tf and tf-idf weighting (effects size, Precision[10] for the combined scheme is 0.084 (or 31.5
in percentages, are presented in Table 5). With tf weighting, percent) higher than that of the baseline tokens scheme,
the combination yields moderate gains over collocations (pre- whereas Precision[20] and Precision[30] gains are 0.062 (or
cision: 0.016 – 0.025; recall: 0.008 – 0.025), and substantial 27 percent) and 0.051 (or 25 percent) respectively (and all
gains over tokens (precision: 0.025 – 0.073, and statistically these effects are statistically significant). These gains are
20
substantially higher than those reported in the literature, and
For combining tokens and collocations we employed the same methodology to the best of our knowledge, no similar gains for collocations
we used for combining the two collocations schemes, that is, combining
query-document similarity scores. were obtained over a large-scale domain-independent test bed.

Combining Collocations and Tokens
0.140
0.120
0.100
F[10]
0.080 tf
0.060 tf-idf
0.040
0.020
0.000
/1
/2
/3
lo ken 4
en /5
/6
/7
To 9
ns
s
2/
/
To tion
To : 9
To : 8
To : 7
To : 6
To s: 4
To : 3
:1
ke
s:
:
ns
ns
ns
ns
ns
ns
ns
lo oca
ke
ke
ke
ke
ke
ke
ke
ok
l
To
ol
T
C
c/
c/
c/
c/
c/
c/
c/
c/
c/
lo
lo
lo
lo
lo
lo
lo
ol
ol
ol
ol
ol
ol
ol
ol
ol
C
C
C
C
Figure 9. Combining Collocations and Tokens with Various Linear Combinations, the Impact on F[10]
Table 5. Combining Collocations and Tokens (Using the 60:40 Collocations/Tokens Combination)
Precision100
Precision10
Precision20
Precision30
Recall100
Recall10
Recall20
Recall30
Weight
F100
F10
F20
F30
Scheme
Collocations/Tokens
0.299 0.247 0.215 0.123 0.061 0.090 0.109 0.174 0.101 0.132 0.144 0.144
Combination
Combination versus
tf 32.2%** 33.5%** 35.3%** 26.3%** 24.8% 21.5% 22.5% 16.9% 26.0% 24.7% 26.8%* 22.4%*
Tokens
Combinations
7.8% 6.9% 12.9% 15.0% 13.8% 10.6% 14.6% 16.5% 12.7% 9.6% 14.0% 15.7%
versus Collocations
Collocations/Tokens
0.351 0.292 0.253 0.152 0.070 0.106 0.132 0.219 0.117 0.156 0.174 0.180
Combination
tf- Combination versus

31.5%** 27.2%** 25.6%** 17.6% 5.4% 13.8% 9.5% 5.3% 9.8% 17.4% 15.0% 12.5%
idf Tokens
Combination versus
15.7% 18.4% 21.9%* 35.8%** 23.0% 24.5% 33.4%* 41.7%** 21.8% 22.9%* 29.4%* 38.2%***
Collocations
Asterisks indicate statistical significance of the results (using a one-sided T test): *p < 0.1; **p < 0.05; ***p < 0.01

Combining Collocations and Tokens (tf-idf weighting)

0.400
0.350
0.300
Precision
0.250
0.200
0.150
Tokens
0.100
Collocations
0.050
Collocations & Tokens
0.000
P10 P20 P30 P100
Figure 10. Effectiveness of the Combined Tokens/Collocations
Discussion findings present the first large-scale demonstration of the

importance of across-sentence collocations for the vector–
Collocation and IR Effectiveness space model, support recent findings in related areas—Mitten-
dorf et al. (2000) with the probabilistic model and Srikanth
Although theory in linguistics suggests that collocations are and Srihari (2002) with the set-based model—and suggest that
essential for capturing meaning, disambiguating terms, and collocation extraction should be performed both within and
enhancing IR precision, to date there has been no empirical across sentence boundaries. While Mittendorf et al. report on
evidence in the literature demonstrating that collocations can difficulties in designing an effective combination of within
enhance retrieval substantially. This study investigated and across sentence collocations, we found the combination
whether and how statistical collocation indexing can benefit moderately superior to any of the individual schemes on most
IR in terms of enhancing users’ ability to access textual infor- performance measures (the combination yielded both preci-
mation from large and heterogeneous collections. Linguistic sion (4 to 15 percent) and recall (6 to 53 percent) gains).
theory provides little guidance on operationalizations of
collocations and on ways to extract collocations from text Further explorations of collocation distance in Experiment 2
effectively. Accordingly, we investigated three key colloca- suggest that within a sentence, proximate collocations are
tions indexing parameters, and tested their impact on IR more useful than distant ones, and a collocation indexing
effectiveness. Two parameters, collocations directionality scheme should assign weights to collocations based on their
and distance, are applicable for any retrieval model, and thus distance, thus challenging the current practices of weighting
our findings related to these parameters could be generalized all within-sentence collocations equally. Nevertheless, our
to alternative IR models. The third parameter, the weighting initial exploration of such a weighting scheme did not yield
scheme, is model specific, and thus findings related to this performance gains. Across sentence boundaries, on the other
parameter are restricted to the vector space model. hand, distance does not seem to impact IR effectiveness, thus
a collocation indexing scheme should weight all across-
Collocations distance plays an important role in determining sentence collocations equally.
collocations performance. While the computational linguis-
tics literature suggests that collocations extraction should be The literature does not provide a clear answer on the effect of
restricted to grammatically-bound elements (i.e., terms ap- collocation directionality on IR performance. In Experiment
pearing in the same sentence; see Carmel et al. 2001; Maarek 1, we found that directional collocations are superior (in both
et al. 1991; Strzalkowski et al. 1996), the results from precision and recall), while for across sentence we observe a
Experiment 1 indicate that collocations both within and across mixed effect: directional collocations provide small precision
sentences are important for retrieval performance. These gains, while nondirectional yields substantially higher recall.

This finding could be explained by the fact that collocations 13 percent higher at Precision[100]), while recall is consis-
extracted within a sentence usually capture lexico-syntactic tently higher for tokens (9 to 26 percent). Thus, collocations
relations (mostly noun/verb relations) and thus preserving are a more effective indexing unit (as evident in tf results), but
collocation directionality is important. Across-sentence collo- due to the lack of an effective collocation weighting scheme,
cations, on the other hand, usually capture noun-noun rela- collocations and tokens perform similarly when using tf-idf.
tions (e.g., doctor/nurse), which are symmetric. Since the
overall performance of directional collocations is superior to In practice collocations are utilized to enhance token indexing
nondirectional collocations, we recommend that directional (e.g., Carmel et al. 2001; Strzalkowski et al. 1996), and we
collocations be employed. reached the optimal performance with a 60:40 collocations/
tokens combination. This combined collocations/tokens
The weighting of indexing units in document profiles has method yielded substantial performance gains over the
been the subject of considerable research in IR (Baeza-Yates traditional token-based indexing, for both tf (precision: 26 to
and Ribeiro-Neto 1999; Salton and McGill 1983) and, in 35 percent and statistically significant; recall: 17 to 25
general, advanced weighting schemes have yielded substantial percent) and tf-idf (precision: 18 to 32 percent and statis-
performance gains. Consistent with findings from previous tically significant; recall: 5 to 14 percent) weighting. We
studies, we found that the effect of tf-idf weighting on conclude that, even with industry-standard weighting (i.e., tf-
collocations (4 to 9 percent precision and recall gains) is sub- idf), collocations could be used to enhance precision sub-
stantially and recall moderately.
stantially lower than the effect observed for traditional token-
based indexing (18 to 32 percent precision gains, and 27 to 40
Our study provides the first strong evidence for the usefulness
percent recall gains). According to Strzalkowski et al. (1996),
of statistical collocation indexing that is performed over a
the cause of this effect is the term-independence assumption
large and heterogeneous collection. We attribute our success
that is implicit in tf-idf weighting. Collocations capture
to our collocation extraction parameters (i.e., distance and
dependencies between terms, and thus violate the term-
directionality) settings, and specifically to the incorporation
independence assumption. Previous works have explored of across-sentence collocations.
various adaptations of tf-idf to collocation with little success
(e.g., Alvarez et al. 2004; Buckley et al. 1996; Mitra et al. It is worth noting that the reported effectiveness levels are
1997; Strzalkowski et al. 1996). We implemented an adap- lower than those attained by state-of-the-art retrieval engines,
tation to tf-idf that is grounded in Shannon’s information as these retrieval engines incorporate many additional features
theory and has been proposed by Church and Hanks (1990), (beyond collocation indexing).21
only to find that performance levels are indistinguishable
from those of traditional tf-idf weighting.
Collocation and IR Efficiency
Recent advancements in work with language models (a newer
retrieval model that seeks to replace the traditional vector– Hevner et al. (2004) suggest that design should consider cost-
space model) try to relax the term-independence assumption benefit tradeoffs. Because inefficient IR techniques will not
(Jiang et al. 2004; Miller et al. 1999; Srikanth and Srihari scale up to large and heterogeneous collections, the efficiency
2002), thus allowing for modeling of term dependencies, but of collocation indexing becomes critical. IR is performed in
these “attempts to integrate positional information into the lan- two steps, preprocessing and real-time processing, with
guage-modeling approach to IR have not shown consistent different efficiency considerations for each. In preprocessing
significant improvements” (Jiang et al. 2004, p. 1). Thus, to (i.e., collocation extraction), we are mainly concerned with
date, it seems that there is no effective method for incorporating scalability to large volumes of documents, and less concerned
term dependencies into existing retrieval models, and there is with the exact timing of these processes, as most of them are
no collocation weighting scheme that is better than tf-idf. only performed once or at relatively long intervals.
Comparing the performance of collocations to that of tokens 21

For instance, the IBM Juru system, which competed at TREC10 (Carmel
in Experiment 3 showed that collocations are more effective et al. 2001), in addition to using collocation, allows users to define stop-
with the simple tf weighting (precision is 10 to 25 percent words (excluded from indexing), as well as a set of proper names. Juru also
processes information included in hyperlinks pointing to that document. In
higher, and recall is up to 10 percent higher), while when addition, Juru extracts information from web pages citing the page that is
using tf-idf we observe a mixed effect: precision is higher for currently being indexed (and this feature alone has shown to improve
collocations at the top ranking list (14 percent for Preci- precision by 20 percent). Despite all these enhancements, the Precision[10]
results attained by Juru are only 20 to 25 percent higher than the results
sion[10]) and higher for tokens when using a long list (tokens
reported here.

Table 6. Summarizing the Relative Effectiveness and Efficiency of Token and Collocation Indexing
Tokens, Effectiveness Efficiency
Collocations, and
†
Combinations Precision Recall Complexity Storage
Tokens Low Medium Low Medium
O(T) 1 gigabyte
Collocations Medium Low Medium Low

(up to 14% over tokens) (9% to 26% under tokens O(T × (Log(T))²) 0.5 gigabyte
Tokens and High High Medium High

Collocations (18% to 32% over tokens) (5% to 32% over tokens) O(T × (Log(T))²) (1.5 gigabyte
†
The complexity of collocation matching is similar for both tokens and collocations, and this is absent from the table.
The complexity of token indexing for one document is O(T); Common practice, as well as our findings, suggests that an
T = number of tokens in a document, while collocation index combining token and collocations is more effective than
indexing complexity is O(T × (Log(T))²).22 Although both are the pure token or collocation indexes. However, a disadvan-
suitable for very large document collections, the collocation tage of a combined token-collocation indexing procedure is its
indexing complexity is somewhat higher. inefficiency, although these efficiency differences are not
crucial and the combined approach could easily scale-up to
Real-time processing occurs when a user submits his query, very large collections. Thus, combined token-collection in-
thus making computational efficiency critical, so as not to dexing is still recommended for large and heterogeneous
keep the user waiting for the system’s response. In matching collections.
a query to documents, both the token and collocation-based
models employ inverted matrixes (Salton and McGill 1983),
where a query is matched with only documents containing at
least one query indexing unit. Thus, the complexity of the Conclusion
matching process is similar for both tokens and collocations,
and linear with the number of documents that contain query Despite the understanding that collocations are essential for
indexing units. representing the meaning of natural language text, the
effectiveness of collocation indexing in IR remained in doubt.
Storage requirements are also an important constraint. Collo- Prior studies were unable to provide solid evidence for the
cation indexing has been shown to require less storage space, ability of collocations to provide substantial effectiveness
when compared to token indexing (Maarek et al. 1991), and gains in retrieval of textual information from large and
in our study, collocation indexing required approximately 50 heterogeneous document collections.
percent less storage space than token indexing. Of course,
expanding the number of unique items (tokens or colloca- This study makes four contributions to our understanding of
tions) will change this ratio. Table 6 summarizes the relative the role collocations can play in general-settings IR. First, our
efficiency and effectiveness for tokens and collocations in our investigation of the effect of distance clearly shows that
experiments. (1) within a sentence, collocation distance is associated with
effectiveness, thus suggesting that the common practice of
assigning equal weight to all within-sentence collocations is
not optimal, and (2) existing theory and practice restricting
collocation extraction to elements co-occurring within the
22
The reported complexity is based on the following extraction algorithm: same sentence should be revisited, as across-sentence colloca-
(1) processing the text file to register the location (sentence and word
positions) for all relevant tokens (in our case, the 20 most frequent tokens),
tion provides additional effectiveness gains. Second, we
(2) for each relevant token instance, register co-occurrences with all other demonstrate the interaction effect between distance and direc-
relevant tokens, and (3) for each collocation, sum-up all co-occurrence tionality, suggesting that directionality is effective within
instances. It should be noted that the aim of this study was not to develop the sentence boundaries, while across sentences nondirectional
most efficient collocation extraction algorithm, and it is likely that more
efficient algorithms could be designed. collocations are superior. Third, our exploration of weighting

schemes for collocations and the disappointing results ob- prone and ambiguous, or alternatively (2) use complex,
tained with the mutual information tf-idf adaptation illustrate domain-specific, analysis, which is not scalable to large
the challenge of modeling collocations within existing collections and not portable across domains. This research
retrieval models. Finally, our results show that collocation work provides important implications for practitioners, and
indexing can substantially enhance IR precision, despite the demonstrated that a third approach that balances the earlier
inappropriateness of tf-idf weighting for collocations. These two is now available: employing scalable and domain-inde-
findings, in addition to contributing to the existing body of pendent statistical NLP methods, which address the problem
knowledge regarding collocation, have immediate implica- of word ambiguity and enable users to effectively access
tions to IR practitioners in providing guidelines for the design textual information.
of effective collocations extraction schemes.
Despite the potential role of document management systems
Although this study enhances our understanding of colloca- to enhance organizational competitiveness, the design of such
tion indexing, several issues warrant further research. systems has been under-explored in management information
systems literature. Sprague (1995, p. 30) has argued that
• Future research should attempt to identify an effective “harnessing information technology to manage documents is
weighting scheme for collocations. one of the most important challenges facing IS managers,”
and encouraged a shift in the focus of MIS research from
• Future research should try to design a collocation extrac- database to information retrieval systems. We support
tion scheme that, within a sentence, assigns more impor- Sprague’s appeal and call for increased attention to the design
tance to proximate collocations. of IR systems in MIS research.
• To generalize the results, it is necessary that similar

experiments be repeated for additional large and hetero- Acknowledgments
geneous text collections.
We thank Reginald Oake, Edmund Szeto, and Andy Leung for their
• In these experiments, we employed only one retrieval work in implementation and testing. We are thankful to David
model, the vector space model. Although insights Patient for providing feedback to earlier drafts of the paper. This
research was funded by a grant from the Natural Sciences and
regarding collocation extraction parameters (i.e., distance
Engineering Research Council of Canada.
and directionality) are model-independent, in order to
provide stronger evidence to the value of collocation in
IR, similar experiments should be repeated with alter- References
native retrieval models.
Alvarez C., Langlais P., and Nie J. Y. “Word Pairs in Language
• To verify the usefulness of collocation for real-life Modeling for Information Retrieval,” in Proceedings of the 7th
settings, it is important that collocation indexing models Conference on Computer Assisted Information Retrieval (RIAO),
be studied in the context of specific applications, Avignon, France, April 26-28, 2004, pp. 686-705.
particular domains, and real business tasks. Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information
Retrieval, ACM Press, New York, 1999.
The amount of knowledge encoded in electronic text Buckley C., Singhal A., Mitra M., and Salton G. “New Retrieval
far surpasses that available in data alone. However, Approaches Using SMART: TREC 4,” in Proceedings of the
the ability to take advantage of this wealth of Fourth Text Retrieval Conference (TREC-4), D. K. Harman (ed.),
NIST Special Publication 500-236, Gaithersburg, MD, November
knowledge is just beginning to meet the challenge.
1-3, 1996 (http://trec.nist.gov/ pubs/trec4/t4_proceedings.html).
Businesses that can take advantage of this potential
Carmel, D., Amitay, E., Herscovici, M., Maarek, Y., Petruschka, Y.,
will surely gain a competitive advantage through and Soffer, A. “Juru at TREC 10: Experiments with Index
better decision-making and increased efficiencies Pruning,” in Proceedings of the 10th Text Retrieval Conference,
(Spangler et al. 2003, p. 192). E. M. Voorhees and D. K. Harman (eds.), NIST Special
Publication 500-250, Gaithersburg, MD, November 13-16, 2001,
To date, organizations trying to process textual information (http://trec.nist.gov/pubs/trec10/ t10_proceedings.html).
into meaningful knowledge, in a general setting, were faced Chen, H. Knowledge Management Systems: A Text-Mining Per-
with a difficult choice: they could either (1) employ very spective, University of Arizona, Tucson, AZ, 2001 (http://dlist.
simplistic processing methods that are scalable, yet error sir.arizona.edu/483/01/chenKMSi.pdf).

Church, K. W., and Hanks, P. “Word Association Norms, Mutual graphical Information,” in Lexicography: Principles and Prac-
Information, and Lexicography,” Computational Linguistics (16), tice, R. R. K. Hartmann (ed.), Academic Press, London, 1983, pp.
1990, pp. 22-29. 77-87.
Croft, W. B., Turtle, H. R., and Lewis, D. D. “The Use of Phrases Miller, D. R., Leek, T., and Schwartz, R. M. “A Hidden Markov
and Structured Queries in Information Retrieval,” in Proceedings Model Information Retrieval System,” in Proceedings of the 22nd
of the 14th Annual Conference on Research and Development in ACM Conference on Research and Development in Information
Information Retrieval (ACM SIGIR), ACM Press, New York, Retrieval (SIGIR’99), ACM Press, New York, 1999, pp.214-221.
1991, pp. 32-45. Miller, G. A. “WordNet: A Lexical Database,” Communications of
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harsh- the ACM (38:11), 1995, pp. 39-41.
man, R. “Indexing by Latent Semantic Analysis,” Journal of the Mitra M., Buckley C., Singhal A., and Cardie C., An Analysis of
American Society of Information Science (41:6), 1990, pp. Statistical and Syntactic Phrases, in Proceedings of the Fifth
391-407. Conference on Computer Assisted Information Retrieval (RIAO),
Fagan, J. L. “The Effectiveness of Nonsyntactic Approach to Auto- Montreal, Canada, June 25-27, 1997, pp. 200-214.
matic Phrase Indexing for Document Retrieval,” Journal of the Mittendorf, E., Mateev, B., and Schauble, P. “Using the
American Society for Information Science (40:2), 1989, pp. Co-occurrence of Words for Retrieval Weighting,” Information
115-132. Retrieval (3:3), 2000, pp. 243-251.
Fagan, J. L. Experiments in Automatic Phrase Indexing for Docu- Morita, K., Atlam, E. S., Fuketra, M., Tsuda, K., Oono, M., and
ment Retrieval: A Comparison of Syntactic and Nonsyntactic Aoe, J. ”Word Classification and Hierarchy Using Co-occurrence
Methods, unpublished Ph.D. thesis, Department of Computer Word Information,” Information Processing and Management
Science, Cornell University, Ithaca, NY, 1987. (40:6), 2004, pp. 957-972.
Firth, J. “A Synopsis of Linguistic Theory 1930-1955,” in Studies Ponte, J. M., and Croft, W. B. “A Language Modeling Approach
in Linguistic Analysis, Philological Society, Oxford, UK, 1957, to Information Retrieval, 1998,” in Proceedings of 21st Annual
pp. 1-32; reprinted in Selected Papers of J. R. Firth 1952-1959, International ACM SIGIR Conference on Research and Develop-
F. Palmer (ed.), Longman, London, 1968. ment in Information Retrieval, Melbourne, Australia, 1998, pp.
Hevner, A., March, S., Park, J., and Ram, S. “Design Science in 275-281.
Information Systems Research,” MIS Quarterly (28:1), 2004, pp. Porter, M. F. “An Algorithm for Suffix Stripping,” Program (14:3),
75-105. 1980, pp. 130-137.
Jain, H., Ramamurthy, K., Ryu, H., and Yasai-Ardekani, M. Robertson, S. E., and Spärck-Jones, K. “Relevance Weighting of
“Success of Data Resource Management in Distributed Envi- Search Terms,” Journal of the American Society for Information
ronments: An Empirical Investigation,” MIS Quarterly (22:1), Sciences (27:3), 1976, pp.129-146.
March 1998, pp. 77-29. Salton, G., and Buckley, C. “Automatic Text Structuring and
Jiang, M., Jensen, E., Beitzel, S., and Argamon, A. “Effective Use Retrieval: Experiments in Automatic Encyclopedia Searching,”
of Phrases in Language Modeling to Improve Information in Proceedings of the 15th Annual International ACM SIGGIR
Retrieval,” paper presented at the Eighth Symposium on AI & Conference on Research and Development in Information
Math Special Session on Intelligent Text Processing, Fort Retrieval, Chicago, October 13-16, 1991, pp. 21-30.
Lauderdale, FL, January 4-6, 2004. Salton, G., and Buckley, C. “Term Weighting Approaches in Auto-
Khoo, C., Myaeng, S., and Oddy, R. “Using Cause-Effect Relations matic Text Retrieval, Information Processing and Management
in Text to Improve Information Retrieval Precision,” Information (24:5), 1988, pp. 513-523.
Processing and Management (37), 2001, pp. 119-145. Salton, G., and McGill, M. J. Introduction to Modern Information
Lesk, M. E. “Word-Word Associations in Document Retrieval Retrieval, McGraw-Hill, New York, 1983.
Systems,” American Documentation (20:1), 1969, pp. 27-38. Salton, G., Wong, A., and Yang, C. S. “A Vector Space Model for
Lewis, D., and Spärck-Jones, K. “Natural Language Processing for Automatic Indexing,” Communications of the ACM (18:11),
Information Retrieval,” Communications of the ACM (39:1), 1975, pp. 613-620.
1996, pp. 92-101. Spangler, S., Kreulen, J. T., and Lessler J. “Generating and
Luhn, H. P. “A Business Intelligence System,” IBM Journal of Browsing Multiple Taxonomies Over a Document Collection,”
Research and Development (2:4), 1958, pp. 314-319. Journal of Management Information Systems (19:4), Spring
Maarek, Y., Berry, D., and Kaiser, G. “An Information Retrieval 2003, pp. 191-212.
Approach for Automatically Constructing Software Libraries,” Spärck-Jones, K. S. “Summary Performance Comparisons TREC-2
IEEE Transactions on Software Engineering (17:8), 1991, pp. Through TREC-8,” in Proceedings of the 8th Text Retrieval Con-
800-813. ference, E. M. Voorhees and D. K. Harman (eds.), NIST Special
Manning, C., and Schutze, H. Foundations of Statistical Natural Publication 500-246, Gaithersburg, MD, November 17-19, 1999
Language Processing (6th ed.), MIT Press, Cambridge, MA, (http://trec.nist.gov/pubs/trec8/t8_proceedings.html).
2003. Sprague, R. H. “Electronic Document Management: Challenges
Martin, W. J. R., Al, B. P. F., and van Strenkenburg, P. J. G. “On and Opportunities for Information Systems Managers,” MIS
the Processing of Text Corpus: From Textual Data to Lexico- Quarterly (19:1), March 1995, pp. 29-49.

Srikanth, M., and Srihari, R. “Biterm Language Models for Docu- About the Authors
ment Retrieval,” in Proceedings of 25th Annual International
ACM SIGIR Conference on Research and Development in Ofer Arazy is an assistant professor at the University of Alberta.
Information Retrieval, Tampere, Finland, 2002, pp. 425-426. He is a B.Sc. (in Industrial Engineering) and an MBA, and worked
Stokoe, C. , Oakes, M. P., and Tait, J. “Word Sense Disambiguation for 7 years in IT projects management. Ofer earned his Ph.D. at the
in Information Retrieval Revisited, in Proceedings of the 26th University of British Columbia in 2004. His research approach is
Annual International ACM SIGIR Conference on Research and grounded in Design Science, and his thesis investigated design
Development in Information Retrieval, Toronto, Canada, 2003, principles for information retrieval systems. Additional areas of
pp. 159-166. interest are multi-agent systems, recommendation systems, and
social informatics.
Strzalkowski, T., Guthrie, L., Karlgren, J., Leistensnider, J., Lin, F.,
Perez-Carballo, J., Straszheim, T., Wang, J., and Wilding, J.
Carson Woo is an associate professor and Stanley Kwok Professor
“Natural Language Information Retrieval: TREC-5 Report,”
of Business at the University of British Columbia. He earned his
Proceedings of the Fifth Text Retrieval Conference, E. M. Ph.D. in Computer Science from the University of Toronto in 1988.
Voorhees and D. K. Harman (eds.), NIST Special Publication Carson served as president of the Workshop on Information Tech-
500-238, Gaithersburg, MD, November 20-22, 1996 (http:// nology and Systems (WITS) Inc. from 2004 to 2006 and on the
trec.nist.gov/pubs/trec5/t5_proceedings.html). editorial board of several journals. His main research interest is in
Van Rijsbergen, C. J. “A Theoretical Basis for the Use of providing methods and tools for supporting the change and evolution
Co-occurrence Data in Retrieval,” Journal of Documentation of information systems from the business and organizational
(33), 1977, pp. 106-119. perspective.

Misq2007 Enhancing Information Retrieval Through Statistical Natural Language Processing A Study of Collocation Indexing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Misq2007 Enhancing Information Retrieval Through Statistical Natural Language Processing A Study of Collocation Indexing

Uploaded by

Copyright:

Available Formats

Arazy & Woo/Enhancing Information Retrieval

ENHANCING INFORMATION RETRIEVAL THROUGH

Although the management of information assets—specifically, Introduction

MIS Quarterly Vol. 31 No. 3, pp. 525-546/September 2007 525

526 MIS Quarterly Vol. 31 No. 3/September 2007

MIS Quarterly Vol. 31 No. 3/September 2007 527

W1,A Indexing Unit A

528 MIS Quarterly Vol. 31 No. 3/September 2007

MIS Quarterly Vol. 31 No. 3/September 2007 529

530 MIS Quarterly Vol. 31 No. 3/September 2007

is computed over all terms which are mentioned in Research Questions

MIS Quarterly Vol. 31 No. 3/September 2007 531

Data appeared within the same sentence, while in another scheme

Implementation We then indexed documents and queries with the remaining

We first processed the TREC collection documents to gener- Measures

532 MIS Quarterly Vol. 31 No. 3/September 2007

MIS Quarterly Vol. 31 No. 3/September 2007 533

Table 2. The Effects of Directionality, Distance, and Weighting on

No clear effect is visible for collocation distance (for either

534 MIS Quarterly Vol. 31 No. 3/September 2007

F[10]: tf vs. tf-idf

MIS Quarterly Vol. 31 No. 3/September 2007 535

Within-Sentence - Collocations at Different Distances

Figure 4. The Effect of Distance Within a Sentence on Precision (Using tf Weighting)

536 MIS Quarterly Vol. 31 No. 3/September 2007

Across-Sentence - Collocations at Different Distances

Figure 5. The Effect of Distance for Across-Sentence Collocations (Using TF Weighting)

MIS Quarterly Vol. 31 No. 3/September 2007 537

Figure 6. Precision[10] and Recall[10]: Combining Within- and Across-Sentence Collocations

538 MIS Quarterly Vol. 31 No. 3/September 2007

Collocations vs. tokens: tf and tf-idf

MIS Quarterly Vol. 31 No. 3/September 2007 539

Combining Collocations and Tokens

tf- Combination versus

540 MIS Quarterly Vol. 31 No. 3/September 2007

Combining Collocations and Tokens (tf-idf weighting)

Figure 10. Effectiveness of the Combined Tokens/Collocations

Discussion findings present the first large-scale demonstration of the

MIS Quarterly Vol. 31 No. 3/September 2007 541

Comparing the performance of collocations to that of tokens 21

542 MIS Quarterly Vol. 31 No. 3/September 2007

Collocations Medium Low Medium Low

Tokens and High High Medium High

MIS Quarterly Vol. 31 No. 3/September 2007 543

• To generalize the results, it is necessary that similar

544 MIS Quarterly Vol. 31 No. 3/September 2007

MIS Quarterly Vol. 31 No. 3/September 2007 545

546 MIS Quarterly Vol. 31 No. 3/September 2007

You might also like