Professional Documents
Culture Documents
1 Introduction
In most cases, when the expression information retrieval is used text retrieval
is meant. Information retrieval systems manage large amounts of documents.
A database containing 1,000,000 documents is a normal case in practice. Some
database providers e.g. NEXUS have to deal with millions of documents a week.
The special purpose that retrieval systems are designed for, is the search for
relevant items with respect to an information need of a user. This would ideally
be realized by a system that understands the question of the user as well as the
content of the documents in the database - but this is far away from the state
of the art.
The requirements of information retrieval systems - domain independence, ef-
ciency, robustness - force them to work very supercially in the case of large
databases. The search is usually based on so-called terms. The terms are the
searchable items of the system. The process of mapping documents to term rep-
resentations is called indexing. In most retrieval systems, the index terms are
all words in the documents with the exception of stopwords like determiners,
prepositions or conjunctions. A query, i.e. the search request, then consists of
terms; and the documents in the result set of the retrieval process are those
that contain these query terms. Most of the work in retrieval research in the last
decades have been concentrated on rening this term based search method. One
of these renements is to use a thesaurus.
A thesaurus in the eld of Information and Documentation is an ordered com-
pilation of concepts which serves for indexing and retrieval in one documen-
tation domain. A central point is not only to dene terms but also relations
between terms[?]. Such relations are synonymy ("container" - "receptacle"),
broader terms or hyperonyms ("container" - "tank"), narrower terms or hy-
ponyms ("tank" - "container"), the part-of relation ("car" - "tank") and antonymy
("acceleration" - "deceleration"). The concept semantically similar 1 subsumes
all these thesaurus relations.
It is dicult for retrieval system users to bring to mind the large number of terms
that are semantically similar to their initial query terms. Especially untrained
users who formulate short queries. For the original query WORD, CLUSTERING,
f
TERM, ASSOCIATION the following terms could be found in the titles of relevant
g
drink
Example 5. Peter drinks a sweet hot coee. HH
Peter coee
Example 6. Peter drinks a coee which is sweet and hot. HHhot
sweet
3.2 Head and Modier Extraction from Corpora
Some implementations of dependency analysis are practically applicable to large
amounts of text because they realize a very quick and robust syntactic analysis.
Hindle [?] reports of a free text analysis with subsequent head modier extraction
which deals with one million words over night. Strzalkowski [?] reports of a rate
of one million words in 8 hours. The last version of our own parser described
in [?] is much faster: 3 MB per minute real time on a SUN SPARC station;
that is approximately 15 minutes for one million words. Such a parser works
robustly and domain independently but results only in partial analysis trees.
Our dependency analysis of noun phrases has been evaluated with respect to
error rates. 85% of the head modier token were determined correctly and 14%
were introduced wrongly. These error rates are acceptable because the further
processing is very robust as shown by the results below. Table 1 shows the most
frequent heads and modiers of the term "pump" in three annual editions of
abstracts of the American Patent and Trademark Oce.
pump
Modiers Frequency Heads Frequency
heat 444 chamber 294
injection 441 housing 276
hydraulic 306 assembly 177
vacuum 238 system 160
driven 207 connected 141
displacement 183 pressure 124
fuel 181 piston 120
pressure 142 unit 119
oil 140 body 115
Table 1. Most frequent heads and modiers of "pump" in 120 MB of patent abstracts
"bachelor". These features also explain which words are compatible, e.g. "bache-
lor" is compatible with "young", but not with "
y". The selection of all possible
heads and modiers of a term therefore means the selection of all compatible
words. If the corpus has a large coverage and all heads and modiers of two
terms are the same, they should also have the same feature representation, i.e.
the same meaning.
Wittgensteins view of the relation of meaning and use was mirrored in the co-
occurrence approach in Sect. 2. Terms appearing in similar contexts were sup-
posed to be synonyms. Probably, this idea is correct if the contexts are very
small. Heads and modiers are contexts and these contexts are as small as pos-
sible - only one term long. Thus, head and modier comparison implements
smallest contexts comparison.
In model theoretic semantics the so-called extensional meaning of many words
is denoted by a set of objects, e.g. the set of all dogs in the world represents the
meaning of the word "dog". If two terms occur as heads and modiers in a real
world corpus, in most cases it holds that the intersection of their representations
in not empty. Thus the head modier relation between two terms means that
each term is a possible property of the objects of the other term. Head modier
comparison therefore is the comparison of possible properties.
Head modier relations can also be found in human memory structure. A variety
of experiments with human associations suggest that in many cases heads or
modiers are responses to stimulus words, e.g. stimulus "rose" and response
"red". As stimulus words, terms with common heads and modiers therefore
can eect the same associations.
The comparison of heads and modiers is expressed by a value between 0 (no
overlap) and 1 (identical heads and modiers). This value is determined by a
similarity measure. I found that the cosine measure with a logarithmic weighting
function (1) works best ([?], [?]).
h h m m
sim(t ; t ) = 12 + 12
ln ln ln ln
(1)
ln
i j
iln jln
i j
hi hln
j
mi mj
In Eq. (1), h = (ln0 (h 1); : : :; ln0(h )). h stands for the number of occur-
ln
i i in ik
rences of term t as head of term t in the corpus. ln0 (r) = ln(r) if r > 1 and 0
i j
otherwise. m is dened analogously for modiers. The cosine measure in prin-
ln
i
ciple gives the cosine of the angle between the two terms represented in a space
spanned by the heads or spanned by the modiers. In sim, the weights of the
heads and modiers were smoothed logarithmically. sim gives the mean of the
head space cosine and the modier space cosine of the two terms.
The head modier approach has been examined on dierent levels of evalu-
ation. First, the rate of semantically similar terms among the 10 terms with
highest sim-values for 159 dierent initial terms has been evaluated ([?], [?]).
About 70% of the terms found were semantically similar in the sense that they
could be used as additional terms for information retrieval. Grefenstette [?] and
Strzalkowski [?] clustered those terms that had a high similarity value. Then
they expanded queries by replacing all query terms by their clusters. The ex-
panded queries performed better than the original queries. Strzalkowski found
an improvement in the retrieval results of 20%. This is a very good value in
the retrieval context. Unfortunately, the eect of query expansion alone is not
known, because Strzalkowski used further linguistic techniques in his retrieval
system.
4 Conclusions
A disadvantage of most Information Retrieval systems is that the search is based
on document terms. These term based systems can be improved by incorporating
thesaurus relations. The expensive task of manual thesaurus construction should
be supported or replaced by automatic tools. After a long period of disappoint-
ing results in this eld linguistically based systems have shown some encouraging
results. These systems are based on the extraction and comparison of head mod-
ier relations in large corpora. They are applicable in practice, because they use
robust and fast parsers.
References
1. Das-Gupta, P.: Boolean Interpretation of Conjunctions for Document Retrieval.
JASIS 38(1987) 245-254
2. DIN 1463: Erstellung und Weiterentwicklung von Thesauri. Deutsches Institut fuer
Normung (1987), related standard published in English: ISO 2788:1986
3. Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluver Aca-
demic Publishers, Boston, Dordrecht, London (1994)
4. Guentzer, U., Juettner, G., Seegmueller, G., Sarre, F.: Automatic Thesaurus Con-
struction by Machine Learning from Retrieval Sessions. Inf. Proc. & Management
25(1989) 265-273
5. Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. Pro-
ceedings of COLING 92, Nantes, Vol. 2 (1992) 539-545
6. Hindle, D.: A Parser for Text Corpora. Technical Memorandum, AT&T Bell Lab-
oratories (1990) also published in Atkins, A., Zampolli, A.: Computational Ap-
proaches to the Lexicon. Oxford University Press (1993)
7. Katz, J., Fordor, J.: The Structure of Semantic Theory. In Fordor, J., Katz, J.:
The Structure of Language: Readings in the Philosophy of Language. Englewood
Clis, NJ, Prentice Hall (1964) 479-518
8. Lesk, M.: Word-Word Associations in Document Retrieval Systems. American Doc-
umentation (1969) 27-38
9. Ruge, G.: Experiments on Linguistically Based Term Associations. Inf. Proc. &
Mangement 28(1992) 317-332
10. Ruge, G.: Wortbedeutung und Termassoziation. Reihe Sprache und Computer
14(1995) Georg Olms Verlag, Hildesheim, Zuerich, New York
11. Ruge, G., Schwarz, C., Warner, A.: Eectiveness and Eciency in Natural Lan-
guage Processing for Large Amounts of Text. JASIS 42(1991) 450-456
12. Schuetze, H., Pederssn, J.: A Cooccurrence-Based Thesaurus and Two Applications
to Information Retrieval. Proceedings of RIAO94, New York (1994) 266-274
13. Shaikevich, A.: Automatic Construction of a Thesaurus from Explanatory Dictio-
naries. Automatic Documentation and Mathematical Linguistics 19(1985) 76-89
14. Salton, G.: Automatic Term Class Construction Using Relevance - A Summary of
Work in Automatic Pseudoclassication. Inf. Proc. & Management 16(1980) 1-15
15. Strzalkowski, T.: TTP: A Fast and Robust Parser for Natural Language. Proceed-
ings of COLING 92 Vol.1(1992) 198-204
16. Strzalkowski, T.: Natural Language Information Retrieval. Inf. Proc. & Manage-
ment 31(1995) 397-417