You are on page 1of 8

Automatic Detection of Thesaurus Relations for

Information Retrieval Applications


Gerda Ruge
Computer Science, Technical University of Munich

Abstract. Is it possible to discover semantic term relations useful for


thesauri without any semantic information? Yes, it is. A recent approach
for automatic thesaurus construction is based on explicit linguistic knowl-
edge, i.e. a domain independent parser without any semantic component,
and implicit linguistic knowledge contained in large amounts of real world
texts. Such texts include implicitly the linguistic, especially semantic,
knowledge that the authors needed for formulating their texts. This ar-
ticle explains how implicit semantic knowledge can be transformed to an
explicit one. Evaluations of quality and performance of the approach are
very encouraging.

1 Introduction
In most cases, when the expression information retrieval is used text retrieval
is meant. Information retrieval systems manage large amounts of documents.
A database containing 1,000,000 documents is a normal case in practice. Some
database providers e.g. NEXUS have to deal with millions of documents a week.
The special purpose that retrieval systems are designed for, is the search for
relevant items with respect to an information need of a user. This would ideally
be realized by a system that understands the question of the user as well as the
content of the documents in the database - but this is far away from the state
of the art.
The requirements of information retrieval systems - domain independence, ef-
ciency, robustness - force them to work very super cially in the case of large
databases. The search is usually based on so-called terms. The terms are the
searchable items of the system. The process of mapping documents to term rep-
resentations is called indexing. In most retrieval systems, the index terms are
all words in the documents with the exception of stopwords like determiners,
prepositions or conjunctions. A query, i.e. the search request, then consists of
terms; and the documents in the result set of the retrieval process are those
that contain these query terms. Most of the work in retrieval research in the last
decades have been concentrated on re ning this term based search method. One
of these re nements is to use a thesaurus.
A thesaurus in the eld of Information and Documentation is an ordered com-
pilation of concepts which serves for indexing and retrieval in one documen-
tation domain. A central point is not only to de ne terms but also relations
between terms[?]. Such relations are synonymy ("container" - "receptacle"),
broader terms or hyperonyms ("container" - "tank"), narrower terms or hy-
ponyms ("tank" - "container"), the part-of relation ("car" - "tank") and antonymy
("acceleration" - "deceleration"). The concept semantically similar 1 subsumes
all these thesaurus relations.
It is dicult for retrieval system users to bring to mind the large number of terms
that are semantically similar to their initial query terms. Especially untrained
users who formulate short queries. For the original query WORD, CLUSTERING,
f

TERM, ASSOCIATION the following terms could be found in the titles of relevant
g

documents like "Experiments on the Determination of the Relationships Between


Terms" or "A Taxonomy of English Nouns and Verbs":
Example 1.
TERM, KEYWORD, DESCRIPTOR, WORD, NOUN, ADJECTIVE, VERB, STEM, CONCEPT,
MEANING, SENSE, VARIANT, SEMANTIC, STATISTIC, RELATED, NARROWER, BROADER,
DEPENDENT, CO-OCCURRENCE, INTERCHANGABLE, FIRST ORDER, SECOND ORDER,
CLUSTER, GROUP, EXPANSION, CLASS, SIMILARITY, SYNONYMY, ANTONYMY, HYPO-
NYMY, HYPERONYMY, ASSOCIATION, RELATION, RELATIONSHIP, TAXONOMY, HIER-
ARCHY, NETWORK, LEXICON, DICTIONARY, THESAURUS, GENERATION, CONSTRUC-
TION, CLASSIFICATION, DETERMINATION
For supporting the users, one important direction of research in automatic the-
saurus construction is the detection of semantically similar terms as candidates
for thesaurus relations.

2 Various Approaches for the Automatic Detection of


Semantic Similarity
There is a large variety of approaches for automatic detection of thesaurus re-
lations, mainly suggested by retrieval researchers and also by linguists. In the
following, some approaches are listed with brief characterizations2 .
In statistical term association, co-occurrence data of terms are analysed. The
main idea of this approach relies in the assumption that terms occurring in sim-
ilar contexts are synonyms. The contexts of an initial term are represented by
terms frequently occurring in the same document or paragraph with the intitial
term. In theory, terms with a high degree of context term overlap should be
1
In this article, synonymy is used in its strong sense; for the week sense of \synonymy",
sematically similar is used.
2
There is a variety of derivations or combinations of methods not cited here.
synonyms, but in practice no synonyms could be found by this method [?].
Co-occurrence statistics can be re ned by using singular value decomposition.
In this case the relations are generated on the basis of the comparision of the
main factors extracted from co-occurrence data [?]. Even though this approach
does not nd semantically similar terms, the use of such term associations can
improve retrieval results. Unfortunately, singular value decomposition is very
costly, such that it can only be applied to a small selection of the database terms.
Salton [?] gives a summary of work on pseudo-classi cation. For this approach,
relevance judgements are required: Each document must be judged with respect
to relevance to each of a set of queries. Then an optimization algorithm can be
run: It assigns relation weights to all term pairs, such that expanding the query
by terms related with high weights leads to retrieval results as correct as pos-
sible. This is the training phase of the approach. After the training, term pairs
with high weights represent thesaurus relations. A disadvantage of this approach
lies in its high e ort for the manual determination of the relevance judgements
as well as for the automatic optimization. The manual e ort is even comparable
to manual thesaurus construction.
Hyponyms were extracted from large text corpora by Hearst [?]. She searched
for relations directly mentioned in the texts and discovered text patterns that
relate hyponyms, e.g. "such that". Frequently this method leads to hyponyms
that are not directly related in the hierarchy like "species" and "steatornis oil-
bird" or questionable hyponyms like "target" and "airplane".
Hyperonyms can also be detected by analysing de nitions in monolingual lexica.
A hyperonym of the de ned term is the so-called genus term, the one that gives
the main characterization of the de ned term [?]. Genus terms can be recog-
nized by means of syntactic analysis. A further approach is based on the idea
that semantically similar terms have similar de nitions in a lexicon. Terms that
have many de ning terms in common are supposed to be semantically similar
or synonyms [?]. These lexicon based approaches seem very plausible, however
they have not been evaluated. One problem of these approaches is their cover-
age. Only terms that are led in a lexicon can be dealt with, thus many relations
between technical terms will stay undiscovered.
Guentzer et al. [?] suggested an expert system that draws hypotheses about
term relations on the basis of user observations. For example, if a retrieval sys-
tem user combines two terms by OR in his/her query (and further requirements
hold) these terms are probably synonyms. The system worked well, but unfortu-
nately the users capability of bringing to mind synonyms is very poor. Therefore
the majority of the term pairs found were either morphologically similar like
"net" and "network" or translations of each other like "user interface" and "Be-
nutzerober aeche". Other synonyms were hardly found.
These examples of approaches for the detection of semantically similar terms
show clearly that the main ideas can be very plausible but nevertheless do not
work in practice. In the following, a recent approach is explained that has been
con rmed by di erent research groups.
3 Detection of Term Similarities on the Basis of
Dependency Analysis
In this section we describe a method that extracts term relations fully automat-
ically from large corpora. Therefore, domain dependent thesaurus relations can
be produced for each text database. The approach is based on linguistic as well
as statistic analysis. The linguistic background of this approach is dependency
theory.
3.1 Dependency Theory
The theory of dependency analysis has a long tradition in linguistics. Its central
concept, the dependency relation, is the relation between heads and modi ers.
Modi ers are terms that specify another term, the head, in a sentence or phrase.
Examples 2, 3 and 4 below include the head modi er relation between \the-
saurus" and \construction".
Example 2. thesaurus construction
Example 3. construction of a complete domain dependent monolingualthesaurus
Example 4. automatic thesaurus generation or construction
In example 3, the head \thesaurus" has three modi ers: \complete", \depen-
dent", and \monolingual". Example 4 exempli es that a modi er might have
more than one head in case of conjunctions. Here the modi ers \automatic" and
\thesaurus"specify both heads, \generation" and \construction".
In dependency trees, the heads are always drawn above their modi ers. A line
stands for the head modi er relation. For the purpose of information retrieval,
stopwords like determiners, prepositions or conjunctions are usually neglected,
such that the tree only contains relations between index terms. Such depen-
dency trees di er from syntax trees of Chomsky grammars in the fact that their
structure is already abstracted from the particular structure of a language; e.g.
example 5 has the same dependency tree as example 6.

drink
Example 5. Peter drinks a sweet hot co ee.   HH
Peter co ee
Example 6. Peter drinks a co ee which is sweet and hot. HHhot
sweet
3.2 Head and Modi er Extraction from Corpora
Some implementations of dependency analysis are practically applicable to large
amounts of text because they realize a very quick and robust syntactic analysis.
Hindle [?] reports of a free text analysis with subsequent head modi er extraction
which deals with one million words over night. Strzalkowski [?] reports of a rate
of one million words in 8 hours. The last version of our own parser described
in [?] is much faster: 3 MB per minute real time on a SUN SPARC station;
that is approximately 15 minutes for one million words. Such a parser works
robustly and domain independently but results only in partial analysis trees.
Our dependency analysis of noun phrases has been evaluated with respect to
error rates. 85% of the head modi er token were determined correctly and 14%
were introduced wrongly. These error rates are acceptable because the further
processing is very robust as shown by the results below. Table 1 shows the most
frequent heads and modi ers of the term "pump" in three annual editions of
abstracts of the American Patent and Trademark Oce.

pump
Modi ers Frequency Heads Frequency
heat 444 chamber 294
injection 441 housing 276
hydraulic 306 assembly 177
vacuum 238 system 160
driven 207 connected 141
displacement 183 pressure 124
fuel 181 piston 120
pressure 142 unit 119
oil 140 body 115
Table 1. Most frequent heads and modi ers of "pump" in 120 MB of patent abstracts

3.3 Semantics of the Head Modi er Relation


Head modi er relations bridge the gap between syntax and semantics: On the one
hand they can be extracted on the basis of pure syntactic analysis. On the other
hand the modi er speci es the head, and this is a semantic relation. How can
the semantic information contained in head modi er relations give hints about
semantic similarity? Di erent semantic theories suggest that terms having many
heads and modi ers in common are semantically similar. These connections are
now drafted very brie y.
Katz and Fordor [?] introduced a feature based semantic theory. They claimed
that the meaning of a word can be represented by a set of semantic features,
for example +human, +male, -married is the famous feature representation of
f g

"bachelor". These features also explain which words are compatible, e.g. "bache-
lor" is compatible with "young", but not with " y". The selection of all possible
heads and modi ers of a term therefore means the selection of all compatible
words. If the corpus has a large coverage and all heads and modi ers of two
terms are the same, they should also have the same feature representation, i.e.
the same meaning.
Wittgensteins view of the relation of meaning and use was mirrored in the co-
occurrence approach in Sect. 2. Terms appearing in similar contexts were sup-
posed to be synonyms. Probably, this idea is correct if the contexts are very
small. Heads and modi ers are contexts and these contexts are as small as pos-
sible - only one term long. Thus, head and modi er comparison implements
smallest contexts comparison.
In model theoretic semantics the so-called extensional meaning of many words
is denoted by a set of objects, e.g. the set of all dogs in the world represents the
meaning of the word "dog". If two terms occur as heads and modi ers in a real
world corpus, in most cases it holds that the intersection of their representations
in not empty. Thus the head modi er relation between two terms means that
each term is a possible property of the objects of the other term. Head modi er
comparison therefore is the comparison of possible properties.
Head modi er relations can also be found in human memory structure. A variety
of experiments with human associations suggest that in many cases heads or
modi ers are responses to stimulus words, e.g. stimulus "rose" and response
"red". As stimulus words, terms with common heads and modi ers therefore
can e ect the same associations.

3.4 Experiments on Head-Modi er-Based Term Similarity


According to the theories in Sect. 3.3, terms are semantically similar if they
correspond in their heads and modi ers. Section 3.2 shows that masses of head
modi er relations can be extracted automatically from large amounts of text.
Thus semantically similar terms can be generated fully automatically by means
of head modi er comparision. Implementations of this method are described by
Ruge [?], Grefenstette [?] and Strzalkowski [?]. Table 2 gives some examples of
initial terms together with their most similar terms (with respect to head and
modi er overlap). The examples contain di erent types of thesaurus relations:
synonyms (\quantity" - \amount"), hyperonyms (\president" - "head"), hy-
ponyms (\government"- \regime") and part-of-relations (\government"- \min-
ister").
quantity government president
amount leader director
volume party chairman
rate regime oce
concentration year manage
ratio week executive
value man ocial
content minister head
level president lead
Table 2. Most similar terms of "quantity" [?], "government" [?] and "president" [?]

The comparison of heads and modi ers is expressed by a value between 0 (no
overlap) and 1 (identical heads and modi ers). This value is determined by a
similarity measure. I found that the cosine measure with a logarithmic weighting
function (1) works best ([?], [?]).

h h m m
sim(t ; t ) = 12 + 12
ln ln ln ln
(1)
 
ln
i j
iln jln
i j 
hi  hln
j

mi  mj

In Eq. (1), h = (ln0 (h 1); : : :; ln0(h )). h stands for the number of occur-
ln
i i in ik
rences of term t as head of term t in the corpus. ln0 (r) = ln(r) if r > 1 and 0
i j
otherwise. m is de ned analogously for modi ers. The cosine measure in prin-
ln
i
ciple gives the cosine of the angle between the two terms represented in a space
spanned by the heads or spanned by the modi ers. In sim, the weights of the
heads and modi ers were smoothed logarithmically. sim gives the mean of the
head space cosine and the modi er space cosine of the two terms.
The head modi er approach has been examined on di erent levels of evalu-
ation. First, the rate of semantically similar terms among the 10 terms with
highest sim-values for 159 di erent initial terms has been evaluated ([?], [?]).
About 70% of the terms found were semantically similar in the sense that they
could be used as additional terms for information retrieval. Grefenstette [?] and
Strzalkowski [?] clustered those terms that had a high similarity value. Then
they expanded queries by replacing all query terms by their clusters. The ex-
panded queries performed better than the original queries. Strzalkowski found
an improvement in the retrieval results of 20%. This is a very good value in
the retrieval context. Unfortunately, the e ect of query expansion alone is not
known, because Strzalkowski used further linguistic techniques in his retrieval
system.

4 Conclusions
A disadvantage of most Information Retrieval systems is that the search is based
on document terms. These term based systems can be improved by incorporating
thesaurus relations. The expensive task of manual thesaurus construction should
be supported or replaced by automatic tools. After a long period of disappoint-
ing results in this eld linguistically based systems have shown some encouraging
results. These systems are based on the extraction and comparison of head mod-
i er relations in large corpora. They are applicable in practice, because they use
robust and fast parsers.

References
1. Das-Gupta, P.: Boolean Interpretation of Conjunctions for Document Retrieval.
JASIS 38(1987) 245-254
2. DIN 1463: Erstellung und Weiterentwicklung von Thesauri. Deutsches Institut fuer
Normung (1987), related standard published in English: ISO 2788:1986
3. Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluver Aca-
demic Publishers, Boston, Dordrecht, London (1994)
4. Guentzer, U., Juettner, G., Seegmueller, G., Sarre, F.: Automatic Thesaurus Con-
struction by Machine Learning from Retrieval Sessions. Inf. Proc. & Management
25(1989) 265-273
5. Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. Pro-
ceedings of COLING 92, Nantes, Vol. 2 (1992) 539-545
6. Hindle, D.: A Parser for Text Corpora. Technical Memorandum, AT&T Bell Lab-
oratories (1990) also published in Atkins, A., Zampolli, A.: Computational Ap-
proaches to the Lexicon. Oxford University Press (1993)
7. Katz, J., Fordor, J.: The Structure of Semantic Theory. In Fordor, J., Katz, J.:
The Structure of Language: Readings in the Philosophy of Language. Englewood
Cli s, NJ, Prentice Hall (1964) 479-518
8. Lesk, M.: Word-Word Associations in Document Retrieval Systems. American Doc-
umentation (1969) 27-38
9. Ruge, G.: Experiments on Linguistically Based Term Associations. Inf. Proc. &
Mangement 28(1992) 317-332
10. Ruge, G.: Wortbedeutung und Termassoziation. Reihe Sprache und Computer
14(1995) Georg Olms Verlag, Hildesheim, Zuerich, New York
11. Ruge, G., Schwarz, C., Warner, A.: E ectiveness and Eciency in Natural Lan-
guage Processing for Large Amounts of Text. JASIS 42(1991) 450-456
12. Schuetze, H., Pederssn, J.: A Cooccurrence-Based Thesaurus and Two Applications
to Information Retrieval. Proceedings of RIAO94, New York (1994) 266-274
13. Shaikevich, A.: Automatic Construction of a Thesaurus from Explanatory Dictio-
naries. Automatic Documentation and Mathematical Linguistics 19(1985) 76-89
14. Salton, G.: Automatic Term Class Construction Using Relevance - A Summary of
Work in Automatic Pseudoclassi cation. Inf. Proc. & Management 16(1980) 1-15
15. Strzalkowski, T.: TTP: A Fast and Robust Parser for Natural Language. Proceed-
ings of COLING 92 Vol.1(1992) 198-204
16. Strzalkowski, T.: Natural Language Information Retrieval. Inf. Proc. & Manage-
ment 31(1995) 397-417

You might also like