Professional Documents
Culture Documents
Abstract
1 Introduction
In information retrieval (IR), the relationship between a query and a document is determined primarily by the number and frequency of terms which they have in common. Unfortunately, words have many morphological variants which will not be recognized by term-matching algorithms without additional text processing. In most cases, these variants have similar semantic interpretations and can be treated as equivalent for information retrieval (as opposed to linguistic) applications. Therefore, stemming or con ation algorithms have been created for IR systems which reduce these variants to a root form. The linguistics groups at Xerox1 have developed a number of linguistic tools for English which can be used in information retrieval. In particular, they have produced an English lexical database which provides a morphological analysis of any word in the lexicon and identi es the base form. There is good reason to expect that this technology would be ideally suited for use as a stemming algorithm. However, this assumption needs to be tested by conducting experiments using IR test collections. In this paper, we present a detailed analysis of the impact of stemming algorithms on performance in information retrieval. We compare traditional approaches based on su x removal to linguistic methods based on the Xerox morphological tools. We provide a detailed analysis which identi es speci c examples of when the di erent methods succeed or fail. On average, there is not a lot of di erence between stemming algorithms, but for speci c queries, the choice of con ation strategy can have a large impact on performance.
1 Natural Language Theory and Technology (NLTT) at the Xerox Palo Alto Research Center and Multi-Lingual Theory and Technology (MLTT) at the Rank Xerox Research Center in Grenoble, France
su xes: ly, ness, ion, ize, ant, ent, ic, al, ic, ical, able, ance, ary, ate, ce, y, dom, ee, eer, ence, ency, ery, ess, ful, hood, ible, icity, ify, ing, ish, ism, ist, istic, ity, ive, less, let, like, ment, ory, ous, ty, ship, some, ure pre xes: anti, bi, co, contra, counter, de, di, dis, en, extra, in, inter, intra, micro, mid, mini, multi, non, over, para, poly, post, pre, pro, re, semi, sub, super, supra, sur, trans, tri, ultra, un The databases are constructed using nite state transducers, which promotes very e cient storage and access. This technology also allows the con ation process to act in reverse, generating all conceivable surface forms from a single base form. The database starts with a lexicon of about 77 thousand base forms from which it can generate roughly half a million surface forms 15]. The Derivational Analyzer is currently being used in the Visual Recall information access and retrieval product available from XSoft.
Wall Street Journal sub-collection, which consists of 550MB of text and about 180,000 articles for our experiments. The TREC queries, generally called topics, are large and very detailed, and provide a very explicit de nition of what it means for a document to be relevant. In fact, with an average of 130 terms per topic, they are comparable in length to the documents in the Wall Street Journal sub-collection (median length = 200 terms). The TREC experiments have frequently been criticized for the length of the topics, since this is in such marked contrast to user behavior when querying the typical commercial retrieval system, where queries of one or two terms are often the norm. While the precision and detail of TREC topics is extremely valuable for making accurate relevance judgements and encouraging researchers to develop complex retrieval strategies, the experimental results may not re ect the tools and techniques that will be most valuable for operational systems. In order to address this issue, we have constructed shorter versions of the topics which attempt to summarize the key components of the query in a few short phrases (average length = 7 words). This follows the lead of Voorhees 14] and Jing 8] who use the description statement as a summary for the full query. In contrast to their approach, we construct the new queries by hand, as it was felt that some of the description statements were lacking important key words. There is certainly an element of subjectivity in this approach, but the queries were constructed without regard to the speci c properties of stemming algorithms.
7 Experimental Results
The SMART system is used to index the queries and documents separately for each stemming algorithm. Documents are ranked according to their similarity to each query, and the results are evaluating using the IR measures precision (the fraction of retrieved documents that are
2
Available for research purposes via anonymous ftp to ftp.cs.cornell.edu in directory /pub/smart.
Query Measure nostem remove s Lovins Porter In ect Deriv full APR11 0.348 0.361 0.369 0.367 0.368 0.366 AP 5-15] 0.556 0.562 0.557 0.562 0.562 0.555 full full AR 50-150] 0.414 0.423 0.427 0.429 0.428 0.426 short APR11 0.179 0.198 0.209 0.206 0.201 0.208 0.339 0.343 0.356 0.345 0.354 short AP 5-15] 0.313 short AR 50-150] 0.263 0.282 0.288 0.290 0.285 0.288 Table 1: Average evaluation scores by stemming algorithm.
relevant) and recall (the fraction of all relevant documents that are retrieved3 ). We choose to use three di erent evaluation measures, average precision at 11 recall points, 0, 10%, . . . 100% (APR11), average precision at 5-15 documents examined (AP 5-15]), and average recall at 50, 60, . . . 150 documents examined (AR 50-150]). The rst measurement is the current evaluation standard in IR experiments and captures a wide range of di erent performance characteristics. The second measurement is designed to estimate performance for shallow searches and the third is chosen to capture a more in-depth inquiry by the user. The evaluation scores for the six stemming algorithms over 200 TREC queries are presented in Table 1. In general, it appears that all stemmers are slightly better than no stemming at all but there is little di erence between stemmers. This comes as no real surprise and agrees with the conclusions obtained from many previous studies. It is also interesting to note how much sharply cutting query size reduces information retrieval performance. However, the size of the query does not make a large di erence in the relative performance of stemming algorithms. The observed di erences are relatively small, so it is important to validate them using statistical signi cance testing. Hypothesis tests make the preliminary assumption that all retrieval strategies are equally e ective. The test determines the probability that the observed results could occur by chance (or p-value) given the initial hypothesis. If the p-value is very small, then evidence suggests that the observed e ect re ects an underlying di erence in performance. Statistical testing is important because the queries used for testing represent only a very small sample from the set of all possible queries. If the number of queries is small relative to the observed di erence in evaluation scores, then the experimental results are not generally applicable. Our approach to statistical testing is based on the Analysis of Variance (ANOVA), which is useful for detecting di erences between three or more experimental methods. The two-way ANOVA assumes that the evaluation scores can modelled by an additive combination of a query e ect and an e ect due to the choice of experimental method. When these e ects are factored out, it is assumed that the remaining component of the score is random error, and that all such values are drawn from an identical Normal distribution, independent of query or method. It is unlikely that these assumptions will be strictly accurate, but it is hoped that they represent a reasonable approximation. In order to run the ANOVA, the evaluation scores must be computed separately for each query and method, generating a table which is used as input to the statistical algorithm. For mathematical details on statistical testing in information retrieval, see Hull 6] or consult any elementary statistics textbook on experimental design, such as Neter 12]. The ANOVA test operates in two stages. First, a test statistic is computed which indicates whether there is any evidence of a di erence between the methods. If the results indicate that there is a di erence (in our experiments, we de ne a signi cant di erence as one which produces a p-value less than .05), then a separate multiple comparisons test is applied to determine which
The term all relevant documents is a little misleading, since it refers only to the relevant documents in the subset of the collection that has actually been examined. For TREC queries, this is usually thousands of the most likely documents, so it is unlikely that many relevant documents have been missed
3
Query nostem remove s Lovins Porter In ect Deriv Q178 0.228 0.170 0.486 0.486 0.191 0.169 4 2 5.5 5.5 3 1 ranks Q187 0.127 0.129 0.117 0.124 0.132 0.188 ranks 3 4 1 2 5 6 Table 2: Example of within-query ranking contrasts are signi cant. In this paper, we examine all pairwise di erences in average scores. The results are presented in compact form in Table 1, and can be interpreted as shown below. The expression x y indicates that method x is signi cantly better than method y according to the appropriate statistical test (ANOVA for the average scores, Friedman for the average ranks). The abbreviations are: N = nostem, R = remove s, L = Lovins, P = Porter, I = in ectional, D = derivational. For example, for APR11 using short queries, the entry is: RLPID N, D which indicates that remove s, Lovins, Porter, In ectional, and Derivational stemmers are all signi cantly better than no stemming. The results from the ANOVA make interpreting Table 1 very simple. As stated previously, all stemmers are better than no stemming, except for AP 5-15] measured on the full query, and there is no signi cant di erence among stemming algorithms. A statistically signi cant di erence is roughly 0.01 for most methods (0.023 for AP 5-15] short). When the query is well de ned and the user is only looking at a few documents (AP 5-15] full), stemming provides absolutely no advantage. This helps to explain a bit of the inconsistency in the literature. Lennon 10] evaluates at 5 and 20 documents retrieved and Harman 5] evaluates at 10 and 30 documents retrieved and nd no di erence between stemmers. Harman's average precision-recall scores look very similar to our own. Krovetz 9] reports much larger improvements for stemming in collections where both queries and documents are short, but the other collections show a similar 4-5% improvement in performance. It seems natural to consider stemmers as recall enhancing devices, as they are creating more matches for each query term. Our results re ect this hypothesis, as stemming is no help at low recall and provides some bene ts when performance is averaged over a number of recall levels. Stemming is slightly more helpful for short queriesand low recall, as demonstrated by the fact that no stemming leads to a drop in AP 5-15] for short queries but not for long ones. As Krovetz demonstrates, stemming becomes much more valuable when both queries and documents are short, as this is when it is most di cult to nd term matches between query and document.
Q187 has 429! Most of the variability in Q178 is accounted for by the change in ranking of a single relevant document. In this example, it could well be the case that the di erences described by Q187 are more reliable. Since in our experiment, the number of relevant documents per query ranges from 2 to 591, there is good reason to believe that it might be worthwhile to examine an evaluation measure that normalizes for unequal variance4 . Using the ranking strategy described above, the evaluation results can be summarized by taking the average rank over all queries, as presented in Table 3. Query Measure full APR11 AP 5-15] full full AR 50-150] short APR11 short AP 5-15] short AR 50-150] none rem s Lov Port In Der sig. test 2.70 3.12 3.83 3.76 3.79 3.81 LPID R N 3.34 3.48 3.66 3.57 3.58 3.37 2.83 3.29 3.68 3.63 3.79 3.79 LID R N, P N 2.62 3.25 3.70 3.79 3.68 3.96 LPID R N 3.05 3.45 3.55 3.69 3.52 3.76 RLPID N 2.63 3.20 3.60 3.91 3.70 3.97 LPID R N, D L Table 3: Average rank measurements
How does one compare average ranks and average precision-recall scores? It is hard to say, since the measures are on di erent scales. The disadvantage of ranks is that any sense of absolute performance is lost. The ranks also depend on which methods are being compared. Remove one stemming algorithm from the analysis and the ranks will change, perhaps signi cantly. However, there are a number of advantages as well. For example, it is now simple to compare between the di erent evaluation scores and also between short and long queries, since the average ranks of all evaluation measures are on the same scale. Note that average performance is 3.5 for a ranking based on six methods. While the overall pattern is about the same, there are several noticeable di erences between the average ranks for the full and short queries. (1) AP 5-15] nostem - No stemming is proportionally less e ective for short queries. We can thus conclude that stemming always improves performance for short queries. (2) AP 5-15] Deriv - The derivational stemmer works better over the short queries, where it can help to nd a match, than on the long queries, where it adds noise to the results. (3) AR 50-150] Porter - The Porter stemmer appears to work better with the short queries than the long queries at high recall. Since rank based analysis is only a relative measure, it should not be used as the only means of evaluation, but it provides a nice supplement to the data provided by average precision and recall. The same approach to hypothesis testing used above can also be applied to the ranking data, and this alternative is known in the statistical literature as the Friedman Test. For more mathematical details, see Hull 6] or Conover 3]. The results are presented in Table 3. The Friedman test nds more signi cant di erences than the ANOVA. In particular, it consistently nds that remove s is less e ective than the other stemmers except for AP 5-15]. The magnitude of a signi cant di erence in ranking is roughly 0.35. There are some puzzling inconsistencies in the statistical results. For example, the derivational stemmer is signi cantly better than the Lovins stemmer using AR 50-150] short query rankings, but the actual recall scores of the two methods are equal to three decimal places. We will come back to this result at the end of the next section.
4 This problem is one of the unfortunate side-e ects of using of subset of the TREC collection. If we used the full collection, we would have a lot less variation in the number of relevant documents for each query.
9 Statistical Testing
It is worth noting that the Friedman Test appears to be consistently more powerful than the ANOVA, in the sense that it nds more signi cant di erences between methods. This is a bit surprising, since the Friedman Test is a nonparametric test (not model based), and tests from this family are usually less powerful than the ANOVA, when its assumptions are reasonable. This suggests that it would be a good idea to check those assumptions. The ANOVA model assumes that the error variance is constant across queries. Given that each query has a widely variable number of relevant documents, there is good reason to expect that this condition may be violated. As we saw in the previous section, the evaluation score of queries with very few relevant documents may change dramatically with the change in rank of a single relevant document. We can examine the data more closely to see how badly the assumption of equal variance is violated. Let us presume that the variance is constant and equal to the error variance obtained from the ANOVA model, which we will denote as 2. It is well known in statistics 1] that: (n ; 1)s2
2 2
n
;1
where s2 is the variance of a sample of size n ; 1 drawn from a normal population with variance 2. We can compute s2 for each query and compare the distribution of the test statistic above to the appropriate chi-square reference distribution. In particular, we determine the expected value of maximum of a sample of 200 observations drawn from the chi-square distribution. If our data is actually drawn from this distribution, we would expect an average of one observation to equal or exceed this value. In practice, we nd that between 11-18 observations exceed the expected maximum for each of our experiments. The results above indicate that without question, the assumption of equal variance is violated. We have suggested that queries with few relevant documents will tend to have higher variability than those with many relevant documents. We examined this hypothesis and found that while a few of the queries with highest variability had fewer than 10 relevant documents, there was no consistent pattern. We also tested whether variability in performance was signi cantly correlated with average performance and also obtained a negative result. From this analysis, we believe that the results from the Friedman Test should be more reliable for our experiments. It also provides evidence that examining average rank data as well as average score data is valuable in determining which experimental methods are most e ective. The reader should keep in mind that the statistical tests described here have been chosen for this particular experiment. In general, the optimal test may depend on the number of experimental methods being compared. For instance, both Friedman and ANOVA have di erent forms when there are only two methods 6], and Conover 3] suggests that there is a more powerful alternative to the Friedman Test for cases when three or four methods are being compared.
Query Measure nostem remove s full APR11 0.375 0.393 full AP 5-15] 0.623 0.640 full AR 50-150] 0.377 0.391 short APR11 0.208 0.233 short AP 5-15] 0.374 0.410 0.284 short AR 50-150] 0.259
Average Scores
Lovins 0.402 0.628 0.394 0.244 0.412 0.287
Query Measure none rem s Lov Port In Der sig. test full APR11 2.50 3.19 3.82 3.74 3.88 3.88 LPID R N full AP 5-15] 3.24 3.52 3.50 3.64 3.58 3.51 full AR 50-150] 2.66 3.32 3.78 3.70 3.84 3.70 RLPID N, LI R short APR11 2.51 3.24 3.58 3.90 3.85 3.92 PID R N, L N short AP 5-15] 2.94 3.44 3.52 3.70 3.69 3.71 RLPID N PID LR N short AR 50-150] 2.54 3.22 3.40 3.99 3.89 3.97 Table 4: Evaluation scores for the important queries (3) The query is hard and relevant documents tend to have low ranks in general. However, some methods rank a few relevant documents relatively well, which can make a large di erence for many evaluation measures. Clearly, factor (1) describes the type of behavior we are looking for and these queries deserve to be considered very important. Some people might argue that queries of type (2) and (3) are also important, because they demonstrate real changes in the ranks of relevant documents, however we believe that this is a dangerous conclusion. If only a few relevant documents are being ranked higher, it is hard to know if this is due to the fact that they are relevant or some reason unrelated to relevance. We would be particularly suspicious of type (3) queries, for one would hope that a method which is valuable would improve the ranking of a reasonably large sample of relevant documents. This does not mean that an e ect which improves the rank of a single relevant document is not real and important in some cases, merely that this is a less reliable indicator of value than a method which improves the rank of many relevant documents. Therefore, we would like to evaluate average performance using only queries which satisfy factor (1). This can be accomplished, by applying the Friedman Test on a per-query basis by comparing the ranks of individual relevant documents. A query of type (1) will have consistent di erences in the ranks of the relevant documents, which can be detected by the Friedman Test. There are a number of issues to consider when applying this method which will not be discussed here. The reader interested in more details is encouraged to refer to Hull 7]. From this analysis, we obtain 125 full queries and 139 short queries which satisfy our de nition of an important query. Average performance scores are computed over this subset and presented in Table 4. Readers are cautioned that these results are not unbiased estimates of the true di erences between methods due to the selective nature of the query ltering process. The signi cant di erences increase to roughly 0.015 (0.032 for short AP 5-15]) for average scores and 0.42 for average ranks, due to the fact that fewer queries are being analyzed, however the results remain roughly the same. This reassures us that the potential problems that we discussed at the beginning of the chapter do not change the results a great deal. We will concentrate on the average rank statistics since they are directly comparable to the rank statistics computed using all queries. 9
Average Ranks
For the full queries, the results are basically identical, although the average rank of no stemming drops even further relative to the other methods. This is probably a result of removing the queries where stemming is not important and indicates that when stemming is important, not stemming is rarely the right decision. For the short queries, there is another interesting pattern. The Lovins stemmer appears to be less e ective while the Porter and In ectional stemmer are proportionally more e ective according to average rank. However, this pattern is not duplicated in the average scores, where the Lovins stemmer is fully competitive with the other stemmers. A closer look at the scores reveals that the Lovins stemmer performs slightly worse for a lot of queries and much better for a few queries, which explains the lower score with respect to the rank based measure. One possible explanation is that Lovins stems more heavily than all the other stemmers. This means that it con ates a lot more terms, which reduces performance slightly on a lot of queries by adding extra noise. However, occasionally an additional important term is recognized, which helps performance a lot. Therefore, while average performance is the same, choosing the Lovins stemmer may degrade performance slightly for a lot of queries in exchange for helping more signi cantly with a few queries. Note that this hypothesis has not been veri ed explicitly and applies only to the short queries when evaluated using APR11 or AR 50-150].
11 Detailed Analysis
While we have learned a lot from our extensive analysis of the average performance gures, it has not really helped us nd out exactly why one stemming algorithm works better than another. If we wish to improve the current technology in stemming algorithms, it is important to see speci c examples where the type of stemming makes a large di erence in performance. Therefore, we will take a detailed look at a number of individual queries and gure out which word stems account for di erences in performance. For this section of the analysis, we rely only on the short queries, for it will be far easier to identify the important word stems in these examples. At this point, it is very clear that no stemming and remove s are not really competitive with the other stemming algorithms, so they will not be considered further. Among the others we have found little evidence which suggests how to choose between them, other than a hint that the Lovins stemmer may have slightly di erent behavior. For the detailed query analysis, we start with only those queries which were judged important based on a per-query Friedman Test, as described in the previous section. We then compute the error variability of the queries and rank them in decreasing order of variability, just as we did when testing the equal-variance assumption for ANOVA, except now we are working only with four stemming algorithms. We compute the chi-square score for each query and nd that 11 exceed the expected maximum score of 12.07 for a sample of 139 queries with equal variance. The top 10 of these are presented in Table 5. The queries shown above can be divided into 6 categories according to their behavior. (1) Lovins, Deriv In ect, Porter - 21, 70, 82, 143 (2) In ect, Porter Lovins, Deriv - 39, 147 (3) Deriv others - 86 (4) Deriv others - 182 (5) Porter others - 128 (6) Lovins, Porter In ect, Deriv - 98 We examine each of these categories in more detail. 10
score Lovins Porter In ect Deriv 61.37 0.216 0.225 0.230 0.546 57.27 0.092 0.343 0.365 0.094 53.5 0.596 0.302 0.278 0.484 39.23 0.438 0.282 0.151 0.406 26.11 0.199 0.101 0.007 0.249 24.87 0.246 0.047 0.251 0.255 24.42 0.305 0.341 0.262 0.115 24.37 0.501 0.507 0.326 0.331 20.27 0.032 0.186 0.187 0.034 15.9 0.635 0.496 0.495 0.647 Table 5: Revised evaluation scores
In ect, Porter
The query talks about research in superconductivity while many of the documents refer only to superconductors. Only the Derivational and the Lovins stemmers make this match. Note that the Derivational stemmer overstems by removing the pre x and so loses a bit in performance for this reason.
The Derivational and Lovins stemmers relate genetic and genetically. The In ectional stemmer lost out even more because engineering did not get con ated to engineer.
11
The In ectional and Porter stemmers did not recognize that farmer and farming were related. The In ectional stemmer did not con ate farming and farm and thus was completely ine ective. The behavior of the in ectional stemmer with respect to farming and engineering deserves a more complete explanation. Both words can be either a noun or a verb. According to in ectional rules, the noun form should not be stemmed while the verb form should. We do not apply a part-of-speech tagger to the text before stemming and are therefore forced to make an arbitrary decision. Our decision not to stem engineering and farming turns out to be a poor one for IR performance, although it may be the right one from the linguistic perspective in the absence of a part-of-speech tag.
Lovins, Deriv
The derivational and the Lovins stemmer equate server with serve. This is a very bad decision since serve is a common term used in a number of contexts and server has a much more speci c meaning, particularly in the domain of computers and technology.
(3) Deriv
others
The Derivational stemmer converts failure to fail. The other stemmers do not make this connection. This is an example where linguistic knowledge can recognize a special case which is missed by su xing rules.
(4) Deriv
others
(5) Porter
others
Since both stemmers analyze on the basis of linguistic rules, a word must be in the lexicon in order to be recognized. We make the decision not to stem any word which does not appear in the lexicon. While this works well for the most part, since most unrecognized words are proper names, there is a distressingly large list of exceptions. This is a particular problem when using the Wall Street Journal, because a large amount of nancial and economic terminology is not included in our general purpose lexicon. Many of these would be important when used as query terms. It might be a good idea to construct a guesser which applies a rules based algorithm when the word does not appear in the lexicon, because at the moment the linguistic stemmers do not even con ate plurals of unidenti ed nouns. We have implemented one special rule to remove possessives ('s) since this is obviously an important factor for proper names and the rule is extremely unlikely to make any errors.
In ect, Deriv
The in ectional and derivational stemmers do not stem optics to optic. For the Derivational stemmer, this is a linguistically motivated decision which turns out to be very unfortunate for information retrieval. The detailed query analysis has been very informative. We have recognized that there are many examples where the su x-removal rules do not produce ideal behavior. We have also discovered that the In ectional and Derivational stemmer make a number of decisions for linguistic reasons that are not correct for information retrieval. Also, the two linguistic stemmers could probably be improved by adding domain-speci c lexicons or by building a guesser for unknown terms.
modi ed the su x of the word, in order to see whether this would make a signi cant di erence in retrieval performance. Table 6 compares performance to the original version. sig. test Query Measure none rem s Lov Port In Der full APR11 2.68 3.10 3.81 3.75 3.77 3.90 LPID R N AP 5-15] 3.32 3.47 3.63 3.54 3.56 3.48 full full AR 50-150] 2.81 3.25 3.69 3.62 3.78 3.87 LPID R N short APR11 2.63 3.25 3.68 3.75 3.68 4.01 LPID R N RLPID N short AP 5-15] 3.03 3.47 3.55 3.69 3.52 3.75 short AR 50-150] 2.63 3.20 3.56 3.90 3.70 4.01 LPID R N, D L Table 7: Ranks with revised Derivational stemmer The average scores changes very little, but all the di erences are in the right direction, so it seems to be worthwhile to use the modi cations. To further examine this question, we replace the old Derivational stemmer with the new one and recompute the average ranks and the Friedman Test, as shown in Table 7. Both Table 6 and Table 7 are computed using the full query set. Although the revised derivational stemmer has slightly higher ranks, there is no signi cant change in the results. It might be worthwhile to look at only the important queries, be we did not run that experiment in time for this paper.
13 Conclusions
After an exhaustive analysis, we have reached the following conclusions concerning stemming algorithms. (1) Some form of stemming is almost always bene cial. The only case where there is not overwhelming evidence in favor of stemming is when the full TREC queries are used and very few documents are examined. The average absolute improvement due to stemming is small, ranging from 1-3%, but it makes a large di erence for many individual queries. There are probably particular queries where stemming hurts performance, but we did not investigate this issue in detail. (2) Simple plural removal is less e ective than more complex algorithms in most cases. However, when only a small number of documents are being examined, plural removal is very competitive with the other algorithms. (3) There is no di erence between the other stemmers in terms of average performance. For short queries, the Lovins stemmer tends to perform slightly worse for many queries but a lot better for a few others. This may be a result of overstemming although we have not produced any concrete evidence for this hypothesis. (4) The detailed query analysis demonstrates that the same su x-removal rule can be benecial in some cases and harmful in others. This suggests that rules-based su x removal may not be the ideal approach to stemming in the long run. (5) There are a number of problems with using linguistic approaches to word analysis directly as stemming algorithms. First, these methods are based on a lexicon and cannot correctly stem words which are not contained in the lexicon. Second, many decisions about the root form of a word which are properly motivated from the linguistic perspective are not optimal for information retrieval performance. It is clear that linguistic analysis tools must be tailored for information retrieval applications. 14
(6) In particular, pre x removal seems to be a particularly bad idea for a stemming algorithm, and the detailed analysis provides a number of speci c examples of this problem. Modifying the linguistic tools provides only a small bene t in terms of average performance, but the di erence is consistent over all evaluation measures. While the linguistically based stemming is not signi cantly better than current algorithms in term of average performance, the detailed query analysis reveals a number of di erent ways in which the linguistic tools could be improved and optimized for information retrieval. With such modi cations, and their inherent advantages over su x-removal algorithms (stems are real words), linguistic tools based on morphological analysis should work very successfully as stemming algorithms.
Acknowledgements We are grateful to Donna Harman and the analysts from the TIPSTER project for providing such an outstanding resource which has really helped to advance the eld of IR. We would also like to thank Chris Buckley for writing and supporting the SMART text retrieval system and making it available for research use. This is a real service to the information retrieval community.
References
1] G.E.P. Box, W.G. Hunter, and J.S. Hunter. Statistics for Experimenters, pages 118{119. John Wiley and Sons, 1978. 2] Chris Buckley. Implementation of the smart information retrieval system. Technical Report 85-686, Cornell University, 1985. 3] W.J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, 2nd edition, 1980. 4] W.B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. 5] Donna Harman. How e ective is su xing? Journal of the American Society for Information Science, 42(1):321{331, 1991. 6] David Hull. Using statistical testing in the evaluation of retrieval performance. In Proc. of the 16th ACM/SIGIR Conference, pages 329{338, 1993. 7] David A. Hull. Stemming algorithms - a case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70{84, 1996. 8] Yufeng Jing and W. Bruce Croft. An association thesaurus for information retrieval. In Proc. of Intelligent Multimedia Retrieval Systems and Management Conference (RIAO), pages 146{160, 1994. 9] Robert Krovetz. Viewing morphology as an inference process. In Proc. of the 16th ACM/SIGIR Conference, pages 191{202, 1993. 10] M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some con ation algorithms for information retrieval. Journal of Information Science, 3:177{183, 1981. 11] Janet Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22{31, 1968. 12] J. Neter, W. Wasserman, and M. Kutner. Applied Linear Statistical Models. R.D. Irwin, 2nd edition, 1985. 15
13] M. Porter. An algorithm for su x stripping. Program, 14(3):130{137, 1980. 14] Ellen M. Voorhees. Query expansion using lexical-semantic relations. In Proc. of the 17th ACM/SIGIR Conference, pages 61{69, 1994. 15] Xerox Corp. Xerox Linguistic Database Reference, english version 1.1.4 edition, December 1994.
16