You are on page 1of 16

A Detailed Analysis of English Stemming Algorithms

David A. Hull Gregory Grefenstette


Rank Xerox Research Centre 6 chemin de Maupertuis, 38240 Meylan France fhull,grefeng@xerox.fr January 31, 1996
We present a study comparing the performance of traditional stemming algorithms based on su x removal to linguistic methods performing morphological analysis. The results indicate that most con ation algorithms perform about 5% better than no stemming, and there is little di erence between methods in terms of average performance. However, a detailed analysis of individual queries indicates that performance on this level is often highly sensitive to the choice of stemming technique. From this analysis, we can suggest a number of di erent ways to modify linguistic approaches so that they will be better suited to the stemming problem.

Abstract

1 Introduction
In information retrieval (IR), the relationship between a query and a document is determined primarily by the number and frequency of terms which they have in common. Unfortunately, words have many morphological variants which will not be recognized by term-matching algorithms without additional text processing. In most cases, these variants have similar semantic interpretations and can be treated as equivalent for information retrieval (as opposed to linguistic) applications. Therefore, stemming or con ation algorithms have been created for IR systems which reduce these variants to a root form. The linguistics groups at Xerox1 have developed a number of linguistic tools for English which can be used in information retrieval. In particular, they have produced an English lexical database which provides a morphological analysis of any word in the lexicon and identi es the base form. There is good reason to expect that this technology would be ideally suited for use as a stemming algorithm. However, this assumption needs to be tested by conducting experiments using IR test collections. In this paper, we present a detailed analysis of the impact of stemming algorithms on performance in information retrieval. We compare traditional approaches based on su x removal to linguistic methods based on the Xerox morphological tools. We provide a detailed analysis which identi es speci c examples of when the di erent methods succeed or fail. On average, there is not a lot of di erence between stemming algorithms, but for speci c queries, the choice of con ation strategy can have a large impact on performance.
1 Natural Language Theory and Technology (NLTT) at the Xerox Palo Alto Research Center and Multi-Lingual Theory and Technology (MLTT) at the Rank Xerox Research Center in Grenoble, France

2 Background - The Stemming Problem


The problem of con ation has been approached with a wide variety of di erent methods, as detailed in Lennon 10], including su x removal, strict truncation of character strings, word segmentation, letter bigrams, and linguistic morphology. Two of the most popular algorithms in information retrieval, the Lovins stemmer 11] and the Porter stemmer 13], are based on su x removal. Lovins nds the longest match from a large list of endings while Porter uses an iterative algorithm with a smaller number of su xes and a few context-sensitive recoding rules. Krovetz 9] accurately describes the problems associated with these methods. Most stemmers operate without a lexicon and thus ignore word meaning, which leads to a number of stemming errors. Words with di erent meanings are con ated to the same stem and words with similar meaning are not con ated at all. For example, the Porter stemmer con ates general, generous, generation, and generic to the same root, while related pairs like recognize and recognition are not con ated. In addition, stems produced by su x removal algorithms are often not words, which makes it di cult to use them for any purpose other than information retrieval. Interactive techniques which require user input, such as term selection for query expansion, will su er greatly if the user must work with stems instead of real words. It also becomes di cult to perform dictionary look-up without real words. While the entries in the dictionary could themselves be stemmed, there would not be a one-to-one correspondance between word stems and word de nitions. Dictionary look-up is an important feature in many IR applications. For example, in multilingual information retrieval, the query may be subject to automatic translation, in which case the system must be able to nd each query term in a transfer dictionary. The problems cited in this section can easily be overcome by an approach to stemming which uses morphological analysis.

3 Xerox Lexical Technology


Morphology is a branch of linguistics that studies and describes how words are formed in language and includes in ection, derivation, and compounding. In ection characterizes the changes in word form that accompany case, gender, number, tense, person, mood, or voice. Derivational analysis reduces surface forms to the base form from which they were derived, and includes changes in the part of speech. Xerox linguists have developed a lexical database for English (as well as other languages) which can analyze and generate in ectional and derivational morphology. The in ectional database reduces each surface word to the form which can be found in the dictionary, as follows 15]: nouns =) singular (ex. children =) child) verbs =) in nitive (ex. understood =) understand) adjectives =) positive form (ex. best =) good) pronoun =) nominative (ex. whom =) who) The derivational database reduces surface forms to stems which are related to the original in both form and semantics. For example, government stems to govern while department is not reduced to depart since the two forms have di erent meanings. All stems are valid English terms, and irregular forms are handled correctly. The derivational process uses both su x and pre x removal, unlike most conventional stemming algorithms which rely solely on su x removal. A sample of the su xes and pre xes that are removed is given below 15]: 2

su xes: ly, ness, ion, ize, ant, ent, ic, al, ic, ical, able, ance, ary, ate, ce, y, dom, ee, eer, ence, ency, ery, ess, ful, hood, ible, icity, ify, ing, ish, ism, ist, istic, ity, ive, less, let, like, ment, ory, ous, ty, ship, some, ure pre xes: anti, bi, co, contra, counter, de, di, dis, en, extra, in, inter, intra, micro, mid, mini, multi, non, over, para, poly, post, pre, pro, re, semi, sub, super, supra, sur, trans, tri, ultra, un The databases are constructed using nite state transducers, which promotes very e cient storage and access. This technology also allows the con ation process to act in reverse, generating all conceivable surface forms from a single base form. The database starts with a lexicon of about 77 thousand base forms from which it can generate roughly half a million surface forms 15]. The Derivational Analyzer is currently being used in the Visual Recall information access and retrieval product available from XSoft.

4 Description of previous work


There have been a large number of studies which have examined the impact of stemming algorithms on information retrieval performance. Frakes 4] provides a nice summary, reporting that the combined results of previous studies make it unclear whether any stemming is helpful. In those cases where stemming is bene cial, it tends to have only a small impact on performance, and the choice of stemmer among the most common variants is not important. However, there is no evidence that a reasonable stemmer will hurt retrieval performance. In contrast, a recent study by Krovetz 9] reports an increase of 15-35% in retrieval performance when stemming is used on some collections (CACM and NPL). Krovetz mentions that these collections have both queries and documents which are extremely short, so that the results make sense given that the likelihood of an exact match of surface form should decrease with the size of the document. For collections with longer documents (TIME and WEST), stemming algorithms are accompanied by a much more modest improvement in retrieval performance. Krovetz 9] also develops stemmers based on in ectional and derivational morphology. He nds that these approaches provide about the same improvements as were obtained by the Porter algorithm. However, he notes that the derivational stemmer performed slightly better than the in ectional stemmer over all four collection that he examined and that the majority of its bene t came at low recall recall levels (i.e. when relatively few documents are examined). We should mention that Krovetz's in ectional and derivational stemmers are not equivalent to the Xerox linguistic tools. In particular, he experiments with a smaller collection of su xes for derivational analysis and it does not appear that he examines pre xes at all (perhaps wisely, as we shall discover). Therefore, his results are not directly comparable to the ones presented in this paper.

5 The Experimental Collection


Our retrieval experiments will use the document collection constructed for the TREC/TIPSTER project. The TREC collection consists of over a million documents (3.3 gigabytes of text) obtained from a variety of sources, including newspapers, computer journals, and government reports. As part of the exhaustive evaluation studies conducted during the TIPSTER project, analysts have constructed 200 queries and evaluated thousands of documents for relevance with respect to these queries. The collection is so large that we have decided to select a subset of the queries and/or documents to use in our experiments. Since experiments using stemmers require that the entire document collection be reindexed for each algorithm, storing many di erent indexed version of the full collection is not really feasible. Therefore, we have decided to use the 3

Wall Street Journal sub-collection, which consists of 550MB of text and about 180,000 articles for our experiments. The TREC queries, generally called topics, are large and very detailed, and provide a very explicit de nition of what it means for a document to be relevant. In fact, with an average of 130 terms per topic, they are comparable in length to the documents in the Wall Street Journal sub-collection (median length = 200 terms). The TREC experiments have frequently been criticized for the length of the topics, since this is in such marked contrast to user behavior when querying the typical commercial retrieval system, where queries of one or two terms are often the norm. While the precision and detail of TREC topics is extremely valuable for making accurate relevance judgements and encouraging researchers to develop complex retrieval strategies, the experimental results may not re ect the tools and techniques that will be most valuable for operational systems. In order to address this issue, we have constructed shorter versions of the topics which attempt to summarize the key components of the query in a few short phrases (average length = 7 words). This follows the lead of Voorhees 14] and Jing 8] who use the description statement as a summary for the full query. In contrast to their approach, we construct the new queries by hand, as it was felt that some of the description statements were lacking important key words. There is certainly an element of subjectivity in this approach, but the queries were constructed without regard to the speci c properties of stemming algorithms.

6 The retrieval system and stemming experiments


We used the SMART2 text retrieval system developed at Cornell University 2] for the information retrieval experiments. We found its high indexing speed (500MB/hr on a Sparc 20) to be extremely valuable for the repeated indexing of the collection that is necessary in stemming experiments. Very frequent terms are removed using a slightly modi ed version of the SMART stop list, and queries and documents are treated as vectors of term frequencies and analyzed using the vector space model. Term weighting is used to improve performance. Term frequencies in both queries and documents are dampened by applying the square-root transformation. Document vectors are normalized to have unit length and query term weights are multiplied by the traditional IDF (inverse document frequency = log N=n , N = # docs in collection, n = # docs containing term i) measure to increase the importance of rare terms. No proper name or phrase recognition is used although these might well be bene cial. Also, no attempt is made to segment long documents, which fortunately are rare in the Wall Street Journal sub-collection. We will examine the performance of ve di erent stemming algorithms and compare them to a baseline which consists of no stemming at all. Two stemmers are included in the SMART collection they are a simple algorithm which removes s's from the end of the word and an extensively modi ed version of the Lovins algorithm 11]. In addition, we will study the Porter stemmer 13], and versions of the Xerox English in ectional and derivational analyzers 15] slightly modi ed for the con ation problem.
i i

7 Experimental Results
The SMART system is used to index the queries and documents separately for each stemming algorithm. Documents are ranked according to their similarity to each query, and the results are evaluating using the IR measures precision (the fraction of retrieved documents that are
2

Available for research purposes via anonymous ftp to ftp.cs.cornell.edu in directory /pub/smart.

Query Measure nostem remove s Lovins Porter In ect Deriv full APR11 0.348 0.361 0.369 0.367 0.368 0.366 AP 5-15] 0.556 0.562 0.557 0.562 0.562 0.555 full full AR 50-150] 0.414 0.423 0.427 0.429 0.428 0.426 short APR11 0.179 0.198 0.209 0.206 0.201 0.208 0.339 0.343 0.356 0.345 0.354 short AP 5-15] 0.313 short AR 50-150] 0.263 0.282 0.288 0.290 0.285 0.288 Table 1: Average evaluation scores by stemming algorithm.

sig. test RLPID N LPID N RLPID N RLPID N RLPID N

relevant) and recall (the fraction of all relevant documents that are retrieved3 ). We choose to use three di erent evaluation measures, average precision at 11 recall points, 0, 10%, . . . 100% (APR11), average precision at 5-15 documents examined (AP 5-15]), and average recall at 50, 60, . . . 150 documents examined (AR 50-150]). The rst measurement is the current evaluation standard in IR experiments and captures a wide range of di erent performance characteristics. The second measurement is designed to estimate performance for shallow searches and the third is chosen to capture a more in-depth inquiry by the user. The evaluation scores for the six stemming algorithms over 200 TREC queries are presented in Table 1. In general, it appears that all stemmers are slightly better than no stemming at all but there is little di erence between stemmers. This comes as no real surprise and agrees with the conclusions obtained from many previous studies. It is also interesting to note how much sharply cutting query size reduces information retrieval performance. However, the size of the query does not make a large di erence in the relative performance of stemming algorithms. The observed di erences are relatively small, so it is important to validate them using statistical signi cance testing. Hypothesis tests make the preliminary assumption that all retrieval strategies are equally e ective. The test determines the probability that the observed results could occur by chance (or p-value) given the initial hypothesis. If the p-value is very small, then evidence suggests that the observed e ect re ects an underlying di erence in performance. Statistical testing is important because the queries used for testing represent only a very small sample from the set of all possible queries. If the number of queries is small relative to the observed di erence in evaluation scores, then the experimental results are not generally applicable. Our approach to statistical testing is based on the Analysis of Variance (ANOVA), which is useful for detecting di erences between three or more experimental methods. The two-way ANOVA assumes that the evaluation scores can modelled by an additive combination of a query e ect and an e ect due to the choice of experimental method. When these e ects are factored out, it is assumed that the remaining component of the score is random error, and that all such values are drawn from an identical Normal distribution, independent of query or method. It is unlikely that these assumptions will be strictly accurate, but it is hoped that they represent a reasonable approximation. In order to run the ANOVA, the evaluation scores must be computed separately for each query and method, generating a table which is used as input to the statistical algorithm. For mathematical details on statistical testing in information retrieval, see Hull 6] or consult any elementary statistics textbook on experimental design, such as Neter 12]. The ANOVA test operates in two stages. First, a test statistic is computed which indicates whether there is any evidence of a di erence between the methods. If the results indicate that there is a di erence (in our experiments, we de ne a signi cant di erence as one which produces a p-value less than .05), then a separate multiple comparisons test is applied to determine which
The term all relevant documents is a little misleading, since it refers only to the relevant documents in the subset of the collection that has actually been examined. For TREC queries, this is usually thousands of the most likely documents, so it is unlikely that many relevant documents have been missed
3

Query nostem remove s Lovins Porter In ect Deriv Q178 0.228 0.170 0.486 0.486 0.191 0.169 4 2 5.5 5.5 3 1 ranks Q187 0.127 0.129 0.117 0.124 0.132 0.188 ranks 3 4 1 2 5 6 Table 2: Example of within-query ranking contrasts are signi cant. In this paper, we examine all pairwise di erences in average scores. The results are presented in compact form in Table 1, and can be interpreted as shown below. The expression x y indicates that method x is signi cantly better than method y according to the appropriate statistical test (ANOVA for the average scores, Friedman for the average ranks). The abbreviations are: N = nostem, R = remove s, L = Lovins, P = Porter, I = in ectional, D = derivational. For example, for APR11 using short queries, the entry is: RLPID N, D which indicates that remove s, Lovins, Porter, In ectional, and Derivational stemmers are all signi cantly better than no stemming. The results from the ANOVA make interpreting Table 1 very simple. As stated previously, all stemmers are better than no stemming, except for AP 5-15] measured on the full query, and there is no signi cant di erence among stemming algorithms. A statistically signi cant di erence is roughly 0.01 for most methods (0.023 for AP 5-15] short). When the query is well de ned and the user is only looking at a few documents (AP 5-15] full), stemming provides absolutely no advantage. This helps to explain a bit of the inconsistency in the literature. Lennon 10] evaluates at 5 and 20 documents retrieved and Harman 5] evaluates at 10 and 30 documents retrieved and nd no di erence between stemmers. Harman's average precision-recall scores look very similar to our own. Krovetz 9] reports much larger improvements for stemming in collections where both queries and documents are short, but the other collections show a similar 4-5% improvement in performance. It seems natural to consider stemmers as recall enhancing devices, as they are creating more matches for each query term. Our results re ect this hypothesis, as stemming is no help at low recall and provides some bene ts when performance is averaged over a number of recall levels. Stemming is slightly more helpful for short queriesand low recall, as demonstrated by the fact that no stemming leads to a drop in AP 5-15] for short queries but not for long ones. As Krovetz demonstrates, stemming becomes much more valuable when both queries and documents are short, as this is when it is most di cult to nd term matches between query and document.

8 Evaluation based on Average Ranks


Unfortunately, it sometimes happens that average measurements do not adequately describe overall performance. For example, perhaps there exist one or a few queries which have a huge variability between methods, while for the others, the di erence is much smaller. The average performance gures will be dominated by these queries and completely insensitive to other patterns in the data. For this reason, we present an alternative evaluation measure that is based on the ranking of scores between methods for each query. Table 2 presents the APR11 scores for two of the short queries. Clearly, query Q178 has more variability than query Q187. We can normalize for unequal variance by replacing the score for each method by its ranking with respect to the scores of other methods for the same query. Most people would naturally wonder if this is desirable. Surely Q178 is demonstrating a much more important di erence than Q187 which is completely lost when the scores are ranked. The answer to this is maybe, but maybe not. In this example, Q178 has only 3 relevant documents while 6

Q187 has 429! Most of the variability in Q178 is accounted for by the change in ranking of a single relevant document. In this example, it could well be the case that the di erences described by Q187 are more reliable. Since in our experiment, the number of relevant documents per query ranges from 2 to 591, there is good reason to believe that it might be worthwhile to examine an evaluation measure that normalizes for unequal variance4 . Using the ranking strategy described above, the evaluation results can be summarized by taking the average rank over all queries, as presented in Table 3. Query Measure full APR11 AP 5-15] full full AR 50-150] short APR11 short AP 5-15] short AR 50-150] none rem s Lov Port In Der sig. test 2.70 3.12 3.83 3.76 3.79 3.81 LPID R N 3.34 3.48 3.66 3.57 3.58 3.37 2.83 3.29 3.68 3.63 3.79 3.79 LID R N, P N 2.62 3.25 3.70 3.79 3.68 3.96 LPID R N 3.05 3.45 3.55 3.69 3.52 3.76 RLPID N 2.63 3.20 3.60 3.91 3.70 3.97 LPID R N, D L Table 3: Average rank measurements

How does one compare average ranks and average precision-recall scores? It is hard to say, since the measures are on di erent scales. The disadvantage of ranks is that any sense of absolute performance is lost. The ranks also depend on which methods are being compared. Remove one stemming algorithm from the analysis and the ranks will change, perhaps signi cantly. However, there are a number of advantages as well. For example, it is now simple to compare between the di erent evaluation scores and also between short and long queries, since the average ranks of all evaluation measures are on the same scale. Note that average performance is 3.5 for a ranking based on six methods. While the overall pattern is about the same, there are several noticeable di erences between the average ranks for the full and short queries. (1) AP 5-15] nostem - No stemming is proportionally less e ective for short queries. We can thus conclude that stemming always improves performance for short queries. (2) AP 5-15] Deriv - The derivational stemmer works better over the short queries, where it can help to nd a match, than on the long queries, where it adds noise to the results. (3) AR 50-150] Porter - The Porter stemmer appears to work better with the short queries than the long queries at high recall. Since rank based analysis is only a relative measure, it should not be used as the only means of evaluation, but it provides a nice supplement to the data provided by average precision and recall. The same approach to hypothesis testing used above can also be applied to the ranking data, and this alternative is known in the statistical literature as the Friedman Test. For more mathematical details, see Hull 6] or Conover 3]. The results are presented in Table 3. The Friedman test nds more signi cant di erences than the ANOVA. In particular, it consistently nds that remove s is less e ective than the other stemmers except for AP 5-15]. The magnitude of a signi cant di erence in ranking is roughly 0.35. There are some puzzling inconsistencies in the statistical results. For example, the derivational stemmer is signi cantly better than the Lovins stemmer using AR 50-150] short query rankings, but the actual recall scores of the two methods are equal to three decimal places. We will come back to this result at the end of the next section.
4 This problem is one of the unfortunate side-e ects of using of subset of the TREC collection. If we used the full collection, we would have a lot less variation in the number of relevant documents for each query.

9 Statistical Testing
It is worth noting that the Friedman Test appears to be consistently more powerful than the ANOVA, in the sense that it nds more signi cant di erences between methods. This is a bit surprising, since the Friedman Test is a nonparametric test (not model based), and tests from this family are usually less powerful than the ANOVA, when its assumptions are reasonable. This suggests that it would be a good idea to check those assumptions. The ANOVA model assumes that the error variance is constant across queries. Given that each query has a widely variable number of relevant documents, there is good reason to expect that this condition may be violated. As we saw in the previous section, the evaluation score of queries with very few relevant documents may change dramatically with the change in rank of a single relevant document. We can examine the data more closely to see how badly the assumption of equal variance is violated. Let us presume that the variance is constant and equal to the error variance obtained from the ANOVA model, which we will denote as 2. It is well known in statistics 1] that: (n ; 1)s2
2 2
n

;1

where s2 is the variance of a sample of size n ; 1 drawn from a normal population with variance 2. We can compute s2 for each query and compare the distribution of the test statistic above to the appropriate chi-square reference distribution. In particular, we determine the expected value of maximum of a sample of 200 observations drawn from the chi-square distribution. If our data is actually drawn from this distribution, we would expect an average of one observation to equal or exceed this value. In practice, we nd that between 11-18 observations exceed the expected maximum for each of our experiments. The results above indicate that without question, the assumption of equal variance is violated. We have suggested that queries with few relevant documents will tend to have higher variability than those with many relevant documents. We examined this hypothesis and found that while a few of the queries with highest variability had fewer than 10 relevant documents, there was no consistent pattern. We also tested whether variability in performance was signi cantly correlated with average performance and also obtained a negative result. From this analysis, we believe that the results from the Friedman Test should be more reliable for our experiments. It also provides evidence that examining average rank data as well as average score data is valuable in determining which experimental methods are most e ective. The reader should keep in mind that the statistical tests described here have been chosen for this particular experiment. In general, the optimal test may depend on the number of experimental methods being compared. For instance, both Friedman and ANOVA have di erent forms when there are only two methods 6], and Conover 3] suggests that there is a more powerful alternative to the Friedman Test for cases when three or four methods are being compared.

10 Identifying Important Queries


In the previous section, we obtained strong evidence that variance is not constant over queries. Since these queries are much more important in determining the average performance of an experimental method than their low variance counterparts, it is valuable to get a better feeling for what causes this behavior. We have found three major causes for high variance in a query. (1) A lot of relevant documents are ranked higher in some methods than in others. (2) The query has very few relevant documents and thus scores are sensitive to changes in the ranking of one or two relevant documents. 8

Query Measure nostem remove s full APR11 0.375 0.393 full AP 5-15] 0.623 0.640 full AR 50-150] 0.377 0.391 short APR11 0.208 0.233 short AP 5-15] 0.374 0.410 0.284 short AR 50-150] 0.259

Average Scores
Lovins 0.402 0.628 0.394 0.244 0.412 0.287

Query Measure none rem s Lov Port In Der sig. test full APR11 2.50 3.19 3.82 3.74 3.88 3.88 LPID R N full AP 5-15] 3.24 3.52 3.50 3.64 3.58 3.51 full AR 50-150] 2.66 3.32 3.78 3.70 3.84 3.70 RLPID N, LI R short APR11 2.51 3.24 3.58 3.90 3.85 3.92 PID R N, L N short AP 5-15] 2.94 3.44 3.52 3.70 3.69 3.71 RLPID N PID LR N short AR 50-150] 2.54 3.22 3.40 3.99 3.89 3.97 Table 4: Evaluation scores for the important queries (3) The query is hard and relevant documents tend to have low ranks in general. However, some methods rank a few relevant documents relatively well, which can make a large di erence for many evaluation measures. Clearly, factor (1) describes the type of behavior we are looking for and these queries deserve to be considered very important. Some people might argue that queries of type (2) and (3) are also important, because they demonstrate real changes in the ranks of relevant documents, however we believe that this is a dangerous conclusion. If only a few relevant documents are being ranked higher, it is hard to know if this is due to the fact that they are relevant or some reason unrelated to relevance. We would be particularly suspicious of type (3) queries, for one would hope that a method which is valuable would improve the ranking of a reasonably large sample of relevant documents. This does not mean that an e ect which improves the rank of a single relevant document is not real and important in some cases, merely that this is a less reliable indicator of value than a method which improves the rank of many relevant documents. Therefore, we would like to evaluate average performance using only queries which satisfy factor (1). This can be accomplished, by applying the Friedman Test on a per-query basis by comparing the ranks of individual relevant documents. A query of type (1) will have consistent di erences in the ranks of the relevant documents, which can be detected by the Friedman Test. There are a number of issues to consider when applying this method which will not be discussed here. The reader interested in more details is encouraged to refer to Hull 7]. From this analysis, we obtain 125 full queries and 139 short queries which satisfy our de nition of an important query. Average performance scores are computed over this subset and presented in Table 4. Readers are cautioned that these results are not unbiased estimates of the true di erences between methods due to the selective nature of the query ltering process. The signi cant di erences increase to roughly 0.015 (0.032 for short AP 5-15]) for average scores and 0.42 for average ranks, due to the fact that fewer queries are being analyzed, however the results remain roughly the same. This reassures us that the potential problems that we discussed at the beginning of the chapter do not change the results a great deal. We will concentrate on the average rank statistics since they are directly comparable to the rank statistics computed using all queries. 9

Average Ranks

Porter 0.399 0.642 0.398 0.241 0.429 0.294

In ect 0.400 0.640 0.397 0.238 0.421 0.290

Deriv 0.397 0.631 0.391 0.245 0.426 0.288

sig. test RLPID N RLPID N RLPID N RLPID N RLPID N

For the full queries, the results are basically identical, although the average rank of no stemming drops even further relative to the other methods. This is probably a result of removing the queries where stemming is not important and indicates that when stemming is important, not stemming is rarely the right decision. For the short queries, there is another interesting pattern. The Lovins stemmer appears to be less e ective while the Porter and In ectional stemmer are proportionally more e ective according to average rank. However, this pattern is not duplicated in the average scores, where the Lovins stemmer is fully competitive with the other stemmers. A closer look at the scores reveals that the Lovins stemmer performs slightly worse for a lot of queries and much better for a few queries, which explains the lower score with respect to the rank based measure. One possible explanation is that Lovins stems more heavily than all the other stemmers. This means that it con ates a lot more terms, which reduces performance slightly on a lot of queries by adding extra noise. However, occasionally an additional important term is recognized, which helps performance a lot. Therefore, while average performance is the same, choosing the Lovins stemmer may degrade performance slightly for a lot of queries in exchange for helping more signi cantly with a few queries. Note that this hypothesis has not been veri ed explicitly and applies only to the short queries when evaluated using APR11 or AR 50-150].

11 Detailed Analysis
While we have learned a lot from our extensive analysis of the average performance gures, it has not really helped us nd out exactly why one stemming algorithm works better than another. If we wish to improve the current technology in stemming algorithms, it is important to see speci c examples where the type of stemming makes a large di erence in performance. Therefore, we will take a detailed look at a number of individual queries and gure out which word stems account for di erences in performance. For this section of the analysis, we rely only on the short queries, for it will be far easier to identify the important word stems in these examples. At this point, it is very clear that no stemming and remove s are not really competitive with the other stemming algorithms, so they will not be considered further. Among the others we have found little evidence which suggests how to choose between them, other than a hint that the Lovins stemmer may have slightly di erent behavior. For the detailed query analysis, we start with only those queries which were judged important based on a per-query Friedman Test, as described in the previous section. We then compute the error variability of the queries and rank them in decreasing order of variability, just as we did when testing the equal-variance assumption for ANOVA, except now we are working only with four stemming algorithms. We compute the chi-square score for each query and nd that 11 exceed the expected maximum score of 12.07 for a sample of 139 queries with equal variance. The top 10 of these are presented in Table 5. The queries shown above can be divided into 6 categories according to their behavior. (1) Lovins, Deriv In ect, Porter - 21, 70, 82, 143 (2) In ect, Porter Lovins, Deriv - 39, 147 (3) Deriv others - 86 (4) Deriv others - 182 (5) Porter others - 128 (6) Lovins, Porter In ect, Deriv - 98 We examine each of these categories in more detail. 10

Q# 86 39 21 82 143 128 182 98 147 70

score Lovins Porter In ect Deriv 61.37 0.216 0.225 0.230 0.546 57.27 0.092 0.343 0.365 0.094 53.5 0.596 0.302 0.278 0.484 39.23 0.438 0.282 0.151 0.406 26.11 0.199 0.101 0.007 0.249 24.87 0.246 0.047 0.251 0.255 24.42 0.305 0.341 0.262 0.115 24.37 0.501 0.507 0.326 0.331 20.27 0.032 0.186 0.187 0.034 15.9 0.635 0.496 0.495 0.647 Table 5: Revised evaluation scores

(1) Lovins, Deriv

Q21: superconductivity vs. superconductor


Unstemmed Word superconduct(ed/ing) superconduction superconductive superconductively superconductivity('s) superconductor(s/'s) Lovins superconduc superconduc superconduc superconduc superconduc superconduc

In ect, Porter

Porter superconduct superconduct superconduct superconductiveli superconductiviti superconductor

In ect superconduct superconduction superconductive superconductively superconductivity superconductor

Deriv conduct conduct conduct conduct conduct conduct

The query talks about research in superconductivity while many of the documents refer only to superconductors. Only the Derivational and the Lovins stemmers make this match. Note that the Derivational stemmer overstems by removing the pre x and so loses a bit in performance for this reason.

Q70: surrogate motherhood vs. surrogate mother


The query asks about surrogate motherhood while many of the documents talk about surrogate mothers. Only the Derivational and the Lovins stemmers make this connection.

Q82: genetic engineering vs. genetically engineered product


Unstemmed Word genetic('s) genetically geneticist(s/'s) genetics('s) Lovins genet genet genet genet Porter genet geneticalli geneticist genet In ect genetic genetically geneticist genetics

Deriv genetic genetic genetic genetics

The Derivational and Lovins stemmers relate genetic and genetically. The In ectional stemmer lost out even more because engineering did not get con ated to engineer.

Q143: U.S. government protection of farming


Unstemmed Word farm(s/'s/ed) farming('s) farmer(s/'s) Lovins farm farm farm

Porter farm farm farmer

In ect farm farming farmer

Deriv farm farm farm

11

The In ectional and Porter stemmers did not recognize that farmer and farming were related. The In ectional stemmer did not con ate farming and farm and thus was completely ine ective. The behavior of the in ectional stemmer with respect to farming and engineering deserves a more complete explanation. Both words can be either a noun or a verb. According to in ectional rules, the noun form should not be stemmed while the verb form should. We do not apply a part-of-speech tagger to the text before stemming and are therefore forced to make an arbitrary decision. Our decision not to stem engineering and farming turns out to be a poor one for IR performance, although it may be the right one from the linguistic perspective in the absence of a part-of-speech tag.

(2) In ect, Porter

Q39: client-server architectures

Lovins, Deriv

The derivational and the Lovins stemmer equate server with serve. This is a very bad decision since serve is a common term used in a number of contexts and server has a much more speci c meaning, particularly in the domain of computers and technology.

Q147: productivity statistics for the U.S. economy


The Derivational and the Lovins stemmer equate productivity and produce. One again, this turns out to be a bad decision for the reasons described in the previous example. Note the contrasts between sections (1) and (2). Con ating farmer to farm is good but con ating server to serve is bad. Superconductivity should de nitely be related to superconduct but converting productivity to produce turns out to be a bad decision. This clearly indicates that it is impossible to produce an ideal stemmer using only su xing rules. There are only two ways to make a distinction in the examples above. One is to construct new rules or exception lists by hand, the other is to identify which con ation pairs are used in similar contexts by conducting a corpus-driven analysis.

Q86: bank failures

(3) Deriv

others

The Derivational stemmer converts failure to fail. The other stemmers do not make this connection. This is an example where linguistic knowledge can recognize a special case which is missed by su xing rules.

Q182: commercial over shing


The Derivational stemmer converts over shing to sh. It is becoming increasing clear that pre x removal is a bad idea.

(4) Deriv

others

Q128: privatization of state assets


The Porter stemmer equates privatization with private. This is certainly unfortunate for information retrieval. The irony here is that the Derivational stemmer would probably have made the same mistake, but the word privatization is not in the lexicon! This points out another potentially serious problem that is shared by both the In ectional and Derivational stemmer. 12

(5) Porter

others

Since both stemmers analyze on the basis of linguistic rules, a word must be in the lexicon in order to be recognized. We make the decision not to stem any word which does not appear in the lexicon. While this works well for the most part, since most unrecognized words are proper names, there is a distressingly large list of exceptions. This is a particular problem when using the Wall Street Journal, because a large amount of nancial and economic terminology is not included in our general purpose lexicon. Many of these would be important when used as query terms. It might be a good idea to construct a guesser which applies a rules based algorithm when the word does not appear in the lexicon, because at the moment the linguistic stemmers do not even con ate plurals of unidenti ed nouns. We have implemented one special rule to remove possessives ('s) since this is obviously an important factor for proper names and the rule is extremely unlikely to make any errors.

(6) Lovins, Porter

Q98: manufacturers of ber optics equipment

In ect, Deriv

The in ectional and derivational stemmers do not stem optics to optic. For the Derivational stemmer, this is a linguistically motivated decision which turns out to be very unfortunate for information retrieval. The detailed query analysis has been very informative. We have recognized that there are many examples where the su x-removal rules do not produce ideal behavior. We have also discovered that the In ectional and Derivational stemmer make a number of decisions for linguistic reasons that are not correct for information retrieval. Also, the two linguistic stemmers could probably be improved by adding domain-speci c lexicons or by building a guesser for unknown terms.

12 Modi ed Derivational Stemmer


The detailed query analysis has revealed that pre x removal is generally a bad idea for a stemming algorithm. The derivational stemmer su ers in performance on a number of occasions for this reason. For example, pre x removal for the terms over shing (Q128) and superconductor (Q21) is certainly undesirable. A quick scan over the unexamined queries reveals several more instances where pre x removal causes problems: Q1 - antitrust, illegal and Q4 - debt rescheduling. Pre x Removal Query Measure Y N sig? full APR11 0.366 0.368 N full AP 5-15] 0.555 0.559 N full AR 50-150] 0.426 0.426 N short APR11 0.208 0.210 Y short AP 5-15] 0.354 0.356 N short AR 50-150] 0.288 0.289 N Table 6: E ect of pre x removal on stemming performance To verify this hypothesis, we printed out a list of all terms for which pre x removal took place in the Wall Street Journal. It quickly became obvious that in the vast majority of cases, stemming would be undesirable. A lot of pre xes, such as anti-, un-, and il- reverse the meaning of the root form. Therefore, we created a new version of the Derivational stemmer that only 13

modi ed the su x of the word, in order to see whether this would make a signi cant di erence in retrieval performance. Table 6 compares performance to the original version. sig. test Query Measure none rem s Lov Port In Der full APR11 2.68 3.10 3.81 3.75 3.77 3.90 LPID R N AP 5-15] 3.32 3.47 3.63 3.54 3.56 3.48 full full AR 50-150] 2.81 3.25 3.69 3.62 3.78 3.87 LPID R N short APR11 2.63 3.25 3.68 3.75 3.68 4.01 LPID R N RLPID N short AP 5-15] 3.03 3.47 3.55 3.69 3.52 3.75 short AR 50-150] 2.63 3.20 3.56 3.90 3.70 4.01 LPID R N, D L Table 7: Ranks with revised Derivational stemmer The average scores changes very little, but all the di erences are in the right direction, so it seems to be worthwhile to use the modi cations. To further examine this question, we replace the old Derivational stemmer with the new one and recompute the average ranks and the Friedman Test, as shown in Table 7. Both Table 6 and Table 7 are computed using the full query set. Although the revised derivational stemmer has slightly higher ranks, there is no signi cant change in the results. It might be worthwhile to look at only the important queries, be we did not run that experiment in time for this paper.

13 Conclusions
After an exhaustive analysis, we have reached the following conclusions concerning stemming algorithms. (1) Some form of stemming is almost always bene cial. The only case where there is not overwhelming evidence in favor of stemming is when the full TREC queries are used and very few documents are examined. The average absolute improvement due to stemming is small, ranging from 1-3%, but it makes a large di erence for many individual queries. There are probably particular queries where stemming hurts performance, but we did not investigate this issue in detail. (2) Simple plural removal is less e ective than more complex algorithms in most cases. However, when only a small number of documents are being examined, plural removal is very competitive with the other algorithms. (3) There is no di erence between the other stemmers in terms of average performance. For short queries, the Lovins stemmer tends to perform slightly worse for many queries but a lot better for a few others. This may be a result of overstemming although we have not produced any concrete evidence for this hypothesis. (4) The detailed query analysis demonstrates that the same su x-removal rule can be benecial in some cases and harmful in others. This suggests that rules-based su x removal may not be the ideal approach to stemming in the long run. (5) There are a number of problems with using linguistic approaches to word analysis directly as stemming algorithms. First, these methods are based on a lexicon and cannot correctly stem words which are not contained in the lexicon. Second, many decisions about the root form of a word which are properly motivated from the linguistic perspective are not optimal for information retrieval performance. It is clear that linguistic analysis tools must be tailored for information retrieval applications. 14

(6) In particular, pre x removal seems to be a particularly bad idea for a stemming algorithm, and the detailed analysis provides a number of speci c examples of this problem. Modifying the linguistic tools provides only a small bene t in terms of average performance, but the di erence is consistent over all evaluation measures. While the linguistically based stemming is not signi cantly better than current algorithms in term of average performance, the detailed query analysis reveals a number of di erent ways in which the linguistic tools could be improved and optimized for information retrieval. With such modi cations, and their inherent advantages over su x-removal algorithms (stems are real words), linguistic tools based on morphological analysis should work very successfully as stemming algorithms.

Acknowledgements We are grateful to Donna Harman and the analysts from the TIPSTER project for providing such an outstanding resource which has really helped to advance the eld of IR. We would also like to thank Chris Buckley for writing and supporting the SMART text retrieval system and making it available for research use. This is a real service to the information retrieval community.

References
1] G.E.P. Box, W.G. Hunter, and J.S. Hunter. Statistics for Experimenters, pages 118{119. John Wiley and Sons, 1978. 2] Chris Buckley. Implementation of the smart information retrieval system. Technical Report 85-686, Cornell University, 1985. 3] W.J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, 2nd edition, 1980. 4] W.B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. 5] Donna Harman. How e ective is su xing? Journal of the American Society for Information Science, 42(1):321{331, 1991. 6] David Hull. Using statistical testing in the evaluation of retrieval performance. In Proc. of the 16th ACM/SIGIR Conference, pages 329{338, 1993. 7] David A. Hull. Stemming algorithms - a case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70{84, 1996. 8] Yufeng Jing and W. Bruce Croft. An association thesaurus for information retrieval. In Proc. of Intelligent Multimedia Retrieval Systems and Management Conference (RIAO), pages 146{160, 1994. 9] Robert Krovetz. Viewing morphology as an inference process. In Proc. of the 16th ACM/SIGIR Conference, pages 191{202, 1993. 10] M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some con ation algorithms for information retrieval. Journal of Information Science, 3:177{183, 1981. 11] Janet Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22{31, 1968. 12] J. Neter, W. Wasserman, and M. Kutner. Applied Linear Statistical Models. R.D. Irwin, 2nd edition, 1985. 15

13] M. Porter. An algorithm for su x stripping. Program, 14(3):130{137, 1980. 14] Ellen M. Voorhees. Query expansion using lexical-semantic relations. In Proc. of the 17th ACM/SIGIR Conference, pages 61{69, 1994. 15] Xerox Corp. Xerox Linguistic Database Reference, english version 1.1.4 edition, December 1994.

16

You might also like