You are on page 1of 10

Inferring Probability of Relevance Using the Method of Logistic Regression

Fredric C. Gey

UC Data Archive and Technical Assistance University of California, Berkeley USA e-mail: gey@sdp2.berkeley. edu Abstract

This technique function clues bution ment

research of logistic

evaluates regression

a model

for

probabilistic equations Since

text which we call

and document rank documents infers it logistic (one derived probability

retrieval;

the model from

utilizes

the as a

to obtain properties.

by probability of relevance By

of relevance

of document of each statistical

and query clue little into

the model distribution

statistical the distrideviation docuof information model similarity collecin per-

present

in the texts allows with collections,

of documents one to apply and the

and queries, logistic are

inference. with from mean a training

transforming collection well-known vector space

its standardized power.

v = O and standard

a = 1), the method collections, test which or equally of the retrieval retrieval measure. tions) formances significant

coefficients The compared method statistical

to other

loss of predictive results

model directly

is applied to (tfidf) the

to three particular

uses term-frequency/inverse-document-frequency the logistic were inference to well two as (in the third models occurred collection) subjected the tfidf/cosine

weighting space model. see if the

and better

the cosine than differences are

In the comparison,

performs vector tests to

significantly

(in two

The

differences

statistically

or could

have

by chance.

1. Introduction
We can consider the central task of Information or documents. in the collection. [1] [2] [3] both that degree Retrieval The heart to be that of this and of providing intellectual access to the of the and the For disIn the or in contents meanings associates this tance One model, such of a collection of the texts at Cornell term which we have well-know form to the of texts contained University space a clear measure term task queries is the concise developed terms from indicate show representation by Gerard is of size the as vectors m.

In the vector documents for the both

space model

Salton

are represented and document deviates only document

m -dimensional measures

assuming geometric the computes space

the indexing to which

vocabulary document

of nontrivial vector

interpretation the cosine query the query These

the query between This

and can choose query vector.

measure

vector.

of the angle or the document. interpreted

and query

sparsest absence weights

of the vector query

model,

vectors

and document

vectors

the presence the magnitude

of a vocabulary

in either vectors.

can be easily merely

generalized

by adding

or document.

weights,

geometrically,

and direction Weights the document, (QAF) (lDF)

of the appropriate may

be mechanically absolute (DAF). where term

derived For terms

from

attributes is the count

of terms.

For example, of that

for terms term

in the query frequency number

or

the attribute

frequency

of occurrences the attribute in the

in either

the query

or the documefit is the ratio, which which terms, 1988 %j

tj used in documents,
number of documents text

inverse and

document ~t 1 is the

is the

collection used

of The for and

documents attribute weights, the Buckley weighting


which only

for IDF, used in

tj occurs
retrieval

in the documents was system, of these first

or has been by Karen frequency frequency for

to index [4].

the document. Two (QAF), and,

is usually the against


for

logged, frequencytest

suggested document Their [5]. one whose

Sparck-Jones the query product show that

well-known Salton for

in the SMART term identified

are the term factors

terms

document

inverse collections.

(DAF

~~F

).

a number several
the

and computed experiments in the document

performance

measurements a simple

multiple weight, in the

schemes
accounted

retrieval of the term

occurrence retrieval system the formal a system need. sound probability

of the term comparisons we will mean return

and the occurrence

query,

performed By

worst

in their search from Such

a probabilistic rules are derived problem. the users is both

query

specification, of probability paper models information

indexing to the logic ranked

processes,

and

retrieval mation of system attempt value

application The primary status of

of the theory documents focus of this

of the inforby probability search retrieval status

retrieval which can

to the user in an order is to test of text

satisfying

information theoretically as the

a probabilistic and document If retrieval then

and practical. satisfying

Probabilistic on a sound the users

to place

the computation

of retrieval

value

theoretical

footing. need,

be computed

the probability

223
ranking in order babilistic tunately tistical methods cessful principle of their retrieval experiments evidence for [9]. statistical [6] states that the optimal was relevance, retrieval This leads to when the half will be achieved Linked have the if documents Binary model are returned by Cooper to the user of proUnforstasuc[8]

probability [7], which with

of relevance. this relevance in the

to the so-called

Independence problems have only

model

generalized feedback even face

Dependence encountered is used size sample

approach

of insufficient been partially

to predict

collection

as a training

set. Improved

estimation

of this

insufficient

2. The logistic
If no assumptions query-document ated (VI, (such number attribute exists with each pair query

inference
are made r term about

model
the character of the probabilities allow that a text are term of relevance retrieval a finite set of which system associate exists about between associterm the a and that

= ~ (~i ,dj ), but we merely

tq,and document
(such of the with as count term divided the in

tdjthere
document), then we

attributes from

,Vn ) from as count values before The of of documents vary

the query occurrences statistically

of occurrences the by the number

of the term and have from of documents

in the query), the collection indexed

the document as the one total which and that

(such i.e.

in the collection particular

by the term),

collection,

a Model

O system,

introducing

models.

logistic
binary

inference
relevance is present

model says that we can use a random


judgments in both have document been made, and compute qi

sample of query-document-term
the logarithm of the odds

triples of relevance

for

which

for term

tk which

dj

and query

by the formula:

log~(~

Iqi,dj,tk)

= co+clvl+

lvn

and further the logodds

that for

the logarithm all terms:

of the odds

of relevance

for

the i h

query

qi =Ct

1, - - ,tq > is the sum

of

logO (1? I qi ,dj ) = ~


k=l

[logo

(~

I qi ,dj >tk ) _ logo

(~

)1
at random term show from the

where

O(R), will

known

as the prior the method as a function

odds of relevance
qi. of The of logistic

is the odds that a document


of individual query

chosen

collection are derived independent variation. Once tion may

be relevant variable

to query

coefficients

and document to predict which

properties statistical

by using

regression

[10] which

fits

an equation variables

a dichotomous

(possibly

continuous)

independent

log

odds

of relevance to directly

is computed obtain

from

a simple

linear

formula,

the inverse

logistic

transforma-

be applied

the probability

of relevance 1

of a document

to a query:

P((ZUqj,dj)=
l+e Once retrieval ples, future). Regression tions mate of multiple relevance. success models clues [11]. There which such are More two approximate frequency, Fuhr recently the dependent authorship, and Buckley major these the coefficients for coefficients may of the equation collection be used for from logodds a random odds

- log(O (R I qi ,dj ))

of relevance sample of relevance for

have other

been

derived

within

a document quadrupairs (past or

system

a particular

of query -document-term-relevance query-document

to predict

variable, [12] [13] with

document

relevance first

by linear

combinawith some to approxiregression

as term

and co-citation, the

were

introduced, regression of ordinary

by Ed Fox

used polynomial application

well-known

problems

approaches
G

to probability variable

of relevance is dichotomous is binomial

approximation: rather than than continuous, hence assumed by ordinary regression. and others [14]

The

outcome

the error An

distribution alternative

rather

normal,

as is usually networks

approach

is to use Bayesian

inference

as developed

by Turtle

[15]

[16].

3. Sampling
One pute logistic

for logistic
problems coefficients.

regression
in using Even logistic supposing regression that only is the computational half the document size necessary to comas a collection is used

of the computational regression

224
training nomical. query, attempt than not ments, pensates ments, and factor We Cranfields lowing sonable queries fit all, set the number Naturally, naturally non-relevant and for to construct of the relevant adjusting for a very ones. of query-document-term approach fraction which small To of the triples total which Since are to be used for computation the number in the higher smaller process of relevant proportion proportion by fewer factor collection, documents, becomes is, for upon astroeach us to if

a sampling a sample

is necessary.

documents of relevant

documents a significantly a significantly

it is incumbent we would a weight than of 30. Cranfield beginning

encompasses while relevant sampling

documents most, docucomdocutriple

assure

adequate

representation

of relevant

sample

documents non-relevant Since into

of non-relevant factor which non-relevant

documents the sample,

in the regression are typically thirtieth by applying upon retrieval which has which for say, every

the differential. to take the logistic chosen

documents

1/ lo(hh
non-relevant

it is reasonable it into have

query-document-term

regression

computation

a weighting to fit our numerous

the Cranfield used It does First, model. a better in

test collection information some have

model.

The [17], both

collection with

is the

a long-standing desirable to our

collection properties. logistic

experiments

experiments.

deficiencies while it only

as outlined seems 1,400

by Swanson to be a sufficient documents, test collection.

but it offers to afford

the fola reaand the

it has 225 queries, Second,

number

the documents

seem to possess The characteristics are in natural

fit to the characteristics test collection so that should query

of an ideal be:

of an ideal language

queries

form,

statistics

can be collected

as well

as document

statis-

tics, . the documents lection), The queries are a random in order sample to achieve sample from for from some larger query population. There must be a sufficiently large have both titles and abstracts present (abstracts are missing from some of the CACM col-

number
G

of queries

some

statistical

significance. collection pairs and, in the collection. judgments If this is not possible crafted sam-

Documents That

are a random judgments million relevance apply

a larger all

document

relevance

be made document judgments,

query-document then the sample collection

(as it cannot

be for

collections), and that

the relevance

are a carefully

ple of all possible We tion will later

size be known. (52 queries, 3204 documents) and the CISI collec-

our tit to the CACM

(76 queries,

1460 documents)

4. Logistic
Our ments particular (RFAD). the query (QRF),

regression
logistic inference
relative

for six term

attributes
additional given elementary clues: relative frequency in (DRF) and relative frequency of the term of term in all docu-

model has the following


in the document for Itj) logodds

frequency formula

The complete

of relevance,

the presence

tj

is then:

Zt,
+ c ~iog

= logO(R

= Co + cllog(QAF)

+ c2bg(QZW)
) + C ~log

(DAF

) + c410g (~~~

) + C ~zog (~~~

(WAD

Query query query. number total

absolute absolute Similarly, of number

frequency frequency document of

(QAF (QAF) all

has been query

previously (DRF)

defined.

Query number

relative of term

frequency occurrences frequency

(QRF) of all divided

is equal terms

to

divided

by the total

in the is the of

relative terms

frequency in

is document Relative divided

absolute in

by the total (RFAD)

occurrences of term

the document. length.

frequency

all

documents number

occurrences collection, that all terms

of the term i.e., collection

in the collection

by the total

of occurrences

all terms You

in the entire will note

in the formula than

are logged.

One reason that reason

for for after

using taking an

logarithms the logs anti-logarithm for these

is to dampen in a docuthe logaFinally, is clues) applied. a higher is that

the influence ment rithm, sums Indeed, maximum is five of

of frequency times more variables smooths

information. important out behave for a skewed logged

It is unrealistic 10 occurrences. distribution of than and non-logged clues the

to assume Another to one same items, for raw which

50 occurrences behavioral

of a term properties.

in general, logged likelihood

has nice

as products logged

variables in general clues.

in comparing

the fit for is attained

(at least

225
The sample fit for from this extended clues logistic model The is done against the same are then 20,246 derived query-document-term from this fit:

taken

the Cranfield

collection.

following

coefficients

Z,j = logO (R I tj) = -0.2085

+ -0.203610g

(QAF)

+ 0.1914310g

(QZ?F)

+ O.1678910g (DAF ) + 0.5754410g (DZ?F ) 1.596710g (IDF


What remains is to compute the number collection 225 queries this log is (0 1838 (R )) pairs) the prior pairs for and then odds of relevance a judgment for

) + 0.7503310g
this collection. This

(IWAD )
is estimated assigned (for pairs

by counting the Cranfield (for Cranfield

of query-document times

which dividing

of relevance number

has been

by the total

of query-document

1400 documents).

Hence:

1838 Przor
becomes the prior probability of relevance, logprior and the final formula for ranking is: I ~)) =

= .005835

=
and =

2251400

log

rl~
lprior

= 5. 138

log(O(l/

-5,138

+ ~(Z~,
j=l

+ 5.138)

50 Performance
A key bilistic model, ment part which vector. is the of this ranks fraction query but method, which

evaluation:
paper is to compare ranks documents according

statistical
performances according to the cosine retrieved over all to

significance
of two major probability of the angle in terms at a certain which queries it of of

tests
methods: between the retrieval our logistic against and in regression the precision the retrieval intent subject have vector probaspace relevance recall point

documents of

the query

vector

and the documeasures. process

Performance

evaluation relevant

is usually documents averaged

described

Recall
while for and

precision
a single precision

is the fraction as one measure

of retrieved of

documents

are relevant. is our

These to

measures It is our also or could

can be computed to use recall performance occurred by

are usually

in the collection. intention is significant

performance;

however,

differences chance. The better retrieval, bility would mance statistical cally of than

to a statistical

test to determine

whether

the difference

reason another

one wishes and yet that to the

to test statistical the performance is chosen chosen performance that the results

significance could query. profile. obtained at random, It

is that for

on a particular the next ranks next vector query set of

query

one method In our according chosen

may model

be of

be reversed might Thus, for

posed. randomly we

we assume relevance have test

a query different (such show

and our method be that if we

documents

to probaqueries a the perfor-

randomly

an entirely methods which will

are attempting space) chosen

to compare need set of queries

of two significant. Such

as logistic

inference

versus

tfidf/cosine

to establish are statisti-

the randomly

a test,

well-known for each Our

in

the

literature, therefore,

would

be a T-test whether these for

which statistical

would

compare in ranking will

the be:

differences

between

performance

query,

and we determine

differences

or in precision

are statistically
G

significant. query value take over

approach,

for testing documents This

significance

For

each

the ranking all ranks (all

of relevant levels

by each will give

method a unique

and compute number for

an average each query

precision

of recall).

for each method.


G

Compute the two

the difference methods being

in average compared.

precision Thus,

between in applying

the two

numbers

found

with size

each will

method the

for

the T-test,

the sample

equal

number
G

of queries a T-test their

in the test collection. to these mean differences difference under should the null be zero hypothesis and their that the methods standard deviation perform not identically significantly

Apply

and

hence

different. [18] provides a summary of possible statistical tests which might be used to evaluate retrieval experiments.

226

6. Logistic
For the logistic

model
inference

performance
model on Cranfield Inference Cranfield

for the Cranfield


the following tfidf/cosine Averages Logistic table Vector over

collection

Logistic

versus Collection:

Space 225

Performance ueries Vector Space

Recall 0.00 0.10

Precision 0.8330 0.8116 0.7129 0.6021 0.5161 0.4503 0.3698 0.2859 0.2280 0.1640 0.1464 0.4655

Precision 0.7787 0.7440 0.6434 0.5301 0.4380 0.3814 0.2994 0.2267 0.1882 0.1379 0.1251 0.4084 -12.3

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90


1.00 11 -pt Avg: % Change: displays for each significant the recall-precision of the 225 queries at the one tenth performance for the two of one percent results. methods

If we further (cosine level:

compute

the difference

in average

precision

and logistic)

we find

that the results

are statistically

significance

Hypothesis between N null mean 225 Thus we can see that clues, 0.0000 making

Test:

two-tail inference

t on Precision and Tfifdf/cosine df

Differences methods test P

Logistic sample mean -0.059885

sample SE 0.0072063 probabilistic model

stat 224 of logistic over -8.31 regression .0000 on the set of six term vector space model.

use of the full

property

we obtain

a statistically

significant

improvement

the tfidf/cosine

7. A criticism
The above results queries, documents

of the Cranfield
obtained for the logistic from when and terms

results
inference the resulting models on half We where feel model logistic this have a fundamental the sample are used difficulty. to rank used The entire set of set for based step to test was used to predict in order in our be to apply set is included whereby the number that within used as the training documents which training will [19], the next

the relevant

the logistic on relevance, for fitting. in testing the results train for

regression. we have In past such models logistic obtained tests

Naturally, a chicken has been half from

coefficients approach of queries number Our retrieval.

and egg process, to train

the test is applied

to the sample use that instead,

of probabilistic

has also been and then approach, of queries

of the other a decent

of the queries. approach and CISI using

a large

is necessary collections

regression the CACM

to document collection collections.

the coefficients repertoire,

the Cranfield

as a training

set to other

in particular

8. Logistic
If we apply the obtain

inference
logistic result: Cranfield

for

CACM
equation fitted for the Cranfield sample to the CACM collection, we

regression

the following

Logistic CACM

versus

Tfidf/cosine Averages

Vector over

Space

Performance

Collection:

52 Queries Vector Space

Logistic Recall 1 l-pt Avg: Precision 0.3322

Precision 0.3148 -5.2

% Change:

While

it may

seem that the extended


for the CACM collection,

logistic

model

performs

5.2 percent

better

than

the tfidf/cosine

vector

space model

if we apply

the same hypothesis

test:

227 CACM Hypothesis Logistic sample mean -0.020528 of performance another logistic model. test: 2-tail T on Precision and Tfidf/cosine df Differences methods test stat 51.0 -1.29 significant vector P value .2016 and hence space we cannot might reject perform a

between N null mean 52 we find null as well again that this 0.0000 difference given

inference sample SE 0.015866

is not statistically

hypothesis or better

that,

in fact,

set of queries

the tfidf/cosine

model

than the extended

9. Cranfield
Applying

logistic
fitted

model
regression

for

CISI
to the CISI collection, we find:

the Cranfield

equation

Cranfield

Logistic CISI

versus Collection:

Tfidf/cosine Averages Logistic

Vector over

Space 76 Queries

Performance

Vector

Space

Recall 11-pt Avg:

Precision 0.2088

Precision 0.2137 4.4

% Change: While this seems to indicate better vector space performance, queries: 2-tail T on Precision and Tfidf/cosine sample SE 0.0087791 significant model. While 75.0 difference the collection between of produce df Differences for CISI test stat 0.65 when we apply

the statistical

test on the aver-

age precision

and recall

measures

for 77 CISI Test:

Hypothesis between N null mean 76 we find that there 0.0000 can the be no extended model, Logistic sample mean 0.0056973 statistically logistic for

inference

collection P value .5183 on results between the vector space collecand

performance the comparable

model directly tions extended

as against to both logistic blind to the vector

application

Cranfield-derived vector

coefficient space

the CACM space

collection

and the CISI neither collection

on these

can we find

the difference

to be statistically application clearly of leaves

significant. coefficients something from derived one collection, from the statistical to adapt distribution we apply them of one collection clever clues to of the

The another other

collection collections.

to be desired.

In the next in order

section,

some

thought

to the transformation

of coefficients

to the statistical

10. Standardized
The direct application lection lowing and one

variables
of coefficients It assumes that derived all from a query-document-term collection which fi-om from for are being used sample as sensitive to collection. probability original as the from one document makes colto another statistical do the clues and standard document the clues come distribution deviation and yet another collection clue set of queries indicators By the folstatistical but they the

set of queries the identical not only mean

assumption. have the we mean identical

of probability

of relevance distribution also have logistic

identical distribution,

the same underlying each

collection

on which

regression Even though from

fitting

has been done. might used derive for from the same underlying would of probability statistical function deviation probability in distribution, either This the gives it seems and us insight known highly into that the one i.e.,

each clue collection being

unlikely deviation possible can take though one which If of this lection

that,

to collection, indication which deviation

there

be no variation of relevance. to another distribution is not known) one (6 distribution, for =

mean

standard

of the clue adaptation the mean the actual

of coefficients and standard under] ying zero Q.L =

fit one collection of any density standard

collection. into

It is well

and transform

that distribution distribution,

(even

probability

a standardized

has mean we obtain

O) and
for such

1).
then we can The also apply the coefficients colof such distribution underlying of yet another assumption

coefficients distribution

a standardized coefficients probability

standardized which

to compute clues to indicate

the standardized

uses the same is the folIowing: each clue the

of relevance.

an adaptation
G

For

underlying

probability

distribution

of

that

clue

remains

identical

from

228 collection to collection changing only in its mean and standard deviation, i.e., only the mean and standard deviation of the particular collection-dependent distribution will change while the underlying standardized probability distribution remains constant over all collections.
One might well question this assumption on the following grounds:

Suppose the mean pi and standard deviation Oi for clue i varies meaningtidly from collection to collection, and that these differences have predictive power with respect to relevance. Then the application of standardization of distributions will remove this predictive power.

In the face for our purposes

of such thinking, the collection) collection the coefficients

the best approach and we derive we can directly to which we wish

is to experimentally the standardized the adjusted our logistic (here

test the assumption. coefficient coefficient regression we assume for for a first

If we accept, collection In terms x (in disof

of testing, other

assumption

case the Cranfield of any we obtain

compute to apply

the standardized two clues,

tribution formulae,

technique.

c ~, c ~, and c ~ for the equation

and Y ):

x lWX1
log(o(ll lx~,y~)) =c~+c~( ~ xl )+C2(

Y1-wy, ~ YI )

Given equation lection,

the equation that for

above,

for apply

the source the two, same

collection, coefficients

if we then

look

at the equation distribution, constant remains

for we

the target can then collection

collection, derive the to col-

assuming

we would the second

to a standardized distribution

collection.

If the standardized the same equation,

from

we have

for collection

x 2PX2

Y 2wy2

log(o(R
and we wish to derive

lx2,yJ)

= co + C*(

~
X2

)+C2(

~
Y2

the coefficients log(o(z?

c 0, c 1, and c z for the equation: Ixz,yz)) above = co + C1X2 + Czyz

but if we multiply

out the first

equation

we obtain WX2
Py2

log(o(zi

Ixz,yz))

=coc~; X2

C2Z Y2

X2 ;1 X2

5Y2 Y2

and hence Py2 IJX2 c2 6 X2 0 Y2

C()=

co-cl

cl=
and C2 =

c1
6
X2

C2 0 Y2

Thus values

to find

the

logistic

coefficients each clue

for v,

the target compute

collection, the new

we

need

only and

determine apply

the them

mean, directly

~V,

and

standard

deviation, for the new

CTv for collection. was

coefficients,

to clue

This retrieval

methodology

first

applied This

by paper

Cooper, provides

Gey

and

Chen complete

[20]

to the queries

of the NIST

text

conference

(TREC

1) [21].

a more

analysis

of the method.

11. Applying
In order sample to obtain from each

to CACM
means of those

and CISI
deviation If we

collections
for obtain all clues a small for the new (of collection, 1 in 125) we must query obtain a new term sample document

and standard collections.

229 triples CISI. and all associated clues values, we can compute the following standard deviations for CACM and

Coil.
CISI CISI CACM CACM

St.
mean

log q.af
0.3031 0.5153 0.1763 0.3347

log d.af 0.4093 0.5795 0.3900 0.5855

log q.rf -3.5450 0.9110 -2.7644 0.5612 together

log d.rf -3.7922 0.6944 -3.4976 0.7999

log idf 1.9045 0.8333 2.0968 1.1378

log rfad -5.5303 1.0675 -5.2434 1.2146 to obtain the fol-

std dev
mean

std dev

We then use these means and standard deviations lowing new logistic or logodds coefficients

with the Cranfield and CISI collections.

coefficients

for the CACM

Coil.
CRAN CACM CISI

const. -4.125
-1.341 -0.934

log q.af -0,03229


-0.08412 -0.061660

log d.af 0.1059


0.1520 0.1622

log q.rf 0.06910


0.11993 0.0663

log d.rf 0.4193


0.5391 0.5572

log idf 1,4515


1.0933 1.5325

log rfad 0.7734


0.5507 0.6680 CISIcollec-

If we then apply the fitinorder tions, weobttin fi;s; for CACMtie

toobtain rankings forallqueries following recall-p;ecision results:

for both the CACMand

Standardized Logistic vs. Tfidf/cosine Vector Space Performance CACM- C llection: Averages over 52 Q :ries Recall Standardized Logistic Precision 0.7452 0.6040 0.5338 0.4392 0.3683 0.3128 0.2682 0.1846 0.1388 0.0961 0.0694 1 I 0.3419 I I Vector Space Precision 0.6985 0.5683 0.4830 0.3887 0.3210 0.2761 0.2306 0.1770 0.1441 0.1018 0.0735 0.3148 -7.9

0.00
0.10 0.20 0.30

0.40 0.50 0.60 0.70 0.80 0.90


1.00

I
I

11-pt Avg: % Change:

Note that the average precision has increased approximately another 3 percent over the tfidf cosine vector space method of ranking documents. The recall-precision table makes it quite clear that except for the tail of recall-precision, the standardized extended logistic model performs substantially better than the tfidflcosine vector space model for the CACM collection. Indeed, if we once again go through our process of computing average precision from all points of recall for each of the 52 significant queries for CACM, and then do a statistical test of the difference from standardized logistic method and tfidf/cosine vector space average precision, we find this improvement is now statistically significant at the 2 percent level.

CACM Hypothesis Test: 2-tail T on Precision Differences between Standardized Logistic and Tfidflcosine N 52 null mean 0.0000 sample mean -0.030180 sample SE 0.012335 df 51.0 test stat -2.45 P value .0179 to the In this collecvector

That is to say, we would only expect in less than two percent of all samples of queries applied CACM collection, for the vector space model to perform better than the standardized logistic model. way, we can reject the null hypothesis that there is no difference in the performance in the CACM tion and accept the clear indication that standardized logistic regression performs better than the space model for the CACM collection.

230 On the other hand, if we apply this correction factor and standardization find that there is a slight worsening in the performance results. to the CISI collection, we

Standardized Logistic vs. Tfidflcosine Vector Space Performance CISI Collection: Averages over 76 Queries Recall 1l-pt Avg: Yo Change: Standardized Logistic Precision 0.2042 Vector Space Precision 0.2137 4.7

When comparing this result with the non-standru-dized extended logistic model for CISI we see that the precision changes occur in the third decimal place. We can conclude that the difference between the two is certainly not statistically significant, nor is that between CISI standardized and tfidf/cosine. Among reasons for this failure to achieve performance improvement in the CISI collection are the different prior probability of relevance for the collections. [22] gives more detail on the sensitivity of the model to the proper estimation of prior probability of relevance.

12. Conclusions
In this research we have investigated a new probabilistic text and document search method based upon logistic regression. This logistic inference method estimates probability of relevance for documents with respect to a query which represents the users information need. Documents are then ranked in descending order of their estimated probability of relevance to the query. 1. The logistic inference method has been subjected to detailed performance tests comparing it (in terms of recall and precision averages) to the traditional tfidf/cosine vector space method, using the same retrieval and evaluation software (the Cornell SMART system) for both methods. In this way the test results remain free of bias which might be introduced from different software implementations of evaluation inference method outperforms the methods. In terms of recall and precision averages, the logistic tfidf/cosine vector space method on the Cranfield and CACM test collections. The methods seem to perform equally well on the CISI test collection (for an appropriately estimated prior probability). 2. Statistical tests have been applied to ascertain whether these performance differences between the The performance improvement of the logistic inference method two methods are statistically significant. over the tfidf/cosine vector space method for the Cranfield and CACM collection is statistically significant at the five percent level. Performance differences for the CISI collections are not statistically significant, for the most plausible estimate of prior probability. 3. The use of standardized variables (statistical clues standardized to mean w = O and standard deviation CT= 1) seems to enable the training of and fitting for logistic regression coefficients to take place on the queries and documents of one collection and to be applied directly to the queries and documents of other collections.

13. Acknowledgments
The work described was part of the authors dissertation research at the School of Library and Information Studies at the University of California, Berkeley. Many of the ideas contained herein were jointly formulated with my dissertation advisor, Professor William S. Cooper, whose clarity of thinking and relentless persistence made this research both possible and doable. My outside committee member, Professor of Biostatistics Steve Selvin, provided much helpful advice on statistical techniques.

References
1. Salton G et al. The SMART Prentice-Hall, Englewood Cliffs, 2. Salton Addison retrieval NJ, 1971 system: Experiments in automatic document processing.

G. Text processing: the Wesley, Reading, MA-Menlo M. Introduction

transformation, analysis and retrieval Park, CA, 1989 retrieval.

of information

by computer.

3. Salton G, McGill 4. Sparck-Jones K.

to modern information of

McGraw-Hill, and its application

New York,

1983 Journal

A statistical

interpretation

term specificity

in retrieval.

231 of Documentation 1972; 28:11-21 in automatic text retrieval. Information pro-

C. Term weighting approaches 5. Salton G Buckley cessing and Management 1988; 24:513-523 6. Robertson, 7. Robertson 27:129-145 S. The probability S Sparck-Jones ranking K. principle in IR. weighting

Journal of Documentation of search terms.

1977; 33:294-304 of the ASIS 1976;

Relevance

Journal

IR, In: Proceedings of the Fourteenth Annual 8. Cooper W. Inconsistencies and misnomers in probabilistic International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, 111,Ott 13-16, 1991, pp 57-61 estimation 9. Fuhr N Huther H. ODtimum Probability cessing and Management 198;; 25:493;507 10. Hosmer D Lemeshow S. Applied logistic from empirical distributions. Information Pro-

regression.

John Wiley

& Sons, New York, Retrieval

1989 Queries

11. Fox E. Extending the Boolean and Vector Space Models and Multiple Concept Types. PhD dissertation, Computer

of Information Science, Cornell

with P-Norm 1983 principle.

University, ranking

retrieval functions based on 12. Fuhr N. Optimal polynomial Transactions on Information Systems 1989; 7:183-204 learning 13. Fuhr N Buckley C. A probabilistic on Information Systems 19919:223-248 approach for

the probability

ACM

document

indexing.

ACM

Transactions

Proceedings of the 1993 SIGIR feedback and inference networks. 14. Haines D Croft B. Relevance International Conference on Information Retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 2-12 15, Turtle H. Inference networks for document retrieval. COINS Technical Report 90-92, February, 1991 PhD Dissertation, University of Massachusetts,

concept-bases information 16, Fung R Crawford S Appelbaum L Tong R. An architecture for probabilistic retrieval. In: Proceedings of the 13th international conference on research and development in information retrieval. Brussels, Belgium, September 5-7, 1990, pp. 455-467 17 Swanson D. Information retrieval as a trial-and-error process. Library Quarterly 1977; 47:128-148

18. Hull D. Using statistical testing in the evaluation of retrieval experiments. Proceedings of the 1993 SIGIR international conference on information retrieval. Pittsburgh, Pa, June 27-July 1, 1993, pp.329338 19. Yu C Buckley C Lam H Salton G. A generalized term dependence retrieval. Information Technology: Research and Development 1983; 2:129-154 model in information

20. Cooper W Gey F Chen A. Information retrieval from the TIPSTER collection: an application of staged logistic regression. In: Proceedings of the Fkst NIST Text Retrieval Conference, National Institute for Standards and Technology, Washington, DC, November 4-6, 1992, NIST Special Publication 500-207, March 1993, pp 73-88 21. Harman, D. Overview of the tional conference on information first TREC conference. In: Proceedings of the 1993 SIGIR retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 36-47 and logistic 1993 inference in information retrieval. interna-

22. Gey F. Probabilistic dependence University of California, Berkeley,

PhD dissertation,

You might also like