Inferring Probability of Relevance Using The Method of Logistic Regression

Inferring Probability of Relevance Using the Method of Logistic Regression
Fredric C. Gey
UC Data Archive and Technical Assistance University of California, Berkeley USA e-mail: gey@sdp2.berkeley. edu Abstract
This technique function clues bution ment
research of logistic
evaluates regression
a model
for
probabilistic equations Since
text which we call
and document rank documents infers it logistic (one derived probability
retrieval;
the model from
utilizes
the as a
to obtain properties.
by probability of relevance By
of relevance
of document of each statistical
and query clue little into
the model distribution
statistical the distrideviation docuof information model similarity collecin per-
present
in the texts allows with collections,
of documents one to apply and the
and queries, logistic are
inference. with from mean a training
transforming collection well-known vector space
its standardized power.
v = O and standard
a = 1), the method collections, test which or equally of the retrieval retrieval measure. tions) formances significant
coefficients The compared method statistical
to other
loss of predictive results
model directly
is applied to (tfidf) the
to three particular
uses term-frequency/inverse-document-frequency the logistic were inference to well two as (in the third models occurred collection) subjected the tfidf/cosine
weighting space model. see if the
and better
the cosine than differences are
In the comparison,
performs vector tests to
significantly
(in two
The
differences
statistically
or could
have
by chance.
1. Introduction
We can consider the central task of Information or documents. in the collection. [1] [2] [3] both that degree Retrieval The heart to be that of this and of providing intellectual access to the of the and the For disIn the or in contents meanings associates this tance One model, such of a collection of the texts at Cornell term which we have well-know form to the of texts contained University space a clear measure term task queries is the concise developed terms from indicate show representation by Gerard is of size the as vectors m.
In the vector documents for the both
space model
Salton
are represented and document deviates only document
m -dimensional measures
assuming geometric the computes space
the indexing to which
vocabulary document
of nontrivial vector
interpretation the cosine query the query These
the query between This
and can choose query vector.
measure
vector.
of the angle or the document. interpreted
and query
sparsest absence weights
of the vector query
model,
vectors
and document
vectors
the presence the magnitude
of a vocabulary
in either vectors.
can be easily merely
generalized
by adding
or document.
weights,
geometrically,
and direction Weights the document, (QAF) (lDF)
of the appropriate may
be mechanically absolute (DAF). where term
derived For terms
from
attributes is the count
of terms.
For example, of that
for terms term
in the query frequency number
or
the attribute
frequency
of occurrences the attribute in the
in either
the query
or the documefit is the ratio, which which terms, 1988 %j
tj used in documents,
number of documents text
inverse and
document ~t 1 is the
is the
collection used
of The for and
documents attribute weights, the Buckley weighting

which only
for IDF, used in
tj occurs
retrieval
in the documents was system, of these first
or has been by Karen frequency frequency for
to index [4].
the document. Two (QAF), and,
is usually the against

for
logged, frequencytest
suggested document Their [5]. one whose
Sparck-Jones the query product show that
well-known Salton for
in the SMART term identified
are the term factors
terms
document
inverse collections.
(DAF
~~F
).
a number several
the
and computed experiments in the document
performance
measurements a simple
multiple weight, in the
schemes
accounted
retrieval of the term
occurrence retrieval system the formal a system need. sound probability
of the term comparisons we will mean return
and the occurrence
query,
performed By
worst
in their search from Such
a probabilistic rules are derived problem. the users is both
query
specification, of probability paper models information
indexing to the logic ranked
processes,
and
retrieval mation of system attempt value
application The primary status of
of the theory documents focus of this
of the inforby probability search retrieval status
retrieval which can
to the user in an order is to test of text
satisfying
information theoretically as the
a probabilistic and document If retrieval then
and practical. satisfying
Probabilistic on a sound the users
to place
the computation
of retrieval
value
theoretical
footing. need,
be computed
the probability
223
ranking in order babilistic tunately tistical methods cessful principle of their retrieval experiments evidence for [9]. statistical [6] states that the optimal was relevance, retrieval This leads to when the half will be achieved Linked have the if documents Binary model are returned by Cooper to the user of proUnforstasuc[8]
probability [7], which with
of relevance. this relevance in the
to the so-called
Independence problems have only
model
generalized feedback even face
Dependence encountered is used size sample
approach
of insufficient been partially
to predict
collection
as a training
set. Improved
estimation
of this
insufficient
2. The logistic
If no assumptions query-document ated (VI, (such number attribute exists with each pair query
inference
are made r term about
model
the character of the probabilities allow that a text are term of relevance retrieval a finite set of which system associate exists about between associterm the a and that
= ~ (~i ,dj ), but we merely
tq,and document
(such of the with as count term divided the in
tdjthere
document), then we
attributes from
,Vn ) from as count values before The of of documents vary
the query occurrences statistically
of occurrences the by the number
of the term and have from of documents
in the query), the collection indexed
the document as the one total which and that
(such i.e.
in the collection particular
by the term),
collection,
a Model
O system,
introducing
models.
logistic
binary
inference
relevance is present
model says that we can use a random

judgments in both have document been made, and compute qi
sample of query-document-term
the logarithm of the odds
triples of relevance
for
which
for term
tk which
dj
and query
by the formula:
log~(~
Iqi,dj,tk)
= co+clvl+
lvn
and further the logodds
that for
the logarithm all terms:
of the odds
of relevance
for
the i h
query
qi =Ct
1, - - ,tq > is the sum
of
logO (1? I qi ,dj ) = ~

k=l
[logo
(~
I qi ,dj >tk ) _ logo
(~
)1
at random term show from the
where
O(R), will
known
as the prior the method as a function
odds of relevance
qi. of The of logistic
is the odds that a document

of individual query
chosen
collection are derived independent variation. Once tion may
be relevant variable
to query
coefficients
and document to predict which
properties statistical
by using
regression
[10] which
fits
an equation variables
a dichotomous
(possibly
continuous)
independent
log
odds
of relevance to directly
is computed obtain
from
a simple
linear
formula,
the inverse
logistic
transforma-
be applied
the probability
of relevance 1
of a document
to a query:
P((ZUqj,dj)=
l+e Once retrieval ples, future). Regression tions mate of multiple relevance. success models clues [11]. There which such are More two approximate frequency, Fuhr recently the dependent authorship, and Buckley major these the coefficients for coefficients may of the equation collection be used for from logodds a random odds
- log(O (R I qi ,dj ))
of relevance sample of relevance for
have other
been
derived
within
a document quadrupairs (past or
system
a particular
of query -document-term-relevance query-document
to predict
variable, [12] [13] with
document
relevance first
by linear
combinawith some to approxiregression
as term
and co-citation, the
were
introduced, regression of ordinary
by Ed Fox
used polynomial application
well-known
problems
approaches
G
to probability variable
of relevance is dichotomous is binomial
approximation: rather than than continuous, hence assumed by ordinary regression. and others [14]
The
outcome
the error An
distribution alternative
rather
normal,
as is usually networks
approach
is to use Bayesian
inference
as developed
by Turtle
[15]
[16].
3. Sampling
One pute logistic
for logistic
problems coefficients.
regression
in using Even logistic supposing regression that only is the computational half the document size necessary to comas a collection is used
of the computational regression
224
training nomical. query, attempt than not ments, pensates ments, and factor We Cranfields lowing sonable queries fit all, set the number Naturally, naturally non-relevant and for to construct of the relevant adjusting for a very ones. of query-document-term approach fraction which small To of the triples total which Since are to be used for computation the number in the higher smaller process of relevant proportion proportion by fewer factor collection, documents, becomes is, for upon astroeach us to if
a sampling a sample
is necessary.
documents of relevant
documents a significantly a significantly
it is incumbent we would a weight than of 30. Cranfield beginning
encompasses while relevant sampling
documents most, docucomdocutriple
assure
adequate
representation
of relevant
sample
documents non-relevant Since into
of non-relevant factor which non-relevant
documents the sample,
in the regression are typically thirtieth by applying upon retrieval which has which for say, every
the differential. to take the logistic chosen
documents
1/ lo(hh
non-relevant
it is reasonable it into have
query-document-term
regression
computation
a weighting to fit our numerous
the Cranfield used It does First, model. a better in
test collection information some have
model.
The [17], both
collection with
is the
a long-standing desirable to our
collection properties. logistic
experiments
experiments.
deficiencies while it only
as outlined seems 1,400
by Swanson to be a sufficient documents, test collection.
but it offers to afford
the fola reaand the
it has 225 queries, Second,
number
the documents
seem to possess The characteristics are in natural
fit to the characteristics test collection so that should query
of an ideal be:
of an ideal language
queries
form,
statistics
can be collected
as well
as document
statis-
tics, . the documents lection), The queries are a random in order sample to achieve sample from for from some larger query population. There must be a sufficiently large have both titles and abstracts present (abstracts are missing from some of the CACM col-
number
G
of queries
some
statistical
significance. collection pairs and, in the collection. judgments If this is not possible crafted sam-
Documents That
are a random judgments million relevance apply
a larger all
document
relevance
be made document judgments,
query-document then the sample collection
(as it cannot
be for
collections), and that
the relevance
are a carefully
ple of all possible We tion will later
size be known. (52 queries, 3204 documents) and the CISI collec-
our tit to the CACM
(76 queries,
1460 documents)
4. Logistic
Our ments particular (RFAD). the query (QRF),
regression
logistic inference
relative
for six term
attributes
additional given elementary clues: relative frequency in (DRF) and relative frequency of the term of term in all docu-
model has the following

in the document for Itj) logodds
frequency formula
The complete
of relevance,
the presence
tj
is then:
Zt,
+ c ~iog
= logO(R
= Co + cllog(QAF)
+ c2bg(QZW)
) + C ~log
(DAF
) + c410g (~~~
) + C ~zog (~~~
(WAD
Query query query. number total
absolute absolute Similarly, of number
frequency frequency document of
(QAF (QAF) all
has been query
previously (DRF)
defined.
Query number
relative of term
frequency occurrences frequency
(QRF) of all divided
is equal terms
to
divided
by the total
in the is the of
relative terms
frequency in
is document Relative divided
absolute in
by the total (RFAD)
occurrences of term
the document. length.
frequency
all
documents number
occurrences collection, that all terms
of the term i.e., collection
in the collection
by the total
of occurrences
all terms You
in the entire will note
in the formula than
are logged.
One reason that reason
for for after
using taking an
logarithms the logs anti-logarithm for these
is to dampen in a docuthe logaFinally, is clues) applied. a higher is that
the influence ment rithm, sums Indeed, maximum is five of
of frequency times more variables smooths
information. important out behave for a skewed logged
It is unrealistic 10 occurrences. distribution of than and non-logged clues the
to assume Another to one same items, for raw which
50 occurrences behavioral
of a term properties.
in general, logged likelihood
has nice
as products logged
variables in general clues.
in comparing
the fit for is attained
(at least
225
The sample fit for from this extended clues logistic model The is done against the same are then 20,246 derived query-document-term from this fit:
taken
the Cranfield
collection.
following
coefficients
Z,j = logO (R I tj) = -0.2085
+ -0.203610g
(QAF)
+ 0.1914310g
(QZ?F)
+ O.1678910g (DAF ) + 0.5754410g (DZ?F ) 1.596710g (IDF

What remains is to compute the number collection 225 queries this log is (0 1838 (R )) pairs) the prior pairs for and then odds of relevance a judgment for
) + 0.7503310g
this collection. This
(IWAD )
is estimated assigned (for pairs
by counting the Cranfield (for Cranfield
of query-document times
which dividing
of relevance number
has been
by the total
of query-document
1400 documents).
Hence:
1838 Przor
becomes the prior probability of relevance, logprior and the final formula for ranking is: I ~)) =
= .005835
=
and =
2251400
log
rl~
lprior
= 5. 138
log(O(l/
-5,138
+ ~(Z~,
j=l
+ 5.138)
50 Performance
A key bilistic model, ment part which vector. is the of this ranks fraction query but method, which
evaluation:
paper is to compare ranks documents according
statistical
performances according to the cosine retrieved over all to
significance
of two major probability of the angle in terms at a certain which queries it of of
tests
methods: between the retrieval our logistic against and in regression the precision the retrieval intent subject have vector probaspace relevance recall point
documents of
the query
vector
and the documeasures. process
Performance
evaluation relevant
is usually documents averaged
described
Recall
while for and
precision
a single precision
is the fraction as one measure
of retrieved of
documents
are relevant. is our
These to
measures It is our also or could
can be computed to use recall performance occurred by
are usually
in the collection. intention is significant
performance;
however,
differences chance. The better retrieval, bility would mance statistical cally of than
to a statistical
test to determine
whether
the difference
reason another
one wishes and yet that to the
to test statistical the performance is chosen chosen performance that the results
significance could query. profile. obtained at random, It
is that for
on a particular the next ranks next vector query set of
query
one method In our according chosen
may model
be of
be reversed might Thus, for
posed. randomly we
we assume relevance have test
a query different (such show
and our method be that if we
documents
to probaqueries a the perfor-
randomly
an entirely methods which will
are attempting space) chosen
to compare need set of queries
of two significant. Such
as logistic
inference
versus
tfidf/cosine
to establish are statisti-
the randomly
a test,
well-known for each Our
in
the
literature, therefore,
would
be a T-test whether these for
which statistical
would
compare in ranking will
the be:
differences
between
performance
query,
and we determine
differences
or in precision
are statistically
G
significant. query value take over
approach,
for testing documents This
significance
For
each
the ranking all ranks (all
of relevant levels
by each will give
method a unique
and compute number for
an average each query
precision
of recall).
for each method.

G
Compute the two
the difference methods being
in average compared.
precision Thus,
between in applying
the two
numbers
found
with size
each will
method the
for
the T-test,
the sample
equal
number
G
of queries a T-test their
in the test collection. to these mean differences difference under should the null be zero hypothesis and their that the methods standard deviation perform not identically significantly
Apply
and
hence
different. [18] provides a summary of possible statistical tests which might be used to evaluate retrieval experiments.
226
6. Logistic
For the logistic
model
inference
performance
model on Cranfield Inference Cranfield
for the Cranfield

the following tfidf/cosine Averages Logistic table Vector over
collection
Logistic
versus Collection:
Space 225
Performance ueries Vector Space
Recall 0.00 0.10
Precision 0.8330 0.8116 0.7129 0.6021 0.5161 0.4503 0.3698 0.2859 0.2280 0.1640 0.1464 0.4655
Precision 0.7787 0.7440 0.6434 0.5301 0.4380 0.3814 0.2994 0.2267 0.1882 0.1379 0.1251 0.4084 -12.3
0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

1.00 11 -pt Avg: % Change: displays for each significant the recall-precision of the 225 queries at the one tenth performance for the two of one percent results. methods
If we further (cosine level:
compute
the difference
in average
precision
and logistic)
we find
that the results
are statistically
significance
Hypothesis between N null mean 225 Thus we can see that clues, 0.0000 making
Test:
two-tail inference
t on Precision and Tfifdf/cosine df
Differences methods test P
Logistic sample mean -0.059885
sample SE 0.0072063 probabilistic model
stat 224 of logistic over -8.31 regression .0000 on the set of six term vector space model.
use of the full
property
we obtain
a statistically
significant
improvement
the tfidf/cosine
7. A criticism
The above results queries, documents
of the Cranfield
obtained for the logistic from when and terms
results
inference the resulting models on half We where feel model logistic this have a fundamental the sample are used difficulty. to rank used The entire set of set for based step to test was used to predict in order in our be to apply set is included whereby the number that within used as the training documents which training will [19], the next
the relevant
the logistic on relevance, for fitting. in testing the results train for
regression. we have In past such models logistic obtained tests
Naturally, a chicken has been half from
coefficients approach of queries number Our retrieval.
and egg process, to train
the test is applied
to the sample use that instead,
of probabilistic
has also been and then approach, of queries
of the other a decent
of the queries. approach and CISI using
a large
is necessary collections
regression the CACM
to document collection collections.
the coefficients repertoire,
the Cranfield
as a training
set to other
in particular
8. Logistic
If we apply the obtain
inference
logistic result: Cranfield
for
CACM
equation fitted for the Cranfield sample to the CACM collection, we
regression
the following
Logistic CACM
versus
Tfidf/cosine Averages
Vector over
Space
Performance
Collection:
52 Queries Vector Space
Logistic Recall 1 l-pt Avg: Precision 0.3322
Precision 0.3148 -5.2
% Change:
While
it may
seem that the extended

for the CACM collection,
logistic
model
performs
5.2 percent
better
than
the tfidf/cosine
vector
space model
if we apply
the same hypothesis
test:
227 CACM Hypothesis Logistic sample mean -0.020528 of performance another logistic model. test: 2-tail T on Precision and Tfidf/cosine df Differences methods test stat 51.0 -1.29 significant vector P value .2016 and hence space we cannot might reject perform a
between N null mean 52 we find null as well again that this 0.0000 difference given
inference sample SE 0.015866
is not statistically
hypothesis or better
that,
in fact,
set of queries
the tfidf/cosine
model
than the extended
9. Cranfield
Applying
logistic
fitted
model
regression
for
CISI
to the CISI collection, we find:
the Cranfield
equation
Cranfield
Logistic CISI
versus Collection:
Tfidf/cosine Averages Logistic
Vector over
Space 76 Queries
Performance
Vector
Space
Recall 11-pt Avg:
Precision 0.2088
Precision 0.2137 4.4
% Change: While this seems to indicate better vector space performance, queries: 2-tail T on Precision and Tfidf/cosine sample SE 0.0087791 significant model. While 75.0 difference the collection between of produce df Differences for CISI test stat 0.65 when we apply
the statistical
test on the aver-
age precision
and recall
measures
for 77 CISI Test:
Hypothesis between N null mean 76 we find that there 0.0000 can the be no extended model, Logistic sample mean 0.0056973 statistically logistic for
inference
collection P value .5183 on results between the vector space collecand
performance the comparable
model directly tions extended
as against to both logistic blind to the vector
application
Cranfield-derived vector
coefficient space
the CACM space
collection
and the CISI neither collection
on these
can we find
the difference
to be statistically application clearly of leaves
significant. coefficients something from derived one collection, from the statistical to adapt distribution we apply them of one collection clever clues to of the
The another other
collection collections.
to be desired.
In the next in order
section,
some
thought
to the transformation
of coefficients
to the statistical
10. Standardized
The direct application lection lowing and one
variables
of coefficients It assumes that derived all from a query-document-term collection which fi-om from for are being used sample as sensitive to collection. probability original as the from one document makes colto another statistical do the clues and standard document the clues come distribution deviation and yet another collection clue set of queries indicators By the folstatistical but they the
set of queries the identical not only mean
assumption. have the we mean identical
of probability
of relevance distribution also have logistic
identical distribution,
the same underlying each
collection
on which
regression Even though from
fitting
has been done. might used derive for from the same underlying would of probability statistical function deviation probability in distribution, either This the gives it seems and us insight known highly into that the one i.e.,
each clue collection being
unlikely deviation possible can take though one which If of this lection
that,
to collection, indication which deviation
there
be no variation of relevance. to another distribution is not known) one (6 distribution, for =
mean
standard
of the clue adaptation the mean the actual
of coefficients and standard under] ying zero Q.L =
fit one collection of any density standard
collection. into
It is well
and transform
that distribution distribution,
(even
probability
a standardized
has mean we obtain
O) and
for such
1).
then we can The also apply the coefficients colof such distribution underlying of yet another assumption
coefficients distribution
a standardized coefficients probability
standardized which
to compute clues to indicate
the standardized
uses the same is the folIowing: each clue the
of relevance.
an adaptation
G
For
underlying
probability
distribution
of
that
clue
remains
identical
from
228 collection to collection changing only in its mean and standard deviation, i.e., only the mean and standard deviation of the particular collection-dependent distribution will change while the underlying standardized probability distribution remains constant over all collections.
One might well question this assumption on the following grounds:
Suppose the mean pi and standard deviation Oi for clue i varies meaningtidly from collection to collection, and that these differences have predictive power with respect to relevance. Then the application of standardization of distributions will remove this predictive power.
In the face for our purposes
of such thinking, the collection) collection the coefficients
the best approach and we derive we can directly to which we wish
is to experimentally the standardized the adjusted our logistic (here
test the assumption. coefficient coefficient regression we assume for for a first
If we accept, collection In terms x (in disof
of testing, other
assumption
case the Cranfield of any we obtain
compute to apply
the standardized two clues,
tribution formulae,
technique.
c ~, c ~, and c ~ for the equation
and Y ):
x lWX1
log(o(ll lx~,y~)) =c~+c~( ~ xl )+C2(
Y1-wy, ~ YI )
Given equation lection,
the equation that for
above,
for apply
the source the two, same
collection, coefficients
if we then
look
at the equation distribution, constant remains
for we
the target can then collection
collection, derive the to col-
assuming
we would the second
to a standardized distribution
collection.
If the standardized the same equation,
from
we have
for collection
x 2PX2
Y 2wy2
log(o(R
and we wish to derive
lx2,yJ)
= co + C*(
~
X2
)+C2(
~
Y2
the coefficients log(o(z?
c 0, c 1, and c z for the equation: Ixz,yz)) above = co + C1X2 + Czyz
but if we multiply
out the first
equation
we obtain WX2
Py2
log(o(zi
Ixz,yz))
=coc~; X2
C2Z Y2
X2 ;1 X2
5Y2 Y2
and hence Py2 IJX2 c2 6 X2 0 Y2
C()=
co-cl
cl=
and C2 =
c1
6
X2
C2 0 Y2
Thus values
to find
the
logistic
coefficients each clue
for v,
the target compute
collection, the new
we
need
only and
determine apply
the them
mean, directly
~V,
and
standard
deviation, for the new
CTv for collection. was
coefficients,
to clue
This retrieval
methodology
first
applied This
by paper
Cooper, provides
Gey
and
Chen complete
[20]
to the queries
of the NIST
text
conference
(TREC
1) [21].
a more
analysis
of the method.
11. Applying
In order sample to obtain from each
to CACM
means of those
and CISI
deviation If we
collections
for obtain all clues a small for the new (of collection, 1 in 125) we must query obtain a new term sample document
and standard collections.
229 triples CISI. and all associated clues values, we can compute the following standard deviations for CACM and
Coil.
CISI CISI CACM CACM
St.
mean
log q.af
0.3031 0.5153 0.1763 0.3347
log d.af 0.4093 0.5795 0.3900 0.5855
log q.rf -3.5450 0.9110 -2.7644 0.5612 together
log d.rf -3.7922 0.6944 -3.4976 0.7999
log idf 1.9045 0.8333 2.0968 1.1378
log rfad -5.5303 1.0675 -5.2434 1.2146 to obtain the fol-
std dev
mean
std dev
We then use these means and standard deviations lowing new logistic or logodds coefficients
with the Cranfield and CISI collections.
coefficients
for the CACM
Coil.
CRAN CACM CISI
const. -4.125
-1.341 -0.934
log q.af -0,03229

-0.08412 -0.061660
log d.af 0.1059

0.1520 0.1622
log q.rf 0.06910

0.11993 0.0663
log d.rf 0.4193

0.5391 0.5572
log idf 1,4515

1.0933 1.5325
log rfad 0.7734

0.5507 0.6680 CISIcollec-
If we then apply the fitinorder tions, weobttin fi;s; for CACMtie
toobtain rankings forallqueries following recall-p;ecision results:
for both the CACMand
Standardized Logistic vs. Tfidf/cosine Vector Space Performance CACM- C llection: Averages over 52 Q :ries Recall Standardized Logistic Precision 0.7452 0.6040 0.5338 0.4392 0.3683 0.3128 0.2682 0.1846 0.1388 0.0961 0.0694 1 I 0.3419 I I Vector Space Precision 0.6985 0.5683 0.4830 0.3887 0.3210 0.2761 0.2306 0.1770 0.1441 0.1018 0.0735 0.3148 -7.9
0.00
0.10 0.20 0.30
0.40 0.50 0.60 0.70 0.80 0.90

1.00
I
I
11-pt Avg: % Change:
Note that the average precision has increased approximately another 3 percent over the tfidf cosine vector space method of ranking documents. The recall-precision table makes it quite clear that except for the tail of recall-precision, the standardized extended logistic model performs substantially better than the tfidflcosine vector space model for the CACM collection. Indeed, if we once again go through our process of computing average precision from all points of recall for each of the 52 significant queries for CACM, and then do a statistical test of the difference from standardized logistic method and tfidf/cosine vector space average precision, we find this improvement is now statistically significant at the 2 percent level.
CACM Hypothesis Test: 2-tail T on Precision Differences between Standardized Logistic and Tfidflcosine N 52 null mean 0.0000 sample mean -0.030180 sample SE 0.012335 df 51.0 test stat -2.45 P value .0179 to the In this collecvector
That is to say, we would only expect in less than two percent of all samples of queries applied CACM collection, for the vector space model to perform better than the standardized logistic model. way, we can reject the null hypothesis that there is no difference in the performance in the CACM tion and accept the clear indication that standardized logistic regression performs better than the space model for the CACM collection.
230 On the other hand, if we apply this correction factor and standardization find that there is a slight worsening in the performance results. to the CISI collection, we
Standardized Logistic vs. Tfidflcosine Vector Space Performance CISI Collection: Averages over 76 Queries Recall 1l-pt Avg: Yo Change: Standardized Logistic Precision 0.2042 Vector Space Precision 0.2137 4.7
When comparing this result with the non-standru-dized extended logistic model for CISI we see that the precision changes occur in the third decimal place. We can conclude that the difference between the two is certainly not statistically significant, nor is that between CISI standardized and tfidf/cosine. Among reasons for this failure to achieve performance improvement in the CISI collection are the different prior probability of relevance for the collections. [22] gives more detail on the sensitivity of the model to the proper estimation of prior probability of relevance.
12. Conclusions
In this research we have investigated a new probabilistic text and document search method based upon logistic regression. This logistic inference method estimates probability of relevance for documents with respect to a query which represents the users information need. Documents are then ranked in descending order of their estimated probability of relevance to the query. 1. The logistic inference method has been subjected to detailed performance tests comparing it (in terms of recall and precision averages) to the traditional tfidf/cosine vector space method, using the same retrieval and evaluation software (the Cornell SMART system) for both methods. In this way the test results remain free of bias which might be introduced from different software implementations of evaluation inference method outperforms the methods. In terms of recall and precision averages, the logistic tfidf/cosine vector space method on the Cranfield and CACM test collections. The methods seem to perform equally well on the CISI test collection (for an appropriately estimated prior probability). 2. Statistical tests have been applied to ascertain whether these performance differences between the The performance improvement of the logistic inference method two methods are statistically significant. over the tfidf/cosine vector space method for the Cranfield and CACM collection is statistically significant at the five percent level. Performance differences for the CISI collections are not statistically significant, for the most plausible estimate of prior probability. 3. The use of standardized variables (statistical clues standardized to mean w = O and standard deviation CT= 1) seems to enable the training of and fitting for logistic regression coefficients to take place on the queries and documents of one collection and to be applied directly to the queries and documents of other collections.
13. Acknowledgments
The work described was part of the authors dissertation research at the School of Library and Information Studies at the University of California, Berkeley. Many of the ideas contained herein were jointly formulated with my dissertation advisor, Professor William S. Cooper, whose clarity of thinking and relentless persistence made this research both possible and doable. My outside committee member, Professor of Biostatistics Steve Selvin, provided much helpful advice on statistical techniques.
References
1. Salton G et al. The SMART Prentice-Hall, Englewood Cliffs, 2. Salton Addison retrieval NJ, 1971 system: Experiments in automatic document processing.
G. Text processing: the Wesley, Reading, MA-Menlo M. Introduction
transformation, analysis and retrieval Park, CA, 1989 retrieval.
of information
by computer.
3. Salton G, McGill 4. Sparck-Jones K.
to modern information of
McGraw-Hill, and its application
New York,
1983 Journal
A statistical
interpretation
term specificity
in retrieval.
231 of Documentation 1972; 28:11-21 in automatic text retrieval. Information pro-
C. Term weighting approaches 5. Salton G Buckley cessing and Management 1988; 24:513-523 6. Robertson, 7. Robertson 27:129-145 S. The probability S Sparck-Jones ranking K. principle in IR. weighting
Journal of Documentation of search terms.
1977; 33:294-304 of the ASIS 1976;
Relevance
Journal
IR, In: Proceedings of the Fourteenth Annual 8. Cooper W. Inconsistencies and misnomers in probabilistic International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, 111,Ott 13-16, 1991, pp 57-61 estimation 9. Fuhr N Huther H. ODtimum Probability cessing and Management 198;; 25:493;507 10. Hosmer D Lemeshow S. Applied logistic from empirical distributions. Information Pro-
regression.
John Wiley
& Sons, New York, Retrieval
1989 Queries
11. Fox E. Extending the Boolean and Vector Space Models and Multiple Concept Types. PhD dissertation, Computer
of Information Science, Cornell
with P-Norm 1983 principle.
University, ranking
retrieval functions based on 12. Fuhr N. Optimal polynomial Transactions on Information Systems 1989; 7:183-204 learning 13. Fuhr N Buckley C. A probabilistic on Information Systems 19919:223-248 approach for
the probability
ACM
document
indexing.
ACM
Transactions
Proceedings of the 1993 SIGIR feedback and inference networks. 14. Haines D Croft B. Relevance International Conference on Information Retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 2-12 15, Turtle H. Inference networks for document retrieval. COINS Technical Report 90-92, February, 1991 PhD Dissertation, University of Massachusetts,
concept-bases information 16, Fung R Crawford S Appelbaum L Tong R. An architecture for probabilistic retrieval. In: Proceedings of the 13th international conference on research and development in information retrieval. Brussels, Belgium, September 5-7, 1990, pp. 455-467 17 Swanson D. Information retrieval as a trial-and-error process. Library Quarterly 1977; 47:128-148
18. Hull D. Using statistical testing in the evaluation of retrieval experiments. Proceedings of the 1993 SIGIR international conference on information retrieval. Pittsburgh, Pa, June 27-July 1, 1993, pp.329338 19. Yu C Buckley C Lam H Salton G. A generalized term dependence retrieval. Information Technology: Research and Development 1983; 2:129-154 model in information
20. Cooper W Gey F Chen A. Information retrieval from the TIPSTER collection: an application of staged logistic regression. In: Proceedings of the Fkst NIST Text Retrieval Conference, National Institute for Standards and Technology, Washington, DC, November 4-6, 1992, NIST Special Publication 500-207, March 1993, pp 73-88 21. Harman, D. Overview of the tional conference on information first TREC conference. In: Proceedings of the 1993 SIGIR retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 36-47 and logistic 1993 inference in information retrieval. interna-
22. Gey F. Probabilistic dependence University of California, Berkeley,
PhD dissertation,

Inferring Probability of Relevance Using The Method of Logistic Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inferring Probability of Relevance Using The Method of Logistic Regression

Uploaded by

Copyright:

Available Formats

Inferring Probability of Relevance Using the Method of Logistic Regression

This technique function clues bution ment

probabilistic equations Since

text which we call

and document rank documents infers it logistic (one derived probability

the model from

of document of each statistical

and query clue little into

the model distribution

statistical the distrideviation docuof information model similarity collecin per-

in the texts allows with collections,

of documents one to apply and the

and queries, logistic are

inference. with from mean a training

transforming collection well-known vector space

its standardized power.

coefficients The compared method statistical

loss of predictive results

is applied to (tfidf) the

weighting space model. see if the

the cosine than differences are

performs vector tests to

In the vector documents for the both

are represented and document deviates only document

assuming geometric the computes space

the indexing to which

interpretation the cosine query the query These

the query between This

and can choose query vector.

of the angle or the document. interpreted

sparsest absence weights

of the vector query

the presence the magnitude

can be easily merely

and direction Weights the document, (QAF) (lDF)

of the appropriate may

be mechanically absolute (DAF). where term

derived For terms

attributes is the count

For example, of that

for terms term

in the query frequency number

of occurrences the attribute in the

or the documefit is the ratio, which which terms, 1988 %j

of The for and

documents attribute weights, the Buckley weighting

for IDF, used in

in the documents was system, of these first

or has been by Karen frequency frequency for

the document. Two (QAF), and,

is usually the against

suggested document Their [5]. one whose

Sparck-Jones the query product show that

well-known Salton for

in the SMART term identified

are the term factors

and computed experiments in the document

multiple weight, in the

retrieval of the term

occurrence retrieval system the formal a system need. sound probability

of the term comparisons we will mean return

and the occurrence