You are on page 1of 18

MINOR PROJECT SYNOPSIS

Implementation of the Fuzzy Logic a!e" IR !y!tem fo# TREC "ata!et$


Sumitte" in pa#tial fulfilment of the #e%ui#ement!
Fo# the a&a#" of "eg#ee of
'achelo# of Technology
In
Compute# Science Enginee#ing
TE(M MEM'ERS) PROJECT *+I,E)
P#ah-ot Singh ./00112/34115
Sumit ,ha&an ./60112/34115
Shuham (ga#&al ./6/112/34115 M#! Na#ina Tha7u#
'8(R(TI 9I,Y(PEET8:S COLLE*E OF EN*INEERIN*
(;6< P(SC8IM 9I8(R< RO8T(= RO(,< NE> ,EL8I; 11//?0
(FFILI(TE, TO
*+R+ *O'IN, SIN*8 IN,R(PR(ST8( +NI9ERSITY< ,EL8I
('STR(CT
Relevance, evaluation, and information needs are various key issues associated with Information
Retrieval. Relevance is the relational value of a given user query to the documents within the
database. Relevance of a document is normally based on a document ranking algorithm. These
algorithms define how relevant a document is to a user query by using functions that define
relations between the query given and the documents collected in the index.
Information needs is how the user interacts with the information retrieval system. The data within
the system should be able to be accessed easily and in a way that is convenient to the user.
Retrieving too much information might be inconvenient in certain systems, also in other systems
not returning all relevant information may be unacceptable.Reserach reveals the effectiveness of
fuzzy logic to handle uncertainty and vagueness of queries and documents.
In present proect, a fuzzy based similarity measure is implemented using TR!" data collection.
The performance of proposed similarity measure is evaluated and compared with "osine similarity
measures on the basis of #recision$Recall curves for individual query.
IN,E@
%. &bective'''''''''''''''''''''''''.%
(. Introduction'''''''''''''''''''''''.....(
). *uzzy +ogic based information system'''''''''''''..)
).% TR!" ,ataset''''''''''''''...............................-
).( Information Retrieval''.''''''''''''''''...
).) *uzzy +ogic'''''''''''''''''''''....../
).- 0ard 1cience with If2then rules''''''''''''''....3
).4 #ropositional *uzzy +ogic........''''''''''''''...%5
).. Information Retrieval on the web........''''''''''''%%
-. !valuation of performance of IR systems'''''''''''.%($%/
4. References''''''''''''''''''''''''%3
1$ O'JECTI9E
This proect implements an improved IR system using fuzzy logic based similarity
measure . *uzzy logic based similarity measure will implementation will improve IR
system6s performance. The performance of fuzzy logic based similarity measure is
compared with "osine similarity measures on TR!" data collection.
3$ INTRO,+CTION
In last couple of decades, various methods have been suggested to improve the performance of
Information Retrieval 7IR8 1ystem. 9n IR system generally deals with retrieval of relevant
documents against user defined queries. :aeza$;ates defined a general Information Retrieval model
as a quadruple <,, =, *, R 7q i , d 8>, where , is a set composed of logical views for the documents
in the collection, = is a set composed of logical views for the user information needs expressed as
queries, * is a framework for modeling document representations, queries and their relationships
and R 7q i , d 8 is a similarity measure2ranking function which associates a real number with a
query q i ? = and a document representation d ? ,. 1uch ranking defines an ordering among the
documents with regard to the query q i . Therefore similarity measure plays an important role to
develop a quality IR system.
9n IR system evaluates the relevancy using some representations of a document and a query. There
are different models for representation documents and queries. !ach model has its pros and cons.
The :oolean model was the first model which was adopted by most of the earlier systems and even
today some of the commercial systems use this model, which makes use of the concepts of :oolean
logic and set theories.
The documents and queries are a collection of terms and each term from the document is indexed.
The presence and absence of a term in a document is represented by % and 5 respectively. *or the
term matching of document and query we maintain an inverted index of the terms i.e. for each term
we must store a list of documents that contain the term. 0owever, the :oolean model has some
maor limitations like binary decision criterion without any notion of grading scale and overloading
of documents. @hile some researchers have tried to overcome the weaknesses of the :oolean model
by building refinements to the existing :oolean model, others have approached IR with a different
search strategy called the Aector 1pace model.
The Aector 1pace Bodel, as the name implies, represents documents and queries internally in the
form of vectors. In the vector space model all queries and documents are represented as vectors in C
AC$dimensional space, where A is the set of all distinct terms in the collection 7the vocabulary8.
1ome of the advantages of the Aector 1pace Bodel are that it is simple and fast model, that it can
handle weighted terms, that it produces a ranked list as output and that the indexing process is
automated which means a significantly lighter workload for the administrator of the collection.
9lso, it is easy to modify individual vectors, which is essential for the query expansion technique
and logic based similarity measure. Therefore, vector space model is used as a base model in this
paper.
9n IR system needs to calculate the similarity of the query and the particular document in order to
decide relevancy of that document with the query. @hen a document retrieval system is used to
query a collection of documents with n terms, the system computes a vector , 7d i% , d i( ... d in 8 of
size n for each document. The vectors are filled with the weights and similarly, a vector = 7@ q% ,
@ q( ... @ qn 8 is constructed for the terms found in the query. In recent years, some efforts have
made to construct a effective similarity measure for enhancing the performance of IR 1ystem.
In *an presented similarity functions as trees and a classical generational scheme. #athak et al have
proposed the idea of combined similarity measure in which they have proposed a linear
combination of various similarity measures and then optimize the weight of each similarity measure
using D9. Behran 1ahami proposes a novel method for measuring the similarity between short text
snippets by leveraging web search results to provide greater context for the short texts.
Aincent 1chickel$Euber et al., present a novel approach that allows similarities to be asymmetric
while still using only information contained in the structure of the ontology. Torra et al. presented a
method to calculate similarity between words based on dictionaries using *uzzy graphs in reference.
"hen presented in reference a new similarity measure based on the geometric mean averaging
operator to handle the similarity problems of generalized fuzzy numbers. Fsharani et al. proposed a
genetic algorithm based method for finding similarity of web document based on cosine similarity.
In the past, most popular similarity measures used in IR 1ystems are "osine, !uclidean, Gaccard and
&kapi.
In the present proect, a *uzzy +ogic based 1imilarity Beasure, is proposed for vector space IR
model. The performance of proposed similarity measure would be evaluated and compared with
above mentioned similarity measures on the basis of ((. ;. Dupta et al.
0$ P#opo!e" Fuzzy Logic 'a!e" Simila#ity Mea!u#e
The IR 1ystem retrieves documents based on a query given by the user. In most of the cases, both
queries and documents are vague or imprecise and usually expressed in Hatural +anguage 7H+8.
1ometimes user may change his query during information retrieval process and2or he may not be
conscious of his exact needs of information <%>.
Therefore, to handle this uncertainty, vagueness and impreciseness, *uzzy +ogic is very suitable.
*uzzy logic is based on *uzzy 1et theory and membership functions .
,ocuments retrieved by a query are evaluated by the rules of *uzzy Inference 1ystem 7*I18. Aector
1pace Bodel is used as a base model due to its advantages over other models. In this *I1, we have
used three input variablesI term frequency 7tf8, inverse document frequency 7idf8, overlap and one
output variableI relevance. These input variables are very useful to determine the relevancy of
document against a particular query. T* indicates that the number of occurrences of a term in each
document of the corpus<(>.
I,* can be given as log 7H2n8, where H is the total number of documents in corpus and n is the
number of documents contains the term. &verlap reflects that many of the terms of the query are
found in documents. Bamdani type fuzzy inference system is used in *+:1B with the help of
Batlab *uzzy +ogic Toolbox.
The range of input variables tf, idf and output variable relevance are represented by +&@,
B!,IFB and 0ID0, while the range of input variable overlap is represented by +&@ and 0ID0.
In this paper, triangular membership function is being used to map input space to a degree of
membership of fuzzy set. The details of the membership functions for input and output variables of
*+:1B are shown in *ig. % <)>.
*ig %I membership functions for input and output variables of *+:1B
*uzzy rules are derived from tf.idf weighting scheme i.e. if a query term has high tf and high idf in
a document, , then relevance is likely to be high. If many of the terms of the query are found in the
document 7overlap8, then relevance is likely to be high. It is known that if the rules tha at penalize
low features are added, the performance of the system is increased. 1o the following rules are
constructed for each of the query termI
J If 7tf is 0igh8 and 7idf is 0igh8 then 7relevance is 0igh8.
J If 7tf is Bedium8 and 7id f is Bedium8 then 7relevance is Bedium8.
J If 7tf is +ow8 and 7idf is +ow8 then 7relevance is +ow8.
Two fuzzy rules are also de efined for overlap as followsI
J If 7overlap is 0igh8 then 7relevance is 0igh8.
J If 7overlap is +ow8 then 7relevance is +ow8.
0$1 T#ec ,ata!et
The Text R!trieval "onference 7TR!"8 is an on$going series of workshops focusing on a list of
different information retrieval 7IR8 research areas, or tracks. It is co$sponsored by the Hational
Institute of 1tandards and Technology 7HI1T8 and the Intelligence 9dvanced Research #roects
9ctivity 7part of the office of the ,irector of Hational Intelligence8, and began in %KK( as part of the
TI#1T!R Text program. Its purpose is to support and encourage research within the information
retrieval community by providing the infrastructure necessary for large$scale evaluation of text
retrieval methodologies and to increase the speed of lab$to$product transfer of technology.
!ach track has a challenge wherein HI1T provides participating groups with data sets and test
problems. ,epending on track, test problems might be questions, topics, or target extractable
features. Fniform scoring is performed so the systems can be fairly evaluated. 9fter evaluation of
the results, a workshop provides a place for participants to collect together thoughts and ideas and
present current and future research work <->.
TR!" systems often provide a baseline for further research. !xamples includeI
0al Aarian, "hief !conomist at Doogle, says :etter data makes for better science. The
history of information retrieval illustrates this principle well,L and describes TR!"Ms
contribution.
TR!"Ms +egal track has influenced the e$,iscovery community both in research and in
evaluation of commercial vendors.
The I:B researcher team building I:B @atson 7aka ,eep=98, which beat the worldMs best
GeopardyN players, used data and systems from TR!"Ms =9 Track as baseline performance
measurements.
0$3 Info#mation Ret#ieAal
Information retrieval is the activity of obtaining information resources relevant to an information
need from a collection of information resources. 1earches can be based on metadata or on full$text
7or other content$based8 indexing.
9utomated information retrieval systems are used to reduce what has been called Linformation
overloadL. Bany universities and public libraries use IR systems to provide access to books,
ournals and other documents. @eb search engines are the most visible IR applications.
9n information retrieval process begins when a user enters a query into the system. =ueries are
formal statements of information needs, for example search strings in web search engines. In
information retrieval a query does not uniquely identify a single obect in the collection. Instead,
several obects may match the query, perhaps with different degrees of relevancy.
9n obect is an entity that is represented by information in a database. Fser queries are matched
against the database information. ,epending on the application the data obects may be, for
example, text documents, images, audio, mind maps or videos. &ften the documents themselves are
not kept or stored directly in the IR system, but are instead represented in the system by document
surrogates or metadata.
Bost IR systems compute a numeric score on how well each obect in the database matches the
query, and rank the obects according to this value. The top ranking obects are then shown to the
user. The process may then be iterated if the user wishes to refine the query.
*or effectively retrieving relevant documents by IR strategies, the documents are typically
transformed into a suitable representation. !ach retrieval strategy incorporates a specific model for
its document representation purposes.
*ig
(I
"ategorisation of IR models
9pplications of IR includeI
,igital libraries
Information filtering
Recommender systems
Bedia search
:log search
Image retrieval
1peech retrieval
Aideo retrieval
To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection
consisting of three thingsI
9 document collection
9 test suite of information needs, expressible as queries
9 set of relevance udgments, standardly a binary assessment of either relevant or
nonrelevant for each query$document pair.
0$0 Fuzzy Logic
*uzzy logic is a form of many$valued logicO it deals with reasoning that is approximate
rather than fixed and exact. "ompared to traditional binary sets, fuzzy logic variables may
have a truth value that ranges in degree between 5 and %.
*uzzy logic has been extended to handle the concept of partial truth, where the truth value
may range between completely true and completely false. *urthermore, when linguistic
variables are used, these degrees may be managed by specific functions.
The term Lfuzzy logicL was introduced with the %K.4 proposal of fuzzy set theory by +otfi
9. Eadeh. *uzzy logic has been applied to many fields, from control theory to artificial
intelligence. *uzzy logics had, however, been studied since the %K(5s, as infinite$valued
logics $ notably by Pukasiewicz and Tarski.
9 basic application might characterize subranges of a continuous variable. *or instance, a
temperature measurement for anti$lock brakes might have several separate membership
functions defining particular temperature ranges needed to control the brakes properly. !ach
function maps the same temperature value to a truth value in the 5 to % range. These truth
values can then be used to determine how the brakes should be controlled.
*ig )I *uzzy logic
temperature
0$6 8a#" !cience &ith IF;T8EN #ule!
*uzzy set theory defines fuzzy operators on fuzzy sets. The problem in applying this is that the
appropriate fuzzy operator may not be known. *or this reason, fuzzy logic usually uses I*$T0!H
rules, or constructs that are equivalent, such as fuzzy associative matrices.
Rules are usually expressed in the formI
I* variable I1 property T0!H action
*or example, a simple temperature regulator that uses a fan might look like thisI
I* temperature I1 very cold T0!H stop fan
I* temperature I1 cold T0!H turn down fan
I* temperature I1 normal T0!H maintain level
I* temperature I1 hot T0!H speed up fan
There is no L!+1!L Q all of the rules are evaluated, because the temperature might be LcoldL and
LnormalL at the same time to different degrees.
The 9H,, &R, and H&T operators of boolean logic exist in fuzzy logic, usually defined as the
minimum, maximum, and complementO when they are defined this way, they are called the Eadeh
operators. 1o for the fuzzy variables x and yI
H&T x R 7% $ truth7x88
x 9H, y R minimum7truth7x8, truth7y88
x &R y R maximum7truth7x8, truth7y88
There are also other operators, more linguistic in nature, called hedges that can be applied. These
are generally adverbs such as LveryL, or LsomewhatL, which modify the meaning of a set using a
mathematical formula.
0$2 P#opo!itional fuzzy logic!
The most important propositional fuzzy logics areI
Bonoidal t$norm$based propositional fuzzy logic BT+ is an axiomatization of logic where
conunction is defined by a left continuous t$norm, and implication is defined as the residuum of the
t$norm. Its models correspond to BT+$algebras that are prelinear commutative bounded integral
residuated lattices.
:asic propositional fuzzy logic :+ is an extension of BT+ logic where conunction is defined by a
continuous t$norm, and implication is also defined as the residuum of the t$norm. Its models
correspond to :+$algebras.
Pukasiewicz fuzzy logic is the extension of basic fuzzy logic :+ where standard conunction is the
Pukasiewicz t$norm. It has the axioms of basic fuzzy logic plus an axiom of double negation, and
its models correspond to BA$algebras.
DSdel fuzzy logic is the extension of basic fuzzy logic :+ where conunction is DSdel t$norm. It
has the axioms of :+ plus an axiom of idempotence of conunction, and its models are called D$
algebras.
#roduct fuzzy logic is the extension of basic fuzzy logic :+ where conunction is product t$norm. It
has the axioms of :+ plus another axiom for cancellativity of conunction, and its models are called
product algebras.
*uzzy logic with evaluated syntax 7sometimes also called #avelkaMs logic8, denoted by !AP, is a
further generalization of mathematical fuzzy logic. @hile the above kinds of fuzzy logic have
traditional syntax and many$valued semantics, in !AP is evaluated also syntax. This means that
each formula has an evaluation. 9xiomatization of !AP stems from Pukasziewicz fuzzy logic. 9
generalization of classical DSdel completeness theorem is provable in !AP.
0$? Info#mation Ret#ieAal On the >e
Retrieving information from the web can prove to be difficult because of the size and abstractness
of data contained on the web. 9pproximations for (5%% estimated the web to be as large as 45
billion web pages or more. @eb retrieval is made increasingly difficult when adding in factors such
as word ambiguity 7where a single word can take on multiple meanings8, and the large amount of
typographical errors contained within web information. It is estimated that one in every two$
hundred words, on an average web site, will contain a textual error.
There are several key issues involving information retrieval. These issues are relevance, evaluation,
and information needs. 0owever, these are not the only issues involving information retrieval.
&ther issues such as performance, scalability and occurrences of paging update are other common
information retrieval issues.
Relevance is the relational value of a given user query to the documents within the database.
Relevance of a document is normally based on a document ranking algorithm. These algorithms
define how relevant a document is to a user query by using functions that define relations between
the query given and the documents collected in the index.
The evaluation of the feedback given by the information retrieval system is another issue with
information retrieval. The behavior of the system may not meet the expectations of the user or the
documents returned from the system may not all be relevant to a query. ,epending on the system
and the user, the results of a query should be in a format that most fits the data being searched and
returned.
@eb information retrieval is an area open for many research opportunities. The larger problems
with web information retrievalTrelevance, evaluation, and information needsTamongst others, are
still important topics that require attention.
*uzzy information retrieval has proven to be a suitable solution for many areas involving
information retrieval that may have data that can be uncertain, such as the web. The individual
sections of the system were developed. This includes a crawler system that obeys standard internet
etiquette rules, and an indexing application that stores all information retrieved from the web in the
form of a standard inverted index. !ach section entered its gathered data to the database.
Information needs is how the user interacts with the information retrieval system. The data within
the system should be able to be accessed easily and in a way that is convenient to the user.
Retrieving too much information might be inconvenient in certain systems, also in other systems
not returning all relevant information may be unacceptable.
6$ EAaluation of Pe#fo#mance of Info#mation Ret#ieAal Sy!tem
In the past, various researchers have used following parameters to evaluate the performance of IR
1ystemsI
%. P#eci!ion) It is a fraction of documents that are relevant among the entire retrieved document.
#ractically it gives accuracy of result.
#recisionRCRaC2C9C 7%8
where,
Ra I 1et of relevant doc cuments retrieved
9I 1et of documents retrieved
(. Recall) 9 fraction of the documents that is retrieved and relevant among all relevant documents
is defineed as recall. #ractically it gives coverage of result.
Recall RCRaC2CRC 7(8
Ra I 1et of relevant doocuments retrieved
RI 1et of all relevant documents
). P#eci!ion;Recall Cu#AeI This curve is based upon the value of precision and recall where the x$
axis is recall and y$axis is precision. Instead of using precision and recall on at each rank posiition ,
the curve is commonly plotted using %% standard recall level 5U, %5U, (5U ...........%55U.
Boreover, average similarity value of documents for individual query and average number of
retrieved relevant documents can also be used as parameters to check the performance of IR
1ystem. If the values for both of these parameters are high then the performance of IR 1ystem will
be good.
2$ REFERENCES
%. ;ates, R.:., :erthier, R.I Bodern Information retrieval. 9ddisson @esley 7%KKK8
(. "ooper, @.1.I Detting beyond :oole. Information #rocessing and Banagement (-, (-)Q(-3
7%K338
). 0arman, ,.I Ranking 9lgorithms. Information retrievalI data structures and algorithms,
pp. ).)Q)K(. #rentice$0all 7%KK(8
-. 1alton, D.I 9utomatic text processingI the transformation, analysis, and retrieval of infor$
mation by computer. 9ddison @esley 7%KK38

You might also like