You are on page 1of 64

UNIT V – SEARCH ARCHITECTURE AND

EVALUATION
•Index Search Optimization
• Text Search Optimization
•GOOGLE Scalable Multiprocessor Architecture
•Information System Evaluation
•Measures used in System Evaluation
Index Search Optimization

• Pruning the Index


• Champion Lists
Index Search Optimization

• Final result of search will be a sorted ranked list of hits where any
item whose relevance weight is above zero will be in the ranked list
• This creates Hit lists of potentially hundreds of thousands to
millions of hits.
• “Total hit” count is really an estimate with the system really only
having a small subset of those hits actually identified.
• User is only really concerned with less than 1% of the top hits.
• But since the user can only go one linear list page of hits at a time
there is no mechanism for the user to get to hit 1,000,000 without
paging through lots of hit pages.
Pruning the Index
• Most systems already make a simplification assumption and retrieve
only the inversion lists for the search terms in the query
• The problem of mismatch of vocabulary described as one of the
major issues in getting complete search results is ignored in this
approach.
• If similarity measure is used items with low relevance weights for a
particular search term will not likely be in the final top portion of a
returned hit list.
• Thus when the inversion list is being merged in the similarity
calculations, values in the index list below a certain threshold can be
treated as if they were zero and skipped over.
• This significantly reduces the number of similarity calculations that
need to be performed.
• The inversion lists are sorted by Item ID.
• Better sort to early find the most similar items would be by weight
• Unfortunately this can then randomly place the item IDs in the
inversion lists.
• But if the goal is to just get the top few items for the first few pages
of hits this technique will more quickly find those items with
minimal processing.
Champion Lists
• The concept of champion lists is based upon the approach of only
analyzing the highest ranked items that the inversion lists that are
typically sorted by unique item ID
• There is a “champion” inversion list created that is a subset of the
original inversion list that only has items in it where the weight of
the processing token exceeds a threshold
• The search is then executed against the champion lists rather than
the full inversion lists.
• The challenge is to find that percentage of high weighted items that
you want to keep in each inversion list
• In the current search systems in addition to weights associated with
specific processing tokens (terms) in items, there is sometimes a
weight assigned to each item in terms of its potential value (e.g., the
GOOGLE link analysis).
• This value is also used in determining which items to be retrieved
and the ranked order of the list.
• Instead of the inversion list only containing the weight of a term in a
document using the weighting formulas previous discussed (e.g.,
inverse document frequency)
• The value in the inversion list could also be biased by the potential
importance weight for the item
• This runs a significant computation risk because the weights of the
relative importance of each item can initially vary based upon
changes in the system until it stabilizes
• Inversion list will be changing as new data enters the system for
existing values
• when there is a significant change to an item would it is necessary to
update the inversion list or just keep a smaller list of those modified
items to be dynamically adjusted for.
Text Search Optimization

• Software text search algorithms


• Hardware text search systems
Text Search Optimization

• The basic concept of a text scanning system is the ability for one or
more users to enter queries, and the text to be searched is accessed
and compared to the query terms.
• When all of the text has been accessed, the query is complete.
• One advantage of this type architecture is that as soon as an item is
identified as satisfying a query
• The database
– contains the full text of the items.
• The term detector
– special hardware/software that contains all of the terms being
searched for and in some systems the logic between the items.
– It will input the text and detect the existence of the search terms.
– It will output to the query resolver the detected terms to allow
for final logical processing of a query against an item.
• The query resolver performs two functions.
– It will accept search statements from the users
– extract the logic and search terms and pass the search terms to
the detector.
• The Query Resolver will pass information to the user interface that
will be continually updating search status to the user and on request
retrieve any items that satisfy the user search statement.
• The worst case search for a pattern of m characters in a string of n
characters is at least n−m+1or a magnitude of O(n)
• There are two approaches to the data stream
– In the first approach the complete database is being sent to the
detector(s) functioning as a search of the database.
– In the second approach random retrieved items are being
passed to the detectors.
• Examples of limits of index searches are:
• search for stop words
• search for exact matches when stemming is performed
• search for terms that contain both leading and trailing
“don’t cares”
• search for symbols that are on the interword symbol list
(e.g., “ , ;)
• Many of the hardware and software text searchers use finite state
automata as a
• basis for their algorithms. A finite state automata is a logical
machine that is composed of five elements:
– I a set of input symbols from the alphabet supported by the
automata
– S a set of possible states
– P a set of productions that define the next state based upon the
current state and input symbol
– S0 a special state called the initial state
– SF a set of one or more final states from the set S
Software Text Search Algorithms
• In software streaming techniques, the item to be searched is read into
memory and then the algorithm is applied.
• The best example is searching an item before display to highlight the
search query terms in the text.
• Four algorithms associated with software text search optimization:
– Brute force approach
– Knuth-Morris-Pratt
– Boyer-Moore
– Rabin-Karp
• Boyer-Moore has been the fastest requiring at most O(n+m) comparisons.
• Knuth-Pratt-Morris and Boyer-Moore both require O(n) for preprocessing
of search strings
Brute Force Approach

• The Brute force approach is the simplest string matching algorithm.


• The idea is to try and match the search string against the input text.
• As a mismatch is detected in the comparison process, shift the input
text one position and start the comparison process over.
• The expected number of comparisons when searching an input text
string of n characters for a pattern of m characters is:

• Nc – Expected number of comparisons


• c – Size of the alphabet for text
Knutt-Prath-Morris Algorithm

• Major improvement in previous algorithms in that even in the worst


case it does not depend upon the length of the text pattern being
searched for.
• Basic concept behind the algorithm is that whenever a mismatch is
detected, the previous matched characters define the number of
characters that can be skipped in the input stream prior to starting
the comparison process again.
Knutt-Prath-Morris Algorithm
• Example
Boyer-Moore

• String algorithm could be significantly enhanced if the comparison


process started at the end of the search pattern processing right to
left versus the start of the search pattern.
• Advantage is that large jumps possible if mismatch occurs.
• The character in the input stream that was mismatched also requires
alignment with it’s next occurrence in the search pattern or the
complete pattern can be moved.
• This can be defined as: Algo1 and Algo2
• ALGO1
• on a mismatch, the character in the input stream is compared to the
search pattern to determine the shifting of the search pattern
(number of characters in input stream to be skipped) to align the
input character to a character in the search pattern.
• If the character does not exist in the search pattern then it is possible
to shift the length of the search pattern matched to that position.
• ALGO 2
• on a mismatch occurs with previous matching on a substring in the
input text,
• the matching process can jump to the repeating occurrence in the
pattern of the initially matched subpattern—thus aligning that
portion of the search pattern that is in the input text.
• Search Pattern = abcabcacab
• Upon a mismatch, the comparison process can skip the MAXIMUM
(ALGO1, ALGO2).
• In this example the searchpattern is (a b d a a b) and the alphabet is
(a, b, c, d, e, f) with m=6 and c=6.
Hardware Text Search Systems

• History
• Current Systems
Hardware Text Search Systems
History
• Software text search is applicable to many circumstances but
has encountered restrictions on the ability to handle many search
terms simultaneously against the same text and limits due to I/O
speeds.
• In 1980s approach is to have a specialized hardware machine to
perform the searches and pass the results to the main computer
which supported the user interface and retrieval of hits.
• Since the searcher is hardware based, scalability is achieved by
increasing the number of hardware search devices.
• The only limit on speed is the time it takes to flow the text off of
secondary storage (i.e., disk drives) to the searchers.
• By having one search machine per disk, the maximum time it
takes to search a database of any size will be the time to search
one disk.
• Major advantage of using a hardware text search unit is in the
elimination of the index that represents the document database.
• New items can be searched as soon as received by the system rather
than waiting for the index to be created and the search speed is
deterministic
• The algorithmic part of the system is focused on the term detector.
• The speed of search is then based on the speed of the I/O
• There have been three approaches to implementing term detectors:
– parallel comparators
– Associative memory
– universal finite state automata
• One of the earliest hardware text string search units was the Rapid
Search Machine developed by General Electric
• The machine consisted of a special purpose search unit where a
single query was passed against a magnetic tape containing the
documents.
• A more sophisticated search unit was developed by Operating
Systems Inc. called the Associative File Processor, searches multiple
queries
• High Speed Text Search (HSTS) machine. It uses an algorithm
similar to the Aho-Corasick software finite state machine algorithm
except that it runs three parallel state machines.
• The GESCAN system uses a text array processor (TAP) that
simultaneously matches many terms and conditions against a given
text stream
– TAP receives the query information from the users computer and directly
access the textual data from secondary storage.
– The TAP consists of a large cache memory and an array of four to 128 query
processors.
– The text is loaded into the cache and searched by the query processors.
– Each query processor is independent and can be loaded at any time.
– A complete query is handled by each query processor.
– Queries support exact term matches, fixed length don’t cares, variable length
don’t cares, terms may be restricted to specified zones, Boolean logic, and
proximity.
• Another approach for hardware searchers is to augment disc storage.
• The augmentation is a generalized associative search element placed
between the read and write heads on the disk.
• The content addressable segment sequential memory (CASSM)
system uses these search elements in parallel to obtain structured
data from a database.
• The Fast Data Finder (FDF) is the most recent specialized hardware
text search unit still in use in many organizations. It was developed
to search text and has been used to search English and foreign
languages.
Fast Data Finder Architecture
• Fast Data Finder (FDF) is the most recent specialized hardware text
search unit still in use in many organizations.
• It was developed to search text and has been used to search English
and foreign languages.
• The early Fast Data Finders consisted of an array of programmable
text processing cells connected in series forming a pipeline hardware
search processor.
• The cells are implemented using a VSLI chip
• Contains 3600 cells (the FDF-3 has a rack mount configuration with
10,800 cells).
• Each cell will be a comparator for a single character limiting the
total number of characters in a query to the number of cells.
• The cells are interconnected with an 8-bit data path and
approximately 20-bit control path.
• A cell is composed of both a register cell (Rs) and a comparator (Cs).
• The input from the Document database is controlled and buffered by the
microprocessor/memory and feed through the comparators.
• The search characters are stored in the registers.
• Groups of cells are used to detect query terms, along with logic between the
terms, by appropriate programming of the control lines.
• When a pattern match is detected, a hit is passed to the internal microprocessor
that passes it back to the host processor, allowing immediate access by the user
to the Hit item.
• The functions supported by the Fast data Finder are:
– Boolean Logic including negation
– Proximity on an arbitrary pattern
– Variable length “don’t cares”
– Term counting and thresholds
– fuzzy matching
– term weights
– numeric ranges
Current Systems

• The Parasel search system continues to be used as a specialized


search system for genetic information.
• Can be used for datawarehouse applications
• Use of a search system imbedded in the hardware significantly
reduces the amount of time it takes to do a search across all of the
data.
• The Netezza System uses a massively parallel architecture with the
filtering at the disk level using a Field Programmable Gate Array
(FPGA).
• FPGAs are programmable by the user using a hardware description
language and are relatively inexpensive because they can be mass
produced
• FGPA consists of “logic blocks” whose connections can be defined
using simple logic gates like AND OR and XOR.
• These are used in lieu of the traditional disk controller chips and can
be programmed with the search logic.
• Each node in the parallel processing architecture consists of the
FPGA along with the storage on the disk (e.g., 400GB).
• These are also called snippet processing units (Sups).
• For example a rack of equipment can hold 112 of these units all
executing the query in parallel
Google Scalable Multiprocessor Architecture

• GOOGLE was founded on a research project that focused on using


link analysis between Internet web pages to influence the ranking of
search results
• GOOGLES other major contribution to the evolution of search
systems was in the architecture they developed
• To allow GOOGLE to provide fast response time against very large
data sets
• Until GOOGLE most systems used very large processors and in
some cases parallel processors (e.g., Alta Vista) to perform search
and functions like link analysis.
• The GOOGLE approach was to use inexpensive commodity type
processors
• GOOGLE developed many capabilities that are used in their overall
architecture.
• GOOGLE file system started with an earlier concept called BigFiles that
was part of the original research effort at Stanford University that laid the
groundwork for the GOOGLE system.
• The characteristic of an information retrieval system is that most data is
relatively static.
• The data is very large and thus large chunks of storage need to be managed.
• This lead to the concept of “chunks” where 64MB fixed size chunks of data
which can be redundantly stored on multiple nodes, called chunk servers.
• There is a Master node that keeps the metadata on the chunk servers
keeping track of what are the current operational chunk servers and what
files are on them.
• Chunk servers can fail but then the MASTER Node can route data requests
to the other replicated servers.
• GOOGLE Web Servers (GWS) are the commodity search servers.
• A query is first processed and via the Domain Name Server (DNS)
an initial load balancing take place to determine which GOOGLE
Processing center the query should go to.
• Once at a processing center the query is then distributed to multiple
GWS to process the search.
• The results of the search are an initial rank list of document IDs of
potential hits for the query.
• The document IDs are then passed to Document Servers that
manage access to the documents.
• The final document may still be at the web site it was found at or
cached locally in the file system
• The Document servers will determine the Title, a query weighted
summary snippet and the URL that will be used by the user to
retrieve the complete document.
• The architecture of the document servers is similar to the web
servers
• There will be a large number of servers that will be randomly
handling a subset of the document set (shards)
• They will use the file system to get where to go to get the data to
process.
• Like the index there are multiple copies of the documents from the
Web that GOOGLE caches locally.
• Creating applications that can execute on a distributed architecture
could also be a difficult task.
• Because the applications need to be easy to be replicated to multiple
servers for processing and then have a mechanism to collect the
results from the individual instances and consolidate them into a
single answer.
• Google developed a programming model that allows for simple
creation and execution
• of applications in a distributed environment against large data sets.
• The model is called MapReduce.
• The programmer first defines a “map” application that will work
against a key/value pair and will generate an intermediate key/value
pairs as output.
• There is a “reduce” function that takes all of the intermediate values
associated with the same intermediate key and merges them.
• This architecture was based upon and an enhancement of the map
and reduce primitives in the LISP language
• and other functional languages.
• Programs written using this model can easily and automatically be
distributed across a large cluster of servers.
• This model also allows for easy re-execution if a particular server
fails.
• The architecture concepts of General file System (BigTables) and
MapReduce were implemented in a Java software framework called
Hadoop.
Introduction to Information System
Evaluation
• The Cranfield model created at the Cranfield Institute for Information
Retrieval Evaluation was one of the earliest structured approaches to
evaluating information retrieval and still continues in TREC and other
evaluation platforms
• TREC-annual Text Retrieval Evaluation Conference (TREC) sponsored by
the Defense Advanced Research Projects Agency (DARPA) and the
National Institute of Standards and Technology (NIST)
• There are many reasons to evaluate the effectiveness of an Information
Retrieval System:
– To aid in the selection of a system to procure
– To monitor and evaluate system effectiveness
– To evaluate the system to determine improvements
– To provide inputs to cost-benefit analysis of an information system
– To determine the effects of changes made to an existing information
system.
• In an operational system there is less concern over 55% versus 65%
precision than 99% versus 89% availability.
• For academic purposes, controlled environments can be created that
minimize errors in data.
• In operational systems, there is no control over the users and care
must be taken to ensure the data collected are meaningful.
• The most important evaluation metrics of information systems (i.e.,
precision and recall) will always be biased by human subjectivity.
• To discuss relevancy, it is necessary to define the context under
which the concept is used.
• From a human judgment standpoint, relevancy can be considered:
– Subjective depends upon a specific user’s judgment
– Situational relates to a user’s requirements
– Cognitive depends on human perception and behavior
– Temporal changes over time
– Measurable observable at a points in time
• Another way of specifying relevance is from information, system
and situational views.
• The information view is subjective in nature and pertains to human
judgment of the conceptual relatedness between an item and the
search.
• Ingwersen categorizes the information view into four types of
“aboutness” (Ingwersen-92):
– Author Aboutness determined by the author’s language as matched by the
system in natural language retrieval
– Indexer Aboutness determined by the indexer’s transformation of the author’s
natural language into a controlled vocabulary they use to index the document
– Request Aboutness determined by the user’s or intermediary’s processing of a
search statement into a query
– User Aboutness determined by the indexer’s attempt to represent the document
according to presupposition about what the user will want to know
• The system view is more technical in nature and ignores the subtle
definition of relevance and looks only at the match between query
terms and words within an item
• It can be objectively observed, manipulated and tested without
relying on human judgment because it uses metrics associated with
the matching of the query to the item.
• The situation view pertains to the relationship between information
and the user’s information problem situation. It assumes that only
users can make valid judgments regarding the suitability of
information to solve their information need.
• Relevancy can only be defined in the terms of human judgment.
• It’s possible to measure the amount of agreement (or disagreement)
between users on relevancy using a mathematical technique called
Cohen’s Kappa coefficient.
• The measure is of the agreement between two evaluators who each
classify N items into C mutually exclusive categories.
• In this case there will be N items placed in one of 2 categories

• Pr(a) is the observed agreement between evaluators and PR(e) is the


hypothetical probability of chance agreement.
• Kappa will take value between 0 and 1
• 1 means there is total agreement and zero means there is no
agreement
• The process of evaluating an information retrieval system starts with
a collection of items, the information retrieval capability(s) and a
“ground truth” data set satisfying a number of information needs.
• The ground truth data set is the set of relevant items that should be
found based upon the information need.
• In some cases two sets of known relevant items are needed.
• The second set is needed as a training set when a categorization
system is being evaluated
• One thing that has been shown is that the evaluation results are very
query specific
• List of some of the major test data sets available.
– Cranfield collection
– TREC Text Retrieval Conference
– NII Test Collection for IR Systems (NTCIR)
– FIRE—Forum for Information Retrieval Evaluation
– Cross Language Evaluation Forum (CLEF)
– Reuters
– Reuters Corpus (RCV1 and RCV2)
– Newsgroup - 20 Newsgroup Dataset
– Linguistic Data Consortium (LDC)
Measures Used in System Evaluation

• Items arrive in the system and are automatically or manually


transformed by “indexing” into searchable data structures.
• The user determines what his information need is and creates a
search statement.
• The system processes the search statement, returning potential hits.
• The user selects those hits to review and retrieves the item.
• The item is reviewed to see if it has any needed information.
• Measurements can be made on each step in this process considering
two perspectives: User perspective and System perspective
• User Perspective:
• The Author’s Aboutness
– occurs as part of the system executing the query against the
index.
• User Aboutness,Indexer Aboutness
– occur when the items are indexed into items are indexed into the
system
• Request Aboutness
– occurs when the user creates the search statement
• The system perspective is based upon aggregate functions, whereas
the user perspective takes a more localized personal view.
• Techniques for collecting measurements can also be objective or
subjective.
• Response time is a metric frequently collected to determine the
efficiency of the search execution.
• Response time is defined as the time it takes to execute the search.
• It is difficult to define objective measures on the process of a user
selecting hits for review and reviewing them.
• The most important evaluation factor is the quality of the search
results which can be measured typically by precision and recall.
• Precision is a measure of the accuracy of the search process.
• Recall is a measure of the ability of the search to find all of the
relevant items that are in the database.
• In evaluating multiple systems it is difficult to compare how each
does by trying to compare their precision/recall graphs.
• It is desirable to calculate a single value that can be used in
comparing systems at recall levels.
• The value typically used is the F-measure which trades off precision
versus recall and is adjustable.
• It calculates the weighted harmonic mean of the precision and recall:
• what is the value of recall if there are no relevant items in the
database or recall if no items are retrieved?
• In both cases the mathematical formula becomes 0/0.
• This measure is called Fallout
• There are other measures of search capabilities that have been
proposed.
• A new measure that provides additional insight in comparing
systems or algorithms is the “Unique Relevance Recall” (URR)
metric.
• URR is used to compare more two or more algorithms or systems.
• It measures the number of relevant items that are retrieved by one
algorithm that are not retrieved by the others

• Number_unique_relevant is the number of relevant items retrieved


that were not retrieved by other algorithm
• TNRR-Total Number of Items Relevant Recall
– Using TNRR as the denominator provides a measure for an
algorithm of the percent of the total items that were found that
are unique and found by that algorithm.
• TURR-Total Unique Relevant Recall
– It is a measure of the contribution of uniqueness to the total
relevant items that the algorithm provides.
– Using the second measure, TURR, as the denominator, provides
a measure of the percent of total unique items that could be
found that are actually found by the algorithm
• Figure 9.2a, b provide an example of the overlap of relevant items
assuming there are four different algorithms.
• Figure 9.2 a gives the number of items in each area of the overlap
diagram in Fig.9.2b
• If a relevant item is found by only one or two techniques as a
“unique item,” then from the diagram t he following values URR
values can be produced.

• TNRR = A + B + C + · · · + M = 985TNRR = A + B + C + · · · + M = 985


• TURR = A + B + C + E + H + J + L + M = 61
• There is a measure that is better when there is incomplete relevance
information.
• The approach looks at evaluating a system on only the judged items.
• The evaluations discussed so far assume an estimate of the total
number of relevant items is available to calculate the recall value.
• The preference measure that they proposed is a function of the
number of times judged non-relevant items are retrieved before
relevant items.
• It is called bpref because it uses binary relevance judgments to
define the preference relation.
• Where R is the first R items judged as relevant so far and r is the
number of relevant items.
• The factor nr is the number of non-relevant items prior to the current
relevant item from the start of the hit file
• Another measure that is oriented towards user satisfaction is the
discounted cumulative gain (DCG).
• The measure looks at the position in the hit list for relevant items
knowing the higher on the list the more valuable it is to the user.
• The gain starts with higher values if the item is near the top of the
hit list and decreases as you go down the hit list.
• The discounted cumulative gain is a modification of the earlier
cumulative gain (CG) measure:

• Where l is a rank position in the list reli is the relevance at position


i.
• Discounted cumulative gain where the position (i) is a factor is:

• Other measures have been proposed for judging the results of


searches
• Novelty Ratio:
– ratio of relevant and not known to the user to total relevant
retrieved
• Coverage Ratio:
– ratio of relevant items retrieved to total relevant by the user
before the search
• Sought Recall:
– ratio of the total relevant reviewed by the user after the search t
the total relevant the user would have liked to examine

You might also like