You are on page 1of 10

Impact of Query Correlation on Web Searching

Ash Mohammad Abbas


Department of Computer Engineering
Zakir Husain College of Engineering and Technology
Aligarh Muslim University, Aligarh - 202002, India.

Abstract— Correlation among queries is an important factor to on the Web are much higher than the user which is simply
analyze as it may affect the results delivered by a search engine. retrieving some information from a traditional database. This
In this paper, we analyze correlation among queries and how makes the task of extracting information from the Web a bit
it affects the information retrieved from the Web. We analyze
two types of queries: (i) queries with embedded semantics, and challenging [1].
(ii) queries without any semantics. In our analysis, we consider Since the Web searching is an important activity and the
parameters such as search latencies and search relevance. We results obtained so may affect decisions and directions for
focus on two major search portals that are mainly used by individuals as well as for organizations, therefore, it is of
end users. Further, we discuss a unified criteria for comparison utmost importance to analyze the parameters or constituents
among the performance of the search engines.
involved in it. Many researchers have analyzed many different
Index Terms— Query correlation, search portals, Web infor- issues pertaining to Web searching that include index quality
mation retrieval, unified criteria for comparison, earned points. [2], user-effort measures [3], Web page reputation [6], and user
perceived quality [7].
In this paper, we try to answer the following question: What
I. I NTRODUCTION happens when a user fires queries to a search engine one by
The Internet that was aimed to communicate research ac- one that are correlated? Specifically, we wish to evaluate the
tivities among a few universities in United States has now effect of correlation among the queries submitted to a search
become a basic need of life for all people who can read and engine (or a search portal).
write throughout the world. It has become possible only due Rest of this paper is organized as follows. In section II, we
to the proliferation of the World Wide Web (WWW) which is briefly review methodologies used in popular search engines.
now simply called as the Web. The Web has become the largest In section III, we describe query correlation. Section IV
source of information in all parts of life. Users from different contains results and discussion. In section V, we describe a
domains often extract information that fits to their needs. criteria for comparison of search engines. Finally, section VI
The term Web information retrieval1 is used for extracting is for conclusion and future work.
information from the Web.
Although, Web information retrieval finds its roots to tra- II. A R EVIEW OF M ETHODOLOGIES U SED IN S EARCH
ditional database systems [4], [5]. However, the retrieval of E NGINES
information from the Web is more complex as compared to the
First we discuss a general strategy employed for retrieving
information retrieval from a traditional database. This is due
information from the Web and then we shall review some of
to subtle differences in their respective underlying databases2 .
the search portals.
In a traditional database, the data is often organized, limited,
and static. As opposed to that the Webbase is unorganized,
unlimited, and is often dynamic. Every second a large number A. A General Strategy for Searching
of updates are carried out in the Webbase. Moreover, as A general strategy for searching information on the Web is
opposed to a traditional database which is controlled by a shown in Fig. 1. Broadly a search engine consists of the fol-
specific operating system and the data is located either at lowing components: User Interface, Query Dispatcher, Cache 3 ,
a central location or at least at a few known locations, the Server Farm, and Web Base. The way these components
Webbase is not controlled by any specific operating system interact with one another depends upon the strategy employed
and its data may not reside either at a central site or at few in a particular search engine. We describe here a broad view.
known locations. Further, the Webbase can be thought as a An end user fires a query using an interface, say User Interface.
collection of a large number of traditional databases of various The User Interface provides a form to the user. The user fills
organization. The expectations of a user searching information the form with a set of keywords to be searched. The query
1 The terms Web surfing, Web searching, Web information retrieval, Web goes to the Query Dispatcher which, after performing some
mining are often used in the same context. However, they differ depending refinements, sends it to the Cache. If the query obtained after
upon the methodologies involved, intensity of seeking information, and
intentions of users who extract information from the Web. 3 We use the word Cache to mean Search Engine Cache i.e. storage space
2 Let us use the term Webbase for the collection of data in case of the Web, where results matching to previously fired queries or words are kept for future
in order to differentiate it from the traditional database. use.
2 4
1 U 3
S Query 5
query E Dispatcher
R
I 8
N 7
T
WEB BASE
E
R
response F Server Farm
Cache
A
C
9 E 6

Fig. 1. A general strategy for information retrieval from the Web.

refinement4 is matched to a query in the Cache, the results are search engine may not search words that are not part of its
immediately sent by the Query Dispatcher to the User Interface ontology. It can modify its ontology with time. One step more,
and hence to the user. Otherwise, the Query Dispatcher sends an ontology based search engine may also shorten the set of
the query to one of the Server in the Server Farm which are results searched before presenting it to the end users that are
busy in building a Web Base for the search engine. The server not part of the ontology of the given term.
so contacted, after due consideration from the Web Base sends We now describe an important aspect pertaining to informa-
it to the Cache so that the Cache may store those results tion retrieval from the Web. The results delivered by a search
for future reference, if any. Cache sends them to the Query engine may depend how the queries are formulated and what
Dispatcher. Finally, through the User Interface, response is relation a given query has with previously fired queries, if any.
returned to the end user. We wish to study the effect of correlation among the queries
In what follows, we briefly review the strategies employed submitted to a search engine.
by different search portals.
III. Q UERY C ORRELATION
B. Review of Strategies of Search Portals The searched results may differ depending upon whether
The major search portals or search engines5 which end users a search engine treats a set of words as an ordered set or an
generally use for searching are GoogleTM and YahooTM . Let unordered set. In what follows, we consider each one of them.
us briefly review the methodologies behind their respective
search engines6 of these search portals. A. Permutations
Google is based on the PageRank scheme described in [8].
Searched results delivered by a search engine may depend
It is somewhat similar to the scheme proposed by Kleinberg in
upon the order of words appearing in a given query7. If we
[9] which is based on hub and authority weights and focuses
take into account order of words, the same set of words may
on the citations of a given page. To understand the Google’s
form different queries for different orderings. The different
strategy, one has to first understand the HITS (Hyperlink-
orderings of the set of words of the given query are called
Induced Topic Search) algorithm proposed by Klienberg. For
permutations. The formal definition of permutations of a given
that the readers are directed to [9] for HITS and to [8] for
PageRank.
query is as follows. 
Definition 1: Let the query Q wi  1  i  m, Q  φ, be
On the other hand, Yahoo employs an ontology based
a set of words excluding stop words of a natural language. Let
P  x j
 1  j  m be a set of words excluding stop words.
search engine. An ontology is a formal term used to mean a
hierarchical structure of terms (or keywords) that are related.
If P is such that wi x j for some j not necessarily equal to
The relationships among the keywords are governed by a set
i, and wi Q  x j P such that wi x j where j may not be
of rules. As a result, an ontology based search engine such
equal to i, then P is called a permutation of Q.
as Yahoo may search other related terms that are part of
In the above definition, stop words are language dependent.
the ontology of the given term. Further, an ontology based
For example in the English language, the set of stop words,
4
By refinement of a query, we mean that the given query is transformed S, is often taken as
in such a way so that the words and forms that are not so important are 
eliminated so that they do not affect the results. S a an  the  is  am  are  will  shall  of  in  for 
5 A search engine is a part of search portal. A search portal provides many
other facilities or services such as Advanced Search, News etc. 7 The term ’query’ means a set of words that is given to a search engine to
6 The respective products are trademarks of their organizations. search for the information available on the Web.
1
Note that if there are m words (excluding the stop words) in Google
Yahoo

the given query, the number of permutations is m!.


The permutations are concerned with a single query. Sub- 0.8

mitting different permutations of the given query to a search


engine, one may evaluate how the search engine behaves for 0.6

Latency
different orderings of the same set of words. However, one
would like to know how the given search engine behaves when 0.4

an end user fires different queries that may or may not be


related. Specifically, one would be interested in the behavior 0.2

of a given search engine when the queries are related. In what


follows, we discuss what is meant by the correlation among 0
1 2 3 4 5 6 7 8 9 10
different queries. Page Number

Fig. 2. Latency versus page number for permutation P1.


B. Correlation
An important aspect that may affect the results of Web 1
Google
searching is how different queries are related. Two queries Yahoo

are said to be correlated if there are common words between 0.8


them. A formal definition of correlation among queries is as
follows.
0.6
Definition 2: Let Q1 and Q2 be queries given to a search

Latency
engine such that Q1 and Q2 are sets of words of a natural
language and Q1  Q2  φ. Q1 and Q2 are said to be correlated 0.4

if and only if there exists a set C Q1 Q2 , C  φ.


One may use the above definition to define k-correlation 0.2

between any two queries. Formally, it can be stated as a


corollary of Definition 2. 0
1 2 3 4 5 6 7 8 9 10

Corrollary 1: Two queries are said to be k-correlated if and Page Number

only if  C  k, where  denotes the cardinality. Fig. 3. Latency versus page number for permutation P2.
For two queries that are correlated, we define a parameter
called Correlation Factor8 as follows. 1
Google
Yahoo
 Q1 Q2 
Correlation Factor  (1)
 Q1  Q2  0.8

This is based on the fact that  Q1  Q2   Q1  Q2 


 Q1 Q2  . 0.6

Note that 0  Correlation Factor  1. For two uncorrelated


Latency

queries the Correlation Factor is 0. Further, one can see from 0.4

Definition 1 that for the permutations of the same query,


Correlation Factor is 1. 0.2

Similarly, one may define the Correlation Factor for a


cluster of queries. Let the number of queries be O. The 0
1 2 3 4 5 6 7 8 9 10
cardinality of the union of the given cluster of queries is given Page Number

by the following equation. Fig. 4. Latency versus page number for permutation P3.

O
 Qo  ∑ Qi ∑
   Qi  Q j  ∑  Qi  Qj  Qk  1
Google
o 1 i i j i j  k Yahoo

O 1
 1  Q1  Q2    QO  (2) 0.8

Using (2), one may define the Correlation Factor of a cluster


of queries as follows. 0.6
Latency

O

o 1 Qo 
Correlation Factor O
(3) 0.4
  o 1 Qo 
A high correlation factor means that the queries in the cluster 0.2

are highly correlated, and vice versa.


In what follows, we discuss results pertaining to query
0
correlation. 1 2 3 4 5 6
Page Number
7 8 9 10

8 This correlation factor is nothing but Jaccard’s Coefficient, which is often Fig. 5. Latency versus page number for permutation P4.
used as a measure of similarity.
TABLE I
S EARCH LATENCIES , QUERY SPACE , AND THE NUMBER OF RELEVANT RESULTS FOR DIFFERENT PERMUTATIONS OF THE QUERY: Ash Mohammad Abbas
FOR G OOGLE .
Permutation p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
1 0.22 0.15 0.04 0.08 0.33 0.29 0.15 0.13 0.16 0.17
300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
8 5 1 0 2 0 0 3 0 0
2 0.51 0.15 0.22 0.19 0.13 0.12 0.10 0.27 0.16 0.15
300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
3 2 2 1 1 0 2 0 0 0
3 0.30 0.08 0.18 0.20 0.14 0.25 0.13 0.21 0.14 0.21
300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
6 4 1 3 2 1 1 0 0 0
4 0.60 0.07 0.35 0.11 0.13 0.15 0.23 0.13 0.28 0.26
300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
3 0 2 1 0 0 2 0 1 1
5 0.38 0.09 0.39 0.14 0.17 0.15 0.14 0.16 0.15 0.13
300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
3 2 1 2 1 0 0 1 1 1
6 0.36 0.15 0.10 0.12 0.18 0.17 0.15 0.13 0.20 0.15
300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
5 4 1 3 0 2 1 2 2 0

TABLE II
S EARCH LATENCIES , QUERY SPACE , AND THE NUMBER OF RELEVANT RESULTS FOR DIFFERENT PERMUTATIONS OF THE QUERY: Ash Mohammad Abbas
FOR YAHOO .
Permutation p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
1 0.15 0.15 0.27 0.24 0.25 0.23 0.34 0.21 0.27 0.30
26100 26400 26400 27000 27000 26900 26900 26900 25900 25900
10 4 1 0 0 1 0 0 0 0
2 0.18 0.13 0.20 0.15 0.19 0.10 0.15 0.09 0.12 0.13
26900 27000 27000 26900 25800 26900 26900 26800 26800 26800
4 6 1 1 0 1 1 1 0 0
3 0.12 0.11 0.15 0.14 0.11 0.10 0.11 0.12 0.09 0.13
26900 27100 26900 26900 26500 26800 26800 26500 26500 26700
10 3 1 2 0 0 0 0 0 0
4 0.03 0.10 0.14 0.13 0.12 0.20 0.10 0.19 0.12 0.17
27000 26400 26400 26700 27000 26700 26400 26900 26800 26800
7 4 0 2 1 0 0 1 0 1
5 0.12 0.12 0.20 0.08 0.13 0.10 0.12 0.09 0.13 0.20
26400 26800 26800 26800 26800 26700 26700 26800 26700 26200
8 5 1 1 0 0 0 0 0 1
6 0.16 0.10 0.16 0.12 0.13 0.11 0.10 0.11 0.12 0.15
27100 26700 27100 26700 26600 27000 26600 26900 26500 26500
10 5 0 0 0 1 0 0 0 0

1 1
Google Google
Yahoo Yahoo

0.8 0.8

0.6 0.6
Latency

Latency

0.4 0.4

0.2 0.2

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Page Number Page Number

Fig. 6. Latency versus page number for permutation P5. Fig. 7. Latency versus page number for permutation P6.

IV. R ESULTS AND D ISCUSSION


classes of search engines. As mentioned earlier, Yahoo is based
The search portals that we have evaluated are Google and on ontology while Google is based on page ranks. Therefore,
Yahoo. We have chosen them because they represent the search if one selects them, one may evaluate two distinct classes of
portals that majority of end users in today’s world use in search engines.
their day-to-day searching. One more reason behind choosing The search environment is as follows. The client from where
them for performance evaluation is that they represent different queries were fired was a Pentium III machine. The machine
0.65
Google:Q1
Google:Q2
was part of a 512Kbps local area network. The operating
Yahoo:Q1
0.6 Yahoo:Q2 system was Windows XP.
0.55 In what follows, we discuss behavior of search engines for
0.5 different permutations of a query.
0.45
Latency

0.4
A. Query Permutations
0.35
To see how a search engine behaves for different permuta-
0.3
tions of a query, we consider the following query.
0.25

0.2
Ash Mohammad Abbas
1 1.5 2 2.5 3 3.5 4
Correlation
The different permutations of this query are
Fig. 8. Latency versus correlation for queries with embedded semantics.
1 Ash Mohammad Abbas
1.1
2 Ash Abbas Mohammad
Google:Q1
Google:Q2
Yahoo:Q1
3 Abbas Ash Mohammad
1 Yahoo:Q2
4 Abbas Mohammad Ash
0.9
5 Mohammad Ash Abbas
0.8
6 Mohammad Abbas Ash
0.7
Latency

We have assigned a number to each permutation to differ-


0.6
entiate from one another. We wish to analyze search results on
0.5 the basis of search time, number of relevant results and query
0.4 space. The query space is nothing but the cardinality of all
0.3
results returned by a given search engine in response to a given
query. Note that search time is defined as the actual time taken
0.2
1 1.5 2 2.5
Correlation
3 3.5 4
by the search engine to deliver the results searched. Ideally, it
Fig. 9. Latency versus correlation for random queries. does not depend upon the speeds of hardware, software, and
network components from where queries are fired because it is
the time taken by the search engine server. Relevant results are
0.65
Google:Q1
Google:Q2 those which the user intends to search. For example, the user
Yahoo:Q1
0.6 Yahoo:Q2
intends to search information about Ash Mohammad Abbas9.
0.55 Therefore, all those results that contain Ash Mohammad Abbas
0.5 are relevant for the given query.
0.45 In what follows, we discuss the results obtained for different
Latency

0.4
permutation of a given query. Let the given query be Ash
Mohammad Abbas. For all permutations, all those results that
0.35
contain Ash Mohammad Abbas are counted as relevant results.
0.3
Since both Google and Yahoo deliver the results page wise,
0.25 therefore, we list all parameters mentioned in the previous
0.2
1 1.5 2 2.5 3 3.5 4
paragraph page wise. We go up to 10 pages for both the search
Correlation engines as the results beyond that are rarely significant.
Fig. 10. Query Space versus correlation for queries with embedded semantics. Table I shows search latencies, query space, and the number
of relevant results for different permutations of the given query.
1.1
Google:Q1
The search portal is Google. Our observations are as follows.
Google:Q2
1
Yahoo:Q1
Yahoo:Q2 For all permutations, the query space remains the same
0.9 and it does not vary along the pages of results.
0.8
The time to search the first page of the results in response
to a the given query is the largest for all permutations.
0.7
Latency

The first page of results contain the most relevant results.


0.6

0.5 9 We have intentionally taken the query: Ash Mohammad Abbas. We wish

0.4
to search for different permutations of a query and the effect of those
permutations on query space and on the number of relevant results. The
0.3 relevance is partly related to the intentions of an end-user. Since we already
know what are the relevant results for the chosen query, therefore, this is
0.2
1 1.5 2 2.5 3 3.5 4 easier to decide what relevant results out of them have been returned by a
Correlation
search engine. The reader may take any other query, if he/she wishes so. In
Fig. 11. Query Space versus correlation for random queries. that case, he has to decide what are the results that are relevant to his/her
query and this will partly depend upon what he/she intended to search.
TABLE III
Q UERIES WITH EMBEDDED SEMANTICS .
S. No. Query No. Query Correlation
E1 Q1 node disjoint multipath 1
Q2 edge disjoint multicast
E2 Q1 node disjoint multipath routing 2
Q2 edge disjoint multicast routing
E3 Q1 node disjoint multipath routing 3
Q2 edge disjoint multipath routing
E4 Q1 node disjoint multipath routing ad hoc 4
Q2 wireless node disjoint multipath routing

TABLE IV
Q UERIES WITHOUT EMBEDDED SEMANTICS ( RANDOM QUERIES ).
S. No. Query No. Query Correlation
R1 Q1 adhoc node ergonomics 1
Q2 quadratic power node
R2 Q1 computer node constellations parity 2
Q2 hiring parity node biased
R3 Q1 wireless node parity common mitigate 3
Q2 mitigate node shallow rough parity
R4 Q1 few node parity mitigate common correlation 4
Q2 shallow mitigate node parity common stanza

TABLE V
S EARCH TIME AND Q UERY S PACE FO QUERIES WITH EMBEDDED SEMANTICS .
S. No. Query No. Google Yahoo
Time Query Space Time Query Space
E1 Q1 0.27 43100 0.37 925
Q2 0.23 63800 0.28 1920
E2 Q1 0.48 37700 0.40 794
Q2 0.32 53600 0.32 1660
E3 Q1 0.48 37700 0.40 794
Q2 0.24 21100 0.34 245
E4 Q1 0.31 23500 0.64 79
Q2 0.33 25600 0.44 518

TABLE VI
S EARCH TIME AND QUERY SPACE FOR RANDOM QUERIES .
S. No. Query No. Google Yahoo
Time Query Space Time Query Space
R1 Q1 0.44 28500 0.57 25
Q2 0.46 476000 0.28 58200
R2 Q1 0.46 34300 0.55 164
Q2 0.42 25000 0.35 90
R3 Q1 0.47 25000 0.40 233
Q2 0.33 754 0.68 31
R4 Q1 0.34 20000 0.58 71
Q2 1.02 374 0.64 23

Table II shows the same set of parameters for different number of relevant results. For permutation 2 (i.e. Ash
permutations of the given query for search portal Yahoo. From Abbas Mohammad), the second page contains the largest
the table, we observe that number of relevant results.
As opposed to Google, the query space does not remain Let us discuss reasons for the above mentioned observations.
same, rather it varies with the pages of searched results. Consider the question why query space in case of Google is
The query space in this case is less than Google. larger than that of Yahoo. We have pointed out that Google
The time to search the first page of results is not neces- is based on the page ranks. For a given query (or a set of
sarily the largest of the pages considered. More precisely, words), it ranks the pages. It delivers all the ranked pages
it is larger for the pages where there is no relevant result. that contain the words contained in the given query. On the
Further, the time taken by Yahoo is less than that of other hand, Yahoo is an ontology based search engine. As
Google. mentioned earlier, it will search only that part of its Webbase
In most of the cases, the first page contains the largest that constitutes the ontology of the given query. This is the
TABLE VII TABLE IX
L ATENCY MATRIX , L, FOR DIFFERENT PERMUTATIONS . R ELEVANCE MATRIX FOR DIFFERENT PERMUTATIONS FOR G OOGLE .
P p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 P p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
1 1 1
1 1 0 0 1 1 1 1 1 8 5 1 0 2 0 0 3 0 0
2
2 0 0 0 0 1 0 1 0 0 0 2 3 2 2 1 1 0 2 0 0 0
3 0 1 0 0 0 0 0 0 0 0 3 6 4 1 3 2 1 1 0 0 0
4 0 1 0 1 0 1 0 1 0 0 4 3 0 2 1 0 0 2 0 1 1
5 0 1 0 0 0 0 0 0 0 1 5 3 2 1 2 1 0 0 1 1 1
6 0 0 1 1
0 0 0 0 0 1 6 5 4 1 3 0 2 1 2 2 0
2 2

TABLE VIII TABLE X


Query Space MATRIX , S, FOR DIFFERENT PERMUTATIONS . R ELEVANCE MATRIX FOR DIFFERENT PERMUTATIONS FOR YAHOO .
P p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
P p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
1 10 4 1 0 0 1 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1
2 4 6 1 1 0 1 1 1 0 0
2 1 1 1 1 1 1 1 1 1 1
3 10 3 1 2 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1
4 7 4 0 2 1 0 0 1 0 1
4 1 1 1 1 1 1 1 1 1 1
5 8 5 1 1 0 0 0 0 0 1
5 1 1 1 1 1 1 1 1 1 1
6 10 5 0 0 0 1 0 0 0 0
6 1 1 1 1 1 1 1 1 1 1

reason why query space in case of Google is larger than that shown in Table IV. The words contained in these queries are
of Yahoo. random and are not related semantically.
Let us answer the question why query space changes in We wish to evaluate the performance of a search engine
case of Yahoo and why it remains constant in case of Google. for k-correlated queries. For that we evaluate search time and
Note that ontology may change with time and with order of query space of a search engine for the first page of results.
words in the given query. For every page of results, Yahoo Since both Google and Yahoo deliver 10 results per page,
estimates the ontology of the given permutation of the query therefore, looking for the first page of results means that we
before delivering the results to the end user. Therefore, the are evaluating 10 top most results of these search engines. Note
query space for different permutations of the given query is that we do not consider number of relevant results because
different and it changes with pages of the searched results10 . relevancy in this case would be query dependent. Since there
However, page ranks do not change either with pages or with is no single query, therefore, evaluation of relevancy would
order of words. The page ranks will only change when new not be so useful.
links or documents are added to the Web that are relevant to Table V shows search time and query space for k-correlated
the given query. Since neither a new link nor a new document queries with embedded semantics (see Table III). The second
is added to the Web during the evaluation of permutations of query, Q2 , is fired after the first query Q1 . On the other hand,
the query, therefore, the query space does not change in case Table VI shows search time and query space for k-correlated
of Google. queries whose words may not be related (see Table IV).
In order to compare the performance of Google and Yahoo,
the latencies versus page numbers for different permutations TABLE XI
of the query have been shown in Figures 2 through 7. Let us R ELEVANCE FOR DIFFERENT PERMUTATIONS .
consider the question why search time in case of Google is P Google Yahoo
larger than that of Yahoo. Note that Google ranks the results 1 19 16
before delivering them to end users while Yahoo does not. The 2 11 15
3 18 16
ranking of pages takes time. This is the reason why search time 4 10 16
taken by Google is larger than that of Yahoo. 5 12 16
In what follows, we discuss how a search engine behaves 6 20 16
Total 90 95
for correlated queries.

TABLE XII
B. Query Correlation Earned Points (EP) FOR DIFFERENT PERMUTATIONS .
We have formulated k-correlated queries as shown in Ta- P Google Yahoo
ble III. Since all words contained in a query are related11, Latency Query EP Latency Query EP
Space Space
therefore, we call them queries with embedded semantics. On
1 14 5 19 33 5 3 0 3
the other hand, we have another set of k-correlated queries as 2 3 11 14 14 0 14
3 4 18 22 13 0 13
10 This observed behavior may also be due to the use of a randomized
4 1 10 11 9 0 9
algorithm. To understand the behavior of randomized algorithms, readers are 5 3 12 15 10 0 10
referred to any text on randomized algorithms such as [10]. 6 25 20 22 5 16 0 16
11 More precisely, all words in these queries are from ad hoc wireless Total 118 65
networks, an area that authors of this paper like to work.
TABLE XIII
In order to compare the performance of Yahoo and Google,
CEP FOR DIFFERENT PERMUTATIONS FOR G OOGLE .
the latencies versus correlation for queries with embedded
P Latency Query Space
semantics is shown in Figure 8 and that for randomized queries Contribution Contribution
is shown in Figure 9. Similarly, the query space for queries 1 123 834 5700000
with embedded semantics is shown in Figure 10 and that for 2 61 262 3300000
randomized queries is shown in Figure 11. 3 116 534 5400000
4 35 918 3000000
The query space of Yahoo is much less than that of Google 5 73 458 3600000
for the reasons discussed in the previous subsection. Other 6 530 376 6000000
Total 530 376 27000000
important observations are as follows.
In case of k-correlated queries with embedded semantics,
TABLE XIV
generally the time to search for Q2 is less than that of
CEP FOR DIFFERENT PERMUTATIONS FOR YAHOO .
Q1 .
P Latency Query Space
This is due to the fact that since the queries are correlated, Contribution Contribution
some of the words of Q2 have already been searched 1 101 385 419900
while searching for Q1 . 2 107 821 404100
The query space is increased when the given query has 3 131 558 431000
4 308 197 428700
a word that is more frequently found in Web pages (e.g. 5 130 833 425000
in R1: Q2 , the word quadratic that is frequently used 6 121 591 431500
in Engineering, Science, Maths, Arts, etc.). The query Total 901 385 2540200
space is decreased when there is a word included in
the query which is rarely used (e.g. mitigate included
in R3,R4:Q1  Q2 and shallow included in R3,R4:Q2). as follows.
The search time is larger in case of randomized queries
as compared to queries with embedded semantics.  1
1
if latency1i j  latency0i j
The reason for the this observation is as follows. In case li j 2 if latency1i j latency0i j (4)
of queries with embedded semantics, the words of a given 0 otherwise.

query are related and are found in Web pages that are not Similarly, let S si j  be a matrix where si j is defined as
too far from one another either from the point of view follows.
of page rank as in Google or from the point of view of
ontology as in Yahoo.  1
1
if space1i j  space0i j
si j if space1i j space0i j (5)
One cannot infer anything about the search time of 2
Google and Yahoo as it depends upon the query. More 0 otherwise.
precisely, it depends upon the fact which strategy takes In matrices defined above, where there is a ’1’, it means at
more time whether page rank in Google or estimation of that place Google is the winner and a ’ 12 ’ represents that there
ontology in Yahoo. has been a tie between Google and Yahoo. We now define a
However, from Table V and Table VI, one can infer the parameter that we call Earned Points (EP) which is as follows.
following. Google is better in the sense that its query space
 ∑  pages
  L


is much larger than that of Yahoo. However, Yahoo takes EPk relevantki k
i  Sik (6)
less time as compared to Google for different permutations i 1

of the same query. For k-correlated queries with embedded where, superscript k 0  1  denotes the search engine.
semantics, Google takes less time to search for the first query Table VII shows a latency matrix, L, for different permu-
as compared to Yahoo. It also applies to randomized queries tations of the query as that for Table I and Table II, and has
with some exceptions. In exceptional cases, Google takes been constructed using both of them. In the latency matrix,
much more time as compared to Yahoo. We have mentioned there are 40 ’0’s, 17 ’1’s, and 3 ’ 12 ’. We observe from the
it previously that it depends upon the given query as well as latency matrix that Yahoo is the winner (as far as latencies
the strategy employed in the search engine. are concerned), as there are 40 ’0’s out of 60 entries in total.
In what follows, we describe a unified criteria for comparing On the other hand, Table VIII shows the query space
the search engines considered in this paper. matrix, S, for different permutations of the same query and
is constructed using the tables mentioned in the preceding
paragraph. One can see that as far as query space is concerned,
V. A U NIFIED C RITERIA FOR C OMPARISON Google is the sole winner. Infact, query space of Google is
much larger than that of Yahoo.
Let us denote Google by  a superscript ’1’ and Yahoo by a The relevance matrix for Google is shown in Table IX and
superscript ’0’12 . Let L li j  be a matrix where li j is defined that for Yahoo is shown in Table X. The total relevance for the
first ten pages is shown in Table XI for both Google as well
12 This is simply a representation. One may consider a representation which as Yahoo. It is seen from Table XI that the total relevance for
is reverse of it, then also, there will not be any effect on the criteria. Google is 90 and that for Yahoo is 95. Average relevance per
TABLE XV
Table XIII shows contributions of latency and query space
C ONTRIBUTION DUE TO QUERY SPACE IN CEP FOR DIFFERENT SETS OF
in CEP for Google. Similarly, Table XIV shows the same for
WEIGHTS .
Yahoo. We observe that contribution of latency for Google is
Weights
wl 1, wq 10
 6
Google
27 00
Yahoo
2 54
530  376 and that for Yahoo is 901  385. However, contribution
wl
wl
1,
1,
wq
wq
10
10
 5
4
270 00
2700 00
25 40
254 02
of query space for Google is 27000000 and that for Yahoo is
2540200. In other words, the contribution of query space for
wl
wl
1,
1,
wq
wq
10
10
 3
2
27000 00
270000 00
2540 20
25402 00
Google is approximately 11 times of that for Yahoo. Adding
these contributions shall result in a larger CEP for Google as
wl 1, wq 10 1 2700000 00 254020 00
wl 1, wq 1 27000000 2540200 compared to Yahoo. The CEP defined using (7) has a problem
that we call dominating constituent problem (DCP). The larger
parameter suppresses the smaller parameter. Note that the
TABLE XVI
definition of CEP in (7) assumes equal weights for latency
CCEP FOR DIFFERENT SETS OF comparable weights.
and query space. On the other hand, one may be interested in
Weights Google Yahoo assigning different weights to constituents of CEP depending
wl 0 9, wq 01 486 3384 811 2465 upon the importance of constituents. Let us rewrite (7) to
wl 0 8, wq 02 442 3008 721 1080
wl 0 7, wq 03 398 2632 630 9695 incorporate weights. Let wl and wq be the weights assigned to
wl 0 6, wq 04 354 2256 540 8310 latency and query space, respectively. The (7) can be written
wl 0 5, wq 05 310 1880 450 6925

as follows.
wl 0 4, wq 06 266 1504 360 5540
wl 0 3,
wl 0 2,
wq
wq
07
08
222 1128
178 0752
270 4155
180 2770 CEP k ∑
pages
relevantki   1
wl  qki wq  (8)
wl 0 1, wq 09 134 0376 90 1385 i 1 dik


The weights should be chosen carefully. For example, the
weights wl 1, wq 10 6 will add 27 to the contribution in
permutation and per page for Google is 1  5 and that for Yahoo

CEP due to query space for Google and 2  54 to Yahoo. On the
is 1  583. Therefore, as far as average relevance is concerned, other hand, a set of weights wl 1, wq 10 5 shall add 270
Yahoo is the winner. for Google and 25  4 for Yahoo. Table XV shows contribution
Table XII shows the number of earned points for both of query space in CEP for different sets of weights. It is to note


Google as well as Yahoo for different permutations of the that wl is fixed to 1 for all sets, and only wq is varied. As wq is
query mentioned earlier. We observe that the number of earned increased beyond 10 5, the contribution of query space starts


points for Google is 118 and that for Yahoo is 65. The number dominating over the contribution of latency. The set of weight
of earned points of Google is far greater than Yahoo. The wl 1  wq 10 5 indicates that one can ignore contribution
reason behind this is that query space of Yahoo is always less of query space in comparison to the contribution of latencies
than that of Google and it does not contribute to the number provided that one is more interested in comparing search
of earned points. engines with respect to latency. In that case, an approximate
A closer look on the definition of EP reveals that while expression for CEP can be written as follows.


∑
defining the parameter EP in (6) together with (4) and (5),

pages
we have assumed that a search engine either has a constituent
parameter (latency or query space) or it does not have that
CEP k
i 1
relevantki
1
dik  (9)

parameter at all. The contribution of some of the parameter Alternatively, one may consider an approach that is combi-
is lost due the fact that the effective contribution of other nation of the definition of EP defined in (6) (together with (4)
parameter by which the given parameter is multiplied is zero. and (5) and that of CEP defined in (7). In that we may use
Note that our goal behind introduction of (6) was to rank the definition of matrix S which converts the contribution of
the given set of search engines. We call this type of ranking query space in the form of binaries13 . The modified definition
of search engines as Lossy Constituent Ranking (LCR). We,

is as follows.
therefore, feel that there should be a method of comparison
between a given set of search engines that is lossless in nature.
CCEPk ∑pages
relevantki   1
 Sik  (10)
For that purpose, we define another parameter that we call i 1 dik
Contributed Earned Points (CEP). The definition of CEP is as
follows. where Sik is in accordance with the definition of S given by
 (5). The acronym CCEP stands for Combined Contributory

CEP k  ∑ pages
relevantki  1
qki  (7)
Earned Points. If one wishes to incorporate weights, then the


 definition of CCEP becomes as follows.
dik
 ∑ 
i 1

 
pages
where, superscript k 0  1  denotes the search engine, d
denotes the actual latency, and q denotes the actual query
CCEPk
i 1
relevantki
1
dik
wl  Sik wq  (11)

space. The reason behind having an inverse of actual latency 13 We mean that the matrix S says either there is a contribution of query
in (7) is that the better search engine would be that which space of a search engine provided that its query space is larger than that of
takes less time. the other one or there is no contribution of query space at all, if otherwise.
In the definition of CCEP given by (11) the weights can be weights to different constituents of the criteria—latency and
comparable and the dominant constituent problem mentioned query space. Our observations are as follows.
earlier can be mitigated for comparable weights. We define We observed that performance of Yahoo is better in terms
comparable weights as follows. 

of the latencies, however, Google performs better in terms
Definition 3: A set of weights W wi   wi 0, is said of query space.
to have comparable weights if and only if ∑i wi 1 and the We discussed the dominant constituent problem. We
condition 19  wwij  9 is satisfied wi  w j W . discussed that this problem can be mitigated using the
Table XVI shows the values of CCEP for different sets of concept of contributory earned points if weights assigned
comparable weights. We observe that the rate of decrease of to constituents are comparable. If both the constituent
CCEP for Yahoo is larger than that of Google. For example, for are assigned equal weights, we found that Yahoo is the
wl 0  9  wq 0  1, CCEP for Google is 486  3384 and that for winner.
Yahoo is 811  2465. For wl 0  8  wq 0  2, CCEP for Google However, the performance of a search engine may depend
is 442.3008 and that for Yahoo is 721  1080. In other words, upon the criteria itself and only one criteria may not be
the rate of decrease in CCEP for Google is 9  05% and that for sufficient for an exact analysis of the performance. Further
Yahoo is 11  11%. The reason being that in the query space investigations and improvements in this direction forms our
matrix, S, (see Table VIII) all entries are ’1’. It means that future work.
query space of Google is always larger than that of Yahoo.
Therefore, in case of Yahoo, the contribution due to query R EFERENCES
space is always 0 irrespective of the weight assigned to it.
[1] S. Malhotra, ”Beyond Google”, CyberMedia Magazine on Data Quest,
However, in case of Google the contribution due to query space vol. 23, no. 24, pp.12, December 2005.
is nonzero and increases with an increase in weight assigned [2] M.R. Henzinger, A. Haydon, M. Mitzenmacher, M. Nozark, ”Measuring
 due to query space. Moreover, for a set of
to the contribution Index Quality Using Random Walks on the Web”, Proceedings of 8th
weights, W wl 0  5  wq 0  5  , the values of CCEP are International World Wide Web Conference, pp. 213-225, May 1999.
[3] M.C. Tang, Y. Sun, ”Evaluation of Web-Based Search En-
310  1880 and 450  6925 for Google and Yahoo, respectively. gines Using User-Effort Measures”, Library an Information Sci-
It means that if one wishes to assign equal weights to latency ence Research Science Electronic Journal, vol. 13, issue 2, 2003,
http://libres.curtin.edu.au/libres13n2 /tang.htm.
and query space then Yahoo is the winner in terms of the [4] C.W. Cleverdon, J. Mills, E.M. Keen, An Inquiry in Testing of Infor-
parameter CCEP. mation Retrieval Systems, Granfield, U.K., 1966.
In case of CCEP, the effect of the dominating constituent [5] J. Gwidzka, M. Chignell, ”Towards Information Retrieval Mea-
sures for Evaluation of Web Search Engines”, http://www.imedia.
problem is less as compared to that in case of CEP. In other mie.utoronto.ca/people/jacek/pubs/webIR eval1 99.pdf, 1999.
words, the effect of large values of query space is fairly smaller [6] D. Rafiei, A.O. Mendelzon, ”What is This Page Known For: Computing
in case of CCEP as compared to that in case of CEP. This is Web Page Reputations”, Elsevier Journal on Computer Networks, vol
33, pp. 823-835, 2000.
with reference to our remark that with the use of CCEP the [7] N. Bhatti, A. Bouch, A. Kuchinsky, ”Integrating User-Perceived Quality
dominating constituent problem is mitigated. into Web Server Design”, Elsevier Journal on Computer Networks, vol
33, pp. 1-16, 2000.
[8] S. Brin, L. Page, ”The Anatomy of a Large-Scale Hypertextual Web
VI. C ONCLUSIONS Search Engine”, http://www-db.stanford.edu/pub/papers/google.pdf,
In this paper, we analyzed the impact of correlation among 2000.
[9] J. Kleinberg, ”Authoritative Sources in a Hyperlinked Environment”,
queries on search results for two representative search portals Proceedings of 9th ACM/SIAM Symposium on Discrete Algorithms,
namely Google and Yahoo. The major accomplishments of the 1998.
paper are as follows: [10] R. Motwani, P. Raghavan, Randomized Algorithms, Cambridge Univer-
sity Press, August 1995.
We analyzed the search time, the query space and the
number of relevant results per page for different per-
mutations of the same query. We observed that these
parameters vary with pages of searched results and are
different for different permutations of the given query.
We analyzed the impact of k-correlation among two
subsequent queries given to a search engine. In that
we analyzed the search time and the query space. We
observed that
– The search time is less in case of queries with
embedded semantics as compared to randomized
queries without any semantic consideration.
– In case of randomized query, the query space is
increased in case the given query includes a word
that is frequently found on the Web and vice versa.
Further, we considered a unified criteria for comparison be-
tween the search engines. Our criteria is based upon the
concept of earned points. An end-user may assign different

You might also like