You are on page 1of 6

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.

ORG

100

Ranking of Web Search Through The Power Method


Hasan Tahir and Muhammad Tahir
Abstract A search engine is a coherent search entity which is designed to facilitate internet users in searching for their required information. The web is expanding at a phenomenal rate, and a typical search engine is required to process every user query and provide results which are accurate, consistent, filtered and ranked according to importance. Besides this a search engine is also required to process a user query quickly. Therefore the importance of a search algorithm that meets all the above goals cannot be negated. This paper presents the power method which forms the basis of most web search algorithms. The power method is an efficient algorithm that attempts to rank sites based on their importance. The algorithm is based on the use of adjacency matrix to identify dominant eigenvalues and dominant eigenvectors. Index Terms Dominant eigenvectors, page rank, power method, search engine.

1 INTRODUCTION
HE internet is a network of millions of private, public, academic, business and government computers. The primary purpose of the internet is to share data which can be in the form of pictures, text, documents, videos etc. Therefore the internet can also be defined as a vast library that provides access to all users. To gain access to the endless resources the internet has to offer, all users regularly require the services of a search engine. Both the users and researchers acknowledge the fact that searching the internet for the most relevant information is a difficult task. Users of the internet often face the problem that they are unable to access the exact information they are looking for. Fundamentally a search engine is an entity that crawls the internet for the most relevant results regarding a user query. In other words a search engine eradicates the complexities of searching the internet by provides a searching interface. The modern search engine is the result of constant research and development, thus it provides better searching and quicker result generation. Researchers have been able to suggest techniques to eliminate the complexities of searching the web. Common methods for improved web searching include ranking of sites, simplified interfaces and ranking based on site speed. This paper presents the power method (with appropriate scaling) that performs computations on an adjacency matrix defined from a search set containing numerous sites, to extract dominant eigenvalues and eigenvectors to rank websites based on their importance. Earlier generations of search engines could present search results without any formal algorithms. Even if an algorithm was present it did not provide any sort of intelligence to the

search engine. With the help of power method modern search engines can present a user with a vast list of ranked search pages. The power method is an attempt to provide the user satisfaction through improved and optimized web searching. Further it is evident that the power method is inline with the design goals of web search engines. Section 2 of the paper presents the generations of search engines. Search engines have been broken down into five generations and each generation builds upon the work of the previous generation. In section three the design goals of search engines have been presented. To fully explain the working of search engines a generic search engine architecture has been presented in section four. Section five explains, in detail, the power method and its algorithm. Mathematical examples have been given to explain the functioning of power method. The power method is concluded with the help of a programming algorithm and formation of the search engine result page.

2 GENERATIONS OF SEARCH ENGINES


Search engines themselves are a result of an evolutionary process. The search engine today has gone through a lot of changes to present itself in the form in which it is available. In the early years of the web search engines that existed were not very mature. With the passage of time very obvious paradigm shift was seen that changed, how search engines worked. Discussed below are the generations of search engines and their various search mechanisms.

H. Tahir is with the Department of Computer Engineering, College of Electrical & Mechanical Engineering. National University of Sciences and Technology, Pakistan. M. Tahir is with the Department of Mathematics, HITEC University Taxila Cantt., Pakistan.

2.1 First Generation The first generation of search engines was very primitive and inaccurate. These search engines had no formal algorithm for searching and ranking the results. Many of these search engines were merely searching entities that could only search among a collection of websites rather than the entire web. Many of the initial search engines did not possess any form of intelligence hence they were con-

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

101

sidered fairly ineffective. A good example of the first generation search engine is the ALIWEB. This search engine allowed users to submit their own webpage to the search engine. This meant that if a webpage was not submitted to the search engine then it would not be included in the search. Therefore a repository of webpages was formed that allowed users to search for particular information.

2.2 Second Generation The second generation search engines were very mature as compared to the first generation. The search engines belonging to this generation could search the entire web by using more efficient searching techniques. For example EXCITE was considered a very popular search engine of this generation because it used statistical analysis to search for relationships among words. Therefore a paradigm shift was observed that led to the incorporation of mathematical algorithms in search engines. Ranking was purely based on content and was based on information retrieval. Although the search engines belonging to this generation were quite practical but there was still room for increased levels of perfection and intelligence. 2.3 Third Generation The search engines of this generation were designed to rank pages according to popularity. The concept behind these search engines was that high level of popularity implies that the webpage is appropriate and most likely to be read by the user. The popularity of a particular page was based on a hyperlink from one site to another. This hyperlink was considered as a popularity vote and functioned very much like an academic citation of a research paper. The higher a popularity vote the higher was the ranking of that particular webpage. Lycos was a search engine developed by Carnegie Mellon University belonged to this exact generation of search engines. 2.4 Fourth Generation The fourth generation search engines are first search engines that formally incorporated mathematical logic into their search engines. Although previous generations of search engines also had mathematical intelligence yet they lacked in accuracy and perfection. To overcome these limitations search engines like Google and Yahoo used the PageRank algorithm to search for particular information. These search engines revolutionized the way search engines operate. These search engines were more accurate, faster in speed and fulfill all other design goals of search engines. The fourth generation search engines search for data based on the content, structure and the page rank of the webpage. These were the first search engines of their kind that could perform searches based on three criteria. Also these search engines provide advanced searching options along with support for image search and search for particular file formats. 2.5 Fifth Generation ` Search engines performed searches based on the user

query without understanding what the user was asking for. The search engines that are currently under development are being developed to comprehend what the user is asking for and then search for the exact information required. This results in greater accuracy and user satisfaction. Researchers have also tried to incorporate artificial intelligence into search engines. The concept of self organizing neural networks is also being used in upcoming search engines.

3 DESIGN GOALS FOR SEARCH ENGINES


Search engines are used by a broad range of computer users. Some users are computer professionals while others are new to computers and obviously to the internet. Therefore search engines need to be designed for a broad audience and they need to be planned to attain the highest levels of quality.

3.1 Usability Search engines need to be designed for high levels of usability. Usability expresses the need for high quality design along with ease of use. Usability is a factor which is often studied in relation with man machine interfaces. A search engine that is designed keeping in view usability promises higher levels of user satisfaction. Currently there are two widely used search engine designs; minimalistic and baroque. Minimalistic designs provide a basic interface with uncluttered appearance. The search engines that follow the minimalistic design present the user with a direct access to the search text box. Google is commonly seen as the first minimalistic search engine. The baroque design is a comprehensive interface that provides a broad range of features on a single page. A search engine following the baroque design provides features related to search, email, news, weather, advertisements etc on a single page. Popular search engines that fall in the category of baroque design are yahoo and hotmail. 3.2 Search Accuracy and Consistency This design goal is perhaps the most important feature of a search engine. Search accuracy dictates that any search query that is provided to the search engine returns links to the very best documents/ pages available on the subject [1]. This design goal is essential in determining the effectiveness of the search engine. Consistency is a design goal that is mandatory for designing a good search engine. This design principle states that a search must return the same results if it repeated a number of times. In other words a user must not get different results for the same search query. 3.3 Search Filtration Search filtration is a design goal that has emerged after the success of internet marketing and other related disciplines. Search filtration requires that any search results that are returned to the user be filtered for spam and inaccurate information. Also outdated texts need to be filtered from the search results. Since the advent of search engine opti-

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

102

mization, filtration has become a very effective tool because it filters out those pages that seem to provide the best resources but the reality of the fact is that the page is of no importance/relevance to the user.

3.4 Search Ranking Search ranking is relatively a newer design goal that has impacted the way users are provided the search results. Using search ranking those pages are provided to the user first which are most widely accessed and are considered more credible. If a user is searching for a particular definition the search will return a list of pages that are most relevant. At the top of the list are those pages that are considered most authoritative and accurate by internet users throughout the world. Many algorithms have been proposed to achieve search ranking but the most widely used algorithm is called the PageRank algorithm. 3.5 Search Speed Search speed is a design goal that ultimately results in user satisfaction. Search speed means the speed with which users are provided search results. To achieve higher search speeds many search engines have shifted from a baroque design to a minimalistic design. Also the use of more efficient/ optimized algorithms for performing web searches is recommended. It has further been observed that search engines that provide search results at a quicker pace are more popular among internet users.

After a minimum of one processing cycle by the crawlers the indexer module begins its processing. The indexer module is responsible for generating the ranking of the results obtained from the search engine. The indexing is commonly performed based on the specific search terms, structure of the webpage, pages with certain tags etc. Once all the processing is completed the results are formally ranked and presented to the user.

Fig 1. A generic search engine architecture [3]

5 WEB SEARCHING AND THE POWER METHOD


The most popular search engines have successfully used the power method to develop their search algorithms. Google implements the PageRank algorithm whereas the Clever search engine uses HITS algorithm [7]. The fundamental concept behind both the techniques is as follows. To begin with, appropriate matrices are constructed that elaborate the referencing structure of pages suitable to the search and then the dominant eigenvectors of these matrices are used to list the pages as per certain criteria.

4 GENERIC SEARCH ENGINE ARCHITECTURE


To fully appreciate search engines it is necessary to understand the architecture and the inherent search engine anatomy. To the user a search engine is visible as a single coherent searching entity but actually it is a collection of complex procedures that function to provide results based on the design goals for search engines. Fundamentally a search engine is based on a crawler which is an automated program that crawls the web in a methodically organized fashion. The crawler visits websites based on a predefined criteria. Some crawlers are designed to visit only governmental websites while others are designed to visit only media sites. The results obtained from a crawler are the result of a well orchestrated set of policies. These policies can dictate selection policy, revisit policy, politeness policy and parallelization policy [2]. Search engines begin processing with the help of a search query provided by the client. Modern search engines have dramatically reduced the need for typing the search tags (images, documents, videos, RSS etc), instead they require the user to select check boxes to specify the exact search tags. Once a search query is taken it is referred to the crawl control module which directs crawlers to begin processing. Crawlers typically crawl the web in search of useful links that are returned to the crawl control module. The crawl control module decides which links need to be visited. These links are then returned back to the crawlers. The crawlers have a direct link with the page repository because all crawled pages are passed to the page repository. The crawlers continue to process pages until resources like storage space are not exhausted.

5.1 The Power Method The values of the parameter for which the matrix equation where is an matrix and an column vector, possesses at least one non-trivial solution, are called eigenvalues and the corresponding solutions are called eigenvectors of the matrix . It is important to are note that eigenvalues of a symmetric matrix real. If the distinct eigenvalues of an matrix are and if is larger than , then is called the dominant eigenvalue of . An eigenvector associated with the dominant eigenvalue is called domiAn iterative technique used to apnant eigenvector of proximate the dominant eigenvalue and associated eigen vector of a matrix is the power method [4-8]. For a given vector in , a sequence of the form is called a power sequence generated by an matrix . The power method uses the power sequence generated by for an initial guess for the eigenvector , that is, by the iteration
that yields
for

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

103

for

(1)

Expressing the vector as a linear combination of linearly independent vectors , gives

The approximate dominant eigenvalue, by the power method, after five iterations is and the dominant eigenvector is
.

To be more elaborative, if each iterate in the power method is scaled to make its largest entry as 1 then the following theorem is called the power method with maximum entry scaling [7].

(2)
From (1) and (2) the following expression immediately follows.

5.3 Theorem Let be a symmetric matrix with a positive domiis a nonzero vector in that is nant eigenvalue . If not orthogonal to the eigenspace corresponding to , then the sequence
, , , ,

Since

for and

converges to an eigenvector associated with sequence

, and the

, 5.4 Example

converges to

which converges to 0 if and diverges if , provided that . Hence , for large That is (1) will approach to value of , provided an eigenvector for the dominant eigenvalue . If each iterate is scaled to make its largest entry as 1 then the method is called the power method with scaling.

Let

with initial guess


, , ,

. Then

5.2 Example
Consider a matrix whose eigenvalues are Starting with an initial guess
.

,
,

, .

which gives
. Then ,

Now,
, ,

, and similarly , ,

Similarly
,

That is, approximates the dominant eigenvalue whereas approximates the dominant eigenvector. Keeping in view the estimated relative error in the criteria to stop computations [7] can easily be used.

, .

5.5 Determining Authority and Hub Weights Let a search set contains sites. Define the adjacency to be the matrix such that matrix for if site references site and otherwise. Assume that no site references itself which signifies that all the diagonal elements of are zero. Suppose that a site which references many other sites during a search process is termed as hub whereas a site which is referenced by

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

104

many other sites is referred to as an authority. Accordingly, denote the row sum of the adjacency matrix by vector (called the initial hub vector) and the column sum of by vector (called the initial authority vector). The entries of a hub vector are termed as hub weights while those of an authority vector are called authority weights. Use initial hub vector and initial authority vector to compute new hub and authority vectors , , ,
, ,

5.7 Example Let be an adjacency matrix for a search set with 4 internet sites.
(Referenced Site)

(Referencing Site)

and

,
, , ,

,
,

by the following iterative formulas,

The initial hub vector and initial authority vector the adjacency matrix are:

for

(3)
,

(4)

and It is obvious that site 2 is the largest hub and site 1 is the greatest authority. Using formula (3) calculate hub vector

thereby generating the power sequences where denotes the Euclidean norm. The scaling used here is therefore termed as Euclidean scaling. The sequences (3) and (4) can further be expressed as
, , , , , , , ,

(5) (6)
(updated hub weights)

Since the matrices and are symmetric, they possess positive dominant eigenvalues and therefore the sequences of vectors , , , and , , , converge to dominant eigenvectors of and respectively. The entries in these dominant eigenvectors are the hub and authority weights that signify the ranking of search sites. The procedure to compute dominant eigenvectors via the power method, as in (3) and (4) above, can be best explained with a programming oriented algorithm. The input to the algorithm is an adjacency matrix that is formed by relationing the hub and authority sites. After processing is complete the authority and hub weights are obtained according to which the search engine result page is formed.

Using formula (4) find authority vector

In a similar way the iterates are approximated until the vectors seem to be stabilized.

5.6 Algorithm
Input: Adjacency matrix of order Output: Authority and hub weights Row sum of = //initial hub vector Column sum of = //initial authority vector while ( compute compute and are not stable)

5.8 Formation of Search Engine Result Page After processing of the adjacency matrix the last step step is to generate the Search Engine Result Page (SERP). The SERP is the actual page that presents the web pages in a ranked order. Generation of the SERP is fairly simple because it is generated as a result of the latest updated authority weights. Once the individual authority weights are made available then those sites that possess an exceptionally low weight are discarded. The pages with the highest weight are ranked at the top of the SERP. In other words, pages are ranked in descending order of their authority weights. Those pages that possess zero weight or a weight that approaches zero are simply discarded and they are not incorporated into the SERP.

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

105

6. CONCLUSION
Modern search engines are designed to process a large number of user queries simultaneously. Internet users require search results to be presented to them in a quick, consistent, accurate, filtered and ranked manner. Since the web is an ever expanding network therefore ranking of pages according to the user query is a very important feature. All users require that their search results be ranked according to importance. This paper presents the power method which is a web based ranking algorithm that can efficiently rank web pages. The algorithm begins processing by first forming an adjacency matrix defined from a search set containing various sites. This adjacency matrix is generated as a result of hub and authority relationship between webpages. With the help of power method dominant eigenvectors are calculated in an iterative manner to achieve refinement of results. Every iteration performs processing on the results of the previous iteration. Iterations continue until the hub and authority weights stabilize. Once these results are obtained the final search engine result page is generated that is used by the user. The power method forms the basis of most search engine based algorithms.

REFERENCES
[1] S. Brin, L. Page, The Effectiveness of a Large Scale Hypertextual Web Search Engine, Journal of Computer Networks and ISDN Systems, Volume 30, Issue 1-7, April 1998. C. Castillo, Effective Web Crawling, PhD Dissertation. University of Chile, 2004. A. Arasu, J. Cho, H.G. Molina, A. Paepcke, S. Raghavan, Searching the Web ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1, Aug 2001. D.W. Boyd, The power method for norms, Linear Algebra Appl., Volume. 9, pp. 95-101, 1974. G.H. Golub and C.F. Van Loan, Matrix Computations (3rd Ed). The John Hopkins University Press, Baltimore, pp. 330-332, 1996. N.J Higham, Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, pp. 291-294, 1996. H. Anton and R.C. Busby, Contemporary Linear Algebra, JohnWiley & Sons, Inc. NJ , pp. 249-260, 2003. J.H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.

[2] [3]

[4] [5] [6] [7] [8]

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

You might also like