Professional Documents
Culture Documents
ORG
100
1 INTRODUCTION
HE internet is a network of millions of private, public, academic, business and government computers. The primary purpose of the internet is to share data which can be in the form of pictures, text, documents, videos etc. Therefore the internet can also be defined as a vast library that provides access to all users. To gain access to the endless resources the internet has to offer, all users regularly require the services of a search engine. Both the users and researchers acknowledge the fact that searching the internet for the most relevant information is a difficult task. Users of the internet often face the problem that they are unable to access the exact information they are looking for. Fundamentally a search engine is an entity that crawls the internet for the most relevant results regarding a user query. In other words a search engine eradicates the complexities of searching the internet by provides a searching interface. The modern search engine is the result of constant research and development, thus it provides better searching and quicker result generation. Researchers have been able to suggest techniques to eliminate the complexities of searching the web. Common methods for improved web searching include ranking of sites, simplified interfaces and ranking based on site speed. This paper presents the power method (with appropriate scaling) that performs computations on an adjacency matrix defined from a search set containing numerous sites, to extract dominant eigenvalues and eigenvectors to rank websites based on their importance. Earlier generations of search engines could present search results without any formal algorithms. Even if an algorithm was present it did not provide any sort of intelligence to the
search engine. With the help of power method modern search engines can present a user with a vast list of ranked search pages. The power method is an attempt to provide the user satisfaction through improved and optimized web searching. Further it is evident that the power method is inline with the design goals of web search engines. Section 2 of the paper presents the generations of search engines. Search engines have been broken down into five generations and each generation builds upon the work of the previous generation. In section three the design goals of search engines have been presented. To fully explain the working of search engines a generic search engine architecture has been presented in section four. Section five explains, in detail, the power method and its algorithm. Mathematical examples have been given to explain the functioning of power method. The power method is concluded with the help of a programming algorithm and formation of the search engine result page.
H. Tahir is with the Department of Computer Engineering, College of Electrical & Mechanical Engineering. National University of Sciences and Technology, Pakistan. M. Tahir is with the Department of Mathematics, HITEC University Taxila Cantt., Pakistan.
2.1 First Generation The first generation of search engines was very primitive and inaccurate. These search engines had no formal algorithm for searching and ranking the results. Many of these search engines were merely searching entities that could only search among a collection of websites rather than the entire web. Many of the initial search engines did not possess any form of intelligence hence they were con-
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
101
sidered fairly ineffective. A good example of the first generation search engine is the ALIWEB. This search engine allowed users to submit their own webpage to the search engine. This meant that if a webpage was not submitted to the search engine then it would not be included in the search. Therefore a repository of webpages was formed that allowed users to search for particular information.
2.2 Second Generation The second generation search engines were very mature as compared to the first generation. The search engines belonging to this generation could search the entire web by using more efficient searching techniques. For example EXCITE was considered a very popular search engine of this generation because it used statistical analysis to search for relationships among words. Therefore a paradigm shift was observed that led to the incorporation of mathematical algorithms in search engines. Ranking was purely based on content and was based on information retrieval. Although the search engines belonging to this generation were quite practical but there was still room for increased levels of perfection and intelligence. 2.3 Third Generation The search engines of this generation were designed to rank pages according to popularity. The concept behind these search engines was that high level of popularity implies that the webpage is appropriate and most likely to be read by the user. The popularity of a particular page was based on a hyperlink from one site to another. This hyperlink was considered as a popularity vote and functioned very much like an academic citation of a research paper. The higher a popularity vote the higher was the ranking of that particular webpage. Lycos was a search engine developed by Carnegie Mellon University belonged to this exact generation of search engines. 2.4 Fourth Generation The fourth generation search engines are first search engines that formally incorporated mathematical logic into their search engines. Although previous generations of search engines also had mathematical intelligence yet they lacked in accuracy and perfection. To overcome these limitations search engines like Google and Yahoo used the PageRank algorithm to search for particular information. These search engines revolutionized the way search engines operate. These search engines were more accurate, faster in speed and fulfill all other design goals of search engines. The fourth generation search engines search for data based on the content, structure and the page rank of the webpage. These were the first search engines of their kind that could perform searches based on three criteria. Also these search engines provide advanced searching options along with support for image search and search for particular file formats. 2.5 Fifth Generation ` Search engines performed searches based on the user
query without understanding what the user was asking for. The search engines that are currently under development are being developed to comprehend what the user is asking for and then search for the exact information required. This results in greater accuracy and user satisfaction. Researchers have also tried to incorporate artificial intelligence into search engines. The concept of self organizing neural networks is also being used in upcoming search engines.
3.1 Usability Search engines need to be designed for high levels of usability. Usability expresses the need for high quality design along with ease of use. Usability is a factor which is often studied in relation with man machine interfaces. A search engine that is designed keeping in view usability promises higher levels of user satisfaction. Currently there are two widely used search engine designs; minimalistic and baroque. Minimalistic designs provide a basic interface with uncluttered appearance. The search engines that follow the minimalistic design present the user with a direct access to the search text box. Google is commonly seen as the first minimalistic search engine. The baroque design is a comprehensive interface that provides a broad range of features on a single page. A search engine following the baroque design provides features related to search, email, news, weather, advertisements etc on a single page. Popular search engines that fall in the category of baroque design are yahoo and hotmail. 3.2 Search Accuracy and Consistency This design goal is perhaps the most important feature of a search engine. Search accuracy dictates that any search query that is provided to the search engine returns links to the very best documents/ pages available on the subject [1]. This design goal is essential in determining the effectiveness of the search engine. Consistency is a design goal that is mandatory for designing a good search engine. This design principle states that a search must return the same results if it repeated a number of times. In other words a user must not get different results for the same search query. 3.3 Search Filtration Search filtration is a design goal that has emerged after the success of internet marketing and other related disciplines. Search filtration requires that any search results that are returned to the user be filtered for spam and inaccurate information. Also outdated texts need to be filtered from the search results. Since the advent of search engine opti-
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
102
mization, filtration has become a very effective tool because it filters out those pages that seem to provide the best resources but the reality of the fact is that the page is of no importance/relevance to the user.
3.4 Search Ranking Search ranking is relatively a newer design goal that has impacted the way users are provided the search results. Using search ranking those pages are provided to the user first which are most widely accessed and are considered more credible. If a user is searching for a particular definition the search will return a list of pages that are most relevant. At the top of the list are those pages that are considered most authoritative and accurate by internet users throughout the world. Many algorithms have been proposed to achieve search ranking but the most widely used algorithm is called the PageRank algorithm. 3.5 Search Speed Search speed is a design goal that ultimately results in user satisfaction. Search speed means the speed with which users are provided search results. To achieve higher search speeds many search engines have shifted from a baroque design to a minimalistic design. Also the use of more efficient/ optimized algorithms for performing web searches is recommended. It has further been observed that search engines that provide search results at a quicker pace are more popular among internet users.
After a minimum of one processing cycle by the crawlers the indexer module begins its processing. The indexer module is responsible for generating the ranking of the results obtained from the search engine. The indexing is commonly performed based on the specific search terms, structure of the webpage, pages with certain tags etc. Once all the processing is completed the results are formally ranked and presented to the user.
5.1 The Power Method The values of the parameter for which the matrix equation where is an matrix and an column vector, possesses at least one non-trivial solution, are called eigenvalues and the corresponding solutions are called eigenvectors of the matrix . It is important to are note that eigenvalues of a symmetric matrix real. If the distinct eigenvalues of an matrix are and if is larger than , then is called the dominant eigenvalue of . An eigenvector associated with the dominant eigenvalue is called domiAn iterative technique used to apnant eigenvector of proximate the dominant eigenvalue and associated eigen vector of a matrix is the power method [4-8]. For a given vector in , a sequence of the form is called a power sequence generated by an matrix . The power method uses the power sequence generated by for an initial guess for the eigenvector , that is, by the iteration
that yields
for
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
103
for
(1)
The approximate dominant eigenvalue, by the power method, after five iterations is and the dominant eigenvector is
.
To be more elaborative, if each iterate in the power method is scaled to make its largest entry as 1 then the following theorem is called the power method with maximum entry scaling [7].
(2)
From (1) and (2) the following expression immediately follows.
5.3 Theorem Let be a symmetric matrix with a positive domiis a nonzero vector in that is nant eigenvalue . If not orthogonal to the eigenspace corresponding to , then the sequence
, , , ,
Since
for and
, and the
, 5.4 Example
converges to
which converges to 0 if and diverges if , provided that . Hence , for large That is (1) will approach to value of , provided an eigenvector for the dominant eigenvalue . If each iterate is scaled to make its largest entry as 1 then the method is called the power method with scaling.
Let
. Then
5.2 Example
Consider a matrix whose eigenvalues are Starting with an initial guess
.
,
,
, .
which gives
. Then ,
Now,
, ,
, and similarly , ,
Similarly
,
That is, approximates the dominant eigenvalue whereas approximates the dominant eigenvector. Keeping in view the estimated relative error in the criteria to stop computations [7] can easily be used.
, .
5.5 Determining Authority and Hub Weights Let a search set contains sites. Define the adjacency to be the matrix such that matrix for if site references site and otherwise. Assume that no site references itself which signifies that all the diagonal elements of are zero. Suppose that a site which references many other sites during a search process is termed as hub whereas a site which is referenced by
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
104
many other sites is referred to as an authority. Accordingly, denote the row sum of the adjacency matrix by vector (called the initial hub vector) and the column sum of by vector (called the initial authority vector). The entries of a hub vector are termed as hub weights while those of an authority vector are called authority weights. Use initial hub vector and initial authority vector to compute new hub and authority vectors , , ,
, ,
5.7 Example Let be an adjacency matrix for a search set with 4 internet sites.
(Referenced Site)
(Referencing Site)
and
,
, , ,
,
,
The initial hub vector and initial authority vector the adjacency matrix are:
for
(3)
,
(4)
and It is obvious that site 2 is the largest hub and site 1 is the greatest authority. Using formula (3) calculate hub vector
thereby generating the power sequences where denotes the Euclidean norm. The scaling used here is therefore termed as Euclidean scaling. The sequences (3) and (4) can further be expressed as
, , , , , , , ,
(5) (6)
(updated hub weights)
Since the matrices and are symmetric, they possess positive dominant eigenvalues and therefore the sequences of vectors , , , and , , , converge to dominant eigenvectors of and respectively. The entries in these dominant eigenvectors are the hub and authority weights that signify the ranking of search sites. The procedure to compute dominant eigenvectors via the power method, as in (3) and (4) above, can be best explained with a programming oriented algorithm. The input to the algorithm is an adjacency matrix that is formed by relationing the hub and authority sites. After processing is complete the authority and hub weights are obtained according to which the search engine result page is formed.
In a similar way the iterates are approximated until the vectors seem to be stabilized.
5.6 Algorithm
Input: Adjacency matrix of order Output: Authority and hub weights Row sum of = //initial hub vector Column sum of = //initial authority vector while ( compute compute and are not stable)
5.8 Formation of Search Engine Result Page After processing of the adjacency matrix the last step step is to generate the Search Engine Result Page (SERP). The SERP is the actual page that presents the web pages in a ranked order. Generation of the SERP is fairly simple because it is generated as a result of the latest updated authority weights. Once the individual authority weights are made available then those sites that possess an exceptionally low weight are discarded. The pages with the highest weight are ranked at the top of the SERP. In other words, pages are ranked in descending order of their authority weights. Those pages that possess zero weight or a weight that approaches zero are simply discarded and they are not incorporated into the SERP.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
105
6. CONCLUSION
Modern search engines are designed to process a large number of user queries simultaneously. Internet users require search results to be presented to them in a quick, consistent, accurate, filtered and ranked manner. Since the web is an ever expanding network therefore ranking of pages according to the user query is a very important feature. All users require that their search results be ranked according to importance. This paper presents the power method which is a web based ranking algorithm that can efficiently rank web pages. The algorithm begins processing by first forming an adjacency matrix defined from a search set containing various sites. This adjacency matrix is generated as a result of hub and authority relationship between webpages. With the help of power method dominant eigenvectors are calculated in an iterative manner to achieve refinement of results. Every iteration performs processing on the results of the previous iteration. Iterations continue until the hub and authority weights stabilize. Once these results are obtained the final search engine result page is generated that is used by the user. The power method forms the basis of most search engine based algorithms.
REFERENCES
[1] S. Brin, L. Page, The Effectiveness of a Large Scale Hypertextual Web Search Engine, Journal of Computer Networks and ISDN Systems, Volume 30, Issue 1-7, April 1998. C. Castillo, Effective Web Crawling, PhD Dissertation. University of Chile, 2004. A. Arasu, J. Cho, H.G. Molina, A. Paepcke, S. Raghavan, Searching the Web ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1, Aug 2001. D.W. Boyd, The power method for norms, Linear Algebra Appl., Volume. 9, pp. 95-101, 1974. G.H. Golub and C.F. Van Loan, Matrix Computations (3rd Ed). The John Hopkins University Press, Baltimore, pp. 330-332, 1996. N.J Higham, Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, pp. 291-294, 1996. H. Anton and R.C. Busby, Contemporary Linear Algebra, JohnWiley & Sons, Inc. NJ , pp. 249-260, 2003. J.H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.
[2] [3]