Professional Documents
Culture Documents
(Roll No : 09141)
Presentation Overview
Problem Definition.
Design Goals Google Search Engine Characteristics. Google Architecture Scalability Conclusions
Problem
Web is vast and ever expanding. It is getting flooded with data. This data is heterogeneous and consists of all forms
Lists maintained by Humans cannot keep track of this. Human attention is confined to 10-1000 documents Previous search methodologies relied on keyword matching producing
documents of their choice within a click of mouse. Some examples of Search engines: Google,Altavista,MetaCrawler,Kosmix. For comprehensive list of search engines do visit: http://en.wikipedia.org/wiki/List_of_search_engines
Make search engine technology transparent, i.e. advertising shouldnt bias results Make system user friendly .
Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages
entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page.
Why do we need d?
In the real world virtually all web graphs are not
connected, i.e. they have dead-ends, islands, etc. If we dont have d we get ranks leaks for graphs that are not connected, i.e. leads to numerical instability.
by important pages, the more it is worth looking at Takes into account global structure of web
Google Architecture
Preliminary
Hitlist is defined as list of occurrences of a particular word in a particular document including additional meta info: - position of word in doc - font size - capitalization - descriptor type, e.g. title, anchor, etc.
Contains full html of every web page. Each document is prefixed by docID, length, and URL.
Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID.
DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.
New lexicon keyed by wordID, inverted doc index keyed by docID, and PageRanks used to answer queries
List of wordIds produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words.
1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for 4. 5. 6. 7. 8.
every word. Scan through the doclists until there is a document that matches all the search terms. Compute the rank of that document for the query. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.
large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights # of hits of each type is counted to form count vector Dot product of two vectors is used to compute IR score IR score is combined with PageRank to compute final rank
Scalability
Cluster architecture combined with Moores Law make
documents Parallelization of indexing phase In-memory lexicon Compression of repository Compact encoding of hitlists accounting for major space savings Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks wherever possible
References:
http://video.google.com/videoplay?docid=-
1400721382961784115 http://google.stanford.edu http://en.wikipedia.org/wiki/List_of_search_engines The Anatomy of a Large-Scale Hyper textual Web Search Engine Sergey Brin and Lawrence Page(pdf). The audio presentation of my lecture will be posted on my HomePage and will be given to Dr.Hankley. www.cis.ksu.edu/~vamsee