Anatomy of A Large-Scale HyperBuid Textual Web Search Engine

Presented By: Ganesh C. Yadav.
(Roll No : 09141)
Presentation Overview
Problem Definition.
Design Goals Google Search Engine Characteristics. Google Architecture Scalability Conclusions
Problem
Web is vast and ever expanding. It is getting flooded with data. This data is heterogeneous and consists of all forms

Text Images Ascii Java applets
Lists maintained by Humans cannot keep track of this. Human attention is confined to 10-1000 documents Previous search methodologies relied on keyword matching producing
inferior quality results.
Solution = Search Engine

Search engines facilitate users to get the text or
documents of their choice within a click of mouse. Some examples of Search engines: Google,Altavista,MetaCrawler,Kosmix. For comprehensive list of search engines do visit: http://en.wikipedia.org/wiki/List_of_search_engines
Specific Design Goals

Deliver results that have very high precision even at the expense of recall
Bring search engine technology into academic
realm in order to support novel research activities
Make search engine technology transparent, i.e. advertising shouldnt bias results Make system user friendly .
Google Search Engine Features

Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval
Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages
PageRank For Layman

Imagine a web surfer doing a simple random walk on the
entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page.
PageRank For Techies

N(p): # outgoing links from page p B(p): set of pages that point to p d: tendency to get bored . R(p): PageRank of p
R(p) = [(1-d)+d*R(q)/N(q)] .
Why do we need d?
In the real world virtually all web graphs are not
connected, i.e. they have dead-ends, islands, etc. If we dont have d we get ranks leaks for graphs that are not connected, i.e. leads to numerical instability.
Justifications for using PageRank

Attempts to model user behavior
Captures the notion that the more a page is pointed to
by important pages, the more it is worth looking at Takes into account global structure of web
Implemented in C and C++ on Solaris and Linux
Google Architecture
Preliminary
Hitlist is defined as list of occurrences of a particular word in a particular document including additional meta info: - position of word in doc - font size - capitalization - descriptor type, e.g. title, anchor, etc.
Google Architecture (cont.)

Keeps track of URLs that have and need to be crawled Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once.
Compresses and stores web pages
Stores each link and text surrounding link.
Converts relative URLs into absolute URLs.
Uncompresses and parses documents. Stores link information in anchors file.
Contains full html of every web page. Each document is prefixed by docID, length, and URL.

Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in barrels. Generates database of links (pairs of docIds). Parses & distributes hit lists into barrels.
Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs.
In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into.
Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID.
DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.

2 kinds of barrels. Short barrell which contain hit list which include title or anchor hits. Long barrell for all hit lists.
New lexicon keyed by wordID, inverted doc index keyed by docID, and PageRanks used to answer queries
List of wordIds produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words.
1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for 4. 5. 6. 7. 8.
Google Query Evaluation
every word. Scan through the doclists until there is a document that matches all the search terms. Compute the rank of that document for the query. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.
Single Word Query Ranking

Hitlist is retrieved for single word Each hit can be one of several types: title, anchor, URL,

large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights # of hits of each type is counted to form count vector Dot product of two vectors is used to compute IR score IR score is combined with PageRank to compute final rank
Multi-word Query Ranking

Similar to single-word ranking except now must analyze proximity Hits occurring closer together are weighted higher Each proximity relation is classified into 1 of 10 values ranging from a phrase match to not even close Counts are computed for every type of hit and proximity
Scalability
Cluster architecture combined with Moores Law make
for high scalability. At time of writing:

~ 24 million documents indexed in one week ~518 million hyperlinks indexed Four crawlers collected 100 documents/sec
Summary of Key Optimization Techniques

Each crawler maintains its own DNS lookup cache Use flex to generate lexical analyzer with own stack for parsing
documents Parallelization of indexing phase In-memory lexicon Compression of repository Compact encoding of hitlists accounting for major space savings Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks wherever possible
References:
http://video.google.com/videoplay?docid=-
1400721382961784115 http://google.stanford.edu http://en.wikipedia.org/wiki/List_of_search_engines The Anatomy of a Large-Scale Hyper textual Web Search Engine Sergey Brin and Lawrence Page(pdf). The audio presentation of my lecture will be posted on my HomePage and will be given to Dr.Hankley. www.cis.ksu.edu/~vamsee

Anatomy of A Large-Scale HyperBuid Textual Web Search Engine

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anatomy of A Large-Scale HyperBuid Textual Web Search Engine

Uploaded by

Copyright:

Available Formats

Presented By: Ganesh C. Yadav.

Text Images Ascii Java applets

inferior quality results.

Solution = Search Engine

Specific Design Goals

realm in order to support novel research activities

Google Search Engine Features

PageRank For Layman

PageRank For Techies

Justifications for using PageRank

Implemented in C and C++ on Solaris and Linux

Google Architecture (cont.)

Compresses and stores web pages

Stores each link and text surrounding link.

Converts relative URLs into absolute URLs.

Uncompresses and parses documents. Stores link information in anchors file.

Google Architecture (cont.)

Google Architecture (cont.)

Google Query Evaluation

Single Word Query Ranking

Multi-word Query Ranking

for high scalability. At time of writing:

Summary of Key Optimization Techniques

You might also like