Professional Documents
Culture Documents
Presented by:
B. Tech. Final Year
Information Technology
Outline
• Web Crawlers
• Basic Crawling
• Selective crawling
• Focused crawling
• URL Frontier
• Web Crawler Architecture
• Crawling Policy
• Web traps
Web Crawlers
• A Web crawler, sometimes called
a spider or spiderbot and often shortened to crawler, is
an Internet bot that systematically browses the World
Wide Web, typically for the purpose of Web
indexing (web spidering).
• A Web crawler starts with a list of URLs to visit, called
the seeds. As the crawler visits these URLs, it identifies
all the hyperlinks in the page and adds them to the list of
URLs to visit, called the crawl frontier
Basic Crawling Algorithm
• A simple crawler uses a graph algorithm such as BFS
– Maintains a queue, Q, that stores URLs
– Two repositories: D- stores documents, E- stores URLs
• Given S0 (seeds): initial collection of URLs
• Each iteration
– Dequeue, fetch, and parse document for new URLs
– Enqueue new URLs not visited (web is acyclic)
• Termination conditions
– Time allotted to crawling expired
– Storage resources are full
– Consequently Q, D have data, so anchors to the URLs in Q
are used to return query results (many search engines do this)
Breadth-First Crawl:
• Basic idea:
- start at a set of known URLs
- explore in “concentric circles” around these URLs
start pages
distance-one pages
distance-two pages
Layer 1 u