You are on page 1of 27

Web Crawlers

Presented by:
B. Tech. Final Year
Information Technology
Outline
• Web Crawlers
• Basic Crawling
• Selective crawling
• Focused crawling
• URL Frontier
• Web Crawler Architecture
• Crawling Policy
• Web traps
Web Crawlers
• A Web crawler, sometimes called
a spider or spiderbot and often shortened to crawler, is
an Internet bot that systematically browses the World
Wide Web, typically for the purpose of Web
indexing (web spidering).
• A Web crawler starts with a list of URLs to visit, called
the seeds. As the crawler visits these URLs, it identifies
all the hyperlinks in the page and adds them to the list of
URLs to visit, called the crawl frontier
Basic Crawling Algorithm
• A simple crawler uses a graph algorithm such as BFS
– Maintains a queue, Q, that stores URLs
– Two repositories: D- stores documents, E- stores URLs
• Given S0 (seeds): initial collection of URLs
• Each iteration
– Dequeue, fetch, and parse document for new URLs
– Enqueue new URLs not visited (web is acyclic)

• Termination conditions
– Time allotted to crawling expired
– Storage resources are full
– Consequently Q, D have data, so anchors to the URLs in Q
are used to return query results (many search engines do this)
Breadth-First Crawl:
• Basic idea:
- start at a set of known URLs
- explore in “concentric circles” around these URLs

start pages
distance-one pages
distance-two pages

• used by broad web search engines


• balances load between servers
Practical Modifications & Issues
• Time to download a doc is unknown
– DNS lookup may be slow
– Network congestion, connection delays
– Exploit bandwidth- run concurrent fetching threads
• Crawlers should be respectful of servers and not abuse resources
at target site (robots exclusion protocol)
• Multiple threads should not fetch from same server
simultaneously or too often
• Broaden crawling fringe (more servers) and increase time
between requests to same server
• Storing Q, and D on disk requires careful external memory
management
• Crawlers avoid aliases “traps”- same doc is addressed by many
different URLs
• Web is dynamic and changes in topology and content
Selective Crawling
(Selective Crawling)

• Recognizing the relevance or importance of


sites, limit fetching to most important subset
• Define a scoring function for relevance
s( ) (u ) : where u is a URL,
 is the relevance criterion,
 is the set of parameters.

• Eg. Best first search using score to enqueue


• Measure efficiency: rt/t, t = #pages fetched, rt =
#fetched pages with score > st (ideally rt =t)
Ex: Scoring Functions
(Selective Crawling)

• Depth- limit #docs downloaded from a single site by a)


setting threshold, b) depth in dir tree, or c) limit path
length; maximizes breadth
1, if |root(u) ~> u| < , root(u) is root of site with u
s( depth) (u )  0, otherwise

• Popularity- assigning importance by most popular; eg. a


relevance function based on backlinks
1, if indegree(u) >
s(backlinks) (u )  0, otherwise

• PageRank- measure of popularity recursively assigns


link a weight proportional to popularity of doc
Focused Crawling
• Searches for info related to certain topic not
driven by generic quality measures
• Relevance prediction
• Context graphs
• Reinforcement learning
• Examples: Citeseer, Fish algm (agents
accumulate energy for relative docs, consume
energy for network resources)
Relevance Prediction
(Focused Crawling)

• Define a score as conditional probability that a doc is


relevant given text in the doc.c is topic of interest
s(topic) (u )  P(c | d (u ), )  are adjustable params of classifier
d(u) is contents of doc at vertex u
• Strategies for approx topic score
– Parent-based: score a fetched doc and extend score
to all URLs in that doc, “topic locality”
s(topic) (u )  P(c | d (v), ) v is parent of u
– Anchor-based: just use text d(v,u) in the anchor(s)
where link to u is referred to, “semantic linkage
• Eg. naïve Bayes classifier trained on relevant docs.
• Naïve Bayes algorithm is based on Probabilistic learning and classification. It
assumes that one feature is independent of another
Context Graphs
(Focused Crawling)

• Take adv of knowledge of internet topology


• Train machine learning system to predict “how far”
relevant info can be expected to be found
– Eg. 2 layered context graph, layered graph of node u
Layer 2

Layer 1 u

– After training, predict layer a new doc belongs to


indicating # links to follow before relevant info reached
Reinforcement Learning
(Focused Crawling)

• Immediate rewards when crawler downloads a


relevant doc
• Policy learned by RL can guide agent toward high
long-term cumulative rewards
URL Frontier
• A crawl frontier is the part of a crawling system that
decides the logic and policies to follow when a crawler is
visiting websites (what pages should be crawled next,
priorities and ordering, how often pages are revisited,
etc).
• The frontier is initialized with a list of start URLs, that we
call the seeds. Once the frontier is initialized the crawler
asks it what pages should be visited next.
Anatomy of a crawler.
• Page fetching threads
– Starts with DNS resolution
– Finishes when the entire page has been fetched
• Each page
– stored in compressed form to disk/tape
– scanned for outlinks
• Work pool of outlinks
– maintain network utilization without overloading it
• Dealt with by load manager
• Continue till he crawler has collected a sufficient number
of pages.
14
Basic crawl architecture
DNS (Domain Name Server)
• A lookup service on the internet
– Given a URL, retrieve its IP address
– Service provided by a distributed set of servers –
thus, lookup latencies can be high (even seconds)
• Common OS implementations of DNS lookup are
blocking: only one outstanding request at a time
• Solutions
– DNS caching
– Batch DNS resolver – collects requests and sends
them out together
Parsing: URL normalization
• When a fetched document is parsed,
some of the extracted links are relative
URLs
• During parsing, must normalize (expand)
such relative URLs
Content seen?
• Duplication is widespread on the web
• If the page just fetched is already in the
index, do not further process it
• This is verified using document fingerprints
or shingles
• • A k-shingle is a sequence of k
consecutive words
Filters and robots.txt
• Filters – regular expressions for URL’s
to be crawled/not
• Once a robots.txt file is fetched from a
site, need not fetch it repeatedly
– Doing so burns bandwidth, hits web
server
• Cache robots.txt files
Crawl policies
• Selection policy
• Re-visit policy
• Politeness policy
• Parallelization policy
• Selection policy
• Pageranks
• Path ascending
• Focused crawling
• Re-visit policy
• Freshness
• Age
• Politeness
• So that crawlers don’t overload web servers
• Set a delay between GET requests
• Parallelization
• Distributed web crawling
• To maximize download rate
Spider traps
• Protecting from crashing on
– Null-formed HTML
• E.g.: page with 68 kB of null characters
– Misleading sites
• indefinite number of pages dynamically generated
by CGI scripts
• paths of arbitrary depth created using soft directory
links and path remapping features in HTTP server
Spider Traps: Solutions
• No automatic technique can be foolproof
• Check for URL length
• Guards
– Preparing regular crawl statistics
– Adding dominating sites to guard module
– Disable crawling active content such as CGI
form queries
– Eliminate URLs with non-textual data types
Text repository
• Crawler’s last task
 Dumping fetched pages into a repository
• Decoupling crawler from other functions
for efficiency and reliability preferred
• Page-related information stored in two
parts
 meta-data
 page contents.
Storage of page-related
information
• Meta-data
 relational in nature
 usually managed by custom software to avoid
relation database system overheads
 text index involves bulk updates
 includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.
Any Questions
Thank You

You might also like