Professional Documents
Culture Documents
2016
Name of Paper
Optimal Web
Page Download
Scheduling
Policies for
Green Web
Crawling
Policy
A. Staleness Policy
1) In this policy time
is divided into no of
slots.
2) Time slots are big
enough to download
page.
3) If web page is
selected to download
then staleness
become zero for that
page and else it is
incremented by one.
Pros
Cons
Easy to
implement.
Freshness
not
considered.
Easy to
implement.
If sky is
cloudy
then it
affects the
solar
irradiation
Sj(t+1)={ 0
if
Xj(t)=1
{ Sj(t)+1 if
Xj(t)=0
Where Xj(t)=1 when
page is selected to
download & Xj(t)=0 if
it is not download.
B. Greenness
Policy
1) In this policy, it is
considered that each
server is powered by
mixture of green
energy during day &
energy provided by
grid (brown energy)
during night.
2)Here solar
irradiation is
considered which
varies from 0 KW/m2
at night to a max of
about 1KW/m2.It
reaches max at noon
when the sun is at its
higher peak in sky.
3) Based upon the
Effective Page
Refreshment
Policies for Web
Crawler.
Synchronization
Policies
A) Synchronization
Frequency
1) Synchronize N
elements per I time
units by varying
value of I.
B)Resource Allocation
1)Uniform allocation
policy
i) Synchronize all
elements at same
rate.
ii) All elements are
synchronized at same
frequency f.
2)Non-Uniform
allocation policy
i) Synchronize
elements at different
rate.
C) Synchronize Order
i) Fixed Order:
ii) Random Order:
Select all elements
but in random order.
iii) Purely Random:
Select random
element from
database &
synchronize it.
Easy to
implement
Pages may
get
updated
even if it is
not
required.
Fixed Order
is good.
Purely
Random is
hypothetic
al.
D) Synchronization
Points:
e.g.
Option 1) For 10
page database from
site A it synchronize
at beginning of day,
say midnight.
Option 2) We
synchronize most
pages in beginning of
day, but we still
synchronize some
pages during the rest
of day.
Option 3)
Synchronize 10 pages
uniformly over a day.
2012
Aiding Web
Crawlers;
projecting web
page last
modification
To update pages
methods are :
1)HTTP-Metadata
Header field( last
modified):
i)Indicate last
modification date
Header field(E-tags):
i)E-tag is string which
uniquely identifies
specific version of
component.
2015
2014
Design of
Improved Focused
Web Crawler by
Analysing
Semantic Nature
of URL & Anchor
Text
Comparison of
C)Neighbourhood
Method:
i) It consider outgoing
links & assets.
In this method URL &
anchor text is used to
determine relevance
of web page
information need.
i) DFS
Human
Readable
E-tag &
Neighbourh
ood method
are good.
Best-first
It may not
complete
2009
Scheduling
Algorithm for
Domain Specific
Web Crawler.
The
implementation of
a Web Crawler
URL Filter
Algorithm Based
on Caching
ii)BFS
iii)Best N-first search
i)Hash Table
ii)Bloom Filter
iii)Caching
search is
best among
these
policies.
Caching is
best among
these
methods.
Hashing is
less
efficient