You are on page 1of 4

Year

2016

Name of Paper
Optimal Web
Page Download
Scheduling
Policies for
Green Web
Crawling

Policy
A. Staleness Policy
1) In this policy time
is divided into no of
slots.
2) Time slots are big
enough to download
page.
3) If web page is
selected to download
then staleness
become zero for that
page and else it is
incremented by one.

Pros

Cons

Easy to
implement.

Freshness
not
considered.

Easy to
implement.

If sky is
cloudy
then it
affects the
solar
irradiation

Sj(t+1)={ 0
if
Xj(t)=1
{ Sj(t)+1 if
Xj(t)=0
Where Xj(t)=1 when
page is selected to
download & Xj(t)=0 if
it is not download.

B. Greenness
Policy
1) In this policy, it is
considered that each
server is powered by
mixture of green
energy during day &
energy provided by
grid (brown energy)
during night.
2)Here solar
irradiation is
considered which
varies from 0 KW/m2
at night to a max of
about 1KW/m2.It
reaches max at noon
when the sun is at its
higher peak in sky.
3) Based upon the

day of the year, the


time zone, latitude &
longitude of each
server, we estimate
day length as well as
the sunrise & sunset
at the server location.
Here,
Gi(t)=1normalized solar
irradiation
2003

Effective Page
Refreshment
Policies for Web
Crawler.

Synchronization
Policies
A) Synchronization
Frequency
1) Synchronize N
elements per I time
units by varying
value of I.
B)Resource Allocation
1)Uniform allocation
policy
i) Synchronize all
elements at same
rate.
ii) All elements are
synchronized at same
frequency f.
2)Non-Uniform
allocation policy
i) Synchronize
elements at different
rate.
C) Synchronize Order
i) Fixed Order:
ii) Random Order:
Select all elements
but in random order.
iii) Purely Random:
Select random
element from
database &
synchronize it.

Easy to
implement

Pages may
get
updated
even if it is
not
required.

Fixed Order
is good.
Purely
Random is
hypothetic
al.

D) Synchronization
Points:
e.g.
Option 1) For 10
page database from
site A it synchronize
at beginning of day,
say midnight.
Option 2) We
synchronize most
pages in beginning of
day, but we still
synchronize some
pages during the rest
of day.
Option 3)
Synchronize 10 pages
uniformly over a day.
2012

Aiding Web
Crawlers;
projecting web
page last
modification

To update pages
methods are :
1)HTTP-Metadata
Header field( last
modified):
i)Indicate last
modification date
Header field(E-tags):
i)E-tag is string which
uniquely identifies
specific version of
component.

B)Content & semantic


timestamping

2015

2014

Design of
Improved Focused
Web Crawler by
Analysing
Semantic Nature
of URL & Anchor
Text
Comparison of

C)Neighbourhood
Method:
i) It consider outgoing
links & assets.
In this method URL &
anchor text is used to
determine relevance
of web page
information need.

i) DFS

Human
Readable
E-tag &
Neighbourh
ood method
are good.

Best-first

It may not
complete

2009

Scheduling
Algorithm for
Domain Specific
Web Crawler.
The
implementation of
a Web Crawler
URL Filter
Algorithm Based
on Caching

ii)BFS
iii)Best N-first search

i)Hash Table
ii)Bloom Filter
iii)Caching

search is
best among
these
policies.
Caching is
best among
these
methods.

Hashing is
less
efficient

You might also like