Professional Documents
Culture Documents
Veenu Mangat
I. INTRODUCTION
WWW is now a famous medium by which people all
around the world can spread and gather the information of all
kinds. But web pages of various sites that are generated
dynamically contain undesired information also. This
information is called noisy or irrelevant content. Mainly,
advertisements, copyright statements, privacy statements,
logos, table of contents, navigational panel, footers and headers
come under noisy content. Table of content and navigational
panel are provided to make it easier for users to navigate
through web pages. Table of content and navigational panel
blocks are also called redundant blocks because they are
present on almost every web page. It has been measured that
almost 40-50% of the content in a webpage can be considered
irrelevant [1].
A user is basically interested in the main content of a web
page. So, the process of identifying main content blocks from
a web page is called content extraction. The term content
extraction was found by Rahman [2].
Content extraction has many applications-it becomes easier
for users to access the information in timely efficient manner.
Irrelevant and Redundant information is removed.
Performance of search engines is also increased because they
will not waste their time and memory in indexing and storing
irrelevant and redundant content. Therefore, it acts as a
preprocessor of web pages for search engines. It also helps
users who uses internet through small screen devices because
they can easily point out the relevant content. Else it would be
difficult for users to get actual information as the information
is displayed on small screen. It makes several web mining
tasks such as web page crawling, web page classification, link
based ranking, topic distillation simple. It can help in
generating automatically rich site summary (RSS) from blogs
or articles. Content Extraction is also being used in application
like ontology generation.
Many methods have been developed to extract content
blocks from web pages. Lin and Ho [3] proposed a method
named infodiscoverer in which they used <table> tag to divide
the web page into blocks. Then they extracted features from
blocks and calculated entropy value of these features. Then
this entropy value is used to determine whether the block is
informative or not. But the problem in this method is that they
are not able to divide the web pages that contain other tags
like DIV than the table tag. They only performed experiments
on news websites having Chinese pages. Kao and Lin [4]
proposed a method in which they used HITS (Hyperlink
Induced Topic Search) algorithm to get a concise structure of
web site by removing irrelevant structures. On the filtered
structure they performed infodiscoverer method. This method
is better than infodiscoverer because instead of using the
whole web page they experimented on the filtered structure.
HITS algorithm works by finding hub and authority page but
it becomes difficult for it to find out those hub web pages that
have few authority pages linked with it. Kao [5] proposed
WISDOM (Web Intrapage Informative Structure Mining
Based on Document Object Model) method. This method
evaluates the amount of information contained in node of
DOM (Document Object Model) tree with the help of
information theory. It first divides the original DOM tree into
subtrees and chooses the candidate subtrees with the help of
assigned threshold. Then a topdown and greedy algorithm is
applied to select the informative blocks and a skeleton set
which consist of set of candidate informative structures.
Merging and expanding methods are applied on skeleton set to
get the required informative blocks. It removes pseudo
informative nodes while merging.
Debnath [6] gave four algorithms content extractor, feature
extractor, k-feature extractor and L-extractor for separating
content blocks from irrelevant content. ContentExtractor
algorithm finds redundant blocks based on the occurrence of
the same block across multiple Web pages. FeatureExtractor
algorithm identifies the content block with help of particular
(1)
4.
5.
6.
7.
(2)
8.
[1]
[2]
Where w (n) =
Where
(4)
[3]
(5)
[4]
(6)
[5]
[6]
[7]
III. CONCLUSION
Informative Content Extraction from web pages is very
important because web pages are unstructured and its number
is growing at a very fast rate. Content Extraction is useful for
the human users as they will get the required information in a
time efficient manner and also used as a preprocessing stage
for systems like robots, indexers, crawlers, etc. that need to
extract the main content of a web page to prevent the
treatment and processing of noisy, irrelevant and useless
information. We have presented traditional approaches for
extracting main content from web pages and also a new
approach for content extraction from web pages using the
concept of word to leaf ratio and link density.
[8]
[9]
[10]
[11]
[12]
[13]