You are on page 1of 3

2009 International Forum on Computer Science-Technology and Applications

Improve the Performance of the Webpage Content Extraction using Webpage


Segmentation Algorithm
Fu Lei, Meng Yao, Yu Hao
Fujitsu R&D Center CO., LTD, Beijing, China, 100025

fulei@cn.fujitsu.com, mengyao@cn.fujitsu.com, yu@cn.fujitsu.com


detailed aspects, such as tag tree method [11], ontology [12]
method and so on, it still has some problems, one main
problem is that this method often throws some sentences of
the main body content away. Because its based on the local
judgment of the DOM tree, it cant get the whole view of the
page. On the other hand, the DOM tree is initially introduced
for presentation in the browser rather than description of the
semantic structure of the webpage, so you cant get the
semantic relation of the different sentence directly, its no
wonder that this method sometimes loses some part of the
content.
In fact, most of researches show that when a page is
presented to the user, the spatial and visual cues play a very
important role, they help the user to unconsciously divide the
webpage into several semantic parts. So, if we can make use
of this information, itll help us to extract the body content of
the page much more precisely. Detecting the semantic
content structure of a webpage could potentially improve the
performance of the webpage content extraction. VIPS [9]
algorithm can do this work perfectly, it can divide the
webpage into some different independent semantic blocks,
and we can also get the coordinate information of each block
to assist the webpage content extraction. Based on VIPS
algorithm, we can recall the lost sentences easily.
The rest of the paper is organized as follows. Section
provides an overview of VIPS algorithm. In Section , I
will introduce our method, how to use VIPS to improve the
performance of the webpage content extraction. The results
are also shown in this section. Finally, we give concluding
remarks in Section .

ABSTRACT: In this paper, we present a method using webpage


segmentation algorithm to improve the performace of the
webpage content extraction. The traditional methods often
depend on parsing the DOM tree of the webpage and judging
each node of the DOM tree to determin which node is the text
node, this kind of method has a potential problem, it sometimes
throws part of the content away because of its local judgement
strategy. But our method which is based on the VIPS (Visionbased Page Segmentation) algorithm, can solve the problem
satisfactorily, it can extract the content according to the
coordinate information of the block and help the traditional
method to recall the lost part of the content.
KEYWORDS: Webpage Segmentation;
Extraction; DOM tree analysis; VIPS

I.

Webpage

Content

INTRODUCTION

With the explosion of the World Wide Web, a large


amount of data on many different subjects has become
available on-line, this has opened the opportunity for users to
benefit from the available data in many interesting way.
Usually, users retrieve web data by browsing and keyword
searching, which are intuitive forms of accessing data on the
web. However, these search strategies present several
limitations. Browsing is not suitable for locating particular
items of data, because following links is tedious and it is
easy to get lost. Keyword searching is sometimes more
efficient than browsing, but often returns large amounts of
data, far beyond what the user can handle. As a result, in
spite of being publicly and readily available, web data can
hardly be properly queried or manipulated. So the
researchers begin to consider how to extract the content of
the webpage for further handling.
The traditional approaches for extracting data from the
webpage can be classified as below. First, its the method
based on wrappers [1-5], the wrappers are some specialized
programs, which identify data of interest and map them to
some suitable format. This method has a well-known
shortcoming, the wrappers are always developed manually,
its a very time-consuming work and very difficult to debug
them. Although many researchers introduce the machine
learning method to optimize the process, it still has no
sufficient power to deal with many different web pages, the
wrappers often takes effect on some similar web pages, not
most of on-line web pages. Second, its the method based on
HTML DOM tree analysis [6-8, 10, 11], much recent work focus
on this method. The main idea of this method is to judge
each node of the DOM tree whether it is a text node.
Although many researchers try to improve it from many
978-0-7695-3930-0/09 $26.00 2009 IEEE
DOI 10.1109/IFCSTA.2009.84

II.

OVERVIEW OF VIPS ALGORITHM

The VIPS algorithm makes full use of the webpage


layout feature: firstly, it extracts all the suitable blocks based
on the html DOM tree structure, then it tries to find the
separators between these extracted blocks. Here, separators
denote the horizontal or vertical lines in a webpage that
visually cross with no other blocks. Finally, based on these
separators, the semantic structure for the webpage is
constructed and the webpage is divided into some
independent blocks. VIPS algorithm employs a top-down
approach, which is very effective.
The basic model of VIPS is described as below.
A web page W is represented as a triple:
W = (B, S, )

323

B = {B1, B2, ... ,BN } is a finite set of blocks. All these


blocks must not be overlapped. Each block can be
recursively viewed as a sub-web-page associated with substructure induced from the whole page structure. S = {S1,
S2,..., SN} is a finite set of separators, including horizontal
separators and vertical separators. Every separator has a
weight indicating its visibility, and all the separators in the
same S have the same weight. is the relationship of every
two blocks in B and can be expressed as:

(x,y)
******B1******

= BB S {NULL}.
For example, suppose Bi and Bj are two objects in B, (Bi,
Bj) NULL indicates that Bi and Bj are exactly separated
by the separator (Bi, Bj) or we can say that the two objects
are adjacent to each other, otherwise there are other objects
between the two blocks Bi and Bj.
VIPS algorithm can divide the webpage into some
independent blocks, each two blocks are separated by the
separators in S, and we can also get the coordinate
information of each block. Just as the example of Figure 1
and Figure 2 below.

**B4**

*****B3**************
*******

************
B2
**********
**********

******B5******

Figure 2. Blocks and Separators in Source Webpage

III.

INTRODUCTION OF OUR METHOD

Our method is based on the VIPS algorithm, it can


overcome the shortcoming of the method based on DOM tree
analysis. For the problem of throwing some sentences of the
content away in traditional method, its mainly because of its
local analysis strategy, you can see the example showed in
Figure 3 below.

Figure 1. Source Webpage

Figure 3. Lost Sentences Example

324

In this example, you can see that the text in blue circle
contains a hyperlink, for this case, because the traditional
DOM tree methods always use the text-link ratio (ratio of the
length of the text in the node and the length of the hyperlink
text in the node) to judge whether the node is a text node, the
text in blue circle will always be treated as a pure link which
has no sense, they will be thrown away wrongly too.
In our method, well use the VIPS algorithm to overcome
this problem and improve the performance of the webpage
content extraction. For VIPS can divide the webpage into
some semantic blocks, it can get a whole view of the
webpage and get the position information of each block. In
order to recall the sentences which are thrown away, well
keep the DOM tree node tag when using traditional method
to extract the content. The steps are as follows.
1. Using VIPS to divide the webpage into several
blocks and keep the coordinate information of each
block and the node tag in each block.
2. Using traditional method to extract the content of
the webpage and keep the html tag information of
each content node.
3. Using the coordinate information of each block to
determine which blocks should be content blocks.
4. Map the extracted content node tag sequence to the
content block according to the node tag and the
content itself. If some node tags in content block
dont appear in extracted content node tag sequence,
we recall the node and the text in this node.
Using our method, because you can get a whole view of
the webpage to some extent by VIPS algorithm, then you can
make full use of this information to supervise the process of
the content extraction and recall the lost sentences. Just as
well, to a certain extent, you extract the content as a whole.
From the experiment we conduct, we can also see that
almost all the lost sentences in traditional method are
recalled using our method.
IV.

method based on DOM tree analysis, and it can extract the


content of the page from a global view of the page not a local
view to some extent. It makes full use of the webpage layout
information, and guides the process of content extraction. By
recalling the sentences which the traditional method throws
away, it improves the performance of the traditional method
greatly.
REFERENCES
[1]

Adelberg, B., NoDoSE: A tool for semiautomatically extracting


structured and semistructured data from text documents, In
Proceedings of ACM SIGMOD Conference on Management of Data,
1998, pp. 283-294.
[2] Ashish, N. and Knoblock, C. A., Semi-Automatic Wrapper
Generation for Internet Information Sources, In Proceedings of the
Conference on Cooperative Information Systems, 1997, pp. 160-169.
[3] Ashish, N. and Knoblock, C. A., Wrapper Generation for Semistructured Internet Sources, SIGMOD Record, Vol. 26, No. 4, 1997,
pp. 8-15.
[4] Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery
in Web documents, In Proceedings of the 1999 ACM SIGMOD
international conference on Management of data, Philadelphia PA,
1999, pp. 467-478.
[5] Valter Crescenzi, GiansalvatoreMecca. RoadRunner: Towards
Automatic Data Extraction from Large WebSite [A]. In proceeding of
the 26th International Conference on very Large Database Systems[C],
2001:109-118.
[6] Chakrabarti, S., Integrating the Document Object Model with
hyperlinks for enhanced topic distillation and information extraction,
In the 10th International World Wide Web Conference, 2001.
[7] Shian-Hua Lin, Jan-Ming Ho: Discovering informative content blocks
from Web documentsKDD 2002: 588-593.
[8] Suhit Gupta, Gail E. Kaiser, David Neistadt, Peter Grimm: DOMbased content extraction of HTML documents. WWW 2003: 207214.
[9] Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma: Extracting
Content Structure for Web Pages Based on Visual Representation.
APWeb 2003: 406-417
[10] ,. DOM Web [J] .
,2002 ,25 (5) :128
[11] ,,.[J].
,2004 (16):129132
[12] ,,. Ontology Web
[J].
,2004,27(3):310-317

CONCLUSION

In this paper, we present a method using VIPS algorithm


to improve the performance of webpage content extraction,
our method overcomes the shortcoming of the traditional

325

You might also like