Professional Documents
Culture Documents
I.
Webpage
Content
INTRODUCTION
II.
323
(x,y)
******B1******
= BB S {NULL}.
For example, suppose Bi and Bj are two objects in B, (Bi,
Bj) NULL indicates that Bi and Bj are exactly separated
by the separator (Bi, Bj) or we can say that the two objects
are adjacent to each other, otherwise there are other objects
between the two blocks Bi and Bj.
VIPS algorithm can divide the webpage into some
independent blocks, each two blocks are separated by the
separators in S, and we can also get the coordinate
information of each block. Just as the example of Figure 1
and Figure 2 below.
**B4**
*****B3**************
*******
************
B2
**********
**********
******B5******
III.
324
In this example, you can see that the text in blue circle
contains a hyperlink, for this case, because the traditional
DOM tree methods always use the text-link ratio (ratio of the
length of the text in the node and the length of the hyperlink
text in the node) to judge whether the node is a text node, the
text in blue circle will always be treated as a pure link which
has no sense, they will be thrown away wrongly too.
In our method, well use the VIPS algorithm to overcome
this problem and improve the performance of the webpage
content extraction. For VIPS can divide the webpage into
some semantic blocks, it can get a whole view of the
webpage and get the position information of each block. In
order to recall the sentences which are thrown away, well
keep the DOM tree node tag when using traditional method
to extract the content. The steps are as follows.
1. Using VIPS to divide the webpage into several
blocks and keep the coordinate information of each
block and the node tag in each block.
2. Using traditional method to extract the content of
the webpage and keep the html tag information of
each content node.
3. Using the coordinate information of each block to
determine which blocks should be content blocks.
4. Map the extracted content node tag sequence to the
content block according to the node tag and the
content itself. If some node tags in content block
dont appear in extracted content node tag sequence,
we recall the node and the text in this node.
Using our method, because you can get a whole view of
the webpage to some extent by VIPS algorithm, then you can
make full use of this information to supervise the process of
the content extraction and recall the lost sentences. Just as
well, to a certain extent, you extract the content as a whole.
From the experiment we conduct, we can also see that
almost all the lost sentences in traditional method are
recalled using our method.
IV.
CONCLUSION
325