A Web Page Topic Segmentation Algorithm Based On Visual Criteria and Content Layout

A Web Page Topic Segmentation Algorithm Based on
Visual Criteria and Content Layout

Idir Chibane Bich-Liên Doan
Supelec Supelec
3, rue Joliot-Curies 3, rue Joliot-Curies
91192 Gif-sur-Yvette Cedex (France) 91192 Gif-sur-Yvette Cedex (France)
(++33) (0)1 69 85 14 77 (++33) (0)1 69 85 14 76
Idir.Chibane@supelec.fr Bich-Lien.Doan@supelec.fr
ABSTRACT processing system and deals with these major types of tags in [6],
This paper presents experiments using an algorithm of web page only <TABLE> tag is used to partition a page into several blocks,
topic segmentation that show significant precision improvement its offspring as a content block and uses entropy based approach
in the retrieval of documents issued from the Web track corpus of to discover informative ones. Similarly, in [1], several simple
TREC 2001. Instead of processing the whole document, a web tags, such as <P>, <TABLE> and <UL> are chosen to divide the
page is segmented into different semantic blocks according to web page for subsequent conversion and summarization. A
visual criteria (such as horizontal lines, colors) and structural tags Function-based Object Model (FOM) is proposed in [3]. FOM
(such as headings <H1>~<H6>, paragraph <P>). We conclude attempts to understand author’s intention by identifying object
that combining visual and content layout criteria gives the best functions. VIPS (VIsion-based Page Segmentation) algorithm [2]
results for increasing the precision: the ranking of the page is extracts the semantic structure for a web page. In VIPS, a tree
calculated for relevant segments of pages resulting from the structure is used to model the page. Each node corresponds to a
segmentation algorithm to a query. block in a page, and has a value to indicate the Degree of
Coherence. The DOM tree is analyzed from root to leaves and the
DOM nodes are divided based on their spatial layout and visual
Categories and Subject Descriptors cues. [5] describes an HTML web page segmentation algorithm,
H.3.1 [Content Analysis and Indexing]: Abstracting methods- which is applied to segment online medical journal articles. In [5],
Indexing methods; H.3.3 [Information Storage and Retrieval] the web page content is modelled by a zone tree structure based
on the geometric layout of the web page.
General Terms
Algorithms, Measurement, Performance, Experimentation. 3. TOPIC SEGMENTATION
A single web page often contains multiple semantics and the
different parts of the web page have different importance in that
Keywords page. We suppose that there are two types of Web pages: single
Segmentation, topic analysis, evaluation, block’s coherence. topic Web page and multiple-topics Web pages. The contents of
single topic Web page are homogeneous, while multiple-topics
1. INTRODUCTION Web pages are divided into several blocks of homogeneous
Most information retrieval systems on the Web process web pages contents. The textual contents of a page follow sequential
as the smallest and undividable units of information, but a web organization of the topic. Topic analysis is based on boundary
page as a whole may not be appropriate to represent a single topic. delimitation. In the case of flat texts, we distinguish two types of
A web page usually contains various contents which are not all basic units: the sentence with is made up of a fixed number of
related to the same topic. Moreover, a web page often contains words and paragraphs. However, in the case of World Wide Web,
multiple topics that are not necessarily semantically linked to each a document is composed of textual contents and HTML structure.
other. Therefore, detecting the semantic content structure of a web The authors use both visual criteria like the horizontal lines,
page could potentially improve the performance of web vertical lines, colors, and content layout of the page like headings
information retrieval. Many web applications can use the semantic <H1>~<H6>, paragraph <P> and tables <TABLE> tags in order
content structure of web pages to improve information retrieval. to separate possible segments into different topics. The separation
Previous work uses ad hoc methods to deal with different types of mode differs from an author to another, and the visual criteria and
web pages. If we can get the semantic content structure of the web content layout tags are not used in different cases as segment
page, wrappers could be built more easily and information could delimiters, from where a major problem to segment Web pages.
be extracted more easily. Consequently, the criteria of delimitation of segments are random
and do not depend on specific rules to respect. We propose a
solution for Web page segmentation based on evaluation of
2. RELATED WORKS several segmentations by using topic analysis method.
A straightforward approach for segmenting web pages is to use Furthermore, the topic segmentation algorithm based on visual
tag information. Usually, a small set of tags serves as segment criteria and content layout is described as follows: from each web
delimiters. In [4], four types of tags, including <P>, <TABLE>, page of our collection (TREC collection), we extract different
<LI>/<UL> and <H1>~<H6>, are used to detect four major types segments by using various segment delimiters appearing in the
of segments: paragraph, table, list and headings, respectively. [4] page and in our predefined criteria list. One solution by criterion
treats segments of web pages in a learning based web query is generated. So, the result is a set of segmentation solutions.
After that, the evaluation function is applied for each solution of 5. EMPIRICAL EVALUATIONS
segmentation. The best segmentation solution is checked and the
block index is created. The goal of our idea is to find a solution of Our experiments are based on Web Tracks of TREC 2001. We
segmentation based on visual criteria (lines, color) and content used OKAPI BM25 measure in our ranking function. We compare
presentation (paragraph, subtitles) in order to extract blocks that two categories of algorithms (DocRank and BlockRank).
are coherent inside their contents and for which the distance DocRank(P) represents the BM25 score of the page P and
between them is great. Our contribution compared to the various BlocRank(P) represents the higher BM25 score of blocks of the
segmentation algorithms that we studied before consists in page P. From table 1, we can see that BlockRank performed better
dividing web pages into topic segment units. Really, our topic results than DocRank, either on MAP or P@5 or P@10 on TREC
segmentation algorithm is a method for partitioning Web pages collection. For example, the result achieved 58%, 75% and 57%
into coherent segment units that correspond to a sequence of sub improvements over the DocRank algorithm on MAP, P@5 and
topical passages. The algorithm assumes that a set of words that P10 on WT10g.
used during the course of a given subtopic discussion, and when
Table 1. Map, P@5 and P@10 comparison
that subtopic changes, a significant proportion of the vocabulary
changes as well. With our evaluation function of segmentation DocRank BlockRank
solutions, we maintain only candidate delimiters for segmenting a Map(Means Average Precision) 0,133 0,2112
Web page by eliminating noisy HTML tags. P@5 0,18 0,316
P@10 0,172 0,27
4. SEGMENTATION EVALUATION
The segmentation evaluation function is calculated from two 6. CONCLUSION
measures: block’s content coherence and distance between these
blocks. The first measure is applied inside the segment which In this paper, we proposed a topic segmentation method which
depends on the co-occurrence value between terms belongs the allows us to extract semantic blocks from Web pages using visual
same segment. The block’s content coherence reflects the density criteria and content presentation HTML tags. The topic
of the information linked to one topic and the degree of segmentation algorithm is a method for partitioning Web pages
correlation between terms of the block. The second measure is
into coherent segment units that correspond to a sequence of sub
based on similarity measure between two segment vectors. The
distance between adjacent blocks allows us to place boundaries topical passages. We performed experimental evaluations of our
between the dissimilar neighbouring blocks. The evaluation algorithm using information retrieval test collection of TREC 9.
function is described as follow: We found that web page topic segmentation algorithm that we
proposed improve information retrieval by indexing documents
 ∑ Coh (bi )   ∑ Dist (bk , bk +1 ) more precisely and by subdividing texts into thematically coherent
SegmEvalFu nct (S j , P ) = 
 1≤i ≤ nb ( P ,S j )   1≤ k ≤ nb ( P ,S j )−1
nb (P , S j )   nb (P , S j ) − 1
*  segments. Our topic segmentation method allows to better
   
    estimate the relevance compared to the request
Where Sj is a segmentation solution of the page P based on the
visual criterion j. nb(P,Sj) represents the number of blocks 7. REFERENCES
extracted from P according to the solution of segmentation Sj. The [1] Buyukkokten, O., Garcia-Molina, H., and Paepche, A.,
best segmentation solution is the one which has a great value of Accordion Summary for End-Game Browsing on PDAs and
the function. Coh(b) is the coherence inside the block b which is Cellular Phones, Proc. of Conference on Human Factors in
calculated as follow: Computer Systems, 2001.
[2] Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y., Extracting
Coh (b ) = Cooccurren ce (t i , t j )
1
nt (b )
2 ∑ ∑ t i ∈b t j ∈b Content Structure for Web Pages Based on Visual
Nbdoc (t i , t j )
Representation, Proc. of 5th Asia Pacific Web Conference,
with Cooccurren ce (t i , t j ) = 2003.
Nbdoc (t i ) + Nbdoc (t j ) − Nbdoc (t i , t j )
[3] Chen, J., Zhou, B., Shi, J., Zhang, H., and Wu, Q., Function-
Where Nbdoc(t1,..,tn) represent the number of documents Based Object Model towards Website Adaptation, Proc. 10th
containing all the terms t1,..,tn and nt(b) is the number of terms of International World Wide Web Conference, 2001.
the block b. Dist(bk,bk+1) represents a distance measure between [4] Diao, Y., Lu, H., Chen, S., and Tian, Z., Toward Learning
adjacent blocks. This measure is defined as follows: Based Web Query Processing, Proc. of International
Conference on Very Large Databases, pp. 317-328, 2000.
∑ ∑
n n
w 2
× wl2,bk +1
Dist (bk , bk +1 ) =
1 1 l =1 l ,bk l =1
= =
(
Sim Vbk , Vbk +1 ) (
cos Vbk , Vbk +1 ) ∑w
n
× wl ,bk +1
[5] Jie, Z.D.L., and George, R.T., Combining DOM Tree and
Geometric Layout Analysis for Online Medical Journal
l ,bk
l =1
Article Segmentation, JCDL’06, June 11–15, 2006, Chapel
Where Vbk and Vbk+1 are block vectors of bk and bk+1 respectively. Hill, North Carolina, USA.
The weight of each term is calculated by using Okapi25 measure. [6] Lin, S.-H., and Ho, J.-M., Discovering Informative Content
Blocks from Web Documents, Proc. of ACM SIGKDD,
2002.

A Web Page Topic Segmentation Algorithm Based On Visual Criteria and Content Layout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Web Page Topic Segmentation Algorithm Based On Visual Criteria and Content Layout

Uploaded by

Copyright:

Available Formats

A Web Page Topic Segmentation Algorithm Based on

Visual Criteria and Content Layout

You might also like