Idir Chibane Bich-Liên Doan Supelec Supelec 3, rue Joliot-Curies 3, rue Joliot-Curies 91192 Gif-sur-Yvette Cedex (France) 91192 Gif-sur-Yvette Cedex (France) (++33) (0)1 69 85 14 77 (++33) (0)1 69 85 14 76 Idir.Chibane@supelec.fr Bich-Lien.Doan@supelec.fr ABSTRACT processing system and deals with these major types of tags in [6], This paper presents experiments using an algorithm of web page only <TABLE> tag is used to partition a page into several blocks, topic segmentation that show significant precision improvement its offspring as a content block and uses entropy based approach in the retrieval of documents issued from the Web track corpus of to discover informative ones. Similarly, in [1], several simple TREC 2001. Instead of processing the whole document, a web tags, such as <P>, <TABLE> and <UL> are chosen to divide the page is segmented into different semantic blocks according to web page for subsequent conversion and summarization. A visual criteria (such as horizontal lines, colors) and structural tags Function-based Object Model (FOM) is proposed in [3]. FOM (such as headings <H1>~<H6>, paragraph <P>). We conclude attempts to understand author’s intention by identifying object that combining visual and content layout criteria gives the best functions. VIPS (VIsion-based Page Segmentation) algorithm [2] results for increasing the precision: the ranking of the page is extracts the semantic structure for a web page. In VIPS, a tree calculated for relevant segments of pages resulting from the structure is used to model the page. Each node corresponds to a segmentation algorithm to a query. block in a page, and has a value to indicate the Degree of Coherence. The DOM tree is analyzed from root to leaves and the DOM nodes are divided based on their spatial layout and visual Categories and Subject Descriptors cues. [5] describes an HTML web page segmentation algorithm, H.3.1 [Content Analysis and Indexing]: Abstracting methods- which is applied to segment online medical journal articles. In [5], Indexing methods; H.3.3 [Information Storage and Retrieval] the web page content is modelled by a zone tree structure based on the geometric layout of the web page. General Terms Algorithms, Measurement, Performance, Experimentation. 3. TOPIC SEGMENTATION A single web page often contains multiple semantics and the different parts of the web page have different importance in that Keywords page. We suppose that there are two types of Web pages: single Segmentation, topic analysis, evaluation, block’s coherence. topic Web page and multiple-topics Web pages. The contents of single topic Web page are homogeneous, while multiple-topics 1. INTRODUCTION Web pages are divided into several blocks of homogeneous Most information retrieval systems on the Web process web pages contents. The textual contents of a page follow sequential as the smallest and undividable units of information, but a web organization of the topic. Topic analysis is based on boundary page as a whole may not be appropriate to represent a single topic. delimitation. In the case of flat texts, we distinguish two types of A web page usually contains various contents which are not all basic units: the sentence with is made up of a fixed number of related to the same topic. Moreover, a web page often contains words and paragraphs. However, in the case of World Wide Web, multiple topics that are not necessarily semantically linked to each a document is composed of textual contents and HTML structure. other. Therefore, detecting the semantic content structure of a web The authors use both visual criteria like the horizontal lines, page could potentially improve the performance of web vertical lines, colors, and content layout of the page like headings information retrieval. Many web applications can use the semantic <H1>~<H6>, paragraph <P> and tables <TABLE> tags in order content structure of web pages to improve information retrieval. to separate possible segments into different topics. The separation Previous work uses ad hoc methods to deal with different types of mode differs from an author to another, and the visual criteria and web pages. If we can get the semantic content structure of the web content layout tags are not used in different cases as segment page, wrappers could be built more easily and information could delimiters, from where a major problem to segment Web pages. be extracted more easily. Consequently, the criteria of delimitation of segments are random and do not depend on specific rules to respect. We propose a solution for Web page segmentation based on evaluation of 2. RELATED WORKS several segmentations by using topic analysis method. A straightforward approach for segmenting web pages is to use Furthermore, the topic segmentation algorithm based on visual tag information. Usually, a small set of tags serves as segment criteria and content layout is described as follows: from each web delimiters. In [4], four types of tags, including <P>, <TABLE>, page of our collection (TREC collection), we extract different <LI>/<UL> and <H1>~<H6>, are used to detect four major types segments by using various segment delimiters appearing in the of segments: paragraph, table, list and headings, respectively. [4] page and in our predefined criteria list. One solution by criterion treats segments of web pages in a learning based web query is generated. So, the result is a set of segmentation solutions. After that, the evaluation function is applied for each solution of 5. EMPIRICAL EVALUATIONS segmentation. The best segmentation solution is checked and the block index is created. The goal of our idea is to find a solution of Our experiments are based on Web Tracks of TREC 2001. We segmentation based on visual criteria (lines, color) and content used OKAPI BM25 measure in our ranking function. We compare presentation (paragraph, subtitles) in order to extract blocks that two categories of algorithms (DocRank and BlockRank). are coherent inside their contents and for which the distance DocRank(P) represents the BM25 score of the page P and between them is great. Our contribution compared to the various BlocRank(P) represents the higher BM25 score of blocks of the segmentation algorithms that we studied before consists in page P. From table 1, we can see that BlockRank performed better dividing web pages into topic segment units. Really, our topic results than DocRank, either on MAP or P@5 or P@10 on TREC segmentation algorithm is a method for partitioning Web pages collection. For example, the result achieved 58%, 75% and 57% into coherent segment units that correspond to a sequence of sub improvements over the DocRank algorithm on MAP, P@5 and topical passages. The algorithm assumes that a set of words that P10 on WT10g. used during the course of a given subtopic discussion, and when Table 1. Map, P@5 and P@10 comparison that subtopic changes, a significant proportion of the vocabulary changes as well. With our evaluation function of segmentation DocRank BlockRank solutions, we maintain only candidate delimiters for segmenting a Map(Means Average Precision) 0,133 0,2112 Web page by eliminating noisy HTML tags. P@5 0,18 0,316 P@10 0,172 0,27 4. SEGMENTATION EVALUATION The segmentation evaluation function is calculated from two 6. CONCLUSION measures: block’s content coherence and distance between these blocks. The first measure is applied inside the segment which In this paper, we proposed a topic segmentation method which depends on the co-occurrence value between terms belongs the allows us to extract semantic blocks from Web pages using visual same segment. The block’s content coherence reflects the density criteria and content presentation HTML tags. The topic of the information linked to one topic and the degree of segmentation algorithm is a method for partitioning Web pages correlation between terms of the block. The second measure is into coherent segment units that correspond to a sequence of sub based on similarity measure between two segment vectors. The distance between adjacent blocks allows us to place boundaries topical passages. We performed experimental evaluations of our between the dissimilar neighbouring blocks. The evaluation algorithm using information retrieval test collection of TREC 9. function is described as follow: We found that web page topic segmentation algorithm that we proposed improve information retrieval by indexing documents ∑ Coh (bi ) ∑ Dist (bk , bk +1 ) more precisely and by subdividing texts into thematically coherent SegmEvalFu nct (S j , P ) = 1≤i ≤ nb ( P ,S j ) 1≤ k ≤ nb ( P ,S j )−1 nb (P , S j ) nb (P , S j ) − 1 * segments. Our topic segmentation method allows to better estimate the relevance compared to the request Where Sj is a segmentation solution of the page P based on the visual criterion j. nb(P,Sj) represents the number of blocks 7. REFERENCES extracted from P according to the solution of segmentation Sj. The [1] Buyukkokten, O., Garcia-Molina, H., and Paepche, A., best segmentation solution is the one which has a great value of Accordion Summary for End-Game Browsing on PDAs and the function. Coh(b) is the coherence inside the block b which is Cellular Phones, Proc. of Conference on Human Factors in calculated as follow: Computer Systems, 2001. [2] Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y., Extracting Coh (b ) = Cooccurren ce (t i , t j ) 1 nt (b ) 2 ∑ ∑ t i ∈b t j ∈b Content Structure for Web Pages Based on Visual Nbdoc (t i , t j ) Representation, Proc. of 5th Asia Pacific Web Conference, with Cooccurren ce (t i , t j ) = 2003. Nbdoc (t i ) + Nbdoc (t j ) − Nbdoc (t i , t j ) [3] Chen, J., Zhou, B., Shi, J., Zhang, H., and Wu, Q., Function- Where Nbdoc(t1,..,tn) represent the number of documents Based Object Model towards Website Adaptation, Proc. 10th containing all the terms t1,..,tn and nt(b) is the number of terms of International World Wide Web Conference, 2001. the block b. Dist(bk,bk+1) represents a distance measure between [4] Diao, Y., Lu, H., Chen, S., and Tian, Z., Toward Learning adjacent blocks. This measure is defined as follows: Based Web Query Processing, Proc. of International Conference on Very Large Databases, pp. 317-328, 2000. ∑ ∑ n n w 2 × wl2,bk +1 Dist (bk , bk +1 ) = 1 1 l =1 l ,bk l =1 = = ( Sim Vbk , Vbk +1 ) ( cos Vbk , Vbk +1 ) ∑w n × wl ,bk +1 [5] Jie, Z.D.L., and George, R.T., Combining DOM Tree and Geometric Layout Analysis for Online Medical Journal l ,bk l =1 Article Segmentation, JCDL’06, June 11–15, 2006, Chapel Where Vbk and Vbk+1 are block vectors of bk and bk+1 respectively. Hill, North Carolina, USA. The weight of each term is calculated by using Okapi25 measure. [6] Lin, S.-H., and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, Proc. of ACM SIGKDD, 2002.