You are on page 1of 7

980

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices
Jinbeom Kang, Jaeyoung Yang, Nonmember and Joongmin Choi, Member, IEEE
Abstract Web page segmentation into logical blocks is an important preprocessing step for recognizing informative content blocks in a page that leads to efficient information extraction and convenient display on the devices with smallsized screens. Previous methods for Web page segmentation are not flexible in a dynamic Web environment because they largely relied on heuristic rules generated by exploiting structural tags and visual information inherent in a page. To resolve this problem, this paper proposes a new method of Web page segmentation by recognizing repetitive tag patterns called key patterns in the DOM tree structure of a page. We report on the Repetition-based Page Segmentation (REPS) algorithm, which detects key patterns in a page and generates virtual nodes to correctly segment nested blocks. A series of experiments performed for real Web sites showed that REPS greatly contributes to improving the correctness of Web page segmentation1.

Index Terms Web Page Segmentation, REPS, Key Patterns, Information Extraction

Fig. 1. An example of Web page segmentation into logical blocks.

I. INTRODUCTION Web page segmentation is a method of segmenting a Web page into logical blocks in a way that each block contains distinctive content information [1]. Fig. 1 shows an example of page segmentation where the segmented blocks are marked with boxes. This technique can be used as a preprocessing tool for information extraction, classifying the segmented blocks into informative blocks that contain the pages core contents and noise blocks that contain irrelevant information such as menus, advertisements, or copyright statements [2],[3]. Web page segmentation is important as it greatly affects the outcome of recognizing informative blocks. Eventually, it contributes to enhancing the performance of information extraction by ignoring noise blocks and focusing only on informative content blocks, and also facilitating the display of useful information on mobile devices with small-sized screens [4]-[7].
1 Jinbeom Kang was with the Department of Computer Science and Engineering, Hanyang University, Ansan, Korea. He is now with the Mobile Communication Research Lab., LG Electronics, Anyang, Korea (e-mail: midgetfx@gmail.com). Jaeyoung Yang is with the Human-Computer Interaction Institute, School of Computer Science, Carnegie Mellon University, PA 15232 USA (e-mail: jyyang@cs.cmu.edu). Joongmin Choi is with the Department of Computer Science and Engineering, Hanyang University, Ansan, Korea (corresponding author, phone:+82-31-400-5666, Cell: +82-10-2345-4880, FAX: +82-31-409-7351, email: jmchoi@hanyang.ac.kr).

Early researches on Web page segmentation are based on the structure of a Web page. This template-based method focused on measuring the structural similarities among the DOM trees of Web pages. A pitfall of this method is that it is impossible to segment information contained in a tree node because the template-based method recognizes only the structural information, not the content, of DOM tree nodes. It also needs qualitative and quantitative training examples to cover real-world situations. To overcome these difficulties, another approach was proposed by using the visual information in a Web page. This vision-based segmentation method exploits visual clues such as font size, font color, background color, spaces between paragraphs, etc. Despite some successes, this vision-based method relies heavily on heuristic rules, which makes it difficult to cope with a dynamic Web environment where the structure and visual information of Web pages are often changing. To resolve these problems, this paper proposes a new method of Web page segmentation by recognizing repetitive tag patterns in the DOM tree structure of a page. We call these repetitive HTML tag patterns key patterns. This idea mainly comes from the observation that Web designers usually build a Web page with structural and repetitive layouts to provide consistent information. Even for some personal homepages and article pages that maintain no structural layout, we are still able to segment them by using the patterns.

Contributed Paper Manuscript received March 16, 2010 Current version published 06 29 2010; Electronic version published 07 06 2010.

0098 3063/10/$20.00 2010 IEEE

J. Kang et al.: Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices

981

The proposed algorithm, named REPS (Repetition-based Page Segmentation), detects key patterns in a Web page and generates virtual nodes to correctly segment nested blocks. REPS proceeds in four phases. First, a Web page is represented by a DOM tree structure after removing less meaningful tags such as <a>, <b>, <script>, etc. from the HTML source of the page. In the second phase, REPS generates a sequence from the DOM tree by using the tags in the child nodes of the root node. The third phase is to find key patterns from the sequence and recognize candidate blocks by matching the sequence with the key patterns. The final phase of REPS is to generate blocks in a page by modifying the DOM tree into a more deeply hierarchical structure by introducing virtual nodes. Eventually, each virtual node corresponds to a block that is a unit of page segmentation. A series of experiments have been performed for real Web sites. Performance comparison with a well-known visionbased page segmentation system indicates that REPS contributes to improving the correctness of Web page segmentation, which in turn affects the efficiency of information extraction by allowing the system to correctly recognize informative blocks in a Web page. The rest of the paper is organized as follows: Section II reports on the limitations of previous approaches to Web page segmentation and streamlines our solution; Section III describes our technique of repetition detection used in finding a pattern from a sequence; Section IV describes the details of the REPS algorithm for segmenting blocks in a Web page by using key patterns and virtual nodes; Section V reports on the results of evaluating REPS by comparing its performance with a novel approach; Section VI concludes with a summary and future directions. II. RELATED WORK Traditionally, Web page segmentation has been approached in three directions; template-based methods, vision-based methods, and tag-based methods. The template-based method builds a template with extracting rules organized by regular expressions [8]-[10]. It collects Web pages from a target site and generates regular expression rules in order to extract content blocks through analyzing the common zone. Web pages are decomposed into blocks according to the generated rules. Whereas it is simple and easy to use with fewer errors, the template-based method requires many sample pages in the same domain to build a template. Also, it has to maintain a template for each domain since Web pages may not be correctly processed with templates for different domain sites. Recently, to solve this problem, Chakrabarti proposed a method using a classifier [11]. However, it still needs many Web pages to train a classifier and build a template, causing the segmentation to be restricted. The vision-based method utilizes visual clues in a Web page. Chen proposed a method that considers visual information such as height, length of node zone, and

separation information such as <HR> tag [4], [5]. Yang proposed the VIPS (Vision-based Page Segmentation) algorithm by considering vision information and heuristic rules to identify blocks [12]. Despite its success in many applications, the vision-based method has a problem of maintaining heuristic rules because the HTML structure of a Web page is often changed as the Web grows. The tag-based method predefines content tags that contain useful information and finds content blocks by measuring the distance between these tags [1], [2], [4], [13]. In particular, Lin assumed that the <TABLE> tag is widely used to make the structure of a Web page, and proposed a method primarily using the <TABLE> tag to extract blocks from a Web page [3]. However, this method cannot be applied to those Web pages without the <TABLE> tags. To solve this problem, Debnath and Peng considered not only the <TABLE> tags, but also the <TR>, <P>, < HR>, and < UL> tags [14]-[16]. To sum up, traditional Web page segmentation methods generally relied on heuristic rules generated by exploiting the hierarchical features of structural tags and some visual information inherent in a page. However, in a dynamic Web environment where the structure of a Web page is often changing with the introduction of new featured tags, the heuristic rules cannot analyze the Web pages correctly. This requires that we maintain and update the rules every time when a new standard is announced and the structure of a Web site is changed. In contrast, we propose a different approach of Web page segmentation based on detecting repetitive tag patterns that does not require the maintenance of the rules. III. REPETITION DETECTION In order to find a pattern in a sequence, the repetition detection technique is widely used. In this paper, a repetition is defined as a subsequence of length m(>1) occurring twice or more in a sequence of length n. According to this definition, the maximum length of a repetition for a sequence of length n is n/2, satisfying the formula of 1<m n/2.
Subsequences (2) Subsequences (3) a b c a b b c b c a a b c a b d c a c a b a b d a b b d
Fig. 2. An example of finding repetitions from a sequence.

Sequence

For instance, as shown in Fig. 2, we can generate repetitions of length 2 and length 3 from the sequence abcabd of length 6. In this example, the only repetition is ab as it occurs twice in the sequence. We consider this repeated subsequence a pattern.

982

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

sequence subsequences

case 1 case 2 a a a a a a a a a a

Fig. 3. An example of generating repetitions when all elements are the same.

17: 18: 19: 20: 21: 22: 23: 24: 25:

end if end for update(maxAllEqual, allEqual) end for candi candi maxAllEqual rep rep findCandidate(candi) end for return rep end function

Note that a sequence whose elements are all the same must be handled exceptionally. According to the above definition, for the sequence aaaaa, there could be four repetitions of aa as shown in case 1 of Fig. 3. However, in order to be a useful pattern in a Web page, a repetition must not be overlapped with other patterns in a sequence. From this observation, the number of repetitions of aa in this situation should be 2 as shown in case 2 of Fig. 3. Thus, in generating candidate repetitions, we handle separately the situations when the elements in a sequence are all the same and when they are not. Algorithm 1 describes a pseudo-code for repetitive pattern detection for an input sequence list of length n. Since the maximum length of a possible pattern is n/2, we only examine the first half of the input sequence by setting maxRepSize appropriately in line 4. The sublist(list,j) function (line 9) returns a subsequence of the input sequence generated from jth to last element of list. The partition(subList,i) function (line 10) returns all subsequences of subList of length i. Note that the total number of subsequences generated by partition is n/i. The update(maxAllEqual,allEqual) function (line 19) assigns the subsequence with the maximum length in the current step to maxAllEqual when all elements are the same. The findCandidate(candi) function (line 22) identifies all possible repetitions occurring at least twice in the candidate set candi.

Fig. 4 shows the result of applying Algorithm 1 to an example sequence where we could find 7 repetitions. Example abcdfabcfabcccccc Rep ab abc bc cc Freq 3 3 3 3 Rep ccc fab fabc Freq 2 2 2

Fig. 4. A result of repetitive pattern detection by Algorithm 1.

IV. WEB PAGE SEGMENTATION BY USING KEY PATTERNS In REPS, Web page segmentation by using the repetition detection algorithm proceeds in 4 phases as follows. First, less meaningful tags such as <a>, <b>, <script>, <span>, and #comment in the HTML source of the page are removed. After this preprocessing step, a Web page is represented as a DOM tree structure whose nodes are either HTML tags or contents consisting of texts and images. In the second phase, we generate a sequence from a DOM tree of a Web page using the tags in the child nodes of the root node. For example, from the main page of WordNet2 in Fig. 5, we can get a sequence of child nodes of the <div> node, which is h2 p p ul p p h2 p p p p p p div. This approach of considering only one-depth child nodes, ignoring other deep descendant nodes, has the advantage of reducing computational costs while still preserving some hierarchical features of the DOM tree. This reflects the phenomenon that a parent node often has the child nodes with similar semantics. For example, in a page of a shopping mall site, a parent node denoting a product list has the children describing specific products. Hence, without considering descendant nodes with more than one depth from the root, the blocks generated by considering repetitions will remain consistent. In this paper, we generate blocks by using this approach and evaluate informative blocks by counting the number of repetitions. The third phase is to generate candidate Web page blocks by using the key patterns. A key pattern is a repetitive pattern in a sequence that is longest and most frequent. For example, for the sequence of h2 p p ul p p h2 p p p p p p div
2

Algorithm 1 Repetitive Pattern Detection 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: function Repetition(list) // list: input sequence rep maxRepSize | list | + 1 | list | %2 > 0 2 2 otherwise for i = maxRepSize to 2 do candi maxAllEqual for j = 1 to maxRepSize do subList sublist(list, j) parts partition(subList, i) allEqual for all seq parts do if all elements of seq are same then allEqual allEqual {seq} else candi candi {seq}

http://wordnet.princeton.edu/

J. Kang et al.: Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices

983

obtained in the previous phase, the repetitions obtained by using Algorithm 1 are [h2, p], [h2, p, p], [p, p], and [p, p, p]. Among these, key patterns will be [h2, p, p] and [p, p, p], since [h2, p] and [p, p] are properly contained in [h2, p, p] and [p, p, p], respectively.

separating two subsequences h2 p p ul p p and h2 p p p p p p div. For each subsequence, we add a virtual node and build a subtree with the virtual node as its root and the elements of the subsequence as its children. Finally, from two key patterns, we can make three virtual nodes as shown in Fig. 6. Note that the key pattern matching is done differently when all elements of a key pattern are the same. In this situation, all subsequences matched with the key pattern are grouped with a virtual node. For instance, the virtual node v3 in Fig. 6 is generated as a result of matching between the key pattern [p p p] and the sequence p p p p p p.

Fig. 6. Generating virtual nodes by using key patterns.

A pseudo-code for generating virtual nodes using key pattern matching is described in Algorithm 2. Algorithm 2 Matching Key Patterns 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15 16: 17: 18: 19 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: function Match(list, rep) // list : input sequence // rep : list of key patterns mSeq allEqualSeq for all pattern rep do innerSeq [] pos 0 cont false allEqual isAllEqual(pattern) for i = 0 to | list | do if pos | pattern | - 1 then pos 0 cont true end if if list[i] = pattern[pos] then if cont = true & allEqual = true then cont false mSeq mSeq {innerSeq} end if append(innerSeq, list[i]) pos pos + 1 else if cont = true & allEqual = true then update(allEqualSeq, innerSeq) pos 0 else if cont = true & allEqual true then append(innerSeq, list[i]) else pos 0

Fig. 5. A Web page of the WordNet site and corresponding DOM tree.

The fourth and final phase of key pattern-based Web page segmentation is to recognize blocks in a page by modifying the 1-depth DOM tree into a more hierarchical structure by building virtual nodes. Our intention is that each subtree rooted at a virtual node can be a potential content block. For instance, humans can easily understand that the Web page in Fig. 5 consists of two logical blocks; one is with the title of About WordNet and the other with News. However, in its corresponding DOM tree, it is hard to recognize the two blocks since the tags are lined up horizontally with the same parent node. We use key patterns to build virtual nodes by separating the subsequences containing key patterns. For doing this, a key pattern is matched with the given sequence from left to right sequentially. When a match is found at position i, the process continues to find the next match in the remaining part of the sequence. If there is another match at position j, the subsequence ranging from i through j-1 is grouped with a new virtual node. For instance, consider a key pattern [h2, p, p] and the sequence h2 p p ul p p h2 p p p p p p div. Note that the first match between the key pattern and the sequence occurs at position 1 and the second match occurs at position 7,

984

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

31: 32: 33: 34: 35: 36: 37: 38: 38: 40: 41:

innerSeq [] end if end if end for if | innerSeq | > 0 then append(mSeq, innerSeq) end if end for append(mSeq, allEqualSeq) return mSeq end function

humans. As mentioned in the previous section, the value of the threshold is set to a different value for each site to achieve optimal results.

We mentioned before that each virtual node with its children is a candidate for segmentation. Each candidate becomes a real segmented block only when it is informative. Informative blocks are determined by evaluating the amount of information in the block, which can be done by assigning an importance weight to each node considering the number of repetitions of node patterns in a Web page. Exceptionally, we set the importance weight of a virtual node to the half of the weight of its parent node. This has the effect of distributing the importance of a parent node to its virtual children through structural hierarchy. For simple calculation, we normalize the importance weight for each node by the maximum number of repetitions. Thus, informative blocks must have normalized importance weights greater than a threshold . V. EXPERIMENTAL RESULTS We evaluated the performance of our REPS algorithm for Web page segmentation with four different types of Web sites; article, blog, portal, and shopping mall-style as shown in Fig. 7. In this figure, the results of page block segmentation are shown with colored boxes. In case of article style as shown in Fig. 7(a), although the content nodes have a common parent node, REPS could generate detail blocks that are nested in the page. For blog and shopping-mall styles as shown in Fig. 7(b) and 7(c), the content items could be correctly distinguished. Finally, for portal style as shown in Fig. 7(d), the informative blocks and the detail blocks could also be correctly identified. Note that the threshold is set to a different value for each site to achieve optimal results. These optimal threshold values were determined through prior experiments. For more objective performance evaluation, we compare the precision and recall values of our algorithm (REPS) with VIPS [12], which is the most well-known algorithm for vision-based Web page segmentation. We collected 100 Web pages from each of 8 sites, and the comparison results are summarized in TABLE I. Here, P denotes the precision value that measures the ratio of correctly segmented blocks over the blocks segmented by the algorithm, and R denotes the recall value that measures the ratio of correctly segmented blocks over the ideal blocks that are manually obtained by

(a) language-fast),

An article style Web page

NETTUTS (http://nettuts.com/articles/10-steps-to-learning-a-new-coding-

= 0.4

(b)

A blog style Web page

TechTipsBlog (http://www.techtipsblog.com/ ) ,

= 0.5

J. Kang et al.: Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices

985

in REPS is determined by the number of repetitions. Also, REPS could not control the number of blocks expected from the segmentation, whereas VIPS can do this by using the PDoC parameter. However, by setting PDoC to a higher value than the value used in this experiment, VIPS could behave incorrectly by generating many small-sized blocks. From these observations, we claim that REPS mostly outperforms VIPS in terms of the correctness of Web page segmentation.
TABLE I PERFORMANCE COMPARISON BETWEEN REPS AND VIPS REPS Main Pages SITE ABC BBC CNN (c) A shopping-mall style Web page 0.5 0.1 0.1 0.4 0.4 0.4 0.02 0.4 P 100 100 100 100 100 100 96.67 100 99.58 R 98.73 100 69.57 77.78 36.84 72.73 82.86 100 79.81 0.45 0.5 0.6 0.15 0.4 0.2 0.4 0.6 Detail Pages P 100 100 100 100 100 100 100 100 100 R 66.67 70 100 100 75 100 88.89 71.43 84

Amazon (http://www.amazon.com/),

FOX Amazon eBay NETTUTS TechTipsBlog

= 0.3

VIPS Main Pages SITE ABC BBC CNN FOX Amazon eBay NETTUTS TechTipsBlog (d) A portal style Web page BBC (http://www.bbc.co.uk/ ), PDoC 6 6 6 6 5 5 6 7 P 42.86 100 96.67 92.31 72.73 100 0 100 75.57 R 100 100 100 100 100 100 0 61.9 82.74 PDoC 6 6 6 6 6 6 6 6 Detail Pages P 100 100 100 100 98.81 93.44 74.36 100 95.83 R 100 53.33 93.33 100 95.4 89.06 59.18 50 80.04

= 0.3

Fig. 7. Results of Web page segmentation for different types of Web sites.

VI. CONCLUSION This paper has proposed a new method of Web page segmentation by using repetitive tag patterns in the DOM tree structure of a page. The REPS algorithm detects key patterns in a page and generates virtual nodes to segment nested blocks. A series of experiments performed for real Web sites revealed that REPS contributes to improving the correctness of Web

Based on the data in the table, we can claim that REPS mostly provides more correct results in terms of precision. In terms of recall, however, REPS does not show much improvement over VIPS. The main reason is that although REPS builds virtual nodes for nested blocks, it has difficulty in finding deeply nested blocks for some pages since the block

986

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010


[12] Y. Yang and H. Zhang, HTML page analysis based on visual cues, Proc. 16th Intl. Conf. on Document Analysis and Recognition, p. 859, 2001. [13] X. Xie, G. Miao, R. Song, J. Wen, and W. Ma, Efficient browsing of web search results on mobile devices based on block importance model, Proc. 3rd IEEE Intl. Conf. on Pervasive Computing and Communications, pp. 17-26, 2005. [14] S. Debnath, P. Mitra, and C. Giles, Automatic extraction of informative blocks from webpages, Proc. ACM Symp. on Applied Computing, pp. 1722-1726, 2005. [15] S. Debnath, P. Mitra, and C. Giles, Identifying content blocks from web documents, Lecture Notes in Computer Science, vol. 3488, pp. 285-293, 2007. [16] T. Peng, C. Zhang, and W. Zuo, Tunneling enhanced by web page content block partition for focused crawling, Concurrency and Computation: Practice & Experience, vol. 20, no. 1, pp. 61-74, 2008.

page segmentation, which eventually affects the efficiency of information extraction by allowing the system to correctly recognize informative blocks in a Web page. One limitation is that REPS could not preset the number of blocks expected from the segmentation. This might lead to inflexibility in segmentation since the granularity of logical blocks might vary for different users. Due to this, REPS experienced a little difficulty in finding deeply nested blocks for some pages. We are currently working on this issue by exploiting the importance weights of blocks in order to select appropriate number of blocks in segmentation.

REFERENCES
[1] G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, Robust web page segmentation for mobile terminal using content distances and page layout information, Proc. 16th Intl. Conf. on World Wide Web, pp. 361370, 2007. [2] C. Choi, J. Kang, and J. Choi, Extraction of user-defined data blocks using the regularity of dynamic web pages, Lecture Notes in Computer Science, vol. 4681, pp. 123-133, 2007. [3] S. Lin and J. Ho, Discovering informative content blocks from Web documents, Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 588-593, 2002. [4] Y. Chen, W.-Y. Ma, and H.-J. Zhang, Detecting web page structure for adaptive viewing on small form factor devices, Proc. 12th Intl. Conf. on World Wide Web, pp. 225233, 2003. [5] Y. Chen, X. Xie, W. Ma, and H. Zhang, Adapting web pages for smallscreen devices, IEEE Internet Computing, vol. 9, no. 1, pp. 40-56, 2005. [6] B. Zheng and M. Atiquzzaman, A novel scheme for streaming multimedia to personal wireless handheld devices, IEEE Trans. Consumer Electron., vol. 49, no. 1, pp. 32-40, 2003. [7] W. Lee, S. Kang, S. Lim, M. Shin, and Y. Kim, Adaptive hierarchical surrogate for searching web with mobile devices, IEEE Trans. Consumer Electron., vol. 53, no. 2, pp. 796-803, 2007. [8] A. Arasu and H. Garcia-Molina, Extracting structured data from web page, Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 337348, 2003. [9] V. Crescenzi, P. Merialdo, and P. Missier, Fine-grain web site structure discovery, Proc. 5th ACM Intl. Workshop on Web Information and Data Management, pp. 15-22, 2003. [10] L. Yi, B. Liu, and X. Li, Eliminating noisy information in web pages for data mining, Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 296-305, 2003. [11] D. Chakrabarti, R. Kumar, and K. Punera, Page-level template detection via isotonic smoothing, Proc. 16th Intl. Conf. on World Wide Web, pp. 61-70, 2007.

BIOGRAPHIES Jinbeom Kang is currently a senior research engineer in the Mobile Communication Research Lab. at LG Electronics Inc., Korea. Dr. Kang received his Ph.D. degree in Computer Science at Hanyang University in 2009. His research interests include informative Web block extraction, intelligent agents, logic-based proof system, situation awareness, combined pattern mining, semantic facilitator, knowledge modeling, and behavioral analysis. Jaeyoung Yang received his B.S., M.S., and Ph.D. degree in Computer Science and Engineering from Hanyang University, Korea, in 1998, 2000, and 2003 respectively. He is a postdoctoral researcher in the Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, PA USA. He is also a researcher member of Web Intelligent Consortium Korea Center. His research interests include intelligent agents, machine learning, context-awareness, ubiquitous computing, and text mining. Joongmin Choi (M97) is a professor in the Department of Computer Science and Engineering, Hanyang University, Ansan, Korea. He is the director of Web Intelligent Consortium (WIC) Korea Center. He received his B.S. and M.S degree in computer engineering from Seoul National University, Korea, in 1984 and 1986, respectively, and the Ph.D. degree in computer science from the State University of New York at Buffalo, USA in 1993. His research interest focuses on Web Intelligence, which is a somewhat broad concept covering the areas of Web information extraction, Web data mining, Semantic Web and ontologies, and other intelligent techniques for manipulating Web information. His other interest areas include intelligent agents, artificial intelligence, and context-aware personalization.

You might also like