Professional Documents
Culture Documents
0.8
0.6
Number of Words Quotient (P/C) 0.585
Number of Words Quotient (P/C) 0.746
Information Gain
Text Density Delta (P/C)
Num. Words (N) Link Density (P) Text Density (N) Link Density (C)
Num.Words (P) Text Density (P)
Link Density (N) F.S.R. (P) F.S.R. (N) Avg. Word Len. (N) Avg. Word Len. (P) Avg. Word Len. (N)
0.3 Start-Uppercase Ratio (P)
0.4 Avg. Word Len. (P) Num.Words (N) Num.Words (P)
Relative Position (C) Start-Uppercase Ratio (N) Link Density (P) Text Density (P) Text Density (N) F.S.R. (P)
Link Density (N) Start-Uppercase Ratio (N) F.S.R. (N)
Avg. Sentence Length (C)
<P> Tag (C) <P> Tag (N)
Avg. Sentence Length (C) <P> Tag (C)
0.2 <P> Tag (P) <P> Tag (N)
Avg. Sentence Length (P) <P> Tag (P)
Avg. Sentence Length (N) 0.2
Text Frequency (C) Avg. Sentence Length (P) Avg. Sentence Length (N)
Full-Uppercase Ratio (C) Text Frequency (P) Num Start-Uppercase (C) All-Uppercase Ratio (C) Text Frequency (P)
0.1 Num UpperCase (C) Text Freq. (C)
Num Start-Uppercase (N) Text Frequency (N) Num Start-Uppercase (P)
Absolute Position (C) Text Frequency (N) Num UpperCase (N) Num Uppercase (P)
Full-Uppercase Ratio (N) Absolute Position (C) Full-Uppercase Ratio (P)
All-Uppercase R. (P) All-Uppercase R. (N)
0 0
Vertical Bars, Date Tokens, <H*>, <DIV>... <H*> Tags, Vertical Bars, <DIV> Tags (P/C/N)
10 20 30 40 50 60 10 20 30 40 50 60
Feature Feature
4.3.2 Application to CleanEval and Re-Validation by Pasternack et al. [21] (in two flavors, the unigram model
To test the domain-independence of the determined clas- trained on CleanEval and the trigram model trained on a
sifiers, we applied the two simplified C4.8 classifiers to the news corpus) to the set of competitors, as well as a baseline
CleanEval collection. We evaluated the 2-class problem (boil- (“Keep all text”) and a classifier solely based on the fea-
erplate vs. content of any kind, called TO in CleanEval) ture with the highest information gain: number of words;
for the classifier that has been trained for the GoogleNews we mark every block with at least 10 words as content.
collection and one that has been trained on the CleanEval The average (µ) and median (m) as well as the ranked
training set. In the latter case, the decision rule was even accuracy for each evaluated strategy are depicted in Figure
simpler: accept all blocks that have a minimum text density 2a. We see that the two flavors of the Pasternack heuristic
of 7 and a maximum link density of 0.35. drastically differ in terms of accuracy. We assume that the
Because the CleanEval collection only provides assess- algorithm needs proper training to succeed for a particular
ments and the algorithmic output at text-level, we cannot corpus, and the trigram model from the news domain was
directly reuse the setup we used for the GoogleNews evalu- not generic enough for the CleanEval dataset.
ation. The CleanEval initiative provides their own accuracy Additionally, to understand how far heuristic additions
measure which is based upon a weighted Levenshtein Edit could further improve the classification, we extended our
Distance at token-level [4]. The computation is expensive two decision tree classifiers downstream with hand-crafted
and also not essential for this task. We confirmed that the rules. In one extension, we only take the content block with
scores can be approximated well with the much simpler bag- the highest number of words (Largest Content Filter ). In
of-words token-level F1 score (like in the GoogleNews setup, another extension, we add rules that are specific for news
except that class weights are not taken into account). As (Main Content Filter ). It extends Largest Content Filter by
our scores therefore slightly differ from the ones in [4], we removing any text that is below a clearly identifiable com-
re-evaluated the available results of three CleanEval contes- ments section (a block solely containing one out of 11 indica-
tants (BTE, Victor, NCleaner) and also added the heuristics tor strings like “User comments:” etc.) and above a clearly
1
400000
0.8
Token-Level F-Measure
300000
Number of Words
0.6
µ=91,11%; m=97.41% BTE
µ=90.41%; m=97.03% Victor 200000
µ=90.10%; m=96.23% Pasternack Unigrams (CleanEval)
0.4 µ=90.08%; m=95.52% Keep Blocks with >=10 Words
µ=89.86%; m=97.18% NumWords/LinkDensity Classifier
µ=89.41%; m=95.66% NCleaner
µ=89.24%; m=95.11% Baseline (Keep everything) 100000
0.2 µ=79.80%; m=94.91% NumWords/LinkDensity + Main Content Filter
µ=77.76%; m=93.34% Densitometric Classifier + Main Content Filter
µ=72.10%; m=93.94% Pasternack Trigrams, trained on News Corpus
µ=68.30%; m=79.81% Densitometric Classifier + Largest Content Filter
0 µ=67.81%; m=76.89% NumWords/LinkDensity + Largest Content Filter 0
0
µ=68.30%; m=70.60% Baseline (Keep everything) In the end, the use of the two heuristic filters (Largest/-
Main Content Filter ) can further improve the detection ac-
0 100 200 300 400 500 600
# Documents curacy for the GoogleNews dataset to an almost perfect av-
erage F1 score of 95.93% (this is a 40% relative improvement
(b) GoogleNews
over the baseline). Even though we see that these algorithms
Figure 2: Performance of Boilerplate Detection Strategies
failed for CleanEval, we expect them to work generically for
the news domain. On the other hand, we see that both, BTE
and our two simplified classifiers work quite well for both
identifiable title (derived from the HTML document’s TI-
collections. Of note, our classifier is far more efficient than
TLE value). As we can see from Figure 2a, for the CleanEval
BTE. It runs in linear time, whereas BTE has a quadratic
collection these modifications resulted in much worse results
upper bound. Furthermore, it can return more than a sin-
than the baseline. On the other hand, the very simple strat-
gle piece of text. BTE’s assumption that only one block
egy to keep all blocks with at least 10 words (as well as our
(the largest having the least tag density) completely covers
NumWords/LinkDensity classifier) performed just as good
the main content seems not to hold for all cases: compared
as the Pasternack unigram and the NCleaner setups that
to the densitometric classifier, BTE only achieves a subop-
have specifically been trained for CleanEval.
timal accuracy in our news experiment. This discrepancy
Ultimately, we see that basically keeping all text (i.e. not
may be due to the fact that the CleanEval collection actu-
removing anything) would be a good strategy, which only is
ally contains less boilerplate text than all other collections
marginally worse than the apparently best solution (BTE)!
we examined (only very little words are in blocks with a text
This leads to the question whether there were failures in
density lower than 11, see Figure 3).
the assessment process, whether the collection is comparable
to the GoogleNews collection or at all appropriate for the
purpose of boilerplate detection. 5. QUANTITATIVE LINGUISTIC ANALYSIS
We repeated this evaluation for the GoogleNews collec-
tion, computing accuracy scores in the same way using the 5.1 Text Density vs. Number of Words
same algorithms (BTE as the alleged winner for CleanEval, In all cases but one, the classifier using Number of Words
Pasternack Trigrams as a supposedly mediocre strategy and per Block performed slightly better than the classifier using
the algorithms we introduced in this paper). For the Paster- Text Density. Also it seems sufficient for a good classifi-
nack algorithm, we used the Web service provided by the au- cation. To get a better understanding why this strategy
thors; unfortunately, there was no unigram implementation performs so well, we need to analyze the created decision
available. Figure 2b shows the corresponding results. tree rules (see Algorithms 1 and 2). We see that the two
We see that the baseline for GoogleNews is much lower classifiers do not differ for the link density-specific rules; if
than for CleanEval; all tested algorithms perform differ- the text block consists of more than 33% linked words, it
ently and are usually better than the baseline, except for the is most likely boilerplate, unless the block is surrounded by
Pasternack strategy, which underperformed in a few cases. long/dense text blocks.
Not Content Not Content (Plain Text)
Content Not Content (Anchor Text)
20000 Related Content
80000 Headlines
Content
Supplemental Text
1000 User Comments
Main Article
Number of Blocks
Number of Words
15000
800
60000 Complete Fit
Functional Text Class (Beta D.)
Descriptive Text Class (Beta D.)
600
Fuzzy Class Transition (Normal D.)
10000 40000 Linked Text
400
200
20000
5000 0
0 10 20 30 40 50 60
0
0
10 20 30 40 50 60 0 5 10 15 20
Number of Words Text Density
Figure 4: Number of Words Distribution (GoogleNews) Figure 5: Text Density Distribution by class (GoogleNews)
Precision @10
0.3
6. RETRIEVAL EXPERIMENTS
0.2
6.1 Setup
In this section we quantify the impact of boilerplate de-
0.1
tection to search. The obvious assumption here is that boil-
erplate not only is another sort of text, it may also deterio-
rate search precision, particularly in those cases where key- 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
words match “related articles” text or other keywords that Threshold
are non-relevant to the actual main content. To evaluate Figure 7: BLOGS06 Search Results
our hypothesis for a representative scenario, we examine yet
another domain of Web documents: Blogs. gorithm (P @10 = 0.33 and N DCG10 = 0.1627). For the
Blogs are particularly relevant for this task because we simple sliding minimum Number of Words, we achieved a re-
may expect many links from one blog page to other blog en- markable accuracy of P @10 = 0.44 and N DCG10 = 0.2476
tries, being topically or temporally related, and those links for any minimum threshold between 11 and 100 words (an
often include headlines and teaser texts of the referenced improvement by 144%/151% over the baseline and 33%/52%
item. In addition to that, a TREC reference collection al- over BTE). We did not examine higher thresholds for prac-
ready exists, containing 3 million permalink documents re- tical reasons; at some point, of course, the precision would
trieved from 100.000 different feeds, along with test queries drop again because of lacking input. Figure 7 depicts the
(consisting of one to five words) and document assessments results for P @10; the N DCG10 curves expose identical be-
at (TREC’06 Blog Track, [20]) which were mainly used for haviour. In a direct comparison with the BLOGS06 com-
measuring opinion retrieval performance. They used graded petition, our results are of course somewhat lower since our
relevance scores (not relevant, topically relevant and three strategy does not do opinion mining at all. However boil-
levels indicating positive, negative and mixed opinions). We erplate removal seems to be strongly beneficial for this pur-
will use this collection for our evaluation. We indexed the pose: we can still compete with the lower 4 of the 16 con-
BLOGS06 collection using the Lucene IR library. Separate testants. One can therefore expect that the addition of our
parallel indexes were created for document blocks with a strategy to the opinion mining pipeline would further in-
particular number of words or a particular text density; this crease accuracy.
allows a selection of permitted ranges at runtime without
reindexing. If our assumption holds, one can expect an im-
provement of the search precision when only words of the 7. CONCLUSIONS AND FURTHER WORK
descriptive text class (long text) are considered for search. Conclusions. In this paper, we presented a simple, yet
effective approach for boilerplate detection using shallow
6.2 Evaluation text features, which is theoretically grounded by stochastic
We perform 50 top-k searches (with k = 10) for the queries text generation processes from Quantitative Linguistics.
defined in the TREC’06 Blog Track and evaluate precision We have shown that textual content on the Web can ap-
at rank 10 (P @10; true/false relevance as in the TREC’06 parently be grouped into two classes, long text (most likely
benchmark results) as well as the normalized discounted cu- the actual content) and short text (most likely navigational
mulative gain (N DCG10 , graded relevance scores as present boilerplate text) respectively. Through our systematical anal-
in the TREC’06 assessments). Using the standard Lucene ysis we found that removing the words from the short text
ranking formula we perform 50 searches from queries pre- class alone already is a good strategy for cleaning boiler-
defined in the TREC’06 Blog Track and count the number plate and that using a combination of multiple shallow text
of documents in the top-10 results which have been marked features achieves an almost perfect accuracy. To a large ex-
relevant by the TREC assessors. We repeatedly issue the tent the detection of boilerplate text does not require any
queries for a sliding minimum text density between 1 and inter-document knowledge (frequency of text blocks, com-
20 and a sliding minimum number of words from 1 to 100 mon page layout etc.) nor any training at token level.
respectively (a minimum of 1 word equals to the baseline). We analyzed our boilerplate detection strategies on four
As we only remove short text, there is no need for a maximum representative multi-domain corpora (news, blogs and cross-
bound. We benchmark the performance of the BTE algo- domain) and also evaluated the impact of boilerplate re-
rithm for this task and compare P @10 as well as N DCG10 moval for document retrieval. In all cases we achieve sig-
to our solution. Finally we compare the results to the P@10 nificant improvements over the baseline and accuracies that
scores reported from TREC’06. withstand and even outperform more complex competitive
Using the sliding minimum Text Density we are able to strategies, which incorporate inter-document information, n-
significantly improve precision (the baseline results in P @10= grams or other heuristics.
0.18; N DCG10 = 0.0985); at the minimum threshold of 14 The costs for detecting boilerplates are negligible, as it
(with slightly lower values for the surrounding densities be- comes down simply to counting words.
tween 11 and 20) we get P @10 = 0.32 and N DCG10 = Further Work. The presented results raise research is-
0.1823, which is almost equal to the scores of the BTE al- sues in many different directions. Obviously, for boilerplate
detection we need to remove the “last 5%” of classification importance for searching on web sites. In CIKM ’07,
error. Even though the generality of our approach might pages 165–174, 2007.
suggest that it is universally applicable, we need to test our [12] A. Ferraresi, E. Zanchetta, M. Baroni, and
approach on other content domains and languages. Finally, S. Bernardini. Introducing and evaluating ukwac, a
we need to deeper investigate and extend our textual model very large web-derived corpus of english. In
from a Quantitative Linguistic perspective. To open up fur- Proceedings of the WAC4 Workshop at LREC 2008.
ther possibilities, the algorithms used in this paper are avail- [13] A. Finn, N. Kushmerick, and B. Smyth. Fact or
able freely from the authors’ website.4 fiction: Content classification for digital libraries.
Joint DELOS-NSF Workshop on Personalisation and
8. REFERENCES Recommender Systems in Digital Libraries (Dublin),
2001.
[1] G. Altmann. Quantitative Linguistics - an [14] D. Gibson, K. Punera, and A. Tomkins. The volume
international handbook, chapter Diversification and evolution of web page templates. In WWW’05,
processes. de Gruyter, 2005. pages 830–839, New York, NY, USA, 2005. ACM.
[2] S. Baluja. Browsing on small screens: recasting [15] J. Gibson, B. Wellner, and S. Lubar. Adaptive
web-page segmentation into an efficient machine web-page content identification. In WIDM ’07:
learning framework. In WWW ’06: Proceedings of the Proceedings of the 9th annual ACM international
15th international conference on World Wide Web, workshop on Web information and data management,
pages 33–42, New York, NY, USA, 2006. ACM. pages 105–112, New York, NY, USA, 2007. ACM.
[3] Z. Bar-Yossef and S. Rajagopalan. Template detection [16] K. Hofmann and W. Weerkamp. Web corpus cleaning
via data mining and its applications. In WWW, pages using content and structure. In Building and Exploring
580–591, 2002. Web Corpora, pages 145–154. UCL Presses
[4] M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. Universitaires de Louvain, September 2007.
Cleaneval: a competition for cleaning web pages. In [17] H.-Y. Kao, J.-M. Ho, and M.-S. Chen. Wisdom: Web
N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, intrapage informative structure mining based on
J. Odjik, S. Piperidis, and D. Tapias, editors, document object model. Knowledge and Data
Proceedings of the Sixth International Language Engineering, IEEE Transactions on, 17(5):614–627,
Resources and Evaluation (LREC’08), 2008. May 2005.
[5] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting [18] C. Kohlschütter. A densitometric analysis of web
content structure for web pages based on visual template content. In WWW ’09: Proc. of the 18th
representation. In X. Zhou, Y. Zhang, and M. E. intl. conf. on World Wide Web, New York, NY, USA,
Orlowska, editors, APWeb, volume 2642 of LNCS, 2009. ACM.
pages 406–417. Springer, 2003.
[19] C. Kohlschütter and W. Nejdl. A Densitometric
[6] D. Chakrabarti, R. Kumar, and K. Punera. Page-level Approach to Web Page Segmentation. In ACM 17th
template detection via isotonic smoothing. In WWW Conf. on Information and Knowledge Management
’07: Proc. of the 16th int. conf. on World Wide Web, (CIKM 2008), 2008.
pages 61–70, New York, NY, USA, 2007. ACM.
[20] I. Ounis, C. Macdonald, M. de Rijke, G. Mishne, and
[7] D. Chakrabarti, R. Kumar, and K. Punera. A I. Soboroff. Overview of the trec 2006 blog track. In
graph-theoretic approach to webpage segmentation. In E. M. Voorhees and L. P. Buckland, editors, TREC,
WWW ’08: Proceeding of the 17th international volume Special Publication 500-272. National Institute
conference on World Wide Web, pages 377–386, New of Standards and Technology (NIST), 2006.
York, NY, USA, 2008. ACM.
[21] J. Pasternack and D. Roth. Extracting article text
[8] Y. Chen, W.-Y. Ma, and H.-J. Zhang. Detecting web from the web with maximum subsequence
page structure for adaptive viewing on small form segmentation. In WWW ’09: Proceedings of the 18th
factor devices. In WWW ’03: Proceedings of the 12th international conference on World wide web, pages
international conference on World Wide Web, pages 971–980, New York, NY, USA, 2009. ACM.
225–233, New York, NY, USA, 2003. ACM.
[22] C. E. Shannon. A mathematical theory of
[9] S. Debnath, P. Mitra, N. Pal, and C. L. Giles. communication. Bell system techn. journal, 27, 1948.
Automatic identification of informative sections of web
[23] M. Spousta, M. Marek, and P. Pecina. Victor: the
pages. IEEE Trans. on Knowledge and Data
web-page cleaning tool. In WaC4, 2008.
Engineering, 17(9):1233–1246, 2005.
[24] K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura,
[10] S. Evert. A lightweight and efficient tool for cleaning
a. M. B. C. Jo and J. Freire. A fast and robust method
web pages. In N. Calzolari, K. Choukri, B. Maegaard,
for web page template detection and removal. In
J. Mariani, J. Odjik, S. Piperidis, and D. Tapias,
CIKM ’06: Proc. 15th ACM int. conf. on Information
editors, Proceedings of the Sixth International
and knowledge management, pages 258–267, 2006.
Language Resources and Evaluation (LREC’08),
Marrakech, Morocco, may 2008. European Language [25] R. Vulanovic and R. Köhler. Quantitative Linguistics -
Resources Association (ELRA). An international Handbook, chapter Syntactic units
http://www.lrec-conf.org/proceedings/lrec2008/. and structures, pages 274–291. de Gruyter, 2005.
[11] D. Fernandes, E. S. de Moura, B. Ribeiro-Neto, A. S. [26] L. Yi, B. Liu, and X. Li. Eliminating noisy information
da Silva, and M. A. Gonçalves. Computing block in web pages for data mining. In KDD ’03: Proc. of
the 9th ACM SIGKDD int. conf. on Knowledge
4
http://www.L3S.de/~kohlschuetter/boilerplate discovery and data mining, pages 296–305, 2003.