You are on page 1of 4

Preprocessing and Feature Preparation in Chinese Web Page Classification

Weitong Huang
Department of Computer Science and Technology
Tsinghua University
Beijing,China,100084
huangwt@tsinghua.edu.cn
Luxiong Xu
Fujian Normal University, Fuqing Brunch
Fuqing, Fujian,China
xlx@fjnu.edu.cn
Yanmin Liu
Department of Computer Technology and Applications
Qinghai University
Xining, Qinghai,810016
lyanmin@qhu.edu.cn

AbstractA detailed design and implementation of a Chinese
web-page classification system is described in this paper, and
some methods on Chinese web-page preprocessing and feature
preparation are proposed. Experimental results on a Chinese
web-page dataset show that methoss we designed can improve
the performance from 75.82% to 81.88%.
Keywords- Text classification, Chinese web-page
preprocessing, Feature preparation
I. INTRODUCTION
With the rapid development of Internet, the amount of
web-pages has been increasing dramatically. While
providing all-embracing information, the large collections of
web-pages bring people great challenge: how to find what
they need. In order to organize and utilize information
effectively, people hope to classify web-pages according to
their contents. Web-page classification is widely applied in
the field of vertical search, personalized search, etc. Web-
page classification has been a hotspot in the research domain
of text mining and web mining. In this paper, we will
introduce a Chinese web-page classification system we
implemented. The system design and our proposed methods
will be described in section 2. In Section 3, the experimental
results on a Chinese web-page dataset are shown as well as
some discussions. Finally, we conclude our work in Section
4.
II. SYSTEM DESIGN AND IMPLEMENTATION
A. System Architecture
There are three parts in our Chinese web-page
classification system: web-page preprocessing, feature
preparation, and web-page classification. The system
architecture is illustrated in Figure 1.
B. Web-page Preprocessing
In the whole data mining task, 60% of the work is data
preprocessing. There are six procedures in web-page
preprocessing: HTML parsing, English lexical analysis,
Chinese word segmentation, stopword removal, stemming,
and vocabulary selection.
Figure 1. A Chinese Web-page Classification System

1) HTML Parsing
The purpose of HTML Parsing is to remove html code
that is irrelevant and extract text. We save the parsing result
as files of XML format for further preprocessing and feature
extracting. The procedure of HTML Parsing is as follows:
Firstly, remove html source code embedded in tags such
as style, script, and applet. Reserve the attributes of
meta and a and discard attributes of other tags. The
reason is that the attributes of tags is not much helpful to
classification task except those of meta and a.
Secondly, we add user-defined tags for text in a web-
page, because no special tags are prepared for text in
standard HTML. The rule is: the plain text, i.e. the non-
hyperlink text is identified by text tag, while the hyperlink
text is identified by anchor tag. Following the above rule,
a HTML web-page can be represented as a HTML tag tree,
whose leaf nodes are plain text identified by text tag or
hyperlink text identified by anchor.
2009 International Conference on Computer Engineering and Technology
978-0-7695-3521-0/09 $25.00 2009 IEEE
DOI 10.1109/ICCET.2009.72
64
Usually the page layout is controlled by table tag,
while the tables along the two sides, the top and the bottom
are often noises such as navigation bar and ads. Once there
is a HTML tag tree described above, some algorithms [1]
[2] can be applied to remove noises and extract content text.
In this paper, we adopt a more simple and effective method
to remove noises. Through experiments (see section 3.2), we
discover that in a content page (not a hub page), the
hyperlink text is usually used for navigation or
advertisement. So it is effective to remove noises to only
discard hyperlink text.
2) English Lexical Analysis and Chinese Word
Segmentation
In English lexical analysis, spaces are usually used for
separators, while there are no separators between words in
Chinese text. Therefore, Chinese lexical analysis is a
prerequisite to Chinese information processing. As we
know, there are two open source Chinese word
segmentation projects. One is ChineseSegmenter [5] which
works with a version of the maximal matching algorithm.
When looking for words, it attempts to match the longest
word possible. This simple algorithm is surprisingly
effective. Institute of Computing Technology in China
developed a Chinese lexical analysis system ICTCLAS [6]
using an approach based on multi-layer HMM.
ICTCLAS includes word segmentation, Part-Of-
Speech tagging and unknown words recognition.
Because of ICTCLASs higher precision, we adopt it as the
Chinese word segmentation module in our classification
system.
3) Stopword Removal and Stemming
Words with high frequency are not only helpless to
distinguish documents but also increase the dimension of
feature space. Such words are called stopword. We maintain
a stopword list containing 580 English words and 637
Chinese words, which is used for stopword removal. In
English, many words have variations. Using stemming [7],
words having the same stem can be considered as to be
equal.
4) Vocabulary Selection
A sentence in natural language is made up of words of
different parts of speech, such as nouns, pronouns, articles,
verbs, adjectives, adverbs, prepositions, and conjunctions.
Among them, nouns and verbs express more semantic
meaning. So it is a feasible way to choose only nouns and
verbs as feature words in order to reduce feature space,
remove noises and improve classification quality. We use
ICTCLAS[6] to extract nouns and verbs from a sentence,
and the experiments are in section 3.3.
C. Feature Extraction and Weighting
HTML web-pages are semi-structured text. A web-page
consists of two parts: <head> part and <body> part. The
former includes meta and title which are the summary
description of the whole page, and the latter is the actual
content text that we can see visually. The content text in the
<body> part includes two types of text. One is plain text
which is content related, and the other is hyperlink text.
Assigning different weight to different parts can improve
the quality of classification. Our experiments are in section
3.2.
D. Feature Selection
One difficulty in text classification is the high dimension
of feature space. The task of feature selection is to remove
less important words from the feature space. Many feature
selection methods have been studied, for example,
document frequency (DF), information gain (IG), mutual
information (MI), and x'-test (CHI). [3][4] compared the
above four methods. The result shows that IG and CHI gain
the best performance; DF has comparable performance with
IG and CHI, and it has the advantage of simplicity and low
computation complexity; MI is the worst.
We adopt document frequency selection in our system. The
assumption of this method is: the features (words) with low
document frequency have small influence on the
classification accuracy.
We use the revised formula below to compute the
document frequency of word
t
w .
( , )
( )
( )
i
t i
t
C C i
freq w C
freq w
num C


( , )
t i
freq w C is the document frequency of
t
w in
class
i
C ;
( )
i
num C
is the number of documents in class
i
C .

E. Classifier
1) Nave Bayes Classifier
The Nave Bayes Classifier is a simple but effective text
classification algorithm. It performs very well in practice
with high accuracy. Given a document and a classifier, we
determine the probability ( | )
j i
P c d that the document
belongs in class
j
c by Bayes rule and the nave Bayes
assumption:
,
| |
1
( | ) ( ) ( | ) ( ) ( | )
i
i k
d
j i j i j j d j
k
P c d P c P d c P c P w c
=



( )
j
P c indicates the document frequency of class
j
c
relative to all other classes; ( | )
t j
P w c indicates the
frequency that the classifier expects word
t
w to occur in
documents in class
j
c ;
, i k
d
w denotes the k th word in
document
i
d .
| |
1
1 ( , ) ( | )
( | )
| | ( , ) ( | )
i
i
t i j i
d D
t j V
s i j i
s d D
N w d P c d
P w c
V N w d P c d

=
+
=
+



65
V is the vocabulary; D is the set of training
documents; ( , )
t i
N w d is the count of the number of times
word
t
w occurs in document
i
d ; ( | ) {0,1}
j i
P c d
1 ( | )
( )
| | | |
i
j i
d D
j
P c d
P c
C D

+
=
+


| | C is the number of classes. | | D is the number of
training documents.
2) Evaluation Measure
We use the standard measures to evaluate the
performance of our classification system, i.e. precision,
recall and F1-measure[8].

III. EXPERIMENTS
A. Chinese Web-page Dataset
We download 6745 Chinese web-pages from
http://www.sohu.com/ as our training set, and 2814 web-
pages from http://www.sina.com.cn/ as our test set. These
web-pages are distributed among 8 categories. Table 1
shows the amount of documents in both training and test set
in each category. We have excuses to choose training set
and test set from two different sources. Our former work
show that the performance on the dataset from only one data
source is surprisingly high (usually above 95%), but that is
not the real world. We decide to use the test set of a
different source to evaluate the effect of training.
TABLE I. CHINESE WEB-PAGE DATASET
Category Name Training Set
(from sohu)
Test Set (from
sina)
Auto 841 351
Business 630 263
Entertainment 1254 523
Health 784 327
House 736 307
IT 1050 438
Sports 841 351
Women 609 254
Sum 6745 2814
B. Feature Extraction and Weight Distribution

Figure 2. Classification Results of Different Feature Extraction Schemes
Section 2.2.1 has defined plain text, i.e. the non-
hyperlink text in web-pages, which are always correlated
with the topic of the web-pages and contain few noises. We
extract only the plain text in every web-page as a feature
extraction scheme, and compare it with the one extracting the
full text. Figure 2 shows the experimental results. When the
amount of feature words varies from 1,000 to 10,000, the
average Micro-F1 of full-text-scheme is 75.82%, while
plain-text-scheme gets to 78.93%. The results shows that the
scheme of extracting only plain text as features can improve
the classification quality by 3.11%, by effectively removing
the noises in the web-pages.

Figure 3. Classification Results of Different Header-Weighting Schemes
The header of a web-page, which is concise and exact,
reflects how the page author highly generalized the content
on the web-page. Properly raising the header weight in the
feature space can improve the classification quality. We
have done an experiment to get the ratio of the header
weight to the body text weight: Choose 4,000 feature words
by document frequency method, and compare the different
classification quality when the ratio of the header weight to
the body text weight varies from 2:1 to 10:1.
C. Vocabulary Selection

Figure 4. Classification Results using vocabularies of different parts of
speech
The results, illustrated in figure 3, shows Micro-F1 is
79.75% when header and body text share the same weight,
and Micro-F1 rises up as the header weight is raised until
gets to a maximum when the ratio of the header weight to
the body text weight is 5:1. Therefore, we let the ratio of the
header weight to the body text weight to be 5:1. Figure 2
66
shows the experimental results of feature extraction after
raising the header weight. The figure shows the average
Micro-F1 goes up from 78.93% to 82.00% thanks to
increasing the header weight.
We do the research on different candidate vocabularies
including all words, nouns and verbs. The results are
illustrated in figure 4. The classification result of using only
nouns and verbs is clearly better than that of using all
words, which means nouns and verbs are enough to reflect
the content of a web-page and can eliminate the noises
caused by pronouns, adjectives, quantifiers, adverbs, and
consequently improve the classification quality.
D. Final Experimental Results
Considering the experimental results comparisons and
analysis mentioned above, we set down a final web-pages
classification scheme: extract the plain text in web-pages,
raise the ratio of the header weight to the body text weight
to 5:1, choose nouns and verbs as features, select 4,000
feature words using document frequency method, and use
Nave Bayes classifier to train and test the Chinese web-
page dataset. The precision and recall of each category is
illustrated in Table 2. Micro-F1 is 81.88% when we use the
preprocessing and feature preparation methods mentioned
above. In comparison, Micro-F1 is only 75.82%, showed in
figure 2, if we extract the full text and dont use special
feature preparation methods.
TABLE II. CLASSIFICATION RESULTS OF EACH CATEGORY
Category Name Precision Recall
Auto 87.69% 83.19%
Business 87.72% 76.05%
Entertainment 85.89% 95.41%
Health 64.39% 69.11%
House 84.56% 82.08%
IT 70.54% 90.18%
Sports 97.90% 92.88%
Women 87.69% 44.88%
Micro-F1 81.88%

We are satisfied with our improvement, because many cases
cannot be assigned to only one category absolutely. For
example, web-pages about womens health information can
be classified into the category Health as well as the
category Women, but this is the real world.
IV. CONCLUSION
In this paper, a series of web-page preprocessing and
feature preparation methods are proposed. Through
experiments, we give the following conclusions: extracting
only plain text can eliminate noises in web-pages
effectively; both raising the header weight and choosing
only nouns and verbs as candidate features can improve the
classification quality. On our Chinese web-page dataset, the
proposed methods improve the measurement Micro-F1
greatly, from 75.82% to 81.88%, compared to full-text
method. In the future, we will enrich our Chinese web-page
dataset and do experiments on larger and more datasets.

REFERENCES
[1] Lawrence Kai Shih and David R. Karger. Using URLs and Table
Layout for Web Classification Tasks. In Proceedings of WWW04,
New York, New York, USA, 2004
[2] Chengjie Sun, Yi Guan. A Statistical Approach for Content
Extraction from Web Page. Journal of Chinese Information
Processing, 2004, 18(5): 17~22.
[3] Yiming Yang, and Jan O. Pedersen. A Comparative Study on Feature
Selection in Text Categorization. In Proceedings of ICML, Nashville,
Tennessee, USA, 1997
[4] Songwei Shan, Shicong Feng, and Xiaoming Li. A Comparative
Study on Several Typical Feature Selection Methods for Chinese
Web Page Categorization. Computer Engineering and Applications,
2003, 39(22): 146~148.
[5] Chinese Segmenter:
http://www.mandarintools.com/segmenter.html
[6] ICTCLAS:
http://www.nlp.org.cn/project/project.php?proj_id=6
[7] Porter Stemming Algorithm:
http://www.tartarus.org/martin/PorterStemmer/
[8] C.J.van Rijsbergen. Information Retrieval. Butterworth, London,
1979, 173~176

67

You might also like