Professional Documents
Culture Documents
ABSTRACT
Text classification is a basic issue in web information processing. The text sorting algorithm is the basic theory for designing
and developing classifier in the text classification. In the field of the text sorting algorithm, the most typical sorting algorithms
at present include decision tree calculation, Bayes calculation and KNN calculation, etc. This paper discusses the theoretical
basis of the above mentioned typical classification algorithms and introduces the applications of those algorithms through
analyzing the advantages and disadvantages of each algorithm. At last, the paper presents a kind of realization process for the
webpage classifier on the basis of C4.5 algorithm.
1. FOREWORD
The fast development of IT boosts a significant increase in the quantity of various types of information resources,
among which the text data account for a large proportion. In consequence, a series of problems have occurred. While
the most essential issue is how to sort out, analyze and use such huge text data efficiently. With the rapid development
of the information technology such as computer network technology and data reserve technology, collecting, sorting
and analyzing the tremendous data on the network becomes much easier. All kinds of data mining techniques including
data classification play positive roles in the deeper technical fields.
How to classify these text data efficiently is essential to the analysis and process of the huge text information. As for the
mountainous information on the network, the traditional solution is to classify them manually which has many
weaknesses: 1. Spend tremendous human and material resources, as well as energies. 2. The consistency of the
classified data is not high, even though the people responsible for the classification work has very good linguistic
competence, the results of the classification will vary from person to person.
Therefore, the text automatic classification techniques are particularly convenient and accurate, so are the relevant
researches. Now, the text automation classification techniques become a hot topic for the researchers from different
circles. Text classification aims to divide the text collections with category labels. After classifying these texts, find out
the classification model according to the common features of each type of text subset. Then divide the untagged text
into existing categories with proper text sorting algorithms.
In general, the process of text classification can be divided into four steps: pre-process the text, extract features, build
classifier and evaluate classification results. At present, the algorithms for text classification concentrate on two
aspects: extracting feature and build classifier [1]. This paper focuses on the text sorting algorithms which mainly
includes decision tree algorithm, Bayes algorithm, KNN algorithm, etc. All of them are the technical foundation of web
text classification.
Page 1
similarity between the known text and unseen text. Then it can tell which categories the new text belongs to through
the established classifier.
In the process of text classification, structuring the classifier as per the text sorting algorithms is an essential step. The
text sorting algorithm is the theoretical basis for realizing classifier which is also the focus of the current researches. At
present, there are many text sorting algorithms, among which, the most representative algorithms include KNN
algorithm, Bayes algorithm, decision tree algorithm and SVM (support vector machine). All these algorithms are
widely applied to the various data mining areas. Meanwhile, many algorithms which were optimized on the basis of
these algorithms have also been put into use in real life.
Page 2
as to determine the class of the tested text. The core concept of this algorithm is: as for the new text to be classified,
find out K pieces of text from the training text set which are most similar to the new text; then classify the new text
according to the class of the K pieces of historical texts. In other words, if the majority of the k pieces of texts which are
most similar to the new text belong to one class (this can be determined through calculating the weights), the new text
also belongs to this class. To emphasize, in KNN algorithm, all the selected K pieces of training text are correctly
categorized. Moreover, in final decision process, the class of the new text is judged as per only the main class of the
nearest one or several texts.
Page 3
5. CONCLUSION
The general rule for the text sorting algorithm is utilizing the features of the data in the training text set to find or
construct a vector model or hypothesis in space so as to determine the class of the provided text. Its purpose is to make
the classified results generated by the sorting algorithm resemble the actual classification of the text as much as
possible. Text sorting algorithm plays an essential role in automatic text classification system. However, it also has
many shortcomings in various aspects which caused by the characteristics of the text like polysemy, multi-words and
ambiguity, etc. In the future studies and applications, more efforts have to be put to improve the text sorting algorithms
Page 4
to enable the classified results become closer to the actual categories of the texts.
6. ACKNOWLEDGMENTS
National Natural Science Foundation (61373148), National Social Science Fund (12BXW040); Shandong Province
Natural Science Foundation (ZR2012FM038, ZR2011FM030); Shandong Province Outstanding Young Scientist
Award Fund (BS2013DX033),Science Foundation of Ministry of Education of China(14YJC860042).
Reference
[1] Zhao Yan, Zhou Bin & Chen Ruhua 2013, 12 (10), Study on Text Sorting Algorithms [J], Software Guide.
[2] Tao Wei, Ma Jiming & Zhang Suzhi 2009, 5 (13), Analysis on Decision Tree Algorithm and Its Application [J],
Computer Knowledge and Technology.
[3] Mao Guojun, Wang Shi & Duan Lijuan, 2005, Principle and Algorithm of Data Mining [M], Beijing, Tsinghua
University Press.
[4] Ma Zhiyuan & Cao Baoxiang, 2013, Application of the Improved Decision Tree Algorithm in Invasion Detect [J],
Computer Technology and Development.
[5] Chen Hongyu, 2009, Study on the Bayes Algorithm in Date Mining [J], Disc Technology.
[6] Zhang Huazhong, 2013, Study on Bayes Algorithm [J], Digital Technology and Application.
[7] Wang Dafu, 2009, Study on the Email Filter System based on Bayes Algorithm [J], Computer & Information
Technology.
[8] Zhang Ning, Jia Ziyan & Shi Zhongzhi, 2005, 31 (8), Text Classification Based on KNN Algorithm [J], Computer
Engineering.
[9] Huang Wei, 2011, 6, Application of KNN in Enterprise Information Search [J], Information Technology.
[10] Cao Wei & Zhang Naizhou, 2010, 19 (10), Webpage Classification Algorithm based on C4.5 Decision Tree[J],
Computer System & Application.
AUTHOR
ZHU Zhenfang , PhD, lecturer, he was born in 1980, Linyi City, Shandong Province. He obtained
Ph.D. in management engineering and industrial engineering at the Shandong Normal University in
2012, his main research fields including the security of network information, network information
filtering, information processing etc.. The authors present the lecturer at the Shandong Jiaotong
University, published more than 30 papers over the year.
Page 5