You are on page 1of 4

International Journal of Computer Trends and Technology- volume3Issue4- 2012

A Study on the Architecture for Text Categorization and Summarization


Suvidha#1, Ravikishan#2
#

Computer Science and Engineering, JNTUNIVERSITY V.R.SiddharthaEngineeringCollege, Vijayawada, India V.R.SiddharthaEngineeringCollege, Vijayawada India Categorization of text documents, where the goal of a system is to assign each document to one or more predefined categories. Most studies focus on the subjective impact of summarization, where the quality of a summary and its utility as far as a particular task is concerned is judged by a group of human experts. In this work, we focus on the task of automatic document categorization in scenarios where a documents summary is functionally equivalent to reducing the number of features of the original. The paper is organized as follows. Section 2 says about the related work in this area. The proposed architecture is discussed in Section 3. II. RELATED WORK

Abstract- Most of the search techniques just check for the availability of the word in the document. But they never go through whether the presence of the word is meaning full or not. Text categorization is the process in which a given document or documents are searched through. And text summarization is the process in which the given documents subjectivity is found. Combining these two a meaningful search technique can be provided. In this paper, an architecture which provides the searching techniques by combining text categorization and text summarization for documents searching is proposed. Term Frequency and Inverse Document Frequency (TFIDF) style equation combining with various machine learning techniques are used for text categorization and text summarization. Here n documents are considered and are searched. Search results are displayed along with the subjectivity of the document, so that get the searched documents along with their subjectivity and fastly identify his wanted document. KeywordsMachine learning, text summarization, TFIDF style equation. I. categorization, text

Up till now the researchers considered the frequency of the word combining the machine learning techniques. Franca Debole et al. [1] for Supervised Term Weighting for Automated Text Categorization discusses the phase of term selection and weighting in which documents weights for the selected terms are computed. This process contains supervised learning techniques. Here term weighting methodology especially designed for IR (Information Retrieval) applications such as text categorization and text filtering are used for supervised learning and does not considered the performance of the effective documents. Inderjit S. Dhillon [2], worked on Theoretic Feature Clustering Algorithm for text classification, it tells about feature clustering, feature selection of word clusters. But have not considered the meaningful documents and text summarization for better search. Johannes Furnkranz et al. [3], represented the effect of using n-grams (sequences of words of length n) for text categorization. Here removal of stop words and word sequences of length 2 or 3 are are used. Using longer sequences reduces classification performance. Training data can be most useful for machine learning algorithms it can be represented as a set of feature vectors. Feature vectors can be represented as a set-of-words approach. Only there is a limitation of documents can be used here.

INTRODUCTION

Text Categorization is a very fast ongoing research field. The present text categorization techniques mainly focused on the repetition of the word. But in our proposed system the maximum repetition of the word is not considered rather we consider whether the repetition of the word is meaningful or not. If the word is repeated maximum number of times without any meaning in the document then the document is not an appropriate document. For identifying whether the word is meaningful or not we use ensemble learning techniques like word distribution, compactness, absolute value, appearance of a word that contains in the document. Text categorization alone is not accurate, so in our Proposed system along with text categorization text summarization is also used. Text summarization is the process where subjectivity of the document is known. One of the most important areas where summaries can be applied is

ISSN: 2231-280

http://www.internationaljournalssrg.org

Page 637

International Journal of Computer Trends and Technology- volume3Issue4- 2012


Man Lan, Sam-yuan SUNG et al. [4], comparative study on term weighting schemes for text categorization mainly focus on the comparison of various term weighting schemes. They only change the term weighting schemes by using bag-of-words approach. And do not consider the absolute values of term frequency and improved IDF factor for text categorization. Pascal Soucy et al.[5], considered TFIDF weighting in the text categorization in the vector space model. They describes about weighting approaches in text categorization, weighting methods based on confidence and supervised weighting. And they introduce a new weighting method based on statistical estimation of the importance of a word for a space categorization problem. This method also has the benefit to make feature selection implicit, since useless features for the text categorization problem considered get a very small weight but they dont consider the compactness of the keywords. Samscott, stan Matwin et al. [7], discuss about feature engineering for text classification. Author examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hyponyms) and they mainly focused on the performance of text classification. The difficulty to detect the micro averaging step of the analysis must be evaluated. Thorsten Joachims et al. [8], discuss about text categorization with support vector machines learning with many relevant features. This paper explores mainly the use of the support vector machines (SVMs) for text classification and identifies the benefits of SVMs for text categorization. And compares support vector machines to four standard methods. Each method represent a different machine learning approach: density estimation using a NAIVE BAYES CLASSIFIER, the ROCCHIO algorithm, K-nearest neighbour classifier, and the DECISSION TREE rule learner. But these are the very difficult and time taking process to learn the text categorization problems. Xiao-Bing Xue and Zhi-Hua Zhou et al. [9], considered distributional features for text categorization. They explore the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features. But these features are not enough for fully capturing the information contained in the document. The limitations are no specific combinations of Term Frequency, Compactness, and First Appearance. How to find optimal combinations for different tasks is an important practical issue. Yanlei Diao, HongjunLu et al. [10], worked on toward learning based web query processing approach with learning capabilities. They described the techniques for modelling html pages, knowledge of query processing. And search engines are widely used to locate information across the web. A user can issue queries in free text sentences and get the results in the form of segments containing the required information. Here only the web documents or a web page plays a major role for the web query processing. The HTML page segmentation is the important issue processing for web queries. Yiming Yang, Jan o.Pedersen et al. [11], focuses on comparative study on feature selection in text categorization, aggressive dimensionality reduction. Here, five methods were evaluated including the term selection based on document frequency (DF), Information gain (IG), mutual information (MI), and 2-test(CHI), and term strength(TS). Based on these five methods feature selection in text categorization is done. Bu the subjectivity of the documents is not said here. Youngjoong KO, JinwooPark et al. [12], explores improving text categorization using the importance of sentences. They discuss the importance of sentences using text summarization techniques. Four kinds of classifiers are used in this experiment. Naive bayes, Rocchio, KNN, and SVM. A supervised learning algorithm has been applied to this area using a training data set of categorized documents. The aim of this paper presented a new indexing method for text categorization using two kinds of text summarization techniques; one uses the TITLE and the other uses the IMPORTANCE of TERMS. And dont considered the how important the document is and the comparison of documents to know which is better. To overcome these drawbacks architecture is proposed which does text categorization and summarization. Section III gives the detailed description for this architecture. III. PROPOSED SYSTEM

The proposed system consists of summarization of the documents and calculates the term frequency, absolute value, compactness, distribution, M-parts, and efficiency comparison. And finally shows the results along with the subjectivity. Figure 1 presents the Text Categorization and Text Summarization architecture of our proposed system.

ISSN: 2231-280

http://www.internationaljournalssrg.org

Page 638

International Journal of Computer Trends and Technology- volume3Issue4- 2012


in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the term frequency inverse document frequency weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the term frequency inverse document frequency for each query term. The following modules are we can calculate by term frequency 5) Text Documents Text Documents are collected from the predefined texts and various subjects. Upload those type of documents and done the machine learning techniques on that based on TFIDF style equation. Fig 2 represents the text categorization of distributional features.
Auto Segmentation Position Tagging

Upload Documents

Text Categorization

Text Summarization

Summary

Calculation of term frequency Calculation of absolute value Calculation of distribution Calculation of M-Parts Efficiency comparison

Fig.1. Text Categorization and Text Summarization (TCTS)

A. Upload Documents Initially upload n number of documents for text categorization and text summarization. B. Text Categorization

Plotting

Result can be displayed based on the rank of the entire document. The best three matching result can be display in the result. We can able to view those documents using this application.

Feature Extraction

C.
TFIDF

Text Summarization

PLOTTING

After the text categorization module is done the text subjectivity is known through the text summarization techniques. Firstly we construct unique features data sets. The N best unique features for a document, i.e., its ideal summary, are obtained in the following manner: 1) For all features available to the classifier (i.e., extracted from the training document collection) a relevance weight is assigned by the feature selection technique. This step is performed just once and its results are shared by all subsequent steps. For each document, its set of unique term features is identified and then ranked according to their relevance weights; the top N elements are retained to be used by a classifier.

Fig.2. Text Categorization

1) 2) 3)

Auto Segmentation It tells about the search word and stop words. Position Tagging It describes about the position of the word that is the appearance of the word. Feature Extraction It describes the overall features about search words for generating results for distribution, compactness, and absolute values. TFIDF The TFIDF weight (Term Frequency Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document

2)

4)

D.

Summary

ISSN: 2231-280

http://www.internationaljournalssrg.org

Page 639

International Journal of Computer Trends and Technology- volume3Issue4- 2012


Based on the text categorization and text summarization results are displayed based on the rank of the entire document and their subjectivity is shown next to them.
[4] Man Lan, Sam-yuan SUNG., A Comparative study on term weighting schemes for text categorization, @12r.a-star.edu.sg, 2005. [5] Pascal Soucy., beyond weighting for text categorization in the vector space model, 2005 [6] Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis(1999). [7] Samscott, stan Matwin., Feature Engineering for Text Classification, cognitive science department, Canada, 1999. [8] Thorsten Joachims., Text categorization with Support vector Machines: Learning with many relevant features, volume 1398/1998, 137-142,1998. [9] Xiao-Bing Xue and Zhi-Hua Zhou, Distributional features for text categorization, IEEE TRANS.Ensembles,IEEE Trans. Knowledge and Data Eng., vol. 21, no. 3, 2009. [10] Yanlei Diao, HongjunLu, Songtingchen, Zengping Tian., Toward Learning Based Web Query Processing, 2000. [11] Yiming Yang, Jan o.Pedersen., A Comparative Study on Feature Selection in Text Categorization, Carnegie Mellon University, Pittsburgh, PA-15213-3702, USA 1997. [12] Youngjoong ko, JinwooPark., Improving Text Categorization using the Importance of Sentences, Department of Computer Science, Seoul 121-742,2002. [13]WWW.daviddlewis.com/resources/testcollections. [14]http://www.dcs.gla.ac.uk/idom/ir-resources/linguistic-utiles/stopwords.

IV.

CONCLUSION

The proposed distributional features are exploited by a TFIDF style equation, and different features are combined using ensemble learning techniques. The extraction of the distributional features is efficiently implemented using the inverted index constructed for the corpus. Using such type of index, for a given worddocument pair, we can obtain not only the frequencies of the word but also the positions where the word appears and compactness of keyword. And also we can find out the optimal combinations of distributional features based on analysis, comparing documents which are higher than another, how important the document is, and also finding the optimal combinations for text categorization and ranks based on analysis. A TFIDF-style equation is constructed, and the ensemble learning technique is used to utilize these distributional features. Experiments show that the text summarization techniques are useful for text categorization, especially when they are combined with term frequency or combined together.

V.

FUTURE WORK

Previous research work on text categorization usually used the appearance or the frequency of appearance to categorize a word. These features are not enough for fully capturing the information contained in a document. The research reported here extends a preliminary research that advocates using distributional features of a word in text categorization. Further analysis reveals that the effect of the distributional features is obvious when the documents are long and when the writing style is informal. Here we concentrated on word documents, further this work can be extended to web documents, PDF, etc.

VI

REFERENCES

[1] Franca Debole et al., Supervised term weighting for automated text categorization, Fabriziosebastiani, istituto di scienzae tecno logie dell, informazione, 2003, via G.Mourzzi 1-56124 pisa(Italy). [2] Inderjit S. Dhillon., A divisive information theoretic feature clustering algorithm for text classification, journal of machine learning research 3(2003). [3] Johannes Furnkranz., A study using n gram features for text categorization, Technical Report OEFAI-TR-98-30.

ISSN: 2231-280

http://www.internationaljournalssrg.org

Page 640

You might also like