Text Mining

MINOR PROJECT II REPORT TEXT MINING : REUTERS-21578
SUBMITTED BY :
Aarshi Taneja (10104666) Divya Gautam (10104673) Nupur (10104676) Shruti Jadon (10104776) Batch: IT-B10 Group Code: DMB10G04
TABLE OF CONTENTS
Abstract Problem definition Data set chosen Preprocessing applied Description of algorithms
--------------- 1 --------------- 2 --------------- 3 --------------- 4 --------------- 7
Actual implementation Results Screenshots
--------------- 14 --------------- 35 ---------------37
Future work
--------------- 42
References
--------------- 43
Abstract
Text Categorization (TC), also known as Text Classification, is the task of automatically classifying a set of text documents into different categories from a predefined set. If a document belongs to exactly one of the categories, it is a single-label classification task; otherwise, it is a multi-label classification task. TC uses several tools from Information Retrieval (IR) and Machine Learning (ML) and has received much attention in the last years from both researchers in the academia and industry developers. Information Retrieval Information Retrieval (IR) is the science of searching for information within relational databases, documents, text, multimedia files, and the World Wide Web . The applications of IR are diverse, they include but not limited to extraction of information from large documents, searching in digital libraries, information filtering, spam filtering, object extraction from images, automatic summarization, document classification and clustering, and web searching. The breakthrough of the Internet and web search engines have urged scientists and large firms to create very large scale retrieval systems to keep pace with the exponential growth of online data. Figure below depicts the architecture of a general IR system. The user first submits a query which is executed over the retrieval system. The latter, consults a database of document collection and returns the matching document.
In general, in order to learn a classifier that is able to correctly classify unseen documents, it is necessary to train it with some preclassified documents from each category, in such a way that the classifier is then able to generalize the model it has learned from the pre-classified documents and use that model to correctly classify the unseen documents.
Problem definition
Our project is about categorizing the news articles into various categories. We work on two major scenarios: a. Classification of documents into various categories. Making it in the form of an application where user can upload an article and we will classify it into various categories. b. On entering keywords by the user we show the most relevant document for the user.
Data Set Chossen

As the caption suggests, the data set used for this particular project is in the form of sgml files. The Reuters-21578 dataset is available at :
http://www.daviddlewis.com/resources/testcollections/reuters21578/
There are 21578 documents; according to the 'ModApte' split: 9603 training docs, 3299 test docs and 8676 unused docs. They were labeled manually by Reuters personnel. Labels belong to 5 different category classes, such as 'people', 'places' and 'topics'. The total number of categories is 672, but many of them occur only very rarely. The dataset is divided in 22 files of 1000 documents delimited by SGML tags.
A sample SGML file:
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2"> <DATE>26-FEB-1987 15:02:20.00</DATE> <TOPICS></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <UNKNOWN> F Y f0708reute d f BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN> <TEXT> <TITLE>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</TITLE> <DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America Inc said they plan to form a venture to manage the money market borrowing and investment activities of both companies. BP North America is a subsidiary of British Petroleum Co Plc <BP>, which also owns a 55 pct interest in Standard Oil. The venture will be called BP/Standard Financial Trading and will be operated by Standard Oil under the oversight of a joint management committee. Reuter </BODY></TEXT> </REUTERS>
Each article starts with an "open tag" of the form
<REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??> where LEWISSPLIT : The possible values are TRAINING, TEST, and NOT-USED. TRAINING indicates it was used in the training set in the experiments reported in LEWIS91d (Chapters 9 and 10), LEWIS92b, LEWIS92e, and LEWIS94b. TEST indicates it was used in the test set for those experiments, and NOT-USED means it was not used in those experiments. NEWID : The identification number (ID) the story has in the Reuters-21578, Distribution 1.0 collection. These IDs are assigned to the stories in chronological order. <TOPICS>:Encloses the list of TOPICS categories, if any, for the document. If TOPICS categories are present, each will be delimited by the tags <D> and </D>. <BODY>: The main text of the story. <AUTHOR>: Author of the story.
Preprocessing applied
Document Term Weighting
Document indexing is the process of mapping a document into a compact representation of its content that can be interpreted by a classifier. The techniques used to index documents in TC are borrowed from Information Retrieval, where text documents are represented as a set of index terms which are weighted according to their importance for a particular document. A text document dj is represented by an n-dimensional vector !d j of index terms or keywords, where each index term corresponds to a word that appears at least once in the initial text and has a weight associated to it, which should reflect how important this index term is.
Term Frequency / Inverse Document Frequency

In this case, which is the most usual in TC, the weight of a term
in a document increases with the number of times that the term occurs in the document and decreases with the number of times the term occurs in the collection. This means that the importance of a term in a document is proportional to the number of times that the term appears in the document, while the importance of the term is inversely proportional to the number of times that the term appears in the entire collection. This term-weighting approach is referred to as term frequency/inverse document frequency . Formally, , the weight of term ti for document , is defined as:
where document
is the number of times that term
appears in
, |D| is the total number of documents in the is the number of documents where term
collection, and appears.
Term Distributions
It is a recent approach of a more sophisticated term weighting method than , based on term frequencies within a particular class and within the collection of training documents. The weight of a term using term distributions is determined by combining three different factors, that depend on the average term frequency of term ti in the documents of class ck, represented as
where
represents the set of documents that belong to class the number of documents belonging to class the
frequency of term in document of class ck, and |C|, which will be used in the next formulas, represents the number of classes in a collection.
Stop Words
Words that are of little value to convey the meaning of the document and which happen to have a high frequency are totally dropped during the tokenization process. These words are called stop words and are generally detected by either their high frequency or by matching them with a dictionary. Below is a stop list of twentyfive semantically nonselective words which are common
Dropping stop words sounds a very good approach to discard some useless redundant words; however this is not the case for some phrases. Imagine a user is searching for "President of the United States" or "Flight to London", the Information Retrieval system would then search for "President" and "United States" separately or it would search for "Flight" and "London" separately. This would of course return an incorrect result that does not reflect the user's initial query.
Document Length Normalization

Naturally, long documents contain more terms than short documents. Considering that the similarity between documents can be measured by how many terms they have in common, long
documents will have more terms in common with other documents than short documents, so they will be more similar to other documents than short documents. To contrast this tendency, weights of terms for TC tasks are usually normalized so that document vectors have unitary length.
Case-Folding
A typical strategy is to do case-folding by converting all uppercase characters to lowercase characters. This is a form of word normalization in which all words are reduced to a standard form. This would equate between Door and door and between university and UNIVERSITY. This sounds very nice; however the problem arises when a proper noun such as Black is equated with the color black or it can also equate between the company name VISION and the word vision. To remedy to this, one can only convert to lowercase words at the beginning of a sentence and words located within titles and headings.
Relevant Algorithms/Techniques
Classification Methods
This dissertation concerns methods for the classification of natural language text, that is, methods that, given a set of training documents with known categories and a new document, which is usually called the query, will predict the querys category.
Naive Bayes
The Nave Bayes classier found its way into many applications nowadays due to its simple principle but yet powerful accuracy [13].
Bayesian classiers are based on a statistical principle. Here, the presence or absence of a word in a textual document determines the outcome of the prediction. In other words, each processed term is assigned a probability that it belongs to a certain category. This probability is calculated from the occurrences of the term in the training documents where the categories are already known. When all these probabilities are calculated, a new document can be classied according to the sum of the probabilities for each category of each term occurring within the document. However, this classier does not take the number of occurrences into account, which is a potentially useful additional source of information. They are called nave because the algorithm assumes that all terms occur independent from each other. Given a set of r document vectors , classified along a set C of q classes, , Bayesian classifiers estimate the probabilities of each class ck given a document dj as:
In this equation,
is the probability that a randomly picked
document has vector as its representation, and the probability that a randomly picked document belongs to ck. Because the number of possible documents is very high, the estimation of is problematic. To simplify the estimation of , Naive Bayes assumes that the probability of a given word or term is independent of other terms that appear in the same document. While this may seem an over simplification, in fact Naive Bayes presents results that are very competitive with those obtained by more elaborate methods. Moreover, because only words and not combinations of words are used as predictors, this naive simplification allows the computation of the model of the data associated with this method to be far more efficient than other non naive Bayesian approaches. Using this
simplification, it is possible to determine as the product of the probabilities of each term that appears in the document. So, , where , may be estimated as:
The Vector Space Model

The vector space model is a data model for representing documents and queries in an Information Retrieval system. Every document and query is represented by a vector whose dimensions, which are called features, represent the words that occur within them [4]. In that sense, each vector representing a document or query consists of a set of features which denote words and the value of each feature is the frequency or the number of occurrence of that particular word in the document itself. Since an IR usually contains more than one document, vectors are stacked together to form a matrix. Figure 2 shows a single vector populated with frequencies of the words contained in document D.
To make things clearer, let's consider an example of finding the occurrence of words information, processing, and language in three documents Doc1, Doc2, and Doc3 After counting the occurrence of the three words in each of the three documents, Doc1 is represented by the vector D1(1,2,1), Doc2 by the vector D2(6,0,1), and Doc3 by the vector D3(0,5,1) In order to emphasize on the contribution of the higher value feature, those vectors are to be normalized. By normalization, we simply mean converting all vectors to a standard length. This can be done by dividing each dimension in a vector by the length of that particular vector. The length of a vector can be calculated according to the following equation: length = sqrt((ax * ax) + (ay * ay) + (az * az)). Normalizing D1: length = sqrt((1 * 1) + (2 * 2) + (1 * 1)) = sqrt(6) = 2.449. Now dividing each dimension by the length: 1/2.449 = 0.41 ; 2/2.449 = 0.81 ; 1/2.449 = 0.41. Final result would be D1(0.41, 0.81, 0.41). Same applies for D2 and D3 which will eventually result in D2(0.98, 0, 0.16) and D3(0, 0.98, 0.19). Now in order to determine the difference between two documents or if a query matches a document, we must calculate the cosine of the angles between the two vectors. When two documents are identical (or when a query completely matches a document) they will receive a cosine of 1; when they are orthogonal (share no common terms) they will receive a cosine of 0. Back to the previous example, let's consider a query with corresponding normalized vector Q(0.57, 0.57, 0.57). The first task is to compute the cosines between this vector and our three document vectors.
Sim(D1,Q) = 0.41*0.57 + 0.81*0.57 + 0.41*0.57 = 0.92 Sim(D2,Q) = 0.65 Sim(D3,Q) = 0.67 The previous results show clearly that D1 is the closest to match Q, then comes D3 and then comes D2 (remember that the more the cosine is closer to 1 the more the two vectors are closer)
Term Graph Model

The term graph model is an improved version of the vector space model [13] by weighting each term according to its relative importance with regard to term associations. Specifically, for a text document Di, it is represented as a vector of term weights Di =< w1i,. .. . , w|T|i >, where T is the ordered set of terms that occur at least once in at least one document in the collection. Each weight wji represents how much the corresponding term tj contribute to the semantics of document di. Although a number of weighting schemes have been proposed (e.g., boolean weighting, frequency weighting, tf-idf weighting, etc.), those schemes determine the weight of each term individually. As a result, important yet rich information regarding the relationships among the terms are not captured in those weighting schemes. We propose to determine the weight of each term in a document collection by constructing a term graph. The basic steps are as follows: 1. Prepocessing Step: For a collection of document, extract all the terms. 2. Graph Building Step: (a) For each document, we view it as a transaction: the document ID is the corresponding transaction ID; the terms contained in the document are the items contained in the corresponding transaction. Association rule mining algorithms can thus be applied to mine the frequently co-occurring terms that occur more than minsup times in the collection.
(b) The frequent co-occurring terms are mapped to a weighted and directed graph, i.e., the term graph.
Preprocessing
In our term graph model, we will capture the relationships among terms using the frequent itemset mining method. To do so, we consider each text document in the training collections as a transaction in which each word is an item. However, not all words in the document are important enough to be retained in the transaction. To reduce the processing space as well as increase the accuracy of our model, the text documents need to be preprocessed by (1) remove stopwords, i.e., words that appear frequently in the document but have no essential meanings; and (2) retaining only the root form of words by stemming their affixes as well as prefixes.
Graph Building
As mentioned above, we will capture the relationships among terms using the frequent itemset mining method. While this idea has been explored by previous research [9], our approach distinguish from previous approaches in that we maintain all such important associations in a graph. The graph not only reveals the important semantics of the document, but also provide a basis to extract novel features about the document, as we will shown in the next section. Frequent Itemset Mining. After the preprocessing step, each document in the text collection will be stored as a transaction (list of items) in which each item (term) is represented by a unique nonnegative integer. Then frequent itemset mining algorithms can be used to find all the subset of items that appeared more than a threshold amount of times (controlled by minsup) in the collection. Graph Builder. In our system, our goal is to explore the relationships among the important terms of the text in a category and try to define a strategy to make use of these relationships in the classifier and other text mining tasks. Vector space model cannot express such
rich relationship among terms. Graph is thus the most suitable data structure in our context, as, in general, each term may be associated with more than one terms. We propose to use the following simple method to construct the graph from the set of frequent itemsets mined from the text collections. First, we construct a node for each unique term that appear at least once in the frequent itemsets. Then we create edges between two node u and v if and only if they are both contained in one frequent itemset. Furthermore, we assign weights to the edges in the following way: the weight of the edge between u and v is the largest support value among all the frequent itemsets that contains both of them. Example Consider the frequent itemsets and their absolute support shown in Figure below. Its corresponding graph is shown in Figure beside.
k-Nearest Neighbors
The initial application of k-Nearest Neighbors (k-NN) to text categorization was reported by Masand and colleagues. The basic idea is to determine the category of a given query based not only on the document that is nearest to it in the document space, but on the categories of the k documents that are nearest to it. Having this in mind, the Vector method can be viewed as an instance on the k-NN method, where k=1. This work uses a vector-based, distance-weighted matching function, as did Yang, by calculating documents similarity like the Vector
method. Then, it uses a voting strategy to find the querys class: each retrieved document contributes a vote for its class, weighted by its similarity to the query. The querys possible classifications will be ranked according to the votes they got in the previous step.
Actual Implementation
For classifying the documents in Reuter-21578 we initially preprocessed the data by performing various techniques : a. b. c. d. e. Bag of words Stop word removal Tf-idf Case Folding Normalisation
Then after pre-processing we applied Nave bayes Algorithm to classify the documents in the training set into five categories (exchange, organisation, people, places and topics). We further applied our classifier model on the test documents and calculated the accuracy by comparing it with the default answers given for the test documents. After Nave bayes we implemented the Term Graph algorithm to show a better classifier model to classify the documents. Again we checked its accuracy for test documents and compares the accuracy for both the Classification Algorithms. To compare the Evaluate the two algorithms we used the following : Precision is defined as the fraction of the retrieved documents that are relevant, and can be viewed as a measure of the systems soundness, that is:
Recall is defined as the fraction of the relevant documents that is actually retrieved, and can be viewed as a measure of the systems completeness, that is:
Accuracy, which is defined as the percentage of correctly classified documents, is generally used to evaluate single-label TC tasks.
The Mean Reciprocal Rank can be calculated for each individual query document as the reciprocal of the rank at which the first correct category was returned, or 0 if none of the first n choices contained the correct category. The score for a sequence of classification queries, considering the first n choices, is the mean of the individual querys reciprocal ranks. So:
where is the rank of the first correct category for query i, considering the first n categories returned by the system.
We then created an application where user can input some keywords and based on the algorithm showing higher accuracy we show the relevant document to the user.
WORK FLOW
News articles
Test set
Training set
Conversion of sgml file to text file
Conversion of sgml file to text file
Building LOCAL DICTIONARY of the document using bag of words.
Using bag of words alogorithm for keyword extraction
Case Folding Comparing
Stop word elimination
Building of LOCAL DICTIONARY
UNIVERSAL DICTIONARY storage
Building of UNIVERSAL DICTIONARY using tf/idf algorithm
Creating database for each category
Application of KNearest Neigbhour To Classify the test documents
Application of Term Graph to predict the category of test documents
Appplication of naive algorithm to predict its category
Calculating Complexity and Comparing Accuracy of the algorithm Information Retrieval application Using Vector Space Model
Information Retrieval Application

In this application user can enter keywords and based on those keywords we show the relevant documents having the highest similarity value and then on selecting one of the shown documents, it will display the content of the document. This application is based on Vector Space Model for Information Retrieval.
NAVE BAYES ALGORITHM FORMULA USED :
Checking the keyword in Test document and storing it in a map.
Calculating yes and no frequency of each keyword in the test document.
Calculating the probability of each keyword of the test document.
Classifying the Test Document into various categories on the basis of probability calculated.
TERM GRAPH ALGORITHM
Setting each unique word occurring the document as nodes of the graph
Making Adjacency Matrix of the keywords
Making Distance Matrix using Dijkstra
Calculating similarity between the test document keywords and the keywords of each category
Classifying the test document by checking the category with highest similarity value
K-NEAREST NEIGHBOUR
Make vector for every document in the test set.
Make centriod vector for each class.
Calculate similarity between each document vector and class vector
Document belongs to the class for which the similarity is maximum,
VECTOR SPACE MODEL
Make query vector.
Make Document vector.
Calculate similarity between query vector and document vector for each document.
Retrieved Document is the one for which the similarity is maximum
For Feature Selection (Tf-idf)

public class WordFrequencyCmd { private String traindir; private String traincsv; private String traincsv2; private String traincsv3; private String hdfile; private String clist; public WordFrequencyCmd (String a, String b, String c, String d, String e, String f) { this.traindir = a; this.traincsv = b; this.hdfile = c; this.clist = d; this.traincsv2 = e; this.traincsv3 = f; } public void generatekeywords() { Hashtable<String, Integer> result = new Hashtable<String, Integer>(); HashSet<String> words = new HashSet<String>(); List catg = new ArrayList(); //list to store all categories File file = new File(traindir); File[] inputFile=file.listFiles(); InputStream inp; try { FileWriter writer = new FileWriter(traincsv); inp = new FileInputStream(new File(hdfile)); Scanner sc = new Scanner(inp); String word; writer.append("doc-id"); writer.append(','); writer.append("keyword-list"); writer.append(','); // gets one word at a time from input
while (sc.hasNext()) { word = sc.next(); catg.add(word); writer.append(word); writer.append(','); //System.out.println(word); }
// is there another word? // get next word
writer.append('\n'); String temp = null; File ignoreFile = new File("E:\\Mining\\longstoplist.txt");
for(int j=0;j<inputFile.length;j++) { BufferedReader br = new BufferedReader(new FileReader(inputFile[j])); String line = ""; StringTokenizer st = null; List keylist = new ArrayList(); while ((line = br.readLine()) != null) { st = new StringTokenizer(line, " "); while (st.hasMoreTokens()) { temp = st.nextToken(); if (st.hasMoreTokens()) { break; } else { keylist.add(temp); } } } WordCounter counter = new WordCounter(); counter.ignore(ignoreFile); counter.countWords(inputFile[j]); String[] wrds = counter.getWords(WordCounter.SortOrder.BY_FREQUENCY); int[] frequency = counter.getFrequencies(WordCounter.SortOrder.BY_FREQUENCY); // System.out.println("for the"+j+"th file"); writer.append(inputFile[j].getName()); writer.append(','); //... Display the results. int n = counter.getEntryCount(); for (int i=0; i<n; i++) { if(frequency[i]>1) { //System.out.println(frequency[i] + " " + wrds[i]); writer.append(wrds[i]+" "+frequency[i]); writer.append("+");
//comparing the values from the hash set and table and if match increase to 1 if (words.contains(wrds[i]) == false) { if (result.get(wrds[i]) == null) result.put(wrds[i], 1); else result.put(wrds[i], result.get(wrds[i]) + 1); words.add(wrds[i]); } else { result.put(wrds[i], result.get(wrds[i]) + 1); } } }
// code for yes/no Iterator it = catg.iterator(); for (int i=0; i < catg.size(); i++) { writer.append(','); if(keylist.contains(it.next())) { writer.append("yes");
} else { writer.append("no"); } } // System.out.println(); writer.append('\n'); }
FileOutputStream out3; PrintStream p3; out3 = new FileOutputStream(clist); p3 = new PrintStream( out3 ); for (Object o: result.entrySet() ) { Map.Entry entry = (Map.Entry) o; int val=Integer.parseInt(entry.getValue().toString()); String k=entry.getKey().toString(); if(val>4){ // System.out.println(k+" "+val); p3.println(k+" "+val); } } //writer.flush(); writer.close(); sc.close(); } catch (IOException iox) { System.out.println(iox); } } public void indocfreq(int nod) { double idf = 0; String word = null; int df = 0; Map m = new HashMap(); try { BufferedReader br = new BufferedReader(new FileReader(clist)); String line = ""; StringTokenizer st = null; while ((line = br.readLine()) != null) { st = new StringTokenizer(line, " "); while(st.hasMoreTokens()) { word = st.nextToken(); df = Integer.parseInt(st.nextToken()); idf = Math.log(nod/df); //System.out.print(word+": "+idf+"\t"); m.put(word,idf); } }
FileWriter writer = new FileWriter(traincsv2); FileWriter writer2 = new FileWriter(traincsv3); String artname = null; BufferedReader br2 = new BufferedReader(new FileReader(traincsv)); String line2 = null; StringTokenizer st2 = null; String y = br2.readLine(); writer.append(y).append("\n"); writer2.append(y).append("\n"); double wt = 0; String temp2 = null; int flag = 0; while ((line2 = br2.readLine()) != null) { st2 = new StringTokenizer(line2,","); artname = st2.nextToken(); writer.append(artname).append(","); writer2.append(artname).append(","); String temp1 = st2.nextToken(); flag = 0;
String[] keyarr = temp1.split("[+\\s]"); for(int i=0; i < keyarr.length-1; i=i+2) { flag = 1; temp2 = keyarr[i]; if (m.get(temp2) != null) { wt = (Double)m.get(keyarr[i]) * Double.parseDouble(keyarr[i+1]); if (wt <= 15) { writer.append(keyarr[i]+" "+Integer.parseInt(keyarr[i+1])+"+"); writer2.append(keyarr[i]+" "+wt+"+"); } } } if (flag == 0) { writer.append(",").append(temp1); writer2.append(",").append(temp1); } while (st2.hasMoreTokens()) { String z = st2.nextToken(); writer.append(",").append(z); writer2.append(",").append(z); } writer.append("\n"); writer2.append("\n"); } writer.close(); writer2.close(); br2.close();
}catch (IOException iox) { System.out.println(iox); } }
For Naive Bayes on Exchange category

public class Naive_Exg { private Map m2; //map for yes private Map m3; //map for no private String clname; private int clnum; private static int first = 0; public Naive_Exg(String st, int x) { this.clname = st; this.m2 = new HashMap(); this.m3 = new HashMap(); this.clnum = x; } public void generatemaps() throws IOException { String csvFile = "E:\\Mining\\training_csvs\\Exg_tf.csv"; BufferedReader br2 = new BufferedReader(new FileReader(csvFile)); String line = ""; line = br2.readLine(); // ignore the first line of headers StringBuffer sb = new StringBuffer(); // buffer for keywords with frequency in yes cases StringBuffer sb2 = new StringBuffer(); // buffer for keywords with frequency in no cases String temp3 = null; StringTokenizer st2 = null; while ((line = br2.readLine()) != null) { st2 = new StringTokenizer(line, ","); int f = 0; st2.nextToken(); //ignore docid while (st2.hasMoreTokens()) { temp3 = st2.nextToken(); if (temp3.equals("yes") || temp3.equals("no")) { // to ignore rest of the classes f = 1; } if(f == 0) { for(int i =0;i < clnum-1; i++){ st2.nextToken(); } } if(f == 0 && st2.nextToken().equals("yes")) { sb.append(temp3); } else if(f==0){ sb2.append(temp3); }
break; } } String keys = sb.toString(); String nokeys = sb2.toString(); String[] keyarr = keys.split("[+\\s]"); String[] nokeyarr = nokeys.split("[+\\s]"); for (int i=0; i <(keyarr.length)-1; i=i+2) { int temp5=Integer.parseInt(keyarr[i+1]); if (m2.get(keyarr[i]) == null) { m2.put(keyarr[i], temp5); } else { m2.put(keyarr[i],(Integer)m2.get(keyarr[i])+ temp5); } } for (int i=0; i <(nokeyarr.length)-1; i=i+2) { int temp5=Integer.parseInt(nokeyarr[i+1]); if (m3.get(nokeyarr[i]) == null) { m3.put(nokeyarr[i], temp5); } else { m3.put(nokeyarr[i],(Integer)m3.get(nokeyarr[i])+ temp5); } } } public int testarticle(double pyes, double pno) throws IOException { double totalprobyes = 1; double totalprobno = 1; System.out.println("class is "+clname); String csvFile2 = "E:\\Mining\\Naive_result\\Exg_result.csv"; BufferedReader br2 = new BufferedReader(new FileReader(csvFile2)); String line2 = ""; StringTokenizer st2 = null; int numyes = 0; StringBuffer outBuffer = new StringBuffer(1024); outBuffer.append(br2.readLine()).append("\n"); String csvFile = "E:\\Mining\\test_csv\\forTesting.csv"; BufferedReader br = new BufferedReader(new FileReader(csvFile)); String line = ""; StringTokenizer st = null; line = br.readLine(); int numofart = 0; while ((line = br.readLine()) != null) {
st = new StringTokenizer(line, ","); String artname = st.nextToken(); if(first == 0) { outBuffer.append(artname).append(","); numofart++; } else { numofart++; if((line2 = br2.readLine()) != null) { st2 = new StringTokenizer(line2, ","); while(st2.hasMoreTokens()) { outBuffer.append(st2.nextToken()).append(","); } } } if(st.hasMoreTokens()) { String temp1 = st.nextToken(); String[] keyarr = temp1.split("[+\\s]"); //for yes String temp2 = null; double temp3 = 0; double x = 0; double y = 0; double probyes = 0; double probno = 0; for(int i=0; i < keyarr.length-1; i=i+2) { temp2 = keyarr[i]; temp3 = Integer.parseInt(keyarr[i+1]); if (m2.get(temp2) != null) { x = (Integer)m2.get(temp2); } else { x = 0; } if (m3.get(temp2) != null) { y = (Integer)m3.get(temp2); } else { y = 0; } probyes = probyes + ( (temp3) * (Math.log((x+1)/(x+y+38))) ); probno = probno + ( (temp3) * (Math.log((y+1)/(x+y+38))) ); } totalprobyes = Math.abs(pyes + probyes); totalprobno = Math.abs(pno + probno); if(totalprobyes > totalprobno && totalprobyes > 500) { outBuffer.append("yes"); numyes++;
} else {
outBuffer.append("no");
} } outBuffer.append("\n"); } String out = outBuffer.toString(); try { FileWriter writer = new FileWriter("E:\\Mining\\Naive_result\\Exg_result.csv"); writer.append(out); writer.close(); }catch (IOException iox) { System.out.println(iox); } first = 1; System.out.println(numofart); return numyes; }
For Term Graph on Exchange category

public class TermGraph_Exg { private int[][] adj; private int [][] T; private int clnum; private ArrayList unikeywords; private String clname; private int siz; private int nVerts; private int[] next; private int current_edge_weight; public TermGraph_Exg(String name,int x) { this.clname = name; this.clnum = x; this.unikeywords = new ArrayList(); this.current_edge_weight = 0; } public void makeAdj(){ try { // for unique keywords list String temp1 = null; BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\training_csvs\\Exg2_tfidf.csv")); String line = br.readLine(); //avoid first line StringTokenizer st = null; while ((line = br.readLine()) != null) { st = new StringTokenizer(line, ","); st.nextToken(); // avoid article name temp1 = st.nextToken(); if (temp1.equals("yes") || temp1.equals("no")) { // check for keyword list for(int i =0; i < clnum-1-1; i++){ st.nextToken(); }
} else { for(int i =0; i < clnum-1; i++){ st.nextToken(); } } if(st.nextToken().equals("yes")) { String[] keyarr = temp1.split("[+\\s]"); for(int i=0; i < keyarr.length-1; i=i+2) { if( !(unikeywords.contains(keyarr[i])) ) { unikeywords.add(keyarr[i]); } } } } br.close(); this.siz = unikeywords.size(); this.adj = new int [siz][siz]; this.nVerts = siz; this.next = new int[siz]; this.T = new int [siz][siz]; for (int i=0; i < siz; i++){ for(int j=0; j < siz; j++) { adj[i][j] = T[i][j] = 0; } } for(int i=0; i < nVerts; i++) { next[i]=-1; }
// initialize next neighbor
//for adjacency matrix int m = 0; int n = 0; BufferedReader br2 = new BufferedReader(new FileReader("E:\\Mining\\training_csvs\\Exg2_tfidf.csv")); String line2 = br2.readLine(); StringTokenizer st2 = null; while ((line2 = br2.readLine()) != null) { st2 = new StringTokenizer(line2, ","); st2.nextToken(); // avoid article name temp1 = st2.nextToken(); if (temp1.equals("yes") || temp1.equals("no")) { // check for keyword list for(int i =0; i < clnum-1-1; i++){ st2.nextToken(); } } else { for(int i =0; i < clnum-1; i++){ st2.nextToken(); } } if(st2.nextToken().equals("yes")) { String[] keyarr = temp1.split("[+\\s]"); for(int i=0; i < keyarr.length-1; i=i+2) { for(int j=i+2; j < keyarr.length-1; j=j+2) { m = unikeywords.indexOf(keyarr[i]);
n = unikeywords.indexOf(keyarr[j]); if(m > -1 && n > -1) { adj[m][n] = adj[n][m] = 1; } } } } } br2.close(); for (int i=0; i < siz; i++){ for(int j=0; j < siz; j++) { } } } catch (IOException iox) { System.out.println(iox); } } //graph functions public int vertices() { return nVerts; } public int edgeLength(int a, int b) { return adj[a][b]; } public int nextneighbor(int v) { next[v] = next[v] + 1; // initialize next[v] to the next neighbor
// return the number of vertices
// return the edge length
if(next[v] < nVerts) { while(adj[v][next[v]] == 0 && next[v] < nVerts) { next[v] = next[v] + 1; // initialize next[v] to the next neighbor if(next[v] == nVerts) break; } } if(next[v] >= nVerts) { next[v]=-1; // reset to -1 current_edge_weight = -1; } else { current_edge_weight = adj[v][next[v]]; } } return next[v]; // return next neighbor of v to be processed
public void resetnext() { for (int i=0; i < nVerts; i++)
// reset the array next to all -1's
next[i] = -1; } public void dijkstra_function(int s) throws IOException { int u, v; int [] dist = new int[nVerts]; for(v=0; v<nVerts; v++) { dist[v] = 99999; // 99999 represents infinity } dist[s] = 0; PriorityQueue Q = new PriorityQueue(dist); while(Q.Empty() == 0) { u = Q.Delete_root(); v = nextneighbor(u); while(v != -1) { // for each neighbor of u if(dist[v] > dist[u] + edgeLength(u,v)) { dist[v] = dist[u] + edgeLength(u,v); Q.Update(v, dist[v]); } v = nextneighbor(u); } } for(int col=0; col<nVerts; col++) { T[s][col] = dist[col]; } } public void makeTermGraph() throws IOException { for(int i = 0; i < siz; i++) { dijkstra_function(i); } for(int i = 0; i < siz; i++) { for(int j = 0; j < siz; j++) { } } } public double testarticle(String x){ String[] keyarr = x.split("[+\\s]"); double sim = 0; double n = 0; double w = 0; int u = 0; int v = 0; for(int i=0; i < keyarr.length-1; i=i+2) { for(int j=i+2; j < keyarr.length-1; j=j+2) { u = unikeywords.indexOf(keyarr[i]); v = unikeywords.indexOf(keyarr[j]); if(u > -1 && v > -1) { w = 0 + Math.pow(T[u][v],2); // get the next neighbor of u
n = n+1; } } } sim = n/w; return (sim); }
k-nearest neighbour on Exchange Category

public class knn { private Map<String, Double> m2; private String clname; private int clnum; private static ArrayList unikeywords = new ArrayList(); private Map<String, Double> docvec; private String docname; private Map<String, Double> qvec;
public knn(String st, int x) { this.clname = st; this.m2 = new HashMap(); this.qvec =new HashMap(); this.clnum = x; } public static void setList() throws IOException { BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\commonlist\\exg.txt")); String line = ""; StringTokenizer st = null; while ((line = br.readLine()) != null) { st = new StringTokenizer(line, " "); while(st.hasMoreTokens()) { unikeywords.add(st.nextToken()); st.nextToken(); } } br.close(); } public void generatemaps() throws IOException { String csvFile = "E:\\Mining\\training_csvs\\Exg3_wts.csv"; BufferedReader br2 = new BufferedReader(new FileReader(csvFile)); String line = ""; line = br2.readLine(); // ignore the first line of headers StringBuffer sb = new StringBuffer(); String temp3 = null; StringTokenizer st2 = null; while ((line = br2.readLine()) != null) { st2 = new StringTokenizer(line, ","); int f = 0;
st2.nextToken(); //ignore docid while (st2.hasMoreTokens()) { temp3 = st2.nextToken(); if (temp3.equals("yes") || temp3.equals("no")) { f = 1; // System.out.println(temp3); } //System.out.println("temp3 is "+temp3); for(int i =0;i < clnum-1; i++){ st2.nextToken(); } if(f == 0 && st2.nextToken().equals("yes")) { sb.append(temp3); //System.out.print(temp3+" "); } else if(f==0){ sb2.append(temp3); } break; } } String keys = sb.toString(); String[] keyarr = keys.split("[+\\s]"); for(int i=0; i<unikeywords.size(); i++) { m2.put((String)unikeywords.get(i),0.0); } for (int i=0; i <(keyarr.length)-1; i=i+2) { Double temp5=Double.parseDouble(keyarr[i+1]); if (m2.get(keyarr[i]) == null) { m2.put(keyarr[i], temp5); } else { m2.put(keyarr[i],(Double)m2.get(keyarr[i])+ temp5); } } for (int i=0; i <(keyarr.length)-1; i=i+2) {if (m2.get(keyarr[i]) != null) m2.put(keyarr[i],(Double)m2.get(keyarr[i])/276); } // System.out.println("centroid :"+m2); }
// to ignore rest of the classes
public void setQuevec(String q) { // docname = aname; // System.out.println(aname); String[] keyar = q.split("[+\\s]"); for(int i=0; i<unikeywords.size(); i++) { qvec.put((String)unikeywords.get(i),0.0); }
int temp1 = 0; for(int i=0; i < (keyar.length)-1; i=i+2) { // System.out.println(keyar[i]); // temp1 = qvec.get(keyar[i]); Double temp5=Double.parseDouble(keyar[i+1]); if (qvec.get(keyar[i]) == null) { qvec.put(keyar[i], temp5); } else { qvec.put(keyar[i],(Double)qvec.get(keyar[i])+ temp5); } } // System.out.println("testvector: "+qvec); } public double calcsim() { double sim = 0; double dprod = 0; double dmag = 0; double qmag = 0; double sumofsq = 0; for (Map.Entry<String, Double> entry : m2.entrySet()) { // System.out.println("hey"+entry.getKey()); dprod = dprod + entry.getValue() * qvec.get(entry.getKey()); // System.out.println(dprod); } // System.out.println("hey"); for (Map.Entry<String, Double> entry2 : m2.entrySet()) { sumofsq = sumofsq + Math.pow(entry2.getValue(),2); } dmag = Math.sqrt(sumofsq); sumofsq = 0; for (Map.Entry<String, Double> entry3 : qvec.entrySet()) { sumofsq = sumofsq + Math.pow(entry3.getValue(),2); } qmag = Math.sqrt(sumofsq); sim = dprod/(dmag*qmag); return sim; } public static void main(String[] args) throws IOException { setList(); InputStream inp; List catg = new ArrayList(); inp = new FileInputStream(new File("E:\\Mining\\headerfiles\\all-exchanges.txt")); Scanner sc = new Scanner(inp); // gets one word at a time from input String word = null; //System.out.printf("My Little Program%n%n"); while (sc.hasNext()) { word = sc.next(); catg.add(word); // is there another word? // get next word
} knn [] obj = new knn [catg.size()]; int j = 0; String csvFile = "E:\\Mining\\test_csv\\forTesting.csv"; BufferedReader br = new BufferedReader(new FileReader(csvFile)); String line = ""; StringTokenizer st = null; StringBuffer sb=new StringBuffer(); line = br.readLine(); String artname=null; Iterator it2 = catg.iterator(); String name = null; String temp1 = null; FileWriter writer = new FileWriter("E:\\Mining\\knn.csv"); writer.append("doc-id"); // System.out.println(catg.size()); for (int i=0; i < catg.size(); i++) { name = (String)catg.get(i); obj[i] = new knn(name,i+1); writer.append(","); writer.append(catg.get(i).toString()); } writer.append("\n"); while ((line = br.readLine()) != null) { double s = 0.0; double value=0.0; st = new StringTokenizer(line,","); while(st.hasMoreTokens()) { artname = st.nextToken(); // System.out.println(artname); writer.append(artname).append(","); if(st.hasMoreTokens()) { temp1 = st.nextToken(); int index=0; for (int i=0; i < catg.size(); i++) { obj[i].setQuevec(temp1); // System.out.println(catg.get(i)); obj[i].generatemaps(); s = obj[i].calcsim(); // int flag; if(s>value) { value=s; index=i; } } //System.out.println("class name : "+catg.get(index)); System.out.println(value); if(value!=0.0)
{for(int p=0;p<index;p++) writer.append("no").append(","); writer.append("yes").append(","); for(int p=index+1;p<=catg.size()-1;p++) writer.append("no").append(","); writer.append("\n"); } else { for (int i=0; i < catg.size(); i++) { writer.append("no").append(","); } writer.append("\n"); }
} else { for (int i=0; i < catg.size(); i++) { writer.append("no").append(","); } writer.append("\n"); } } } writer.close(); } }
Vector Space Model for Information Retrieval

public class Vsm_Exg { private static ArrayList unikeywords = new ArrayList(); private Map<String, Double> docvec; private String docname; private static Map<String, Double> qvec = new HashMap <String,Double>(); public Vsm_Exg() { this.docname = null; this.docvec = new HashMap <String,Double>(); } public static void setList() throws IOException { BufferedReader br = new BufferedReader(new FileReader("E:\\Mining\\commonlist\\exg.txt")); String line = "";
StringTokenizer st = null; while ((line = br.readLine()) != null) { st = new StringTokenizer(line, " "); while(st.hasMoreTokens()) { unikeywords.add(st.nextToken()); st.nextToken(); //df = Integer.parseInt(st.nextToken()); } } br.close(); //System.out.println(unikeywords); } public static void setQuevec(String q) { String[] keyar = q.split(" "); for(int i=0; i<unikeywords.size(); i++) { qvec.put((String)unikeywords.get(i),0.0); } double temp1 = 0; for(int i=0; i < keyar.length; i++) { if(qvec.get(keyar[i]) != null){ temp1 = qvec.get(keyar[i]); qvec.put(keyar[i],temp1+1); } } double normval = 0; double dummy = 0; for (Map.Entry<String, Double> entry : qvec.entrySet()) { dummy = dummy + Math.pow(entry.getValue(),2); } normval = Math.sqrt(dummy); for(int i=0; i<unikeywords.size(); i++) { String x = (String)unikeywords.get(i); double val = qvec.get(x)/normval ; qvec.put(x,val); } System.out.println("Queryvector: "+qvec); } public void setDocvec(String aname, String keywrds) throws IOException { docname = aname; for(int i=0; i<unikeywords.size(); i++) { docvec.put((String)unikeywords.get(i),0.0); } //System.out.println("Initial vector: "+docvec); String[] keyarr = keywrds.split("[+\\s]"); for(int i=0; i < keyarr.length-1; i=i+2) { docvec.put(keyarr[i],Double.parseDouble(keyarr[i+1])); } double normval = 0; double dummy = 0; for (Map.Entry<String, Double> entry : docvec.entrySet()) { dummy = dummy + Math.pow(entry.getValue(),2); } normval = Math.sqrt(dummy); for(int i=0; i<unikeywords.size(); i++) {
String x = (String)unikeywords.get(i); double val = docvec.get(x)/normval ; docvec.put(x,val); } System.out.println("Docvector: "+docvec); } public double calcsim() { double sim = 0; double dprod = 0; double dmag = 0; double qmag = 0; double sumofsq = 0; for (Map.Entry<String, Double> entry : docvec.entrySet()) { dprod = dprod + entry.getValue() * qvec.get(entry.getKey()); } for (Map.Entry<String, Double> entry2 : docvec.entrySet()) { sumofsq = sumofsq + Math.pow(entry2.getValue(),2); } dmag = Math.sqrt(sumofsq); sumofsq = 0; for (Map.Entry<String, Double> entry3 : qvec.entrySet()) { sumofsq = sumofsq + Math.pow(entry3.getValue(),2); } qmag = Math.sqrt(sumofsq); sim = dprod/(dmag*qmag); return sim; }
Accuracy Calculation
public class CalAccuracy_TermGraph { private String csvFile; private String csvFile2; private String catname; public CalAccuracy_TermGraph(String a, String b, String c){ csvFile = a; csvFile2 = b; catname = c; } public void acc() throws IOException{ System.out.println(catname+":"); BufferedReader br = new BufferedReader(new FileReader(csvFile)); String line = ""; StringTokenizer st = null;
BufferedReader br2 = new BufferedReader(new FileReader(csvFile2)); String line2 = ""; StringTokenizer st2 = null; line = br.readLine(); line2 = br2.readLine(); String temp1 = null; String temp2 = null;
double a = 0; double b = 0; double c = 0; double d = 0; double accuracy = 0; double precision = 0; double recall = 0; double f = 0; while ((line = br.readLine()) != null && (line2 = br2.readLine()) != null) { st = new StringTokenizer(line, ","); st2 = new StringTokenizer(line2, ","); st.nextToken(); st2.nextToken(); while (st2.hasMoreTokens()) { temp1 = st.nextToken(); //actual temp2 = st2.nextToken(); //predicted if (temp1.equals("yes") && temp2.equals("yes")) { a++; } if (temp1.equals("yes") && temp2.equals("no")) { b++; } if (temp1.equals("no") && temp2.equals("yes")) { c++; } if (temp1.equals("no") && temp2.equals("no")) { d++; } } } accuracy = (((a+d) * 100)/(a+b+c+d)); precision = (a)/(a+c); recall = (a)/(a+b); f = (2*precision*recall)/(precision+recall); System.out.println(a); System.out.println(b); System.out.println(c); System.out.println(d); System.out.println("accuracy : "+accuracy+"% precision : "+precision*100+"% recall : "+recall*100+"% f: "+f); }
Result:
We compared the accuracy of Nave Bayes, Term Graph and Knn for Text classification of our articles of Reuter 21578.
As shown above in the bar graph, we found that Knn shows the best result with accuracy as follows : FOR EXCHANGE CATEGORY KNN 98.00 NAVE 74.68 TERM GRAPH 97.41 FOR ORGANISATION CATEGORY KNN 98.51 NAVE 51.43 TERM GRAPH 98.23
FOR PEOPLE CATEGORY KNN NAVE 33.19 TERM GRAPH 99.61 FOR TOPICS CATEGORY KNN NAVE 81.80 TERM GRAPH 99.19 FOR PLACES CATEGORY KNN NAVE 72.23 TERM GRAPH 99.19 Our project is about categorizing the news articles into various categories. We have made a web application for user where, the user can enter some keywords (in the form of query) and we can generate the article based on those keywords by applying the Vector Space Model.
Conclusion:
We conclude that knn shows the maximum accuracy as compared to the Naive Bayes and Term- Graph. The drawback for KNN is that its time complexity is high but gives a better accuracy than others. We used Tf-idf with Term graph Rather than the traditional TermGraph used with AFOPT. This hybrid shows a better result than the traditional combination. Finally we made an INFORMATION RETREIVAL APPLICATION using Vector Space Model to give the result of the query entered by the client by showing the relevant document.
Screenshots:
Output of feature generation
Screenshot showing the keywords with their respective frequencies for each document and stating whether the document belongs to the header class or not.
Screenshot showing the keywords with their respective weights calculated using tf-idf algorithm for each document,and stating whether the document belongs to the header class or not.
Output of Naive bayes

Showing the documents and keywords classified on the application of Nave bayes.
Output for Term Graph

Showing the Keywords and the Documents classification by applying Term Graph
Output for k-nearest neighbour
Accuracy output
Showing the accuracy of various categories by applying Term- graph.
Output for Vector Space Model
Output for Vector Space Model with GUI:

Showing the application where user enters a query and we output the appropriate article relating to the query using Vector Space Model.
Future Work
We will focus more in future on a. Reducing Complexity b. Increasing Accuracy
c. Text Summarization
Such similar applications are used in yahoo alerts where relevant documents are shown on entering keywords by the user.
References
http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf
http://web.mit.edu/6.863/www/fall2012/projects/writeups/newspaperarticle-classifier.pdf
http://www.informatik.uni-hamburg.de/WTM/ps/coling-232.pdf
http://jatit.org/volumes/research-papers/Vol3No2/9vol3.pdf Improved knn Classification Algortihm Research in Text Categorization --Lijun Wang, Xiquing Zhao A Comparison of Event Models For Nave Bayes Text Classification Andrew McCallum, Kamal Nigam An Improved TF-IDF Approach for Text Classification(2004) Gong Ling, Wang Yong-cheng Term Graph Model for Text Classification Xuemin Lin ---
-- Zyang Yun-tao,
--- Wei Wang, Diep Bich Do, and

Text Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Mining

Uploaded by

Copyright:

Available Formats

MINOR PROJECT II REPORT TEXT MINING : REUTERS-21578

--------------- 1 --------------- 2 --------------- 3 --------------- 4 --------------- 7

Actual implementation Results Screenshots

--------------- 14 --------------- 35 ---------------37

Data Set Chossen

Each article starts with an "open tag" of the form

Term Frequency / Inverse Document Frequency

is the number of times that term

collection, and appears.

Document Length Normalization

is the probability that a randomly picked

The Vector Space Model

Term Graph Model

Conversion of sgml file to text file

Building LOCAL DICTIONARY of the document using bag of words.

Using bag of words alogorithm for keyword extraction

Case Folding Comparing

Stop word elimination

Building of LOCAL DICTIONARY

UNIVERSAL DICTIONARY storage

Building of UNIVERSAL DICTIONARY using tf/idf algorithm

Creating database for each category

Application of KNearest Neigbhour To Classify the test documents

Application of Term Graph to predict the category of test documents

Appplication of naive algorithm to predict its category

Information Retrieval Application

NAVE BAYES ALGORITHM FORMULA USED :

Checking the keyword in Test document and storing it in a map.

Calculating yes and no frequency of each keyword in the test document.

Calculating the probability of each keyword of the test document.

TERM GRAPH ALGORITHM

Making Adjacency Matrix of the keywords

Making Distance Matrix using Dijkstra

Make centriod vector for each class.

Calculate similarity between each document vector and class vector

Document belongs to the class for which the similarity is maximum,

VECTOR SPACE MODEL

Make query vector.

Make Document vector.

Retrieved Document is the one for which the similarity is maximum

For Feature Selection (Tf-idf)

while (sc.hasNext()) { word = sc.next(); catg.add(word); writer.append(word); writer.append(','); //System.out.println(word); }

// is there another word? // get next word

writer.append('\n'); String temp = null; File ignoreFile = new File("E:\\Mining\\longstoplist.txt");

} else { writer.append("no"); } } // System.out.println(); writer.append('\n'); }

}catch (IOException iox) { System.out.println(iox); } }

For Naive Bayes on Exchange category

For Term Graph on Exchange category

// initialize next neighbor

// return the number of vertices

// return the edge length

public void resetnext() { for (int i=0; i < nVerts; i++)

// reset the array next to all -1's

n = n+1; } } } sim = n/w; return (sim); }

k-nearest neighbour on Exchange Category

// to ignore rest of the classes

Vector Space Model for Information Retrieval

Output of Naive bayes

Output for Term Graph

Output for k-nearest neighbour

Output for Vector Space Model

Output for Vector Space Model with GUI:

--- Wei Wang, Diep Bich Do, and