You are on page 1of 6

Sample document collection

This is the Initial stage of the proposed architecture. In this stage the sample documents are
collected. These documents belong to 4 different subjects of computer science domain namingly
database management system, computer networks, operating system and software engineering.
The extracted documents are arranged in a directory to facilitate the training process.

Indexing of Documents
Indexing is the process of preparing the raw document collection into an easily accessible
representation of documents. This transformation from a document text into a representation of
text is known as indexing of the documents. For transforming a document into an indexed form
Text to Matrix Generator toolbox is used in MATLAB which involves following steps:

INPUT: File name/ Directory, Options

OUTPUT: Term to Document Matrix

For
each
Text
Dictionary
Read the Parse Normalize
Stoplist the file dictionary

TDM

Filtering of
terms

Input: filename, OPTIONS

Output: tdm, dictionary and several optional outputs;

parse files or input directory;

read the stoplist;


for each input file,

parse the file (construct dictionary);

end

normalize the dictionary (remove stopwords and too long or too

short terms, stemming);

construct tdm;

remove terms as per frequency parameters;

compute global weights;

apply local weighting function;

form final tdm;

Document Linearization

Document Linearization is the process by which a document is reduced to a stream of terms.


This is usually done in two steps and as follows.

1. Markup and Format Removal During this phase, the markup tags and formatting tags
are removed from the document.
2. Tokenization During this phase, the outcome of the previous phase that is the remaining
text is parsed, lowercased and all the punctuations are removed.

Filtration

Filtration is the process of deciding which terms or attributes should be used to represent the
documents so that these terms can

1. Describe the content of the document.


2. Discriminate the document from the other documents in the collection.

For this purpose frequently used terms cannot be used for two reasons. First, the number of
documents that are relevant to a topic or subtopic is likely to be a small proportion of the
collection. A term that will be effective in separating the relevant documents from the non-
relevant documents, then, is likely to be a term that appears in a small number of documents.
This means that high frequency terms are poor discriminators. The second reason is that terms
appearing in many contexts do not define a topic or sub-topic of a document.

"The more documents in which a term appears (the more contexts in which it is used) then the
less likely it is to be a content-bearing term. Consequently it is less likely that the term is one of
those terms that contribute to the user's relevance assessment. Hence, terms that appear in many
documents are less likely to be the ones used by a searcher to discriminate between relevant and
non-relevant documents."

Download
agent
World Wide Web
Html files

Document
Linearization

Transformatio
Filtration
n Indexing
Document vectors
Stemming
Weighting

For these reasons, frequently used terms or stopwords are removed from text streams. Stop words
are the words having frequency greater than some user specified frequency. Special care is taken so that
important words that occur more frequently are not removed. The stop-word removal is done with the aid
of a publicly available list of stop-words []. Using public list of stop-words is category independent and
ensures important words within a category that occur more frequently are not removed. The disadvantage
is that there are many different public lists of stop-words all of which may not be the same. Nevertheless,
a number of the list could be compared and the appropriate one chosen.

However, removing stopwords from one document at a time is time consuming. A cost-effective
approach consists in removing all terms which appear commonly in the document collection, and
which will not improve retrieval of relevant material.

This can be accomplished with a stopword library --a stop-list of terms to be removed. These
lists contain words, which are assumed to have no impact on the meaning of a document. Such a
list usually contains words like ‘the’, ‘is’, ‘a’, etc. During the preprocessing all words matching
the top word list are removed from the document.

These lists can be either generic (applied to all collections) or specific (created for a given
collection). For instance, in some IR systems terms that appear in more than 5% of a collection
are removed. In others, terms that are not in the stop-list but appear in more than 50% of a
collection are deemed as "negative terms" and are also removed to avoid weighting
complications.

Stemming

A stemming algorithm is a process of linguistic normalization, in which the variant forms of a


word are reduced to a common form, for example,

connection
connections
connective ---> connect
connected
connecting
It refers to the process of reducing terms to their stems or root variant. Thus, "computer",
"computing", "compute" is reduced to "compute" and "walks", "walking" and "walker" is
reduced to "walk".

Stemming can be strong stemming (e.g. ‘houses’, ‘mice’ to word stems e.g. ‘hous’, ‘mic’) or weak
stemming (e.g. houses’, ‘mice’ to word stems e.g. ‘house’, ‘mouse’). For pre-processing of
English documents the PORTER stemming algorithm is often used, other algorithms for example
are: the n-Gram Stemmer and others like Snowball Stemming Algorithms, Lovins' English
stemmer etc.

Weighting

Weighting is the final stage in indexing application. Terms are weighted according to a given
weighting model which may include local weighting, global weighting or both. If local weights
are used, then term weights are normally expressed as term frequencies, tf. If global weights are
used, the weight of a term is given by IDF values. The most common (and basic) weighting
scheme is one in which local and global weights are used (weight of a term = tf*IDF). This is
commonly referred to as tf*IDF weighting.

Clustering of documents

Clustering is a process of grouping similar documents from a given set of documents. It will put
documents with similar content or with related topics into the same cluster (group). Each cluster
is assigned a label based on the content of the documents belong to this cluster.
k-mean clustering

K-means clustering is a partitioning method. It partitions data into k mutually exclusive clusters,
and returns the index of the cluster to which it has assigned each observation.

K-means treats each document in the collection as an object having a location in space. It finds a
partition in which objects within each cluster are as close to each other as possible, and as far
from objects in other clusters as possible.

Each cluster in the partition is defined by its member objects and by its centroid, or center. The
centroid for each cluster is the point to which the sum of distances from all objects in that cluster
is minimized. kmeans computes cluster centroids differently for each distance measure, to
minimize the sum with respect to the measure that you specify.

K-means uses an iterative algorithm that minimizes the sum of distances from each object to its
cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum
cannot be decreased further. The result is a set of clusters that are as compact and well-separated
as possible.
Text mining

NMF can be used for text mining applications. In this process, a document-term matrix is
constructed with the weights of various terms (typically weighted word frequency information)
from a set of documents. This matrix is factored into a term-feature and a feature-document
matrix. The features are derived from the contents of the documents, and the feature-document
matrix describes data clusters of related documents.

You might also like