Professional Documents
Culture Documents
This is the Initial stage of the proposed architecture. In this stage the sample documents are
collected. These documents belong to 4 different subjects of computer science domain namingly
database management system, computer networks, operating system and software engineering.
The extracted documents are arranged in a directory to facilitate the training process.
Indexing of Documents
Indexing is the process of preparing the raw document collection into an easily accessible
representation of documents. This transformation from a document text into a representation of
text is known as indexing of the documents. For transforming a document into an indexed form
Text to Matrix Generator toolbox is used in MATLAB which involves following steps:
For
each
Text
Dictionary
Read the Parse Normalize
Stoplist the file dictionary
TDM
Filtering of
terms
end
construct tdm;
Document Linearization
1. Markup and Format Removal During this phase, the markup tags and formatting tags
are removed from the document.
2. Tokenization During this phase, the outcome of the previous phase that is the remaining
text is parsed, lowercased and all the punctuations are removed.
Filtration
Filtration is the process of deciding which terms or attributes should be used to represent the
documents so that these terms can
For this purpose frequently used terms cannot be used for two reasons. First, the number of
documents that are relevant to a topic or subtopic is likely to be a small proportion of the
collection. A term that will be effective in separating the relevant documents from the non-
relevant documents, then, is likely to be a term that appears in a small number of documents.
This means that high frequency terms are poor discriminators. The second reason is that terms
appearing in many contexts do not define a topic or sub-topic of a document.
"The more documents in which a term appears (the more contexts in which it is used) then the
less likely it is to be a content-bearing term. Consequently it is less likely that the term is one of
those terms that contribute to the user's relevance assessment. Hence, terms that appear in many
documents are less likely to be the ones used by a searcher to discriminate between relevant and
non-relevant documents."
Download
agent
World Wide Web
Html files
Document
Linearization
Transformatio
Filtration
n Indexing
Document vectors
Stemming
Weighting
For these reasons, frequently used terms or stopwords are removed from text streams. Stop words
are the words having frequency greater than some user specified frequency. Special care is taken so that
important words that occur more frequently are not removed. The stop-word removal is done with the aid
of a publicly available list of stop-words []. Using public list of stop-words is category independent and
ensures important words within a category that occur more frequently are not removed. The disadvantage
is that there are many different public lists of stop-words all of which may not be the same. Nevertheless,
a number of the list could be compared and the appropriate one chosen.
However, removing stopwords from one document at a time is time consuming. A cost-effective
approach consists in removing all terms which appear commonly in the document collection, and
which will not improve retrieval of relevant material.
This can be accomplished with a stopword library --a stop-list of terms to be removed. These
lists contain words, which are assumed to have no impact on the meaning of a document. Such a
list usually contains words like ‘the’, ‘is’, ‘a’, etc. During the preprocessing all words matching
the top word list are removed from the document.
These lists can be either generic (applied to all collections) or specific (created for a given
collection). For instance, in some IR systems terms that appear in more than 5% of a collection
are removed. In others, terms that are not in the stop-list but appear in more than 50% of a
collection are deemed as "negative terms" and are also removed to avoid weighting
complications.
Stemming
connection
connections
connective ---> connect
connected
connecting
It refers to the process of reducing terms to their stems or root variant. Thus, "computer",
"computing", "compute" is reduced to "compute" and "walks", "walking" and "walker" is
reduced to "walk".
Stemming can be strong stemming (e.g. ‘houses’, ‘mice’ to word stems e.g. ‘hous’, ‘mic’) or weak
stemming (e.g. houses’, ‘mice’ to word stems e.g. ‘house’, ‘mouse’). For pre-processing of
English documents the PORTER stemming algorithm is often used, other algorithms for example
are: the n-Gram Stemmer and others like Snowball Stemming Algorithms, Lovins' English
stemmer etc.
Weighting
Weighting is the final stage in indexing application. Terms are weighted according to a given
weighting model which may include local weighting, global weighting or both. If local weights
are used, then term weights are normally expressed as term frequencies, tf. If global weights are
used, the weight of a term is given by IDF values. The most common (and basic) weighting
scheme is one in which local and global weights are used (weight of a term = tf*IDF). This is
commonly referred to as tf*IDF weighting.
Clustering of documents
Clustering is a process of grouping similar documents from a given set of documents. It will put
documents with similar content or with related topics into the same cluster (group). Each cluster
is assigned a label based on the content of the documents belong to this cluster.
k-mean clustering
K-means clustering is a partitioning method. It partitions data into k mutually exclusive clusters,
and returns the index of the cluster to which it has assigned each observation.
K-means treats each document in the collection as an object having a location in space. It finds a
partition in which objects within each cluster are as close to each other as possible, and as far
from objects in other clusters as possible.
Each cluster in the partition is defined by its member objects and by its centroid, or center. The
centroid for each cluster is the point to which the sum of distances from all objects in that cluster
is minimized. kmeans computes cluster centroids differently for each distance measure, to
minimize the sum with respect to the measure that you specify.
K-means uses an iterative algorithm that minimizes the sum of distances from each object to its
cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum
cannot be decreased further. The result is a set of clusters that are as compact and well-separated
as possible.
Text mining
NMF can be used for text mining applications. In this process, a document-term matrix is
constructed with the weights of various terms (typically weighted word frequency information)
from a set of documents. This matrix is factored into a term-feature and a feature-document
matrix. The features are derived from the contents of the documents, and the feature-document
matrix describes data clusters of related documents.