Professional Documents
Culture Documents
Volume: 2 Issue: 12
ISSN: 2321-8169
4060 - 4062
_______________________________________________________________________________________________
HOD
Department of Computer Engineering
P.V.P.I.T College, Bavdhan
e-mail-ybgurav@gmail.com
AbstractIn text mining, more operations are based on the statistical analysis of a term, word or phrase. Clustering is a popular technique for
automatically organizing a large collection of text; it is also used to text classification. Many text mining applications contains side information
with text documents in the form of web documents, user access web-log, and different links attached with text files. This side information is
helpful for clustering purpose but sometime it is risky to use side information because it may add noise to procedure. So we need a better
technique for text mining to improve quality of presentation. In this paper, we are using different algorithms for enhancement of the clustering
quality with the document-based, sentence-based, corpus-based, and combined approach concept analysis design, so as to maximize the benefits
from using side information.
Keywords Text classification, clustering, Side information, concept analysis, document-based, sentence-based.
__________________________________________________*****_________________________________________________
because the logs can often pick up subtle correlations in
content, which cannot be picked up by the raw text
alone.
I. INTRODUCTION
Data mining is the exercise of automatically searching large
stores of data to discover patterns and trends with simple analysis.
2.
3.
_______________________________________________________________________________________
ISSN: 2321-8169
4060 - 4062
_______________________________________________________________________________________________
helps to identify of the topic of the document. one of the
traditional data mining techniques is an unsupervised learning
paradigm where clustering methods try to identify inherent
groupings of the text documents, so that a set of clusters is
produced in which clusters exhibit high intra-cluster similarity
and low inter cluster similarity. Generally, text document
clustering methods attempt to segregate the documents into
groups where each group represents some topic that is
different than those topics represented by the other groups.
Most current document clustering methods are based on the
Vector Space Model (VSM), which is a widely used data
representation for text classification and clustering[8].
Document clustering has been investigated for use in a number
of different areas of text mining and information retrieval [5].
Initially, document clustering was investigated for improving
the precision or recall in information retrieval systems and as
an efficient way of finding the nearest neighbors of a
document. More recently, clustering has been proposed for use
in browsing a collection of documents or in organizing the
results returned by a search engine in response to a users
query. Document clustering has also been used to
automatically generate hierarchical clusters of documents. A
somewhat different approach finds the natural clusters in
already existing document taxonomy, and then uses these
clusters to produce an effective document classifier for new
documents. Agglomerative hierarchical clustering and Kmeans are two clustering techniques that are commonly used
for document clustering. Agglomerative hierarchical clustering
is always portrayed as better than K-means, although
slower. A widely known study, discussed in [9] indicated that
agglomerative hierarchical clustering is superior to K-means,
although we stress that these results were with non-document
data. In the document domain, Scatter/Gather, a document
browsing system based on clustering, uses a hybrid approach
involving both K-means and agglomerative hierarchical
clustering. K-means is efficient because of its efficiency and
agglomerative hierarchical clustering is used to improve
quality. Recent work to generate document hierarchies uses
some of the clustering techniques from and presents a result
that indicates that agglomerative hierarchical clustering is
better than K-means, although this result is just for a single
data set and is not one of the major results of the paper.
Initially we also believed that agglomerative hierarchical
clustering was superior to K-means clustering, especially for
building document hierarchies, and we sought to find new and
better hierarchical clustering algorithms. During the course of
our experiments we discovered that a simple and efficient
variant of K-means, bisecting K-means, can produce clusters
of documents that are better than those produced by regular
K-means and as good as or better than those produced by
agglomerative hierarchical clustering techniques. We have
practically find what we think is a reasonable explanation for
this behavior.
Co-clustering is a technique for knocking the rich metainformation of web documents like multimedia [3], including
category, annotation and explanation, for relative discovery.
Most coclustering methods implemented for different data
ignoring the representation issue of short and noisy text and
their performance is tied up by the experimental weighting of
the multi-modal features.
III. SYSTEM IMPLEMENTATION
We will select method consists of concept-based similarity
measure, document-based concept analysis, corpus-based
concept-analysis, and sentence-based concept analysis. Many
forms of text databases contain a large amount of sideinformation is the input to the proposed model.
1.
2.
3.
Separate sentence
Label terms
Stem words
Term frequency
Document frequency
4.
5.
6.
_______________________________________________________________________________________
ISSN: 2321-8169
4060 - 4062
_______________________________________________________________________________________________
introduce some notations and terms which are related to the
classification problem.
IV. CONCLUSION
In this survey paper, better text clustering result will be
achieved for mining text data with the use of side information.
Multiple text databases contain a large amount of sideinformation or meta information, which is used in order to
improve the efficiency of clustering process. To design the
clustering method, implementation of an iterative partitioning
technique with a probability estimation process which gives
the importance of different kinds of side-information. General
approaches will be used to improve the clustering and
classification algorithms. In this paper we have studied
different techniques to improve clustering and data mining.
It gives brief knowledge of clustering in text mining. Every
method increases the accuracy and quality of different clusters
by using side information.
References:
[1]
[2]
[3]
B.
[5]
Methodology Comparison:
[4]
[6]
[7]
[8]
[9]
[10]
4062
IJRITCC | December 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________