An Innovative Approach in Text Mining

ISSN: 2229-6093
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
An Innovative Approach in Text Mining

(2)
(1)
R.Santhanalakshmi, Dr.K.Alagarsamy,
Research Scholar, Associate Professor
Dept of MCA, ComputerCenter, Dept of MCA, ComputerCenter,
Madurai Kamaraj University, Madurai. Madurai Kamaraj University, Madurai.
sanlakconference@gmail.com alagarsamymku@gmail.com
Abstract:
The text mining is the of disease in cattle. BVDV is belongs to
classification and predictive modelings the family of pestiviruses. Other diseases
that are based on bootstrapping associated with other pestiviruses
techniques re-use a source data set for include classical swine fever and border
the specific application, which is disease in sheep. Pestiviruses infect
specialized for avoid the information cloven-hoofed stock only, BVDV has
overloading and redundancy. The results been found in pigs and sheep. BVDV
offer a classification and prediction causes such a wide range of disease it is
results are minimum compare with the rare to be able to diagnose because on
original data source. clinical signs alone. Testing the blood
Text is the common approach used to for antibodies and virus is the best
examine text and data in order to draw method of diagnosis. A paired blood
conclusions about the structure and sample for antibodies is useful for
relationships between sets of information pneumonia, diarrhoea and infertility. If
contained in the original set or the first sample is taken when the animal
approximate the some expected values. is ill and the second two to three weeks
In this paper we are going to retrieve the later, a rise in antibodies suggests that
bovine diseases information form the there was active infection
internet using k-means clustering and
principal component analysis. BVD is a viral disease of cattle
caused by a pestivirus. It has many
Keyword: Bovine Diseases, K-Means different manifestations in a herd,
Clustering, Principal component depending on the herd’s immune and
analysis. reproductive status. Transient diarrhoea,
mixed respiratory infection, infertility or
I. Introduction: abortion and mucosal disease are the
1.1 Bovine Diseases: most common clinical signs of the
disease and can be seen simultaneously
Bovine Diseases are the common in a herd. Due to its varied
diseases in cattle sector. It has variety of manifestations and sub clinical nature in
forms and N number of symptoms. Here many herds, the significance of the
we discuss some forms. BVDV is one of disease has not been understood until
the common causes of infectious recently, when diagnostic methods
abortion. It is also correlated with a wide improved.Bovine herpes virus 1 is a
range of diseases from infertility to virus of the family Herpesviridae that
pneumonia, diarrhoea and poor growth. causes several diseases worldwide in
BVDV is normally the major viral cause cattle, including rhinotracheitis,
193
vaginitis, balanoposthitis, abortion, integer values with the lower

conjunctivitis and enteritis. BHV-1 is bound of 1 an upper bound that
also a contributing factor in shipping equals the total number of
fever. Bovine leukemia virus is a bovine samples.
virus closely related to HTLV-I, a
human tumour virus. BLV is a retrovirus The K-Means algorithm is repeated a
which integrates a DNA intermediate as number of times to obtain an optimal
a provirus into the DNA of B- clustering solution, every time starting
lymphocytes of blood and milk. It with a random set of initial clusters.
contains an oncogene coding for a
protein called Tax. 1.3 Principal Component Analysis:
The main basis of PCA-based
1.2 K-Means Clustering: dimension reduction is that PCA picks
In statistics and machine up the dimensions with the largest
learning, k-means clustering [4] is a variances. Mathematically, this is
method of cluster analysis which aims to equivalent to finding the best low rank
partition n observations into k clusters in approximation of the data via the
which each observation belongs to the singular value decomposition. However,
cluster with the nearest mean. It is this noise reduction property alone is
similar to the expectation-maximization inadequate to explain the effectiveness
algorithm for mixtures of Gaussians in of PCA.
that they both attempt to find the centers
of natural clusters in the data as well as PCA is a basic method of social
in the iterative refinement approach network mining with applications to
employed by both algorithms. ranking and clustering that can be further
deployed in marketing, in user
Procedure: segmentation by selecting communities
with desired or undesired properties as.
 This algorithm is initiated by In particular the friends list of a blog can
creating ‘k’ different clusters. be used for social filtering, that is
The given sample set is first reading posts that their friends write or
randomly distributed between recently read.
these ‘k’ different clusters. Principal Component Analysis is similar
 As a next step, the distance to the HITS ranking algorithm; in fact
measurement between each of the hub and authority ranking is defined
the sample, within a given by the first left and right singular vectors
cluster, to their respective and the use of higher dimensions is
cluster centroid is calculated. suggested already and analyzed in detail
 Samples are then moved to a in Several authors use HITS for
cluster that records the shortest measuring authority in mailing lists or
distance from a sample to the blogs , the latter result observing a
cluster centroid. strong correlation of HITS score and
 As a first step to the cluster degree, indicating that the first principal
analysis, the user decides on the axis will contain no high-level
Number of Clusters ‘k’. This information but simply order by number
parameter could take definite of friends. We demonstrate that HITS-
194
style ranking can be used but with domains. Even though we have specific
special care due to the Tightly Knit domain we should search through out the
Community effect that result in internet if it’s online otherwise in large
communities that are small on a global data base. As our earlier work we used
level grabbing the first principal axes. modified HITS algorithm to search and
The probably the first who identify the another one we used stemming
TKC problem in the HITS algorithm, algorithm with hierarchical clustering. In
their algorithmic solution however turns this we combine the K-means and
out to merely compute in and out- Principal component analysis and
degrees. In contrast we keep PCA as the evaluate the results. Our research end
underlying matrix method and filter the with comparison making between al
relevant high-level structural those things and optimize which
information by removing TK technique better for my work.
II. Proposed Method (SAN Method): Bovine diseases keyword given

for searching element using that
In our method we combined the keyword first we form the initial
k-means clustering and principal clusters. For example we have n sample
component analysis for the effective feature vectors bv1, bv2… bvn all from
clustering and optimized solution. While the same class, and we know that they
searching the information from the fall into k compact clusters, k < n. Let mi
internet we have to get what information be the mean of the vectors in cluster i.
we required until otherwise that here for calculating the mean value we
searching becomes a null and void. use Euclidean distance formula which
Every clustering method has its own standard algorithm as well as simple
strategy and importance. We can’t say algorithm to find out the distance
the single clustering mechanism enough between two elements. If the clusters are
for every kind of search and also we well separated, we can use a minimum-
can’t ensure every clustering method distance classifier to separate them. That
provide the same result for same key is, we can say that x is in cluster i if x -
term. For this reason we combined the mi is the minimum of all the k distances.
both clustering technique and gave new This suggests the following procedure
innovative idea to optimizing the for finding the k means:
searching from the large data base or
internet, etc. Both techniques are some  Make initial guesses for the
what related to clustering technique. K- means m1, m2... mk.
Means clustering grouping the source  Until there are no changes in
data into certain groups called as clusters any mean
based on some distance measures o Use the estimated
technique. Principal component analyze means to classify the
focusing dimension reduction based on samples into clusters
the mathematical models. o For i from 1 to k
o Replace bvi with the
Our domain information’s related mean of all of the
to Bovine Disease, which are very samples for cluster i
specific instead of searching the all o end_for
195
 end_until the second greatest variance on the

second coordinate, and so on.
In addition to improve K- means
algorithm while forming the clustering Define a data matrix, XT, with zero
analysis we include the brute force samples mean, where each of the n rows
stemming algorithm, suffix stripping and represents a different repetition of data
brute force approach. Brute force form the different experiment, and each
stemmers maintain the lookup table of the m columns gives a particular kind
which contains relations between root of datum .The singular value
forms and inflected forms. To stem a decomposition of X is X = W Σ VT,
word, the table is queried to find a where the m × m matrix W is the matrix
matching inflection. If a matching of eigenvectors of XXT, the matrix Σ is
inflection is found, the associated root is an m × n rectangular diagonal matrix
replaced by the original word. Suffix with nonnegative real numbers on the
stripping algorithms not like lookup diagonal, and the matrix V is n × n. The
table Instead, a typically smaller list of PCA transformation that preserves
rules are stored which provide a path for dimensionality is then given by:
the algorithm, given an input word form,
to find its root form. Some examples of yT  X TW
the rules include, Rule1) if the word
ends in 'ed', remove the 'ed'.Rule2) if the
V
T
word ends in 'ing', remove the
'ing'.Rule3) if the word ends in 'ly',
remove the 'ly'. Like wise they form V is not uniquely defined in the usual
some group of clusters at final stage but case when m<n−1, but Y will usually
we can’t stay these are the final still be uniquely defined. Since W is an
optimized result so that we going to orthogonal matrix, each row of YT is
analyze this cluster further for that final simply a rotation of the corresponding
clusters distributed into Principal row of XT. The first column of YT is
component analysis. made up of the scores of the cases with
respect to the principal component; the
Principal component analysis is a next column has the scores with respect
mathematical procedure that uses an to the second principal component, and
orthogonal transformation to convert a so on.
set of observations of possibly correlated
variables into a set of values of If we want a reduced-dimensionality
uncorrelated variables called principal representation, we can project X down
components. into the reduced space defined by only
the first L singular vectors, WL:
PCA is mathematically defined
as an orthogonal linear transformation Y  WL X  L VL
T T
that transforms the data to a new

coordinate system such that the greatest
variance (difference) by any projection The matrix W of singular vectors of X is
of the data comes on the first coordinate equivalently the matrix W of
eigenvectors of the matrix of observed
covariance’s C = X XT,
196
XX T  W   T
WT mean. PCA essentially rotates the set of
points around their mean in order to
align with the principal components.
Given a set of points in Euclidean space,
This moves as much of the variance as
the first principal component
possible into the first few dimensions.
corresponds to a line that passes through
The values in the remaining dimensions,
the multidimensional mean and
therefore, tend to be small and may be
minimizes the sum of squares of the
dropped with minimal loss of
distances of the points from the line. The
information. Finally we will get the
second principal component corresponds
reduced cluster as the output of our
to the same concept after all correlation
query.
with the first principal component has
been subtracted out from the points. The
singular values (in Σ) are the square III. Result Analysis:
roots of the eigenvalues of the matrix
XXT. Each Eigen value is proportional to Simulation will carry over in
the portion of the variance that is Matrix lab (Mat lab) software. For
correlated with each eigenvector. The example take this as query: Symptoms of
sum of all the eigenvalues is equal to the Bovine leukemia: First we will see the
out come of K-means Clustering in
sum of the squared distances of the
points from their multidimensional
Feeding B-Sell leukemia

Gouge Dehorhing
Lymphocytes
Rhinotracheitis
BLV
ataxia
Ataxia
Palption
Colostrums
Provirus
Lymphocytes Fig: 2
Mononucleosis The K-means cluster output value given
to the principal component analysis. It
B-Sell leukemia will generate the variance matrix that
Colostrum will reduce into further steps finally we
will get these things as the result of
query Comparison Analysis will give
leukaemia
performance evaluation of combined
BLV approach with linearly:
Sample SAN K- PCA
Fig: 1 size method means
197
2750 0.91 0.87 0.82 Proceedings of the 12th ACM SIGKDD

4550 0.81 0.75 0.69 international conference on Knowledge
7700 0.84 0.65 0.62 discovery and data mining, pages 44–54,
10100 0.89 0.73 0.65 New York, NY, USA, 2006. ACM
Press.
As the result analysis depicts SAN [4] D Cheng, R Kannan, S Vempala, and
method performance will high then any G Wang. On a recursive spectral
other methods. While increasing the data algorithm for clustering from pairwise
set performance ratio will decrease in K- similarities. Technical report, MIT LCS
means and principal component analysis Technical Report MIT-LCS-TR-906,
2003.
IV. Conclusion: [5] Matthew Hurst, Matthew Siegler,
In this paper we provided an effective and Natalie Glance. On estimating the
method for information retrieval in geographic distribution of social media.
Bovine disease. The SAN method gives In Proceedings Int. Conf. on Weblogs
an optimum solution compare with and Social Media (ICWSM-2007), 2007.
principal component analysis and K- [6] M. Newman. Detecting community
means. In our earlier work we focused in structure in networks. The European
enhancing the Medline & Pubmed using Physical Journal B - Condensed Matter,
modified hits algorithm and also we tried 38(2):321–330, March 2004.
with stemming algorithms. It’s our [7] Jun Zhang, Mark S. Ackerman, and
conclusion among all those methods; Lada Adamic. Expertise networks in
The SAN method gave the effective online communities: structure and
solution for bovine disease searching algorithms. In WWW ’07: Proceedings
methodology. of the 16th international conference on
World Wide Web, pages 221–230, New
V. References: York, NY, USA, 2007. ACM Press.
[1] Lada A. Adamic and Natalie Glance.
The political blogosphere and the 2004
u.s. election: divided they blog. In
LinkKDD ’05: Proceedings of the 3rd
international workshop on Link
discovery, pages 36–43, New York, NY,
USA, 2005. ACM.
[2] Pedro Domingos and Matt
Richardson. Mining the network value of
customers. In KDD ’01: Proceedings of
the seventh ACM SIGKDD international
conference on Knowledge discovery and
data mining, pages 57–66, New York,
NY, USA, 2001. ACM.
[3] Lars Backstrom, Dan Huttenlocher,
Jon Kleinberg, and Xiangyang Lan.
Group formation in large social
networks: membership, growth, and
evolution. In KDD ’06:
198

An Innovative Approach in Text Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Innovative Approach in Text Mining

Uploaded by

Copyright:

Available Formats

ISSN: 2229-6093

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

An Innovative Approach in Text Mining

vaginitis, balanoposthitis, abortion, integer values with the lower

II. Proposed Method (SAN Method): Bovine diseases keyword given

 end_until the second greatest variance on the

that transforms the data to a new

Feeding B-Sell leukemia

2750 0.91 0.87 0.82 Proceedings of the 12th ACM SIGKDD

You might also like