You are on page 1of 9

IMPLEMENTING CORRELATION SIMILARITY MEASURE SPACE IN A TEXT CLUSTER

S.P.Karthick 20208205046 S.Karthik 20208205047

Batch Number: 21

Project guide : Mr.Muthukrishnan.P Project coordinator : Mrs.Pushpavalli

Domain : Knowledge and data Engineering


Document clustering : Technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering. Correlation : Correlations indicate a predictive relationship between any two data point . Euclidean Distance: Distance between each data point from the other . (measure of dissimilarity )

Example : Euclidean distance


John Age = 20, Height = 170, Weight = 80 Henry Age = 30, Height = 160, Weight = 120 Euclidean Distance: [(X1-Y1)2+ (X2-Y2)2+.+(Xn-Yn)2]1/2 let (x1,y1)=Age , (x2,y2)=Height ,(x3,y3)=Weight [(30-20)2+(170-160)2+(80-120)2]1/2 =42.42

Abstract:
The clustering method used here is the CPI algorithm(expanded as Correlation Preserving Indexing ) The documents are projected into a lowdimensional semantic space . The correlations between the documents in the local patches are maximized while the correlations between the documents outside these patches are minimized simultaneously. Correlation as a similarity measure is more suitable for detecting the intrinsic geometrical structure of the document space than Euclidean distance.

Existing System

Euclidean distance is the widely used parameter to group the documents. The k-means method is one of the methods that use the Euclidean distance, The Euclidean distance is a dissimilarity measure, It is not able to effectively capture the nonlinear manifold structure embedded in the similarities between them. The document space is always of high dimensionality, which increases the complexity.

Proposed System

CPI (Correlation Preserving Indexing) explicitly considers the manifold structure embedded in the similarities between the documents. The documents are projected into a low-dimensional semantic space, reducing complexity. CPI maximizes the correlations between the documents in the local patches and minimizes the correlations between the documents outside these patches. CPI method focuses on detecting the intrinsic structure between nearby documents rather than on detecting the intrinsic structure between widely separated documents. CPI method focuses on the similarities rather than the dissimilarities between the data.

Advantages

The intrinsic geometrical structure of the documents are preserved. Reduces software or code complexity. Faster information retrieval or filtering. Performance is improved greatly.

Hardware Requirements :

Processor : Pentium IV 2.4 GHz Hard Disk : 40 GB RAM : 256 MB Keyboard :Standard 102 Keys Monitor : 15 color Mouse :3 buttons

Software Requirements :

Operating system :Windows XP Professional Front End : Visual Studio .NET 2008 Platform : VB.NET .Net Framework : Version 3.5 Back End : Oracle 9i

You might also like