Professional Documents
Culture Documents
www.iaetsd.in
I.
INTRODUCTION
A.Jayanthi,
M.E(CSE),Department of CSE,
Velammal Engineering College,Anna University,
Chennai,India.
jayanthiarumugamk@gmail.com
www.iaetsd.in
DATABASE A
DATABASE B
PREPRUNING TECHNIQUE
COMPARE ENTITIES
MATCHING ENTITY
NON-MATCHING
ENTITY
E. CANOPY CLUSTERING
The canopy clustering[14]is built by converting BKVs into the
lists of tokens with each unique token becoming a key in the
inverted index. It uses the approach called as the Thresholdbased approach and Nearest Neighbor-Based approach.The
drawback of the canopy clustering is similar to that of the
sorted neighborhood technique based on the sorted array.
F. STRING-MAP-BASED INDEXING
String-map-based indexing [9] is based on mapping BKVs to
objects in a multidimensional Euclidean Space,such that the
distance between the pairs of the strings are preserved.Group
of similar strings are then generated by extracting the objects
that are similar to each other. However this technique fails
when the size of the database is too large or too small.
Hence all the above discussed indexing techniques has few
drawbacks in the data linkage process. In order to overcome
those indexing problems associated with the data linkage
process a new approach called as the One Class Clustering
Tree is proposed, which uses four splitting criteria
namely,Coarse-Grained Jaccard coefficient,Fine-Grained
Jaccard Coefficient, Least Probable Intersection(LPI) and
Maximum Likelihood Estimation(MLE) for data split and
pruning techniques.
FINAL RESULT
Fig 1: Work Flow Diagram
Initially the tree is constructed where the inner nodes of the
tree consists of the attribute and the leaves represents the
clusters of the clusters of the matching entities. Secondly, the
prepruning technique is being used which means that the
algorithm stops expanding a branch whenever the subbranch
does not improve the accuracy of the model. OCCT uses the
probabilistic model to find the similar entities that are to be
matched. This probabilistic approach helps to avoid
overfitting. OCCT is chosen to be the best approach for data
linkage compared to indexing techniques.
IV.CONCLUSION
In this paper OCCT approach is used which performs one-tomany data linkage.This method is based on the one class
decision tree model which sums up the knowledge of which
records to be linked together. This method uses one-class
approach which gives the results more accurately.OCCT
model has also been proved successful in three different
domains namely data linkage prevention,recommender system
and fraud detection.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
www.iaetsd.in