You are on page 1of 6

A Novel two-step Approach for Record Pair Classification in Record Linkage Process

K.Gomathi PG Scholar, Department of CSE, Anna University Chennai Archana Institute of Technology, Krishnagiri Mathikrishnan55@gmail.com

Abstract The aim of Record Linkage is to match the records that refer to same entity. One of the major challenges when linking large databases is the efficient and accurate classification of candidate record pair which is generated during indexing step to match, on-match and possible match. While traditional classification methods was based on setting the threshold level manually which is cumbersome and time consuming process. In this paper we proposed a two-step approach that classifies the candidate record pairs automatically. In first step training set is automatically selected from compared candidate record pair using weight vector classifier. In second step Support Vector Classifier is used to improve the performance of training set used in first step. Experimental results show that this two step approach can achieve better classification results than other unsupervised approaches. Keywords Indexing, Record linkage, training set, candidate record, classifier.

I. INTRODUCTION The task of linking databases is an important step in increasing number of data mining projects like fraud and crime detection, national security, bioinformatics etc because linked data contain information that is not available otherwise it would require time consuming and expensive collection of specific data. The records to be matched frequently correspond to entities that refer to people, such as clients or customers, patients, employees, tax payers, students, or travelers. The task of record linkage is now commonly used for improving data quality and integrity, to allow re-use of existing data sources for new studies, and to reduce costs and efforts in data acquisition. 1.1 Record Linkage Process The general steps involved in the linking of two databases. Most of real data is dirty and contains noise or incomplete information.
Input data Input Datasets

Cleaning Process

Data Standardization

Indexing Process

Sorted Block Indexing

Data Comparison

Pair Classification

Edit Distance Comparison Function Novel approach

Evaluation

Figure 1.Steps involved in record linkage process

The main task of data cleaning and standardization is the conversion of the raw input data into well defined, consistent forms, as well as the resolution of inconsistencies in the way information is represented and encoded. The second step indexing step generates pairs of candidate records that are compared in detail in the comparison step using a variety of comparison functions appropriate to the content of the record fields (attributes). the next step in the record linkage process is to classify the compared candidate record pairs into matches, non-matches, and possible matches, depending upon the decision model used. If record pairs are classified into possible matches, a clerical review process is required where these pairs are manually assessed and classified into matches or non matches. This is usually a time-consuming, cumbersome and errorprone process, especially when large databases are being linked or deduplicated. Measuring and evaluating the quality and complexity of a record linkage project is a final step in the record linkage process. II INDEXING FOR RECORD LINKAGE The aim of indexing step is to reduce large number of potential comparisons by removing as many records as possible that corresponds to non matches. The traditional record linkage has employed a indexing technique called blocking [2] which splits the database into non overlapping blocks. A blocking criterion called blocking key is used which is a single record field (attribute) or the concatenation of values from other fields. Because real world data contains dirty which contains errors the important criteria of good blocking key is that it can group similar values in to same block. Similarity can refer to similar sounding or similar looking values based on phonetic characteristics. For strings contain same personal names phonetic similarity can be obtained by using phonetic encoding functions like Soundex, Double metaphone etc.

TABLE 1 Example records with surname and soundex encoding used for blocking using Blocking Key Identifiers Surnames BK(soundex encoding) R1 Smith S530 R2 Miller M460 R3 Peters P362 R4 Myler M460 R5 Smyth S530 R6 Millar M460 R7 Smyth S530 R8 Miller M460 Table 1 illustrates the small data set with soundex encoding scheme. For example S530 fro Table 1 denotes the pair (R1, R5), (R5, R7), (R1, R7) are generated. These pairs are called candidate record pairs which are compared in comparison step by using various string comparison functions. 2.1 Hamming Distance Comparison function The hamming distance is used primarily for numerical fixed size fields like Zip Code or SSN. It counts the number of mismatches between two numbers. For example the hamming distance between zip codes 47905 and 46901 is 2 since it has two mismatches. 2.2 Edit Distance Comparison function The edit distance between two strings is the minimum cost to convert one of them to the other by a sequence of character insertions, deletions and replacements. Each one of these modifications is assigned a cost value. For insertion and deletion the cost is equal to 1 and for replacement the cost is equal to .To compute the edit distance is the SmithWaterman algorithm that uses dynamic programming technique. Edit distance based function is more accurate than hamming distance since it does not find the similarity between two strings.

III COMPARISON OF RECORDS USING WEIGHT VECTORS The two records in a candidate pair which is generated during indexing are compared using similarity functions applied to selected record attributes. These functions can be as an exact string or a numerical comparison can take a typographical variations. There are also various approaches to learn such similarity functions from training data [5].Each similarity function return a numerical matching weight that is usually normalized such that 1 corresponds to exact match and 0 corresponds to no match. Some similar values having a match weight is somewhere between 0 and 1. TABLE 2 Comparison of records using weight vectors Records Name Surname Street No R1 Christine Smith 42 R2 Christina Smith 42 R3 Bob OBrain 11 R4 Robert Byree 12 WV (R1, R2): [0.9, 1.0, 1.0] WV (R1, R3): [0.0, 0.0, 0.0] WV (R1, R4): [0.0, 0.0, 0.5] WV (R2, R3): [0.0, 0.0, 0.0] WV (R2, R4): [0.0, 0.0, 0.5] WV (R3, R4): [0.7, 0.3, 0.5] As illustrated in the table for each compared record pair weight vector is formed that contains the matching weights calculated for that pair. Using these vectors the pairs are classified as matches, possibly match, and on matches depending upon the decision model used. III TWO-STEP CLASSIFICATION The idea of second record pair classification is based on the following assumptions. First, weight vectors that contain exact or high similarity values in all their matching weights were with high likelihood generated when two records that refer to same entity are compared. Second weight vectors that contain mostly low similarity values were with high likelihood generated when two records that refer to different entity were compared. As a result selecting

such weight vectors in a first step for generating training data and for training a classifier using these weight vectors is the second step should enable automatic efficient and accurate record pair classification. 3.1 Step 1: Selection of Training Data The aim of the first step is to select weight vectors from the set W of all weight vectors, generated in the comparison step, which corresponds to true matches and true non matches and to insert them into matched training set WM, and the non match training set as WN, respectively. There are two approach to select training set either using distance threshold or nearest neighborhood based. We selects nearest based approach since it outperforms than distance threshold approach. In this approach weight vectors are sorted according to their distances using Euclidean distance. Vectors are sorted from the vectors containing only exact similarities and only total dissimilarities and nearest vectors and those vectors are selected for training sets. An estimation of ratio r of true matches to true non matches can be calculated using the number of records in the two databases to be linked. A and B and the number of weight vectors W:

Where . denoting the number of elements in a set of database. The problem with balanced training set is that weight vectors that likely do not refer to true matches will be selected to WM. 3.2 Step 2: Classification of Record Pairs Once the training sets for matches and nonmatches have been generated, they can be used to train any binary classifier. In this paper the nearest neighborhood classifier and iterative SVM classifier are investigated. In the following section the set of weight vectors not selected in the training sets is denoted with WU, with WU=W\(WM U WN) 3.2.1 Nearest Neighbor classification The basic idea of this classifier is to iteratively add unclassified weight vectors from WU into the training sets until all weight vectors are classified. In each iteration the unclassified weight vectors closest to k already classified weight vectors is classified according to the majority vote of its

classified neighbors(i.e. if the majority is either matches or non-matches).Using training example sets the nearest based can be implemented efficiently as shown in Figure 2.
1 0.8 0.6 matches 0.4 0.2 0 0.2 0.4 0.6 0.8 1 nonmatches unclassified

matched record pairs found in the reduced comparison space, to the total number of matched pairs in the entire comparison space. Formally the pairs completeness metric id defined as PC=sM/nM. 3.3.3 Accuracy The accuracy metric tests how accurate a decision model is. The accuracy of a decision model is defined to be the percentage of the correctly classified record pairs. Formally the accuracy metric can be defined as AC=(cM,M+cU,U)/s. IV EXPERIMENTAL EVALUATION The two record pair classifiers presented above were evaluated and compared with other two classifiers. The first classifier has access to the true match status of all weight vectors. Nine parameter variations were evaluated. The second method is based on hybrid approach implemented in TAILOR toolbox [6].It first employs a k-means and then uses the match and non-match clusters to train SVM. The Euclidean distance function was evaluated for the k means step, while for classifier step nine parameter variations were used. TABLE 3 Datasets used in experiments Number No of of PC RR weight Ratio r records vectors DS-A 1,000 0.957 0.995 2,475 1/1.48 DS-B 2,500 0.940 0.997 9,878 1/2.95 DS-C 5,000 0.953 0.997 35,491 1/6.10 DS-D 10,000 0.948 0.997 132,532 1/12.25 Experiments were conducted using synthetic data as summarized in Table 3.The four synthetic datasets of various sizes were created during using Febrl dataset generator. This synthetic data contains name and address attributes that are based on real world frequency tables and includes 60% original records and 40% duplicate records. These duplicates were randomly selected through modification of record attributes. All classifiers are implemented using febrl [3] record linkage system. Dataset

Figure 2: Example of nearest neighbor classifier with 2 dimensional weight vectors with k=1. 3.2.2 Iterative SVM Classification The iterative SVM classifier is used to train an initial SVM using training example set WM and WN and then to iteratively add the strongest positive and negative vectors from WU into training sets of subsequent SVMs. 3.3 Measurement Tools The following subsection introduces the metrics using the following notation. Let nM and nU be the total number of matched and non matched record pairs in the entire dataset. Let s be the size of the reduced comparison space generated by the searching method and let sM and sU is the total number of matched and non matched record pairs in the reduced comparison space. Finally let ca,d be the number of record pairs whose actual matching status is a, and whose predicted matching status is d, where a is either M or U and d is either M,U or P, where M,U and P represents the matched, unmatched and possibly matched respectively. 3.3.1 Reduction Ratio The reduction ratio metric is defined as RR=1-s/( nM+ nU).It measures the relative reduction in the size of comparison space accomplished by a searching method. 3.3.2 Pairs Completeness A searching method can be evaluated based on the number of actual matched record pairs contained in its reduced comparison space. We define the pairs completeness metric as ratio of the

TABLE 4 Quality of nearest neighborhood Classifier Seed DS-A DS-B DS-C DS-D size 1% 100% 99.0% 100% 100% 5% 96.7% 98.4% 99.8% 99.8% 10% 95.5% 98.3% 99.5% 99.7% Table 4 shows the quality of the seed training sets generated in the first step of proposed classification approach given as percentage of correctly selected vectors. 4.1 Results and Discussion As seen in the Table 4,the training example set selected in the first step of the two step classification approach are mostly very high quality with match training set. While overall 1% training set selection contains highest percentage but the size is very small so classifiers based on 1% is worse when compared to other classifiers like 5% and 10%.Therefore 1% classifier is not included in F score results presented in figure 3.
1 0.8 0.6 0.4 0.2 0 SVM TAILOR NEAREST SVM 0-0 SVM 25BASED 50 1.2 1 0.8 0.6 0.4 0.2 0 SVM TAILOR NEAREST SVM 0-0 SVM 25BASED 50

Figure 3.2 F score measure for DS-C(5,000) records

1.2 1 0.8 0.6 0.4 0.2 0 SVM TAILOR NEAREST BASED SVM 0-0 SVM 25-50

Figure 3.3 F score for DS-C(10,000) records

Figure 3.1 F score measure for DS-B (2,500 records) The F score measure for DS-A dataset A has not included since the seed size is 1% and the record size is small. So the dataset is not included.The weight vectors generated when attribute values are compared with mixed distribution of matches and non matches that are hard to classify without knowing the true match status of these vectors. This can be seen in Table 4 where with 10% seed size the training set contains 9% non matches.

The F score results for four synthetic datasets are shown in figure 3.As expected the supervised SVM classifier performs best on all data sets. The TAILOR classifier has the lowest performance on most of datasets. The nearest based performs better than iterative SVM for all synthetic datasets. These experiments showed the limitation of unsupervised classification like TAILOR hybrid approach which works based only pair wise attribute similarities compared to supervised classification.

V CONCLUSION AND FUTURE WORK This paper presented a novel unsupervised two step approach to record pair classification that is aimed at automating the record linkage process. This approach combines automatic selection of training data. The two classifiers nearest based and iterative SVM achieve improved record pair classification results compared with other classifiers such as TAILOR approach. Future work will include conducting more experiments using different datasets including runtime tests on datasets of various sizes in order to experimentally get scalability results. The overall efficiency of the proposed classifier can further be improved using data reduction and fast searching and indexing techniques. REFERENCES [1] A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication Peter Christen IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. [2] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007. [3] A. Aizawa and K. Oyama, "A Fast Linkage Detection Scheme for Multi-Source Information Integration," Proc. Int'l Workshop Challenges in Web Information Retrieval and Integration (WIRI '05), 2005. [4] M.A. Hernandez and S.J. Stolfo, "Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998. [5] P. Christen, "Febrl: An Open Source Data Cleaning, Deduplication and Record Linkage System With a Graphical User Interface," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '08), pp. 1065-1068, 2008.

[6] P. Christen, "Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '08), pp. 151-159, 2008. [7] R. Baxter, P. Christen, and T. Churches, "A Comparison of Fast Blocking Methods for Record Linkage," Proc. ACM Workshop Data Cleaning, Record Linkage and Object Consolidation (SIGKDD '03), pp. 25-27, 2003. [8] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002. [9] S. Tejada, C.A. Knoblock, and S. Minton, "Learning Domain Independent String Transformation Weights for High Accuracy Object Identification," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002. [10] T. Churches, P. Christen, K. Lim, and J.X. Zhu, "Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models," BioMed Central Medical Informatics and Decision Making, vol. 2, no. 9, 2002.

You might also like