Duplicate Detection

International Journal of Dynamics of Fluids
ISSN 0973-1784 Volume 5, Number 2 (2009), pp. 145–164

© Research India Publications
http://www.ripublication.com/ijdf.htm
Detection and Elimination of Duplicate Data Using

Token-Based Method for a Data Warehouse: A
Clustering Based Approach
1
J. Jebamalar Tamilselvi and 2V. Saravanan
1
PhD Research Scholar, Department of Computer Application, Karunya University,
Coimbatore – 641 114, Tamilnadu, INDIA
E-mail: jjebamalar@gmail.com
2
Professor & HOD, Department of Computer Application, Karunya University,
Coimbatore – 641 114, Tamilnadu, INDIA
E-mail: saravanan@karunya.edu
Abstract
The process of detecting and removing database defects and duplicates is

referred to as data cleaning. The fundamental issue of duplicate detection is
that inexact duplicates in a database may refer to the same real world object
due to errors and missing data. Duplicate elimination is hard because it is
caused by different types of errors like typographical errors, missing values,
abbreviations and different representations of the same logical value. In the
existing approaches, duplicate detection and elimination is domain dependent.
These domain dependent methods for duplicate elimination rely on similarity
functions and threshold for duplicate elimination and produce high false
positives.
This research work presents a general sequential framework for duplicate
detection and elimination. The proposed framework uses six steps to improve
the process of duplicate detection and elimination. First, an attribute selection
algorithm is used to identify or select best and suitable attributes for duplicate
identification and elimination. The token is formed for the selected attribute
field values in the next step. After the token formation, clustering algorithm or
blocking method is used to group the records based on the similarities value.
The best blocking key will be selected for the blocking records by comparing
performance of the duplicate detection. In the next step the threshold value is
calculated based on the similarities between records and fields. Then, a rule
based approach is used to identify or detect duplicates and to eliminate poor
quality duplicates by retaining only one copy of the best duplicate record.
146 J. Jebamalar Tamilselvi and V. Saravanan
Finally, all the cleaned records are grouped or merged and made available for
the next process.
This research work will be efficient for reducing the number of false
positives without missing out on detecting duplicates. To compare this new
framework with previous approaches the token concept is included to speed up
the data cleaning process and reduce the complexity. Analysis of several
blocking key is made to select best blocking key to bring similar records
together through extensive experiments to avoid comparing all pairs of
records. A rule based approach is used to identify exact and inexact duplicates
and to eliminate duplicates.
Introduction
In the 1990's as organizations of scale began to need more timely data for their
business, they found that traditional information systems technology was simply too
cumbersome to provide relevant data efficiently and quickly. Completing reporting
requests could take days or weeks using antiquated reporting tools that were designed
more or less to 'execute' the business rather than 'run' the business.
A data warehouse is basically a database and having unintentional duplication of
records created from the millions of data from other sources can hardly be avoided. In
the data warehousing community, the task of finding duplicated records within data
warehouse has long been a persistent problem and has become an area of active
research. There have been many research undertakings to address the problems of
data duplication caused by duplicate contamination of data.
There are two issues to be considered for duplicate detection: Accuracy and
Speed. The measure of accuracy in duplicate detection depends on the number of false
negatives (duplicates that were not classified as such) and false positives (non-
duplicates which were classified as duplicates). The algorithm’s speed is mainly
affected by the number of records compared, and how costly these comparisons are.
Generally CPUs are not able to do duplicate detection on large databases within any
reasonable time, so normally the number of record comparison needs to be cut down
[4].
In this research work, a framework is developed to handle any duplicate data in a
data warehouse. The main objective of this research work is to improve data quality
and increase speed of the data cleaning process. A high quality, scalable blocking
algorithm, similarity computation algorithm and duplicate elimination algorithm are
used and evaluated on real datasets from an operational data warehouse to achieve the
objective.
Framework Design
A sequential based framework is developed for detection and elimination of duplicate
data. This framework comprises of some existing data cleaning approaches and new
approaches which are used to reduce the complexity of duplicate data detection and
elimination and to clean with more flexibility and less effort.
Detection and Elimination of Duplicate Data Using Token-Based Method 147
Fig. 1 shows the framework to clean the duplicate data in a sequential order. Each
step of the framework is well suited for the different purposes. This framework will
work according to the data by using software agent in each step with less user
interaction.
The principle on this framework is as follows:
A. Selection of attributes: There is a clear need to identify and select attributes.
These selected attributes are to be used in the other steps.
B. Formation of tokens: The well suited token is created to check the
similarities between records as well as fields.
C. Clustering/Blocking of records: Clustering algorithm or blocking method is
used to group the records based on the similarities of block token key value.
D. Similarity computation for selected attributes: Jaccard similarity method is
used for token-based similarity computation.
E. Detection and elimination of duplicate records: The rule based detection
and elimination approach is used for detecting and eliminating the duplicates.
F. Merge: The result or cleaned data is merged.
A. Selection of attributes
The data cleaning process is complex with the large amount of data in the data
warehouse. The attribute selection is very important to reduce the time and effort for
the further work such as record similarity and elimination process etc. Attribute
selection is very important when comparing two records [5]. This step is the
foundation step for all the remaining steps. Therefore time and effort are two
important requirements to promptly and qualitatively select the attribute to be
considered.
A B C
New
Data i
E
Selection of
Attributes Forming Clustering / Cleaned
ii
Tokens Blocking Data
Algorithm
…
Maintaining LOG n
Table Similarity Elimination Merge
Data computation for
Warehouse selected attributes
using selected
functions
Bank of
Maintaining elimination
….. similarity LOG F
i ii iii n functions
D Tables
Figure 1: Framework for Data Cleaning.

A missing value is expressed and treated as a string of blanks. Obviously, missing

character values are not the smallest strings. Distinct is used to retrieve number of
rows that have unique values for each attribute. Uniqueness is the characteristic of an
attribute that make up its relevance for duplicate detection. The uniqueness is a
property that is reflected in the distribution of the similarity of the attribute-pairs of
non-duplicates, i.e. if an attribute is highly unique, the mean similarity on non-
duplicates for this attribute will be low. The value of measurement types are also
considered for the attribute selection. The data cleaning with numeric data will not be
effective. The categorical data is efficient for the data cleaning process.
To identify relevant attributes for further data cleaning process, three criteria are
used. The three criteria are: (a) Identifying key attributes, (b) classifying attributes
with high distinct value and low missing value and (c) measurement types of the
attributes. Based on the above information, ‘weight’ or ‘rank value’ is calculated for
all the attributes. Finally, the highest priority attributes are selected for the further data
cleaning process [2].
This flow diagram (Fig 3) shows attribute selection procedure in a sequential way.
First, the data set is identified for the data cleaning process. From this data set,
attributes are analyzed by identifying types of the attribute, relationship between
attributes, and properties of each attribute to select the appropriate attribute. Type of
attribute is classified using data type, size or length of the attribute.
Threshold value is used to identify best attributes for the data cleaning process.
The threshold value is measured by using three different criteria – High Threshold
value, Data Quality and High Rank. High threshold value is calculated to identify
high power attribute for the data cleaning process. Attributes are ranked based on the
threshold value and data quality value. Finally, high rank attributes are selected for
the next cleaning process to improve speed of the data cleaning process. The
developed attribute selection algorithm is presented in Fig 2.
The attribute selection algorithm can eliminate both irrelevant and redundant
attributes and is applicable to any type of data (nominal, numeric, etc.). Also, this
attribute selection algorithm can handle different attribute types of data smoothly. The
quality of the algorithm is confirmed by applying a set of rules. The main purpose for
this attribute selection for data cleaning is to reduce time and to improve the speed for
the further data cleaning process such as token formation, record similarity and
elimination process in an efficient way.
Input : N-Attributes, no. of tuples n, relation instance r

Output : S – Sub set of the attributes
Initialize : L - Temporary relation instance, x – Attribute set
Var : L - Temporary relation instance, x – Attribute set, F – Field set, i, j
begin
I. Analyze attribute set X to select best attributes
II. Calculate threshold value σ for each attribute
for each attribute xi, i<{0,1,2,…..N}
do
i) Assign threshold value as 1 for key attribute, put into L
ii) Calculate threshold σ with (σ : D /\ M /\ MT /\ S)
a. Distinct(D) value of the attribute xi if tuple i=1n t = ti
b. Missing(M) value of the attribute xi if tuple i=1n ti = NULL
c. Measurement types(MT) of attribute (ordinal, nominal,
interval, and ratio) xi
Put into L.
end
III. Select attribute with high threshold value σ i, then put into L.
IV. Ignore attribute with low threshold value
V. Calculate data quality dqj of the selected attributes
for each attribute xi, i<{0,1,2,…..N}
for each field fj, j<{0,1,2,…..n}
compute
∑Ni=0 completeness i, j
a. Completenessj =
n
∑Ni=0 accuracy i, j
b. Accuracyj =
n
∑Ni=0 consistency i, j
c. Consistencyj =
n
α*Completenessj + β*accuracyj + γ*consistency j

d. dqj =
n
end for
end for
VI. Rank attributes based on data quality and threshold value
VII. Select high rank attributes
End
Figure 2: Algorithm for attribute selection.

Evaluation of Attributes
Analyze Calculate Data Quality Calculate

Attributes for Selected attribute Threshold
values
Rank attributes with threshold

Original Data set value and Data Quality
Select high rank

attributes
Figure 3: Flow diagram of attribute selection.
B. Formation of Tokens
This step makes use of the selected attribute field values to form a token. The tokens
can be created for a single attribute field value or for combined attributes. For
example, contact name attribute is selected to create a token for further cleaning
process.The contact name attribute are split as first name, middle name and last name.
Here first name and last name are combined as contact name to form a token.
Tokens are formed using numeric values, alphanumeric values and alphabetic
values by selecting some combination of characters. Unimportant elements are
removed before the token formation [title tokens like Mr., Dr. and so on [6].
Numeric tokens comprise only digits [0 – 9]. Alphabetic tokens consist of
alphabets (aA - zZ). The first character of each word in the field is considered and the
characters are sorted. Alphanumeric tokens comprise of both numeric and alphabetic
tokens. It composes a given alphanumeric element into numeric [7].
This step eliminates the need to use the entire string records with multiple passes,
for duplicate identification. It also, solves similarity computation problem in a large
database by forming token key from some selected fields, to reduce the number of
comparisons.
Fig 4 shows an algorithm for token formation. In this algorithm, rules are
specified to form the tokens. This algorithm works according to the type of the data.
For example, if address attribute is selected, alphanumeric rule is used to form the
tokens. The formed tokens are stored in LOG table.
The idea behind this algorithm is to define smart tokens from fields of selected
multiple most important attributes by applying simple rules for defining numeric,
alphabetic, and alphanumeric tokens. Temporary table now consists of smart token
records, composed from field tokens of the records. These smart token records are
sorted out using block-token-key.
Input: Tables with dirty data, Reference table, Selected attributes

Output: LOG table with tokens
Var: i, j, m – attribute set, n – no. of records
begin
For attribute i = 1 to m
for row j = 1 to n
do
i) remove unimportant characters
ii) expand abbreviations using Reference
table
iii) if row(j) isnumeric then
a. convert string into number
b. sort or arrange the number in order
c. form a token, then put into LOG table
end if
iv) if row(j) isaphanumeric then
a. separate numeric and alphanumeric
b. split alphanumeric into numeric and
alphabetic
c. sort numeric and alphabetic separately
d. form a token, then put into LOG table
end if
v) if row(j) isalphabetic then
a. select the first character from each word
b. sort these letters in a specific order
c. string them together
d. if one word is present, take the first three
character as token, then sort the
characters
e. form a token, then put into LOG table
end if
end
Figure 4: Algorithm for Token Formation.
C. Clustering/Blocking of records
Data mining primarily works with large databases. Sorting the large datasets and data
duplicate elimination process with this large database faces the scalability problems.
The clustering techniques are used to cluster or group the dataset into small groups
based on the distance values or some threshold values to reduce the time for the
elimination process.
The blocking methods are used for reducing huge number of comparisons. This
step is used for grouping the records that are most likely to be duplicated based on the
similarity of block-token-key. It works in a blocking fashion. i.e Block-token key is
used to split the data sets into blocks. Block-token key is very important in blocking
the records. There are four types of block-token-key generation used and identified
good block-token-key for blocking. They are i) Blocking with Single Attribute ii)
Blocking with Multiple Attributes iii) Array based Block-Token-Key and iv) Token
based Blocking key. The choice of a good block-token-key can greatly reduce the
number of record pair evaluations to be performed and so the user can achieve
significant result.
Input: Data set (records) D, keys k[ ]
←
Output: Blocked records
Var: n no. of records, block b[ ], i, j
begin
1. Sort database using Block-Token-Key
2. Initialize new block
3. Blocking Records
i. Single Attribute
for each record i = 0 to n
for each key j = 1 to kn
for each block k = 1 to bn
if distance (key[j], key[j + 1]) > threshold
add key[j] to block b[k]
else
add key[j + 1] to b1
initialize new block b[k + 1]
end if
j=j+1
end for
end
ii. Multiple Attribute
for each record i = 0 to rn
for each column j = 0 to cn
for each key k = 1 to kn
for each block b = 1 to bn
Calculates distance of all selected column key values d(values)
if distance (d[values]) > threshold
add key[j] to block b[k]
else
add key[j + 1] to b1
initialize new block b[k + 1]
end if
j=j+1
end for
end
Figure 5: Algorithm for blocking single and multiple attribute records.
This algorithm is efficient to identify exact and in exact duplicate records based
on the selection of block-token key. The number of comparison is reduced when
compared with other existing methods and increases the speed of the data cleaning
process. This blocking method gives better performance than other methods like
shrinking or expanding the window size based on the block-token key.
D. Similarity Computation for Selected Attributes

Record linkage algorithms fundamentally depend on string similarity functions for
record fields as well as on record similarity functions for string fields. Similarity
computation functions depend on the data type. Therefore the user must choose the
function according to the attribute’s data type, for example numerical, string and so
on.
This step uses Jaccard similarity function to compare token values of adjacent
field values for selected attribute. Tokenization is typically formed by treating each
individual word of certain minimum length as a separate token or by taking first

character from each word. In step (ii), token has been created for the selected
attributes. Each function measures the similarity of selected attributes with other
record fields and assigns a similarity value for each field. In the next step, the
clustering techniques have been selected to group the fields based on the similarity
values.
Accurate similarity functions are important for clustering and duplicate detection
problem. Better string distance might also be useful to pair the record as match or
non-match. This matching and non-matching pairs is used for clustering and to
eliminate the duplicates.
E. Detection and elimination of duplicate data

In step (v), the rule based duplicate detection and elimination approach is used for
detecting and eliminating the records. During the elimination process, only one copy
of duplicated records are retained and eliminated other duplicate records [8] [9]. The
elimination process is very important to produce a cleaned data. The above steps are
used to identify the duplicate records. This step is used to detect and remove the
duplicate records from one cluster or many clusters. Before the elimination process,
the similarity threshold values for all the records in the dataset are calculated. The
similarity threshold values are important for the elimination process.
The same comparison process is done in Step (iv). Step (v) also requires a
comparison process. To reduce a comparison, a threshold value or similarity value is
stored as a LOG file. The threshold criteria and certainty factors are used to detect and
eliminate the duplicate records. Finally one record is retained as prime representative
and maintained this value in the log file. This primary copy will be used for the
incremental cleaning process also for further work. This approach can substantially
reduce the probability of false mismatches, with a relatively small increase in the
running time.
F. Merge
This step merges the corrected data as a single cluster [10],[11]. The user must
maintain the merged record and the prime representative as a separate file in the data
warehouse. This information helps the user for further changes in the duplicate
elimination process. This merge step is useful for the incremental data cleaning. When
a new data enters into the data warehouse, incremental data cleaning checks the new
data with the already created LOG file. Hence, this reduces the time for the data
cleaning process.
Experimental Results
An attribute selection algorithm is implemented to select more important attributes
which is having enough information for identifying duplicate records. Selected
attributes are used for duplicate record detection. The selected attributes have enough
information for duplicate data detection. The results are drawn below with different
attribute values, numbers of duplicate records detection and token formation..
Efficiency of duplicate detection and elimination largely depends on the selection of

attributes. Token formation algorithm is used to form tokens for selected attribute
field values to reduce the time for cleaning planning.
Attribute Selection with parameters

Wrong selection of attribute affects the performance and accuracy of data cleaning
process. Hence key fields are selected in such a way that fields should contain
sufficient information to identify the duplication of the record. Fig. 6 shows the
attribute selection for data cleaning process. The threshold value is calculated using
four important criteria such as size of the field, missing values, distinct values and
measurement type. The best attributes are selected based on the threshold value for
the data cleaning process. The name attribute has the highest threshold value
compared to other attributes and seven attributes are selected for the next process of
data cleaning.
120
100
80
60 Field Size
Missing Value
40
Distinct Value
20
Mesurement Type
0
Figure 6: Attribute selection in Student Dataset.
Attribute Vs Duplicates
The identification of duplicates is mainly based on the selection of attributes and
selection of window size. In the existing methods, fixed size sliding window is used to
minimize the number of comparison. In this method, dynamic window size is used
based on the similarities of field values. The best accuracy of the duplicate detection
is obtained by using our dynamic method. Fig. 7 shows how the number of duplicate
records detected varies as the sliding window size changes and for dynamic size of
window. The result of duplicate detection is varied based on the selection of window
size and dynamic size. To test this phenomenon, results are taken by varying the
attribute values for each execution setting window size between 10 and 50 and
dynamic size.
120
100
Duplicate Detected
80
ADD_ADDRESS1
60 ADD_NAME
ADD_PHONE1
40 ADD_ADDRESS2
ADD_CDATE
20 ADD_PINCODE
0
Ws 2 Ws 10 Ws 20 Ws 30 Ws 40 Ws 50 Dyanamic
Window Size
Figure 7: Attribute Vs Duplicate detected with varying window size.
Duplicates Vs No. of Attributes

Efficiency of the record matching algorithm mainly depends on the selection of the
attributes. It has been observed from fig. 8 that selection of key column influences the
result greatly. Table 1 shows key field selection and approximate duplicate record
detected for those selected key fields. The selection of combination of multiple
attribute will be more useful to identify exact and inexact duplicates.
70000
60988 58944
60000
No. of duplicatesdetected
50860
50000 45128
40156 39160
40000
33172
30000
20000
10000
0
1 2 3 4 5 6 7
No. of keys columns
Figure 8: Duplicate detected Vs No. of attribute selected.

Table 1: Key columns and no. of duplicate detected.
No. of
Key Columns Selected No. of duplicate detected
Columns
1 ADD_ADDRESS1 60988
2 ADD_ADDRESS1 ADD_NAME 58944
ADD_PHONE1
ADD_PHONE1 ADD_CDATE
ADD_PHONE1 ADD_CDATE ADD_DEL
ADD_PARENTTYPE
ADD_PARENTTYPE ADD_PINCODE
Time Vs Token Formation

In this research work, token formation step takes very low time for token formation.
Table loading and attribute selection from the data warehouse also takes few seconds
to select best attribute for the data cleaning process (Shown in Fig. 9). The time is
varied based on the size of the data set.
90
80
70
Time (Seconds)
60
50
Token_Creation
40
30
Table Definition &
20
Attribute Selection
10
0
Figure 9: Time taken Vs token formation, attribute selection with different data size.
Similarity Time Vs Database Size

In multiple similarities, the similarity between the entire records and fields are
compared bit by bit using exact string matching function. In token similarities, token
values of the adjacent selected attribute fields are compared. The time taken for token
based similarity method is lower than the multiple similarity method. Fig. 10 shows
that variance of time between token based similarities and multiple attribute
similarities with different database size. The speed of the data similarity computation
is increased in Token-based similarity computation. From above fig. 10, as the size of
the data increases, the time taken for similarity computation also increases. As a
result, the time taken for the similarity computation is reduced when comparing token
based similarity computation with multi-similarity computation.
2500
2000
Time (Seconds)
1500
Token_Similarity
1000
Multi_Similarity
500
Figure 10: Similarity time Vs Database size.
Duplicates Vs Size of dataset and attributes

Number of blocks are varied based on the size of dataset and number of duplicates.
Basically, size of block depends on the number of duplicates available in the data set.
In this fig. 11, identification of duplicates is very low using address1 attribute and
identification of duplicates is high using phone attribute. Attribute address1 identifies
more duplicates because attribute address1 has more high distinct values than other
attributes. As a result, high power attributes identify high amount of duplicates than
other attributes. At the same time, identification of duplicates varies for different sizes
of data.
1800000
118953
1600000
1400000
707398
1200000
1000000
Phone
800000 Address2
Name
600000 Address1
778231
400000 189623
83752
200000 73425 98734
66479 52346
45923 36457 51043
14275 36315 148737
12472 63841 85353 103891
44075
0
100000 200000 300000 400000 500000
Figure 11: Duplicates Vs Size of dataset and attributes.

Duplicates Vs Window
The fig. 12 shows relationship between identification of duplicates, window size and
attributes. In this figure 5h, size of each window and number of windows are varied
for each attributes and also numbers of identified duplicates are varied with each
attribute. Accurate results are obtained using attributes address1 and phone. Theses
two attributes are having enough information that is high distinct values for duplicate
identification. Attributes name and address2 have less number of unique values than
other attributes. But, wrongly identified duplicates are high with these two attributes.
Finally, as a result, smaller blocks result in less comparison but match pairs are
missed. Pair completeness results improve with larger windows. Figure 5h shows the
identification of duplicates as being varied based on the size of the window. As the
duplicates increase the size of the window also increases for different attributes.
25000
1, 20193 2, 20206 3, 20369 4, 19080
20000
15000
10000 4, 9647
1, 8384
5000 Duplicates
3, 3647 Window
0 2, 3635
0 1 2 3 4 5
1 - Address1 2 - Name
3 - Address 2 4 - Phone
Figure 12: Duplicates Vs Window.
Pair completeness for blocking keys and dynamic window

Fig. 13 shows pair completeness with different types of blocking keys. The pair
completeness (PC) is defined as PC = SM / NM, where SM is the number of true match
record pairs in the set of record pairs produced for comparison by the blocking keys
and dynamic window size blocking method and NM is the total number of true match
record pairs in the entire data. These results show pair completeness by using the
blocking keys and dynamic blocking method. Pair completeness varies as size of
dataset increases based on the blocking key. Size of the window varies based on the
resemblance of records. In this research work, token based blocking key gives better
pair completeness results. Identification of duplicates is high when comparing token
based blocking key method with other three blocking key methods. Hence, token
based blocking key gives better results in pair completeness.
To evaluate the performance of this research work, the errors introduced in
duplicate records range from small typographical changes to large changes of some
fields. Generally, the database duplicate ratio is the ratio of the number of duplicate
records to the number of records of the database. To analyze the efficiency of this
research work, proposed approach is applied on a selected data warehouse with
variant window size, database duplicate ratios and database sizes. In all tests the time
taken for duplicate data detection and elimination processes are analyzed to evaluate
the efficiency of time saved in this research work.
0.9
Array Based Block-Token-Key
0.8
Blocking with Single Attribute
0.7
Blocking with Multiple
0.6 Attribute
Token Based Blocking Key
0.5
0 200000 400000 600000 800000 1000000
Figure 13: Pair completeness for blocking keys and dynamic window.
Duplicates Vs Threshold Values

Fig. 14 shows the performance of duplicate detection with false mismatches by
varying the threshold value. Table 2 contains number of duplicates detected by each
combination of attributes for particular threshold value. The data values shown in
bold letters represent the total number of duplicate records detected at optimum
threshold value. The optimum threshold values are 0.6, 0.7 and 0.8. The results are
not accurate if the threshold value is greater than or less than the optimum threshold
value. The number of mismatches and false mismatches are increased and it is not
possible to detect exact and inexact duplicates. Hence, the accuracy of the duplicate
detection is not exact. The threshold value is an important parameter for duplicate
detection process.
70000
No. of Duplicate Records
60000
50000
40000
30000
20000
10000
0
0.5 0.6 0.7 0.8 0.9 1
Threshold Value
Figure 14: Duplicate detected Vs varying threshold value.

Table 2: Attributes and threshold value.
ATTRIBUTES 0.5 0.6 0.7 0.8 0.9 1

ADD_ADDRESS1 40153 38511 37462 34423 31048 30988
ADD_ADDRESS1
48523 45263 45139 42874 40743 38944
ADD_NAME
ADD_ADDRESS1
ADD_NAME 49898 48990 48987 45214 41985 39860
ADD_PHONE1
ADD_ADDRESS1
ADD_NAME
52984 50234 49136 46987 42035 40128
ADD_PHONE1
ADD_CDATE
ADD_ADDRESS1
ADD_NAME
ADD_PHONE1 55133 53134 52456 50156 42542 40156
ADD_CDATE
ADD_DEL
ADD_ADDRESS1
ADD_NAME
ADD_PHONE1
55982 53634 53452 50598 42590 40160
ADD_CDATE
ADD_DEL
ADD_ADDRESS2
ADD_ADDRESS1
ADD_NAME
ADD_PHONE1
ADD_CDATE 57213 54136 53874 52345 42898 40172
ADD_DEL
ADD_ADDRESS2
ADD_PINCODE
Percent of incorrectly detected duplicated pairs Vs Attributes

Fig. 15 shows that false positive ratio is increased as increases the window sizes.
Identification of duplicates depends on keys selected and size of the window. This
figure shows the percent of those records incorrectly marked as duplicates as a
function of the window size. The percent of false positive is almost insignificant for
each independent run and grows slowly as the window size increases. The percentage
of false positives is very slow if the window size is dynamic. This suggests that the
dynamic window size is more accurate than fixed size window in duplicate detection.
0.1
False Positives (%)

0.08
ADD_ADDRESS1
0.06
ADD_NAME
0.04 ADD_PHONE1
0.02 ADD_ADDRESS2
ADD_CDATE
0
ADD_PINCODE
ADD_ADDRESS3
Window Size
Figure 15: False Positives Vs Window size and Attributes.
Time taken on databases with different dataset size

Fig. 16 shows variations of time taken in different database sizes. The result on time
taken by each step of this research work is shown in figure 6e. The time taken by
proposed research work increases as database size increases: the time increases when
the duplicate ratio increases in dataset. The time taken for duplicate data detection and
elimination is mainly dependent upon size of the dataset and duplicate ratio in dataset.
The efficiency of time saved is much larger than existing work because token based
approach is implemented to reduce the time taken for cleaning process and improves
the quality of the data.
140
120
100 Table Definition
Time(Seconds)
Token Creation
80
Blocking
60
Similarity Computation
40
Duplicate Detection
20
Duplicate Elimination
0
Total Time
50,000 75,000 1,00,000 1,25,000 1,50,000
Database(Size)
Figure 16: Time taken Vs Database size.
Time performance Vs Different size databases and Percentage of created

duplicates
10%, 30% and 50% duplicates are created in the selected datasets for result analysis.
The results are shown in Fig. 17. The time taken for duplicate detection and
elimination is varied based on the size of the data and percentage of duplicates
available in the dataset. For these relatively large size databases, the time seems to
increase linearly as the size of the databases increase is independent of the duplicate
factor.
140
120
100
80 10%duplicates
60 30%duplicates
40 50%duplicates
20
0
50,000 75,000 100,000 1,25,000 1,50,000
Figure 17: Time Vs Database size and % of Duplicates.
Conclusion
Deduplication and data linkage are important tasks in the pre-processing step for
many data mining projects. It is important to improve data quality before data is
loaded into data warehouse. Locating approximate duplicates in large databases is an
important part of data management and plays a critical role in the data cleaning
process. In this research wok, a framework is designed to clean duplicate data for
improving data quality and also to support any subject oriented data. This framework
is useful to develop a powerful data cleaning tool by using the existing data cleaning
techniques in a sequential order.
A new attribute selection algorithm is implemented and evaluated through
extensive experiments. The attribute selection algorithm can eliminate both irrelevant
and redundant attributes and is applicable to any type of data (nominal, numeric, etc.).
Also, this attribute selection algorithm can handle data of different attribute types
smoothly. The quality of the algorithm results are confirmed by applying a set of rule.
The main purpose of this attribute selection for data cleaning is to reduce the time for
the further data cleaning process such as token formation, record similarity and
elimination process in an efficient way.
The token formation algorithm is used to form smart tokens for data cleaning and
it is suitable for numeric, alphanumeric and alphabetic data. There are three different
rules described for the numeric, alphabetic, and alphanumeric tokens. The result of
the token based data cleaning is to remove duplicate data in an efficient way. The time
will be reduced by the selection of attributes and by the token based approach. These
formed tokens are stored in the LOG Table. The time required to compare entire
string is more than comparison of tokens. This formed token will be used as the
blocking key in the further data cleaning process. So, the token formation is very
important to define best and smart token.
Using of an unsuitable key, which is not able to group the duplicates together, has
a deterring effect on the result, i.e. many false duplicates are detected in comparison
with the true duplicates, using say, the address key. Hence, key creation and selection
of attributes are important in the blocking method to group similar records together.
The selection of the most suitable blocking key (parameter) for the blocking method
is addressed in this research work. Dynamically adjusting the blocking key for the
blocking method will be effective in record linkage algorithms during the execution
time. The blocking key is selected based on the type of the data and usage of the data
in the data warehouse. The dynamically adjusting blocking key and token based
blocking key as well as the dynamic window size SNM method is used in this
research work. An agent is used in tuning parameter or everything is set dynamically
for the blocking method without human intervention to yield better performance.
However, in most real world problems where expert knowledge is hard to obtain, it is
helpful to have methods that can automatically choose reasonable parameters for us.
Time is critical in cleansing large database. In this research work, efficient token
based blocking method and similarity computation method is used to reduce the time
taken on each comparison. In this research work, efficient duplicate detection and
duplicate elimination approach is developed to obtain good result of duplicate
detection and elimination by reducing false positives. Performance of this research
work shows that there was significant time saving and improved duplicate results than
the existing approach.
To compare this new framework with previous approaches the token concept is
included to speed up the data cleaning process and reduce the complexity. Each step
of this new framework is specified clearly in a sequential order by means of the six
data cleaning process offered such as attribute selection, token formation, clustering,
similarity computation, elimination, and merge. An agent is used in this framework to
reduce the effort taken by the user. This agent will work according to the type and
size of the data set. This framework is flexible for all kinds of data in the relational
databases.
The framework is mainly developed to increase the speed of the duplicate data
detection and elimination process and to increase the quality of the data by identifying
true duplicates and strict enough to keep out false-positives. The accuracy and
efficiency of duplicate elimination strategies are improved by introducing the concept
of a certainty factor for a rule. Data cleansing is a complex and challenging problem.
This rule-based strategy helps to manage the complexity, but does not remove that
complexity. This approach can be applied to any subject oriented databases in any
domain. This proposed research work maintains LOG files with all the cleaning
process for the incremental data cleaning.
References
[1] Dorian Pyle, Data Preparation for Data Mining, Published by Morgan
Kaufmann, ISBN 1558605290, 9781558605299, 540 pages, 1999.
[2] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques,
Publisher: Elsevier Science & Technology Books, ISBN-13:9781558609013,
March 2006.
[3] Hans-peter Keriegel, Karsten M. Borgwardt, Peer Kroger, Alexey Pryakhin,

Matthias Schubert, Arthur Zimek, Future trends in data mining, Data Mining
and Knowledge Discovery, Volume 15 , Issue 1, Pages: 87 – 97, ISSN:1384-
5810, 2007.
[4] Robert Leland, Duplicate Detection with PMC – A Parallel Approach to
Pattern Matching Department of Computer and Information Science,
Norwegian University of Science and Technology, Ph.D. Thesis, August 2007.
[5] Kononenko and S.J. Hong, Attribute Selection for Modeling, Future
Generation Computer Systems, Elsevier, Volume 13, Number 2, pp. 181-
195(15), November 1997.
[6] C.I. Ezeife and Timothy E. Ohanekwu, Use of Smart Tokens in Cleaning
Integrated Warehouse Data, the International Journal of Data Warehousing
and Mining (IJDW), Vol. 1, No. 2, pp. 1-22, Ideas Group Publishers, April-
June 2005.
[7] Ohanekwu, T.E. & Ezeife, C.I., A token-based data cleaning technique for
data warehouse systems. Proceedings of the International Workshop on Data
Quality in Cooperative Information Systems (pp. 21-26), held in conjunction
with the 9th International Conference on Database Theory (ICDT 2003),
Siena, Italy, January 2003.
[8] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy Duplicates
in Data Warehouses. VLDB, pages 586-597, 2002.
[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate Record
Detection: A Survey, IEEE TKDE, 19(1):1-16, 2007.
[10] Hernandez, M.A. & Stolfo, S.J. The merge/purge problem for large databases,
Proceedings of the ACM SIGMOD International Conference on Management
of Data, pp. 127-138, 1995.
[11] A. Hernandez Mauricio and J. S. Stolfo, Real-world Data is dirty: Data
Cleansing and the Merge/Purge Problem, Data Mining and Knowledge
Discovery, pp. 9-27, 1998.
[12] Partrick Lehti, Unsupervised Duplicate Detection Using Sample Non-
duplicates, Lecture Notes in Computer Science, NUMB 4244, pages 136-164,
2006.
[13] Prabhu C S R, Data Warehousing: Concepts Techniques 3/e, ISBN:
8120336275, ISBN-13: 9788120336278, Publisher: Prentice-hall Of India Pvt
Ltd, 2008
[14] J. Paul Kirby, Tom Pohlmann, Shelby Semmes, Topic Overview: Data
Warehousing, Information & Knowledge Management Professionals,
Forrester Research, January 23, 2007.
[15] Su Yan, Dongwon Lee, Min-Yen Kan, C. Lee Giles, Adaptive sorted
neighborhood methods for efficient record linkage, 185-194, JCDL 2007.

Duplicate Detection

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Duplicate Detection

Uploaded by

Copyright:

Available Formats

International Journal of Dynamics of Fluids

ISSN 0973-1784 Volume 5, Number 2 (2009), pp. 145–164

Detection and Elimination of Duplicate Data Using

The process of detecting and removing database defects and duplicates is

Figure 1: Framework for Data Cleaning.

A missing value is expressed and treated as a string of blanks. Obviously, missing

Input : N-Attributes, no. of tuples n, relation instance r

α*Completenessj + β*accuracyj + γ*consistency j

Figure 2: Algorithm for attribute selection.

Analyze Calculate Data Quality Calculate

Rank attributes with threshold

Select high rank

Figure 3: Flow diagram of attribute selection.

Input: Tables with dirty data, Reference table, Selected attributes

Figure 4: Algorithm for Token Formation.

Input: Data set (records) D, keys k[ ]

Figure 5: Algorithm for blocking single and multiple attribute records.

D. Similarity Computation for Selected Attributes

individual word of certain minimum length as a separate token or by taking first

E. Detection and elimination of duplicate data

Efficiency of duplicate detection and elimination largely depends on the selection of

Attribute Selection with parameters

Figure 6: Attribute selection in Student Dataset.

Figure 7: Attribute Vs Duplicate detected with varying window size.

Duplicates Vs No. of Attributes

Figure 8: Duplicate detected Vs No. of attribute selected.

Table 1: Key columns and no. of duplicate detected.

Time Vs Token Formation

Similarity Time Vs Database Size

Figure 10: Similarity time Vs Database size.

Duplicates Vs Size of dataset and attributes

Figure 11: Duplicates Vs Size of dataset and attributes.

Figure 12: Duplicates Vs Window.

Pair completeness for blocking keys and dynamic window

Duplicates Vs Threshold Values

Figure 14: Duplicate detected Vs varying threshold value.

Table 2: Attributes and threshold value.

ATTRIBUTES 0.5 0.6 0.7 0.8 0.9 1

Percent of incorrectly detected duplicated pairs Vs Attributes

False Positives (%)

Figure 15: False Positives Vs Window size and Attributes.

Time taken on databases with different dataset size

Figure 16: Time taken Vs Database size.

Time performance Vs Different size databases and Percentage of created

Figure 17: Time Vs Database size and % of Duplicates.

[3] Hans-peter Keriegel, Karsten M. Borgwardt, Peer Kroger, Alexey Pryakhin,

You might also like

αCompletenessj + βaccuracyj + γ*consistency j