You are on page 1of 6

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 19

Enhancing Quality of Data using Data Mining


Method
Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad Alishahi

Abstract—Data is asset for companies and organizations. Because data and the information obtained from data analysis play
an important role in decision making. The quality of data affects the quality of decisions and the incorrect data causes incorrect
decision making. Recently, a great deal of researches has focused on enhancing data quality. It is infeasible or very difficult to
improve quality of data through manual inspection. Because data quality is one of the complicated and non-structured concepts
and data cleansing process can not be done without the help of professional domain experts, and detection of errors require a
thorough knowledge in the related domain of the data. Therefore (semi-)automatic data cleansing methods is employed to find
data errors and defects and solve them. Data mining methods are appropriate for enhancing different dimensions of data
quality, since they are aimed at finding abnormal patterns in large volumes of dataset. In this paper, a new approach is
presented to detect the errors inside the dataset using fuzzy association rules. Fuzzy association rules are used to build a
model that is intended to capture the structure of the regarded data. Finally, Experimental results of the proposed approach
show the effectiveness of the proposed method to find errors in datasets.

Index Terms—data quality, data mining, fuzzy association rules

——————————  ——————————

1 INTRODUCTION

W IDESPREAD use of data and decisions based on


data analysis focuses on data quality in today’s
business success. Nevertheless, a study by the Mta
used data mining in preprocessing data in the process of
knowledge discovery in database. By filtering the
probably incorrect training data, they can significantly
group reveals that 41% of projects based on data analysis reduce the misclassifications, while their goal is to
will be failed. As one of the main reasons, they identified improve final classification accuracies, not detection of
insufficient data quality leading to wrong decisions [1]. errors in training data [3].
Therefore, traditional methods of cleaning data can be In addition, Hipp and et al proposed a developed
used rarely. It is normally infeasible to guarantee data algorithm of data mining to extract the structure of the
quality by manual inspection, especially when data are data. Deviation from this structure can then be
collected over long periods of time and through multiple hypothesized to be incorrect [4].
generations of databases. Therefore, (semi-)automatic Grüning showed that classifiers can be used in
data cleaning methods have to be used [1]. detection of conflicts in datasets and gave practical
Since the early 90s, knowledge discovery in databases recommendations for data correction. This approach uses
(KDD) has been introduced as a well established field of support vector machines as a classification algorithm [5].
research, and over the years new methods together with Marcus and Maletic used different methods of data
scalable algorithms have been developed to analyze mining which include statistical methods, clustering,
effciently very large datasets. Unfortunately most of the pattern-based methods and ordinal association rules for
orientation is towards the particular and theoretical atomatic error detection in real dataset. Their
problems. Experiments showed tthaeeexperiments showed that
The application of data mining methods to improve ordinal association rule is more efficient than the other
data quality is a relatively new and promising approach methods, but it is appropriate only for the datasets whose
from research and usage viewpoint and can present new lots of their attribute type is decimal or date [6].
domains of their application outside the domain of pure In this paper fuzzy association rules are used to enhance
data analysis [2]. the data quality. The structure of this paper is as follows:
Data quality studies have been accomplished using Section 2 presents an introduction of data quality and
data mining methods. For example, Brodley and Friedl reasons of using data mining for improving data quality.
In Section 3, the proposed method is explained, and in
————————————————
Section 4, the experimental results of proposed method
 F. Ghorbanpour A. is with the Department of Computer Engineering,
Islamic Azad University, Science and Research Branch , Tehran,Iran have been given and analyzed.
 M.M. Pedram is with the Department of Computer Engineering, Faculty
of Engineering, Tarbiat Moallem University, Karaj/Tehran,Iran
 K. Badie is with the Information Technology Research Center,Telecom 2 DATA QUALITY AND DATA MINING
Research center , Tehran,Iran
 M. Alishahi is with the Department of Industrial Engineering, Sharif Definition of data quality depends on the considered
University of Technology , Tehran,Iran purpose, so there is no unique definition which can be
stated formally. In literature, "appropriate for use" or "to
meet end user needs" are used. According to the general
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 20

definition of quality management, we define quality of capable of discovering such associations between
data to meet customer needs. Contrary to popular belief, quantitative values. As much of data in real world
the quality is not necessarily error at the level of zero [7]. contain quantitative values and there may be errors in
Data will have high quality if they are appropriate for this type of values, the proposed method in [4] cannot be
applications, decision making and planning. an applicable to detect error. In this paper, the presented
Researches have defined different dimensions for data method in [4] will be extended and association rules
quality. Each dimension shows a particular view of between quantitative values will be discovered with the
quality. The more interesting dimensions are as follow: help of fuzzy sets concept.
 Consistency: the rate of violation from the
defined significant rules on the dataset.
3.1 Fuzzy Association Rule
 Availability: the rate of data availability, being
easy and the speed of data retrieval. Fuzzy association rules are the rules which are extracted
 Accuracy: the distance between value of v and from a fuzzy dataset. A fuzzy rule in the form of A  B
value of v' that represents the entity which v where A, B are fuzzy sets and A  B  0 , is called fuzzy
tries to portray it. association rule.
 Completeness: the database includes all of the Using of fuzzy concepts has some advantages. First,
related entities. the discovered association rules are more understandable.
 Value added: the rate of being useful. These concepts make a transition between numerical
Data mining is an automatic process to extract the values of data and categorical concepts. Second, these
patterns as an implicit knowledge in the database. Data concepts help discover the rules between quantitative
mining utilizes multiple scientific fields simultaneously, values. For example, the following fuzzy sets are defined
such as the technology of data base, artificial intelligence, for the data shown in Table I:
machine learning, neural networks, statistics, High job level = {manager, deputy}
pattern recognizer, knowledge based systems and Mean job level = {designer}
information retrieval. High degree grade = {PHD, MS}
Data mining methods are appropriate for improving Mean degree grade = {BS, 12th}
different dimensions of data quality, since they are aimed High income = {400-500}
at finding abnormal patterns in large volumes of dataset. Mean income = {100-200}
In 2001, Hipp introduced data quality mining as a new
approach from research and application viewpoint. The TABLE 1
goal of data quality mining is to employ data mining INSTANCE DATABASE
methods in order to detect, quantify, explain and correct Trans_id Job Degree Income
data quality deficiencies in huge databases [4]. 1 desiger BS 100
In this paper, the accuracy dimension of data quality has 2 desiger BS 150
been considered using data mining method. 3 desiger 12th 200
4 manaer PHD 400
5 manage MS 450
3 THE PROPOSED IDEA
6 deputy PHD 500
As Among data mining methods, association rules is a The following fuzzy association rules may be stated
very active research topic that has been used widely in by the mentioned fuzzy sets:
lots of cases such as basket analysis, decision maker High job level→ High degree grade
support systems, theft detection, etc. Data quality is also Mean job level→ Mean degree grade
one of the topics where using association rules can be High job level→ High income
useful, because In comparison with other methods of data Mean job level→ Mean income
mining, association rules are understandable and can Now assume the following transaction:
clearly discover and describe the dependency between Transaction_id 7: job = deputy, degree = PHD,
data. Also association rules are independent of each income=100
other. Therefore, recision of simpler rules has no effect on Considering the dataset shown by Table I, an error can be
the other rules. detected in the above transaction, as the person who is
One of the defects of using association rules for data deputy cannot have income of 100, while his income is
cleaning is that association rules can not show properly expected to be much higher. This transaction violates
the associations between quantitative values. As an "High job level→ High income" fuzzy rule. It should be
example in Table I, job and degree are categorical noted that the error would not be detected by the rules
attributes and income is numerical attribute. If we want to extracted by Apriori algorithm.
obtain association rules in these data with support>1, via
Apriori algorithm, only we can find some BS degrees,
whereas there is an association between income and job 3.2 proposed method
or degree and income. For example, some one who have a The proposed method is consisted of steps that the
high degree level have more income than the people who description of each step is as follows:
have a mean degree level. Apriori algorithm is not
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 21

a) Data preprocessing: in this step, data is converted to In (1), 1  Y (T ) shows the degree that fuzzy transaction
standard format and lost values are managed. For T violates r, the more 1  Y (T ) , the more inconsistency of
example, the values of date attribute in some
transaction T with rule r.
transactions may be in the form of YYYY/MM/DD and
As an example, the consistency check for transaction 7 is
in some other in the form of YYYY-MM-DD. To obtain a
shown in Table IV.
correct model of the associations, implementation of this
step is necessary.
TABLE 4
b) Mapping data to fuzzy values: fuzzy sets and fuzzy CONSISTENCY CHECK FOR TRANSACTION6 WITH RULES IN
TABLE 3
membership functions for quantitative data are defined
in accordance with the knowledge of experienced Fuzzytrans_id Non- Consistent Inconsistent
experts. Then, data values are mapped to fuzzy values correlated rule rule
by fuzzy membership functions. Table II shows fuzzy rule
mapping for income and job attributes of Table I. Mean job High job High job
7 level→ Mean level→ High level→ High
income degree grade income
TABLE 2
MAPPING TABLE 1 TO FUZZY VALUE
e)Scoring and ranking the transactions: A score is
Fuzzy- High Mean High job Mean job assigned to each transaction by summing the confidence
trans_id income income level level values of the rules it violates. The score of each
1 0 1 0 1 transaction is computed as following:
2 0.02 0.97 0 1
3 0.04 0.91 0 1 
score ( R ) : D  R
4 0.80 0.09 0.93 0.03 
5 0.92 0.03 0.93 0.01 T  rR confidence ( r ) . violate (T , r )

6 0.96 0.01 1 0

The tuning parameter   R  allows assessing the


c)Fuzzy association rules extraction: Fuzzy association
confidences depending on their value [4]. For example
rules are extracted, via the presented algorithm in [9], in 
which all attributes are weighted equally. In this step, the score of transaction 7 is (0.44) .
the rules with high confidence level are considered. The transactions that have scores higher than
Thus, rule set R is limited to score_treshold, the minimal threshold for scores, will be
R  r  R | confidence ( r )    where  is the presented as a list sorted according to the rules that are
minimum confidence. violated or hold. Based on this information together
with her background knowledge the user will decide
Table III shows some extracted rules from Table II.
upon the trustworthiness of single transactions or
groups of similar transactions and finally upon the
TABLE 3 quality of the whole dataset.
FUZZY ASSOCIATION RULES EXTRACTED FROM TABLE 2
Used algorithm with according to upper proposed
Fuzzy association rule Confidence method is shown in figure 1.
High job level→ High income 0.44
Mean job level→ Mean income 0.48
Preprocess records // for enhancing data quality
Map dataset to fuzzy value
d) Consistency check of transactions with discovered rules: Generate fuzzy association rules with determined sup-
the consistency of transactions with discovered rules is
port and confidence
checked in this step. Each of transactions may violate
For all unmarked transaction in the dataset
some rules and may be consistent with other rules. Also,
For all generated fuzzy association rules
it may not fire some of the rules. Let R be the set of
If transaction doesn’t satisfy the rule
fuzzy association rules and Let D be the database of
fuzzy transactions. Let r  X  Y be a Fuzzy
Then mark transaction as a possible error and
save confidence of rule
association rule and  X be membership function of X.
End for
the mapping that determines whether a fuzzy transac- Set score of each transaction with summing saved
tion T  D violates a rule r  R , is defined as: confidence
End for
Violate : D  R  [0,1] Sort marked transaction by the score in descending
1  Y (T ) if  X (T )  rule _ satisfy  Y (T )  rule _ satisfy order
 Output marked transactions with consequent of the vi-
0 else
olated rules
 Fig. 1. Proposed algorithm
rule_satisfy parameter is the threshold for fuzzy rule satis-
faction.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 22

4 EVALUATION RESULTS causes rules’ support would not be satisfied (above


example).
In this section, proposed method is evaluated and the
results are compared with presented method in [4]. All Precision curves have wavy shapes, because any
the evaluations have been implemented on a system change in data causes a change in the support of rules
equipped with Core 2 Duo 2.26 GHz CPU, 3GB RAM and which can affect their detections.
Win 7. 1
Usually for evaluating the performance of information 0.9
0.8
retrieval and classification algorithms, precision and 0.7
recall measures are used. These measures are defined by

Precision
0.6
(3) and (4) based on the confusion matrix entries, shown 0.5
0.4
in Table V. In this paper, in addition to these two 0.3
measures, the runtime of the two methods has been also 0.2
compared. 0.1
0
0 20 40 60 80 100
Num ber Of Error In Dataset
TABLE 5 Adult Dataset-Fuzzy Association Rule
CONFUSION MATRIX Adult Dataset-Apriorit
Transactions classifi- Real Dataset-Fuzzy Association Rule
cation by proposed mthod Real Dataset-Apriori

Correct Incorrect
Fig. 2. The precision measure for the proposed method and the
transactions transactions method described in [4] for two datasets
Correct
TP FN
Real transactions
Classification Incorrect
Figure 3 also compares the recall measure of Adult
FP TN and Personal datasets for two methods.
transactions
It is clear that the proposed method delivers better
results.

Precision = TP (3) 1
0.9
TP  FP 0.8
Recall = TP (4) 0.7

TP  FN 0.6
Recall

0.5
0.4
In order to evaluate the proposed method, two 0.3

datasets has been used. One of the datasets is Adult 0.2


0.1
dataset from UCI and another dataset is real dataset that 0
is Personnel data of IT company. Adult includes two 0 20 40 60 80 100

numerical attributes called income and hpw and two Num ber Of Error In Dataset

categorical attributes called job and degree. Also, Adult Dataset-Fuzzy Association Rule
personnel data includes two numerical attributes called Adult Dataset-Apriorit
income and valuation score and two categorical attributes Real Dataset-Fuzzy Association Rule

called job and degree. 1000 instances of the transactions of Real Dataset-Apriori

this datasets have been selected and manipulated to make


Fig. 3. The recall measure for the proposed method and the method
imitation errors. Then, proposed method and the described in [4] for two datasets
presented one in [4] are applied to the dataset. In the tests
related to precision and recall, min_support,
min_confidence, score_treshold and min_rule_satisfy are Figures 4, 5 compare the precision and recall of the
considered to be 0.2, 0.7, 0.3 and 0.4, respectively. The proposed method with the method proposed in [4], in
tuning parameter  is equal 7 in two methods. Because terms of varying min_support and min_confidence for
based on the experimental result of Hipp this value is Adult dataset.
well when it is important for us that transactions don’t The figures show that increasing the min_support or
violate the rules with high confidence value [4]. min_confidence causes the decreasing the precision and
recall. Because increasing the min_support or
Figure 2 compares the precision of the proposed min_confidence causes some association rules are not
method with the method proposed in [4], in terms of discovered and play no rule in error detection. Therefore
number of created errors in Adult and Personal datasets. some errors are not detected.
The figure shows that the proposed method outperforms
the method proposed in [4], as lots of the association
between quantitative data in datasets are not discovered
by Apriori algorithm and play no role in transactions’
ranking. Some attributes’ values are in a range which
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 23

algorithm, because Apriori algorithm must check all of


0.7
0.6
the numerical and categorical data in the dataset to find
Precision-Recall
0.5
frequent itemsets. Also, it should find the support of each
0.4
of itemsets which is time consuming due to variety of
0.3 numeric values. While fuzzy association rules algorithm
0.2 only works with the fuzzy sets whose numbers are more
0.1 limited and are defined by user.
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Varying m in_support 5 CONCLUSION
Precision-Fuzzy Association Rule
Automated data quality improvement is a necessity
Precision-Apriori
for finding incorrect data in large databases. In this paper,
Recall-Fuzzy Association Rule a method based on data mining approaches is presented
Recall-Apriori to improve data quality. This paper’s proposed approach
for data quality improvement uses fuzzy association
rules, by which hidden rules in datasets will be
Fig. 4 The precision and recall measure for the proposed method discovered. Then, the consistency of transactions with
and the method described in [4] for Adult
these rules is checked and a score is assigned to each
Figure 6 shows the runtime of the proposed method transaction. The high scores are assigned to the
and the method proposed in [4]. transactions which are suspicious to have defects. User’s
knowledge on suspicious transactions and background
knowledge about data will be used in the process of
0.7 determining the accuracy of transactions and ultimately
0.6
total quality of dataset. Evaluation results show the
Precision-Recall

0.5
0.4
effectiveness of the proposed method.
0.3
0.2 REFERENCES
0.1
0 [1] D. Luebbers, U. Grimmer, M. Jarke, “Systematic Development
0.6 0.7 0.8 0.9 1 1.1 of Data Mining-Based Data Quality Tools”, Proc. of the 29-th
Varying m in_confidence International Conference on Very Large Data Bases, Berlin,
Germany, pp. 548-559, 2003.
Precision-Fuzzy Association Rule
Precision-Apriori
[2] J. Hipp, M. Müller, J. Hohendorff, F. Naumann, “Rule-Based
Recall-Fuzzy Association Rule Measurement of Data Quality in Nominal Data”, Proc. of the
Recall-Apriori 12th International Conference on Information Quality (ICIQ),
Cambridge, USA, 2007.
Fig. 5 The precision and recall measure for the proposed method [3] C. Brodley, M. Friedl, “Identifying Mislabeled Training Data”,
and the method described in [4] for Adult Journal of Artificial Intelligence Research. Vol. 11, pp. 113-167,
1999.
[4] J. Hipp, U. Güntzer, U. Grimmer, “Data Quality Mining –
140 Making a Virtue of Necessity”, Proc. of the 6-th ACM SIGMOD
120 Workshop on Research Issues in Data Mining and Knowledge
100 Discovery, Santa Barbara, California, pp. 52-57, 2001.
[5] F. Grüning, “Data Quality Mining: Employing Classifiers for
RunTime

80
Assuring consistent Datasets”, Proc. of the 21-th Information
60
Technologies in Environmental Engineering, Springer-Verlag
40
Berlin Heidelberg, pp. 58-94, 2007.
20 [6] J. Maletic, A. Marcus, “Data cleansing: Beyond integrity
0 analysis”, Proc. of the Conference on Information Quality, MIT,
0 200 400 600 800 1000 1200 Boston, pp. 200-209, 2000.
Number Of Transactions [7] J. Geiger, “Data Quality Management: The Most Critical
Initiative You Can Implement”, Intelligent Solutions, Inc.,
Fuzzy Association Rule Apriori
Boulder, Paper 098-29, 2004.
[8] J. Maletic, A. Marcus, K. Lin, “Ordinal Association Rules for
Error Identification in DataSets”, Proc. of 10-th Intl. conf.
Fig. 6 The runtime of the proposed method and the mothod
presented in [4] for Personal dataset Information and Knowledge Management(CIKM), Atlanta, GA,
pp. 589-591, 2001
According to figure 6, the runtime of the proposed [9] D. Olson, Y. Li, “Mining Fuzzy Wighted Association Rules”,
method is better. The main reason is due to the difference Proc. of the 40th IEEE Intl. Conference on System Sciences,
in runtime of association rules discovery. Fuzzy Hawaii, 2007.
association rules algorithm is much faster than Apriori [10] Http://archive.ics.uci.edu/ml/datasets/adult
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 24

Fatemeh Ghorbanpour A. received B.Sc. and M.Sc. degrees in


Computer Engineering- Software from Iran Universities. Her current
research interests include information retrieval and data mining. She
is a member of the IEEE.

Dr. M. Mohsen Pedram received his B.Sc., M.Sc. and Ph.D. in


electronic engineering from the Iran universities. His major research
interests are Machine Learning, Image Processing, Artificial
intelligence, Data mining, and Pattern Recognition. He has published
many papers in various fields. At present, he is teaching in
Department of Computer Engineering in Iran.

Dr. Kambiz Badie received his B.Sc., M.Sc. and Ph.D. in electronic
engineering from the Tokyo Institute of Technology, Japan, majoring
in Pattern Recognition & Artificial Intelligence. His major research
interests are Machine Learning, Cognitive Modeling, and Systematic
Knowledge Processing in general, and Analogical Knowledge
Processing Experience-Based Modeling, and Interpretative Modeling
in particular with emphasis on idea and technique generation. He
has published many papers in various fields. At present, he is the
Director of IT Faculty at Iran Telecom Research Center

Mohammad Alishahi received B.Sc. degree in Computer


Engineering- Software and M.Sc. degree in Industrial Engineering
from Iran Universities. His current research interests include data
mining and Management information systems.

You might also like