QMDM Coverprefacetdm

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/230752326
Quality Measures in Data Mining
Book · January 2007

DOI: 10.1007/978-3-540-44918-8
CITATIONS READS
89 1,288
2 authors:
Fabrice Guillet Howard J. Hamilton

University of Nantes University of Regina
149 PUBLICATIONS 1,060 CITATIONS 190 PUBLICATIONS 3,489 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A Study of User Interface Modifications in World of Warcraft View project
Tweet Topic Detection View project
All content following this page was uploaded by Fabrice Guillet on 03 June 2014.
The user has requested enhancement of the downloaded file.

3TUDIESÈINÈ#OMPUTATIONALÈ)NTELLIGENCEÈ
'UILLET q (AMILTONÈ%DS
1UALITYÈ-EASURESÈ
INÈ$ATAÈ-INING
฀
Fabrice Guillet, Howard J. Hamilton (Eds.)
Quality Measures in Data Mining
Studies in Computational Intelligence, Volume 43
Editor-in-chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl
Further volumes of this series ISBN 3-540-34955-3

can be found on our homepage:
springer.com Vol. 35. Ke Chen, Lipo Wang (Eds.)
Trends in Neural Computation, 2007
Vol. 26. Nadia Nedjah, Luiza de Macedo Mourelle ISBN 3-540-36121-9
(Eds.)
Swarm Intelligent Systems, 2006 Vol. 36. Ildar Batyrshin, Janusz Kacprzyk, Leonid
ISBN 3-540-33868-3 Sheremetor, Lotfi A. Zadeh (Eds.)
Preception-based Data Mining and Decision
Vol. 27. Vassilis G. Kaburlasos Making in Economics and Finance, 2006
Towards a Unified Modeling and Knowledge- ISBN 3-540-36244-4
Representation based on Lattice Theory, 2006
ISBN 3-540-34169-2 Vol. 37. Jie Lu, Da Ruan, Guangquan Zhang
(Eds.)
Vol. 28. Brahim Chaib-draa, Jörg P. Müller (Eds.) E-Service Intelligence, 2007
Multiagent based Supply Chain ISBN 3-540-37015-3
Management, 2006
ISBN 3-540-33875-6 Vol. 38. Art Lew, Holger Mauch
Dynamic Programming, 2007
Vol. 29. Sai Sumathi, S.N. Sivanandam
ISBN 3-540-37013-7
Introduction to Data Mining and its
Application, 2006 Vol. 39. Gregory Levitin (Ed.)
ISBN 3-540-34689-9 Computational Intelligence in Reliability
Engineering, 2007
Vol. 30. Yukio Ohsawa, Shusaku Tsumoto (Eds.)
ISBN 3-540-37367-5
Chance Discoveries in Real World Decision
Making, 2006 Vol. 40. Gregory Levitin (Ed.)
ISBN 3-540-34352-0 Computational Intelligence in Reliability
Vol. 31. Ajith Abraham, Crina Grosan, Vitorino Engineering, 2007
Ramos (Eds.) ISBN 3-540-37371-3
Stigmergic Optimization, 2006 Vol. 41. Mukesh Khare, S.M. Shiva Nagendra
ISBN 3-540-34689-9 (Eds.)
Vol. 32. Akira Hirose Artificial Neural Networks in Vehicular Pollution
Complex-Valued Neural Networks, 2006 Modelling, 2007
ISBN 3-540-33456-4 ISBN 3-540-37417-5
Vol. 33. Martin Pelikan, Kumara Sastry, Erick Vol. 42. Bernd J. Krämer, Wolfgang A. Halang
Cantú-Paz (Eds.) (Eds.)
Scalable Optimization via Probabilistic Contributions to Ubiquitous Computing, 2007
Modeling, 2006 ISBN 3-540-44909-4
ISBN 3-540-34953-7
Vol. 43. Fabrice Guillet, Howard J. Hamilton
Vol. 34. Ajith Abraham, Crina Grosan, Vitorino (Eds.)
Ramos (Eds.) Quality Measures in Data Mining, 2007
Swarm Intelligence in Data Mining, 2006 ISBN 3-540-44911-6
Fabrice Guillet
Howard J. Hamilton
(Eds.)
Quality Measures
in Data Mining
With 51 Figures and 78 Tables
123
Fabrice Guillet
LINA-CNRS FRE 2729 - Ecole polytechnique
de l’université de Nantes
Rue Christian-Pauc-La Chantrerie - BP 60601
44306 NANTES Cedex 3-France
E-mail: Fabrice.Guillet@polytech.univ-nantes.fr
Howard J. Hamilton
Department of Computer Science
University of Regina
Regina, SK S4S 0A2-Canada
E-mail: hamilton@cs.uregina.ca
Library of Congress Control Number: 2006932577

ISSN print edition: 1860-949X
ISSN electronic edition: 1860-9503
ISBN-10 3-540-44911-6 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-44911-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-
casting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of
this publication or parts thereof is permitted only under the provisions of the German Copyright Law
of September 9, 1965, in its current version, and permission for use must always be obtained from
Springer-Verlag. Violations are liable to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media

springer.com
°c Springer-Verlag Berlin Heidelberg 2007
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
Cover design: deblik, Berlin

Typesetting by the editors.
Printed on acid-free paper SPIN: 11588535 89/SPi 543210
Preface
Data Mining has been identified as one of the ten emergent technologies of
the 21st century (MIT Technology Review, 2001). This discipline aims at
discovering knowledge relevant to decision making from large amounts of data.
After some knowledge has been discovered, the final user (a decision-maker
or a data-analyst) is unfortunately confronted with a major difficulty in the
validation stage: he/she must cope with the typically numerous extracted
pieces of knowledge in order to select the most interesting ones according
to his/her preferences. For this reason, during the last decade, the designing
of quality measures (or interestingness measures) has become an important
challenge in Data Mining.
The purpose of this book is to present the state of the art concerning
quality/interestingness measures for data mining. The book summarizes re-
cent developments and presents original research on this topic. The chapters
include reviews, comparative studies of existing measures, proposals of new
measures, simulations, and case studies. Both theoretical and applied chapters
are included.
Structure of the book
The book is structured in three parts. The first part gathers four overviews
of quality measures. The second part contains four chapters dealing with
data quality, data linkage, contrast sets and association rule clustering.
Lastly, in the third part, four chapters describe new quality measures and
rule validation.
Part I: Overviews of Quality Measures
• Chapter 1: Choosing the Right Lens: Finding What is Interest-

ing in Data Mining, by Geng and Hamilton, gives a broad overview
VI Preface
of the use of interestingness measures in data mining. This survey re-

views interestingness measures for rules and summaries, classifies them
from several perspectives, compares their properties, identifies their roles
in the data mining process, describes methods of analyzing the measures,
reviews principles for selecting appropriate measures for applications, and
predicts trends for research in this area.
• Chapter 2: A Graph-based Clustering Approach to Evaluate

Interestingness Measures: A Tool and a Comparative Study, by
Hiep et al., is concerned with the study of interestingness measures. As
interestingness depends both on the the structure of the data and on the
decision-maker’s goals, this chapter introduces a new contextual approach
implemented in ARQAT, an exploratory data analysis tool, in order to
help the decision-maker select the most suitable interestingness measures.
The tool, which embeds a graph-based clustering approach, is used to
compare and contrast the behavior of thirty-six interestingness measures
on two typical but quite different datasets. This experiment leads to the
discovery of five stable clusters of measures.
• Chapter 3: Association Rule Interestingness Measures: Experi-

mental and Theoretical Studies, by Lenca et al., discusses the selection
of the most appropriate interestingness measures, according to a variety
of criteria. It presents a formal and an experimental study of 20 measures.
The experimental studies carried out on 10 data sets lead to an experi-
mental classification of the measures. This studies leads to the design of
a multi-criteria decision analysis in order to select the measures that best
take into account the user’s needs.
• Chapter 4: On the Discovery of Exception Rules: A Survey, by

Duval et al., presents a survey of approaches developed for mining excep-
tion rules. They distinguish two approaches to using an expert’s knowledge:
using it as syntactic constraints and using it to form as commonsense rules.
Works that rely on either of these approaches, along with their particu-
lar quality evaluation, are presented in this survey. Moreover, this chapter
also gives ideas on how numerical criteria can be intertwined with user-
centered approaches.
Part II: From Data to Rule Quality
• Chapter 5: Measuring and Modelling Data Quality for Quality-

Awareness in Data Mining, by Berti-Équille. This chapter offers an
overview of data quality management, data linkage and data cleaning tech-
niques that can be advantageously employed for improving quality aware-
ness during the knowledge discovery process. It also details the steps of a
Preface VII
pragmatic framework for data quality awareness and enhancement. Each

step may use, combine and exploit the data quality characterization, mea-
surement and management methods, and the related techniques proposed
in the literature.
• Chapter 6: Quality and Complexity Measures for Data Linkage

and Deduplication, by Christen and Goiser, proposes a survey of
different measures that have been used to characterize the quality and
complexity of data linkage algorithms. It is shown that measures in the
space of record pair comparisons can produce deceptive quality results.
Various measures are discussed and recommendations are given on how to
assess data linkage and deduplication quality and complexity.
• Chapter 7: Statistical Methodologies for Mining Potentially

Interesting Contrast Sets, by Hilderman and Peckham, focuses on con-
trast sets that aim at identifying the significant differences between classes
or groups. They compare two contrast set mining methodologies, STUCCO
and CIGAR, and discuss the underlying statistical measures. Experimen-
tal results show that both methodologies are statistically sound, and thus
represent valid alternative solutions to the problem of identifying poten-
tially interesting contrast sets.
• Chapter 8: Understandability of Association Rules: A Heuristic

Measure to Enhance Rule Quality, by Natarajan and Shekar, deals
with the clustering of association rules in order to facilitate easy explo-
ration of connections between rules, and introduces the Weakness measure
dedicated to this goal. The average linkage method is used to cluster rules
obtained from a small artificial data set. Clusters are compared with those
obtained by applying a commonly used method.
Part III: Rule Quality and Validation
• Chapter 9: A New Probabilistic Measure of Interestingness

for Association Rules, Based on the Likelihood of the Link, by
Lerman and Azé, presents the foundations and the construction of a prob-
abilistic interestingness measure called the likelihood of the link index.
They discuss two facets, symmetrical and asymmetrical, of this measure
and the two stages needed to build this index. Finally, they report the re-
sults of experiments to estimate the relevance of their statistical approach.
• Chapter 10: Towards a Unifying Probabilistic Implicative Nor-

malized Quality Measure for Association Rules, by Diatta et al.,
defines the so-called normalized probabilistic quality measures (PQM) for
association rules. Then, they consider a normalized and implicative PQM
VIII Preface
called MGK , and discuss its properties.
• Chapter 11: Association Rule Interestingness: Measure and

Statistical Validation, by Lallich et al., is concerned with association
rule validation. After reviewing well-known measures and criteria, the sta-
tistical validity of selecting the most interesting rules by performing a
large number of tests is investigated. An original, bootstrap-based valida-
tion method is proposed that controls, for a given level, the number of
false discoveries. The potential value of this method is illustrated by sev-
eral examples.
• Chapter 12: Comparing Classification Results between N -ary

and Binary Problems, by Felkin, deals with supervised learning and
the quality of classifiers. This chapter presents a practical tool that will
enable the data-analyst to apply quality measures to a classification task.
More specifically, the tool can be used during the pre-processing step, when
the analyst is considering different formulations of the task at hand. This
tool is well suited for illustrating the choices for the number of possible
class values to be used to define a classification problem and the relative
difficulties of the problems that result from these choices.
Topics
The topics of the book include:

• Measures for data quality
• Objective vs subjective measures
• Interestingness measures for rules, patterns, and summaries
• Quality measures for classification, clustering, pattern discovery, etc.
• Theoretical properties of quality measures
• Human-centered quality measures for knowledge validation
• Aggregation of measures
• Quality measures for different stages of the data mining process,
• Evaluation of measure properties via simulation
• Application of quality measures and case studies
Preface IX
Review Committee
All published chapters have been reviewed by at least 2 referees.

• Henri Briand (LINA, University of Nantes, France)
• Rgis Gras (LINA, University of Nantes, France)
• Yves Kodratoff (LRI, University of Paris-Sud, France)
• Vipin Kumar (University of Minnesota, USA)
• Pascale Kuntz (LINA, University of Nantes, France)
• Robert Hilderman (University of Regina, Canada)
• Ludovic Lebart (ENST, Paris, France)
• Philippe Lenca (ENST-Bretagne, Brest, France)
• Bing Liu (University of Illinois at Chicago, USA)
• Amdo Napoli (LORIA, University of Nancy, France)
• Gregory Piatetsky-Shapiro (KDNuggets, USA)
• Gilbert Ritschard (Geneve University, Switzerland)
• Sigal Sahar (Intel, USA)
• Gilbert Saporta (CNAM, Paris, France)
• Dan Simovici (University of Massachusetts Boston, USA)
• Jaideep Srivastava (University of Minnesota, USA)
• Einoshin Suzuki (Yokohama National University, Japan)
• Pang-Ning Tan (Michigan State University, USA)
• Alexander Tuzhilin (Stern School of Business, USA)
• Djamel Zighed (ERIC, University of Lyon 2, France)
Associated Reviewers
Jérôme Azé, Karl Goiser,

Laure Berti-Equille, Stéphane Lallich,
Libei Chen, Rajesh Natajaran,
Peter Christen, Ansaf Salleb,
Béatrice Duval, Benoı̂t Vaillant
Mary Felkin,
Liqiang Geng,
X Preface
Acknowledgments
The editors would like to thank the chapter authors for their insights and
contributions to this book.
The editors would also like to acknowledge the member of the review
committee and the associated referees for their involvement in the review
process of the book. Without their support the book would not have been
satisfactorily completed.
A special thank goes to D. Zighed and H. Briand for their kind support
and encouragement.
Finally, we thank Springer and the publishing team, and especially

T. Ditzinger and J. Kacprzyk, for their confidence in our project.
Regina, Canada and Nantes, France, Fabrice Guillet

May 2006 Howard Hamilton
Contents
Part I Overviews on rule quality
Choosing the Right Lens: Finding What is Interesting

in Data Mining
Liqiang Geng, Howard J. Hamilton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A Graph-based Clustering Approach to Evaluate

Interestingness Measures: A Tool and a Comparative Study
Xuan-Hiep Huynh, Fabrice Guillet, Julien Blanchard, Pascale Kuntz,
Henri Briand, Régis Gras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Association Rule Interestingness Measures: Experimental and

Theoretical Studies
Philippe Lenca, Benoı̂t Vaillant, Patrick Meyer, Stéphane Lallich . . . . . . . 51
On the Discovery of Exception Rules: A Survey
Béatrice Duval, Ansaf Salleb, Christel Vrain . . . . . . . . . . . . . . . . . . . . . . . . . 77
Part II From data to rule quality
Measuring and Modelling Data Quality

for Quality-Awareness in Data Mining
Laure Berti-Équille . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Quality and Complexity Measures for Data Linkage

and Deduplication
Peter Christen, Karl Goiser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Statistical Methodologies for Mining Potentially Interesting

Contrast Sets
Robert J. Hilderman, Terry Peckham . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
XII Contents
Understandability of Association Rules: A Heuristic Measure

to Enhance Rule Quality
Rajesh Natarajan, B. Shekar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Part III Rule quality and validation
A New Probabilistic Measure

of Interestingness for Association Rules,
Based on the Likelihood of the Link
Israël-César Lerman, Jérôme Azé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Towards a Unifying Probabilistic Implicative Normalized

Quality Measure for Association Rules
Jean Diatta, Henri Ralambondrainy, André Totohasina . . . . . . . . . . . . . . . 237
Association Rule Interestingness: Measure and Statistical

Validation
Stephane Lallich, Olivier Teytaud, Elie Prudhomme . . . . . . . . . . . . . . . . . . 251
Comparing Classification Results between N -ary

and Binary Problems
Mary Felkin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
About the Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
List of Contributors
Jérôme Azé Mary Felkin

LRI, University of Paris-Sud, Orsay, LRI, University of Paris-Sud, Orsay,
France France
aze@lri.fr felkin@lri.fr
Laure Berti-Équille
IRISA, University of Rennes I, Liqiang Geng
France Department of Computer Science,
Laure.Berti-Equille@irisa.fr University of Regina, Canada
gengl@cs.uregina.ca
Julien Blanchard
LINA CNRS 2729, Polytechnic
Karl Goiser
School of Nantes University, France
The Australian National University,
Julien.Blanchard@univ-nantes.fr
Canberra, Australia
Henri Briand karl.goiser@anu.edu.au
School of Nantes University, France Régis Gras
Julien.Blanchard@univ-nantes.fr LINA CNRS 2729, Polytechnic
Peter Christen
Julien.Blanchard@univ-nantes.fr
The Australian National University,
Canberra, Australia
peter.christen@anu.edu.au Fabrice Guillet
Jean Diatta School of Nantes University, France
IREMIA, University of La Réunion, Fabrice.Guillet@univ-nantes.fr
Saint-Denis, France
jdiatta@univ-reunion.fr
Howard J. Hamilton
Béatrice Duval Department of Computer Science,
LERIA, University of Angers, France University of Regina, Canada
Beatrice.Duval@univ-angers.fr hamilton@cs.uregina.ca
XIV List of Contributors
Robert J. Hilderman Elie Prudhomme

Department of Computer Science, ERIC, University of Lyon 2, France
University of Regina, Canada eprudhomme@eric.univ-lyon2.fr
hilder@cs.uregina.ca
Henri Ralambondrainy
Xuan-Hiep Huynh
IREMIA, University of La Réunion,
Saint-Denis, France
ralambon@univ-reunion.fr
Xuan-Hiep.Huynh@univ-nantes.fr
Pascale Kuntz Ansaf Salleb

LINA CNRS 2729, Polytechnic UCCLS, Columbia University,
School of Nantes University, France New York, U.S.A
Julien.Blanchard@univ-nantes.fr Ansaf@ccls.columbia.edu
Stéphane Lallich
ERIC, University of Lyon 2, France B. Shekar
stephane.lallich@univ-lyon2.fr QMIS, Indian Institute
of Management(IIMB),
Philippe Lenca Bangalore, India
TAMCIC CNRS 2872, GET/ENST shek@iimb.ernet.in
Bretagne, France
philippe.lenca@enst-bretagne.fr Olivier Teytaud
TAO-Inria, LRI,
Israël-César Lerman University of Paris-Sud,
IRISA, University of Rennes I, Orsay, France
France teytaud@lri.fr
lerman@irisa.fr
Patrick Meyer André Totohasina

University of Luxemburg, ENSET, University of Antsiranana,
Luxemburg Madagascar
patrick.meyer@uni.lu totohasina@yahoo.fr
Rajesh Natarajan
Benoı̂t Vaillant
Cognizant Technology Solutions,
TAMCIC CNRS 2872, GET/ENST
Chennai, India
Bretagne, France
rajesh.natarajan@cognizant.com
benoit.vaillant@enst-bretagne.fr
Terry Peckham
Department of Computer Science, Christel Vrain
University of Regina, Canada LIFO, University of Orléans, France
peckham@cs.uregina.ca Christel.Vrain@univ-orleans.fr
View publication stats

QMDM Coverprefacetdm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

QMDM Coverprefacetdm

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Quality Measures in Data Mining

Book · January 2007

Fabrice Guillet Howard J. Hamilton

SEE PROFILE SEE PROFILE

A Study of User Interface Modifications in World of Warcraft View project

Tweet Topic Detection View project

The user has requested enhancement of the downloaded file.

Further volumes of this series ISBN 3-540-34955-3

With 51 Figures and 78 Tables

Library of Congress Control Number: 2006932577

Springer is a part of Springer Science+Business Media

Cover design: deblik, Berlin

Structure of the book

Part I: Overviews of Quality Measures

• Chapter 1: Choosing the Right Lens: Finding What is Interest-

of the use of interestingness measures in data mining. This survey re-

• Chapter 2: A Graph-based Clustering Approach to Evaluate

• Chapter 3: Association Rule Interestingness Measures: Experi-

• Chapter 4: On the Discovery of Exception Rules: A Survey, by

Part II: From Data to Rule Quality

• Chapter 5: Measuring and Modelling Data Quality for Quality-

pragmatic framework for data quality awareness and enhancement. Each

• Chapter 6: Quality and Complexity Measures for Data Linkage

• Chapter 7: Statistical Methodologies for Mining Potentially

• Chapter 8: Understandability of Association Rules: A Heuristic

Part III: Rule Quality and Validation

• Chapter 9: A New Probabilistic Measure of Interestingness

• Chapter 10: Towards a Unifying Probabilistic Implicative Nor-

called MGK , and discuss its properties.

• Chapter 11: Association Rule Interestingness: Measure and

• Chapter 12: Comparing Classification Results between N -ary

The topics of the book include:

All published chapters have been reviewed by at least 2 referees.

Jérôme Azé, Karl Goiser,

Finally, we thank Springer and the publishing team, and especially

Regina, Canada and Nantes, France, Fabrice Guillet

Part I Overviews on rule quality

Choosing the Right Lens: Finding What is Interesting

A Graph-based Clustering Approach to Evaluate

Association Rule Interestingness Measures: Experimental and

Part II From data to rule quality

Measuring and Modelling Data Quality

Quality and Complexity Measures for Data Linkage

Statistical Methodologies for Mining Potentially Interesting

Understandability of Association Rules: A Heuristic Measure

Part III Rule quality and validation

A New Probabilistic Measure

Towards a Unifying Probabilistic Implicative Normalized

Association Rule Interestingness: Measure and Statistical

Comparing Classification Results between N -ary

About the Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

Jérôme Azé Mary Felkin

Robert J. Hilderman Elie Prudhomme

Pascale Kuntz Ansaf Salleb

Patrick Meyer André Totohasina

View publication stats

You might also like