You are on page 1of 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/230752326

Quality Measures in Data Mining

Book · January 2007


DOI: 10.1007/978-3-540-44918-8

CITATIONS READS

89 1,288

2 authors:

Fabrice Guillet Howard J. Hamilton


University of Nantes University of Regina
149 PUBLICATIONS   1,060 CITATIONS    190 PUBLICATIONS   3,489 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A Study of User Interface Modifications in World of Warcraft View project

Tweet Topic Detection View project

All content following this page was uploaded by Fabrice Guillet on 03 June 2014.

The user has requested enhancement of the downloaded file.


3TUDIESÈINÈ#OMPUTATIONALÈ)NTELLIGENCEÈ 

'UILLET q (AMILTONÈ%DS

1UALITYÈ-EASURESÈ
INÈ$ATAÈ-INING


Fabrice Guillet, Howard J. Hamilton (Eds.)
Quality Measures in Data Mining
Studies in Computational Intelligence, Volume 43
Editor-in-chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl

Further volumes of this series ISBN 3-540-34955-3


can be found on our homepage:
springer.com Vol. 35. Ke Chen, Lipo Wang (Eds.)
Trends in Neural Computation, 2007
Vol. 26. Nadia Nedjah, Luiza de Macedo Mourelle ISBN 3-540-36121-9
(Eds.)
Swarm Intelligent Systems, 2006 Vol. 36. Ildar Batyrshin, Janusz Kacprzyk, Leonid
ISBN 3-540-33868-3 Sheremetor, Lotfi A. Zadeh (Eds.)
Preception-based Data Mining and Decision
Vol. 27. Vassilis G. Kaburlasos Making in Economics and Finance, 2006
Towards a Unified Modeling and Knowledge- ISBN 3-540-36244-4
Representation based on Lattice Theory, 2006
ISBN 3-540-34169-2 Vol. 37. Jie Lu, Da Ruan, Guangquan Zhang
(Eds.)
Vol. 28. Brahim Chaib-draa, Jörg P. Müller (Eds.) E-Service Intelligence, 2007
Multiagent based Supply Chain ISBN 3-540-37015-3
Management, 2006
ISBN 3-540-33875-6 Vol. 38. Art Lew, Holger Mauch
Dynamic Programming, 2007
Vol. 29. Sai Sumathi, S.N. Sivanandam
ISBN 3-540-37013-7
Introduction to Data Mining and its
Application, 2006 Vol. 39. Gregory Levitin (Ed.)
ISBN 3-540-34689-9 Computational Intelligence in Reliability
Engineering, 2007
Vol. 30. Yukio Ohsawa, Shusaku Tsumoto (Eds.)
ISBN 3-540-37367-5
Chance Discoveries in Real World Decision
Making, 2006 Vol. 40. Gregory Levitin (Ed.)
ISBN 3-540-34352-0 Computational Intelligence in Reliability
Vol. 31. Ajith Abraham, Crina Grosan, Vitorino Engineering, 2007
Ramos (Eds.) ISBN 3-540-37371-3
Stigmergic Optimization, 2006 Vol. 41. Mukesh Khare, S.M. Shiva Nagendra
ISBN 3-540-34689-9 (Eds.)
Vol. 32. Akira Hirose Artificial Neural Networks in Vehicular Pollution
Complex-Valued Neural Networks, 2006 Modelling, 2007
ISBN 3-540-33456-4 ISBN 3-540-37417-5

Vol. 33. Martin Pelikan, Kumara Sastry, Erick Vol. 42. Bernd J. Krämer, Wolfgang A. Halang
Cantú-Paz (Eds.) (Eds.)
Scalable Optimization via Probabilistic Contributions to Ubiquitous Computing, 2007
Modeling, 2006 ISBN 3-540-44909-4
ISBN 3-540-34953-7
Vol. 43. Fabrice Guillet, Howard J. Hamilton
Vol. 34. Ajith Abraham, Crina Grosan, Vitorino (Eds.)
Ramos (Eds.) Quality Measures in Data Mining, 2007
Swarm Intelligence in Data Mining, 2006 ISBN 3-540-44911-6
Fabrice Guillet
Howard J. Hamilton
(Eds.)

Quality Measures
in Data Mining

With 51 Figures and 78 Tables

123
Fabrice Guillet
LINA-CNRS FRE 2729 - Ecole polytechnique
de l’université de Nantes
Rue Christian-Pauc-La Chantrerie - BP 60601
44306 NANTES Cedex 3-France
E-mail: Fabrice.Guillet@polytech.univ-nantes.fr

Howard J. Hamilton
Department of Computer Science
University of Regina
Regina, SK S4S 0A2-Canada
E-mail: hamilton@cs.uregina.ca

Library of Congress Control Number: 2006932577


ISSN print edition: 1860-949X
ISSN electronic edition: 1860-9503
ISBN-10 3-540-44911-6 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-44911-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-
casting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of
this publication or parts thereof is permitted only under the provisions of the German Copyright Law
of September 9, 1965, in its current version, and permission for use must always be obtained from
Springer-Verlag. Violations are liable to prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media


springer.com
°c Springer-Verlag Berlin Heidelberg 2007

The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.

Cover design: deblik, Berlin


Typesetting by the editors.
Printed on acid-free paper SPIN: 11588535 89/SPi 543210
Preface

Data Mining has been identified as one of the ten emergent technologies of
the 21st century (MIT Technology Review, 2001). This discipline aims at
discovering knowledge relevant to decision making from large amounts of data.
After some knowledge has been discovered, the final user (a decision-maker
or a data-analyst) is unfortunately confronted with a major difficulty in the
validation stage: he/she must cope with the typically numerous extracted
pieces of knowledge in order to select the most interesting ones according
to his/her preferences. For this reason, during the last decade, the designing
of quality measures (or interestingness measures) has become an important
challenge in Data Mining.
The purpose of this book is to present the state of the art concerning
quality/interestingness measures for data mining. The book summarizes re-
cent developments and presents original research on this topic. The chapters
include reviews, comparative studies of existing measures, proposals of new
measures, simulations, and case studies. Both theoretical and applied chapters
are included.

Structure of the book

The book is structured in three parts. The first part gathers four overviews
of quality measures. The second part contains four chapters dealing with
data quality, data linkage, contrast sets and association rule clustering.
Lastly, in the third part, four chapters describe new quality measures and
rule validation.

Part I: Overviews of Quality Measures

• Chapter 1: Choosing the Right Lens: Finding What is Interest-


ing in Data Mining, by Geng and Hamilton, gives a broad overview
VI Preface

of the use of interestingness measures in data mining. This survey re-


views interestingness measures for rules and summaries, classifies them
from several perspectives, compares their properties, identifies their roles
in the data mining process, describes methods of analyzing the measures,
reviews principles for selecting appropriate measures for applications, and
predicts trends for research in this area.

• Chapter 2: A Graph-based Clustering Approach to Evaluate


Interestingness Measures: A Tool and a Comparative Study, by
Hiep et al., is concerned with the study of interestingness measures. As
interestingness depends both on the the structure of the data and on the
decision-maker’s goals, this chapter introduces a new contextual approach
implemented in ARQAT, an exploratory data analysis tool, in order to
help the decision-maker select the most suitable interestingness measures.
The tool, which embeds a graph-based clustering approach, is used to
compare and contrast the behavior of thirty-six interestingness measures
on two typical but quite different datasets. This experiment leads to the
discovery of five stable clusters of measures.

• Chapter 3: Association Rule Interestingness Measures: Experi-


mental and Theoretical Studies, by Lenca et al., discusses the selection
of the most appropriate interestingness measures, according to a variety
of criteria. It presents a formal and an experimental study of 20 measures.
The experimental studies carried out on 10 data sets lead to an experi-
mental classification of the measures. This studies leads to the design of
a multi-criteria decision analysis in order to select the measures that best
take into account the user’s needs.

• Chapter 4: On the Discovery of Exception Rules: A Survey, by


Duval et al., presents a survey of approaches developed for mining excep-
tion rules. They distinguish two approaches to using an expert’s knowledge:
using it as syntactic constraints and using it to form as commonsense rules.
Works that rely on either of these approaches, along with their particu-
lar quality evaluation, are presented in this survey. Moreover, this chapter
also gives ideas on how numerical criteria can be intertwined with user-
centered approaches.

Part II: From Data to Rule Quality

• Chapter 5: Measuring and Modelling Data Quality for Quality-


Awareness in Data Mining, by Berti-Équille. This chapter offers an
overview of data quality management, data linkage and data cleaning tech-
niques that can be advantageously employed for improving quality aware-
ness during the knowledge discovery process. It also details the steps of a
Preface VII

pragmatic framework for data quality awareness and enhancement. Each


step may use, combine and exploit the data quality characterization, mea-
surement and management methods, and the related techniques proposed
in the literature.

• Chapter 6: Quality and Complexity Measures for Data Linkage


and Deduplication, by Christen and Goiser, proposes a survey of
different measures that have been used to characterize the quality and
complexity of data linkage algorithms. It is shown that measures in the
space of record pair comparisons can produce deceptive quality results.
Various measures are discussed and recommendations are given on how to
assess data linkage and deduplication quality and complexity.

• Chapter 7: Statistical Methodologies for Mining Potentially


Interesting Contrast Sets, by Hilderman and Peckham, focuses on con-
trast sets that aim at identifying the significant differences between classes
or groups. They compare two contrast set mining methodologies, STUCCO
and CIGAR, and discuss the underlying statistical measures. Experimen-
tal results show that both methodologies are statistically sound, and thus
represent valid alternative solutions to the problem of identifying poten-
tially interesting contrast sets.

• Chapter 8: Understandability of Association Rules: A Heuristic


Measure to Enhance Rule Quality, by Natarajan and Shekar, deals
with the clustering of association rules in order to facilitate easy explo-
ration of connections between rules, and introduces the Weakness measure
dedicated to this goal. The average linkage method is used to cluster rules
obtained from a small artificial data set. Clusters are compared with those
obtained by applying a commonly used method.

Part III: Rule Quality and Validation

• Chapter 9: A New Probabilistic Measure of Interestingness


for Association Rules, Based on the Likelihood of the Link, by
Lerman and Azé, presents the foundations and the construction of a prob-
abilistic interestingness measure called the likelihood of the link index.
They discuss two facets, symmetrical and asymmetrical, of this measure
and the two stages needed to build this index. Finally, they report the re-
sults of experiments to estimate the relevance of their statistical approach.

• Chapter 10: Towards a Unifying Probabilistic Implicative Nor-


malized Quality Measure for Association Rules, by Diatta et al.,
defines the so-called normalized probabilistic quality measures (PQM) for
association rules. Then, they consider a normalized and implicative PQM
VIII Preface

called MGK , and discuss its properties.

• Chapter 11: Association Rule Interestingness: Measure and


Statistical Validation, by Lallich et al., is concerned with association
rule validation. After reviewing well-known measures and criteria, the sta-
tistical validity of selecting the most interesting rules by performing a
large number of tests is investigated. An original, bootstrap-based valida-
tion method is proposed that controls, for a given level, the number of
false discoveries. The potential value of this method is illustrated by sev-
eral examples.

• Chapter 12: Comparing Classification Results between N -ary


and Binary Problems, by Felkin, deals with supervised learning and
the quality of classifiers. This chapter presents a practical tool that will
enable the data-analyst to apply quality measures to a classification task.
More specifically, the tool can be used during the pre-processing step, when
the analyst is considering different formulations of the task at hand. This
tool is well suited for illustrating the choices for the number of possible
class values to be used to define a classification problem and the relative
difficulties of the problems that result from these choices.

Topics

The topics of the book include:


• Measures for data quality
• Objective vs subjective measures
• Interestingness measures for rules, patterns, and summaries
• Quality measures for classification, clustering, pattern discovery, etc.
• Theoretical properties of quality measures
• Human-centered quality measures for knowledge validation
• Aggregation of measures
• Quality measures for different stages of the data mining process,
• Evaluation of measure properties via simulation
• Application of quality measures and case studies
Preface IX

Review Committee

All published chapters have been reviewed by at least 2 referees.


• Henri Briand (LINA, University of Nantes, France)
• Rgis Gras (LINA, University of Nantes, France)
• Yves Kodratoff (LRI, University of Paris-Sud, France)
• Vipin Kumar (University of Minnesota, USA)
• Pascale Kuntz (LINA, University of Nantes, France)
• Robert Hilderman (University of Regina, Canada)
• Ludovic Lebart (ENST, Paris, France)
• Philippe Lenca (ENST-Bretagne, Brest, France)
• Bing Liu (University of Illinois at Chicago, USA)
• Amdo Napoli (LORIA, University of Nancy, France)
• Gregory Piatetsky-Shapiro (KDNuggets, USA)
• Gilbert Ritschard (Geneve University, Switzerland)
• Sigal Sahar (Intel, USA)
• Gilbert Saporta (CNAM, Paris, France)
• Dan Simovici (University of Massachusetts Boston, USA)
• Jaideep Srivastava (University of Minnesota, USA)
• Einoshin Suzuki (Yokohama National University, Japan)
• Pang-Ning Tan (Michigan State University, USA)
• Alexander Tuzhilin (Stern School of Business, USA)
• Djamel Zighed (ERIC, University of Lyon 2, France)

Associated Reviewers

Jérôme Azé, Karl Goiser,


Laure Berti-Equille, Stéphane Lallich,
Libei Chen, Rajesh Natajaran,
Peter Christen, Ansaf Salleb,
Béatrice Duval, Benoı̂t Vaillant
Mary Felkin,
Liqiang Geng,
X Preface

Acknowledgments

The editors would like to thank the chapter authors for their insights and
contributions to this book.

The editors would also like to acknowledge the member of the review
committee and the associated referees for their involvement in the review
process of the book. Without their support the book would not have been
satisfactorily completed.

A special thank goes to D. Zighed and H. Briand for their kind support
and encouragement.

Finally, we thank Springer and the publishing team, and especially


T. Ditzinger and J. Kacprzyk, for their confidence in our project.

Regina, Canada and Nantes, France, Fabrice Guillet


May 2006 Howard Hamilton
Contents

Part I Overviews on rule quality

Choosing the Right Lens: Finding What is Interesting


in Data Mining
Liqiang Geng, Howard J. Hamilton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

A Graph-based Clustering Approach to Evaluate


Interestingness Measures: A Tool and a Comparative Study
Xuan-Hiep Huynh, Fabrice Guillet, Julien Blanchard, Pascale Kuntz,
Henri Briand, Régis Gras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Association Rule Interestingness Measures: Experimental and


Theoretical Studies
Philippe Lenca, Benoı̂t Vaillant, Patrick Meyer, Stéphane Lallich . . . . . . . 51
On the Discovery of Exception Rules: A Survey
Béatrice Duval, Ansaf Salleb, Christel Vrain . . . . . . . . . . . . . . . . . . . . . . . . . 77

Part II From data to rule quality

Measuring and Modelling Data Quality


for Quality-Awareness in Data Mining
Laure Berti-Équille . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Quality and Complexity Measures for Data Linkage


and Deduplication
Peter Christen, Karl Goiser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Statistical Methodologies for Mining Potentially Interesting


Contrast Sets
Robert J. Hilderman, Terry Peckham . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
XII Contents

Understandability of Association Rules: A Heuristic Measure


to Enhance Rule Quality
Rajesh Natarajan, B. Shekar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Part III Rule quality and validation

A New Probabilistic Measure


of Interestingness for Association Rules,
Based on the Likelihood of the Link
Israël-César Lerman, Jérôme Azé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Towards a Unifying Probabilistic Implicative Normalized


Quality Measure for Association Rules
Jean Diatta, Henri Ralambondrainy, André Totohasina . . . . . . . . . . . . . . . 237

Association Rule Interestingness: Measure and Statistical


Validation
Stephane Lallich, Olivier Teytaud, Elie Prudhomme . . . . . . . . . . . . . . . . . . 251

Comparing Classification Results between N -ary


and Binary Problems
Mary Felkin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

About the Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
List of Contributors

Jérôme Azé Mary Felkin


LRI, University of Paris-Sud, Orsay, LRI, University of Paris-Sud, Orsay,
France France
aze@lri.fr felkin@lri.fr
Laure Berti-Équille
IRISA, University of Rennes I, Liqiang Geng
France Department of Computer Science,
Laure.Berti-Equille@irisa.fr University of Regina, Canada
gengl@cs.uregina.ca
Julien Blanchard
LINA CNRS 2729, Polytechnic
Karl Goiser
School of Nantes University, France
The Australian National University,
Julien.Blanchard@univ-nantes.fr
Canberra, Australia
Henri Briand karl.goiser@anu.edu.au
LINA CNRS 2729, Polytechnic
School of Nantes University, France Régis Gras
Julien.Blanchard@univ-nantes.fr LINA CNRS 2729, Polytechnic
School of Nantes University, France
Peter Christen
Julien.Blanchard@univ-nantes.fr
The Australian National University,
Canberra, Australia
peter.christen@anu.edu.au Fabrice Guillet
LINA CNRS 2729, Polytechnic
Jean Diatta School of Nantes University, France
IREMIA, University of La Réunion, Fabrice.Guillet@univ-nantes.fr
Saint-Denis, France
jdiatta@univ-reunion.fr
Howard J. Hamilton
Béatrice Duval Department of Computer Science,
LERIA, University of Angers, France University of Regina, Canada
Beatrice.Duval@univ-angers.fr hamilton@cs.uregina.ca
XIV List of Contributors

Robert J. Hilderman Elie Prudhomme


Department of Computer Science, ERIC, University of Lyon 2, France
University of Regina, Canada eprudhomme@eric.univ-lyon2.fr
hilder@cs.uregina.ca
Henri Ralambondrainy
Xuan-Hiep Huynh
IREMIA, University of La Réunion,
LINA CNRS 2729, Polytechnic
Saint-Denis, France
School of Nantes University, France
ralambon@univ-reunion.fr
Xuan-Hiep.Huynh@univ-nantes.fr

Pascale Kuntz Ansaf Salleb


LINA CNRS 2729, Polytechnic UCCLS, Columbia University,
School of Nantes University, France New York, U.S.A
Julien.Blanchard@univ-nantes.fr Ansaf@ccls.columbia.edu

Stéphane Lallich
ERIC, University of Lyon 2, France B. Shekar
stephane.lallich@univ-lyon2.fr QMIS, Indian Institute
of Management(IIMB),
Philippe Lenca Bangalore, India
TAMCIC CNRS 2872, GET/ENST shek@iimb.ernet.in
Bretagne, France
philippe.lenca@enst-bretagne.fr Olivier Teytaud
TAO-Inria, LRI,
Israël-César Lerman University of Paris-Sud,
IRISA, University of Rennes I, Orsay, France
France teytaud@lri.fr
lerman@irisa.fr

Patrick Meyer André Totohasina


University of Luxemburg, ENSET, University of Antsiranana,
Luxemburg Madagascar
patrick.meyer@uni.lu totohasina@yahoo.fr

Rajesh Natarajan
Benoı̂t Vaillant
Cognizant Technology Solutions,
TAMCIC CNRS 2872, GET/ENST
Chennai, India
Bretagne, France
rajesh.natarajan@cognizant.com
benoit.vaillant@enst-bretagne.fr
Terry Peckham
Department of Computer Science, Christel Vrain
University of Regina, Canada LIFO, University of Orléans, France
peckham@cs.uregina.ca Christel.Vrain@univ-orleans.fr

View publication stats

You might also like