Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Entity Resolution and Information Quality
Entity Resolution and Information Quality
Entity Resolution and Information Quality
Ebook380 pages7 hours

Entity Resolution and Information Quality

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Entity Resolution and Information Quality presents topics and definitions, and clarifies confusing terminologies regarding entity resolution and information quality. It takes a very wide view of IQ, including its six-domain framework and the skills formed by the International Association for Information and Data Quality {IAIDQ). The book includes chapters that cover the principles of entity resolution and the principles of Information Quality, in addition to their concepts and terminology. It also discusses the Fellegi-Sunter theory of record linkage, the Stanford Entity Resolution Framework, and the Algebraic Model for Entity Resolution, which are the major theoretical models that support Entity Resolution. In relation to this, the book briefly discusses entity-based data integration (EBDI) and its model, which serve as an extension of the Algebraic Model for Entity Resolution. There is also an explanation of how the three commercial ER systems operate and a description of the non-commercial open-source system known as OYSTER. The book concludes by discussing trends in entity resolution research and practice. Students taking IT courses and IT professionals will find this book invaluable.
  • First authoritative reference explaining entity resolution and how to use it effectively
  • Provides practical system design advice to help you get a competitive advantage
  • Includes a companion site with synthetic customer data for applicatory exercises, and access to a Java-based Entity Resolution program.
LanguageEnglish
Release dateJan 14, 2011
ISBN9780123819734
Entity Resolution and Information Quality
Author

John R. Talburt

Dr. John R. Talburt is Professor of Information Science at the University of Arkansas at Little Rock (UALR) where he is the Coordinator for the Information Quality Graduate Program and the Executive Director of the UALR Center for Advanced Research in Entity Resolution and Information Quality (ERIQ). He is also the Chief Scientist for Black Oak Partners, LLC, an information quality solutions company. Prior to his appointment at UALR he was the leader for research and development and product innovation at Acxiom Corporation, a global leader in information management and customer data integration. Professor Talburt holds several patents related to customer data integration and the author of numerous articles on information quality and entity resolution, and is the author of Entity Resolution and Information Quality (Morgan Kaufmann, 2011). He also holds the IAIDQ Information Quality Certified Professional (IQCP) credential.

Related to Entity Resolution and Information Quality

Related ebooks

Enterprise Applications For You

View More

Related articles

Reviews for Entity Resolution and Information Quality

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Entity Resolution and Information Quality - John R. Talburt

    Table of Contents

    Cover Image

    Front matter

    Copyright

    Dedication

    Foreword

    Preface

    Acknowledgements

    1. Principles of Entity Resolution

    2. Principles of Information Quality

    3. Entity Resolution Models

    4. Entity-Based Data Integration

    5. Entity Resolution Systems

    6. The OYSTER Project

    7. Trends in Entity Resolution Research and Applications

    Bibliography

    Glossary

    Appendix A

    Index

    Front matter

    ENTITY RESOLUTION AND INFORMATION QUALITY

    Entity Resolution and Information Quality

    John R. Talburt

    Copyright © 2011 Elsevier Inc.. All rights reserved.

    Copyright

    Morgan Kaufmann Publishers is an imprint of Elsevier.

    30 Corporate Drive, Suite 400, Burlington, MA 01803, USA

    This book is printed on acid-free paper.

    © 2011 Elsevier Inc. All rights reserved.

    No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher's permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

    This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

    Notices

    Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

    Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

    To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

    Library of Congress Cataloging-in-Publication Data

    Application submitted

    British Library Cataloguing-in-Publication Data

    A catalogue record for this book is available from the British Library.

    ISBN: 978-0-12-381972-7

    For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.elsevierdirect.com

    Printed in the United States of America

    101112131454321

    Dedication

    To Rebeca and Geneva for their patience and understanding during the writing of this book

    Foreword

    Entity resolution is the process of probabilistically identifying some real thing based upon a set of possibly ambiguous clues. Humans have been performing entity resolution throughout history. Early humans looked at footprints and tried to match that clue to the animals that made the tracks. Later, people with special domain knowledge looked at the shape of a whale's spout to determine if the particular whale belonged to the right class of whale to hunt. During World War II, English analysts learned to identify individual German radio operators solely based upon that operator's fist, the timing and style the operator used to key Morse code.

    In the middle of the twentieth century, people began applying the power of computers to the problem of entity resolution. For example, entity resolution techniques were used to process and analyze U.S. Census records and the early direct marketing industry developed merge-purge systems to help identify and resolve individuals and households. The speed of the computer allows analysis of far more data than possible by a human expert, but requires that the heuristics and expertise we often take for granted in humans to be codified into algorithms the computer can execute.

    One industry that is particularly interested in effective entity resolution is the direct marketing industry. Acxiom Corporation provides many entity resolution services to the direct marketing industry and has developed many tools and algorithms to address the entity resolution problem. I met John Talburt in 1996 when we both began work at Acxiom. At that time, much of the knowledge about how to effectively apply computers to the problem of entity resolution was fragmented and dispersed. For example, the criteria for what made two entities similar or distinct was often defined differently across teams, as was the assessment of the quality of the results. Similarly, from a technical perspective, the strategies and techniques for extracting clues from digital data, including possible transformations to correct or enhance the extracted clues, were often based directly upon the experience of the particular people involved. This was also true of the matching algorithms used for resolution. While some papers had been published about the techniques, much of the knowledge was held in the heads of practitioners. That knowledge was carefully guarded and often considered as trade secrets or as competitive advantages, particularly in the commercial sector.

    In 1997, John and I, along with several others at Acxiom, set out to create a single entity resolution system that combined all the experience and knowledge about entity resolution for names and postal addresses. The product resulting from this effort is called AbiliTec™. At the start of the AbiliTec™ project, most of the people working on the project had either no previous background in entity resolution or hard won trial-and-error knowledge from implementing previous entity resolution systems. I was one of the ones with no previous knowledge and looking back, I realize, not for the first time, how valuable a comprehensive introductory book would have been to our efforts.

    I am very happy that John has written this book to help fill that need. In this book, John has brought organization to the topic and provides definitions and clarifications to terminology that has been overlapping and confusing. This book continues the transformation of entity resolution into a discipline rather than merely a toolbox of techniques. John is uniquely qualified to write this book. He not only has practical experience building important real-world entity resolution systems (e.g., AbiliTec™), but also the academic background to explain and unify the theory of entity resolution. John also brings his expertise in information quality to this book. Information quality and entity resolution are closely related and John, along with Rich Wang from MIT, were the driving forces behind the creation of the Information Quality program at the University of Arkansas at Little Rock (UALR). This was the first program of its kind in the world and is at the center of the information quality field.

    I am writing this forward on September 11, 2010, the anniversary of the terrorist attacks on Washington and New York. This gives me a perspective on how entity resolution continues to expand and evolve since the early merge-purge days. Following the terrorist attacks, the United States government investigated how entity resolution techniques could help prevent such attacks. The government looked at entity resolution techniques not only already employed by the security agencies, but also commercial systems such as those used in the gambling industry and the direct marketing industry. John was engaged, and is still engaged, in some of the work with the government on these problems. Much of the entity resolution work up to that time was focused on analyzing direct attributes of an individual (e.g., name, address, date of birth, etc.), but these efforts brought much more focus on the links between people and how those links can help in identifying and resolving not only at the individual level, but also at the group level.

    Resolution through linkage has become critical not only for security and law enforcement work, but also in analysis of social networks. Indeed, the explosion of applications on the Internet has generated many new challenges for entity resolution. The early direct marketing industry dealt with people with known names at postal addresses. Today, in the Internet world, people are increasingly known by multiple artificial names or personas and are contacted through virtual addresses. This requires new techniques for entity resolution. For example, resolving anonymous entities (e.g., visitors at a web site) based upon browsing fingerprints (e.g., IP address of the client machine, operating system of that machine, the browser used, etc.) is an interesting challenge and an active area of work. Examples such as this also bring questions of privacy into the discussion of entity resolution. Similarly, efforts supporting selective exposure of private data on the Internet (e.g., information cards) and distributed authentication (e.g., OpenID) also complicate and expand the discussion of entity resolution from both a technical and policy perspective. This book will not only help provide the background for these efforts, but also help organize and frame the discussions as entity resolution continues to evolve.

    Terry Talley

    September 11, 2010

    Preface

    Motivation for the Book

    Entity resolution (ER) and information quality (IQ) are both emerging disciplines in the field of information science. It is my hope that this book will make some contribution to the growing bodies of knowledge in these areas. I find it very rewarding to be a part of starting something new. The opportunity to help organize the first graduate degree programs in IQ has been an exciting journey. One of the struggles has been to find appropriate books and resources for the students. Not many college-level textbooks have been written on these topics. With the notable exceptions of Introduction to Information Quality by Craig Fisher, Eitel Lauria, Shobha Chengalur-Smith, and Richard Wang, and Journey to Data Quality by Yang Lee, Leo Pipino, James Funk, and Richard Wang, most of the titles in the area of IQ have been written by practitioners for primarily for other practitioners. However, I must say that this is not necessarily a bad thing. Very practical and detailed books like Data Quality Assessment by Arkady Madanchik and Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information have served well as texts for some of our classes and have been well-received by both instructors and students. As more schools begin to teach courses in these areas I have no doubt that more textbooks will be produced to meet the demand.

    As you read this book, especially Chapter 2, you will see that I take a very broad view of IQ. I think that the six-domain framework of IQ knowledge and skills developed by the International Association for Information and Data Quality (IAIDQ) that also appears in Chapter 2 is an excellent outline of the scope of the new discipline. It confirms that many of the currently popular information technology and information management themes such as master data management and data governance properly fall within the discipline of IQ, and that many others such as entity and identity resolution and information architecture to have very strong bonds with IQ.

    This book has emerged from the material developed for a graduate course titled Entity Resolution and Information Quality that has been offered as an elective in the Information Quality Graduate Program at the University of Arkansas at Little Rock since the fall of 2009. In these offerings, the book Data Quality and Record Linkage Techniques by Thomas Herzog, Fritz Scheuren, and William Winkler has been an important resource for the students. Although I highly recommend this book for its excellent coverage of value imputation, the Fellegi-Sunter record linkage model, and a number of well-written case studies, it does not cover the breadth of topics that in my view comprise the whole of entity resolution. As with IQ, I also take a very broad view of entity resolution, and one of my goals for writing this book was to encourage both ER and IQ researchers and practitioners to take a more holistic view of both of these topics.

    My observation has been that there are many highly-qualified practitioners and researchers publishing in these areas. For example, it is not hard to find papers that plumb the depths of almost any given topic in entity resolution. My hope is that this book will help place these more narrowly defined topics into the larger framework of ER, and that by doing so, it will promote the cross-fertilization of ideas and techniques among them. I am sure that not everyone will entirely agree with my definitions or categorizations, but this is the view I offer for the reader's consideration. Knowledge grows by examining and contrasting ideas, and building step-by-step on the work of others.

    Audience

    Although written in textbook format, I believe that IT professionals, as well as students, will find it helpful. Even for experts in this area, I think it can be useful if for no other reason than to provide an organized perspective at the very broad range of topics that comprise entity resolution. My hope is that the designers of ER systems may be inspired to create even more robust applications by integrating some of the techniques and methods presented here that they had not previously considered. I also think that the material in this book will be useful to both technical and non-technical managers who want to be conversant in the basic terminology and concepts related to this important area of information systems technology.

    Because IQ is very interdisciplinary, the UALR Information Quality program was positioned as a graduate program to accommodate students coming from a variety of undergraduate disciplines. Even though the course that motivated the writing of this book is taught at the graduate level, most of the material is accessible to upper-division undergraduate students and could support either an undergraduate or a dual-listed graduate/undergraduate course.

    Organization of the Material

    The first two chapters of the book cover the principles of ER and the principles of IQ, respectively. They cover the basic terminology and concepts that are used throughout the remainder of the book including the definition of ER, the unique reference assumption, and the fundamental law of ER. The main thrust of Chapter 1 is the ER is much more than just record matching, that it is about determining the equivalence of references. It discusses the five ER activities of entity reference extraction, entity reference preparation, entity reference resolution, entity identity management, and entity association analysis. Chapter 1 also introduces the four ER architectures of merge-purge or record linkage, heterogeneous database join, identity resolution, and identity capture, and the four methods for determining the equivalence of references including direct matching, transitive equivalence, association analysis, and asserted equivalence.

    Chapter 2 outlines the emerging discipline of IQ. Here the primary emphasis is that IQ must always be connected to business value, that IQ is more than just cleaning data, it is about viewing information as a non-fungible asset of the organization and that its quality is directly related to the value produced by its application. It also discusses the information product model of IQ and the application of TQM principles to IQ management. Both Chapter 1 and Chapter 2 speak to the close relationship between ER and IQ.

    Chapter 3 describes the major theoretical models that underpin the basic aspects of ER starting with the Fellegi-Sunter theory of record linkage. This is followed by the Stanford Entity Resolution Framework and the Algebraic Model for Entity Resolution with a brief description of the ENRES meta-model.

    Chapter 4 is a brief excursion into the realm of entity-based data integration (EBDI). It describes a model for EBDI that is an extension of the algebraic model for entity resolution. The algebraic EBDI model provides a framework for formally describing integration contexts and operators independently of their actual implementation. It also discusses some of the more commonly defined integration selection operators and how they are evaluated.

    As a balance to the theoretical discussions in Chapters 3 and 4, the material in Chapter 5 describes the operation of three commercial ER systems. It also includes step-by-step details on how two of these tools are setup to execute actual ER scenarios.

    Chapter 6 extends Chapter 5 by describing a non-commercial, open-source system called OYSTER. Although used in this book as an instructional tool, OYSTER has the demonstrated capability of supporting real applications in business and government. OYSTER is the only open-source ER system with a resolution engine that can be configured to perform merge-purge (record linkage), identity resolution, or identity capture operations. An appendix to the book provides the reader with example OYSTER XML scripts that can be used to operate each of these configurations and guidance for those who want to download and experiment with OYSTER.

    Chapter 7 discusses some of the trends in ER research and practice. These include the growing use of identity resolution to support information hubs, the impact high-performance computing on entity resolution, research in the application of graph theory and network analysis to improve resolution results, and the use of machine learning techniques, such as genetic programming, to optimize the accuracy of entity-based data integration.

    In addition to OYSTER, another important resource for the material in this book is the use of synthetic data. Synthetic data solves one of the more difficult problems in teaching ER and IQ when the entities are persons. Privacy and legal concerns make it very difficult to obtain and use personally identifiable information. Even though using trivial examples such as, John Doe on Elm Street, can illustrate many of the basic ER concepts, they fail to exhibit the many complexities, nuances, and data quality issues that make real entity references difficult to resolve. In order to give students more realistic ER exercises, synthetic data is used. The synthetic data used in the Chapter 5 scenarios are available to the reader through the Center for Advanced Research in Entity Resolution and Information Quality (ERIQ, ualr.edu/eriq). The data was generated in previous research projects as a way to simulate a population of synthetically generated identities of different ages moving through a set of real US addresses over a period of time.

    Acknowledgements

    There are many people and organizations whose support for the UALR Information Quality Graduate Program and the Center for Advanced Research in Entity Resolution and Information Quality (ERIQ) have made it possible for me to write this book and to whom I owe a great debt of gratitude. First and foremost I want to thank my friend and mentor Dr. Richard Wang, Director of the MIT IQ Program and currently serving as the Chief Data Quality Officer and Deputy Chief Data Officer of the U.S. Army. Without his vision, encouragement, and tireless efforts to establish information quality as an academic discipline none of my work would have been possible. I am also grateful to Acxiom Corporation and its leadership team for its willingness to underwrite and support these programs during their formation especially former Acxiom executives Charles Morgan, Rodger Kline, Alex Dietz, Jerry Adams, Don Hinman, Zack Wilhoit, Jim Womble, and Wally Anderson, as well as, the current CEO, John Meyer, and executives Jennifer Barrett, Jerry Jones, Chad Fitz, Todd Greer, and Catherine Hughes. I would also like to thank Dr. Mary Good, the Founding Dean of the UALR Donaghey College of Engineering and Information Technology for her support and willingness to embark into uncharted academic waters.

    Special thanks to Mike Shultz, CEO of Infoglide Software, who provided the academic license for their Identity Resolution Engine (IRE); and Bob Barker of 2020 Outlook who helped to arrange the collaboration. Also Dr. Jim Goodnight, President of SAS who provided the academic license for dfPowerStudio; and Lisa Dodson, Product Manager for DataFlux; who helped us learn how to use it.

    Others who have provided support include Jim Boardman, Dr. Neal Gibson, and Dr. Greg Holland, Arkansas Department of Education; Rick McGraw, Managing Partner for Black Oak Partners; Alba Alemán and Raymond Roberts, Citizant; Frank Ponzio, Symbolic Systems; Larry English, Information Impact International; Michael Boggs, Analytix Data Services; Ken Kotansky and Steven Meister, AMB New Generation Empowerment; Terry Talley, Southwest Power Pool; Chuck Backus, Lexis-Nexis (who suggested the student data challenge exercise); Rob Williams and Brian Tsou, US Air Force Research Laboratory (AFRL) at Wright Patterson Air Force Base, as well as Qbase and the Wright Brothers Institute in Dayton, Ohio.

    I would also like to acknowledge Dr. Ali Kooshesh, Sonoma State University, for his work on the genetic programming approach to entity-based data integration, and Dr. Ray Hashemi, Armstrong Atlantic State University, who assisted me on several ER-related projects. I also owe a great debt to the many students who have contributed to the development of this material especially Eric Nelson his for help in the development of OYSTER; Yinle Zhou, my teaching assistant in the ER course and co-developer with Sabitha Shiviah of the synthetic data generator (SOG); Isaac Osesina for providing input on entity reference extraction and named-entity recognition (NER); and Fumiko Kobayashi for her work in testing and documenting OYSTER. Finally I would like to thank my Department Chair, Dr. Elizabeth Pierce, and my support staff including Natalie Rego, Administrative Assistant for the Information Quality Graduate Program, Brenda Barnhill, the ERIQ Center Program Manager, and Gregg Webster, the ERIQ Center Technical Manager.

    1. Principles of Entity Resolution

    Entity Resolution

    Entity resolution (ER) is the process of determining whether two references to real-world objects are referring to the same object or to different objects. The term entity describes the real-world object, a person, place, or thing, and the term resolution is used because ER is fundamentally a decision process to answer (resolve) the question, Are the references to the same or to different entities? Although the ER process is defined between pairs of references, it can be systematically and successively applied to a larger set of references so as to aggregate all the references to same object into subsets or clusters. Viewed in this larger context, ER is also defined as the process of identifying and merging records judged to represent the same real-world entity (Benjelloun, Garcia-Molina, Menestrina, et al., 2009).

    Entities are described in terms of their characteristics, called attributes. The values of these attributes provide information about a specific entity. Identity attributes are those that when taken together distinguish one entity from another. Identity attributes for people are things such as name, address, date of birth, and fingerprint—the kinds of things often asked for to identify the person requesting a driver's license or hospital admission. For a product identity, attributes might be model number, size, manufacturer, or universal product code (UPC).

    A reference is a collection of attributes values for a specific entity. When two references are to the same entity, they are sometimes said to co-refer (Chen, Kalashnikov, Mehtra, 2009) or to be matching references (Benjelloun, et al., 2009). However, for reasons that will be clear later, the term equivalent references will be used throughout this text to describe references to the same entity.

    An important assumption throughout the following discuss of ER is the unique reference assumption. The unique reference assumption simply states that a reference is always created to refer to one, and only one, entity. The reason for this assumption is that in real-world situations a reference may appear to be ambiguous—that is, it could refer to more than one entity or possibly no entity. For example, a salesperson could write a product description on a sales order,

    Enjoying the preview?
    Page 1 of 1