You are on page 1of 54

Data Quality Integration for PowerCenter Guide

Informatica PowerCenter
(Versions 8.1.1-8.6)

Informatica Data Quality


(Version 8.6.2)

Informatica Data Quality Integration for PowerCenter Guide Versions 8.1.1, 8.5.1, and 8.6 January 2009 Copyright (c) 1998-2008 Informatica Corporation. All rights reserved. This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and international Patents and other Patents Pending. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in writing. Informatica, PowerCenter, PowerExchange, Informatica B2B Data Exchange, Informatica B2B Data Transformation, Informatica Data Quality, Informatica Data Explorer, Informatica Identity Resolution and Matching, Informatica On Demand, PowerMart, PowerBridge, PowerConnect, PowerChannel, PowerPartner, PowerAnalyzer, PowerCenter Connect and PowerPlug are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright Melissa Data Corporation. All rights reserved. Copyright MySQL AB. All rights reserved. Copyright Platon Data Technology GmbH. All rights reserved. Copyright Seaview Software. All rights reserved. Copyright Sun Microsystems. All rights reserved. Copyright Oracle Corporation. All rights reserved. This product includes software developed by the Apache Software Foundation (http://www.apache.org/), software developed by lf2prod.com (http://common.l2fprod.com) and other software which is licensed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software which was developed by the JFreeChart project (http://www.jfree.org/freechart/), software developed by the JDIC project (https:// jdic.dev.java.net/) and other software which is licensed under the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/ lgpl.html. The materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and Vanderbilt University, Copyright (c) 1993-2006, all rights reserved. This product includes ICU software which is copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding this software are subject to terms available at http://www-306.ibm.com/software/globalization/icu/license.jsp. This product includes software which is licensed under the MIT License, which may be found at http://www.opensource.org/licenses/mit-license.html. This product includes software which is licensed under the Eclipse Public License, which may be found at http://www.eclipse.org/org/documents/epl-v10.html. Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporaotin and other parties. The authors hereby grant permission to use, copy, modify, distribute, and license this software and its documentation for any purpose. This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright 2000-2004 Jason Hunter and Brett McLaughlin. All rights reserved. This product includes software which is licensed under the Open LDAP Public License, which may be found at http://www.openldap.org/software/release/license.html. Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com). This Software may be protected by U.S. and international Patents and Patents Pending. DISCLAIMER: Informatica Corporation provides this documentation as is without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this product or documentation is error free. The information provided in this product or documentation may include technical inaccuracies or typographical errors.

Part Number: IDQ-INT-86200-0002

Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2: Integrating Data Quality and PowerCenter . . . . . . . . . . . . . . . . . . . . . . . . . 3


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Integrating with Data Quality Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Using Association and Consolidation Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Using the Data Quality Integration Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Deploying Data Quality Plans in PowerCenter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Adding Address Validation Functionality to PowerCenter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Adding Identity Matching Functionality to PowerCenter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 3: Working with Plans in the PowerCenter Repository . . . . . . . . . . . . . . . . . 9


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Running Plans as Mapplets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Running Plans with the Data Quality Integration Transformation . . . . . . . . . . . . . . . . . . . . . . . . 12

Chapter 4: Association Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Association Transformation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Creating an Association Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Chapter 5: Consolidation Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Consolidation Transformation Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Consolidation Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Consolidation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Creating a Consolidation Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Chapter 6: Creating Mappings for Data Quality Plans . . . . . . . . . . . . . . . . . . . . . . . . 25


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Creating a Mapping to Cleanse, Parse, or Validate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Creating a Mapping to Match Data from a Single Source . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Creating a Mapping to Match Data from Two Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . 28 Creating a Mapping to Match Identity Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Chapter 7: Creating Mappings for Association and Consolidation . . . . . . . . . . . . . . 33


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iii

Associating and Consolidating Data from a Single Source . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Associating and Consolidating Data from Multiple Data Sources . . . . . . . . . . . . . . . . . . . . . . 34

Appendix A: Working with Data Matching Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Appendix B: Working with Identity Matching Plans . . . . . . . . . . . . . . . . . . . . . . . . . 41


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

iv

Table of Contents

Preface
This guide describes the Data Quality Integration components developed for Informatica Data Quality 8.6. It is provided for the following audiences:

PowerCenter systems administrators who will install and register the Data Quality Integration components on their PowerCenter systems. PowerCenter users who will run data quality plans embedded in PowerCenter mappings.

Note: The Data Quality Integration transformation has changed in significant ways in the Data Quality 8.6

release. Read the Data Quality Integration Release Notes before installing and registering these components.

Informatica Resources
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica How-To Library, the Informatica Knowledge Base, Informatica Documentation Center, and access to the Informatica user community.

Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation team through email at infa_documentation@informatica.com. We will use your feedback to improve our documentation. Let us know if we can contact you regarding your comments. The Documentation team updates documentation as needed. To get the latest documentation for your product, navigate to the Informatica Documentation Center from http://my.informatica.com.

Informatica Web Site


You can access the Informatica corporate web site at http://www.informatica.com. The site contains information about Informatica, its background, upcoming events, and sales offices. You will also find product and partner information. The services area of the site includes important information about technical support, training and education, and implementation services.

Informatica How-To Library


As an Informatica customer, you can access the Informatica How-To Library at http://my.informatica.com. The How-To Library is a collection of resources to help you learn more about Informatica products and features. It includes articles and interactive demonstrations that provide solutions to common problems, compare features and behaviors, and guide you through performing specific real-world tasks.

Informatica Knowledge Base


As an Informatica customer, you can access the Informatica Knowledge Base at http://my.informatica.com. Use the Knowledge Base to search for documented solutions to known technical issues about Informatica products. You can also find answers to frequently asked questions, technical white papers, and technical tips.

Informatica Global Customer Support


There are many ways to access Informatica Global Customer Support. You can contact a Customer Support Center through telephone, email, or the WebSupport Service. Use the following email addresses to contact Informatica Global Customer Support:

support@informatica.com for technical inquiries support_admin@informatica.com for general customer service requests

WebSupport requires a user name and password. You can request a user name and password at http:// my.informatica.com. Use the following telephone numbers to contact Informatica Global Customer Support:
North America / South America Informatica Corporation Headquarters 100 Cardinal Way Redwood City, California 94063 United States Europe / Middle East / Africa Informatica Software Ltd . 6 Waltham Park Waltham Road, White Waltham Maidenhead, Berkshire SL6 3TN United Kingdom Asia / Australia Informatica Business Solutions Pvt. Ltd. Diamond District Tower B, 3rd Floor 150 Airport Road Bangalore 560 008 India Toll Free Australia: 1 800 151 830 Singapore: 001 800 4632 4357 Standard Rate India: +91 80 4112 5738

Toll Free +1 877 463 2435

Toll Free 00 800 4632 4357

Standard Rate Brazil: +55 11 3523 7761 Mexico: +52 55 1168 9763 United States: +1 650 385 5800

Standard Rate Belgium: +32 15 281 702 France: +33 1 41 38 92 26 Germany: +49 1805 702 702 Netherlands: +31 306 022 797 Spain and Portugal: +34 93 480 3760 United Kingdom: +44 1628 511 445

vi

Preface

CHAPTER 1

Introduction
This chapter includes the following topics:

Overview, 1

Overview
Informatica Data Quality components can add significant data quality management capabilities to your PowerCenter projects. The Data Quality Workbench application allows you to design data quality management processes, called plans, and write them to the PowerCenter repository. The Data Quality Integration plug-in installs a set of transformations to PowerCenter that allow you to run plans in sessions and to create mappings that identify and consolidate groups of duplicate records. Table 1-1 summarizes the tasks you can perform by integrating PowerCenter and Data Quality:
Table 1-1. Data Quality-PowerCenter Integration Options
To Perform This Task Save a data quality plan to the PowerCenter repository as a mapplet Import a plan file as a PowerCenter mapplet Run a plan in a PowerCenter session using the PowerCenter Integration Service. Link related data rows for consolidation Consolidate duplicate or overlapping data rows Use These Components Use Data Quality Workbench to export the plan from the DQ repository to the PowerCenter repository. Use Data Quality Workbench to export the plan from the DQ repository to XML file. Import the XML file using PowerCenter Repository Manager. Use PowerCenter to run a mapplet in a session.

Use the Data Quality Association transformation in PowerCenter. Use the Data Quality Consolidation transformation in PowerCenter.

Note: The Integration plug-in does not install the Data Quality Integration transformation. Informatica

provided this transformation with Data Quality version 8.5, and the transformation is deprecated. The current integration supports instances of the Data Quality Integration transformation that are saved in the PowerCenter repository, but it does not permit you to edit these transformations or to create new instances of this transformation.

Data Quality allows you to export data quality plans from the Data Quality repository to the PowerCenter repository as mapplets. Follow this path to integrate your plans into PowerCenter processes. You can convert to mapplets instances of the Data Quality Integration transformation in your PowerCenter repository.

Chapter 1: Introduction

CHAPTER 2

Integrating Data Quality and PowerCenter


This chapter includes the following topics:

Overview, 3 Integrating with Data Quality Workbench, 3 Using Association and Consolidation Transformations, 4 Using the Data Quality Integration Transformation, 5 Deploying Data Quality Plans in PowerCenter, 5 Adding Address Validation Functionality to PowerCenter, 6 Adding Identity Matching Functionality to PowerCenter, 6

Overview
The Data Quality Integration plug-in adds a set of transformations to your PowerCenter client-side and serverside installations. Several of these transformations correspond to Data Quality Workbench components. For information on the features and functionality of these components, see the Informatica Data Quality User Guide. The plug-in also installs Data Quality Association and Consolidation transformations, which are not found in Data Quality Workbench. For information on the Association and Consolidation transformations, see page 15 and page 19 respectively. There are client-side and server-side versions of the plug-in.

Install the client version locally to the PowerCenter Designer. Install the server version locally to the PowerCenter Integration Service that runs a workflow containing either transformation.

Integrating with Data Quality Workbench


Data Quality Workbench is the plan design application in the Data Quality software suite. Plan designers build plans in Workbench in the same way that mapping designers build mappings in PowerCenter. A plan comprises a series of operational components, each of which performs a different data

quality management task. Plan designers configure these components to read data sources, perform data analysis or data enhancement tasks on source columns, and write the results to data targets. Plan designers save plans to the Data Quality repository and write saved plans from the Data Quality repository to the PowerCenter repository, where they appear as mapplets. When a plan enters the PowerCenter repository, PowerCenter stores the plan as a mapplet and converts each component in the data quality plan to a PowerCenter transformation. Workbench users can also save a data quality plan as XML to the file system for import to the PowerCenter repository. The result in both cases is the same the plan appears as a mapplet in the PowerCenter repository. A session containing one or more data quality mapplet runs with the PowerCenter Integration service and does not require a Data Quality engine. The Data Quality and PowerCenter repositories can reside on remote machines.
Note: Not all Workbench components are installed as transformations to PowerCenter. For more information

about the Data Quality components that install to PowerCenter, see page 10.

Editable Properties in Data Quality Transformations


You can edit the port names and precision settings for a port on a Data Quality transformation. Do not edit any other aspect of the transformation, as this will invalidate the session in which the transformation is used.

Using Association and Consolidation Transformations


In addition to the Workbench transformations, the Integration plug-in adds the following transformations to PowerCenter:

Association transformation . Allows you to create links between related records so that they are treated as members of a single set in data consolidation. The Association transformation allows you to link records that do not share a group ID but share other characteristics that make them candidates for consolidation. This transformation generates an association ID that you can use to link such records. Consolidation transformation . Allows you to create a single, consolidated record using field values from one more records with a common association ID. You can create expressions to determine how fields in the consolidated record are defined.

Informatica provides the Association and Consolidation transformations to consolidate duplicate records identified by data quality plans, but you can use these transformations on data from any source in a mapplet or mapping.

Understanding the Association and Consolidation Transformations


A dataset can contain duplicate records exact duplicates, as well as different versions of the same record. This can occur when data entry systems allow you to enter duplicate or inaccurate records, when different record sets are merged, or as records change over time. Duplicate records can create an inaccurate representation of data and adversely impact the functioning of your organization. Informatica Data Quality integrates with PowerCenter to provide an effective means of identifying duplicate records and reconciling or removing them from the dataset. The Association transformation enhances PowerCenters ability to flag duplicate records and then to reconcile or remove them. Use the Association and Consolidation transformations to identify and link related records and to consolidate duplicate or overlapping records:

Use the Association transformation to order related records into groups for processing and to generate an association ID for each group of associated records. Use this association ID in downstream transformations to enable operations on grouped records.

Chapter 2: Integrating Data Quality and PowerCenter

Use the Consolidation transformation to create a single, consolidated record from a group of associated records.

You can use the Association and Consolidation transformations in the same mapping as a Data Quality Integration transformation. You can also use the transformations in any PowerCenter mapping, independent of each other and the Data Quality Integration transformation. The Association and Consolidation transformations operate independently of Data Quality and do not require that their inputs originate from data quality plans.
Note: You cannot use multiple partitions, grids, incremental recovery, real-time processing, or web service

workflows with the Association or Consolidation transformations.

Using the Data Quality Integration Transformation


Note: The Data Quality Integration is deprecated.

It provided a means for PowerCenter users to read the Data Quality repository and save data quality plan information into the PowerCenter repository as metadata extensions to the transformation. When you ran a session containing this transformation, PowerCenter loaded an instance of the Data Quality engine to process the plan information. This capability was introduced in an earlier version of the transformation. The current Integration components support any Data Quality Integration transformations saved in your repository, but you can no longer use the Data Quality Integration transformation to read plans from the Data Quality repository. The current Data Quality-PowerCenter integration model does not require a Data Quality engine to run a plan from PowerCenter, and Informatica recommends that you re-save any plans embedded in a Data Quality Integration transformation as a mapplet in the PowerCenter repository.
X

The Data Quality Integration transformation provides a Convert to DQ Mapplet option to facilitate the process of saving a data quality plan that has been embedded in a Data Quality Integration transformation as a mapplet. Right-click on the transformation in Mapping Designer to access this option.
Note: Do not convert a Data Quality Integration transformation to a mapplet if the transformation is

currently saved within a mapplet. PowerCenter does not support the presence of mapplets within mapplets. If you wish to run a session containing a Data Quality Integration in which a data quality plan is embedded, verify that an instance of the Data Quality engine is installed locally to the Integration Service that runs the session.

Deploying Data Quality Plans in PowerCenter


Run a data quality plan in PowerCenter by adding it to a mapping and including the mapping as part of a session task. Add a plan to a mapping in one of the following ways:

Use Data Quality Workbench to export the plan as a mapplet to the PowerCenter repository, and add the mapplet to a mapping. Use Data Quality Workbench to export the plan as an XML file and import this file to the PowerCenter repository. PowerCenter imports this file as a mapplet. Add the mapplet to a mapping. Locate a Data Quality Integration transformation that contains a data quality plan, and use this transformation in a mapping. You cannot create a Data Quality Integration transformation and add a plan to it.

The PowerCenter Integration Service runs the session containing a data quality mapplet wholly within the PowerCenter engine. When PowerCenter runs a session containing a Data Quality Integration transformation, it calls an instance of the Data Quality engine to process the plan information embedded in the transformation.
Using the Data Quality Integration Transformation 5

In the latter case, PowerCenter requires an instance of the Data Quality engine on the same machine as the PowerCenter Integration service that runs the session. Data quality plan information is stored as XML in the PowerCenter repository. Plan information that has been added through the Data Quality Integration transformation is stored with the transformation XML when the transformation was saved. Plans that you export from Data Quality Workbench are stored as mapplets, and no PowerCenter steps are necessary to save the mapplet.
Note: Ensure that resources required by Data Quality mapplets are present on the machine running the

PowerCenter Integration Service. Examples of resources used in Data Quality mapplets are dictionary files, database dictionaries, and address validation files. For more information about writing plans to the PowerCenter repository, see the Informatica Data Quality User Guide.

Adding Address Validation Functionality to PowerCenter


Data Quality Workbench users can create plans that validate and enhance the accuracy and completeness of postal address records using reference data. Address validation plans compare source addresses against comprehensive address reference datasets that are approved by national postal carriers such as the USPS and the Royal Mail. You can run these plans in PowerCenter. Informatica can provide you with the necessary address reference datasets. Informatica sources address reference datasets from third-party vendors, and each vendor also provides a processing engine for its data. You purchase an address reference data for your countries of choice on a subscription basis, and Informatica provides updates to the reference data on a monthly or quarterly basis. Install the reference data through the Data Quality Content Installer, an executable fileset included with Data Quality. The Content Installer also installs a set of text-based reference dictionary files.
Note: The Integration plug-in installer installs the third-party address reference data engines.

Adding Identity Matching Functionality to PowerCenter


In data quality terminology, an identity is a set of data values within a record that collectively provide enough information to identify an individual or entity. The Data Quality Integration installs sources, targets, and matching transformations components that can analyze identity information across record values and return likely duplicates. You can create identity matching plans in Data Quality Workbench and copy them to the PowerCenter repository in the same way as other plans. To perform duplicate analysis on identities, you must first generate an index of key values that represent the identity permutations possible within your source data. For example, the input name John Smith could have at least four identities in the index:
John Smith J Smith Smith, John Smith, J

Identity matching transformations use the index as a reference dataset when searching for duplicate identities in the source data. For information on identity transformations, see the Informatica Data Quality User Guide.
Note: When you run a mapping containing an identity match mapplet in a session, you must ensure that your

session properties reference the location of the index file. How you do so depends on your PowerCenter version. For more information on setting the location of the key index, see page 41.
6 Chapter 2: Integrating Data Quality and PowerCenter

Population Files
Before you can use identity transformations, you must install population files. A population file contains keybuilding algorithms, search strategies, and matching schemas that enable duplicate analysis of identity information. Population files can allow for multiple languages and character sets within the source data. Informatica provides proprietary population files for use in Data Quality and PowerCenter. Before you begin, ensure you have a suitable population file for your source data installed on your computer. Use the Data Quality Content Installer to install your population files.

Adding Identity Matching Functionality to PowerCenter

Chapter 2: Integrating Data Quality and PowerCenter

CHAPTER 3

Working with Plans in the PowerCenter Repository


This chapter includes the following topics:

Overview, 9 Running Plans as Mapplets, 10 Running Plans with the Data Quality Integration Transformation, 12 Data Quality Integration Transformation Properties, 13

Overview
Integrate your data quality plans with PowerCenter mappings and sessions in one of the following ways: Export the plan from Data Quality repository to the PowerCenter repository. Use Data Quality Workbench to export a plan from the Data Quality repository directly to the PowerCenter repository. The Workbench components used to create the plan appear in PowerCenter as transformations, and the plan is saved to the PowerCenter repository as a mapplet. Create a mapping that includes this mapplet, and add the mapping to a session task. Export the plan to XML file from Data Quality repository, and import the XML file to the PowerCenter repository. Use this method when you cannot connect to the required PowerCenter repository from Data Quality Workbench. Use Data Quality Workbench to export as an XML file to the file system, and use PowerCenter Repository Manager to import the plan to the repository. The Workbench components used to create the plan appear in PowerCenter as transformations, and the plan is saved to the PowerCenter repository as a mapplet. Create a mapping that includes this mapplet and add the mapping to a session task. Use a plan embedded in a Data Quality Integration transformation. Your PowerCenter repository may contain one or more Data Quality Integration transformations in which a plan is embedded. Informatica recommends converting the embedded plan to a data quality mapplet. Create a mapping that includes this mapplet, or add it to the mapplet that contained the original transformation, and add the mapping to a session task.
Note: You cannot use the Data Quality Integration transformation to connect to the Data Quality repository.

Running Plans as Mapplets


A data quality mapplet is similar to mapplets created in PowerCenter Mapping Designer. Follow the same procedure when adding either type of mapplet to a mapping. Figure 3-1 shows a data quality mapplet in a PowerCenter mapping:
Figure 3-1. PowerCenter Mapping Containing a Data Quality Mapplet

A data quality mapplet contains the operational components configured for the plan in Data Quality Workbench. These components appear onscreen as transformations. The Integration plug-in installs these transformations to PowerCenter. You cannot edit the properties of these transformations in PowerCenter. Figure 3-2 shows the transformations in this mapplet:
Figure 3-2. PowerCenter Mapplet Containing Data Quality Components

Add data quality mapplets to the PowerCenter repository through Data Quality Workbench. For information about exporting data quality plans as mapplets from the Data Quality repository to the PowerCenter repository, see the Informatica Data Quality User Guide. For information about importing XML files to the PowerCenter repository and for information about adding a mapplet to a mapping, see your PowerCenter Designer online help.

Data Quality Components


The Data Quality Integration plug-in adds data quality transformations to PowerCenter that mimic the functionality of the Workbench components used to build the plan. When you open a data quality mapplet in Mapplet Designer, each component in the plan appears onscreen as a transformation. The plan source and target components are represented as mapplet input and output components.
Note: When working with data quality transformations, you can modify the port names and precision settings

for each port. Do not edit any other aspect of the transformation, as this will invalidate the session in which the transformation is used. Not all Workbench components are added as transformations in PowerCenter, and not all Workbench sources and targets can convert to mapplet inputs and outputs.

10

Chapter 3: Working with Plans in the PowerCenter Repository

Table 3-1 lists the data quality plan components that are usable in PowerCenter. If your data quality plan contains components not listed in this table, the plan will not function in PowerCenter.
Table 3-1. Data Quality Components Usable In PowerCenter
Component Bigram Description Calculates levels of similarity between pairs of strings. Outputs a match score for two strings based on pairs of consecutive characters that are common to both strings. Parse free-text fields containing multiple tokens into multiple single-token fields. Creates a character-by-character profile of data values in a data field. File-based source in an identity matching plan. Performs identity matching on CSV sources using keys created by the Identity Group Target. Converts to an Identity Match Pair Generator. An Identity Match Pair Generator configures pairs of data values that will be subjected to match analysis in an identity data matching operation. CSV Identity Match Target File-based target in an identity matching plan. Converts to an Identity Match Identifier. A Match Identifier appends the match score and match cluster information calculated by matching components to each output record at the end of the matching process. CSV Match Source File-based source in a matching plan. Converts to a Match Pair Generator. A Match Pair Generator configures pairs of data values that will be subjected to match analysis in a data matching operation. CSV Match Target File-based target in a matching plan. Converts to a Match Identifier. A Match Identifier appends the match score and match cluster information calculated by matching components to each output record at the end of the matching process. CSV Source CSV Target DB Identity Group Source File-based source in a non-matching plan. Converts to a mapplet input. File-based target in a non-matching plan. Converts to a mapplet output. Database source in an identity matching plan. Performs identity matching on database sources using keys created by the Identity Group Target. Converts to an Identity Match Pair Generator. A Match Pair Generator configures pairs of data values that will be subjected to match analysis in a data matching operation. DB Source DB Target Edit Distance Database source in a non-matching plan. Converts to a mapplet input. Database target in a non-matching plan. Converts to a mapplet output. Calculates levels of similarity between pairs of strings. Outputs a match score for two strings by calculating the minimum cost of transforming one string into another by the insertion, deletion, and replacement of characters. Global address validation component. Enables Data Quality to evaluate input addresses against address reference data with third-party validation engines. Calculates levels of similarity between pairs of strings. Outputs a match score for two strings by calculating the number of positions in which characters differ between them. Generates keys for groups of input data for use by the CSV Identity Group Source and the DB Identity Group Source. Converts to an Identity Key Store. Identifies similar or duplicate strings at identity level. An identity is a set of fields providing name and address information for a person or organization. Calculates levels of similarity between pairs of strings. Outputs a match score for two strings by calculating the minimum cost of transforming one string into another by the insertion, deletion, and replacement of characters. Reduces this score if the two strings do not share a common prefix.

Context Parser Character Labeller CSV Identity Group Source

Global AV Hamming Distance

Identity Group Target Identity Match Jaro Distance

Running Plans as Mapplets

11

Table 3-1. Data Quality Components Usable In PowerCenter


Component Merge Mixed Field Matcher Normalization NYSIIS Profile Standardizer Realtime Source Realtime Target Rule Based Analyzer Scripting Similarity Soundex Splitter Token Labeller Token Parser ToUpper Weight Based Analyzer Word Manager Description Combines the data values from multiple input fields to form a single output field. Compares multiple fields against one another in match calculations. Shell component combining the functionality of third-party data normalization or standardization engines. Converts the values of an input field into their phonetic equivalent. Useful in match key generation. Parses the output data from a Token Labeller into a number of output fields based on a user-defined data structure. Source component that feeds data to a plan in real time. Converts to a mapplet input. Target component that writes plan output data in real time. Converts to a mapplet output. Applies user-defined business rules to input data. Applies business rules written in TCL (Tool Command Language) to input data. Shell component combining the functionality of third-party components that calculate the levels of similarity between strings. Assigns a value to a string based on the phonetic characteristics of the initial characters in the string. Useful in match key generation. Parses the data values in a text field into discrete new fields by comparing the source data to one or more reference datasets. Analyzes the format of data values within a field and categorizes each value according to a list of standard or user-defined tokens. Parse free-text fields containing multiple tokens into multiple single-token fields. Changes the case (upper or lower) of characters in a string. Reads the outputs from matching components and calculates a single, overall match score for their matching operations. Applies reference dictionaries to strings to determine and improve their accuracy or usability.

Running Plans with the Data Quality Integration Transformation


Your repository may contain instances of the Data Quality Integration transformation in which data quality plan information has been embedded. This transformation is represented iconically as an upper case Q.
Note: This transformation is deprecated and is no longer installed. The current Integration installation supports

mappings that contain transformations, but you cannot create a new instance of this transformation or an instance of this transformation that exists in your repository. Informatica recommends converting a plan saved with a Data Quality Integration transformation to a data quality mapplet.
Tip: The name of the deprecated transformation is the Data Quality Integration. The name of the installer that

adds Informatica Data Quality components and capabilities to PowerCenter is the Integration installer. Take care when using these names. PowerCenter need not communicate with a Data Quality engine or Data Quality repository to run a plan embedded in a Data Quality Integration transformation. You cannot re-connect to the Data Quality repository to change or refresh the plan.
12 Chapter 3: Working with Plans in the PowerCenter Repository

Figure 3-3 shows a Data Quality Integration transformation in iconic form in a mapping:
Figure 3-3. Mapping Containing a Data Quality Integration Transformation

A Data Quality Integration transformation contains a single data quality plan. You can add multiple Data Quality Integration transformations to a mapping. When you add a series of Data Quality Integration transformations to a single mapping, the session containing the mapping will run faster than a session containing a series of mappings with a single Data Quality Integration transformation in each one.

Active and Passive Transformations


Data Quality Integration transformations can be active or passive. Once set, the transformation type cannot be changed. In passive transformations, the number and order of output data records must match the number and order of input data records. In active transformations, the number and order of output data records can differ from the number and order of input data records. Use active transformations in data matching operations.

Data Quality Integration Transformation Properties


To view the configuration of the Data Quality Integration transformation, double-click its icon on the PowerCenter workspace or right-click its title bar and select Edit from the shortcut menu. This opens the Edit Transformations dialog box. The settings on the Configurations tab are unique to the Data Quality Integration transformation. This tab allows you to view details of the Data Quality repository from which the plan was loaded to PowerCenter and the ports configured for the plan. You cannot edit the settings on this tab. Figure 3-4 shows the Configurations tab settings:
Figure 3-4. Edit Transformation Dialog Box, Configurations Tab

Running Plans with the Data Quality Integration Transformation

13

For information about the other tabs on this dialog box, consult PowerCenter Designer online help. Table 3-2 describes the options on this tab:
Table 3-2. Configurations Tab Options List
Option Plan Name Grouping Port Plan Location Status I/O Ports Description Identifies the plan to be added to the transformation. Buffers data on the selected field before sending to the Data Quality engine. Used in matching plans. Lists the location of the Data Quality repository from which PowerCenter read the plan and the original path to the plan within that repository. Describes the last connection state between PowerCenter and Data Quality. Lists any pass-through ports added to the transformation. These ports enable data to pass through the transformation unchanged. They are not included in the input and output ports created by the data quality plan and are added in PowerCenter. Activates pass-through ports on the transformation.

Include Pass Through Ports

14

Chapter 3: Working with Plans in the PowerCenter Repository

CHAPTER 4

Association Transformations
This chapter includes the following topics:

Overview, 15 Association Transformation Properties, 15 Creating an Association Transformation, 16

Overview
In data quality, association is an extension of the data matching process and a precursor of the data consolidation process. The Association transformation creates links between records that share duplicate characteristics across more than one data field. The transformation generates an association ID value for each row in a group of associated records and writes the association ID values as a new output port. Use a Consolidation transformation to create a master record based on the records with common association ID values.
Note: You cannot use multiple partitions, grids, incremental recovery, real-time processing, or web service

workflows with the Association transformation.

Association Transformation Properties


To view the configuration of the Association transformation, double-click its icon on the PowerCenter workspace or right-click its title bar and select Edit from the shortcut menu. This opens the Edit Transformations dialog box. Table 4-1 describes the tabs available:
Table 4-1. Edit Transformations Dialog Box Tabs
Tab Transformations Ports Properties Description Lists the name and type of the transformation. Includes a Description text field. Lists the input and output ports configured on the transformation. This tab also lists Datatype, Precision, and Scale values for each port. Lists the properties of the transformation. This tab provides names and values for several transformation attributes.

15

Table 4-1. Edit Transformations Dialog Box Tabs


Tab Initialization Properties Metadata Extensions Port Attribute Definitions Association Ports Description Allows you enter external procedure initialization properties for the transformation. Allows you to extend the metadata stored in the repository by associating information with individual repository objects. Allows you to create port attributes for the transformation. Allows you to select the ports to which PowerCenter will add association IDs.

The Association Ports tab is unique to this transformation. When you configure ports as association ports, PowerCenter creates a common association ID for all records with common values in either port. Configure at least two association ports for each Association transformation. The transformation creates a new AssociationID port that contains the ID values. You can configure any input/output port as an association port. To associate data from Data Quality matching plans, configure the cluster ID ports as association ports. The port date is of type Integer(10).
Note: In addition to the association ID port, you can create input/output ports to pass related data to

downstream transformations. Create and configure all Association transformation ports on the Association Ports tab.

Example
The following data fragment includes cluster IDs generated by data quality plans that matched last names and addresses:
First_Name John Mary Stuart Kevin Paul Last_Name Smith Anne Peterson Smith Smith Address 11, Bridge ST 345, tracy blvd 1 Main street 11 Bridge street 11 Bridge st ClusterID_ LastName 9 10 11 9 9 ClusterID_ Address 14 15 16 14 14

When you route data to an Association transformation and configure each ClusterID port as an association port, the Association transformation evaluates cluster IDs and generates a single association ID for associated rows:
First_Name John Kevin Paul Mary Stuart Last_Name Smith Smith Smith Anne Peterson Address 11, Bridge ST 11 Bridge street 11 Bridge st 345, tracy blvd 1 Main street ClusterID_ LastName 9 9 9 10 11 ClusterID_ Address 14 14 14 15 16 AssociationID 1 1 1 2 3

Creating an Association Transformation


You can create a Association transformation in the Transformation Developer or the Mapping Designer.

16

Chapter 4: Association Transformations

To create a Association transformation: 1.

In the Mapping Designer, click Transformation > Create. Select Association transformation and enter the name of the transformation. The naming convention for Association transformation is AT_ TransformationName. Click Create, and then click Done.

2.

Select and drag ports from an upstream transformation to the Association transformation. Copies of these ports appear as input/output ports in the Association transformation. Or, in the Association transformation properties, click the Association Ports tab and create each port manually.
Note: To make this transformation reusable, you must create each port manually within the transformation.

3.

On the Association Ports tab, select Associate to define a port as an association port. Select two or more association ports.

4.

Click OK.

Creating an Association Transformation

17

18

Chapter 4: Association Transformations

CHAPTER 5

Consolidation Transformations
This chapter includes the following topics:

Overview, 19 Consolidation Transformation Ports, 20 Consolidation Expressions, 21 Consolidation Functions, 21 Creating a Consolidation Transformation, 22

Overview
Use the Consolidation transformation to create a single, consolidated record from a group of associated records. When you use a Consolidation transformation, you configure the following components:

Group by port. The port used to group related records. The Consolidation transformation generates one record for each group. Consolidation expressions. The expression used to define each field of the consolidated record. You can use consolidation functions as well as other standard PowerCenter functions to define expressions.

Use the Consolidation transformation with the Association transformation to consolidate duplicate records from Data Quality matching plans. When used with an Association transformation, configure the Consolidation transformation to group records by association ID.
Note: You cannot use multiple partitions, grids, incremental recovery, real-time processing, or web service

workflows with the Consolidation transformation.

Sorting Input Data


The Consolidation transformation requires input data sorted by the group by port. The Consolidation transformation generates a single record for each new value that appears in the group by port. For example, when the Consolidation transformation is configured to group by State, it inaccurately generates two rows for Maryland for the following data because it is not sorted by State:
State (group by port) MD MD County Montgomery Frederick Stores 10 6

19

State (group by port) VA VA MD

County Fairfax Loudoun Anne Arundel

Stores 15 3 4

Association transformations provide data sorted by association ID. When you use the Consolidation transformation with an Association transformation and configure AssociationID as the group by port, you do not need to perform additional sorting on input data. Otherwise, you can use a Sorter transformation to sort data.

Default Values
The Consolidation transformation uses the fields in the first record in a group as default data for the consolidated record. You can configure consolidation expressions to provide specific results for the consolidated record. When you do not enter consolidation expressions or when consolidation expressions do not generate a result, the field defaults to the value in the first record of the group. For example, if you do not configure any consolidation expressions for the sample data, above, the consolidated record for Maryland is MD, Montgomery, 10.

Consolidation Transformation Ports


Create and configure Consolidation transformation ports on the Consolidation Ports tab in the transformation properties. In addition to any ports you create, the Consolidation transformation includes the following ports:

Group by port IsConsolidatedRecord

Group By Port
In a Consolidation transformation, the group by port determines how records are grouped for consolidation. When you create a Consolidation transformation, configure at least one group by port. For each group of consecutive records that has the same value in the group by port, the Consolidation transformation generates a consolidated record. You can select more than one group by port. When you create a composite group identifier, the Consolidation transformation creates a consolidated record for each composite group identifier. You can configure any input/output port as a group by port. However, when you use a Consolidation transformation with an Association transformation, use the association ID as the group by port. The Association transformation provides data that is already associated and sorted by association ID. Configure the group by port using the GroupBy option on the Consolidated Ports tab in the transformation properties. Data in the group by port should be sorted to ensure expected results. For more information, see Sorting Input Data on page 19.

IsConsolidatedRecord
The Consolidation transformation provides an IsConsolidatedRecord output port to indicate if a record was consolidated from a group of records.

20

Chapter 5: Consolidation Transformations

Table 5-1 describes the flags that are used in the IsConsolidatedRecord field:
Table 5-1. Consolidation Flags
Consolidation Flag 1 0 Description The record was consolidated. It represents a group of records that share the same group by value. The record was not consolidated. It represents a group consisting of a single record.

Consolidation Expressions
You can create expressions for any input/output port in the Consolidation transformation except group by ports. The expression determines how the port is defined for the consolidated record. For example, you can configure an expression to return the most frequently appearing value in the port for each group. If you do not enter an expression for a port, the Consolidation transformation uses the default value for the port. For more information, see Default Values on page 20. Use the Expression Editor to create and validate expressions. You can use any valid PowerCenter function or variable in expressions. You can also use the consolidation functions installed with the Consolidation transformation.

Consolidation Functions
The Consolidation transformation installation includes a set of new consolidation functions:

Store. Stores the value of the port or of an expression related to the current record. Stored. Returns the stored value of the port. Most_Frequent. Returns the most frequently occurring value for the port within a group, including blank and null values. Most_Frequent_NonBlank. Returns the most frequently occurring value for the port within a group, excluding blank and null values.

For information about other PowerCenter functions, see the Transformation Language Reference.

Store
Store uses the following syntax:
STORE(port)

or
STORE(port, expression)

Store( port) stores the value of the port as the candidate value for the consolidated record. Store( port, expression ) stores the value of the expression as the candidate value for the consolidated record. The expression must return a value of the same datatype as the port. Use only the port that you are configuring with Store. Do not use Store to generate string literals.

Consolidation Expressions

21

Stored
Stored uses the following syntax:
STORED(port)

Stored returns the stored candidate value of the port. When using Stored, include Store in the same expression to store candidate values. If you use Stored without Store, the Integration Service returns null values. For example, to store the most recent date in a Date port, you might use the following expression:
IIF (ISNULL (STORED(Date)), STORE(Date), IIF (Date > STORED(Date), STORE(Date)))

The return value must be the same datatype as the port you are configuring. Do not use Stored to generate string literals.

Most_Frequent
Most_Frequent function uses the following syntax:
MOST_FREQUENT(port)

Most_Frequent evaluates all values within a group and returns the most frequently occurring value in a group, including null and blank values. The return value must be the same datatype as the port you are configuring. Do not use aggregate functions with Most_Frequent. Do not use Most_Frequent to generate string literals.

Most_Frequent_NonBlank
Most_Frequent_NonBlank uses the following syntax:
MOST_FREQUENT_NONBLANK(port)

Most_Frequent_NonBlank ignores blank and null values when it evaluates values within a group. It returns the most frequently occurring value in a group, excluding null and blank values. The return value must be the same datatype as the port you are configuring. Do not use aggregate functions with Most_Frequent_NonBlank. Do not use Most_Frequent_NonBlank to generate string literals.

Creating a Consolidation Transformation


You can create a Consolidation transformation in the Transformation Developer or the Mapping Designer.
To create a Consolidation transformation: 1.

In the Mapping Designer, click Transformation > Create. Select Consolidation transformation and enter the name of the transformation. The naming convention for Consolidation transformation is CON_ TransformationName. Click Create, and then click Done.

2.

Select and drag ports from an upstream transformation to the Consolidation transformation. Copies of these ports appear as input/output ports in the Consolidation transformation. Or, in the Consolidation transformation properties, click the Consolidation Ports tab and create each port manually.
Note: To make this transformation reusable, you must create each port manually within the transformation.

3.

On the Consolidation Ports tab, select GroupBy to define a group by port. Define at least one group by port.

22

Chapter 5: Consolidation Transformations

4.

To enter an expression for a port, click the button in the Expression field and use the Expression Editor. To prevent typographic errors, use the listed port names and functions when possible.

5. 6. 7.

Click Validate to validate the expression. To close the Expression Editor, click OK. For each port that requires an expression, repeat steps 4 - 5. Click OK.

Creating a Consolidation Transformation

23

24

Chapter 5: Consolidation Transformations

CHAPTER 6

Creating Mappings for Data Quality Plans


This chapter includes the following topics:

Overview, 25 Creating a Mapping to Cleanse, Parse, or Validate Data, 27 Creating a Mapping to Match Data from a Single Source, 28 Creating a Mapping to Match Data from Two Data Sources, 28 Creating a Mapping to Match Identity Information, 29

Overview
This section describes how to define the mappings that will add the data quality plans you exported or imported to the PowerCenter repository to PowerCenter processes. The mapping descriptions demonstrate how data quality transformations interact with default PowerCenter transformations and what dependencies may apply. How you define a mapping depends on the type of data quality plan you use.
Note: The descriptions illustrate the ways that mappings can be configured for different types of plan but do not

represent the only ways that such mappings can be configured.

Creating Mappings for Data Cleansing, Parsing, and Validation


For data cleansing, parsing, or validation, define a mapping with a qualified data source, a data quality mapplet, and a data target. If your plan contains an address validation component, you must ensure that your PowerCenter Integration Service machine

Creating Mappings for Data Matching


A mapping that supports a data matching plan needs additional PowerCenter transformations to sort the data and optionally to enable matching on two data sources. For information on mappings designed for identity matching, see page 26.

25

Add a Sorter transformation to sort the data on a key field upstream of the data quality mapplet or Data Quality Integration transformation. Optionally, add a Sequence Generator transformation if your data lacks unique IDs. The Sorter transformation performs the same task as a pre-match grouping plan in Data Quality. Grouping plans are not necessary in PowerCenter. The Grouping port must be set on any Data Quality Integration transformation that contains a matching plan. Plan designers can create both single-source and dual-source matching plans in Data Quality Workbench. PowerCenter runs single-source matching plans only. Therefore, you must combine two data sources into a single stream to match across them in PowerCenter. Use a Union transformation to combine data from two data sources.

Creating Mappings for Identity Matching


In data quality terminology, an identity is a set of values, read from multiple fields in a record, that together provide enough information to identify an individual or entity. Identity match processes read a source dataset, create an index of key data values for each record in the dataset, and match the source records against the possible identity permutations for the records in the index file. The complete set of processes requires two plans, or two mapplets: one to create the index, and another to perform the math analysis and identify the likely duplicates. Table 6-1 lists the types of operation and the components or transformations that perform them:
Table 6-1. Identity Matching Operations In Data Quality And PowerCenter
Operation Read key values from a set of input records, and create an index file containing possible identities based on multiple combinations of the key values in each record. Collate source records in preparation for identity matching so that records are matched properly against the index entries. Generate match scores for each record-index entry comparison. Analyze the input records, index entries, and match scores to create clusters of matching identities. Plan/Mapplet Type Key Generation Data Quality Component Identity Group Target PowerCenter Transformation Identity Key Store

Identity Match

CSV Identity Group Source or DB Identity Group Source Identity Match or other data quality match component CSV Identity Match Target

Identity Match Pair Generator Identity Match

Identity Match

Identity Match

identity Match Identifier

The Integration installer installs the identity transformations to PowerCenter. For more information on identity matching in Data Quality, see the Informatica Data Quality User Guide.

26

Chapter 6: Creating Mappings for Data Quality Plans

Creating a Mapping to Cleanse, Parse, or Validate Data


Figure 6-1 illustrates the simplest model for a data quality mapping. The mapplet may represent a data quality plan designed for data cleansing, standardization, parsing, address validation. It could also contain an index key generation process in identity matching.
Figure 6-1. Mapping Defined for Standardizing, Parsing, and Validation

You can configure a mapping like this one with multiple data quality mapplets to conduct cleansing, standardization, parsing, or validation operations in sequence in a single mapping. If your repository contains a Data Quality Integration transformation, you can use that transformation in place of the mapping.
To create a mapping that includes a single data quality mapplet or transformation: 1. 2. 3. 4. 5.

Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations. Add the required data quality mapplet or transformation. Connect the outputs from the Source Qualifier to the input ports of the data quality mapplet. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. Add a Target Definition and connect the mapplet output ports to it.

Expression Transformations in Cleansing, Parsing, and Validation Mapplets


When Data Quality exports a cleansing, parsing, or validation plan as a mapplet, or when it exports such a plan as an XML object ready for import to the PowerCenter repository, it typically adds at least one Expression transformation alongside the transformations created from the plan. An Expression transformation enables a data quality source or target to map its inputs or outputs to other transformations in the same manner as a PowerCenter source or target. For example, the Expression transformation enables a data quality source to map a given output field to an input field in multiple other transformations. It is possible to save a mapplet without an Expression transformation, but doing so risks losing some PowerCenter capabilities. Data Quality adds an Expression transformation whenever it exports a non-matching source or target. If the plan contains a non-matching source, the source is followed immediately in the exported mapplet by an Expression transformation. If the plan contains a non-matching target, the target is preceded immediately in the exported mapplet by an Expression transformation. This is also the case for sources and targets in identity matching plans. Match sources or targets, including identity match sources and targets, do not require an Expression transformation for these purposes. The Expression transformation does not apply any expressions to the data. It acts as a conduit between a source or target and other transformations in the mapplet.

Creating a Mapping to Cleanse, Parse, or Validate Data

27

Creating a Mapping to Match Data from a Single Source


This model describes a matching plan created for a single data source. A matching plan compares every row in the dataset with every other row to identify duplicates. In Data Quality operations, a matching plan is often preceded by a grouping plan. Grouping plans are not necessary in PowerCenter. Figure 6-2 shows a mapping set up for single-source matching.
Figure 6-2. Mapping Defined for Single-Source Matching

To create a mapping that matches data from a single source: 1. 2.

Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations in the mapping.
Note: If the input records lack unique identifiers, add a Sequence Generator transformation. This

transformation will generate a series of incremented values for the records passed into it, creating a column of unique IDs. If the input records have unique IDs, you can omit this step.
3.

Add a Sorter transformation. Set this transformation to sort the input records according to values in a suitable field. To do so, open the transformation on the Ports tab and check the Key column box for the required port name. Add the required data quality mapplet. Connect all required ports. In the data quality mapplet, verify that you have selected as a group key port the field you set as the Key column in the Sorter transformation. Add a Target Definition and connect the mapplet output ports to it.

4. 5. 6.

Creating a Mapping to Match Data from Two Data Sources


This model describes a mapping that contains a matching plan set up for two data sources. The mapping combines the two data sources into a single dataset in which the source records are flagged A and B to indicate their dataset of origin. The data quality mapplet matches every A record against every B record. The mapping includes a Sequence Generator that provides unique IDs for the input data rows.

28

Chapter 6: Creating Mappings for Data Quality Plans

Figure 6-3 illustrates this mapping:


Figure 6-3. Mapping Designed for Matching on Two Data Sources

To create a mapping that matches data from two sources: 1. 2. 3. 4. 5.

Add two Source Definitions to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation for each source definition. These read the data from the source files and enable the data to be read by other transformations in the mapping. Add two Expression transformations. Use each Expression transformation to flag the data from each source as Source A and Source B. This facilitates matching across the sources. Add a Union transformation. Use this transformation to combine the Source A and Source B data into a single dataset, as required by the matching plan. Add a Sequence Generator transformation. This generates a series of incremented values for the records passed into it, creating a column of unique IDs. If the input records contain a field that constitutes a unique ID, you can omit this step. Add a Sorter transformation. Set the Sorter transformation to sort the input records according to values in a suitable field. To do so, open the transformation on the Ports tab and check the Key column box for the required port name.

6.

7. 8. 9.

Add the required data quality mapping. Connect all required ports. In the data quality mapplet, verify that you have selected as a group key port the field you set as the Key column in the Sorter transformation. Add a Target Definition and connect the mapplet output ports to it.

Creating a Mapping to Match Identity Information


A complete identity matching process involves two plans: one to generate the key index and another to match the source dataset against the potential identities in that index. Export both plans to the PowerCenter repository to perform identity matching in PowerCenter. Create a mapping for each plan.
Note: Identity matching uses population files to create the index keys and to perform match analyses. A population file contains key-building algorithms, search strategies, and matching schemas that enable duplicate analysis of identity information. Informatica provides proprietary population files for use in Data Quality and PowerCenter. Before you begin, ensure you have a suitable population file for your source data installed on the PowerCenter Integration Service computer.

Creating a Mapping to Match Identity Information

29

Population files install through the Data Quality Content Installer. For more information on installing population files, see the Informatica Data Quality Installation Guide.
To create a mapping that performs index key generation: 1. 2. 3.

Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations. Add the required data quality mapplet or transformation.
Note: The Identity Key Store transformation in your mapplet creates an index of key values and writes it to

a folder on the system. You must ensure that any identity matching components that must read the index are able to do so. For information on setting the index directory, see page 41.
4. 5.

Connect the outputs from the Source Qualifier to the input ports of the data quality mapplet. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. Add a Target Definition and connect the mapplet output ports to it.

The high-level steps to create a mapping that matches a source dataset against the contents of the key index are almost identical.
Note: You can perform identity matching across two datasets by selecting a different source dataset in the each

plan. For this form of matching to succeed, both datasets must have the same structure and their respective columns must contain the same types of information. If
To create a mapping that performs identity match analysis between a source dataset and a key index: 1. 2. 3.

Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations. Add the required data quality mapplet or transformation.
Note: The identity matching components within the mapplet used in this mapping must refer to the index

key folder specified in the preceding mapping. For information on setting the index directory, see page 41.
4. 5.

Connect the outputs from the Source Qualifier to the input ports of the data quality mapplet. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. Add a Target Definition and connect the mapplet output ports to it.

Figure 6-4 shows the Mapping Designer view of a mapping created for identity match analysis above the Mapplet Designer view of the identity mapplet exported from Data Quality:
Figure 6-4. Identity Matching Mapping and Mapplet

In Figure 6-4, the mapplet contains the following transformations:


30 Chapter 6: Creating Mappings for Data Quality Plans

Mapplet Input Identity Match Pair Generator Edit Distance Identity Match Identifier Mapplet Output

Note: If a workflow containing an identity match analysis mapplet fails, you must delete the key index folder

that was read by the identity components in the mapplet. The workflow failure corrupts the index data. Recreate the index by running the workflow that contains the index key generation mapplet.

Creating a Mapping to Match Identity Information

31

32

Chapter 6: Creating Mappings for Data Quality Plans

CHAPTER 7

Creating Mappings for Association and Consolidation


This chapter includes the following topics:

Overview, 33 Associating and Consolidating Data from a Single Source, 33 Associating and Consolidating Data from Multiple Data Sources, 34

Overview
Informatica provides the Association and Consolidation transformations to process records that have been identified as potential duplicates. The transformations are designed to process data from data quality matching plans, although they are not limited to such data. Use the Association and Consolidation transformations to link groups of matching records and to generate a consolidated master record for each group. You can create mappings that use the Association and Consolidation transformations with Data Quality data in the following general formats:

Single source mapping. A single pipeline that routes all data through a more than one matching plan before passing to Association and Consolidation transformations. Create this type of mapping when you want to run different matching plans on different ports in the same source dataset. Multiple source mappings. Multiple pipelines that merge data from different sources before passing through a matching plan to Association and Consolidation transformations.

To understand how matching plans run in PowerCenter mappings, read chapter 6, Creating Mappings for Association and Consolidation before you read this chapter.

Associating and Consolidating Data from a Single Source


When you create a mapping with a single source, you use multiple matching plans in the pipeline and route their outputs to Association transformation and Consolidation transformations. Use this model to run different

33

matching plans on different ports in the same source data and create a single, consolidated record for each group of related records. Figure 7-1 shows an example of this type of mapping:
Figure 7-1. Associating and Consolidating Data from a Single Source

The mapping includes the following transformations:


Sequence Generator transformation. If source data does not include a column of unique identifiers that can be used as a key column, you can use a Sequence Generator transformation to generate keys. Data Quality Integration transformations or data quality mappings. Use two Data Quality Integration transformations or two data quality mappings that each contain a matching plan. For example, one plan may match surname data and the other may match city name data in an address dataset. They pass their respective cluster IDs to the Association transformation. Sorter and Joiner transformations. Use the Sorter transformations to ensure that all records with common cluster IDs are sequenced together in the data streams. As the Association transformation allows only one input group, use a Joiner transformation to merge data. When you use more than two data sources, you can use multiple Joiner transformations in the pipeline.
Tip: When you use relational sources, you can configure the Number of Sorted Ports option in the Source

Qualifier transformation to have the database server sort source data.

Association transformation. Use the cluster IDs generated by the matching plans to associate related data records. This transformation generates a single association ID for each associated group and provides output data sorted according to the new groups to the Consolidation transformation. Consolidation transformation. Use to generate a single, consolidated record for each group of consecutive records that share a common association ID. The value for each field is created based on the expressions configured for each port.

Associating and Consolidating Data from Multiple Data Sources


When you create a mapping with multiple sources, you define two source data pipelines and add data matching plan to each one. Sort and join the outputs of these matching operations before routing them to the Association and Consolidation transformations. Use this model to run matching plans on different data sources and create a single master record for duplicates identified across the data sources.
Note: This model differs from the process used to describe dual-source data matching in chapter 6, Creating

Mappings for Association and Consolidation. In chapter 6, the two data sources are joined before matching takes place.

34

Chapter 7: Creating Mappings for Association and Consolidation

Figure 7-2 shows an example of this type of mapping:


Figure 7-2. Associating and Consolidating Data from Two Sources

This mapping includes the following transformations:

Data quality mappings. Use two data quality mappings that each contain a matching plan. Each plan may perform the same type of matching operations on the same or similar fields in each dataset. The plans will pass their respective cluster IDs to the Association transformation. Sorter and Joiner transformations. Use the Sorter to ensure that all records with common cluster IDs are sequenced together in the data streams. As the Association transformation allows only one input group, use a Joiner transformation to merge data. When you use more than two data sources, you can use multiple Joiner transformations in the pipeline.
Tip: When you use relational sources, you can configure the Number of Sorted Ports option in the Source

Qualifier transformation to have the database server sort source data.

Association transformation. Use the cluster IDs generated by the matching plans to associate related data records. This transformation generates a single association ID for each associated group and provides output data sorted according to the new groups to the Consolidation transformation. Consolidation transformation. Use to generate a single, consolidated record for each group of consecutive records that share a common association ID. The value for each field is created based on the expressions configured for each port.

Associating and Consolidating Data from Multiple Data Sources

35

36

Chapter 7: Creating Mappings for Association and Consolidation

APPENDIX A

Working with Data Matching Plans


The appendix includes the following topics:

Overview, 37

Overview
Data matching plans identify duplicate records in a dataset or across datasets. Matching plans operate differently to other types of plan. To understand matching plans, you must understand how data quality mapplets handle matching in PowerCenter. A matching plan compares all values on a user-selected input port with one another and generates a match score for every comparison pair. The score represents the degree of similarity between the two matched values, taking the first value as a baseline and calculating the percentage similarity of the second value to the first. All values are matched against all other values in this way, and the plan designer sets a match threshold value that defines the level of similarity that constitutes a strong match. A set of records with values that demonstrate a high level of identity with one another is called a cluster.

Grouping and Matching Considerations


Unless your input dataset is small in size, you must provide a means of grouping your data before it reaches the data quality plan. Grouping means sorting input records by values in one or more user-selected fields. When a matching plan is run on grouped data, matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches across groups in the dataset. Data Quality and PowerCenter handle grouping differently:

Data Quality users typically create a separate grouping plan that runs before the matching plan. The grouping plan creates a set of temporary files or database entries that indicate the records that belong to each group. These files or database entries are read by the matching plan to determine which records are matched together. They can be discarded when the matching plan has run, and they can be recreated by re-running the grouping plan. Every time you run the grouping plan in Workbench, you overwrite the group data. In PowerCenter, pre-match grouping is not necessary. Use a Sorter transformation instead.

When designing a mapping with a data quality mapplet, add a Sorter transformation before the mapplet and sort the data records according to a key field. Link the Group Key port on the mapplet input to the Sorter transformation key field.

37

When designing a mapping with a Data Quality Integration transformation, add a Sorter transformation before the Data Quality Integration transformation and sort the data records according to a key field. Set the Grouping Port in the Data Quality Integration transformation to this field to enable match processing on grouped data.

As well as assigning data to groups, a grouping plan may create columns of potential group keys. You can select one of these columns as the group key in the Sorter transformation. Always select the same group key port in the mapplet input as you selected in the Sorter transformation. Any column containing a statistically meaningful range of values can be used as a group key column, so long as the range of values has a meaningful association with the main focus of the matching exercise. For example, if your data quality plan focuses on matching person names, you could select date of birth information as a group key, on the basis that two records with common values for name and date of birth are likely to he the same person. In such a case, a City or Town name column would be a poor choice of group key, as there may be many people with similar or identical names in a city whose records are not duplicates of one another. In Workbench you can create composite group keys composed of data from two or more existing fields. For example, you could create a composite group key that included both date of birth and city or town of residence.

Match Output Types and Cluster Information


PowerCenter adds output ports named MatchScore, ClusterID, and RecordsPerCluster to a mapplet or transformation containing a matching plan. These fields enable transformations downstream in the mapping to recognize records that have been identified as likely duplicates by the data quality process. Cluster IDs are effective AssociationID ports in Association transformations.
MatchScore ClusterID RecordsPerCluster A numerical value representing the degree of similarity between two input strings as a percentage. A unique identifier for each set of matching records identified by the plan. The number of records in a given cluster.

ClusterID Format
Data Quality Workbench and PowerCenter create ClusterID values in different ways. In Data Quality Workbench, the ClusterID values created by a matching plan are numbers that increment for each new cluster. In PowerCenter, the ClusterID value contains additional information that ensures it is unique within the system. The output format for a data row on the ClusterID port in PowerCenter is as follows:
<hostname>:<process id>:<thread id>:<internal cluster id for the row>

Matching Components in Data Quality Mapplets


Data matching plans that are saved as mapplets in PowerCenter contain components not found in other types of plan: Match Pair Generator and Match Identifier. Match Pair Generator. Data Quality adds this transformation to enable PowerCenter to match the values on a given mapplet input port against one another. The Match Pair Generator creates two identical output ports for each input port that it receives. For identity matching mapplets, Data Quality adds an Identity Match Pair Generator. Match Identifier. The data quality mapplet uses this transformation to write match score and match cluster information to its output rows. The transformation adds three fields to each data row. For identity matching mapplets, Data Quality adds an Identity Match Identifier. The Match Pair Generator and Match Identifier transformations allow PowerCenter to record duplicate analysis results in a data flow while maintaining the same numbers of input and output ports in the data quality mapplet.

38

Appendix A: Working with Data Matching Plans

Note: If you are performing identity match analysis in PowerCenter, you must run an index key generation

process prior to an identity match analysis process. This creates the index of key values from which PowerCenter will create a set of possible identities within the source data. Data Quality plans use an Identity Group Target to create the key index. This target component installs to PowerCenter as the Identity Key Store transformation. To facilitate the grouping port, add a Sorter transformation before the data quality mapplet when you create your mapping and sort the data on the same port that you use as the grouping port. This ensures that all data rows with common values on this port are grouped together as they enter the mapplet.

Overview

39

40

Appendix A: Working with Data Matching Plans

APPENDIX A

Working with Identity Matching Plans


The appendix includes the following topics:

Overview, 41

Overview
When you create a plan to generate a key index in Informatica Data Quality, you specify the location of the index within the Data Quality folder structure. When you create a plan to read that key index, you must set the path to the index in the plans identity components. You must follow similar steps when you use a mapplet created from an identity matching plan in a PowerCenter. When you add the parent mapping to a session, you must specify the index location as sessionlevel properties for any identity component that reads or writes an identity index in the session. There are three transformations that can read or write an index: Identity Match Identifier, Identity Key Store, and Identity Match Pair Generator.

Setting the Index Folder Location as a Session-Level Property


To set the key index location, open the Edit Tasks dialog box for the session and type the absolute path to the key index folder as a value for the Key Index Path attribute. When adding the key index path, do not include the index name. How your session reads the index folder location depends on your version of PowerCenter. For more information, see page 42.

41

Figure B-1 shows the Edit Task dialog box with an Identity Key Store transformation selected.
Figure B-1. Setting the Identity Index Location, Edit Tasks dialog box

Specifying the Key Index in PowerCenter 8.5.1


Note: In default installations of PowerCenter 8.5.1, the Identity Key Store and Identity Match Pair Generator

transformations do not read the index location from the session-level properties. Instead, they read the index location from the mapplet variables. This location is typically a folder within the Data Quality folder structure on the PowerCenter Integration Service machine. If the data quality plan read an index folder at this location:
C:\Program Files\Infomatica Data Quality\Identity\MyIndexes

Then the Identity Key Store and Identity Match Pair Generator transformations in PowerCenter 8.5.1 will read the index folder at this location:
C:\Informatica\PowerCenter8.5.1

You must ensure that the Identity Match Identifier reads the index from the same folder. Set the folder path for the Identity Match Identifier as a session-level property.
Note: PowerCenter provides an EBF that enables the transformations affected on the Integration Service

machine to read the session properties. For more information, contact Global Customer Support. Table B-1 summarizes how different versions of PowerCenter read the key index folder location.
Table B-1. Index Folder Location Settings in PowerCenter
Transformation name Identity Key Store Data Quality component name Identity Group Target Behavior in PC 811 SP5 Reads the index folder location from sessionlevel properties Behavior in PowerCenter 851 Reads the index folder location at a path specified in the mapplet Behavior in PC 860 Read the index folder location from sessionlevel properties

42

Appendix A: Working with Identity Matching Plans

Table B-1. Index Folder Location Settings in PowerCenter


Transformation name Identity Match Pair Generator Identity Match Identifier Data Quality component name Identity Group Source Behavior in PC 811 SP5 Reads the index folder location from sessionlevel properties Reads the index folder location from sessionlevel properties Behavior in PowerCenter 851 Reads the index folder location at a path specified in the mapplet Reads the index folder location from sessionlevel properties - must match the path specified for other identity transformations in the mapping Behavior in PC 860 Read the index folder location from sessionlevel properties Read the index folder location from sessionlevel properties

CSV Identity Match Target

Overview

43

44

Appendix A: Working with Identity Matching Plans

INDEX

A
Address validation 6 Data Quality Content Installer 6 association IDs using to consolidate records 19 association ports description 15 Association transformation creating 16 example 16 in Data Quality mappings 33 naming convention 17 overview 4, 15 ports 15 AssociationID port 16 description 15

in mappings for data standardizing, parsing, and validation 27 in mappings for dual-source matching 28 in mappings for single-source matching 28 overview 5 pass-through ports 14 Data Quality mappings for association and consolidation 33 deprecated component Data Quality Integration transformation 1, 5 designing mappings for association and consolidation, multiple sources 34 for association and consolidation, single source 33 for data standardizing, parsing, and validation 27 for dual-source matching 28 for single-source matching 28

E
Expression transformation in data quality mapplets 27 expressions Consolidation transformations 19, 21 default values for Consolidation transformations 20

C
cluster IDs using to associate records 15 Consolidation functions Most_Frequent 22 Most_Frequent_NonBlank 22 overview 21 Store 21 Stored 22 Consolidation transformation creating 22 default values 20 expressions 21 expressions in 19 group by port 19 in Data Quality mappings 33 IsConsolidatedRecord port 20 naming convention 22 overview 4, 19 ports 20 sorted input 19

G
group by ports defined 19 in Consolidation transformations 20 grouping records in Association transformations 15 in Consolidation transformations 20 overview 37

I
identity matching creating mappings 26, 29 description 6 index folder location 41 population files 7 Workbench components in PowerCenter 11 IsConsolidatedRecord port Consolidation transformation port 20

D
data matching plans 37, 41 Data Quality Content Installer 6 Data Quality Integration transformation convert plans to mapplets 5 deprecation 1, 5 embedded plans 12

45

M
mappings data standardizing, parsing, and validation 27

N
naming conventions Association transformation 17 Consolidation transformation 22

P
pass-through ports in Data Quality Integration transformations 14 ports Association transformation 15 AssociationID 15 Consolidation transformation 20 group by 19, 20 IsConsolidatedRecord 20

R
records associating 19

S
Sequence Generator transformation 26 Sorter transformation 20, 26, 34, 37 sorting data for Consolidation transformations 19 for data matching 25 syntax for consolidation functions 21

T
transformations active and passive 13 Association 4, 15 Consolidation 4, 19 Data Quality Integration transformation 12

46

Index

NOTICES
This Informatica product (the Software) includes certain drivers (the DataDirect Drivers) from DataDirect Technologies, an operating company of Progress Software Corporation (DataDirect) which are subject to the following terms and conditions: 1. THE DATADIRECT DRIVERS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. 2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

You might also like