You are on page 1of 83

dfPower Studio 7.1.

2 Users Guide

dfPower Studio - Contact and Legal Information


Technical Support
Phone: 1-919-531-9000 Email: techsupport@dataflux.com Web: www.dataflux.com/techsupport

Contact DataFlux
Corporate Headquarters DataFlux Corporation 940 NW Cary Parkway, Suite 201 Cary, NC 27513-2792 Toll Free Phone: 1-877-846-FLUX (3589) Toll Free Fax: 1-877-769-FLUX (3589) Local Telephone: 1-919-447-3000 Local Fax: 1-919-447-3100 Web: www.dataflux.com European Headquarters DataFlux UK Limited 59-60 Thames Street WINDSOR Berkshire SL4 ITX United Kingdom UK (EMEA): +44(0) 1753 272 020

Legal Information
PCRE Copyright Disclosure
A modified version of the open source software PCRE library package, written by Philip Hazel and copyrighted by the University of Cambridge, England, has been used by DataFlux for regular expression support. Copyright (c) 1997-2005 University of Cambridge. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University of Cambridge nor the name of Google Inc. nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

gSOAP Copyright Disclosure


Part of the software embedded in this product is gSOAP software. Portions created by gSOAP are Copyright (C) 2001-2004 Robert A. van Engelen, Genivia inc. All Rights Reserved. THE SOFTWARE IN THIS PRODUCT WAS IN PART PROVIDED BY GENIVIA INC AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 1

MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Expat Copyright Disclosure


Part of the software embedded in this product is Expat software. Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Apache/Xerces Copyright Disclosure


The Apache Software License, Version 1.1 Copyright (c) 1999-2003 The Apache Software Foundation. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. The end-user documentation included with the redistribution, if any, must include the following acknowledgment: "This product includes software developed by the Apache Software Foundation (http://www.apache.org/)." Alternately, this acknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normally appear. The names "Xerces" and "Apache Software Foundation" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact apache@apache.org. Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior written permission of the Apache Software Foundation.

THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This software consists of voluntary contributions made by many individuals on behalf of the Apache Software Foundation and was originally based on software copyright (c) 1999, International Business Machines, Inc., http://www.ibm.com. For more information on the Apache Software Foundation, please see <http://www.apache.org/>.

SQLite Copyright Disclosure


The original author of SQLite has dedicated the code to the public domain. Anyone is free to copy, modify, publish, use, compile, sell, or distribute the original SQLite code, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

USPS Copyright Disclosure


National ZIP, ZIP+4, Delivery Point Barcode Information, DPV, RDI. United States Postal Service 2005. ZIP Code and ZIP+4 are registered trademarks of the U.S. Postal Service. DataFlux holds a non-exclusive license from the United States Postal Service to publish and sell USPS CASS, DPV, and RDI information. This information is confidential and proprietary to the United States Postal Service. The price of these products are neither established, controlled, nor approved by the United States Postal Service. DataFlux and all other DataFlux Corporation LLC product or service names are registered trademarks or trademarks of, or licensed to, DataFlux Corporation LLC in the USA and other countries. indicates USA registration. Copyright 2005 DataFlux Corporation LLC, Cary, NC, USA. All Rights Reserved.

Table of Contents
dfPower Studio - Contact and Legal Information......................................................... 1
Technical Support ........................................................................................................................ 1 Contact DataFlux ......................................................................................................................... 1 Legal Information ......................................................................................................................... 1

Chapter 1: Introducing dfPower Studio ........................................................................ 6


dfPower Studio Overview............................................................................................................. 6 dfPower Studio Main Screen Main Menu and Toolbar .......................................................... 8 dfPower Studio Base ................................................................................................................ 8 dfPower Profile........................................................................................................................... 13 dfPower Quality.......................................................................................................................... 14 dfPower Integration.................................................................................................................... 15 dfPower Enrichment................................................................................................................... 16 dfPower Design.......................................................................................................................... 17 dfPower Favorites ...................................................................................................................... 18

Chapter 2: Getting Started ........................................................................................... 19


Before You Upgrade dfPower Studio and/or the Quality Knowledge Base............................... 19 System Requirements................................................................................................................ 21 dfPower Studio Installation Wizard ............................................................................................ 21 dfPower Studio Installation Command-Line Switches ............................................................... 28 Downloading and Installing dfPower Verify Databases ............................................................. 28 Launching dfPower Studio ......................................................................................................... 31 Licensing dfPower Studio .......................................................................................................... 32 Moving Around in dfPower Studio ............................................................................................. 33 Accessing dfPower Studio Online Help ..................................................................................... 33

Chapter 3: dfPower Studio in Action........................................................................... 35


Sample Data Management Scenario......................................................................................... 35 Data Management Scenario Background.................................................................................. 35 Profiling ...................................................................................................................................... 36 Quality ........................................................................................................................................ 44 Integration .................................................................................................................................. 49 Enrichment ................................................................................................................................. 54 Monitoring .................................................................................................................................. 57 Summary.................................................................................................................................... 59

Chapter 4: Performance Guide .................................................................................... 60


dfPower Studio Settings............................................................................................................. 60 Database and Connectivity Issues ............................................................................................ 64 Quality Knowledge Base Issues ................................................................................................ 65

Exercises ....................................................................................................................... 66
Profile ......................................................................................................................................... 66 Profile ......................................................................................................................................... 74 Architect ..................................................................................................................................... 77

dfPower Studio Users Guide

Chapter 1: Introducing dfPower Studio


This chapter describes dfPower Studio and its component applications and application bundles. Topics in this chapter include: dfPower Studio Overview dfPower Studio Main Menu and Toolbar descriptions dfPower Studio node descriptions for Base, Profile, Quality, Integration, Enrichment, and Design.

dfPower Studio Overview


dfPower Studio is a powerful, easy-to-use suite of data cleansing and data integration software applications. dfPower Studio connects to virtually any data source, and any dfPower Studio user in any department can use Studio to profile, cleanse, integrate, enrich, monitor, and otherwise improve data quality throughout the enterprise. An innovative job flow builder allows users to build complex management workflows quickly and logically. This capability allows frontline staffnot IT or development resourcesto discover and address data problems, merge customer and prospect databases, verify and complete address information, transform and standardize product codes, and perform just about any other data management process required by your organization. You can access dfPower Studios functionality through a graphical user interface (GUI), from the command line, or in batch operation mode, providing you great flexibility in addressing your dataquality issues. The dfPower Studio main GUI screen is a centralized location from which to launch the dfPower Studio applications. The toolbar on the left of the screen lists all the dfPower Studio applications. dfPower Studio consists of these nodes and application bundles: dfPower Studio Base dfPower Studio Profile dfPower Studio Quality dfPower Studio Integration dfPower Studio Enrichment

These applications/application bundles are described in this chapter, along with the Navigator window. Note: Some applications will not be available for launching if you have not licensed them.

The dfPower Studio Navigator Window


When you launch dfPower Studio, the Navigator window appears.

Chapter 2: Getting Started

The dfPower Studio Navigator window. The Navigator is a revolutionary concept in the data quality and data integration industry, and is an essential part of the dfPower solution. The Navigator was designed to help you collect, store, examine, and otherwise manage the various data quality and integration logic and business rules that are generated and created during dfPower Studio use. Quality Knowledge Base, Management Resources, and References are all containers for metadata that provide you a complete view of your data quality assets and help you build complicated data management processes with ease. Use the Navigator to incorporate data quality business rules that are used across an entire enterprise within many different aspects of business intelligence and overall data hygiene. These business rules are stored as objects that are independent of the underlying data sources used to create them. Thus, there is no limit to the application of these business rules to other sources of data throughout the enterprise, including internal datasets, external datasets, data entry, internal applications, the web, and so on. From the smallest spreadsheets buried in the corners of the enterprise, all the way up to corporate operational systems on mainframes, you can use the same business rules to achieve consistent data integrity and usability across the entire enterprise. More specifically, the Navigator: Stores all business rules generated during any of the various dfPower Studio processes Allows metadata to be maintained about each of these data quality objects Allows reuse of these data quality jobs and objects across the various applications Facilitates quick launches of various stored data management jobs with a few clicks of the mouse Maintains all reports that are generated during data quality processing Manages various configurations of the various data quality processes routinely implemented by the organization Maintains information about batch jobs and the various schedules of these jobs Maintains information about batch jobs and the various schedules of these jobs

dfPower Studio Users Guide

Studio Supports the AIC Process and Building Blocks of Data Management
The dfPower Studio bundles have been designed to directly support the building blocks of data management. You will find this design particularly helpful if you are using the approach DataFlux takes to data management initiatives using an Analyze, Improve, and Control (AIC) process. AIC is a method of finding data problems, building reusable transformations, and strategically applying those transformations to improve the usability and reliability of an organizations data. The five building blocks of AIC data management are: data profiling data quality data integration data enrichment data monitoring

As you will see, these building blocks closely mirror the structure of dfPower Studios main menu.

dfPower Studio Main Screen Main Menu and Toolbar


Use the dfPower Studio toolbar to launch the various dfPower Studio applications. The node menu and toolbar contain several categories of applications and functions that mirror the data management methodologyProfile, Quality, Integration, Enrichment, and Monitoring. You can customize the contents of the seventh category, Favorites. Figure 1. The Main Screen menu and toolbar reflect the Data Management Methodology process. The toolbar palette offers quick access to various functions. They are identified by balloon text.

You can also display a toolbar palette, by selecting Studio > Show Palette. To hide the palette, select Studio > Hide Palette. Next, we describe the various nodes available from the main window.

dfPower Studio Base


At the heart of dfPower Studio is the bundle of applications and functions collectively known as dfPower Studio Base. dfPower Studio Base is the infrastructure that ties all the dfPower Studio applications together, allowing them to interact, process data simultaneously at scheduled times, and share information where necessary. dfPower Studio Base consists of the following components: ArchitectDesign a workflow for processing your data BatchSchedule jobs to be run in batch mode

Chapter 2: Getting Started

DB ViewerView and retrieve records from your data sources ODBC AdminConfigure dfPower Studio to connect to your various data sources OptionsConfigure dfPower Studio options Rule ManagerCreate and manage business rules Monitor ViewerView task data in a repository Repository AdministratorCreate and manage repositories Note: Navigator and Extend appeared in dfPower Studio Base in earlier releases. Navigator now opens automatically with Studio and appears as a menu option at the top of Studios main menu. The Extend function is now accessible from dfPower Studio Enrichment.

ArchitectDesign a Workflow for Processing Your Data


dfPower Base - Architect brings much of the functionality of the other dfPower Studio applications (such as dfPower Base - Batch and dfPower Quality - Analysis), as well as some unique functionality, into a single, intuitive user interface. In Architect, you define a sequence of operationsfor example, selecting a source data, parsing that data, verifying address data, and outputting that data into a new tableand then run the operations at once. This functionality not only saves you the time and trouble of using multiple dfPower Studio applications, but also helps to ensure consistency in your data-processing work. To use Architect, you specify operations by selecting job flow steps and then configuring those steps, via straightforward settings, to meet your specific data needs. The steps you choose are displayed on dfPower Architect's main screen as node icons, together forming a visual job flow. Architect can read data from virtually any source, including data processed and output to a text file, HTML reports, database tables, and a host of other formats. Note: Understanding the other dfPower Studio applications is very helpful when working with dfPower Architect.

BatchSchedule Jobs to Run in Batch Mode


dfPower Base - Batch supports dfPower Studios built-in batch facilities, allowing you to schedule and run data quality procedures on databases without actually using dfPower Studio applications' graphical interfaces. These procedures include logging on and off the appropriate database, as well as the actual processing. Using Batch, you can instruct dfPower Studio to analyze multiple tables and databases once and to run multiple batch processing jobs. You can also use Batch to schedule jobs to run at a time when database activity is non-existent or at a minimum, even when you cannot be physically present.

DB ViewerView and Retrieve Records


dfPower Base DB Viewer is a record viewer that permits you to view and retrieve records from your various data sources.

ODBC AdminConfigure dfPower Studio to Connect to Various Data Sources


Configure dfPower Studio and dfPower Studio applications to connect to databases by using the Microsoft Windows ODBC (Open Database Connectivity) Data Source Administrator screen. You can access this screen from a variety of locations within the dfPower Studio suite.

dfPower Studio Users Guide

To process a database in dfPower Studio, an ODBC driver for the specified DBMS must be installed, and the database must be configured as an ODBC data source. If you have successfully completed these steps, the database name will display in the database lists found on the dfPower Studio main screen, dfPower Integration - Match main screen, and dfPower Enrichment - Verify main screen.

Figure 1. dfPower Studio - Base - ODBC Admin screen Note: Extend appeared in dfPower Studio Base in earlier releases. The Extend function is now accessible from dfPower Studio Enrichment Extend. Extend permits you to connect directly to a data source using ODBC and creates fixed-width text files that can be sent to Donnelly Marketings InfoConnect web service for address hygiene, privacy protection, and data enhancement services. See the online Help for additional details.

OptionsConfigure dfPower Studio Options


After installing and licensing dfPower Studio, use the Options screen to set how dfPower Studio Base and certain other dfPower Studio applications operate.

10

Chapter 2: Getting Started

Figure 2. dfPower Studio - Base - Options

Rule ManagerCreate Custom Business Rules and Metrics


Use the Rule Manager to create and manage business rules and custom metrics to add them to a job to monitor the data. You can use business rules and custom metrics to analyze data to identify problems. The Rule Manager provides tools for building expressions for custom metric and rules. You can create custom metrics to supplement the standard metrics available in dfPower Profile or to be used in rules and tasks with dfPower Architect. Use the Rule Manager to create tasks that implement one or more rules. You can then execute these tasks at various points in a dfPower Architect job flow and trigger events when the conditions of a rule are met.

11

dfPower Studio Users Guide

Figure 3. dfPower Studio - Base - Rule Manager

Monitor ViewerView Task Data in Repository


When you run a job in dfPower Architect, the job executes tasks and rules that you created using the Business Rule Manager. Information about these executed tasks and their associated rules are stored in the default repository. The Monitor Viewer lets you view the information stored in the repository about the tasks and rules executed.

Figure 4. dfPower Studio - Base Monitor Viewer

Repository AdministratorCreate and Manage Repositories


The Repository Administrator enables you to create and register new repositories, upgrade existing repositories and dfPower Profile reports, and import objects from one repository to another. A DataFlux unified repository is a set of tables in a database or file that stores profile reports, custom metrics, business rules, and results from data monitoring jobs.
12

Chapter 2: Getting Started

Figure 5. dfPower Studio - Base Repository Administrator

dfPower Profile
Start any data quality initiative with a complete assessment of your organizations information assets. Use dfPower Profile to examine the structure, completeness, suitability and relationships of your data. Through built-in metrics and analysis functionality, you can find out what problems you have, and determine the best way to address them. Using dfPower Profile, you can: Select and connect to multiple databases of your choice without worrying about your sources being local, over a network, on a different platform, or at a remote location. Create virtual tables using business rules from your data sources in order to scrutinize and filter your data. Run at the same time multiple data metrics operations on different data sources. Run primary/foreign key and redundant data analyses to maintain the referential integrity of your data. Monitor the structure of your data as you change and update your content.

The following tools are available within Profile.

dfPower Profile ConfiguratorSet up Jobs to Profile Data Sources


The Configurator screen appears when you first start Profile. Use the Configurator to set up jobs to profile your data sources.

13

dfPower Studio Users Guide

dfPower Profile ViewerView Profile Results


View the results of a dfPower Profile job. By default, this screen appears automatically. To access this screen from Configurator main screen by choosing Tools > dfPower Profile (Viewer).

dfPower Profile CompareCompare Tables from Multiple Data Sources


This dfPower DBMC Compare helps you compare content from multiple data sources. With this tool, you can Compare table copies for identical content, determine whether files have been deleted, reveal whether extra calculated fields have been added, and ascertain whether a table is a subset, and so on. Produce one spreadsheet, text file, or html page showing all variables on all databases. Build an output file with all the values of all the variables on any number of databases.

For more information on the Compare function, see the online Help links to the DBMC manual and web content.

dfPower Profile Scheme BuilderBuild Standardization Schemes


Use the Scheme Builder to build schemes from an analysis report, edit existing schemes, or to create schemes from scratch.

dfPower Quality
dfPower Quality uses the DataFlux matching technology, transformation routines, and identification logic to help correct most common data problems, including duplicate records, non-standard data representations, and indeterminate data types. dfPower Quality consists of two components:
14

dfPower Quality Analysis dfPower Quality Standardize

Chapter 2: Getting Started

Note:

Match appeared in dfPower Studio Quality in earlier releases. The Match function is now accessible from dfPower Studio Integration.

dfPower Quality AnalysisAnalyze Data Sources


dfPower Quality Analysis helps you discover the data-quality issues that exist in your data sources, then build business logic that can be used by other dfPower Studio applications and DataFlux technology to resolve data quality issues throughout the enterprise. Primarily, dfPower Quality - Analysis addresses the subset of data quality known as data inconsistency. Data representation inconsistency can have damaging effects on any system, whether it is a data warehouse, a customer relationship management system, a marketing database, a human resource system, a sales force automation system, or even a plain operational data store. It can cause inaccurate results, especially when reporting, querying, or performing any type of quantitative analysis on a data set. Importantly, dfPower Quality - Analysis uses your enterprise's own data to generate the business logic that will be used throughout dfPower Studio. This allows you not only to apply generic data quality standards to achieve better business intelligence results, but also to resolve issues that are unique to your organization. dfPower Quality Analysis will actually write transformation rules for you by inferring standard values from your own data.

dfPower Quality StandardizeStandardize Your Data


Use dfPower Quality Standardize to scrub, clean, purify, standardize, and otherwise ensure a consistent representation of any data that exists in your organizations databases. This function makes the data significantly more usable, enabling more accurate results when issuing queries and generating reports, while also increasing the economic value hidden within the data.

dfPower Integration
Similar data often exists in multiple databases. For example, the same product line might have different product names in European and U.S. markets. Integration helps establish the commonality among the various records. Several methods can be used for integration, or data matching. DataFlux uses a modified deterministic matching method. This means that the rules to determine matches are known before matching begins. This approach differs from probabilistic matching, which looks at data elements in relation to other data elements in the data set to determine matches. For more detailed information on integration strategies, refer to the DataFlux white paper, The Building Blocks of Data Management, which you can find via the Customer Portal at www.dataflux.com.

dfPower Integration MatchGenerate Match Codes to Identify Duplicate Records


dfPower Integration Match automates the otherwise tedious, time-consuming task of identifying duplicate and near-duplicate records that can threaten the integrity of an organization's data. Its flexible match sensitivities and detailed reports enable dfPower Integration Match to identify and help eliminate a wide array of duplicate and near-duplicate records. dfPower Integration Match has complete functionality not only for producing detailed match reports, but also for manual and automatic best-surviving-record determination (also known as a duplicate elimination process).

15

dfPower Studio Users Guide

You can also use dfPower Integration Match to identify households: groups of related customers that share a physical and/or other relationship. Households provide a way to use data about your customers relationships with each other to gain insight into their spending patterns, job mobility, family moves and additions, and more.

dfPower Integration MergeCombine Data from Multiple Sources


Use dfPower Integration Merge to combine data from multiple sources. You will have performed various data transformations before you attempt to merge files. Likely, you will generate Match Codes via the Integration Match function (or directly from Architect) and then use virtually created Match Code fields as comparison data to match records. For more detailed information on data merging strategies, refer to the DataFlux white paper, The Building Blocks of Data Management, which you can find via the Customer Portal at www.dataflux.com.

dfPower Enrichment
Enrichment permits you to incorporate external data that is not directly related to the base data; you may gain useful insights from this extra data if your data set is incomplete. Missing data can prevent you from adequately recognizing customer needs, or it may be difficult to tie these record types to other information that already exists in your system. These problems really have two aspects. First is the issue of data completeness. A typical example of incomplete data would be a missing ZIP code in an address field; the incomplete data precludes any mailing effort or ZIP code-based demographic mapping. Second is the issue of data keying. You may have a complete view of your data for existing needs but desire geographical data like longitude and latitude to further some recently instituted goal. Finally, DataFlux considers any combination of supplemental data and custom or targeted algorithms designed to tackle industry-specific problems to be data enrichment.

dfPower Enrichment VerifyVerify Addresses Against Reference Databases


Address standardization and verification is the key to most data quality projects involving customer contact information. Depending on your dfPower Studio license, dfPower Verify can process US, Canadian, and international addressesstandardizing address elements, adding valuable address-related and geocode data, and correcting spelling errors, invalid addresses, invalid cities and states, invalid postal codes, and invalid organization names. For US addresses, dfPower Verify can also geocode addresses to the ZIP+4 centroid level, providing longitude, latitude, state and county FIPS codes, and census tract and block group numbers. You can also use dfPower Verify to report on telephone numbers with invalid area codes or determine area codes from postal codes. dfPower Verify is CASS (Coding Accuracy Support System) certified by the United States Postal Service and SERP (The Software Evaluation and Recognition Program) certified by Canada Post. Companies that use CASS-certified software applications are eligible to receive significant postal discounts for bulk mailings.

16

Chapter 2: Getting Started

dfPower Enrichment ExtendUse Online Data Enhancement Tool


dfPower Extend is an application that connects directly to a data source using ODBC and creates fixed-width text files that can be sent to Donnelley Marketing's InfoConnect web service that performs any number of address hygiene, privacy protection, and data enhancement services on your data. Note: You must have an account with Donnelley Marketing in order to use this feature. Extend appeared in dfPower Studio Base in earlier releases. The Extend function is now accessible from dfPower Studio Base ODBC Admin or from dfPower Enrichment Extend.

dfPower Design
Designed for advanced users, dfPower Design is a development and testing bundle of applications for creating and modifying data-management algorithms (such as algorithms for matching and parsing) that are surfaced in other DataFlux products. dfPower Design provides several GUI components to facilitate this process, as well as extensive testing and reporting features to help you fine tune your data quality routines and transformations. Some of the dfPower Design tasks you can perform include: Creating data types and processing definitions based directly on your own data. Modifying DataFlux-designed processing definitions to meet the needs of your data. Creating and editing regular expression (Regex Library) files to format and otherwise clean inconsistent data. Creating and editing Phonetics Library files, which provide a means for better data matching. Creating and modifying extensive look-up tables (Vocabularies) and parsing rule libraries (Grammars) used for complex parsing routines. Creating and modifying character-level Chop Tables used to split apart strings of text according to the structure inherent in the data. Creating and modifying transformation tables (Schemes) that you can apply on a phrase or element-level basis. Testing parsing, standardization, and matching rules and definitions in an interactive or batch mode before you use them to process live data. Generating extensive character-level and token-level reporting information that allows you to see exactly how a string is parsed, transformed, cleaned, matched, or reformatted. Creating data types and processing definitions specific to your locale.

dfPower Design Customize


Modify or create data quality algorithms.

dfPower Design Vocab Editor


Create and modify collections of words.

dfPower Design Regex Editor


Create and modify regular expressions.

17

dfPower Studio Users Guide

dfPower Design Phonetics Editor


Create Rules to identify similar-sounding data strings.

dfPower Design Chop Table Editor


Create ordered word lists from input data. For more detailed information on available Customize functions, refer to the online Help.

dfPower Favorites
The Favorites menu provides a convenient place to store quick links to your favorite dfPower Studio applications.

18

Chapter 2: Getting Started

Chapter 2: Getting Started


This chapter describes how to get started using dfPower Studio. Topics in this chapter include: Before You Upgrade dfPower Studio and/or the Quality Knowledge Base System Requirements dfPower Studio Installation Wizard dfPower Studio Installation Command-Line Switches Downloading and Installing dfPower Verify Databases Launching dfPower Studio Licensing dfPower Studio Moving Around in dfPower Studio Accessing dfPower Studio Online Help

Before You Upgrade dfPower Studio and/or the Quality Knowledge Base
You do not need to uninstall a previous version of dfPower Studio before you install a new version. By default, new versions will install into a separate directory indicated by the version number. We recommend you install the new version into a new directory, allowing both the older and the newer versions to operate side by side. Incidentally, uninstalling dfPower Studio first might delete some of the ODBC drivers you are currently using but have been discontinued in the upgraded software installation. Note: To install dfPower Studio on Terminal Server, use the Windows Control Panel > Add or Remove Programs option. Failing to do so may cause dfPower Studio to be installed incorrectly.

19

dfPower Studio Users Guide

Step 1

Action

Note

Create backup copies of your personal resource files, such as jobs and reports that are displayed in the dfPower Studio Navigator. Make a backup copy of your Quality Knowledge Base (QKB) prior to installing a new QKB.

These files are stored by default in the \Program Files\DataFlux\dfPower Studio\<version number>\mgmtrsrc directory.

By default, the QKB location is \Program Files\DataFlux\QltyKB. If you choose to install the latest QKB, you can install it into a new directory and change your Options > QKB Directory setting in dfPower Studio to point to the new location. You can also install it into a new directory and use dfPower Customize and the import features of dfPower Studio to selectively import portions of the updated QKB. The final option is to install the new QKB directly into an existing QKB location. The installation has merge functionality that will incorporate the updates from DataFlux into your existing QKB while keeping any changes you might have made.

For version 7, the QKB format for storing metadata has changed. This change could cause problems during your installation unless a few steps are taken to accommodate the change. Here are a few scenarios: 1. If you install dfPower Studio version 7 on a machine that has version 6.2 or earlier installed and QKB version 2004D or earlier already installed, the installation process will convert the version 6.2 metadata file to be compatible with version 7. 2. If you Install dfPower Studio version 7 on a new machine and then install QKB version 2004D or earlier, you will need to run a conversion program that will convert QKB version 2004D to use the new metadata file format. To do this, use the vault_merge.exe application in the root directory of your QKB. Launch it from the command line and use the following syntax: vault_merge --convert <new file> <old file> An example: vault_merge --convert "C:\Program Files\DataFlux\QltyKB\2004D\qltykb.db" "C:\Program Files\DataFlux\QltyKB\2004D\qltykb.xml" 3. You can also install a QKB version 2005A or later QKB. This version will already have a metadata file in the correct format for dfPower Studio 7. A QKB is not delivered in the same installation as dfPower Studio. You can use an existing Quality Knowledge Base or you can install the most recent version of the Quality Knowledge Base, available at www.dataflux.com/qkb/. Note: If you do choose to download the latest QKB version, install it after you install dfPower Studio.

20

Chapter 2: Getting Started

System Requirements
Minimum Platform Processor Memory (RAM) Disk Space Windows NT 4.0/2000/2003/XP Pentium 4 1.2 GhZ or higher 512 MB 5 GB Recommended Windows 2000

Windows XP Professional
Pentium 4 2.2 GHz or higher 2+ GB 10+ GB

dfPower Studio Installation Wizard


Use the following procedure to install your copy of dfPower Studio using the installation wizard. Note: This procedure is designed for Windows users. Non-Windows users should refer to dfPower Studio Installation command line switches.

1. Insert the dfPower Studio CD into your computers CD drive. A screen appears that offers several options. Select Install dfPower Studio 7.1 2. From the Windows taskbar, choose Start > Run. The Run dialog box appears. 3. In the Open text box, type <drive>:\dfPowerStudio.exe, replacing <drive> with the letter for your CD drive. For example, if your CD drive is e, type e:\dfPowerStudio.exe. 4. Press Enter. The setup wizard begins, as shown below. This screen offers a general welcome. Click Next to continue. Note: You can also start the dfPower Studio by navigating to and double-clicking the dfPowerStudio.exe file on the dfPower Studio CD.

21

dfPower Studio Users Guide

5. The Choose Destination Location screen appears, as shown in the following illustration. Use this screen to specify the directory where you want to install dfPower Studio. We suggest you use the default directory. Do not install a new version of dfPower Studio into an older versions directory. For example, do not install version 7.1 into an existing 7.1 directory. However, if you need to, you can re-install the same version of dfPower Studio into the same directory. Click Next to continue.

22

Chapter 2: Getting Started

6. The Select Additional Components screen appears, as shown in the following illustration. Use this screen to specify the dfPower Studio applications you want to install. By default, all applications are installed; however only the applications you have licensed will be operational. It will not harm your computer or dfPower Studio installation to install applications for which you have no license, and this makes later licensing of additional applications easier. However, if you do not install all the applications now, you can install additional applications later by re-running the installation. Click Next to continue.

23

dfPower Studio Users Guide

7. The Copy Resources screen appears, as shown in the following illustration. If you have a previously installed version of dfPower Studio on your computer, you can use this screen to make all your existing reports and jobs available to the new version of dfPower Studio. Click Next to continue.

24

Chapter 2: Getting Started

8. The dfPower Studio License Agreement screen appears, as shown in the following illustration. Use this screen to review and accept the software license agreement. Click Accept to continue.

25

dfPower Studio Users Guide

9. The Select Program Manager Group screen appears, as shown in the following illustration. Use this screen to specify the name for the Program Manager group that is added to the Windows Start menu. By default, the group will be named DataFlux dfPower Studio 7.1. Click Next to continue.

26

Chapter 2: Getting Started

10. The Start Installation screen appears, as shown in the following illustration. Use this screen to confirm your installation wizard selections so far. Click Next to continue.

11. The dfPower Batch Scheduler screen appears, as shown in the following illustration. dfPower Studio includes a batch scheduler as part of its Base Batch application. Click Yes to install this scheduler as a Windows service, or click No to install it as a standard executable file.

12. When the installation wizard completes, check for any available dfPower Studio patches or updates, available at www.dataflux.com/products/update.asp.

27

dfPower Studio Users Guide

dfPower Studio Installation Command-Line Switches


Two command-line switches allow you to modify the way dfPower Studio installs: /S - Install dfPower Studio in "silent mode." Use this switch in conjunction with the /M switch described below. /M=<filename> - Use installation variables from an external text file. The available variables are: o o MAINDIR=<path> - Specify the dfPower Studio installation location. COMPONENTS=[A] [B] [C] [D] - Specify the dfPower Studio applications components to install. A= dfPower Verify. B= dfPower Match. C= dfPower Customize. D= dfPower Profile. RUNMDAC=[YES/NO] - Specify if you want to install MDAC (Microsoft Data Access Components). RUNCONVWIZ=[YES/NO] - Specify if you want to run the Conversion Wizard Utility. The conversion wizard changes dfPower Studio version 4.x-style jobs into version 5.x/6.x-style jobs. Note: Specifying YES will bring up a related dialog box during installation. o o RUNSASODBC=[YES/NO] - Specify if you want to install the SAS ODBC driver. RUNODBC=[YES/NO] - Specify if you want to install DataFlux ODBC Drivers. Note: This variable will also specify if you want to install the sample DataFlux database. If you do not install this database, none of the sample jobs set up in dfPower Studio will work correctly. Sample installation command line: dfPowerStudio.exe /S /M=dfinst.txt where dfinst.txt contains the following text: MAINDIR=C:\Program Files\DataFlux\dfPower Studio\7.1 COMPONENTS=ABCD RUNMDAC=NO RUNCONVWIZ=NO RUNSASODBC=NO

o o

Downloading and Installing dfPower Verify Databases


If your dfPower Studio installation included dfPower Verify, you need to use one or both of the following procedures to install the proper USPS, SERP, and/or Geocode databases to make dfPower Verify work correctly. Until you install the proper database(s), dfPower Verify will operate in "trial mode" using a sample North Carolina database. If you are licensed to use dfPower Verify QAS, you must acquire the postal reference databases directly from QAS for the countries they support. For more information, contact your DataFlux representative. A few things to consider when installing QAS data: 1. Install dfPower Studio. This will install the QAS libraries and required configuration files. 2. Start the installation wizard for your QAS data, as described later in Installing dfPower Verify Databases.
28

Chapter 2: Getting Started

3. The installation wizard will ask you to select the application you want to update. Specify the path to the bin directory where you installed dfPower Studio. The default location is \Program Files\DataFlux\dfPower Studio\7.1\bin. 4. Continue with the installation. When the installation wizard prompts you for a license key, enter your key for the locale you are installing. If QAS data is already installed on your computer and you want to use it with dfPower Studio, perform the following steps: 1. In the bin directory where you installed dfPower Studiothe default location is: \Program Files\DataFlux\dfPower Studio\7.1\binuse a text editor such as Windows Notepad to open the qalicn.ini file. 2. Enter your license keys, each on a separate line. (For additional information, see the instructions in the qalicn.ini file.) 3. Copy your existing qawserve.ini file to the bin directory. If you have dfPower Verify World address databases, follow these steps to install the data and configure dfPower Studio appropriately: 1. Copy the country database (.MD file) and the address configuration file (addressformat.cfg) into the same folder of your choice. For example: C:\World_Database\AD40NZL.MD (database file) C:\World_Database\addressformat.cfg (configuration file) 2. Use a text editor to edit your dfStudio7.ini file (typically found in your Windows system directory) to add the entries for WorldDir and WorldLic. Make sure the WorldDir points to the directory where you installed the database and configuration file, and then copy in the license key next to the WorldLic entry. WorldDir=C:\World_Database WorldLic=license key goes here 3. In dfPower Studio, in the Options dialog, under the References section, set the World Address Database Directory to be the same directory you used in the dfStudio7.ini file. In this example, you would type "C:\Platon" in that text box (without the quotes).

Downloading dfPower Verify Databases


If you have received a CD from DataFlux containing the dfPower Verify database installation file you do NOT have a CD from DataFlux containing the dfPower Verify database installation file Then

skip to the next procedure, Installing dfPower Verify Databases.


use the following procedure to download this file from the DataFlux FTP server.

If you do not have a CD that contains the installation file, you can download it from the DataFlux FTP server: Before You Begin This procedure is designed for Windows users and requires an active Internet connection.
29

dfPower Studio Users Guide

If you prefer, you can use your favorite FTP client or the FTP command instead of Internet Explorer to download this file as described in the following procedure. This procedure requires a DataFlux FTP username and password. If you do not know your username and password, contact DataFlux Technical Support at 919-531-9000.

1. Start Microsofts Internet Explorer. 2. In Internet Explorer, choose Tools > Internet Options. The Internet Options screen appears.

3. On the Advanced tab, check Enable Folder View for FTP Sites, then click OK. 4. In the Address box near the top of the Internet Explorer screen, type ftp://<username>:<password>@ftp.dataflux.com/usps/all/win, where <username> is your DataFlux FTP username and <password> is your DataFlux FTP password. For example, if your username is fred and your password is ABC123, type ftp://fred:ABC123@ftp.dataflux.com/usps/all/win. Do not include any spaces. 5. Press Enter. The DataFlux FTP directory that contains the dfPower Verify database installation file appears. 6. Right-click on the appropriate installation file, then click Copy to Folder. The Browse for Folder screen appears. For dfPower Verify US users, this file is VerifyDataSetup.exe. For dfPower Verify Canada users, this file is VerifyCanDataSetup.exe. For dfPower Verify Geocode US, dfPower Verify Geocode Canada, and dfPower Verify PhonePlus users, this file is VerifyGeoPhoneDataSetup.exe. 7. Use the Browse for Folder screen to select a location for the installation file, then click OK. Internet Explorer downloads the file.

30

Chapter 2: Getting Started

Note:

Be sure to select a location to which you have write access and which has at least 430 MB of free space.

Installing dfPower Verify Databases


If you have received a CD from DataFlux containing the dfPower Verify database installation file you have downloaded the installation file from the DataFlux FTP server Then

insert that CD in your CD drive.

make sure you know the files exact location.

Note:

This procedure is designed for Windows users.

1. Close all other Windows applications. 2. Browse to and double-click the dfPower Verify database installation file on the CD or in your download location. The setup wizard begins. 3. Follow the setup wizard instructions.

Launching dfPower Studio


After you install dfPower Studio and any required databases, use the following one-step procedure to launch dfPower Studio. 1. From the Windows taskbar, choose Start > Programs > DataFlux dfPower Studio 7.1 > dfPower Studio. The dfPower Studio main screen appears.

31

dfPower Studio Users Guide

Tip
If a small dfPower Studio Trial Version screen appears, click Start dfPower to display the dfPower Studio main screen. To exit dfPower Studio, from the dfPower Studio main screen, choose Studio > Exit.

Licensing dfPower Studio


Use the following procedure to license dfPower Studio and gain access to all dfPower Studio applications that you have licensed. 1. If necessary, launch dfPower Studio. 2. From the dfPower Studio main screen, choose Help > Licensing. The dfPower Studio Licensing screen appears. 3. Note your five-digit Machine Code and License Directory. 4. Exit dfPower Studio.

32

Chapter 2: Getting Started

5. Use your Web browser to navigate to www.dataflux.com/customers/unlock.asp. A DataFlux Product Registration form appears. 6. Complete the form. Be sure to enter your Machine Code correctly and to complete the form fully. 7. Click Submit to submit the form to DataFlux. Within 24 hours, DataFlux will email you a license file, studio.lic. 8. Copy studio.lic to the License Directory you noted in Step 3. (By default, this directory is Program Files\DataFlux\dfPower Studio\7.1\license.) 9. Re-launch dfPower Studio. The dfPower Studio title bar no longer includes the words Trial Version, and the dfPower Studio main screen displays the appropriate dfPower Studio applications based on your license agreement. Note: Rather than using the Product Registration form in steps 57, you can email the following information to unlock@dataflux.com:

Your first and last name Your companys name Your email address Your work telephone number Your work address Your Machine Code The DataFlux product you want to license (dfPower Studio) The dfPower Studio release (7.1) Your computers operating system (for example, Windows 2000 Professional)

Moving Around in dfPower Studio


Moving around in dfPower Studio is similar to moving around in other Windows applications. The only significant differences in navigation are: The main toolbar appears at the left side of the dfPower Studio main screen, while many Windows applications instead have toolbars along the top of the screen. Some actions in dfPower Studio will open a separate application screen for the task you want to perform. For example, if you click Architect in the toolbar, dfPower Studio opens a separate dfPower Architect screen.

Accessing dfPower Studio Online Help


In addition to this Users Guide, dfPower Studio provides an online Help system that describes the finer details of using dfPower Studio. Categories of information in the online Help include: Help Me Understand, which contains explanations of key dfPower Studio concepts. How Do I, which contains step-by-step instructions for performing common dfPower Studio tasks.

33

dfPower Studio Users Guide

Screen Descriptions, which contains detailed descriptions of dfPower Studio screens and the controls and fields on each of those screens. Reference, which contains details on data types and layouts, codes, metrics, the Expression Engine Language, and the like.

There are two main ways to access the online Help: On any dfPower Studio screen that has a menu bar, choose Help > Help Topics. The dfPower Studio Help opening screen appears. If available, click on a screens Help button. The dfPower Studio Help screen appears, usually displaying a detailed description of the current screen, as well as the controls and fields on that screen.

Regardless of how you access the online Help, you can use the navigation pane at the left side of the Help screen to display other Help topics.

Tip
If the navigation pane does not appear automatically, click Show Navigation Pane near the top of the Help screen. Help categories display book icons. Help topics display page icons. To list the topics available in a category (book), click the categorys name. To view a topic, click the topics name. Use the navigation panes Search tab to find Help topics. For more information on this and other online Help features, see the Tips for Using this Online Help System topic in the online Help.

34

Chapter 3: dfPower Studio in Action


This chapter walks through a sample data management scenario to provide you an overview of using dfPower Studio. Next, it walks you through a Profile job and Architect job.

Sample Data Management Scenario


The sample data management scenario highlights the typical procedures in a data-management project, following the DataFlux five-building-block methodology shown in the following illustration: Topics in this chapter include: Data Management Scenario Background Profiling Quality Integration Enrichment Monitoring

Figure 6. The Building Blocks of the AIC Process

Other Sources
This chapter highlights some typical procedures in a data-improvement project. For stepby-step details of procedures, see the How Do I... topics in the online Help. This chapter covers only part of the functionality available from dfPower Studio. For more information on what you can accomplish with dfPower Studio, see the online Help. The first scenario in this chapter follows a straight-line path through the DataFlux methodology. In practice, however, this methodology is highly iterative; as you find problems, you will dig deeper to find more problems, and as you fix problems, you will find more problems to fix. Throughout the data management scenario, we mention the dfPower Studio applications we use to accomplish each task. In most cases, a task can be performed by more than one application, although the available options and specific steps to accomplish that task might differ. For example, you can profile data with both dfPower Profile and dfPower Architect's Basic Statistics and Frequency Distribution job steps.

Data Management Scenario Background


State Bank services thousands of personal, business, and public-sector accounts, with offices throughout the state. State Bank has just acquired County Bank, and needs to integrate the customer records from both banks. Our job is to: ensure that all records follow the same data standards, join the records into one database,

dfPower Studio Users Guide

identify and merge duplicate records, and prepare the records for an upcoming mailing to all customers

Profiling
Profiling is a proactive approach to understanding your data. Also called data discovery or data auditing, data profiling helps you discover the major features of your data landscape: what data is available in your organization and the characteristics of those data. In preparing for an upcoming mailing, we know that invalid and non-standard addresses will cause a high rate of returned mail. By eventually standardizing and validating the data, we will lower our risks of not reaching customers and incurring unnecessary mailing costs. Also, by understanding the data structure of both banks customer records, we will be better able to join, merge, and de-duplicate those records. Where are your organizations data? How do data in one database or table relate to data in another? Are your data consistent within and across databases and tables? Are your data complete and up to date? Do you have any multiple records for the same customer, vendor, or product? Good data profiling serves as the foundation for successful data-management projects by answering these questions up front. After all, if you do not know the condition of your data, how can you effectively address your data problems? Profiling helps you determine what data and types of data you need to change to make your data usable. Data Profiling Discovery Techniques The many techniques and processes used for data profiling fall into three major categories: Structure Discovery Data Discovery Relationship Discovery

The next three sections address each of these major categories in our State Bank/County Bank scenario. Caution! As you profile your own data, resist the urge to correct data on the spot. Not only can this become a labor-intensive project, but you might be changing valid data that only appears invalid. Only after you profile your data will you be fully equipped to start correcting data efficiently.

Structure Discovery
The first step in profiling the State and County Bank data is to examine the structure of that data. In structure discovery, we will determine if: our data match their corresponding metadata, data patterns match expected patterns, and data adhere to appropriate uniqueness and null-value rules

Discovering structure issues early in the process can save much work later on, and establishes a solid foundation for all other data-management tasks.

36

Chapter 3: Performance Guide

To get started, we launch the Configurator component of the dfPower Profile application from the dfPower Studio main screen, connect to the State Bank database and customer table, select all fields in the table, select all available metrics, and run the resulting Profile job. When the job is complete, the results appear in dfPower Profiles Viewer component. Here is some of the information we can see about each data field: Column profiling: For each field, a set of metrics such as data type, minimum and maximum values, null and blank counts, and data length. Frequency distribution: How often the same value occurs in that field, presented both as text and in a chart. Pattern distribution: How often the same pattern occurs in that field, presented both as text and in a chart.

The information we can glean from each metric depends on the field. For example, we look at some column profiling metrics for the State Bank field that specifies the year customers opened their first account with the bank, 1STACCTOPND:
State Bank 1STACCTOPND Field Metric Name Data Type Unique Count Pattern Count Minimum Value Maximum Value Maximum Length Null Count Blank Count Actual Type Data Length Metric Value VARCHAR 976 2 01 2208 4 24 35 integer 20

These metrics highlight several issues: The official type set for the field in VARCHAR, but the actual values are all integers. While this is not a major data problem, it can slow processing. The pattern count is 2, indicating that the years are not recorded in a consistent pattern. Perhaps a year such as 1980 is sometimes stored as just 80. The maximum length of the actual data values is 4, but the data length reserved for the field is 20. That is 16 characters of wasted space for each record. Both the null and blank columns are greater than zero. Is it the banks policy to have a value in this field for every customer? Note: We will look at the Unique Count, Minimum Value, and Maximum Values later in Data Discovery.

37

dfPower Studio Users Guide

Next, we look at the pattern frequency distribution for this field:


State Bank 1STACCTOPND Field Pattern 99 9999 Alternate 9(2) 9(4) Count 229 4771 Percentage 0.0458 0.9542

The table above shows those 1STACCTOPND years is expressed with two and four digits. Fortunately, the 1STACCTOPND field does appear to contain year data as we expected. Note: dfPower Studio expresses patterns using 9 for a digit, A for an uppercase letter, a for a lowercase letter, and spaces and punctuation marks for spaces and punctuation marks. In addition, dfPower Studio also expresses alternate shorthand patterns, where each 9, A, and a is followed by a number in parentheses that provides a count of each character type at that location in the value, unless that count is 1. The following table illustrates dfPower Studio patterns and alternates.
Pattern 9 9999 1-999-999-9999 (999) 999-9999 Aaaa Aaaa Aaaaa-Aaaaaaa Alternate 9 9(4) 9-9(3)-9(3)-9(4) (9(3)) 9(3)-9(4) Aa(3) Aa(3) Aa(4)-Aa(6)

Value 1 1997 1-919-555-1212 (919) 555-1212 Mary Mary Smith-Johnson

We continue to review these metrics for every field in this table, making notes on each data problem we find. We also review metrics for the County Banks fields, where we find additional data problems. For example, County Banks Phone field uses three different patterns.

Data Discovery
Our second data profiling step, data discovery, will help us determine whether our data values are complete, accurate, and unambiguous. If they are not, this can prevent us from adequately recognizing the needs of the banks customers, and can make it difficult to tie these records to other data. We take a second look at the set of metrics we first saw in structure discovery for the 1STACCTOPND field:

38

Chapter 3: Performance Guide

State Bank 1STACCTOPND Field Metric Name Data Type Unique Count Pattern Count Minimum Value Maximum Value Maximum Length Null Count Blank Count Actual Type Data Length Metric Value VARCHAR 976 2 01 2208 4 24 35 integer 20

This time, focusing on the Unique Count, Minimum Value, and Maximum Value metrics, we discover new problems: Given that State Bank was founded in 1936, the likely range of years is 1936 through the current year, or about 70 years. However, the fields unique count is 976, implying there are many more values than just 1936, 1937, 1938, and so on. The Minimum Value is 01. This appears to be part of the problem we saw earlier with dates being expresses in one to four digits. Does 01 mean 2001? The Maximum Value is 2208. Assuming no State Bank employees or customers are time travelers, this must be incorrect.

We do this analysis for every field for both banks.

39

dfPower Studio Users Guide

We next look at part of the frequency distribution for each field. The following table shows frequency distribution for State Banks 1STACCTOPND field.
State Bank 1STACCTOPND Field Value 54 55 56 1936 1937 1938 2001 2002 2003 165 189 203 0.0330 0.0378 0.0406 34 41 43 0.0068 0.0082 0.0086 14 9 6 0.0028 0.0018 0.0012 Count Percentage

We already determined from the pattern frequency distribution in structure discovery that there are two patterns for 1STACCTOPND and that there are 229 1STACCTOPND fields we need to change to always express years with four digits. Now, however, we can see the actual values, and the distribution of the four-digit year seems right, with each year showing progressively more customers who remain with the bank. Next, we look at these same count metrics for both banks across several fields:
State Bank Tax ID Count Null Count Unique Count 4985 15 4879 Gender 3873 1127 4 Street Address 4998 2 3962 County Bank Tax ID Count Null Count Unique Count 997 3 916 Gender 836 164 3 Street Address 985 15 815 City 926 74 6 State 966 34 2 ZIP 881 119 86 Phone 766 234 692 City 4650 350 27 State 4650 114 4 ZIP 4747 253 46 Phone 4184 816 3922

The two preceding tables clearly show that a large number of records from both banks are incomplete. In particular, null counts greater than 0 for both banks show that data is missing for Gender, City, State, ZIP, and Phone. Also, County Bank has some missing data for Street Address. Looking at the Count and Unique Count values, we are glad to discover that both values are high for both banks Tax ID field. This means we should be able to use this field to identify most if not all customers uniquely, greatly helping us find multiple records for the same customer within and across the banks.

40

Chapter 3: Performance Guide

The Unique Count values also reveal some problems: State Banks Gender field contains four unique values, even though there are only three valid Gender values: F (female), M (male), and U (unknown). County Banks City field contains six unique values while the ZIP field contains 86 unique values. Something is wrong here.

To find other issues, we review data length and type metrics for both banks across several fields:
State Bank Tax ID Data Length Maximum Length Data Type 9 9 INTEGER Name 30 27 VARCHAR Street Address 75 39 VARCHAR County Bank Tax ID Data Length Maximum Length Data Type 11 9 VARCHAR First Name 20 17 VARCHAR Last Name 20 18 VARCHAR Street Address 50 47 VARCHAR City 25 23 VARCHAR State 15 3 VARCHAR ZIP 10 5 VARCHAR City 20 10 VARCHAR State 25 20 VARCHAR ZIP 10 5 INTEGER

These metrics highlight some more problems: State Bank stores customers full names in a single Name field, while County Bank uses separate First Name and Last Name fields. While many of the other fields in both tables have the same name, the fields are of different lengths. For example, the City Data Length is 20 for State Bank and 25 for County Bank. Also, the City Maximum Length is 23 for County Bank, which is larger than County Banks Data Length of 20. This difference would cause truncated City data if we merged the two tables without first making some changes. Similar fields also have different data types between the two tables. For example, State Banks Tax ID field is set to INTEGER, while County Banks Tax ID is set to VARCHAR.

We next look at the actual records. To do this from the report in dfPower Profiles Viewer component, we select a field with questionable metrics, and then double-click on that metric. For example, to view records with 80 as the 1STACCTOPND value, in the Viewer we select the 1STACCTOPND field, show the frequency distribution for that field, then double-click on the 80 value. The following two tables show selected field values for a random five of each banks customer records:

41

dfPower Studio Users Guide

State Bank Name Bill Jones Sam Smith Julie Swift Mary Wise Jim Tax ID 15238762 3 58735491 7 78463284 5 78521367 8 45879547 7 F O U Gender M Street Address 1324 New Road 253 Forest RD 159 Merle St 115 Dublin Woods DR 100 Main Street East Durham NC Cary City State North Carolina NC 27712 27513 32934 919-6625301 ZIP 27712 Phone 716-4794990

Bill Jones record is missing a City value; Sam Smiths record is missing Gender, ZIP, and Phone values; Julie Swifts record is missing City, State, and Phone values; and Mary Wises record is missing City, State, and Phone values. Viewing the actual record data, we also discover something metrics did not show: The area code for Bill Jones telephone number is invalid because 716 is a code for western New York. Jims Name field contains a value, but there is no last name. Jims ZIP code is invalid because 32934 is a code for central Florida. The street designations in Street Address are valid but inconsistent. For example, Bill Jones record uses Road while Sam Smiths uses RD. The state designations in State are valid but inconsistent. For example, Bill Jones record uses North Carolina while Sam Smiths uses NC.

We can also see why State Banks Gender field has four unique values: Mary Wises Gender value is an invalid O. To see how extensive this problem is, we can later look at the frequency distribution for the Gender field to see how often O appears in the field.
County Bank First Name William Joe Brad James Jim Last Name Jones Mead Martin Smith Smith Tax ID 458795-689 487562-389 455687-747 467888-898 467885-898 M F 100 Main Street 100 Main E St Gender M Street Address 1324 Milton Road 159 Poof St City Cary Melbourne Durham Carrboro NC State North Carolina 32934 27712 27514 27514 919 485 1963 919-4794992 ZIP Phone (919) 3039516

County Banks records are also missing data, specifically Gender, Street Address, State, ZIP, and Phone. As with State Bank, when we view the actual County Bank record data, we discover some problems metrics did not show:

42

Chapter 3: Performance Guide

Brad Martins Gender field indicates a female, but Brad is an unusual name for a woman. The bottom two records appear as if they might be for the same customer. The names and addresses are similar, but the Tax Ids are different by one digit. Is this a father and son living in the same household with two similar Tax Ids or one man with a typo in his Tax ID?

By examining both banks tables, we also discover that the Tax ID patterns are different. State Banks tax ID uses a 999999999 pattern, while County Banks ID uses a 999-999-999 pattern.

Relationship Discovery
Our final data profiling step, relationship discovery, can help us answer these questions: Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables, or databases? Are there redundant data? Here are some other problems that can result from the wrong relationships: A product ID exists in your invoice register, but no corresponding product is available in your product database. According to your systems, you have sold a product that does not exist. A customer ID exists on a sales order, but no corresponding customer is in your customer database. In effect, you have sold something to a customer with no possibility of delivering the product or billing the customer. You run out of a product in your warehouse with a particular UPC number. Your purchasing database has no corresponding UPC number. You have no way to restock the product.

For the purposes of demonstrating relationship discovery with our State Bank scenario, let us say we know from earlier profiling that while both banks use Standard Industrial Classification (SIC) codes for business customers, some of County Banks SIC codes are invalid because County did not use a SIC code lookup table as State Bank did. To help determine how much work might be involved in identifying and fixing the invalid codes for County Banks records, we will run two analyses: A redundant data analysis will determine how many values are common between the County Bank customer records and the State Banks SIC code lookup table. A primary key/foreign key analysis will show us which specific values in the customer records do not match, and thus are likely to be the invalid codes.

To run our redundant analysis, we start with dfPower Profiles Configurator component. On the Configurator main screen, we right-click on the Code field in State Banks SIC Lookup table and choose Redundant Data Analysis. In the screen that appears, we select the SIC field in County Banks customer records table. Then, we run the job, and results appear in dfPower Profiles Viewer component. In the Viewer, we select the SIC codes records table and Code field and display the Redundant Data Analysis results:
Field Name Code Table Name SIC Lookup Primary Count 1516 Common Count 873 Secondary Count 127

(The overlap of values between the Code and SIC fields is also shown graphically on a Venn diagram.) The results tell us that in the customer table, 873 records have SIC code values that are also in the SIC Lookup tables Code field, the SIC Lookup table contains 1,516 codes that are not used by the customer table, and the customer table has 127 occurrences of invalid SIC codes. This means we need to update 127 records in County Banks customer table.

43

dfPower Studio Users Guide

Now let us look at the same tables using primary key/foreign key analysis. To do this, we again start with dfPower Profiles Configurator component. On the Configurator main screen, we rightclick on the Code field in State Banks SIC Lookup table and choose Primary Key/Foreign Key Analysis. In the screen that appears, we select the SIC field in County Banks customer records table. Then, we run the job, and results appear in dfPower Profiles Viewer component. In the Viewer, we select the SIC lookup tables Code field and display the Primary Key/Foreign Key Analysis results:
Field Name Code Outliers for Field: SIC 1007 2382 2995 4217 5068 7223 9327 Table Name SIC Lookup Count 47 13 3 16 9 32 7 Match Percentage 87.3 Percentage 0.047 0.013 0.003 0.016 0.009 0.032 0.007

The Match Percentage confirms that of all the SIC codes in County Banks customer records, 87.3% are also in State Banks SIC code lookup table. Much more valuable for our purposes, though, is the list of outliers for the SIC field. We can now see that although 12.7% of the SIC codes in County Banks customer records are invalid, only seven codes1007, 2382, 2995, 4217, 5068, 7223, and 9327are invalid. This could be pretty good news if the invalid codes are consistent; if so, we only need to figure out how the seven invalid codes map to valid ones. With all that we have learned about our data in data profilingthrough structure discovery, data discovery, and relationship discoverywe are now ready to move on to the next datamanagement building block, Quality. For more information on Profiling, start with the following topics in the dfPower Studio online Help: dfPower Profile Introduction dfPower Profile Getting Started dfPower Profile Main Screen dfPower Profile Metrics dfPower - Custom Metrics dfPower Profile Viewer Main Screen dfPower Profile Viewer DataFlux DB Record Viewer Screen

Quality
Using the next data management building blockQualitywe start to correct the problems we found through profiling. When we correct problems, we can choose to correct the source data, write the corrected data to a different field or table, or even write the proposed corrections to a report or audit file for team review. For example, we might leave the County Bank customer table untouched, writing all the corrected data to an entirely new table.

44

Chapter 3: Performance Guide

Typical operations to improve data quality include: parsing, standardization, and general data sanitization

The goal of these operations is to improve overall consistency, validity, and usability of the data. Quality can also include creating match codes in preparation for joining and merging in the Integration building block.

Parsing
Parsing refers to breaking a data string into its constituent parts based on the datas type. Parsing is usually done in preparation for later data-management work, such as verifying addresses, determining gender, and integrating data from multiple sources. Parsing can also improve/speed database searches. For our data, parsing State Banks Name field into separate First Name and Last Name fields will help us in three ways: It will make integrating the State Bank and County Bank name data easier. It will make it possible to verify existing and fill in missing State Bank gender identification. For an upcoming customer mailing, it will allow us to address customers in a more personal way. For example, we can open a letter with Dear Edward or Dear Mr. Smith instead of Dear Edward Smith.

Parsing Techniques Using the dfPower Architect application, we can both parse and identify gender for State Bank customers. To parse, we launch dfPower Architect from the dfPower Studio main screen, use the Data Source job step to add the State Bank customer table to the dfPower Architect job flow, add a Parsing step, indicate that dfPower Architect should look in the Name field for both first and last names, then output the parsed values into separate First Name and Last Name fields. The following two tables show values before and after parsing:
State Bank Before Parsing Name Bill Jones Sam Smith Julie Swift Mary Wise Jim Elliott F O U Gender M

45

dfPower Studio Users Guide

State Bank After Parsing Name Bill Jones Sam Smith Julie Swift Mary Wise Jim F O U Gender M First Name Bill Sam Julie Mary Jim Last Name Jones Smith Swift Wise

Notice that we did not remove the Name values or change any of the Gender values. We just added fields and values for First Name and Last Name. Also, note that dfPower Architect assumed Jim was a first name and left the Last Name field blank. With First Name in its own field, we can now use dfPower Architect for gender analysis. To do this, we add a Gender Analysis (Parsed) step and indicate that dfPower Architect should look in the new First Name field to determine gender and output updated gender values back to the existing Gender field. Our data now looks like this:
State Bank After Parsing Name Bill Jones Sam Smith Julie Swift Mary Wise Jim Gender M U F F M First Name Bill Sam Julie Mary Jim Last Name Jones Smith Swift Wise

This time, the Gender field was updated. Note that Sam Smith is marked as a U because Sam is a common nickname for both men (Samuel) and women (Samantha). Also note that the Gender value for Jim was changed from U (unknown) to M (male). We could have used the dfPower Architect Gender Analysis step both to parse the Name field and identify gender in the same step. However, we would not then been able to use the First Name and Last Name fields, which we know we will need later for our mailing. We also could have used the Gender Analysis step if the Name field contained first, middle, and last names; this approach would allow Architect to identify gender even better by using both first and middle names.

Standardization
While profiling our data, we discovered several inconsistency issues, including: Different ways of expressing street designations, such as Road and RD. Different ways of expressing state designations, such as North Carolina and NC. Two different ways of expressing years: with two digits and four digits.

46

Chapter 3: Performance Guide

Let us look first at the street and state designations:


State Bank First Name Bill Sam Julie Mary Jim Last Name Jones Smith Swift Wise Street Address 1324 New Road 253 Forest RD 159 Merle St 115 Dublin Woods DR 100 Main Street East NC State North Carolina NC

To standardize these designations, we again use the dfPower Architect application. Continuing the job flow we created for parsing and gender identification, we add a Standardization job step. For this job step, we instruct dfPower Architect to use the DataFlux-designed Address standardization definition for the Street Address field and the State (Two Letter) standardization definition for the State field. Both standardization definitions will change the addresses to meet US Postal Service standards. After standardizing these two fields, here is how our records look:
State Bank First Name Bill Sam Julie Mary Jim Last Name Jones Smith Swift Wise Street Address 1324 New Rd 253 Forest Rd 159 Merle St 115 Dublin Woods Dr 100 Main St E NC State NC NC

Notice that two of the State fields are still blank. Standardization confirms existing values to a standard, but does not fill in missing values. Standardizing the two- and four-digit years requires more work. Here is what a sampling of the data looks like currently:
State Bank First Name Bill Sam Julie Mary Karen William Fred Craig Mary Last Name Jones Smith Swift Wise Hyder Travis Jones Spencer Kossowski 1STACCTOPND 1936 01 54 1937 41 16 1962 57 71

47

dfPower Studio Users Guide

Because there is no DataFlux-designed standardization definition for this task, we will need to create a standardization scheme. To do this, we launch the dfPower Quality Analysis application from the dfPower Studio main screen, select State Banks customer database and table, select the 1STACCTOPND field, specify a Phrase Analysis, and run the Analysis job. The results appear in the Analysis Editor, which lists each permutation of a year and how often that year occurs in the customer records. For each two-digit permutation, we set a standard to transform the two-digit value to the appropriate four-digit value. For the values 0005, we will add 20 to the beginning of the value. For example, we will set each occurrence of 01 to transform to 2001. For the values 3699, we will add 19 to the beginning. For example, we will set each occurrence of 45 to transform to 1945. Because State Bank opened in 1936, we know that the values 0635 are invalid, so we will leave those values untouched for now; these values will later require some manual editing or the use of another table with valid year data. After saving our new standardization scheme, we can now standardize years in a similar manner to how we standardize street and state designations. In dfPower Architect, we use a Standardization job step to instruct dfPower Architect to use our new standardization scheme (before, we used DataFlux-designed standardization definitions) to update two-digit values in the 1STACCTOPND field. Our data now looks like this:
State Bank First Name Bill Sam Julie Mary Karen William Fred Craig Mary Last Name Jones Smith Swift Wise Hyder Travis Jones Spencer Kossowski 1STACCTOPND 1936 2001 1954 1937 1941 16 1962 1957 1971

Notice that William Traviss 1STACCTOPND remains unchanged because it is an invalid value we chose not to transform. One last thing before we leave Standardization: As you might recall, in Data Discovery, we saw that State Banks Tax ID field uses a 999999999 pattern, while County Banks Tax ID field uses a 999-999-999 pattern. We will skip the details here, but to transform one of the patterns to another, we use dfPower Customize to create a new standardization definition, and then use that new definition in a dfPower Architect Standardization job step. Note: Assuming we had all the required standardization definitions schemes and definitions at the outset, the most efficient way to standardize all four fieldsStreet Address, State, 1STACCTOPND, and Tax IDwould be to use one Standardization job step in dfPower Architect, assigning the appropriate definition or scheme to each field.

Matching
Matching seeks to uniquely identify records within and across data sources. Record uniqueness is a precursor to Integration activities. While profiling, we discovered some potentially duplicate records. As an example, we have identified three State Bank records that might be all for the same person, as shown below:

48

Chapter 3: Performance Guide

Record 1 First Name Last name Street Address Tax ID Robert Smith 100 Main St 265-67-7890

Record 2 Bob Smith 100 Main 265-67-7890

Record 3 Rob Smith 100 Main St 242-63-9990

To help determine if these records are indeed for the same customer, we launch the dfPower Integration Match application; select the State bank database and table; assign the appropriate DataFlux-designed and custom-created match definitions to the fields we want considered during the match process, set match sensitivities and conditions, and run an Append Match Codes job. The result is a new field containing match codes, as shown below.
Record 1 First Name Last Name Street Address Tax ID Match Code Robert Smith 100 Main St 265-67-7890 GHWS$$EWT$ Record 2 Bob Smith 100 Main 265-67-7890 GHWS$$EWT$ Record 3 Rob Smith 100 Main St 242-63-9990 GHWS$$WWI$

Based on our match criteria, dfPower Integration Match considers Record 1 and Record 2 to be for the same person. Record 3 is very similarnote how the first four characters of the match code (GHWS$$) are the same as Records 1 and 2but the Tax ID is different. Of course, we could have figured this out manually, but it gets much more difficult when you need to make matches across thousands of records. With this in mind, we create match codes for both banks. Now that we have finished parsing, standardizing, and creating match codes for both banks tables, we are ready to move onto the Integration building block. For more information on the Quality building block, start with the following topics in the dfPower Studio online Help: dfPower Architect Introduction dfPower Quality Analysis Introduction dfPower Quality Standardize Introduction dfPower Customize Introduction

Integration
Data integration encompasses a variety of techniques for joining, merging/de-duplicating, and house holding data from a variety of data sources. These techniques together can help identify potentially duplicate records, merge dissimilar data sets, purge redundant data, and link data sets across the enterprise.

Joining
Joining is the process of bringing data from multiple tables into one. For our data, we have three options for joining: Combine the County Bank table into the State Bank table Combine State Bank table into the County Bank table Combine both tables into a new table
49

dfPower Studio Users Guide

We decide to combine both tables into a new table. As we start this process, we recall that the data types and lengths for similar fields differ between the two banks tables, and that some data types and lengths should be adjusted for increased efficiency. For example, currently: The City Data Length is 20 for State Bank and 25 for County Bank. Also, the City Maximum Length is 23 for County Bank, which is larger than the City Data Length of 20 for State Bank. The 1STACCTOPND data length for State Bank is 20, but the maximum length of data is 4, thus wasting space. The 1STACCTOPND data type for State Bank is VARCHAR, but the actual values are all integers. The Tax ID data type for State Bank is set to INTEGER, but VARCHAR for County Bank.

We could address all these issues by opening each banks table using the appropriate database tool (for example, dBase or Oracle) and changing field names, lengths, and types as needed, making sure data lengths for any given field is at least as large as the actual maximum length of those fields in both tables. For example, for the City field, we would set the data length to at least 23 in both tables. However, we can change data types and lengths and join the two tables into a new table using just dfPower Architect. To get started, we start a new job flow in dfPower Architect and use Data Source job steps to add the State Bank table to the flow. On the Output Fields tab for that job step, we click the Advanced button. This opens the Override Defaults screen, which lists each field and its length. To set a new length for a field, we specify that length in the Override column. To set a new type for a field, we specify that type in the Override column. To set a new length and type, we specify the length followed by the type in parentheses. For example, to change the field from a data length of 20 and a VARCHAR type to a data length of 4 and an INTEGER type and, we specify 4(INTEGER). We do take similar steps for the County Bank table, adding a second Data Source job step to the flow and using the Override Defaults screen to change data lengths and types as necessary. Now that both bank tables are at the top of our job flow and similar fields have the same data lengths and types, we add a Data Union job step, and use the job steps Add button to create one-to-one mappings between similar fields in each table and to the combined field in the new table. The data Union job step will not create records with combined data, but rather will add records from one table as new records to the other without making any changes to those records. For example, we map the State Tax ID and County Tax ID fields to a combined TID field and the State First Name field and County First Name field to a combined Customer First Name field. Finally, we use a Data Target (Insert) job step to specify a database and new table name, then run the job flow to create a new table that contains all the customer records for both banks. Note: If we had a common field in both tables (ideally primary key/foreign key fields), we could have used a Data Joining job step in dfPower Architect to combine matching records as we combined the two tables. For example, if we had one table that contained unique customer IDs and customer name information and another table that had unique customer IDs and customer phone numbers, we could use a Data Joining job step to create a new table with customer IDs, names, and phone numbers.

50

Chapter 3: Performance Guide

To illustrate this approach, we will start with these two sets of records:
Customer ID 100034 100035 100036 100037 Customer ID 100034 100035 100036 100037 First Name Robert Jane Karen Moses Phone (919) 484-0423 (919) 479-4871 (919) 585-2391 (919) 452-8472 Last Name Smith Kennedy Hyder Jones

Using the Data Joining job step, we could identify Customer ID as the primary key/foreign key fields and create the following records:
Customer ID 100034 100035 100036 100037 First Name Robert Jane Karen Moses Last Name Smith Kennedy Hyder Jones Phone (919) 484-0423 (919) 479-4871 (919) 585-2391 (919) 452-8472

Merging/De-duplicating
Now that we have all the customer records in one table, we can look for records that appear to be for the same customer and merge the data from all records for a single customer into a single surviving record. Let us look at selected fields in three records that might be for the same customer:
Record 1 First Name Last name Street Address City State ZIP Match Code Carrboro NC 27510 GHWS$$EWT$ 27510 GHWS$$EWT$ GHWS$$WWI$ Robert Smith Record 2 Bob Smith 100 Main Record 3 Rob Smith 100 Main St Carrboro

Notice that each record contains a match code, which we generated in the Matching section of the Integration building block. We can use these codes as keys to find and merge duplicate records. To do this, we start by launching the dfPower Integration Match application from the dfPower Studio main screen, select our combined customer records table, choose Outputs > Eliminate Duplicates, then set a match definition of Exact for the Match Code field.

51

dfPower Studio Users Guide

Note:

Because we already generated and appended match codes to our records in the Quality building block, we only need to match on the Match Code field. If we did not already have match codes, however, we could specify match definitions and sensitivities for selected fields in Integration Match and the application would use match codes behind the scenes to find multiple records for the same customer. We could even set up OR conditions such as If First Name, Last Name, and Tax ID are similar OR if First Name, Last Name, Street Address, and ZIP are similar.

We then choose Settings from the Eliminate Duplicates output mode and specify several options for our duplicate elimination job, including Manually Review Duplicate Records, Physically Delete Remaining Duplicates, and that surviving records should be written back to the Current Source Table. When the options are all set, we run the job, and after processing, the Duplicate Elimination File Editor screen appears, showing our first surviving records from a cluster of records that might be for the same customer. The screen looks something like this:
Surviving record: First Name Robert Cluster records: Robert Bob Rob Smith Smith Smith 100 Main St 100 Main 100 Main St Carrboro Carrboro NC 27510 27510 GHWS$$EWT$ GHWS$$EWT$ GHWS$$WWI$ Last Name Smith Street Address City Carrboro State NC ZIP 27510 Match Code GHWS$$EWT$

Notice that the surviving record and first record in the cluster are both checked and contain the same data. The Duplicate Elimination File Editor has selected the Robert Smith record as the surviving record, probably because it is the most complete record. However, the Street Address for that record is still blank. To fix this, we double-click on the Street Address value for the Rob Smith record to copy 100 Main St from the Rob Smith record to the Robert Smith record. The screen now looks something like this:
Surviving record: First Name Robert Cluster records: Robert Bob Rob Smith Smith Smith 100 Main St 100 Main 100 Main St Carrboro Carrboro NC 27510 27510 GHWS$$EWT$ GHWS$$EWT$ GHWS$$WWI$ Last Name Smith Street Address 100 Main St City Carrboro State NC ZIP 27510 Match Code GHWS$$EWT$

The Street Address is highlighted in red in the surviving record to indicate that we manually added the value. To review and edit the surviving record and records cluster for the next customer, we click Next and start the process again, repeating as necessary until duplicate records have been merged and the excess records deleted. Note: If you find patterns of missing information in surviving records, especially if you have thousands of potentially duplicate records, consider returning to dfPower Integration Match and setting field rules and/or record rules to help the Duplicate Elimination File Editor make better choices about surviving records.

52

Chapter 3: Performance Guide

Householding
Another use for match codes is householding. Householding entails using a special match code, called a household ID or HID, to link customer records that share a physical and/or other relationship. Householding provides a straightforward way to use data about your customers relationships with each other to gain insight into their spending patterns, job mobility, family moves and additions, and more. Consider the following pair of records:
First Name Joe Julie Last Name Mead Swift Street Address 159 Milton St 159 Milton St Phone (410) 569-7893 (410) 569-7893

Given that Joe and Julie have the same address and phone number, it is highly likely that they are married or at least share some living expenses. Because our bank prefers to send marketing pieces to entire households rather than just individuals, and because our bank is considering offering special account packages to households that have combined deposits of more than $50,000, we need to know what customers share a household. Note: Although a household is typically thought of in the residential context, similar concepts apply to organizations. For example, householding can be used to group together members of a marketing department, even if those members work at different locations.

To create household IDs, we use dfPower Integration Match to select our customer records table, and then set up the following match criteria as OR conditions: Match Code 1 (MC1): Last Name AND Address Match Code 2 (MC2): Address AND Phone Match Code 3 (MC2): Last Name [and] Phone (MC3)

53

dfPower Studio Users Guide

Behind the scenes, dfPower Integration Match generates three match codes, as shown below. If any one of the codes match across those records, the application assigns the same persistent HID to all those records.
First Name Joe Julie Last Name Mead Swift Street Address 159 Milton St 159 Milton St Phone (410) 5697893 (410) 5697893 (919) 6882856 (919) 6882856 (919) 6882856 (919) 2312611 (919) 2312611 (919) 2788848 (919) 8069920 (919) 5627448 (919) 5627448 (919) 2397436 (919) 2392600 (919) 2392600 MC1 $MN $RN MC2 #L1 #L1 MC3 %RQ %LQ HID 1 1

Michael Jason Becker

Becker Green Ruth

1530 Hidden Cove Dr 1530 Hidden Cove Dr 1530 Hidden Cove Dr 841B Millwood Ln 841 B Millwood Lane

$BH $GH $RH

#H6 #H6 #H6

%B6 %G6 %R6

2 2 2

Courtney Courtney

Benson Myers

$BM $MM

#M2 #M2

%B2 %M2

3 3

David Carol

Jordan Jordan

4460 Hampton Ridge 4460 Hampton Ridge

$JH $JH

#H2 #H2

%J2 %J8

5 5

Robin Sharon Carol

Klein Romano Romano

5574 Waterside Drive 5574 Waterside Drive 5574 Waterside Drive PO Box 873 12808 Flanders Ln

$KW $RW $RW

#W5 #W5 #W2

%K5 %R5 %R2

7 7 7

Melissa Melissa

Vegas Vegas

$VB $VF

#B2 #F2

%V2 %V2

14 14

Now that our data is integrated, including joining, merging/de-duplicating, and householding, we move on to the next building block: Enrichment. For more information on the Integration building block, start with the following topics in the dfPower Studio online Help: dfPower Integration Match Introduction dfPower Integration Match Creating Duplicate Elimination Jobs dfPower Architect Introduction

Enrichment
In the Enrichment building block, we add to our existing data by verifying and completing incomplete fields, and adding new data such as geocodes. For our data, we are going to perform three types of enrichment: address verification, phone validation, and geocoding.

54

Chapter 3: Performance Guide

Address Verification
Address verification is the process of verifying and completing address data based on existing address data. As examples: You can use an existing ZIP code to determine the city and state. You can use an existing street address, city, and state to determine the ZIP or ZIP+4 code. You can use an existing ZIP code to determine that the street address actually exists in that ZIP code.

Currently, a sampling of our data looks like this:


First Name Bill Sam Julie Mary Last Name Jones Smith Swift Wise Street Address 1324 New Rd 253 Forest Rd 159 Merle St 115 Dublin Woods Dr Cary City State NC NC NC 14127 43953 ZIP 27712 Phone 716-479-4990 919-452-8253 716-662-5301 614-484-0555

To verify and help complete some of these records, we first launch dfPower Architect from the dfPower Studio main screen, use a Data Source job step to specify our combined customer records table, and then add an Address Verification job step. In the Address Verification job step, we assign the following address types to each field: Street Address: Address Line 1 City: City State: State ZIP: ZIP/Postal Code

This tells dfPower Architect what type of data to expect in each field. (It is only coincidence that the City and State field and address types use the same term. We could just as easily have had a CtyTwn field to which we would assign a City address type.) For the output fields, we specify the following: Output Type: Address Line 1, Output Name: Street Address Output Type: City, Output Name: City Output Type: State, Output Name: State Output Type: ZIP/Postal Code, Output Name: ZIP Output Type: US County Name, Output Name: County

Notice that except for County, all the fields will get updated with new data. For County, Architect will create a new field. Additional Output Fields: First Name Last Name Phone

These three fields will not be changed, just carried through the Address Verification job step.
55

dfPower Studio Users Guide

After running the job flow, our data now looks like this:
First Name Bill Sam Julie Mary Last Name Jones Smith Swift Wise Street Address 1324 E New Rd 253 Forest Rd 159 Merle St 115 Dublin Woods Dr S City Durham Cary Orchard Park Cadiz State NC NC NY OH ZIP 277121525 275131627 141275283 439531524 County Durham Wake Erie Harrison Phone 716-479-4990 919-452-8253 716-662-5301 614-484-0555

Notice that the Street Address, City, State, and ZIP fields are now completely populated, a couple of Street Addresses have new directional designations (for example, 1324 E New Rd), all the ZIP codes are now ZIP+4 codes, and the County field has been created and populated. Also note that Julie Swift's State was changed from NC to NY because 14127 is a ZIP code for western New York. Note: To change these values, dfPower Architect used data from the US Postal Service. If you have licensed dfPower Verify, you should have access to this same data, as well as data for phone validation and geocoding.

Phone Validation
Phone validation is the process of checking that the phone numbers in your data are valid, working numbers. dfPower Studio accomplishes this by comparing your phone data to its own reference database of valid area code and exchange information and returning several pieces of information about that phone data, including a verified area code, phone type, and MSA/PSA codes. In addition, you can add one of the following result codes to each record: FOUND FULL The full telephone number appears to be valid. FOUND AREA CODE The area code appears to be valid, but the full phone number does not. NOT FOUND Neither the area code or phone number appear to be valid.

Ignoring the address fields, our data currently looks like this:
First Name Bill Sam Julie Mary Last Name Jones Smith Swift Wise Phone 716-479-4990 919-452-8253 716-662-5301 614-484-0555

To validate the phone numbers, we continue our dfPower Architect job flow from address verification, and add a Phone job step. In the Phone job step, we identify Phone as the field that contains phone numbers, select Phone Type, Area Code, and Result as outputs, and add First Name and Last Name as additional output fields. The data now looks like this:
First Name Bill Sam Last Name Jones Smith Phone 716-479-4990 919-452-8253 Area Code 919 Phone Type Standard Result FOUND FULL NOT FOUND

56

Chapter 3: Performance Guide

First Name Julie Mary

Last Name Swift Wise

Phone 716-662-5301 614-484-0555

Area Code 716 740

Phone Type Standard Cell

Result FOUND AREA CODE FOUND FULL

Geocoding
Geocoding is the process of using ZIP code data to add geographical information to your records. This information will be useful because it allows us to determine where customers live in relation to each other and to our bank branches. We might use this information to plan new branches or inform new customers of their closest branch. Geocode data can also indicate certain demographic features of our customers, such as average household income. To add geocode data to our records, we add a Geocoding job step to our dfPower Architect job flow, Identify our ZIP field, and indicate what kind of geocode data we want. Options include latitude, longitude, census tract, FIPS (the Federal Information Processing Standard code assigned to a given county or parish within a state), and census block. Note: Geocode data generally needs to be used with a lookup table that maps the codes to more meaningful data.

For more information on the Enrichment building block, start with the following topics in the dfPower Studio online Help: dfPower Architect Introduction dfPower Verify Introduction

Our data is now ready for our customer mailing. However, one data-management building block remains: Monitoring.

Monitoring
Because data is a fluid, dynamic, ever-evolving resource, building quality data is not a one-time activity. The integrity of data degrades over time as incorrect, nonstandard, and invalid data is introduced from various sources. For some data, such as customer records, existing data becomes incorrect as people move and change jobs. In the fifth and final building block, Monitoring, we set up techniques and processes to help us understand when data gets out of limits and to identify ways to correct data over time. Monitoring helps ensure that once data is consistent, accurate, and reliable, we have the information we need to keep the data that way. For our data, we will use two types of monitoring techniques and processes: Auditing Alerts

Auditing
Auditing involves periodic reviews of your data to help ensure you can identify and correct bad data as quickly as possible. Auditing requires you to set a baseline for the acceptable characteristics of your data. For example, it might be that certain fields should: Never contain null values Contain only unique values Contain only values within a specified range

57

dfPower Studio Users Guide

For our customer records, we know there have been problems with the ZIP field being blank, so we plan to regularly audit these fields to see if the number of blank fields is increasing or decreasing over time. We will generate a trend chart that shows these changes graphically. To prepare for auditing, we create a Profile job just as we did in the Profiling building block: We launch dfPower Profiles Configurator component from the dfPower Studio main screen, connect to our customer records database and table, select the ZIP field, use the Job > Options menu command to select Blank Count metric, and run the resulting Profile job. This creates a profile report that appears in dfPower Profiles Viewer component and establishes our auditing baseline. Note: For your own data, you might want to select all fields and all metrics. This will give you the greatest flexibility in auditing data on the fly.

Now, each day, we open and run that same job from the Configurator, making sure to keep Append to Report (If It Already Exists) checked on the Run Job screen. Note: If you plan to do a lot of data auditing, consider using dfPower Studios Base Batch component to schedule and automatically run your Profile job or jobs.

When the dfPower Profile - Viewer main screen appears, we choose Tools > Data Monitoring > Historical Analysis. In the Metric History screen that appears, we select ZIP as the field name and Blank Count as the metric. If we want to audit data for a certain time period, we can also select start and end dates and times; a date and time will be available for each time we ran the Profile job. The Metric History screen displays a chart like this:
DSN: Bank Records
Metric Value
600 500 400 300 200 100 0 2/1/05 12:00 AM 2/2/05 12:00 AM 2/3/05 12:00 AM Date/Tim e 2/4/05 12:00 AM 2/5/05 12:00 AM

Table: Customer Records Field: ZIP

Metric

Blank Count

Viewing this chart, we can quickly see that the number of blank ZIP fields generally declined over the week until the last day, when the number increased a bit.

58

Chapter 3: Performance Guide

Alerts
Alerts are messages generated by dfPower Studio when data does not meet criteria you set. For our data, we know that the Last Name field is sometimes left blank. To be alerted whenever this happens, we will set up an alert. To do this, we launch dfPower Profiles Configurator component from the dfPower Studio main screen, choose Tools > Data Monitoring > Alert Configuration, and select our customer records database and table. Then, we select the Last Name field and set the following: Metric: Blank Count Comparison: Metric is Greater Than Value: 0

This automatically generates a description of Blank Count is greater than 0. We also indicate we want to receive alerts by e-mail and specify the email address of a group of people responsible for addressing these alerts. Now, whenever dfPower Profile finds a blank Last Name field, it will send out an e-mail message. We can also review alerts through dfPower Profiles Viewer component. To do this, in the Viewer, we choose Tools > Data Monitoring > Alerts. The Alert screen appears, listing the current alerts, similar to what is shown in the following table:
Data Source Name Bank Records Table Name Customer Records Field Name ZIP Description Blank Count is greater than 0

For more information on the Monitoring building block, see the following subject in the Online Help: dfPower Profile Introduction dfPower Batch Using Studio in Batch Mode

Summary
We hope this scenario has provided you with a solid foundation for starting to use dfPower Studio through all five data-management building blocks: Profiling, Quality, Integration, Augmentation, and Monitoring. As we mentioned at the beginning of this chapter, this scenario covers a sampling of dfPower Studios functionality. For more information on what you can do with dfPower Studio, see the online Help.

59

dfPower Studio Users Guide

Chapter 4: Performance Guide


There are many settings, environment variables, and associated database and system settings that can significantly impact the performance of dfPower Studio. Aside from normal issues regarding data processing that stem from user hardware and operating-system capabilities, dfPower Studios performance can benefit either from directly modifying settings within dfPower Studio or from modifying the data source on which dfPower Studio is working. These modifications can range from the amount of memory used to sort records, to the complexity of the parse definition selected for data processing, to the number and size of columns in a database table. This chapter describes some of the more common ways that you can optimize dfPower Studio for performance. Topics in this chapter include: dfPower Studio Settings Database and Connectivity Issues Quality Knowledge Base Issues

dfPower Studio Settings


Several dfPower Studio environment settings permit you access through the various dfPower Studio applications you use to process your data. Currently, a few of the settings can only be changed by directly editing the dfStudio7.ini file, which is stored in your computers main Windows directory. This file contains many of the user-defined options and system path relationships used by dfPower Studio. Caution! Edit the dfStudio7.ini file with care. Unwarranted changes can corrupt your dfPower Studio installation, causing it to perform unexpectedly or fail to initialize.

dfPower Studio (All Applications)


Working Directory Settings: Most dfPower Studio applications use a working directory to create temporary files and save miscellaneous log files. Some processes, such as multi-condition matching, can create very large temporary files depending on the number of records and conditions being processed. To speed processing, the working directory should be a local drive with plenty of physical space to handle large temporary files. Normal hard drive performance issues apply, so a faster hard drive is advantageous, as is keeping the drive defragmented. You can change the working directory from the dfPower Studio main screen: Choose Studio > Options, click Directories, and change the path in the Working Directory box.

dfPower Base Architect


Memory Allocated for Sorting and Joining: Architect has an optionAmount of Memory to Use During Sorting Operationsfor allocating the amount of memory used for sorting. Access this option from the Architect main screen by choosing Tools > Options. The number for this option indicates the amount of memory Architect uses for sorting and joining operations. This number is represented in bytes, and the default is 64MB. Increasing this number will allow more memory to be used for these types of procedures, and can improve performance substantially if your computer has the RAM available. As a rule, if you only have one join or sort step in a Architect job flow, you should not to set this number greater than half the total available memory. If you have multiple uses of sorts and joins, divide this number by the number of sorts or joins in the job flow.

60

Chapter 3: Performance Guide

Caching the USPS Reference Database: Architect has an optionPercentage of Cached USPS Data Indexesfor how much United States Postal Service data your computer should store in memory. You can access this option from the Architect main screen by choosing Tools > Options. The number for this option indicates an approximate percentage of how much of the USPS reference data set will be cached in memory prior to an address verification procedure. The default setting 20. The range is from 0, which indicates that only some index information will be cached into memory, to 100, which directs that most of the indices and normalization information (approximately 200MB) be cached in memory. Sort by ZIP Codes: For up to about 100,000 records, sorting records containing address data by postal code prior to the address verification process will enhance performance. This has to do with the way address information is searched for in the reference database. If you use Architect to verify addresses, this step can be done inside the application; otherwise, it must be done in the databases system. For over 100,000 records, it might be faster to use the SQL Query step in Architect, write a SQL statement that sorts records by ZIP, and let the database do the work. New Table Creation vs. Table Updates: When using Architect, it will be faster to create a new table as an output step rather than attempting to update existing tables. If you choose to update an existing table, set your commit interval on the output step to an explicit value (for example, 500) instead of choosing Commit Every Row. Clustering Performance in Version 6.1.1 and Higher: To take advantage of performance enhancements to the clustering engine that provides the functionality for the Clustering job step in Architect: o Try to have as much RAM as possible on your computer, preferably 4GB. If more than 3GB of RAM is installed, the Windows boot.ini system file needs a special switch to instruct Windows to allow a user process to take up to 3GB. Otherwise, Windows will automatically reserve 2GB of RAM for itself. To add this special switch, add /3GB to the Windows boot.ini file. Note: This is not supported on all Windows versions. For more information on making this change and for supported versions, see www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx. o o Terminate all non-essential processes to free up memory resources. Set the Windows Sort Bytes memory allocation parameter close to 7580% of total physical RAM. If the data set is rather small, this might not be as important, but when clustering millions of records, using more memory dramatically improves performance. Do not exceed 8085% because higher memory allocation might result in memory thrashing, which dramatically decreases clustering performance. Defragment the hard drive used for temporary cluster engine files. This is the dfPower Studio working directory described earlier. Use the fastest hard drive for cluster engine temporary files. If possible, set the dfPower Studio working directory to be on the fastest physical drive that is not the drive for the operating system or page file. Defragment the page file disk. Do this by setting the Windows Virtual Memory size to 0, then rebooting the system and defragmenting that drive.

o o

61

dfPower Studio Users Guide

Manually set both minimum and the maximum values of the Windows Virtual Memory file size to the same large value. Preferably, this will be 2GB or more, depending on disk space available. This will prevent the operating system from running out of virtual memory, and will prevent the operating system from needing to resize the file dynamically. Disable Fast User Switching in Windows XP. This will free up space in the page file.

Multiple Match Code Generation: When defining matching criteria, do not set up rules to generate match codes on the same fields at different sensitivities. For example, do not choose to create match codes for Name and Address fields at 55 and 85 percent sensitivities. By just choosing the Match Code job step in Architect, several calls to parsing algorithms are made to generate the match codes. A better approach than simply using the Match Code job step is to use the Parsing job step first on the Name and Address field, and then use the Match Codes (Parsed) step job to generate the match codes at the different sensitivities. This will access the parse engine only once for each field and record instead of several more times for each sensitivity. Performance gains can cut processing time by more than half depending on the match criteria. Limit Data Passed Through Each Step: While it might be convenient to use the Architect setting that passes every output field by default to the next step, this creates a large amount of overhead when you are only processing a few fields. Passing only 10 fields through each step might not be that memory intensive, but when it is 110 fields, the performance can really suffer. After you create your job flow with the Output Fields setting on All, go through each job step and delete the additional output fields you do not need.

dfPower Verify
Caching the USPS Reference Database: dfPower Verify has an optionUSPS Cache for how much USPS data your computer should store in memory. You can access this option from the dfPower Verify main screen by choosing Tools > Options and displaying the Runtime tab. The number for this option indicates an approximate percentage of how much of the USPS reference data set will be cached in memory prior to an address verification procedure. The default setting is 20. The range is from 0, which indicates that only some index information will be cached into memory, to 100, which directs that most of the indices and normalization information (approximately 200MB) be cached in memory. Create a Verified Table and Join to Source: As an option for quicker database updates, it is possible to write verified address data to a new table, and then use Architect to join the records in the source table with the information in the verified table using primary keys for the join in Architect.

dfPower Integration - Match


Set Sort Memory Size: If you want to create match jobs that take advantage of multiple OR steps in your match criteria, you might want to increase the memory used for match code sorting. To do this from the dfPower Integration Match main screen, choose Options > Set Sort Memory Size. The range is from 0, which indicates that only minimal memory should be used for the sorting and clustering process, to 100, which indicates that almost all available memory will be used for this process. The default value is 15. We recommend you not set this number to much higher than 40 unless you have a very large amount or available RAM.

62

Chapter 3: Performance Guide

Multiple Condition Matching: The number of match conditions, in addition to the number of fields you choose to use as match criteria, will impact performance significantly. A match job of a 100,000 records that uses four OR conditions with four fields for each condition might take hours to complete, while a match job of 100,000 records with a single condition and six fields might complete in minutes. Generate Cluster Data Mode: See Clustering Performance in Version 6.1.1 and Higher in the dfPower Base Architect section.

dfPower Quality - Standardize


Log Updates: Choosing to log database updates to the statistics file can slow down performance to a small degree. On very large job runs, the difference might be more evident. You can turn off the logging feature from the dfPower Quality main screen by choosing Control > Log Updates.

dfPower Profile - Configurator


Metric Calculation Options: From the dfPower Profile - Configurator main screen, choose Job > Options to access settings that can directly affect the performance of a Profile job process. Two of these settings are Count all Rows for Frequency Distribution and Count All Rows for Pattern Distribution. By default, all values are examined to determine metrics that are garnered from frequency and pattern distributions. You can decrease processing time by setting a limit on the number of accumulated values. The tradeoff is that you might not get a complete assessment of the true nature of your data. Another setting, Maximum Number of Values per Field to Store in Memory for Processing, is set to 10000 by default. If your computer has more memory available, you can increase the number to improve performance. This number is per field, so if you profile all five fields in a table, then 50000 values will be stored in memory between memory flushes. To optimize performance, you can choose fewer fields to profile or increase the number, but you run the risk of running out of memory to complete the process. Subset the Data: You can explicitly specify which metrics to run for each field in a table. If certain metrics do not apply to certain kinds of data, or if you only want to examine select fields of large tables, be sure to manually change the list of metrics to be run for each field. Incidentally, the biggest performance loss is generating a frequency distribution on unique or almost-unique columns. Thus, if you know a field to be mostly unique, you might choose not to perform metrics that use frequency distribution. For numeric fields, these are metrics are Percentile and Median. For all fields, these metrics are Primary Key Candidate, Mode, Unique Count, and Unique Percentage. You can also subset the data by creating a business rule or SQL query that can pare down the data to only the elements you want to profile. Sample the Data: dfPower Profile allows you to specify a sample interval on tables that you want to profile. This feature improves performance at the expense of accuracy, but for very large tables with mostly similar data elements, this might be a good option. Memory Allocated for Frequency Distribution: dfPower Profile allows you to configure the amount of memory being allocated to the Frequency Distribution Engine (FRED). This option is set in the dfstudio.ini file and must be within the [Profile] section. If your .ini file does not contain the [Profile] heading you may add it manually. For example: [Profile] pertablebytes = 128000

63

dfPower Studio Users Guide

By default FRED allocates 256 KB per column being profiled, on top of the amount configured by the user in the Job > Options menu of dfPower Profile Configurator. If you have a large number of columns to profile (in the hundreds) this can cause your machine to run low on (or out of) memory and you may need to reduce the amount of per table bytes. For performance reasons this value should always be a power of 2. Setting this value to 1 MB (NOTE: 1 MB = 1024 * 1024 Bytes, not 1000 * 1000 Bytes) yields optimal performance. Setting it to a value larger that 1 MB (again, always a power of 2) may help slightly with processing very large data sets (dozens of millions), but might actually reduce performance for data sets with just a few millions of rows or fewer.

Database and Connectivity Issues


There are certain issues related to the way databases are physically described and built that can impact the overall performance of dfPower Studio applications. These issues are generally common to all relational database systems. With just a few exceptions, database changes to enhance performance take place external to dfPower Studio applications.

dfPower Studio
Commit Interval: From the dfPower Studio main screen, choose Studio > Options and click Transactions to access some transaction processing options. For example, you can choose a number of ways to commit changes to the database in write procedures. The default value is Auto Commit, which commits every change to the database table one record at a time. You can instead select Every N Transactions and set this number to somewhere between 100 and 500. This should increase performance, but on the remote chance that there is a problem during processing, all the data changes made by dfPower Studio ultimately might not be saved to the source table.

dfPower Base Architect


Commit Interval: Architect commit settings are controlled on a per-job basis. The Data Target (Update) and Data Target (Insert) job steps both have commit options. The default value is Commit Every Row, which commits every change to the database table one record at a time. You can instead select Every N Rows and set this number to somewhere between 100 and 500. This should increase performance, but on the remote chance that there is a problem during processing, all the data changes made by Architect ultimately might not be saved to the source table.

Database
ODBC Logging: Use the Microsoft Windows ODBC Administrator to turn off ODBC Tracing. ODBC Tracing can dramatically decrease performance. Using Primary Keys: For many dfPower Studio processes, using primary keys will enhance performance only if constructed correctly. Primary keys should be numeric, unique, and indexed. You cannot currently use composite keys to enhance performance. Database Reads and Writes: All dfPower Studio processes read data from databases, but only some of them write data back. Processes that only read from a database, such as match reports, run quickly. However, if you choose to flag duplicate records in the same table using the same criteria as the match report, the processing time will greatly increase, as you might expect. Using correctly constructed primary keys will help. In addition, as a general rule, smaller fields sizes will enhance performance. A table where all fields are set to a length of 255 by default will be processed more slowly than a table that has more reasonable lengths for each field, such as 20 for a name field and 40 for an address field.

64

Chapter 3: Performance Guide

Turn Off Virus Protection Software: Some virus-scanning applications such as McAfee NetShield 4.5 cause processing times to increase substantially. While you might not want to turn off your virus-scanning software completely, you might be able to change some settings to ensure that the software is not harming performance by doing things such as scanning for local database changes.

Quality Knowledge Base Issues


The Quality Knowledge Base is the set of file and file relationships that dictates how all DataFlux applications parse, standardize, match, and otherwise process data. The Quality Knowledge Base uses several file types and libraries that work with each other to provide you expected outputs. A few of the file types are used in ways that are not simply direct data look-ups, and it is these types of files that require a certain degree of optimization to ensure that DataFlux applications are processing data as efficiently as they can. Complex Parsing: Processes using definition types that incorporate parsing functionalityparse definitions obviously do, but match, standardization, and gender definitions can use parse functionality as wellare directly affected by the way parse definitions are constructed. A parse definition built to process e-mail addresses will be much less complex than a parse definition that processes address information. This has to do with the specifics of the data type itself. Data types that have more inherent variability will most likely have parse definitions that are more complex, and processes using these more complex definitions will perform more slowly. When creating custom parse definitions, it is very easy to accidentally create an algorithm that is far from optimized for performance. Training materials are available from DataFlux that teach the proper way to design a parse definition. Complex Regular Expression Libraries: Many definitions use regular expression files to do some of the required processing work. Sometimes regular expression libraries are used to normalize data, while other times they are used to categorize data. Incorrectly constructed regular expressions are notorious for being resource intensive. This means you might design a perfectly valid regular expression that takes an extremely long time to accomplish a seemingly simple task. DataFlux-designed definitions have been optimized for performance, but if you create custom definitions, be sure to learn how to create efficient regular expression. Training materials are available from DataFlux that teach Quality Knowledge Base customization.

65

Exercises

Exercises
Profile
Exercise 1: Create a Profile Report
Overview Assignment This is the discovery phase of a data management project. We will profile a Contacts Table. Create a simple Profile report from the Configurator. Among the tasks to be performed are: Select a data source Select job metrics Access drill-through options such as Visualize and Filter Output a report Step-by-Step: Creating a Profile Report 1. Select Profile > Configurator (or Tools > Profile > Configurator) to open the dfPower Profile Configurator. The Configurator appears with data sources displayed on the left. 2. Double-click the DataFlux Sample database to begin a profile. The data source listing will expand to show available tables for that data source. Select job options Profile a selected field Save the job

Indicates the data source or table is not selected to be part of the profile report. Indicates all items within the data source or table are selected to be part of the profile report. Indicates only some of the items within the data source or table are selected to be part of the profile report.

3. Select Contacts to view the fields for that table. A drop-down menu shows report options, which display on the right.

66

Exercises

Report options are available by field to run full or filtered profile jobs.

Before proceeding with the profile job, we will define options and metrics for the profile. 4. Select Job > Options. Select the General tab. Ensure that both Count all rows for frequency distribution and Count all rows for pattern distribution are selected. The profile engine will look at every value for these records. 5. Click OK.

67

Exercises

Note:

These processes can be constructed on a remote server, but require advanced options. Most users will run profiles from a desktop computer.

6. Select Job > Select Metrics to open the Metrics dialog box.

68

Exercises

Column Profiling shows a list of things that you can analyze.

7. To select all the metrics, click the Select/unselect all box. 8. Click OK. The Configurator displays structural information with preliminary data about these selections, which helps users see the structure of the jobs selected for profiling. Note: You can override features from the Configurator window by selecting the checkbox and then returning to options.

9. Select the desired field name(s) for profiling. For this exercise, select the Contacts table from the left side of the dfPower Profile Configurator main screen to select all fields. You can also select individual fields from within the Contacts Table display. 10. To save the job, select File > Save As, specify a job name, type a report description, and click Save. (Here we named the job Profile 1.) 11. Process the job. Click the Run Job icon on the toolbar to launch the job.

Run Job icon The Run Job dialog box appears, displaying the Profile 1 name just saved. Choose either Standard or Repository outputs; for this exercise, select Standard Output, type the Report Description Contacts Profile 1, and click OK.

69

Exercises

Append to Report is a useful option for historical data analysis.

The dfPower Profile (Viewer) appears, showing the Profile 1 job. 12. Select a Data Source (Contacts for this exercise) and then a Field Name (State) to view detailed profile data; note that several tabs appear in the lower half of the viewer that permit users to see full details of the data profile. The Column Profiling tab is selected in the figure that follows.

70

Exercises

These tabs enable you to view various report details.

13. Select the Frequency Distribution tab from the lower half of the viewer. The viewer now displays information about the data value, count, and percentage. Note the various values used to denote California.

Several values exist for California, so some are nonstandard

14. Right-click anywhere in the panel to access Visualize and Filter drill-through options. 15. Double-click a specific value to get to the drill-through feature.

71

Exercises

The Visualize function accessed from Frequency Distribution for State generates this chart.

16. Select the Pattern Frequency Distribution tab for another view of the data. 17. Double-click the first pattern AA to view the Pattern Frequency Distribution Drill-through screen, shown in the next image. Note that users can scroll to the right to view more columns.

72

Exercises

You can export or print this report from this window. The export function permits you to save the report as a .txt or .csv file that can be viewed by users without access to dfPower Studio. Next, we will examine filtering capabilities more fully by applying a business rule to a table.

73

Exercises

Profile
Exercise 2: Applying a Business Rule to Create a Filtered Profile Report
Overview Assignment Here, we will create a filtered profile in order to perform a standardizing function on the State field in the Contacts database. Create a Profile report to compare filtered and full tables. Among the tasks to be performed are: Select a data source Insert a Business Rule Create the Rule Expression Save and Run the Job Note: You can view filtered data in a couple of waysby applying a filter to the Pattern and/or Frequency Distribution tabs, or by applying a Business Rule. It is important to note that the first method filters only the displayed data, while applying a business rule filters data out of the report altogether.

Step-by-Step: Creating a Filtered Profile Report 1. Return to dfPower Studio Profile (Configurator). If the Viewer is still open, close it. 2. Select the Contacts table from DataFlux Sample. 3. Select the State field name from the Contacts table data. We now want to customize the profile by writing a new business rule that permits us to examine contacts for California only. 4. Select Insert > Business Rule. A dialog box appears. User Tips Select Insert > Business Rule to create a profile report based on a business rule. A business rule allows you to filter data to ensure data is adhering to standards. A business rule can be based on a table or a SQL query. Select Insert > SQL Query to create a profile report based on an SQL query. Query Builder can be used to help construct the SQL statement. Highlight Text Files and select Insert > Text File to create a profile report based on a text file instead of an ODBC data source.

5. Type a Business rule name that is easily identifiable. For this example, we will use Contacts_California. 6. Click OK. A Business Rule on Table dialog box appears, as shown next.

74

Exercises

7. Select STATE from the Field Name browser. 8. Select Equal to from the Operation browser. 9. Type CA to indicate California as a Single value in the Value section. 10. Select Add Condition to create a Rule Expression, which appears in the corresponding display within this window. 11. Save the Profile Job as Profile 2. 12. Run the Profile job. dfPower Studio opens a Profile 2 job viewer to display details of the job. You now have a full table and a filtered table to study for Contacts. You can view the data in several ways using the Visualize function, as in Exercise 1.

Reporting
You can run multiple jobs with the same setup. For example, you may want to run the same job weekly. These reports can be set up as real-time or batch reports. You can even set up a report database as an advanced option. To view reports from the Profile Viewer or by selecting the Reports folder in the Management Resources Directory, click Navigator > DFStudio > Profile > Reports.

75

Exercises

Such reports can be quite valuable for data monitoring, which can show trends that permit you to fine-tune your data management strategy.

76

Exercises

Architect
Exercise 3: Create an Architect Job from a Profile Report
Overview After creating the Profile Reports, you can use Architect for several functions (Data Validation, Pattern Analysis, Basic Statistics, Frequency Distribution, and more). In this exercise, we will create an Architect job from a Profile Report that is based on a SQL query. Create an Architect Profile report based on a SQL query to standardize data. Among the tasks to be performed: Create a Profile report Build the SQL query Standardize State data Create the Architect job Output Architect data to HTML Save and Run the Architect job

Assignment

To prepare, we will create a Profile report and then begin the Architect procedure.

Step-by-Step: Creating a Profile Architect Job from a Profile Report 1. Select Configurator from the Studio Profile node. An untitled Profile appears in the Configurator window. 2. Create a Profile report based on a SQL query. a. Select the Client Info table from the data sources. All Client Info field names are automatically selected. b. Select Insert > SQL Query from the Configurator menu.

77

Exercises

c.

Type a SQL query name for the job, for this exercise type Client_Info for State. A SQL Query dialog opens, with Query Builder available. d. Type SELECT State from Client_Info.

The untitled profile information is updated in the Configurator. 3. Select File > Save As. 4. Name the job Sample Architect Job for State. We can now build the job with our specifications and open Architect to display.

5. To run the Profile job, click Run Job. 6. Specify Standard Output. A Profile Viewer display shows the data source and how the data is parsed. We now need to create an output for the data. Architect can perform a job flow data transformation. Architect also contains nodes and steps for 50 different transformations for data quality procedures. 7. From the open Client_Info report, highlight the field name State.

78

Exercises

8. With the State field highlighted, select Tools > Add Task > Standardization. The Standardization dialog box appears.

9. From the Standardization definition drop-down menu, select State (Two Letter). 10. Leave Standardization scheme at the default setting of (None). 11. Click OK. The Profile Viewer Task List window will display the status of the task being performed. 12. Select Tools > Create Job to build the Architect job from the task list. 13. Save the job.

79

Exercises

14. Architect is launched, displaying the job. You can continue data transformations from this point.

By default, Architect jobs are saved in the Management Resources directory (Navigator). 15. Select from Data Outputs to view the available output nodes. 16. Double-click HTML Report to add it to the Architect job. The Report Properties screen appears.

80

Exercises

17. Enter the following Name and Report Title: HTML Report 1 for Client_Info. 18. Select OK. 19. Select File > Save to save the Architect job. 20. Select Job > Run to run the Architect job. The Run Job window displays. 21. Close the Run Job status window. The HTML Report now displays, as pictured.

81

Exercises

22. Select File > Close to close the HTML report. 23. Select File > Exit to close the Architect job. 24. Select File > Exit to close the Profile report.

82

You might also like