Professional Documents
Culture Documents
Data Warehousing
Methodology
Data Warehousing
Executive Summary
Information Technology (IT) plays a crucial role in delivering the data foundation for key
performance indicators such as revenue growth, margin improvement and asset
efficiency at the corporate, business unit and departmental levels. And IT now has the
tools and methods to succeed at any of these levels. An enterprise-wide, integrated
hub is the most effective approach to track and improve fundamental business
measures. It is not only desirable, it is necessary and feasible. Here are the reasons
why:
Organizations may choose to implement different levels of Data Warehouses from line
of business level implementations to Enterprise Data Warehouses. As the size and
scope of a Warehouse increases so does the complexity, risk and effort. For those that
achieve an Enterprise Data Warehouse, the benefits are often the greatest. However,
an organization must be committed to delivering an Enterprise Data Warehouse and
must ensure the resources, budget and timeline are sufficient to overcome the
organizational hurdles to having a single repository of corporate data assets.
The primary business drivers responsible for a data warehouse project vary and can be
very organization specific. However, a few generalities can be evidenced as trends
across most organizations. Below are some of the key drivers typically responsible for
driving data warehouse projects:
Desire for a ‘360 degree view’ around customers, products, or other subject areas
In order to make effective business decisions and have meaningful interactions with
customers, suppliers and other partners it is important to gather information from a
variety of systems to provide a ‘360 degree view’ of the entity. For example, consider a
software company looking to provide a 360 degree view of their customers. To provide
this view, it may require gathering and relating sales orders, prospective sales
interactions, maintenance payments, support calls and services engagements. These
items merged together paint a more complete picture of a particular customer’s value
and interaction with the organization. The challenge is that in any organization this data
might reside in numerous systems with different customer codes and structures across
different technologies, making the creation of a single report nearly impossible
programmatically. Thus a need arises for a centralized location to merge and
rationalize this data for easy reporting - such as a Data Warehouse.
Operational systems are built and tuned for the best operational performance possible.
A slowdown in an order entry system may cost a business lost sales and decreased
customer satisfaction. Given that analytic reporting often requires summarizing and
gathering large amounts of information, queries against operational systems for
analytic purposes are usually discouraged and even outright prohibited for fear of
impacting system performance. One key value of a data warehouse is the ability to
access large data sets for analytic purposes while remaining physically separated from
operational systems. This ensures that operational system performance is not
adversely affected by analytic work and that business users are free to crunch large
data sets and metrics without impacting daily operations.
In most cases, operational systems only store current state information on orders,
transactions, customers, products and other data. Historical information has little use in
the operational world. Point of sale transactions, for example, may be purged from
There are many other specific business drivers that can spur the need for a Data
Warehouse. However, these are some of the most common seen across most
organizations.
To ensure success for a Data Warehouse implementation, there are key success
factors that must be kept in mind throughout the project. Many times data warehouses
are built by IT staff that have been pulled or moved from other implementation efforts
such as system implementations and upgrades. In these cases, the process for
implementing a Data Warehouse can be quite a change from past IT work. These Key
Success Factors point out important topics to consider as you begin project planning.
● Data Sources are from many disparate systems, internal and external to the
organization
● Data models are used to understand relationships and business rules within
Typically data must be modeled, structured and populated in a relational database for it
to be available for a Data Warehouse reporting project. Data Integration is designed
based on available Operational Application sources to pull, cleanse, transform and
populate an Enterprise Subject Area Database. Once the data is present in the Subject
Area Database, projects can fulfill their requirements to provide Business Intelligence
reporting to the Data Warehouse end users. This is done by identifying detailed
reporting requirements and designing corresponding Business Intelligence Data Marts
that capture all of the properly granulated facts and dimensions needed for reporting.
These Data Marts are then populated using a Data Integration process and coupled to
the Reporting components developed in the Business Intelligence tool.
This type of project addresses the need to gather data from an area of the enterprise
where no prior familiarity of the business data or requirements exists. All project
components are required.
Physical data modeling and data discovery components will drive out the identification
and design of the new database requirements and the new data sources. New Data
Integration processes must be created to bring new data into new Data Warehouse
database structures. A set of history loads may be required to backload the data and
bring it up to the current timeline. New Dimensional Data Mart and BI Reporting
offerings must be modeled, designed and implemented to satisfy the user information
and access needs.
This type of project addresses the need to add a new data source or to alter an existing
data source, but always within the context of already established logical data structures
and definitions. No logical data modeling is needed because no new business data
requirements are being entertained. Minor adjustments to the physical model and
database may be needed to accommodate changes in volume due to the new source
or new or altered views may be needed to report on the new data instances that may
now be available to the users.
Data discovery analysis comprises a key portion of this type of project, as does the
corresponding new or altered data integration processes that move the data to the
database. Business intelligence reports and queries may need to change to incorporate
new views or expanded drill-downs and data value relationships. Back loading
historical data may also be required. When enhancing existing data, Metadata
management efforts to track data from the physical data sources through the data
integration process and to business intelligence data marts and reports can assist with
impact analysis and scoping efforts.
This type of project is focused solely on the expansion or alteration of the business
intelligence reporting and query capability using existing subject area data. This type of
project does not entertain the introduction of any new or altered data (in structure or
content) to the warehouse subject area database. New or altered dimensional data
mart tables/views may be required to support the business intelligence enhancements;
otherwise the majority, if not all, of the work is within the business intelligence
It is important to assess the business case and executive sponsorship early on in the
Data Warehouse project. The project is at risk if the business value of the warehouse
cannot be articulated at the executive level and on down through the organization. If
executives do not have a clear picture of how the data warehouse will impact their
business and the value it will provide, it wont’ be long before a decision is made to
reduce or stop funding the effort.
The Data Warehouse team should always keep the end goal in mind for an enterprise
wide data warehouse. Often an enterprise data warehouse will strive to achieve a
‘single source of truth’ across the entire enterprise and across all data stores.
Delivering this in a ‘big bang’ approach nearly always fails. By the time all of the
enterprise modeling, data rationalization and data integration have taken place across
all facets of the organization, the value of the project is called into question, and the
project is either delayed or cancelled.
The key to remember is that while these short term milestones are delivered, the Data
Warehouse Team should not lose sight of the end goal of the enterprise vision. For
example, when implementing customer retention metrics for two key systems - as an
early ‘win’ – be sure to consider the 5 other systems in the organization and try to
ensure that the model and process is flexible enough so that the current work will not
need to be re-architected when this data is added in a later phase. Keep the final goal
in mind when designing and building the incremental milestones.
End-user reporting must provide flexibility, offering straightforward reports for basic
users, and for analytic users allowing drilling and roll-ups, views of both summary and
detailed data and ad-hoc reporting. Report design that is too rigid may lead to clutter
(as multiple structures are developed for reports that are very similar to each other) in
the Business Intelligence Application and in the Data Integration and Data Warehouse
Database contents. Providing flexible structures and reports allows data to be queried
from the same reports and database structures without redundancy or the time required
to develop new objects and Data Integration processes. Users can create reports from
the common structures, thus removing the bottleneck of IT activities and the need to
wait for development. Data modeling and physical database structures that reflect the
business model (rather than the requirements for a single report) enable flexibility as a
by-product.
Often reporting requirements are defined for summary data. While summary data may
be available from transaction and operational systems, it is best to bring the detailed
data into the Data Warehouse and summarize based on that detail. This avoids
potential problems due to different calculation methods, aggregating on different criteria
and other ways in which the summary data brought in as a source might differ from roll-
ups that begin with the raw detailed records. Because the summary offers smaller
database table sizes, it may be tempting to bring this data in first, and then bring in the
detailed data at a later stage in order to drill down to the details. Having standard
sources of the raw data and using the same source for various summaries increases
the quality of the data and avoids ending up with multiple versions of the truth.
Business Users must be engaged throughout the entire development process. The
resulting reports and supporting data from the Data Warehouse project should address
answers to business questions. If the users are not involved then it is probable that the
end-result will not meet their needs for business data; and the overall success of the
project is diminished. The more the business users feel that the solution is focused on
solving their analytic needs – the more likelihood there is of adoption.
Once lost, trust is difficult to regain. As a Data Warehouse is rolled out (and throughout
its existence) it is important to thoroughly validate the data it contains in order it to
maintain the end users trust in the data warehouse analytics. If a key metric is incorrect
(i.e., the gross sales amount for a region in a particular month) end users may loose
confidence in the system and all of its reports and metrics. If users lose faith in the
analytics, this can hamper enterprise adoption and even spell the end of a data
warehouse.
Not only is thorough testing and validation required to ensure that data is loaded
completely and accurately into the warehouse, but organizations will often create on-
going balancing and auditing procedures. These procedures are run on a regular basis
to ensure metrics are accurate and that they ‘tie out’ with source systems. Sometimes
these procedures are manual and sometimes they are automated. If the warehouse is
suspected to be inaccurate - or a daily load fails to run – communications are initiated
with end users to alert them to the problem. It is better to limit user reporting for a
morning until the issues are addressed, than to risk that an executive makes a critical
business decision with incorrect data.
The following pages describe the roles used throughout this Guide, along with the responsibilities typically
associated with each. Please note that the concept of a role is distinct from that of an employee or full time
equivalent (FTE). A role encapsulates a set of responsibilities that may be fulfilled by a single person in a part-time or full-
time capacity, or may be accomplished by a number of people working together. The Velocity Guide refers to roles with an
implicit assumption that there is a corresponding person in that role. For example, a task description may discuss the involvement
of "the DBA" on a particular project, however, there may be one or more DBAs, or a person whose part-time responsibility is
database administration.
In addition, note that there is no assumption of staffing level for each role -- that is, a small project may have one individual filling
the role of Data Integration Developer, Data Architect, and Database Administrator, while large projects may have multiple
individuals assigned to each role. In cases where multiple people represent a given role, the singular role name is used, and
project planners can specify the actual allocation of work among all relevant parties. For example, the methodology always refers
to the Technical Architect, when in fact, there may be a team of two or more people developing the Technical Architecture for a
very large development effort.
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Under normal circumstances, someone from the business community fills this role,
since deep knowledge of the business requirement is indispensable. Ideally, familiarity
with the technology and the development life-cycle allows the individual to function as
the communications channel between technical and business users.
Reports to:
Responsibilities:
● Ensures that the delivered solution fulfills the needs of the business (should be
involved in decisions related to the business requirements)
● Assists in determining the data integration system project scope, time and
required resources
● Provides support and analysis of data collection, mapping, aggregation and
balancing functions
● Performs requirements analysis, documentation, testing, ad-hoc reporting,
user support and project leadership
● Produces detailed business process flows, functional requirements
specifications and data models and communicates these requirements to the
design and build teams
● Conducts cost/benefit assessments of the functionality requested by end-users
● Prioritizes and balances competing priorities
● Plans and authors the user documentation set
Qualifications/Certifications
Recommended Training
● Interview/workshop techniques
● Project Management
● Data Analysis
● Structured analysis
● UML or other business design methodology
● Data Warehouse Development
Reports to:
● Project Sponsor
Responsibilities:
Qualifications/Certifications
Recommended Training
● Project Management
Depending on the specific structure of the development organization, the Data Architect
may also be considered a Data Warehouse Architect, in cooperation with the Technical
Architect. This role involves developing the overall Data Warehouse logical
architecture, specifically the configuration of the data warehouse, data marts, and an
operational data store or staging area if necessary. The physical implementation of the
architecture is the responsibility of the Database Administrator.
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
● Modeling Packages
● Data Warehouse Development
Reports to:
Responsibilities:
● Uses the Informatica Data Integration platform to extract, transform, and load
data
● Develops Informatica mapping designs
● Develops Data Integration Workflows and load processes
● Ensures adherence to locally defined standards for all developed components
● Performs data analysis for both Source and Target tables/columns
● Provides technical documentation of Source and Target mappings
● Supports the development and design of the internal data integration
framework
● Participates in design and development reviews
● Works with System owners to resolve source data issues and refine
transformation rules
● Ensures performance metrics are met and tracked
● Writes and maintains unit tests
● Conduct QA Reviews
● Performs production migrations
Qualifications/Certifications
Recommended Training
● Data Modeling
● PowerCenter – Level I & II Developer
● PowerCenter - Performance Tuning
● PowerCenter - Team Based Development
● PowerCenter - Advanced Mapping Techniques
● PowerCenter - Advanced Workflow Techniques
● PowerCenter - XML Support
● PowerCenter - Data Profiling
● PowerExchange
Reports to:
Responsibilities:
● Profile source data and determine all source data and metadata characteristics
● Design and execute Data Quality Audit
● Present profiling/audit results, in summary and in detail, to the business
analyst, the project manager, and the data steward
● Assist the business analyst/project manager/data steward in defining or
modifying the project plan based on these results
● Assist the Data Integration Developer in designing source-to-target mappings
● Design and execute the data quality plans that will cleanse, de-duplicate, and
otherwise prepare the project data for the Build phase
● Test Data Quality plans for accuracy and completeness
● Assist in deploying plans that will run in a scheduled or batch environment
● Document all plans in detail and hand-over documentation to the customer
● Assist in any other areas relating to the use of data quality processes, such as
unit testing
Qualifications/Certifications
Recommended Training
Typically the Data Steward is a key member of a Data Stewardship Committee put into
place by the Project Sponsor. This committee will include business users and technical
staff such as Application Experts. There is often an arbitration element to the role
where data is put to different uses by separate groups of users whose requirements
have to be reconciled.
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
● DBMS Administration
● Data Warehouse Development
● PowerCenter Administrator Level I & II
● PowerCenter Security and Migration
● PowerCenter Metadata Manager
Reports to:
Responsibilities:
Recommended Training
● DBMS Administration
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
● DBMS Basics
● Data Modeling
● PowerCenter - Metadata Manager
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
The Presentation Layer Developer designs the application, ensuring that the end-user
requirements gathered during the requirements definition phase are accurately met by
the final build of the application. In most cases, the developer works with front-end
Business Intelligence tools, such as Cognos, Business Objects and others. To be most
effective, the Presentation Layer Developer should be familiar with metadata concepts
and the Data Warehouse/Data Mart data model.
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
● Executive Leadership
Responsibilities:
Qualifications/Certifications
Recommended Training
● N/A
Reports to:
Responsibilities:
● Leads the effort to validate the integrity of the data through the data integration
processes
● Ensures that the data contained in the data integration solution has been
accurately derived from the source data
● Develops and maintains quality assurance plans and test requirements
documentation
● Verifies compliance to commitments contained in quality plans
● Works with the project management and development teams to resolve issues
● Participates in the enforcement of data quality standards
● Communicates concerns, issues and problems with data
● Participates in the testing and post-production verification
● Together with the Technical Lead and the Repository Administrator, articulates
the development standards
● Advises on the development methods to ensure that quality is built in
● Designs the QA and standards enforcement strategy
● Together with the Test Manager, coordinates the QA and Test strategies
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
● Operating Systems
● DBMS
● PowerCenter Developer and Administrator - Level I
● PowerCenter New Features
● Basic and advanced XML
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
The test manager is also responsible for the creation of the test data set. An integrated
test data set is a valuable project resource in its own right; apart from its obvious role in
testing, the test data set is very useful to the developers of integration and presentation
components. In general, separate functional and volume test data sets will be required.
In most cases, these should be derived from the production environment. It may also
be necessary to manufacture a data set which triggers all the business rules and
transformations specified for the application.
Finally, the Test Manager must continually advocate adherence to the Test Plans.
Projects at risk of delayed completion often sacrifice testing at the expense of a high-
quality end result.
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
Reports to:
Responsibilities:
Qualifications/Certifications
Reports to:
Responsibilities:
Qualifications/Certifications
Recommended Training
● N/A
1 Manage
Description
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
Description
This task entails constructing the business context for the project, defining in business
terms the purpose and scope of the project as well as the value to the business (i.e.,
the business case).
Prerequisites
None
Roles
Considerations
There are no technical considerations during this task; in fact, any discussion of
implementation specifics should be avoided at this time. The focus here is on defining
the project deliverable in business terms with no regard for technical feasibility. Any
discussion of technologies is likely to sidetrack the strategic thinking needed to develop
the project objectives.
Best Practices
None
Sample Deliverables
Project Definition
Description
In many ways the potential for success of the development effort for a data
integration solution correlates directly to the clarity and focus of its business scope. If
the business purpose is unclear or the boundaries of the business objectives are poorly
defined, there is a much higher risk of failure or, at least, of a less-than-direct path to
limited success.
Prerequisites
None
Roles
Considerations
The primary consideration in developing the Business Project Scope is balancing the
high-priority needs of the key beneficiaries with the need to provide results within the
near-term. The Project Manager and Business Analysts need to determine the key
business needs and determine the feasibility of meeting those needs to establish a
scope that provides value, typically within a 60 to 120 day time-frame.
Best Practices
None
Sample Deliverables
Project Charter
Description
Building support and funding for a data integration solution nearly always requires convincing executive IT management of its value
to the business. The best way to do this, if possible, is to actually calculate the project's estimated return on investment (ROI) through
a business case that calculates ROI.
In addition to traditional ROI modeling on data integration initiatives, quantitative and qualitative ROI assessments should also
include assessments of data quality. Poor data quality costs organizations vast sums in lost revenues. Defective data leads
to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. Moreover, poor
quality data can lead to failures in compliance with industry regulations and even to outright project failure at the IT level.
It is vital to acknowledge data quality issues at an early stage in the project. Consider a data integration project that is planned
and resourced meticulously but that is undertaken on a dataset where the data is of a poorer quality than anyone realized. This
can lead to the classic “code-load-explode” scenario, wherein the data breaks down in the target system due to a poor
understanding of the data and metadata. What is worse, a data integration project can succeed from an IT perspective but deliver
little if any business value if the data within the system is faulty. For example, a CRM system containing a dataset with a large
quantity of redundant or inaccurate records is likely to be of little value to the business. Often an organization does not realize it
has data quality issues until it is too late. For this reason, data quality should be a consideration in ROI modeling for all data
integration projects – from the beginning.
For more details on how to quantify business value and associated data integration project cost, please see Assessing the
Business Case.
Prerequisites
Roles
Considerations
The Business Case must focus on business value and, as much as possible, quantify that value. The business beneficiaries
are primarily responsible for assessing the project benefits, while technical considerations drive the cost assessments. These
two assessments - benefits and costs - form the basis for determining overall ROI to the business.
When creating your ROI model, it is best to start by looking at the expected business benefit of implementing the data
integration solution. Common business imperatives include:
Each of these business imperatives requires support via substantial IT initiatives. Common IT initiatives include:
For these IT initiatives to be successful, you must be able to integrate data from a variety of disparate systems. The form of those
data integration projects may vary. You may have a:
● Data Warehousing project, which enables new business insight usually through business intelligence.
● Data Migration project, where data sources are moved to enable a new application or system.
● Data Consolidation project, where certain data sources or applications are retired in favor of another.
● Master Data Management project, where multiple data sources come together to form a more complex, master view of the
data.
● Data Synchronization project, where data between two source systems need to stay perfectly consistent to enable different
applications or systems.
● B2B Data Transformation project, where data from external partners is transformed to internal formats for processing by
internal systems and responses are transformed back to partner appropriate formats.
● Data Quality project, where the goals are to cleanse data and to correct errors such as duplicates, missing information,
mistyped information and other data deficiencies.
Once you have established the heritage of your data integration project back to its origins in the business imperatives, it is important
to estimate the value derived from the data integration project. You can estimate the value by asking questions such as:
After asking the questions above, you’ll start to be able to equate business value, in a monetary number, with the data
integration project. Remember to not only estimate the business value over the first year after implementation, but also over the
course of time. Most business cases and associated ROI models factor in expected business value for at least three years.
If you are still struggling with estimating business value with the data integration initiative, see the table below that outlines
common business value categories and how they relate to various data integration initiatives:
INCREASE REVENUE
Cross-Sell / Up-Sell Increase penetration and sales - % cross-sell rate - Single view of customer across
within existing customers - # products/customer all products, channels
- % share of wallet - Marketing analytics & customer
- customer lifetime value segmentation
- Customer lifetime value analysis
Sales and Channel Increase sales productivity, - sales per rep or per employee - Sales/agent productivity
Management and improve visibility into - close rate dashboard
demand - revenue per transaction - Sales & demand analytics
- Customer master data
integration
- Demand chain synchronization
New Product / Accelerate new product/service - # new products launched/year - Data sharing across design,
Service Delivery introductions, and improve "hit - new product/service launch time development, production and
rate" of new offerings - new product/service adoption rate marketing/sales teams
- Data sharing with third parties e.
g. contract manufacturers,
channels, marketing agencies
LOWER COSTS
Supply Chain Lower procurement costs, - purchasing discounts - product master data integration
Management increase supply chain visibility, - inventory turns - demand analysis
and improve inventory - quote-to-cash cycle time - cross-supplier purchasing
management - demand forecast accuracy history
Production & Lower the costs to manufacture - production cycle times - cross-enterprise inventory rollup
Service Delivery products and/or deliver services - cost per unit (product) - scheduling and production
- cost per transaction (service) synchronization
- straight-through-processing rate
Logistics & Lower distribution costs and - distribution costs per unit - integration with third party
Distribution improve visibility into - average delivery times logistics management and
distribution chain - delivery date reliability distribution partners
Compliance Risk(e. Prevent compliance outages to -# negative audit/inspection findings - Financial reporting
g. SEC/SOX/Basel avoid investigations, penalties, - probability of compliance lapse - Compliance monitoring &
II/PCI) and negative impact on brand - cost of compliance lapses (fines, reporting
recovery costs, lost business)
- audit/oversight costs
Financial/Asset Improve risk management of - errors & omissions - Risk management data
Risk Management key assets, including financial, - probability of loss warehouse
commodity, energy or capital - expected loss - Reference data integration
assets - safeguard and control costs - Scenario analysis
- Corporate performance
management
Business Reduce downtime and lost - mean time between failure (MTBF) - Resiliency and automatic
Continuity/ business, prevent loss of key - mean time to recover (MTTR) failover/recovery for all data
Disaster Recovery data, and lower recovery costs - recovery time objective (RTO) integration processes
Risk - recover point objective (RPO -- data
loss)
Now that you have estimated the monetary business value from the data integration project in Step 1, you will need to calculate
the associated costs with that project in Step 2. In most cases, the data integration project is inevitable – one way or another
the business initiative is going to be accomplished – so it is best to compare two alternative cost scenarios. One scenario would
be implementing that data integration with tools from Informatica, while the other scenario would be implementing the data
integration project without Informatica’s toolset.
Some examples of benchmarks to support the case for Informatica lowering the total cost of ownership (TCO) on data integration
and data quality projects are outlined below:
Forrester Research, "The Total Economic Impact of Deploying Informatica PowerCenter", 2004
The average savings of using a data integration/ETL tool vs. hand coding:
• The top-performing third of Integration Competency Centers (ICCs) will save an average of:
Larry English, Improving Data Warehouse and Business Information Quality, Wiley Computer
Publishing, 1999.
• "The business costs of non-quality data, including irrecoverable costs, rework of products
and services, workarounds, and lost and missed revenue may be as high as 10 to 25 percent
of revenue or total budget of an organization."
• "Invalid data values in the typical customer database averages around 15 to 20 percent…
Actual data errors, even though the values may be valid, may be 25 to 30 percent or more in
those same databases."
Ponemon Institute-- Study of costs incurred by 14 companies that had security breaches affecting
between 1,500 to 900,000 consumer records
• Total costs to recover from a breach averaged $14 million per company, or $140 per lost
customer record
• Direct costs for incremental, out-of-pocket, unbudgeted spending averaged $5 million per
company, or $50 per lost customer for outside legal counsel, mail notification letters, calls to
individual customers, increased call center costs and discounted product offers
• Indirect costs for lost employee productivity averaged $1.5 million per company, or $15 per
customer record
• Opportunity costs covering loss of existing customers and increased difficulty in recruiting new
customers averaged $7.5 million per company, or $75 per lost customer record.
• Overall customer loss averaged 2.6 percent of all customers and ranged as high as 11 percent
In addition to lowering cost of implementing a data integration solution, Informatica adds value to the ROI model by mitigating risk
in the data integration project. In order to quantify the value of risk mitigation, you should consider the cost of project overrun and
the associated likelihood of overrun when using Informatica vs. when you don’t use Informatica for your data integration project. An
example analysis of risk mitigation value is below:
Once you have calculated the three year business/IT benefits and the three year costs of using PowerCenter vs. not
using PowerCenter, put all of this information into a format that is easy-to-read for IT and line of business executive management.
The following isa sample summary of an ROI model:
1. Informatica Software can reduce the overall project timeline by accelerating migration development efforts.
2. Informatica delivered migrations will have lower risk due to ease of maintenance, less development effort, higher quality of data,
and increased project management tools with the metadata driven solution.
3. Availability of lineage reports as to how the data was manipulated by the data migration process and by whom.
Best Practices
None
Sample Deliverables
None
Description
If an ICC does not already exist, this subtask is finished since there are no centralized
resources to assess and all the tasks in the Velocity WBS are the responsibility of the
development team.
If an ICC does exist, it is necessary to assess the extent and nature of the resources
available in order to demarcate the responsibilities between the ICC and project teams.
Typically, the ICC acquires responsibility for some or all of the data integration
infrastructure (essentially the Non-Functional Requirements) and the project teams are
liberated to focus on the functional requirements. The precise division of labor is
obviously dependent on the degree of centralization and the associated ICC model that
has been adopted.
In the task descriptions that follow, an ICC section is included under the Considerations
heading where alternative or supplementary activity is required if an ICC is in place.
Prerequisites
None
Roles
Considerations
Best Practices
Sample Deliverables
None
Description
This task incorporates the initial project planning and management activities as well as
project management activities that occur throughout the project lifecycle. It includes the
initial structure of the project team and the project work steps based on the business
objectives and the project scope, and the continuing management of expectations
through status reporting, issue tracking and change management.
Prerequisites
None
Roles
Considerations
The tools of the trade, apart from strong people skills (especially, interpersonal
communication skills), are detailed documentation and frequent review of the status of
the project effort against plan, of the unresolved issues, and of the risks regarding
enlargement of scope ("change management"). Successful project management is
predicated on regular communication of these project aspects with the project
manager, and with other management and project personnel.
For data migration projects there is often a project management office (PMO) in place
The PMO is typically found in high dollar, high profile projects such as implementing a
new ERP system that will often cost in the millions of dollars. It is important to identify
the roles and gain the understanding of the PMO as to how these roles are needed and
will intersect with the broader system implementation. More specifically, these roles will
have responsibility beyond the data migration, so the resource requirements for the
Data Migration must be understood and guaranteed as part of the larger effort
overseen by the PMO.
For B2B projects, technical considerations typically play an important role. The format
of data received from partners (and replies sent to partners) forms a key consideration
in overall business operations and has a direct impact on the planning and scoping of
changes. Informatica recommends having the Technical Architect directly involved
throughout the process.
Best Practices
None
Sample Deliverables
None
Description
This subtask involves defining the roles/skill sets that will be required to complete the
project. This is a precursor to building the project team and making resource
assignments to specific tasks.
Prerequisites
None
Roles
Considerations
The Business Project Scope established in 1.1.1 Establish Business Project Scope
provides a primary indication of the required roles and skill sets. The following types of
questions are useful discussion topics and help to validate the initial indicators:
● What are the main tasks/activities of the project and what skills/roles are
needed to accomplish them?
● How complex or broad in scope are these tasks? This can indicate the level of
skills needed.
● What responsibilities will fall to the company resources and which are off-
loaded to a consultant? Who (i.e. company resource or consultant) will provide
the project management? Who will have primary responsibility for
infrastructure requirements? ...for data architecture? ...for documentation? ...
for testing? ...for deployment/training/support?
This is a definitional activity and very distinct from the later assignment of resources.
These roles should be defined as generally as possible rather than attempting to match
a requirement with a resource at hand.
After the project scope and required roles have been defined, there is often pressure to
combine roles due to limited funding or availability of resources. There are some roles
that inherently provide a healthy balance with one another, and if one person fills both
of these roles, project quality may suffer.
The classic conflict is between development roles and highly procedural or operational
roles. For example, a QA Manager or Test Manager or Lead should not be the same
person as a Project Manager or one of the development team. The QA Manager is
responsible for determining the criteria for acceptance of project quality and managing
quality-related procedures. These responsibilities directly conflict with the developer’s
need to meet a tight development schedule. For similar reasons, development
personnel are not ideal choices for filling such operational roles as Metadata Manager,
DBA, Network Administrator, Repository Administrator, or Production Supervisor.
Those roles require operational diligence and adherence to procedure as opposed to
ad hoc development. When development roles are mixed with operational roles,
resulting ‘shortcuts’ often lead to quality problems in production systems.
Tip
Involve the Project Sponsor.
Before defining any roles, be sure that the Project Sponsor is in agreement as to the
project scope and major activities, as well as the level of involvement expected from
company personnel and consultant personnel. If this agreement has not been
explicitly accomplished, review the project scope with the Project Sponsor to resolve
any remaining questions.
In defining the necessary roles, be sure to provide the Sponsor with a full description
of all roles, indicating which will rely on company personnel and which will use
consultant personnel. This sets clear expectations for company involvement and
indicates if there is a need to fill additional roles with consultant personnel if the
company does not have personnel available in accordance with the project timing.
The Role Descriptions in Roles provides typical role definitions. The Project Role Matrix
can serve as a starting point for completing the project-specific roles matrix.
Sample Deliverables
Project Definition
Description
Once the overall project scope and roles have been defined, details on project
execution must be developed. These details should answer the questions of what must
be done, who will do it, how long it will take, and how much will it cost.
The objective of this subtask is to develop a complete WBS and, subsequently, a solid
project estimate.
● Work Breakdown Structure (WBS), which can be viewed as a list of tasks that
must be completed to achieve the desired project results. (See Developing a
Work Breakdown Structure (WBS) for more details)
● Project Estimate, which, at this time, focuses solely on development costs
without consideration for hardware and software liabilities.
Estimating a project is never an easy task, and often becomes more difficult as project
visibility increases and there is an increasing demand for an "exact estimate". It is
important to understand that estimates are never exact. However, estimates are useful
for providing a close approximation of the level of effort required by the project. Factors
such as project complexity, team skills, and external dependencies always have an
impact on the actual effort required.
The accuracy of an estimate largely depends on the experience of the estimator (or
estimators). For example, an experienced traveller who frequently travels the route
between his/her home or office and the airport can easily provide an accurate estimate
of the time required for the trip. When the same traveller is asked to estimate travel
time to or from an unfamiliar airport however, the estimation process becomes much
more complex, requiring consideration of numerous factors such as distance to the
airport, means of transportation, speed of available transportation, time of day that the
travel will occur, expected weather conditions, and so on. The traveller can arrive at a
valid overall estimate by assigning time estimates to each factor, then summing the
whole. The resulting estimate however, is not likely to be nearly as accurate as the one
based on knowledge gained through experience. The same holds true for estimating
Prerequisites
None
Roles
Considerations
For B2B projects (and non B2B projects that have significant unstructured or semi-
structured data transformation requirements) the actual creation and subsequent QA of
transformations relies on having sufficient samples of input and output data; and
specifications for data formats.
By their nature, the full authoring of B2B data transformations cannot be completed (or
in some cases proceed) without the availability of adequate sample data both for input
to transformations and for comparison purposes during the quality assurance process.
Best Practices
None
Sample Deliverables
None
Description
In this subtask, the Project Manager develops a schedule for the project using the
agreed-upon business project scope to determine the major tasks that need to be
accomplished and estimates of the amount of effort and resources required.
Prerequisites
None
Roles
Considerations
The initial project plan is based on agreements-to-date with the Project Sponsor
regarding project scope, estimation of effort, roles, project timelines and any
understanding of requirements.
Updates to the plan (as described in Developing and Maintaining the Project Plan) are
typically based on changes to scope, approach, priorities, or simply on more precise
determinations of effort and of start and/or completion dates as the project unfolds. In
some cases, later phases of the project, like System Test (or "alpha"), Beta Test and
Deployment, are represented in the initial plan as a single set of activities, and will be
more fully defined as the project progresses. Major activities (e.g., System Test,
Deployment, etc.) typically involve their own full-fledged planning processes once the
technical design is completed. At that time, additional activities may be added to the
project plan to allow for more detailed tracking of those project activities.
Best Practices
Sample Deliverables
Project Roadmap
Description
In the broadest sense, project management begins before the project starts and
continues until its completion and perhaps beyond. The management effort includes:
In a more specific sense, project management involves being constantly aware of, or
preparing for, anything that needs to be accomplished or dealt with to further the
project objectives, and making sure that someone accepts responsibility for such
occurrences and delivers in a timely fashion.
● Project Kick-off, including the initial project scope, project organization, and
project plan
● Project Status and reviews of the plan and scope
● Project Content Reviews, including business requirements reviews and
technical reviews
● Change Management as scope changes are proposed, including changes to
staffing or priorities
● Issues Management
●
Prerequisites
None
Considerations
In all management activities and actions, the Project Manager must balance the needs
and expectations of the Project Sponsor and project beneficiaries with the needs,
limitations and morale of the project team. Limitations and specific needs of the team
must be communicated clearly and early to the Project Sponsor and/or company
management to mitigate unwarranted expectations and avoid an escalation of
expectation-frustration that can have a dire effect on the project outcome. Issues that
affect the ability to deliver in any sense, and potential changes to scope, must be
brought to the Project Sponsor's attention as soon as possible and managed to
satisfactory resolution.
Best Practices
None
Sample Deliverables
Issues Tracking
Description
This is a summary task that entails closing out the project and creating project wrap-up
documentation.
Each project should end with an explicit closure procedure. This process should include
Sponsor acknowledgement that the project is complete and the end product meets
expectations. A Project Close Report should be completed at the conclusion of the
effort, along with a final status report.
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
2 Analyze
Description
Increasingly, organizations
demand faster, better, and cheaper delivery of data integration and business
intelligence solutions. Many development failures and project cancellations can be
traced to an absence of adequate upfront planning and scope definition. Inadequately
defined or prioritized objectives and project requirements foster scenarios where
project scope becomes a moving target as requirements may change late in the game,
requiring repeated rework of design or even development tasks. The purpose of
the Analyze Phase is to build a solid foundation for project scope through a deliberate
determination of the business drivers, requirements, and priorities that will form the
basis of the project design and development.
Once the business case for a data integration or business intelligence solution is
accepted and key stakeholders are identified, the process of detailing and prioritizing
objectives and requirements can begin - with the ultimate goal of defining project scope
and, if appropriate, a roadmap for major project stages.
Prerequisites
None
Roles
Considerations
Functional and technical requirements must focus on the business goals and objectives
of the stakeholders, and must be based on commonly agreed-upon definitions of
business information. The initial business requirements are then compared to feasibility
studies of the source systems to help the prioritization process that will result in a
project roadmap and rough timeline. This sets the stage for incremental delivery of the
requirements so that some important needs are met as soon as possible, thereby
providing value to the business even though there may be a much longer timeline to
complete the entire project. In addition, during this phase it can be valuable to identify
the available technical metadata as a way to accelerate the design and improve its
quality. A successful Analyze Phase can serve as a foundation for a successful project.
Best Practices
None
Sample Deliverables
None
Description
In many ways, the potential for success of any data integration/business intelligence
solution correlates directly to the clarity and focus of its business scope. If the business
objectives are vague, there is a much higher risk of failure or, at least, of a less-than-
direct path to likely limited success.
Business Drivers
The business drivers explain why the solution is needed and is being recommended at
a particular time by identifying the specific business problems, issues, or increased
business value that the project is likely to resolve or deliver. Business drivers may
include background information necessary to understand the problems and/or needs.
There should be clear links between the project’s business drivers and the company’s
underlying business strategies.
Business Objectives
Objectives are concrete statements describing what the project is trying to achieve.
Objectives should be explicitly defined so that they can be evaluated at the conclusion
of a project to determine if they were achieved.
Objectives written for a goal statement are nothing more than a deconstruction of the
goal statement into a set of necessary and sufficient objective statements. That is,
every objective must be accomplished to reach the goal, and no objective is
superfluous.
Objectives are important because they establish a consensus between the project
sponsor and the project beneficiaries regarding the project outcome. The specific
deliverables of an IT project, for instance, may or may not make sense to the project
sponsor. However, the business objectives should be written so they are
understandable by all of the project stakeholders.
Goal statements provide the overall context for what the project is trying to accomplish.
They should align with the company's stated business goals and strategies. Project
context is established in a goal statement by stating the project's object of study, its
purpose, its quality focus, and its viewpoint. Characteristics of a well-defined goal
should reference the project's business benefits in terms of cost, time, and/or quality.
Because goals are high-level statements, it may take more than one project to achieve
a stated goal. If the goal's achievement can be measured, it is probably defined at too
low a level and may actually be an objective. If the goal is not achievable through any
combination of projects, it is probably too abstract and may be a vision statement.
Every project should have at least one goal. It is the agreement between the company
and the project sponsor about what is going to be accomplished by the project. The
goal provides focus and serves as the compass for determining if the project outcomes
are appropriate. In the project management life cycle, the goal is bound by a number of
objective statements. These objective statements clarify the fuzzy boundary of the goal
statement. Taken as a pair, the goal and objectives statements define the project. They
are the foundation for project planning and scope definition.
Prerequisites
None
Roles
Considerations
Business Drivers
The business drivers must be defined using business language. Identify how the
project is going to resolve or address specific business problems. Key components
when identifying business drivers include:
Large projects often have significant business and technical requirements that drive the
project's development. Consider explaining the origins of the significant requirements
as a way of explaining why the project is needed.
Business Objectives
Before the project starts, define and agree on the project objectives and the business
goals they define. The deliverables of the project are created based on the objectives -
not the other way around. A meeting between all major stakeholders is the best way to
create the objectives and gain a consensus on them at the same time. This type of
meeting encourages discussion among participants and minimizes the amount of time
involved in defining business objectives and goals. It may not be possible to gather all
the project beneficiaries and the project sponsor together at the same time so multiple
meetings may have to be arranged with the results summarized.
The business objectives should take into account the results of any data quality
investigations carried out before or during the project. If the project source data quality
Generally speaking, the number of objectives comes down to how much business
investment is going to be made in pursuit of the project's goals. High investment
projects generally have many objectives. Low investment projects must be more
modest in the objectives they pursue. There is considerable discretion in how granular
a project manager may get in defining objectives. High-level objectives generally need
a more detailed explanation and often lead to more definition in the project's
deliverables to obtain the objective. Lower level, detailed objectives tend to require less
descriptive narrative and deconstruct into fewer deliverables to obtain. Regardless of
the number of objectives identified, the priority should be established by ranking the
objectives with their respective impacts, costs, and risks.
Business Goals
The goal statement must also be written in business language so that anyone who
reads it can understand it without further explanation. The goal statement should:
Smaller projects generally have a single goal. Larger projects may have more than one
goal, which should also be prioritized. Since the goal statement is meant to be succinct,
regardless of the number of goals a project has, the goal statement should always be
brief and to the point.
Best Practices
None
Sample Deliverables
None
Description
The goal of this task is to ensure the participation and consensus of the project sponsor
and key beneficiaries during the discovery and prioritization of these information
requirements.
Prerequisites
None
Roles
Strategic Requirements
Tactical Requirements
● The tactical requirements serve the ‘day to day’ business. Operational level
employees want solutions to enable them to manage their on-going work and
solve immediate problems. For instance, a distributor running a fleet of trucks
has an unavailable driver on a particular day. They would want to answer
questions such as, 'How can the delivery schedule be altered in order to meet
the delivery time of the highest priority customer?' Answers to these questions
are valid and pertinent for only a short period of time in comparison to the
strategic requirements.
Best Practices
None
Sample Deliverables
None
Description
A business rule is a compact and simple statement that represents some important
aspect of a business process or policy. By capturing the rules of the business—the
logic that governs its operation—systems can be created that are fully aligned with the
needs of the organization.
Business rules stem from the knowledge of business personnel and constrain some
aspect of the business. From a technical perspective, a business rule expresses
specific constraints on the creation, updating, and removal of persistent data in an
information system. For example, a new bank account cannot be created unless the
customer has provided an adequate proof of identification and address.
Prerequisites
None
Roles
Considerations
The aim is to define atomic business rules, that is, rules that cannot be decomposed
further. Each atomic business rule is a specific, formal statement of a single term, fact,
derivation, or constraint on the business. The components of business rules, once
formulated, provide direct inputs to a subsequent conceptual data modeling and
analysis phase. In this approach, definitions and connections can eventually be
mapped onto a data model and constraints and derivations can be mapped onto a set
of rules that are enforced in the data model.
Best Practices
None
Sample Deliverables
Description
Data stewardship is about keeping the business community involved and focused on
the goals of the project being undertaken. This subtask outlines the roles and
responsibilities that key personnel can assume within the framework of an overall
stewardship program. This participation should be regarded as ongoing because
stewardship activities need to be performed at all stages of a project lifecycle and
continue through the operational phase.
Prerequisites
None
Roles
Considerations
● An executive sponsor
● A business steward
● A technical steward
● A data steward
Executive Sponsor
Technical Steward
Business Steward
Data Steward
The mix of personnel for a particular activity should be adequate to provide expertise in
each of the major business areas that will be undertaken in the project.
The success of the stewardship function relies on the early establishment and
distribution of standardized documentation and procedures. These should be
distributed to all of the team members working on stewardship activities.
● Arbitration
● Sanity checking
● Preparation of metadata
● Support
Arbitration
Arbitration means resolving data contention issues, deciding which is the best data to
use, and determining how this data should best be transformed and interpreted so that
it remains meaningful and consistent. This is particularly important during the phases
where ambiguity needs to be resolved, for example, when conformed dimensions and
standardized facts are being formulated by the analysis teams.
Sanity Checking
There is a role for the data stewardship committee to check the results and ensure that
the transformation rules and processes have been applied correctly. This is a key
verification task and is particularly important in evaluating prototypes developed in the
Analyze Phase , during testing, and after the project goes live.
Preparation of Metadata
The data stewardship committee should be actively involved in the preparation and
verification of technical and business metadata. Specific tasks are:
Depending on the tools used to determine the metadata (for example, PowerCenter
Profiling option, Informatica Data Explorer), the Data Steward may take a lead role in
this activity.
Business metadata is used to answer questions such as: How does this division
of the enterprise calculate revenue?"
Technical metadata is used to perform analysis such as: “What would be the
impact of changing the length of a field from 20 to 30 characters and what
systems would be affected?”
Support
The data stewardship committee should be involved in the inception and preparation of
training of the user community by answering questions about data and the tools
available to perform analytics. During the Analyze Phase the team would provide
inputs to induction training programs prepared for system users when the project goes
live. Such programs should include, for example, technical information about how to
query the system and semantic information about the data that is retrieved.
New Functionality
The data stewardship committee needs to assess any major additions to functionality.
The assessment should consider return on investment, priority, and scalability in terms
of new hardware/software requirements. There may be a need to perform this activity
during the Analyze Phase if functionality that was initially overlooked is to be included
in the scope of the project. After the project has gone live, this activity is of key
importance because new functionality needs to be assessed for ongoing development.
Best Practices
None
Sample Deliverables
None
Description
The business scope forms the boundary that defines where the project begins and
ends. Throughout the project discussions about the business requirements and
objectives, it may appear that everyone views the project scope in the same way.
However, there is commonly confusion about what falls inside the boundary of a
specific project and what does not. Developing a detailed project scope and socializing
it with your project team, sponsors, and key stakeholders is critical.
Prerequisites
None
Roles
The primary consideration in developing the business scope is balancing the high-
priority needs of the key beneficiaries with the need to provide results within the near-
term. The Project Manager and Business Analysts need to determine the key business
needs and determine the feasibility of meeting those needs to establish a scope that
provides value, typically within a 60 to 120 day time-frame.
Quick WINS are accomplishments in a relatively short time, without great expense and
with a positive outcome - they can be included in the business scope. WINS stand for
Ways to Implement New Solutions.
Tip
As a general rule, involve as many project beneficiaries as possible in the needs
assessment and goal definition. A "forum" type of meeting may be the most efficient
way to gather the necessary information since it minimizes the amount of time
involved in individual interviews and often encourages useful dialog among the
participants. However, it is often difficult to gather all of the project beneficiaries and
the project sponsor together for any single meeting, so you may have to arrange
multiple meetings and summarize the input for the various participants.
A common mistake made by project teams is to define the project scope only in general
terms. This lack of definition causes managers and key beneficiaries throughout the
company to make assumptions related to their own processes or systems falling inside
or outside of the scope of the project. Then later, after significant work has been
completed by the project team, some managers are surprised to learn that their
assumptions were not correct, resulting in problems for the project team. Other project
teams report problems with "scope creep" as their project gradually takes on more and
more work. The safest rule is “the more detail, the better” along with details regarding
what related elements are not within scope or will be delayed to a later effort.
Best Practices
None
Sample Deliverables
None
Description
Before beginning any work with the data, it is necessary to determine precisely what
data is required to support the data integration solution. In addition, the developers
must also determine what source systems house the data, where the data resides in
the source systems, and how the data is accessed.
In this subtask, the development project team needs to validate the initial list of source
systems and source formats and obtain documentation from the source system owners
describing the source system schemas. For relational systems, the documentation
should include Entity-Relationship diagrams (E-R diagrams) and data dictionaries, if
available. For file based data sources (e.g., unstructured, semi-structured and complex
XML) documentation may also include data format specifications for both internal and
public (in the case of open data format standards) and any deviations from public
standards. The development team needs to carefully review the source system
documentation to ensure that it is complete (i.e., specifies data owners and
dependencies) and current. The team also needs to ensure that the data is fully
accessible to the developers and analysts that are building the data integration solution.
Prerequisites
None
Roles
In determining the source systems for data elements, it is important to request copies
of the source system data to serve as samples for further analysis. This is a
requirement in 2.8.1 Perform Data Quality Analysis of Source Data , but is also
important at this stage of development. As data volumes in the production environment
are often large, it is advisable to request a subset of the data for evaluation purposes.
However, requesting too small of a subset can be dangerous in that it fails to provide a
complete picture of the data and may hide any quality issues that truly exist.
Another important element of the source system analysis is to determine the life
expectancy of the source system itself. Try to determine if the source system is likely
to be replaced or phased out in the foreseeable future. As companies merge, or
technologies and processes improve, many companies upgrade or replace their
systems. This can present challenges to the team as the primary knowledge of those
systems may be replaced as well. Understanding the life expectancy of the source
system will play a crucial part in the design process.
For example, assume you are building a customer data warehouse for a small bank.
The primary source of customer data is a system called Shucks, and you will be
building a staging area in the warehouse to act as a landing area for all of the source
data. After your project starts, you discover that the bank is being bought out by a
larger bank and that Shucks will be replaced within three months by the larger bank's
source of customer data: a system called Grins. Instead of having to redesign your
entire data warehouse to handle the new source system, it may be possible to design a
generic staging area that could fit any customer source system instead of building a
staging area based on one specific source system. Assuming that the bulk of your
processing occurs after the data has landed in the staging area, you can minimize the
impact of replacing source systems by designing a generic staging area that would
essentially allow you to plug in the new source system. Designing this type of staging
area however, takes a large amount of planning and adds time to the schedule, but will
be well worth the effort because the warehouse is now able to handle source system
changes.
For Data Migration, the source systems that are in scope should be understood at the
start of the project. During the Analyze Phase these systems should be confirmed and
communicated to all key stakeholders. If there is a disconnect between which systems
are in and out of scope it is important to document and analyze the impact. Identifying
new source systems may exponentially increment the amount of resources needed on
the project and require re-planning. Make a point to over-communicate what systems
are in-scope.
Sample Deliverables
None
Description
Before beginning to work with the data, it is necessary to determine precisely what data
is required to support the data integration solution.
Take care to focus only on data that is within the scope of the requirements.
Involvement of the business community is important in order to prioritize the business
data needs based upon how effectively the data supports the users' top priority
business problems.
Prerequisites
None
Roles
Considerations
In determining the source systems for data elements, it is important to request copies
of the source system data to serve as samples for further analysis. Because data
volumes in the production environment are often large, it is advisable to request a
subset of the data for evaluation purposes. However, requesting too small a subset can
be dangerous in that it fails to provide a complete picture of the data and may hide any
quality issues that exist.
Particular care needs to be taken when archived historical data (e.g., data archived on
tapes) or syndicated data sets (i.e., externally provided data such as market research)
is required as a source to the data integration application. Additional resources and
procedures may be required to sample and analyze these data sources.
A list of business data sources should have been prepared during the business
requirements phase. This list typically identifies 20 or more types of data that are
required to support the data integration solution and may include, for example, sales
forecasts, customer demographic data, product information (e.g., categories and
classifiers), and financial information (e.g., revenues, commissions, and budgets).
The candidate source systems (i.e., where the required data can be found) can be
identified based on this list. There may be a single source or multiple sources for the
required data.
Consider, for example, a low-latency data integration application that requires credit
checks to be performed on customers seeking a loan. In this case, the relevant source
systems may be:
● A call center that captures the initial transactional request and passes this
information in real time to a data integration application.
● An external system against which a credibility check needs to be performed by
the data integration application (i.e., to determine a credit rating).
● An internal data warehouse accessed by the data integration application to
validate and complement the information.
Timeliness, reliability, accuracy of data, and a single source for reference data may be
key factors influencing the selection of the source systems. Note that projects typically
under-estimate problems in these areas. Many projects run into difficulty because poor
data quality, both at high (metadata) and low (record) levels, impacts the ability to
perform transform and load operations.
An appreciation of the underlying technical feasibility may also impact the choice of
data sources and should be within the scope of the high-level analysis being
undertaken. This activity is about compiling information about the “as is and as will be”
technological landscape that affect the characteristics of the data source systems and
their impact on the data integration solution. Factors to consider in this survey are:
For B2B solutions, solutions with significant file based data sources (and other
solutions with complex data transformation requirements) it is necessary to also assess
data sizes, volumes and the frequency of data updates with respect to the ability to
parse and transform the data and the implications that will have on hardware and
software requirements.
A high-level analysis should also allow for the early identification of risks associated
with the planned development, for example:
Data Quality
The next step in determining source feasibility is to perform a detailed analysis of the
data sources, both in structure and in content, and to create an accurate model of the
source data systems. Understanding data sources requires the participation of a data
source expert/Data Quality Developer and a business analyst to clarify the relevance,
The output of the data profiling effort is a survey, whose recipients include the data
stewardship committee, which documents:
Bear in mind that the issue of data quality can cleave in two directions: discovering the
structure and metadata characteristics of the source data, and analyzing the low-level
quality of the data in terms of record accuracy, duplication, and other metrics. In-depth
structural and metadata profiling of the data sources can be conducted through
Informatica Data Explorer. Low-level/per-record data quality issues also must be
uncovered and, where necessary, corrected or flagged for correction at this stage in the
project. See 2.8 Perform Data Quality Audit for more information on required data
quality and data analysis steps.
The next step is to determine when all source systems are likely to be available for data
extraction. This is necessary in order to determine realistic start and end times for the
load window. The developers need to work closely with the source system
administrators during this step because the administrators can provide specific
information about the hours of operations for their systems.
The Source Availability Matrix lists all the sources that are being used for data
extraction and specifies the systems' downtimes during a 24-hour period. This matrix
should contain details of the availability of the systems on different days of the week,
including weekends and holidays.
For Data Migration projects access to data is not normally a problem given the premise
of the solution. Typically, data migration projects have high level sponsorship and
whatever is needed is provided. However, for smaller-impact projects it is important
that direct access is provided to all systems that are in scope. If direct access is not
available, timelines should be increased and risk items should be added to the project.
Historically, most projects without direct access go over-time due to lack of availability
of key resources to provide extracted data. If this can be avoided by providing direct
access it should.
For solutions with complex data transformation requirements, the final step is to
determine the feasibility of transforming the data to target formats and any implications
that will have on the eventual system design.
Very large flat file formats often require splitting processes to be introduced into the
design in order to split the data into manageable sized chunks for subsequent
processing. This will require identification of appropriate boundaries for splitting and
may require additional steps to convert the data into formats that are suitable for
splitting.
For example large PDF-based data sources may require conversion into some other
format such as XML before the data can be split.
Best Practices
None
Description
This subtask provides detailed business requirements that lead to design of the target
data structures for a data integration project. For Operational Data Integration projects,
this may involve identifying a subject area or transaction set within an existing
operational schema or a new data store. For Data Warehousing / Business Intelligence
projects, this typically involves putting some structure to the informational
requirements. The preceding business requirements tasks (see Prerequisites) provide
a high-level assessment of the organization's business initiative and provide business
definitions for the information desired.
Note that if the project involves enterprise-wide data integration, it is important that the
requirements process involve representatives from all interested departments and that
those parties reach a semantic consensus early in the process.
Prerequisites
None
Roles
Considerations
Metrics
Often a mix of financial (e.g., budget targets) and operational (e.g., trends in customer
satisfaction) key performance metrics is required to achieve a balanced measure of the
organizational performance.
The key performance metrics may be directly sourced from an existing operational
system or may require integration of data from various systems. Market analytics may
indicate a requirement for metrics to be compared to external industry performance
criteria.
Dimensions
Data migration projects should be exclusively driven by the target system needs, not by
what is available in the source systems. Therefore, it is recommended to identify the
target system needs early in the Analyze Phase and focus the analysis activities on
those objects.
B2B Projects
For B2B and non B2B projects that have significant flat file based data targets,
consideration needs to be given to the target data to be generated. Considerations
include:
At a higher level, the number and complexity of data sources, the number and
complexity of data targets and the number and complexity of intermediate data formats
and schemas determine the overall scope of the data transformation and integration
aspects of B2B data integration projects as a whole.
Best Practices
None
Sample Deliverables
None
Description
Data Integration projects, whether data warehousing or operational data integration, are often large-scale, long-term
projects. This can also be the case with analytics visualization projects or metadata reporting/management projects. Any
complex project should be considered a candidate for incremental delivery. Under this strategy the entirety of the
comprehensive objectives of the project are broken up into prioritized deliverables, each of which can be completed within
approximately three months. This gives near-term deliverables that provide early value to the business (which can be
helpful in funding discussions) and conversely, is an important avenue for early end-user feedback that may enable the
development team to avoid major problems. This feedback may point out misconceptions or other design flaws which, if
undetected, could cause costly rework later on.
This roadmap, then, provides the project stakeholders with a rough timeline for completion of their entire objective, but also
communicates the timing of these incremental sub-projects based on their prioritization. Below is an example of a timeline
for a Sales and Finance data warehouse with the increments roughly spaced each quarter. Each increment builds on the
completion of the prior increment, but each delivers clear value in itself.
Q1 Yr 1 Q2 Yr 1 Q3 Yr 1 Q4 Yr 1 Q1 Yr 2
Implement Data
Warehouse
Architecture
Revenue Analytics
Complete Bookings,Billings,
Backlog
GL Analytics
COGS Analysis
Prerequisites
None
Roles
Considerations
The roadmap is the culmination of business requirements analysis and prioritization. The business requirements are
reviewed for logical subprojects (increments), source analysis is reviewed to provide feasibility, and business priorities are
used to set the sequence of the increments, factoring in feasibility and the interoperability or dependencies of the
increments.
● Customer value is delivered earlier – the business sees an early start to its ROI.
● Early increments elicit feedback and sometimes clearer requirements that will be valuable in designing the later
increments.
● Much lower risk of overall project failure because of the plan for early, attainable successes.
● Highly likely that even if all of the long-term objectives are not achieved (they may prove infeasible or lose favor with
the business), the project still provides the value of the increments that are completed.
● Because the early increments reflect high-priority business needs, they may attract more visibility and have greater
perceived value than the project as a whole.
Disadvantages can be
● There is always some extra effort involved in managing the release of multiple increments. However, there is less
risk of costly rework effort due to misunderstood (or changing) requirements because of early feedback from end-
users.
● There may be schema redesign or other rework necessary after initial increments because of unforeseen
requirements or interdependencies.
Best Practices
None
Sample Deliverables
None
Description
For any project to be ultimately successful, it must resolve the business objectives in a
way that the end users find easy to use and satisfactory in addressing their needs. A
functional requirements document is necessary to ensure that the project team
understands these needs in detail and is capable of proceeding with a system design
based upon the end user needs. The business drivers and goals provide a high-level
view of these needs and serve as the starting point for the detailed functional
requirements document. Business rules and data definitions further clarify specific
business requirements and are very important in developing detailed functional
requirements and ultimately the design itself.
Prerequisites
None
Roles
Considerations
The analysis may include studying existing reporting, interviewing current information
providers (i.e., those currently developing reports and analyses for Finance and other
departments), and even reviewing mock-ups and usage scenarios with key end-users.
Data Migration
These projects are similar to data migration projects in terms of the need to understand
the target transactions and how the data will be processed to accommodate them. The
processing may involve multiple load steps, each with a different purpose, some
operational and perhaps some for reporting. There may also be real-time requirements
for some, and there may be a need for interfaces with queue-based messaging
systems in situations where EAI-type integration between operational databases is
involved or master data management requirements.
For all data integration projects (i.e., all of the above), developers also need to review
the source analysis with the DBAs to determine the functional requirements of the
source extraction processes.
B2B Projects
For B2B projects and flat file/XML-based data integration projects, the data formats that
are required for trading partners to interact with the system, the mechanisms for trading
partners and operators to determine the success and failure of transformations and the
internal interactions with legacy systems and other applications all form part of the
For large B2B projects, overall business process management will typically form part of
the overall system which may impose requirements around the use of partner
management software such as B2B Data Exchange and/or business process
management software.
Often B2B systems may have real-time requirements and involve the use of interfaces
with queue-based messaging systems, web services and other application integration
technologies.
While these are technical, rather than business requirements, for Business Process
Outsourcing and other types of B2B interaction, technical considerations often form a
core component of the business operation.
Best Practices
None
Sample Deliverables
None
Description
Metadata is often articulated as ‘data about data’. It is the collection of information that
further describes the data used in the data integration project. Examples of metadata
include:
In terms of flat file and XML sources, metadata can include open and proprietary data
standards and an organization’s interpretations of those standards. In addition to the
previous examples, flat file metadata can include:
All of these pieces of metadata are of interest to various members of the metadata
community, some are of interest only to certain technical staff members, while other
pieces may be very useful for business people attempting to navigate through the
enterprise data warehouse or across and through various business/subject area-
orientated data marts. That is, metadata can provide answers to such typical business
questions as:
❍ Capturing
❍ Establishing standards and procedures
❍ Maintaining and securing the metadata
❍ Proper use, quality control, and update procedures
Prerequisites
None
Roles
Considerations
One of the primary objectives of this subtask is to attain broad consensus among all
key business beneficiaries regarding metadata business requirements priorities, it is
critical to obtain as much participation as possible in this process.
For B2B and flat file oriented data integration projects, metadata is often defined in less
structured forms than for data dictionaries or other traditional means of managing
metadata. The process of designing the system may include the need to determine and
document the metadata consumed and produced by legacy and 3rd party systems. In
some cases applicable metadata may need to be mined from sample operational data
from unstructured and semi structured system documentation.
For B2B projects, getting adequate sample source and target data can become a
critical part of defining the metadata requirements.
Best Practices
None
Sample Deliverables
None
Description
Organizations undertaking new initiatives require access to consistent and reliable data
resources. Confidence in the underlying information assets and an understanding of
how those assets relate to one another can provide valuable leverage in the
strategic decision-making process. As organizations grow through mergers and
consolidations, systems that generate data become isolated resources unless they are
properly integrated. Integrating these data assets and turning them into key
components of the decision-making process requires significant effort.
Prerequisites
None
Considerations
The first part of the process is to establish a Metadata Inventory that lists all metadata
sources.
The second part of the process is to investigate in detail those metadata repositories or
sources that will be required to meet the next phase of requirements. This investigation
will establish:
B2B Projects
For B2B and flat file oriented data integration projects, metadata is often defined and
maintained in the form of non-database oriented metadata such as XML schemas or
data format specifications (and specifications as to how standards should be
interpreted). Metadata may need to be mined from sample data, legacy systems and/or
mapping specifications.
Metadata repositories may take the form of document repositories using document
management or source control technologies.
In B2B systems, the responsibility for tracking metadata may shift to members of the
technical architecture team; as traditional database design, planning and maintenance
may play a lesser role in these systems.
Best Practices
None
Sample Deliverables
Metadata Inventory
Description
Prerequisites
None
Roles
Considerations
Operations personnel generally require metadata regarding either the data integration
processes or business intelligence reporting, or both. This information is helpful in
determining issues or problems with delivering information to the final end-user with
regard to items such as the expected source data sizes versus actual processed; the
time to run specific processes, and if load windows are being met; the number of end
users running specific reports; the time of day reports are being run and when the load
on the system is highest; etc. This metadata allows operations to address issues as
they arise.
When reviewing metadata, business users want to know how the data was generated
(and related) and what manipulation, if any, was performed to produce
it. Information looked at ranges from specific reference metadata (i.e., ontologies and
taxonomies) to the transformations and/or calculations that were used to create the
final report values.
Sources of Metadata
After initial reporting requirements are developed, the location and accessibility of the
metadata must be considered.
Various other more formalized sources of metadata usually have automated methods
for loading to a metadata repository or warehouse. This includes information that is
held in data modeling tools, data integration platforms, database management systems
and business intelligence tools.
For each of the various types of metadata reporting requirements mentioned, as well as
the various types of metadata sources, different methods of storage may fit better than
others and affect how the various metadata can be sourced.
In the case of metadata for developers and operations personnel, this type can
generally be found and stored in the repositories of the software used to accomplish
the tasks, such as the PowerCenter repository or the business intelligence software
repository. Usually, these software packages include sufficient reporting capability to
meet the required needs of this type of reporting. At the same time, most of these
metadata repositories include locations for manually entering metadata, as well as
automatically importing metadata from various sources.
Specifically, when using the PowerCenter repository as a metadata hub, there are
various locations where description fields can be used to include unstructured type /
more descriptive metadata. Mechanisms such as metadata extensions also allow for
user-defined fields of metadata. In terms of automated loading of
metadata, PowerCenter can import definitions from data modeling tools using Metadata
Exchange. Also, metadata from various sources, targets, and other objects can be
imported natively from the connections the PowerCenter software can make to these
systems, including items such as database management systems, ERP systems via
PowerConnects, and XML schema definitions.
In the case of metadata requirements for a business user, this usually requires a
platform that can integrate the metadata from various metadata sources, as well as
provide a relatively robust reporting function, which specific software metadata
repositories usually lack. Thus, in these cases, a platform like Metadata Manager is
optimal.
The specific types of analysis and reports must also be considered with regard to
specifically what metadata needs to be sourced.
For metadata repositories like PowerCenter, the available analysis is very specific and
little information beyond what is normally sourced into the repository can be available
for reporting.
● Metadata browsing
● Metadata searching
● Where-used analysis
● Lineage analysis
● Packaged reports
Metadata Manager provides more specific metadata analysis to help analyze source
repository metadata, including:
It may be possible that even with a metadata warehouse platform like Metadata
Manager, some analysis requirements cannot be fulfilled by the above-mentioned
features and out-of-the-box reports. Analysis should be performed to identify any gaps
and to determine if any customization or design can be done within Metadata Manager
to resolve the gaps.
Bear in mind that Informatica Data Explorer (IDE) also provides a range of source data
and metadata profiling and source-to-target mapping capabilities.
Best Practices
None
Sample Deliverables
None
Description
Prerequisites
None
Roles
Considerations
Overall
Environment
● What are the current hardware or software standards? For example, NT vs.
UNIX vs. Linux? Oracle vs. SQL Server? SAP vs. PeopleSoft?
● What, if any, data extraction and integration standards currently exist?
● What source systems are currently utilized? For example, mainframe? flat file?
relational database?
● What, if any, regulatory requirements exist regarding access to and historical
maintenance of the source data?
● What, if any, load window restrictions exist regarding system and/or source
data availability?
● How many environments are used in a standard deployment? For example: 1)
Development, 2) Test, 3) QA, 4) Pre-Production, 5) Production.
● What is or will be the end-user presentation layer?
Project Team
Project Lifecycle
Resolving the answers to questions such as these will enable a greater accuracy in
project planning, scoping, and staffing efforts. Additionally, the understanding gained
from this assessment ensures that any new project effort will better marry its approach
to the established practices of the organization.
Best Practices
None
Sample Deliverables
None
Description
The goal of this task is to determine the readiness of an IT organization with respect to
its technical architecture, implementation of said architecture, and the associated
staffing required to support the technical solution. Conducting this analysis, through
interviews with the existing IT team members (such as those noted in the Roles
section), provides evidence as to whether or not the critical technologies and
associated support system are sufficiently mature as to not present significant risk to
the endeavor.
Prerequisites
None
Roles
Considerations
Carefully consider the following questions when evaluating the technical readiness of a
given enterprise:
● Has the architecture team been staffed and trained in the assessment of
critical technologies?
● Have all of the decisions been made regarding the various components of the
infrastructure, including: network, servers, and software?
Best Practices
None
Sample Deliverables
None
Description
Many organizations must now comply with a range of regulatory requirements such as
financial services regulation, data protection, Sarbanes-Oxley, retention of data for
potential criminal investigations, and interchange of data between organizations. Some
industries may also be required to complete specialized reports for government
regulatory bodies. This can mean prescribed reporting, detailed auditing of data, and
specific controls over actions and processing of the data.
These requirements differ from the "normal" business requirements in that they are
imposed by legislation and/or external bodies. The penalties for not precisely meeting
the requirements can be severe. However, there is a "carrot and stick" element to
regulatory compliance. Regulatory requirements and industry standards can also
present the business with an opportunity to improve its data processes and update the
quality of its data in key areas. Successful compliance — for example, in the banking
sector, with the Basel II Accord — brings the potential for more productive and
profitable uses of data.
As data is prepared for the later stages in a project, the project personnel must
establish what government or industry standards the project data must adhere to and
devise a plan to meet these standards. These steps include establishing a catalog of all
reporting and auditing required, including any prescribed content, formats, processes,
and controls. The definitions of content (e.g., inclusion/exclusion rules, timescales,
units, etc.) and any metrics or calculations, are likely to be particularly important.
Prerequisites
None
Roles
Considerations
If your project must comply with a government or industry regulation, or if the business
simply insists on high standards for its data (for example, to establish a “single version
of the truth” for items in the business chain), then you must increase your focus on data
quality in the project. 2.8 Perform Data Quality Audit is dedicated to performing a Data
Quality Audit that can provide the project stakeholders with a detailed picture of the
strengths and weaknesses of the project data in key compliance areas such as
accuracy, completeness, and duplication.
For example, compliance with a request for data under Section 314 of the USA-
PATRIOT Act is likely to be difficult for a business that finds it has large numbers of
duplicate records, or records that contain empty fields, or fields populated with default
values. Such problems should be identified and addressed before the data is moved
downstream in the project.
Regulatory requirements often require the ability to clearly audit the processes affecting
the data. This may require a metadata reporting system that can provide viewing and
reporting of data lineage and ‘where-used.’ Remember, such a system can produce
Industry and regulatory standards for data interchange may also affect data model and
ETL designs. HIPAA and HL7-compliance may dictate transaction definitions that affect
healthcare-related projects, as may SWIFT or Basel II for finance-related data.
Potentially there are now two areas to investigate in more detail: data and metadata.
● Map the requirements back to the data and/or metadata required using a
standard modeling approach.
● Use data models and the metadata catalog to assess the availability and
quality of the required data and metadata. Use the data models of the systems
and data sources involved, along with the inventory of metadata.
● Verify that the target data models meet the regulatory requirements.
Also, ensure that reporting requirements can be met, again filling any gaps. It is
important to check that the format, content, and delivery mechanisms for all reports
comply with the regulatory requirements.
Best Practices
None
Sample Deliverables
None
Description
Data Quality is a key factor for several tasks and subtasks in the Analyze Phase. The
quality of the proposed project source data, in terms of both its structure and content, is
a key determinant of the specifics of the business scope and of the success of the
project in general. For information on issues relating primarily to data structure, see
subtask 2.3.2 Determine Sourcing Feasibility, which focuses on the quality of the data
content.
Problems with the data content must be communicated to senior project personnel as
soon as they are discovered. Poor data quality can impede the proper execution of
later steps in the project, such as data transformation and load operations, and can
also compromise the business’ ability to generate a return on the project investment.
This is compounded by the fact that most businesses underestimate the extent of their
data quality problems. There is little point in performing a data warehouse, migration, or
integration project if the underlying data is in bad shape.
The Data Quality Audit is designed to analyze representative samples of the source
data and discover their data quality characteristics so that these can be articulated to
all relevant project personnel. The project leaders can then decide what actions, if any,
are necessary to correct data quality issues and ensure that the successful completion
of the project is not in jeopardy.
Prerequisites
None
Roles
Considerations
The Data Quality Audit can typically be conducted very quickly, but the actual time
required is determined by the starting condition of the data and the success criteria
defined at the beginning of the audit. The main steps are as follow:
● Representative samples of source data from all main areas are provided to the
Data Quality Developer.
● The Data Quality Developer uses a data analysis tool to determine the quality
of the data according to several criteria.
● The Data Quality Developer generates summary reports on the data and
distributes these to the relevant roles for discussion and next steps.
Two important aspects of the audit are (1) the data quality criteria used, and (2) the
type of report generated.
You can define any number and type of criteria for your data quality. However, there
are six standard criteria:
❍ A dataset may contain user-entered records for “Batch No. 12345” and
“Batch 12345”, where both records describe the same batch.
❍ A dataset may contain several records with common surnames and street
addresses, indicating that the records refer to a single household; this
type of information is relevant to marketing personnel.
This list is not absolute; the characteristics above are sometimes described with other
terminology, such as redundancy or timeliness. Every organization’s data needs are
different, and the prevalence and relative priority of data quality issues differ from one
organization and one project to the next. Note that the accuracy factor differs from the
other five factors in the following respect: whereas, for example,a pair of duplicate
records may be visible to the naked eye, it can be difficult to tell simply by “eye-balling”
if a given data record is inaccurate. Accuracy can be determined by applying fuzzy
logic to the data or by validating the records against a verified reference data set.
Best Practices
Sample Deliverables
None
Description
The data quality audit is a business rules-based approach that aims to help define project expectations through the use of data
quality processes (or plans) and data quality scorecards. It involves conducting a data analysis on the project data, or on
a representative sample of the data, and producing an accurate and qualified summary of the data’s quality. This subtask focuses
on data quality analysis. The results are processed and presented to the business users in the next subtask 2.8.2 Report
Analysis Results to the Business.
Prerequisites
None
Roles
Considerations
The main objective of this step is to meet with the data steward and business owners to identify the data sources to be analyzed.
For each data source, the Data Quality Developer will need all available information on the data format, content, and structure,as
well as input on known data quality issues. The result of this step is a list of the sources of data to be analyzed, along with
the identification of all known issues. These define the initial scope of the audit. The following figure illustrates selecting target
data from multiple sources.
This step identifies and quantifies data quality issues in the source data. Data quality analysis plans are configured in Informatica
Data Quality (IDQ) Workbench. (The plans should be configured in a manner that enables the production of scorecards in the
next subtask. A scorecard is a graphical representation of the levels of data quality in the dataset.) The plans designed at this
Data analysis provides detailed metrics to guide the next steps of the audit. For example:
● For character data, analysis identifies all distinct values (such as code values) and their frequency distribution.
● For numeric data, analysis provides statistics on the highest, lowest, average, and total, as well as the number of positive
values, negative values, zero/null values, and any non-numeric values.
● For dates, analysis identifies the highest and lowest dates, the number of blank/null fields, as well as any invalid date values.
● For consumer packaging data, analysis can detect issues such as bar codes with correct/incorrect numbers of digits.
The key objectives of this step are to identify issues in the areas of completeness, conformity, and consistency, to prioritize data
quality issues, and to define customized data quality rules. These objectives involve:
● Discussions of data quality analyses with business users to define completeness, conformity, and consistency rules for each
data element.
● Tuning and re-running the analysis plans with these business rules.
For each data set, a set of base rules must be established to test the conformity of the attributes' data values against basic
rule definitions. For example, if an attribute has a date type, then that attribute should only have date information stored. At a
minimum, all the necessary fields must be tested against the base rule sets. The following figure illustrated business rule evaluation.
Sample Deliverables
None
Description
The steps outlined in subtask 2.8.1 lead to the preparation of the Data Quality Audit Report, which
is delivered in this subtask. The Data Quality Audit report highlights the state of the data analyzed in
an easy-to-read, high-impact fashion.
● Data quality scorecards - charts and graphs of data quality that can be pre-set to present
and compare data quality across key fields and data types
● Drill-down reports that permit reviewers to access the raw data underlying the summary
information
● Exception files
In this subtask, potential risk areas are identified and alternative solutions are evaluated. The Data
Quality Audit concludes with a presentation of these findings to the business and project
stakeholders and agreement on recommended next steps.
Prerequisites
None
Roles
Considerations
There are two key activities in this subtask: delivering the report, and framing a discussion for the
business about what actions to take based on the report conclusions.
Creating Scorecards
Informatica Data Quality (IDQ) is used to identify, measure, and categorize data quality issues
according to business criteria. IDQ reports information in several formats, including database tables,
CSV files, HTML files, and graphically. (Graphical displays, or scorecards, are linked to the
underlying data so that viewers can move from high-level to low-level views of the data.)
Part of the report creation process is the agreement of pass/fail scores for the data and the
assignment of weights to the data performance for different criteria. For example, the business may
state that at least 98 percent of values in address data fields must be accurate and weight the zip
+four field as most important. Once the scorecards are defined, the data quality plans can be re-used
to track data quality progress over time and throughout the organization.
The data quality scorecard can also be presented through a dashboard framework, which adds value
to the scorecard by grouping graphical information in business-intelligent ways.
By integrating various data analysis results within the dashboard application, the stakeholders can
review the current state of data quality and decide on appropriate actions within the project.
The set of stakeholders should include one or more members of the data stewardship committee, the
project manager, data experts, a Data Quality Developer, and representatives of the business.
Together, these stakeholders can review the data quality audit conclusions and conduct a cost-
benefit comparison of the desired data quality levels versus the impact on the project of the steps to
achieve these levels.
In some projects — for example, when the data must comply with government or industry regulations
— the data quality levels are non-negotiable, and the project stakeholders must work to those
regulations. In other cases, the business objectives may be achieved by data quality levels that are
less than 100 percent. In all cases, the project data must obtain a minimum quality levels in order to
pass through the project processes and be accepted by the target data source.
Conducting a data quality audit one time provides insight into the then-current state of the data, but
does not reflect how project activity can change data quality over time. Tracking levels of data quality
over time, as part of an ongoing monitoring process, provides a historical view of when and how
much the quality of data has improved. The following figure illustrates how ongoing audits can chart
progress in data quality.
As part of a statistical control process, data quality levels can be tracked on a periodic basis and
charted to show if the measured levels of data quality reach and remain in an acceptable range, or
whether some event has caused the measured level to fall below what is acceptable. Statistical
control charts can help in notifying data stewards when an exception event impacts data quality and
can help to identify the offending information process. Historical statistical tracking and charting
capabilities are available within a data quality scorecard, and scorecards can be easily updated;
once configured, the scorecard typically does not need to be re-created for successive data quality
analyses.
Best Practices
None
Sample Deliverables
None
3 Architect
Description
Proper execution during the Architect Phase is especially important for for Data
Migration projects. In the Architect Phase a series of key tasks are undertaken to
accelerate development, ensure consistency and expedite completion of the data
migration.
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
Description
Data integration solutions have grown in scope as well as the amount of data they
process. This necessitates careful consideration of architectural issues across a
number of architectural domains. Well-designed solution architecture is very crucial to
any data integration effort, and can be the most influential, visible part of the whole
effort. A robust solution architecture not only meets the business requirements but it
also exceeds the expectations of the business community. Given the continuous state
of change that has become a trademark of information technology, it is prudent to have
an architecture that is not only easy to implement and manage, but also flexible enough
to accommodate changes in the future, easily extendable, reliable (with minimal or no
downtime), and vastly scalable.
Prerequisites
None
Considerations
The specific activities that comprise this task focus primarily on the Execution
Architecture. 3.2 Design Development Architecture focuses on the development
architecture and the Operate Phase discusses the important aspects of operating a
data integration solution. Refer to the Operate Phase for more information on the
operations architecture.
Best Practices
None
Sample Deliverables
None
Description
Prerequisites
None
Roles
Considerations
For Data Migration projects the technical requirements are fairly consistent and known
They will require processes to:
Best Practices
None
Sample Deliverables
None
Description
Much like a logical data model, a logical view of the architecture provides a high-level
depiction of the various entities and relationships as an architectural blueprint of the
entire data integration solution. The logical architecture helps people to visualize the
solution and show how all the components work together. The major purposes of the
logical view are:
● To describe how the various solution elements work together (i.e., databases,
ETL, reporting, and metadata).
● To communicate the conceptual architecture to project participants to validate
the architecture.
● To serve as a blueprint for developing the more detailed physical view.
The logical diagram provides a road map of the enterprise initiative and an opportunity
for the architects and project planners to define and describe, in some detail, the
individual components.
The logical view should show relationships in the data flow and among the functional
components; indicating, for example, how local repositories relate to the global
repository (if applicable).
The logical view must take into consideration all of the source systems required to
support the solution, the repositories that will contain the runtime metadata, and all
known data marts and reports. This is a “living” architectural diagram, to be refined as
you implement or grow the solution.
The logical view does not contain detailed physical information such as server names,
IP addresses, hardware specifications, etc. These details will be fleshed out in the
development of the physical view.
Prerequisites
None
Considerations
For Data Migration projects a key component is the documentation of the various utility
database schemas. This will likely include legacy staging, pre-load staging, reference
data, and audit database schemas. Additionally, database schemas for Informatica
Data Quality and Informatica Data Explorer will also be included.
Best Practices
Sample Deliverables
None
Description
Using the Architecture Logical View as a guide, and considering any corporate
standards or preferences, develop a set of recommendations for how to technically
configure the analytic solution. These recommendations will serve as the basis for
discussion with the appropriate parties, including project management, the Project
Sponsor, system administrators, and potentially the user community. At this point, the
recommendations of the Data Architect and Technical Architect should be very well
formed, based on their understanding of the business requirements and the current and
planned technical standards.
The recommendations will be formally documented in the next subtask 3.1.4 Develop
Architecture Physical View but are not documented at this stage since they are still
considered open to debate. Discussions with interested constituents should focus on
the recommended architecture, not on protracted debate over the business
requirements.
It is critical that the scope of the project be set - and agreed upon - prior to developing
and documenting the technical configuration recommendations. Changes in the
requirements at this point can have a definite impact on the project delivery date.
(Refer back to the Manage Phase for a discussion of scope setting and control issues).
Prerequisites
None
Roles
The primary areas to consider in developing the recommendations include, but are not
necessarily limited to:
TIP
Use the Architecture Logical View as a starting point for discussing the
technical configuration recommendations. As drafts of the physical view are
developed, they will be helpful for explaining the planned architecture.
Best Practices
Sample Deliverables
None
Description
The physical view of the architecture is a refinement of the logical view, but takes into
account the actual hardware and software resources necessary to build the
architecture. Much like a physical data model, this view of the architecture depicts
physical entities (i.e., servers, workstations, and networks) and their attributes (i.e.,
hardware model, operating system, server name, IP address). In addition, each entity
should show the elements of the logical model supported by it. For example, a UNIX
server may be serving as a PowerCenter server engine, Data Analyzer server engine,
and may also be running Oracle to store the associated PowerCenter repositories.
The physical view is the summarized planning document for the architecture
implementation. The physical view is unlikely to explicitly show all of the technical
information necessary to configure the system, but should provide enough information
for domain experts to proceed with their specific responsibilities. In essence, this view
is a common blueprint that the system's general contractor (i.e. the Technical
Architect) can use to communicate to each of the subcontractors (i.e. UNIX
Administrator, Mainframe Administrator, Network Administrator, Application Server
Administrator, DBAs, etc).
Prerequisites
None
Roles
None
Best Practices
Sample Deliverables
None
Description
Estimating the data volume and physical storage requirements of a data integration
project is a critical step in the architecture planning process. This subtask represents a
starting point for analyzing data volumes, but does not include a definitive discussion of
capacity planning. Due to the varying complexity and data volumes associated with
data integration solutions, it is crucial to review each technical area of the proposed
solution with the appropriate experts (i.e., DBAs, Network Administrators, Server
System Administrators, etc.).
Prerequisites
None
Roles
Considerations
Capacity planning and volume estimation should focus on several key areas that are
likely to become system bottlenecks or to strain system capacity, specifically:
During the Architect Phase only a rough volume estimate is required. After the Design
Phase is completed, the database sizing model should be updated to reflect the data
model and any changes to the known business requirements. The basic techniques for
database sizing are well understood by experienced DBAs. Estimates of database size
must factor in:
● Determine the upper bound of the precision of each table row. This can
obviously be affected by certain DBMS data types, so be sure to take into
account each physical byte consumed. The documentation for the DBMS
should specify storage requirements for all supported data types. After the
physical data model has been developed, the row width can be calculated.
● Depending on the type of table, this number may be vastly different for a
"young" warehouse than one at "maturity". For example, if the database is
designed to store three years of historical sales data, and there is an average
daily volume of 5,000 sales, the table will contain 150,000 rows after the first
month, but will have swelled to nearly 5.5 million rows at full maturity. Beyond
the third year, there should be a process in place for archiving data off the
table, thus limiting the size to 5.5 million rows.
● Indexing can add a significant disk usage penalty to a database. Depending on
the overall size of the indexed table, and the size of the keys used in the index,
an index may require 30 to 80 percent additional disk space. Again, the DBMS
documentation should contain specifics about calculating index size.
● Partitioning the physical target can greatly increase the efficiency and
organization of the load process. However, it does increase the number of
physical units to be maintained. Be sure to discuss with the DBAs the most
intelligent structuring of the database partitions.
Using these basic factors, it is possible to construct a database sizing model (typically
in spreadsheet form) that lists all database tables and indexes, their row widths, and
estimated number of rows. Once the row number estimates have been validated, the
estimating model should produce a fairly accurate estimate of database size. Note that
the model will provide an estimate of raw data size. Be sure to consult the DBAs to
understand how to factor in the physical storage characteristics relative to the DBMS
being used, such as block parameter sizes.
Since there is the possibility of unstructured data being sourced, transformed and
stored, it is important to factor in any conversion in data size, either up or down, from
source to target.
TIP
If you have determined that the star schema is the right model to use for the data
integration solution, be sure that the DBAs who are responsible for the target
data model understand its advantages. A DBA who is unfamiliar with the star
schema may seek to normalize the data model in order to save space. Firmly
resist this tendency to normalize.
Data processing volume refers to the amount of data being processed by a given
PowerCenter server within a specified timeframe. In most data integration
implementations, a load window is allotted representing clock time. This window is
determined by the availability of the source systems for extracts and the end-user
requirements for access to the target data sources. Maintenance jobs that run on a
regular basis may further limit the length of the load window.
As a result of the limited load window, the PowerCenter server engine must be able to
perform its operations on all data in a given time period. The ability to do so is
constrained by three factors:
● Time it takes to extract the data (potentially including network transfer time, if
The biggest factors affecting extract and load times are, however, related to database
tuning. Refer to Performance Tuning Databases (Oracle) for suggestions on improving
database performance.
The throughput of the PowerCenter Server engine is typically the last option for
improved performance. Refer to the Velocity Best Practice Tuning Sessions for Better
Performance which includes suggestions on tuning mappings and sessions to optimize
performance. From an estimating standpoint, however, it is impossible to accurately
project the throughput (in terms of rows per second) of a mapping due to the high
variability in mapping complexity, quantity and complexity of transformations, and the
nature of the data being transformed. It is a more accurate estimation to use clock time
to ensure processing within the given load window.
If the project includes steps dedicated to improving data quality (for example, as
described in Task 4.6) then a related performance factor is the time taken to perform
data matching (that is, record de-duplication) operations. Depending on the size of the
dataset concerned, data matching operations in Infomatica Data Quality can take
several hours of processor time to complete. Data matching processes can be tuned
and executed on remote machines on the network to significantly reduce record
processing time. Refer to the Best Practice Effective Data Matching Techniques for
more information.
Network Throughput
Once the physical data row sizes and volumes have been estimated, it is possible to
estimate the required network capacity. It is important to remember the network
overhead associated with packet headers, as this can have an affect on the total
volume of data being transmitted. The Technical Architect should work closely with a
Network Administrator to examine network capacity between the different components
involved in the solution.
The initial estimate is likely to be rough, but should provide a sense of whether the
existing capacity is sufficient and whether the solution should be architected differently
(i.e., move source or target data prior to session execution, re-locate server engine(s),
etc.). The Network Administrator can thoroughly analyze network throughput during
system and/or performance testing, and apply the appropriate tuning techniques. It is
important to involve the network specialists early in the Architect Phase so that they
TIP
Informatica generally recommends having either the source or target database
co-located with the PowerCenter Server engine because this can significantly
reduce network traffic. If such co-location is not possible, it may be advisable to
FTP data from a remote source machine to the PowerCenter Server as this is a
very efficient way of transporting the data across the network.
Best Practices
None
Sample Deliverables
None
Description
Although the various subtasks that compose this task are described here in linear
fashion, all of these subtasks relate to the others, so it is important to approach the
overall body of work in this task as a whole and consider the development architecture
as a whole.
Prerequisites
None
Roles
Considerations
The scope of a typical PowerCenter implementation, possibly covering more than one
project, is much broader than a departmentally-scoped solution. It is important to
consider this statement fully, because it has implications for the planned deployment of
a solution, as well as the requisite planning associated with the development
environment. The main difference is that a departmental data mart type project can be
created with only two or three developers in a very short time period. By contrast, a full
integration solution involving the creation of an ICC (Integration Competency Center) or
an analytic solution that approaches enterprise scale requires more of a "big team"
approach. This is because many more organizational groups are involved, adherence
to standards is much more important, and testing must be more rigorous, since the
results will be visible to a larger audience.
The following paragraphs outline some of the key differences between a departmental
development effort and an enterprise effort:
However, as the development team grows and the project becomes more complex, this
simplified environment leads to serious development issues:
These factors represent only a subset of the issues that may occur when the
development architecture is haphazardly constructed, or "organically" grown. As is the
case with the execution environment, a departmental data mart development effort can
"get away with" minimal architectural planning. But any serious effort to develop an
enterprise-scale analytic solution must be based on well-planned architecture, including
both the development and execution environments.
Best Practices
None
Sample Deliverables
None
Description
Although actual testing starts with unit testing during the build phase followed by the
project’s Test Phase, there is far more involved in producing a high quality project. The
QA Strategy includes definition of key QA roles, key verification processes and key QA
assignments involved in detailing all of the validation procedures for the project.
Prerequisites
None
Roles
Considerations
In determining what project steps will require verification, the QA Manager or “owner” of
the project’s QA processes, should consider the business requirements and the project
methodology. Although it may take a “sales” effort to win over management to a QA
process that is highly involved throughout the project, the benefits can be proven
historically in the success rates of projects and their ongoing maintenance costs.
However, the trade-offs of cost vs. value will likely affect the scope of QA.
Best Practices
None
Sample Deliverables
None
Description
Although the development environment was relatively simple in the early days of
computer system development when a mainframe-based development project typically
involved one or more isolated regions connected to one or more database instances,
distributed systems, such as federated data warehouses, involve much more complex
development environments, and many more "moving parts." The basic concept of
isolating developers from testers, and both from the production system, is still critical to
development success. However, relative to a centralized development effort, there are
many more technical issues, hardware platforms, database instances, and specialized
personnel to deal with.
The task of defining the development environment is, therefore, extremely important
and very difficult. Because of the wide variance in corporate technical environments,
standards, and objectives, there is no "optimal" development environment. Rather,
there are key areas of consideration and decisions that must be made with respect to
them.
After the development environment has been defined, it is important to document its
configuration, including (most importantly) the information the developers need to use
the environments. For example, developers need to understand what systems they are
logging into, what databases they are accessing, what repository (or repositories) they
are accessing, and where sources and targets reside. An important component of any
development environment is to configure it as close to the test and production
environments as possible given time and budget. This can significantly ease the
development and integration efforts downstream and will ultimately save time and cost
during the testing phases.
Prerequisites
Roles
Considerations
The development environment for any data integration solution must consider many of
the same issues as a "traditional" development project. The major differences are that
the development approach is "repository-centric" (as opposed to code-based), there
are multiple sources and targets (unlike a typical system development project, which
deals with a single database), and few (if any) hand-coded objects to build and
maintain. In addition, because of the repository-based development approach, the
development environment must consider all of the following key areas:
Repository Configuration
Most data integration solutions currently being developed involve data from multiple
sources, target multiple data marts, and include the participation of developers from
multiple areas within the corporate organization. In order to develop a cohesive analytic
solution, with shared concepts of the business entities, transformation rules, and end
results, a PowerCenter-based development environment is required.
In this advanced repository configuration, the Technical Architect must pay careful
attention to the sharing of development objects and the use of multiple repositories.
Again, there is no single "correct" solution, only general guidelines for consideration.
TIP
It is very important to house all globally-shared database schemas in the
Global Repository. Because most IT organizations prefer to maintain their
database schemas in a CASE/data modeling tool, the procedures for updating
the PowerCenter definitions of source/target schemas must include importing
these schemas from tools such as ERwin. It is far easier to develop these
procedures for a single (global) repository than for each of the (independent)
local repositories that may be using the schemas.
For example, any data quality steps taken with Infomatica Data Quality
(IDQ) applications (such as those implemented in 2.8 Perform Data Quality Audit or 5.3
Design and Build Data Quality Process) are performed using processes saved to a
discrete IDQ repository. These processes (called plans in IDQ parlance) can be added
to PowerCenter transfomations and subsequently saved with those transformations in
the PowerCenter repository. As indicated above, data quality plans can be designed
and tested within an IDQ repository before deployment in PowerCenter. Moreover,
depending on their purpose, plans may remain in an IDQ server repository, from which
they can be distributed as needed across the enterprise, for the life of the project.
Repository folders provide development teams with a simple method for grouping and
organizing work units. The process for creating and administering folders is quite
simple, and thoroughly explained in Informatica’s product documentation. The main
area for consideration is the determination of an appropriate folder structure within one
or more repositories.
Finally, it is also important to consider the migration process in the design of the folder
structures. The migration process depends largely on the folder structure that is
established, and the type of repository environment. In earlier versions of PowerCenter,
the most efficient method to migrate an object was to perform a complete folder copy.
This involves grouping mappings meaningfully within a folder, since all mappings within
the folder migrate together. However, if individual objects need to be migrated, the
migration process can become very cumbersome, since each object needs to be
"manually" migrated.
Data Analyzer 4.x uses the export and import of repository objects for the migration
process among environments. Objects are exported and imported as individual pieces
and cannot be linked together in a deployment group as they can in PowerCenter 7.x or
migrated as a complete folder as they can in earlier versions of PowerCenter.
Developer Security
The security features built into PowerCenter and Data Analyzer allow the development
team to be grouped according to the functions and responsibilities of each member.
One common, but risky, approach is to give all developers access to the default
Administrator ID provided upon installation of the PowerCenter or Data Analyzer
For companies that have the capabilities to do so, LDAP integration is an available
option that can minimize the administration of usernames and passwords separately. If
you use LDAP authentication for repository users, the repository maintains an
association between repository user names and external login names. When you
create a user, you can select the login name from the external directory.
The tightest security of all is reserved for promoting development objects into
production. In some environments, no member of the development team is permitted to
move anything into production. In these cases, a System Owner or other system
representative outside the development group must be given the appropriate repository
Best Practices
Configuring Security
Sample Deliverables
None
Description
Changes are inevitable during the initial development and maintenance stages of any
project. Wherever and whenever the changes occur - in the logical and physical data
models, extract programs, business rules, or deployment plans - they must be
controlled.
This subtask addresses many of the factors influencing the design of the change
control procedures. The procedures themselves should be a well-documented series of
steps, describing what happens to a development object once it has been modified (or
created) and unit tested by the developer. The change control procedures document
should also provide background contextual information, including the configuration of
the environment, repositories, and databases.
Prerequisites
None
Roles
Considerations
It is important to recognize that the change control procedures and the organization of
the development environment are heavily dependent upon each other. It is impossible
to thoroughly design one without considering the other. The following development
environment factors influence the approach taken to change control:
Repository Configuration
Subtask 3.2.2 Define Development Environments discusses the two basic approaches
to repository configuration. The first one, Stand-Alone PowerCenter, is the simplest
configuration in that it involves a single repository. If that single repository supports
both development and production (although this is not generally advisable), then the
change control process is fairly straightforward; migrations involve copying the relevant
object from a development folder to a production folder, or performing a complete folder
copy.
The general approach for migration is similar regardless of whether the environment is
a single repository or multiple repository approach. In either case, logical groupings of
development objects have been created, representing the various promotion levels
within the promotion hierarchy (e.g., DEV, TEST, QA, PROD). In the single repository
approach, the logical grouping is accomplished through the use of folders named
accordingly. In the multiple repository approach, an entire repository may be used for
● If the object is a global object (reusable or not reusable), the change must be
applied to the global repository.
● If the object is shared, the shortcuts referencing this object automatically
reflect the change from any location in the global or local architecture.
Therefore, only the "original" object must be migrated.
● If the object is stored in both repositories (i.e., global and local), the change
must be made in both repositories.
● Finally, if the object is only stored locally, the change is only implemented in
the local repository.
Tip
With a PowerCenter Data Integration Hub implementation, global repositories
can register local repositories. This provides access to both repositories
through one "console", simplifying the administrative tasks for completing
change requests. In this case, the global Repository Administrator can perform
all repository migration tasks.
The change procedures must include a means for tracking change requests and their
migration schedules, as well as a procedure for backing out changes, if necessary.
The Change Request Form should include information about the nature of the change,
The team-based development option provides functionality in two areas: versioning and
deployment. But, other features, such as repository queries and labeling are necessary
to ensure optimal use of versioning and deployment. The following sections describe
this functionality at a general level. For a more detailed explanation of any of the
capabilities of the team-based development features of PowerCenter, refer to the
appropriate sections of the PowerCenter documentation.
For Data Migration projects change control is critical for success. It is common that the
target system has continual changes during the life of the data migration project. These
cause changes to specifications, which in turn cause a need to change the mappings,
sessions, workflows, and scripts that make up the data migration project. Change
control is important to allow the project management to understand the scope of
change and to limit the impact that process changes cause to related processes. For
data migration, the key to change control is in the communication of changes to ensure
that testing activities are integrated.
Best Practices
None
Description
Designing, implementing and maintaining a solid metadata strategy is a key enabler of high-quality solutions. The
federated architecture model of a PowerCenter-based global metadata repository provides the ability to share metadata
that crosses departmental boundaries while allowing non-shared metadata to be maintained independently.
A proper metadata strategy provides Data Integration Developers, End-User Application Developers, and End Users with
the ability to create a common understanding of the data, where it came from, and what business rules have been applied
to it. As such, the metadata may be as important as the data itself, because it provides context and credibility to the data
being analyzed.
The metadata strategy should describe where metadata will be obtained, where it will be stored, and how it will be
accessed. After the strategy is developed, the Metadata Manager is responsible for documenting and distributing it to the
development team and end-user community. This solution allows for the following capabilities:
The Business Intelligence Metadata strategy can also assist in achieving the goals of data orientation by providing a focus
for sharing the data assets of an organization. It can provide a map for managing the expanding requirements for
reporting information that the business places upon the IT environment. The metadata strategy highlights the importance
of a central data administration department for organizations that are concerned about data quality, integrity, and reuse.
The components of a metadata strategy for Data Analyzer include:
Prerequisites
None
Roles
The metadata captured while building and deploying analytic solution architecture should pertain to each of the system's
points of integration, an area where managing metadata provides benefit to IT and/or business users. The Metadata
Manager should analyze each point of integration in order to answer the following questions:
It is important to centralize metadata management functions despite the potential "metadata bottleneck" that may be
created during development. This consolidation is beneficial when a production system based on clean, reliable metadata
is unveiled to the company. The following table expands the concept of the Who, What, Why, Where, and How approach
to managing metadata:
Note that the Informatica Data Explorer (IDE) application suite possesses a wide range of functional capabilities for data
and metadata profiling and for source-to-target mapping.
The Metadata Manager and Repository Manager need to work together to determine how best to capture the metadata,
always considering the following points:
● Source structures. Are source data structures captured or stored already in a CASE/data modeling tool? Are
they maintained consistently?
● Target structures. Are target data structures captured or stored already in a CASE/data modeling tool? Is
PowerCenter being used to create target data structure? Where will the models be maintained?
● Extract, Transform, and Load process. Assuming PowerCenter is being used for the ETL processing; the
metadata will be created and maintained automatically within the PowerCenter repository. Also, remember that
any ETL code developed outside of a PowerCenter mapping (i.e., in stored procedures or external procedures)
will not have metadata associated with it.
● Analytic applications. Several front-end analytic tools have the ability to import PowerCenter metadata. This can
simplify the development and maintenance of the analytic solution.
● Reporting tools. End users working with Data Analyzer may need access to the PowerCenter metadata in order
to understand the business context of the data in the target database(s).
● Operational metadata. PowerCenter automatically captures rich operational data when batches and sessions are
executed. This metadata may be useful to operators and end users, and should be considered an important part
of the analytic solution.
Best Practices
None
Sample Deliverables
None
Description
Its purpose is to minimize the disruption to services caused by change and to ensure
that records of hardware, software, services and documentation are kept up to date.
The Change Management process enables the actual change to take place. Elements
of the process include identify change, create request for change, impact assessment,
approval, scheduling, and implementation.
Prerequisites
None
Roles
Considerations
Identify Change
● A problem arises that requires a change that will affect more than one
business user or a user group such as sales, marketing, etc.
A request for change should be completed for each proposed change, with a checklist
of items to be considered and approved before implementing the change. The change
procedures must include a means for tracking change requests and their migration
schedules, as well as a procedure for backing out changes, if necessary. The Change
Request Form should include information about the nature of the change, the
developer making the change, the timing of the request for migration, and enough
technical information about the change that it can be reversed if necessary.
The team-based development option provides functionality in two areas: versioning and
deployment. But, other features, such as repository queries and labeling are required to
ensure optimal use of versioning and deployment. The following sections describe this
functionality at a general level. For a more detailed explanation of any of the
capabilities of the Team-based Development features of PowerCenter, please refer to
the appropriate sections of the PowerCenter documentation.
Approval to Proceed
An initial review of the Change Request form should assess the cost and value of
proceeding with the change. If sufficient information is not provided on the request form
to enable the initial reviewer to thoroughly assess the change, he or she should return
the request form to the originator for further details. The originator can then resubmit
the change request with the requested information. The change request must be
tracked through all stages of the change request process, with thorough documentation
regarding approval or rejection and resubmission.
Once approval to proceed has been granted, the originator may plan and prepare the
change in earnest.
The following sections on the request for change must be completed at this stage:
The Change Control Process must include a formalized approach to completing impact
analysis. Any implemented change has some planned downstream impact (e.g., the
values on a report will change, additional data will be included, a new target file will be
populated, etc.) The importance of the impact analysis process is in recognizing
unforeseen downstream affects prior to implementing the change. In many cases, the
impact is easy to define. For example, if a requested change is limited to changing the
target of a particular session from a flat file to a table, the impact is obvious. However,
most changes occur within mappings or within databases, and the hidden impacts can
be worrisome. For example, if a business rule change is made, how will the end results
of the mapping be affected? If a target table schema needs to be modified within the
repository, the corresponding target database must also be changed, and it must be
done in sync with the migration of the repository change.
Implementation
Following final approval and after relevant and timely communications have been
issued, the change may be implemented in accordance with the plan and the
scheduled date and time.
Identifying the most efficient method for applying change to all environments is
essential. Within the PowerCenter and Data Analyzer environments, the types of
objects to manage are:
● Source definitions
● Target definitions
● Mappings and mapplets
● Reusable transformations
● Sessions
● Batches
● Reports
● Schemas
● Global variables
● Dashboards
● Schedules
In addition, there are objects outside of the Informatica architecture that are directly
linked to these objects, so the appropriate procedures need to be established to ensure
that all items are synchronized.
1. Perform impact analysis on the request. List all objects affected by the change,
including development objects and databases.
2. Approve or reject the change or migration request. The Project Manager has
authority to approve/reject change requests.
3. If approved, pass the request to the PowerCenter Administrator for processing.
4. Migrate the change to the test environment.
5. Test the requested change. If the change does not pass testing, the process will
need to start over for this object.
6. Submit the promotion request for migration to QA and/or production
environments.
Best Practices
None
Sample Deliverables
None
Description
While it is crucial to design and implement a technical architecture as part of the data
integration project development effort, most of the implementation work is beyond the
scope of this document. Specifically, the acquisition and installation of hardware and
system software is generally handled by internal resources, and is accomplished by
following pre-established procedures. This section touches on these topics, but is not
meant to be a step-by-step guide to the acquisition and implementation process.
After determining an appropriate technical architecture for the solution (3.1 Develop
Solution Architecture), the next step is to physically implement that architecture. This
includes procuring and installing the hardware and software required to support the
data integration processes.
Prerequisites
Roles
The project schedule should be the focus of the hardware and software implementation
process. The entire procurement process, which may require a significant amount of
time, must begin as soon as possible to keep the project moving forward. Delays in this
step can cause serious delays to the project as a whole. There are, however, a number
of proven methods for expediting the procurement and installation processes, as
described in the related subtasks.
Best Practices
None
Sample Deliverables
None
Description
This is the first step in implementing the technical architecture. The procurement
process varies widely among organizations, but is often based on a purchase request (i.
e., Request for Purchase or RFP) generated by the Project Manager after the project
architecture is planned and configuration recommendations are approved by IT
management.
An RFP is usually mandatory for procuring any new hardware or software. Although the
forms vary widely among companies, an RFP typically lists what products need to be
purchased, when they will be needed, and why they are necessary for the project. The
document is then reviewed and approved by appropriate management and the
organization's "buyer".
Prerequisites
Roles
Considerations
Frequently, the Project Manager does not control purchasing new hardware and
software. Approval must be received from another group or individual within the
organization, often referred to as a "buyer". Even before product purchase decisions
are finalized, it is a good idea to notify the buyer of necessary impending purchases,
providing a brief overview of the types of products that are likely to be required and for
what reasons.
It may also be possible to begin the procurement process before all of the prerequisite
steps are complete (See 2.2 Define Business Requirements, 3.1.2 Develop
Architecture Logical View, and 3.1.3 Develop Configuration Recommendations. The
Technical Architect should have a good idea of at least some of the software and
hardware choices before a physical architecture and configuration recommendations
are solidified.
Finally, if development is ready to begin and the hardware procurement process is not
yet complete, it may be worthwhile to get started on a temporary server with the
intention of moving the work to the new server when it is available.
Best Practices
None
Sample Deliverables
None
Description
Installing, configuring, and deploying new hardware and software should not affect the
progress of a data integration project. The entire development team depends on a
properly configured technical environment. Incorrect installation or delays can have
serious negative effects on the project schedule.
Establishing and following a detailed installation plan can help avoid unnecessary
delays in development. (See 3.1.2 Develop Architecture Logical View).
Prerequisites
Roles
Considerations
When installing and configuring hardware and software for a typical data warehousing
project, the following Informatica software components should be considered:
The PowerCenter services need to be installed and configured, along with any
necessary database connectivity drivers, such as native drivers or ODBC. Connectivity
needs to be established among all the platforms before the Informatica applications can
be used.
For step-by-step instructions for installing the PowerCenter services, refer to the
Informatica PowerCenter Installation Guide. The following list is intended to
complement the installation guide when installing PowerCenter:
● Network Protocol - TCP/IP and IPX/SPX are the supported protocols for
communication between the PowerCenter services and PowerCenter client
tools. To improve repository performance, consider installing the Repository
service on a machine with a fast network connection. To optimize
performance, do not install the Repository service on a Primary Domain
Controller (PDC) or a Backup Domain Controller (BDC).
● Native Database Drivers (or ODBC in some instances) are used by the
Server to connect to the source, target, and repository databases. Ensure that
The PowerCenter Client needs to be installed on all developer workstations, along with
any necessary drivers, including database connectivity drivers such as ODBC.
Before you begin the installation, verify that you have enough disk space for the
PowerCenter Client. You must have 300MB of disk space to install the PowerCenter 8
Client tools. Also, make sure you have 30MB of temporary file space available for the
PowerCenter Setup. When installing PowerCenter Client tools via a standard
installation, choose to install the “Client tools” and “ODBC” components.
TIP
You can install the PowerCenter Client tools in standard mode or silent mode.
You may want to perform a silent installation if you need to install the
PowerCenter Client on several machines on the network, or if you want to
standardize the installation across all machines in the environment. When you
perform a silent installation, the installation program uses information in a
response file to locate the installation directory. You can also perform a silent
installation for remote machines on the network.
When adding an ODBC data source name (DSN) to client workstations, it is a good
idea to keep the DSN consistent among all workstations. Aside from eliminating the
potential for confusion on individual developer machines, this is important when
importing and exporting repository registries.
For step-by-step instructions for installing the PowerCenter Reports, refer to the
Informatica PowerCenter Installation Guide. The following list of considerations is
intended to complement the installation guide when installing PCR:
The PCR client is a web-based, thin-client tool that uses Microsoft Internet Explorer 6
as the client. The proper version of Internet Explorer should be verified on client
machines, ensuring that Internet Explorer 6 is the default web browser, and the
minimum system requirements should be validated.
In order to use PCR, the client workstation should have at least a 300MHz processor
and 128MB of RAM. Please note that these are the minimum requirements for the PCR
Certain interactive features in the PCR require third-party plug-in software to work
correctly. Users must download and install the plug-in software on their workstation
before they can use these features. PCR uses the following third-party plug-in software:
● Microsoft SOAP Toolkit - In PCR, you can export a report to an Excel file and
refresh the data in Excel directly from the cached data in PCR or from data in
the data warehouse through PCR. To use the data refresh feature, you must
first install the Microsoft SOAP Toolkit. For information on downloading the
Microsoft SOAP Toolkit, see “Working with Reports” in the Data Analyzer User
Guide.
● Adobe SVG Viewer - In PCR, you can display interactive report charts and
chart indicators. You can click on an interactive chart to drill into the report
data and view details and select sections of the chart. To view interactive
charts, you must install Adobe SVG Viewer. For more information on
downloading Adobe SVG Viewer, see “Managing Account Information” in the
Data Analyzer User Guide.
Lastly, for PCR to display its application windows correctly, Informatica recommends
disabling any pop-up blocking utility on your browser. If a pop-up blocker is running
while you are working with PCR, the PCR windows may not display properly.
The Data Analyzer Server needs to be installed and configured along with the
application server foundation software. Currently, Data Analyzer is certified on the
following application servers:
● BEA WebLogic
● IBM WebSphere
● JBoss Application Server
Refer to the PowerCenter Installation Guide for the current list of supported application
servers and exact version numbers.
The recommended configuration for the Data Analyzer environment is to put the Data
Analyzer Server, application server, repository, and data warehouse databases on the
same multiprocessor machine. This approach minimizes network input/output as the
Data Analyzer Server reads from the data warehouse database. Use this approach
when available CPU and memory resources on the multiprocessor machine allow all
software processes to operate efficiently without “pegging” the server. If available
hardware dictates that the Data Analyzer Server is separated physically from the data
warehouse database server, Informatica recommends placing a high-speed network
connection between the two servers.
For step-by-step instructions for installing the Data Analyzer Server components, refer
to the Informatica Data Analyzer Installation Guide. The following list of considerations
is intended to complement the installation guide when installing Data Analyzer:
The Data Analyzer Client is a web-based, thin-client tool that uses Microsoft Internet
Explorer 6 as the client. The proper version of Internet Explorer should be verified on
client machines, ensuring that Internet Explorer 6 is the default web browser, and the
minimum system requirements should be validated.
In order to use the Data Analyzer Client, the client workstation should have at least a
300MHz processor and 128MB of RAM. Please note that these are the minimum
requirements for the Data Analyzer Client, and that if other applications are running on
the client workstation, additional CPU and memory is required. In most situations, users
are likely to be multi-tasking using multiple applications, so this should be taken into
consideration.
Certain interactive features in Data Analyzer require third-party plug-in software to work
correctly. Users must download and install the plug-in software on their workstation
before they can use these features. Data Analyzer uses the following third-party plug-in
software:
Lastly, for Data Analyzer to display its application windows correctly, Informatica
recommends disabling any pop-up blocking utility on your browser. If a pop-up blocker
is running while you are working with Data Analyzer, the Data Analyzer windows may
not display properly.
Metadata Manager requires a web server and a Java 2 Enterprise Edition (J2EE)-
compliant application server. Metadata Manager works with BEA WebLogic Server,
IBM WebSphere Application Server, and JBoss Application Server. If you choose to
use BEA WebLogic or IBM WebSphere, they must be installed prior to the Metadata
Manager installation. The JBoss Application Server can be installed from the Metadata
Manager installation process.
● Metadata Manager
● Limited edition of PowerCenter
● Metadata Manager documentation in PDF format
● Metadata Manager and Data Analyzer integrated online help
● Configuration Console online help
To install Metadata Manager for the first time, complete each of the following tasks in
the order listed below:
1. Create database user accounts. Create one database user account for the
Metadata Manager Warehouse and Metadata Manager Server repository and
another for the Integration repository.
2. Install the application server. Install BEA WebLogic Server or IBM
WebSphere Application Server.
3. Install PowerCenter 8. Install PowerCenter 8 to manage metadata extract and
load tasks.
4. Install Metadata Manager. When installing Metadata Manager, provide the
connection information for the database user accounts for the Integration
repository and the Metadata Manager Warehouse and Metadata Manager
Server repository. The Metadata Manager installation creates both repositories
and installs other Metadata Manager components, such as the Configuration
Console, documentation, and XConnects.
5. Optionally, run the pre-compile utility (for BEA WebLogic Server and IBM
WebSphere). If you are using the BEA WebLogic Server as your Application
server, optionally pre-compile the JSP scripts to display the Metadata Manager
web pages faster when they are accessed for the first time.
6. Apply the product license. Apply the application server license, as well as the
PowerCenter and Metadata Manager licenses.
7. Configure the PowerCenter Server. Assign the Integration repository to the
PowerCenter Server to enable running of prepackaged XConnect workflows.
The workflow for each XConnect extracts metadata from the metadata source
repository and loads it into the Metadata Manager Warehouse.
Note: For more information about installing Metadata Manager, see “Installing
Metadata Manager” chapter of the PowerCenter Installation Guide.
After the software has been installed and tested, the Metadata Manager Administrator
can begin creating security groups, users, and the repositories. Following are the some
of the initial steps for the Metadata Manager Administrator once the Metadata Manager
is installed. For more information on any of these steps, refer to the Metadata Manager
Administration Guide.
PowerExchange Installation
Before beginning the installation, take time to read the PowerExchange Installation
Guide as well as the documentation for the specific PowerExchange products you have
licensed and plan to install.
Take time to identify and notify resources you are going to need to complete the
installation. Depending on the specific product, you could need any or all of the
following:
● Database Administrator
● PowerCenter Administrator
● MVS Systems Administrator
● UNIX Systems Administrator
● Security Administrator
● Network Administrator
● Desktop (PC) Support
The process for installing PowerExchange on the source system varies greatly
depending on the source system. Take care to read through the installation
documentation prior to attempting the installation. The PowerExchange Installation
Guide has step by step instructions for installing PowerExchange on all supported
platforms.
The Navigator allows you to create and edit data maps and tables. To install
PowerExchange on the desktop (PC) for the first time, complete each of the following
tasks in the order listed below:
The PowerExchange client for the PowerCenter server allows PowerCenter to read
data from PowerExchange data sources. The PowerCenter Administrator should
perform the installation with the assistance of a server administrator. It is recommended
that a separate user account be created to run the required processes. A PowerCenter
Administrator needs to register the PowerExchange plug-in with the PowerExchange
repository.
Best Practices
None
Sample Deliverables
None
4 Design
Description
Each task in the Design Phase provides the functional architecture for the
development process using PowerCenter. The design of target data store may include,
data warehouses and data marts, star schemas, web services, message queues or
custom databases to drive specific applications or effect a data migration. The Design
Phase requires that several preparatory tasks are completed before beginning the
development work of building and testing mappings, sessions, and workflows within
PowerCenter.
Prerequisites
3 Architect
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
Description
Depending on the structure and approach to data storage supporting the data
integration solution, the data architecture may include an Enterprise Data Warehouse
(EDW) and one or more data marts. In addition, many implementations also include an
Operational Data Store (ODS), which may also be referred to as a dynamic data store
(DDS) or staging area. Each of these data stores may exist independently of the
others, and may reside on completely different database management systems
(DBMSs) and hardware platforms. In any case, each of the database schemas
comprising the overall solution will require a corresponding logical model.
An ODS may be needed when there are operational or reporting uses for the
consolidated detail data or to provide a staging area, for example, when there is a short
time span to pull data from the source systems. It can act as a buffer between the EDW
and the source applications. The data model for the ODS is typically in third-normal
form and may be a virtual duplicate of the source systems' models. The ODS typically
receives the data after some cleansing and integration, but with little or no
summarization from the source systems; the ODS can then become the source for the
EDW.
Data marts (DMs) are effectively subsets of the EDW. Data marts are fed directly from
the enterprise data warehouse, ensuring synchronization of business rules and
snapshot times. The logical design structures are typically dimensional star or
snowflake schemas. The structures of the data marts are driven by the requirements of
particular business users and reporting tools. There may be additions and reductions to
the logical data mart design depending on the requirements for the particular data mart.
Historical data capture requirements may differ from those on the enterprise data
warehouse. A subject-oriented data mart may be able to provide for more historical
analysis, or alternatively may require none. Detailed requirements drive content, which
in turn drives the logical design that becomes the foundation of the physical database
design.
Two generic assumptions about business users also affect data mart design:
These assumptions encourage the use of star and snowflake schemas in the solution
design. These types of schemas represent business activities as a series of discrete,
time-stamped events (or facts) with business-oriented names, such as orders or
shipments. These facts contain foreign key "pointers" to one or more dimensions that
place the fact into a business context, such as the fiscal quarter in which the shipment
occurred, or the sales region responsible for the order. The use of business
terminology throughout the star or snowflake schema is much more meaningful to the
end user than the typical normalized, technology-centric data model.
During the modeling phase of a data integration project, it is important to consider all
possible methods of obtaining a data model. Analyzing the cost benefits of build vs. buy
may well reveal that it is more economical to buy a pre-built subject area model than to
invest the time and money in building your own.
Roles
Considerations
Requirements
Are the requirements sufficiently defined in at least one subject area that the
data modeling tasks can begin?
If the data modeling requires too much guesswork, at best, time will be wasted or, at
worst, the Data Architect will design models that fail to support the business
requirements.
This question is particularly critical for designing logical data warehouse and data mart
schemas. The EDW logical model is largely dependent on source system structures.
Some internal standards need to be set at the beginning of the modeling process to
define data types and names. It is extremely important for project team members to
adhere to whatever conventions are chosen. If project team members deviate from the
chosen conventions, the entire purpose is defeated. Conventions should be chosen for
the prefix and suffix names of certain types of fields. For example, numeric surrogate
keys in the data warehouse might use either seq or id as a suffix to easily identify the
Data modeling tools refer to common data types as domains. Domains are also
hierarchical. For example, address can be of a string data type. Residential and
business addresses are children of address. Establishing these data types at the
beginning of the model development process is beneficial for consistency and
timeliness in implementing the subsequent physical database design.
Metadata
Logical data models can be delivered to PowerCenter ready for data integration
development. Additionally, metadata originating from these models can be delivered to
end users through business intelligence tools. Many business intelligence vendors
have tools that can access the PowerCenter Repository through the Metadata
Services and Metadata Manager Benefits architectures.
Data models are valuable documentation, both to the project and the business users.
They should be stored in a repository in order to take advantage of PowerCenter's
integrated metadata approach. Additionally, they should be regularly backed-up to file
after major changes. Versioning should take place regularly within the repository so
that it is possible to roll back several versions of a data model, if necessary. Once the
backbone of a data model is in place, a change control procedure should be
implemented to monitor any changes requested and record implementation of those
changes. Adhering to rigorous change control procedures will help to ensure that all
impacts of a change are recognized prior to their implementation.
To facilitate metadata analysis and to keep your documentation up-to-date, you may
want to consider the metadata reporting capabilities in Metadata Manager to provide
automatically updated lineage and impact analysis.
Best Practices
None
Sample Deliverables
None
Description
If the aim of the data integration project is to produce an Enterprise Data Warehouse
(EDW), then the logical EDW model should encompass all of the sources that feed the
warehouse. This model will be a slightly de-normalized structure to replicate source
data from operational systems; it should be neither a full star, nor snowflake schema,
nor a highly normalized structure of the source systems. Some of the source structures
are redesigned in the model to migrate non-relational sources to relational structures.
In some cases, it may be appropriate to provide limited consolidation where common
fields are present in various incoming data sources. In summary, the developed EDW
logical model should be the sum of all the parts but should exclude detailed attribute
information.
Prerequisites
None
Roles
Considerations
Analyzing Sources
Universal Tables
Universal tables provide some consolidation and commonality among sources. For
example, different systems may use different codes for the gender of a customer. A
universal table brings together the fields that cover the same business subject or
business rule. Universal tables are also intended to be the sum of all parts. For
example, a customer table in one source system may have only standard contact
details while a second system may supply fields for mobile phones and email
addresses, but not include a field for a fax number. A universal table should hold all of
the contact fields from both systems (i.e., standard contact details plus fields for fax,
mobile phones and email). Additionally, universal tables should ensure syntactic
consistency such that fields from different source tables represent the same data items
and possess the same data types.
Relationship Modeling
Historical Considerations
Business requirements and refresh schedules should determine the amount and type
of history that an EDW should hold. The logical history maintenance architecture
should be common to all tables within the EDW.
Capturing historical data usually involves taking snapshots of the database on a regular
basis and adding the data to the existing content with time stamps. Alternatively,
individual updates can be recorded and the previously current records can be time-
period stamped or versioned. It is also necessary to decide how far back the history
should go.
Data Quality
Data can be verified for validity and accuracy as it comes into the EDW. The EDW can
reasonably be expected to answer such questions as:
Additionally, data values can be evaluated against expected ranges. For example,
dates of birth should be in a reasonable range (not after the current date, and not
before 1st Jan 1900.) Values can also be validated against reference datasets. As well
as using industry-standard references (e.g., ISO Currency Codes, ISO Units of
Measure), it may be necessary to obtain or generate new reference data to perform all
relevant data quality checks.
The Data Architect should focus on the common factors in the business requirements
as early as possible. Variations and dimensions specific to certain parts of the
organization can be dealt with later in the design. More importantly, focusing on the
commonalities early in the process also allows other tasks in the project cycle to
proceed earlier. A project to develop an integrated solution architecture is likely to
encounter such common business dimensions as organizational hierarchy, regional
definitions, a number of calendars, and product dimensions, among others.
The Data Architect may determine at this point that subject areas thought to be
common are, in fact, not common across the entire organization. Various departments
may use different rules to calculate their profit, commission payments, and customer
values. These facts need to be identified and labeled in the logical model according to
the part of the organization using the differing methods. There are two reasons for this:
● Common semantics enable business users to know if they are using the same
organizational terminology as their colleagues.
● Commonality ensures continuity between the measures a business currently
takes from an operational system and the new ones that will be available in the
data integration solution.
Objectives such as trading ease of maintenance and minimal disk space storage
against speed and usability determine whether a simple star or snowflake structure is
preferable. One or two central tables should hold the facts. Variations in facts can be
included in these tables along with common organizational facts. Variations in
dimension may require additional dimensional tables.
Tip
Syndicated data sets, such as weather records, should be held in the data warehouse.
These external dimensions will then be available as a subset of the data warehouse. It
should be assumed that the data set will be updated periodically and that the history
will be kept for reference unless the business determines it is not necessary. If the
historical data is needed, the syndicated data sets will need to be date-stamped.
Use of single code lookup tables does not provide the same benefits as a single code
lookup table on an OLTP system. The function of a single code lookup table is to
provide central maintenance of codes and descriptions. This is not a benefit that can be
achieved when populating a data warehouse since data warehouses are potentially
loaded from more than one source several times.
Having a single database structure is likely to complicate matters in the future. A single
code lookup table implies the use of a single surrogate key. If problems occur in the
load, they affect all code lookups - not just one. Separate codes would have to be
loaded from their various sources and checked for existing records and updates.
A single lookup table simply increases the amount of work mapping developers need to
carry out to qualify the parts of the table they are concerned with for a particular
mapping. Individual lookup tables remove the single point of failure for code lookups
and improve development time for mappings; however, they also involve more work for
the Data Architect. The Data Architect may prefer to show a single object for codes on
the diagrams. He/she should however, ensure that regardless of how the code tables
are modeled, they will be physically separable when the physical database
implementation takes place.
Surrogate Keys
The use of surrogate keys in most dimensional models presents an additional obstacle
that must be overcome in the solution design. It is important to determine a strategy to
create, distribute, and maintain these keys as you plan your design. Any of the
following strategies may be appropriate:
Best Practices
None
Sample Deliverables
None
Description
The data mart's logical data model, supports the final step in the integrated enterprise
decision support architecture. These models should be easily identified with their
source in the data warehouse and will provide the foundation for the physical design. In
most modeling tools, the logical model can be used to automatically resolve and
generate some of the physical design, such as lookups used to resolve many-to-many
relationships.
If the data integration project was initiated for the right reasons, the aim of the data
mart is to solve a specific business issue for its business sponsors. As a subset of the
data warehouse, one data mart may focus on the business customers while another
may focus on residential services. The logical design must incorporate transformations
supplying appropriate metrics and levels of aggregation for the business users. The
metrics and aggregations must incorporate the dimensions that the data mart business
users can use to study their metrics. The structure of the dimensions must be
sufficiently simple to enable those users to quickly produce their own reports, if desired.
Prerequisites
None
Roles
Considerations
The subject area of the data mart should be the first consideration because it
determines the facts that must be drawn from the Enterprise Data Warehouse into the
Tip
Keep it Simple!
If, as is generally the case, the data mart is going to be used primarily as a
presentation layer by business users extracting data for analytic purposes, the mart
should use as simple a design as possible.
Best Practices
None
Sample Deliverables
None
Description
The goal of this task is to understand the various data sources that will be feeding the
solution. Completing this task successfully increases the understanding needed to
efficiently map data using PowerCenter. It is important to understand all of the data
elements from a business perspective, including the data values and dependencies on
other data elements. It is also important to understand where the data comes from, how
the data is related, and how much data there is to deal with (i.e., volume estimates).
Prerequisites
None
Roles
Considerations
None
Sample Deliverables
None
Description
The third step in analyzing data sources is to determine the relationship between the
sources and targets and to identify any rework or target redesign that may be required
if specific data elements are not available. This step defines the relationships between
the data elements and clearly illuminates possible data issues, such as incompatible
data types or unavailable data elements.
Prerequisites
None
Roles
Considerations
Creating the relationships between the sources and targets is a critical task in the
design process. It is important to map all of the data elements from the source data to
an appropriate counterpart in the target schema. Taking the necessary care in this
effort should result in the following:
● Identification of any data elements in the target schema that are not currently
The next step in this subtask produces a (Target-Source Matrix) which provides a
framework for matching the business requirements to the essential data elements and
defining how the source and target elements are paired. The matrix lists each of the
target tables from the data mart in the rows of the matrix and lists descriptions of the
source systems in the columns, to provide the following data:
Undefined Data
In some cases the Data Architect cannot locate or access the data required to establish
a rule defined by the Business Analyst. When this occurs, the Business Analyst may
need to revalidate the particular rule or requirement to ensure that it meets the end-
users' needs. If it does not, the Business Analyst and Data Architect must determine if
there is another way to use the available data elements to enforce the rule. Enlisting
the services of the System Administrator or another knowledgeable source system
resource, may be helpful. If no solution is found, or if the data meets requirements but
is not available, the Project Manager should communicate with the end-user community
and propose an alternative business rule.
Choosing to eliminate data too early in the process due to inaccessibility, however, may
cause problems further down the road. The Project Manager should meet with the
Business Analyst and the Data Architect to determine what rules or requirements can
be changed and which must remain as originally defined. The Data Architect can
propose data elements that can be safely dropped or changed without compromising
the integrity of the user requirements. The Project Manager must then identify any risks
inherent in eliminating or changing the data elements and decide which are acceptable
to the project.
Some of the potential risks involved in eliminating or changing data elements are:
● Losing a critical piece of data required for a business rule that was not
originally defined but is likely to be needed in the future. Such data loss may
require a substantial amount of rework and can potentially affect project
timelines.
● Any change in data that needs to be incorporated in the Source or Target data
models requires substantial time to rework and could significantly delay
development. Such a change would also push back all tasks defined and
require a change in the Project Plan.
● Changes in the Source system model may drop secondary relationships that
When a source changes after the initial assessment, the corresponding Target-Source
Matrix must also change. The Data Architect needs to outline everything that has
changed, including the data types, names, and definitions. Then, the various risks
involved in changing or eliminating data elements must be re-evaluated. The Data
Architect should also decide which risks are acceptable. Once again, the System
Administrator may provide useful information about the reasons for any changes to the
source system and their effect on data relationships.
Best Practices
None
Sample Deliverables
None
Description
The final step in the 4.2 Analyze Data Sources task is to determine when all source
systems are likely to be available for data extraction. This is necessary in order to
determine realistic start and end times for the load window. The developers need to
work closely with the source system administrators during this step because the
administrators can provide specific information about the hours of operations for their
systems.
The final deliverable in this subtask, the Source Availability Matrix, lists all the sources
that are being used for data extraction and specifies the systems' downtimes during a
24-hour period. This matrix should contain details of the availability of the systems on
different days of the week, including weekends and holidays.
Prerequisites
None
Roles
Considerations
The information generated in this step will be crucial later in the development process
This information is also helpful for determining whether an Operational Data Store
(ODS) is needed. Sometimes, the extraction times can be so varied among necessary
source systems that an ODS or staging area is required purely for logistical reasons.
Best Practices
None
Sample Deliverables
Target-Source Matrix
Description
The physical database design is derived from the logical models created in Task 4.1.
Where the logical design details the relationships between logical entities in the
system, the physical design considers the following physical aspects of the database:
The physical design must reflect the end-user reporting requirements, organizing the
data entities to allow a fast response to the expected business queries. Physical target
schemas typically range from fully normalized (essentially OLTP structures) to
snowflake and star schemas, and may contain both detail and aggregate information.
The relevant end-user reporting tools, and the underlying RDBMS, may dictate
following a particular database structure (e.g., multi-dimensional tools may arrange the
data into data "cubes").
Prerequisites
None
Roles
Considerations
Although many factors influence the physical design of the data marts, end-user
reporting needs are the primary driver. These needs determine the likely selection
criteria, filters, selection sets and measures that will be used for reporting. These
elements may, in turn, suggest indexing or partitioning policies (i.e., to support the most
frequent cross-references between data objects or tables and identify the most
common table joins) and appropriate access rights, as well as indicate which elements
are likely to grow or change most quickly.
A final consideration is how to implement the schema. Database design tools may
generate and execute the necessary processes to create the physical tables, and
the PowerCenter Metadata Exchange can interact with many common tools to pull
target table definitions into the repository. However, automated scripts may still be
necessary for dropping, truncating, and creating tables.
For Data Migration, the tables that are designed and created are normally either stage
tables or reference tables. These tables are generated to simplify the migration
process. The table definitions for the target application are almost always provided to
the data migration team. These are typically delivered with a packaged application or
already exist for the broader project implementation.
Sample Deliverables
Description
As with all design tasks, there are both enterprise and workgroup considerations in
developing the physical database design. Optimally, the final design should balance the
following factors:
Physical designs are required for target data marts, as well as any ODS/DDS schemas
or other staging tables.
The relevant end-user reporting tools, and the underlying RDBMS, may dictate
following a particular database structure (e.g., multi-dimensional tools may arrange the
data into data "cubes").
Prerequisites
None
Roles
Considerations
The logical target data models provide the basic structure of the physical design. The
physical design provides a structure that enables the source data to be quickly
extracted and loaded in the transformation process, and allows a fast response to the
end-user queries.
The design must also reflect the end-user reporting requirements, organizing the data
entities to provide answers to the expected business queries.
● Operational Data Store (ODS) design. This is usually closely related to the
individual sources, and is, therefore, relationally organized (like the source
OLTP), or simply relational copies of source flat files. Optimized for fast
loading (to allow connection to the source system to be as short as possible)
with few or no indexes or constraints.
● Data Warehouse design. Tied to subject areas, this may be based on a star-
schema (i.e., where significant end-user reporting may occur), or a more
normalized relational structure (where the data warehouse acts purely as a
feeder to several dependent data marts) to speed up extracts to the
subsequent data marts.
● Data Mart design. The usual source for complex business queries, this
typically uses a star or snowflake schema, optimized for set-based reporting
and cross-referenced against many, varied combinations of dimensional
attributes. May use multi-dimensional structures if a specific set of end-user
reporting requirements can be identified.
Tip
The tiers of a multi-tier strategy each have a specific purpose, which strongly
suggests the likely physical structure:
● ODS - Staging from source should be designed to quickly move data from
the operational system. The ODS structure should be very similar to the
source since no transformations are performed, and has few indexes or
constraints (which slow down loading).
● The Data Warehouse design should be biased toward feeding subsequent
data marts, and should be indexed to allow rapid feeds to the marts, along
with a relational structure. At the same time, since the data warehouse
functions as the enterprise-wide central point of reference, physical
partitioning of larger tables allow it to be quickly loaded via parallel
processes. Because data volumes are high, the data warehouse and ODS
structures should be as physically close as possible so as to avoid network
traffic.
● Data Marts should be strongly biased toward reporting, most likely as star-
schemas, or multi-dimensional cubes. The volumes will be smaller than the
parent data warehouse, so the impact of indexes on loading is not as
significant.
The physical database design is tempered by the functionality of the operating system
and RDBMS. In an ideal world, all RDBMS systems might provide the same set of
functions, level of configuration, and scalability. This is not the case however, different
vendors include different features in their systems and new features are included with
each new release; this may affect:
● Physical partitioning. This is not available with all systems. A lack of physical
partitioning may affect performance when loading data into growing tables.
When it is available, partitioning allows faster parallel loading to a single table,
as well as greater flexibility in table reorganisations as well as backup and
recovery.
● Physical device management. Ideally, using many physical devices to store
individual targets or partitions can speed loading because several tables on a
single device must use the same read-write heads when being updated in
parallel. Of course, using multiple, separate devices may result in added
administrative overhead and/or work for the DBA (i.e., to define additional
pointers and create more complex backup instructions).
● Limits to individual tables. Older systems may not allow tables to physically
grow past a certain size. This may require amending an initial physical design
to split up larger tables.
Tip
Using multiple physical devices to store whole tables allows faster parallel updates to
them.
If target tables are physically partitioned, the separate partitions can be stored on
separate physical devices, allowing a further order of parallel loading. The downside
is that extra initial and ongoing DBA and systems administration overhead is required
to fully manage partition management, although much of this can be automated
using external scripts.
Tools
The relevant end-user reporting tools may dictate following a particular database
structure, at least for the data mart and data warehouse designs.
Hardware Issues
Physical designs should be able to be implemented on the existing system (which can
help to identify weaknesses in the physical infrastructure). The areas to consider are:
For a large system, the likely demands on the data mart should affect the physical
design. Factors to consider include:
● Will end-users require continuous (i.e., 24x7) access, or will a batch window
be available to load new data? Each involves some issues: continuous access
may require complex partitioning schemes and/or holding multiple copies of
the data, while a batch window would allow indexes/constraints to be dropped
before loading, resulting in significantly decreased load times.
● Will different users require access to the same data, but in different forms (e.
g., different levels of aggregation, or different sub-sets of the data)?
● Will all end-users access the same physical data, or local copies of it (which
need to be distributed in some way)? This issue affects the potential size of
any data mart.
● Will the designs fit into existing back up processes? Will they execute within
the available timeframes and limits?
● Will recovery processes allow end-users to quickly re-gain access to their
reporting system?
● Will the structures be easy to maintain (i.e., to change, reorganize, rebuild, or
upgrade)?
Tip
Indexing frequently-used selection fields/columns can substantially speed up the
response for end-user reporting because the database engine optimizes its search
pattern, rather than simply scanning all rows of the table if appropriately indexed
fields are used in a request. The more indexes that exist on the target, however, the
slower the speed of data loading into the target, since maintaining the indexes
becomes an additional load on the database engine.
Where an appropriate batch window is available for performing the data load, the
indexes can be dropped before loading, and then re-generated after the load. If no
window is available, the strategy should be one of balancing the load and reporting
needs by careful selection of which fields to index.
For Data Migration projects, it is rare that any tables will be designed for the source or
target application. If tables are needed they will most likely be staging tables or be used
to assist in transformation. It is common that the staging tables will mirror either the
source system or the target system. It is encouraged to create two levels of staging
where Legacy Stage will mirror the source system and Pre-Load Stage will mirror the
Best Practices
None
Sample Deliverables
None
Description
The objective of this task is to design a presentation layer for the end-user community.
The developers will use the design that results from this task and its associated
subtasks in the Build Phase to build the presentation layer (5.6 Build Presentation
Layer). This task includes activities to develop a prototype, demonstrate it to users and
get their feedback, and document the overall presentation layer.
The purpose of any presentation layer is to design an application that can transform
operational data into relevant business information. An analytic solution helps end
users to formulate and support business decisions by providing this information in the
form of context, summarization, and focus.
Note: Readers are reminded that this guide is intentionally analysis-neutral. This
section describes some general considerations and deliverables for determining how to
deliver information to the end user. This step may actually take place earlier in this
phase, or occur in parallel with the data integration tasks.
Prerequisites
None
Roles
Considerations
The analysis tool does not necessarily have to be "one size fits all." Meeting the
requirements of all end users may require mixing different approaches to end-user
analysis. For example, if most users are likely to be satisfied with an OLAP tool while a
group focusing on fraud detection requires data mining capabilities, the end-user
analysis solution should include several tools, each satisfying the needs of the various
user groups. The needs of the various users should be determined by the user
requirements defined in 2.2 Define Business Requirements.
Best Practices
None
Sample Deliverables
Description
The purpose of this subtask is to develop a prototype of the end-user presentation layer
"application" for review by the business community (or its representatives).
The result of this subtask is a working prototype for end-user review and investigation.
PowerCenter can deliver a rough cut of the data to the target schema; then, Data
Analyzer (or other business intelligence tools) can build reports relatively quickly,
thereby allowing the end-user capability to evolve through multiple iterations of the
design.
Prerequisites
None
Roles
Considerations
It is important to use actual source data in the prototype. The closer the prototype is to
what the end user will actually see upon final release, the more relevant the feedback.
In this way, end users can see an initial interpretation of their needs and validate or
expand upon certain requirements.
Also consider the benefits of baselining the user requirements through a sign-off
process. This makes it easier for the development team to focus on deliverables. A
formal change control request process complements this approach. Baselining user
requirements also allows accurate tracking of progress against the project plan and
provides transparency to changes in the user requirements. This approach helps to
Best Practices
None
Sample Deliverables
None
Description
The purpose of this subtask is to present the presentation layer prototype to business
analysts and the end users. The result of this subtask will be a deliverable,
the Prototype Feedback document, containing detailed results from the prototype
presentation meeting or meetings.
Technologies such as OLAP, EIS and Data Mining often bring a new data analysis
capability and approach to end users. In an ad hoc reporting paradigm, end users must
precisely specify their queries. Multidimensional analysis allows for much more
discovery and research, which follows a different paradigm. A prototype that uses
familiar data to demonstrate these abilities helps to launch the education process while
also improving the design. The demonstration of the prototype is also an opportunity to
further refine the business requirements discovered in the requirements gathering
subtask. The end users themselves can offer feedback and ensure that the method of
data presentation and the actual data itself are correct.
Prerequisites
Roles
Considerations
As with the tool selection process, it is important here to assemble a group that
represents the spectrum of end-users across the organization, from business analysts
to high-level managers. A cross section of end users at various levels ensures an
accurate representation of needs across the organization. Different job functions
require different information and may also require various data access methods (i.e., ad
hoc, OLAP, EIS, Data Mining). For example, information that is important to business
analysts such as metadata, may not be important to a high-level manager, and vice-
versa.
The demonstration of the presentation layer tool prototype should not be a one-time
activity; instead it should be conducted at several points throughout design and
development to facilitate and elicit end-user feedback. Involving the end users is vital to
getting "buy-in" and ensuring that the system will meet their requirements. User
involvement also helps build support for the presentation layer tool throughout the
organization.
Best Practices
None
Sample Deliverables
Prototype Feedback
Description
Prerequisites
Roles
Considerations
Types of Layouts
Each piece of information presented to the end user has its own level of importance.
The significance and required level of detail in the information to be delivered
determines whether to present the information on a dashboard or a report.
For example, information that needs to be concise and answers the question “Has this
measurement fallen below the critical threshold number?”, qualifies to be an Indicator
on a dashboard. The more critical information in the above-mentioned category,
needing to reach the end user without having to wait for the user to log onto the
Dashboards
Data Analyzer dashboards contain all the critical information users need in one single
interface. Data can be provided via Alerts, Indicators, or links to Favorite Reports and
Shared Documents.
Data Analyzer facilitates the design of an appealing presentation layout for the
information by providing predefined dashboard layouts. A clear understanding of what
needs to be displayed, as well as how many different types of indicators and alerts are
going to put on the dashboard are important in the selection of an appropriate
dashboard layout. Generally, each subset of data should be placed in a separate
container. Detailed Reports can be put as links on the dashboards so that users can
easily navigate to more detailed reports.
Each report that you are going to build should have suitable design features for the
data to be displayed so as to ensure that the report communicates its message
effectively. In order to ascertain this, be sure to understand the type of data that each
report is going to display before choosing a report table layout. For example, a tabular
layout would be appropriate for a sales revenue report that shows the dollar amounts
against only one dimension (e.g., product category), but a sectional layout would be
more appropriate if the end users are interested in seeing the dollar amounts for each
category of the product in each district, one at a time.
When developing either a dashboard or report, be sure to consider the following points:
● Who is your audience? You have to know who is the intended recipient of the
information that you are going to provide. The audience’s requirements and
preferences should drive your presentation style. Often there will be multiple
audiences for the information you have to share. On many occasions, you will
find that the same information will best serve its purpose if presented in two
different styles to two different users. For example: you may have to create
multiple dashboards in a single project and personalize each dashboard to a
specific group of end users’ needs.
● What type of information do the users need and what are their expectations?
Always remember that the users are looking for very specific pieces of
information in the presentation layout. Most of the time, the business users are
Additionally, the users' expectations will affect the way your information is
presented to them. Some users may be interested in more indicators and
charts while others may want see detailed reports. The more thoroughly you
understand the user expectations, the better you can design presentation
layout.
● Why do they need it? Understanding this can help you to choose the right
layout for each piece of information that you have to present. If they want
granular information, then they are likely to want a detailed report. However, if
they just need quick glimpses of the data, indicators on a dashboard or
emailed alerts are likely to be more appropriate.
● When does the data need to be displayed? It is critical to know when
important business processes occur. This can help drive the development and
scheduling of reports – daily, weekly, monthly, etc. This can also help to
determine what type of indicators to develop, such as monthly or daily sales.
● How should the data be displayed? A well-designed chart, graph or an
indicator can convey critical information to the concerned users quickly and
accurately. It becomes important to choose the right colors and backgrounds
to catch the user’s attention where it is needed the most. A good example of
this would be using a bright red color for all your alerts, green for all the ‘good’
values and so on.
Tip
It is also important to determine if there are any enterprise standards set for
the layout designs of the reports and dashboards, especially the color codes
as given in the example above.
Best Practices
None
5 Build
Description
At this point, the project scope, plan, and business requirements defined in
the Manage Phase should be re-evaluated to ensure that the project can deliver the
appropriate value at an appropriate time.
Prerequisites
None
Roles
Considerations
PowerCenter serves as a complete data integration platform to move data from source
to target databases, perform data transformations, and automate the extract, transform,
and load (ETL) processes. As a project progresses from the Design Phase to the
Build Phase, it is helpful to review the activities involved in each of these processes.
Best Practices
None
Sample Deliverables
None
Description
In order to begin the Build phase, all analysis performed in previous phases of the
project needs to be compiled, reviewed and disseminated to the members of the Build
team. Attention should be given to project schedule, scope, and risk factors. The team
should be provided with:
● Project background
● Business objectives for the overall solution effort
● Project schedule, complete with key milestones, important deliverables,
dependencies, and critical risk factors
● Overview of the technical design including external dependencies
● Mechanism for tracking scope changes, problem resolution, and other
business issues
A series of meetings may be required to transfer the knowledge from the Design team
to the Build team, ensuring that the appropriate staff is provided with relevant
information. Some or all of the following types of meetings may be required to get
development under way:
● Kick-off meeting to introduce all parties and staff involved in the Build phase
● Functional design review to discuss the purpose of the project and the benefits
expected and review the project plan
● Technical design review to discuss the source to target mappings, architecture
design, and any other technical documentation
Information provided in these meetings should enable members of the data integration
team to immediately begin development. As a result of these meetings, the integration
team should have a clear understanding of the environment in which they are to work,
including databases, operating systems, database/SQL tools available in the
environment, file systems within the repository and file structures within the
organization relating to the project, and all necessary user logons and passwords.
Prerequisites
None
Roles
Considerations
It is important to include all relevant parties in the launch activities. If all points of
Because of the nature of the development process, there are often bottlenecks in the
development flow. The Project Manager should be aware of the risk factors, which
emanate from outside the project, and should be able to anticipate where bottlenecks
are likely to occur. The Project Manager also needs to be aware of the external factors
that create project dependencies, and should avoid having meetings prematurely when
external dependencies have not been resolved. Having meetings prior to resolving
these issues can result in significant down time for the developers while they wait to
have their sources in place and finalized.
Best Practices
None
Sample Deliverables
None
Description
The Build team needs to understand the project's objectives, scope, and plan in order
to prepare themselves for the Build Phase. There is often a tendency to waste time
developing non-critical features or functions. The team should review the project plan
and identify the critical success factors and key deliverables to avoid focusing on
relatively unimportant tasks. This helps to ensure that the project stays on its original
track and avoids much unnecessary effort. The team should be provided with:
With this information, the Build team should be able to enhance the project plan to
navigate through the risk areas, dependencies, and tasks to reach its goal of
developing an effective solution.
Prerequisites
None
Roles
Considerations
With the Design Phase complete, this is the first opportunity for the team to review
what it has learned during the Architect Phase and the Design Phase about the
sources of data for the solution. It is also a good time to review and update the project
plan, which was created before these findings, to incorporate the knowledge gained
during the earlier phases. For example, the team may have learned that the source of
data for marketing campaign programs is a spreadsheet that is not easily accessible by
the network on which the data integration platform resides. In this case, the team may
need to plan additional tasks and time to build a method for accessing the data. This is
also an appropriate time to review data profiling and analysis results to ensure all data
quality requirements have been taken into consideration.
During the project scope and plan review, significant effort should be made to identify
upcoming Build Phase risks and assess their potential impact on project schedule and/
or cost. Because the design is complete, risk management at this point tends to be
more tactical than strategic; however, the team leadership must be fully aware of key
risk factors that remain. Team members are responsible for identifying the risk factors
in their respective areas and notifying project management during the review process.
Best Practices
None
Sample Deliverables
Description
The data integration team needs the physical model of the target database in order to
begin analyzing the source to target mappings and develop the end user interface
known as the presentation layer.
The Data Architect can provide database specifics such as: what are the indexed
columns, what partitions are available and how they are defined, and what type of data
is stored in each table.
The Data Warehouse Administrator can provide metadata information and other source
data information, and the Data Integration Developer(s) needs to understand the entire
physical model of both the source and target systems, as well as all the dimensions,
aggregations, and transformations that will be needed to migrate the data from the
source to the target.
Prerequisites
None
Roles
Considerations
Depending on how much up-front analysis was performed prior to the Build phase, the
project team may find that the model for the target database does not correspond well
with the source tables or files. This can lead to extremely complex and/or poorly
performing mappings. For this reason, it is advisable to allow some flexibility in the
design of the physical model to permit modifications to accommodate the sources. In
addition, some end user products may not support some datatypes specific to a
database. For example, Teradata's BYTEINT datatype is not supported by some end
user reporting tools.
As a result of the various kick-off and review meetings, the data integration team
should have sufficient understanding of the database schemas to begin work on the
Build-related tasks.
Best Practices
None
Sample Deliverables
Description
Since testing is designed to uncover defects, it is crucial to properly record the defects
as they are identified, along with their resolution process. This requires a ‘defect
tracking system’ that may be entirely manual, based on shared documents such as
spreadsheets, or automated using, say, a database with a web browser front-end.
Whatever tool is chosen, sufficient details of the problem must be recorded to allow
proper investigation of the root cause and then the tracking of the resolution process.
● Formal test plans and schedules being in place, to ensure that defects are
discovered, and that their resolutions can be retested.
● Sufficient details being recorded to ensure that any problems reported are
repeatable and can be properly investigated.
Prerequisites
None
Roles
Considerations
The Project Manager and Test Manager review the test results at their next meeting
and agree on closure, if appropriate.
Best Practices
None
Issues Tracking
Description
Implementing the physical database is a critical task that must be performed efficiently
to ensure a successful project. In many cases, correct database implementation can
double or triple the performance of the data integration processes and presentation
layer applications. Conversely, poor physical implementation generally has the greatest
negative performance impact on a system.
The information in this section is intended as an aid for individuals responsible for the
long-term maintenance, performance, and support of the database(s) used in the
solution. It should be particularly useful for programmers, Database Administrators, and
System Administrators with an in-depth understanding of their database engine and
Informatica product suite, as well as the operating system and network hardware.
Prerequisites
None
Roles
Considerations
The DBA is responsible for determining which of the many available alternatives is the
best implementation choice for the particular database. For this reason, it is critical for
this individual to have a thorough understanding of the data, database, and desired use
of the database by the end-user community prior to beginning the physical design and
implementation processes.
The DBA should be thoroughly familiar with the design of star-schemas for Data
Warehousing and Data Integration solutions, as well as standard 3rd Normal
Form implementations for operational systems.
For data migration projects this task often refers exclusively to the development of new
tables in either a reference data schema or staging schemas. Developers are
encouraged to leverage a reference data database which will hold reference data such
as valid values, cross-reference tables, default values, exception handling details, and
other tables necessary for successful completion of the data migration. Additionally,
tables will get created in staging schemas. There should be little creation of tables in
the source or target system due to the nature of the project. Therefore most of the table
development will be in the developer space rather then in the applications that are part
of the data migration.
Best Practices
None
Sample Deliverables
None
Description
Follow the steps in this task to design and build the data quality enhancement
processes that can ensure that the project data meets the standards of data quality
required for progress through the rest of the project.
The processes designed in this task are based on the results of 2.8 Perform Data
Quality Audit. Both the design and build components are captured in the Build Phase
since much of this work is interative as intermediate builds of the data quality process
are reviewed, the design is further expanded and enhanced.
Note: If the results of the Data Quality Audit indicate that the project data already
meets all required levels of data quality, then you can skip this task. However, this
is unlikely to occur.
Here again (as in subtask 2.3.1 Identify Source Data Systems) it is important to work as
far as is practicable with the actual source data. Using data derived from the actual
source systems - either the complete dataset or a subset - was essential in identifying
quality issues during the Data Quality Audit and determining if the data meets the
business requirements (i.e., if it answers the business questions identified in
the Manage Phase). The data quality enhancement processes designed in the
subtasks of this task must operate on as much of the project dataset(s) as deemed
necessary, and possibly the entire dataset.
Data quality checks can be of two types: one can cover the metadata characteristics of
the data, and the other covers the quality of the data contents from a business
perspective. In the case of complex ERP systems like SAP, where implementation has
a high degree of variation from the base product, a thorough data quality check should
be performed to consider the customizations.
Prerequisites
None
Roles
Considerations
Because the quality of the source system data has a major effect on the correctness of
all downstream data, it is imperative to resolve as many of the data issues as possible,
as early as possible. Making the necessary corrections at this stage eliminates many of
the questions that may otherwise arise later during testing and validation.
If the data is flawed, the development initiative faces a very real danger of failing. In
addition, eliminating errors in the source data makes it far easier to determine the
nature of any problems that may arise in the final data outputs. If data comes from
different sources, it is mandatory to correct data for each source as well as for the
integrated data. If data comes from a mainframe, it is necessary to use the proper
access method to interpret data correctly. Note however that Informatica Data Quality
(IDQ) applications do not read data directly from mainframe.
As indicated above, the issue of data quality covers far more than simply whether the
source and target data definitions are compatible. From the business perspective, data
quality processes seek to answer the following questions: what standard has the data
achieved in areas that are important to the business, and what standards are required
in these areas?
There are six main areas of data quality performance: Accuracy, Completeness,
Conformity, Consistency, Integrity, and Duplication. These are fully explained in
task 2.8 Perform Data Quality Audit. The Data Quality Developer uses the results of the
Data Quality Audit as the benchmark for the data quality enhancement steps you need
to apply in the current task. Before beginning to design the data quality processes, the
Data Quality Developer, Business Analyst, Project Sponsor, and other interested
The tasks that follow are written from the perspective of Informatica Data Quality,
Informatica’s dedicated data quality application suite.
Best Practices
Data Cleansing
Sample Deliverables
None
Description
Business rules are a key driver of data enhancement processes. A business rule is a
condition of the data that must be true if the data is to be valid and, in a larger sense,
for a specific business objective to succeed. In may cases, poor data quality is directly
related to the data’s failure concerning a business rule.
In this subtask the Data Quality Developer and the Business Analyst, and optionally
other personnel representing the business, establish the business rules to be applied to
the data. An important factor in completing this task is proper documentation of the
business rules.
Prerequisites
None
Roles
Considerations
All areas of data quality can be affected by business rules, and business rules can be
defined at high- and low-levels and at varying levels of complexity. Some business
rules can be tested mathematically using simple processes, whereas others may
require complex processes or reference data assistance.
For example, consider a financial institution that must store several types of information
for account holders in order to comply with the Sarbanes-Oxley or the USA-PATRIOT
Act. It defines several business rules for its database data, including:
These three rules are equally easy to express, but they are implemented in different
ways. All three rules can be checked in a straightforward manner using Informatica
Data Quality (IDQ), although the third rule, concerning address validation, requires
reference data verification. The decision to use external reference data is covered in
subtask 5.3.2 Determine Dictionary and Reference Data Requirements.
When defining business rules, the Data Quality Developer must consider the following
questions:
Note: In IDQ, a discrete data quality process is called a plan. A plan has inputs,
outputs, and analysis or enhancement algorithms and is analogous to a PowerCenter
When the Data Quality Developer and Business Analyst have agreed on the business
rules to apply to the data, the Data Quality Developer must decide how to convert the
rules into data quality plans. (The Data Quality Developer need not to create the plans
at this stage)
The Data Quality Developer may create a plan for each rule, or may incorporate
several rules into a single plan. This decision is taken on a rule-by-rule basis. There is
a trade-off between simplicity in plan design, wherein each plan contains a single rule,
and efficiency in plan design, wherein a single plan addresses several rules.
Typically a plan handles more than one rule. One advantage of this course of action is
that the Data Quality Developer does not need to define and maintain multiple
instances of input and output data, covering small increments of data quality progress,
where a single set of inputs and outputs can do the same job in a more sophisticated
plan.
It’s also worth considering if the plan will be run from within IDQ or added to a
PowerCenter mapping for execution in a workflow. Bear in mind that the Data Quality
Integration transformation in PowerCenter accepts information from one plan. To add
several plans to a mapping, you must add the same number of transformations.
Best Practices
None
Sample Deliverables
None
Description
Many data quality plans make use of reference data files to validate and improve the
quality of the input data. The main purposes of reference data are:
● To validate the accuracy of the data in question. For example, in cases where
input data is verified against tables of known-correct data.
● To enrich data records with new data or enhance partially-correct data values.
For example, in cases of address records that contain usable but incomplete
postal information. (Typos can be identified and fixed; Plus-4 information can
be added to zip codes.)
When preparing to build data quality plans, the Data Quality Developer must determine
the types of dictionary and reference files that may be used in the data quality plans,
obtain approval to use third-party data, if necessary, and define a strategy for
maintaining and distributing reference files. An important factor in completing this task
is the proper documentation of the required dictionary or reference files.
Prerequisites
None
Roles
Data quality plans can make use of three types of reference data.
● Standard dictionary files. These files are installed with Informatica Data
Quality (IDQ) and can be used by several types of components in Workbench.
All dictionaries installed with IDQ are text dictionaries. These are plain-text
files saved in .DIC file format. They can be created and edited manually.
IDQ installs with a set of dictionary files in generic business information areas
including forenames, city and town names, units of measurement, and gender
identification. Informatica also provides and supports reference data of external
origin, such as postal address data endorsed by national postal carriers.
If the Data Quality Developer feels that externally-derived reference data files are
necessary, he or she must inform the Project Manager or other business personnel as
soon as possible, as this is likely to effect (1) the project budget and (2) the software
architecture implementation.
What is a non-standard location? One where the plans cannot see the dictionary files.
If the relevant dictionary files are moved out of these locations, the plan cannot run
unless the config.xml file has been edited. Conversely, if the user has created new or
modified dictionaries within the standard dictionary format, and wishes to copy (publish)
plans to a server or another IDQ installation, the user must copy the new dictionary files
to a recognized location for the server or the other IDQ also.
Third-party reference data adds another set of actions. The third-party data currently
available from Informatica is packaged in a manner that installs to locations recognized
by IDQ. (Again, these locations are defined in the config.xml file.) However, copying
these files to other locations is not as simple, because the installation of these files is
less simple, and because the files are licensed and delivered separately from IDQ. The
business must agree to license these files before the Data Quality Developer can
assume he or she can develop plans using third-party files, and the system
administrator must understand that the reference data will be installed in the required
locations.
Whenever you add a dictionary or reference data file to a plan, you must document
exactly how you have done so: record the plan name, the reference file name, and the
component instance that uses the reference file. Make sure you pass the inventory of
reference data to all other personnel who are going to use the plan.
Data migration projects have additional reference data requirements which include a
need to determine the valid values for key code fields and to ensure that all input data
aligns with these codes. It is recommended to build valid value processes to perform
this validation. It is also recommended to use a table driven approach to populate hard-
coded values which then allows for easy changes if the specific hard-coded values
change over time. Additionally, a large number of basic cross-references are also
required for data migration projects. These data types are examples of reference data
that should be planned for by using a specific approach to populate and maintain them
with input from the business community. These needs can be met with a variety of
Informatica products, but to expedite development, they must be addressed prior to
building data integration processes.
Sample Deliverables
None
Description
This subtask, along with subtask 5.3.4 Design Run-time and Real-time Processes for Operate
Phase Execution concerns the design and execution of the data quality plans that will prepare the
project data for the Data Integration Design and Development in the Build Phase.
While this subtask describes the creation and execution of plans through Informatica Data Quality
(IDQ) Workbench, subtask 5.3.4 focuses on the steps to deploy plans in a runtime or scheduled
environment. All plans are created in Workbench. However, there are several aspects to creating
plans primarily for runtime use, and these are covered in 5.3.4. Users who are creating plans
should read both subtasks.
Note: IDQ provides a user interface, the Data Quality Workbench, within which plans can be
designed, tested, and deployed to other Data Quality engines across the network. Workbench is an
intuitive user interface; however, the plans that users construct in Workbench can grow in size and
complexity, and Workbench, like all software applications, requires user training. These subtasks
are not a substitute for that training. Instead, they describe the rudiments of plan construction, the
elements required for various types of plans, and the next steps to plan deployment. Both subtasks
assume that the Data Quality Developer will have received formal training in IDQ.
Prerequisites
None
Roles
Considerations
A data quality plan is a discrete set of data analysis and/or enhancement operations with a data
source and a data target (or sink). At a high level, the design of a plan is not dissimilar to the
design of a PowerCenter mapping. The data sources, sinks, and analysis/enhancement
components are represented on-screen by icons, much like the sources, targets, and
transformations in a mapping. Sources, sinks, and other components can be configured through a
tabbed dialog box in the same way as PowerCenter transformations. One difference between
PowerCenter and Workbench is that users cannot define workflows that contain serial data quality
Data quality plans can read source data from, and write data to file and database. Most delimited,
flat, or fixed-width file types are usable, as are DB2, Oracle, SQL Server databases and any
database legible via ODBC. Informatica Data Quality (IDQ) stores plan data in its own MySQL data
repository. The following figure illustrates a simple data quality plan.
This data quality plan shows a data source reading from a SQL database, an operational
component analyzing the data, and a data sink component that receives the data available as plan
output. A plan can have any number of operational components.
Plans can be designed to fulfill several data quality requirements, including data analysis, parsing,
cleansing and standardization, enrichment, validation, matching, and consolidation. These are
described in detail in the Best Practice Data Cleansing.
● What types of plan are necessary to meet the needs of the project The business
should have already signed-off on specific data quality goals as a part of agreeing the
overall project objectives, and the Data Quality Audit should have indicated the areas
where the project data requires improvement. For example, the audit may indicate that the
project data contains a high percentage of duplicate records, and therefore matching and
pre-match grouping plans may be necessary.
● What test cycles are appropriate for the plans? Testing and tuning plans in Workbench
is a normal part of plan development. In many cases, testing a plan in Workbench is akin
to validating a mapping in PowerCenter, and need not be part of a formal test scenario.
However, the Data Quality Developer must be able to sign-off on each plan as valid and
executable.
● What source data will be used for the plans? This is related to the testing issue
mentioned above. The final plans that operate on the project data are likely to operate on
Bear in mind that a plan that is published to a service domain repository will translate the data
source locations set at design time into new locations local to the new computer on which it
resides. See subtask 5.3.4 Design Run-time and Real-time Processes for Operate
Phase Execution and the Informatica Data Quality User Guide for more information.
Where will the plans be deployed? IDQ can be installed in a client-server configuration, with
multiple Workbench installations acting as clients to the IDQ server. The server employs service
domain architecture, so that a connected Workbench user can run a plan from a local or domain
repository to any Execution Service on the service domain. Likewise, the Data Quality Developer
may publish plans from Workbench to a remote repository on the IDQ service domain for execution
by other Data Quality Developers.
An important consideration here is, will the plans be deployed as runtime plans? A plan is
considered a runtime plan if it is deployed in a scheduled or batch operation with other plans. In
such cases, the plan is run using a command line instruction. See subtask 5.3.4 Design Run-time
and Real-time Processes for Operate Phase Execution for details.
Bear in mind also that it is possible to add a plan to a mapping if the Data Quality Integration plug-
in has been installed client-side and server-side to PowerCenter. The Integration enables the
following types of interaction:
● It enables you to browse the Data Quality repository and add a data quality plan to the
Data Quality Integration transformation. The functional details of the plan are saved as
XML in the PowerCenter repository.
● It enables the PowerCenter Integration Service to send data quality plan XML to the Data
Quality engine when a session containing a Data Quality Integration transformation is run.
A plan designed for use in a PowerCenter mapping must set its data source and data sink
components to process data in realtime. A subset of the source and sink components can be
configured in this way (six out of twenty-one components).
Note that plans with realtime capabilities are also suitable for use in a request-response
environment, such as a point of data entry environment. These realtime plans can be called by a
third-party application to analyze keyboard data inputs and correct human error.
Best Practices
None
Sample Deliverables
Description
This subtask, along with subtask 5.3.3 Design and Execute Data Enhancement
Processes concerns the design and execution of the data quality plans to prepare the
project data for the Data Integration component of the Build Phase and possibly later
phases.
While subtask 5.3.3 describes the creation and execution of plans through Data Quality
Workbench, this subtask focuses on the steps to deploy plans in a runtime or
scheduled environment. All data quality plans are created in Workbench. However,
there are several aspects to creating plans primarily for runtime which are described in
this subtask. Users who are creating plans should read both subtasks.
Because they can be scheduled and run in a batch, runtime plans present two
opportunities for the Data Quality Developer and the data project as a whole:
● A plan that may take several hours to run — such as a large-scale data
matching plan — can be scheduled to run overnight as a runtime plan.
● A runtime plan can be scheduled to run at regular intervals on the dataset to
analyze dataset quality; such plans can outlive the project in which they are
designed and provide a method for ongoing data monitoring in the enterprise.
Because runtime plans need not be run from a user interface, they are commonly
published or moved to a computer where higher-performance is available. When
publishing or moving a runtime plan, consider the issues discussed in this subtask.
Prerequisites
None
Roles
Considerations
The two main factors to consider when planning to use runtime plans are:
In both cases, the source data and reference files must reside in locations that are
visible to Informatica Data Quality (IDQ). This is pertinent as the runtime plan will
typically be moved from its design-time computer to another computer for execution.
Data source locations are set in the in the plan at design time. If the plan connects to a
file, the name and path to the file(s) are set in the data source component. If the source
data is stored in a database, the same database connection must be available on the
machine to which the plans are moved. If the plan is run on the machine on which it
was designed, then the data locations can remain static — so long as the data source
details do not change. However, if the plan is moved to another machine, consider the
following questions:
Will the plan be run in an IDQ service domain? A plan moved to another machine
may be run through Data Quality Server (specifically, by a machine hosting a Data
Quality Execution Service.) In this case, the Data Quality engine can run the plan from
the repository, and you can publish the plan to repository from the Workbench client.
When you publish a plan, bear in mind that IDQ recognizes a specific set of folders as
valid source file locations. If a Data Quality Developer defines a plan with a source file
stored in the following location on the Workbench computer:
C:\Myfiles\File.txt
A Data Quality Server on Windows will look for the file here:
/home/informatica/dataquality/users/user.name/Files/Myfiles
where user.name is the logged-on Data Quality Developer name. (The Data Quality
Developer must be working on a Workbench machine that has a client connection to
the Data Quality Server.)
Path translations are platform-independent, that is, a Windows path will be mapped to a
UNIX path.
Will the plan be deployed to IDQ machines outside the service domain? If so, the
plans must be saved as a .xml file for runtime deployment. (Plans can also be saved
as .pln files for use in another instance of Workbench.) The Data Quality Developer can
set the run command to distinguish between plans stored in the Data Quality repository
and plans saved on the file system.
If a plan uses standard dictionary files (i.e., the files that installed with the product) then
IDQ takes care of this automatically, as long as the plan resides on a service domain. If
a plan is published or copied to a network location and uses non-standard reference
files, these files must be copied to the a location that is recognizable to the IDQ
installation that will run the deployed plans. For more information on valid dictionary
and reference data files, see the Informatica Data Quality User Guide.
The above settings can have a significant bearing on plan design. When the Data
Quality Developer designs a plan in Workbench, he or she should ensure that the
folders created for file resources can map efficiently to the server folder structure.
When the plan runs on the server side, the Data Quality Server looks for the source file
in the following location:
Note that the folder path Program Files\Data Quality is repeated here: in this case,
good plan design suggests the creation of folders under C:\ that can be recreated
efficiently on the server.
Best Practices
None
Sample Deliverables
None
Description
When the Data Quality Developer has designed and tested the plans to be used later in
the project, he or she must then create an inventory of the plans. This inventory should
be as exhaustive as possible. Data quality plans, once they achieve any size, can be
hard for personnel other than the Data Quality Developer to read. Moreover, other
project personnel and business users are likely to rely on the inventory to identify
where the plan functioned in the project.
Prerequisites
None
Roles
Considerations
For each plan created for use in the project (or for use in the Operate Phase and post-
project scenarios), the inventory document should answer the following questions. The
questions can be divided into two sections: one relating to the plan’s place and function
relative to the project and its objectives, and the other relating to the plan design itself.
The questions below are a subset of those included in the sample deliverable
document Data Quality Plan Documentation and Handover.
Project-related Questions
● What project is the plan part of? Where does the plan fit in the overall project?
● What are the predicted ‘before and after’ states of the plan data?
● Where is the plan located (include machine details and folder location) and
when was it executed?
● Is the plan version-controlled? What are the creation/medatada details for the
plan?
● What Informatica application will run the plan, and on which applications will
the plan run?
● Where is the source located? What are the format and origin of the database
table?
● What business rules are defined? This question can refer to the documented
business rules from subtask 5.3.1 Design Data Quality Technical Rules.
Provide the logical statements, if appropriate.
● What are the outputs for the instance, and how are they named?
● Who receives the plan output data, and what actions are they likely to they
take?
Best Practices
None
Sample Deliverables
None
Description
In this subtask the Data Quality Developer collates all the documentation produced for
the data quality operations thus far in the project and makes them available to the
Project Manager, Project Sponsor, and Data Integration Developers — in short, to all
personnel who need them.
The Data Quality Developer must also ensure that the data quality plans themselves
are stored in locations known to and usable by the Data Integration Developers.
Prerequisites
None
Roles
Considerations
After the Data Quality Developer verifies that all data quality-related materials produced
in the project are complete, he or she should hand them all over to other interested
parties in the project. The Data Quality Developer should either arrange a handover
meeting with all relevant project roles or ask the Data Steward to arrange such a
meeting.
● Progress in treating the quality of the project data (‘before and after’ states of
the data in the key data quality areas)
● Success stories, lessons learned
● Data quality targets: met or missed?
● Recommended next steps for project data
Regarding data quality targets met or missed, the Data Quality Developer must be able
to say whether the data operated on is now in a position to proceed through the rest of
the project. If the Data Quality Developer believes that there are “show stopper” issues
in the data quality, he or she must inform the business managers and provide an
estimate of the work necessary to remedy the data issues. The business managers can
then decide if the data can pass to the next stage of the project or if remedial action is
appropriate.
The materials that the Data Quality Developer must assemble include:
Best Practices
Description
A properly designed data integration process performs better and makes more efficient
use of machine resources than a poorly designed process. This task includes the
necessary steps for developing a comprehensive design plan for the data integration
process, which incorporates high-level standards such as error-handling strategies, and
overall load-processing strategies, as well as specific details and benefits of individual
mappings. Many development delays and oversights are attributable to an incomplete
or incorrect data integration process design, thus underscoring the importance of this
task.
When complete, this task should provide the development team with all of the detailed
information necessary to construct the data integration processes with minimal
interaction with the design team. This goal is somewhat unrealistic, however, because
requirements are likely to change, design elements need further clarification, and some
items are likely to be missed during the design process. Nevertheless, the goal of this
task should be to capture and document as much detail as possible about the data
integration processes prior to development.
Prerequisites
None
Roles
Considerations
The PowerCenter platform provides facilities for developing and executing mappings
for extraction, transformation and load operations. These mappings determine the flow
of data between sources and targets, including the business rules applied to the data
before it reaches a target. Depending on the complexity of the transformations, moving
data can be a simple matter of passing data straight from a data source through an
expression transformation to a target, or may involve a series of detailed
transformations that use complicated expressions to manipulate the data before it
reaches the target. The data may also undergo data quality operations inside or outside
PowerCenter mappings; note also that some business rules may be closely aligned
with data quality issues. (Pre-emptive steps to define business rules and to avoid data
errors may have been performed already as part of task 5.3 Design and Build Data
Quality Process.)
Data Migration projects differ from typical data integration projects in that they should
have an established process and templates for most processes that are
developed. This is due to the fact that development is accelerated and more time is
spent on data quality and driving out incomplete business rules then on traditional
development. For migration projects the data integration processes can be further
subdivided into the following processes:
Best Practices
Sample Deliverables
None
Description
Designing the high-level load process involves the factors that must be considered
outside of the mapping itself. Determining load windows, availability of sources and
targets, session scheduling, load dependencies and session level error handling are all
examples of issues that developers should deal with in this task. Creating a solid load
process is an important part of developing a sound data integration solution.
This subtask incorporates three steps, all of which involve specific activities,
considerations, and deliverables. The steps are:
1. Identify load requirements . In this step, members of the development team work
together to determine the load window. The load window is the amount of time it will
take to load an individual table or an entire data warehouse or data mart. To begin this
step, the team must have a thorough understanding of the business requirements
developed in task 1.1 Define Project. The team should also consider the differences
between the requirements for initial and subsequent loading; tables may be loaded
differently in the initial load than they will subsequently. The load document generated
in this step, describes the rules that should be applied to the session or mapping, in
order to complete the loads successfully.
2. Determine dependencies . In this step, the Database Administrator works with the
Data Warehouse Administrator and Data Integration Developer, to identify and
document the relationships and dependencies that exist between tables within the
physical database. These relationships affect the way in which a warehouse is loaded.
In addition, the developers should consider other environmental factors, such as
database availability, network availability, and other processes that may be executing
concurrently with the data integration processes.
3. Create initial and ongoing load plan . In this step, the Data Integration Developer
and Business Analyst use information created in the two earlier steps to develop a load
plan document; this lists the estimated run times for the batches and sessions required
to populate the data warehouse and/or data marts.
Prerequisites
Roles
Considerations
The load window determined in step 1 of this subtask, can be used by the Data
Integration Developers as a performance target. Mappings should be tailored to ensure
that their sessions run to successful completion within the constraints set by the load
window requirements document. The Database Administrator, Data Warehouse
Administrator and Technical Architect are responsible for ensuring that their respective
environments are tuned properly to allow for maximum throughput, to assist with
this goal.
Subsequent loads of a table are often performed differently than the initial load. For
example, suppose the primary focus of a mapping is an update of a dimension. But in
the first load of a warehouse, the dimension table has no data. The initial load of a table
may involve the execution of a subset of the database operations used by subsequent
loads. For example, if the primary focus of a mapping is an update of a dimension,
before the first load of the warehouse, the dimension table will be empty.
Consequently, the first load will perform a large number of inserts, while subsequent
loads may perform a smaller number of both insert and update operations. The
development team should consider and document such situations and convey the
different load requirements to the developer creating the mappings, and to the
operations personnel configuring the sessions.
Foreign key (i.e., parent / child) relationships are the most common variable that should
be considered in this step. When designing the load plan, the parent table must always
be loaded before the child table, or integrity constraints (if applied) will be broken and
the data load will fail. The Data Integration Developer is responsible for documenting
these dependencies at a mapping level so that loads can be planned to coordinate with
the existence of dependent relationships. The Developer should also consider and
document other variables such as source and target database availability, network up/
down time, and local server processes unrelated to PowerCenter when designing the
load schedule.
TIP
Load parent / child tables in the same mapping to speed development and
reduce the number of sessions that must be managed.
To load tables with parent / child relationships in the same mapping, use the
constraint-based loading option at the session level. Use the target load plan
option in PowerCenter Designer to ensure that the parent table is marked to be
loaded first. The parent table keys will be loaded before an associated child
foreign key is loaded into its table.
The load plans should be designed around the known availability of both source and
target databases; it is particularly important to consider the availability of source
systems, as these systems are typically beyond the operational control of the
development team. Similarly, if sources or targets are located across a network, the
development team should consult with the Network Administrator to discuss network
capacity and availability in order to avoid poorly performing batches and sessions.
Finally, although unrelated local processes executing on the server are not likely to
cause a session to fail, they can severely decrease performance by keeping available
processors and memory away from the PowerCenter server engine, thereby slowing
throughput and possibly causing a load window to be missed.
Best Practices
None
Sample Deliverables
None
Description
After the high-level load process is outlined and source files and tables are identified, a
decision needs to be made regarding how the load process will account for data errors.
The identification of a data error within a load process is driven by the standards of
acceptable data quality. The identification of a process error is driven by the stability
of the process itself. It is unreasonable to expect any source system to contain perfect
data. It is also unreasonable to expect any automated load process to execute correctly
100 percent of the time. Errors can be triggered by any number of events or scenarios,
including session failure, platform constraints, bad data, time constraints, mismatched
control totals, dependencies, or server availability.
The error handling development effort should include all the work that needs to be
performed to correct errors in a reliable, timely, and automated manner.
Several types of tasks within the Workflow Manager are designed to assist in error
handling. The following is a subset of these tasks:
● Command Task allows the user to specify one or more shell commands to
run during the workflow.
● Control Task allows the user to stop, abort, or fail the top-level workflow or
the parent workflow based on an input-link condition.
● Decision Task allows the user to enter a condition that determines the
execution of the workflow. This task determines how the PowerCenter
Integration Service executes a workflow.
● Event Task specifies the sequence of task execution in a workflow. The event
is triggered based on the completion of the sequence of tasks.
Data integration developers should find an acceptable balance between the end users'
needs for accurate and complete information and the cost of additional time and
resources required to repair errors. The Data Integration Developer should consult
closely with the Data Quality Developer in making these determinations, and include in
the discussion the outputs from tasks 2.8 Perform Data Quality Audit and 5.3 Design
and Build Data Quality Process.
Prerequisites
None
Roles
Considerations
● Session Failure. If a PowerCenter session fails during the load process, the
failure of the session itself needs to be recognized as an error in the load
process. The error handling strategy commonly includes a mechanism for
notifying the process owner that the session failed, whether it is in the form of
a message to a pager from operations or a post-session email from a
PowerCenter Integration Service. There are several approaches to handling
session failures within the Workflow Manager. These include custom-written
recovery routines with pre- and post- session scripts, workflow variables such
as the pre-defined task-specific variables or user-defined variables, and event
tasks (e.g., the event-raise task and the event-wait task) can be used to start
specific tasks in reaction to a failed task.
❍ The database server will reject a row if the primary key field(s) of that row
already exists in the target.
❍ A PowerCenter Integration Service will reject a row if a date/time field is
sent to a character field without implicitly converting the data.
In both of these scenarios, the data will be rejected regardless of whether or not
it was accounted for in the code. Although the data is rejected without
developer intervention, accounting for it remains a challenge. In the first
scenario, the data will end up in a reject file on the PowerCenter server. In the
second scenario, the row of data is simply skipped by the Data Transformation
Manager (DTM) and is not written to the target or to any reject file. Both
scenarios require post-load reconciliation of the rejected data. An error handling
strategy should account for data that is rejected in this manner; either by
parsing reject files or balancing control totals.
● "Bad" Data. Bad data can be defined as data that enters the load process
from one or more source systems, but is prevented from entering the target
systems, which are typically staging areas, end-user environments, or
reporting environments. This data can be rejected by the load process itself or
designated as "bad" by the mapping logic created by developers.
To a degree, the PowerCenter session logs and repository tables store this type
of information. Depending on the level of detail desired to capture control totals,
some organizations run post-session reports against the repository tables and
parse the log files. Others, wishing to capture more in-depth information about
their loads, incorporate control totals in their mapping logic, spinning off check
sums, row counts, and other calculations during the load process. These totals
The accuracy of the data, before any logic is applied to it, is dependent on the source
systems from which it is extracted. It is important, therefore, for developers to identify
the source systems and thoroughly examine the data in them. task 2.8 Perform Data
Quality Audit is specifically designed to establish such knowledge about project data
quality, and task 5.3 Design and Build Data Quality Process is designed specifically to
eliminate data quality problems as far as possible before data enters the Build Phase
of the project. In the absence of dedicated data quality steps such as these, one
approach is to estimate, along with source owners and data stewards, how much of the
data is still bad (vs. good) on a column-by-column basis, and then to determine which
data can be fixed in either the source or the mappings, and which does not need to be
fixed before it enters the target. However, the former approach is preferable as it (1)
provides metrics to business and project personnel and (2) provides an effective means
of addressing data quality problems.
Data Integrity deals with the internal relationships of the data in the system and how
those relationships are maintained (i.e., data in one table must match corresponding
data in another table). When relationships cannot be maintained because of incorrect
information entered from the source systems, the load process needs to determine if
processing can continue or if the data should be rejected.
Including lookups in a mapping is a good way of checking for data integrity. Lookup
tables are used to match and validate data based upon key fields. The error handling
process should account for the data that does not pass validation. Ideally, data integrity
issues will not arise since the data has already been processed in the steps described
in task 4.6.
Since it is unrealistic to expect any source system to contain data that is 100 percent
accurate, it is essential to assign the responsibilities of correcting data errors. Taking
ownership of these responsibilities throughout the project is vital to correcting errors
during the load process. Specifically, individuals should be held accountable for:
Part of the load process validates that the data conforms to known rules from the
business. When these rules are not met by the source system data, the process should
handle these exceptions in an appropriate manner. End users should either accept the
consequences of permitting invalid data to enter the target system or they should
choose to reject the invalid data. Both options involve complex issues for the business
organization.
The individuals responsible for providing business information to the developers must
be knowledgeable and experienced in both the internal operations of the organization
and the common practices of the relevant industry. It is important to understand the
data and functionality of the source systems as well as the goals of the target
environment. If developers are not familiar with the business practices of the
organization, it is practically impossible to make valid judgments about which data
should be allowed in the target system and which data should be flagged for error
handling.
The primary purpose for developing an error handling strategy is to prevent data that
inaccurately portrays the state of the business from entering the target system.
Providers of business information play a key role in distinguishing good data from bad.
The individuals responsible for maintaining the physical data structures play an equally
crucial role in designing the error handling strategy. These individuals should be
thoroughly familiar with the format, layout, and structure of the data. After
understanding the business requirements, developers must gather data content
information from the individuals that have first-hand knowledge of how the data is laid
out in the source systems and how it is to be presented in the target systems. This
knowledge helps to determine which data should be allowed in the target system based
on the physical nature of the data as opposed to the business purpose of the data.
Data stewards, or their equivalent, are responsible for the integrity of the data in and
around the load process. They are also responsible for maintaining translation tables,
codes, and consistent descriptions across source systems. Their presence is not
always required, depending on the scope of the project, but if a data steward is
designated, he or she will be relied upon to provide developers with insight into such
things as valid values, standard codes, and accurate descriptions.
For Data Migration projects, it is important to develop a standard method to track data
exceptions. Normally this tracking data is stored in a relational database with a
corresponding set of exception reports. By developing this important standardized
strategy, all data cleansing and data correction development will be expedited due to
having a predefined method of determining what exceptions have been raised and
which data caused the exception.
Best Practices
Sample Deliverables
None
Description
The process of updating a data warehouse with new data is sometimes described as
"conducting a fire drill". This is because it often involves performing data updates within
a tight timeframe, taking all or part of the data warehouse off-line while new data is
loaded. While the update process is usually very predictable, it is possible for
disruptions to occur, stopping the data load in mid-stream.
To minimize the amount of time required for data updates and further ensure the quality
of data loaded into the warehouse, the development team must anticipate and plan for
potential disruptions to the loading process. The team must design the data integration
platform so that the processes for loading data into the warehouse can be restarted
efficiently in the event that they are stopped or disrupted.
Prerequisites
None
Roles
Considerations
Providing backup schemas for sources and staging areas for targets is one step toward
improving the efficiency with which a stopped or failed data loading process can be
restarted. Source data should not be changed prior to restarting a failed process, as
this may cause the PowerCenter server to return missing or repeat values. A backup
If flat file sources are being used, all sources should be date-stamped and stored until
the loading processes using those sources that have successfully completed. A script
can be incorporated into the data update process to delete or move flat file sources
only upon successful completion of the update.
TIP
You can configure the links between sessions to only trigger downstream
sessions upon success status.
Also, PowerCenter versions 6 and above have the ability to configure a Workflow
to Suspend on Error. This places the workflow in a state of suspension, so that
the environmental problem can be assessed and fixed, while the workflow can be
resumed from the point of suspension.
Follow these steps to identify and create points of recovery within a workflow:
On the session property screen, configure the session to stop if errors occur in pre-
session scripts. If the session stops, review and revise scripts as necessary.
Determine whether or not a session really needs to be run in bulk mode. Successful
recovery on a bulk-load session is not guaranteed, as bulk loading bypasses the
database log. While running a session in bulk load can increase session performance,
it may be easier to recover a large, normal loading session, rather than truncating
targets and re-running a bulk-loaded session.
Always be sure to examine log files when a session stops, and research and resolve
potential reasons for the stop.
Data Migration Projects often have a need to migrate significant volumes of data. Due
to this fact, re-start processing should be considered in the Architect Phase and
throughout the Design Phase and Build Phase. In many cases a full refresh is the
best course of action. However, if large amounts of data need to be loaded, then the
final load processes should include re-start processing design which should be
prototyped during the Architect Phase. This will limit the amount of time lost if any
Best Practices
Sample Deliverables
None
Description
The next step in designing the data integration processes is breaking the development
work into an inventory of components. These components then become the work tasks
that are divided among developers and subsequently unit tested. Each of these
components would help further refine the project plan by adding the next layer of detail
for the tasks related to the development of the solution.
Prerequisites
None
Roles
Considerations
The smallest divisions of assignable work in PowerCenter are typically mappings and
reusable objects. The Inventory of Reusable Objects and Inventory of Mappings
created during this subtask are valuable high-level lists of development objects that
need to be created for the project.
Naturally, the lists will not be completely accurate at this point; they will be added to
and subtracted from over the course of the project and should be continually updated
as the project moves forward. Despite the ongoing changes, however, these lists are
valuable tools, particularly from the perspective of the lead developer and project
manager, because the objects on these lists can be assigned to individual developers
and their progress tracked over the course of the project.
Primary Key extract (Full extract of Primary Keys used in the delete
mapping)
●
It is important to break down the work into this level of detail because from the list
above, you can see how a single source to target matrix may generate 5 separate
mappings that could each be developed by different developers. From a project
planning perspective, it is then useful to track each of these 5 mappings separately for
status and completion.
Also included in your mapping inventory are the special purpose mappings that are
involved in the end to end process but not specifically defined by the business
requirements and source to target matrixes. These would include audit mappings,
aggregate mappings, mapping generation mappings, templates and other objects that
will need to be developed during the build phase.
For reusable objects, it is important to keep a holistic view of the project in mind when
determining which objects are reusable and which ones are custom built. Sometimes
an object that would seem sharable across any mapping making use of it, may need
different versions depending on purpose.
Having a list of the common objects that are being developed across the project allows
individual developers to better plan their mapping level development efforts. By
knowing that a particular mapping is going to utilize 4 reusable objects - they can focus
on the unique work to that particular mapping and not duplicate the same functionality
of the 4 reusable objects. This is another area where Metadata Manager can become
very useful for developers who want to do where used analysis for objects. As a result
of the processes and tools implemented during the project, developers can achieve
Best Practices
Sample Deliverables
Mapping Inventory
Description
After the Inventory of Mappings and Inventory of Reusable Objects is created, the next
step is to provide detailed design for each object on each list. The detailed design
should incorporate sufficient detail to enable developers to complete the task of
developing and unit testing the reusable objects and mappings. These details include
specific physical information, down to the table, field, and datatype level, as well as
error processing and any other information requirements identified.
Prerequisites
None
Roles
Considerations
A detailed design must be completed for each of the items identified in the Inventory of
Mapping and Inventory of Reusable Objects. Developers use the documents created in
subtask 5.4.4 Develop Inventory of Mappings & Reusable Objects to construct the
mappings and reusable objects, as well as any other required processes.
Reusable Objects
Three key items should be documented for the design of reusable objects: inputs,
outputs, and the transformations or expressions in between.
Developers who have a clear understanding of what reusable objects are available are
likely to create better mappings that are easy to maintain. For the project, consider
Mappings
After the high-level flow has been established, it is important to document pre-mapping
logic. Special joins for the source, filters, or conditional logic should be made clear
upfront. The data being extracted from the source system dictates how the developer
implements the mapping. Next, document the details at the field level, listing each of
the target fields and the source field(s) that are used to create the target field.
Document any expression that may take place in order to generate the target field (e.g.,
a sum of a field, a multiplication of two fields, a comparison of two fields, etc.).
Whatever the rules, be sure to document them and remember to keep it at a physical
level. The designer may have to do some investigation at this point for business rules
as well. For example, the business rules may say, "For active customers, calculate a
late fee rate". The designer of the mapping must determine that, on a physical level,
that translates to 'for customers with an ACTIVE_FLAG of "1", multiply the
DAYS_LATE field by the LATE_DAY_RATE field'.
Document any other information about the mapping that is likely to be helpful in
developing the mapping. Helpful information may, for example, include source and
target database connection information, lookups and how to match data in the lookup
tables, data cleansing needed at a field level, potential data issues at a field level, any
known issues with particular fields, pre or post mapping processing requirements, and
any information about specific error handling for the mapping.
The mapping and reusable object detailed designs are a crucial input for building the
data integration processes, and can also be useful for system and unit testing. The
specific details used to build an object are useful for developing the expected results to
be used in system testing.
For Data Migrations, often the mappings are very similar for some of the stages; such
as populating the reference data structures, acquiring data from the source, loading the
target and auditing the loading process. In these cases, it is likely that a detailed
‘template’ is documented for these mapping types. For mapping specific alterations
such as converting data from source to target format, individual mapping designs
may be created. This strategy reduces the sheer documentation required for the
project, while still providing sufficient detail to develop the solution.
Best Practices
None
Sample Deliverables
None
Description
With the analysis and design steps complete, the next priority is to put everything
together and build the data integration processes, including the mappings and reusable
objects.
Reusable objects can be very useful in the mapping building process. By this point,
most reusable objects should have been identified, although the need for additional
objects may become apparent during the development work. Commonly-used objects
should be put into a shared folder to allow for code reuse via shortcuts. The mapping
building process also requires adherence to naming standards, which should be
defined prior to beginning this step. Developing, and consistently using, naming
standards helps to ensure clarity and readability for the original developer and
reviewers, as well as for the maintenance team that inherits the mappings after
development is complete.
In addition to building the mappings, this subtask involves updating the design
documents to reflect any changes or additions found necessary to the original design.
Accurate, thorough documentation helps to ensure good knowledge transfer and is
critical to project success.
Once the mapping is completed, a session must be made for the mapping in Workflow
Manager. A unit testing session can be created initially to test that the mapping logic is
executing as designed. To identify and troubleshoot problems in more detail, the debug
feature may be leveraged; this feature is useful for looking at the data as it flows
through each transformation. Once the initial session testing proves satisfactory, then
pre- and post-session processes and session parameters should be incorporated and
tested (if needed), so that the session and all of its processes are ready for unit testing.
Prerequisites
None
Roles
Considerations
Although documentation for building the mapping already exists in the design
document, it is extremely important to document the sources, targets, and
transformations in the mapping at this point to help end users understand the flow of
the mapping and ensure effective knowledge transfer.
Importing the sources and targets is the first step in building a mapping. Although the
targets and sources are determined during the Design Phase the keys, fields, and
definitions should be verified in this subtask to ensure that they correspond with the
design documents.
TIP
When data modeling or database design tools (e.g., CA ERwin, Oracle
Designer/2000, or Sybase PowerDesigner) are used in the design phase,
Informatica PowerPlugs can be helpful for extracting the data structure
definitions of source and target sources. Metadata Exchange for Data Models
extract table, column, index and relationship definitions, as well as descriptions
from a data model. This can save significant time because the PowerPlugs also
import documentation and help users to understand the source and target
structures in the mapping.
For more information about Metadata Exchange for Data Models. PowerPlugs,
refer to Informatica's web site (www.informatica.com) or the Metadata Exchange
for Data Models' manuals.
The design documents may specify that data can be obtained from numerous sources,
including DB/2, Informix, SQL Server, Oracle, Sybase, ASCII/EBCDIC flat files
(including OCCURS and REDEFINES), Enterprise Resource Planning (ERP)
applications, and mainframes via PowerExchange data access products. The design
documents may also define the use of target schema and specify numerous ways of
creating the target schema. Specifically, target schema may be created:
● From scratch.
Reusable objects are useful when standardized logic is going to be used in multiple
mappings. A single reusable object is referred to as a mapplet. Mapplets represent a
set of transformations and are constructed in the Mapplet Designer, much like creating
a "normal" mapping. When mapplets are used in a mapping, they encapsulate logic
into a single transformation object, making the flow of a mapping easier to understand.
However, because the mapplets hide their underlying logic, it is particularly important to
carefully document their purpose and function.
Other types of reusable objects, such as reusable transformations, can also be very
useful in mapping. When reusable transformations are used with mapplets, they
facilitate the overall mapping maintenance. Reusable transformations can be built in
either of two ways:
When all the transformations are complete, everything must be linked together (as
specified in the design documentation) and arrangements made to begin unit testing.
Development FAQs
Sample Deliverables
None
Description
The success of the solution rests largely on the integrity of the data available for
analysis. If the data proves to be flawed, the solution initiative is in danger of failure.
Complete and thorough unit testing is, therefore, essential to the success of this type of
project. Within the presentation layer, there is always a risk of performing less than
adequate unit testing. This is due primarily to the iterative nature of development and
the ease with which a prototype can be deployed. Experienced developers are,
however, quick to point out that data integration solutions and the presentation layers
should be subject to more rigorous testing than transactional systems. To underscore
this point, consider which poses a greater threat to an organization: sending a supplier
an erroneous purchase order or providing a corporate vice president with flawed
information about that supplier's ranking relative to other strategic suppliers?
Prerequisites
None
Roles
Considerations
Successful unit testing examines any inconsistencies in the transformation logic and
ensures correct implementation of the error handling strategy.
The first step in unit testing is to build a test plan (see Unit Test Plan). The test plan
should briefly discuss the coding inherent in each transformation of a mapping and
elaborate on the tests that are to be conducted. These tests should be based upon the
business rules defined in the design specifications rather than on the specific code
being tested. If unit tests are based only upon the code logic, they run the risk of
If the transformation types include data quality transformations (that is, transformations
designed on the Data Quality Integration transformation that links to Informatica Data
Quality (IDQ) software) then the data quality processes (or plans) defined in IDQ are
also candidates for unit testing. Good practice holds that all data quality plans that are
going to be used on project data — whether as part of a PowerCenter transfomation or
a discrete process — should be tested before formal use on such data. Consider
establishing a discrete unit test stage for data quality plans.
Test data should be available from the initial loads of the system. Depending on
volumes, a sample of the initial load may be appropriate for development and unit
testing purposes. It is important to use actual data in testing since test data does not
necessarily cover all of the anomalies that are possible with true data, and creating test
data can be very time consuming. However, depending upon the quality of the actual
data used, it may be necessary to create test data in order to test any exception, error,
and/or value threshold logic that may not be triggered by actual data.
While it is possible to analyze test data without tools, there are many good tools
available for creating and manipulating test data. Some are useful in editing data in a
flat file, and most all offer some improvements in productivity.
A detailed test script is essential for unit testing; the test scripts indicate the
transformation logic being tested by each test record and should contain an expected
result for each record.
TIP
Session log tracing can be set in a mapping's transformation level, in a
session's "Mapping" tab, or in a session's "Config Object" tab. For testing, it is
generally good practice to override logging in a session's "Mapping" tab
transformation properties. For instance, if you are testing the logic performed in
a Lookup transformation, create a test session and only activate verbose data
logging on the appropriate Lookup. This focuses the log file on the unit test at
hand.
If you change the tracing level in the mapping itself, you will have to go back
and modify the mapping after the testing has been completed. If you override
tracing in a session's "Config Object" tab properties, this will affect all
transformation objects in the mapping and potentially create a significantly
larger session log to parse.
Running the mapping in the Debugger also allows you to view the target data
without the session writing data to the target tables. You can then document
the actual results as compared to the expected results outlined in the test
script. The ability to change the data running through the mapping while in
debug mode is an extremely valuable tool because it allows you to test all
conditions and logic as you step through the mapping, thereby ensuring
appropriate results.
The first session should load test data into empty targets. After checking for errors from
the initial load, a second run of test data should occur if the business requirements
demand periodic updates to the target database.
A thorough unit test should uncover any transformation flaws and document the
adjustments needed to meet the data integration solution's business requirements.
Best Practices
None
Sample Deliverables
Defect Log
Defect Report
Description
Peer review is a powerful technique for uncovering and resolving issues that otherwise
would be discovered much later in the development process (i.e., during testing) when
the cost of fixing is likely to be much higher. The main types of object that can be
subject to formal peer review are: documents, code, and configurations.
Prerequisites
None
Roles
Considerations
The peer review process encompasses several steps, which vary depending on the
object (i.e., document, code, etc.) being reviewed. In general, the process should
include these steps:
There are two main factors to consider when rating the ‘impact’ of defects discovered
during peer review, the effect on functionality and the saving in rework time. If a defect
would result in a significant functional deficiency, or large amount of rework later in the
project, it should be rated as 'high impact'.
Metrics can be used to help in tracking the value of the review meetings. The ‘cost’ of
formal peer reviews is the man-time spent on meeting preparation, the review meeting
itself, and the subsequent re-work. This can be recorded in man-days.
The ‘benefit’ of such reviews is the potential time saved. Although this can be estimated
when the defect is originally noted, such estimates are unlikely to be reliable. It may be
better to assign a notional ‘benefit’ – say two hours for a low-impact defect, one day for
a medium-impact defect and two days for a high-impact defect. Adding up the benefit in
man-days allows a direct comparison with ‘cost’. If no net benefit is obtained from the
peer reviews, the Quality Assurance Manager should investigate a less intensive
review regime, which can be implemented across the project or, more likely, in specific
areas of the project.
Best Practices
None
Description
This task bridges the gap between unit testing and system testing. After unit testing is
complete, the sessions for each mapping must be ordered so as to properly execute
the complete data migration from source to target. Creating workflows containing
sessions and other tasks with the proper execution order does this.
By incorporating link conditions and/or decision tasks into workflows, the execution
order of each session or task is very flexible. Additionally, event raises and event waits
can be incorporated to further develop dependencies. The tasks within the workflows
should be organized so as to achieve an optimum load in terms of data quality and
efficiency.
When this task is completed, the development team should have a completely
organized loading model that it can use to perform a system test. The objective here is
to eliminate any possible errors in the system test that relate directly to the load
process. The final product of this task - the completed workflow(s) - is not static,
however. Since the volume of data used in production may differ significantly from the
volume used for testing, it may be necessary to move sessions and workflows around
to improve performance.
Prerequisites
None
Roles
At a minimum, this task requires a single instance of the target database(s). Also, while
data may not be required for initial testing, the structure of the tables must be identical
to those in the operational database(s). Additionally, consider putting all mappings to
be tested in a single folder. This will allow them to be executed in the same workflows
and reordered to assess optimum performance.
Best Practices
None
Sample Deliverables
None
Description
Proper organization of the load process is essential for achieving two primary load
goals:
Prerequisites
None
Roles
Considerations
If the volume of data is sufficiently low for the available hardware to handle, you may
consider volume analysis optional, developing the load process solely on the
dependency analysis. Also, if the hardware is not adequate to run the sessions
concurrently, you will need to prioritize them. The highest priority within a group is
usually assigned to sessions with the most child dependencies.
Another possible component to add into the load process is sending e-mail. Three e-
mail options are available for notification during the load process:
When the integrated load process is complete, it should be subject to unit test. This is
true even if all of the individual components have already been subjected to unit test.
The larger volumes associated with an actual operational run would be likely to hamper
validation of the overall process. With unit test data, the staff members who perform
unit testing should be able to easily identify major errors when the system is placed in
operation.
The Load Dependency Analysis should list all sessions, in order of their dependency,
together with any other events (Informatica or other), on which the sessions depend.
The analysis must clearly document the dependency relationships between each
session and/or event, the algorithm or logic needed to test the dependency conditions
during execution, and the impact of any possible dependency test results (e.g., do not
run a session, fail a session, fail a parent or worklet, etc.).
The load dependency documentation would for example, follow the following format:
The first set of sessions or events listed in the analysis (Group A),
would be those with no dependencies.
●
The third set (Group C), would be those with a dependency on one or
more sessions or events in the second set (Group B). Against each
session in this list, similar dependency information as above would be
included.
●
The listing would be continued in the document, until all sessions are
included.
The Load Volume Analysis should list all the sources , source row counts and row
widths, expected for each session. This should include the sources for all lookup
transformations, in addition to the extract sources, as the amount of data that is read to
initialize a lookup cache can materially affect the initialization and total execution time
of a session. The Load Volume Analysis should also list sessions in descending order
of processing time, estimated based these factors (i.e., the number of rows extracted,
number of rows loaded, number and volume of lookups in the mappings).
For Data Migration projects, the final load processes are the set of load scripts,
scheduling objects, or master workflows that will be executed for the data migration. It
is important that developers develop with a load plan in mind so that these load
procedures can be developed quickly, as they are often developed late in the project
development cycle when time is of short supply.
Best Practices
None
Sample Deliverables
Description
The task of integration testing is to check that components in a software system or, one
step up, software applications at the company level , interact without error. There are a
number of strategies that can be employed for integration testing, two examples are as
follows:
Prerequisites
None
Roles
Although this is a minor test from an ETL perspective, it is crucial for the ultimate goal
of a successful process implementation. Primary proofing of the testing method
involves matching the number of rows loaded to each individual table.
It is a good practice to keep the Load Dependency Analysis and Load Volume Analysis
in mind during this testing, particularly if the process identifies a problem in the load
order. Any deviations from those analyses are likely to cause errors in the loaded data.
The final product of this subtask, the Final Load Process document, is the layout of
workflows, worklets, and session tasks that will achieve an optimal load process. The
Final Load Process document orders workflows, worklets, and session tasks in such a
way as to maintain the required dependencies while minimizing the overall load
window. This document will differ from that generated in the previous subtask, 5.5.1
Build Load Process to represent the current actual result. However, this layout is still
dynamic and may change as a result of ongoing performance testing.
Tip
The Integration Test Percentage (ITP) is a useful tool that indicates the percentage
of the project's source code that has been unit and integration tested. The formula for
ITP is:
As an example, this table shows the number of transformation objects for mappings.
If mapping M_ABC is the only one unit tested, the ITP is:
If mappings M_GHI and M_JKL are unit tested, the ITP is:
The ITP metric provides a precise measurement as to how much unit and integration
testing has been done. On actual projects, the definition of a unit can vary. A unit
may be defined as an individual function, a group of functions, or an entire Computer
Software Unit (which can be several thousand lines of code). The ITP metric is not
based on the definition of a unit. Instead, the ITP metric is based on the actual
number of transformation objects tested with respect to the total number of
transformation objects defined in the project.
Best Practices
None
Sample Deliverables
None
Description
The objective of this task is to develop the end-user analysis, using the results from 4.4
Design Presentation Layer. The result of this task should be a final presentation layer
application that satisfies the needs of the organization. While this task may run in
parallel with the building of the data integration processes, data is needed to validate
the results of any presentation layer queries. This task cannot therefore, be completed
before tasks 5.4 Design and Develop Data Integration Processes and 5.5 Populate and
Validate Database. The Build Presentation Layer task consists of two subtasks which
may need to be performed iteratively several times:
2. Presenting the end-user the presentation layer to business analysts to elicit and
incorporate their feedback.
Throughout the Build Phase, the developers should refer to the deliverables produced
during the Design Phase. These deliverables include a working prototype, end user
feedback, metadata design framework and, most importantly, the Presentation
Layer Design document, which is the final result of the Design Phase and incorporates
all efforts completed during that phase. This document provides the necessary
specifications for building the front-end application for the user community.
This task incorporates both development and unit testing. Test data will be available
from the initial loads of the target system. Depending on volumes, a sample of the initial
load may be appropriate for development and unit testing purposes. This sample data
set can be used to assist in building the presentation layer and validating reporting
results , without the added effort of fabricating test data.
Prerequisites
None
Roles
Considerations
Best Practices
None
Sample Deliverables
None
Description
By the time you get to this subtask, all the design work should be complete, making this
subtask relatively simple. Now is the time to put everything together and build the
actual objects such as reports, alerts and indicators.
During the build, it is important to follow any naming standards that may have been
defined during the design stage, in addition to the standards set on layouts, formats,
etc. Also, keep detailed documentation of these objects during the build activity. This
will ensure proper knowledge transfer and ease of maintenance in addition to improving
the readability for everyone.
After an object is built, thorough testing should be performed to ensure that the data
presented by the object is accurate and the object is meeting the performance that is
expected.
The principles for this subtask also apply to metadata solutions providing metadata to
end users.
Prerequisites
None
Roles
Considerations
During the Build task, it is good practice to verify and review all the design options and
to be sure to have a clear picture of what the goal is. Keep in mind that you have to
create a report no matter what the final form of the information delivery is. In other
words, the indicators and alerts are derived off a report and hence your first task is to
create a report. The following considerations should be taken into account while
The measurements, which are called metrics in the BI terminology, are perhaps the
most important part of the report. Begin the build task by selecting your metrics, unless
you are creating an Attribute-only Report. Add all the metrics that you want to see on
the report and arrange them in the required order. You can add a prompt to the report if
you want to make it more generic over, for example, time periods or product categories.
Optionally, you can choose a Time Key that you want to use as well for each metric.
The metrics are always measured against a set of predefined parameters. Select these
parameters, which are called Attributes in the BI terminology, and add them to the
Report (unless you are creating a Metric-only Report). You can add Prompts and Time
Keys for the attributes too, just like the metrics.
Tip
Create a query for metrics and attributes. This will help in searching for the specific
metrics or attributes much faster than manually searching in a pool of hundreds of
metrics and attributes.
Time setting preferences can vastly differ from one user’s requirement to that of
another. One group of users may be interested just in the current data while another
group may want to compare the trends and patterns over a period of time. It is
important to thoroughly analyze the end user’s requirements and expectations prior to
adding the Time Settings to reports.
Now that you have selected all the data elements that you need in the report, it is time
to make sure that you are delivering only the relevant data set to the end users. Make
sure to use the right Filters and Ranking criteria to accomplish this in the report.
Consider using Filtersets instead of just Filters so that important criteria limiting the
data sets can be standardized over a project or department, for example.
Table report type: The data in the report can be arranged in one of the following three
table types: tabular, cross tabular, or sectional. Select the one that suits the report the
best.
Data sort order: Arrange the data such that the pattern makes it easy to find any part
of the information that one is interested in.
Chart or graph: A picture is worth a thousand words. A chart or graph can be very
useful when you are trying to make a comparison between two or more time periods,
regions, or product categories, etc.
Once the report is ready, you should think about how the report should be delivered. In
doing so, be sure to address the following points:
Where should the report reside? – Select a folder that is most suited for the data that
the report contains. If the report is shared by more than one group of users, you may
want to save it in a shared folder.
Who should get the report, when, and how should they get it ? – Make sure that
proper security options are implemented for each report. There may be sensitive and
confidential data that you want to ensure is not accessible by unauthorized users.
When should the report be refreshed? - You can chose to run the report on-demand
or schedule it to be automatically refreshed at regular intervals. Ad-hoc reports that are
of interest to a smaller set of individuals are usually run on-demand. However, the bulk
of the reports that are viewed regularly by different business users need to be
scheduled to refresh periodically. The refresh interval should typically consider the
period for which the business users are likely to consider the data ‘current’ as well as
the frequency of data change in the data warehouse.
Occasionally, there will be a requirement to see the data in the report as soon as the
data changes in the data warehouse (and data in the warehouse may change very
frequently). You can handle situations like this by having the report refresh at ‘real-time’.
Adding certain features to the report can make it more useful for everybody. Consider
the following for each report that you build:
Title of the report: The title of the report should reflect what the report contents are
meant to convey. Rarely, it may become a tough task to name a report very accurately
if the same report is viewed in two different perspectives by two different sets of users.
You may consider making a copy of the report and naming the two instances to suit
each set of users.
Drill paths: Check to make sure that the required drill paths are set up. If you don’t find
a drill path that you think is useful for this report, you may have to contact the
Administrator and have it set up for you.
Highlighters: It may also be a good idea to use highlighters to make critical pieces of
information more conspicuous in the report.
Comments and description: Comments and Descriptions make the reports more
easily readable as well as helping when searching for a report.
Indicator Considerations
After the base report is complete, you can build indicators on top of that report. First,
you will need to determine and select the type of indicator that best suits the primary
purpose. You can use chart, table or gauge indicators. Remember that there are
several types of chart indicators as well as several different gauge indicators to choose
Gauge indicators allow you to monitor a single metric and display whether or not that
indicator is within an acceptable range. For example, you can create a gauge indicator
to monitor the revenue metric value for each division of your company. When you
create a gauge indicator, you have to determine and specify three ranges (low,
medium, and high) for the metric value. Additionally, you have to decide how the
gauge should be displayed: circular, flat, or digital.
If you want to display information for one or more attributes or multiple metrics, you can
create either chart or table indicators. If you chose chart indicators, you have more than
a dozen different types of charts to choose from (standard bar, stacked line, pie, etc).
However, if you’d like to see a subset of an actual report, including sum calculations, in
a table view, chose a table indicator.
Alert Considerations
Alerts are created when something important is occurring, such as falling revenue or
record breaking sales. When creating indicators, consider the following:
These answers will come from discussions with your users. Once you find out what is
important to the users, you can define the Alert rules.
It is important that the alert is delivered to the appropriate audience. An alert may go on
a business unit’s dashboard or a personal dashboard.
Once the appropriate Alert receiver is identified, you must determine the proper
delivery device. If the user doesn’t log into Power Analyzer on a daily basis, maybe an
email should be sent. If the alert is critical, a page could be sent. Furthermore, make
sure that the required delivery device (i.e., email, phone, fax, or pager) has been
Always keep performance of the reports in mind. If a report takes too long time to
generate data, then you need to identify what is causing the bottleneck and eliminate or
reduce the bottleneck. The following points are worth remembering:
Tip
You can view the query in your report if your report is taking a long time to get the
data from the source system. Copy the query and evaluate the query, by running
utilities such as Explain Plan on the query in Oracle, to make sure that it is optimized.
Best Practices
None
Sample Deliverables
None
Description
After the initial development effort, the development team should present the
presentation layer to the Business Analysts to elicit and incorporate their feedback.
When educating the end users about the front end tool whether a business intelligence
tool or an application, it is important to focus on the capabilities of the tool and the
differences between typical reporting environments and solution architectures. When
end users thoroughly understand the capabilities of the front end that they will use, they
can offer more relevant feedback.
Prerequisites
None
Roles
Considerations
Best Practices
None
Sample Deliverables
None
6 Test
Description
The Test phase includes the full design of your testing plans and infrastructure as well
as two categories of comprehensive system-wide verification procedures; the System
Test and the User Acceptance Test (UAT). The System Test is conducted after all
elements of the system have been integrated into the test environment. It includes a
number of detailed technically-oriented verifications that are managed as processes by
the technical team with primarily technical criteria for acceptance. UAT is a detailed
user-oriented set of verifications with User Acceptance as the objective. It is typically
managed by end-users with participation from the technical team. Any test cannot be
considered complete until there is verification that it has accomplished the agreed-upon
Acceptance Criteria. Because of the natural tension that exists between completion of
the preset project timeline and completion of Acceptance Criteria (which may take
longer than expected) the Test Phase schedule is often owned by a QA Manager
or Project Sponsor rather than the Project Manager.
Velocity includes as a final step in the Test Phase activities related to tuning system
performance. Satisfactory performance and system responsiveness can be a critical
element of user acceptance.
Prerequisites
None
Roles
Considerations
To ensure the Test Phase is successful it must be preceded by diligent planning and
preparation. Early on, project leadership and project sponsors should establish test
strategies and begin building plans for System Test and UAT. Velocity recommends
that this planning process begins, at the latest, during the Design Phase, and that it
includes descriptions of timelines, participation, test tools, guidelines and scenarios, as
well as detailed Acceptance Criteria.
The Test Phase includes the development of test plans and procedures. It is intended
The Test Phase includes other important activities in addition to testing. Any defects or
deficiencies discovered must be categorized (severity, criticality, priority) recorded, and
weighed against the Acceptance Criteria (AC). The technical team should repair them
within the guidelines of the AC, and the results must be retested with the inclusion of
satisfactory regression testing. This process has the prerequisite for the development
of some type of Defect Tracking System; Velocity recommends that this be
developed during the Build Phase.
Although formal user acceptance signals the completion of the Test Phase, some of its
activities will be revisited, perhaps many times, throughout the operation of the
system. Performance tuning is recommended as a recurrent process. As data volume
grows and the profile of the data changes, performance and responsiveness may
degrade. You may want to plan for regular periods of benchmarking and tuning, rather
than waiting to be reactive to end-user complaints. By it's nature software
development is not always perfect, so some repair and retest should be expected. The
Defect Tracking System must be maintained to record defects and enhancements for
as long as the system is supported and used. Test scenarios, regression test
procedures, and other testing aids must also be retained for this purpose.
Best Practices
None
Sample Deliverables
None
Description
The purpose of testing is to verify that the software has been developed according to
the requirements and design specifications. Although the major testing actually occurs
at the end of the Build Phase , determining the amount and types of testing to be
performed should occur early in the development lifecycle. This enables project
management to allocate adequate time and resources to this activity. This also enables
the project to build the appropriate testing infrastructure prior to the beginning of the
testing phase. Thus, while all of the testing related activities have been consolodated
in the Testing phase, the beginning of these activities often begins as early as the
Design Phase. The detailed object level testing plans are continually updated and
modified as the development process continues since any change to development work
is likely to create a new scenario to test.
Prerequisites
None
Roles
Considerations
Best Practices
None
Sample Deliverables
None
Description
Ideally, actual data from the production environment will be available for testing so that
tests can cover the full range of possible values and states in the data. However, the
full set of production data is often not available.
If generated data is used, the main challenge is to ensure that it accurately reflects the
production environment. Theoretically, generated data can be made to be
representative and engineered to test all of the project functionality. While the actual
record counts in generated tables are likely to differ from production environments, the
ratios between tables should be maintained; for example, if there is a one-to-ten ratio
between products and customers in the live environment, care should be taken to
retain this same ratio in the test environment.
The deliverable from this subtask is a description and schedule for how test data will be
derived, stored, and migrated to testing environments. Adequate test data can be
important for proper unit testing and is critical for satisfactory system and user
acceptance tests.
Prerequisites
None
Roles
Considerations
Usually, data for testing purposes is stored in the same structure as the source in the
data flow. However, it is also possible to store test data in a format that is geared
toward ease of maintenance and to use PowerCenter to transfer the data to the source
system format. So if the source is a database with a constantly changing structure, it
may be easier to store test data in XML or CSV formats where it can easily be
maintained with a text editor. The PowerCenter mappings that load the test data from
this source can make use of techniques to insulate (to some degree) the logic from
schema changes by including pass-through transformations after source qualifiers and
before targets.
For Data Migration, the test data strategy should be focused on how much source data
to use rather than how to manufacture test data. It is strongly recommended that the
data used for testing is real production data but most likely of less volume then the
production system. By using real production data, the final testing will be more
meaningful and increase the level of confidence from the business community thus
making ‘go/no-go’ decisions easier.
Best Practices
None
Description
Any distinct unit of development must be adequately tested by the developer before it is
designated ready for system test and for integration with the rest of the project
elements. This includes any element of the project that can, in any way, be tested on its
own. Rather than conducting unit testing in a haphazard fashion with no means of
certifying satisfactory completion, all unit testing should be measured against a
specified unit test plan and its completion criteria.
Unit test plans are based on the individual business and functional requirements and
detailed design for mappings, reports, or components for the mapping or report. The
unit test plans should include specification of inputs, tests to verify, and expected
outputs and results. The unit test is the best opportunity to discover any
misinterpretation of the design as well as errors of development logic. The creation of
the unit test plan should be a collaborative effort by the designer and the developer,
and must be validated by the designer as meeting the business and functional
requirements and design criteria. The designer should begin with a test scenario or test
data descriptions and include checklists for the required functionality; the developer
may add technical tests and make sure all logic paths are covered.
Prerequisites
None
Roles
Considerations
Reference to design documents should contain the name and location of any related
requirements documents, high-level and detailed design, mock-ups, workflows, and
other applicable documents.
Specification of the test environment should include such details as which reference or
conversion tables must be used to translate the source data for the appropriate target
(e.g., for conversion of postal codes, for key translation, other code translations). It
should also include specification of any infrastructure elements or tools to be used in
conjunction with the tests.
The description of test runs should include the functional coverage, and any
dependencies between test runs.
Prerequisites should include whatever is needed to create the correct environment for
the test to take place, any dependencies the test has on completion of other logic or
The input files or tables must be specified with their locations. These data must be
maintained in a secure place to make repeatable tests possible.
Specifying the expected output is the main part of the test plan. It specifies in detail any
output records and fields, and any functional or operational results through each step of
the test run. The script should cover all of the potential logic paths and include all code
translations and other transformations that are part of the unit. Comparing the produced
output from the test run with this specification provides the verification that the build
satisfies the design.
The test script specifies all the steps needed to create the correct environment for the
test, to complete the actual test run itself, and the steps to analyze the results. Analysis
can be done by hand or by using compare scripts.
The Comments and Findings section is where all errors and unexpected results found
in the test run should be logged. In addition, errors in the test plan itself can be logged
here as well. It is up to the QA Management and/or QA Strategy to determine whether
to use a more advanced error tracking system for unit testing or to wait until system
test. Some sites demand a more advanced error logging system, (e.g., ClearCase)
where errors can be logged along with an indication of their severity and impact, as well
as information about who is assigned to resolve the problem.
One or more test runs can be specified in a single unit test plan. For example, one run
may be an initial load against an empty target, with subsequent runs covering
incremental loads against existing data or tests with empty input or with duplicate input
records or files and empty reports.
Test data must contain a mix of correct and incorrect data. Correct data can be
expected to result in the specified output; incorrect data may have results according to
the defined error-handling strategy such as creating error records or aborting the
process. Examples of incorrect data are:
Data quality plans should be tested using IDQ applications before they are added to
PowerCenter transformations. The results of these tests will feed as prerequisites into
the main unit test plan. The tests for data quality processes should follow the same
guidelines as outlined in this document. A PowerCenter mapping should be validated
once the Data Quality Integration transformation has been added to it and configured
with a data quality plan.
Every difference between the output expectation and the test output itself should be
logged in the Comments and Findings section, along with information about the
severity and impact on the test process. The unit test can proceed after analysis and
error correction.
The unit test is complete when all test runs are successfully completed and the findings
are resolved and retested. At that point, the unit can be handed over to the next test
phase.
Best Practices
Sample Deliverables
Description
System Test (sometimes known as Integration Test) is crucial for ensuring that the
system operates reliably as a fully integrated system and functions according to the
business requirements and technical design. Success rests largely on business users'
confidence in the integrity of the data. If the system has flaws that impede its functions,
the data may also be flawed or users may perceive it as flawed,which results in a loss
of confidence in the system. If the system does not provide adequate performance and
responsiveness, the users may abandon it (especially if it is a reporting system)
because it does not meet their perceived needs.
As with the other testing processes, it is very important to begin planning for System
Test early in the project to make sure that all necessary resources are scheduled and
prepared ahead of time.
Prerequisites
None
Roles
Considerations
Since the system test addresses multiple areas and test types, creation of the test plan
should involve several specialists. The System Test Manager is then responsible for
compiling their inputs into one consistent system test plan. All individuals participating
in executing the test plan must agree on the relevant performance indicators that are
required to determine if project goals and objectives are being met. The performance
indicators must be documented, reviewed, and signed-off on by all participating team
members.
Test Cases
The test case (i.e., unit of work to be tested) must be sufficiently specific to track and
improve data quality and performance.
Test Levels
Each test case is categorized as occurring on a specific level or levels. This helps to
clearly define the actual extent of testing expected within a given test case. Test levels
may include one or more of the following:
● System Level. Covers all "end to end" integration testing, and involves the
complete validation of total system functionality and reliability through all
system entry points and exit points. Typically, this test level is the highest, and
the last level of testing to be completed.
● Support System Level. Involves verifying the ability of existing support
systems and infrastructure to accommodate new systems or the proposed
expansion of existing systems. For example, this level of testing may
determine the effect of a potential increase in network traffic due to an
expanded system user base on overall business operations.
● Internal Interface Level. Covers all testing that involves internal system data
flow. For example, this level of testing may validate the ability of PowerCenter
to successfully connect to a particular data target and load data.
● External Interface Level. Covers all testing that involves external data
sources. For example, this level of testing may collect data from diverse
business systems into a data warehouse.
● Hardware Component Level. Covers all testing that involves verifying the
function and reliability of specific hardware components. For example, this
level of testing may validate a back-up power system by removing the primary
power source. This level of testing typically occurs during the development
cycle.
● Software Process Level. Covers all testing that involves verifying the function
and reliability of specific software applications. This level of testing typically
occurs during the development cycle.
● Data Unit Level. Covers all testing that involves verifying the function and
reliability of specific data items and structures. This typically occurs during the
development cycle in which data types and structures are defined and tested
Test Types
The Data Integration Developer generates a list of the required test types based on the
desired level of testing. The defined test types determine what kind of tests must be
performed to satisfy a given test case. Test types that may be required include:
As part of 6.2 Execute System Test other specific tests should be planned for :-
The system test plan consists of one or more test runs, each of which must be
described in detail. The interaction between the test runs must also be specified. After
each run, the System Test Manager can decide, depending on the defect count and
severity, whether the system test can proceed with subsequent test runs or that errors
must be corrected and the previous run repeated.
Every difference between the expected output and the test output itself should
be recorded and entered into the defect tracking system with a description of the
severity and impact on the test process. These errors and the general progress of the
system test should be discussed in a weekly or bi-weekly progress meeting. At this
meeting, participants review the progress of the system test, any problems identified,
and assignments to resolve or avoid them. The meeting should be directed by the
System Test Manager and attended by the testers and other necessary specialists like
designers, developers, systems engineers and database administrators.
After assignment of the findings, the specialists can take the necessary actions to
resolve the problems. After the solution is approved and implemented, the system test
can proceed.
Best Practices
None
Sample Deliverables
None
Description
User Acceptance Testing (often know as UAT) is essential for gaining approval,
acceptance and project sign off. It is the end user community that needs to carryout the
testing and identify relevant issues for fixing. Resources for the testing will include
physical environment setup as well as allocation of staff to testing from the user
community. As with system testing, planning for User Acceptance Testing should be
begun early in the project so as to ensure the necessary resources are scheduled and
ready. In addition, the user acceptance criteria will need to be distilled from the
requirements and existing gold standard reports. These criteria need to be documented
and agreed by all parties so as to avoid delays through scope creep.
Prerequisites
None
Roles
Considerations
The plan should be a construction of the acceptance criteria, with test scripts of actions
that users will need to carry out to achieve certain results. For example, instructions to
run particular workflows and run reports within which, the users can then examine the
data. The author of the plan needs to bear in mind that the testers from user community
may not be technically minded. Indeed, one possible benefit of having non technical
users involved, is that they will provide an insight into the time and effort required for
In addition to test scripts for execution additional criteria for acceptance need to be
defined:-
In Data Migration projects, user acceptance testing is even more user-focused than
other data integration efforts. This testing usually takes two forms, traditional UAT and
‘day-in-the-life’. During these two phases, business users are working through the
system, executing their normal daily routine and driving out issues and
inconsistencies. It is very important that the data migration team works closely with the
business testers to both provide appropriate data for these tests and to
capture feedback to improve the data as soon as possible. This UAT activity is the best
way to find out if the data is correct and if the data migration was completed
successfully.
Best Practices
None
Sample Deliverables
None
Description
Test scenarios provide the context, the “story line”, for much of the test procedures,
whether Unit Test, System Test or UAT. How can you know that the software solution
you’re developing will work within its ultimate business usage? A scenario provides the
business case for testing specific functionality, enabling testers to pretend to carry-out
the related business activity and then measure the results against expectations. For
this reason, design of the scenarios is a critical activity and one that may involve
significant effort in order to provide coverage for all the functionality that needs testing.
The test scenario forms the basis for development of test scripts and checklists, the
source data definitions, and other details of specific test runs.
Prerequisites
None
Roles
Considerations
Test scenarios must be based on the functional and technical requirements by dividing
them into specific functions that can be treated in a single test process.
Best Practices
None
Sample Deliverables
None
Description
This subtask deals with the procedures and considerations for actually creating,
storing, and maintaining the test data. The procedures for any given project are, of
course, specific to its requirements and environments, but are also opportunistic. For
some projects, there will exist a comprehensive set of data or at least a good start in
that direction, while for other projects, the test data may need to be created from
scratch.
In addition to test data that allows full functional testing (i.e., functional test data), there
is also a need for adequate data for volume tests (i.e., volume test data). The following
paragraphs discuss each of these data types.
Creating a source data set to test the functionality of the transformation software should
be the responsibility of a specialized team largely consisting of business-aware
application experts. Business application skills are necessary to ensure that the test
data not only reflects the eventual production environment but that it is also engineered
to trigger all the functionality specified for the application. Technical skills in whatever
storage format is selected are also required to facilitate data entry and/or movement.
Volume is not a requirement of the functional test data set; indeed, too much data is
undesirable since the time taken to load it needlessly delays the functional test.
In a data integration project, while functional test data for the application sources is
indispensable, the case for a predefined data set for the targets should also be
considered. If available, such a data set makes it possible to develop an automated test
procedure to compare the actual result set to a predicted result set (making the
necessary adjustments to generated data, such as surrogate keys, timestamps, etc.).
This has additional value in that the definition of a target data set in itself serves as a
sort of design audit.
Once again, PowerCenter can be used to generate volumes of data and to modify
sensitive live information in order to preserve confidentiality. There are a number of
techniques to generate multiple output rows from a single source row, such as:
If possible, the volume test data set should also be available to developers for unit
testing in order to identify problems as soon as possible.
Maintenance
In addition to the initial acquisition or generation of test data, you will need a protected
location for its storage and procedures for migrating it to test environments in such a
fashion that the original data set is preserved (for the next test sequence). In addition,
you are likely to need procedures that will enable you to rebuild or rework the test data,
as required.
Prerequisites
None
Roles
Considerations
Creating the source and target data sets and conducting automated testing are non-
trivial, and are therefore, often dismissed as impractical. This is partly the result of a
Data Migration projects should have little need for generating test data. It is strongly
recommended that all data migration integration and system tests use
actual production data. Therefore, effort spent generating test data on a data migration
project should be very limited.
Best Practices
None
Sample Deliverables
None
Description
This is the first major task of the Test Phase – general preparations for System Test
and UAT. This includes preparing environments, ramping up defect management
procedures, and generally making sure the test plans and all their elements are
prepared and that all participants have been notified of the upcoming testing processes.
Prerequisites
None
Roles
Considerations
Prior to beginning this subtask, you will need to collect and review the documentation
generated by the previous tasks and subtasks, including the test strategy, system test
plan, and UAT plan. Verify that all required test data has been prepared and that the
defect tracking system is operational. Ensure that all unit test certification procedures
Best Practices
None
Sample Deliverables
None
Description
It is important to prepare the test environments in advance of System Test with the
following objectives:
Prerequisites
None
Roles
Considerations
A formal test plan needs to be prepared by the Project Manager in conjunction with the
Test Manager. This plan should cover responsibilities, tasks, time-scales, resources,
training, and success criteria. It is vital that all resources, including off-project support
staff, are made available for the entire testing period.
Test scripts need to be prepared, together with a definition of the data required to
execute the scripts. The Test Manager is responsible for preparing these items, but is
likely to delegate a large part of the work.
A formal definition of the required environment also needs to be prepared, including all
necessary hardware components (i.e., server and client), software components (i.e.,
operating system, database, data movement, testing tools, application tools, custom
application components etc., including versions), security and access rights, and
networking.
Review the test plans and scenarios to determine the technical requirements for the
test environments. Volume tests and disaster/recovery tests may require special
system preparations.
The System Test environment may evolve into the UAT environment, depending on
requirements and stability.
Processes
Where possible, all processes should be supported by the use of appropriate tools.
Some of the key terminology related to the preparation of the environments and the
associated processes include:
Data
The data required for testing can be derived from the test cases defined in the scripts.
This should enable a full dataset to be defined, ensuring that all possible cases are
tested. 'Live data' is usually not sufficient because it does not cover all the cases the
system should handle, and may require some sampling to keep the data volumes at
realistic levels. It is, of course, possible to use modified live data, adding the additional
cases or modifying the live data to create the required cases.
The process of creating the test data needs to be defined. Some automated approach
to creating all or the majority of the data is best. There is often a need to process data
through a system where some form of OLTP is involved. In this case, it must be
possible to roll-back to a base-state of data to allow reapplication of the ‘transaction’
data – as would be achieved by restoring from back-up.
Where multiple data repositories are involved, it is important to define how these
datasets relate. It is also important that the data is consistent across all the repositories
and that it can be restored to a known state (or states) as and when required.
Environment
● Server(s) – must be available for the required duration and have sufficient disk
space and processing power for the anticipated workload.
● Client workstations – must be available and sufficiently powerful to run the
required client tools.
● Server and client software – all necessary software (OS, database, ETL, test
tools, data quality tools, connectivity etc.) should be installed at the version
used in development (normally) with databases created as required.
● Networking – all required LAN and WAN connectivity must be set up and
firewalls configured to allow appropriate access. Bandwidth must be available
for any particular large data transmissions.
● Databases – all necessary schemas must be created and populated with an
For Data Migration, the system test environment should not be limited to the
Informatica environment, but should also include all source systems, target systems,
reference data and staging databases, and file systems. The system tests will be a
simulation of production systems so the entire process should execute like a production
environment.
Best Practices
None
Sample Deliverables
None
Description
The key measure of software quality is, of course, the number of defects (a defect is
anything that produces results other than the expected results based on the software
design specification). Therefore it is essential for software projects to have a systematic
approach to detecting and resolving defects early in the development life cycle.
Prerequisites
None
Roles
Considerations
Personal and peer reviews are primary sources of early defect detection. Unit testing,
system testing and UAT are other key sources, however, in these later project stages,
defect detection is a much more resource-intensive activity. Worse yet, change
requests and trouble reports are evidence of defects that have made their way to the
end users.
There are two major components of successful defect management, defect prevention
and defect detection. A good defect management process should enable developers to
both lower the number of defects that are introduced, and remove defects early in the
life cycle prior to testing.
Defect management begins with the design of the initial QA strategy and a good,
detailed test strategy. They should clearly define methods for reviewing system
requirements and design and spell out guidelines for testing processes, tracking
To support early defect resolution, you must have a defect tracking system that is
readily accessible to developers and includes the following:
Ability to identify and type the defect, with details of its behaviour
●
Best Practices
None
Sample Deliverables
None
Description
System Test (sometimes known as Integration Test) is crucial for ensuring that the
system operates reliably and according to the business requirements and technical
design. Success rests largely on business users' confidence in the integrity of the data.
If the system has flaws that impede its function, the data may also be flawed, or users
may perceive it as flawed - which results in a loss of confidence in the system. If the
system does not provide adequate performance and responsiveness, the users may
abandon it (especially if it is a reporting system) because it does not meet their
perceived needs.
System testing follows unit testing, providing the first tests of the fully integrated
system, and offers an opportunity to clarify users performance expectations and
establish realistic goals that can be used to measure actual operation after the system
is placed in production. It also offers a good opportunity to refine the data volume
estimates that were originally generated in the Architect Phase. This is useful for
determining if existing or planned hardware will be sufficient to meet the demands on
the system.
1.
6.3.1 Prepare for System Test , in which the test team determines how to test
the system from end-to-end to ensure a successful load as well as planning for
the environments, participants, tools and timelines for the test.
2.
6.3.2 Execute Complete System Test , in which the data integration team works
with the Database Administrator to run the system tests planned in the prior
subtask. It is crucial to also involve end-users in the planning and review of
system tests.
3.
6.3.3 Perform Data Validation , in which the QA Manager and QA team ensure
that the system is capable of delivering complete, valid data to the business
users.
4.
6.3.4 Conduct Disaster Recovery Testing , in which the system’s robustness
Prerequisites
None
Roles
Considerations
For Data Migration projects, system tests are important because these are essentially
‘dress-rehearsals’ for the final migration. These tests should be executed with
production-level controls and be tracked and improved upon from system test cycle to
system test cycle. In data migration projects these system tests are often referred to as
‘mock-runs’ or ‘trial cutovers’.
Best Practices
None
Sample Deliverables
None
Description
System test preparation consists primarily of creating the environment(s) required for
testing the application and staging the system integration. System Test is the first
opportunity, following comprehensive unit testing, to fully integrate all the elements of
the system, and to test the system by emulating how it will be used in production. For
this reason, the environment should be as similar as possible to the production
environment in its hardware, software, communications, and any support tools.
Prerequisites
None
Roles
Considerations
The preparations for System Test often take much more effort than expected, so they
should be preceded by a detailed integration plan that describes how all of the system
elements will be physically integrated within the System Test environment. The
integration plan should be specific to your environment, but some of the general steps
are likely be the same. The following are some general steps that are common in most
integration plans.
For Data Migration projects the system test should not just involve running Informatica
Workflows, it should also include data set-up, migrating code, executing data and
process validation and post-process auditing. The system test set-up should be part of
the system test, not a pre-system test step.
Best Practices
None
Sample Deliverables
Description
This subtask involves a number of guidelines for running the complete system test and
resolving or escalating any issues that may arise during testing.
Prerequisites
None
Roles
Considerations
A System test plan needs to include pre-requisites to enter into the system test phase,
criteria to successfully exit system test phase, and defect classifications. In addition, all
test conditions, expected results, and test data need to be available prior to system test.
Load Routines
Ensure that the system test plan includes all types of load that may be encountered
during the normal operation of the system. For example, a new data warehouse (or a
new instance of a data warehouse) may include a one-off initial load step. There may
also be weekly, monthly, or ad-hoc processes beyond the normal incremental load
routines.
System testing is a cyclical process. The project team should plan to execute multiple
iterations of the most common load routines within the timeframe allowed for system
testing. Applications should be run in the order specified in the test plan.
Scheduling
Also the tools in PowerCenter and/or a third-party scheduling tool can be used to detect
long running sessions/tasks and alert the system test team via email. This helps to
identify issues early and manage system test timeframe effectively.
The team executing the system test plan is responsible for tracking the expected and
The details of each PowerCenter session run can be found in the Workflow Monitor. To
see the results:
The testing team must document the specific statistical results of each run and
communicate those results back to the project development team. If the results do not
meet the criteria listed in the test case, or if any process fails during testing, the test
team should immediately generate a change request. The change request is assigned
to the developer(s) responsible for completing system modifications. In the case of a
PowerCenter session failure, the test team should seek the advice of the appropriate
developer and business analyst before continuing with any other dependent tests.
Ideally, all defects will be captured, fixed, and successfully retested within the system
testing timeframe. In reality, this is unlikely to happen. If outstanding defects are still
apparent at the end of the system testing period, the project team needs to decide how
to proceed. If system test plan contains successful system test completion criteria,
those criteria must be fulfilled.
Defect levels must meet established criteria for completion of the system test cycle.
Defects should be judged by their number and by their impact. Ultimately, the project
team is responsible for ensuring that the tests adhere to the system test plan and the
test cases within it (developed in Subtask 6.3.1 Prepare for System Test ). The project
team must review and sign-off on the results of the tests.
For Data Migration projects, because they are usually part of a larger
implementation the system test should be integrated with the larger project system
test. The results of this test should be reviewed, improved upon and communicated to
the project manager or project management office (PMO). It is common for these types
of projects to have three or four full system tests otherwise known as ‘mock runs’ or
‘trial cutovers’.
Sample Deliverables
None
Description
The purpose of data validation is to ensure that source data is populated as per
specification. The team responsible for completing the end-to-end test plan should be
in a position to utilize the results detailed in the testing documentation (e.g., TCR,
CTPs, TCD, and TCRs). Test team members should review and analyze the test
results to determine if project and business expectations are being met.
● If the team concludes that the expectations are being met, it can sign-off on
the end-to-end testing process.
● If expectations are not met, the testing team should perform a gap analysis on
the differences between the test results and the project and business
expectations.
The gap analysis should list the errors and requirements not met so that a Data
Integration Developer can be assigned to investigate the issue. The analysis should
also include data from initial runs in production. The Data Integration Developer should
assess the resources and time required to modify the data integration environment to
achieve the required test results. The Project Sponsor and Project Manager should
then finalize the approach for incorporating the modifications, which may include
obtaining additional funding or resources, limiting the scope of the modifications, or re-
defining the business requirements to minimize modifications.
Prerequisites
None
Roles
Considerations
The Integration Service generates the following tables to help you track row
errors:
● Involvement. The test team, the QA team, and, ultimately, the end-user
community are all jointly responsible for ensuring the accuracy of the data. At
the conclusion of system testing, all must sign-off to indicate their acceptance
of the data quality.
● Access To Front-End for Reviewing Results. The test team should have
access to reports and/or a front-end tool to help review the results of each
The Data Validation task has enormous scope and is a significant phase in any project
cycle. Data validation can be either manual or automated.
Manual. This technique involves manually validating target data with source and also
ensuring that all the transformation have been correctly applied. Manual validation may
be valid for a limited set of data or for master data.
Automated. This technique involves using various techniques and/or tools to validate
data and ensure, at the end of cycle, that all the requirements are met. The
following tools are very useful for data validation:
File Diff. This utility is generally available with any testing tool and is
very useful if the source(s) and target(s) are files. Otherwise, the
result sets from the source and/or target systems can be saved as
flat files and compared using file diff utilities.
●
Data Analysis Using IDQ. The testing team can use Informatica
Data Quality (IDQ) Data Analysis plans to assess the level of data
quality needs. Plans can be built to identify problems with data
conformity and consistency. Once the data is analyzed, scorecards
can be used to generate a high-level view of the data quality. Using
the results from data analysis and scorecards, new test cases can be
added and new test data can be created for the testing cycle.
●
Defect Management:
The defects encountered during the data validation should be organized using either a
simple tool like an Excel (or comparable) spreadsheet or a more advanced tool.
Advanced tools may have facilities for defect assignment, defect status changes, and/
or a section for defect explanation. The Data Integration Developer and the testing
team must ensure that all defects are identified and corrected before changing
the defect status.
For Data Migration projects it is important to identify a set of processes and procedures
to be executed to simplify the validation process. These processes and procedures
should be built into the Punch List and should focus on reliability and efficiency. For
large scale data migration projects it is important to realize the scale of validation. A set
of tools must be developed to enable the business validation personnel to quickly and
accurately validate that the data migration was complete. Additionally it is important
that the run book includes steps to verify that all technical steps were completed
successfully. PowerCenter Metadata Reporter should be leveraged and documented in
the punch list steps and detailed records of all interaction points should be included in
operational procedures.
Best Practices
None
Sample Deliverables
None
Description
Disaster testing is crucial for proving the resilience of the system to the business
sponsors and IT support teams, and for ensuring that staff roles and responsibilities are
understood if a disaster occurs.
Prerequisites
None
Roles
Considerations
Secondly, consider the system architecture. A well-designed system will minimize the
risk of disaster. If a disaster occurs, the system should allow a smooth and timely
recovery.
Disaster Tolerance
Disaster tolerance is the ability to successfully recover applications and data after a
disaster within an acceptable time period. A disaster is an event that unexpectedly
disrupts service availability, corrupts data, or destroys data. Disasters may be
triggered by natural phenomena, malicious acts of sabotage against the organization,
or terrorist activity against society in general.
The need for a disaster tolerant system depends on the risk of disaster and how long
the business can afford applications to be out of action. The location and geographical
proximity of data centers plus the nature of the business affect risk. The vulnerability of
the business to disaster depends upon the importance of the system to the business as
a whole and the nature of a system. Service level agreements (SLA) for the availability
of a system dictate the need for disaster testing. For example, a real-time message-
based transaction processing application that has to be operational 24/7 needs to be
recovered faster than a management information system with a less stringent SLA.
System Architecture
Although the failed workflow has to be manually recovered if one of the servers
unexpectedly shuts down, other servers in the grid should be available to rerun it,
unless a catastrophic network failure occurs.
The guideline is to aim to avoid single points of failure in a system where possible.
Clustering and server grid solutions alleviate single points of failure. Be aware that
single physical points of failure are often hardware and network related. Be sure to
have backup facilities and spare components available, for example auxiliary
generators, spare network cards, cooling systems; even a torch in case the lights go
out!
Perhaps the greatest risk to a system is human error. Businesses need to provide
proper training for all staff involved in maintaining and supporting the system. Also be
sure to provide documentation and procedures to cope with common support issues.
Remember a single mis-typed command or clumsy action can bring down a whole
system.
After disaster tolerance and system architecture have been considered, you can begin
to prepare the disaster test plan. Allow sufficient time to prepare the plan. Disaster
testing requires a significant commitment in terms of staff and financial resources.
Therefore, the test plan and activities should be precise, relevant, and achievable.
The test plan identifies the overall test objectives; consider what the test goals are and
whether they are worthwhile for the allocated time and resources. Furthermore, the
plan explains the test scope, establishes the criteria for measuring success, specifies
any prerequisites and logistical requirements (e.g., the test environment), includes test
scripts, and clarifies roles and responsibilities.
Test Scope
Test scope identifies the exact systems and functions to be tested. There may not be
time to test for every possible disaster scenario. If so the scope should list and explain
why certain functions or scenarios cannot be tested.
Focus on the stress points for each particular application when deciding on the test
scope. For example, in a typical data warehouse it is quite easy to recover data during
In theory, success criteria can be measured in several ways. Success can mean
identifying a weakness in the system highlighted in the test cycle or successfully
executing a series of scripts to recover critical processes that were impacted by the
disaster test case.
Use SLAs to help establish quantifiable measures of success. SLAs should already
exist specifically for disaster recovery criteria.
In general, if the disaster testing results meet or beat the SLA standards, then the
exercise can be considered a success.
Try and prepare a dedicated environment for disaster testing. As new applications are
created and improved, they should be tested in the isolated disaster-testing
environment. It is important to regularly test for disaster tolerance, particularly if new
hardware and / or software components are introduced to the system being tested.
Make sure that the testing environment is kept up to date with code and infrastructure
changes that are being applied in the normal system testing environment(s).
The test schedule is important because it explains what will happen and when. For
example, if the electricity supply is going to be turned off or the plug pulled on a
particular server, it must be scheduled and communicated to all concerned parties.
Test Scripts
The disaster test plan should include test scripts, detailing the actions and activities
required to actually conduct the technical tests. These scripts can be simple or
complex, and can be used to provide instructions to test participants. The test scripts
should be prepared by the business analysts and application developers.
Ensure that the test plan is approved by the appropriate staff members and business
groups.
Disaster test execution should expose any flaws in the system architecture or in the
test plan itself. The testing team should be able to run the tests based on the
information within the test plan and the instructions in the test scripts.
Any deficiencies in this area need to be addressed because a good test plan forms the
basis of an overall disaster recovery strategy for the system.
The test team is responsible for capturing and logging test results. It needs to
communicate any issues in a timely manner to the application developers, business
analysts, end-users, and system architects.
It is advisable to involve other business and IT departmental staff in the testing where
possible, not just the department members who planned the test. If other staff can
understand the plan and successfully recover the system by following it, then the
impact of a real disaster is reduced.
While data migration projects don’t fully require a full-blown disaster recovery solution,
it is recommended to establish a disaster recovery plan. Typically this is a simple
document to identify emergency procedures to follow if something were to happen to
any of the major pieces of infrastructure. Additionally, a back-out plan should be
present in the event the migration must stop mid-stream during the final implementation
weekend.
Disaster testing is a critical aspect of the overall system testing strategy. If conducted
properly, disaster testing provides valuable feedback and lessons that will prove
important if a real disaster strikes.
Best Practices
Sample Deliverables
None
Description
Basic volume testing seeks to verify that the system can cope with anticipated
production data levels. Taken to extremes, volume testing seeks to find the physical
and logical limits of a system; this is also known as stress testing. Stress and volume
testing seek to determine when and if system behavior changes as the load increases.
A volume testing exercise is similar to a disaster testing exercise. The test scenarios
encountered may never happen in the production environment. However, a well-
planned and conducted test exercise provides invaluable reassurance to the business
and IT communities regarding the stability and resilience of the system.
Prerequisites
None
Roles
Considerations
Before starting the volume test exercise, consider the Service Level Agreements (SLA)
Estimate Projected Data Volumes Over Time and Consider Peak Load
Periods
Enlist the help of the DBAs and Business Analysts to estimate the growth in projected
data volume across the lifetime of the system. Remember to make allowances for any
data archiving strategy that exists in the system. Data archiving helps to reduce the
volume of data in the actual core production system, although of course, the net
volume of data will increase over time. Use the projected data volumes to provide
benchmarks for testing.
Data matching is also a processor-intensive activity: the speed of the processor has a
significant impact on how fast a matching process completes. If the project includes
data quality operations, consult with a Data Quality Developer when estimating data
volumes over time and peak load periods.
Volume test planning is similar in many ways to disaster test planning. See 6.3.4
Conduct Disaster Recovery Testing for details on disaster test planning guidelines.
However, there are some volume-test specific issues to consider during the planning
stage:
The test team responsible for completing the end-to-end test plan should
ensure that the volume(s) of test data accurately reflect the production business
environment. Obtaining adequate volumes of data for testing in a non-
production environment can be time-consuming and logistically difficult, so
remember to make allowances in the test plan for this.
Some organizations choose to copy data from the production environment into
the test system. Security protocol needs to be maintained if data is copied from
a production environment since the data is likely to need to be scrambled.
Some of the popular RDBMS products contain built-in scrambling packages;
third-party scrambling solutions are also available. Contact the DBA and the IT
security manager for guidance on the data scrambling protocol of the
department or organization.
For new applications, production data probably does not exist. Some
commercially-available software products can generate large volumes of data.
Alternatively, one of the developers may be able to build a customized suite of
programs to artificially generate data.
Volume testing cycles need to include normal expected volumes of data and
some exceptionally high volumes of data. Incorporate peak period loads into the
volume testing schedules. If stress tests are being carried out, data
volume need to be increased even further. Additional pressure can be applied
to the system, for example, by adding a high number of database users or
temporarily bringing down a server.
The volume testing team is responsible for capturing volume test results. Be
sure to capture performance statistics for PowerCenter tasks, database
throughput, server performance and network efficiency.
If jobs and tasks are being run through a scheduling tool, use the features
within the scheduling tool to capture lapse time data. Alternatively, use shell
scripts or batch file scripts to retrieve time and process data from the operating
system.
If the system has been well-designed and built, the applications are more likely
to perform in a predictable manner as data volumes increase. This is known as
scalability and is a very desirable trait in any software system.
Eventually however, the limits of the system are likely to be exposed as data
volumes reach a critical mass and other stresses are introduced into the
system. Physical or user-defined limits may be reached on particular
parameters. For example, exceeding the maximum file size supported on an
Bottlenecks are likely to appear in the load processes before such limits are
exceeded. For example, a SQL query called in a PowerCenter session may
experience a sudden drop in performance when data volumes reach a
threshold figure. The DBA and application developer need to investigate any
sudden drop in the performance of a particular query. Volume and stress testing
is intended to gradually increase the data load in order to expose weaknesses
in the system as a whole.
Conclusion
Volume and stress testing are important aspects of the overall system testing strategy.
The test results provide important information that can be used to resolve issues before
they occur in the live system.
However, be aware that it is not possible to test all scenarios that may cause the
system to crash. A sound system architecture and well-built software applications can
help prevent sudden catastrophic errors.
Best Practices
None
Sample Deliverables
None
Description
User Acceptance Testing (UAT) is arguably the most important step in the project and
is crucial to verifying that the system meets the users’ requirements. Being business
usage-focused, it relates to the business requirements rather than on testing all the
details of the technical specification. As such UAT is considered black box testing (i.e.,
without knowledge of all the underlying logic) that focuses on the deliverables to the
end user, primarily through the presentation layer. UAT is the responsibility of the user
community in terms of organization, staffing and final acceptance, but much of the
preparation will have been undertaken by IT staff working to a plan agreed with the
users. The function of the user acceptance testing is to obtain final functional approval
from the user community for the solution to be deployed into production. As such,
every effort must be made to replicate the production conditions.
Prerequisites
None
Roles
Considerations
Plans
By this time User Acceptance Criteria should have been precisely defined by the user
community as well, of course, as the specific business objectives and requirements for
the project. UAT Acceptance Criteria should include
As the testers may not have a technical background, the plan should include detailed
procedures for testers to follow. The success of UAT depends on having certain critical
items in place:
It is important that the user acceptance testers and their management are thoroughly
committed to the new system and ensuring its success. There needs to be
communication with the user community so that they are informed of the project’s
progress and able to identify appropriate members of staff to make available to carry
out the testing. These participants will become the users most equipped to adopt the
new system and so should be considered “super-users” who may participate in user
training thereafter.
Best Practices
None
Sample Deliverables
None
Description
Tuning a system can, in some cases, provide orders of magnitude performance gains.
However, tuning is not something that should just be performed after the system is in
production; rather, it is a concept of continual analysis and optimization. More
importantly, tuning is a philosophy. The concept of performance must permeate all
stages of development, testing, and deployment. Decisions made during the
development process can seriously impact performance and no level of production
tuning can compensate for an inefficient design that must be redeveloped.
The information in this section is intended for use by Data Integration Developers, Data
Quality Developers, Database Administrators, and System Administrators, but should
be useful for anyone responsible for the long-term maintenance, performance, and
support of PowerCenter Sessions, Data Quality Plans, PowerExchange
Connectivity and Data Analyzer Reports.
Prerequisites
None
Roles
Considerations
Performance and tuning the Data Integration environment is more than just simply
tuning PowerCenter or any other Informatica product. True system performance
analysis requires looking at all areas of the environment to determine opportunities for
better performance from relational database systems, file systems, network bandwidth,
and even hardware. The tuning effort requires benchmarking, followed by small
incremental tuning changes to the environment, then re-executing the
benchmarked data integration processes to determine the affect of the tuning changes
Often, tuning efforts mistakenly focus on PowerCenter as the only point of concern
when there may be other areas causing the bottleneck and needing attention. If you
are sourcing data from a relational database for example, your data integration loads
can never be faster than the source database can provide data. If the source database
is poorly indexed, poorly implemented, or underpowered - no amount of downstream
tuning in PowerCenter, hardware, network, file systems etc. can fix the problem of slow
source data access. Throughout the tuning process, the entire end-to-end process
must be considered and measured. The unit of work being baselined may be a single
PowerCenter session for example, but it is always necessary to consider the end-to-
end process of that session in the tuning efforts.
Best Practices
None
Sample Deliverables
None
Description
Benchmarking involves the process of running sessions or reports and collecting run
statistics to set a baseline for comparison. The benchmark can be used as the standard
for comparison after the session or report is tuned for performance. When determining
a benchmark, the two key statistics to record are:
Prerequisites
None
Roles
Considerations
After choosing a set of mappings, create a set of new sessions that use the default
settings. Run these sessions when no other processes are running in the background.
Tip
Tracking Results
Track two values for rows per second throughput: rows per second as calculated
by PowerCenter (from transformation statistics in the session properties), and the
average rows processed per second (based on total time duration divided by the
number of rows loaded).
If it is not possible to run the session without background processes, schedule the
session to run daily at a time where there are not many processes running on the
server. Be sure that the session runs at the same time each day or night for
benchmarking. The session should run at the same time for future tests.
Track the performance results in spreadsheet over a period of days or for several
runs. After the statistics are gathered, compile the average of the results in a new
spreadsheet. Once the average results are calculated, identify the sessions that have
lowest throughput or that miss their load window. These sessions are the first
candidates for performance tuning.
When the benchmark is complete, the sessions should be tuned for performance. It
should be possible to identify potential areas for improvement by considering the
machine, network, database, and PowerCenter session and server process.
Data Analyzer benchmarking should focus on the time taken to run the source query,
Best Practices
None
Sample Deliverables
None
Description
The goal of this subtask is to identify areas for improvement, based on the performance
benchmarks established in Subtask 6.5.1 Benchmark .
Prerequisites
None
Roles
Considerations
After performance benchmarks are established (in 6.5.1 Benchmark ), careful analysis
of the results can reveal areas that may be improved through tuning. It is important to
consider all possible areas for improvement, including:
The actual tuning process can begin after the areas for improvement have been
identified and documented.
For data migration projects, other considerations must be included in the performance
tuning activities. Many ERP applications have two-step processes where the data is
loaded through simulated on-line processes. More specifically an API will be executed
that will replicate in a batch scenario the way that the on-line entry works, executing all
edits. In such a case, performance will not be the same as in a scenario where a
relational database is being populated. The best approach to performance tuning is to
set the expectation that all data errors should be identified and corrected in the ETL
layer prior to the load to the target application. This approach can improve performance
by as much as 80%.
Sample Deliverables
None
Description
The goal of this subtask is to implement system changes to improve overall system
performance, based on the areas for improvement that were identified and documented
in Subtask 6.5.2 Identify Areas for Improvement .
Prerequisites
None
Roles
Considerations
2. Re-run the session and monitor the performance details, watching the buffer
input and outputs for the sources and targets.
3. Tune the source system and target system based on the performance details.
Once the source and target are optimized, re-run the PowerCenter session or
Data Analyzer report to determine the impact of the changes.
4. Only after the server, source, and target have been tuned to their peak
performance should the mapping and session be analyzed for tuning. This is
because, in most cases, the mapping is driven by business rules. Since the
purpose of most mappings is to enforce the business rules, and the business
rules are usually dictated by the business unit in concert with the end-user
community, it is rare that the mapping itself can be greatly tuned. Points to look
for in tuning mappings are: filtering unwanted data early, cached lookups,
aggregators that can be eliminated by programming finesse and using sorted
input on certain active transformations. For more details on tuning mappings
and sessions refer to the Best Practices.
5. After the tuning achieves a desired level of performance, the DTM (data
transformation manager) process should be the slowest portion of the session
details. This indicates that the source data is arriving quickly, the target is
inserting the data quickly, and the actual application of the business rules is the
slowest portion. This is the optimal desired performance. Only minor tuning of
the session can be conducted at this point and usually has only a minimal effect.
6. Finally, re-run the benchmark sessions, comparing the new performance with
the old performance. In some cases, optimizing one or two sessions to run
quickly can have a disastrous effect on another mapping and care should be
taken to ensure that this does not occur.
Best Practices
Sample Deliverables
None
Description
The goal of this subtask is to identify areas where changes can be made to improve the
performance of Data Analyzer reports.
Prerequisites
None
Roles
Considerations
Database Performance
1. Generate SQL for each report and explain this SQL in the database to
2. Analyze SQL requests made against the database to identify common patterns
with user queries. If you find that many users are running aggregations against
detail tables, consider creating an aggregate table in the database and perform
the aggregations via ETL processing. This will save time when the user runs the
report as the data will already be aggregated.
1. Within Data Analyzer, use filters within reports as much as possible. Try to
restrict as much data as possible. Also try to architect reports to start out with a
high-level query, then provide analytic workflows to drill down to more detail.
Data Analyzer report rendering performance is directly related to the number of
rows returned from the database.
2. If the data within the report does not get updated frequently, make the report a
cached report. If the data is being updated frequently, make the report a
dynamic report.
3. Try to avoid sectional reports as much as possible since they take more time in
rendering.
4. Schedule reports to run during off peak hours. Reports run in batches can use
considerable resources. Therefore such reports should be run at the time when
there is least use on the system subject to other dependencies.
1. Fine tune the application server Java Virtual Machine (JVM) to correspond with
the recommendations in the Best Practice on Data Analyzer Configuration and
Performance Tuning. This should significantly enhance Data Analyzer's
reporting performance.
2. Ensure that the application server has sufficient CPU and memory to handle the
expected user load. Strawman estimates for CPU and memory are as follows:
Best Practices
None
Sample Deliverables
None
7 Deploy
Description
The deployment strategy developed during the Architect Phase is now put into action.
During the Build Phase components are created that may require special initialization
steps and proceedures. For the production deployment, checklists and procedures are
developed to ensure that crucial steps are not missed in the production cut over.
To the end user, this is where the fruits of the project are exposed and the end user
acceptance begins. Up to this point, developers have been developing data cleansing,
data transformations, load processes, reports, and dashboards in one or more
development environments. But whether a project team is developing the back-end
processes for a legacy migration project or the front-end presentation layer for a
metadata management system, deploying a data integration solution is the final step in
the development process.
Metadata, which is the cornerstone of any data integration solution, should play an
integral role in the documentation and training rollout to users. Not only is metadata
critical to the current data integration effort, but it will be integral to planned metadata
management projects down the road. After the solution is actually deployed, it must be
maintained to ensure stability and scalability.
All data integration solutions must be designed to support change as user requirements
and the needs of the business change. As data volumes grow and user interest
increases, organizations face many hurdles such as software upgrades, additional
functionality requests, and regular maintenance. Use the Deploy Phase as a guide to
deploying an on-time, scalable, and maintainable data integration solution that provides
business value to the user community.
Prerequisites
None
Considerations
None
Sample Deliverables
None
Description
The success or failure associated with deployment often determines how users and
management perceive the completed data integration solution. The steps involved in
planning and implementing deployment are, therefore, critical to project success. This
task addresses three key areas of deployment planning:
● Training
● Metadata documentation
● User documentation
Prerequisites
None
Roles
Considerations
Although training and documentation are considered part of the Deploy Phase, both
activities need to start early in the development effort and continue throughout the
project lifecycle. Neither can be planned nor implemented effectively without the
following:
Companies that have training and documentation groups in place should include
representatives of these groups in the project development team. Companies that do
not have groups in place need to assign resources on the project team to these tasks,
ensuring effective knowledge transfer throughout the development effort. And,
everyone involved in the system design and build should understand the need for good
documentation and make it a part of his or her everyday activities. This "in-process"
documentation then serves as the foundation for the training curriculum and user
documentation that is generated during the Deploy Phase.
Although most companies have training programs and facilities in place, it is sometimes
necessary to create these facilities to provide training on the data integration solution. If
this is the case, the determination to create a training program must be made as early
in the project lifecycle as possible, and the project plan must specify the necessary
resources and development time. Creating a new training program is a double-edged
sword: it can be quite time-consuming and costly, especially if additional personnel and/
or physical facilities are required but it also gives project management the opportunity
to tailor a training program specifically for users of the solution rather than "fitting" the
training needs into an existing program.
Project management also needs to determine policies and procedures for documenting
and automating metadata reporting early in the deployment process rather than making
reporting decisions on-the-fly.
For Data Migration projects it is very important that the operations team has the tools
and processes to allow for a mass deployment of large amounts of code at one time, in
a consistent manner. Capabilities should include:
This is why team-based development is normally a part of any data migration project.
Best Practices
None
Sample Deliverables
None
Description
Companies often misjudge the level of effort and resources required to plan, create, and successfully
implement a user training program. In some cases, such as legacy migration initiatives, it may be that very
little training is required on the data integration component of the project. However, in most cases, multiple
training programs are required in order to address a wide assortment of user types and needs. For
example, when deploying a metadata management system, it may be necessary to train administrative
users, presentation layer users, and business users separately. When deploying a data conversion project,
on the other hand, it may only be necessary to train administrative users. Note also that users of data
quality applications such as Informatica Data Quality or Informatica Data Explorer will require training, and
that these products may be of interest to personnel at several layers of the organization.
The project plan should include sufficient time and resources for implementing the training program - from
defining the system users and their needs, to developing class schedules geared toward training as many
users as possible, efficiently and effectively, with minimal disruption of everyday activities.
In developing a training curriculum, it is important to understand that there is seldom a "one size fits all"
solution. The first step in planning user training is identifying the system users and understanding both their
needs and their existing level of expertise. It is generally best to focus the curriculum on the needs of
"average" users who will be trained prior to system deployment, then consider the specialized needs of
high-end (i.e., expert) users and novice users who may be completely unfamiliar with decision-support
capabilities. The needs of these specialized users can be addressed most effectively in follow-up classes.
Planning user training also entails ensuring the availability of appropriate facilities. Ideally, training should
take place on a system that is separate from the development and production environments. In most cases,
this system mirrors the production environment, but is populated with only a small subset of data. If a
separate system is not available, training can use either a development or production platform, but this
arrangement raises the possibility of affecting either the development efforts or the production data. In any
case, if sensitive production data is used in a training database, ensure appropriate security measures are
in place to prevent unauthorized users in training from accessing confidential data.
Prerequisites
None
Roles
Considerations
Successful training begins with careful planning. Training content and duration must correspond with end-
user requirements. A well-designed and well-planned training program is a "must have" for a data
integration solution to be considered successfully deployed.
● While the presentation layer is often the primary focus of training, data content and application
training are also important to business users. Many companies overlook the importance of training
users on the data content and application, providing only data access tool training. In this case,
users often fail to understand the full capabilities of the data integration system and the company
is unlikely to achieve optimal value from the system.
Careful curriculum preparation includes developing clear, attractive training materials, including good
graphics and well-documented exercise materials that encourage users to practice using the system
features and functions. Laboratory materials can make or break a training program by encouraging users to
try using the system on their own. Training materials that contain obvious errors or poorly documented
procedures actually discourage users from trying to use the system, as does a poorly-designed
presentation layer. If users do not gain confidence using the system during training, they are unlikely to use
the data integration solution on a regular basis in their everyday activities.
The training curriculum should include a post-training evaluation process that provides users with an
opportunity to critique the training program, identifying both its strengths and weaknesses and making
recommendations for future or follow-up training classes. The evaluation should address the effectiveness
of both the course and the trainer because both are crucial to the success of a training program.
As an example, the curriculum for a two-day training class on a data integration solution might look
something like this:
Curriculum Duration
Day 1
Lunch 1 hour
Day 2
Lunch 1 hour
Best Practices
None
Sample Deliverables
None
Description
Prerequisites
None
Roles
Considerations
During this subtask, it is important to decide what metadata to capture, how to access
it, and when to place change control check points in the process to maintain all the
changes in the metadata.
From the developer's perspective, PowerCenter provides the ability to enter descriptive
information for all repository objects, sources, targets, and transformations. Moreover,
column level descriptions of the columns in a table, as well as all information about
column size and scale, datatypes, and primary keys are stored in the repository. This
enables business users to maintain information on the actual business name and
description of a field on a particular table. This ability helps users in a number of ways:
for example, it eliminates confusion about which columns should be used for a
calculation. For example, 'C_Year' and 'F_Year' might be column names on a table, but
'Calendar Year' and 'Fiscal Year' are more useful to business users trying to calculate
market share for the company's fiscal year.
Informatica does not recommend accessing the repository tables directly, even for
select access, because the repository structure can change with any product release.
Informatica provides several methods of gaining access to this data:.
MX2 is a set of encapsulated objects that can communicate with the metadata
repository through a standard interface. These MX2 objects offer developers an
advanced object-based API for accessing and manipulating the PowerCenter
Repository from a variety of programming languages.
Best Practices
None
Sample Deliverables
None
Description
Good system and user documentation is invaluable for a number of data integration
system users, such as:
A well-documented project can save development and production team members both
time and effort getting the new system into production and the new employee(s) up-to-
speed.
User documentation usually consists of two sets: one geared toward ad-hoc users,
providing details about the data integration architecture and configuration; and another
geared toward "push button" users, focusing on understanding the data, and providing
details on how and where they can find information within the system. This increasingly
includes documentation on how to use and/or access metadata.
Prerequisites
None
Roles
Considerations
To improve users' ability to effectively access information in, and increase their
understanding of, the content, many companies create resource groups within the
business organization. Group members attend detailed training sessions and work with
the documentation and training specialists to develop materials that are geared toward
the needs of typical, or frequent, system users like themselves. Such groups have two
benefits: they help to ensure that training and documentation materials are on-target for
the needs of the users, and they serve as in-house experts on the data integration
architecture, reducing users' reliance on the central support organization.
Best Practices
None
Sample Deliverables
None
Description
A comprehensive communication plan can ensure that all required people in the
organization are ready for the production deployment. Since many of them can be
outside of the immediate data integration project team, it cannot be assumed
that everyone is always up to date on the production go-live planning and timing. For
example you may need to communicate with DBA's, IT infrastructure, web support
teams, and other system owners that may have assigned tasks and
monitoring activities during the first production run. The communication plan will
ensure proper and timely communication across the organization so there are no
surprises when the production run is initiated.
Prerequisites
Roles
Considerations
The communication plan should provide details about communication. It must include
steps to take if a specific person on the plan is unresponsive, escalation procedures
and emergency communication protocols (i.e., how would the entire core project team
communicate in a dire emergency). Since many go-live events occur over weekends, it
is also important to retain not only business contact information but also weekend
contact information such as cell phones or pagers in the event a key contact needs to
be reached on a non-business day.
Best Practices
None
Sample Deliverables
Description
The Run Book contains detailed descriptions of the tasks from the punch list that
was used for the first production run. It details the tasks more explicitly for the
individual mock-run and final go-live production run.
Typically the punch list will be created for the first trial cutover or mock-run and the run
book will be developed during the first and second trial cutovers and completed by the
start of the final production go-live.
Prerequisites
Roles
One of the biggest challenges for completing a run book (like completing an operations
manual) is to provide an adequate level of detail. It is important to find a balance
between providing too much information making it unwieldy and unlikely to be
used, versus providing too little detail that could jeopardize the successful execution of
the tasks.
For Data Migration projects this is even more imperative, since you normally have only
one critical go-live event. This is the one chance to have a successful production go-
live without negatively impacting operational systems that depend on the migrated
data. The run book is developed and leveraged on trial cutovers and should have all
the necessary information to ensure a successful migration. Go/No-Go Procedure
Information will also be included in the run-book. The run book for a data migration
project eliminates the need for an operations manual that is present for most other data
integration solutions.
Best Practices
None
Sample Deliverables
Description
Before the deployment tasks are undertaken however, it is necessary to determine the
organization's level of preparedness for the deployment and thoroughly plan end-user
training materials and documentation. If all prerequisites are not satisfactorily
completed, it may be advisable to delay the migration, training, and delivery of finalized
documentation rather than hurrying through these tasks solely to meet a predetermined
target delivery date.
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
Description
Before training can begin, company management must work with the development
team to review the training curricula to ensure that it meets the needs of the various
application users. First, however, management and the development team need to
understand just who the users are and how they are likely to use the application.
Application users may include individuals who have reporting needs and need to
understand the presentation layer; operational users who need to review the content
being delivered by a data conversion system; administrative users managing the
sourcing and delivery of metadata across the enterprise; production operations
personnel responsible for day-to-day operations and maintenance; and more.
After the training curricula is planned and users are scheduled to attend classes
appropriate to their needs, a training environment must be prepared for the training
sessions. This involves ensuring that a “laboratory environment” is set-up properly for
multiple concurrent users, and that data is clean and available to that environment. If
the presentation layer is not ready or the data appears incomplete or inaccurate, users
may lose interest in the application and choose not to use it for their regular business
tasks. This lack of interest can result in an underutilized resource critical to business
success.
It is also important to prevent untrained users from accessing the system, otherwise the
support staff is likely to be overburdened and spend a significant amount of time
providing on-the-job training to uneducated users.
Prerequisites
None
Roles
Considerations
It is important to consider the many and varied roles of all application users when
planning user training. The user roles should be defined up-front to ensure that
everyone who needs training receives it. If the roles are not defined up-front, some key
users may not be properly trained, resulting in a less-than-optimal hand-off to the user
departments. For example, in addition to training obvious users such as the operational
staff, it may be important to consider users such as DBAs, data modelers, and
metadata managers, at least from a high-level perspective, and ensure that they
receive appropriate training.
The training curricula should educate users about the data content as well as the
effective use of the data integration system. While correct and effective use of the
system is important, a thorough understanding of the data content helps to ensure that
training moves along smoothly without interruption for ad-hoc questions about the
meaning or significance of the data itself. Additionally, it is important to remember that
no one training curriculum can address all needs of all users. The basic training class
should be geared toward the average user with follow-up classes scheduled for those
users needing training on the application's advanced features.
It is also wise to schedule follow-up training for data and tool issues that are likely to
arise after the deployment is complete and the end-users have had time to work with
the tools and data. This type of training can be held in informal "question and answer"
sessions rather than formal classes.
Finally, be sure that training objectives are clearly communicated between company
management and the development team to ensure complete satisfaction with the
training deliverable. If the training needs of the various user groups vary widely, it may
be necessary to obtain additional training staff or services from a vendor or consulting
firm.
Best Practices
None
Training Evaluation
Description
● Pre-deployment phase
● Deployment phase
● Post-deployment phase
While there are multiple tasks to perform in the deployment process, the actual
migration phase consists of moving objects from one environment to another. A
migration can include the following objects:
Prerequisites
None
Roles
Considerations
The tasks below should be completed before, during, and after the migration to ensure
a successful deployment. Failure to complete one or more of these tasks can result in
an incomplete or incorrect deployment.
Pre-deployment tasks
● Ensure all objects have been successfully migrated and tested in the Quality
Assurance environment.
● Ensure the Production environment is compliant with specifications and is
ready to receive the deployment.
● Obtain sign-off from the deployment team and project teams to deploy to the
Production environment.
● Obtain sign-off from the business units to migrate to the Production
environment.
Deployment tasks:
Post-deployment tasks:
Best Practices
Deployment Groups
Sample Deliverables
Description
● Gathering all of the various documents that have been created during the life
of the project;
● Updating and/or revising them as necessary, and
● Distributing them to the departments and individuals that will need them to use
or supervise use of the application. By this point, management should have
reviewed and approved all of the documentation.
Documentation types and content varies widely among projects, depending on the type
of engagement, expectations, scope of project, and so forth. Some typical deliverables
include all of those listed in the Sample Deliverables section.
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
8 Operate
Description
During its day-to-day operations the system continually faces new challenges such as
increased data volumes, hardware and software upgrades, and network or other
physical constraints. The goal of this phase is to keep the system operating smoothly
by anticipating these challenges before they occur and planning for their resolution.
Planning is probably the most important task in the Operate Phase. Often, the project
team plans the system's development and deployment, but does not allow adequate
time to plan and execute the turnover to day-to-day operations. Many companies have
dedicated production support staff with both the necessary tools for system monitoring
and a standard escalation process. This team requires only the appropriate system
documentation and lead time to be ready to provide support. Thus, it is imperative for
the project team to acknowledge this support capability by providing ample time to
create, test, and turn over the deliverables discussed throughout this phase.
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
Description
In this task, the project team produces an Operations Manual, which tells system
operators how to run the system on a day-to-day basis. The manual should include
information on how to restart failed processes and who to contact in the event of a
failure. In addition, this task should produce guidelines for performing system upgrades
and other necessary changes to the system throughout the project's lifetime. Note that
this task must occur prior to the system actually going live. The production support
procedures should be clear to system operators even before the system is in
production, because any production issues that are going to arise will probably do so
very shortly after the system goes live.
Prerequisites
None
Roles
Considerations
The watchword here is: Plan Ahead. Most organizations have well-established and
documented system support procedures in-place. The support procedures for the
solution should fit into these existing procedures, deviating only where absolutely
necessary - and then, only with the prior knowledge and approval of the Project
Manager and Production Supervisor. Any such deviations should be determined and
documented as early as possible in the development effort, preferably before the
system actually goes live. Be sure to thoroughly document specific procedures and
contact information for problem escalation, especially if the procedures or contacts
Best Practices
None
Sample Deliverables
None
Description
After the system is deployed, the Operations Manual is likely to be the most frequently-
used document in the operations environment. The system operators - the individuals
who monitor the system on a day-to-day basis - use this manual to determine how to
run the various pieces of the implemented solution. In addition, the manual provides the
operators with error processing information, as well as reprocessing steps in the event
of a system failure.
The Operations Manual should contain a high-level overview of the system in order to
familiarize the operations staff with new concepts along with the specific details
necessary to successfully execute day-to-day operations. For data visualization, the
Operations Manual should contain high-level explanations of reports, dashboards, and
shared objects in order to familiarize the operations staff with those concepts.
For a data visualization or metadata reporting solution the manual should include the
details on the following:
Operations manuals for all projects should provide information for performing the
following tasks:
● Start servers
● Stop servers
● Notify the appropriate second-tier support personnel in the event of a serious
system malfunction
● Test the health of the reporting and/or data integration environment (i.e., check
DB connections to the repositories, source and target databases / files and
real time feeds, check CPU and memory usage on the PowerCenter and Data
Analyzer servers).
Prerequisites
None
Roles
Considerations
A draft version of the Operations Manual can be started during the Build Phase as the
developers document the individual components. Documents such as mapping
specifications, report specifications, and unit and integration testing plans contain a
great deal of information that can be transferred into the Operations Manual. Bear in
mind that data quality processes are executed earlier, during the Design Phase,
although the Data Quality Developer and Data Integration Developer will be available
during the Build Phase to agree on any data quality measures (such as ongoing run-
time data quality process deployment) that need to be added to the Operations Manual.
Restart and recovery procedures should be thoroughly tested and documented, and
the processing window should be calculated and published. Escalation procedures
should be thoroughly discussed and distributed so that members of the development
and operations staff are fully familiar with them. In addition, the manual should include
information on any manual procedures that may be required, along with step-by-step
instructions for implementing the procedures. This attention to detail helps to ensure a
smooth transition into the Operate Phase.
Although it is important, the Operations Manual is not meant to replace user manuals
and other support documentation. Rather, it is intended to provide system operators
with a consolidated source of documentation to help them support the system. The
Operations Manual also does not replace proper training on PowerCenter, Data
Analyzer, and supporting products.
Best Practices
None
Sample Deliverables
Operations Manual
Description
After the data integration solution has been built and deployed, the job of running it
begins. For a data migration or consolidation solution, the system must be monitored to
ensure that data is being loaded into the database. A data visualization or metadata
reporting solution should be monitored to ensure that the system is accessible to the
end users. The goal of this task is to ensure that the necessary processes are in place
to facilitate the monitoring of and the reporting on the system's daily processes.
Prerequisites
None
Roles
Considerations
None
Best Practices
None
Sample Deliverables
None
Description
Once a Data Integration solution is fully developed, tested and signed off for production
it is time to execute the first run in the production environment. During the
implementation, the first run is a key to a successful deployment. While the first run is
often similar to the on-going load process, it can be distinctively different. There are
often specific one-time setup tasks that need to be executed on the first run that will not
be part of the regular daily data integration process.
In most cases the first production run is a high-profile set of activities that must be
executed, documented, and improved for all future production runs. This run should
leverage a Punch List and should execute a set of tested workflows or scripts
(not manual steps such as executing a specific SQL statement for set-up).
It is important that the first run is executed successfully with limited manual interactions.
Any manual steps should be closely monitored, controlled, documented and
communicated.
This first run should be executed following the Punch List and should be revisited upon
completion of the execution.
Prerequisites
Roles
Considerations
For some projects (such as a data migration effort) the first production run is the
production system. It will not go on beyond the first production run since a data
migration by its nature requires a single movement of the production data. Further, the
set of tasks that make up the production run may not be executed again. Any future
runs will be a part of the execution that addresses a specific data problem, not the
entire batch.
For data warehouses, often the first production run may include loading historical data
as well as initial loads of code tables and dimension tables. The load process may
execute much longer than a typical on-going load due to the extra amount of data and
the different criteria it is run against to pick up the historical data. There may be extra
data validation and verification at the end of the first production run to ensure that the
system is properly initialized and ready for on-going loads. It is important to
appropriately plan and execute the first load properly as the subsequent periodic
refreshes of the data warehouse (daily, hourly, real time) depend on the setup and
success of the first production run.
Best Practices
None
Sample Deliverables
Operations Manual
Punch List
Description
Increasing data volume is a challenge throughout the life of a data integration solution.
As the data migration or consolidation system matures and new data sources are
introduced, the amount of data processed and loaded into the database continues to
grow. Similarly, as a data visualization or metadata management system matures, the
amount of data processed and presented increases. One of the operations team's
greatest tasks is to monitor the data volume processed by the system to determine any
trends that are developing.
If generated correctly, the data volume estimates used by the Technical Architect and
the development team in building the architecture, should ensure that it is capable of
growing to meet ever-changing business requirements. By continuously monitoring
volumes, however, the development and operations teams can act proactively as data
volumes increase. Monitoring affords team members the time necessary to determine
how best to accommodate the increased volumes.
Prerequisites
None
Roles
Considerations
The Session Run Details report can also be configured to display data over ranges of
time for trending. This information provides the project team with both a measure of the
increased volume over time and an understanding of the increased volume's impact on
the data load window.
Dashboards and alerts can be set to monitor loads on an on-going basis, alerting data
integration administrators if load times exceed specified threshholds. By customizing
the standard reports, Data Integration support staff can create any variety of monitoring
levels -- from individual projects to full daily load processing statistics -- across all
projects.
Best Practices
None
Sample Deliverables
None
Description
After the data integration solution is deployed, the system operators begin the task of
monitoring the daily processes. For data migration and consolidation solutions, this
includes monitoring the processes that load the database. For presentation layers and
metadata management reporting solutions, this includes monitoring the processes that
create the end-user reports. This monitoring is necessary to ensure that the system is
operating at peak efficiency. It is important to ensure that any processes that stop, are
delayed, or simply fail to run are noticed and appropriate steps are taken.
Prerequisites
None
Roles
Considerations
Data Analyzer with Repository and Administration Reports installed can provide
information about session run details, average loading times, and server load trends by
day. Administrative and operational dashboards can display all vital metrics needing to
be monitored. They can also provide the project management team with a high-level
Large installations may already have monitoring software in place that can be adapted
to monitor the load processes of the analytic solution. This software typically includes
both visual monitors for the client desktop of the System Operator as well as electronic
alerts than can be programmed to contact various project team members.
Best Practices
None
Sample Deliverables
None
Description
The process of tracking change control requests is integral to the Operate Phase. It is
here that any production issues are documented and resolved. The change control
process allows the project team to prioritize the problems and create schedules for their
resolution and eventual promotion into the production environment.
Prerequisites
None
Roles
Considerations
Ideally, a change control process was implemented during the Architect Phase,
enabling the developers to follow a well-established process during the Operate
Phase. Many companies rely on a Configuration Control Board to prioritize and
approve work for the various maintenance releases.
The Change Control Procedure document, created in conjunction with the Change
Control Procedures in the Architect Phase should describe precisely how the project
team is going to identify and resolve problems that come to light during system
development or operation.
Most companies use a Change Request Form to kick-off the Change Control
procedure. These forms should include the following:
Best Practices
None
Sample Deliverables
Description
One of the most important aspects of the Operate Phase is monitoring how and when
the organization's end users use the data integration solution. This subtask enables the
project team to gauge what information is the most useful, how often it is retrieved, and
what type of user generally requests it. All of this information can then be used to
gauge the system's return on investment and to plan future enhancements.
Monitoring the use of the presentation layer during User Acceptance Testing can
indicate bottlenecks. When the project is complete, Operations continues to monitor the
tasks to maintain system performance. The monitoring results can be used to plan for
changes in hardware and/or network facilities to support increased requests to the
presentation layer. For example, new requirements may be determined by the number
of users requesting a particular report or by requests for more or different information in
the report. These requirements may trigger changes in hardware capabilities and/or
network bandwidth.
Prerequisites
None
Roles
Considerations
Most business organizations have tools in place to monitor the use of their production
systems. Some end-user reporting tools have built-in reports for such purposes. The
project team should review the available tools, as well as software that may be bundled
with the RDBMS, and determine which tools best suit the project's monitoring needs.
Informatica provides tools and sources to metadata that meet the need for monitoring
information from the presentation layer, as well as the metadata on processes used to
provide the presentation layer with data. This information can be extracted using
Informatica tools to provide a complete view of information presentation usage.
Best Practices
None
Sample Deliverables
None
Description
This subtask is concerned with data quality processes that may have been scoped into
the project for late-project or post-project use. Such processes are an optional
deliverable for most projects. However, there is a strong argument for building into the
project plan data quality initiatives that will outlast the project. This argument is based
upon the concept that the decision to incorporate ongoing monitoring should be
considered a key deliverable, as it provides a means to monitor the existing data to
ensure that previously identified data quality issues do not reoccur. For new data
entering the system, monitoring provides a means to ensure that any new feeds do not
compromise the integrity of the existing data. Moreover, the processes created for the
Data Quality Audit task in the Analyze Phase may still be suitable for application to the
data in the Operate Phase, or may be suitable with a reasonable amount of tuning.
There are three types of data quality process relevant in this context:
This subtask is concerned with agreeing to a strategy to use any or all such processes
to validate the continuing quality of the business’ data and to safeguard against lapses
in data quality in the future.
Prerequisites
None
Roles
Considerations
Ongoing data quality initiatives bring the data quality process full-circle. This subtask is
the logical conclusion to a process that began with the performance of a Data Quality
Audit in the Analyze Phase and the creation of data quality processes (called plans in
Informatica Data Quality terminology) in the Design Phase.
The plans created during and after the Operate Phase are likely to be runtime or real-
time plans. A runtime plan is one that can be scheduled for automated, regular
execution (e.g., nightly or weekly). A real-time plan is one that can accept a live data
feed, for example, from a third-party application, and write output data back to a live
application.
Real-time plans are useful in data entry scenarios; they can be used to capture data
problems at the point of keyboard entry and thus before they are saved to the data
system. The real-time plan can be used to check data entries, pass them if accurate,
cleanse them of error, or reject them as unusable.
Runtime plans can be used to monitor the data stored to the system; these plans can
be run during periods of relative inactivity (e.g., weekends). For example, the Data
Quality Developer may design a plan to identify duplicate records in the system, and
the Developer or the system administrator can schedule the plan to run overnight. Any
duplication issues found in the system can be addressed manually or by other data
quality plans.
The Data Quality Developer must discuss the importance of ongoing data quality
management with the business early in the project, so that the business can decide
what data quality management steps to take within the project or outside of it.
The Data Quality Developer must also consider the impact that ongoing data quality
initiatives are likely to have on the business systems. Should the data quality plans be
deployed to several locations or centralized? Will the reference data be updated at
regular intervals and by whom? Can plan resource files be moved easily across the
enterprise? Once the project resources are unwound, these matters require a
committed strategy from the business. However, the results — clean, complete,
compliant data — are well worth it.
Best Practices
Sample Deliverables
None
Description
The goal in this task is to develop and implement an upgrade procedure to facilitate
upgrading the hardware, software, and/or network hardware that supports the overall
analytic solution. This plan should enable both the development and operations staff to
plan for and execute system upgrades in an efficient, timely manner, with as little
impact on the system's end users as possible.The deployed system incorporates
multiple components, many of which are likely to undergo upgrades during the system's
lifetime. Ideally, upgrading system components should be treated as a system change
and as such, use many of the techniques discussed in 8.2.4 Track Change Control
Requests. After these changes are prioritized and authorized by the Project Manager,
an upgrade plan should be developed and executed. This plan should include the tasks
necessary to perform the upgrades as well as the tasks necessary to update system
documentation and the Operations Manual, when appropriate.
Prerequisites
None
Roles
Considerations
Once the Build Phase has been completed, the development and operations staff
should begin determining how upgrades should be carried out. The team should
consider all aspects of the systems' architecture including any software and hardware
being used. Special attention should be paid to software release schedules, hardware
limitations, network limitations, and vendor release support schedules. This information
Best Practices
None
Sample Deliverables
None
Description
Prerequisites
None
Roles
Considerations
Exclusive Mode
The Repository Service executes in normal or exclusive mode. Running the Repository
Service in exclusive mode allows only one user to access the repository through the
Administrative Console or pmrep command line program.
Repository Backup
The Repository Service provides backup processing for repositories through the
TIP
A simple approach to automating PowerCenter repository backups is to use the
pmrep command line program. Commands can be packaged and scheduled so
that backups occur on a desired schedule without manual intervention. The
backup file name should minimally include repository name and backup date
(yyyymmdd).
TIP
Keep in mind that you cannot restore a single folder or mapping from a
repository backup. If, for example, a single important mapping is deleted by
accident, you need to obtain a temporary database space from the DBA in order
to restore the backup to a temporary repository DB. With the PowerCenter client
tools, copy the lost metadata, and then remove the temporary repository from
the database and the cache.
If the developers need this service often, it may be prudent to keep the
temporary database around all the time and copy over the development
repository to the backup repository on a daily basis in addition to backing up to a
file. Only the DBA should have access to the backup repository and requests
should be made through him/her.
Repositories may grow in size due to the execution of workflows, especially in large
projects. As the repository grows, response may become slower. Consider these
techniques to maintain a repository for better performance:
Audit Trail
Best Practices
Sample Deliverables
None
Description
Upgrading the application software of a data integration solution to a new release is a continuous operations task as
new releases are offered periodically by every software vendor. New software releases offer expanded functionality,
new capabilities, and fixes to existing functionality that can benefit the data integration environment and future
integration work. However, an upgrade can be a disruptive event since project work may halt while the upgrade
process is in progress.
Given that data integration environments often contain a host of different applications including Informatica
software, database systems, operating systems, EAI tools, BI tools, and other related technologies – an upgrade in
any one of these technologies may require an upgrade in any number of other software programs for the full system
to function properly. System architects and administrators must continually evaluate the new software offerings
across the various products in their data integration environment and balance the desire to upgrade with the impact
of an upgrade.
Software upgrades require a continuous assessment and planning process. A regular schedule should be defined
where new releases are evaluated on functionality and need in the environment. Once approved, upgrades must be
coordinated with on-going development work and on-going production data integration. Appropriate planning and
coordination of software upgrades allow a data integration environment to stay current on its technology stack with
minimal disruptions to production data integration efforts and development projects.
Prerequisites
None
Roles
Considerations
When faced with a new software release, the first consideration is to decide whether the upgrade is appropriate for
the data integration environment. The pro’s and con’s of every upgrade decision typically include the following:
Pro Con
New functionality and features Disruptive to development environment
Bug fixes and refinements of existing functionalityDisruptive to production environment
Often provides enhanced performance May require new training and adversely affect
productivity
Support for older releases of software is dropped, May require other pieces of software to be
forcing an upgrade to maintain support upgraded to function properly
May be required to support newer releases of
other software in the environment
Architects sometimes decide to forgo a particular software version and skip ahead to the future releases if the
current release does not provide enough benefit to warrant the disruption to the environment. It is not uncommon
for data integration teams to skip minor releases (and sometimes even major releases) if they aren’t appropriate for
their environment or when the upgrade effort outweighs the benefits.
Whether you are in a production environment or still in development mode, an upgrade requires careful planning to
ensure a successful transition and minimal disruption. The following issues need to be factored into the overall
upgrade plan:
● Training - New releases of software often include new features and functionality that are likely to require
some level of training for administrators and developers. Proper planning of the necessary training can
ensure that employees are trained ahead of the upgrade so that productivity does not suffer once the new
software is in place. Because it is impossible to properly estimate and plan the upgrade effort if you do not
have knowledge of the new features and potential environment changes, best practice dictates training a
core set of architects and system administrators early in the upgrade process so they can assist in the
upgrade planning process.
● Environment Assessment - A future release of software may range from minimal architectural changes to
major changes in the overall data integration architecture. Investigation and strategy around potential
architecture changes should occur early. In PowerCenter for example, as the architecture has moved to a
Service-Oriented-Architecure with high availability and failover, the underlying physical setup and location
of software components has changed from release to release. Planning for these architecture changes
allows users to take full advantage of the new features when the software upgrade is deployed. Often
these changes provide an opportunity to redesign and improve the existing architecture in coordination of
the software upgrade.
● Testing - Often more than 60 percent of the total upgrade time is devoted to testing the data integration
environment with the new software release. Ensuring that data continues to flow correctly, software
versions are compatible, and new features do not cause unexpected results requires detailed
testing. Developing a well thought-out test plan is crucial to a successful upgrade.
● New Features - A new software release likely includes new and expanded features that may create a need
to alter the current data integration processes. During the upgrade process, existing processes may be
altered to incorporate and implement the new features. Time is required to make and test these changes
as well. Reviewing the new features and assessing the impact on the upgrade process is a key pre-
planning step.
● Sandbox Upgrade - In environments with production systems, it is advisable to copy the production
environment to a ‘sandbox’ instance. The ‘sandbox’ environment should be as close to an exact copy of
production as possible, including production data. A software upgrade is then performed on the ‘sandbox
instance’ and data integration processes run on both the current production and the sandbox instance for a
period of time. In this way, results can be compared over time to ensure that no unforeseen differences
occur in the new software version. If differences do occur, they can be investigated, resolved,
and accounted for in the final upgrade plan.
Once a comprehensive plan for the upgrade is in place, the time comes to perform the actual upgrade on the
development, test, and production environments. The Installation Guides for each of the Informatica products and
online help provide instructions on upgrading and the step-by-step process for applying the new version of the
software. However, there are a few important steps to emphasize in the upgrade process:
A well-planned upgrade process is key to ensuring success during the transition from the current version to a new
version, with minimal disruption to the development and production environments. A smooth upgrade process
enables data integration teams to take advantage of the latest technologies and advances in data integration.
Best Practices
None
Sample Deliverables
None
Challenge
Using Data Analyzer's sophisticated security architecture to establish a robust security system to
safeguard valuable business information against a range of technologies and security models. Ensuring
that Data Analyzer security provides appropriate mechanisms to support and augment the security
infrastructure of a Business Intelligence environment at every level.
Description
Four main architectural layers must be completely secure: user layer, transmission layer, application
layer and data layer.
Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the
following LDAP-compliant directory servers:
SunOne/iPlanet
4.1
Directory Server
IBM SecureWay
3.2
Directory
IBM SecureWay
4.1
Directory
In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing
authentication and access control for the various web applications in the organization.
Transmission Layer
The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security
protocol Secure Sockets Layer (SSL) to provide a secure environment.
Application Layer
Only appropriate application functionality should be provided to users with associated privileges. Data
Analyzer provides three basic types of application-level security:
● Report, Folder and Dashboard Security. Restricts access for users or groups to specific
reports, folders, and/or dashboards.
● Column-level Security. Restricts users and groups to particular metric and attribute columns.
● Row-level Security. Restricts users to specific attribute values within an attribute column of a
table.
Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data
Analyzer provides the following components for managing application layer security:
● Roles. A role can consist of one or more privileges. You can use system roles or create custom
roles. You can grant roles to groups and/or individual users. When you edit a custom role, all
Types of Roles
● System roles - Data Analyzer provides a set of roles when the repository is created. Each role
has sets of privileges assigned to it.
● Custom roles - The end user can create and assign privileges to these roles.
Managing Groups
Groups allow you to classify users according to a particular function. You may organize users into groups
based on their departments or management level. When you assign roles to a group, you grant the same
privileges to all members of the group. When you change the roles assigned to a group, all users in the
group inherit the changes. If a user belongs to more than one group, the user has the privileges from all
groups. To organize related users into related groups, you can create group hierarchies. With hierarchical
groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you
edit a group, all subgroups contained within it inherit the changes.
For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead
group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager
group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer
role privileges.
Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but
group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the
object.
If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data
Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP
directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer
provides a way for you to keep user accounts in the authentication server and still keep the groups in
Data Analyzer.
Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory
service, it updates the users and groups in the repository and deletes users and groups that are not found
in the Windows Domain or LDAP directory service.
To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the
web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and
manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service.
The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR
file, use the EAR Repackager utility provided with Data Analyzer.
Note: Be sure to back-up the web.xml file before you modify it.
1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the
following directory:
/custom/properties
2. Open the web.xml file with a text editor and locate the line containing the following property:
enableGroupSynchronization
The enableGroupSynchronization property determines whether Data Analyzer updates the groups
in the repository.
<init-param>
<param-name>
InfSchedulerStartup.com.informatica.ias.
scheduler.enableGroupSynchronization
</param-name>
<param-value>false</param-value>
</init-param>
When the value of enableGroupSynchronization property is false, Data Analyzer does not
synchronize the groups in the repository with the groups in the Windows Domain or LDAP
directory service.
4. Save the web.xml file and add it back to the Data Analyzer EAR file.
When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer
updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows
Domain or LDAP authentication server. You must create and manage groups, and assign users to
groups in Data Analyzer.
Managing Users
Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a
user must have the appropriate privileges. You can assign privileges to a user with roles or groups.
Data Analyzer creates a System Administrator user account when you create the repository. The default
user name for the System Administrator user account is admin. The system daemon, ias_scheduler/
padaemon, runs the updates for all time-based schedules. System daemons must have a unique user
name and password in order to perform Data Analyzer system functions and tasks. You can change the
password for a system daemon, but you cannot change the system daemon user name via the GUI. Data
Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to
system daemons or assign them to groups.
To change the password for a system daemon, complete the following steps:
When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property.
In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished
name entries define the type of information that is stored in the LDAP directory. If you do not know the
value for BaseDN, contact your LDAP system administrator.
You can customize Data Analyzer user access with the following security options:
● Access permissions. Restrict user and/or group access to folders, reports, dashboards,
attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access
to a particular folder or object in the repository.
● Data restrictions. Restrict user and/or group access to information in fact and dimension tables
and operational schemas. Use data restrictions to prevent certain users or groups from
accessing specific values when they create reports.
● Password restrictions. Restrict users from changing their passwords. Use password restrictions
when you do not want users to alter their passwords.
When you create an object in the repository, every user has default read and write permissions for that
object. By customizing access permissions for an object, you determine which users and/or groups can
read, write, delete, or change access permissions for that object.
When you set data restrictions, you determine which users and groups can view particular attribute
values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to
that user.
Access permissions determine the tasks that you can perform for a specific repository object. When you
set access permissions, you determine which users and groups have access to the folders and repository
objects. You can assign the following types of access permissions to repository objects:
By default, Data Analyzer grants read and write access permissions to every user in the repository. You
can use the General Permissions area to modify default access permissions for an object, or turn off
default access permissions.
You can restrict access to data based on the values of related attributes. Data restrictions are set to keep
sensitive data from appearing in reports. For example, you may want to restrict data related to the
performance of a new store from outside vendors. You can set a data restriction that excludes the store
ID from their reports.
You can set data restrictions using one of the following methods:
● Set data restrictions by object. Restrict access to attribute values in a fact table, operational
schema, real-time connector, and real-time message stream. You can apply the data restriction
to users and groups in the repository. Use this method to apply the same data restrictions to
more than one user or group.
● Set data restrictions for one user at a time. Edit a user account or group to restrict user or
group access to specified data. You can set one or more data restrictions for each user or group.
Use this method to set custom data restrictions for different users or groups
● Inclusive. Use the IN option to allow users to access data related to the attributes you select. For
example, to allow users to view only data from the year 2001, create an “IN 2001” rule.
● Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes
you select. For example, to allow users to view all data except from the year 2001, create a “NOT
IN 2001” rule.
You can edit a user or group profile to restrict the data the user or group can access in reports. When you
edit a user profile, you can set data restrictions for any schema in the repository, including operational
schemas and fact tables.
You can set a data restriction to limit user or group access to data in a single schema based on the
attributes you select. If the attributes apply to more than one schema in the repository, you can also
restrict the user or group access from related data across all schemas in the repository. For example, you
may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one
data restriction that applies to both the Sales and Salary fact tables based on the region you select.
To set data restrictions for a user or group, you need the following role or privilege:
When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the
data restrictions for the report owner. However, if the reports have consumer-based security, the Data
Analyzer Server creates a separate report for each unique security profile.
● Repository authentication. You must use the Update System Accounts utility to change the
system administrator account name in the repository.
● LDAP or Windows Domain Authentication. Set up the new system administrator account in
Windows Domain or LDAP directory service. Then use the Update System Accounts utility to
change the system administrator account name in the repository.
3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class
6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp
\Repository Utils\Refresh\
8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils
\Refresh\
The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin",
respectively.
● mkdir \tmp
● cd \tmp
● jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF
● cd META-INF
● Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler
● cd \
● jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .
Challenge
Database sizing involves estimating the types and sizes of the components of a data architecture.
This is important for determining the optimal configuration for the database servers in order to
support the operational workloads. Individuals involved in a sizing exercise may be data architects,
database administrators, and/or business analysts.
Description
The first step in database sizing is to review system requirements to define such things as:
● Expected data architecture elements (will there be staging areas? operational data stores?
centralized data warehouse and/or master data? data marts?)
Each additional database element requires more space. This is even more true in situations
where data is being replicated across multiple systems, such as a data warehouse
maintaining an operational data store as well. The same data in the ODS will be present in
the warehouse as well, albeit in a different format.
It is useful to analyze how each row in the source system translates into the target system. In
most situations the row count in the target system can be calculated by following the data
flows from the source to the target. For example, say a sales order table is being built by
denormalizing a source table. The source table holds sales data for 12 months in a single row
(one column for each month). Each row in the source translates to 12 rows in the target. So a
source table with one million rows ends up as a 12 million row table.
Granularity refers to the lowest level of information that is going to be stored in a fact table.
Granularity affects the size of a database to a great extent, especially for aggregate tables.
The level at which a table has been aggregated increases or decreases a table's row count.
For example, a sales order fact table's size is likely to be greatly affected by whether the
table is being aggregated at a monthly level or at a quarterly level. The granularity of fact
tables is determined by the dimensions linked to that table. The number of dimensions that
are connected to the fact tables affects the granularity of the table and hence the size of the
table.
One way to estimate projections of data growth over time is to use scenario analysis. As an example,
for scenario analysis of a sales tracking data mart you can use the number of sales transactions to
be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are
expected; this equates to 10 million fact-table records.
Next, use the sales growth forecasts for the upcoming years for database growth calculations. That
is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next
year. At the end of five years, the fact table is likely to contain about 60 million records. You may
want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent
annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very
helpful.
Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature
can be useful in predicting table space requirements.
Oracle incorporates a table space prediction model in the database engine that provides projected
statistics for space used by a table. The following Oracle 10g query returns projected space usage
statistics:
SELECT *
FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE'))
ORDER BY timepoint;
The results of this query are shown below:
TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY
------------------------------ ----------- ----------- --------------------
Baseline Volumetric
Next, use the physical data models for the sources and the target architecture to develop a baseline
sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various
database structures such as tables, indexes, sort space, data files, log files, and database cache.
Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical
data model, along with field data types and field sizes. Various database products use different
storage methods for data types. For this reason, be sure to use the database manuals to determine
the size of each data type. Add up the field sizes to determine row size. Then use the data volume
projections to determine the number of rows to multiply by the table size.
The default estimate for index size is to assume same size as the table size. Also estimate the
temporary space for sort operations. For data warehouse applications where summarizations are
common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger
than the largest table in the database.
Another approach that is sometimes useful is to load the data architecture with representative data
and determine the resulting database sizes. This test load can be a fraction of the actual data and is
used only to gather basic sizing statistics. You then need to apply growth projections to these
statistics. For example, after loading ten thousand sample records to the fact table, you determine
the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60
million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB *
(60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.
Guesstimating
When there is not enough information to calculate an estimate as described above, use educated
guesses and “rules of thumb” to develop as reasonable an estimate as possible.
● If you don’t have the source data model, use what you do know of the source data to
estimate average field size and average number of fields in a row to determine table size.
Based on your understanding of transaction volume over time, determine your growth
metrics for each type of data and calculate out your source data volume (SDV) from table
size and growth metrics.
● If your target data architecture is not completed so that you can determine table sizes, base
your estimates on multiples of the SDV:
❍ If it includes staging areas: add another SDV for any source subject area that you will
And finally, remember that there is always much more data than you expect so you may want to add
a reasonable fudge-factor to the calculations for a margin of safety.
Challenge
In selectively migrating objects from one repository folder to another, there is a need for
a versatile and flexible mechanism that can overcome such limitations as confinement
to a single source folder.
Description
Deployment Groups are containers that hold references to objects that need to be
migrated. This includes objects such as mappings, mapplets, reusable transformations,
sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the
repository folders). Deployment groups are faster and more flexible than folder moves
for incremental changes. In addition, they allow for migration “rollbacks” if necessary.
Migrating a deployment group involves moving objects in a single copy operation from
across multiple folders in the source repository into multiple folders in the target
repository. When copying a deployment group, individual objects to be copied can be
selected as opposed to the entire contents of a folder.
A deployment group exists in a specific repository. It can be used to move items to any
other accessible repository/folder. A deployment group maintains a history of all
migrations it has performed. It tracks what versions of objects were moved from which
folders in which source repositories, and into which folders in which target repositories
those versions were copied (i.e., it provides a complete audit trail of all migrations
performed). Given that the deployment group knows what it moved and to where, then
if necessary, an administrator can have the deployment group “undo” the most recent
deployment, reverting the target repository to its pre-deployment state. Using labels (as
described in the Using PowerCenter Labels Best Practice) allows objects in the
subsequent repository to be tracked back to a specific deployment.
It is important to note that the deployment group only migrates the objects it contains to
the target repository/folder. It does not, itself, move to the target repository. It still
resides in the source repository.
Migrations can be performed via the GUI or the command line (pmrep). In order to
migrate objects via the GUI, simply drag a deployment group from the repository it
resides in onto the target repository where the referenced objects are to be moved. The
Deployment Wizard appears and steps the user through the deployment process. Once
the wizard is complete, the migration occurs, and the deployment history is created.
Alternatively, the PowerCenter pmrep command can be used to automate both Folder
Level deployments (e.g., in a non-versioned repository) and deployments using
Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in
pmrep are used respectively for these purposes. Whereas deployment via the GUI
requires stepping through a wizard and answering a series of questions to deploy, the
command-line deployment requires an XML control file that contains the same
The following steps can be used to create a script to wrap pmrep commands and
automate PowerCenter deployments:
Deployment groups help to ensure that there is a back-out methodology and that the
latest version of a deployment can be rolled back. To do this:
In the target repository (where the objects were migrated to), go to:
Versioning>>Deployment>>History>>View History>>Rollback.
The rollback purges all objects (of the latest version) that were in the deployment
group. Initiate a rollback on a deployment in order to roll back only the latest versions of
As objects are checked in and objects are deployed to target repositories, the number
of object versions in those repositories increases, as does the size of the repositories.
In order to manage repository size, use a combination of Check-in Date and Latest
Status (both are query parameters) to purge the desired versions from the repository
and retain only the very latest version. Also all the deleted versions of the objects
should be purged to reduce the size of the repository.
If it is necessary to keep more than the latest version, labels can be included in the
query. These labels are ones that have been applied to the repository for the specific
purpose of identifying objects for purging.
Challenge
Develop a migration strategy that ensures clean migration between development, test, quality assurance (QA), and
production environments, thereby protecting the integrity of each of these environments as the system evolves.
Description
Ensuring that an application has a smooth migration process between development, QA, and production
environments is essential for the deployment of an application. Deciding which migration strategy works best for a
project depends on two primary factors.
● How is the PowerCenter repository environment designed? Are there individual repositories for
development, QA, and production or are there just one or two environments that share one or all of these
phases.
● How has the folder architecture been defined?
Each of these factors plays a role in determining the migration procedure that is most beneficial to the project.
PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter
migration options include repository migration, folder migration, object migration, and XML import/export. In
versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which
provides the capability to migrate any combination of objects within the repository with a single command.
This Best Practice is intended to help the development team decide which technique is most appropriate for the
project. The following sections discuss various options that are available, based on the environment and architecture
selected. Each section describes the major advantages of its use, as well as its disadvantages.
Repository Environments
The following section outlines the migration procedures for standalone and distributed repository environments. The
distributed environment section touches on several migration architectures, outlining the pros and cons of each.
Also, please note that any methods described in the Standalone section may also be used in a Distributed
environment.
In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata
store. Separate folders are used to represent the development, QA, and production workspaces and segregate work.
This type of architecture within a single repository ensures seamless migration from development to QA, and from
QA to production.
The following example shows a typical architecture. In this example, the company has chosen to create separate
development folders for each of the individual developers for development and unit test purposes. A single shared or
common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources,
targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the
unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the
tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.
Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings
to test, and then eventually to production.
After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this
process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the
SHARED_MARKETING_TEST folder. This can be done using one of two methods:
● The first, and most common method, is object migration via an object copy. In this case, a user opens the
SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the
appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file
from one folder to another using Windows Explorer.
● The second approach is object migration via object XML import/export. A user can export each of the
objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the
SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded
to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning
can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this
document.
After you've copied all common or shared objects, the next step is to copy the individual mappings from each
development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration
methods described above to copy the mappings to the folder, although the XML import/export method is the most
intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when
you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the
SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you
created in the previous example, which point to the SHARED_MARKETING_TEST (see image below). You can then
continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later
versions, you can export multiple objects into a single XML file, and then import them at the same time.
1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the
destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default
name is used. Then click “Next” to continue the copy process.
2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or
replace the current one. If it does not exist, then the default name is used (see below). Then click “Next.”
3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow.
Select the mapping and continue by clicking “Next".
The move to production is very different for the initial move than for subsequent changes to mappings and workflows.
Since the repository only contains folders for development and test, we need to create two new folders to house the
production-ready objects. Create these folders after testing of the objects in SHARED_MARKETING_TEST and
MARKETING_TEST has been approved.
The following steps outline the creation of the production folders and, at the same time, address the initial test to
production migration.
1. Open the PowerCenter Repository Manager client tool and log into the repository.
2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST folder,
drag it, and drop it on the repository name.
3. The Copy Folder Wizard appears to guide you through the copying process.
5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on
this screen is the folder name followed by the date. In this case, enter the name as
“SHARED_MARKETING_PROD.”
7. The final screen begins the actual copy process. Click "Finish" when the process is complete.
At the end of the migration, you should have two additional folders in the repository environment for
production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders
contain the initially migrated objects. Before you can actually run the workflow in these production folders, you
need to modify the session source and target connections to point to the production environment.
When you copy or replace a PowerCenter repository folder, the Copy Wizard copies the permissions for the
folder owner to the target folder. The wizard does not copy permissions for users, groups, or all others in the
repository to the target folder. Previously, the Copy Wizard copied the permissions for the folder owner,
owner’s group, and all users in the repository to the target folder.
Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the
folder.
1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object
to copy and drag-and-drop it into the appropriate workspace window.
2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you
to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the
object.
3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in
Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are
making are what you intend. See below for an example of the mapping compare window.
In this example, we look at moving development work to QA and then from QA to production, using multiple
development folders for each developer, with the test and production folders divided into the data mart they
represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move
objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and
workflows to the new area.
1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2
❍ Copy the tested objects from the SHARED_MARKETING_DEV folder to the
SHARED_MARKETING_TEST folder.
❍ Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to
MARKETING_TEST.
❍ Save your changes.
2. Copy the mapping from Development into Test.
❍ In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping
each reusable session from the developers’ folders into the MARKETING_TEST folder. A Copy
Session Wizard guides you through the copying process.
❍ Open each newly copied session and click on the Source tab. Change the source to point to the
source database for the Test environment.
❍ Click the Target tab. Change each connection to point to the target database for the Test
environment. Be sure to double-check the workspace from within the Target tab to ensure that the
load options are correct.
❍ Save your changes.
4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test.
❍ Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy
Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new
folder.
❍ As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare
conflicts from within Workflow Manager to ensure that the correct migrations are being made.
❍ Save your changes.
5. Implement the appropriate security.
❍ In Development, the owner of the folders should be a user(s) in the development group.
❍ In Test, change the owner of the test folder to a user(s) in the test group.
❍ In Production, change the owner of the folders to a user in the production group.
❍ Revoke all rights to Public other than Read for the production folders.
The folder or global object owner or a user assigned the Users with the appropriate repository privileges could grant
Administrator role for the Repository Service can grant folder folder and global object permissions.
and global object permissions.
Permissions can be granted to users, groups, and all others in Permissions could be granted to the owner, owner’s group,
the repository. and all others in the repository.
The folder or global object owner and a user assigned the You could change the permissions for the folder or global
Administrator role for the Repository Service have all object owner.
permissions which you cannot change.
The biggest disadvantage or challenge with a single repository environment is migration of repository objects with
respect to database connections. When migrating objects from Dev to Test to Prod you can’t use the same database
connection as those that will be pointing to dev or test environment. A single repository structure can also
create confusion as the same users and groups exist in all environments and the number of folders can increase
exponentially.
With a fully distributed approach, separate repositories function much like the separate folders in a standalone
environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our
Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following
example, we discuss a distributed repository architecture.
There are four techniques for migrating from development to production in a distributed repository architecture, with
each involving some advantages and disadvantages.
● Repository Copy
● Folder Copy
● Object Copy
● Deployment Groups
Repository Copy
So far, this document has covered object-level migrations and folder migrations through drag-and-drop object
copying and object XML import/export. This section discusses migrations in a distributed repository environment
through repository copies.
● The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once
from one environment to another.
● The ability to automate this process using pmrep commands, thereby eliminating many of the manual
processes that users typically perform.
● The ability to move everything without breaking or corrupting any of the objects.
Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository
Copy method:
Copying the Test repository to Production through the GUI client tools is the easiest of all the migration
methods. First, ensure that all users are logged out of the destination repository and then connect to the
PowerCenter Repository Administration Console (as shown below).
If the Production repository already exists, you must delete the repository before you can copy the Test repository.
Before you can delete the repository, you must run the repository in the ‘exclusive mode’.
1. Click on the “INFA_PROD Repository on the left pane to select it and change the running mode to “exclusive
mode’ by clicking on the edit button on the right pane under the properties tab.
Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the
repository to a binary file that can be restored to any new location. This method is preferable to the repository copy
process because if any type of error occurs, the file is backed up to the binary file on the repository server.
From 8.5 onwards, security information is maintained at the domain level. Before you back up a repository and
restore it in a different domain, verify that users and groups with privileges for the source Repository Service exist in
the target domain. The Service Manager periodically synchronizes the list of users and groups in the repository with
the users and groups in the domain configuration database. During synchronization, users and groups that do not
exist in the target domain are deleted from the repository.
You can use infacmd to export users and groups from the source domain and import them into the target domain.
Use infacmd ExportUsersAndGroups to export the users and groups to a file. Use infacmd ImportUsersAndGroups to
import the users and groups from the file to a different PowerCenter domain
The following steps outline the process of backing up and restoring the repository for migration.
1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service. Select
Action -> Backup Contents from the drop-down menu.
3. After you've selected the location and file name, click OK to begin the backup process.
4. The backup process creates a .rep file containing all repository information. Stay logged into the Manage
Repositories screen. When the backup is complete, select the repository connection to which the backup will
be restored to (i.e., the Production repository).
When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to
delete all of the unused objects and renaming of the folders.
PMREP
Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is
run from the command line rather than through the GUI client tools. pmrep is installed in the PowerCenter Client and
PowerCenter Services bin directories. PMREP utilities can be used from the Informatica Server or from any client
machine connected to the server. Refer to the Repository Manager Guide for a list of PMREP commands.
PMREP backup backs up the repository to the file specified with the -o option. You must provide the backup file
name. Use this command when the repository is running. You must be connected to a repository to use this
command.
backup
-o <output_file_name>
[-d <description>]
[-f (overwrite existing output file)]
[-b (skip workflow and session logs)]
[-j (skip deploy group history)]
[-q (skip MX data)]
[-v (skip task statistics)]
The following is a sample of the command syntax used within a Windows batch file to connect to and backup a
repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions
backupproduction.bat
REM This batch file uses pmrep to connect to and back up the repository Production on the server Central
@echo off
After you have used one of the repository migration procedures to migrate into Production, follow these steps to
convert the repository to Production:
1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows.
❍ Disable the workflows not being used in the Workflow Manager by opening the workflow
properties, then checking the Disabled checkbox under the General tab.
❍ Delete the tasks not being used in the Workflow Manager and the mappings in the Designer
2. Modify the database connection strings to point to the production sources and targets.
❍ In the Workflow Manager, select Relational connections from the Connections menu.
❍ Edit each relational connection by changing the connect string to point to the production
sources and targets.
❍ If you are using lookup transformations in the mappings and the connect string is anything
other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.
❍ In the Workflow Manager, open the session task properties, and from the Components tab
make the required changes to the pre- and post-session scripts.
Folder Copy
Although deployment groups are becoming a very popular migration method, the folder copy method has historically
been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly
promote all of the objects located within that folder. All source and target objects, reusable transformations,
mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in
the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the
Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is
copied.
● The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the
objects located within it.
● If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships
are automatically converted to point to this newly copied common or shared folder.
● All connections, sequences, mapping variables, and workflow variables are copied automatically.
The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being
performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least
utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a
serious consideration in real-time or near real-time environments.
The following example steps through the process of copying folders from each of the different environments. The first
example uses three separate repositories for development, test, and production.
2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:
4. If the folder already exists in the destination repository, choose to replace the folder.
The following screen appears to prompt you to select the folder where the new shortcuts are located.
5. When testing is complete, repeat the steps above to migrate to the Production repository.
When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to
the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is
modified for test and production.
Object Copy
Copying mappings into the next stage in a networked environment involves many of the same advantages and
disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked
environment. For additional information, see the earlier description of Object Copy for the standalone environment.
One advantage of Object Copy in a distributed environment is that it provides more granular control over objects.
Below are the steps to complete an object copy in a distributed repository environment:
● In each of the distributed repositories, create a common folder with the exact same name and case.
● Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same
name.
3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping
(first ensure that the mapping exists in the current repository).
● In Development, ensure the owner of the folders is a user in the development group.
● In Test, change the owner of the test folders to a user in the test group.
● In Production, change the owner of the folders to a user in the production group.
● Revoke all rights to Public other than Read for the Production folders.
Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows
the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in
an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects
are deployed at once. The objects included in a deployment group have no restrictions and can come from one or
multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the
objects in the deployment group to be defined by a repository query, rather than being added to the deployment
group manually. Lastly, because deployment groups are available on versioned repositories, they also have the
ability to be rolled back, reverting to the previous versions of the objects, when necessary.
❍ Deployment Groups are containers that hold references to objects that need to be migrated.
❍ Allows for version-based object migration.
❍ Faster and more flexible than folder moves for incremental changes.
❍ Allows for migration “rollbacks”
❍ Allows specifying individual objects to copy, rather than the entire contents of a folder.
● Dynamic
Pre-Requisites
Creating Labels
A label is a versioning object that you can associate with any versioned object or group of versioned objects in a
repository.
● Advantages
● Create label
■ Development
■ Deploy_Test
■ Test
■ Deploy_Production
■ Production
● Apply Label
Queries
A query is an object used to search for versioned objects in the repository that meet specific conditions.
● Advantages
● Create a query
❍ The Query Browser allows you to create, edit, run, or delete object queries
● Execute a query
1. Launch the Repository Manager client tool and log in to the source repository.
2. Expand the repository, right-click on “Deployment Groups” and choose “New Group.”
3. In the dialog window, give the deployment group a name, and choose whether it should be static or
dynamic. In this example, we are creating a static deployment group. Click OK.
1. In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the
deployment group and choose “Versioning” -> “View History.” The “View History” window appears.
2. In the “View History” window, right-click the object and choose “Add to Deployment Group.”
4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want
to add dependent objects to the deployment group so that they will be migrated as well. Click OK.
Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is
similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter
allows the capability to create dynamic deployment groups.
Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that objects are
added. In a static deployment group, objects are manually added one by one. In a dynamic deployment group, the
contents of the deployment group are defined by a repository query. Don’t worry about the complexity of writing a
repository query, it is quite simple and aided by the PowerCenter GUI interface.
1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the
dynamic option. Also, select the “Queries” button.
3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that
should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata
categories. In this case, the developers have assigned the “RELEASE_20050130” label to all objects that
need to be migrated, so the query is defined as “Label Is Equal To ‘RELEASE_20050130’”. The creation and
application of labels are discussed in Using PowerCenter Labels.
A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep
command line utility. With the client tool, you simply drag the deployment group from the source repository and drop
it on the destination repository. This opens the Copy Deployment Group Wizard, which guides you through the step-
by-step options for executing the deployment group.
To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i.
e., Deployments -> History -> View History -> Rollback).
Automated Deployments
For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep
DeployDeploymentGroup command, which can execute a deployment group migration without human intevention.
This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to
run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use
the pmrep utility to automate importing objects via XML.
Recommendations
Informatica recommends using the following process when running in a three-tiered environment with development,
test, and production servers.
For migrating from development into test, Informatica recommends using the Object Copy method. This method
gives you total granular control over the objects that are being moved. It also ensures that the latest development
mappings can be moved over manually as they are completed. For recommendations on performing this copy
procedure correctly, see the steps listed in the Object Copy section.
Versioned Repositories
For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in
a distributed repository environment. This method provides the greatest flexibility in that you can promote any object
from within a development repository (even across folders) into any destination repository. Also, by using labels,
dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group
migration method results in automated migrations that can be executed without manual intervention.
Third-Party Versioning
Some organizations have standardized on third-party version control software. PowerCenter’s XML import/export
functionality offers integration with such software and provides a means to migrate objects. This method is most
useful in a distributed environment because objects can be exported into an XML file from one repository and
imported into the destination repository.
The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable
transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later
versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can
significantly cut down on the work associated with object level XML import/export.
The following steps outline the process of exporting the objects from source repository and importing them into the
destination repository:
Exporting
1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object
to be exported.
2. Select Repository -> Export Objects
Importing
Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where
the object is to be imported.
Challenge
Description
DTLURDMO Utility
● Test communication between clients and all listeners in the production environment with:
dtlrexeprog=ping <loc>=<nodename>.
● Run selected jobs to exercise data access through PowerExchange data maps.
At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than
existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility
DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO
does have the ability to copy selectively, however, and the full functionality of the utility is documented in
the PowerExchange Utilities Guide.
The types of definitions that can be managed with this utility are:
On MVS, the input statements for this utility are taken from SYSIN.
On non-MVS platforms, the input argument point to a file containing the input definition. If no input
argument is provided, it looks for a file dtlurdmo.ini in the current path.
● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.
Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library.
● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates and is read from the SYSIN card.
AS/400 utility
● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib
library.
If you want to create a separate DTLURDMO definition file rather than use the default location, you must
give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/
DTLURDMO) parm ('datalib/deffile(dtlurdmo)')
Running DTLURDMO
The utility should be run extracting information from the files locally, then writing out the datamaps through
the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format
required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again
for the registrations, and then the extract maps if this is a capture environment. Commands for mixed
datamaps, registrations, and extract maps cannot be run together.
The following example shows a definition file to copy all datamaps from the existing local datamaps (the
local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or
UNIX) to the V8.x.x listener (defined by the TARGET location node1):
USER DTLUSR;
EPWD A3156A3623298FDC;
SOURCE LOCAL;
TARGET NODE1;
DETAIL;
REPLACE;
DM_COPY;
SELECT schema=*;
Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from
the PowerExchange Navigator.
● Test communication between clients and all listeners in the production environment with:
dtlrexeprog=ping loc=<nodename>.
On the drop-down list box, choose the appropriate location ( in this case mvs_prod).
●
Challenge
Understanding the recovery options that are available for PowerCenter when errors are
encountered during the load.
Description
When a task in the workflow fails at any point, one option is to truncate the target and
run the workflow again from the beginning. As an alternative, the workflow can be
suspended and the error can be fixed, rather than re-processing the portion of the
workflow with no errors. This option, "Suspend on Error", results in accurate and
complete target data, as if the session completed successfully with one run. There are
also recovery options available for workflows and tasks that can be used to handle
different failure scenarios.
For consistent recovery, the mapping needs to produce the same result, and in the
same order, in the recovery execution as in the failed execution. This can be achieved
by sorting the input data using either the sorted ports option in Source Qualifier (or
Application Source Qualifier) or by using a sorter transformation with distinct rows
option immediately after source qualifier transformation. Additionally, ensure that all the
targets received data from transformations that produce repeatable data.
The recovery strategy can be configured on the Properties page of the Session task.
Enable the session for recovery by selecting one of the following three Recovery
Strategies:
● Restart task
The Suspend on Error option directs the Integration Service to suspend the workflow
while the error is being fixed and then it resumes the workflow. The workflow is
suspended when any of the following tasks fail:
● Session
● Command
● Worklet
● Email
When a task fails in the workflow, the Integration Service stops running tasks in the
path. The Integration Service does not evaluate the output link of the failed task. If no
other task is running in the workflow, the Workflow Monitor displays the status of the
workflow as "Suspended."
If one or more tasks are still running in the workflow when a task fails, the Integration
Service stops running the failed task and continues running tasks in other paths. The
Workflow Monitor displays the status of the workflow as "Suspending." When the status
of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target
database error, and recover the workflow in the Workflow Monitor. When you recover a
workflow, the Integration Service restarts the failed tasks and continues evaluating the
rest of the tasks in the workflow. The Integration Service does not run any task that
already completed successfully.
Session Logs
In a suspended workflow scenario, the Integration Service uses the existing session log
when it resumes the workflow from the point of suspension. However, the earlier runs
that caused the suspension are recorded in the historical run information in the
repository.
Suspension Email
The workflow can be configured to send an email when the Integration Service
suspends the workflow. When a task fails, the workflow is suspended and suspension
email is sent. The error can be fixed and the workflow can be resumed subsequently.
If another task fails while the Integration Service is suspending the workflow, another
suspension email is not sent. The Integration Service only sends out another
suspension email if another task fails after the workflow resumes. Check the "Browse
Emails" button on the General tab of the Workflow Designer Edit sheet to configure the
suspension email.
Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the Integration
Service also suspends the worklet if a task within the worklet fails. When a task in the
worklet fails, the Integration Service stops executing the failed task and other tasks in
its path. If no other task is running in the worklet, the status of the worklet is
"Suspended". If other tasks are still running in the worklet, the status of the worklet is
"Suspending". The parent workflow is also suspended when the worklet is "Suspended"
or "Suspending".
Starting Recovery
The recovery process can be started using Workflow Manager or Workflow Monitor .
Alternately, the recovery process can be started by using pmcmd in command line
mode or by using a script.
When the Integration Service runs a session that has a resume recovery strategy, it
PM_RECOVERY - Contains target load information for the session run. The Integration
Service removes the information from this table after each successful session and
initializes the information at the beginning of subsequent sessions.
PM_REC_STATE - When the Integration Service runs a real-time session that uses the
recovery table and that has recovery enabled, it creates a recovery table,
PM_REC_STATE, on the target database to store message IDs and commit numbers.
When the Integration Service recovers the session, it uses information in the recovery
tables to determine if it needs to write the message to the target table. The table
contains information that the Integration Service uses to determine if it needs to write
messages to the target table during recovery for a real-time session.
If you edit or drop the recovery tables before you recover a session, the Integration
Service cannot recover the session. If you disable recovery, the Integration Service
does not remove the recovery tables from the target database and you must manually
remove them
For recovery to be effective, the recovery session must produce the same set of rows;
and in the same order. Any change after initial failure (in mapping, session and/or in the
Integration Service) that changes the ability to produce repeatable data, results in
inconsistent data during the recovery process. The following situations may produce
inconsistent data during a recovery session:
HA Recovery
Challenge
Using labels effectively in a data warehouse or data integration project to assist with
administration and migration.
Description
A label is a versioning object that can be associated with any versioned object or group of
versioned objects in a repository. Labels provide a way to tag a number of object versions with
a name for later identification. Therefore, a label is a named object in the repository, whose
purpose is to be a “pointer” or reference to a group of versioned objects. For example, a label
called “Project X version X” can be applied to all object versions that are part of that project and
release.
Note that labels apply to individual object versions, and not objects as a whole. So if a mapping
has ten versions checked in, and a label is applied to version 9, then only version 9 has that
label. The other versions of that mapping do not automatically inherit that label. However,
multiple labels can point to the same object for greater flexibility.
The “Use Repository Manager” privilege is required in order to create or edit labels, To create a
label, choose Versioning-Labels from the Repository Manager.
Locking the label is also advisable. This prevents anyone from accidentally associating
additional objects with the label or removing object references for the label.
Labels, like other global objects such as Queries and Deployment Groups, can have user and
group privileges attached to them. This allows an administrator to create a label that can only
be used by specific individuals or groups. Only those people working on a specific project
should be given read/write/execute permissions for labels that are assigned to that project.
Applying Labels
Labels can be applied to any object and cascaded upwards and downwards to parent and/or
child objects. For example, to group dependencies for a workflow, apply a label to all children
objects. The Repository Server applies labels to sources, targets, mappings, and tasks
associated with the workflow. Use the “Move label” property to point the label to the latest
version of the object(s).
Note: Labels can be applied to any object version in the repository except checked-out
versions. Execute permission is required for applying labels.
After the label has been applied to related objects, it can be used in queries and deployment
groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the
size of the repository (i.e. to purge object versions).
An object query can be created using the existing labels (as shown below). Labels can be
associated only with a dynamic deployment group. Based on the object query, objects
associated with that label can be used in the deployment.
Repository Administrators and other individuals in charge of migrations should develop their
own label strategies and naming conventions in the early stages of a data integration project.
Be sure that developers are aware of the uses of these labels and when they should apply
labels.
For each planned migration between repositories, choose three labels for the development and
subsequent repositories:
● The first is to identify the objects that developers can mark as ready for migration.
● The second should apply to migrated objects, thus developing a migration audit trail.
● The third is to apply to objects as they are migrated into the receiving repository,
completing the migration audit trail.
Additional labels can be created with developers to allow the progress of mappings to be
tracked if desired. For example, when an object is successfully unit-tested by the developer, it
can be marked as such. Developers can also label the object with a migration label at a later
time if necessary. Using labels in this fashion along with the query feature allows complete or
incomplete objects to be identified quickly and easily, thereby providing an object-based view of
progress.
Challenge
Data Migration and Data Integration projects are often challenged to verify that the data in an
application is complete. More specifically, to identify that all the appropriate data was extracted
from a source system and propagated to its final target. This best practice illustrates how to do this
in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly
important in businesses that are either highly regulated internally and externally or that have to
comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II,
HIPAA, Patriot Act, and many others.
Description
The common practice for audit and balancing solutions is to produce a set of common tables that
can hold various control metrics regarding the data integration process. Ultimately, business
intelligence reports provide insight at a glance to verify that the correct data has been pulled from
the source and completely loaded to the target. Each control measure that is being tracked will
require development of a corresponding PowerCenter process to load the metrics to the Audit/
Balancing Detail table.
1. Work with business users to identify what audit/balancing processes are needed. Some
examples of this may be:
a. Customers – (Number of Customers or Number of Customers by Country)
b. Orders – (Qty of Units Sold or Net Sales Amount)
c. Deliveries – (Number of shipments or Qty of units shipped of Value of all shipments)
d. Accounts Receivable – (Number of Accounts Receivable Shipments or Total
Accounts Receivable Outstanding)
2. Define for each process defined in #1 which columns should be used for tracking purposes
for both the source and target system.
3. Develop a data integration process that will read from the source system and populate the
detail audit/balancing table with the control totals.
4. Develop a data integration process that will read from the target system and populate the
detail audit/balancing table with the control totals.
5. Develop a reporting mechanism that will query the audit/balancing table and identify the the
source and target entries match or if there is a discrepancy.
Audit/Balancing Details
AUDIT_KEY NUMBER 10
CONTROL_AREA VARCHAR2 50
CONTROL_SUB_AREA VARCHAR2 50
CONTROL_COUNT_1 NUMBER 10
CONTROL_COUNT_2 NUMBER 10
CONTROL_COUNT_3 NUMBER 10
CONTROL_COUNT_4 NUMBER 10
CONTROL_COUNT_5 NUMBER 10
UPDATE_TIMESTAMP TIMESTAMP
UPDATE_PROCESS VARCHAR2 50
CONTROL_AREA VARCHAR2 50
CONTROL_SUB_AREA VARCHAR2 50
CONTROL_COUNT_2 VARCHAR2 50
CONTROL_COUNT_3 VARCHAR2 50
CONTROL_COUNT_4 VARCHAR2 50
CONTROL_COUNT_5 VARCHAR2 50
CONTROL_SUM_1 VARCHAR2 50
CONTROL_SUM_2 VARCHAR2 50
CONTROL_SUM_3 VARCHAR2 50
CONTROL_SUM_4 VARCHAR2 50
CONTROL_SUM_5 VARCHAR2 50
UPDATE_TIMESTAMP TIMESTAMP
UPDATE_PROCESS VARCHAR2 50
The following is a screenshot of a single mapping that will populate both the source and target
values in a single mapping:
The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of
this type of process:
There are also a set of basic tasks that can be leveraged and shared across any audit/balancing
needs. By building a common model for meeting audit/balancing needs, projects can lower the
time needed to develop these solutions and still provide risk reductions by having this type of
solution in place.
Challenge
Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005
study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer
limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of
attention to data quality.
Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues.
Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer
relationship management. It is essential that data quality issues are tackled during any large-scale data
project to enable project success and future organizational success.
Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure
that all data entering the organizational data stores provides for consistent and reliable decision-making.
Description
A significant portion of time in the project development process should be dedicated to data quality,
including the implementation of data cleansing processes. In a production environment, data quality
reports should be generated after each data warehouse implementation or when new source systems are
integrated into the environment. There should also be provision for rolling back if data quality testing
indicates that the data is unacceptable.
Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE)
and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data
integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ
has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a
complete solution for identifying and resolving all types of data quality problems and preparing data for the
consolidation and load processes.
Concepts
Following are some key concepts in the field of data quality. These data quality concepts provide a
foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and
effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to
consolidation.
Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in
Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily
concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can
discover data quality issues at a record and field level, and Velocity best practices recommends the use of
IDQ for such purposes.
Note: The remaining items in this document will therefore, focus in the context of IDQ usage.
Enhancement - refers to adding useful, but optional, information to existing data or complete data.
Examples may include: sales volume, number of employees for a given business, and zip+4 codes.
Validation - the process of correcting data using algorithmic components and secondary reference data
sources, to check and validate information. Example: validating addresses with postal directories.
Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality
records where high-quality records of the same information exist. Use matching components and business
rules to identify records that may refer, for example, to the same customer. For more information, see the
Best Practice Effective Data Matching Techniques.
Consolidation - using the data sets defined during the matching process to combine all cleansed or
approved data into a single, consolidated view. Examples are building best record, master record, or
house-holding.
Informatica Applications
The Informatica Data Quality software suite has been developed to resolve a wide range of data quality
issues, including data cleansing. The suite comprises the following elements:
● IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality
functionality on a single computer (Windows only).
IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its
own applications or to provide them for addition to PowerCenter transformations.
Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is,
Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input
components, output components, and operational components. Plans can perform analysis, parsing,
standardization, enhancement, validation, matching, and consolidation operations on the specified data.
Plans are saved into projects that can provide a structure and sequence to your data quality endeavors.
The following figure illustrates how data quality processes can function in a project setting:
In stage 2, you verify the target levels of quality for the business according to the data quality
measurements taken in stage 1, and in accordance with project resourcing and scheduling.
In stage 3, you use Workbench to design the data quality plans and projects to achieve the
targets. Capturing business rules and testing the plans are also covered in this stage.
In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy
plans and resources to remote repositories and file systems through the user interface. If you are running
Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which
data cleansing and other data quality tasks are performed on the project data.
In stage 5, you’ll test and measure the results of the plans and compare them to the initial data quality
assessment to verify that targets have been met. If targets have not been met, this information feeds into
another iteration of data quality operations in which the plans are tuned and optimized.
In a large data project, you may find that data quality processes of varying sizes and impact are necessary
at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at
a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design
Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the
level of unit testing required.
● On the PowerCenter client side, it enables you to browse the Data Quality repository and add
data quality plans to custom transformations. The data quality plans’ functional details are saved
as XML in the PowerCenter repository.
● On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to
send data quality plan XML to the Data Quality engine for execution.
The Integration requires that at least the following IDQ components are available to PowerCenter:
● Client side: PowerCenter needs to access a Data Quality repository from which to import plans.
● Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan
instructions.
An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by
Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North
American name and postal address records.
● Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality
repository.
● The PowerCenter Designer user opens a Data Quality Integration transformation and configures it
to read from the Data Quality repository. Next, the users selects a plan from the Data Quality
repository and adds it to the transformation.
● The PowerCenter Designer user saves the transformation and the mapping containing it to the
PowerCenter repository. The plan information is saved with the transformation as XML.
The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant
source data and plan information will be sent to the Data Quality engine, which processes the data (in
conjunction with any reference data files used by the plan) and returns the results to PowerCenter.
Challenge
Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data
profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This
Best Practice is intended to provide an introduction on usage for new users.
Bear in mind that Informatica’s Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following
Velocity Best Practice documents for more information:
● Data Cleansing
● Using Data Explorer for Data Discovery and Analysis
Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems and enables users to measure changes
in the source data over time. This information can help to improve the quality of the source data.
An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good
overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source
level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level.
Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short
amount of time.
A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating
business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a
business rule that you want to validate, or if you want to test whether data matches a particular pattern.
Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking “Configure
Session” on the "Function-Level Operations” tab of the wizard.
● Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration
parameters.
● For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow
Manager and configure and schedule them appropriately.
Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.
You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can
also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.
Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling option:
Automatic random sampling PowerCenter determines the Larger data sources where you
appropriate percentage to sample, then want a statistically significant data
samples random rows. analysis
Manual random sampling PowerCenter samples random rows of Samples more or fewer rows than
the source data based on a user- the automatic option chooses.
specified percentage.
Sample first N rows Samples the number of user-selected Provides a quick readout of a
rows source (e.g., first 200 rows)
The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be
sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script
that is generated and run it.
ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%';
select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
INFORMIX
IBM DB2
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP
%'
TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and
databasename = 'database_name'
Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and
connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.
Challenge
Use PowerCenter to create data quality mapping rules to enhance the usability of the
data in your system.
Description
The issue of poor data quality is one that frequently hinders the success of data
integration projects. It can produce inconsistent or faulty results and ruin the credibility
of the system with the business users.
This Best Practice focuses on techniques for use with PowerCenter and third-party or
add-on software. Comments that are specific to the use of PowerCenter are enclosed
in brackets.
Bear in mind that you can augment or supplant the data quality handling capabilities of
PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite
dedicated to data quality issues. Data analysis and data enhancement processes, or
plans, defined in IDQ can deliver significant data quality improvements to your project
data. A data project that has built-in data quality steps, such as those described in the
Analyze and Design phases of Velocity, enjoys a significant advantage over a project
that has not audited and resolved issues of poor data quality. If you have added these
data quality steps to your project, you are likely to avoid the issues described below.
A description of the range of IDQ capabilities is beyond the scope of this document. For
a summary of Informatica’s data quality methodology, as embodied in IDQ, consult the
Best Practice Data Cleansing.
Data integration/warehousing projects often encounter general data problems that may
not merit a full-blown data quality project, but which nonetheless must be addressed.
This document discusses some methods to ensure a base level of data quality; much
of the content discusses specific strategies to use with PowerCenter.
The quality of data is important in all types of projects, whether it be data warehousing,
Text Formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users
want to see data in its “raw” format without any capitalization, trimming, or formatting
applied to it. This is easily achievable as it is the default behavior, but there is danger in
taking this requirement literally since it can lead to duplicate records when some of
these fields are used to identify uniqueness and the system is combining data from
various source systems.
One solution to this issue is to create additional fields that act as a unique key to a
given table, but which are formatted in a standard way. Since the “raw” data is stored in
the table, users can still see it in this format, but the additional columns mitigate the risk
of duplication.
Another possibility is to explain to the users that “raw” data in unique, identifying fields
is not as clean and consistent as data in a common format. In other words, push back
on this requirement.
This issue can be particularly troublesome in data migration projects where matching
the source data is a high priority. Failing to trim leading/trailing spaces from data can
often lead to mismatched results since the spaces are stored as part of the data value.
The project team must understand how spaces are handled from the source systems to
determine the amount of coding required to correct this. (When using PowerCenter and
sourcing flat files, the options provided while configuring the File Properties may be
sufficient.). Remember that certain RDBMS products use the data type CHAR, which
then stores the data with trailing blanks. These blanks need to be trimmed before
matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.
Datatype Conversions
It is advisable to use explicit tool functions when converting the data type of a particular
data value.
Dates
Dates can cause many problems when moving and transforming data from one place
to another because an assumption must be made that all data values are in a
designated format.
If the majority of the dates coming from a source system arrive in the same format, then
it is often wise to create a reusable expression that handles dates, so that the proper
checks are made. It is also advisable to determine if any default dates should be
defined, such as a low date or high date. These should then be used throughout the
system for consistency. However, do not fall into the trap of always using default dates
as some are meant to be NULL until the appropriate time (e.g., birth date or death date).
The NULL in the example above could be changed to one of the standard default dates
described here.
Decimal Precision
With numeric data columns, developers must determine the expected or required
precisions of the columns. (By default, to increase performance, PowerCenter treats all
numeric columns as 15 digit floating point decimals, regardless of how they are defined
in the transformations. The maximum numeric precision in PowerCenter is 28 digits.)
If it is determined that a column realistically needs a higher precision, then the Enable
Decimal Arithmetic in the Session Properties option needs to be checked. However, be
aware that enabling this option can slow performance by as much as 15 percent. The
Enable Decimal Arithmetic option must be enabled when comparing two numbers for
equality.
When requesting a data feed from an upstream system, be sure to request an audit file
or report that contains a summary of what to expect within the feed. Common requests
here are record counts or summaries of numeric data fields. If you have performed a
data quality audit, as specified in the Analyze Phase these metrics and others should
be readily available.
Assuming that the metrics can be obtained from the source system, it is advisable to
then create a pre-process step that ensures your input source matches the audit file. If
the values do not match, stop the overall process from loading into your target system.
The source system can then be alerted to verify where the problem exists in its feed.
Another method of filtering bad data is to have a set of clearly defined data rules built
into the load job. The records are then evaluated against these rules and routed to an
Error or Bad Table for further re-processing accordingly. An example of this is to check
all incoming Country Codes against a Valid Values table. If the code is not found, then
the record is flagged as an Error record and written to the Error table.
A pitfall of this method is that you must determine what happens to the record once it
has been loaded to the Error table. If the record is pushed back to the source system to
be fixed, then a delay may occur until the record can be successfully loaded to the
target system. In fact, if the proper governance is not in place, the source system may
refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the
data manually and risk not matching with the source system; or 2) relax the business
rule to allow the record to be loaded.
Often times, in the absence of an enterprise data steward, it is a good idea to assign a
team member the role of data steward. It is this person’s responsibility to patrol these
tables and push back to the appropriate systems as necessary, as well as help to make
decisions about fixing or filtering bad data. A data steward should have a good
command of the metadata, and he/she should also understand the consequences to
the user community of data decisions.
The majority of current data warehouses are built using a dimensional model. A
dimensional model relies on the presence of dimension records existing before loading
the fact tables. This can usually be accomplished by loading the dimension tables
before loading the fact tables. However, there are some cases where a corresponding
dimension record is not present at the time of the fact load. When this occurs,
consistent rules need to handle this so that data is not improperly exposed to, or hidden
from, the users.
One solution is to continue to load the data to the fact table, but assign the foreign key
a value that represents Not Found or Not Available in the dimension. These keys must
also exist in the dimension tables to satisfy referential integrity, but they provide a clear
and easy way to identify records that may need to be reprocessed at a later date.
Another solution is to filter the record from processing since it may no longer be
relevant to the fact table. The team will most likely want to flag the row through the use
of either error tables or process codes so that it can be reprocessed at a later time.
A third solution is to use dynamic caches and load the dimensions when a record is not
found there, even while loading the fact table. This should be done very carefully since
it may add unwanted or junk values to the dimension table. One occasion when this
may be advisable is in cases where dimensions are simply made up of the distinct
combination values in a data set. Thus, this dimension may require a new record if a
new combination occurs.
It is imperative that all of these solutions be discussed with the users before making
any decisions since they will eventually be the ones making decisions based on the
reports.
Challenge
Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource
management initiatives, and it is an increasingly important driver of cost-efficient compliance with regulatory
initiatives such as KYC (Know Your Customer).
Once duplicate records are identified, you can remove them from your dataset, and better recognize key
relationships among data records (such as customer records from a common household). You can also match
records or values against reference data to ensure data accuracy and validity.
This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching
approach. It has two high-level objectives:
● To identify the key performance variables that affect the design and execution of IDQ matching plans.
● To describe plan design and plan execution actions that will optimize plan performance and results.
To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.
Description
All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or
prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer
numbers or product ID fields) that, if present, would allow clear ‘joins’ between the datasets and improve business
knowledge.
Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single
view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from
being sent to the same person or household; and it can assist marketing efforts by identifying households or
individuals who are heavy users of a product or service.
Data can be enriched by matching across production data and reference data sources. Business intelligence
operations can be improved by identifying links between two or more systems to provide a more complete picture
of how customers interact with a business.
IDQ’s matching capabilities can help to resolve dataset duplications and deliver business results. However, a
user’s ability to design and execute a matching plan that meets the key requirements of performance and match
quality depends on understanding the best-practice approaches described in this document.
An integrated approach to data matching involves several steps that prepare the data for matching and improve the
overall quality of the matches. The following table outlines the processes in each step.
Step Description
Typically the first stage of the data quality process, profiling generates
Profiling a picture of the data and indicates the data elements that can comprise
effective group keys. It also highlights the data elements that require
standardizing to improve match scores.
The sections below identify the key factors that affect the performance (or speed) of a matching plan and the
quality of the matches identified. They also outline the best practices that ensure that each matching plan is
implemented with the highest probability of success. (This document does not make any recommendations on
profiling, standardization or consolidation strategies. Its focus is grouping and matching.)
The following table identifies the key variables that affect matching plan performance and the quality of matches
identified.
Informatica Data Quality Plan performance The plan designer must weigh
components file-based versus database
matching approaches when
considering plan requirements.
Group Size
Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons
performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a
matching plan compares the records within each group with one another. When grouping is implemented properly,
plan execution speed is increased significantly, with no meaningful effect on match quality.
The most important determinant of plan execution speed is the size of the groups to be processed — that is, the
number of data records in each group.
For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If
9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000
records; based on this one large group, the matching plan would require 87 days to complete, processing
1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12
minutes if the group sizes were evenly distributed.
Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups
perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small
groups. As groups get smaller, fewer comparisons are possible, and the potential for missing good matches is
increased. The goal of grouping is to optimize performance while minimizing the possibility that valid matches will
be overlooked because like records are assigned to different groups. Therefore, groups must be defined
intelligently through the use of group keys.
Group Keys
Group keys determine which records are assigned to which groups. Group key selection, therefore, has a
significant affect on the success of matching operations.
Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the
plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are
compared against one another.
● Candidate group keys should represent a logical separation of the data into distinct units where there is a
low probability that matches exist between records in different units. This can be determined by
profiling the data and uncovering the structure and quality of the content prior to grouping.
● Candidate group keys should also have high scores in three keys areas of data quality: completeness,
conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior
to grouping.
For example, geography is a logical separation criterion when comparing name and address data. A record for a
Size of Dataset
In matching, the size of the dataset typically does not have as significant an impact on plan performance as the
definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to
produce a matching plan — both in terms of the preparation of the data and the plan execution.
IDQ Components
All IDQ components serve specific purposes, and very little functionality is duplicated across the components.
However, there are performance implications for certain component types, combinations of components, and the
quantity of components used in the plan.
Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational
components. In tests comparing file-based matching against database matching, file-based matching outperformed
database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching
plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed
Field Matcher component performed more slowly than plans without a Mixed Field Matcher.
Raw performance should not be the only consideration when selecting the components to use in a matching plan.
Different components serve different needs and may offer advantages in a given scenario.
Time Window
IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the
completion of a matching plan can have a significant impact on the perception that the plan is running correctly.
Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping
strategy, and the IDQ components to employ.
Frequency of Execution
The frequency with which plans are executed is linked to the time window available. Matching plans may need to
be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the
execution time will have to be considered.
Match Identification
The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key
methods for assessing matches are:
● deterministic matching
● probabilistic matching
Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQ’s
fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if
The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to
others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to
this method are its rigidity and its requirement that each dependency be true. This can result in matches being
missed, or can require several different rule checks to cover all likely combinations.
Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in
order to calculate a weighted average that indicates the degree of similarity between two pieces of information.
The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no
dependencies on certain data elements matching in order for a full match to be found. Weights assigned to
individual components can place emphasis on different fields or areas in a record. However, even if a heavily-
weighted score falls below a defined threshold, match scores from less heavily-weighted components may still
produce a match.
The disadvantages of this method are a higher degree of required tweaking on the user’s part to get the right
balance of weights in order to optimize successful matches. This can be difficult for users to understand and
communicate to one another.
Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching
plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and
94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to
only 65 percent genuine matches, and so on. The following table illustrates this principle.
Close analysis of the match results is required because of the relationship between match quality and match
thresholds scores assigned since there may not be a one-to-one mapping between the plan’s weighted score and
the number of records that can be considered genuine matches.
The following section outlines best practices for matching with IDQ.
Capturing client requirements is key to understanding how successful and relevant your matching plans are likely
to be. As a best practice, be sure to answer the following questions, as a minimum, before designing and
implementing a matching plan:
Test Results
Performance is the key to success in high-volume matching solutions. IDQ’s architecture supports massive
scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly
enhances IDQ’s ability to meet the service levels required by users without sacrificing quality or requiring an overly
complex solution.
If IDQ is integrated with PowerCenter, matching scalability can be achieved using PowerCenter's partitioning
capabilities.
As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity
of small groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the
following parameters in mind when designing a grouping plan.
Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about
to be matched has been profiled and standardized to identify candidate keys.
Group keys act as a “first pass” or high-level summary of the shape of the dataset(s). Remember that only data
records within a given group are compared with one another. Therefore, it is vital to select group keys that have
high data quality scores for completeness, conformity, consistency, and accuracy.
Group key selection depends on the type of data in the dataset, for example whether it contains name and address
data or other data types such as product codes.
Hardware Specifications
The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has
a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is
one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million
comparisons per minute, depending on the hardware specification, background processes running, and
components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for
high-volume matching plans.
Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and
writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how
quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during
plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information
that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be
appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available.
The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms.
Specifications for UNIX-based systems vary.
With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based
or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware
however, that this requires additional effort to create the groups and consolidate the match output. Also, matching
plans split across four processors do not run four times faster than a single-processor matching plan. As a result,
multi-processor matching may not significantly improve performance in every case.
The following table can help in estimating the execution time between a single and multi-processor match plan.
For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match,
a four-processor match plan should require approximately one hour and 20 minute to group and standardize and
two and one half hours to match. The time difference between a single- and multi-processor plan in this case would
be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the
quad-processor plan).
No best-practice research has yet been completed on which type of comparison is most effective at determining a
match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for
deterministic comparisons since they remove the burden of identifying a universal match threshold from the user.
Bear in mind that IDQ supports deterministic matching operations only. However, IDQ’s Weight Based Analyzer
component lets plan designers calculate weighted match scores for matched fields.
File-based matching and database matching perform essentially the same operations. The major differences
between the two methods revolve around how data is stored and how the outputs can be manipulated after
matching is complete. With regards to selecting one method or the other, there are no best practice
recommendations since this is largely defined by requirements.
The following table outlines the strengths and weakness of each method:
This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of
execution and quality of results. It highlights the key factors affecting matching performance and discusses the
results of IDQ performance testing in single and multi-processor environments.
Checking for duplicate records where no clear connection exists among data elements is a resource-intensive
activity. In order to detect matching information, a record must be compared against every other record in a
dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases
geometrically as the volume of data increases. A similar situation arises when matching between two datasets,
where the number of comparisons required is a multiple of the volumes of data in each dataset.
When the volume of data increases into the tens of millions, the number of comparisons required to identify
matches — and consequently, the amount of time required to check for matches — reaches impractical levels.
The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into
distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of
records outside of the group. Grouping data greatly reduces the total number of required comparisons without
affecting match accuracy.
● Its matching components maximize the comparison activities assigned to the com-puter processor. This
reduces the amount of disk I/O communication in the system and increases the number of comparisons
per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs.
● IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple
processors. The use of multiple processors to handle matching operations greatly enhances IDQ
scalability with regard to high-volume matching problems.
The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the
results obtained in Informatica Corporation testing.
IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take
advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel.
To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability
comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the
plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and
The following diagram outlines how multi-processor matching can be implemented in a database model. Source
data is first grouped and then subgrouped according to the number of processors available to the job. Each
subgroup of data is loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against
each table. Results from each plan are consolidated to generate a single match result for the orig-inal source data.
Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows
2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors
effectively provided four CPUs on which to run the tests.
Several tests were performed using file-based and database-based matching methods and single and multiple
processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total
number of comparisons to approximately 500,000,000.
Test results using file-based and database-based methods showed a near linear scal-ability as the number of
available processors increased. As the number of processors increased, so too did the demand on disk I/O
resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the
benefits of adding additional processor capacity. This is demonstrated in the graph below.
Challenge
To enable users to streamline their data cleansing and standardization processes (or plans) with
Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a
consistent and methodological approach to cleansing and standardizing project data.
Description
Data cleansing refers to operations that remove non-relevant information and “noise” from the
content of the data. Examples of cleansing operations include the removal of person names, “care
of” information, excess character spaces, or punctuation from postal address.
Data standardization refers to operations related to modifying the appearance of the data, so that it
takes on a more uniform structure and to enriching the data by deriving additional details from
existing content.
Data can be transformed into a “standard” format appropriate for its business type. This is typically
performed on complex data types such as name and address or product data. A data
standardization operation typically profiles data by type (e.g., word, number, code) and parses
data strings into discrete components. This reveals the content of the elements within the data as
well as standardizing the data itself.
For best results, the Data Quality Developer should carry out these steps in consultation with a
member of the business. Often, this individual is the data steward, the person who best
understands the nature of the data within the business scenario.
● Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the
correct fields. However, when using the Profile Standardizer, be aware that there is a
finite number of profiles (500) that can be contained within a cleansing plan. Users can
extend the number of profiles by using the first 500 profiles within one component and
then feeding the data overflow into a second Profile Standardizer via the Token Parser
component.
After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to
further standardize the data. It may take several iterations of dictionary construction and review
before the data is standardized to an acceptable level. Once acceptable standardization has been
achieved, data quality scorecard or dashboard reporting can be introduced. For information on
dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User
Guide.
At this point, the business user may discover and define business rules applicable to the data.
These rules should be documented and converted to logic that can be contained within a data
quality plan. When building a data quality plan, be sure to group related business rules together in
a single rules component whenever possible; otherwise the plan may become very difficult to read.
If there are rules that do not lend themselves easily to regular IDQ components (i.e, when
standardizing product data information), it may be necessary to perform some custom
scripting using IDQ’s scripting component. This requirement may arise when a string or an
element within a string needs to be treated as an array.
Reference data can be a useful tool when standardizing data. Terms with variant formats or
spellings can be standardized to a single form. IDQ installs with several reference dictionary files
that cover common name and address and business terms. The illustration below shows part of a
dictionary of street address suffixes.
Data values can often appear ambiguous, particularly in name and address data where name,
address, and premise values can be interchangeable. For example, Hill, Park, and Church are all
common surnames. In some cases, the position of the value is important. “ST” can be a suffix for
street or a prefix for Saint, and sometimes they can both occur in the same string.
The address string “St Patrick’s Church, Main St” can reasonably be interpreted as “Saint Patrick’s
Church, Main Street.” In this case, if the delimiter is a space (thus ignoring any commas and
periods), the string has five tokens. You may need to write business rules using the IDQ Scripting
component, as you are treating the string as an array. St with position 1 within the string would be
standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2.
Each data value can then be compared to a discrete prefix and suffix dictionary.
Conclusion
Using the data cleansing and standardization techniques described in this Best Practice can help
an organization to recognize the value of incorporating IDQ into their development
methodology. Because data quality is an iterative process, the business rules initially developed
may require ongoing modification, as the results produced by IDQ will be affected by the starting
condition of the data and the requirements of the business users.
When data arrives in multiple languages, it is worth creating similar IDQ plans for each country
and applying the same rules across these plans. The data would typically be staged in a database,
and the plans developed using a SQL statement as input, with a “where country_code= ‘DE’”
clause, for example. Country dictionaries are identifiable by country code to facilitate such
statements. Remember that IDQ installs with a large set of reference dictionaries and additional
dictionaries are available from Informatica.
IDQ provides several components that focus on verifying and correcting the accuracy of name and
postal address data. These components leverage address reference data that originates from
national postal carriers such as the United States Postal Service. Such datasets enable IDQ to
validate an address to premise level. Please note, the reference datasets are licensed and
installed as discrete Informatica products, and thus it is important to discuss their inclusion in the
project with the business in advance so as to avoid budget and installation issues. Several types of
reference data, with differing levels of address granularity, are available from Informatica. Pricing
for the licensing of these components may vary and should be discussed with the Informatica
Account Manager.
Challenge
This Best Practice outlines the steps to integrate an Informatica Data Quality (IDQ) plan into a PowerCenter
mapping. This document assumes that the appropriate setup and configuration of IDQ and PowerCenter have
been completed as part of the software installation process and these steps are not included in this document.
Description
Preparing IDQ Plans for PowerCenter Integration
IDQ plans are typically developed and tested by executing from workbench. Plans running locally from workbench
can use any of the available IDQ Source and Sink components. This is not true for plans that are integrated into
PowerCenter as they can only use Source and Sink components that contain the “Enable Real-time processing”
check box. Specifically those components are CSV Source, CSV Match Source, CSV Sink and CSV Match Sink.
In addition, the Real-time Source and Sink can be used; however, they require additional setup as each field name
and length must be defined. Database source and sinks are not allowed in PC integration.
When IDQ plans are integrated within a PowerCenter mapping, the source and sink need to be enabled by setting
the enable real-time processing option on them. Consider the following points when developing a plan for
integration in PC.
● If the IDQ was plan developed using database source and/or sink, you must replace them with CSV Sink/
Source or CSV Match Sink/Source.
● If the IDQ plan was developed using group sink/source (or dual group sink), you must replace them with
either CSV Sink/Source or CSV Match Sink/Source depending on the functionality you are replacing.
When replacing group sink you also must add functionality to the PC mapping to replicate the grouping.
This is done by placing a join and sort prior to the IDQ plan containing the match.
● PowerCenter only sees the input and output ports of the IDQ plan from within the PC mapping. This is
driven by the input file used for the workbench plan and the fields selected as output in the sink. If you
don’t see a field after the plan is integrated in PowerCenter, it means the field is not in the input file or not
selected as output.
● PowerCenter integration does not allow input ports to be selected as output if the IDQ transformation is
defined as a passive transformation. If the IDQ transformation is configured as active this is not an issue
as you must select all fields needed as output from the IDQ transformation within the sink transformation
of the IDQ plan. Passive and active IDQ transformations follow the general restrictions and rules for
active and passive transformations in PowerCenter.
● The delimiter of the Source and Sink must be comma for integration IDQ plans. Other fields such as Pipe
will cause an error within the PowerCenter Designer. If you encounter this error, go back to workbench,
change the delimiter to comma, save the plan and then go back to PowerCenter Designer and perform
the import of the plan again.
● For reusability of IDQ plans, use generic naming conventions for the input and output ports. For example,
rather than naming a field Customer address1, customer address2, customer city, name the field
address1, address2, city, etc. Thus, if the same standardization and cleansing is needed by multiple
sources you can integrate the same IDQ plan, which will reduce development time as well as ongoing
maintenance.
● Use only necessary fields as input to each mapping plan. If you are working with an input file that has 50
fields and you only really need 10 fields for the IDQ plan, create a file that contains only the necessary
field names, save it as a comma delimited file and then point to that newly created file from the source of
the IDQ plan. This changes the input field reference to only those fields that must be visible in the
PowerCenter integration.
After the IDQ Plans are converted to real time-enabled, they are ready to integrate into a PowerCenter mapping.
Integrating into PowerCenter requires proper installation and configuration of the IDQ/PowerCenter integration,
including:
Note: The plug-in must be registered in each repository from which an IDQ transformation is to be
developed.
When all of the above steps are executed correctly, the IDQ transformation icon, shown below, is visible in the
PowerCenter repository.
To integrate an IDQ plan, open the mapping, and click on the IDQ icon. Then click in the mapping workspace to
insert the transformation into the mapping. The following dialog box appears:
Select Active or Passive, as appropriate. Typically, an active transformation is necessary only for a matching
plan. If selecting Active, IDQ plan input needs to have all input fields passed through, as typical PowerCenter
As the following figure illustrates, the IDQ transformation is “empty” in its initial, un-configured state. Notice all
ports are currently blank; they will be populated upon import/integration of the IDQ plan.
Double-click on the title bar for the IDQ transformation to open it for editing.
When first integrating an IDQ plan, the connection and repository displays are blank. Click the Connect button to
establish a connection to the appropriate IDQ repository.
In the Host Name box, specify the name of the computer on which the IDQ repository is installed. This is usually
the PowerCenter server. If the default Port Number (3306) was changed during installation, specify the correct
value. Next, click Test Connection.
When the connection is established, click the down arrow to the right of the Plan Name box, and the following
dialog is displayed:
Browse to the plan you want to import, then click on the Validate button. If there is an error in the plan, a dialog
box appears. For example, if the Source and Sink have not been configured correctly, the following dialog box
appears.
If the plan is valid for PowerCenter integration, the following dialog is displayed.
After Data Quality Plans are integrated in PowerCenter, changes made to the IDQ plan in Workbench are not
reflected in the PowerCenter mapping until the plan is manually refreshed in the PowerCenter mapping. When
you save an IDQ plan, it is saved in the MySQL repository. When you integrate that plan into PowerCenter, a copy
of that plan is then integrated in the PowerCenter metadata; the MySQL repository and the PowerCenter
repository do not communicate updates automatically.
The following paragraphs detail the process for refreshing integrated IDQ plans when necessary to reflect changes
made in workbench.
Plans that are to be integrated into PowerCenter mappings must be saved to an IDQ Repository that is visible to
the PowerCenter Designer prior to integration. The usual practice is to save the plan to the IDQ repository located
on the PowerCenter server.
In order for a Workbench client to save a plan to that repository, the client machine must be granted permissions
to the MySQL on the server. If the client machine has not been granted access, the client receives an error
message when attempting to access the server repository. The person at your organization who has login rights to
the server on which IDQ is installed needs to perform this task for all users who will need to save or retrieve plans
from the IDQ Server. This procedure is detailed below.
● Identify the IP address for any client machine that needs to be granted access.
● Login to the server on which the MySQL repository is located and login to MySQL:
mysql –u root
● For a user to connect to IDQ server, save and retrieve plans, enter the following command:
● For a user to integrate an IDQ plan into PowerCenter, grant the following privilege:
Challenge
To provide guidelines for the development and management of the reference data sources that can be
used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition
from development to production for reference data files and the plans with which they are associated.
Description
Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan.
A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those
terms. It may be a list of employees, package measurements, or valid postal addresses — any data set
that provides an objective reference against which project data sources can be checked or corrected.
Reference files are essential to some, but not all data quality processes.
Internal data is specific to a particular project or client. Such data is typically generated from internal
company information. It may be custom-built for the project.
External data has been sourced or purchased from outside the organization. External data is used when
authoritative, independently-verified data is needed to provide the desired level of data quality to a
particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal
address data sets that have been verified as current and complete by a national postal carrier, such as
United States Postal Service, or company registration and identification information from an industry-
standard source such as Dun & Bradstreet.
Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that
requires intermediary (third-party) software in order to be read by Informatica applications.
Internal data files, as they are often created specifically for data quality projects, are typically saved in the
dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases
can also be used as a source for internal data.
External files are more likely to remain in their original format. For example, external data may be
contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal
discrete data values.
Most organizations already possess much information that can be used as reference data — for example,
employee tax numbers or customer names. These forms of data may or may not be part of the project
source data, and they may be stored in different parts of the organization.
IDQ installs with a set of reference dictionaries that have been created to handle many types of business
data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from
dictionary, and dictionary files are essentially comma delimited text files.
● You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of
your IDQ (client or server) installation.
● You can use the Dictionary Manager within Data Quality Workbench. This method allows you to
create text and database dictionaries.
● You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).
The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text
editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct
or standardized form of each datum from the dictionary’s perspective. The Item columns contain versions
of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore,
each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration
below). A dictionary can have multiple Item columns.
To edit a dictionary value, open the DIC file and make your changes. You can make changes either
through a text editor or by opening the dictionary in the Dictionary Manager.
Note: IDQ users with database expertise can create and specify dictionaries that are linked to database
tables, and that thus can be updated dynamically when the underlying data is updated. Database
dictionaries are useful when the reference data has been originated for other purposes and is likely to
change independently of data quality. By making use of a dynamic connection, data quality plans can
always point to the current version of the reference data.
As you can publish or export plans from a local Data Quality repository to server repositories, so you can
copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like
mechanism for moving files to other machines across the network.
Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when
running a plan. By default, Data Quality relies on dictionaries being located in the following locations:
IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file
when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will
fail.
This is most relevant when you publish or export a plan to another machine on the network. You must
ensure that copies of any dictionary files used in the local plan are available in a suitable location on the
service domain — in the user space on the server, or at a location in the server’s Dictionaries folders that
corresponds to the dictionaries’ location on Workbench — when the plan is copied to the server-side
repository.
Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file.
However, this is the master configuration file for the product and you should not edit it without consulting
Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.
Plans can be version-controlled during development in Workbench and when published to a domain
repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions
when necessary.
Dictionary files are not version controlled by IDQ, however. You should define a process to log changes
and back-up your dictionaries using version control software if possible or a manual method. If
modifications are to be made to the versions of dictionary files installed by the software, it is recommended
that these modifications be made to a copy of the original file, renamed or relocated as desired. This
approach avoids the risk that a subsequent installation might overwrite changes.
External data may or may not permit the copying of data into text format — for example, external data
contained in a database or in library files. Currently, third-party postal address validation data is provided
to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The
third-party software has a very small footprint.) However, some software files can be amenable to data
extraction to file.
External data vendors produce regular data updates, and it’s vital to refresh your external reference data
when updates become available. The key advantage of external data — its reliability — is lost if you do not
apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept
up to date with the latest data as it becomes available for as long as your data subscription warrants. You
can check that you possess the latest versions of third-party data by contacting your Informatica Account
Manager.
If your organization has a reference data subscription, you will receive either regular data files on compact
disc or regular information on how to download data from Informatica or vendor web sites. You must
develop a strategy for distributing these updates to all parties who run plans with the external data. This
may involve installing the data on machines in a service domain.
Bear in mind that postal address data vendors update their offerings every two or three months, and that a
significant percentage of postal addresses can change in such time periods.
You should plan for the task of obtaining and distributing updates in your organization at frequent intervals.
Depending on the number of IDQ installations that must be updated, updating your organization with third-
party reference data can be a sizable task.
Experience working with reference data leads to a series of best practice tips for creating and managing
reference data files.
With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionary-
compatible format.
For example, let’s say you have an exception file containing suspect or invalid customer account records.
Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a
new text file containing the account serial numbers only. This file effectively constitutes the labels column
of your dictionary.
By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into
Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns.
Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the
dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account
numbers that you can use in any plans checking the validity of the organization's account records.
The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data.
The figure below illustrates how you can drill-down into report data, right-click on a column, and save the
column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to
the column data.
In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically,
records containing bad zip codes). The plan designer can now create plans to check customer databases
against these serial numbers. You can also append data to an existing dictionary file in this manner.
As a general rule, it is a best practice to follow the dictionary organization structure installed by the
application, adding to that structure as necessary to accommodate specialized and supplemental
dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible
modifications, thereby lowering the risk of accidental errors during migration. When following the original
dictionary organization structure is not practical or contravenes other requirements, take care to document
Since external data may be obtained from third parties and may not be in file format, the most efficient way
to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically,
this is the machine that hosts the Execution Service.)
This is a similar issue to that of sharing reference data across the organization. If you must move or
relocate your reference data files post-plan development, you have three options:
● You can reset the location to which IDQ looks by default for dictionary files.
● You can reconfigure the plan components that employ the dictionaries to point to the new
location. Depending on the complexity of the plan concerned, this can be very labor-intensive.
● If deploying plans in a batch or scheduled task, you can append the new location to the plan
execution command. You can do this by appending a parameter file to the plan execution
instructions on the command line. The parameter file is an xml file that can contain a simple
command to use one file path instead of another.
Challenge
This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in
planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this
process using Informatica’s PowerCenter and Data Quality.
The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being
entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a
queue, changes captured from a database, or other common data feeds, taking these records and matching them
against existing master data that already exists allows for only the new, unique records to be added.
Description
1. There is a master data set (or possibly multiple master data sets) that contain clean and unique customers,
prospects, suppliers, products, and/or many other types of data.
2. To interact with the master data set, there is an incoming transaction; typically thought to be a new item. This
transaction can be anything from a new customer signing up on the web to a list of new products; this is
anything that is assumed to be new and intended to be added to master.
3. There must be a process to determine if a “new” item really is new or if it already exists within the master data
set. In a perfect world of consistent id’s, spellings, and representations of data across all companies and
systems, checking for duplicates would simply be some sort of exact lookup into the master to see if the item
already exists. Unfortunately, this is not the case and even being creative and using %LIKE% syntax does not
provide thorough results. For example, comparing Bob to Robert or GRN to Green requires a more
sophisticated approach.
The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires
well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent,
conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with
the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to
cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization
will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system.
Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching,
there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On
the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be
compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate
pull records to those having the same zip code and the same first letter of the last name, because we can reason that if
those elements are different between two records, those two records will not match.
Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching
one to many elements of the input record to each candidate pulled from the master. Once the data is compared each
pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a
certain threshold can then be discarded and potential matches can be output or displayed.
Determining which records from the master should be compared with the incoming record is a critical decision in an
effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master
records. Consider even a modest customer master data set with one million records; the amount of processing, and
thus the wait in real-time would be unacceptable.
Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of
candidate selection is to select only that subset of the records from the master that are definitively related by a field, part
of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this
key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval.
There are many instances where multiple keys are used to allow for one key to be missing or different, while another
pulls in the record as a candidate.
What specific data elements the candidate key should consist of very much depends on the scenario and the match
rules. The one common theme with candidate keys is the data elements used should have the highest levels of
completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code
The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For
acceptable two to three second response times, candidate record counts should be kept under 5000 records.
The following instructions further explain the steps for building a solution to real-time matching using the Informatica
suite. They involve the following applications:
Solution:
1. The first step is to analyze the customer master file. Assume that this analysis shows the postcode field is
complete for all records and the majority of it is of high accuracy. Assume also that neither the first name or last
name field is completely populated; thus the match rules we must account for blank names.
2. The next step is to load the customer master file into the database. Below is a list of tasks that should be
implemented in the mapping that loads the customer master data into the database:
● Standardize and validate the address, outputting the discreet address components such as house
number, street name, street type, directional, and suite number. (Pre-built mapplet to do this; country
pack)
● Generate the candidate key field, populate that with the selected strategy (assume it is the first 3
characters of the zip, house number, and the first character of street name), and generate an index on
that field. (Expression, output of previous mapplet, hint: substr(in_ZIPCODE, 0, 3)||
in_HOUSE_NUMBER||substr(in_STREET_NAME, 0, 1))
● Standardize the phone number. (Pre-built mapplet to do this; country pack)
● Parse the name field into individual fields. Although the data structure indicates names are already
parsed into first, middle, and last, assume there are examples where the names are not properly
fielded. Also remember to output a value to handle of nicknames. (Pre-built mapplet to do this; country
pack)
● Once complete, your customer master table should look something like this:
● Within PowerCenter Designer, go to the source analyzer and select the source menu. From there
select Web Service Provider and the Create Web Service Definition.
● You will see a screen like the one below where the Service can be named and input and output ports
can be created. Since this is a matching scenario, the potential that multiple records will be returned
must be taken into account. Select the Multiple Occurring Elements checkbox for the output ports
section. Also add a match score output field to return the percentage at which the input record matches
the different potential matching records from the master.
4. An IDQ match plan must be build to use within the mapping. In developing a plan for real-time, using a CSV
source and CSV sink, both enabled for real-time is the most significant difference from a similar match plan
designed for use in IDQ standalone. The source will have the _1 and the _2 fields that a Group Source would
supply built into it, e.g. Firstname_1 & Firstname_2. Another difference from batch matching in PowerCenter is
that the DQ transformation can be set to passive. The following steps illustrate converting the North America
Country Pack’s Individual Name and Address Match Plan from a plan built for use in a batch mapping to a plan
built for use in a real-time mapping.
● Open the DCM_NorthAmerica project and from within the Match folder make a copy of the
“Individual Name and Address Match” plan. Rename it to “RT Individual Name and Address
Match”.
● Create a new stub CSV file with only the header row. This will be used to generate a new CSV
Source within the plan. This header must use all of the input fields used by the plan before
modification. For convenience, a sample stub header is listed below. The header for the stub
file will duplicate all of the fields, with one set having a suffix of _1 and the other _2.
IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1,
IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1,
IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1,
IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,
● Now delete the CSV Match Source from the plan and add a new CSV Source, and point it at the
new stub file.
● Because the components were originally mapped to the CSV Match Source and that was
deleted, the fields within your plan need to be reselected. As you open the different match
components and RBAs, you can see the different instances that need to be reselected as they
appear with a red diamond, as seen below.
● Also delete the CSV Match Sink and replace it with a CSV Sink. Only the match score field(s)
must be selected for output. This plan will be imported into a passive transformation.
Consequently, data can be passed around it and does not need to be carried through the
transformation. With this implementation you can output multiple match scores so it is possible
to see why two records matched or didn’t match on a field by field basis.
● Select the check box for Enable Real-time Processing in both the source and the sink and the
plan will be ready to be imported into PowerCenter.
❍ mplt_dq_p_Personal_Name_Standardization_FML
❍ mplt_dq_p_USA_Address_Validation
❍ mplt_dq_p_USA_Phone_Standardization_Validation
● Add an Expression Transformation and build the candidate key from the Address Validation mapplet
output fields. Remember to use the same logic as in the mapping that loaded the customer master.
Also within the expression, concatenate the pre and post directional field into a single directional field
for matching purposes.
● Add a SQL transformation to the mapping. The SQL transform will present a dialog box with a few
questions related to the SQL transformation. For this example select Query mode, MS SQL Server
(change as desired), and a Static connection. For details on the other options refer to the PowerCenter
help.
● Connect all necessary fields from the source qualifier, DQ mapplets, and Expression transformation to
the SQL transformation. These fields should include:
● The next step is to build the query from within the SQL transformation to select the candidate records.
Make sure that the output fields agree with the query in number, name, and type.
The output of the SQL transform will be the incoming customer record along with the candidate record.
● Comparing the new record to the candidates is done by embedding the IDQ plan converted in step 4
into the mapping through the use of the Data Quality transformation. When this transformation is
created, select passive as the transformation type. The output of the Data Quality transformation will
be a match score. This match score will be in a float type format between 0.0 and 1.0.
● Using a filter transformation, all records that have a match score below a certain threshold will get
filtered off. For this scenario, the cut-off will be 80%. (Hint: TO_FLOAT(out_match_score) >= .80)
● Any record coming out of the filter transformation is a potential match that exceeds the specified
threshold, and the record will be included in the response. Each of these records needs a new Unique
ID so the Sequence Generator transformation will be used.
● To complete the mapping, the output of the Filter and Sequence Generator transformations need to be
mapped to the target. Make sure to map the input primary key field (XPK_n4_Envelope_output) to the
primary key field of the envelope group in the target (XPK_n4_Envelope) and to the foreign key of the
response element group in the target (FK_n4_Envelope). Map the output of the Sequence Generator
to the primary key field of the response element group.
● The mapping should look like this:
● Using the Workflow Manager, generate a new workflow and session for this mapping using all the
defaults.
● Once created, edit the session task. On the Mapping tab select the SQL transformation and make sure
the connection type is relational. Also make sure to select the proper connection. For more advanced
tweaking and web service settings see the PowerCenter documentation.
● The final step is to expose this workflow as a Web Service. This is done by editing the Workflow. The
workflow needs to be Web Services enabled and this is done by selecting the enabled checkbox for
Web Services. Once the Web Service is enabled, it should be configured. For all the specific details of
this please refer to the PowerCenter documentation, but for the purpose of this scenario:
a. Give the service the name you would like to see exposed to the outside world
Challenge
To provide a guide for testing data quality processes or plans created using Informatica
Data Quality (IDQ) and to manage some of the unique complexities associated with
data quality plans.
Description
Testing data quality plans is an iterative process that occurs as part of the Design
Phase of Velocity. Plan testing often precedes the project’s main testing activities, as
the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to
formally test the plans used in the Analyze Phase of Velocity.
Bear in mind that data quality plans are designed to analyze and resolve data content
issues. These are not typically cut-and-dry problems, but more often represent a
continuum of data improvement issues where it is possible that every data instance is
unique and there is a target level of data quality rather than a “right or wrong answer”.
Data quality plans tend to resolve problems in terms of percentages and probabilities
that a problem is fixed. For example, the project may set a target of 95 percent
accuracy in its customer addresses. The level of inaccuracy acceptability is also likely
to change over time, based upon the importance of a given data field to the underlying
business process. As well, accuracy should continuously improve as the data quality
rules are applied and the existing data sets adhere to a higher standard of quality.
● What dataset will you use to test the plans? While the ideal situation is to
use a data set that exactly mimics the project production data, you may not
gain access to this data. If you obtain a full cloned set of the project data for
testing purposes, bear in mind that some plans (specifically some data
The best practice steps for testing plans can be grouped under two headings.
This process is concerned with establishing that a data enhancement plan has been
properly designed; that is, that the plan delivers the required improvements in data
quality.
This is largely a matter of comparing the business and project requirements for data
quality and establishing if the plans are on course to deliver these. If not, the plans may
need a thorough redesign – or the business and project targets may need to be
revised. In either case, discussions should be held with the key business stakeholders
to review the results of the IDQ plan and determine the appropriate course of action. In
Challenge
This document gives an insight into the type of considerations and issues a user needs
to be aware of when making changes to data quality processes defined in Informatica
Data Quality (IDQ). In IDQ, data quality processes are called plans.
The principal focus of this best practice is to know how to tune your plans without
adversely affecting the plan logic. This best practice is not intended to replace training
materials but serve as a guide for decision making in the areas of adding, removing or
changing the operational components that comprise a data quality plan.
Description
You should consider the following questions prior to making changes to a data quality
plan:
● What is the purpose of changing the plan? You should consider changing a
plan if you believe the plan is not optimally configured, or the plan is not
functioning properly and there is a problem at execution time or the plan is not
delivering expected results as per the plan design principles.
● Are you trained to change the plan? Data quality plans can be complex.
You should not alter a plan unless you have been trained or are highly
experienced with IDQ methodology.
● Is the plan properly documented? You should ensure all plan
documentation on the data flow and the data components are up-to-date. For
guidelines on documenting IDQ plans, see the Sample Deliverable Data
Quality Plan Design.
● Have you backed up the plan before editing? If you are using IDQ in a
client-server environment, you can create a baseline version of the plan using
IDQ version control functionality. In addition, you should copy the plan to a
new project folder (viz., Work_Folder) in the Workbench for changing and
testing, and leave the original plan untouched during testing.
● Is the plan operating directly on production data? This applies especially
to standardization plans. When editing a plan, always work on staged data
(database or flat-file). You can later migrate the plan to the production
environment after complete and thorough testing.
Bear in mind that at a high level there are two types of data quality plans: data analysis
and data enhancement plans.
● Data analysis plans produce reports on data patterns and data quality across
the input data. The key objective in data analysis is to determine the levels of
completeness, conformity, and consistency in the dataset. In pursuing these
objectives, data analysis plans can also identify cases of missing, inaccurate
or “noisy” data.
● Data enhancement plans corrects completeness, conformity and consistency
problems; they can also identify duplicate data entries and fix accuracy issues
through the use of reference data.
Your goal in a data analysis plan is to discover the quality and usability of your data. It
is not necessarily your goal to obtain the best scores for your data. Your goal in a data
enhancement plan is to resolve the data quality issues discovered in the data analysis.
Adding Components
In general, simply adding a component to a plan is not likely to directly affect results if
no further changes are made to the plan. However, once the outputs from the new
component are integrated into existing components, the data process flow is changed
and the plan must be re-tested and results reviewed in detail before migrating the plan
into production.
Bear in mind, particularly in data analysis plans, that improved plan statistics do not
always mean that the plan is performing better. It is possible to configure a plan that
moves “beyond the point of truth” by focusing on certain data elements and excluding
others.
When added to existing plans, some components have a larger impact than others. For
example, adding a “To Upper” component to convert text into upper case may not
cause the plan results to change meaningfully, although the presentation of the output
data will change. However, adding and integrating a Rule Based Analyzer component
As well as adding a new component — that is, a new icon — to the plan, you can add a
new instance to an existing component. This can have the same effect as adding and
integrating a new component icon. To avoid overloading a plan with too many
components, it is a good practice to add multiple instances to a single component,
within reason. Good plan design suggests that instances within a single component
should be logically similar and work on the selected inputs in similar ways. The overall
name for the component should also be changed to reflect the logic of the instances
contained in the component. If you add a new instance to a component, and that
instance behaves very differently to the other instances in that component — for
example, if it acts on an unrelated set of outputs or performs an unrelated type of action
on the data — you should probably add a new component for this instance. This will
also help you keep track of your changes onscreen.
To avoid making plans over-complicated, it is often a good practice to split tasks into
multiple plans where a large amount of data quality measures need to be checked. This
makes plans and business rules easier to maintain and provides a good framework for
future development. For example, in an environment where a large number of attributes
must be evaluated against the six standard data quality criteria (i.e., completeness,
conformity, consistency, accuracy, duplication and consolidation) using one plan per
data quality criterion may be a good way to move forward. Alternatively, splitting plans
up by data entity may be advantageous. Similarly, during standardization, you can
create plans for specific function areas (e.g,. address, product, or name) as opposed to
adding all standardization tasks to a single large plan.
For more information on the six standard data quality criteria, see Data Cleansing
Removing Components
Removing a component from a plan is likely to have a major impact since, in most
cases, data flow in the plan will be broken. If you remove an integrated component,
configuration changes will be required to all components that use the outputs from the
component. The plan cannot run without these configuration changes being completed.
The only exceptions to this case are when the output(s) of the removed component are
solely used by CSV Sink component or by a frequency component. However, in these
cases, you must note that the plan output changes since the column(s) no longer
appear in the result set.
Similarly, changing the name of a component instance output does not break a plan. By
default, component output names “cascade” through the other components in the plan,
so when you change an output name, all subsequent components automatically update
with the new output name. It is not necessary to change the configuration of dependent
components.
Challenge
To understand and make full use of Informatica Data Explorer’s potential to profile and define mappings for your
project data.
Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration,
consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise
application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate
understanding of the true structure of the source data in order to correctly transform the data for a given target
database design. However, the data’s actual form rarely coincides with its documented or supposed form.
The key to success for data-related projects is to fully understand the data as it actually is, before attempting to
cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this
purpose.
This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.
Description
Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality,
content and structure of project data sources. Data profiling analyzes several aspects of data structure and
content, including characteristics of each column or field, the relationships between fields, and the commonality
of data values between fields— often an indicator of redundant data.
Data Profiling
Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics
against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a
field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality
standards may either be the native rules expressed in the source data’s metadata, or an external standard (e.g.,
corporate, industry, or government) to which the source data must be mapped in order to be assessed.
Data mapping involves establishing relationships among data elements in various data structures or sources, in
terms of how the same information is expressed or stored in different ways in different sources. By performing
these processes early in a data project, IT organizations can preempt the “code/load/explode” syndrome, wherein
a project fails at the load stage because the data is not in the anticipated form.
Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure
summarizes and abstracts these scenarios into a single depiction of the IDE solution.
2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents
cleansing and transformation requirements based on the source and normalized schemas.
3. The resultant metadata are exported to and managed in the IDE Repository.
4. In a derived-target scenario, the project team designs the target database by modeling the existing data
sources and then modifying the model as required to meet current business and performance
requirements. In this scenario, IDE is used to develop the normalized schema into a target database.
The normalized and target schemas are then exported to IDE’s FTM/XML tool, which documents
transformation requirements between fields in the source, normalized, and target schemas.
OR
5. In a fixed-target scenario, the design of the target database is a given (i.e., because another
organization is responsible for developing it, or because an off-the-shelf package or industry standard is
to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to
map the source data fields to the corresponding fields in an externally-specified target schema, and to
document transformation requirements between fields in the normalized and target schemas. FTM is
used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based
metadata structures. Externally specified targets are typical for ERP package migrations, business-to-
business integration projects, or situations where a data modeling team is independently designing the
target schema.
6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and
loading or formatting specs developed with IDE applications.
Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely
metadata and alternate metadata which is consistent with the data.
Cross-Table profiling - determines the overlap of values across a set of columns, which may come from
multiple tables.
Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the
quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality
should be considered as an alternative tool for data cleansing.
Fixed-target migration projects involve the conversion and migration of data from one or more sources to an
externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing
The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows:
The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may
discover ‘hidden’ tables within tables.
Derived-Target Migration
Derived-target migration projects involve the conversion and migration of data from one or more sources to a
target database defined by the migration team. IDE is used to profile the data and develop a normalized schema
representing the data source(s), and to further develop the normalized schema into a target schema by adding
tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or
denormalizing the schema to enhance performance. When the target schema is developed from the normalized
schema within IDE, the product automatically maintains the mappings from the source to normalized schema,
and from the normalized to target schemas.
The figure below shows that the general sequence of activities for a derived-target migration project is as follows:
Challenge
To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data
Cleanse and Match (DC&M) product offering.
Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter
system:
● Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be
designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until
needed.
● Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in
adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users
can connect to the Data Quality repository and read data quality plan information into this transformation.
Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document
focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components
of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication
functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address
reference data files.
Description
The North America Content Pack installs several plans to the Data Quality Repository:
● Plans 01-04 are designed to parse, standardize, and validate United States name and address data.
● Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual
source matching operations (identifying matching records between two datasets).
The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.
These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and well-
structured data sources. The level of structure contained in a given data set determines the plan to be used.
The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and
validate an address.
● It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in
sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input
addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on
the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07.
● Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the
seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation
and extremely complex plan logic that would be difficult to modify and maintain.
01 General Parser
The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example,
consider data stored in the following format:
While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such
as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered
throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into type-
specific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses,
depending on the profile of the content. As a result, the above data will be parsed into the following format:
The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by
information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the
contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed
in the file.
The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because
they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date.
The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and
address element are contained in the same field, the General Parser would label the entire field either a name or an address - or
leave it unparsed - depending on the elements in the field it can identify first (if any).
While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data
that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form
containing unparsed data.
The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that
data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan.
Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g.,
telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of
differing structures have been merged into a single file.
02 Name Standardization
The Name Standardization plan is designed to take in person name or company name information and apply parsing and
standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names.
The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company
names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters,
numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to
validate a company name are not likely to yield usable results.
Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the
Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is
matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found.
The second track for name standardization is person names standardization. While this track is dedicated to standardizing person
names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow
a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company
suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is
applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that
contain an identified first name and a company name are treated as a person name.
If the company name track inputs are already fully populated for the record in question, then any company name detected in a
person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e.
g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining
data is accepted as being a valid person name and parsed as such.
North American person names are typically entered in one of two different styles: either in a “firstname middlename surname”
format or “surname, firstname middlename” format. Name parsing algorithms have been built using this assumption.
Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name
When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details
including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated.
In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate.
The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality
plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output).
Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed
according to person name processing rules. Likewise, some person names may be identified as companies and standardized
according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when
working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required.
Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the
fields. For example, an address datum such as “Corporate Parkway” may be standardized as a business name, as “Corporate” is
also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on
whether or not the field contains a recognizable company suffix in the text.
To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and post-
execution analysis of the data.
ROW ID IN NAME1
1 Steven King
2 Chris Pope Jr.
3 Shannon C. Prince
4 Dean Jones
5 Mike Judge
6 Thomas Staples
7 Eugene F. Sears
8 Roy Jones Jr.
9 Thomas Smith, Sr
10 Eddie Martin III
11 Martin Luther King, Jr.
12 Staples Corner
13 Sears Chicago
14 Robert Tyre
15 Chris News
03 US Canada Standardization
This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United
States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where
processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key
search elements into discrete fields, thereby speeding up the validation process.
The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All
remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that
cannot be parsed into the remaining fields is merged into the non-address data field.
The plan makes a number of assumptions that may or may not suit your data:
● When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are
spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where
town names are commonly misspelled, the standardization plan may not correctly parse the information.
● Zip codes are all assumed to be five-digit. In some files, zip codes that begin with “0” may lack this first number and so
appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not
recommended, as these will conflict with the “Plus 4” element of a zip code. Zip codes may also be confused with other
five-digit numbers in an address line such as street numbers.
● City names are also commonly found in street names and other address elements. For example, “United” is part of a
country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates
from right to left across the data, so that country name and zip code fields are analyzed before city names and street
addresses. Therefore, the word “United” may be parsed and written as the town name for a given address before the
actual town name datum is reached.
● The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there
is no need to include any country code field in the address inputs when configuring the plan.
Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding
some pre-processing logic to a workflow prior to passing the data into the plan.
The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been
parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as
address lines 1-3.
04 NA Address Validation
● To match input addresses against known valid addresses in an address database, and
● To parse, standardize, and enrich the input addresses.
The address validation APIs store specific area information in memory and continue to use that information from one record to the
next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to
maximize the usage of data in memory.
In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1
User Guide for information on how to interpret them.
These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05
and 06 or plans 05 and 07. There plans work as follows:
● 05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on
the data prior to matching.
● 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set.
● 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.
Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed
directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English
data. Although they work with datasets in other languages, the results may be sub-optimal.
Matching Concepts
To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and
group the data.
The aim for standardization here is different from a classic standardization plan – the intent is to ensure that different spellings,
abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main
Road will obtain an imperfect match score, although they clearly refer to the same street address.
Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a
matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records
within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing
time while minimizing the likelihood of missed matches in the dataset.
Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data
columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to
facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group
keys.
In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.)
Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of
additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John
Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of
grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data
are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more
information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching
Techniques.
Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before
group keys are generated. It offers a number of grouping options. The plan generates the following group keys:
The grouping output used depends on the data contents and data volume.
Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used.
However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression
transform upstream in the PowerCenter mapping.
A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weight-
based component and a custom rule are applied to the outputs from the matching components. For further information on IDQ
matching components, consult the Informatica Data Quality 3.1 User Guide.
By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The
Data Quality Developer can easily adjusted this figure in each plan.
PowerCenter Mappings
When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.
To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts
data according to the group key to be used during matching. This transformation should follow standardization and grouping
operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality
Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the
same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active
The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not
present in the source data. (Note that a unique identifier is not required for matching processes).
When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for
the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively.
The data from the two sources is then joined together using a Union transformation, before being passed to the Integration
transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single
source version.
Challenge
Develop a sound data integration architecture that can serve as a foundation for data integration
solutions.
Description
Historically, organizations have approached the development of a "data warehouse" or "data mart"
as a departmental effort, without considering an enterprise perspective. The result has been silos
of corporate data and analysis, which very often conflict with each other in terms of both detailed
data and the business conclusions implied by it. Data integration efforts are often the cornerstone
in today's IT initiatives. Taking an enterprise-wide, architect stance in developing data
integration solutions provides many advantages, including:
● A sound architectural foundation ensures the solution can evolve and scale with the
business over time.
● Proper architecture can isolate the application component (business context) of the data
integration solution from the technology.
● Broader data integration efforts will be simplified by using an holisitc enterprise-based
approach.
● Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.
As the evolution of data integration solutions (and the corresponding nomenclature) has
progressed, the necessity of building these solutions on a solid architectural framework has
become more and more clear. To understand why, a brief review of the history of data
integration solutions and their predecessors is warranted.
As businesses become more global, Service Oriented Architecture (SOA) becomes more of an
Information Technology standard. Having a solid architecture is paramount to the success of data
Integration efforts.
Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed,
transaction-oriented view of an organization's data. While this view was indispensable for the day-
to-day operation of a business, its ability to provide a "big picture" view of the operation, critical for
management decision-making, was severely limited. Initial attempts to address this problem took
several directions:
Reporting directly against the production system. This approach minimized the effort
associated with developing management reports, but introduced a number of significant issues:
Ad hoc queries against the production database introduced uncontrolled performance issues,
resulting in slow reporting results and degradation of OLTP system performance.
Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the
OLTP systems.
The initial attempts at reporting solutions were typically point solutions; they were developed
internally to provide very targeted data to a particular department within the enterprise. For
example, the Marketing department might extract sales and demographic data in order to infer
customer purchasing habits. Concurrently, the Sales department was also extracting sales data for
the purpose of awarding commissions to the sales force. Over time, these isolated silos of
information became irreconcilable, since the extracts and business rules applied to the data during
the extract process differed for the different departments
The result of this evolution was that the Sales and Marketing departments might report completely
different sales figures to executive management, resulting in a lack of confidence in both
departments' "data marts." From a technical perspective, the uncoordinated extracts of the same
data from the source systems multiple times placed undue strain on system resources.
The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would
be supported by a single set of periodic extracts of all relevant data into the data warehouse (or
Operational Data Store), with the data being cleansed and made consistent as part of the extract
process. The problem with this solution was its enormous complexity, typically resulting in project
failure. The scale of these failures led many organizations to abandon the concept of the enterprise
data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these
solutions still had all of the issues discussed previously, they had the clear advantage of providing
individual departments with the data they needed without the unmanageability of the enterprise
solution.
As individual departments pursued their own data and data integration needs, they not only created
data stovepipes, they also created technical islands. The approaches to populating the data marts
and performing the data integration tasks varied widely, resulting in a single enterprise evaluating,
The first approach to gain popularity was the centralized data warehouse. Designed to solve the
decision support needs for the entire enterprise at one time, with one effort, the data integration
process extracts the data directly from the operational systems. It transforms the data according to
the business rules and loads it into a single target database serving as the enterprise-wide data
warehouse.
Advantages
The centralized model offers a number of benefits to the overall architecture, including:
● Centralized control . Since a single project drives the entire process, there is centralized
control over everything occurring in the data warehouse. This makes it easier to manage a
production system while concurrently integrating new components of the warehouse.
● Consistent metadata . Because the warehouse environment is contained in a single
database and the metadata is stored in a single repository, the entire enterprise can be
queried whether you are looking at data from Finance, Customers, or Human Resources.
Disadvantages
Of course, the centralized data warehouse also involves a number of drawbacks, including:
The second warehousing approach is the independent data mart, which gained popularity in 1996
when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the
same principles as the centralized approach, but it scales down the scope from solving the
warehousing needs of the entire company to the needs of a single department or workgroup.
Much like the centralized data warehouse, an independent data mart extracts data directly from the
operational sources, manipulates the data according to the business rules, and loads a single
Advantages
The independent data mart is the logical opposite of the centralized data warehouse. The
disadvantages of the centralized approach are the strengths of the independent data mart:
Disadvantages
● Lack of centralized control . Because several independent data marts are needed to
solve the decision support needs of an organization, there is no centralized control. Each
data mart or project controls itself, but there is no central control from a single location.
● Redundant data . After several data marts are in production throughout the organization,
all of the problems associated with data redundancy surface, such as inconsistent
definitions of the same data object or timing differences that make reconciliation
impossible.
● Metadata integration . Due to their independence, the opportunity to share metadata - for
example, the definition and business rules associated with the Invoice data object - is lost.
Subsequent projects must repeat the development and deployment of common data
objects.
● Manageability . The independent data marts control their own scheduling routines and
therefore store and report their metadata differently, with a negative impact on the
manageability of the data warehouse. There is no centralized scheduler to coordinate the
individual loads appropriately or metadata browser to maintain the global metadata and
share development work among related projects.
The third warehouse architecture is the dependent data mart approach supported by the hub-and-
spoke architecture of PowerCenter and PowerExchange. After studying more than one hundred
different warehousing projects, Informatica introduced this approach in 1998, leveraging the
benefits of the centralized data warehouse and independent data mart.
The more general term being adopted to describe this approach is the "federated data warehouse."
Industry analysts have recognized that, in many cases, there is no "one size fits all" solution.
Although the goal of true enterprise architecture, with conformed dimensions and strict standards,
is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated
data warehouse was born. It allows for the relatively independent development of data marts, but
leverages a centralized PowerCenter repository for sharing transformations, source and target
objects, business rules, etc.
Recent literature describes the federated architecture approach as a way to get closer to the goal
of a truly centralized architecture while allowing for the practical realities of most organizations. The
centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the
organization can develop semi-autonomous data marts, so long as they subscribe to a common
view of the business. This common business model is the fundamental, underlying basis of the
federated architecture, since it ensures consistent use of business terms and meanings throughout
the enterprise.
With the exception of the rare case of a truly independent data mart, where no future growth is
planned or anticipated, and where no opportunities for integration with other business areas exist,
the federated data warehouse architecture provides the best framework for building a data
integration solution.
This environment allows for relatively independent development of individual data marts, but also
supports metadata sharing without obstacles. The common business model and names described
above can be captured in metadata terms and stored in the Global Repository. The data marts use
the common business model as a basis, but extend the model by developing departmental
metadata and storing it locally.
A typical characteristic of the federated architecture is the existence of an Operational Data Store
(ODS). Although this component is optional, it can be found in many implementations that extract
data from multiple source systems and load multiple targets. The ODS was originally designed to
extract and hold operational data that would be sent to a centralized data warehouse, working as a
time-variant database to support end-user reporting directly from operational systems. A typical
ODS had to be organized by data subject area because it did not retain the data model from the
operational system.
Informatica's approach to the ODS, by contrast, has virtually no change in data model from the
operational system, so it need not be organized by subject area. The ODS does not permit direct
Advantages
The Federated architecture brings together the best features of the centralized data warehouse
and independent data mart:
● Room for expansion . While the architecture is designed to quickly deploy the initial data
mart, it is also easy to share project deliverables across subsequent data marts by
migrating local metadata to the Global Repository. Reuse is built in.
● Centralized control . A single platform controls the environment from development to test
to production. Mechanisms to control and monitor the data movement from operational
databases into the data integration environment are applied across the data marts, easing
the system management task.
● Consistent metadata . A Global Repository spans all the data marts, providing a
consistent view of metadata.
● Enterprise view . Viewing all the metadata from a central location also provides an
enterprise view, easing the maintenance burden for the warehouse administrators.
Business users can also access the entire environment when necessary (assuming that
security privileges are granted).
● High data integrity . Using a set of integrated metadata repositories for the entire
enterprise removes data integrity issues that result from duplicate copies of data.
● Minimized impact on operational systems . Frequently accessed source data, such as
customer, product, or invoice records is moved into the decision support environment
once, leaving the operational systems unaffected by the number of target data marts.
Disadvantages
● Data propagation . This approach moves data twice-to the ODS, then into the individual
data mart. This requires extra database space to store the staged data as well as extra
time to move the data. However, the disadvantage can be mitigated by not saving the data
permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or
a rolling three months of data can be saved.
● Increased development effort during initial installations . For each table in the target,
there needs to be one load developed from the ODS to the target, in addition to all the
loads from the source to the targets.
Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is
not organized by subject area and is not customized for viewing by end users or even for reporting.
Data from the various operational sources is staged for subsequent extraction by target systems in
the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are
joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS
refreshes, for instance).
The ODS and the data marts may reside in a single database or be distributed across several
physical databases and servers.
● Normalized
● Detailed (not summarized)
● Integrated
● Cleansed
● Consistent
Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number
of ways:
● Normalizes data where necessary (such as non-relational mainframe data), preparing it for
storage in a relational system.
● Cleans data by enforcing commonalties in dates, names and other data types that appear
across multiple systems.
● Maintains reference data to help standardize other formats; references might range from
zip codes and currency conversion rates to product-code-to-product-name translations.
The ODS may apply fundamental transformations to some database tables in order to
reconcile common definitions, but the ODS is not intended to be a transformation
processor for end-user reporting requirements.
Its role is to consolidate detailed data within common formats. This enables users to create wide
varieties of data integration reports, with confidence that those reports will be based on the same
detailed data, using common definitions and formats.
The following table compares the key differences in the three architectures:
The federated architecture approach allows for the planning and implementation of an enterprise
architecture framework that addresses not only short-term departmental needs, but also the long-
term enterprise requirements of the business. This does not mean that the entire architectural
investment must be made in advance of any application development. However, it does mean that
development is approached within the guidelines of the framework, allowing for future growth
without significant technological change. The remainder of this chapter will focus on the process of
designing and developing a data integration solution architecture using PowerCenter as the
platform.
Very few organizations have the luxury of creating a "green field" architecture to support their
decision support needs. Rather, the architecture must fit within an existing set of corporate
guidelines regarding preferred hardware, operating systems, databases, and other software. The
Technical Architect, if not already an employee of the organization, should ensure that he/she has
a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will
eliminate the possibility of developing an elegant technical solution that will never be implemented
because it defies corporate standards.
Challenge
Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution.
While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions
that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup
Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with
PowerCenter for additional information.
Description
The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.
Mapping Design
Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?)
In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixed-
width files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform
intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which
allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and
custom SQL SELECTs where appropriate.
Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by
a single map?)
With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you
can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of
complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can
also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to
multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple
targets, and to multiple sessions running simultaneously.
Q: What are some considerations for determining how many objects and transformations to include in a single mapping?
The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement.
Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging
and better understandability, as well as to create potential partition points. This should be balanced against the fact that more
objects means more overhead for the DTM process.
It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to
use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the
WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to
increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.
The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To
perform the logging function, the Service Manager runs a Log Manager and a Log Agent.
The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain
operations and application services. The log events contain operational and error messages for a domain. The Service
The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include
information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for
sessions include information about the tasks performed by the Integration Service, session errors, and load summary and
transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the
Workflow Monitor.
Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log
events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or
application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide.
Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays
domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error
messages.
One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than
one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed
in the Administration Console.
If you have more than one PowerCenter domain, you must configure a different directory path for each domain’s Log Manager.
Multiple domains can not use the same shared directory path.
For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide.
Q: What documentation is available for the error codes that appear within the error log files?
Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error
information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific
errors, consult your Database User Guide.
Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session?
Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the
warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group
can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the
targets.
Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.
● A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure
that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2
when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next
session only if the previous session was successful, or to stop on errors, etc.
● A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at
one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multi-
processing (SMP) architecture.
Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse.
This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.
No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is
possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow
fails, first recover and then restart that workflow from the Workflow Monitor.
Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor?
Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow
From Task."
Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications?
The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The
load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be
compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive,
so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a
session needs about 120 percent of a processor for the DTM, reader, and writer in total.
One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what
number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server.
If possible, sessions should run at "off-peak" hours to have as many available resources as possible.
Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory
usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter
sessions running.
Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any
cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners.
At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the
production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may
be able to run only one large session, or several small sessions concurrently.
Load-order dependencies are also an important consideration because they often create additional constraints. For example,
load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may
become saturated if overloaded; and some target tables may need to be available to end users earlier than others.
Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify
the Server Administrator?
The application level of event notification can be accomplished through post-session email. Post-session email allows you to
%s Session name
%e Session status
%a<filename> Attaches the named file. The file must be local to the
Informatica Server. The following are valid filenames: %a<c:
\data\sales.txt> or %a</users/john/data/sales.txt>
The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter
server must have the rmail tool installed in the path in order to send email.
1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server.
2. Type rmail <fully qualified email address> at the prompt and press Enter.
3. Type '.' to indicate the end of the message and press Enter.
4. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail
Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed
Status
1 0 30 1 t_Q3_sales
No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0: 00:10 (h:m:s)
This information, or a subset, can also be sent to any text pager that accepts email.
Q: Can individual objects within a repository be restored from the backup or from a prior version?
At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you
can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then
manually copy the individual objects back into the main repository.
It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To
correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well.
An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of
individual objects, mappings, tasks, workflows, etc.
Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA,
and production environments.
Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other
significant event occurs?
The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the
Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by
another user. Notification messages are received through the PowerCenter Client tools.
Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels?
The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.
A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT
documentation.
Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash?
If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this
is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started
correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.
Custom Transformations
Q: What is the relationship between the Java or SQL transformation and the Custom transformation?
Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations
operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality.
Other transformations that were built using Custom transformations include HTTP, SQL, Union , XML Parser, XML Generator,
and many others. Below is a summary of noticeable differences.
Q: What is the main benefit of a Custom transformation over an External Procedure transformation?
A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation
handles both the input and output simultaneously. Additionally, an External Procedure transformation’s parameters consist of all
the ports of the transformation.
The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to
After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type,
delete and recreate the transformation.
Q: What is the difference between active and passive Java transformations? When should one be used over the other?
An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive
Java transformation only allows for the generation of one output row per input row.
Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports
that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use
passive when you need one output row for each input.
A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete,
update, and retrieve rows from a database. For example, you might need to create database tables before adding new
transactions. The SQL transformation allows for the creation of these tables from within the workflow.
Q: What is the difference between the SQL transformation’s Script and Query modes?
Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a
query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters.
For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.
Metadata
Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may
be extracted from the PowerCenter repository and used in others?
With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the
amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column
level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and
primary keys are stored in the repository.
The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to
enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this
decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata.
There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party
metadata software and, for sources and targets, data modeling tools.
Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store,
retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata
Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository.
Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and
MicroStrategy, are effectively using the MX views to report and query the Informatica metadata.
Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of
PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have
Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository
database and are able to present reports to the end-user and/or management.
Versioning
Q: How can I keep multiple copies of the same object within PowerCenter?
A: With PowerCenter, you can use version control to maintain previous copies of every changed object.
You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an
object, control development of the object, and track changes. You can configure a repository for versioning when you create it,
or you can upgrade an existing repository to support versioned objects.
When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object
has an active status.
You can perform the following tasks when you work with a versioned object:
● View object version properties. Each versioned object has a set of version properties and a status. You can also
configure the status of a folder to freeze all objects it contains or make them active for editing.
● Track changes to an object. You can view a history that includes all versions of a given object, and compare any
version of the object in the history to any other version. This allows you to determine changes made to an object over
time.
● Check the object version in and out. You can check out an object to reserve it while you edit the object. When you
check in an object, the repository saves a new version of the object and allows you to add comments to the version.
You can also find objects checked out by yourself and other users.
● Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You
can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from
the repository.
Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time
on making a list of all changed/affected objects?
You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can
create the following types of deployment groups:
To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment
group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of
deployment. You can associate an object query with a deployment group when you edit or create a deployment group.
If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another.
Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source
repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects
to copy, rather than the entire contents of a folder.
Performance
Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels
to change the priority of each task waiting to be dispatched. This can be changed in the Administration Console’s domain
properties.
For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.
Web Services
A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients
that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and
Repository Service through the Web Services Hub.
The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter
installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For
more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide.
The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log
in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name
and password. You can use the Web Services Hub console to view service information and download Web Services
Description Language (WSDL) files necessary for running services and workflows.
Challenge
Description
The following paragraphs describe events that can be triggered by an Event-Wait task.
To use a pre-defined event, you need a session, shell command, script, or batch file to
create an indicator file. You must create the file locally or send it to a directory local to
the PowerCenter Server. The file can be any format recognized by the PowerCenter
Server operating system. You can choose to have the PowerCenter Server delete the
indicator file after it detects the file, or you can manually delete the indicator file. The
PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot
delete the indicator file.
1. Create an Event-Wait task and double-click the Event-Wait task to open the
Edit Tasks dialog box.
2. In the Events tab of the Edit Task dialog box, select Pre-defined.
3. Enter the path of the indicator file.
4. If you want the PowerCenter Server to delete the indicator file after it detects
the file, select the Delete Indicator File option in the Properties tab.
5. Click OK.
Pre-defined Event
User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise
task triggers the event at one point of the workflow/worklet. If an Event-Wait task is
configured in the same workflow/worklet to listen for that event, then execution will
continue from the Event-Wait task forward.
Assume that you have four sessions that you want to execute in a workflow. You want
P1_session and P2_session to execute concurrently to save time. You also want to
execute Q3_session after P1_session completes. You want to execute Q4_session
only when P1_session, P2_session, and Q3_session complete. Follow these steps:
Be sure to take carein setting the links though. If they are left as the default and if Q3
fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the
workflow will run until it is stopped. To avoid this, check the workflow option ‘suspend
on error’. With this option, if a session fails, the whole workflow goes into suspended
mode and can send an email to notify developers.
Challenge
Key management refers to the technique that manages key allocation in a decision
support RDBMS to create a single view of reference data from multiple sources.
Informatica recommends a concept of key management that ensures loading
everything extracted from a source system into the data warehouse.
This Best Practice provides some tips for employing the Informatica-recommended
approach of key management, an approach that deviates from many traditional data
warehouse solutions that apply logical and data warehouse (surrogate) key strategies
where errors are loaded and transactions rejected from referential integrity issues.
Description
● Key merging/matching
● Missing keys
● Unknown keys
All three methods are applicable to a Reference Data Store, whereas only the missing
and unknown keys are relevant for an Operational Data Store (ODS). Key management
should be handled at the data integration level, thereby making it transparent to the
Business Intelligence layer.
Key Merging/Matching
When companies source data from more than one transaction system of a similar type,
the same object may have different, non-unique legacy keys. Additionally, a single key
may have several descriptions or attributes in each of the source systems. The
independence of these systems can result in incongruent coding, which poses a
greater problem than records being sourced from multiple systems.
The bottom line is that nearly every data warehouse project encounters this issue and
needs to find a solution in the short term.
Missing Keys
A problem arises when a transaction is sent through without a value in a column where
a foreign key should exist (i.e., a reference to a key in a reference table). This normally
occurs during the loading of transactional data, although it can also occur when loading
reference data into hierarchy structures. In many older data warehouse solutions, this
condition would be identified as an error and the transaction row would be rejected.
The row would have to be processed through some other mechanism to find the correct
code and loaded at a later date. This is often a slow and cumbersome process that
leaves the data warehouse incomplete until the issue is resolved.
The more practical way to resolve this situation is to allocate a special key in place of
the missing key, which links it with a dummy 'missing key' row in the related table. This
enables the transaction to continue through the loading process and end up in the
warehouse without further processing. Furthermore, the row ID of the bad transaction
can be recorded in an error log, allowing the addition of the correct key value at a later
time.
The major advantage of this approach is that any aggregate values derived from the
transaction table will be correct because the transaction exists in the data warehouse
rather than being in some external error processing file waiting to be fixed.
Simple Example:
In the transaction above, there is no code in the SALES REP column. As this row is
processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a
record in the SALES REP table. A data warehouse key (8888888) is also added to the
An error log entry to identify the missing key on this transaction may look like:
This type of error reporting is not usually necessary because the transactions with
missing keys can be identified using standard end-user reporting tools against the data
warehouse.
Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process
has to add the unknown key value to the referenced table to maintain integrity rather
than explicitly allocating a dummy key to the transaction. The process also needs to
make two error log entries. The first, to log the fact that a new and unknown key has
been added to the reference table and a second to record the transaction in which the
unknown key was found.
Simple example:
The sales rep reference data record might look like the following:
In the transaction above, the code 2424242 appears in the SALES REP column. As
this row is processed, a new row has to be added to the Sales Rep reference table.
This allows the transaction to be loaded successfully.
2424242 Unknown
Some warehouse administrators like to have an error log entry generated to identify the
addition of a new reference table entry. This can be achieved simply by adding the
following entries to an error log.
A second log entry can be added with the data warehouse key of the transaction in
which the unknown key was found.
As with missing keys, error reporting is not essential because the unknown status is
clearly visible through the standard end-user reporting.
Moreover, regardless of the error logging, the system is self-healing because the newly
added reference data entry will be updated with full details as soon as these changes
appear in a reference data feed.
Challenge
In the course of developing mappings for PowerCenter, situations can arise where a set of similar functions/procedures must be
executed for each mapping. The first reaction to this issue is generally to employ a mapplet. These objects are suited to
situations where all of the individual fields/data are the same across uses of the mapplet. However, in cases where the fields are
different – but the ‘process’ is the same – a requirement emerges to ‘generate’ multiple mappings using a standard template of
actions and procedures.
The potential benefits of Autogeneration are focused on a reduction in the Total Cost of Ownership (TCO) of the integration
application and include:
Description
From the outset, it should be emphasized that auto-generation should be integrated into the overall development strategy. It is
probable that some components will still need to be manually developed and many of the disciplines and best practices that are
documented elsewhere in Velocity still apply. It is best to regard autogeneration as a productivity aid in specific situations and not
as a technique that works in all situations. Currently, the autogeneration of 100% of the components required is not a realistic
objective.
All of the techniques discussed here revolve around the generation of an XML file which shares the standard format of exported
PowerCenter components as defined in the powrmart.dtd schema definition. After being generated, the resulting XML document
is imported into PowerCenter using standard facilities available through the user interface or via command line.
With Informatica technology, there are a number of options for XML targeting which can be leveraged to implement
autogeneration. Thus you can exploit these features to make the technology self-generating.
● Pattern-Driven
● Rules-Driven
● Metadata-Driven
A Pattern-Driven build is appropriate when a single pattern of transformation is to be replicated for multiple source-target
The potential for Rules-Driven build typically arises when non-technical users are empowered to articulate transformation
requirements in a format which is the source for a process generating components. Usually, this is accomplished via a
spreadsheet which defines the source-to-target mapping and uses a standardized syntax to define the transformation rules. To
implement this type of autogeneration, it is necessary to build an application (typically based on a PowerCenter mapping) which
reads the spreadsheet, matches the sources and targets against the metadata in the repository and produces the XML output.
Finally, the potential for Metadata-Driven build arises when the import of source and target metadata enables transformation
requirements to be inferred which also requires a mechanism for mapping sources to target. For example, when a text source
column is mapped to a numeric target column the inferred rule is to test for data type compatibility.
The first stage in the implementation of an autogeneration strategy is to decide which of these autogeneration types is applicable
and to ensure that the appropriate technology is available.
In most case, it is the Pattern-Driven build which is the main area of interest; this is precisely the requirement which the mapping
generation license option within PowerCenter is designed to address. This option uses the freely distributed Informatica Data
Stencil design tool for Microsoft Visio and freely distributed Informatica Velocity-based mapping templates to accelerate and
automate mapping design.
Generally speaking, applications which involve a small number of highly-complex flows of data tailored to very specific source/
target attributes are not good candidates for pattern-driven autogeneration.
Currently, there is a great deal of product innovation in the areas of Rules-Driven and Metadata-driven autogeneration One
option includes using PowerCenter via an XML target to generate the required XML files later used as import mappings..
Depending on the scale and complexity of both the autogeneration-rules and the functionality of the generated components, it
may be advisable to acquire a license for the PowerCenter Unstructured Data option.
In conclusion, at the end of this stage the type of autogeneration should be identified and all the required technology licenses
should be acquired.
It is assumed that the standard development activities in the Velocity Architect and Design phases have been undertaken and at
this stage, the development team should understand the data and the value to be added to it.
It is recommended that a prototype is manually developed for a representative subset of the sources and targets since the
adoption of autogeneration techniques does not obviate the need for a re-usability strategy. Even if some components are
generated rather than built, it is still necessary to distinguish between the generic and the flow-specific components. This will
allow the generic functionality to be mapped onto the appropriate re-usable PowerCenter components – mapplets,
transformations, user defined functions etc.
The manual development of the prototype also allows the scope of the autogeneration to be established. It is unlikely that every
It will also be necessary to devise a customization strategy if the autogeneration is seen as a repeatable process. How are
manual modifications to the generated component to be implemented? Should this be isolated in discrete components which are
called from the generated components?
If the autogeneration strategy is based on an application rather than the Visio stencil mapping generation option, ensure that the
components you are planning to generate are consistent with the restrictions on the XML export file by referring to the product
documentation.
TIP
If you modify an exported XML file, you need to make sure that the XML file conforms to the structure of powrmart.dtd. You
also need to make sure the metadata in the XML file conforms to Designer and Workflow Manager rules. For example, when
you define a shortcut to an object, define the folder in which the referenced object resides as a shared folder. Although
PowerCenter validates the XML file before importing repository objects from it, it might not catch all invalid changes. If you
import into the repository an object that does not conform to Designer or Workflow Manager rules, you may cause data
inconsistencies in the repository.
CRCVALUE Codes
Informatica restricts which elements you can modify in the XML file. When you export a Designer object, the PowerCenter
Client might include a Cyclic Redundancy Checking Value (CRCVALUE) code in one or more elements in the XML file. The
CRCVALUE code is another attribute in an element.
When the PowerCenter Client includes a CRCVALUE code in the exported XML file, you can modify some attributes and
elements before importing the object into a repository. For example, VSAM source objects always contain a CRCVALUE
code, so you can only modify some attributes in a VSAM source object. If you modify certain attributes in an element that
contains a CRCVALUE code, you cannot import the object
For more information, refer to the Chapter on Exporting and Importing Objects in the PowerCenter Repository Guide.
Essentially, the requirements for the autogeneration may be discerned from the XML exports of the manually developed
prototype.
(Refer to the product documentation for more information on installation, configuration and usage.)
It is important to confirm that all the required PowerCenter transformations are supported by the installed version of the Stencil.
The use of an external industry-standard interface such as MS Visio allows the tool to be used by Business Analysts rather than
PowerCenter specialists. Apart from allowing the mapping patterns to be specified, the Stencil may also be used as a
documentation tool.
The icons for transformation objects should be familiar to PowerCenter users. Less easily understood will be the concept of
properties for the links (i.e. relationships) between the objects in the Stencil. These link rules define what ports propagate from
one transformation to the next and there may be multiple rules in a single link.
Essentially, the process of developing the template consists of identifying the dynamic components in the pattern and
parameterizing them such as.
Once the template is saved and validated, it needs to be “published” which simply makes it available in formats which the
generating mechanisms can understand such as:
One of the outputs from the publishing is the template for the definition of the parameters specified in the template. An example
of a modified file is shown below:
The other output from the publishing is the template in XML format. This file is only used in manual generation.
There is a choice of either manual or scripted mechanisms for generating components from the published files.
The manual mechanism involves the importation of the published XML template through the Mapping Template Import Wizard in
the PowerCenter Designer. The parameters defined in the template are entered manually through the user interface.
Alternately, the scripted process is based on a supplied command-line utility – mapgen. The first stage is to manually modify the
published parameter file to specify values for all the mappings to be generated. The second stage is to use PowerCenter to
export source and target definitions for all the objects referenced in the parameter file. These are required in order to generate
the ports.
The generated output file is imported using the standard import facilities in PowerCenter.
TIP
Even if the scripted option is selected as the main generating mechanism, use the Mapping Template Import Wizard in the
PC Designer to generate the first mapping; this allows the early identification of any errors or inconsistencies in the template.
This strategy generates PowerCenter XML but can be implemented through either PowerCenter itself or the Unstructured Data
option. Essentially, it will require the same build sub-stages as any other data integration application. The following components
are anticipated:
● Specification of the formats for source to target mapping and transformation rules definition
● Development of a mapping to load the specification spreadsheets into a table
● Development of a mapping to validate the specification and report errors
● Development of a mapping to generate the XML output excluding critical errors
● Development of a component to automate the importation of the XML output into PowerCenter
One of the main issues to be addressed is whether there is a single generation engine which deals with all of the required
patterns, or a series of pattern-specific generation engines.
One of the drivers for the design should be the early identification of errors in the specifications. Otherwise the first indication of
any problem will be the failure of the XML output to import in PowerCenter.
It is very important to define the process around the generation and to allocate responsibilities appropriately.
● The javadoc (api directory) describe all the class of the java API
● The API (lib directory) which contains the jar files used for mapping SDK application
● Some basic samples which show how java development with Mapping SDK is done
The Java application also requires a mechanism to define the final mapping between source and target structures; the
application interprets this data source and combines it with the metadata in the repository in order to output the required mapping
XML.
Presumably there should be less of a requirement for QA and Testing with generated components. This does not mean that the
need to test no longer exists. To some extent, the testing effort should be re-directed to the components in the Assembly line
itself.
There is a great deal of material in Velocity to support QA and Test activities. In particular, refer to Naming Conventions .
Informatica suggests adopting a Naming Convention that distinguishes between generated and manually-built components.
For more information on the QA strategy refer to Using PowerCenter Metadata Manager and Metadata Exchange Views for
Quality Assurance .
Challenge
Description
Although PowerCenter environments vary widely, most sessions and/or mappings can
benefit from the implementation of common objects and optimization procedures.
Follow these procedures and rules of thumb when creating mappings to help ensure
optimization.
❍ Using flat files located on the server machine loads faster than a
database located in the server machine.
❍ Fixed-width files are faster to load than delimited files because
delimited files require extra parsing.
❍ If processing intricate transformations, consider loading first to a
source flat file into a relational database, which allows the
PowerCenter mappings to access the data in an optimized fashion by
using filters and custom SQL Selects where appropriate.
8. If working with data that is not able to return sorted data (e.g., Web Logs),
consider using the Sorter Advanced External Procedure.
9. Use a Router Transformation to separate data flows instead of multiple Filter
Transformations.
10. Use a Sorter Transformation or hash-auto keys partitioning before an
Aggregator Transformation to optimize the aggregate. With a Sorter
Transformation, the Sorted Ports option can be used even if the original source
cannot be ordered.
11. Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same target.
12. Rejected rows from an update strategy are logged to the bad file. Consider
filtering before the update strategy if retaining these rows is not critical because
logging causes extra overhead on the engine. Choose the option in the update
strategy to discard rejected rows.
13. When using a Joiner Transformation, be sure to make the source with the
smallest amount of data the Master source.
14. If an update override is necessary in a load, consider using a Lookup
transformation just in front of the target to retrieve the primary key. The primary
key update is much faster than the non-indexed lookup override.
Challenge
A wide array of Mapping Template examples can be obtained for the most current
PowerCenter version from the Informatica Customer Portal. As "templates," each of the
objects in Informatica's Mapping Template Inventory illustrates the transformation logic
and steps required to solve specific data integration requirements. These sample
templates, however, are meant to be used as examples, not as means to implement
development standards.
Description
Templates can be heavily used in a data integration and warehouse environment, when
loading information from multiple source providers into the same target structure, or
when similar source system structures are employed to load different target instances.
Using templates guarantees that any transformation logic that is developed and tested
correctly, once, can be successfully applied across multiple mappings as needed. In
some instances, the process can be further simplified if the source/target structures
have the same attributes, by simply creating multiple instances of the session, each
with its own connection/execution attributes, instead of duplicating the mapping.
When the process is not simple enough to allow usage based on the need to duplicate
transformation logic to load the same target, Mapping Templates can help to reproduce
transformation techniques. In this case, the implementation process requires more than
just replacing source/target transformations. This scenario is most useful when certain
logic (i.e., logical group of transformations) is employed across mappings. In many
instances this can be further simplified by making use of mapplets. Additionally user
defined functions can be utilized for expression logic reuse and build complex
Transport mechanism
Once Mapping Templates have been developed, they can be distributed by any of the
following procedures:
The following Mapping Templates can be downloaded from the Informatica Customer
Portal and are listed by subject area:
Transformation Techniques
Source-Specific Requirements
Industry-Specific Requirements
Challenge
A variety of factors are considered when assessing the success of a project. Naming standards are an important, but often overlooked
component. The application and enforcement of naming standards not only establishes consistency in the repository, but provides for
a developer friendly environment. Choose a good naming standard and adhere to it to ensure that the repository can be easily
understood by all developers.
Description
Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the
former. Choosing a convention and sticking with it is the key.
Having a good naming convention facilitates smooth migrations and improves readability for anyone reviewing or carrying out
maintenance on the repository objects. It helps them to understand the processes being affected. If consistent names and descriptions
are not used, significant time may be needed to understand the workings of mappings and transformation objects. If no description is
provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective.
The following pages offer suggested naming conventions for various repository objects. Whatever convention is chosen, it is important
to make the selection very early in the development cycle and communicate the convention to project staff working on the repository.
The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test
execution documents.
Mapplet mplt_{DESCRIPTION}
Aggregator Transformation AGG_{FUNCTION} that leverages the expression and/or a name that
describes the processing being done.
Custom Transformation CT_{TRANSFORMATION} name that describes the processing being done.
Data Quality Transform IDQ_{descriptor}_{plan} with the descriptor describing what this plan is
doing with the optional plan name included if desired.
Expression Transformation EXP_{FUNCTION} that leverages the expression and/or a name that
describes the processing being done.
Filter Transformation FIL_ or FILT_{FUNCTION} that leverages the expression or a name that
describes the processing being done.
Lookup Transformation LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple look-
ups on a single table. For unconnected look-ups, use ULKP in place of LKP.
Mapplet Input Transformation MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.
Mapplet Output Transformation MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet.
Normalizer Transformation NRM_{FUNCTION} that leverages the expression or a name that describes
the processing being done.
Rank Transformation RNK_{FUNCTION} that leverages the expression or a name that describes
the processing being done.
SAP DMI Prepare dmi_{Entity Descriptor}_{Secondary Descriptor} defining what entity is being
loaded and a secondary description if multiple DMI objects are being
leveraged in a mapping.
Sequence Generator SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that
Transformation
Unstructured Data Transform UDO_{descriptor} with the descriptor ideintifying the kind of data being
parsed by the UDO transform.
Update Strategy Transformation UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_
{TARGET_NAME} if there are multiple targets in the mapping. E.g.,
UPD_UPDATE_EXISTING_EMPLOYEES
Port Names
Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be
prefixed with the appropriate name.
When the developer brings a source port into a lookup, the port should be prefixed with ‘in_’. This helps the user immediately identify
the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port
is transformed in an output port with the same name, prefix the input port with ‘in_’.
Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many
other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the
name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix ‘v’, 'var_’ or
‘v_' plus a meaningful name.
With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the
Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data
from the database.
Other transformations that are not applicable to the port standards are:
● Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it.
● Sequence Generator - The ports are reserved words.
● Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_
as well. Port names should not have any prefix.
● Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to
rename them unless they are prefixed. Prefixed port names should be removed.
● Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in
both the input and output. The port names should not have any prefix.
Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for
longer port names.
Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the Designer.
Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items
such as the SQL statement to be included in the description as well.
● Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup
table name] to retrieve the [lookup attribute name].
Where:
❍ Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria.
❍ Lookup table name is the table on which the lookup is being performed.
❍ Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition
when the lookup is actually executed.
It is also important to note lookup features such as persistent cache or dynamic lookup.
Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions
being performed.
Within each Expression, transformation ports have their own description in the format:
Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions
being performed.
Within each Aggregator, transformation ports have their own description in the format:
“This Sequence Generator provides the next value for the [column name] on the [table name].”
Where:
❍ Table name is the table being populated by the sequence number, and the
❍ Column name is the column within that table being populated.
“This Joiner uses … [joining field names] from [joining table names].”
Where:
Where:
Where:
❍ explanation describes what the filter criteria are and what they do.
● Stored Procedure Transformation Descriptions. Explain the stored procedure’s functionality within the mapping (i.e., what
does it return in relation to the input ports?).
● Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet.
● Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an
example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the
currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate?
business rate or tourist rate? Has the conversion gone through an intermediate currency?
● Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or
determined by a calculation.
● Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction.
● Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if
any) is expected to take place in later transformations in the mapping.
● Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of
the control to commit or rollback.
● Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is
expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure
which is used.
● External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is
expected as input and what data will be generated as output. Also indicate the module name (and location) and the
procedure that is used.
● Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data
is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation.
● Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the
rank, the rank direction, and the purpose of the transformation.
● XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the
purpose of the XML being generated.
● XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the
Mapping Comments
These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember
to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues
arise that need to be discussed with business analysts.
Mapplet Comments
These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions
for the input and output transformation.
Repository Objects
Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either ‘L_’ for
local or ‘G’ for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g.,
PROD, TEST, DEV).
Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a
descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and
DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should
prefix with ‘z_’ so that they are grouped together and not confused with working production folders.
Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets,
mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to
facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of
copies.
Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across
the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created
by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared
before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the
maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects.
If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression
transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by
creating a shortcut to the object. In this case, the naming convention is ‘sc_’ (e.g., sc_EXP_CALC_SALES_TAX). The folder should
prefix with ‘SC_’ to identify it as a shared folder and keep all shared folders grouped together in the repository.
Session s_{MappingName}
Worklet wk or wklt_{DESCRIPTOR}
All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines.
PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the
ODBC DSN since the PowerCenter Client talks to all databases through ODBC.
Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that
there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate
a new DSN when they use a separate machine.
If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example,
machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as
Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2.
TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by
multiple names, creating confusion for developers, testers, and potentially end users.
Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev,
to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names
should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.
Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for
developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or
environment tokens in the database connection name. Database connection names must be very generic to be understandable and
ensure a smooth migration.
The naming convention should be applied across all development, test, and production environments. This allows seamless migration
of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information
is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also
copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually
wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change
connection names, user names, passwords, and possibly even connect strings.
Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the
development environment to the test environment, the sessions automatically use the existing connection in the test repository. With
the right naming convention, you can migrate sessions from the test to production repository without manual intervention.
Administration console objects such as domains, nodes, and services should also have meaningful names.
Services:
Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager.
When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target
databases. Connections are saved in the repository.
For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you
configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_
[Instance_Name]). The following table shows some examples.
The connection you configure depends on the type of target data you want to load.
Challenge
As with any other development process, the use of clear, consistent, and documented naming conventions
contributes to the effective use of Informatica Data Quality (IDQ). This Best Practice provides suggested naming
conventions for the major structural elements of the IDQ Designer and IDQ Plans.
Description
IDQ Designer
The IDQ Designer is the user interface for the development of IDQ plans.
Each IDQ plan holds the business rules and operations for a distinct process. IDQ plans may be constructed for use
inside the IDQ Designer (a runtime plan), using the athanor-rt command line utility (also runtime), or within an
integration with PowerCenter (a real-time plan).
IDQ requires that each IDQ plan belong to a project. Optionally, plans may be organized in folders within a project.
Folders may be nested to span more than one level.
Element Parent
At any common level of visibility, IDQ requires that all elements have distinct names. Thus no two projects within a
repository may share the same name. Likewise, no two folders at the same level within a project may share the
same name. The rule also applies to plans within the same folder.
IDQ will not permit an element to be renamed if the new name would conflict with an existing element at the same
level. A dialog will explain the error.
Naming Projects
When a project is created, it will be by default have the name “New Project”.
Project naming should be clear and consistent within a repository. The exact approach to naming will vary
depending on an organization’s needs. Suggested naming rules include:
1. Limit project names to 22 characters if possible. The limit imposed by the repository is 30 characters.
Limiting project names to 22 characters allows “Copy of” to be prefixed to copies of a project without
truncating characters.
2. Include enough descriptive information within the project name so an unfamiliar user will have a reasonable
idea of what plans may be included in the project.
3. If plans within a project will operate on only one data source, including the data source in the project name
may be helpful.
4. If abbreviations are used, they should be consistent and documented.
Naming Folders
When a new project is created, by default it will contain four folders, named “Consolidation”, “Matching”, “Profiling”,
and “Standardization”.
While the default naming convention may prove satisfactory in many cases, it imposes an organizational structure
for plans that may not be optimal. Therefore, another naming convention may make more sense in a particular
circumstance.
1. Limit folder names to 42 characters if possible. The limit imposed by the repository is 50 characters.
Limiting folder names to 42 characters allows “Copy of” to be prefixed to copies of a folder without truncating
characters.
2. Include enough descriptive information within the folder name so an unfamiliar user will have a reasonable
idea of what plans may be included in the folder.
3. If abbreviations are used, they should be consistent and documented.
Naming Plans
When a new plan is created, the user is required to select from one of the four main plan classifications, “Analysis”,
“Matching”, “Standardization”, or “Consolidation”. By default, the new plan name will correspond to the option
selected.
1. Limit plan names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting
plan names to 42 characters allows “Copy of” to be prefixed to copies of a plan without truncating
characters.
2. Include enough descriptive information within the plan name so an unfamiliar user will have a reasonable
idea of what the plan does at a high level.
3. While the project and folder structure will be visible within the IDQ Designer and will be required when using
athanor-rt, it is not as readily visible within PowerCenter. Therefore, repetition of the information conveyed
by the project and folder names may be advisable.
4. If abbreviations are used, they should be consistent and documented.
Naming Components
Within the Designer, component types may be identified by their unique icons as well as by hovering over a
component with a mouse.
It is suggested that component names be prefixed with an acronym identifying the component type. While less
critical than field naming, as discussed below, using a prefix allows for consistent naming, for clarity, and it makes
field naming more efficient in some cases.
Component Prefix
Bigram BG_
Merge MG_
Nysiis NYS_
Scripting SC_
Soundex SX_
Splitter SPL_
To Upper TU_
In addition, names for components should take into account the following suggested rules:
1. Limit names to a reasonably short length. A limit of 32 characters is suggested. In many cases, component
names are also useful for field names, and databases limit field lengths at varying sizes.
2. Consider using the name of the input field or at least the field type.
3. Consider limiting names to alphabetic characters, spaces, underscores, and numbers. This will make the
corresponding field names compatible with most likely output destinations.
4. If the component type abbreviation itself is not sufficient to identify what the component does, include an
identifier for the function of the component in its name.
5. If abbreviations are used, they should be consistent and documented.
Dictionaries may be given any name suitable for the operating system on which they will be used.
1. Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both
Windows and UNIX, avoid using spaces.
2. If a dictionary supplied by Informatica is to be modified, it is suggested that the dictionary be renamed and/
or moved to a new folder. This will avoid accidentally overwriting the modifications when an update is
installed.
3. If abbreviations are used, they should be consistent and documented.
Naming Fields
Careful field naming is probably the most critical standard to follow when using IDQ.
● IDQ requires that all fields output by components have unique names; a name cannot be carried through
from component to component.
● The power of IDQ leads to complex plans with many components.
● IDQ does not have the data lineage feature of PowerCenter, so the component name is the clearest
indicator of the source of an input component when a plan is being examined.
With those considerations in mind, the following naming rules are suggested:
Component Prefix
Bigram BG_
Merge MG_
Scripting SC_
Soundex SX_
Splitter SPL_
To Upper TU_
Challenge
Data warehousing incorporates very large volumes of data. The process of loading the
warehouse in a reasonable timescale without compromising its functionality is
extremely difficult. The goal is to create a load strategy that can minimize downtime for
the warehouse and allow quick and robust data management.
Description
As time windows shrink and data volumes increase, it is important to understand the
impact of a suitable incremental load strategy. The design should allow data to be
incrementally added to the data warehouse with minimal impact on the overall system.
This Best Practice describes several possible load strategies.
Incremental Aggregation
If the source changes only incrementally, and you can capture those changes, you can
configure the session to process only those changes with each run. This allows the
PowerCenter Integration Service to update the target incrementally, rather than forcing
it to process the entire source and recalculate the same calculations each time you run
the session.
Some conditions that may help in making a decision on an incremental strategy include:
● Error handling, loading and unloading strategies for recovering, reloading, and
unloading data.
● History tracking requirements for keeping track of what has been loaded and
when
● Slowly-changing dimensions. Informatica Mapping Wizards are a good start to
an incremental load strategy. The Wizards generate generic mappings as a
starting point (refer to Chapter 15 in the Designer Guide)
Source Analysis
● Delta records. Records supplied by the source system include only new or
changed records. In this scenario, all records are generally inserted or updated
into the data warehouse.
● Record indicator or flags. Records that include columns that specify the
intention of the record to be populated into the warehouse. Records can be
selected based upon this flag for all inserts, updates, and deletes.
● Date stamped data. Data is organized by timestamps, and loaded into the
warehouse based upon the last processing date or the effective date range.
● Key values are present. When only key values are present, data must be
checked against what has already been entered into the warehouse. All values
must be checked before entering the warehouse.
After the sources are identified, you need to determine which records need to be
entered into the warehouse and how. Here are some considerations:
● Compare with the target table. When source delta loads are received,
determine if the record exists in the target table. The timestamps and natural
keys of the record are the starting point for identifying whether the record is
new, modified, or should be archived. If the record does not exist in the target,
insert the record as a new row. If it does exist, determine if the record needs to
be updated, inserted as a new record, or removed (deleted from target) or
filtered out and not added to the target.
● Record indicators. Record indicators can be beneficial when lookups into the
target are not necessary. Take care to ensure that the record exists for update
or delete scenarios, or does not exist for successful inserts. Some design
effort may be needed to manage errors in these situations.
There are four main strategies in mapping design that can be used as a method of
comparison:
● Joins of sources to targets. Records are directly joined to the target using
Source Qualifier join conditions or using Joiner transformations after the
Source Qualifiers (for heterogeneous sources). When using Joiner
transformations, take care to ensure the data volumes are manageable and
that the smaller of the two datasets is configured as the Master side of the join.
● Lookup on target. Using the Lookup transformation, lookup the keys or
critical columns in the target relational database. Consider the caches and
indexing possibilities.
● Load table log. Generate a log table of records that have already been
inserted into the target system. You can use this table for comparison with
lookups or joins, depending on the need and volume. For example, store keys
in a separate table and compare source records against this log table to
determine load strategy. Another example is to store the dates associated with
the data already loaded into a log table.
● MD5 checksum function. Generate a unique value for each row of data and
then compare previous and current unique checksum values to determine
The simplest method for incremental loads is from flat files or a database in which all
records are going to be loaded. This strategy requires bulk loads into the warehouse
with no overhead on processing of the sources or sorting the source records.
Data can be loaded directly from the source locations into the data warehouse. There is
no additional overhead produced in moving these sources into the warehouse.
Date-Stamped Data
This method involves data that has been stamped using effective dates or sequences.
The incremental load can be determined by dates greater than the previous load date
or data that has an effective key greater than the last key processed.
With the use of relational sources, the records can be selected based on this effective
date and only those records past a certain date are loaded into the warehouse. Views
can also be created to perform the selection criteria. This way, the processing does not
have to be incorporated into the mappings but is kept on the source component.
Placing the load strategy into the other mapping components is more flexible and
controllable by the Data Integration developers and the associated metadata.
To compare the effective dates, you can use mapping variables to provide the previous
date processed (see the description below). An alternative to Repository-maintained
mapping variables is the use of control tables to store the dates and update the control
table after each load.
Non-relational data can be filtered as records are loaded based upon the effective
dates or sequenced keys. A Router transformation or filter can be placed after the
Source Qualifier to remove old records.
Data that is uniquely identified by keys can be sourced according to selection criteria.
For example, records that contain primary keys or alternate keys can be used to
determine if they have already been entered into the data warehouse. If they exist, you
It may be possible to perform a join with the target tables in which new data can be
selected and loaded into the target. It may also be feasible to lookup in the target to
see if the data exists.
● Loading directly into the target. Loading directly into the target is possible
when the data is going to be bulk loaded. The mapping is then responsible for
error control, recovery, and update strategy.
● Load into flat files and bulk load using an external loader. The
mapping loads data directly into flat files. You can then invoke an external
loader to bulk load the data into the target. This method reduces the load times
(with less downtime for the data warehouse) and provides a means of
maintaining a history of data being loaded into the target. Typically, this
method is only used for updates into the warehouse.
● Load into a mirror database. The data is loaded into a mirror database to
avoid downtime of the active data warehouse. After data has been loaded, the
databases are switched, making the mirror the active database and the active
the mirror.
You can use a mapping variable to perform incremental loading. By referencing a date-
based mapping variable in the Source Qualifier or join condition, it is possible to select
only those rows with greater than the previously captured date (i.e., the newly inserted
source data). However, the source system must have a reliable date to use.
In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create
variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet
Designer.
Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case,
make your variable a date/time. For the Aggregation option, select MAX.
In the same screen, state your initial value. This date is used during the initial run of the
● MM/DD/RR
● MM/DD/RR HH24:MI:SS
● MM/DD/YYYY
● MM/DD/YYYY HH24:MI:SS
Step 3: Refresh the mapping variable for the next session run using
an Expression Transformation
Use an Expression transformation and the pre-defined variable functions to set and use
the mapping variable.
SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)
CREATE_DATE in this example is the date field from the source that should be used to
identify incremental rows.
● Expression
● Filter
● Router
● Update Strategy
Note: This behavior has no effect on the date used in the source qualifier. The initial
select always contains the maximum data value encountered during the previous,
successful session run.
When the mapping completes, the PERSISTENT value of the mapping variable is
stored in the repository for the next run of your session. You can view the value of the
mapping variable in the session log file.
The advantage of the mapping variable and incremental loading is that it allows the
session to use only the new rows of data. No table is needed to store the max(date)
since the variable takes care of it.
After a successful session run, the PowerCenter Integration Service saves the final
value of each variable in the repository. So when you run your session the next time,
only new data from the source system is captured. If necessary, you can override the
value saved in the repository with a value saved in a parameter file.
● After installing PWX, ensure the PWX Listener is up and running and that
connectivity is established to the Listener. For best performance, the Listener
should be co-located with the source system.
● In the PWX Navigator client tool, use metadata to configure data access. This
means creating data maps for the non-relational to relational view of
mainframe sources (such as IMS and VSAM) and capture registrations for all
sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables
and columns desired for change capture. There should be one registration per
source. Group the registrations logically, for example, by source database.
● For an initial test, make changes in the source system to the registered
sources. Ensure that the changes are committed.
● Still working in PWX Navigator (and before using PowerCenter), perform Row
Tests to verify the returned change records, including the transaction action
flag (the DTL__CAPXACTION column) and the timestamp. Set the required
access mode: CAPX for change and CAPXRT for real time. Also, if desired,
edit the PWX extraction maps to add the Change Indicator (CI) column. This
CI flag (Y or N) allows for field level capture and can be filtered in the
PowerCenter mapping.
● Use PowerCenter to materialize the targets (i.e., to ensure that sources and
targets are in sync prior to starting the change capture process). This can be
accomplished with a simple pass-through “batch” mapping. This same bulk
mapping can be reused for CDC purposes, but only if specific CDC columns
are not included, and by changing the session connection/mode.
● Import the PWX extraction maps into Designer. This requires the PWXPC
● Use “group sourcing” to create the CDC mapping by including multiple sources
in the mapping. This enhances performance because only one read/
connection is made to the PWX Listener and all changes (for the sources in
the mapping) are pulled at one time.
● Keep the CDC mappings simple. There are some limitations; for instance, you
cannot use active transformations. In addition, if loading to a staging area,
store the transaction types (i.e., insert/update/delete) and the timestamp for
subsequent processing downstream. Also, if loading to a staging area, include
an Update Strategy transformation in the mapping with DD_INSERT or
DD_UPDATE in order to override the default behavior and store the action
flags.
● In the CDC session properties, enable session recovery (i.e., set the Recovery
Strategy to “Resume from last checkpoint”).
● Use post-session commands to archive the restart token files for restart/
recovery purposes. Also, archive the session logs.
Challenge
Configure PowerCenter to work with various PowerExchange data access products to process real-time
data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting
up a real-time session to work with PowerCenter.
Description
PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter
supports the following types of real-time data:
● Messages and message queues. PowerCenter with the real-time option can be used to
integrate third-party messaging applications using a specific PowerExchange data access
product. Each PowerExchange product supports a specific industry-standard messaging
application, such as WebSphere MQ, JMS, MSMQ, SAP NetWeaver, TIBCO, and webMethods.
You can read from messages and message queues and write to messages, messaging
applications, and message queues. WebSphere MQ uses a queue to store and exchange data.
Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the
message exchange is identified using a topic.
● Web service messages. PowerCenter can receive a web service message from a web service
client through the Web Services Hub, transform the data, and load the data to a target or send a
message back to a web service client. A web service message is a SOAP request from a web
service client or a SOAP response from the Web Services Hub. The Integration Service
processes real-time data from a web service client by receiving a message request through the
Web Services Hub and processing the request. The Integration Service can send a reply back to
the web service client through the Web Services Hub or write the data to a target.
● Changed source data. PowerCenter can extract changed data in real time from a source table
using the PowerExchange Listener and write data to a target. Real-time sources supported by
PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL
Server, Oracle and VSAM.
Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the third-party
messaging application and message itself. Each PowerExchange product supplies its own connection
attributes that need to be configured properly before running a real-time session.
The PowerCenter real-time option uses a zero latency engine to process data from the messaging
system. Depending on the messaging systems and the application that sends and receives messages,
there may be a period when there are many messages and, conversely, there may be a period when
there are no messages. PowerCenter uses the attribute ‘Flush Latency’ to determine how often the
messages are being flushed to the target. PowerCenter also provides various attributes to control when
the session ends.
● Message Count - Controls the number of messages the PowerCenter Server reads from the
source before the session stops reading from the source.
● Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it
stops reading from the source.
● Time Slice Mode - Indicates a specific range of time that the server read messages from the
source. Only PowerExchange for WebSphere MQ uses this option.
● Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading
messages from the source.
The specific filter conditions and options available to you depend on which Real-Time source is being
used. For example -Attributes for PowerExchange for DB2 for i5/OS:
Set the attributes that control how the reader ends. One or more attributes can be used to control the end
of session.
If more than one attribute is selected, the first attribute that satisfies the condition is used to control the
end of session.
Note:: The real-time attributes can be found in the Reader Properties for PowerExchange for JMS,
TIBCO, webMethods, and SAP iDoc. For PowerExchange for WebSphere MQ , the real-time attributes
must be specified as a filter condition.
The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often
PowerCenter should flush messages, expressed in milli-seconds.
For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two
seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition
is reached. The Source Based Commit condition is defined in the Properties tab of the session.
The message recovery option can be enabled to ensure that no messages are lost if a session fails as a
result of unpredictable error, such as power loss. This is especially important for real-time sessions
because some messaging applications do not store the messages after the messages are consumed by
another application.
A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on
the source system from an external application. Each UOW may consist of a different number of rows
depending on the transaction to the source system. When you use the UOW Count Session condition, the
Integration Service commits source data to the target when it reaches the number of UOWs specified in
the session condition.
For example, if the value for UOW Count is 10, the Integration Service commits all data read from the
source after the 10th UOW enters the source. The lower you set the value, the faster the Integration
Service commits data to the target. The lower value also causes the system to consume more resources.
A real-time session often has to be up and running continuously to listen to the messaging application and
to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and
Flush Latency to a specific time interval. This is applicable for all PowerExchange products except for
PowerExchange for WebSphere MQ where the session continues to run and flush the messages to the
target using the specific flush latency interval.
Another scenario is the ability to read data from another source system and immediately send it to a real-
time target. For example, reading data from a relational source and writing it to WebSphere MQ. In this
case, set the session to run continuously so that every change in the source system can be immediately
reflected in the target.
A real-time session may run continuously until a condition is met to end the session. In some situations it
may be required to periodically stop the session and restart it. This is sometimes necessary to execute a
post-session command or run some other process that is not part of the session. To stop the session and
restart it, it is useful to deploy continuously running workflows. The Integration Service starts the next run
To set a workflow to run continuously, edit the workflow and select the ‘Scheduler’ tab. Edit the
‘Scheduler’ and select ‘Run Continuously’ from ‘Run Options’. A continuous workflow starts automatically
when the Integration Service initializes. When the workflow stops, it restarts immediately.
Some of the transformations in PowerCenter are ‘active transformations’, which means that the number of
input rows and output rows of the transformations are not the same. For most cases, active transformation
requires all of the input rows to be processed before processing the output row to the next transformation
or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to
be processed.
Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-
time session by setting the transaction scope property in the active transformation to ‘Transaction’. This
signals the session to process the data in the transformation every transaction. For example, if a real-time
session is using an aggregator that sums a field of an input, the summation will be done per transaction,
as opposed to all rows. The result may or may not be correct depending on the requirement. Use the
active transformation with real-time session if you want to process the data per transaction.
Custom transformations can also be defined to handle data per transaction so that they can be used in a
real-time session.
PowerExchange NRDB CDC Real Time connections can be used to extract changes from ADABAS,
DATACOM, IDMS, IMS and VSAM sources in real time.
The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400 connection
to extract from AS/400. There is a separate connection to read from DB2 UDB in real time.
The NRDB CDC connection requires the application name and the restart token file name to be
overridden for every session. When the PowerCenter session completes, the PowerCenter Server writes
the last restart token to a physical file called the RestartToken File. The next time the session starts, the
PowerCenter Server reads the restart token from the file and the starts reading changes from the point
where it last left off. Every PowerCenter session needs to have a unique restart token filename.
Informatica recommends archiving the file periodically. The reader timeout or the idle timeout can be used
to stop a real-time session. A post-session command can be used to archive the RestartToken file.
The encryption mode for this connection can slow down the read performance and increase resource
consumption. Compression mode can help in situations where the network is a bottleneck; using
compression also increases the CPU and memory usage on the source system.
When the PowerCenter session completes, the Integration Service writes the last restart token to a
If, for some reason, the changes from a particular point in time have to “replayed”, we need the
PowerExchange token from that point in time.
To enable such a process, it is a good practice to periodically copy the token file to a backup folder. This
procedure is necessary to maintain an archive of the PowerExchange tokens. A real-time PowerExchange
session may be stopped periodically, using either the reader time limit or the idle time limit. A post-session
command is used to copy the restart token file to an archive folder. The session will be part of a
continuous running workflow, so when the session completes after the post session command, it
automatically restarts again. From a data processing standpoint very little changes; the process pauses
for a moment, archives the token, and starts again.
The following are examples of post-session commands that can be used to copy a restart token file
(session.token) and append the current system date/time to the file name for archive purposes:
Windows:
1. In the Workflow Manager, connect to a repository and choose Connection > Queue
2. The Queue Connection Browser appears. Select New > Message Queue
3. The Connection Object Definition dialog box appears
You need to specify three attributes in the Connection Object Definition dialog box:
● Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely identify the
connection.)
● Queue Manager - the Queue Manager name for the message queue. (in Windows, the default
Queue Manager name is QM_<machine name>)
● Queue Name - the Message Queue name
● Open the MQ Series Administration Console. The Queue Manager should appear on the left
panel
● Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left
panel
Note that the Queue Manager’s name and Queue Name are case-sensitive.
PowerExchange for JMS can be used to read or write messages from various JMS providers, such as
WebSphere MQ, JMS, BEA WebLogic Server.
● JNDI Application Connection, which is used to connect to a JNDI server during a session run.
● JMS Application Connection, which is used to connect to a JMS provider during a session run.
● Name
● JNDI Context Factory
● JNDI Provider URL
● JNDI UserName
● JNDI Password
● JMS Application Connection
● Name
● JMS Destination Type
● JMS Connection Factory Name
● JMS Destination
● JMS UserName
● JMS Password
The JNDI settings for WebSphere MQ JMS can be configured using a file system service or LDAP
(Lightweight Directory Access Protocol).
The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the
WebSphere MQ Java installation/bin directory.
If you are using a file system service provider to store your JNDI settings, remove the number sign (#)
before the following context factory setting:
INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory
Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#)
before the following context factory setting:
If you are using a file system service provider to store your JNDI settings, remove the number sign (#)
before the following provider URL setting and provide a value for the JNDI directory.
<JNDI directory> is the directory where you want JNDI to store the .binding file.
Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#)
before the provider URL setting and specify a hostname.
#PROVIDER_URL=ldap://<hostname>/context_name
PROVIDER_URL=ldap://<localhost>/o=infa,c=rc
If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the
following settings and enter a user DN and password:
PROVIDER_USERDN=cn=myname,o=infa,c=rc
PROVIDER_PASSWORD=test
The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI
application connection in the Workflow Manager:
The JMS connection is defined using a tool in JMS called jmsadmin, which is available in the WebSphere
MQ Java installation/bin directory. Use this tool to configure the JMS Connection Factory.
● When Queue Connection Factory is used, define a JMS queue as the destination.
● When Connection Factory is used, define a JMS topic as the destination.
The following table shows the JMS object types and the corresponding attributes in the JMS application
connection in the Workflow Manager:
Configure the JNDI settings for WebSphere to use WebSphere as a provider for JMS sources or targets in
a PowerCenterRT session.
JNDI Connection
Add the following option to the file JMSAdmin.bat to configure JMS properly:
The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin
directory.
INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory
PROVIDER_URL=iiop://<hostname>/
For example:
PROVIDER_URL=iiop://localhost/
PROVIDER_USERDN=cn=informatica,o=infa,c=rc
PROVIDER_PASSWORD=test
JMS Connection
The JMS configuration is similar to the JMS Connection for WebSphere MQ.
Configure the JNDI settings for BEA WebLogic to use BEA WebLogic as a provider for JMS sources or
targets in a PowerCenterRT session.
PowerCenter Connect for JMS and the JMS hosting Weblogic server do not need to be on the same
server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place.
JNDI Connection
The WebLogic Server automatically provides a context factory and URL during the JNDI set-up
configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources
and targets in the Workflow Manager.
Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow
Manager:
weblogic.jndi.WLInitialContextFactory
Enter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow
Manager:
t3://<WebLogic_Server_hostname>:<port>
where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the
port number for the WebLogic Server.
The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection
Factory.
The JMS Destination is also configured from the BEA WebLogic Server console.
From the Console pane, select Services > JMS > Servers > <JMS Server name> > Destinations under
your domain.
The following table shows the JMS object types and the corresponding attributes in the JMS application
connection in the Workflow Manager:
In addition to JNDI and JMS setting, BEA WebLogic also offers a function called JMS Store, which can be
used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is
available from the Console pane: select Services > JMS > Stores under your domain.
TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for
JMS can’t connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-
compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server.
Configure a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for
PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO
Rendezvous Server.
To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server,
follow these steps:
1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server.
2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server.
To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server:
1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below,
so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems:
tibrv_transports = enabled
2.
Enter the following transports in the transports.conf file:
[RV]
type = tibrv // type of external messaging system
topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer
daemon = tcp:localhost:7500 // default daemon for the Rendezvous server
The transports in the transports.conf configuration file specify the communication protocol between
TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties
on a destination can list one or more transports to use to communicate with the TIBCO
Rendezvous system.
3. Optionally, specify the name of one or more transports for reliable and certified message delivery
in the export property in the file topics.conf. as in the following example:
topicname export="RV"
The export property allows messages published to a topic by a JMS client to be exported to the external
systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous
reliable and certified messaging protocols.
When importing webMethods sources into the Designer, be sure the webMethods host file doesn’t contain
‘.’ character. You can’t use fully-qualified names for the connection when importing webMethods sources.
You can use fully-qualified names for the connection when importing webMethods targets because
PowerCenter doesn’t use the same grouping method for importing sources and targets. To get around
this, modify the host file to resolve the name to the IP address.
For example:
Host File:
crpc23232.crp.informatica.com crpc23232
Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods
source definition. This step is only required for importing PowerExchange for webMethods sources into
If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate
document back to the broker for every document it receives. PowerCenter populates some of the
envelope fields of the webMethods target to enable webMethods broker to recognize that the published
document is a reply from PowerCenter. The envelope fields ‘destid’ and ‘tag’ are populated for the request/
reply model. ‘Destid’ should be populated from the ‘pubid’ of the source document and ‘tag’ should be
populated from ‘tag’ of the source document. Use the option ‘Create Default Envelope Fields’ when
importing webMethods sources and targets into the Designer in order to make the envelope fields
available in PowerCenter.
To create or edit the PowerExchange for webMethods connection select Connections > Application >
webMethods Broker from the Workflow Manager.
● Name
● Broker Host
● Broker Name
● Client ID
● Client Group
● Application Name
● Automatic Reconnect
● Preserve Client State
Enter the connection to the Broker Host in the following format <hostname: port>.
If you are using the request/reply method in webMethods, you have to specify a client ID in the
connection. Be sure that the client ID used in the request connection is the same as the client ID used in
the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup
different webMethods connections for each pair because they cannot share a client ID.
Challenge
Description
On hardware systems that are under-utilized, you may be able to improve performance
by processing partitioned data sets in parallel in multiple threads of the same session
instance running onthe PowerCenter Server engine. However, parallel execution may
impair performance on over-utilized systems or systems with smaller I/O capacity.
Assumptions
The following assumptions pertain to the source and target systems of a session that is
a candidate for partitioning. These factors can help to maximize the benefits that can
be achieved through partitioning.
● Indexing has been implemented on the partition key when using a relational
source.
● Source files are located on the same physical machine as the PowerCenter
Server process when partitioning flat files, COBOL, and XML, to reduce
network overhead and delay.
● All possible constraints are dropped or disabled on relational targets.
● All possible indexes are dropped or disabled on relational targets.
● Table spaces and database partitions are properly managed on the target
system.
● Target files are written to same physical machine that hosts the PowerCenter
First, determine if you should partition your session. Parallel execution benefits
systems that have the following characteristics:
Check idle time and busy percentage for each thread. This gives the high-level
information of the bottleneck point/points. In order to do this, open the session log and
look for messages starting with “PETL_” under the “RUN INFO FOR TGT LOAD
ORDER GROUP” section. These PETL messages give the following details against the
reader, transformation, and writer threads:
Sufficient memory. If too much memory is allocated to your session, you will receive a
memory allocation error. Check to see that you're using as much memory as you can. If
the session is paging, increase the memory. To determine if the session is paging:
If you determine that partitioning is practical, you can begin setting up the partition.
Partition Types
Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin
partitioning when you need to distribute rows evenly and do not need to group data
among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin
partitioning. For example, consider a session based on a mapping that reads data from
three flat files of different sizes.
In this scenario, the recommended best practice is to set a partition point after the
Source Qualifier and set the partition type to round-robin. The PowerCenter Server
distributes the data so that each partition processes approximately one third of the
data.
Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among
Use hash partitioning where you want to ensure that the PowerCenter Server
processes groups of rows with the same partition key in the same partition. For
example, in a scenario where you need to sort items by item ID, but do not know the
number of items that have a particular ID number. If you select hash auto-keys, the
PowerCenter Server uses all grouped or sorted ports as the partition key. If you select
hash user keys, you specify a number of ports to form the partition key.
An example of this type of partitioning is when you are using Aggregators and need to
ensure that groups of data based on a primary key are processed in the same partition.
With this type of partitioning, you specify one or more ports to form a compound
partition key for a source or target. The PowerCenter Server then passes data to each
partition depending on the ranges you specify for each port.
Use key range partitioning where the sources or targets in the pipeline are partitioned
by key range. Refer to Workflow Administration Guide for further directions on setting
up Key range partitions.
For example, with key range partitioning set at End range = 2020, the PowerCenter
Server passes in data where values are less than 2020. Similarly, for Start range =
2020, the PowerCenter Server passes in data where values are equal to greater than
2020. Null values or values that may not fall in either partition are passed through the
first partition.
Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition
point to the next partition point without redistributing them.
Use pass-through partitioning where you want to create an additional pipeline stage to
improve performance, but do not want to (or cannot) change the distribution of data
across partitions. The Data Transformation Manager spawns a master thread on each
session run, which in turn creates three threads (reader, transformation, and writer
threads) by default. Each of these threads can, at the most, process one data set at a
time and hence, three data sets simultaneously. If there are complex transformations in
the mapping, the transformation thread may take a longer time than the other threads,
which can slow data throughput.
When you have considered all of these factors and selected a partitioning strategy, you
can begin the iterative process of adding partitions. Continue adding partitions to the
session until you meet the desired performance threshold or observe degradation in
performance.
● Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before adding additional partitions.
Refer to Workflow Administrator Guide, for more information on Restrictions on
the Number of Partitions.
● Set DTM buffer memory. For a session with n partitions, set this value to at
least n times the original value for the non-partitioned session.
● Set cached values for sequence generator. For a session with n partitions,
there is generally no need to use the Number of Cached Values property of
the sequence generator. If you must set this value to a value greater than
zero, make sure it is at least n times the original value for the non-partitioned
session.
● Partition the source data evenly. The source data should be partitioned into
equal sized chunks for each partition.
● Partition tables. A notable increase in performance can also be realized when
the actual source and target tables are partitioned. Work with the DBA to
discuss the partitioning of source and target tables, and the setup of
tablespaces.
● Consider using external loader. As with any session, using an external
loader may increase session performance. You can only use Oracle external
loaders for partitioning. Refer to the Session and Server Guide for more
information on using and setting up the Oracle external loader for partitioning.
● Write throughput. Check the session statistics to see if you have increased
the write throughput.
● Paging. Check to see if the session is now causing the system to page. When
you partition a session and there are cached lookups, you must make sure
that DTM memory is increased to handle the lookup caches. When you
partition a source that uses a static lookup cache, the PowerCenter Server
creates one memory cache for each partition and one disk cache for each
transformation. Thus, memory requirements grow for each partition. If the
memory is not bumped up, the system may start paging to disk, causing
When you finish partitioning, monitor the session to see if the partition is degrading or
improving session performance. If the session performance is improved and the
session meets your requirements, add another partition
Dynamic Partitioning
Challenge
Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.
Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific
transformations and to those server variables that were global in nature. Transformation variables were defined as
variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression,
Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect
the subdirectories for source files, target files, log files, and so forth.
More current versions of PowerCenter made variables and parameters available across the entire mapping rather
than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow
Manager. Using parameter files, these values can change from session-run to session-run. With the addition of
workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility
and reducing parameter file maintenance. Other important functionality that has been added in recent releases is
the ability to dynamically create parameter files that can be used in the next session in a workflow or in other
workflows.
Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or
session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or
variables and their values in the parameter file. Parameter files can contain the following types of parameters and
variables:
● Workflow variables
● Worklet variables
● Session parameters
● Mapping parameters and variables
When using parameters or variables in a workflow, worklet, mapping, or session, the Integration Service checks
the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize
workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for
these parameters and variables, the Integration Service checks for the start value of the parameter or variable in
other places.
Session parameters must be defined in a parameter file. Because session parameters do not have default values,
if the Integration Service cannot locate the value of a session parameter in the parameter file, it fails to initialize
the session. To include parameter or variable information for more than one workflow, worklet, or session in a
single parameter file, create separate sections for each object within the parameter file.
Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks
use, as necessary. To specify the parameter file that the Integration Service uses with a workflow, worklet, or
session, do either of the following:
If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd
command line, the Integration Service uses the information entered in the pmcmd command line.
When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet
or session whose parameters and variables are to be assigned. Assign individual parameters and variables
directly below this heading, entering each parameter or variable on a new line. List parameters and variables in
any order for each task.
● parameter name=value
● parameter2 name=value
● variable name=value
● variable2 name=value
For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $
$State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value
of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The
session also uses session parameters to connect to source files and target databases, as well as to write session
log to the appropriate session log file. The following table shows the parameters and variables that can be defined
in the parameter file:
The parameter file for the session includes the folder and session name, as well as each parameter and variable:
● [Production.s_MonthlyCalculations]
● $$State=MA
● $$Time=10/1/2000 00:00:00
● $InputFile1=sales.txt
● $DBConnection_target=sales
● $PMSessionLogFile=D:/session logs/firstrun.txt
The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable.
This allows the Integration Service to use the value for the variable that was set in the previous session run
Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and
Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a
variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to
creating a port in most transformations (See the second figure, below).
● SetVariable
● SetMaxVariable
● SetMinVariable
● SetCountVariable
A mapping variable can store the last value from a session run in the repository to be used as the starting value
for the next session run.
● Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily
identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.
● Aggregation type. This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the repository at the end of each
session run would be the maximum value across ALL records until the value is deleted.
● Initial value. This value is used during the first session run when there is no corresponding and
overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value
is identified, then a data-type specific default value is used.
Variable values are not stored in the repository when the session:
● Fails to complete.
● Is configured for a test load.
● Is a debug session.
Order of Evaluation
The start value is the value of the variable at the start of the session. The start value can be a value defined in the
parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined
initial value for the variable, or the default value based on the variable data type. The Integration Service looks for
the start value in the following order:
Since parameter values do not change over the course of the session run, the value used is based on:
Once defined, mapping parameters and variables can be used in the Expression Editor section of the following
transformations:
● Expression
● Filter
● Router
● Update Strategy
● Aggregator
Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined
join, and source filter sections, as well as in a SQL override in the lookup transformation.
● Enter folder names for non-unique session names. When a session name exists more than once in a
repository, enter the folder name to indicate the location of the session.
● Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions
individually. Specify the same parameter file for all of these tasks or create several parameter files.
● If including parameter and variable information for more than one session in the file, create a new
section for each session. The folder name is optional.
[folder_name.session_name]
parameter_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
● Specify headings in any order. Place headings in any order in the parameter file. However, if defining
the same parameter or variable more than once in the file, the Integration Service assigns the parameter
or variable value using the first instance of the parameter or variable.
● Specify parameters and variables in any order. Below each heading, the parameters and variables
can be specified in any order.
● When defining parameter values, do not use unnecessary line breaks or spaces. The Integration
Service may interpret additional spaces as part of the value.
● List all necessary mapping parameters and variables. Values entered for mapping parameters and
variables become the start value for parameters and variables in a mapping. Mapping parameter and
variable names are not case sensitive.
● List all session parameters. Session parameters do not have default values. An undefined session
parameter can cause the session to fail. Session parameter names are not case sensitive.
● Use correct date formats for datetime values. When entering datetime values, use the following date
formats:
MM/DD/RR
MM/DD/RR HH24:MI:SS
MM/DD/YYYY
MM/DD/YYYY HH24:MI:SS
● Do not enclose parameters or variables in quotes. The Integration Service interprets everything after
the equal sign as part of the value.
● Do enclose parameters in single quotes. In a Source Qualifier SQL Override use single quotes if the
parameter represents a string or date/time value to be used in the SQL Override.
● Precede parameters and variables created in mapplets with the mapplet name as follows:
mapplet_name.parameter_name=value
mapplet2_name.variable_name=value
Parameter files, along with session parameters, allow you to change certain values between sessions. A
Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to
reuse the same mapping, with different sessions, to extract specified data from the parameter files the session
references. Moreover, there may be a time when it is necessary to create a mapping that will create a parameter
file and the second mapping to use that parameter file created from the first mapping. The second mapping pulls
the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter
file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a
parameter file for another session to use.
Variables and parameters can enhance incremental strategies. The following example uses a mapping variable,
an expression transformation object, and a parameter file for restarting.
Scenario
Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new
information. The environment data has an inherent Post_Date that is defined within a column named
Date_Entered that can be used. The process will run once every twenty-four hours.
Sample Solution
Create a mapping with source and target objects. From the menu create a new mapping variable named $
$Post_Date with the following attributes:
● TYPE Variable
● DATATYPE Date/Time
● AGGREGATION TYPE MAX
● INITIAL VALUE 01/01/1900
Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used
within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE
(--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute:
DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample
refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the
Integration Service to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.
The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the
function for setting the variable will reside. An output port named Post_Date is created with a data type of date/
time. In the expression code section, place the following function:
SETMAXVARIABLE($$Post_Date,DATE_ENTERED)
1. In order for the function to assign a value, and ultimately store it in the repository, the port must be
connected to a downstream object. It need not go to the target, but it must go to another Expression
Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream
transformation object.
2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an
update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not
work. In this case, make the session Data Driven and add an Update Strategy after the transformation
containing the SETMAXVARIABLE function, but before the Target.
3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an
ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is
preserved.
The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900
providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it
encounters. Upon successful completion of the session, the variable is updated in the repository for use in the
next session run. To view the current value for a particular variable associated with the session, right-click on the
session in the Workflow Monitor and choose View Persistent Values.
The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this
session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered >
02/03/1998 will be processed.
To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow
Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing
the Order of Evaluation to use the Initial Value declared from the mapping.
If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:
● Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A
session may (or may not) have a variable, and the parameter file need not have variables and
parameters defined for every session using the parameter file. To override the variable, either change,
uncomment, or delete the variable in the parameter file.
● Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.
Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in
the workflow or session properties:
● Select either the Workflow or Session, choose, Edit, and click the Properties tab.
● Enter the parameter directory and name in the Parameter Filename field.
● Enter either a direct path or a server variable directory. Use the appropriate delimiter for the Integration
Service operating system.
The following graphic shows the parameter filename and location specified in the session task.
In this example, after the initial session is run, the parameter file contents may look like:
[Test.s_Incremental]
;$$Post_Date=
By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the
subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a
simple Perl script or manual change can update the parameter file to:
[Test.s_Incremental]
$$Post_Date=04/21/2001
Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value
and uses that value for the session run. After successful completion, run another script to reset the parameter file.
Scenario
Company X maintains five Oracle database instances. All instances have a common table definition for sales
orders, but each instance has a unique instance name, schema, and login.
Each sales order table has a different name, but the same definition:
Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the strings are named according
to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then
create a Mapping Parameter named $$Source_Schema_Table with the following attributes:
Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.
Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.
Using Workflow Manager, create a session based on this mapping. Within the Source Database connection drop-
down box, choose the following parameter:
$DBConnection_Source.
Now create the parameter files. In this example, there are five separate parameter files.
Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=aardso.orders
$DBConnection_Source= ORC1
Parmfile2.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=environ.orders
$DBConnection_Source= ORC99
Parmfile3.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=hitme.order_done
$DBConnection_Source= HALC
Parmfile4.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=snakepit.orders
$DBConnection_Source= UGLY
Parmfile5.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table= gmer.orders
Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular
parameter file is as follows:
You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual
password.
When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter
Integration Service runs the workflow using the parameters in the file specified. For UNIX shell users, enclose the
parameter file name in single quotes:
-paramfile '$PMRootDir/myfile.txt'
For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the
name includes spaces, enclose the file name in double quotes:
Note: When writing a pmcmd command that includes a parameter file located on another machine, use the
backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the
server variable.
pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\
$PMRootDir/myfile.txt'
In the event that it is necessary to run the same workflow with different parameter files, use the following five
separate commands:
Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script
can change the parameter file for the next session.
Using advanced techniques a PowerCenter mapping can be built that produces as a target file a parameter file (.
parm) that can be referenced by other mappings and sessions. When many mappings use the same parameter
file it is desirable to be able to easily re-create the file when mapping parameters are changed or updated. This
also can be beneficial when parameters change from run to run. There are a few different methods of creating a
parameter file with a mapping.
There is a mapping template example on the my.informatica.com that illustrates a method of using a PowerCenter
mapping to source from a process table containing mapping parameters and to create a parameter file. This same
feat can be accomplished also by sourcing a flat file in a parameter file format with code characters in the fields to
be altered.
[folder_name.session_name]
parameter_name= <parameter_code>
variable_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name= <parameter_code>
variable_name=value
mapplet_name.parameter_name=value
In place of the text <parameter_code> one could place the text filename_<timestamp>.dat. The mapping would
then perform a string replace wherever the text <timestamp> occurred and the output might look like:
Src_File_Name= filename_20080622.dat
This method works well when values change often and parameter groupings utilize different parameter sets. The
overall benefits of using this method are such that if many mappings use the same parameter file, changes can be
made by updating the source table and recreating the file. Using this process is faster than manually updating the
file line by line.
Use a single parameter file to group parameter information for related sessions.
When sessions are likely to use the same database connection or directory, you might want to include them in the
same parameter file. When connections or directories change, you can update information for all sessions by
editing one parameter file. Sometimes you reuse session parameters in a cycle. For example, you might run a
session against a sales database everyday, but run the same session against sales and marketing databases
once a week. You can create separate parameter files for each session run. Instead of changing the parameter
file in the session properties each time you run the weekly session, use pmcmd to specify the parameter file to
use when you start the session.
When you use a target file or target database connection parameter with a session, you can keep track of reject
files by using a reject file parameter. You can also use the session log parameter to write the session log to the
target machine.
Use a resource to verify the session runs on a node that has access to the parameter file.
In the Administration Console, you can define a file resource for each node that has access to the parameter file
and configure the Integration Service to check resources. Then, edit the session that uses the parameter file and
assign the resource. When you run the workflow, the Integration Service runs the session with the required
resource on a node that has the resource available.
If you keep all parameter files in one of the process variable directories, such as $SourceFileDir, use the process
variable in the session property sheet. If you need to move the source and parameter files at a later date, you can
update all sessions by changing the process variable to point to the new directory.
Challenge
For an error handling strategy to be implemented successfully, it must be integral to the load process as a
whole. The method of implementation for the strategy will vary depending on the data integration
requirements for each project.
The resulting error handling process should however, always involve the following three steps:
1. Error identification
2. Error retrieval
3. Error correction
This Best Practice describes how each of these steps can be facilitated within the PowerCenter
environment.
Description
A typical error handling process leverages the best-of-breed error management technology available in
PowerCenter, such as:
These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in
the flow chart below:
The first step in the error handling process is error identification. Error identification is often achieved
through the use of the ERROR() function within mappings, enablement of relational error logging in
PowerCenter, and referential integrity constraints at the database.
This approach ensures that row-level issues such as database errors (e.g., referential integrity failures),
transformation errors, and business rule exceptions for which the ERROR() function was called are captured
in relational error logging tables.
Enabling the relational error logging functionality automatically writes row-level data to a set of four error
handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can
be centralized in the PowerCenter repository and store information such as error messages, error data, and
source row data. Row-level errors trapped in this manner include any database errors, transformation errors,
and business rule exceptions for which the ERROR() function was called within the mapping.
Error Retrieval
The second step in the error handling process is error retrieval. After errors have been captured in the
PowerCenter repository, it is important to make their retrieval simple and automated so that the process is
as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the
information stored in the PowerCenter repository. A typical error report prompts a user for the folder and
workflow name, and returns a report with information such as the session, error message, and data that
caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved
through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed
(such as “number of errors is greater than zero”).
Error Correction
The final step in the error handling process is error correction. As PowerCenter automates the process of
error identification, and Data Analyzer can be used to simplify error retrieval, error correction is
straightforward. After retrieving an error through Data Analyzer, the error report (which contains information
such as workflow name, session name, error date, error message, error data, and source row data) can
be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of
an error, the error report can be extracted into a supported format and emailed to a developer or DBA to
resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface
supports emailing a report directly through the web-based interface to make the process even easier.
For further automation, a report broadcasting rule that emails the error report to a developer’s inbox can be
set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the
error, a fix for the error can be implemented. The exact method of data correction depends on various
factors such as the number of records with errors, data availability requirements per SLA, the level of data
criticality to the business unit(s), and the type of error that occurred. Considerations made during error
correction include:
• The ‘owner’ of the data should always fix the data errors. For example, if the source data is
coming from an external system, then the errors should be sent back to the source system to be
fixed.
• In some situations, a simple re-execution of the session will reprocess the data.
• Does partial data that has been loaded into the target systems need to be backed-out in order to
avoid duplicate processing of rows.
• Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of
errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and
corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be
manually inserted into the target table using a SQL statement.
Any approach to correct erroneous data should be precisely documented and followed as a standard.
For organizations that want to identify data irregularities post-load but do not want to reject such rows at load
time, the PowerCenter Data Profiling option can be an important part of the error management solution. The
PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that
provides profile reporting such as orphan record identification, business rule violation, and data irregularity
identification (such as NULL or default values). The Data Profiling option comes with a license to use Data
Analyzer reports that source the data profile warehouse to deliver data profiling information through an
intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports
can be delivered to users through the same easy-to-use application.
Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the
load management process and the load metadata; it is the integration of all these approaches that ensures
the system is sufficiently robust for successful operation and management. The flow chart below illustrates
this in the end-to-end load process.
• Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity
to source systems)?
• Source File Validation. Is the source file datestamp later than the previous load?
• File Check. Does the number of rows successfully loaded match the source rows read?
Challenge
A key requirement for any successful data warehouse or data integration project is that it attains credibility
within the user community. At the same time, it is imperative that the warehouse be as up-to-date as
possible since the more recent the information derived from it is, the more relevant it is to the business
operations of the organization, thereby providing the best opportunity to gain an advantage over the
competition.
Transactional systems can manage to function even with a certain amount of error since the impact of an
individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can
be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse
systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number
of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse
"until someone notices" because business decisions may be driven by such information.
Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors
occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them
from the warehouse immediately (i.e., before the business tries to use the information in error).
These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or
column-related errors) concerns.
Description
In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you
can be sure that every source element was populated correctly, with meaningful values, never missing a
value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed
structure, are always available on time (and in the correct order), and are never corrupted during transfer to
the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and
privileges change.
Realistically, however, the operational applications are rarely able to cope with every possible business
scenario or combination of events; operational systems crash, networks fall over, and users may not use the
transactional systems in quite the way they were designed. The operational systems also typically need
some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is
a risk that the source data does not match what the data warehouse expects.
Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by
the business managers. If erroneous data does reach the warehouse, it must be identified and removed
immediately (before the current version of the warehouse can be published). Preferably, error data should
As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors
within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point
on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific
chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those
responsible for the source data into the warehouse process; source data staff understand that their
professionalism directly affects the quality of the reports, and end-users become owners of their data.
As a final consideration, error management (the implementation of an error handling strategy) complements
and overlaps load management, data quality and key management, and operational processes and
procedures.
Load management processes record at a high-level if a load is unsuccessful; error management records the
details of why the failure occurred.
Quality management defines the criteria whereby data can be identified as in error; and error management
identifies the specific error(s), thereby allowing the source data to be corrected.
Operational reporting shows a picture of loads over time, and error management allows analysis to identify
systematic errors, perhaps indicating a failure in operational procedure.
Error management must therefore be tightly integrated within the data warehouse load process. This is
shown in the high level flow chart below:
High-Level Issues
From previous discussion of load management, a number of checks can be performed before any attempt is
made to load a source data set. Without load management in place, it is unlikely that the warehouse process
will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot
(in so far as nearly all maintenance and development resources will be working full time to manually correct
bad data in the warehouse). The following assumes that you have implemented load management
processes similar to Informatica’s best practices.
• Process Dependency checks in the load management can identify when a source data set is
missing, duplicates a previous version, or has been presented out of sequence, and where the
previous load failed but has not yet been corrected.
• Load management prevents this source data from being loaded. At the same time, error
management processes should record the details of the failed load; noting the source instance, the
load affected, and when and why the load was aborted.
• Source file structures can be compared to expected structures stored as metadata, either from
header information or by attempting to read the first data row.
• Source table structures can be compared to expectations; typically this can be done by interrogating
the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by
simply running a ‘describe’ command against the table (again comparing to a pre-stored version in
metadata).
• Control file totals (for file sources) and row number counts (table sources) are also used to
determine if files have been corrupted or truncated during transfer, or if tables have no new data in
them (suggesting a fault in an operational application).
• In every case, information should be recorded to identify where and when an error occurred, what
sort of error it was, and any other relevant process-level details.
Low-Level Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load
to abort), further error management processes need to be applied to the individual source rows and fields.
• Individual source fields can be compared to expected data-types against standard metadata within
the repository, or additional information added by the development. In some instances, this is
enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the
source data set as a whole either cannot be processed at all or (more worryingly) is likely to be
processed unpredictably.
• Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-
in error handling can be used to spot failed date conversions, conversions of string to numbers, or
missing required data. In rare cases, stored procedures can be called if a specific conversion fails;
however this cannot be generally recommended because of the potentially crushing impact on
performance if a particularly error-filled load occurs.
• Business rule breaches can then be picked up. It is possible to define allowable values, or
acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from the
mapping metadata that the business rules are included in the mapping itself). A more flexible
approach is to use external tables to codify the business rules. In this way, only the rules tables
need to be amended if a new business rule needs to be applied. Informatica has suggested
methods to implement such a process.
• Missing Key/Unknown Key issues have already been defined in their own best practice document
Key Management in Data Warehousing Solutions with suggested management techniques for
identifying and handling them. However, from an error handling perspective, such errors must still
be identified and recorded, even when key management techniques do not formally fail source rows
with key errors. Unless a record is kept of the frequency with which particular source data fails, it is
difficult to realize when there is a systematic problem in the source systems.
• Inter-row errors may also have to be considered. These may occur when a business process
expects a certain hierarchy of events (e.g., a customer query, followed by a booking request,
Since best practice means that referential integrity (RI) issues are proactively managed within the loads,
instances where the RDBMS rejects data for referential reasons should be very rare (i.e., the load should
already have identified that reference information is missing).
However, there is little that can be done to identify the more generic RDBMS problems that are likely to
occur; changes to schema permissions, running out of temporary disk space, dropping of tables and
schemas, invalid indexes, no further table space extents available, missing partitions and the like.
Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space,
command syntax, and authentication may occur outside of the data warehouse. Often such changes are
driven by Systems Administrators who, from an operational perspective, are not aware that there is likely to
be an impact on the data warehouse, or are not aware that the data warehouse managers need to be kept
up to speed.
In both of the instances above, the nature of the errors may be such that not only will they cause a load to
fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS
user ids are revoked, it may be impossible to write a row to an error table if the error process depends on
the revoked id; if disk space runs out during a write to a target table, this may affect all other tables
(including the error tables); if file permissions on a UNIX host are amended, bad files themselves (or even
the log files) may not be accessible.
Most of these types of issues can be managed by a proper load management process, however. Since
setting the status of a load to ‘complete’ should be absolutely the last step in a given process, any failure
before, or including, that point leaves the load in an ‘incomplete’ state. Subsequent runs should note this,
and enforce correction of the last load before beginning the new one.
The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational
Administrators and DBAs have proper and working communication with the data warehouse management to
allow proactive control of changes. Administrators and DBAs should also be available to the data warehouse
operators to rapidly explain and resolve such errors if they occur.
Load management and key management best practices (Key Management in Data Warehousing Solutions)
have already defined auto-correcting processes; the former to allow loads themselves to launch, rollback,
and reload without manual intervention, and the latter to allow RI errors to be managed so that the
quantitative quality of the warehouse data is preserved, and incorrect key values are corrected as soon as
the source system provides the missing data.
We cannot conclude from these two specific techniques, however, that the warehouse should attempt to
change source data as a general principle. Even if this were possible (which is debatable), such functionality
would mean that the absolute link between the source data and its eventual incorporation into the data
warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking
the error would be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely
from scratch.
The principle to apply here is to identify the errors in the load, and then alert the source system users that
data should be corrected in the source system itself, ready for the next load to pick up the right data. This
maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and
permits extra training needs to be identified and managed.
The following data structure is an example of the error metadata that should be captured as a minimum
within the error handling strategy.
• The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including:
• The ERROR_HEADER table provides a high-level view on the process, allowing a quick
identification of the frequency of error for particular loads and of the distribution of error types. It is
linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which
other process-level information can be gathered.
• The ERROR_DETAIL table stores information about actual rows with errors, including how to
identify the specific row that was in error (using the source natural keys and row number) together
with a string of field identifier/value pairs concatenated together. It is not expected that this
Challenge
The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes
various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for
handling data errors, and alternatives for addressing the most common types of problems. For the most part, these
strategies are relevant whether your data integration project is loading an operational data structure (as with data
migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing
structure.
Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the
business. When the source system data does not meet these rules, the process needs to handle the exceptions in
an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to
enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide
what is acceptable and prioritize two conflicting goals:
In general, there are three methods for handling data errors detected in the loading process:
● Reject All. This is the simplest to implement since all errors are rejected from entering the target when they
are detected. This provides a very reliable target that the users can count on as being correct, although it
may not be complete. Both dimensional and factual data can be rejected when any errors are encountered.
Reports indicate what the errors are and how they affect the completeness of the data.
Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key
relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a
subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and
loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the
users need to take into account that the data they are looking at may not be a complete picture of the
operational systems until the errors are fixed. For an operational system, this delay may affect downstream
transactions.
The development effort required to fix a Reject All scenario is minimal, since the rejected data can be
processed through existing mappings once it has been fixed. Minimal additional code may need to be written
since the data will only enter the target if it is correct, and it would then be loaded into the data mart using
the normal process.
● Reject None. This approach gives users a complete picture of the available data without having to consider
data that was not available due to it being rejected during the load process. The problem is that the data
may not be complete or accurate. All of the target data structures may contain incorrect information that
can lead to incorrect decisions or faulty transactions.
With Reject None, the complete set of data is loaded, but the data may not support correct transactions or
The development effort to fix this scenario is significant. After the errors are corrected, a new loading
process needs to correct all of the target data structures, which can be a time-consuming effort based on the
delay between an error being detected and fixed. The development strategy may include removing
information from the target, restoring backup tapes for each night’s load, and reprocessing the data. Once
the target is fixed, these changes need to be propagated to all downstream data structures or data marts.
● Reject Critical. This method provides a balance between missing information and incorrect information. It
involves examining each row of data and determining the particular data elements to be rejected. All
changes that are valid are processed into the target to allow for the most complete picture. Rejected
elements are reported as errors so that they can be fixed in the source systems and loaded on a
subsequent run of the ETL process.
This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts
or updates.
Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be
summarized at various levels in the organization. Attributes provide additional descriptive information per
key element.
Inserts are important for dimensions or master data because subsequent factual data may rely on the
existence of the dimension data row in order to load properly. Updates do not affect the data integrity as
much because the factual data can usually be loaded with the existing dimensional data unless the update is
to a key element.
The development effort for this method is more extensive than Reject All since it involves classifying fields
as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The
effort also incorporates some tasks from the Reject None approach, in that processes must be developed to
fix incorrect data in the entire target data architecture.
Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target.
By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to
enter the target on each run of the ETL process, while at the same time screening out the unverifiable data
fields. However, business management needs to understand that some information may be held out of the
target, and also that some of the information in the target data structures may be at least temporarily
allocated to the wrong hierarchies.
Profiles are tables used to track history changes to the source data. As the source systems change, profile records
are created with date stamps that indicate when the change took place. This allows power users to review the target
data using either current (As-Is) or past (As-Was) views of the data.
A profile record should occur for each change in the source data. Problems occur when two fields change in the
source system and one of those fields results in an error. The first value passes validation, which produces a new
profile record, while the second value is rejected and is not included in the new profile. When this error is fixed, it
would be desirable to update the existing profile rather than creating a new one, but the logic needed to perform this
UPDATE instead of an INSERT is complicated. If a third field is changed in the source before the error is fixed, the
correction process is complicated further.
The following example represents three field values in a source system. The first row on 1/1/2000 shows the original
values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is
invalid. On 1/10/2000, Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field
Three methods exist for handling the creation and update of profiles:
1. The first method produces a new profile record each time a change is detected in the source. If a field value
was invalid, then the original field value is maintained.
By applying all corrections as new profiles in this method, we simplify the process by directly applying all
changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error
-- is applied as a new change that creates a new profile. This incorrectly shows in the target that two
changes occurred to the source information when, in reality, a mistake was entered on the first change and
should be reflected in the first profile. The second profile should not have been created.
2. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000,
which loses the profile record for the change to Field 3.
If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile
information. If the third field changes before the second field is fixed, we show the third field changed at the
same time as the first. When the second field was fixed, it would also be added to the existing profile, which
incorrectly reflects the changes in the source system.
3. The third method creates only two new profiles, but then causes an update to the profile records on
1/15/2000 to fix the Field 2 value in both.
If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create
complex algorithms that handle the process correctly. It involves being able to determine when an error occurred
and examining all profiles generated since then and updating them appropriately. And, even if we create the
algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If
an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error,
causing an automated process to update old profile records, when in reality a new profile record should have been
entered.
Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters
a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile
records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile
records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is
decided. Once an action is decided, another process examines the existing Profile records and corrects them as
necessary. This method only delays the As-Was analysis of the data until the correction method is determined
because the current information is reflected in the new Profile.
Quality indicators can be used to record definitive statements regarding the quality of the data received and stored
in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary
key. Quality indicators can be used to:
● Show the record and field level quality associated with a given record at the time of extract.
● Identify data sources and errors encountered in specific records.
● Support the resolution of specific record error types via an update and resubmission process.
Quality indicators can be used to record several types of errors – e.g., fatal errors (missing primary key value),
missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error,
data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data
quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors
were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the
original file name and record number. These records cannot be loaded to the target because they lack a primary
key field to be used as a unique record identifier in the target.
In these error types, the records can be processed, but they contain errors:
When an error is detected during ingest and cleansing, the identified error type is recorded.
The requirement to validate virtually every data element received from the source data systems mandates the
development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality
of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the
information necessary to identify acute data quality problems, systemic issues, business process problems and
information technology breakdowns.
The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a Required Field, “3”-Wrong Data Type/
Format, “4”-Invalid Data Value and “5”-Outdated Reference Table in Use, apply a concise indication of the quality of
the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data
quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time,
these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.
The need to periodically correct data in the target is inevitable. But how often should these corrections be
performed?
The correction process can be as simple as updating field information to reflect actual values, or as complex as
deleting data from the target, restoring previous loads from tape, and then reloading the information correctly.
Although we try to avoid performing a complete database restore and reload from a previous point in time, we
cannot rule this out as a possible solution.
As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data
and the related error messages indicating the causes of error. The business needs to decide whether analysts
should be allowed to fix data in the reject tables, or whether data fixes will be restricted to source systems. If errors
are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility
problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source
Attributes provide additional descriptive information about a dimension concept. Attributes include things like the
color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate
characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the
target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find
specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected
and reapplied to the data in the target.
When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new
record enter thetarget. Some rules that have been proposed for handling defaults are as follows:
Reference tables are used to normalize the target model to prevent the duplication of data. When a source value
does not translate into a reference table value, we use the ‘Unknown’ value. (All reference tables contain a value of
‘Unknown’ for this purpose.)
The business should provide default values for each identified attribute. Fields that are restricted to a limited domain
of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in
translating these values, we use the value that represents off or ‘No’ as the default. Other values, like numbers, are
handled on a case-by-case basis. In many cases, the data integration process is set to populate ‘Null’ into these
fields, which means “undefined” in the target. After a source system value is corrected and passes validation, it is
corrected in the target.
The business also needs to decide how to handle new dimensional values such as locations. Problems occur when
the new key is actually an update to an old key in the source system. For example, a location number is assigned
and the new location is transferred to the target using the normal process; then the location number is changed due
to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the
change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error
causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An
analyst would be unable to get a complete picture.
Fixing this type of error involves integrating the two records in the target data, along with the related facts.
Integrating the two rows involves combining the profile information, taking care to coordinate the effective dates of
the profiles to sequence properly. If two profile records exist for the same day, then a manual decision is required as
to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together
and the originals deleted in order to correct the data.
If information is captured as dimensional data from the source, but used as measures residing on the fact records in
the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until
the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time
consuming and difficult to implement.
If we let the facts enter downstream target structures, we need to create processes that update them after the
dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes
simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors
If there are no business rules that reject fact records except for relationship errors to dimensional data, then when
we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the
following night. This nightly reprocessing continues until the data successfully enters the target data structures.
Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new
entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data
and translation tables enable the target data architecture to maintain consistent descriptions across multiple source
systems, regardless of how the source system stores the data. New entities in dimensional data include new
locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different
data for the same dimensional entity.
Reference Tables
The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a
short code value as a primary key and a long description for reporting purposes. A translation table is associated
with each reference table to map the codes to the source system values. Using both of these tables, the ETL
process can load data from the source systems into the target structures.
The translation tables contain one or more rows for each source value and map the value to a matching row in the
reference table. For example, the SOURCE column in FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data
steward would be responsible for entering in the translation table the following values:
O OFFICE
S STORE
W WAREHSE
OF OFFICE
ST STORE
WH WAREHSE
The data stewards are also responsible for maintaining the reference table that translates the codes into
descriptions. The ETL process uses the reference table to populate the following values into the target:
OFFICE Office
Error handling results when the data steward enters incorrect information for these mappings and needs to correct
them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered
ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore
and reload source data from the first time the mistake was entered. Processes should be built to handle these types
of situations, including correction of the entire target data architecture.
Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the target may include Locations
and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These
translation tables map the source system value to the target value. For location, this is straightforward, but over
time, products may have multiple source system values that map to the same product in the target. (Other similar
translation issues may also exist, but Products serves as a good example for error handling.)
There are two possible methods for loading new dimensional entities. Either require the data steward to enter the
translation data before allowing the dimensional data into the target, or create the translation data through the ETL
process and force the data steward to review it. The first option requires the data steward to create the translation
for new entities, while the second lets the ETL process create the translation, but marks the record as ‘Pending
Verification’ until the data steward reviews it and changes the status to ‘Verified’ before any facts that reference it
can be loaded.
When the dimensional value is left as ‘Pending Verification’ however, facts may be rejected or allocated to dummy
values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to
this issue is to generate an email each night if there are any translation table entries pending verification. The data
steward then opens a report that lists them.
The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same
product, but really represent two different products). In this case, it is necessary to restore the source information for
all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from
the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split
to generate correct profile information.
Manual Updates
Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs
to be established for manually entering fixed data and applying it correctly to the entire target data architecture,
including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further,
a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of
the normal load process.
Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources
contain subsets of the required information. For example, one system may contain Warehouse and Store
information while another contains Store and Hub information. Because they share Store information, it is difficult to
decide which source contains the correct information.
When this happens, both sources have the ability to update the same row in the target. If both sources are allowed
to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared
information on only one source system, the two systems then contain different information. If the changed system is
loaded into the target, it creates a new profile indicating the information changed. When the second system is
loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another
new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be
loaded every day until the two source systems are synchronized with the same information.
To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source
that should be considered primary for the field. Then, only if the field changes on the primary source would it be
changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can
provide information toward the one profile record created for that day.
One solution to this problem is to develop a system of record for all sources. This allows developers to pull the
information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to
indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can
use the field level information to update only the fields that are marked as primary. However, this requires additional
effort by the data stewards to mark the correct source fields as primary and by the data integration team to
customize the load process.
Challenge
Identifying and capturing data errors using a mapping approach, and making such errors available for further processing or correction.
Description
Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the
production environment, data must be checked and validated prior to entry into the target system. One strategy for catching data
errors is to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and
unexpected transformation or database constraint errors.
The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements.
Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and
improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g., executing
a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of functionality in
a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the target table;
constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to the session log
and the reject/bad file, thus improving performance.
Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to them.
This approach can be effective for many types of data content error, including: date conversion, null values intended for not null
target fields, and incorrect data formats or data types.
In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written to
not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated from
the data flow and logged in an error table.
An expression transformation can be employed to validate the source data, applying rules and flagging records with one or more errors.
A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows with
a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would refer to
The composite key is designed to allow developers to trace rows written to the error tables that store information useful for
error reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.
The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the
error description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of
reference for reporting.
The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns:
ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the entire
row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to reprocess them.
The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended for
a not null target field could generate an error message such as ‘NAME is NULL’ or ‘DOB is NULL’. This step can be done in
an expression transformation (e.g., EXP_VALIDATION in the sample mapping).
After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a
normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that
are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would
be generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL.
The following table shows how the error data produced may look.
The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in
mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of
data validation errors.
Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data
validation errors as ‘soft’ or ‘hard’.
● A ‘hard’ error can be defined as one that would fail when being written to the database, such as a constraint error.
● A ‘soft’ error can be defined as a data content error.
A record flagged as ‘hard’ can be filtered from the target and written to the error tables, while a record flagged as ‘soft’ can be written
to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data
imperfections while still allowing the records to be processed for end-user reporting.
Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems.
The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be
properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings
that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified,
the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors,
the operations team can effectively communicate data quality issues to the business users.
Perfect data can never be guaranteed. In implementing the mapping approach described above to detect errors and log them to
an error table, how can we handle unexpected errors that arise in the load? For example, PowerCenter may apply the validated data
to the database; however the relational database management system (RDBMS) may reject it for some unexpected
reason. An RDBMS may, for example, reject data if constraints are violated. Ideally, we would like to detect these database-level
errors automatically and send them to the same error table used to store the soft errors caught by the mapping approach
described above.
An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the
records that were found to be in error. This can be achieved by configuring the ‘stop on errors’ property to 0 and switching on
relational error logging for a session. By default, the error-messages from the RDBMS and any un-caught transformation errors
are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four
tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS.
The PowerCenter Workflow Administration Guide contains detailed information on the structure of these tables. However,
the PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table
allow us to retrieve any RDBMS errors:
• SESS_INST_ID: A unique identifier for the session. Joining this table with the Metadata Exchange (MX)
View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved.
• TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the
target transformation.
• TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the row number at the target
when the error occurred.
With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an
additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View
REP_LOAD_SESSION in the repository, and insert the error details into ERR_DESC_TBL. When the post process
ends, ERR_DESC_TBL will contain both ‘soft’ errors and ‘hard’ errors.
One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can
be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more rows at
the target). In this case, the mapping that loads the source must write translation data to a staging table (including the source key
and target row number). The translation table can then be used by the post-load session to identify the source key by the target
row number retrieved from the error log. The source key stored in the translation table could be a row number in the case of a flat
file, or a primary key in the case of a relational data source.
Reprocessing
After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of
the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations
team can, therefore, fix the data in the source that resulted in ‘soft’ errors and may be able to explain and remediate the ‘hard’ errors.
Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors during the first run
should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and using
a parameter to configure the mapping for an initial load or for a reprocess load. If the mapping is reprocessing, the lookup searches
for each source row number in the error table, while the filter removes source rows for which the lookup has not found errors. If
initial loading, all rows are passed through the filter, validated, and loaded.
With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records
successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should
be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This ensures
that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.
Challenge
Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and
reprocess the corrected data.
Description
Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in
an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the
standards of acceptable data quality; and process errors, which are driven by the stability of the process itself.
The first step in implementing an error handling strategy is to understand and define the error handling requirement.
Consider the following questions:
● What tools and methods can help in detecting all the possible errors?
● What tools and methods can help in correcting the errors?
● What is the best way to reconcile data across multiple systems?
● Where and how will the errors be stored? (i.e., relational tables or flat files)
A robust error handling strategy can be implemented using PowerCenter’s built-in error handling capabilities along with
Data Analyzer as follows:
● Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process
failures.
● Data Errors: Setup the ETL process to:
❍ Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables
for analysis, correction, and reprocessing.
❍ Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows.
❍ Setup customized Data Analyzer reports and dashboards at the project level to provide information on
failed sessions, sessions with failed rows, load time, etc.
Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the
event of a session failure. Create a reusable email task and use it in the “On Failure Email” property settings in the
Components tab of the session, as shown in the following figure.
%s Session name.
%e Session status.
Source and target table details, including read throughput in bytes per second and write
%t throughput in rows per second. The PowerCenter Server includes all information displayed
in the session detail dialog box.
Attach the named file. The file must be local to the PowerCenter Server. The following are
valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>.
%a<filename>
Note: The file name cannot include the greater than character (>) or a line break.
Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include
these variables in the email message only.
PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these
tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when
compared with a custom error handling solution.
When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the
PowerCenter Server logs error information that allows you to determine the cause and source of the error. The
PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error
code, error message, repository name, folder name, session name, and mapping information. This error metadata is
logged for all row-level errors, including database errors, transformation errors, and errors raised through the ERROR()
function, such as business rule violations.
Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you
enable error logging and chose the ‘Relational Database’ Error Log Type, the PowerCenter Server offers you the
following features:
❍ PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source
row.
❍ PMERR_MSG. Stores metadata about an error and the error message.
❍ PMERR_SESS. Stores metadata about the session.
❍ PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype,
when a transformation error occurs.
■ Appends error data to the same tables cumulatively, if they already exist, for the further runs of the
session.
■ Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors
to go to one set of error tables, you can specify the prefix as ‘EDW_’
■ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do
this, you specify the same error log table name prefix for all sessions.
Example:
In the following figure, the session ‘s_m_Load_Customer’ loads Customer Data into the EDW Customer table. The
Customer Table in EDW has the following structure:
To take advantage of PowerCenter’s built-in error handling features, you would set the session properties as shown
below:
The session property ‘Error Log Type’ is set to ‘Relational Database’, and ‘Error Log DB Connection’ and ‘Table name
Prefix’ values are given accordingly.
When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes
information into the Error Tables as shown below:
EDW_PMERR_DATA
WORKFLOW_ WORKLET_ SESS_ TRANS_NAME TRANS_ TRANS_ROW SOURCE_ SOURCE_ SOURCE_ LINE_
RUN_ID RUN_ID INST_ ROW_ID DATA ROW_ID ROW_ ROW_ NO
ID TYPE DATA
EDW_PMERR_MSG
EDW_PMERR_SESS
EDW_PMERR_TRANS
By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.
Informatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer
is Informatica’s powerful business intelligence tool that is used to provide insight into the PowerCenter repository
metadata.
● Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an
entry made into the error tables PMERR_DATA or PMERR_TRANS.
● Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter
folders for easy analysis.
● Configure reports to provide detailed information of the row level errors for each session. This can be
accomplished by using the four error tables as sources of data for the reports
Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS
to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing
tedious queries, comparing two separately produced reports, or using constructs such as DBLinks.
Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your
company’s data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and
reusable way to accomplish data reconciliation. Using Data Analyzer’s reporting capabilities, you can select data from
various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers
through aggregate reports. You can further schedule the reports to run automatically every time the relevant
PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any
discrepancies.
For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a
data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by
comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation
reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on
demand. This process allows users to verify the accuracy of data and builds confidence in the data warehouse solution.
Challenge
Successfully identify the need and scope of reusability. Create inventories of reusable
objects with in a folder or shortcuts across folders (Local shortcuts) or shortcuts across
repositories (Global shortcuts).
Description
Reusable Objects
Please note that shortcuts are not supported for workflow level objects (Tasks).
Identify the need for reusable objects based on the following criteria:
Creating and testing common objects does not always save development time or
facilitate future maintenance. For example, if a simple calculation like subtracting a
current rate from a budget rate that is going to be used for two different mappings,
carefully consider whether the effort to create, test, and document the common object is
worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the
calculation were to be performed in a number of mappings, if it was very difficult, and if
all occurrences would be updated following any change or fix, then the calculation would
be an ideal case for a reusable object. When you add instances of a reusable
transformation to mappings, be careful that the changes do not invalidate the mapping or
generate unexpected data. The Designer stores each reusable transformation as
metadata, separate from any mapping that uses the transformation.
The second criterion for a reusable object concerns the data that will pass through the
reusable object. Developers often encounter situations where they may perform a certain
type of high-level process (i.e., a filter, expression, or update strategy) in two or more
mappings. For example, if you have several fact tables that require a series of dimension
keys, you can create a mapplet containing a series of lookup transformations to find each
dimension key. You can then use the mapplet in each fact table mapping, rather than
recreating the same lookup logic in each mapping. This seems like a great candidate for
a mapplet. However, after performing half of the mapplet work, the developers may
realize that the actual data or ports passing through the high-level logic are totally
different from case to case, thus making the use of a mapplet impractical. Consider
whether there is a practical way to generalize the common logic so that it can be
successfully applied to multiple cases. Remember, when creating a reusable object, the
actual object will be replicated in one to many mappings. Thus, in each mapping using
the mapplet or reusable transformation object, the same size and number of ports must
pass into and out of the mapping/reusable object.
Document the list of the reusable objects that pass this criteria test, providing a high-level
description of what each object will accomplish. The detailed design will occur in a future
subtask, but at this point the intent is to identify the number and functionality of reusable
objects that will be built for the project. Keep in mind that it will be impossible to identify
one hundred percent of the reusable objects at this point; the goal here is to create an
In some cases, we may have to read data from different sources and go through the
same transformation logic and write the data to either one destination database or
multiple destination databases. Also, sometimes, depending on the availability of the
source, these loads have to be scheduled at different time. This case would be the ideal
one to create a re-usable session and do Session overrides at the session instance level
for the database connections/pre-session commands / post session commands.
Logging load statistics, failure criteria and success criteria are usually common pieces of
code that would be executed for multiple loads in most Projects. Some of these common
tasks include:
Mappings
A mapping is a set of source and target definitions linked by transformation objects that
define the rules for data transformation. Mappings represent the data flow between
sources and targets. In a simple world, a single source table would populate a single
target table. However, in practice, this is usually not the case. Sometimes multiple
sources of data need to be combined to create a target table, and sometimes a single
source of data creates many target tables. The latter is especially true for mainframe
data sources where COBOL OCCURS statements litter the landscape. In a typical
warehouse or data mart model, each OCCURS statement decomposes to a separate
The goal here is to create an inventory of the mappings needed for the project. For this
exercise, the challenge is to think in individual components of data movement. While the
business may consider a fact table and its three related dimensions as a single ‘object’ in
the data mart or warehouse, five mappings may be needed to populate the
corresponding star schema with data (i.e., one for each of the dimension tables and two
for the fact table, each from a different source system).
Typically, when creating an inventory of mappings, the focus is on the target tables, with
an assumption that each target table has its own mapping, or sometimes multiple
mappings. While often true, if a single source of data populates multiple tables, this
approach yields multiple mappings. Efficiencies can sometimes be realized by loading
multiple tables from a single source. By simply focusing on the target tables, however,
these efficiencies can be overlooked.
When completed, the spreadsheet can be sorted either by target table or source table.
Sorting by source table can help determine potential mappings that create multiple
targets.
When using a source to populate multiple tables at once for efficiency, be sure to keep
restartabilty and reloadability in mind. The mapping will always load two or more target
tables from the source, so there will be no easy way to rerun a single table. In this
example, potentially the Customers table and the Customer_Type tables can be loaded
in the same mapping.
When merging targets into one mapping in this manner, give both targets the same
At this point, it is often helpful to record some additional information about each mapping
to help with planning and maintenance.
First, give each mapping a name. Apply the naming standards generated in 3.2 Design
Development Architecture. These names can then be used to distinguish mappings from
one other and also can be put on the project plan as individual tasks.
Next, determine for the project a threshold for a high, medium, or low number of target
rows. For example, in a warehouse where dimension tables are likely to number in the
thousands and fact tables in the hundred thousands, the following thresholds might apply:
Assign a likely row volume (high, medium or low) to each of the mappings based on the
expected volume of data to pass through the mapping. These high level estimates will
help to determine how many mappings are of ‘high’ volume; these mappings will be the
first candidates for performance tuning.
Add any other columns of information that might be useful to capture about each
mapping, such as a high-level description of the mapping functionality, resource
(developer) assigned, initial estimate, actual completion time, or complexity rating.
Challenge
Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.
Description
The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on
the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping,
sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys
are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may
be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra
amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can
be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created
to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file
etc.). Informatica offers two ways to access the repository metadata:
● Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a
very comprehensive tool that is powered by the functionality of Informatica’s BI reporting tool, Data Analyzer. It is included on the
PowerCenter CD.
● Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the
second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).
Metadata Reporter
The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata
reports from their repositories. Metadata Reporter is based on the Data Analyzer and PowerCenter products. It provides Data
Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to
every Informatica object stored in the repository, and even reports to access objects in the Data Analyzer repository. The
architecture of the Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on
Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata Reporter setup.
Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they
are listed below:
● Schemas.xml
● Schedule.xml
● GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable files for DB2, SQLServer,
Sybase and Teradata. You need to select the appropriate file based on your PowerCenter repository environment)
● Reports.xml
● Dashboards.xml
Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have no problem
importing these files. However, if you are using an existing instance of Data Analyzer which you currently use for some other
reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist
with the same name. You can rename the conflicting objects.
The following are the folders that are created in Data Analyzer when you import the above-listed files:
● Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g. Today’s Login ,Reports accessed by
Users Today etc.
● PowerCenter Metadata Reports - contains reports for PowerCenter repository. To better organize reports based on their
functionality these reports are further grouped into subfolders as following:
● Configuration Management - contains a set of reports that provide detailed information on configuration management, including
deployment and label details. This folder contains following subfolders:
❍ Deployment
❍ Label
● Operations - contains a set of reports that enable users to analyze operational statistics including server load, connection usage,
run times, load times, number of runtime errors, etc. for workflows, worklets and sessions. This folder contains following
subfolders:
❍ Session Execution
❍ Workflow Execution
● PowerCenter Objects - contains a set of reports that enable users to identify all types of PowerCenter objects, their properties,
and interdependencies on other objects within the repository. This folder contains following subfolders:
❍ Mappings
❍ Mapplets
❍ Metadata Extension
❍ Server Grids
❍ Sessions
❍ Sources
❍ Target
❍ Transformations
❍ Workflows
❍ Worklets
● Security - contains a set of reports that provide detailed information on the users, groups and their association within the
repository.
Informatica recommends retaining this folder organization, adding new folders if necessary.
The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards.
Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter
is installed, even without the other Informatica client tools being installed on that computer. The Metadata Reporter connects to
the PowerCenter repository using JDBC drivers. Be sure the proper JDBC drivers are installed for your database platform.
(Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax - jdbc:odbc:<data_source_name>)
● Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide information about all types of
metadata objects.
● Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can generate reports from any
machine that has access to the web server. The reports in the Metadata Reporter are customizable. The Metadata Reporter
allows you to set parameters for the metadata objects to include in the report.
● The Metadata Reporter allows you to go easily from one report to another. The name of any metadata object that displays on a
report links to an associated report. As you view a report, you can generate reports for objects on which you need more
information.
The following table shows list of reports provided by the Metadata Reporter, along with their location and a brief description:
4 All Object Version Public Folders>PowerCenter Metadata Displays all versions of an object by
History Reports>Configuration Management>Object the date the object is saved in the
Version>All Object Version History repository. This is a standalone report.
5 Server Load by Day of Public Folders>PowerCenter Metadata Displays the total number of sessions
Week Reports>Operations>Session that ran, and the total session run
Execution>Server Load by Day of Week duration for any day of week in any
given month of the year by server by
repository. For example, all Mondays
in September are represented in one
row if that month had 4 Mondays
6 Session Run Details Public Folders>PowerCenter Metadata Displays session run details for any
Reports>Operations>Session start date by repository by folder. This
Execution>Session Run Details is a primary report in an analytic
workflow.
7 Target Table Load Public Folders>PowerCenter Metadata Displays the load statistics for each
Analysis (Last Month) Reports>Operations>Session table for last month by repository by
Execution>Target Table Load Analysis (Last folder. This is a primary report in an
Month) analytic workflow.
8 Workflow Run Details Public Folders>PowerCenter Metadata Displays the run statistics of all
Reports>Operations>Workflow workflows by repository by folder. This
Execution>Workflow Run Details is a primary report in an analytic
workflow.
9 Worklet Run Details Public Folders>PowerCenter Metadata Displays the run statistics of all
Reports>Operations>Workflow worklets by repository by folder. This is
Execution>Worklet Run Details a primary report in an analytic
workflow.
19 Server Grid List Public Folders>PowerCenter Metadata Displays all server grids and servers
Reports>PowerCenter Objects>Server associated with each grid. Information
Grid>Server Grid List includes host name, port number, and
internet protocol address of the
servers.
20 Session List Public Folders>PowerCenter Metadata Displays all sessions and their
Reports>PowerCenter properties by repository by folder. This
Objects>Sessions>Session List is a primary report in an analytic
workflow.
22 Source Shortcuts Public Folders>PowerCenter Metadata Displays sources that are defined as
Reports>PowerCenter shortcuts by repository and folder
Objects>Sources>Source Shortcuts
24 Target Shortcuts Public Folders>PowerCenter Metadata Displays targets that are defined as
Reports>PowerCenter shortcuts by repository and folder.
Objects>Targets>Target Shortcuts
27 Scheduler (Reusable) Public Folders>PowerCenter Metadata Displays all the reusable schedulers
List Reports>PowerCenter defined in the repository and their
Objects>Workflows>Scheduler (Reusable) List description and properties by
repository by folder. This is a primary
report in an analytic workflow.
30 Users By Group Public Folders>PowerCenter Metadata Displays users by repository and group.
Reports>Security>Users By Group
1 Bottom 10 Least Public Folders>Data Analyzer Metadata Displays the ten least accessed
Accessed Reports this Reporting>Bottom 10 Least Accessed Reports reports for the current year. It has an
Year this Year analytic workflow that provides access
details such as user name and access
time.
2 Report Activity Details Public Folders>Data Analyzer Metadata Part of the analytic workflows "Top 10
Reporting>Report Activity Details Most Accessed Reports This Year",
"Bottom 10 Least Accessed Reports
this Year" and "Usage by Login (Month
To Date)".
3 Report Activity Details Public Folders>Data Analyzer Metadata Provides information about reports
for Current Month Reporting>Report Activity Details for Current accessed in the current month until
Month current date.
5 Reports Accessed by Public Folders>Data Analyzer Metadata Part of the analytic workflow for
Users Today Reporting>Reports Accessed by Users Today "Today's Logins". It provides detailed
information on the reports accessed by
users today. This can be used
independently to get comprehensive
information about today's report
activity details.
6 Todays Logins Public Folders>Data Analyzer Metadata Provides the login count and average
Reporting>Todays Logins login duration for users who logged in
today.
7 Todays Report Usage Public Folders>Data Analyzer Metadata Provides information about the number
by Hour Reporting>Todays Report Usage by Hour of reports accessed today for each
hour. The analytic workflow attached
to it provides more details on the
reports accessed and users who
accessed them during the selected
hour.
8 Top 10 Most Accessed Public Folders>Data Analyzer Metadata Shows the ten most accessed reports
Reports this Year Reporting>Top 10 Most Accessed Reports this for the current year. It has an analytic
Year workflow that provides access details
such as user name and access time.
9 Top 5 Logins (Month Public Folders>Data Analyzer Metadata Provides information about users and
To Date) Reporting>Top 5 Logins (Month To Date) their corresponding login count for the
current month to date. The analytic
workflow attached to it provides more
details about the reports accessed by
a selected user.
10 Top 5 Longest Public Folders>Data Analyzer Metadata Shows the five longest running on-
Running On-Demand Reporting>Top 5 Longest Running On- demand reports for the current month
Reports (Month To Demand Reports (Month To Date) to date. It displays the average total
Date) response time, average DB response
time, and the average Data Analyzer
response time (all in seconds) for each
report shown.
11 Top 5 Longest Public Folders>Data Analyzer Metadata Shows the five longest running
Running Scheduled Reporting>Top 5 Longest Running Scheduled scheduled reports for the current
Reports (Month To Reports (Month To Date) month to date. It displays the average
Date) response time (in seconds) for each
report shown.
12 Total Schedule Errors Public Folders>Data Analyzer Metadata Provides the number of errors
for Today Reporting>Total Schedule Errors for Today encountered during execution of
reports attached to schedules. The
analytic workflow "Scheduled Report
Error Details for Today" is attached to
it.
14 Users Who Have Public Folders>Data Analyzer Metadata Provides information about users who
Never Logged On Reporting>Users Who Have Never Logged On exist in the repository but have never
logged in. This information can be
used to make administrative decisions
about disabling accounts.
Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics.
Data Analyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters
offers tremendous reporting flexibility. Additionally, you can setup report templates and export them as Excel files, which can
be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter,
consult the product documentation.
Wildcards
You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is
the same as using %. The following examples show how you can use the wildcards to set parameters.
The following list shows the return values for some wildcard combinations you can use:
item_ Items
A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange
the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the
clipboard. Use Ctrl+V to paste the copy into a Word document.
Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data Analyzer has a
robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users
based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production,
the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers
of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if
a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's
profile. For more information on the ways in which you can implement security in Data Analyzer, refer to the Data
Analyzer documentation.
The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and
display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the
underlying repository tables while exposing the metadata in several categories that were more suitable for external parties.
Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the
MX views to report and query the Informatica metadata.
Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for
creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2
supersede those of MX.
Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing
and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved
by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require
interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based
API for accessing and manipulating the PowerCenter Repository from various programming languages.
Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the
repository database and thus can be used independent of any of the Informatica software products. The same requirement also
holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently
of the client or server products.
Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data
warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This
type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica
partners by means of the new MX2 interfaces.
Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not
be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by
directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with
the appropriate verification and validation features to ensure the integrity of the metadata in the repository.
Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with
MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of
the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every
major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from
the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution.
Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated
procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools.
Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to
reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the
validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to
implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository.
Challenge
Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.
Description
Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information
from the repository maintains the repository for better performance.
Managing Repository
The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The
role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and
managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating
statistics.
Repository backup
Repository back up can be performed using the client tool Repository Server Admin Console or the command line
program pmrep. Backup using pmrep can be automated and scheduled for regular backups.
This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be
called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to
run daily.
Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is
recommended once a month or prior to major release. For development repositories, backup is recommended
once a week or once a day, depending upon the team size.
Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a
utility such as winzip or gzip.
Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that
the repository itself.
Move backups offline: Review the backups on a regular basis to determine how long they need to remain
online. Any that are not required online should be moved offline, to tape, as soon as possible.
Restore repository
Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for
testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica
recommends testing the backup files and recovery process at least once each quarter. The repository can be
restored using the client tool, Repository Server Administrator Console, or the command line programs
pmrepagent.
Restore folders
There is no easy way to restore only one particular folder from backup. First the backup repository has to be
restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from
the restored repository into the target repository.
Use the purge command to remove older versions of objects from repository. To purge a specific version of an
object, view the history of the object, select the version, and purge it.
If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option.
Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted
objects, use either the find checkouts command in the client tools or a query generated in the repository
After an object has been deleted from the repository, you cannot create another object with the same name
unless the deleted object has been completely removed from the repository. Use the purge command to
completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions
of a deleted object to completely remove it from repository.
Truncating Logs
You can truncate the log information (for sessions and workflows) stored in the repository either by using
repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a
particular folder.
Options allow truncating all log entries or selected entries based on date and time.
Analyzing (or updating the statistics) of repository tables can help to improve the repository performance.
Because this process should be carried out for all tables in the repository, a script offers the most efficient means.
You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a
command task to call the script.
Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on
repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various
causes should be analyzed and the repository server (or agent) configuration file modified to improve
performance.
Managing Metadata
The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The
queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7.
Minor changes in the queries may be required for PowerCenter repositories residing on other databases.
Failed Sessions
The following query lists the failed sessions in the last day. To make it work for the last ‘n’ days, replace
SYSDATE-1 with SYSDATE - n
Session_Name,
Last_Error AS Error_Message,
Actual_Start AS Start_Time,
FROM rep_sess_log
WHERE run_status_code != 1
The following query lists long running sessions in the last day. To make it work for the last ‘n’ days, replace
SYSDATE-1 with SYSDATE - n
Session_Name,
Successful_Source_Rows AS Source_Rows,
Successful_Rows AS Target_Rows,
Actual_Start AS Start_Time,
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code = 1
ORDER BY Session_timeStamp
Invalid Tasks
The following query lists folder names and task name, version number, and last saved for all invalid tasks.
TASK_NAME AS OBJECT_NAME,
LAST_SAVED
WHERE IS_VALID=0
AND IS_ENABLED=1
ORDER BY SUBJECT_AREA,TASK_NAME
Load Counts
The following query lists the load counts (number of rows loaded) for the successful sessions.
SELECT
subject_area,
workflow_name,
session_name,
successful_rows,
failed_rows,
actual_start
FROM
REP_SESS_LOG
WHERE
ORDER BY
subject_area
workflow_name,
session_name,
Session_status
Challenge
Description
Metadata Extensions, as the name implies, help you to extend the metadata stored in
the repository by associating information with individual objects in the repository.
Informatica Client applications can contain two types of metadata extensions: vendor-
defined and user-defined.
You can create reusable or non-reusable metadata extensions. You associate reusable
metadata extensions with all repository objects of a certain type. So, when you create a
reusable extension for a mapping, it is available for all mappings. Vendor-defined
metadata extensions are always reusable.
● Source definitions
● Target definitions
Metadata Extensions offer a very easy and efficient method of documenting important
information associated with repository objects. For example, when you create a
mapping, you can store the mapping owners name and contact information with the
mapping OR when you create a source definition, you can enter the name of the
person who created/imported the source.
The power of metadata extensions is most evident in the reusable type. When you
create a reusable metadata extension for any type of repository object, that metadata
extension becomes part of the properties of that type of object. For example, suppose
you create a reusable metadata extension for source definitions called SourceCreator.
When you create or edit any source definition in the Designer, the SourceCreator
extension appears on the Metadata Extensions tab. Anyone who creates or edits a
source can enter the name of the person that created the source into this field.
You can create, edit, and delete non-reusable metadata extensions for sources,
targets, transformations, mappings, and mapplets in the Designer. You can create,
edit, and delete non-reusable metadata extensions for sessions, workflows, and
worklets in the Workflow Manager. You can also promote non-reusable metadata
extensions to reusable extensions using the Designer or the Workflow Manager. You
can also create reusable metadata extensions in the Workflow Manager or Designer.
You can create, edit, and delete reusable metadata extensions for all types of
repository objects using the Repository Manager. If you want to create, edit, or delete
metadata extensions for multiple objects at one time, use the Repository Manager.
When you edit a reusable metadata extension, you can modify the properties Default
Value, Permissions and Description.
You can also migrate Metadata Extensions from one environment to another. When
Additionally, Metadata Extensions can also be populated via data modeling tools such
as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange
for Data Models. With the Informatica Metadata Exchange for Data Models, the
Informatica Repository interface can retrieve and update the extended properties of
source and target definitions in PowerCenter repositories. Extended Properties are the
descriptive, user defined, and other properties derived from your Data Modeling tool
and you can map any of these properties to the metadata extensions that are already
defined in the source or target object in the Informatica repository.
Challenge
Once the data warehouse has been moved to production, the most important task is keeping the system running and
available for the end users.
Description
In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support
team. This team is typically involved with the support of other systems and has expertise in database systems and
various operating systems. The Data Warehouse Development team becomes, in effect, a customer to the Production
Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an
Operations Manual, to help in the support of the production data warehouse.
Monitoring the system is useful for identifying any problems or outages before the users notice. The Production
Support team must know what failed, where it failed, when it failed, and who needs to be working on the
solution. Identifying outages and/or bottlenecks can help to identify trends associated with various technologies. The
goal of monitoring is to reduce downtime for the business user. Comparing the monitoring data against threshold
violations, service level agreements, and other organizational requirements helps to determine the effectiveness of the
data warehouse and any need for changes.
The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained. This is a
high-level document that discusses system maintenance and the components of the system, and identifies the groups
responsible for monitoring the various components. The SLA should be able to be measured against key performance
indicators. At a minimum, it should contain the following information:
Operations Manual
The Operations Manual is crucial to the Production Support team because it provides the information needed to
perform the data warehouse system maintenance. This manual should be self-contained, providing all of the
information necessary for a production support operator to maintain the system and resolve most problems that can
arise. This manual should contain information on how to maintain all data warehouse system components. At a
minimum, the Operations Manual should contain:
● Information on how to stop and re-start the various components of the system.
● Ids and passwords (or how to obtain passwords) for the system components.
The need to maintain archive logs and listener logs, use started tasks, perform recovery, and other operation functions
on MVS are challenges that need to be addressed in the Operations Manual. If listener logs are not cleaned up on a
regular basis, operations is likely to face space issues. Setting up archive logs on MVS requires datasets to be
allocated and sized. Recovery after failure requires operations intervention to restart workflows and set the restart
tokens. For Change Data Capture, operations are required to start the started tasks in a scheduler and/or after an
IPL. There are certain commands that need to be executed by operations.
The PowerExchange Reference Guide (8.1.1) and the related Adapter Guide provides detailed information on the
operation of PowerExchange Change Data Capture.
The archive log should be controlled by using the Retention Period specified in the EDMUPARM ARCHIVE_OPTIONS
in parameter ARCHIVE_RTPD=. The default supplied in the Install (in RUNLIB member SETUPCC2) is 9999. This is
generally longer than most organizations need. To change it, just rerun the first step (and only the first step) in
SETUPCC2 after making the appropriate changes. Any new archive log datasets will be created with the new retention
period. This does not, however, fix the old archive datasets; to do that, use SMS to override the specification, removing
the need to change the EDMUPARM.
The listener default log are part of the joblog of the running listener. If the listener job runs continuously, there is a
potential risk of the spool file reaching the maximum and causing issues with the listener. For example, if the listener
started task is scheduled to restart every weekend, the log will be refreshed and a new spool file will be created.
If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=* //DTLLOG DD DSN=&HLQ..LOG,
this will log the file to the member LOG in the HLQ..RUNLIB.
The last resort recovery procedure is to re-execute your initial extraction and load, and restart the CDC process from
the new initial load start point. Fortunately there are other solutions. In any case, if you do need every change, re-
initializing may not be an option.
Application ID
PowerExchange documentation talks about “consuming” applications – the processes that extract changes, whether
they are realtime or change (periodic batch extraction).
Each “consuming” application must identify itself to PowerExchange. Realistically, this means that each session must
have an application id parameter containing a unique “label”.
Restart Tokens
Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of
the extraction (Address in the database Log – RBA or SCN) is stored in a file on the server hosting the Listener that
reads the changed data. Each of these memorized end-points (i.e., Restart Tokens) is a potential restart point. It is
possible, using the Navigator interface directly, or by updating the restart file, to force the next extraction to restart from
any of these points. If you’re using the ODBC interface for PowerExchange, this is the best solution to implement.
There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or until there’s
a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may already have
processed and committed many changes. You can’t afford to “miss” any changes and you don’t want to reapply the
same changes you’ve just processed, but the previous restart token does not correspond to the reality of what you’ve
processed.
If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery problem lies
with PowerCenter, which has historically been able to deal with restarting this type of process – Guaranteed Message
Delivery. This functionality is applicable to both realtime and change CDC options.
The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each
Application Id in files on the PowerCenter Server. The directory and file name are required parameters when
configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures
compared to using the ODBC interface to PowerExchange.
To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the
session properties. During normal session execution, PowerCenter Server stores recovery information in cache files in
the directory specified for $PMCacheDir.
If the session ends "cleanly" (i.e., zero return code), PowerCenter writes tokens to the restart file, and the GMD cache
is purged.
If the session fails, you are left with unprocessed changes in the GMD cache and a Restart Token corresponding to
the point in time of the last of the unprocessed changes. This information is useful for recovery.
Recovery
If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode – either from the
PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this assumes that you are
able to identify that the session failed previously.
1. Start from the point in time specified by the Restart Token in the GMD cache.
2. PowerCenter reads the change records from the GMD cache.
3. PowerCenter processes and commits the records to the target system(s).
4. Once the records in the GMD cache have been processed and committed, PowerCenter purges the records
from the GMD cache and writes a restart token to the restart file.
5. The PowerCenter session ends “cleanly”.
The CDC session is now ready for you to execute in normal mode again.
You can, of course, successfully recover if you are using the ODBC connectivity to PowerCenter, but you have to build
in some things yourself – coping with processing all the changes from the last restart token, even if you’ve already
processed some of them.
When you re-execute a failed CDC session, you receive all the changed data since the last Power Exchange restart
token. Your session has to cope with processing some of the same changes you already processed at the start of the
failed execution – either using lookups/joins to the target to see if you’ve already applied the change you are
If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction and save
the results, you can use the generated restart token to force a recovery at a more recent point in time than the last
session-end restart token. This is especially useful if you are running realtime extractions using ODBC, otherwise you
may find yourself re-processing several days of changes you’ve already processed.
Finally, you can always re-initialize the target and the CDC processing:
● Take an image copy of the tablespace containing the table to be captured, with QUIESCE option.
● Monitor the EDMMSG output from the PowerExchange Logger job.
● Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number
corresponding to the QUIESCE event.
● The output logger show detail with the following format:
● Note how the sequence number is a repeated string from the sequence number found in the Logger
messages after the Copy/Quiesce.
Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same
message sequence. This sets the extraction start point on the PowerExchange Logger to the point at which the
QUIESCE was done above.
The image copy obtained above can be used for the initial materialization of the target tables.
Start
Task Stop Command Notes Description of Task
Command*
The PowerExchange
Agent, used to manage
connections to the
/DTLA DRAIN and
PowerExchange
SHUTDOWN COMPLETELY can be
Agent /S DTLA /DTLA shutdown Logger and handle
used only at the request of Informatica
repository and other
Support
tasks. This must be
started before the
Logger.
The PowerExchange
/P DTLL Logger used to
****(if you are installing, you need to
manage the Linear
Logger /S DTLL run setup2 here prior to starting the
/F DTLL, STOP Logger) /f DTLL, display datasets and
hiperspace that hold
change capture data.
STOP command just cancel ECCR,
/F DTLDB2EC, QUIESCE wait for open UOWs to
STOP or /F complete. There must be
DTLDB2EC, registrations present
ECCR (DB2) /S DTLDB2EC QUIESCE or /P prior to bringing up
DTLDB2EC /F DTLDB2EC, display will publish most adaptor ECCRs.
stats into the ECCR sysout
The PowerExchange
Condenser used to run
condense jobs against
the PowerExchange
Logger. This is used
with PowerExchange
/F DTLC,
Condense /S DTLC CHANGE to organize
SHUTDOWN the data by table, allow
for interval-based
extraction, and
optionally fully
condense multiple
changes to a single row.
The PowerExchange
(1) To identify all tasks running through Apply process used in
a certain listener issue the following:
situations
(2) Then to stop the Apply issue the
where straight
Submit JCL or / (1) F <Listener following where: name = DBN2 (apply
Apply job>, D A replication is required
S DTLAPP name)
and the data is not
(2) F DTLLST,
If the CAPX access and apply is
STOPTASK name moved through
running locally not through a listener
PowerCenter before
then issue the following command:
landing in the target.
<Listener job>, CLOSE
If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and
that the logger will come down AFTER the ECCRS go away. What you should do is:
You can shut the Listener and the ECCR(s) down at the same time.
The Listener:
1. F <Listener_job>,CLOSE
2. If this isn’t coming down fast enough for you, issue F <Listener_job>,CLOSE FORCE
3. If it still isn’t coming down fast enough, issue C <Listener_job>
Note that these commands are listed in the order of most to least desirable method for bringing the listener
down.
1. F <DB2 ECCR>,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if a long-
running batch job is running.
2. F <DB2 ECCR>,STOP - this terminates immediately
3. P <DB2 ECCR> - this also terminates immediately
Once the ECCR(s) are down, you can then bring the Logger down.
If you know that you are headed for an IPL, you can issue all these commands at the same time. The Listener and
ECCR(s) should start down, if you are looking for speed, issue F <Listener_job>,CLOSE FORCE to shut down the
Listener, then issue F <DB2 ECCR>,STOP to terminate DB2 ECCR, then shut down the Logger and the Agent.
Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2
table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the
source is registered returns a “Not being captured” answer. The update, therefore, occurs without you capturing
it, leaving your target in a broken state (which you won't know about until too late!)
When you install PWX-CHANGE, up to two active log data sets are allocated with minimum size requirements. The
information in this section can help to determine if you need to increase the size of the data sets, and if you should
allocate additional log data sets. When you define your active log data sets, consider your system’s capacity and your
changed data requirements, including archiving and performance issues.
After the PWX Logger is active, you can change the log data set configuration as necessary. In general,
remember that you must balance the following variables:
An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger
data sets need to be archived less often than smaller data sets.
Note: Although smaller data sets require more frequent archiving, the archiving process requires less time.
Use the following formulas to estimate the total space you need for each active log data set. For an example of the
calculated data set size, refer to the PowerExchange Reference Guide.
● active log data set size in bytes = (average size of captured change record * number of changes captured per
hour * desired number of hours between archives) * (1 + overhead rate)
● active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder
● active log data set size in tracks = active log data set size in bytes/number of usable bytes per track
When determining the average size of your captured change records, note the following information:
● PWX Change Capture captures the full object that is changed. For example, if one field in an IMS
segment has changed, the product captures the entire segment.
● The PWX header adds overhead to the size of the change record. Per record, the overhead is
approximately 300 bytes plus the key length.
● The type of change transaction affects whether PWX Change Capture includes a before-image, after-
image, or both:
Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors:
You have some control over the frequency of system checkpoints when you define your PWX Logger parameters.
See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter.
The estimated size of each active log data set in bytes is calculated as follows:
The following example shows an IDCAMS DEFINE statement that uses the above calculations:
DEFINE CLUSTER -
(NAME (HLQ.EDML.PRILOG.DS01) -
LINEAR -
VOLUMES(volser) -
SHAREOPTIONS(2,3) -
CYL(410) ) -
DATA -
(NAME(HLQ.EDML.PRILOG.DS01.DATA) )
The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.
The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the Logger
does not use secondary allocation. This includes Candidate Volumes and Space, such as that allocated by SMS
when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should be defined through
IDCAMS with:
● No secondary allocation.
● A single VOLSER in the VOLUME parameter.
● An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.
AG01 LOGCLOSE
The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in the
agent address space. If the PowerExchange Agent address space is inactive, MVS rejects any PowerExchange
Agent commands that you issue. If the PowerExchange Agent has not been started during the current IPL, or if
you issue the command with the wrong prefix, MVS generates the following message:
See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.
The PowerExchange Logger uses two types of commands: interactive and batch
You run interactive commands from the MVS console when the PowerExchange logger is running. You can use
PowerExchange Logger interactive commands to:
● Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections.
● Resolve in-doubt UOWs.
● Stop a PowerExchange Logger.
● Print the contents of the PowerExchange active log file (in hexadecimal format).
You use batch commands primarily in batch change utility jobs to make changes to parameters and configurations
when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to:
● Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange Logger
names, archive log options, buffer options, and mode (single or dual).
● Add log definitions to the restart data set.
● Delete data set records from the restart data set.
● Display log data sets, UOWs, and reader/writer connections.
See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59)
Challenge
Description
Tasks such as getting server and session properties, session status, or starting or
stopping a workflow or a task can be performed either through the Workflow Monitor or
by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be
integrated with PowerCenter at any of several levels. The level of integration depends
on the complexity of the workflow/schedule and the skill sets of production support
personnel.
Many companies want to automate the scheduling process by using scripts or third-
party schedulers. In some cases, they are using a standard scheduler and want to
continue using it to drive the scheduling process.
A third-party scheduler can start or stop a workflow or task, obtain session statistics,
and get server details using the pmcmd commands. Pmcmd is a program used to
communicate with the PowerCenter server.
In general, there are three levels of integration between a third-party scheduler and
PowerCenter: Low, Medium, and High.
Low Level
Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter
workflow. This process subsequently kicks off the rest of the tasks or sessions. The
PowerCenter scheduler handles all processes and dependencies after the third-party
scheduler has kicked off the initial workflow. In this level of integration, nearly all control
lies with the PowerCenter scheduler.
This type of integration is very simple to implement because the third-party scheduler
Medium Level
With Medium-level integration, a third-party scheduler kicks off some, but not all,
workflows or tasks. Within the tasks, many sessions may be defined with
dependencies. PowerCenter controls the dependencies within the tasks.
With this level of integration, control is shared between PowerCenter and the third-party
scheduler, which requires more integration between the third-party scheduler and
PowerCenter. Medium-level integration requires Production Support personnel to have
a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not
have in-depth knowledge about the tool, they may be unable to fix problems that arise,
so the production support burden is shared between the Project Development team
and the Production Support team.
High Level
With High-level integration, the third-party scheduler has full control of scheduling and
kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible
for controlling all dependencies among the sessions. This type of integration is the
most complex to implement because there are many more interactions between the
third-party scheduler and PowerCenter.
Production Support personnel may have limited knowledge of PowerCenter but must
have thorough knowledge of the scheduling tool. Because Production Support
personnel in many companies are knowledgeable only about the company’s standard
scheduler, one of the main advantages of this level of integration is that if the batch
fails at some point, the Production Support personnel are usually able to determine the
exact breakpoint. Thus, the production support burden lies with the Production Support
team.
There are many independent scheduling tools on the market. The following is an
example of a AutoSys script that can be used to start tasks; it is included here simply
as an illustration of how a scheduler can be implemented in the PowerCenter
environment. This script can also capture the return codes, and abort on error,
returning a success or failure (with associated return codes to the command line or the
Autosys GUI monitor).
# Name: jobname.job
# Author: Author Name
# Date: 01/03/2005
# Description:
# Schedule: Daily
#
# Modification History
# When Who Why
#
#------------------------------------------------------------------
. jobstart $0 $*
# set variables
ERR_DIR=/tmp
if [ $STEP -le 1 ]
then
echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."
cd /dbvol03/vendor/informatica/pmserver/
#pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01
wf_stg_tmp_product_xref_table
#pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f
FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait
s_M_S
# The above lines need to be edited to include the name of the workflow or the
TG_TMP_PRODUCT_XREF_TABLE
jobend normal
exit 0
Challenge
Because there are many variables involved in identifying and rectifying performance
bottlenecks, an efficient method for determining where bottlenecks exist is crucial to
good data warehouse management.
Description
1. Target
2. Source
3. Mapping
4. Session
5. System
If a transformation thread is 100 percent busy and there are additional resources (e.g.,
CPU cycles and memory) available on the Integration Service server, add a partition
point in the segment.
If reader or writer thread is 100 percent busy, consider using string data types in source
or target ports since non-string ports require more processing.
Attempt to isolate performance problems by running test sessions. You should be able
to compare the session’s original performance with that of tuned session’s performance.
The swap method is very useful for determining the most common bottlenecks. It
involves the following five steps:
Target Bottlenecks
Relational Targets
The most common performance bottleneck occurs when the Integration Service writes
to a target database. This type of bottleneck can easily be identified with the following
If session performance increases significantly when writing to a flat file, you have a
write bottleneck. Consider performing the following tasks to improve performance:
If the session targets a flat file, you probably do not have a write bottleneck. If the
session is writing to a SAN or a non-local file system, performance may be slower than
writing to a local file system. If possible, a session can be optimized by writing to a flat
file target local to the Integration Service. If the local flat file is very large, you can
optimize the write process by dividing it among several physical drives.
If the SAN or non-local file system is significantly slower than the local file system, work
with the appropriate network/storage group to determine if there are configuration
issues within the SAN.
Source Bottlenecks
Relational sources
If the session reads from a relational source, you can use a filter transformation, a read
test mapping, or a database query to identify source bottlenecks.
You can create a read test mapping to identify source bottlenecks. A read test mapping
isolates the read query by removing any transformation logic from the mapping. Use
the following steps to create a read test mapping:
Use the read test mapping in a test session. If the test session performance is similar to
the original session, you have a source bottleneck.
You can also identify source bottlenecks by executing a read query directly against the
source database. To do so, perform the following steps:
If there is a long delay between the two time measurements, you have a source
bottleneck.
If your session reads from a flat file source, you probably do not have a read
bottleneck. Tuning the line sequential buffer length to a size large enough to hold
approximately four to eight rows of data at a time (for flat files) may improve
performance when reading flat file sources. Also, ensure the flat file source is local to
the Integration Service.
Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a
mapping bottleneck. Use the swap method to determine if the bottleneck is in the
mapping.
Begin by adding a Filter transformation in the mapping immediately before each target
definition. Set the filter condition to false so that no data is loaded into the target tables.
If the time it takes to run the new session is the same as the original session, you have
a mapping bottleneck. You can also use the performance details to identify mapping
bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping
bottlenecks.
Multiple lookups can slow the session. You may improve session performance by
locating the largest lookup tables and tuning those lookup expressions.
For further details on eliminating mapping bottlenecks, refer to the Best Practice:
Tuning Mappings for Better Performance
Session Bottlenecks
Session performance details can be used to flag other problem areas. Create
performance details by selecting “Collect Performance Data” in the session properties
before running the session.
View the performance details through the Workflow Monitor as the session runs, or
view the resulting file. The performance details provide counters about each source
qualifier, target definition, and individual transformation within the mapping to help you
understand session and mapping efficiency.
To view the resulting performance daa file, look for the file session_name.perf in the
same directory as the session log and open the file in any text editor.
All transformations have basic counters that indicate the number of input row, output
rows, and error rows. Source qualifiers, normalizers, and targets have additional
counters indicating the efficiency of data moving into and out of buffers. Some
transformations have counters specific to their functionality. When reading performance
details, the first column displays the transformation name as it appears in the mapping,
the second column contains the counter name, and the third column holds the resulting
number or efficiency percentage.
Note: PowerCenter versions 6.x and above include the ability to assign memory
allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were
assigned at a global/session level.
For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning
Sessions for Better Performance and Tuning SQL Overrides and Environment for
Better Performance.
System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the
system hosting the Integration Service.
You can use system performance monitoring tools to monitor the amount of system
resources the Server uses and identify system bottlenecks.
Challenge
Description
Performance Tuning Tools
Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar
with these tools, so we’ve included only a short description of some of the major ones
here.
V$ Views
Explain Plan
Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks
and developing a strategy to avoid them.
Explain Plan allows the DBA or developer to determine the execution path of a block of
SQL code. The SQL in a source qualifier or in a lookup that is running for a long time
should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid
inefficient execution of these statements. Review the PowerCenter session log for long
initialization time (an indicator that the source qualifier may need tuning) and the time it
takes to build a lookup cache to determine if the SQL for these transformations should
Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most
systems. Database files should be separated and identified. Rollback files should be
separated onto their own disks because they have significant disk I/O. Co-locate tables
that are heavily used with tables that are rarely used to help minimize disk contention.
Separate indexes so that when queries run indexes and tables, they are not fighting for
the same resource. Also be sure to implement disk striping; this, or RAID technology
can help immensely in reducing disk contention. While this type of planning is time
consuming, the payoff is well worth the effort in terms of performance gains.
Dynamic Sampling
● The sample time is small compared to the overall query execution time.
● Dynamic sampling results in a better performing query.
TIP
The automatic SQL tuning features are accessible from Enterprise Manager on
the "Advisor Central" page
Useful Views
● DBA_ADVISOR_TASKS
● DBA_ADVISOR_FINDINGS
● DBA_ADVISOR_RECOMMENDATIONS
● DBA_ADVISOR_RATIONALE
● DBA_SQLTUNE_STATISTICS
● DBA_SQLTUNE_BINDS
● DBA_SQLTUNE_PLANS
● DBA_SQLSET
● DBA_SQLSET_BINDS
● DBA_SQLSET_STATEMENTS
● DBA_SQLSET_REFERENCES
● DBA_SQL_PROFILES
● V$SQL
● V$SQLAREA
● V$ACTIVE_SESSION_HISTORY
The settings presented here are those used in a four-CPU AIX server running Oracle
7.3.4 set to make use of the parallel query option to facilitate parallel
processing queries and indexes. We’ve also included the descriptions and
documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle)
systems determine what the commands do in the Oracle environment to facilitate
setting their native database commands and settings in a similar fashion.
HASH_AREA_SIZE = 16777216
● HASH_MULTIBLOCK_IO_COUNT
● OPTIMIZER_INDEX_COST_ADJ
Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost
functions. The default of 0 means that the optimizer chooses the best serial plan. A
value of 100 means that the optimizer uses each object's degree of parallelism in
computing the cost of a full-table scan operation.
The value of this parameter can be changed without shutting down the Oracle instance
by using the ALTER SESSION command. Low values favor indexes, while high values
favor table scans.
Cost-based optimization is always used for queries that reference an object with a
nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or
goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero
setting of OPTIMIZER_PERCENT_PARALLEL.
parallel_max_servers=40
Parallel_min_servers=8
If these parameters are set to a non-zero value, they represent the minimum size for
the pool. These minimum values may be necessary if you experience application errors
when certain pool sizes drop below a specific threshold.
The following parameters must be set manually and take memory from the quota
● DB_KEEP_CACHE_SIZE
● DB_RECYCLE_CACHE_SIZE
● DB_nK_CACHE_SIZE (non-default block size)
● STREAMS_POOL_SIZE
● LOG_BUFFER
On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same
box), using an IPC connection can significantly reduce the time it takes to build a
lookup cache. In one case, a fact mapping that was using a lookup to get five columns
(including a foreign key) and about 500,000 rows from a table was taking 19 minutes.
Changing the connection type to IPC reduced this to 45 seconds. In another mapping,
the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000
row write (array inserts), and primary key with unique index in place. Performance went
from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec).
A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:
DW.armafix =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL =TCP)
(HOST = armafix)
(PORT = 1526)
)
)
(CONNECT_DATA=(SID=DW)
)
)
Make a new entry in the tnsnames like this, and use it for connection to the local
Oracle instance:
DWIPC.armafix =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL=ipc)
Experts often recommend dropping and reloading indexes during very large loads to a
data warehouse but there is no easy way to do this. For example, writing a SQL
statement to drop each index, then writing another SQL statement to rebuild it, can be
a very tedious process.
Run the following to generate output to disable the foreign keys in the data warehouse:
FROM USER_CONSTRAINTS
Dropping or disabling primary keys also speeds loads. Run the results of this SQL
statement after disabling the foreign key constraints:
FROM USER_CONSTRAINTS
FROM USER_CONSTRAINTS
Save the results in a single file and name it something like ‘DISABLE.SQL’
To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with
‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run
it as a post-session command.
Re-enable constraints in the reverse order that you disabled them. Re-enable the
unique constraints first, and re-enable primary keys before foreign keys.
TIP
Dropping or disabling foreign keys often boosts loading, but also slows queries
(such as lookups) and updates. If you do not use lookups or updates on your
target tables, you should get a boost by using this SQL statement to generate
scripts. If you use lookups and updates (especially on large tables), you can
exclude the index that will be used for the lookup from your script. You may
want to experiment to determine which method is faster.
With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree
index. A b-tree index can greatly improve query performance on data that has high
cardinality or contains mostly unique values, but is not much help for low cardinality/
highly-duplicated data and may even increase query time. A typical example of a low
cardinality field is gender – it is either male or female (or possibly unknown). This kind
of data is an excellent candidate for a bitmap index, and can significantly improve query
performance.
Bitmap indexes are suited to data warehousing because of their performance, size, and
ability to create and drop very quickly. Since most dimension tables in a warehouse
have nearly every column indexed, the space savings is dramatic. But it is important to
note that when a bitmap-indexed column is updated, every row associated with that
bitmap entry is locked, making bit-map indexing a poor choice for OLTP database
tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after
each DML statement (e.g., inserts and updates), which can make loads very slow. For
this reason, it is a good idea to drop or disable bitmap indexes prior to the load and re-
create or re-enable them after the load.
The relationship between Fact and Dimension keys is another example of low
cardinality. With a b-tree index on the Fact table, a query processes by joining all the
Dimension tables in a Cartesian product based on the WHERE clause, then joins back
to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be
created that accesses the Fact table first followed by the Dimension table joins,
avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’
access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is
equal to TRUE in the init.ora file and if there are single column bitmapped indexes
on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree
indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’.
All other syntax is identical.
Bitmap Indexes
B-tree Indexes
To enable bitmap indexes, you must set the following items in the instance initialization
file:
Also note that the parallel query option must be installed in order to create bitmap
indexes. If you try to create bitmap indexes without the parallel query option, a syntax
error appears in the SQL statement; the keyword ‘bitmap’ won't be recognized.
TIP
To check if the parallel query option is installed, start and log into SQL*Plus. If
the parallel query option is installed, the word ‘parallel’ appears in the banner
text.
Index Statistics
Table method
Index statistics are used by Oracle to determine the best method to access tables and
should be updated periodically as normal DBA procedures. The following should
improve query results on Fact and Dimension tables (including appending and updating
records) by updating the table and index statistics for the data warehouse:
The following SQL statement can be used to analyze the tables in the database:
FROM USER_TABLES
The following SQL statement can be used to analyze the indexes in the database:
FROM USER_INDEXES
Schema method
Another way to update index statistics is to compute indexes by schema rather than by
table. If data warehouse indexes are the only indexes located in a single schema, you
can use the following command to update the statistics:
TIP
These SQL statements can be very resource intensive, especially for very large
tables. For this reason, Informatica recommends running them at off-peak
times when no other process is using the database. If you find the exact
computation of the statistics consumes too much time, it is often acceptable to
estimate the statistics rather than compute them. Use ‘estimate’ instead of
‘compute’ in the above examples.
Parallelism
Hints are used to define parallelism at the SQL statement level. The following examples
demonstrate how to utilize four processors:
TIP
When using a table alias in the SQL Statement, be sure to use this alias in the
hint. Otherwise, the hint will not be used, and you will not receive an error
message.
FROM EMP A
Parallelism can also be defined at the table and index level. The following example
demonstrates how to set a table’s degree of parallelism to four for all eligible SQL
statements on this table:
Ensure that Oracle is not contending with other processes for these resources or you
may end up with degraded performance due to resource contention.
Additional Tips
You can execute queries as both pre- and post-session commands. For a UNIX
environment, the format of the command is:
For example, to execute the ENABLE.SQL file created earlier (assuming the data
warehouse is on a database named ‘infadb’), you would execute the following as a post-
session command:
In some environments, this may be a security issue since both username and
password are hard-coded and unencrypted. To avoid this, use the operating system’s
authentication to log onto the database instance.
In the following example, the Informatica id “pmuser” is used to log onto the Oracle
database. Create the Oracle user “pmuser” with the following SQL statement:
In the following pre-session command, “pmuser” (the id Informatica is logged onto the
operating system as) is automatically passed from the operating system to the
database and used to execute the script:
You may want to use the init.ora parameter “os_authent_prefix” to distinguish between
“normal” oracle-users and “external-identified” ones.
DRIVING_SITE ‘Hint’
If the source and target are on separate instances, the Source Qualifier transformation
should be executed on the target instance.
For example, you want to join two source tables (A and B) together, which may reduce
the number of selected rows. However, Oracle fetches all of the data from both tables,
moves the data across the network to the target instance, then processes everything
on the target instance. If either data source is large, this causes a great deal of network
traffic. To force the Oracle optimizer to process the join on the source instance, use the
‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the
SQL statement as:
Challenge
Description
Proper tuning of the source and target database is a very important consideration in the
scalability and usability of a business data integration environment. Managing
performance on an SQL Server involves the following points.
Taking advantage of grid computing is another option for improving the overall SQL
Server performance. To set up a SQL Server cluster environment, you need to set up a
cluster where the databases are split among the nodes. This provides the ability to
distribute the load across multiple nodes. To achieve high performance, Informatica
recommends using a fibre-attached SAN device for shared storage.
● Max async I/O is used to specify the number of simultaneous disk I/O
operations that SQL Server can submit to the operating system. Note that this
setting is automated in SQL Server 2000
● SQL Server allows several selectable models for database recovery, these
include:
❍ Full Recovery
❍ Bulk-Logged Recovery
❍ Simple Recovery
Creating and maintaining good indexes is key to maintaining minimal I/O for all
database queries.
To reduce overall I/O contention and improve parallel operations, consider partitioning
table data and indexes. Multiple techniques for achieving and managing partitions
using SQL Server 2000 are addressed in this document.
The simplest technique for creating disk I/O parallelism is to use hardware partitioning
and create a single "pool of drives" that serves all SQL Server database files except
transaction log files, which should always be stored on physically-separate disk drives
dedicated to log files. (See Microsoft documentation for installation procedures.)
The following areas of SQL Server activity can be separated across different hard
drives, RAID controllers, and PCI channels (or combinations of the three):
● Transaction logs
● Tempdb
● Database
● Tables
● Nonclustered Indexes
Segregating tempdb
SQL Server creates a database, tempdb, on every server instance to be used by the
server as a shared working area for various activities, including temporary tables,
sorting, processing subqueries, building aggregates to support GROUP BY or ORDER
BY clauses, queries using DISTINCT (temporary worktables have to be created to
remove duplicate rows), cursors, and hash joins.
To move the tempdb database, use the ALTER DATABASE command to change the
physical file location of the SQL Server logical file name associated with tempdb. For
example, to move tempdb and its associated log to the new file locations E:\mssql7 and
C:\temp, use the following commands:
'e:\mssql7\tempnew_location.mDF')
alterdatabasetempdbmodifyfile(name='templog',filename=
'c:\temp\tempnew_loglocation.mDF')
The master database, msdb, and model databases are not used much during
production (as compared to user databases), so it is generally y not necessary to
consider them in I/O performance tuning considerations. The master database is
usually used only for adding new logins, databases, devices, and other system objects.
Database Partitioning
● Primary filegroup. Contains the primary data file and any other files not
placed into another filegroup. All pages for the system tables are allocated
from the primary filegroup.
● User-defined filegroup. Any filegroup specified using the FILEGROUP
keyword in a CREATE DATABASE or ALTER DATABASE statement, or on
the Properties dialog box within SQL Server Enterprise Manager.
● Default filegroup. Contains the pages for all tables and indexes that do not
have a filegroup specified when they are created. In each database, only one
filegroup at a time can be the default filegroup. If no default filegroup is
specified, the default is the primary filegroup.
Files and filegroups are useful for controlling the placement of data and indexes
and eliminating device contention. Quite a few installations also leverage files and
filegroups as a mechanism that is more granular than a database in order to exercise
more control over their database backup/recovery strategy.
When you partition data across multiple tables or multiple servers, queries accessing
only a fraction of the data can run faster because there is less data to scan. If the
tables are located on different servers, or on a computer with multiple processors, each
table involved in the query can also be scanned in parallel, thereby improving query
performance. Additionally, maintenance tasks, such as rebuilding indexes or backing
up a table, can execute more quickly.
By using a partitioned view, the data still appears as a single table and can be queried
as such without having to reference the correct underlying table manually
Use this option to specify the threshold where SQL Server creates and executes
parallel plans. SQL Server creates and executes a parallel plan for a query only when
the estimated cost to execute a serial plan for the same query is higher than the value
set in cost threshold for parallelism. The cost refers to an estimated elapsed time in
seconds required to execute the serial plan on a specific hardware configuration. Only
set cost threshold for parallelism on symmetric multiprocessors (SMP).
Use this option to limit the number of processors (from a maximum of 32) to use in
parallel plan execution. The default value is zero, which uses the actual number of
available CPUs. Set this option to one to suppress parallel plan generation. Set the
value to a number greater than one to restrict the maximum number of processors
used by a single query execution.
Use this option to specify whether SQL Server should run at a higher scheduling
priority than other processors on the same computer. If you set this option to one, SQL
Server runs at a priority base of 13. The default is zero, which is a priority base of
seven.
When configuring a SQL Server that contains only a few gigabytes of data and does
not sustain heavy read or write activity, you need not be particularly concerned with the
subject of disk I/O and balancing of SQL Server I/O activity across hard drives for
optimal performance. To build larger SQL Server databases however, which can
contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/
write activity (as in a DSS application), it is necessary to drive configuration around
maximizing SQL Server disk I/O performance by load-balancing across multiple hard
drives.
For SQL Server databases that are stored on multiple disk drives, performance can be
improved by partitioning the data to increase the amount of disk I/O parallelism.
Partitioning can be performed using a variety of techniques. Methods for creating and
managing partitions include configuring the storage subsystem (i.e., disk, RAID
partitioning) and applying various data configuration mechanisms in SQL Server such
as files, file groups, tables and views. Some possible candidates for partitioning include:
● Transaction log
● Tempdb
● Database
● Tables
● Non-clustered indexes
Two mechanisms exist inside SQL Server to address the need for bulk movement of
data: the bcp utility and the BULK INSERT statement.
TIP
Both of these mechanisms enable you to exercise control over the batch size.
Unless you are working with small volumes of data, it is good to get in the habit
of specifying a batch size for recoverability reasons. If none is specified, SQL
Server commits all rows to be loaded as a single batch. For example, you
attempt to load 1,000,000 rows of new data into a table. The server suddenly
loses power just as it finishes processing row number 999,999. When the
server recovers, those 999,999 rows will need to be rolled back out of the
database before you attempt to reload the data. By specifying a batch size of
10,000 you could have saved significant recovery time, because SQL Server
would have only had to rollback 9999 rows instead of 999,999.
● Remove indexes.
● Use Bulk INSERT or bcp.
● Parallel load using partitioned data files into partitioned tables.
● Run one load stream for each available CPU.
● Set Bulk-Logged or Simple Recovery model.
● Use the TABLOCK option.
● Create indexes.
● Switch to the appropriate recovery model.
● Perform backups
Challenge
Description
Tuning MultiLoad
There are many aspects to tuning a Teradata database. Several aspects of tuning can
be controlled by setting MultiLoad parameters to maximize write throughput. Other
areas to analyze when performing a MultiLoad job include estimating space
requirements and monitoring MultiLoad performance.
MultiLoad parameters
Always estimate the final size of your MultiLoad target tables and make sure the
destination has enough space to complete your MultiLoad job. In addition to the space
that may be required by target tables, each MultiLoad job needs permanent space for:
● Work tables
● Error tables
● Restart Log table
Note: Spool space cannot be used for MultiLoad work tables, error tables, or the
restart log table. Spool space is freed at each restart. By using permanent space for the
MultiLoad tables, data is preserved for restart operations after a system failure. Work
tables, in particular, require a lot of extra permanent space. Also remember to account
for the size of error tables since error tables are generated for each target table.
Use the following formula to prepare the preliminary space estimate for one target
table, assuming no fallback protection, no journals, and no non-unique secondary
indexes:
2. Use the Teradata RDBMS Query Session utility to monitor the progress of the
MultiLoad job.
3. Check for locks on the MultiLoad target tables and error tables.
4. Check the DBC.Resusage table for problem areas, such as data bus or CPU
capacities at or near 100 percent for one or more processors.
5. Determine whether the target tables have non-unique secondary indexes
(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a
separate NUSI change row to be applied to each NUSI sub-table after all of the
rows have been applied to the primary table.
6. Check the size of the error tables. Write operations to the fallback error tables
are performed at normal SQL speed, which is much slower than normal
MultiLoad tasks.
7. Verify that the primary index is unique. Non-unique primary indexes can cause
severe MultiLoad performance problems
8. Poor performance can happen when the input data is skewed with respect to
the Primary Index of the database. Teradata depends upon random and well
distributed data for data input and retrieval. For example, a file containing a
million rows with a single value 'AAAAAA' for the Primary Index will take an
infinite time to load.
9. One common tool used for determining load issues/skewed data/locks is
Performance Monitor (PMON). PMON requires MONITOR access on the
Teradata system. If you do not have Monitor access, then the DBA can help
After the spool rises has a reached its peak, spool will fall rapidly as data is
inserted from spool into the table. If the spool grows slowly, then the input data
is probably skewed.
FastExport
FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/
Sources is by using ODBC since there is not native connectivity to Teradata. However,
ODBC is slow. For higher performance, use FastExport if the number of rows to be
pulled is in the order of a million rows. FastExport writes to a file. The lookup or source
qualifier then reads this file. FastExport integrated within PowerCenter.
BTEQ
BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you
to export data to a flat file, but is suitable for smaller volumes of data. This provides
faster performance than ODBC but doesn't tax Teradata system resources the way
FastExport can. A possible use for BTEQ with PowerCenter is to export smaller
volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by
PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a pre-
session script.
TPump
TPump was a load utility primarily intended for streaming data (think of loading bundles
of messages arriving from MQ using Power Center Real Time). TPump can also load
from a file or a named pipe.
While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility.
Another important difference between MultiLoad and TPump is that TPump locks at the
The ELT design paradigm can be achieved through the Pushdown Optimization option
offered with PowerCenter.
ETL or ELT
Because many database vendors and consultants advocate using ELT (Extract, Load
and Transform) over ETL (Extract, Transform and Load), the use of Pushdown
Optimization can be somewhat controversial. Informatica advocates using Pushdown
Optimization as an option to solve specific performance situations rather than as the
default design of a mapping.
1. When the load needs to look up only dimension tables then there may be no
need to use Pushdown Optimization. In this context, PowerCenter's ability to
build dynamic, persistent caching is significant. If a daily load involves 10s or
100s of fact files to be loaded throughout the day, then dimension surrogate
keys can be easily obtained from PowerCenter's cache in memory. Compare
this with the cost of running the same dimension lookup queries on the
database.
2. In many cases large Teradata systems contain only a small amount of data. In
such cases there may be no need to push down.
3. When only simple filters or expressions need to be applied on the data then
there may be no need to push down. The special case is that of applying filters
or expression logic to non-unique columns in incoming data in PowerCenter.
Compare this to loading the same data into the database and then applying a
WHERE clause on a non-unique column, which is highly inefficient for a large
table.
The principle here is: Filter and resolve the data AS it gets loaded instead of
loading it into a database, querying the RDBMS to filter/resolve and re-loading it
into the database. In other words, ETL instead of ELT.
4. Push Down optimization needs to be considered only if a large set of data
needs to be merged or queried for getting to your final load set.
You can push transformation logic to either the source or target database using
pushdown optimization. The amount of work you can push to the database depends on
the pushdown optimization configuration, the transformation logic, and the mapping
and session configuration.
When you run a session configured for pushdown optimization, the Integration Service
analyzes the mapping and writes one or more SQL statements based on the mapping
transformation logic. The Integration Service analyzes the transformation logic,
mapping, and session configuration to determine the transformation logic it can push to
the database. At run time, the Integration Service executes any SQL statement
generated against the source or target tables, and processes any transformation logic
that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping
logic that the Integration Service can push to the source or target database. You can
also use the Pushdown Optimization Viewer to view the messages related to
You may encounter the following problems using ODBC drivers with a Teradata
database:
You can configure the Integration Service to perform an SQL override with Pushdown
Optimization. To perform an SQL override, you configure the session to create a view.
When you use a SQL override for a Source Qualifier transformation in a session
configured for source or full Pushdown Optimization with a view, the Integration Service
creates a view in the source database based on the override. After it creates the view
in the database, the Integration Service generates a SQL query that it can push to the
database. The Integration Service runs the SQL query against the view to perform
Pushdown Optimization.
Note: To use an SQL override with pushdown optimization, you must configure the
session for pushdown optimization with a view.
Running a Query
If the Integration Service did not successfully drop the view, you can run a query
against the source database to search for the views generated by the Integration
Service. When the Integration Service creates a view, it uses a prefix of PM_V. You
Use the following rules and guidelines when you configure pushdown optimization for a
session containing an SQL override:
Challenge
As Data Integration becomes a broader and more service-oriented Information Technology initiative, real-time and
right-time solutions will become critical to the success of the overall architecture. Tuning real-time processes is often
different then tuning batch processes.
Description
To remain agile and flexible in increasingly competitive environments, today’s companies are dealing with
sophisticated operational scenarios such as consolidation of customer data in real time to support a call center or
the delivery of precise forecasts for supply chain operation optimization. To support such highly demanding
operational environments, data integration platforms must do more than serve analytical data needs. They must
also support real-time, 24x7, mission-critical operations that involve live or current information available across the
enterprise and beyond. They must access, cleanse, integrate and deliver data in real time to ensure up-to-the-
second information availability. Also, data integration platforms must intelligently scale to meet both increasing data
volumes and also increasing numbers of concurrent requests that are typical of shared services Integration
Competency Center (ICC) environments. The data integration platforms must also be extremely reliable, providing
high availability to minimize outages and ensure seamless failover and recovery as every minute of downtime can
lead to huge impacts on business operations.
PowerCenter can be used to process data in real time. Real-time processing is on-demand processing of data from
real-time sources. A real-time session reads, processes and writes data to targets continuously. By default, a
session reads and writes bulk data at scheduled intervals unless it is configured for real-time processing.
To process data in real time, the data must originate from a real-time source. Real-time sources include JMS,
WebSphere MQ, TIBCO, webMethods, MSMQ, SAP, and web services. Real-time processing can also be used for
processes that require immediate access to dynamic data (i.e., financial data).
Use the Real-time Flush Latency session condition to control the target commit latency when running in real-time
mode. PWXPC commits source data to the target at the end of the specified maximum latency period. This
parameter requires a valid value and has a valid default value.
When the session runs, PWXPC begins to read data from the source. After data is provided to the source qualifier,
the Real-Time Flush Latency interval begins. At the end of each Real-Time Flush Latency interval and an end-UOW
boundary is reached, PWXPC issues a commit to the target. The following message appears in the session log to
indicate that this has occurred:
[PWXPC_10082] [INFO] [CDCDispatcher] raising real-time flush with restart tokens [restart1_token],
[restart2_token] because Real-time Flush Latency [RTF_millisecs] occurred
The commit to the target when reading CDC data is not strictly controlled by the Real-Time Flush Latency
specification. The UOW Count and the Commit Threshold values also determine the commit frequency.
The value specified for Real-Time Flush Latency also controls the PowerExchange Consumer API (CAPI) interface
timeout value (PowerExchange latency) on the source platform. The CAPI interface timeout value is displayed in the
The CAPI interface timeout also affects latency as it will affect how quickly changes are returned to the PWXPC
reader by PowerExchange. PowerExchange will ensure that it returns control back to PWXPC at least once every
CAPI interface timeout period. This allows the PWXPC to regain control and, if necessary, perform the real-time
flush of data returned. A high RTF Latency specification will also impact the speed with which stop requests from
PowerCenter are handled as the PWXPC CDC Reader must wait for PowerExchange to return control before it can
handle the stop request.
TIP
Use the PowerExchange STOPTASK command to shutdown more quickly when using a high RTF Latency value.
For example, if the value for Real-Time Flush Latency is 10 seconds, PWXPC will issue a commit for all data read
after 10 seconds have elapsed and the next end-UOW boundary is received. The lower the value is set, the faster
the data commits data to the target. As the lowest possible latency is required for the application of changes to the
target, specify a low Real-Time Flush Latency value.
Warning: When you specify a low Real-Time Flush Latency interval, the session might consume more system
resources on the source and target platforms. This is because:
● The session will commit to the target more frequently therefore consuming more target resources.
● PowerExchange will return more frequently to the PWXPC reader thereby passing fewer rows on each
iteration and consuming more resources on the source PowerExchange platform
Balance performance and resource consumption with latency requirements when choosing the UOW Count and
Real-Time Flush Latency values.
Commit Threshold is only applicable to Real-Time CDC sessions. Use the Commit Threshold session condition to
cause commits before reaching the end of the UOW when processing large UOWs. This parameter requires a valid
value and has a valid default value
Commit Threshold can be used to cause a commit before the end of a UOW is received, a process also referred to
as sub-packet commit. The value specified in the Commit Threshold is the number of records within a source UOW
to process before inserting a commit into the change stream. This attribute is different from the UOW Count attribute
in that it is a count records within a UOW rather than complete UOWs. The Commit Threshold counter is reset when
either the number of records specified or the end of the UOW is reached.
This attribute is useful when there are extremely large UOWs in the change stream that might cause locking issues
on the target database or resource issues on the PowerCenter Integration Server.
The Commit Threshold count is cumulative across all sources in the group. This means that sub-packet commits are
inserted into the change stream when the count specified is reached regardless of the number of sources to which
the changes actually apply. For example, a UOW contains 900 changes for one source followed by 100 changes for
a second source and then 500 changes for the first source. If the Commit Threshold is set to 1000, the commit
record is inserted after the 1000th change record which is after the 100 changes for the second source.
Warning: A UOW may contain changes for multiple source tables. Using Commit Threshold can cause commits to
be generated at points in the change stream where the relationship between these tables is inconsistent. This may
If 0 or no value is specified, commits will occur on UOW boundaries only. Otherwise, the value specified is used to
insert commit records into the change stream between UOW boundaries, where applicable.
The value of this attribute overrides the value specified in the PowerExchange DBMOVER configuration file
parameter SUBCOMMIT_THRESHOLD. For more information on this PowerExchange parameter, refer to the
PowerExchange Reference Manual.
The commit to the target when reading CDC data is not strictly controlled by the Commit Threshold specification.
The commit records inserted into the change stream as a result of the Commit Threshold value affect the UOW
Count counter. The UOW Count and the Real-Time Flush Latency values determine the target commit frequency.
For example, a UOW contains 1,000 change records (any combination of inserts, updates, and deletes). If 100 is
specified for the Commit Threshold and 5 for the UOW Count, then a commit record will be inserted after each 100
records and a target commit will be issued after every 500 records.
Challenge
Identify opportunities for performance improvement within the complexities of the UNIX
operating environment.
Description
This section provides an overview of the subject area, followed by discussion of the use
of specific tools.
Overview
All system performance issues are fundamentally resource contention issues. In any
computer system, there are three essential resources: CPU, memory, and I/O - namely
disk and network I/O. From this standpoint, performance tuning for PowerCenter
means ensuring that the PowerCenter and its sub-processes have adequate resources
to execute in a timely and efficient manner.
Each resource has its own particular set of problems. Resource problems are
complicated because all resources interact with each other. Performance tuning is
about identifying bottlenecks and making trade-off to improve the situation. Your best
approach is to initially take a baseline measurement and to obtain a good
understanding of how it behaves, then evaluate any bottleneck revealed on each
system resource during your load window and determine the removal of whichever
resource contention offers the greatest opportunity for performance enhancement.
Here is a summary of each system resource area and the problems it can have.
CPU
Memory
Disk I/O
❍ iostat can give you information about the transfer rates for each disk
drive. ps and vmstat can give some information about how many
processes are blocked waiting for I/O.
❍ sar can provide voluminous information about I/O efficiency.
❍ sadp can give detailed information about disk access patterns.
● The source data, the target data, or both the source and target data are likely
to be connected through an Ethernet channel to the system where
PowerCenter resides. Be sure to consider the number of Ethernet channels
and bandwidth available to avoid congestion.
❍ netstat shows packet activity on a network, watch for high collision rate of
output packets on each interface.
❍ nfstat monitors NFS traffic; execute nfstat –c from a client machine (not
from the nfs server); watch for high time rate of total call and “not
responding” message.
Given that these issues all boil down to access to some computing resource, mitigation
of each issue con sists of making some adjustment to the environment to provide more
(or preferential) access to the resource; for instance:
Detailed Usage
The following tips have proven useful in performance tuning UNIX-based machines.
While some of these tips are likely to be more helpful than others in a particular
environment, all are worthy of consideration.
Running ps -axu
● Are there any processes waiting for disk access or for paging? If so check the I/
O and memory subsystems.
● What processes are using most of the CPU? This may help to distribute the
workload better.
● What processes are using most of the memory? This may help to distribute the
workload better.
● Does ps show that your system is running many memory-intensive jobs? Look
for jobs with a large set (RSS) or a high storage integral.
Use vmstat or sar to check for paging/swapping actions. Check the system to
ensure that excessive paging/swapping does not occur at any time during the session
processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/
swapping. If paging or excessive swapping does occur at any time, increase memory to
prevent it. Paging/swapping, on any database system, causes a major performance
decrease and increased I/O. On a memory-starved and I/O-bound server, this can
effectively shut down the PowerCenter process and any databases running on the
server.
Some swapping may occur normally regardless of the tuning settings. This occurs
because some processes use the swap space by their design. To check swap space
availability, use pstat and swap. If the swap space is too small for the intended
applications, it should be increased.
Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory
problems and check for the following:
Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used
to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring
the load on specific disks. Take notice of how evenly disk activity is distributed among
the system disks. If it is not, are the most active disks also the fastest disks?
Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area
of the disk (good), spread evenly across the disk (tolerable), or in two well-defined
peaks at opposite ends (bad)?
● Reorganize your file systems and disks to distribute I/O activity as evenly as
possible.
● Using symbolic links helps to keep the directory structure the same throughout
while still moving the data files that are causing I/O contention.
● Use your fastest disk drive and controller for your root file system; this almost
certainly has the heaviest activity. Alternatively, if single-file throughput is
important, put performance-critical files into one file system and use the fastest
drive for that file system.
● Put performance-critical files on a file system with a large block size: 16KB or
32KB (BSD).
● Increase the size of the buffer cache by increasing BUFPAGES (BSD). This
may hurt your systems memory performance.
If your system has disk capacity problem and is constantly running out of disk space try
the following actions:
● Write a find script that detects old core dumps, editor backup and auto-save
files, and other trash and deletes it automatically. Run the script through cron.
● Use the disk quota system (if your system has one) to prevent individual users
from gathering too much storage.
● Use a smaller block size on file systems that are mostly small files (e.g.,
source code files, object modules, and small data files).
Use uptime or sar -u to check for CPU loading. Sar provides more detail, including %
usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target
goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10.
If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O
bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr
has a high %idle, this is indicative of memory and contention of swapping/paging
problems. In this case, it is necessary to make memory changes to reduce the load on
the system server.
When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without
letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the
time, work must be piling up somewhere. This points to CPU overload.
Suspect problems with network capacity or with data integrity if users experience
slow performance when they are using rlogin or when they are accessing files via NFS.
If collisions and network hardware are not a problem, figure out which system
appears to be slow. Use spray to send a large burst of packets to the slow system. If
the number of dropped packets is large, the remote system most likely cannot respond
to incoming data fast enough. Look to see if there are CPU, memory or disk I/O
problems on the remote system. If not, the system may just not be able to tolerate
heavy network workloads. Try to reorganize the network so that this system isn’t a file
server.
A large number of dropped packets may also indicate data corruption. Run netstat-s
on the remote system, then spray the remote system from the local system and run
netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is
equal to or greater than the number of drop packets that spray reports, the remote
system is slow network server If the increase of socket full drops is less than the
number of dropped packets, look for network errors.
Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent
of calls, the network or an NFS server is overloaded. If timeout is high, at least one
NFS server is overloaded, the network may be faulty, or one or more servers may have
crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If
timeout and retrans are high, but badxid is low, some part of the network between the
NFS client and server is overloaded and dropping packets.
Try to prevent users from running I/O- intensive programs across the network.
The greputility is a good example of an I/O intensive program. Instead, have users log
into the remote system to do their work.
In order to take full advantage of the PowerCenter Enterprise Grid Option , cluster file
system (CFS) is recommended. PowerCenter Grid option requires that the directories
for each Integration Service to be shared with other servers. This allows Integration
Services to share files such as cache files between different session runs. CFS
performance is a result of tuning parameters and tuning the infrastructure. Therefore,
using the parameters recommended by each CFS vendor is the best approach for CFS
tuning.
PowerCenter Options
The PowerCenter 64-bit option can allocate more memory to sessions and achieve
higher throughputs compared to 32-bit version of PowerCenter.
Challenge
Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.
Description
The following tips have proven useful in performance-tuning Windows Servers. While
some are likely to be more helpful than others in any particular environment, all are
worthy of consideration.
● Performance Monitor.
● Performance tab (hit ctrl+alt+del, choose task manager, and click on the
Performance tab).
Server Load: Assume that some software will not be well coded, and some
background processes (e.g., a mail server or web server) running on a single machine,
can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs
may be the only recourse.
Memory and services: Although adding memory to Windows Server is always a good
solution, it is also expensive and usually must be planned in advance. Before adding
memory, check the Services in Control Panel because many background applications
do not uninstall the old service when installing a new version. Thus, both the unused
old service and the new service may be using valuable CPU memory resources.
I/O Optimization: This is, by far, the best tuning option for database applications in
the Windows Server environment. If necessary, level the load across the disk devices
by moving files. In situations where there are multiple controllers, be sure to level the
load across the controllers too.
Using electrostatic devices and fast-wide SCSI can also help to increase performance.
Further, fragmentation can usually be eliminated by using a Windows Server disk
defragmentation product.
Finally, on Windows Servers, be sure to implement disk striping to split single data files
across multiple disk drives and take advantage of RAID (Redundant Arrays of
Inexpensive Disks) technology. Also increase the priority of the disk devices on the
Windows Server. Windows Server, by default, sets the disk device priority low.
Windows Server provides the following tools (accessible under the Control Panel/
Administration Tools/Performance) for monitoring resource usage on your computer:
● System Monitor
● Performance Logs and Alerts
These Windows Server monitoring tools enable you to analyze usage and detect
System Monitor
The System Monitor displays a graph which is flexible and configurable. You can copy
counter paths and settings from the System Monitor display to the Clipboard and paste
counter paths from Web pages or other sources into the System Monitor display.
Because the System Monitor is portable, it is useful in monitoring other systems that
require administration.
Performance Monitor
The Performance Logs and Alerts tool provides two types of performance-related logs—
counter logs and trace logs—and an alerting function.
Counter logs record sampled data about hardware resources and system services
based on performance objects and counters in the same manner as System Monitor.
They can, therefore, be viewed in System Monitor. Data in counter logs can be saved
as comma-separated or tab-separated files that are easily viewed with Excel.
Trace logs collect event traces that measure performance statistics associated with
events such as disk and file I/O, page faults, or thread activity. The alerting function
allows you to define a counter value that will trigger actions such as sending a network
message, running a program, or starting a log. Alerts are useful if you are not actively
monitoring a particular counter threshold value but want to be notified when it exceeds
or falls below a specified value so that you can investigate and determine the cause of
the change. You may want to set alerts based on established performance baseline
values for your system.
Note: You must have Full Control access to a subkey in the registry in order to create
or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM
\CurrentControlSet\Services\SysmonLog\Log_Queries).
The predefined log settings under Counter Logs (i.e., System Overview) are configured
to create a binary log that, after manual start-up, updates every 15 seconds and logs
continuously until it achieves a maximum size. If you start logging with the default
settings, data is saved to the Perflogs folder on the root directory and includes the
counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and
Processor(_Total)\ % Processor Time.
If you want to create your own log setting, press the right mouse on one of the log
PowerCenter Options
PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64-
bit Windows Server 2003 can allocate more memory to sessions and achieve higher
throughputs than the 32-bit version of PowerCenter on Windows Server.
Challenge
Description
1. Perform Benchmarking
You should always have a baseline of current load times for a given workflow or
session with a similar row count. Maybe you are not achieving your required load
window or simply think your processes could run more efficiently based on comparison
with other similar tasks running faster. Use the benchmark to estimate what your
desired performance goal should be and tune to that goal. Begin with the problem
mapping that you created, along with a session and workflow that use all default
settings. This helps to identify which changes have a positive impact on performance.
This step helps to narrow down the areas on which to focus further. Follow the areas
and sequence below when attempting to identify the bottleneck:
● Target
The methodology steps you through a series of tests using PowerCenter to identify
trends that point where next to focus. Remember to go through these tests in a
scientific manner; running them multiple times before reaching any conclusion
and always keep in mind that fixing one bottleneck area may create a different
bottleneck. For more information, see Determining Bottlenecks.
Problems “outside” PowerCenter refers to anything that indicates the source of the
performance problem is external to PowerCenter. The most common performance
problems “outside” PowerCenter are source/target database problem, network
bottleneck, server, or operating system problem.
● For source database related bottlenecks, refer to Tuning SQL Overrides and
Environment for Better Performance
● For target database related problems, refer to Performance Tuning Databases
- Oracle, SQL Server, or Teradata
● For operating system problems, refer to Performance Tuning UNIX Systems
or Performance Tuning Windows 2000/2003 Systems for more information.
Although there are certain procedures to follow to optimize mappings, keep in mind
that, in most cases, the mapping design is dictated by business logic; there may be a
After you have completed the recommended steps for each relevant performance
bottleneck, re-run the problem workflow or session and compare the results to the
benchmark and compare load performance against the baseline. This step is iterative,
and should be performed after any performance-based setting is changed. You are
trying to answer the question, “Did the performance change have a positive impact?” If
so, move on to the next bottleneck. Be sure to prepare detailed documentation at every
step along the way so you have a clear record of what was and wasn't tried.
While it may seem like there are an enormous number of areas where a performance
problem can arise, if you follow the steps for finding the bottleneck(s), and apply the
tuning techniques specific to it, you are likely to improve performance and achieve your
desired goals.
Challenge
A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a
crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning
Data Analyzer and Data Analyzer reports.
Description
Performance tuning reports occurs both at the environment level and the reporting level. Often report performance
can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The
following guidelines should help with tuning the environment and the report itself.
1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform
benchmarks at various points throughout the day and evening hours to account for inconsistencies in
network traffic, database server load, and application server load. This provides a baseline to measure
changes against.
2. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data
elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the
report can be broken into multiple reports or presented at a higher level. These are often ways to create
more visually appealing reports and allow for linked detail reports or drill down to detail level.
3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the
report to run during hours when the system use is minimized. Consider scheduling large numbers of reports
to run overnight. If mid-day updates are required, test the performance at lunch hours and consider
scheduling for that time period. Reports that require filters by users can often be copied and filters pre-
created to allow for scheduling of the report.
4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the
report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the
creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA
monitors the database environment. This provides the DBA the opportunity to tune the database for
querying. Finally, look into changes in database settings. Increasing the database memory in the initialization
file often improves Data Analyzer performance significantly.
5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL"
button on the report. Run the query from the report, against the database using a client tool on the server
that the database resides on. One caveat to this is that even the database tool on the server may contact the
outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath /
IPC Oracle’s local database communication protocol) and monitor the database throughout this process.
This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the
query performs well regardless of where it is executed, but the report continues to be slow, this indicates an
application server bottleneck. Common locations for network bottlenecks include router tables, web server
demand, and server input/output. Informatica does recommend installing Data Analyzer on a
dedicated application server.
6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of
tuning involves changes to the database tables. Review the under performing reports.
Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes
efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an
aggregate table. By studying the existing reports and future requirements, you can determine what key
aggregates can be created in the ETL tool and stored in the database.
Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data
7.
Database Queries. As a last resort for under-performing reports, you may want to edit the actual report
query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the
SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be
the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once
you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a
query utility. Additional options include utilizing database views to cache data prior to report generation.
Reports are then built based on the view.
Note: Editing the report query requires query editing for each report change and may require editing during
migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning.
The Data Analyzer repository database should be tuned for an OLTP workload.
JVM Layout
The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the
following primary jobs:
● Execute code
● Manage memory
● Remove garbage objects
The size of the JVM determines how often and how long garbage collection runs.
The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application
server.
1.
-Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like Data
Analyzer, the values should be set equal to each other.
2.
Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection.
3.
When the new generation fills up, it triggers a minor collection, in which surviving
objects are moved to the old generation.
❍
When the old generation fills up, it triggers a major collection, which involves the entire
object heap. This is more expensive in terms of resources than a minor collection.
6.
If you increase the new generation size, the old generation size decreases. Minor collections occur less
often, but the frequency of major collection increases.
7.
If you decrease the new generation size, the old generation size increases. Minor collections occur more, but
the frequency of major collection decreases.
8.
As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size).
9.
Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three
CPUs per JVM.
Execute Threads
Too few threads means CPUs are under-utilized and jobs are waiting for threads to become
available.
●
Too many threads means system is wasting resource in managing threads. The OS performs
unnecessary context switching.
●
The default is 15 threads. Informatica recommends using the default value, but you may need
to experiment to determine the optimal value for your environment.
Connection Pooling
The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it.
● Initial capacity = 15
● Maximum capacity = 15
● Sum of connections of all pools should be equal to the number of execution threads.
Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and
Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available
on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.
For Websphere, use the Performance Tuner to modify the configurable parameters.
For optimal configuration, separate the application server , the data warehouse database, and the repository
database onto separate dedicated machines.
Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable
number of HTTP requests as required by the Data Analyzer installation. Ensure that the web container is made
available to optimal number of threads so that it can accept and process more HTTP requests.
<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml
maxProcessors. Maximum number of threads that can ever be created in the pool.
●
acceptCount. Controls the length of the queue of waiting requests when no more threads are
available from the pool to process the request.
●
connectionTimeout. Amount of time to wait before a URI is received from the stream.
Default is 20 seconds. This avoid problems where a client opens a connection and does not
send any data
●
tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer
to be full. This reduces latency at the cost of more packets being sent over the network. The
default is true.
●
enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled
to prevent IP spoofing. Enabling this parameter can cause problems when a DNS is
misbehaving. The enableLookups parameter can be turned off when you implicitly trust all
connectionLinger. How long connections should linger after they are closed. Informatica
recommends using the default value: -1 (no linger).
In the Data Analyzer application, each web page can potentially have more than one request to the application
server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an
installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value.
If the number of threads is too low, the following message may appear in the log files:
ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads
JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first
time, Informatica ships Data Analyzer with pre-compiled JSPs.
<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml
<servlet>
<servlet-name>jsp</servlet-name>
<servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class>
<init-param>
<param-name>logVerbosityLevel</param-name>
<param-value>WARNING</param-value>
<param-name>development</param-name>
<param-value>false</param-value>
</init-param>
<load-on-startup>3</load-on-startup>
</servlet>
Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata information.
When it runs reports, it accesses the data sources to get the report information. Data Analyzer keeps a pool of
database connections for the repository. It also keeps a separate database connection pool for each data source. To
optimize Data Analyzer database connections, you can tune the database connection pools.
Repository Database Connection Pool. To optimize the repository database connection pool, modify the JBoss
configuration file:
<JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml
The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or other databases. For example,
for an Oracle repository, the configuration file name is oracle_ds.xml. With some versions of Data Analyzer, the
configuration file may simply be named DataAnalyzer-ds.xml.
min-pool-size. The minimum number of connections in the pool. (The pool is lazily
constructed, that is, it will be empty until it is first accessed. Once used, it will always have at
least the min-pool-size connections.)
●
idle-timeout-minutes. The length of time an idle connection remains in the pool before it is
used.
The max-pool-size value is recommended to be at least five more than maximum number of concurrent users
because there may be several scheduled reports running in the background and each of them needs a database
connection.
A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository very
frequently, it is inefficient to spend resources on checking for idle connections and cleaning them out. Checking for
idle connections may block other threads that require new connections.
Data Source Database Connection Pool. Similar to the repository database connection pools, the data source
also has a pool of connections that Data Analyzer dynamically creates as soon as the first client requests a
connection.
The tuning parameters for these dynamic pools are present in following file:
<JBOSS_HOME>/bin/IAS.properties.file
#
# Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.initialCapacity. The minimum number of initial connections in the data source pool.
●
dynapool.maxCapacity. The maximum number of connections that the data source pool may
grow to.
●
dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name
for identification purposes.
●
dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a
connection from the pool if none is readily available.
●
EJB Container
Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity
beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time
functionalities.
Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune
the EJB pool parameters in the following file:
<JBOSS_HOME>/server/Informatica/conf/standardjboss.xml.
<container-configuration>
<container-name> Standard Stateless SessionBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>
stateless-rmi-invoker</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor>
Additionally, there are two other parameters that you can set to fine tune the EJB pool. These
two parameters are not set by default in Data Analyzer. They can be tuned after you have
performed proper iterative testing in Data Analyzer to increase the throughput for high-
concurrency installations.
●
strictMaximumSize. When the value is set to true, the <strictMaximumSize> enforces a rule
that only <MaximumSize> number of objects can be active. Any subsequent requests must
wait for an object to be returned to the pool.
●
Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The
main difference is that MDBs are not invoked by clients. Instead, the messaging system delivers messages to the
MDB when they are available.
<JBOSS_HOME>/server/informatica/conf/standardjboss.xml
<container-configuration>
<container-name>Standard Message Driven Bean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>message-driven-bean
</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor
</interceptor>
<!-- CMT -->
<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor
</interceptor>
<interceptor transaction="Container">
org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>
<!-- BMT -->
<interceptor transaction="Bean">
org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>
<interceptor transaction="Bean">
org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT
</interceptor>
<interceptor transaction="Bean" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool
</instance-pool>
<instance-cache></instance-cache>
<persistence-manager></persistence-manager>
<container-pool-conf>
<MaximumSize>100</MaximumSize>
</container-pool-conf>
</container-configuration>
MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then
<MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if
<strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are
request for more objects. However, only the <MaximumSize> number of objects can be returned to the pool.
Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are
not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data
Enterprise Java Beans (EJB). Data Analyzer EJBs use BMP (bean-managed persistence) as opposed to CMP
(container-managed persistence). The EJB tuning parameters are very similar to the stateless bean tuning
parameters.
<JBOSS_HOME>/server/informatica/conf/standardjboss.xml.
<container-configuration>
<container-name>Standard BMP EntityBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>entity-rmi-invoker
</invoker-proxy-binding-name>
<sync-on-commit-only>false</sync-on-commit-only>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.SecurityInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT
</interceptor>
<interceptor metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor
</interceptor>
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
<interceptor>
org.jboss.ejb.plugins.EntitySynchronizationInterceptor
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.EntityInstancePool
</instance-pool>
<instance-cache>org.jboss.ejb.plugins.EntityInstanceCache
</instance-cache>
<persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager
MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then
<MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if
<strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are
request for more objects. However, only the <MaximumSize> number of objects are returned to the pool.
Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are
not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data
Analyzer to increase the throughput for high-concurrency installations.
RMI Pool
The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for
remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from other
custom applications, you can optimize the RMI thread pool parameters.
<JBOSS_HOME>/server/informatica/conf/jboss-service.xml
NumAcceptThreads. The controlling threads used to accept connections from the client.
●
MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server.
●
ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the
client.
●
Backlog. The number of requests in the queue when all the processing threads are in use.
● EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to true
may increase the network traffic because more packets will be sent across the network.
WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some
of the parameters and arrive at a good settings.
Web Container
Navigate to “Application Servers > [your_server_instance] > Web Container > Thread Pool” to tune the following
parameters.
Minimum Size: Specifies the minimum number of threads to allow in the pool. The default
value of 10 is appropriate.
● Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent
usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has been determined to be
optimal.
● Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a
thread is reclaimed. The default of 3500ms is considered optimal.
●
Is Growable: Specifies whether the number of threads can increase beyond the maximum size
configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum
threads should be hard-limited to the value given in the “Maximum Size”.
Note: In a load-balanced environment, there is likely to be more than one server instance that may be spread across
multiple machines. In such a scenario, be sure that the changes have been properly propagated to all of the server
Transaction Services
Total transaction lifetime timeout: In certain circumstances (e.g., import of large XML files), the default value of 120
seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.
Navigate to “Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service >
Debugging Service “ and make sure “Startup” is not checked.
This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the
application server after a certain interval; if the server is found to be dead, it then tries to restart the server.
Navigate to “Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy “ and tune the
parameters according to a policy determined for each Data Analyzer installation.
Note: The parameter “Ping Timeout” determines the time after which a no-response from the server implies that it is
faulty. The monitoring service then attempts to kill the server and restart it if “Automatic restart” is checked. Take
care that “Ping Timeout” is not set to too small a value.
For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the minimum
and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during
a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of
minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after
carefully studying the garbage collection behavior by turning on the verbosegc option.
The following is a list of java parameters (for IBM JVM 1.4.1) that should not be modified from the default values for
Data Analyzer installation:
-Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap
compaction results in heap fragmentation. Since Data Analyzer frequently allocates large
objects, heap fragmentation can result in OutOfMemory exceptions.
●
-Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out
-Xgcthreads. This controls the number of garbage collection helper threads created by the
JVM during startup. The default is N-1 threads for an N-processor machine. These threads
provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause
time during garbage collection.
●
You may want to alter the following parameters after carefully examining the application server processes:
● Navigate to “Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine"
●
Verbose garbage collection. Check this option to turn on verbose garbage collection. This can
help in understanding the behavior of the garbage collection for the application. It has a very
low overhead on performance and can be turned on even in the production environment.
●
Initial heap size. This is the –ms value. Only the numeric value (without MB) needs to be
specified. For concurrent usage, the initial heap-size should be started with a 1000 and,
depending on the garbage collection behavior, can be potentially increased up to 2000. A value
beyond 2000 may actually reduce throughput because the garbage collection cycles will take
more time to go through the large heap, even though the cycles may be occurring less
frequently.
●
Maximum heap size. This is the –mx value. It should be equal to the “Initial heap size” value.
●
RunHProf:. This should remain unchecked in production mode, because it slows down the VM
considerably.
●
Debug Mode. This should remain unchecked in production mode, because it slows down the VM
considerably.
●
Disable JIT.: This should remain unchecked (i.e., JIT should never be disabled).
Performance Monitoring Services
Be sure that performance monitoring services are not enabled in a production environment.
Navigate to “Application Servers > [your_server_instance] > Performance Monitoring Services“ and be sure “Startup”
is not checked.
The repository database connection pool can be configured by navigating to “JDBC Providers > User-defined JDBC
Provider > Data Sources > IASDataSource > Connection Pools”
● Connection Timeout. The default value of 180 seconds should be good. This implies that after 180 seconds,
the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an
exception. In that case, the pool size may need to be increased.
Much like the repository database connection pools, the data source or data warehouse databases also have a pool
of connections that are created dynamically by Data Analyzer as soon as the first client makes a request.
The tuning parameters for these dynamic pools are present in <WebSphere_Home>/AppServer/IAS.properties file.
# Datasource definition
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2
dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_
dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20
To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple reports
within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler
(InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report
consumes considerable resources (e.g., database connections, and CPU processing at both the application-server
and database server levels) and setting this to a very high value may actually be detrimental to the whole system.
Navigate to “Application Servers > [your_server_instance] > Message Listener Service > Listener Ports >
IAS_ScheduleMDB_ListenerPort” .
● Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not
recommend going beyond five.
● Maximum messages. This should remain as one. This implies that each report in a schedule will be
executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted
effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.
When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the load-
balancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in
then routes the request to the proper application-server.
The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the
server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies
that after the given number of seconds if the server doesn’t respond, then it is marked as down and the request is
sent over to the next available member of the cluster.
The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down.
The default value is 10 seconds. This means if a cluster member is marked as down, the server does not try to send
a request to the same member for 10 seconds.
Challenge
In general, mapping-level optimization takes time to implement, but can significantly boost performance.
Sometimes the mapping is the biggest bottleneck in the load process because business rules determine
the number and complexity of transformations in a mapping.
Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic
issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally,
bringing about a performance increase in all scenarios. The second group of tuning processes may yield
only small performance increase, or can be of significant value, depending on the situation.
Some factors to consider when choosing tuning processes at the mapping level include the specific
environment, software/ hardware limitations, and the number of rows going through a mapping. This Best
Practice offers some guidelines for tuning mappings.
Description
Analyze mappings for tuning only after you have tuned the target and source for peak performance. To
optimize mappings, you generally reduce the number of transformations in the mapping and delete
unnecessary links between transformations.
For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations),
limit connected input/output or output ports. Doing so can reduce the amount of data the transformations
store in the data cache. Having too many Lookups and Aggregators can encumber performance because
each requires index cache and data cache. Since both are fighting for memory space, decreasing the
number of these transformations in a mapping can help improve speed. Splitting them up into different
mappings is another option.
Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on
the cache directory. Unless the seek/access time is fast on the directory itself, having too many
Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk
and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.
If several mappings use the same data source, consider a single-pass reading. If you have several
sessions that use the same sources, consolidate the separate mappings with either a single Source
Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the
separate data flows.
Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that
function is called in the session. For example, if you need to subtract percentage from the PRICE ports for
both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage
before splitting the pipeline.
When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override
of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned
depends on the underlying source or target database system. See Tuning SQL Overrides and
Environment for Better Performance for more information .
PowerCenter Server automatically makes conversions between compatible datatypes. When these
conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from
an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary.
In some instances however, datatype conversions can help improve performance. This is especially true
when integer values are used in place of other datatypes for performing comparisons using Lookup and
Filter transformations.
Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During
transformation errors, the PowerCenter Server engine pauses to determine the cause of the error,
removes the row causing the error from the data flow, and logs the error in the session log.
Transformation errors can be caused by many things including: conversion errors, conflicting mapping
logic, any condition that is specifically set up as an error, and so on. The session log can help point out the
cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for
these transformations. If you need to run a session that generates a large number of transformation errors,
you might improve performance by setting a lower tracing level. However, this is not a long-term response
to transformation errors. Any source of errors should be traced and eliminated.
There are a several ways to optimize lookup transformations that are set up in a mapping.
Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table
and queries the lookup cache during the session. When this option is not enabled, the PowerCenter
Server queries the lookup table on a row-by-row basis.
Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for
lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and
cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better
Performance.
A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard
to the number of rows expected to be processed. For example, consider the following example.
LKP_Manufacturer
LKP_DIM_ITEMS
Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a
total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it
will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the
lookup table is small in comparison with the number of times the lookup is executed. So this lookup should
be cached. This is the more likely scenario.
Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in
105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk
reads would total 10,000. In this case the number of records in the lookup table is not small in comparison
with the number of times the lookup will be executed. Thus, the lookup should not be cached.
(LS*NRS*CRS)/(CRS-NRS) = X
Where X is the breakeven point. If your expected source records is less than X, it is better to not
cache the lookup. If your expected source records is more than X, it is better to cache the lookup.
For example:
Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more
than 66,603 records, then the lookup should be cached.
● Within a specific session run for a mapping, if the same lookup is used multiple times in a
mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup.
Using the same lookup multiple times in the mapping will be more resource intensive with each
successive instance. If multiple cached lookups are from the same table but are expected to
return different columns of data, it may be better to setup the multiple lookups to bring back the
same columns even though not all return ports are used in all lookups. Bringing back a common
set of columns may reduce the number of disk reads.
● Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple
runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a
persistent cache is set in the lookup properties, the memory cache created for the lookup during
the initial run is saved to the PowerCenter Server. This can improve performance because the
Server builds the memory cache from cache files instead of the database. This feature should
only be used when the lookup table is not expected to change between session runs.
● Across different mappings and sessions, the use of a named persistent cache allows
There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the
WHERE clause to reduce the set of records included in the resulting cache.
Note: If you use a SQL override in a lookup, the lookup must be cached.
In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first
in order to optimize lookup performance.
The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a
result, indexes on the database table should include every column used in a lookup condition. This can
improve performance for both cached and un-cached lookups.
In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement
used to create the cache. Columns used in the ORDER BY condition should be indexed.
The session log will contain the ORDER BY statement.
●
In the case of an un-cached lookup, since a SQL statement is created for each row
passing into the lookup transformation, performance can be helped by indexing
columns in the lookup condition.
If the lookup source does not change between sessions, configure the Lookup transformation to use a
persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to
session, eliminating the time required to read the lookup source.
Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of
using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use
a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve
performance.
Avoid complex expressions when creating the filter condition. Filter transformations are most
effective when a simple integer or TRUE/FALSE expression is used in the filter condition.
Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if
rejected rows do not need to be saved.
Aggregator Transformations often slow performance because they must group data before processing it.
Use simple columns in the group by condition to make the Aggregator Transformation more efficient.
When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex
expressions in the Aggregator expressions, especially in GROUP BY ports.
Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be
sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option
decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is
sorted by group and, as a group is passed through an Aggregator, calculations can be performed and
information passed on to the next transformation. Without sorted input, the Server must wait for all rows of
data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by
a Source Qualifier which uses the Number of Sorted Ports option.
Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can
only be used if the source data can be sorted. Further, using this option assumes that a mapping is using
an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is
required to hold data from the previous row of data processed. The premise is to use the previous row of
data to determine whether the current row is a part of the current group or is the beginning of a new group.
Thus, if the row is a part of the current group, then its data would be used to continue calculating the
current group function. An Update Strategy Transformation would follow the Expression Transformation
and set the first row of a new group to insert, and the following rows to update.
Use incremental aggregation if you can capture changes from the source that changes less than half the
target. When using incremental aggregation, you apply captured changes in the source to aggregate
calculations in a session. The PowerCenter Server updates your target incrementally, rather than
processing the entire source and recalculating the same calculations every time you run the session.
Joiner Transformation
You can join data from the same source in the following ways:
You may want to join data from the same source if you want to perform a calculation on part of the data
and join the transformed data with the original data. When you join the data using this method, you can
maintain the original data and transform parts of that data within one mapping.
When you join data from the same source, you can create two branches of the pipeline. When you branch
If you want to join unsorted data, you must create two instances of the same source and join the pipelines.
For example, you may have a source with the following ports:
● Employee
● Department
● Total Sales
In the target table, you want to view the employees who generated sales that were greater than the
average sales for their respective departments. To accomplish this, you create a mapping with the
following transformations:
Note: You can also join data from output groups of the same transformation, such as the Custom
transformation or XML Source Qualifier transformations. Place a Sorter transformation between each
output group and the Joiner transformation and configure the Joiner transformation to receive sorted input.
Joining two branches can affect performance if the Joiner transformation receives data from one branch
much later than the other branch. The Joiner transformation caches all the data from the first branch, and
writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk
when it receives the data from the second branch. This can slow processing.
You can also join same source data by creating a second instance of the source. After you create the
second source instance, you can join the pipelines from the two source instances.
Note: When you join data using this method, the PowerCenter Server reads the source data for each
source instance, so performance can be slower than joining two branches of a pipeline.
Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a
source:
● Join two branches of a pipeline when you have a large source or if you can read the source data
only once. For example, you can only read source data from a message queue once.
● Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you
Performance Tips
Use the database to do the join when sourcing data from the same database schema. Database
systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a
join condition should be used when joining multiple tables from the same database schema.
Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of
data is also smaller.
Join sorted data when possible. You can improve session performance by configuring the Joiner
transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the
PowerCenter Server improves performance by minimizing disk input and output. You see the greatest
performance improvement when you work with large data sets.
For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows.
For optimal performance and disk storage, designate the master source as the source with the fewer rows.
During a session, the Joiner transformation compares each row of the master source against the detail
source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which
speeds the join process.
For a sorted Joiner transformation, designate as the master source the source with fewer duplicate
key values. For optimal performance and disk storage, designate the master source as the source with
fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it
caches rows for one hundred keys at a time. If the master source contains many rows with the same key
value, the PowerCenter Server must cache more rows, and performance can be slowed.
Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner
transformation, you may optimize performance by grouping data and using n:n partitions.
To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you
must group and sort data. To group data, ensure that rows with the same key value are routed to the same
partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a
hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort
the data ensures that you maintain grouping and sort the data within each group.
You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When
you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not
need to cache all of the master data. This reduces memory usage and speeds processing. When you use
1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache
Sequence Generator transformations need to determine the next available sequence number; thus,
increasing the Number of Cached Values property can increase performance. This property determines
the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the
PowerCenter Server must query the repository each time to determine the next number to be used. You
may consider configuring the Number of Cached Values to a value greater than 1000. Note that any
cached values not used in the course of a session are lost since the sequence generator value in the
repository is set when it is called next time, to give the next set of cache values.
For the most part, making calls to external procedures slows a session. If possible, avoid the use of these
Transformations, which include Stored Procedures, External Procedures, and Advanced External
Procedures.
As a final step in the tuning process, you can tune expressions used in transformations. When examining
expressions, focus on complex expressions and try to simplify them when possible.
Processing field level transformations takes time. If the transformation expressions are complex, then
processing is even slower. It’s often possible to get a 10 to 20 percent performance improvement by
optimizing complex field level transformations. Use the target table mapping reports or the Metadata
Reporter to examine the transformations. Likely candidates for optimization are the fields with the most
complex expressions. Keep in mind that there may be more than one field causing performance problems.
Factoring out common logic can reduce the number of times a mapping performs the same logic. If a
mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the
logic to be performed just once. For example, a mapping has five target tables. Each target requires a
Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup
to a position before the data flow splits.
Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the
PowerCenter Server must search and group the data. Thus, the following expression:
SUM(Column A) + SUM(Column B)
SUM(Column A + Column B)
In general, operators are faster than functions, so operators should be used whenever possible. For
example if you have an expression which involves a CONCAT function such as:
CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
FIRST_NAME || LAST_NAME
Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical
statements to be written in a more compact fashion. For example:
IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT>
The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in
three IIFs, three comparisons, and two additions.
Avoid calculating or testing the same value multiple times. If the same sub-expression is used several
times in a transformation, consider making the sub-expression a local variable. The local variable can be
used only within the transformation in which it was created. Calculating the variable only once and then
referencing the variable in following sub-expressions improves performance.
The PowerCenter Server processes numeric operations faster than string operations. For example, if a
lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID,
configuring the lookup around EMPLOYEE_ID improves performance.
When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows
each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read
option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of
CHAR source fields.
When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a
DECODE function is used, the lookup values are incorporated into the expression itself so the server does
not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using
DECODE may improve performance.
Because there is always overhead involved in moving data among transformations, try, whenever
possible, to reduce the number of transformations. Also, resolve unnecessary links between
transformations to minimize the amount of data moved. This is especially important with data being pulled
from the Source Qualifier Transformation.
You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier
transformation and in the Properties tab of the target instance in a mapping. To increase the load speed,
use these commands to drop indexes on the target before the session runs, then recreate them when the
● You can use any command that is valid for the database type. However, the PowerCenter Server
does not allow nested comments, even though the database may.
● You can use mapping parameters and variables in SQL executed against the source, but not
against the target.
● Use a semi-colon (;) to separate multiple statements.
● The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/.
● If you need to use a semi-colon outside of quotes or comments, you can escape it with a back
slash (\).
● The Workflow Manager does not validate the SQL.
For relational databases, you can execute SQL commands in the database environment when connecting
to the database. You can use this for source, target, lookup, and stored procedure connections. For
instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the
guidelines listed above for using the SQL statements.
You can use local variables in Aggregator, Expression, and Rank transformations.
Rather than parsing and validating the same expression each time, you can define these components as
variables. This also allows you to simplyfy complex expressions. For example, the following expressions:
AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >
SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >
can use variables to simplify complex expressions and temporarily store data:
Port Value
You can use variables to store data from prior rows. This can help you perform procedural calculations.
To compare the previous state to the state just read:
Variables also provide a way to capture multiple columns of return values from stored procedures.
Challenge
Running sessions is where the pedal hits the metal. A common misconception is that
this is the area where most tuning should occur. While it is true that various specific
session options can be modified to improve performance, PowerCenter 8 comes with
PowerCenter Enterprise Grid Option and Pushdown optimizations that also improve
performance tremendously.
Description
Once you optimize the source and target database, and mapping, you can focus on
optimizing the session. The greatest area for improvement at the session level usually
involves tweaking memory cache settings. The Aggregator (without sorted ports),
Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches.
The PowerCenter Server uses index and data caches for each of these
transformations. If the allocated data or index cache is not large enough to store the
data, the PowerCenter Server stores the data in a temporary disk file as it processes
the session data. Each time the PowerCenter Server pages to the temporary file,
performance slows.
You can see when the PowerCenter Server pages to the temporary file by examining
the performance details. The transformation_readfromdisk or
transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or
Joiner transformation indicate the number of times the PowerCenter Server must page
to disk to process the transformation. Index and data caches should both be sized
according to the requirements of the individual lookup. The sizing can be done using
the estimation tools provided in the Transformation Guide, or through observation of
actual cache sizes on in the session caching directory.
The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention used by
the PowerCenter Server for these files is PM [type of transformation] [generated
session instance id number] _ [transformation instance id number] _ [partition index].
dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.
dat. The cache directory may be changed however, if disk space is a constraint.
Informatica recommends that the cache directory be local to the PowerCenter
Server. A RAID 0 arrangement that gives maximum performance with no redundancy is
If the PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow session
performance, the RAM allocated needs to be available on the server. If the server
doesn’t have available RAM and uses paged memory, your session is again accessing
the hard disk. In this case, it is more efficient to allow PowerCenter to page the data
rather than the operating system. Adding additional memory to the server is, of course,
the best solution.
Refer to Session Caches in the Workflow Administration Guide for detailed information
on determining cache sizes.
The PowerCenter Server writes to the index and data cache files during a session in
the following cases:
When a session is running, the PowerCenter Server writes a message in the session
log indicating the cache file name and the transformation name. When a session
completes, the DTM generally deletes the overflow index and data cache files.
However, index and data files may exist in the cache directory if the session is
configured for either incremental aggregation or to use a persistent lookup cache.
Cache files may also remain if the session does not complete successfully.
PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you
can configure the Integration Service to automatically calculate cache memory settings
at run time. When you run a session, the Integration Service allocates buffer memory to
the session to move the data from the source to the target. It also creates session
The Integration Service can determine cache memory requirements for the Lookup,
Aggregator, Rank, Joiner, Sorter and XML.
You can configure auto for the index and data cache size in the transformation
properties or on the mappings tab of the session properties
Configuring maximum memory limits allows you to ensure that you reserve a
designated amount or percentage of memory for other processes. You can configure
the memory limit as a numeric value and as a percent of total memory. Because
available memory varies, the Integration Service bases the percentage value on the
total memory on the Integration Service process machine.
For example, you configure automatic caching for three Lookup transformations in a
session. Then, you configure a maximum memory limit of 500MB for the session. When
you run the session, the Integration Service divides the 500MB of allocated memory
among the index and data caches for the Lookup transformations.
When you configure a maximum memory value, the Integration Service divides
memory among transformation caches based on the transformation type.
When you configure a numeric value and a percent both, the Integration Service
compares the values and uses the lower value as the maximum memory limit.
When you configure automatic memory settings, the Integration Service specifies a
minimum memory allocation for the index and data caches. The Integration Service
allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for
each transformation instance. If you configure a maximum memory limit that is less
than the minimum value for an index or data cache, the Integration Service overrides
this value. For example, if you configure a maximum memory value of 500 bytes for
When you run a session on a grid and you configure Maximum Memory Allowed for
Auto Memory Attributes, the Integration Service divides the allocated memory among
all the nodes in the grid. When you configure Maximum Percentage of Total Memory
Allowed for Auto Memory Attributes, the Integration Service allocates the specified
percentage of memory on each node in the grid.
Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:
● Allocate at least enough space to hold at least one row in each aggregate
group.
● Remember that you only need to configure cache memory for an Aggregator
transformation that does not use sorted ports. The PowerCenter Server uses
Session Process memory to process an Aggregator transformation with sorted
ports, not cache memory.
● Incremental aggregation can improve session performance. When it is used,
the PowerCenter Server saves index and data cache information to disk at the
end of the session. The next time the session runs, the PowerCenter Server
uses this historical information to perform the incremental aggregation. The
PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and
saves them to the cache directory. Mappings that have sessions which use
incremental aggregation should be set up so that only new detail records are
read with each subsequent run.
● When configuring Aggregate data cache size, remember that the data cache
holds row data for variable ports and connected output ports only. As a result,
the data cache is generally larger than the index cache. To reduce the data
cache size, connect only the necessary output ports to subsequent
transformations.
Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads
from master and detail sources concurrently and builds index and data caches based
on the master rows. The PowerCenter Server then performs the join based on the
detail source data and the cache data.
The number of rows the PowerCenter Server stores in the cache depends on the
After the memory caches are built, the PowerCenter Server reads the rows from the
detail source and performs the joins. The PowerCenter Server uses the index cache to
test the join condition. When it finds source data and cache data that match, it retrieves
row values from the data cache.
Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.
● Persistent caches should be used when lookup data is not expected to change
often. Lookup cache files are saved after a session with a persistent cache
lookup is run for the first time. These files are reused for subsequent runs,
bypassing the querying of the database for the lookup. If the lookup table
changes, you must be sure to set the Recache from Database option to
ensure that the lookup cache files are rebuilt. You can also delete the cache
files before the session run to force the session to rebuild the caches.
● Lookup caching should be enabled for relatively small tables. Refer to the Best
Practice Tuning Mappings for Better Performance to determine when lookups
should be cached. When the Lookup transformation is not configured for
caching, the PowerCenter Server queries the lookup table for each input row.
The result of the lookup query and processing is the same, regardless of
whether the lookup table is cached or not. However, when the transformation
is configured to not cache, the PowerCenter Server queries the lookup table
instead of the lookup cache. Using a lookup cache can usually increase
session performance.
● Just like for a joiner, the PowerCenter Server aligns all data for lookup caches
on an eight-byte boundary, which helps increase the performance of the
lookup
The Integration Service can determine the memory requirements for the buffer memory:
You can also configure DTM buffer size and the default buffer block size in the session
properties. When the PowerCenter Server initializes a session, it allocates blocks of
To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the buffer
size and/or the buffer block size based on the default settings, to create the required
number of session blocks.
If there are XML sources or targets in the mappings, use the number of groups in the
XML source or target in the total calculation for the total number of sources and targets.
The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter
Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer
memory to create the internal data structures and buffer blocks used to bring data into
and out of the server. When the DTM buffer memory is increased, the PowerCenter
Server creates more buffer blocks, which can improve performance during momentary
slowdowns.
If a session's performance details show low numbers for your source and target
BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer
pool size may improve performance.
Using DTM buffer memory allocation generally causes performance to improve initially
and then level off. (Conversely, it may have no impact on source or target-bottlenecked
sessions at all and may not have an impact on DTM bottlenecked sessions). When the
DTM buffer memory allocation is increased, you need to evaluate the total memory
available on the PowerCenter Server. If a session is part of a concurrent batch, the
combined DTM buffer memory allocated for the sessions or batches must not exceed
the total memory for the PowerCenter Server system. You can increase the DTM buffer
size in the Performance settings of the Properties tab.
The PowerCenter Server can process multiple sessions in parallel and can also
process multiple partitions of a pipeline within a session. If you have a symmetric multi-
processing (SMP) platform, you can use multiple CPUs to concurrently process session
data or partitions of data. This provides improved performance since true parallelism is
achieved. On a single processor platform, these tasks share the CPU, so there is no
parallelism.
Partitioning Sessions
When you create or edit a session, you can change the partitioning information for each
pipeline in a mapping. If the mapping contains multiple pipelines, you can specify
multiple partitions in some pipelines and single partitions in others. Keep the following
attributes in mind when specifying partitioning information for a pipeline:
If you find that your system is under-utilized after you have tuned the application,
databases, and system for maximum single-partition performance, you can reconfigure
your session to have two or more partitions to make your session utilize more of the
hardware. Use the following tips when you add partitions to a session:
● Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before you add each partition.
● Set DTM buffer memory. For a session with n partitions, this value should be
at least n times the value for the session with one partition.
● Set cached values for Sequence Generator. For a session with n partitions,
there should be no need to use the number of cached values property of the
Sequence Generator transformation. If you must set this value to a value
greater than zero, make sure it is at least n times the original value for the
One method of resolving target database bottlenecks is to increase the commit interval.
Each time the target database commits, performance slows. If you increase the commit
interval, the number of times the PowerCenter Server commits decreases and
performance may improve.
When increasing the commit interval at the session level, you must remember to
increase the size of the database rollback segments to accommodate the larger
number of rows. One of the major reasons that Informatica set the default commit
interval to 10,000 is to accommodate the default rollback segment / extent size of most
databases. If you increase both the commit interval and the database rollback
segments, you should see an increase in performance. In some cases though, just
increasing the commit interval without making the appropriate database changes may
cause the session to fail part way through (i.e., you may get a database error like
"unable to extend rollback segments" in Oracle).
If a session runs with high precision enabled, disabling high precision may improve
session performance.
To reduce the amount of time spent writing to the session log file, set the tracing level
to Terse. At this tracing level, the PowerCenter Server does not write error messages
or row-level information for reject data. However, if terse is not an acceptable level of
detail, you may want to consider leaving the tracing level at Normal and focus your
efforts on reducing the number of transformation errors. Note that the tracing level must
be set to Normal in order to use the reject loading utility.
As an additional debug option (beyond the PowerCenter Debugger), you may set the
tracing level to verbose initialization or verbose data.
However, the verbose initialization and verbose data logging options significantly affect
the session performance. Do not use Verbose tracing options except when testing
sessions. Always remember to switch tracing back to Normal after the testing is
complete.
The session tracing level overrides any transformation-specific tracing levels within the
mapping. Informatica does not recommend reducing error tracing as a long-term
response to high levels of transformation errors. Because there are only a handful of
Pushdown Optimization
You can push transformation logic to the source or target database using pushdown
optimization. The amount of work you can push to the database depends on the
pushdown optimization configuration, the transformation logic, and the mapping and
session configuration.
When you run a session configured for pushdown optimization, the Integration Service
analyzes the mapping and writes one or more SQL statements based on the mapping
transformation logic. The Integration Service analyzes the transformation logic,
mapping, and session configuration to determine the transformation logic it can push to
the database. At run time, the Integration Service executes any SQL statement
generated against the source or target tables, and it processes any transformation logic
that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping
logic that the Integration Service can push to the source or target database. You can
also use the Pushdown Optimization Viewer to view the messages related to
Pushdown Optimization.
When you run a session configured for target-side pushdown optimization, the
To use full pushdown optimization, the source and target must be on the same
database. When you run a session configured for full pushdown optimization, the
Integration Service analyzes the mapping from source to target and analyze each
transformation in the pipeline until it analyzes the target. It generates and executes the
SQL on sources and targets,
When you run a session for full pushdown optimization, the database must run a long
transaction if the session contains a large quantity of data. Consider the following
database performance issues when you generate a long transaction:
The Rank transformation cannot be pushed to the database. If you configure the
session for full pushdown optimization, the Integration Service pushes the Source
Qualifier transformation and the Aggregator transformation to the source. It pushes the
Expression transformation and target to the target database, and it processes the Rank
transformation. The Integration Service does not fail the session if it can push only part
of the transformation logic to the database and the session is configured for full
optimization.
Using a Grid
You can use a grid to increase session and workflow performance. A grid is an alias
assigned to a group of nodes that allows you to automate the distribution of workflows
and sessions across nodes.
When you run a session on a grid, you improve scalability and performance by
distributing session threads to multiple DTM processes running on nodes in the grid.
To run a workflow or session on a grid, you assign resources to nodes, create and
configure the grid, and configure the Integration Service to run on a grid.
When you run a session on a grid, the master service process runs the workflow and
workflow tasks, including the Scheduler. Because it runs on the master service process
node, the Scheduler uses the date and time for the master service process node to
start scheduled workflows. The Load Balancer distributes Command tasks as it does
when you run a workflow on a grid. In addition, when the Load Balancer dispatches a
Session task, it distributes the session threads to separate DTM processes.
The master service process starts a temporary preparer DTM process that fetches the
session and prepares it to run. After the preparer DTM process prepares the session, it
acts as the master DTM process, which monitors the DTM processes running on other
nodes.
The worker service processes start the worker DTM processes on other nodes. The
worker DTM runs the session. Multiple worker DTM processes running on a node might
be running multiple sessions or multiple partition groups from a single session
depending on the session configuration.
For example, you run a workflow on a grid that contains one Session task and one
Command task. You also configure the session to run on the grid.
When the Integration Service process runs the session on a grid, it performs the
following tasks:
● On Node 1, the master service process runs workflow tasks. It also starts a
For information about configuring and managing a grid, refer to the PowerCenter
Administrator Guide and to the best practice PowerCenter Enterprise Grid Option.
For information about how the DTM distributes session threads into partition groups,
see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.
Challenge
Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from
source database tables, which positively impacts the overall session performance. This Best Practice explores ways to
optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While
the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other
RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.
Description
Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must
look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the
logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.
When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In
Oracle, the NVL function is used, while in DB2, the COALESCE function is used.
In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around
this limitation.
You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of
the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain.
The logic is now in two places: in an Informatica mapping and in a database view
You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to
a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in
the FROM clause:
N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,
N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,
N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
FROM DOSE_REGIMEN N,
FROM EXPERIMENT_PARAMETER R,
NEW_GROUP_TMP TMP
)X
ORDER BY N.DOSE_REGIMEN_ID
Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression
temp tables and the WITH Clause
The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH
clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by
specifying the query name. For example:
WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT >
Here is another example using a WITH clause that uses recursive SQL:
FROM PARENT_CHILD
UNION ALL
The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents,
but you get the idea. The LEVEL clause prevents infinite recursion.
The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case
since it was the only legal way to test a condition in earlier versions.
In Oracle:
DECODE (SALARY)
FROM EMPLOYEE
In DB2:
CASE
END AS COMMENT
FROM EMPLOYEE
It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can
be commented out or removed after it is put in general use.
DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:
FROM EMPLOYEE
FROM EMPLOYEE
Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL
return all rows.
Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or
the date however you want with date functions.
Here is an example that returns yesterday’s date in Oracle (default format as mm/dd/yyyy):
FROM EMPLOYEE
Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in
queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer
control over the execution. Hints are always honored unless execution is not possible. Because the database engine does
not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of
hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and
access method hints are the most common.
In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It
was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2,
however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using
older versions of Oracle however, the use of the proper INDEX hints should help performance.
The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below
provides a partial list of optimizer hints and descriptions.
Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts
while the nested loop involves no sorts. The hash join also requires memory to build the hash table.
Hash joins are most effective when the amount of data is large and one table is much larger than the other.
Hint Description
ALL_ROWS The database engine creates an execution plan that optimizes for throughput.
Favors full table scans. Optimizer favors Sort/Merge
FIRST_ROWS The database engine creates an execution plan that optimizes for response time.
It returns the first row of data as quickly as possible. Favors index lookups.
Optimizer favors Nested-loops
CHOOSE The database engine creates an execution plan that uses cost-based execution if
statistics have been run on the tables. If statistics have not been run, the engine
uses rule-based execution. If statistics have been run on empty tables, the
engine still uses cost-based execution, but performance is extremely poor.
RULE The database engine creates an execution plan based on a fixed set of rules.
HASH The database engine performs a hash scan of the table. This hint is ignored if the
table is not clustered.
Access method hints control how data is accessed. These hints are used to force the database engine to use indexes,
hash scans, or row id scans. The following table provides a partial list of access method hints.
Hint Description
ROWID The database engine performs a scan of the table based on ROWIDS.
INDEX DO NOT USE in Oracle 9.2 and above. The database engine performs an index
scan of a specific table, but in 9.2 and above, the optimizer does not use any
indexes other than those mentioned.
USE_CONCAT The database engine converts a query with an OR condition into two or more
queries joined by a UNION ALL statement.
From emp;
From emp;
The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be
accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best
SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are
not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually
provide better execution time.
The developer can determine which type of execution is being used by running an explain plan on the SQL query in
question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results
of that statement are then used as input by the next level statement.
Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full
table scans cause degradation in performance.
Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following
additional information including:
The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can
immediately show the change in resource consumption after the statement has been tuned and a new explain plan has
been run.
Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should
compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that
are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once
implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being
used, it is possible to force the query to use it by using an access method hint, as described earlier.
The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine
whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for
additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme
cases, the entire SQL statement may need to be re-written to become more efficient.
SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example:
● EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent
query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and
may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example:
Situation Exists In
Index in parent that match sub-query columns Possibly not since the Yes – IN uses the
EXISTS cannot use the index
index
● Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way
can improve performance by more than100 percent.
● Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup
objects within the mapping to fill in the optional information.
Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the
incremental ETL load.
Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work
from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause.
Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.
Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN
● Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this
may not be a problem on small tables, it can become a performance drain on large tables.
● In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator.
MINUS
● Also consider using outer joins with IS NULL conditions for anti-joins.
Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change
based on the database engine.
● In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source
qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders
entered into the system since the previous load of the database, then, in the product information lookup, only
select the products that match the distinct product IDs in the incremental sales orders.
● Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved
from a table as limits in the BETWEEN. Here is an example:
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
R.GCW_LOAD_DATE
FROM CDS_SUPPLIER R,
FROM ETL_AUDIT_LOG L
WHERE L.LOAD_DATE_PREV IN
FROM ETL_AUDIT_LOG Y)
)Z
WHERE
The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits
the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the
throughput time from hours to seconds.
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
R.LOAD_DATE
FROM
(SELECT
R1.BATCH_TRACKING_NO,
R1.SUPPLIER_DESC,
R1.SUPPLIER_REG_NO,
R1.SUPPLIER_REF_CODE,
R1.LOAD_DATE
ORDER BY R1.LOAD_DATE) R,
System Resources
● CPU
● Load Manager shared memory
● DTM buffer memory
● Cache memory
When tuning the system, evaluate the following considerations during the implementation process.
● Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of
network hops between the PowerCenter Server and the databases.
● Use multiple PowerCenter Servers on separate systems to potentially improve session performance.
● When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the
PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to
store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can
potentially slow session performance
● Check hard disks on related machines. Slow disk access on source and target databases, source and target file
Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the
many available alternatives is the best implementation choice for the particular database. The project team must have a
thorough understanding of the data, database, and desired use of the database by the end-user community prior to
beginning the physical implementation process. Evaluate the following considerations during the implementation process.
● Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and
primary key to foreign key relationships, and also eliminating join tables.
● Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a
degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are
recommended to drop indexes before the load and rebuilding them after the load using post-session scripts.
● Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating
that additional logic in the mappings.
● Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for
queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all
the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions,
particularly on initial loads.
● OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to
determine after the fact. DBAs must work with the System Administrator to ensure all the database processes
have the same priority.
● Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5
(pooled disk sharing) disk I/O throughput.
● Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk
controllers.
Challenge
Setting the Registry to ensure consistent client installations, resolve potential missing or invalid
license key issues, and change the Server Manager Session Log Editor to your preferred editor.
Description
Ensuring Consistent Data Source Names
To ensure the use of consistent data source names for the same data sources across the domain,
the Administrator can create a single "official" set of data sources, then use the Repository
Manager to export that connection information to a file. You can then distribute this file and import
the connection information for each client machine.
Solution:
● From Repository Manager, choose Export Registry from the Tools drop-down menu.
● For all subsequent client installs, simply choose Import Registry from the Tools drop-down
menu.
The “missing or invalid license key” error occurs when attempting to install PowerCenter Client
tools on NT 4.0 or Windows 2000 with a userid other than Administrator.
This problem also occurs when the client software tools are installed under the Administrator
account, and a user with a non-administrator ID subsequently attempts to run the tools. The user
who attempts to log in using the normal ‘non-administrator’ userid will be unable to start the
PowerCenter Client tools. Instead, the software displays the message indicating that the license
key is missing or invalid.
Solution:
● While logged in as the installation user with administrator authority, use regedt32 to edit
the registry.
● Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/.
● From the menu bar, select Security/Permissions, and grant read access to the users that
should be permitted to use the PowerMart Client. (Note that the registry entries for both
PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and
PowerMart Client tools.)
For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the
wordpad.exe can be found in the path statement. Instead, a window appears the first time a
session log is viewed from the PowerCenter Server Manager prompting the user to enter the full
path name of the editor to be used to view the logs. Users often set this parameter incorrectly and
must access the registry to change it.
Solution:
● While logged in as the installation user with administrator authority, use regedt32 to go into
the registry.
● Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart
Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar,
select View Tree and Data.
● Select the Log File Editor entry by double clicking on it.
● Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write.
exe).
● Select Registry --> Exit from the menu bar to save the entry.
For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow
Monitor.
The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for
workflow and session logs.
Other tools, in addition to the PowerCenter client tools, are often needed during development and
testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad
(Oracle) to query the database. You can add shortcuts to executable programs from any client
tool’s ‘Tools’ drop-down menu to provide quick access to these programs.
Solution:
Choose ‘Customize’ under the Tools menu and add a new item. Once it is added, browse to find
the executable it is going to call (as shown below).
In the following example, TOAD can be called quickly from the Repository Manager tool.
In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type
‘bulk’, although this was not necessarily what was desired and could cause the session to fail under
certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow
Manager to choose the default load type to be either 'bulk' or 'normal'.
● In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab.
● Click the button for either 'normal' or 'bulk', as desired.
● Click OK, then close and open the Workflow Manager tool.
After this, every time a session is created, the target load type for all relational targets will default to
your choice.
The Repository Navigator window sometimes becomes undocked. Docking it again can be
frustrating because double clicking on the window header does not put it back in place.
Solution:
● To get the Window correctly docked, right-click in the white space of the Navigator
window.
● Make sure that ‘Allow Docking’ option is checked. If it is checked, double-click on the title
bar of the Navigator Window.
If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e.
g., Designer) disappears, try the following solutions to recover it:
Note: If none of the above solutions resolve the problem, you may want to try the following solution
using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause
serious problems that may require reinstalling the operating system. Informatica does not
guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the
Registry Editor at your own risk.
Solution:
Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can
often be resolved as follows:
PowerCenter Folder
Version Name
7.1 7.1
7.1.1 7.1.1
7.1.2 7.1.1
7.1.3 7.1.1
7.1.4 7.1.1
8.1 8.1
● Open the key of the affected tool (for the Repository Manager open Repository Manager
Options).
● Export all of the Toolbars sub-folders and rename them.
● Re-open the client tool.
The PowerCenter client tools allow you to customize the look and feel of the display. Here are a
few examples of what you can do.
Designer
Changing the background workspace colors can help identify which workspace is currently
open. For example, changing the Source Analyzer workspace color to green or the Target Designer
workspace to purple to match their respective metadata definitions helps to identify the workspace.
Alternatively, click the Select Theme button to choose a color theme, which displays background
colors based on predefined themes.
You can modify the Workflow Manager using the same approach as the Designer tool.
From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or
customize each element individually.
Workflow Monitor
You can modify the colors in the Gantt Chart view to represent the various states of a task. You can
also select two colors for one task to give it a dimensional appearance; this can be helpful in
To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt
Chart.
Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable
macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you
To use the security level for the Visio, select Tools > Macros > Security from the menu. On the
Security Level tab, select Medium.
When you start Data Stencil, Visio displays a security warning about viruses in macros. Click
Enable Macros to enable the macros for Data Stencil.
Challenge
Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings;
using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.
Description
Configuring Advanced Integration Service Properties
Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To
edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties >
Edit.
Limit on Resilience Optional Maximum amount of time (in seconds) that the service holds on to resources
Timeouts for resilience purposes. This property places a restriction on clients that
connect to the service. Any resilience timeouts that exceed the limit are cut off
at the limit. If the value of this property is blank, the value is derived from the
domain-level settings.
Resilience Timeout Optional Period of time (in seconds) that the service tries to establish or reestablish a
connection to another service. If blank, the value is derived from the domain-
level settings.
One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits
include:
You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files
include run-time files, state of operation files, and session log files.
Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run
on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-
time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.
State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates
files to store the state of operations for the service. The state of operations includes information such as the active service requests,
scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover
operations from the point of interruption.
All Integration Service processes associated with an Integration Service must use the same shared location. However, each
Integration Service can use a separate location.
You must specify the directory path for each type of file. You specify the following directories using service process variables:
Each registered server has its own set of variables. The list is fixed, not user-extensible.
$PMSessionLogDir $PMRootDir/SessLogs
$PMBadFileDir $PMRootDir/BadFiles
$PMCacheDir $PMRootDir/Cache
$PMTargetFileDir $PMRootDir/TargetFiles
$PMSourceFileDir $PMRootDir/SourceFiles
$PMExtProcDir $PMRootDir/ExtProc
$PMTempDir $PMRootDir/Temp
$PMSessionLogCount 0
$PMSessionErrorThreshold 0
$PMWorkflowLogCount 0
$PMWorkflowLogDir $PMRootDir/WorkflowLogs
$PMLookupFileDir $PMRootDir/LkpFiles
$PMStorageDir $PMRootDir/Storage
Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through
the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous
versions.
To write all Integration Service logs (session, workflow, server, etc.) to files:
Integration Service Custom Properties (undocumented server parameters) can be entered here as well:
1. At the bottom of the list enter the Name and Value of the custom property
2. Click OK.
When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent
collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server.
Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending
on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as
database servers.
The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and
system. The method used to change the parameter depends on the operating system:
Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set
these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note
that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after
configuring the UNIX kernel.
HP-UX
For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to
500. NCALL is referred to as NCALLOUT.
1. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program.
2. Double click the Kernel Configuration icon.
3. Double click the Configurable Parameters icon.
4. Double click the parameter you want to change and enter the new value in the Formula/Value field.
5. Click OK.
6. Repeat these steps for all kernel configuration parameters that you want to change.
7. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.
The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.
IBM AIX
None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.
SUN Solaris
Keep the following points in mind when configuring and tuning the SUN Solaris platform:
set shmsys:shminfo_shmmax=value
set shmsys:shminfo_shmmin=value
set shmsys:shminfo_shmmni=value
set shmsys:shminfo_shmseg=value
set semsys:seminfo_semmap=value
set semsys:seminfo_semmni=value
set semsys:seminfo_semmns=value
set semsys:seminfo_semmsl=value
set semsys:seminfo_semmnu=value
set semsys:seminfo_semume=value
# init 6
The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a
restart.
For example, to allow 128MB, type the following command:
Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar
to the following:
kernel.shmmax = 134217728
This file is usually processed at startup, but sysctl can also be called explicitly later.
To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/
sem.h.
SuSE Linux
The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a
restart. For example, to allow 512MB, type the following commands:
You can also put these commands into a script run at startup.
Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following:
With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size
at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source
to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank,
Joiner, and Lookup transformations, as well as Sorter and XML target caches.
Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer
memory and cache memory settings, consider the overall memory usage for best performance.
Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the
Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the
Integration Service disables automatic memory settings and uses default values.
Challenge
Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.
Description
Parameter files are a means of providing run time values for parameters and variables defined in a
workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple
workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell
script, or an Informatica mapping.
Variable values are stored in the repository and can be changed within mappings. However, variable
values specified in parameter files supersede values stored in the repository. The values stored in the
repository can be cleared or reset using workflow manager.
A Parameter File contains the values for variables and parameters. Although a parameter file can
contain values for more than one workflow (or session), it is advisable to build a parameter file to contain
values for a single or logical group of workflows for ease of administration. When using the command
line mode to execute workflows, multiple parameter files can also be configured and used for a single
workflow if the same workflow needs to be run with different parameters.
If a session uses a parameter file, it must run on a node that has access to the file. You create a
resource for the parameter file and make it available to one or more nodes. When you configure the
session, you assign the parameter file resource as a required resource. The Load Balancer dispatches
the Session task to a node that has the parameter file resource. If no node has the parameter file
resource available, the session fails.
Depending on the database workload, you may want to use source-side, target-side, or full pushdown
optimization at different times. For example, you may want to use partial pushdown optimization during
the database's peak hours and full pushdown optimization when activity is low. Use the $
$PushDownConfig mapping parameter to use different pushdown optimization configurations at different
times. The parameter lets you run the same session using the different types of pushdown optimization.
When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute.
Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in
the parameter file:
● None. The Integration Service processes all transformation logic for the session.
● Source. The Integration Service pushes part of the transformation logic to the source database.
● Source with View. The Integration Service creates a view to represent the SQL override value,
and runs an SQL statement against this view to push part of the transformation logic to the
source database.
● Target. The Integration Service pushes part of the transformation logic to the target database.
● Full. The Integration Service pushes all transformation logic to the database.
● Full with View. The Integration Service creates a view to represent the SQL override value,
and runs an SQL statement against this view to push part of the transformation logic to the
source database. The Integration Service pushes any remaining transformation logic to the
target database.
Informatica recommends giving the Parameter File the same name as the workflow with a suffix of “.
par”. This helps in identifying and linking the parameter file to a workflow.
While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a
file specified at the workflow level always supersedes files specified at session levels.
Place the Parameter Files in directory that can be accessed using the server variable. This helps to
move the sessions and workflows to a different server without modifying workflow or session properties.
You can override the location and name of parameter file specified in the session or workflow while
executing workflows via the pmcmd command.
The following points apply to both Parameter and Variable files, however these are more relevant to
Parameters and Parameter files, and are therefore detailed accordingly.
To run a workflow with different sets of parameter values during every run:
Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.
Based on requirements, you can obtain the values for certain parameters from relational tables or
generate them programmatically. In such cases, the parameter files can be generated dynamically using
shell (or batch scripts) or using Informatica mappings and sessions.
Consider a case where a session has to be executed only on specific dates (e.g., the last working day of
every month), which are listed in a table. You can create the parameter file containing the next run date
(extracted from the table) in more than one way.
Method 1:
Method 2:
In some other cases, the parameter values change between runs, but the change can be incorporated
into the parameter files programmatically. There is no need to maintain separate parameter files for
each run.
Consider, for example, a service provider who gets the source data for each client from flat files located
in client-specific directories and writes processed data into global database. The source data structure,
target data structure, and processing logic are all same. The log file for each client run has to be
preserved in a client-specific directory. The directory names have the client id as part of directory
structure (e.g., /app/data/Client_ID/)
You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one
parameter file per client. However, the number of parameter files may become cumbersome to manage
when the number of clients increases.
[PROJ_DP.WF:Client_Data]
$InputFile_1=/app/data/Client_ID/input/client_info.dat
$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log
Using a script, replace “Client_ID” and “curdate” to actual values before executing the workflow.
The following text is an excerpt from a parameter file that contains service variables for one Integration
Service and parameters for four workflows:
[Service:IntSvs_01]
$PMSuccessEmailUser=pcadmin@mail.com
$PMFailureEmailUser=pcadmin@mail.com
[HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS]
$$platform=unix
[HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR]
$$platform=unix
$DBConnection_ora=qasrvrk2_hp817
[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1]
$$DT_WL_lvl_1=02/01/2005 01:05:11
$$Double_WL_lvl_1=2.2
[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2]
$$DT_WL_lvl_2=03/01/2005 01:01:01
$$Int_WL_lvl_2=3
$$String_WL_lvl_2=ccccc
Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping
parameters to process the correct fiscal period.
For example, create a calendar table in the database with the mapping between the Gregorian calendar
and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates.
Create another mapping with the logic to create a parameter file. Run the parameter file creation
session before running the main session.
The calendar table can be directly joined with the main table, but the performance may not be good in
some databases depending upon how the indexes are defined. Using a parameter file can resolve the
index and result in better performance.
Mapping parameters and variables can be used to extract inserted/updated data since previous extract.
Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp
and the end timestamp for extraction.
For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the
timestamp of the last row the Integration Service read in the previous session. Use this variable for the
beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source
filter.
Use the following filter to incrementally extract data from the database:
Mapping parameters can be used to extract data from different tables using a single mapping. In some
cases the table name is the only difference between extracts.
For example, there are two similar extracts from tables FUTURE_ISSUER and EQUITY_ISSUER; the
column names and data types within the tables are same. Use mapping parameter $$TABLE_NAME in
the source qualifier SQL override, create two parameter files for each table name. Run the workflow
using the pmcmd command with the corresponding parameter file, or create two sessions with
corresponding parameter file.
You can create variables within a workflow. When you create a variable in a workflow, it is valid only in
Use user-defined variables when you need to make a workflow decision based on criteria you specify.
For example, you create a workflow to load data to an orders database nightly. You also need to load a
subset of this data to headquarters periodically, every tenth time you update the local orders database.
Create separate sessions to update the local database and the one at headquarters. Use a user-defined
variable to determine when to run the session that updates the orders database at headquarters.
Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow
has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that
updates the local orders database. Set up the decision condition to check to see if the number of
workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an
Assignment task to increment the $$WorkflowCount variable by one.
Link the Decision task to the session that updates the database at headquarters when the decision
condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false.
When you configure workflow variables using conditions, the session that updates the local database
runs every time the workflow runs. The session that updates the database at headquarters runs every
10th time the workflow runs.
Challenge
Description
The main factors that affect the sizing estimate are the input parameters that are based
on the requirements and the constraints imposed by the existing infrastructure and
budget. Other important factors include choice of Grid/High Availability Option, future
growth estimates and real time versus batch load requirements.
The required platform size to support PowerCenter depends upon each customer’s
unique infrastructure and processing requirements: The Integration Service allocates
resources for individual extraction, transformation and load (ETL) jobs or sessions.
Each session has its own resource requirement.The resources required for the
Integration Service depend upon the number of sessions, the complexity of each
session (i.e., what it does while moving data) and how many sessions run concurrently.
This Best Practice discusses the relevant questions pertinent to estimating the platform
requirements.
TIP
An important concept regarding platform sizing is not to size your environment
too soon in the project lifecycle. A common mistake is to size the servers
before any ETL is designed or developed, and in many cases these platforms
are too small for the resulting system. Thus, it is better to analyze sizing
requirements after the data transformation processes have been well defined
during the design and development phases.
Environment Questions
● Is the overall ETL task currently being performed? If so, how is it being done,
and how long does it take?
● What is the total volume of data to move?
● What is the largest table (i.e., bytes and rows)? Is there any key on this table
that can be used to partition load sessions, if needed?
● How often does the refresh occur?
● Will refresh be scheduled at a certain time, or driven by external events?
● Is there a "modified" timestamp on the source table rows?
● What is the batch window available for the load?
● Are you doing a load of detail data, aggregations, or both?
● If you are doing aggregations, what is the ratio of source/target rows for the
largest result set? How large is the result set (bytes and rows)?
The answers to these questions provide an approximation guide to the factors that
affect PowerCenter's resource requirements. To simplify the analysis, focus on large
jobs that drive the resource requirement.
Processor
Note: - However Virtual CPU is considered as 0.75 CPU. For example 4 CPU with 4
cores each, could be considered as 12 Virtual CPUs.
Memory
●
System Recommendations
Minimum Server
1 Node, 4 CPUs and 16GB of memory (instead of the minimal requirement of 4GB
RAM) and 6 GB storage for PowerCenter binaries. A separate file system is
recommended for the infa_shared working file directory and it can be sized depending
on the work load profile.
Disk Space
Disk space is not a factor if the machine is used only for PowerCenter services, unless
the following conditions exist:
If any of these factors is true additional storage should be allocated for the file system
used by the infa_shared directory. Typically Informatica customers allocate a minimum
of 100 to 200 GB for this file system. Informatica recommends monitoring disk
space on a regular basis or maintaining some type of script to purge unused files.
Sizing Analysis.
The basic goal is to size the server so that all jobs can complete within the specified
load window. You should consider the answers to the questions in the "Environment"
and "PowerCenter Server Sizing" sections to estimate the required number of sessions,
the volume of data that each session moves, and its lookup table, aggregation, and
heterogeneous join caching requirements. Use these estimates with the
recommendations in the "PowerCenter Resource Consumption" section to determine
the required number of processors, memory, and disk space to achieve the required
performance to meet the load window.PowerCenter provides an advanced level of auto
memory configuration with the option of using manual configuration. The minimum
required cache memory for each active transformation in a mapping can be calculated
You can use the “Cache Calculator” feature for Aggregator, Joiner, Rank, and Lookup
transformations:
Note that the deployment environment often creates performance constraints that
hardware capacity cannot overcome. The Integration Service throughput is usually
constrained by one or more of the environmental factors addressed by the questions in
the "Environment" section. For example, if the data sources and target are both remote
from the PowerCenter server, the network is often the constraining factor. At some
point, additional sessions, processors, and memory may not yield faster execution
because the network (not the PowerCenter services) imposes the performance limit.
The hardware sizing analysis is highly dependent on the environment in which the
server is deployed. You need to understand the performance characteristics of the
environment before making any sizing conclusions.
Challenge
Description
There are two performance types that need to be considered when dealing with Oracle
CDC; latency of the data and restartability of the environment. Some of the factors that
impact these performance types are configurable within PowerExchange, while others
are not. These two performance types are addressed separately in this Best Practice.
The objective of latency performance is to minimize the amount of time that it takes for
a change made to the source database to appear in the target database. Some of the
factors that can affect latency performance are discussed below.
The optimal location for installing PowerExchange CDC is on the server that contains
the Oracle source database. This eliminates the need to use the network to pass data
between Oracle’s LogMiner and PowerExchange. It also eliminates the need to use
SQL*Net for this process and it minimizes the amount of data being moved across the
network. For best results, install the PowerExchange Listener on the same server as
the source database server.
Volume of Data
The volume of data that the Oracle Log Miner has to process in order to provide
changed data to PowerExchange can have a significant impact on performance. Bear
in mind that in addition to the changed data rows, other processes may be writing large
volumes of data to the Oracle redo logs. These include, but are not limited to:
Server Workload
Optimize the performance of the Oracle database server by reducing the number of
unnecessary tasks it is performing concurrently with the PowerExchange CDC
components. This may include a full review of the backup and restore schedules,
Oracle import and export processing and other application software utilized within the
production server environment.
The condense option for Oracle CDC provides only the required data by reducing the
collected data based on the Unit of Work information. This can prevent the transfer of
unnecessary data and save CPU and memory resources. In order to properly allocate
space for the files created by the condense process it is necessary to perform capacity
planning.
In determining the space required for the CDC data files it is important to know whether
before and after images (or just after images) are required. Also, the retention period
for these files must be considered. The retention period is defined in the
COND_CDCT_RET_P parameter in the dtlca.cfg file. The value that appears for this
Before/After Image –
Accurate capacity planning can be accomplished by running sample condense jobs for
a given number of source changes to determine the storage required. The size of files
created by the condense process can be used for projecting the actual storage required
in a production environment.
When Continuous Capture Extract is used for Oracle CDC, condense files can be
consumed with CAPXRT processing. Since the PowerCenter session waits for the
creation of new condense files (rather than stopping and restarting) the CPU and
memory impact of real-time processing is reduced. Similar to the Condense option,
there is a need to perform proper capacity planning for the files created as a result of
using the Continuous Capture Extract option.
The amount of time required to restart the PowerExchange CDC process should be
considered when determining performance. The PowerExchange CDC process will
need to be restarted whenever any of the following events occur:
A copy of the Oracle catalog must be placed on the archive log in order for LogMiner to
function correctly. The frequency of these copies is very site specific and it can impact
the amount of time that it takes the CDC process to restart.
There are several parameters that appear in the dbmover.cfg configuration file that can
assist in optimizing restart performance. These parameters are:
RSTRADV: The RSTRADV parameter specifies the number of seconds to wait after
receiving a Unit of Work (UOW) for a source table before advancing the restart tokens
by returning an “empty” UOW. This parameter is very beneficial in cases where the
frequency of updates on some tables is low in comparison to other tables.
CATINT: The CATINT parameter specifies the frequency in which the Oracle catalog is
copied to the archive logs. Since LogMiner needs a copy of the catalog on the archive
log to become operational, this is an important parameter as it will have an impact on
which archive log is used to restart the CDC process. When Oracle places a catalog
copy on the archive log, it will first flush all of the online redo logs to the archive logs
prior to writing out the catalog.
CATBEGIN: The CATBEGIN parameter specifies the time of day that the Oracle
catalog copy process should begin. The time of day that is specified in this parameter is
based on a 24 hour clock.
CATEND: The CATEND parameter specifies the time of day that the Oracle catalog
copy process should end. The time of day that is specified in this parameter is based
on a 24 hour clock.
It is important to carefully code these parameters as it will impact the amount of time it
takes to restart the PowerExchange CDC process.
Sample of the dbmover.cfg parameters that affect the Oracle CDC process.
/********************************************************************/
/* Change Data Capture Connection Specifications
/********************************************************************/
Challenge
Install, configure, and performance tune PowerExchange for MS SQL Server Change Data Capture (CDC).
Description
PowerExchange Real-Time for MS SQL Server uses SQL Server publication technology to capture changed
data. To use this feature Distribution must be enabled. The publisher database handles replication while the
distributor database transfers the replicated data to PowerExchange; which is installed on the distribution
database server.
When looking at the architecture for SQL Server capture, we see that PowerExchange treats the SQL Server
Publication process as a “virtual” change stream. By turning the standard SQL Server publication process on,
SQL Server publishes changes to the SQL Server Distribution database. PowerExchange then reads the
changes from the Distribution database.
When Publication is used, and the Distribution function is enabled, support for capturing changes for a table of
interest are dynamically activated through the registration of a source in the PowerExchange Navigator GUI (i.e.,
PowerExchange makes the appropriate calls to SQL Server automatically, via SQL DMO objects).
CAPI_CONN_NAME=CAPIMSSC
CAPI_CONNECTION=(NAME=CAPIMSSC,
TYPE=(MSQL,DISTSRV=SDMS052,DISTDB=distribution,repnode=SDMS052))
Microsoft SQL Server Replication must be enabled using the Microsoft SQL Server Publication
technology. Informatica recommends enabling distribution through the SQL Server Management Console.
Multiple SQL Servers can use a single Distribution database. However, Informatica recommends using a
single Distribution database for Production and a separate one for Development/Test. In addition, for a
busy environment, placing the Distribution database on a separate server is advisable. Also, configure
the Distribution database for a retention period of 10 to 14 days.
3.
Ensure that the MS SQL Server Agent Service is running.
4.
Register sources using the PowerExchange Navigator.
Source tables must have a primary key. Note that system admin authority is required to register source
tables.
If you plan to capture large numbers of transaction updates, consider using a dedicated distributed server as the
host of the distribution database. This will avoid contention for CPU and disk storage with a production instance.
Sometimes SQL Server CDC performance is slow. It requires approximately ten seconds for changes made at
the source to take effect at the target. This is specifically when data is coming in low volumes.
● POLWAIT
● PollingInterval
POLWAIT
This parameter specifies the number of seconds to wait between polling for new data after end of current data
has been achieved.
● Specify this parameter in the dbmover.cfg file of the Microsoft SQL Distribution database machine.
● The default is ten seconds. Reducing this value to one or two seconds can improve the performance.
PollingInterval
Be aware, however, that the trade-off with the above options is, to some extent, increased overhead and
frequency of access to the source distribution database. To minimize overhead and frequency of access to the
database, increase the delay between the time an update is performed and the time it is extracted.
Increasing the value of POLWAIT in the dbmover cfg file reduces the frequency with which the source distribution
database is accessed. In addition, increasing the value of Real-Time Flush Latency in the PowerCenter
Application Connection can also reduce the frequency of access to the source.
Challenge
Description
1. Complete the PowerExchange pre-install checklist and obtain valid license keys.
2. Install PowerExchange on the mainframe.
3. Start the PowerExchange jobs/tasks on the mainframe.
4. Install the PowerExchange client (Navigator) on a workstation.
5. Test connectivity to the mainframe from the workstation.
6. Install Navigator on the UNIX/NT server.
7. Test connectivity to the mainframe from the server.
You will need a valid license key in order to run any of the PowerExchange
components. This is a 44 or 64-byte key that uses hyphens every 4 bytes. For example:
1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1
The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F).
Keys are valid for a specific time period and are also linked to an exact or generic TCP/
IP address. They also control access to certain databases. You cannot successfully
install PowerExchange without a valid key for all required components.
Note: When copying software from one machine to another, you may encounter
license key problems since the license key is IP specific. Be prepared to deal with this
eventuality, especially if you are going to a backup site for disaster recovery testing. In
the case of such an event Informatica Product Shipping or Support can generate a
temporary key very quickly.
Step 1: Create a folder c:\PWX on the workstation. Copy the file with a naming
convention similar to PWXOS26.Vxxx.EXE from the PowerExchange CD or from
the extract of the zip file downloaded to this directory. Double click the file to
unzip its contents into this directory.
Step 3: Run the “MVS_Install” file. This displays the MVS Install Assistant.
Configure the IP Address, Logon ID, Password, HLQ, and Default volume
setting on the display screen. Also, enter the license key.
Be sure that the HLQ on this screen matches the HLQ of the allocated
RUNLIB (from step 2).
Save these settings and click Process. This creates the JCL libraries
Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g.,
execution class, message class, etc.)
Step 5: Edit the SETUPBLK member in RUNLIB. Copy in the JOBCARD and
SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with
return code 0 (success) or 1, and a list of the needed installation jobs can be
found in the XJOBS member.
The installed PowerExchange Listener can be run as a normal batch job or as a started
task. Informatica recommends that it initially be submitted as a batch job: RUNLIB
(STARTLST). If it will be run as a started task then copy the PSTRTLST member in
runlib to the started task proclib.
If implementing change capture, start the PowerExchange Agent (as a started task):
/S DTLA
Note: The load libraries must be APF authorized prior to starting the Agent.
Step 1: Run the Windows or UNIX installation file in the software folder on the
installation CD and follow the prompts.
Step 3: Follow the wizard to complete the install and reboot the machine.
Step 1: Create a user for the PowerExchange installation on the UNIX box.
Step 4: Use the UNIX tar command to extract the files. The command is “tar –
xvf pwxxxx_vxxx.tar”.
Step 5: Update the logon profile with the correct path, library path, and
home environment variables.
DRIVER=<install dir>/libdtlodbc.so
DESCRIPTION=MVS DB2
DBTYPE=db2
LOCATION=mvs1
DBQUAL1=DB2T
There is a separate manual for each type of change data capture option. This
manual contains the specifics on the following general steps. You will need
to understand the appropiate options guide to ensure success.
Step 1: APF authorize the .LOAD and the .LOADLIB libraries. This is required
for external security.
Step 2: Copy the Agent from the PowerExchange PROCLIB to the system site
PROCLIB.
Step 3: After the Agent has been started, run job SETUP2.
Challenge
Assessing the business case for a project must consider both the tangible and
intangible potential benefits. The assessment should also validate the benefits and
ensure they are realistic to the Project Sponsor and Key Stakeholders to
ensure project funding.
Description
A Business Case should include both qualitative and quantitative measures of potential
benefits.
The Qualitative Assessment portion of the Business Case is based on the Statement
of Problem/Need and the Statement of Project Goals and Objectives (both generated in
Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the
project beneficiaries regarding the expected benefits in terms of problem alleviation,
cost savings or controls, and increased efficiencies and opportunities.
Many qualitative items are intangible, but you may be able to cite examples of the
potential costs or risks if the system is not implemented. An example may be the cost
of bad data quality resulting in the loss of a key customer or an invalid analysis
resulting in bad business decisions. Risk factors may be classified as business,
technical, or execution in nature. Examples of these risks are uncertainty of value or
the unreliability of collected information, new technology employed, or a major change
in business thinking for personnel executing change.
● Cash flow analysis- Projects positive and negative cash flows for the
anticipated life of the project. Typically, ROI measurements use the cash flow
formula to depict results.
The following are steps to calculate the quantitative business case or ROI:
Step 1 – Develop Enterprise Deployment Map. This is a model of the project phases
over a timeline, estimating as specifically as possible participants, requirements, and
systems involved. A data integration or migration initiative or amendment may require
estimating customer participation (e.g., by department and location), subject area and
type of information/analysis, numbers of users, numbers and complexity of target data
systems (data marts or operational databases, for example) and data sources, types of
sources, and size of data set. A data migration project may require customer
participation, legacy system migrations, and retirement procedures. The types of
estimations vary by project types and goals. It is important to note that the more details
you have for estimations, the more precise your phased solutions are likely to be. The
scope of the project should also be made known in the deployment map.
Step 3 – Calculate Net Present Value for all Benefits. Information gathered in this
step should help the customer representatives to understand how the expected
benefits are going to be allocated throughout the organization over time, using the
enterprise deployment map as a guide.
Step 4 – Define Overall Costs. Customers need specific cost information in order to
assess the dollar impact of the project. Cost estimates should address the following
fundamental cost components:
Step 5 – Calculate Net Present Value for all Costs. Use either actual cost estimates
or percentage-of-cost values (based on cost allocation assumptions) to calculate costs
for each cost component, projected over the timeline of the enterprise deployment map.
Actual cost estimates are more accurate than percentage-of-cost allocations, but much
more time-consuming. The percentage-of-cost allocation process may be valuable for
initial ROI snapshots until costs can be more clearly predicted.
Step 6 – Assess Risk, Adjust Costs and Benefits Accordingly. Review potential
risks to the project and make corresponding adjustments to the costs and/or benefits.
Some of the major risks to consider are:
● Scope creep, which can be mitigated by thorough planning and tight project
scope.
● Integration complexity, which may be reduced by standardizing on vendors
with integrated product sets or open architectures.
● Architectural strategy that is inappropriate.
● Current support infrastructure may not meet the needs of the project.
● Conflicting priorities may impact resource availability.
● Other miscellaneous risks from management or end users who may withhold
project support; from the entanglements of internal politics; and from
technologies that don't function as promised.
● Unexpected data quality, complexity, or definition issues often are discovered
late in the course of the project and can adversely affect effort, cost, and
schedule. This can be somewhat mitigated by early source analysis.
Step 7 – Determine Overall ROI. When all other portions of the business case are
complete, calculate the project's "bottom line". Determining the overall ROI is simply a
matter of subtracting net present value of total costs from net present value of (total
Final Deliverable
The final deliverable of this phase of development is a complete business case that
documents both tangible (quantified) and in-tangible (non-quantified, but estimate of
benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This
allows them to review the Business Case in order to justify the development effort.
If your organization has the concept of a Project Office which provides the governance
for project and priorities, many times this is part of the original Project Charter which
states items like scope, initial high level requirements, and key project stakeholders.
However, developing a full Business Case can validate any initial analysis and provide
additional justification. Additionally, the Project Office should provide guidance in
building and communicating the Business Case.
Once completed, the Project Manager is responsible for scheduling the review and
socialization of the Business Case.
Challenge
Requirements need to be gathered from business users who currently use and/or have
the potential to use the information being assessed. All input is important since the
assessment should encompass an enterprise view of the data rather than a limited
functional, departmental, or line-of-business view.
By gathering and documenting some of the key detailed data requirements, a solid
understanding the business rules involved is reached. Certainly, all elements can’t be
analyzed in detail, but it helps in getting to the heart of the business system so you are
better prepared when speaking with business and technical users.
Description
The following steps are key for successfully defining and prioritizing requirements:
Step 1: Discovery
Gathering business requirements is one of the most important stages of any data
integration project. Business requirements affect virtually every aspect of the data
integration project starting from Project Planning and Management to End-User
Data Profiling
Informatica Data Explorer (IDE) is an automated data profiling and analysis software
product that can be extremely beneficial in defining and prioritizing requirements. It
provides a detailed description of data content, structure, rules, and quality by profiling
the actual data that is loaded into the product.
Some industry examples of why data profiling is crucial prior to beginning the
development process are:
Using a Data Profiling Tool can lower the risk and lower the cost of
the project and increase the chances of success.
●
IDE provides the ability to promote collaboration through tags, notes, action items,
transformations and rules. By profiling the information, the framework is set to have an
effective interview process with business and technical users.
Interviews
● Who are the stakeholders for this milestone delivery (IT, field business
analysts, executive management)?
● What are the target business functions, roles, and responsibilities?
● What are the key relevant business strategies, decisions, and processes (in
brief)?
● What information is important to drive, support, and measure success for
those strategies/processes? What key metrics? What dimensions for those
metrics?
● What current reporting and analysis is applicable? Who provides it? How is it
presented? How is it used? How can it be improved?
IT interviewees. The IT interviewees have a different flavor than the business user
community. Interviewing the IT team is generally very beneficial because it is
composed of data gurus who deal with the data on a daily basis. They can provide
great insight into data quality issues, help in systematic exploration of legacy source
systems, and understanding business user needs around critical reports. If you are
developing a prototype, they can help get things done quickly and address important
business reports. Questioning during these sessions should include the following:
Facilitated Sessions
Facilitated sessions provide quick feedback by gathering all the people from the various
teams into a meeting and initiating the requirements process. You need a facilitator
who is experienced in these meetings to ensure that all the participants get a chance to
speak and provide feedback. During individual (or small group) interviews with high-
level management, there is often focus and clarity of vision that may be hindered in
large meetings. Thus, it is extremely important to encourage all attendees to
participate and minimize a small number from dominating the requirement process.
The Business Analyst, with the help of the Project Architect, documents the findings of
At this time also, the Architect develops the Information Requirements Specification to
clearly represent the structure of the information requirements. This document, based
on the business requirements findings, can facilitate discussion of informational details
and provide the starting point for the target model definition.
Concurrent with the validation of the business requirements, the Architect begins the
Functional Requirements Specification providing details on the technical requirements
for the project.
As general technical feasibility is compared to the prioritization from Step 2, the Project
Final Deliverable
This is presented to the Project Sponsor for approval and becomes the first "increment"
or starting point for the Project Plan.
Challenge
Developing a comprehensive work breakdown structure (WBS) is crucial for capturing all the tasks required
for a data integration project. Many times, items such as full analysis, testing, or even specification
development, can create a sense of false optimism for the project. The WBS clearly depicts all of the
various tasks and subtasks required to complete a project. Most project time and resource estimates are
supported by the WBS. A thorough, accurate WBS is critical for effective monitoring and also
facilitates communication with project sponsors and key stakeholders.
Description
The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of
related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various
resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves
as a starting point as well as a monitoring tool for the project.
One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail, and
too much detail. The WBS shouldn’t include every minor detail in the project, but it does need to break the
tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at
least a day. It is also important to maintain consistency across project for the level of detail.
A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown
in the following sample. The actual WBS for the project manager may, for example, may be a level of detail
deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll
up a level or two to make things more clear.
% Budget Actual
Plan Complete Hours Hours
Inventory (3 tables) 0% 60 0
Shipping (3 tables) 0% 60 0
A fundamental question is to whether to include “activities” as part of a WBS. The following statements are
generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving
this question.
● The project manager should have the right to decompose the WBS to whatever level of detail he or
she requires to effectively plan and manage the project. The WBS is a project management tool
that can be used in different ways, depending upon the needs of the project manager.
●
At the lowest level in the WBS, an individual should be identified and held accountable
for the result. This person should be an individual contributor, creating the deliverable
personally, or a manager who will in turn create a set of tasks to plan and manage the
results.
●
The WBS is not necessarily a sequential document. Tasks in the hierarchy are often
Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be
completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in
parallel if they do not have sequential requirements.
❍ It is important to remember that a task is not complete until all of its corresponding subtasks
are completed - whether sequentially or in parallel. For example, the Build Phase is not
complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin
for the Deploy Phase long before the Build Phase is complete.
The Project Plan provides a starting point for further development of the project WBS. This sample is a
Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the
Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to
ensure that it corresponds to the specific development effort, removing any steps that aren’t relevant or
adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the
development effort.
If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown
Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other
project management tools, simplifying the effort of developing the WBS.
Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the
project team. The project manager can act as a facilitator or can appoint one, freeing up the project
manager and enabling team members to focus on determining the actual tasks and effort needed.
Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams
creating their own project plans. The overall project manager then brings the plans together into a master
project plan. This group of projects can be defined as a program and the project manager and project
architect manage the interaction among the various development teams.
Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses;
new information becomes available; scope, resources and priorities change; deliverables are (or are not)
completed on time, etc. The process of estimating and modifying the plan should be repeated many times
throughout the project. Even initial planning is likely to take several iterations to gather enough information.
Significant changes to the project plan become the basis to communicate with the project sponsor(s) and/
or key stakeholders with regard to decisions to be made and priorities rearranged. The goal of the project
manager is to be non-biased toward any decision, but to place the responsibility with the sponsor to shape
direction.
Data integration projects differ somewhat from other types of development projects, although they also
share some key attributes. The following list summarizes some unique aspects of data integration projects:
● Business requirements are less tangible and predictable than in OLTP (online transactional
processing) projects.
● Database queries are very data intensive, involving few or many tables, but with many, many rows.
In OLTP, transactions are data selective, involving few or many tables and comparatively few
Data integration projects, like all development projects, must be managed. To manage
them, they must follow a clear plan. Data integration project managers often have a
more difficult job than those managing OLTP projects because there are so many
pieces and sources to manage.
Two purposes of the WBS are to manage work and ensure success. Although this is the same as any
project, data integration projects are unlike typical waterfall projects in that they are based on a iterative
approach. Three of the main principles of iteration are as follows:
●
Iteration. Division of work into small “chunks” of effort using lessons learned from
earlier iterations.
●
Time boxing. Delivery of capability in short intervals, with the first release typically
requiring from three to nine months (depending on complexity) and quarterly releases
thereafter.
●
Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The
danger is that projects can iterate or spiral out of control..
The three principles listed above are very important because even the best data integration plans are
likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a
fully detailed plan, is a large common data management project that gathers all requirements upfront and
delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all
requirements upfront" and the "all-at-once in three years."
Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles. The
feedback you can gather from increment to increment is critical to the success of the future increments. The
benefit is that such incremental deliveries establish patterns for development that can be used and
leveraged for future deliveries.
The correct development approach is usually dictated by corporate standards and by departments such as
the Project Management Office (PMO). Regardless of the development approach chosen, high-level phases
typically include planning the project; gathering data requirements; developing data models; designing and
developing the physical database(s); developing the source, profile, and map data; and extracting,
transforming, and loading the data. Lower-level planning details are typically carried out by the project
manager and project team leads.
In many cases, a manual technique is used identify and record the high-level phases and tasks, then the
information is transferred to project tracking software such as Microsoft Project. Project team members
typically begin by identifying the high-level phases and tasks, writing the relevant information on large sticky
notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or card
per phase or task so that you can easily be rearrange them as the project order evolves. As the project plan
progresses, you can add information to the cards or notes to flesh out the details, such as task owner, time
estimates, and dependencies. This information can then be fed into the project tracking software.
Once you have a fairly detailed methodology, you can enter the phase and task information into your project
tracking software. When the project team is assembled, you can enter additional tasks and details directly
into the software. Be aware however, that the project team can better understand a project and its various
components if they actually participate in the high-level development activities, as they do in the manual
approach. Using software alone, without input from relevant project team members, to designate phases,
tasks, dependencies and time lines can be difficult and prone to errors and ommissions.
Benefits of developing the project timeline manually, with input from team members include:
Team members have an opportunity to work with each other and set the foundation.
This is particularly important if the team is geographically dispersed and cannot work
face-to-face throughout much of the project.
The project plan should incorporate a thorough description of the project and its goals. Be sure to review the
business objectives, constraints, and high-level phases but keep the description as short and simple as
possible. In many cases, a verb-noun form works well (e.g., interview users, document requirments, etc.).
After you have described the project on a high-level, identify the tasks needed to complete each phase. It is
often helpful to use the notes section in the tracking software (e.g., Microsoft Project) to provide narrative for
each task or subtask. In general, decompose the tasks until they have a rough durations of two to 20 days.
Remember to break down the tasks only to the level of detail that you are willing to track. Include key
checkpoints or milestones as tasks to be completed. Again, a noun-verb form works well for milestones (e.
g., requirements completed, data model completed, etc.).
Identify a single owner for each task in the project plan. Although other resources may help to complete the
task; the individual who is designated as the owner is ultimately responsible for ensuring that the task, and
any associated deliverables, is completed on time.
After the WBS is loaded into the selected project tracking software and refined for the specific project
requirements, the Project Manager can begin to estimate the level of effort involved in completing each of
the steps. When the estimate is complete, the project manager can assign individual resources and prepare
Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan
updated throughout the project.
Challenge
The challenge of developing and maintaing a project plan is to incorporate all of the necessary components while
retaining the flexibility necessary to accommodate change.
Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor.
2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a
communication plan with the Project Sponsor; such communication may involve a weekly status report of
accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in
involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes.
If your organization has the concept of a Project Office that provides governance for the project and priorities, look for
a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally,
the Project Office should provide guidance in funding and resource allocation for key projects.
Informatica’s PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose
here is to provide some key elements that can be used to develop and maintain a data integration, data migration,
or data quality project.
Description
Use the following steps as a guide for developing the initial project plan:
1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design,
development, and testing.)
2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point
or for recommending tasks for inclusion.
3. Continue the detail breakdown, if possible, to a level at which there are logical “chunks” of work can be completed
and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate
estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as
assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes
difficult to maintain.
4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if
applicable). This helps to build commitment for the project plan.
5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must
start or complete concurrently with another).
6. Define the resources based on the role definitions and estimated number of resources needed for each role.
7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.
8. Ensure that project plan follows your organization’s system development methodology.
Note: Informatica Professional Services has found success in projects that blend the“waterfall” method with the “iterative”
method. The“Waterfall” method works well in the early stages of a project, such as analysis and initial design.
The “Iterative” methods work well in accelerating development and testing where feedback from extensive testing
At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor
relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities.
Set the constraint type to “As Soon As Possible” and avoid setting a constraint date. Use the Effort-Driven approach so
that the Project Plan can be easily modified as adjustments are made.
By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the
project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in
your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to
establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good
faith.
This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for
opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule.
When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions,
dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.
Once the Project Sponsor and Key Stakeholders agree to the initial plan, it becomes the basis for assigning tasks
and setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule
and updating the plan based on status and changes to assumptions.
One of the key communication methods is building the concept of a weekly or bi-weekly Project Sponsor
meeting. Attendance at this meeting should include the Project Sponsor, Key Stakeholders, Lead Developers, and the
Project Manager.
Elements of a Project Sponsor meeting should include: a) Key Accomplishments (milestones, events at a high-level),
b) Progress to Date against the initial plan, c) Actual Hours vs. Budgeted Hours, d) Key Issues and e) Plans for Next
Period.
Key Accomplishments
Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is
an opportunity to bring in the lead developers and have them report to management on what they have accomplished;
it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they
have to own and account to management.
Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to
ten minute maximum during this portion of the meeting.
The following matrix shows progress on relevant stages of the project. Roll-up tasks to a management level so it is
readable to the Project Sponsor (see sample below).
Percent Budget
Plan Complete Hours
Architecture - Set up of Informatica Migration Environment 167
Develop data integration solution architecture 10% 28
Design development architecture 28% 32
Customize and implement Iterative Migration Framework
A key measure to be aware of is budgeted vs. actual cost of the project. The Project Sponsor needs to know if additional
funding is required; forecasting actual hours against budgeted hours allows the Project Sponsor to determine when
additional funding or a change in scope is required.
Many projects are cancelled because of cost overruns, so it is the Project Manager’s job to keep expenditures under
control. The following example shows how a budgeted vs. actual report may look.
Key Issues
This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks,
key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate
decisions and minimizes the risk of impact to the project.
This communicates back to the Project Sponsor where the resources are to be deployed. If key issues dictate a change,
this is an opportunity to redirect the resources and use them correctly.
Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample
Deliverable), or changes in priority or approach as they arise to determine if they effect the plan. It may be necessary to
revise the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add
new tasks or postpone existing ones.
One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With
Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If
company and project management do not require tracking against a baseline, simply maintain the plan through updates
without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is
completed.
Summary
Managing a data integration, data migration, or data quality project requires good project planning and
communications. Many data integration project fail because of issues such as poor data quality or complexity of
integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such
issues from causing a project to fail.
Challenge
Identifying the departments and individuals that are likely to benefit directly from the
project implementation. Understanding these individuals, and their business information
requirements, is key to defining and scoping the project.
Description
The following four steps summarize business case development and lay a good
foundation for proceeding into detailed business requirements for the project.
1. One of the first steps in establishing the business scope is identifying the project
beneficiaries and understanding their business roles and project participation. In many
cases, the Project Sponsor can help to identify the beneficiaries and the various
departments they represent. This information can then be summarized in an
organization chart that is useful for ensuring that all project team members understand
the corporate/business organization.
2. The next step in establishing the business scope is to understand the business
problem or need that the project addresses. This information should be clearly defined
in a Problem/Needs Statement, using business terms to describe the problem. For
example, the problem may be expressed as "a lack of information" rather than "a lack
of technology" and should detail the business decisions or analysis that is required to
resolve the lack of information. The best way to gather this type of information is by
interviewing the Project Sponsor and/or the project beneficiaries.
3. The next step in creating the project scope is defining the business goals and
objectives for the project and detailing them in a comprehensive Statement of Project
4. The final step is creating a Project Scope and Assumptions statement that clearly
defines the boundaries of the project based on the Statement of Project Goals and
Objective and the associated project assumptions. This statement should focus on the
type of information or analysis that will be included in the project rather than what will
not.
The assumptions statements are optional and may include qualifiers on the scope,
such as assumptions of feasibility, specific roles and responsibilities, or availability of
resources or data.
Challenge
Description
The quality of a project can be directly correlated to the amount of review that occurs
during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.
In addition to the initial project plan review with the Project Sponsor, it is critical to
schedule regular status meetings with the sponsor and project team to review status,
issues, scope changes and schedule updates. This is known as the project sponsor
meeting.
Gather status, issues and schedule update information from the team one day before
the status meeting in order to compile and distribute the Project Status Report . In
addition, make sure lead developers of major assignments are present to report on the
status and issues, if applicable.
The Project Manager should coordinate, if not facilitate, reviews of requirements, plans
and deliverables with company management, including business requirements reviews
with business personnel and technical reviews with project technical personnel.
Set a process in place beforehand to ensure appropriate personnel are invited, any
relevant documents are distributed at least 24 hours in advance, and that reviews focus
on questions and issues (rather than a laborious "reading of the code").
● Key Accomplishments.
● Activities Next Week.
● Tracking of Progress to-Date (Budget vs. Actual).
● Key Issues / Roadblocks.
It is the Project Manager’s role to stay neutral to any issue and to effectively state facts
and allow the Project Sponsor or other key executives to make decisions. Many times
this process builds the partnership necessary for success.
Change in Scope
Directly address and evaluate any changes to the planned project activities, priorities,
or staffing as they arise, or are proposed, in terms of their impact on the project plan.
Management of Issues
Any questions, problems, or issues that arise and are not immediately resolved should
be tracked to ensure that someone is accountable for resolving them so that their effect
can also be visible.
Use the Issues Tracking template, or something similar, to track issues, their owner,
and dates of entry and resolution as well as the details of the issue and of its solution.
Significant or "showstopper" issues should also be mentioned on the status report and
communicated through the weekly project sponsor meeting. This way, the Project
Sponsor has the opportunity to resolve and cure a potential issue.
A formal project acceptance and close helps document the final status of the project.
Rather than simply walking away from a project when it seems complete, this explicit
close procedure both documents and helps finalize the project with the Project Sponsor.
For most projects this involves a meeting where the Project Sponsor and/or department
managers acknowledge completion or sign a statement of satisfactory completion.
● Prepare for the close by considering what the project team has learned about
the environments, procedures, data integration design, data architecture, and
other project plans.
● Formulate the recommendations based on issues or problems that need to be
addressed. Succinctly describe each problem or recommendation and if
applicable, briefly describe a recommended approach.
Challenge
Data warehousing projects are usually initiated out of a business need for a certain
type of reports (i.e., “we need consistent reporting of revenue, bookings and backlog”).
Except in the case of narrowly-focused, departmental data marts however, this is not
enough guidance to drive a full data integration solution. Further, a successful, single-
purpose data mart can build a reputation such that, after a relatively brief period of
proving its value to users, business management floods the technical group with
requests for more data marts in other areas. The only way to avoid silos of data marts
is to think bigger at the beginning and canvas the enterprise (or at least the
department, if that’s your limit of scope) for a broad analysis of data
integration requirements.
Description
Process Steps
The first step in the process is to identify and interview “all” major sponsors and
stakeholders. This typically includes the executive staff and CFO since they are likely to
be the key decision makers who will depend on the data integraton. At a minimum,
figure on 10 to 20 interview sessions.
The next step in the process is to interview representative information providers. These
individuals include the decision makers who provide the strategic perspective on what
information to pursue, as well as details on that information, and how it is currently
used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors
and stakeholders regarding the findings of the interviews and the recommended
subject areas and information profiles. It is often helpful to facilitate a Prioritization
Workshop with the major stakeholders, sponsors, and information providers in order to
set priorities on the subject areas.
The following paragraphs offer some tips on the actual interviewing process. Two
sections at the end of this document provide sample interview outlines for the executive
staff and information providers.
Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A
focused, consistent interview format is desirable. Don't feel bound to the script,
however, since interviewees are likely to raise some interesting points that may not be
included in the original interview format. Pursue these subjects as they come up,
asking detailed questions. This approach often leads to “discoveries” of strategic uses
for information that may be exciting to the client and provide sparkle and focus to the
project.
Interviews of information providers are secondary but can be very useful. These are the
business analyst-types who report to decision-makers and currently provide reports
and analyses using Excel or Lotus or a database program to consolidate data from
more than one source and provide regular and ad hoc reports or conduct sophisticated
analysis. In subsequent phases of the project, you must identify all of these individuals,
learn what information they access, and how they process it. At this stage however,
you should focus on the basics, building a foundation for the project and discovering
what tools are currently in use and where gaps may exist in the analysis and reporting
functions.
Be sure to take detailed notes throughout the interview process. If there are a lot of
interviews, you may want the interviewer to partner with someone who can take good
notes, perhaps on a laptop to save note transcription time later. It is important to take
down the details of what each person says because, at this stage, it is difficult to know
what is likely to be important. While some interviewees may want to see detailed notes
from their interviews, this is not very efficient since it takes time to clean up the notes
for review. The most efficient approach is to simply consolidate the interview notes into
a summary format following the interviews.
Be sure to review previous interviews as you go through the interviewing process, You
can often use information from earlier interviews to pursue topics in later interviews in
more detail and with varying perspectives.
Keep the interview groups small. One or two Professional Services personnel should
suffice with at most one client project person. Especially for executive interviews, there
should be one interviewee. There is sometimes a need to interview a group of middle
managers together, but if there are more than two or three, you are likely to get much
less input from the participants.
At the completion of the interviews, compile the interview notes and consolidate the
content into a summary.This summary should help to breakout the input into
departments or other groupings significant to the client. Use this content and your
interview experience along with “best practices” or industry experience to recommend
specific, well-defined subject areas.
Remember that this is a critical opportunity to position the project to the decision-
makers by accurately representing their interests while adding enough creativity to
capture their imagination. Provide them with models or profiles of the sort of information
that could be included in a subject area so they can visualize its utility. This sort of
“visionary concept” of their strategic information needs is crucial to drive their
awareness and is often suggested during interviews of the more strategic thinkers. Tie
descriptions of the information directly to stated business drivers (e.g., key processes
and decisions) to further accentuate the “business solution.”
A typical table of contents in the initial Findings and Recommendations document might
look like this:
This is a critical workshop for consensus on the business drivers. Key executives and
decision-makers should attend, along with some key information providers. It is
advisable to schedule this workshop offsite to assure attendance and attention, but the
workshop must be efficient — typically confined to a half-day.
Be sure to announce the workshop well enough in advance to ensure that key
attendees can put it on their schedules. Sending the announcement of the workshop
may coincide with the initial distribution of the interview findings.
Keep the presentation as simple and concise as possible, and avoid technical
discussions or detailed sidetracks.
Key business drivers should be determined well in advance of the workshop, using
information gathered during the interviewing process. Prior to the workshop, these
business drivers should be written out, preferably in display format on flipcharts or
similar presentation media, along with relevant comments or additions from the
interviewees and/or workshop attendees.
During the validation segment of the workshop, attendees need to review and discuss
the specific types of information that have been identified as important for triggering or
monitoring the business drivers. At this point, it is advisable to compile as complete a
list as possible; it can be refined and prioritized in subsequent phases of the project.
As much as possible, categorize the information needs by function, maybe even by
specific driver (i.e., a strategic process or decision). Considering the information needs
on a function by function basis fosters discussion of how the information is used and by
whom.
With the results of brainstorming over business drivers and information needs listed (all
over the walls, presumably), take a brief detour into reality before prioritizing and
planning. You need to consider overall feasibility before establishing the first priority
information area(s) and setting a plan to implement the data warehousing solution with
initial increments to address those first priorities.
Briefly describe the current state of the likely information sources (SORs). What
information is currently accessible with a reasonable likelihood of the quality and
content necessary for the high priority information areas? If there is likely to be a high
degree of complexity or technical difficulty in obtaining the source information, you may
need to reduce the priority of that information area (i.e., tackle it after some successes
in other areas).
Avoid getting into too much detail or technical issues. Describe the general types of
information that will be needed (e.g., sales revenue, service costs, customer descriptive
The project sponsors, stakeholders, and users should all understand that the process
of implementing the data warehousing solution is incremental.. Develop a high-level
plan for implementing the project, focusing on increments that are both high-value and
high-feasibility. Implementing these increments first provides an opportunity to build
credibility for the project. The objective during this step is to obtain buy-in for your
implementation plan and to begin to set expectations in terms of timing. Be practical
though; don't establish too rigorous a timeline!
At the close of the workshop, review the group's decisions (in 30 seconds or less),
schedule the delivery of notes and findings to the attendees, and discuss the next steps
of the data warehousing project.
As soon as possible after the workshop, provide the attendees and other project
stakeholders with the results:
I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
● Interviews to understand business information strategies and
expectations
● Document strategy findings
I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
● Interviews to understand business information strategies and
expectations
● Document strategy findings and model the strategic subject
areas
● Consensus-building meeting to prioritize information
requirements and identify “quick hits”
● Produce multi-phase Business Intelligence strategy
III. Goals for this meeting
1. Understanding of how business issues drive information needs
2. High-level understanding of what information is currently provided to
whom
● Where does it come from
● How is it processed
● What are its quality or access issues
IV. Briefly describe your roles and responsibilities?
● The interviewee may provide this information before the actual
● Business Requirements
Specification
● Change Request Form
● Data Migration Communication Plan
● Data Quality Plan Design
● Database Sizing Model
● Functional Requirements Specification
● Information Requirements Specification
● Issues Tracking
● Mapping Inventory
● Mapping Specifications
● Metadata Inventory
● Migration Request Checklist
● Operations Manual
● Physical Data Model Review Agenda
● Project Definition
● Project Plan
● Project Roadmap
● Project Role Matrix
● Prototype Feedback
● Restartability Matrix
● Scope Change Assessment
● Source Availability Matrix
● System Test Plan
● Target-Source Matrix
● Technology Evaluation Checklist
● Test Case List
● Test Condition Results
● Unit Test Plan
● Work Breakdown Structure
VELOCITY
SAMPLE DELIVERABLE
Business Requirements
Specification
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
BUSINESS REQUIREMENTS SPECIFICATION
DOCUMENT OVERVIEW
This document presents a brief description of the business of CompanyX and specific business
requirements applicable to the project. Any subsequent changes, additions, or deletions are not part of
this document and will be submitted to CompanyX separately for acceptance and inclusion as an
additional requirement for the project.
<Describe high-level view of business environment, strategy, reason for system implementation, etc>
Business Data from log files and the application database will be extracted and distributed
Requirement into central repository.
The log files will be transported from each cluster machine and placed in the
Constraints repository.
The application server must be available for data extract.
Hourly log files from the cluster processing machines.
Inputs
Tables from the application database server.
Flat file database built from the log files and tables using a hierarchical
Outputs
directory/file structure.
Each processing machine in each cluster will correctly generate the log files.
Dependencies The log files will be transferred to the repository machine intact.
The database is operational.
Central Repository Machine: SUN E4500, 4 CPU, 4GB RAM, 70GB HDD. Sun
Hardware /
Solaris 2.6
Software
PowerCenter will be used to extract data from the application database.
Requirements
PERL scripts will be used to extract data from the log files.
Business
Requirement
Constraints
Inputs
Outputs
Dependencies
Hardware /
Software
Requirements
Business
Requirement
Constraints
Inputs
Outputs
Dependencies
Hardware /
Software
Requirements
REQUESTOR INFORMATION:
Name: Phone Number: Date:
DESCRIPTION OF OTHER:
MIGRATION INFORMATION:
Migrate From: DEV TEST QA PRODUCTION
Migrate To: DEV TEST QA PRODUCTION
Deployment Group:
Label:
WORKFLOW/SESSION DETAILS: (Include any special details about the session configuration,
automatic memory configuration, recovery options, load strategy, etc.)
IMPLEMENTATION INFORMATION:
Reviewed/Approved By: Implemented By:
Date Received: Date Implemented:
Comments:
Communication Plan
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
COMMUNICATION PLAN
TABLE OF CONTENTS
INTRODUCTION ........................................................................................ 3
INTRODUCTION
[Provide a brief introduction to the Communication Plan -- specify the purpose of the document.]
OVERALL OWNERSHIP
[Provide verbiage that identifies the owner of the migration project. This person is the decision-maker for all issues
and points that require clarification (i.e., Joe Blob, Implementation Architect will own the plan and make all final
decisions. Responsibilities include setting up ad-hoc calls, consulting with the PMO and Project Manager and
relaying information about appropriate options.]
CONTACT INFORMATION
Name Team Cell Phone Number Home Phone Number
TEAM INFORMATION
Team Name Role Manager
Additional calls will be scheduled as needed. Required attendees will receive notification of ad-hoc conference calls
via cell phone and Email invitations.
Teleconference Access:
Phone Number: 1-800-123-4567
Conference Code: 1234
GO/NO-GO PROCEDURE
[Provide details on how a Go/No-Go decision will be determined and communicated.]
The document should serve as a plan handover document for business users and be written in a manner that a user
trained in IDQ can understand and update the plan design unaided.
We recommend that you build a plan design document as you build the plan.
Introduction
Document scope and readership
Document history
Plan heading [plan name].pln
Overview
Inputs
Component descriptions
Dictionaries
Outputs
Next steps
THE INTRODUCTION
The introduction should describe the data quality objectives of the plan and its relationship to the parent project.
When writing the introduction, consider these questions:
OVERVIEW
INPUTS
This section identifies the source data for the plan. Consider these questions:
COMPONENT DESCRIPTIONS
This section describes at a low level the operational components and business rules used in the plan. Where
possible, these should be listed in order of their interaction with the data. How much detail you go onto depends on
the audience for the document and what their needs are. It also depends on whether the business rules are
documented elsewhere.
Component functionality can be described at a high level as shown in the examples below:
Search Replace component takes Addr Line1 from CSV Source and removes spaces anywhere and full
stops from end.
Output from Search Replace component is put through the Word Manager and Addr Line 1 is standardized
using ‘Address Prefix’ and ‘Address Suffix’ dictionaries.
Output from Word Manager is used as input to Token Labeller and profiled using the following dictionaries in
this order: A.dic, Bb.dic, C.dic.
DICTIONARIES
List dictionaries and other reference content used, and their file locations.
OUTPUTS
In this section, describe the plan output and identify its file or database destination. Consider these questions:
[PLAN NAME.PLN]
This section is optional; it can be used in the same manner as the previous plan section and its subsections, above, if
another plan is described in this document.
NEXT STEPS
This section is relevant if there are other actions dependent on the plan and if the plan output is to be used elsewhere
in the project, as is typically the case. Consider these questions:
Functional Requirements
Specification
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
FUNCTIONAL REQUIREMENTS SPECIFICATION
Constraints
Inputs Source data from the shopping logs and application database.
Functional Requirement
Constraints
Inputs
Outputs
Dependencies
Hardware / Software
Requirements
Information Requirements
Specification
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
INFORMATION REQUIREMENTS SPECIFICATION
Issue Resolution
# Short Description Assign To Status Priority Severity Date ID'd Date ID'd By Description Work Around Investigation Solution
10
11
12
13
14
15
Map ID Mapping Name Target Table(s) Source(s) Volume Complexity Assigned Estimate Actual Issues
1 m_DM_Customer_Dimension DataMart.CUSTOMER Staging.RS_CUSTOMER, Med Low <Developer> 5 days 4 days Need to determine how to merge with second source system: i.e. based
Staging.RS_CUSTOMER_ADDR on what fields can the two files be joined?
ESS
2
10
11
12
13
14
15
16
17
18
19
20
Mapping Specifications
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
MAPPING SPECIFICATIONS
Mapping Name:
Source System(s):
Target System(s):
Short Description:
Load Frequency:
Preprocessing:
Post Processing:
Error Strategy:
Reload Strategy:
Unique Source
Fields (PK):
Dependant
Objects
SOURCES
Tables
Table Name System/Schema/Owner Selection/Filter
Files
File Name File Location Fixed/Delimited Additional File
Info
TARGETS
Files
File Name File Location Fixed/Delimited Additional File Info
LOOKUPS
Lookup Name
Table Location
Match Condition(s)
Persistent / Dynamic
Filter/SQL Override
Source Target
REQUESTOR INFORMATION:
Name: Phone Number:
Completion Date:
DESCRIPTION:
INFORMATICA OBJECTS:
From Repository To Repository From Folder To Folder
DATABASE OBJECTS:
From Database To Database From Schema To Schema
CODE REVIEW:
Reviewer Approval Date Code Review Comments
FUNCTIONAL TEST:
Functional User Approval Date
GOVERNANCE CHECKLIST:
Are workflows set with ‘if previous task list completed successfully? Yes No N/A
MIGRATION LIST:
INFORMATICA OBJECTS:
Sources
Targets
Mappings
Workflow
DATABASE OBJECTS: (i.e. Tables, Stored Procedures, Triggers, Views, Sequences, Functions…)
Flat File Source Information (If source is Flat File, Provide this information….)
Flat File Target Information (If target is Flat File, Provide this information….)
Pre-Session Information
Post-Session Information
Operations Manual
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
OPERATIONS MANUAL
TABLE OF CONTENTS
INTRODUCTION .................................................................................................................. 3
INFRASTRUCTURE .............................................................................................................. 3
POWERCENTER INFRASTRUCTURE ....................................................................................... 3
STEPS TO STOP/RESTART INFORMATICA COMPONENTS....................................................... 3
HIGH AVAILABILITY CONFIGURATION .................................................................................. 3
POWERCENTER – WORKFLOW MANAGER ........................................................................... 3
DATA EXTRACTION.............................................................................................................. 3
DATA TRANSFORMATION ..................................................................................................... 4
POWERCENTER TRANSFORMATIONS ................................................................................. 4
PERFORM DATA TRANSFORMATIONS ................................................................................. 4
LOAD THE TARGET TABLE ............................................................................................. 4
REPROCESS CORRECTED DATA .................................................................................... 4
SQL SCRIPTS .................................................................................................................. 4
STORED PROCEDURES .................................................................................................... 4
DATA LOAD ........................................................................................................................ 5
SUBJECT AREA LOAD ORDER ............................................................................................ 5
MAPPING LOAD ORDER AND RECOVERY ............................................................................. 5
WORKFLOW/SESSIONS A .............................................................................................. 5
RESTART STEPS:......................................................................................................... 6
RECOVERY/ROLLBACK PROCEDURES: ........................................................................... 6
ERROR HANDLING .............................................................................................................. 6
ERROR REPROCESSING STEPS:........................................................................................ 6
METADATA ......................................................................................................................... 6
[OTHER SOFTWARE] DESCRIPTION ....................................................................................... 7
[OTHER SOFTWARE] PROCEDURES ....................................................................................... 7
[OTHER SOFTWARE] OPERATIONS ........................................................................................ 7
SUPPORT/MAINTENANCE ..................................................................................................... 7
SERVICE LEVEL AGREEMENT ............................................................................................ 7
CONTACT INTERNAL SUPPORT PERSONNEL ....................................................................... 7
CONTACT EXTERNAL VENDOR CUSTOMER SUPPORT ........................................................... 7
INFORMATICA GLOBAL CUSTOMER SUPPORT .................................................................. 7
COMMUNICATIONS WITH OTHER TEAMS ............................................................................. 8
MAINTENANCE/OUTAGE SCHEDULE ................................................................................... 8
APPENDIX A - REFERENCES ................................................................................................. 8
TABLE OF FIGURES
[List of any figures or diagrams used in this document]
INTRODUCTION
[Provide a brief introduction for the Operation Manual. Specify the purpose of the document.]
INFRASTRUCTURE
[Describe the project’s infrastructure. If possible, provide a diagram of the system.]
POWERCENTER INFRASTRUCTURE
[Describe the PowerCenter infrastructure. Include setup and location of Informatica servers.]
[Outline steps to perform graceful shutdown and restart of the PowerCenter domain, PowerCenter node,
PowerCenter service, Repository service, Data Analyzer web service providers and PowerCenter Repository. Include
the locations and names of any startup and shutdown scripts.
[Outline the high availability configuration for the domain gateway, the PowerCenter services and other service
components.]
[Discuss the set-up within Workflow Manger. Include the setup of the database connections and an explanation of the
workflow manager variables.
DATA EXTRACTION
[Describe the source data that are being processed in this system.]
DATA TRANSFORMATION
[Provide an overview of how data is loaded into the data warehouse/mart tables.]
POWERCENTER TRANSFORMATIONS
[Describe the overall data transformation process. Indicate how data flows from the source to the target(s). If
necessary, discuss the error handling process. If possible, include a diagram that illustrates the data flows and
describe each stage in the process.]
[Discuss how and when data is loaded into the target database. If there are intermediate steps, describe these as
well. Describe the distinctions between the initial target table load and incremental load, if any.]
SQL SCRIPTS
[Describe any SQL scripts that are used in the process. Include where scripts are located and when in the load
process they are executed.]
STORED PROCEDURES
[Describe any stored procedures that must be used throughout the process. Include where the procedures are
located and any special permissions needed to execute them.]
DATA LOAD
[Describe the data load order.]
[Discuss the individual mappings and their load orders. If applicable, discuss the steps necessary to recover from a
failure. List each workflow and/or session involved in this system.]
WORKFLOW/SESSIONS A
wf_A :
s_Session1
s_Session2
cmd_Task1
RESTART STEPS:
RECOVERY/ROLLBACK PROCEDURES:
[Refer to workflow monitor, session logs and database audit trail to determine failed sessions and point of failure.
Outline steps to rollback/recover load operations.
Identify and describe when to run Data Analyzer scheduled reports manually, for example where down time has
resulted in the scheduled time passing.
ERROR HANDLING
[Provide a detailed description of the error handling strategies used in this system. Be sure to include any error
tables or files that are created in the process. Also mention any scripting that is executed or emails that are sent to
notify either operations or development staff of session failures.]
[If applicable, describe how error reprocessing is to take place. Be sure to include any procedures that the operations
staff must perform as well as any automate procedures.
METADATA
[If applicable, describe the metadata strategy in use in this system. If possible, list the metadata elements what will
be most important for the operations and development staff to track for error handling, reprocessing, and general
volume estimating purposes.]
SUPPORT/MAINTENANCE
[Outline agreed service level information, including a formal Service Level Agreement.]
[List organizational contacts in the event of a component failure that cannot be resolved by the Production Support
team. Include type of contact, contact name, department, telephone number, and e-mail address (if applicable).]
[Include details of named Informatica support contacts and support contract (and project id) with Informatica in case
they need to be contacted.]
[Provide a list of teams (i.e. SysAdmin, DBAdmin) that require coordination between the operations and its specific
support function (e.g., system, database maintenance, security, etc.). Describe regular communications and include
a schedule for coordination activities.]
MAINTENANCE/OUTAGE SCHEDULE
[Describe the process for maintenance/planned outage scheduling and list their potential impact on system reliability.
List system maintenance/outage schedule.]
APPENDIX A - REFERENCES
[List any documents used as reference in creating this document.]
Review Source-to-Target Relationships – discuss the preliminary analysis performed regarding how the
physical model is different from the logical.
Perform Physical Model Walk-through - Using a selection of 20 user-query scenarios, step through the
tables and attributes required to answer each query.
Review Source-to-Target Relationships – Discuss the preliminary analysis performed regarding how data
will be integrated from source systems to target systems and what data transformations are expected to be
required.
PROJECT DEFINITION
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PROJECT DEFINITION
PROJECT OBJECTIVES
These are the key business “drivers” for the project—business-focused goals and objectives.
<Objective 1>
<Objective 2>
<Objective 3> …
PROJECT TIMING
Key Milestone or Deliverable Target Date(s)
TECHNICAL ENVIRONMENT
Informatica Products, versions
Platforms / OS
Source systems / data characteristics
Target systems / DBMS
Target architectures
PROJECT PERSONNEL
Personnel that you are likely to interact with (e.g., DBA, Business Analyst, System Administrator, etc.)?
Name Phone(s) E-mail Role
Project Roadmap
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PROJECT ROADMAP
TIMELINE
INCREMENT DESCRIPTIONS
The initial project roadmap above is based on the priorities set forth in the Functional Requirements Specification
document. Based on those priorities and estimated feasibility and time-to-deliver, the above timeline is an
approximate high-level plan for this project.
D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r
1 Manage P P S S S S A P A P P
1.1 Define Project P P P
1.1.1 Establish Business Project Scope P R P
1.1.2 Build Business Case P S
1.1.3 Assess Centralized Resources P
1.2 Plan and Manage Project P S S S A P P
1.2.1 Establish Project Roles P A P
1.2.2 Develop Project Estimate P S S S S A S S
1.2.3 Develop Project Plan P A S
1.2.4 Manage Project P R P
1.3 Perform Project Close P A A A A
2 Analyze P P P P P P P P P P S P P P P
2.1 Define Business Drivers, Objectives and Goals P R R
2.2 Define Business Requirements S P S P A P A A
2.2.1 Define Business Rules and Definitions S P A P A
2.2.2 Establish Data Stewardship S P S A
2.3 Define Business Scope P P P P P P S P S P P
2.3.1 Identify Source Data Systems P P P P P
2.3.2 Determine Sourcing Feasibility P P P P P
2.3.3 Determine Target Requirements S P P S S P P
2.3.4 Determine Business Process Data Flows P P P P P
2.3.5 Build Roadmap for Incremental Delivery S P P P S S P P
2.4 Define Functional Requirements P R
2.5 Define Metadata Requirements R P P P P P P P
2.5.1 Establish Inventory of Technical Metadata R P P P P
2.5.2 Review Metadata Sourcing Requirements P R P R
2.5.3 Assess Technical Strategies and Policies R P P P P
2.6 Determine Technical Readiness P P P P
2.7 Determine Regulatory Requirements P R P P
2.8 Perform Data Quality Audit S S P P S
2.8.1 Perform Data Quality Analysis of Source Data S P S
2.8.2 Report Analysis Results to the Business S S P P S
3 Architect P P P S P R P P S A P P S P P P
3.1 Develop Solution Architecture P P P R P P P R
3.1.1 Define Technical Requirements P S P R
3.1.2 Develop Architecture Logical View S P
3.1.3 Develop Configuration Recommendations S P
D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r
D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r
D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r
Prototype Feedback
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PROTOTYPE FEEDBACK
ADMINISTRATIVE INFORMATION
This section should record the date and time of the meeting and a list of attendees.
INTRODUCTION
This section should describe the overall effort, including the business objectives of the end product. It should
explicitly describe which parts of the prototype are demonstrated for feedback and those that are not included in the
feedback.
It is important to restate the requirements here to measure the final product against the end users’ requests. This
makes it possible to determine if the final product is likely satisfying the users’ needs.
FUNCTIONAL REQUIREMENTS
This section should specify the functional requirements as defined by the end users. Functional requirements can
include such capabilities as: drill down and up, alert functionality, and access via the web (including both dynamic
and static reporting), as well as information display capabilities such as graphs, bar charts, pie charts, etc.
DATA REQUIREMENTS
This section should specify the data requirements defined by the end users. Data requirements include specific
information requests such as inventory amounts, revenues, organizational data, locations, personnel, etc. If desired,
data requirements can be further broken down by fact data and dimension data.
HARDWARE/SOFTWARE REQUIREMENTS
This section should contain the specific hardware and software requirements necessary to install and run the end
user application. Examples of requirements include minimum memory requirements, software requirements such as
ODBC drivers, database connections, web browsers, or any other required software.
DESCRIPTION OF INTERFACES
This section should contain a detailed description of the interfaces that end users will use to access the data,
including how the interfaces are organized (i.e., by subject area, organizational constraints, or whatever). It should
also provide screen captures of the actual system to be developed.
PREDEFINED REPORTS
This section should contain a detailed description of each predefined report planned for development. Each report
description should include the report name, all of the attributes, including any attributes calculated by the end user
analysis application. The description should also detail how the information will be presented (e.g., tabular, bar graph,
line chart, pie chart, etc.).
USER FEEDBACK
This section should describe the results of the user review of the prototype. Any comments or suggestions that
impact the design/implementation should be noted here, and may result in a Scope Change Assessment.
Restartability Matrix
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
RESTARTABILITY MATRIX
PURPOSE
To identify issues that may have an impact on the data integration team’s ability to restart or recover a failed session
and maintain the integrity of data in the data warehouse.
ISSUE OR REQUIREMENT
<Description of issue to be resolved or new requirement>
PROPOSED RESOLUTION
<Description of approach and reasoning (and alternatives if applicable)>
IMPACT TO PLAN
<Estimated change to schedule, budget, staffing, or other costs>
= In Use
= Available for Extraction
= In Use
= Available for Extraction
TABLE OF CONTENTS
1.0 SCOPE ...................................................................................................................................................... 3
1.1 INTRODUCTION ...................................................................................................................................... 3
1.2 PURPOSE .............................................................................................................................................. 3
1.3 LIMITATIONS .......................................................................................................................................... 3
1.4 ROLES, RESPONSIBILITIES AND SUPPORT AGENCIES ................................................................................ 3
1.4.1 POINTS OF CONTACT ...................................................................................................................... 3
1.4.2 TESTING ORGANIZATION DIAGRAM ................................................................................................... 3
1.4.3 SUPPORT SYSTEMS ........................................................................................................................ 3
1.5 SYSTEM OVERVIEW ................................................................................................................................ 3
1.6 SYSTEM CONFIGURATION ....................................................................................................................... 3
1.6.1 DATA SOURCES .............................................................................................................................. 3
1.6.2 DATA WAREHOUSE ......................................................................................................................... 3
1.6.3 DATA STORES/TARGETS .................................................................................................................. 4
1.6.4 DATA MODELING ............................................................................................................................. 4
1.6.5 DATA INTEGRATION ADMINISTRATION ............................................................................................... 4
1.6.6 DATA STAGING ............................................................................................................................... 4
1.6.7 CLIENT INTERFACE ......................................................................................................................... 4
1.6.8 NETWORK ...................................................................................................................................... 4
1.6.9 SYSTEM OVERVIEW DIAGRAM .......................................................................................................... 4
1.7 RELATIONSHIP TO OTHER PLANS ............................................................................................................. 4
2.0 REFERENCES ............................................................................................................................................. 4
3.0 SYSTEM TEST ENVIRONMENT ...................................................................................................................... 5
3.1 TEST AREAS .......................................................................................................................................... 5
3.1.1 ENVIRONMENT (HARDWARE AND SOFTWARE) ................................................................................... 5
3.1.2 ENVIRONMENT (HARDWARE AND SOFTWARE) ................................................................................... 5
3.1.3 OTHER MATERIALS ......................................................................................................................... 6
3.1.4 PROPRIETARY NATURE, ACQUIRER'S RIGHTS AND LICENSING............................................................. 6
3.1.5 INSTALLATION, TESTING AND CONTROL ............................................................................................ 6
3.1.6 TEST PERSONNEL ........................................................................................................................... 6
4.0 TEST IDENTIFICATION ................................................................................................................................. 6
4.1 TEST CASE DESCRIPTION........................................................................................................................ 7
4.1.1 TEST CASE OBJECTIVE .................................................................................................................... 7
4.1.2 TEST LEVELS .................................................................................................................................. 7
4.1.3 TEST TYPES ................................................................................................................................... 7
4.1.4 CRITICAL TECHNICAL PARAMETERS (CTPS)....................................................................................... 7
4.1.5 TEST CONDITION REQUIREMENTS (TCRS) ......................................................................................... 7
4.1.6 TEST EXECUTION AND PROGRESSION .............................................................................................. 7
4.1.7 TEST SCHEDULE ............................................................................................................................. 7
4.2 PHASED TESTING BREAKDOWN DIAGRAM ................................................................................................. 7
5.0 DATA RECORDING, ANALYSIS, AND REPORTING ............................................................................................ 8
5.1 RECORDING........................................................................................................................................... 8
5.2 ANALYSIS .............................................................................................................................................. 8
5.3 REPORTING ........................................................................................................................................... 8
1.0 SCOPE
1.1 INTRODUCTION
[Discuss the System Test Plan, its strategy, and intended use.]
1.2 PURPOSE
[Describe the purpose of this document.]
1.3 LIMITATIONS
[Include any disclaimers or issues that can make this test plan less effective.]
[Include all points of contact involved in any phase of the testing process. Also include the group or affiliation of each
POC. (i.e. DBA, developer, etc.)]
[Provide the organization chart of the testing organization. If an outside testing group is used, provide information
and contacts within that group.]
[Discuss any other organizations that will be supporting the testing effort. Include contacts and organization charts
where necessary.]
[Briefly describe the data modeling tool used. If desired, include a reference to the data model of the system under
test.]
[Describe how and where source, intermediate, and/or target files will be stored. If possible, include a reference to
any documentation that describes the process in greater detail.]
[Describe the client interface used to view and/or query the resulting data.]
1.6.8 NETWORK
[Describe the network architecture of the system under test. If possible, discuss the implication of the battery of tests
on the network.]
2.0 REFERENCES
[List any documents referenced within this test plan.]
Document Name Source Date
[List all hardware and software under test. Also provide a point-of-contact (POC) for each item, indicating who will be
available to support the testing effort.]
Software Version Purpose POC
Compilers
Application
Database
Interfaces
COTS
Hardware
Component
Server
Hard Drive
PCs
Printers
UPS
Tape Drive
Switch
Controller
Drivers
Communications
[Describe all other materials including manuals, installation software, documentation and procedures that will be
supplied on an as needed basis.]
[Describe who will install, maintain, and control the testing environment.]
[Summarize the reasoning for establishing the specific test case, answering in the simplest possible terms the
question: “Why are we doing this?”]
[Describe the various testing levels that will be performed within the test plan. Examples of levels are hardware,
software, data, application, etc.]
[Describe the types of testing that will be performed. Discuss whether the testing will consist of black box, white box,
or a combination of both. (Black box testing provides the most visibility into the various test levels being addressed,
while White box testing tests each level at the highest level possible.
[Each CTP will define specific functional units that will be tested. This should include any specific data items,
component, or functional parts that will be tested. ]
[TCR scripts will be developed to satisfy all identified CTPs. Personnel identified by the Technical System Manager
will develop these TCRs. All TCRs will be assigned a numeric designation and will include the test objective, list of
any prerequisites, test steps, actual results, expected results and identification of tester, the current date, and the
current iteration of the test. ]
[Provide a set of control procedures for executing a test such as special conditions and processes for returning a
TCR to a technical developer in the event that it fails.]
5.2 ANALYSIS
[Describe how the test results will be gathered, reviewed, and analyzed. In addition, describe who will receive the
results of this analysis.]
5.3 REPORTING
[Describe test case summary report provided for each test case. Provide detail about the items that will be included
in this test case summary.]
10
11
12
13
14
15
Data Governance
Technology Evaluation Checklist
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
TECHNOLOGY EVALUATION CHECKLIST
IT organizations can use this evaluation criteria checklist to ensure that the data integration platform that
they have (or select) offers the comprehensive set of capabilities that are required for a robust data
governance program. It is critical that a unified platform supplies these capabilities to ensure consistency
and re-use and the provision of uniform process and policy controls.
1. DATA ACCESSIBILITY
The platform should ensure that all enterprise data can be accessed, regardless of its source or structure.
□ Pre-built connectivity. Does the platform have pre-built connectivity to a wide variety of
systems, including multiple mainframe formats, messaging systems, and numerous applications?
□ Input/output data validation. Does the platform validate input/output data?
□ Event logging. Does the platform provide failed session statistics, error messages, metadata
statistics and lineage that help assess the exceptions and failures related to accessing data?
□ Federated access. Does the platform provide both physical and virtual/federated access to data
in one common tool?
□ Cross-firewall access. Does the platform support secure, high performance data movement
across firewalls?
□ Supported data types. Is the platform able to access the following data types with one tool while
leveraging common metadata; mainframe data; structured data; unstructured data (e.g., Microsoft
Word documents and Excel spreadsheets); XML and EDI data; relational data; application data;
and message queue data?
2. DATA AVAILABILITY
The platform should ensure that data is available to users and applications—when, where, and how
needed.
□ Throughput. Does the platform make it easy to configure multiple performance enhancement
options including pipelining, dynamic partitioning, and smart parallelism?
□ Scalability. Does the platform take advantage of 64-bit, thread-based parallel processing and
grid deployment for near-linear scalability?
□ Automatic failover and recovery. Does the platform feature automatic failover and recovery
capabilities? Does it provide a graphical status on the grid, as well as other key indicators/alerts?
□ High availability. Does the platform enable you to easily configure high availability? Does it
include built-in resiliency, failover and recovery? Does it support multi-node/grid deployment?
□ Volume and timing. Does the platform allow data volumes and latencies (e.g., large volume
batch vs. message-based real-time) to be configured to meet business needs, without any
recoding?
□ Breadth of delivery protocols. Can the platform be easily configured to deliver data via different
protocols and methods, including loading physical databases for SQL-based access, creating
virtual data views (EII), publishing to a message bus or queue, and publishing Web Services?
3. DATA QUALITY
The platform should ensure the accuracy and validity of data.
□ Profiling. Does the platform include tools to automatically profile data sources to understand the
data and flag potential issues? Is that tool integrated with the rest of the data quality and data
integration platform?
□ Monitoring and measurement. Does the platform enable you to establish key data quality
metrics, monitor them on an ongoing basis, and receive alerts on items that fall out of acceptable
ranges?
□ Cleansing and remediation. Does the platform allow you to define business rules to address
data quality issues on an automated basis? Does it provide historical statistics, which help root
cause analysis on data quality issues, including accuracy, completeness, conformity, consistency,
referential integrity, and duplication?
□ Breadth of data. Does the platform include data quality capabilities that address all key data
types—customer, product/service, financial, employee, etc.— not just a single data type such as
customer contact information?
□ Ease of use. Does the platform provide an easy-to-use interface to enable both business users
(e.g., business analysts and data stewards) and IT users to visualize and address data quality
issues?
□ Integrated metadata. Does the platform automatically capture the metadata from your data
quality processes? Is the metadata seamlessly incorporated as part of the overall data integration
lifecycle?
4. DATA CONSISTENCY
The platform should ensure that the value, structure, and meaning of data is consistent and reconciled
across systems, processes, and organizations.
□ Validation. Does the platform provide an integrated design and mapping tool that automatically
validates the data model on-the-fly?
□ Transformation. Does the platform feature robust transformation capabilities that address not
only syntactic issues, but also structural and semantic variances across different systems?
□ Logical design and workflow. Does the platform capture all design and business rules at the
logical level as metadata, abstracting them from the physical layer?
□ Reusability. Are you able to capture all data integration and data quality logic and workflows as
metadata via one platform? Does the platform enable sharing at both local and global levels?
□ Cataloging. Does the platform allow you to easily search, filter, define, and modify data
dictionaries and business rules?
□ Data synchronization. Does the platform easily interoperate with enterprise application
integration (EAI) and messaging technologies to help synchronize the context and meaning of
data, as well as data values, across operational systems?
5. DATA AUDITABILITY
The platform should ensure that there is an audit trail on the data and that internal controls have been
appropriately implemented.
□ Lineage. Can the platform provide a visual lineage of data across multiple systems and
applications, including both backward and forward tracking? Does it provide drill-down
capabilities?
□ Impact Assessment. Can the platform automatically assess the impact of changes across
applications and systems? Does it provide reports on port details, metadata extensions and
usage, and mapping dependencies across connected systems?
□ Workflow. Does the platform include robust workflow orchestration capabilities including support
for grid deployments and global, cross-team collaboration?
□ Dashboard. Does the platform provide a dashboard with a high-level summary of workflows,
processes and status? Does it include the ability to easily drill down into the details?
□ Testing. Does the platform include an integrated test environment that not only detects mapping
and session errors, but also helps identify the root causes of invalid mapping and session errors?
□ Version control. Does the platform feature robust, granular version management and
deployment capabilities?
6. DATA SECURITY
The platform should ensure secure access to the data.
SCHEDULE
10
11
12
13
14
15
16
17
18
19
20
TCs
# Test Case Description Total TCs Passed Total % Passed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Source Target
Tables: Tables:
System/Subsystem:
System Function:
Business Scenario:
Expected Results:
Error Code:
Actual Occurrences:
Expected Occurrences:
MAPPING OVERVIEW
<Description of the function(s) performed by the mapping.>
FILTERS
LOOKUPS
EXPRESSIONS
ERROR HANDLING
Does the error table record count in the error files(s) match the number of errors that exist in the
source table?
Is the .bad file free of error messages?
Was error table built correctly?
Error Cases
• Refer to the expected results section.
• Were the error cases caught?
• Were the error records written to the appropriate file/table?
• Was the record layout for the error record correct?
Post Session (All commands are executed only if the session is successful
[Enter specific tests for the post-session script.]
LOAD STATISTICS
2 Analyze
2.1 Define Business Drivers, Objectives and Goals
2.2 Define Business Requirements
2.2.1 Define Business Rules and Definitions
2.2.2 Establish Data Stewardship
2.3 Define Business Scope
2.3.1 Identify Source Data Systems
2.3.2 Determine Sourcing Feasibility
2.3.3 Determine Target Requirements
2.3.4 Determine Business Process Data Flows
2.3.5 Build Roadmap for Incremental Delivery
2.4 Define Functional Requirements
2.5 Define Metadata Requirements
2.5.1 Establish Inventory of Technical Metadata
2.5.2 Review Metadata Sourcing Requirements
2.5.3 Assess Technical Strategies and Policies
2.6 Determine Technical Readiness
2.7 Determine Regulatory Requirements
2.8 Perform Data Quality Audit
2.8.1 Perform Data Quality Analysis of Source Data
2.8.2 Report Analysis Results to the Business
3 Architect
3.1 Develop Solution Architecture
3.1.1 Define Technical Requirements
4 Design
5 Build
5.1 Launch Build Phase
5.1.1 Review Project Scope and Plan
5.1.2 Review Physical Model
5.1.3 Define Defect Tracking Process
5.2 Implement Physical Database
5.3 Design and Build Data Quality Process
5.3.1 Design Data Quality Technical Rules
5.3.2 Determine Dictionary and Reference Data Requirements
5.3.3 Design and Execute Data Enhancement Processes
5.3.4 Design Run-time and Real-time Processes for Operate Phase Execution
5.3.5 Develop Inventory of Data Quality Processes
5.3.6 Review and Package Data Transformation Specification Processes and Documents
5.4 Design and Develop Data Integration Processes
5.4.1 Design High Level Load Process
5.4.2 Develop Error Handling Strategy
5.4.3 Plan Restartability Process
5.4.4 Develop Inventory of Mappings & Reusable Objects
5.4.5 Design Individual Mappings & Reusable Objects
5.4.6 Build Mappings & Reusable Objects
5.4.7 Perform Unit Test
5.4.8 Conduct Peer Reviews
6 Test
6.1 Define Overall Test Strategy
6.1.1 Define Test Data Strategy
6.1.2 Define Unit Test Plan
6.1.3 Define System Test Plan
6.1.4 Define User Acceptance Test Plan
6.1.5 Define Test Scenarios
6.1.6 Build/Maintain Test Source Data Set
6.2 Prepare for Testing Process
6.2.1 Prepare Environments
6.2.2 Prepare Defect Management Processes
6.3 Execute System Test
6.3.1 Prepare for System Test
6.3.2 Execute Complete System Test
6.3.3 Perform Data Validation
6.3.4 Conduct Disaster Recovery Testing
6.3.5 Conduct Volume Testing
6.4 Conduct User Acceptance Testing
6.5 Tune System Performance
6.5.1 Benchmark
6.5.2 Identify Areas for Improvement
6.5.3 Tune Data Integration Performance
6.5.4 Tune Reporting Performance
7 Deploy
7.1 Plan Deployment
7.1.1 Plan User Training
7.1.2 Plan Metadata Documentation and Rollout
7.1.3 Plan User Documentation Rollout
7.1.4 Develop Punch List
8 Operate
8.1 Define Production Support Procedures
8.1.1 Develop Operations Manual
8.2 Operate Solution
8.2.1 Execute First Production Run
8.2.2 Monitor Load Volume
8.2.3 Monitor Load Processes
8.2.4 Track Change Control Requests
8.2.5 Monitor Usage
8.2.6 Monitor Data Quality
8.3 Maintain and Upgrade Environment
8.3.1 Maintain Repository
8.3.2 Upgrade Software